Thursday, 15 January 2015

Hypotheses are for testing, but assumption is the mother of disaster

If I were to ask what is the most important element to being a good programmer I would say the ability to avoid making assumptions and instead make hypotheses that can be tested. Knowing you are right because you have data to back it up will allow projects to remain on track. Ploughing on based on guess work and things a colleague told you once (without providing you with data & methodology to back up these statements) will likely lead to ruin.

I does not matter how long someone is programming it is always possible that someone heard something once, took it as gospel and never tested it to make sure it was true. I encountered this with a couple of colleagues and an SQL query. The SQL query was built from a ORM-like query generator and included the syntax WHERE 1=1. Both my manager and a senior developer stated, "well that is the problem WHERE 1=1 will always cause a full table scan rather than use indexes".

When I looked at the same query I thought that the table may be missing indexes and the  GROUP BY clause was potentially causing the problem, however, rather than assuming I was right I set about testing variants of the query.

My testing was in MySQL so I was employing SQL_NO_CACHE in my select statements to make sure the behaviour was not effected by query caching. This command is vital in any query optimisation testing. 

My testing proved that WHERE 1=1 has no discernible effect on query perform and does not prevent the query from caching. Of course removing it would be a few less characters to be sent over wire, read and parsed by the engine, but it certainly does not trigger a full table scan. 

WHERE 1=1 is an easy way of setting up where clauses so that you can just concatenate AND condition, without having to remove the first conditions AND. It is even useful for testing your own queries in a tool like MySQL Workbench you can comment out various ANDs and not have to worry about the syntax breaking.

Reading the MySQL documentation also makes it clear that clauses like 1=1 are removed by the constant condition removal portion of the WHERE clause optimisation. There may be some SQL variant that does not manage to optimise this condition away but it is certainly none of the major vendors I have worked with over the last 10 years.

Testing illustrated that the GROUP BY clauses caused the biggest performance issue, there were also some other optimisations available by removing unnecessary JOINs. I was slightly surprised that the ORDER BY on an unindexed column did not cause any real impact. Please do not be fooled by myths, dig into the code and test for yourself. Testing that 1=1 does not impact performance takes a few seconds if you have a MySQL database running, going down the root of assuming that it is a problem can waste hours "fixing" code for no benefit.

Through testing though you can come up with some interesting discoveries, such as 

count(DISTINCT id) FROM `table`

is substantially slower than 

count(*) FROM `table`

even though id is a unique primary key column. The system I was using was automatically constructing the first query, which on very large tables was able to take 1-2 seconds, whereas the second always came back in less than 0.001 and provided the same result. 

However, as I said don't take my word for it, form a hypothesis by all means, but never make assumptions, hypotheses are always tested, but assumption is the mother of disaster.

No comments: