An Object Lesson In Database Optimization

I have a tendency to regard a database engine as a black box. I just formulate my queries and count on the engine to make them fast, somehow. I think this is similar to the faith that people tend to place in language compilers– just write the C code and the compiler will just magically optimize it. And if the code isn’t fast enough, maybe you should use a higher optimization level. Of course, a computer scientist ought to be able to analyze algorithmic running efficiency and spot opportunities for theoretical improvement, rather than relying on the compiler to insert a faster machine instruction here or there.

I started reading up on MySQL optimization strategies. There are a few things to understand about how the database works under the covers, things that are quite intuitive to anyone who has a semester of data structures and algorithms coursework. The FATE database is getting slower as it grows larger. The table growing the fastest is test_result. Each build currently generates 111 new rows, one for each active test specification.

mysql> SELECT COUNT(test_spec) 
       FROM test_result 
       WHERE build_record=6742;
+------------------+
| COUNT(test_spec) |
+------------------+
|              111 |
+------------------+
1 row in set (4.12 sec)


3-4 seconds is nominal currently but will obviously grow. MySQL (perhaps other databases as well) has an EXPLAIN keyword that can be prepended to a SELECT statement to detail fascinating statistics about the query:

mysql> EXPLAIN SELECT * 
       FROM test_result 
       WHERE build_record=6742;
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table       | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | test_result | ALL  | NULL          | NULL | NULL    | NULL | 443099 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+--------+-------------+

From what I read about the foregoing data, a type of “ALL” is as bad as it gets. Apparently, it means that all rows were examined. Sure enough– all 443k rows had to be studied to come up with the query result. This makes sense from an algorithmic perspective– cover each row in a linear manner. How could this be improved using a rudimentary knowledge of algorithms and data structures? Since the query operates by studying rows based on build_record, how about making an ordered index of build_record numbers that point to their rows and perhaps doing a binary search on the index when build_record is part of the WHERE clause? In fact, MySQL has the same idea. I’m not sure of the exact technical underpinnings of the indexing mechanism, but this is how you ask MySQL to build and maintain an index:

mysql> CREATE INDEX build_record_index 
       ON test_result (build_record);

What kind of improvement? Check out the explanation (using a different build record to avoid query caching):

mysql> EXPLAIN SELECT * 
       FROM test_result 
       WHERE build_record=6745;
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+------+-------+
| id | select_type | table       | type | possible_keys      | key                | key_len | ref   | rows | Extra |
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+------+-------+
|  1 | SIMPLE      | test_result | ref  | build_record_index | build_record_index | 4       | const |  109 |       |
+----+-------------+-------------+------+--------------------+--------------------+---------+-------+------+-------+

Nice. Only 109 rows were inspected, which is kind of strange, since there are 111 rows in the query:

mysql> SELECT COUNT(test_spec) 
       FROM test_result 
       WHERE build_record=6745;
+------------------+
| COUNT(test_spec) |
+------------------+
|              111 |
+------------------+
1 row in set (0.50 sec)

And the average query shapes up to be well under a second.

Anyway, I found it all terribly interesting. Further, certain aspects of FATE will be much faster now and will scale much better as the data set grows. Maybe someone else will find this article through Google and find this more practical example to be more helpful than the common employee salary database examples that litter the web.