I have a confession– it has been a very long time since I created any meaningful backup of the FATE database. The thing has been operational for over 8 months now and has collected hordes of data. I already had one adventure with dramatically trimming the data size. But when I try to create a backup, it just goes on and on and on.
I finally thought to look up what MySQL facilities might be able to help me diagnose this. Here’s something: SHOW TABLE STATUS. It turns out that the table that stores the build records is nearly 600 MB large. The table that holds the test results exceeds 230 MB (at least it’s not 2.3 GB like the last time I visited the size issue). The same culprit is to blame– stdout and stderr:
mysql> SELECT -> SUM(LENGTH(stdout)), -> SUM(LENGTH(stderr)) -> FROM build_record; +---------------------+---------------------+ | SUM(LENGTH(stdout)) | SUM(LENGTH(stderr)) | +---------------------+---------------------+ | 424930434 | 165123451 | +---------------------+---------------------+
The database has evolved before and it’s time for the next evolution, based on the premise that there is almost no reason whatsoever to store the stdout data. I know of one circumstance where it helps– it is useful to read the configure script output to verify configuration options (especially after I update the gcc SVN configurations). And there is definitely a good reason to keep stderr around– if the build fails, the stderr is the first stop for diagnosing the problem. Even if the build succeeds, it should be theoretically possible to compare stderr across builds to find common warnings that should be eliminated (possible future expansion).
So, my solution: Periodically retire old data. In this case, I was planning to retire stdout/stderr on build records from before September. But, well… mistakes were made and I managed to retire ALL the stdout/stderr data for all existing build records, thus further underscoring the need for responsible periodic backups. Eh, I don’t think anyone looked at that data anyway. And the table is nicely compact now:
+---------------------+---------------------+ | SUM(LENGTH(stdout)) | SUM(LENGTH(stderr)) | +---------------------+---------------------+ | 665532 | 665532 | +---------------------+---------------------+
It’s actually possible to back it up now. A curious observation: MySQL’s SHOW TABLE STATUS still reports a Data_length for this table of 591471720 bytes, which concerned me. But then I spied the Data_free column which reports 588306412 bytes. I think the numbers are related.
I’m still deciding on more efficient policies for the test results table, especially since I have plans to expand to other platforms soon.
What’s the problem with a database growth of 1gb/year? Surely if you only backup every three months or so the amount of time taken is a non issue?
Well, I sincerely want to back up more frequently, like once a week or so. At the current database size, it takes over 3 minutes to download all the data (about 300 MB). And I’m really not sure what effect the backup operation has on database additions (CGI scripts that insert data seem to hang during the operation). So, quicker backups are preferable.
Is three whole minutes a problem? Can you automate it so you don’t personally have to wait on it?
Well, there are those 3 minutes, plus all the other databases on the site that I have to back up. It really adds up.
What’s done is done on the deleted data. But I still think it will be a good policy going forward to delete old stdout data from the build records table. That gets useless really fast.
I should point out that the database has been stored the line counts for both the stdout and stderr since March. I just haven’t gotten around to exposing the data yet. That data will not go away and is arguably more interesting than the ancient stdout/stderr data itself.