Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


Archives:

Adjusting The Timetable and SQL Shame

August 15th, 2012 by Multimedia Mike

My Game Music Appreciation website has a big problem that many visitors quickly notice and comment upon. The problem looks like this:



The problem is that all of these songs are 2m30s in length. During the initial import process, unless a chiptune file already had curated length metadata attached, my metadata utility emitted a default play length of 150 seconds. This is not good if you want to listen to all the songs in a soundtrack without interacting with the player page, but have various short songs (think “game over” or other quick jingles) that are over in a few seconds. Such songs still pad out 150 seconds of silence.

So I needed to correct this. Possible solutions:

  1. Manually: At first, I figured I could ask the database which songs needed fixing and listen to them to determine the proper lengths. Then I realized that there were well over 1400 games affected by this problem. This just screams “automated solution”.
  2. Automatically: Ask the database which songs need fixing and then somehow ask the computer to listen to the songs and decide their proper lengths. This sounds like a winner, provided that I can figure out how to programmatically determine if a song has “finished”.

SQL Shame
This play adjustment task has been on my plate for a long time. A key factor that has blocked me is that I couldn’t figure out a single SQL query to feed to the SQLite database underlying the site which would give me all the songs I needed. To be clear, it was very simple and obvious to me how to write a program that would query the database in phases to get all the information. However, I felt that it would be impure to proceed with the task unless I could figure out one giant query to get all the information.

This always seems to come up whenever I start interacting with a database in any serious way. I call it SQL shame. This task got some traction when I got over this nagging doubt and told myself that there’s nothing wrong with the multi-step query program if it solves the problem at hand.

Suddenly, I had a flash of inspiration about why the so-called NoSQL movement exists. Maybe there are a lot more people who don’t like trying to derive such long queries and are happy to allow other languages to pick up the slack.

Estimating Lengths
Anyway, my solution involved writing a Python script to iterate through all the games whose metadata was output by a certain engine (the one that makes the default play length 150 seconds). For each of those games, the script queries the song table and determines if each song is exactly 150 seconds. If it is, then go to work trying to estimate the true length.

The forgoing paragraph describes what I figured was possible with only a single (possibly large) SQL query.

For each song represented in the chiptune file, I ran it through a custom length estimator program. My brilliant (err, naïve) solution to the length estimation problem was to synthesize seconds of audio up to a maximum of 120 seconds (tightening up the default length just a bit) and counting how many of those seconds had all 0 samples. If the count reached 5 consecutive seconds of silence, then the estimator rewound the running length by 5 seconds and declared that to be the proper length. Update the database.

There were about 1430 chiptune files whose songs needed updates. Some files had 1 single song. Some files had over 100. When I let the script run, it took nearly 65 minutes to process all the files. That was a single-threaded solution, of course. Even though I already had the data I needed, I wanted to try to hand at parallelizing the script. So I went to work with Python’s multiprocessing module and quickly refactored it to use all 4 CPU threads on the machine where the files live. Results:

  • Single-threaded solution: 64m42s to process corpus (22 games/minute)
  • Multi-threaded solution: 18m48s with 4 CPU threads (75 games/minute)

More than a 3x speedup across 4 CPU threads, which is decent for a primarily CPU-bound operation.

Epilogue
I suspect that this task will require some refinement or manual intervention. Maybe there are songs which actually have more than 5 legitimate seconds of silence. Also, I entertained the possibility that some songs would generate very low amplitude noise rather than being perfectly silent. In that case, I could refine the script to stipulate that amplitudes below a certain threshold count as 0. Fortunately, I marked which games were modified by this method, so I can run a new script as necessary.

SQL Schema
Here is the schema of my SQlite3 database, for those who want to try their hand at a proper query. I am confident that it’s possible; I just didn’t have the patience to work it out. The task is to retrieve all the rows from the games table where all of the corresponding songs in the songs table is 150000 milliseconds.

  1. CREATE TABLE games
  2.     (
  3.       id INTEGER PRIMARY KEY AUTOINCREMENT,
  4.       uncompressed_sha1 TEXT,
  5.       uncompressed_size INTEGER,
  6.       compressed_sha1 TEXT,
  7.       compressed_size INTEGER,
  8.       system TEXT,
  9.       game TEXT,
  10.       gme_system TEXT default NULL,
  11.       canonical_url TEXT default NULL,
  12.       extension TEXT default "gamemusicxz",
  13.       enabled INTEGER default 1,
  14.       redirect_to_id INT DEFAULT -1,
  15.       play_lengths_modified INT DEFAULT NULL);
  16. CREATE TABLE songs
  17.     (
  18.       game_id INTEGER,
  19.       song_number INTEGER NOT NULL,
  20.       song TEXT,
  21.       author TEXT,
  22.       copyright TEXT,
  23.       dumper TEXT,
  24.       length INTEGER,
  25.       intro_length INTEGER,
  26.       loop_length INTEGER,
  27.       play_length INTEGER,
  28.       play_order INTEGER default -1);
  29. CREATE TABLE tags
  30.     (
  31.       game_id INTEGER,
  32.       tag TEXT NOT NULL,
  33.       tag_type TEXT default "filename");
  34. CREATE INDEX gameid_index_songs ON songs(game_id);
  35. CREATE INDEX gameid_index_tag ON tags(game_id);
  36. CREATE UNIQUE INDEX sha1_index ON games(uncompressed_sha1);

Posted in General | 14 Comments »

14 Responses

  1. Z.T. Says:

    email me the output of `sqlite3 your_db_file.db ‘.schema’` and I will write SQL for you.

  2. Multimedia Mike Says:

    @Z.T.: I updated the post with the schema. Go for it.

  3. Z.T. Says:

    Does this do what you want?

    select * from games where not exists (select 1 from songs where game_id = id and length != 150000);

    It means: “get me all columns about every game, for which there is no record of a song that has a length other than 150000”.

  4. Z.T. Says:

    It could also be read as: “get me all the games, but if a game has at least one song that has a length other than 150000, then I’m not interested in that game”.

    BTW, this means that games with no songs are returned. If you don’t want to get games with no songs, you need to add a “and (select count(*) from songs where game_id = id) > 0” to that query.

    In imperative programming, that query would be like this:

    for each game in games:
    does_exist = False
    for each song in get_songs_of_game(game):
    if song.length != 150000:
    does_exist = True
    break
    if not does_exist:
    print game

    This has the same bug that games with zero number of songs are returned.

    Another way to achieve what you want would be:

    select * from games where id in (select game_id from songs group by game_id having min(length) == 150000 and max(length) == 150000);

    This means: “group all songs by game_id, for each group calculate the min and max length of a song. give me the game ids for groups in which the min and max song length where both 150000. Now give me the data on the games that correspond to thee game_ids.”

    The second may be more natural, and doesn’t need to fix the “game has zero songs” case.

  5. Multimedia Mike Says:

    @Z.T.: This didn’t work:

    ‘select * from games where not exists (select 1 from songs where game_id = id and length != 150000);’

    It returned one row of the database which didn’t even meet the criteria.

    This query:

    ‘select * from games where id in (select game_id from songs group by game_id having min(length) == 150000 and max(length) == 150000);’

    turns up an empty set. However, you nailed it on the imperative algorithm– that’s exactly how my Python script looked.

  6. Multimedia Mike Says:

    @Z.T.: Oops, this query does, in fact, work when using play_length instead of length:

    ‘select * from games where not exists (select 1 from songs where game_id = id and play_length != 150000);’

    Actually, that returns 1444 rows. This one returns 1443:

    ‘select * from games where id in (select game_id from songs group by game_id having min(play_length) == 150000 and max(play_length) == 150000);’

    I’ll have to investigate the discrepancy.

    I know the length fields are confusing: length vs. intro_length vs. loop_length vs. play_length. That comes from Game Music Emu.

  7. Multimedia Mike Says:

    BTW, I have to state that, even though the queries work, I still can’t understand them. “game_id = id” particularly hurts my head in this case. I’m glad I didn’t delay this project further while trying to figure out a query.

  8. Z.T. Says:

    There is an “EXCEPT” operator: SELECT * … EXCEPT SELECT * … gives you the rows returned by the first query minus the rows returned by the second query. There is also INTERSECT, UNION and UNION ALL (UNION returns unique rows, UNION ALL just concatenates). Is there a game with zero songs?

  9. Multimedia Mike Says:

    @Z.T.: There is at least one game record with 0 songs (which, incidentally, was the one game that was returned with your original query that used length instead of play_length). It only exists as a redirect record to another row (redirect_to_id is a number other than -1).

  10. Z.T. Says:

    One query is nested inside another query. The code in the inner loop has to refer to the variable of the outer loop. This is not very different from a join, which is a nested loop:

    for each record_a in table_a:
    for each record_b in table_b:
    if record_b.key == record_a.key:
    print record_a, record_b

    You have to be able to refer to the current record_a when looping over records from table_b. So record_a is in scope and available to you.

    With SQL, the query optimizer can choose which table should be the outer loop and which the inner loop (the small table that fits in ram/cache is the inner loop and the big table that is streamed from disk is the outer loop), or even implement the join in a completely different way (sort-merge join, hash-join), using statistics of the real data to make the decision, then build the program to perform this query in ram and execute it (possibly with JIT).

    An RDBMS would cache the built program and use it again for queries that look the same (with a parameter instead of “150000”, not hardcoded) and invalidate this cached program if more than 10% of the rows in the affected tables change or if indexes over these tables are created/removed/changed.

    And of course, an RDBMS returns results that are correct for the moment when the query was issued, not affected by later changes to the data, or uncommitted previous changes to the data.

  11. Multimedia Mike Says:

    @Z.T.: I totally accept that an RDBMS is smarter than me. :-)

  12. Z.T. Says:

    Only in the sense that a C compiler is smarter than you. :-)

  13. Z.T. Says:

    The beauty of specifying what you want done instead of how to do it is that optimizers often do as well as humans, are almost always faster, and improve with time. By writing “multiply_matrices” instead of a hand coded nested loop in C, you can get automatic vectorization, automatic parallelization, automatic loop blocking, automatic GPU code, etc.

    A C compiler doing all those things has to extract the higher level meaning out of the lower level details in your input and optimize that. By providing the higher level meaning, you save it work and future-proof your code.

    Of course, relying on tools can be a weakness, especially if you get yourself into a single vendor situation. Also, by enabling useful work to be done without understanding how the tools work, you create “engineers” who are helpless when their tools misbehave.

  14. Multimedia Mike Says:

    Yeah, the SQL optimizer can magically make things fast. But you can also hurt yourself if you’re not careful. I once wrote an SQL query that brought down my machine.