No matter how much I think I know about about character encoding or trying to work around issues arising from the same, I’ll always get bitten.
For a long time, one particular FATE configuration has shown 87/127 tests succeeding, even though the total number of test specs has crept up over to 300. I investigated this errant configuration on the client side and concluded that it was, in fact, executing all of the tests and sending all the results over the server in one neat package. Apparently, the problem was on the server side. Since it was an older Intel C compiler configuration, I didn’t care about investigating much further.
At one point, some bad bit of code was checked into FFmpeg and all of the results started showing xy/127 tests succeeded. This made the issue a bit more pressing. Mans discovered that the problem had to do with the svq3 test spec failing. The bad code affecting the SVQ3 test was quickly fixed so I didn’t worry about it again until yesterday when, once again, FATE’s various configs were only reporting that 127 tests had been run.
Here’s what was happening: FATE stores the stderr output of a test only if the test spec fails. This is a key data point since everything is fine when the test is successful and FATE tosses the output. The sample used for the SVQ3 test outputs the following metadata (among other data) in the stderr (seen, for example, in this test result):
copyright : ? Vertical Online 2001 copyright-eng : ? Vertical Online 2001
Those mystery characters map to the byte 0xA9 which is the c-in-a-circle copyright symbol according to UTF tables I can find (at least, U+00A9 is). That byte is making the system choke somewhere along the line, which annoys me greatly. When the client-side Python script executes the test and stuffs the stdout and stderr into the SQLite database, the relevant field is supposed to be a blob– a binary large object. The receiving PHP script on the server is also supposed to honor that blob schema.
Mans’ solution is to specifically encode the stdout/stderr blobs as UTF8 strings in the client-side Python script. That fixes this problem. But I’m confused as to why this is necessary in the first place. Was the PHP script doing its best to interpret the data inside the blob and falling over? Or was the SQLite engine on the server confused by the 0xA9 character in the blob?
Also, I suddenly find myself wondering how the A9 search company got its name.
Update: Thanks for MichaelK. for pointing out the problem. While I was properly converting (since that’s necessary) stdout/stderr from build records to binary type, I never did the same for test result stdout/stderr. I had to do it for the build record output since that was compressed before going into the database. Since the test result data is “just strings”, or so I thought, no reason to do so.
I always assumed that A9 is short for A + strlen(“mazon.com”), since it’s their search engine.
sqlite docs.. “The second argument .. is the statement to be compiled, encoded as either UTF-8 or UTF-16”
Sqlite queries need to be encoded in UTF-8. one should be using prepared statements and checking for errors, if done this way, such problems would have been easily found.
If you’d like a set of eyes to look over the sqlite/php code you think needs attention, please direct me to it.
I had recently a similar funny experience: I was debugging the parsing code for a third-party text-based data source. Theoretically I should have received only “U” (for up) and “D” (for down), but from time-to-time I kept getting a weird sequence of 3 characters.
Turns out – because of an oversight on their end – they sometimes transmitted the “up arrow” / “down arrow” unicode character encoded in UTF-8 instead of “U” or “D” :-)