Adventures in Unicode

Tangential to multimedia hacking is proper metadata handling. Recently, I have gathered an interest in processing a large corpus of multimedia files which are likely to contain metadata strings which do not fall into the lower ASCII set. This is significant because the lower ASCII set intersects perfectly with my own programming comfort zone. Indeed, all of my programming life, I have insisted on covering my ears and loudly asserting “LA LA LA LA LA! ALL TEXT EVERYWHERE IS ASCII!” I suspect I’m not alone in this.

Thus, I took this as an opportunity to conquer my longstanding fear of Unicode. I developed a self-learning course comprised of a series of exercises which add up to this diagram:



Part 1: Understanding Text Encoding
Python has regular strings by default and then it has Unicode strings. The latter are prefixed by the letter ‘u’. This is what ‘ö’ looks like encoded in each type.

>>> 'ö', u'ö'
('\xc3\xb6', u'\xf6')

A large part of my frustration with Unicode comes from Python yelling at me about UnicodeDecodeErrors and an inability to handle the number 0xc3 for some reason. This usually comes when I’m trying to wrap my head around an unrelated problem and don’t care to get sidetracked by text encoding issues. However, when I studied the above output, I finally understood where the 0xc3 comes from. I just didn’t understand what the encoding represents exactly.

I can see from assorted tables that ‘ö’ is character 0xF6 in various encodings (in Unicode and Latin-1), so u’\xf6′ makes sense. But what does ‘\xc3\xb6’ mean? It’s my style to excavate straight down to the lowest levels, and I wanted to understand exactly how characters are represented in memory. The UTF-8 encoding tables inform us that any Unicode code point above 0x7F but less than 0x800 will be encoded with 2 bytes:

 110xxxxx 10xxxxxx

Applying this pattern to the \xc3\xb6 encoding:

            hex: 0xc3      0xb6
           bits: 11000011  10110110
 important bits: ---00011  --110110
      assembled: 00011110110
     code point: 0xf6

I was elated when I drew that out and made the connection. Maybe I’m the last programmer to figure this stuff out. But I’m still happy that I actually understand those Python errors pertaining to the number 0xc3 and that I won’t have to apply canned solutions without understanding the core problem.

I’m cheating on this part of this exercise just a little bit since the diagram implied that the Unicode text needs to come from a binary file. I’ll return to that in a bit. For now, I’ll just contrive the following Unicode string from the Python REPL:

>>> u = u'Üñìçô?é'
>>> u
u'\xdc\xf1\xec\xe7\xf4\u0111\xe9'

Part 2: From Python To SQLite3
The next step is to see what happens when I use Python’s SQLite3 module to dump the string into a new database. Will the Unicode encoding be preserved on disk? What will UTF-8 look like on disk anyway?

>>> import sqlite3
>>> conn = sqlite3.connect('unicode.db')
>>> conn.execute("CREATE TABLE t (t text)")
>>> conn.execute("INSERT INTO t VALUES (?)", (u, ))
>>> conn.commit()
>>> conn.close()

Next, I manually view the resulting database file (unicode.db) using a hex editor and look for strings. Here we go:

000007F0   02 29 C3 9C  C3 B1 C3 AC  C3 A7 C3 B4  C4 91 C3 A9

Look at that! It’s just like the \xc3\xf6 encoding we see in the regular Python strings.

Part 3: From SQLite3 To A Web Page Via PHP
Finally, use PHP (love it or hate it, but it’s what’s most convenient on my hosting provider) to query the string from the database and display it on a web page, completing the outlined processing pipeline.

I tested the foregoing PHP script on 3 separate browsers that I had handy (Firefox, Internet Explorer, and Chrome):



I’d say that counts as success! It’s important to note that the “meta http-equiv” tag is absolutely necessary. Omit and see something like this:



Since we know what the UTF-8 stream looks like, it’s pretty obvious how the mapping is operating here: 0xc3 and 0xc4 correspond to ‘Ã’ and ‘Ä’, respectively. This corresponds to an encoding named ISO/IEC 8859-1, a.k.a. Latin-1. Speaking of which…

Part 4: Converting Binary Data To Unicode
At the start of the experiment, I was trying to extract metadata strings from these binary multimedia files and I noticed characters like our friend ‘ö’ from above. In the bytestream, this was represented simply with 0xf6. I mistakenly believed that this was the on-disk representation of UTF-8. Wrong. Turns out it’s Latin-1.

However, I still need to solve the problem of transforming such strings into Unicode to be shoved through the pipeline diagrammed above. For this experiment, I created a 9-byte file with the Latin-1 string ‘Üñìçôdé’ couched by 0’s, to simulate yanking a string out of a binary file. Here’s unicode.file:

00000000   00 DC F1 EC  E7 F4 64 E9  00         ......d..

(Aside: this experiment uses plain ‘d’ since the ‘?’ with a bar through it doesn’t occur in Latin-1; shows up all over the place in Vietnamese, at least.)

I’ve been mashing around Python code via the REPL, trying to get this string into a Unicode-friendly format. This is a successful method but it’s probably not the best:

Conclusion
Dealing with text encoding matters reminds me of dealing with integer endian-ness concerns. When you’re just dealing with one system, you probably don’t need to think too much about it because the system is usually handling everything consistently underneath the covers.

However, when the data leaves one system and will be interpreted by another system, that’s when a programmer needs to be cognizant of matters such as integer endianness or text encoding.

13 thoughts on “Adventures in Unicode

  1. apexo

    if you have binary strings à la ‘\xdc\xf1\xec\xe7\xf4d\xe9’ you can simply use str.decode(encoding) to get the unicode representation, so if you know the encoding is latin1:

    >>> print ‘\xdc\xf1\xec\xe7\xf4d\xe9’.decode(“ISO-8859-1”) #latin1 works as well
    Üñìçôdé

    likewise, with unicode.encode(encoding) you’ll get a binary string (in whatever encoding you chose, generally utf-8 is preferred over latin1 since it encodes a wider range of characters)

  2. Multimedia Mike Post author

    @apexo: Thanks! That’s what I’ve been trying to figure out this whole time (but I’m glad I had this adventure anyway). The decode/encode semantics are so fuzzy to me when it comes to these Python text encodings.

  3. Z.T.

    Please replace the word “Python” to “Python2” in your text. This is part of the reason adoption of Python3 is taking so long.

    Instead of screwing around with interpreted languages with abysmal support for Unicode, implement in straight C with no iconv a program that reads an input file and outputs the text found inside as valid, shortest encoding, UTF-8. The input file can be:
    1. ASCII
    2. ISO-8859-1
    3. Windows-1252 (spoiler: not the same thing)
    4. UTF16LE
    5. UTF16BE
    6. UCS4LE
    7. UCS4BE
    8. UTF-16 encoded with UTF-8 so that Unicode code points outside the BMP take 6 bytes instead of 4 (what Java produces by default when you ask for UTF-8)
    9. UTF-8 with some invalid bytes (what you get if you cut and concatenate UTF-8 strings in PHP)
    10. UTF-8 with some characters encoded not using the shortest encoding (which is invalid but accepted by many decoders and leads to security problems).

    When you understand how that works in C, you can easily learn what any language runtime does, encoding-wise.

    Warning: encoding is the easiest part of Unicode. Actual Unicode support is a much larger problem.

    http://98.245.80.27/tcpc/OSCON2011/gbu.html

  4. Multimedia Mike Post author

    @Z.T.: Thanks for the feedback. I haven’t really thought to draw a distinction between Python2 and Python3 and doing so makes them sound like 2 totally different languages. That’s frightening. BTW, everything I have done so far has been Python2.

    Your suggestion is another route that I have entertained. I was thinking of writing a C program that digs out the metadata strings in these binary files and prints out JSON data, which requires Unicode text strings. This is easy to interpret in Python and subsequently toss into a database.

    I have successfully used this strategy for extracting metadata from another type of file corpus. However, I should really revisit that corpus since I know I added some hacks in the Python processing scripts to circumvent problems caused by character 0xc3.

  5. SvdB

    I was also going to suggest ‘iconv’, but not the library, but the command line tool (which uses the library).
    The syntax is simple:
    iconv -f -t
    It reads its input from stdin, and writes to stdout.
    e.g.
    printf ‘\xdc\xf1\xec\xe7\xf4d\xe9\n’ | iconv -f latin1 -t utf-8
    (Make sure that your terminal is set to UTF-8 for this example to work.)
    And to figure out the hex values, you could use xxd, od, hexdump or something similar:
    echo Üñìçôdé | iconv -f utf-8 -t latin1 | xxd

    Also, using meta-http-equiv is an undesirable hack. You’re specifying the type of the document inside of the document itself, which will only work as long as the parser can understand the character encoding of the document well enough to parse the meta tag.

    The “right” way is to put the content type in the HTTP headers.
    From PHP, you can do this with the following line:
    header(‘Content-type: text/html; charset=utf-8’)
    You need to execute this line before starting to write the body.

    You can also use an UTF-8 byte order marker, by putting \xef\xbb\xbf as the first characters of your document, but imho this is another suboptimal solution.

    Btw, endianness is in fact an issue when dealing with 16- or 32-byte character encodings. e.g. UTF16LE vs. UCS16BE

  6. Multimedia Mike Post author

    @Z.T.: “Warning: encoding is the easiest part of Unicode. Actual Unicode support is a much larger problem.” I wondered what you meant by that. Then I got to read through that slide deck you linked. Very illuminating. You’re right, encoding sounds like the easiest part. Still, I’m thankful that I’m minimally equipped to comprehend the broader issues now.

    Funny that the presentation rated Python 2.7 as pretty much the worst for Unicode support since that’s what I mostly used for this journey.

  7. Multimedia Mike Post author

    @SvdB: Thanks for the reminder. I remember reading about the fact that the server should send the UTF-8 signal in the HTTP headers as a first measure. http-equiv is a backup measure.

    Thanks for the ideas on using the command line tools. I always wondered what ‘iconv’ was for.

  8. Z.T.

    Python3 _is_ a different language from Python2:
    http://docs.python.org/3.2/whatsnew/3.0.html

    HTTP header, BOM, HTML meta tag, etc. precedence rules:
    http://www.w3.org/International/questions/qa-html-encoding-declarations#precedence

    BTW, the program I specified should detect the type of the input by itself. And I said _no iconv_!

    I’ve implemented PNG and JPEG decoders in C (including the DEFLATE part), after reading your blog post (http://multimedia.cx/eggs/learn-multimedia-with-jpeg/). You can man up and acknowledge that being a practicing programmer in 2012 who doesn’t know this stuff is the same as not knowing what IEEE 754 is, or TCP/IP. The XML specification said in 1998 that UTF-8 is the default encoding of XML unless you specify another, and UTF-8 was designed by Ken Thompson in 1992. There is no excuse, not even if you spent the last two decades doing COBOL on S/390.

    http://www.w3.org/TR/1998/REC-xml-19980210

    http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

  9. Reimar

    Honestly I think they should update the specs to require any browser to assume UTF-8 if nothing else is specified, competely solving that part of the problem.
    If they are really concerned, they can specify falling back to latin-1 on coding errors, due to redundancy in UTF-8 that is going to be nearly bulletproof.

  10. Multimedia Mike Post author

    @Z.T.: Glad to know someone took that little multimedia self-learning course.

    “…being a practicing programmer in 2012 who doesn’t know this stuff…” Yeah, yeah. Joel Spolsky already covered this much more thoroughly in his essay, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”

    http://www.joelonsoftware.com/articles/Unicode.html

  11. Reimar

    But one has to admit that you’re really disadvantaged over there.
    I mean, no strange letters like ö in your names to observe the wonders of UTF-8-treated-as-ANSI whenever you get mail from the US (and on the other hand being deeply surprised by a hotel employee in San Francisco who not only got the ö unmangled from the computer system but even wrote it down correctly on a hand-written notice as if it was nothing special).
    No new currency like the Euro (hey, with all that complaining about it going on I have to say something good about it!), which gives you the opportunity to observe how its € sign gets mangled into ? when you use it online.
    (As a joke people still sometime write ?, as in “It was 10?, so I bought it”)
    No funny stuff like letters that have no uppercase – or at least they can’t really decide on one (ß).
    No especially funny stuff like that the uppercase of the normal ASCII i is, in fact, not I (guys: always test your software after setting the locale to tr_TR. You might profit from a profound feeling of confusion over your HTTP parser stopping to work correctly since you used strcasecmp).
    So you really should adopt a more complicated language, then a lot of learning would come naturally without extra effort :-)

Comments are closed.