Monthly Archives: November 2012

Adventures in Unicode

Tangential to multimedia hacking is proper metadata handling. Recently, I have gathered an interest in processing a large corpus of multimedia files which are likely to contain metadata strings which do not fall into the lower ASCII set. This is significant because the lower ASCII set intersects perfectly with my own programming comfort zone. Indeed, all of my programming life, I have insisted on covering my ears and loudly asserting “LA LA LA LA LA! ALL TEXT EVERYWHERE IS ASCII!” I suspect I’m not alone in this.

Thus, I took this as an opportunity to conquer my longstanding fear of Unicode. I developed a self-learning course comprised of a series of exercises which add up to this diagram:



Part 1: Understanding Text Encoding
Python has regular strings by default and then it has Unicode strings. The latter are prefixed by the letter ‘u’. This is what ‘ö’ looks like encoded in each type.

>>> 'ö', u'ö'
('\xc3\xb6', u'\xf6')

A large part of my frustration with Unicode comes from Python yelling at me about UnicodeDecodeErrors and an inability to handle the number 0xc3 for some reason. This usually comes when I’m trying to wrap my head around an unrelated problem and don’t care to get sidetracked by text encoding issues. However, when I studied the above output, I finally understood where the 0xc3 comes from. I just didn’t understand what the encoding represents exactly.
Continue reading