Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering


XML Monkey

July 25th, 2009 by Multimedia Mike

I’m trying to come to terms with the reality that is XML. I may not like the format but that won’t change the fact that I have to interoperate with various XML data formats already in the wild. In other words, treat it like any random multimedia format. For example, suppose I want to write software to interpret the various comics that I’ve created with Taco Bell’s series of Comics Constructors CD-ROMs.

Amazon Raiders: XML Monkey, top panel

Amazon Raiders: XML Monkey, bottom panels

The comics are saved as XML files that look something like this:

  1. <comic>
  2.   <page0 name="pgt1">
  3.     <sq1 mirror="0" rotation="0" scale="350" y="283" x="388" bg="bg07">
  4.       <object sq="1" libType="characters" depth="1" mirror="0" rotation="0" scale="100" y="368" x="196" name="ch01" />
  5.       <object sq="1" libType="characters" depth="2" mirror="1" rotation="0" scale="100" y="370" x="338" name="ch10" />
  6.       <object sq="1" libType="characters" depth="3" mirror="0" rotation="0" scale="100" y="376" x="342" name="0" />
  7.       <object sq="1" libType="objects" depth="4" mirror="0" rotation="0" scale="100" y="367" x="469" name="ob02" />
  8.       <object txtColor="" cont="We might as well face it-- XML isn&apos;t going away" sq="1" libType="bubbles" depth="5" mirror="1" rotation="0" scale="100" y="265" x="216" name="bu01" />
  9.       <object sq="1" libType="characters" depth="6" mirror="0" rotation="0" scale="80" y="321" x="168" name="ch19" />
  10.     </sq1>
  11. ...
  12.   </page0>
  13. ...
  14. </comic>

How to even begin with this? Sometimes a good book can help. Yesterday, I found an old book from 1999 called “Just XML” by John E. Simpson. It weighs in at nearly 400 pages. I thought XML was supposed to be relatively straightforward to understand.

The book is supposed to be geared toward web programmers. I’m not a web programmer, but I do wish to know how to programmatically access this data. I have seen that Python has interfaces to libraries that parse XML. So I shoved xml-monkey.xml through the example code shown at the end of Python’s xml.parser.expat documentation. This yields:

Start element: COMIC {}
Start element: PAGE0 {u'name': u'pgt1'}
Start element: SQ1 {u'scale': u'350', u'bg': u'bg07', 
  u'mirror': u'0', u'y': u'2
  83', u'x': u'388', u'rotation': u'0'}
Start element: OBJECT {u'scale': u'100', u'name': u'ch01', 
  u'sq': u'1', u'depth': u'1', u'mirror': u'0', u'y': u'368', u'x':
  u'196', u'rotation': u'0', u'libType': u'characters'}
End element: OBJECT
Start element: OBJECT {u'scale': u'100', u'name': u'ch10', 
  u'sq': u'1', u'depth': u'2', u'mirror': u'1', u'y': u'370', u'x': 
  u'338', u'rotation': u'0', u'libType': u'characters'}
End element: OBJECT
Start element: OBJECT {u'scale': u'100', u'name': u'0', u'sq':
  u'1', u'depth': u'3', u'mirror': u'0', u'y': u'376', u'x': u'342',
  u'rotation': u'0', u'libType': u'characters'}
End element: OBJECT
Start element: OBJECT {u'scale': u'100', u'name': u'ob02', 
  u'sq': u'1', u'depth': u'4', u'mirror': u'0', u'y': u'367', u'x': 
  u'469', u'rotation': u'0', u'libType': u'objects'}
End element: OBJECT
Start element: OBJECT {u'scale': u'100', 
  u'cont': u"We might as well face it-- XML isn't going away", 
  u'name': u'bu01', u'sq': u'1', u'txtColor': u'', u'depth': u'5', u'mirror': 
  u'1', u'y': u'265', u'x': u'216', u'libType': u'bubbles', u'rotation': u'0'}

So that’s something. I thought XML documents were required to start with a little more boilerplate such as <?xml version=”1.0″ encoding=”UTF-8″?>. I see that there are a few levels to XML validity, the first is “well-formed” in which the document adheres to basic XML syntactic rules. Then there’s actually being “valid” which requires a document type definition to validate against. That DTD, I do not have.

But this is still a good start. I can see how I might start processing the data using Python. This is good since I am encountering more and more XML files that I’m interested in manipulating.

Posted in Programming, Python | 9 Comments »

9 Responses

  1. DrV Says:

    I have in my possession a book entitled “XML Bible (Gold Edition)” from 2001 (it was given to me; I am not exactly an XML fan either) that weighs in at a hefty 1565 pages, if you count the index. It makes a great monitor stand – the binding is nearly three inches thick.

  2. Peter Says:

    I don’t care much for their religion either…

  3. Multimedia Mike Says:

    Then again, it’s probably not fair to judge the complexity of a given computer topic by the thickness of the tomes published on the topic. Some publishers can publish 1600 pages about anything, usually by reprinting the publicly-available API documentation for a language (see also: any Java book).

  4. SvdB Says:

    Brevity isn’t everything. And you’ll warm up to XML when you get to XPath.
    And you don’t need a 400 pages book. I’d start with a simple online tutorial.

    P.S. Could you add a “preview” button to the comment form?

  5. Tomer Gabel Says:

    Sorry to say this, but anyone who writes up over 1000 pages on XML is an idiot (and/or full of horseshit).

    XML, in its raw, simple form, is nothing much than an information interchange format. Any junior programmer can learn to use XML (conceptually as well as programmatically) in a couple of hours, and you can take your time learning XML schema, XSLT, XQuery or even none of the above. The only thing you “really” need to learn is XPath, and it makes so much intuitive sense that you’re unlikely to ever need to open up the reference.

    The whole “XML is the bomb” vs “XML is crap” debate is getting really, really old, and makes about as much sense as the VHS vs Beta argument of old.

  6. Mans Says:

    Calling xpath intuitive is the single most stupid thing I’ve heard all week. Sure, it allows some complex things to be expressed, but INTUITIVE? No way! The same goes for xslt, which relies heavily on xpath. Both xpath and xslt are prime examples of the wrong tool being used for the job.

    I do agree about someone writing 1000 pages on xml being an idiot; it does not deserve that much attention.

  7. Multimedia Mike Says:

    Thanks for keeping the flame [war] burning bright, Mans. :-)

  8. follower Says:

    You might want to take a look at ElementTree (in Python standard library since 2.5) for playing with XML & Python.

    I was reminded of it the other day when someone posted about how “intuitive” it made accessing structures.


  9. Tomer Gabel Says:

    @Mans: What exactly isn’t intuitive about XPath? “/root/somePath/@someAttribute” takes about three seconds to learn. Want predicates? Right: “/root/somePath[@someAttribute=”someValue”]”.

    If you don’t use complex functions (or user-defined functions), axes and XML namespaces you can pick it up in minutes. Using any of the above is almost always unnecessary anyway.