Codec Gold Mine; More On The gentree Perl Script

I’m feeling good about these new MS multimedia libraries with debug symbols. At first I was a little disappointed to see that ‘only’ WMA8, WMA9 and WMV9 were covered by these libraries.

Digging a little deeper, it appears that the WMA9 module covers the WM Series 9 audio codecs. This includes WMA9 multi-channel, WM Voice, and WM Lossless. Witness these functions:

  • _prvWMADECCreateForWMAVoice
  • _prvWMADECDecodeDataForWMAVoice
  • _prvWMADECDeleteForWMAVoice
  • _prvWMADECGetPCMForWMAVoice
  • _prvWMADECInitForWMAVoice
  • _prvWMADECInitFrameForWMAVoice
  • _prvWMADECResetForWMAVoice
  • _prvWMADECResetSMSwitchForWMAVoice
  • _WMAVoiceRealForwFFT_INT
  • _WMAVoiceFFT4DCT
  • _prvDecodeSubFrameChannelRawPCMPureLosslessMode
  • _prvDecodeSubFrameChannelResiduePureLosslessModeVerB
  • _prvDecodeSubFrameHeaderPureLosslessMode
  • _prvDecodeSubFramePureLosslessMode

The WMV9 file has a bunch of functions prefixed with WMV2 (a.k.a. WMV8) in addition to WMV3 (a.k.a. WMV9). Not all of WMV8 has been RE’d and this may help fill in some gaps.

I think the next goal should be to put together a good tree detailing the various decoding algorithms. I would like for the gentree Perl script to do it for me. But when I try to make it recurse down and show me the whole underlying tree, Perl complains, something about “recursion too deep” or some such. Frankly, that tells me that there is something wrong with my script. Unless there are 2 functions that actually call each other, I can not imagine that the calling hierarchy could possibly sink too deep.

Of course, there are other imperfections in the script, among them:

  • register calls and other dynamic calls (e.g., “call eax” or “call [ebp-1ch]”): to be fair, this is not really a problem with the script; the actual jump destinations are unknown until the machine’s instruction pointer lands on that address. Actually, lines like “call [ebp-1ch]” make Perl’s regex engine very unhappy. That is why such lines get classified as “dynamic” instead of trying to push them through the regex engine.
  • numerical address calls (e.g., “call 00000022”): these apparently occur due to relative vs. absolute call instructions. Perhaps the script needs to analyze the instruction’s opcode and decide if it is a relative call. If it is, the script should probably disregard the call since it really only has the option of jumping somewhere else in the same function. This is relocatable code so the function can not safely jump to neighboring functions.
  • too many “top-level” functions: not sure if my script can take care of this one. There are way more functions at the top level then there ought to be. This is related to the aforementioned register and dynamic calls. For example, these 4 functions are listed as top-level functions:
    • _ApplySmoothing
    • _ApplySmoothing_Improved
    • _ApplySmoothing_KNI
    • _ApplySmoothing_MMX

    These are obviously processor-specific optimzations for a certain, well, optimizable operation. In the decoder, there will be some data field called “ApplySmoothingFunc” which will be initialized to the address of one of the above functions depending on which processor features are present. The dynamic address of ApplySmoothingFunc() is called and the functions that do the actual work are never referenced in the static disassembly. But in this context, there is not much to be done about it. It sure does distract from what the actual top-level functions are.