So I have managed to automatically de-obfuscate an obfuscated Java project. Remember, there are 2 major challenges in reverse engineering: 1) Understanding the original code flow, and 2) understanding what the original identifier names could have been. My experiment was focused on problem #2. Problem #1 is generally a non-issue in decompiled Java code since Java classes retain so much information about the original code flow.
Are there better approaches for obfuscating Java code?
One of my RE colleagues has conjectured that the next level of Java obfuscators will focus on making the bytecode harder to decompile, like making bytecode that simply can not be expressed in Java language. I am not entirely clear of the technical validity of this, or more importantly, the reliability of this (i.e., would the code still work correctly?). However, I can imagine making code a little more scrambled, perhaps by finding switch-case constructs and converting them into large if-then-else constructs. Those are less straightforward to RE than a clean switch-case construct. This could be easier to do in a syntactic pre-processing phase before sending the source code over to the Java compiler.
Also, during the de-obfuscator experiment, I learned that Java does, in fact, have some kind of goto construct by using continue in conjunction with labels. Perhaps this feature could be exploited and subverted in the pre-processing phase to make codeflow harder to follow, as long as the substituted code could still be proven correct.
Oh well, it’s really not my problem to work out the fine details of better obfuscation techniques.
However, I see that On2 has been hard at work on the problem. When I decompiled an earlier version of their Java-based VP6 decoder, the source code contained over 35 unique data tables. The purposes of several were very obvious (bitmasks, frame maps, dequantizers, prediction coordinates). However, in their latest VP6 Java applet version, On2 has apparently managed to combine all the data tables into some kind of unholy mega-datatable. The de-obfuscated table is called jumpingSpider[] and contains line after line of disjointed, apparently unrelated numbers:
private static final int jumpingSpider[] = { 0x41030, 0x81438b26, 0x8e180011, 0x1112345b, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0xf0000000, 0, 2330, 0x50000000, 0, 0x8c94a, 0xd0000000, 0, 0x878c9bb, 0, 8, ...
The code uses a function ironically renamed by the de-obfuscator as streamline() to unpack various pieces of this table into working data tables. For example, to initialize a variable determined to be the dezigzag matrix for IDCT operations:
streamline(dezigzagMatrix, 0, -7, 4);
Somehow, the function unpacks information from the mega-datatable into a number of smaller arrays.
So there you have another possible method for thwarting the intrepid reverse engineer. Provided, of course, that they do not have access to an earlier version of the software that did not use this technique.