Automated Java De-obfuscation

If they value their intellectual property, responsible Java-developing software companies use some kind of code obfuscator as part of their build process. This way, there is very little chance that a Java class file will be unleashed that retains the original identifiers. One such Java source code obfuscator is called Retroguard. But can the tool be used for its opposite purpose?

As an aside, I need to point out some irony: I once took a Java course where the instructor claimed that Java is desirable to proprietary software outfits because it can be compiled, and IP implicitly guarded, whereas programs written in interpreted languages, notably Perl, need their source code redistributed. In fact, Java code is very trivial to decompile, whereas languages like Perl can be compiled for redistribution.

I have a lot of respect for Retroguard. It does its job well. I also have ideas about how to use it to subvert its own handiwork. Let’s look at some of the strategies Retroguard employs:

In the samples I have viewed, Retroguard makes liberal use of slightly modified reserved words. Lots of method names will be named similar to _mthelse(), _mthif(), _mthcase(), etc. I think this is supposed to be a psychological trick to make you think about something else.

Another way Retroguard renames identifiers is to use very terse character sequences, e.g. ‘aQ’, ‘b’, ‘tS’. These renaming tricks do not faze me much, to be honest. I am so used to disassembled code where I have to give variables impromptu names like localvar1, globalvar2, etc, that my brain just ignores it. I am just happy to have the code structure intact. However, one thing that bugs me is when Retroguard uses single-character identifiers (‘a’, ‘b’, ‘t’, whatever). Identifiers like ‘aQ’ are nice because that sequence is unlikely to occur randomly within the decompiled source. Thus, when it becomes clear that aQ is a stand-in for the member variable ‘iHeight’, search and replace is easy. Not so simple with single-character identifiers.

Another trick that Retroguard uses (and this one really torques me): It will find various methods in a class that have unique declarations. Then it will rename all of those method identifiers to be the same. Thanks to the miracle of overloading, this is perfectly valid. In one source file, Retroguard managed to assign 27 unique (and probably unrelated functions) the indentifier a(). This can cause a lot of trouble when the code in question is:
a(8);
Does that correspond to this?
void a(int aQ);
Or perhaps this?
int a(byte t);

Retroguard is published under GNU LGPL, so I am free to modify the source code and redistribute it as I see fit. This is fortunate and I just might take advantage of it. It stands to reason that Retroguard does not care what the identifiers inside a given class file look like; it just cares about a set of identifiers and mangling their appearance. So, the idea is to modify Retroguard so that it will make sure that all identifiers are mangled back into unique character sequences. I think the program is aware of basic types so it would be nice if it could put some notation in front of variables to identify their type and scope (e.g., iQdf is some integer value belonging to the current scope, m_fUrb is a float class member). Apply the same renaming to the overloaded functions and that would greatly ease the pain of manually solving a decompiled Java cryptogram.

Breaking Eggs And Making Omelettes

Topics On Multimedia Technology and Reverse Engineering

One thought on “Automated Java De-obfuscation”