“Unnamed RE Project” is the impromptu name I gave to a program that I hastily wanted to start but couldn’t be bothered to come up with even a quasi-clever name. Moreover, I actually got it to do something. I can’t believe I actually made a go of this, perhaps one of the most useless reverse engineering exercises.
Aside: Does this still qualify for my “outlandish brainstorms” blog category if I actually made it work?
The basic idea is one that a lot of reverse engineers surely kick around at some point: A set of CPU registers can be abstracted as a set of global C program variables and individual assembly language instructions map quite neatly onto C program statements. Thus, what about an automatic conversion utility that can take an ASM disassembly and convert it into a C program that can be portably compiled? Not optimal, but it might be a start for other RE projects.
Traditionally, I objected to this approach on the basis of its inherent impurity– one of my objectives in this RE journey is to understand the algorithms being recovered. Technically, while it sounded like a simple enough concept, when one actually sits down to think about, all kinds of problems crop up. One of the most immediate is how case statements (jumps using dynamic tables) would be handled.
Putting aside all uncertainty, I decided to go for it and see what could happen. Believe it or not, I met with some success while also discovering a number of problems I hadn’t yet realized (for example, the dream of portability goes right out the window). I hope to write up some more about this shortly. But for tonight, I will just show the results of the first experiment.
This is the static disassembly of one of my favorite little RE puzzles, a simple bitstream reader:
DEF0 55 push ebp DEF1 8BEC mov ebp, esp DEF3 53 push ebx DEF4 56 push esi DEF5 57 push edi DEF6 8B7D08 mov edi, dword[ebp+08] DEF9 8B750C mov esi, dword[ebp+0C] DEFC 8B4F04 mov ecx, dword[edi+04] DEFF 8B5F10 mov ebx, dword[edi+10] DF02 8BD1 mov edx, ecx DF04 83E107 and ecx, 00000007 DF07 C1EA03 shr edx, 03 DF0A 8B0413 mov eax, dword[ebx+edx] DF0D 0FC8 bswap eax DF0F D3E0 shl eax, cl DF11 B920000000 mov ecx, 00000020 DF16 2BCE sub ecx, esi DF18 D3E8 shr eax, cl DF1A 017704 add dword[edi+04], esi DF1D 5F pop edi DF1E 5E pop esi DF1F 5B pop ebx DF20 5D pop ebp DF21 C3 ret
Note that is doesn’t involve any branching logic (the ‘ret’ notwithstanding), which limited the scope of the experiment for the time being. This is the automatically translated C output (which uses some register and stack abstraction infrastructure not shown):
And I’ll have you know that this simple experiment worked! I wrote a test program that uses both the original opcode stream and the translated C function and they both produced the same output.
Followup Post:
I’ve been thinking of something similar, but more aimed at the fpu stack. The problem I want to solve is easy recovering of floating point loops. A simple IIR filter gets quite complex after a compiler has assembled it. Most of the times you get a brief understanding of the loop and you can easily see what goes in and what comes out. But the actual recurence formula is hard to recover. I’ve been thinking of some virtual fpu stack machine that uses symbol aritmetic to do the recovery process.
Indeed, sky’s the limit. And perhaps sequences of SIMD instructions?
Actually, I tend to think a sequence of FPU instructions would be simpler to recover because there is less weirdness that could be thrown in. But only experiments will confirm this.
Well, as of my practice, such kind of bare asm-to-c translation is merely useless. Surely, you can then de-optimize it to more high-level constructions, but in most cases it’s just an unneccessary waste of time.
@VAG: Yeah, I had a feeling it was useless. But I just wanted to see if I could make it work anyway, at least for a small case. For the limited success I’ve had so far, I think there are far too many little problems with the concept. I’ll try to write them up in a future post.
Had a look at uncc ?
http://cvs.savannah.nongnu.org/viewcvs/?root=uncc
It’s old, unmaintained, buggy… But it is supposed to be a bit more advanced that your try.
Anyway, I also think that this is mostly useless. A C decompiler would need to be much more intelligent to produce a code which is easier to understand than raw asm.
I knew I couldn’t possibly be the first person to actually try this idea. I figured that posting an attempt would be the fastest method of finding out about other efforts.
I usually do this by hand optimizing on-the-fly. And anyway this may be a good tool to do blackbox RE-ing. And I suspect it may stuck in such places:
– flags
– stack (when first push’ing something then lea[ebp+X+arg_0] )
– FPU and SIMD (already discussed)
Somebody a long time ago talked on xine-codec-devel about collecting different asm patterns and their corresponding C code. Dare to open another Wiki?
@Kostya–
The MultimediaWiki is open for all such purposes.
The flags are something I was putting off with this experiment; fortunately, there were no conditional branches in the ASM fragment I selected so flags were moot. I was thinking that any arithmetic operation would also need to update the flags. A subsequent pass could remove a huge number of irrelevant flag ops.
I am not sure how your stack example would trip up the re-targeted code.
What about Boomerang(1)?
1-http://boomerang.sourceforge.net/
@Lucio: I have worked with Boomerang several times in the past with limited degrees of practical success. Thanks for reminding me that I really need to write up some experimental results.
http://hexrays.net/
Hm, the way they actually use “implements unpublished algorithms” in ad copy like it’s a good thing is enough to practically annoy me into writing one. Too bad i haven’t read any good compiler books.
That Hex Rays program looks extremely interesting. However, it’s very expensive by itself, and requires another piece of software that is also very expensive (IDA Pro 5.1).