Many have postulated on a program that can take the machine code of a binary program and automatically translate it to C. I am not talking about elegant C, with proper indentation and control loop structures. I mean C that looks like it was directly translated from ASM, GOTO statements and all. When you start reverse engineering i386 code and extracting algorithms from it, you very quickly notice the tight ASM-C relationship. It’s no wonder that C is sometimes referred to as “portable assembly language”. ASM constructs map quite cleanly into the C language (or is it the other way around?).
Based on that, why can’t we write a program that automatically converts those low-level ASM instructions into a series of C statements? Then it could be compiled for any CPU architecture. Personally, I have traditionally been opposed to this idea, simply on the grounds that one of my overriding goals is actually understanding the underlying technology that I am reverse engineering.
However, I also realize that intelligence-gathering is an essential component of reverse engineering activity. To that end, this type of program could be another tool in the RE toolbox, another way to look at the problem. I would call this approach brute-force reverse engineering since it might– just might– be workable, but not elegant by any stretch of the imagination. Let’s call this hypothetical program “BFRE”.
Let’s talk hard, technical details. A computer program expressed in assembly language consists of a series of simple operations. These operations basically include:
- load operations from memory into CPU registers
- store operations from CPU registers back into memory
- arithmetic operations on data sitting in either memory or CPU registers
- conditional or unconditional program execution branches
As usual, there are some very difficult problems. Otherwise, there would already be a generalized solution. First and foremost is the dreaded case-switch construct. I will explain briefly for the uninitiated: In academic computer science parlance, automatically reverse engineering case-switch constructs has been proven to be an NP-complete problem due to the fact that a program can not look at another program and differentiate between code and data. In terms of bits and bytes, case-switch constructs usually compile down to dynamic branches where an address is loaded into a register at run-time and a “jump [register]” instruction is executed. If you are looking at a static disassembly of a program, it is very difficult to figure out where the program will land.
So that is one problem that can easily trip up our BFRE program. One solution is to feed intelligence into the BFRE program about jump targets observed in execution.
Some aspiring reverse engineers with whom I have corresponded dream of writing an automated RE program that can digest an entire binary file at once, spitting out C-like code to represent the whole thing. I like to start small since the problems at hand are already hard enough. To that end, I prefer to focus on individual functions. Using some other custom tools as well as human intelligence, it is possible to discern the boundaries of the relevant functions. Feed each of these into the BFRE machine in order to ease development pain and focus efforts in the final RE process.
How to go about writing such a program? My first idea would be to virtualize the underlying CPU as a set of global variables. That means declaring “int eax, ebx, ecx, edx, esi, edi, ebp, esp and eflags” at the top level. Declaring eip seems somehow unnecessary though I suppose it is remotely possible that a function might play fast and loose with the eip but I can not think of how right now. Further, declaring variables corresponding to special machine state registers is silly since the scope of this project is to RE user-space programs. Extended register sets such as those used for floating point and SIMD operations? That would be a nice future feature after proof of concept has been established. We will stick to the basics for now. One more item at the top level: An array of bytes to represent the stack. Make it sufficiently large, like a megabyte. Expand as needed.
Let’s think about the actual program. I am thinking of Perl. I have written a fair number of little RE tools in Perl. Why? Because there are a lot of great tools out there for taking apart i386 binary files and presenting neatly formatted text. I would rather sub-contract to one of those programs for the heavy lifting involved in disassembly and then use a language that is good at manipulating text (like Perl) to chew on the output.
I am envisioning that the program will take the text output from a disassembler, the starting address and ending addresses of a function and the suspected prototype of a function. That last item might be a bit much to key in at the command line and process subsequently. Instead, perhaps specify the number of 32-bit parameters the program is suspected to take and whether or not the program will return a 32-bit integer or null. Actually, since this is likely to be an iterative process, it may be worthwhile to create a parameter file specification with all kinds of intelligence, hints, and heuristics about various functions in the target module.
First pass at the actual BFRE logic which is very naive and well, brutish, hence the project name; optimizations will come after the first phase proof of concept:
- initialize the virtual state machine– this basically entails setting all of the virtual registers to 0x00000000 except for esp which should be set to STACK_SIZE (where the stack array is stack[STACK_SIZE]); technically, this will put esp out of array bounds which does not matter since the first stack push operation will decrement esp before storing data on the stack
- read in the starting and ending addresses, number of parameters, and whether there is a return value
- read in lines from disassembly file until first address of function is located, error out if exact match is not found
- declare a generic name function (or allow the user to hint at the name) with void or int as the return type
- for each parameter in the list (p = (# of parms – 1) downto 0), declare a parameter “int parm{p}” (e.g., for 3 parms, declare parm2, parm1, and parm0) and push each on the stack; also, keep internal accounting of which position each variable is at on the stack (sounds like a good job for a hash array where the key is the stack address and the value is the parameter name)
- set up is done, time to start iterating through assembly language instructions until the current instruction address is greater than the user-specified ending address
- process each line; for starters, this will entail outputting a label in the form of “_$current_address” since program flow will consist of a series of GOTOs– look, I’m sorry professor Dijkstra, but if it is good enough for the C compiler to generate for the machine to execute, it should be good enough for the machine-generated C code to use
- transform each line of assembly language into an equivalent set of C statements; this will likely take a lot of trial and error; too tired tonight to brainstorm on this point further (of course, it is the biggest point of this project)
Hopefully, putting these ideas into practice will yield little atomic functions that can be successfully re-compiled and run, and will produce identical output as the original binary code. At that point, it will be time to look for optimizations.
Future expansion: Look for opportunities to apply the Cifuentes thesis. Or, since this is the brute-force reverse engineering program, apply solutions that require less brainpower. For example, it can get a little messy to try to guess which local variable or function parameter [esp+0x28] refers to after the function builds momentum and has pushed some data and gotten past a few conditional branches. So, we could add calls before such references inside the analyzed function to custom functions that compare addresses such as [esp+0x28] and will plainly advertise which variable corresponds to that memory location. This is where that hash array with parameter stack locations would come into play. Use that data to provide feedback to the BFRE program on subsequent executions. It is an iterative feedback process. But if it provides good, accurate intelligence, that is what matters.
So, what do you think? Does it seem too crazy to be useful? Or, better yet, has someone else already created a package that does the exact same thing outlined here (and more)?