Implementing The Re-targeter

It was nearly a year ago that I tried my hand at writing a re-targeter — a program that can take machine opcodes and automatically translate them into a portable C program, which certainly sounds simple and intuitive enough. I was really quite busy last year about this time and I don’t remember how I found time for the re-targeter experiment in the first place. But it looks like I had time to write up some notes that I never fleshed out and published. It was hard enough just to locate the old source code. I was completely surprised to find that I had actually managed to write the re-targeter in Python; I had no idea I knew so much of that language (which, granted, isn’t much).

Here are some of the problems I encountered when I took a stab at writing a re-targeter; let’s see if I can remember the specifics a year later:

Managing the stack and parameters– I made the effort to re-target entire functions at a time. Functions have parameters and establish stack frames at the front, clean them up at the end, and return a value. For stack considerations, I virtualized a small stack along with the necessary register set. The re-targeter simulated the ASM function call by pushing all the known parameters onto the stack at the front of the function (the parameters count has to be determined in advance and fed to the re-targeter).

Return values is the easy part — just “return eax”, the virtual eax register.

Segmented registers: This was painful when I hit up against it. Sure, you have a set of neat, 32-bit, more or less general purpose registers (eax, ebx, ecx, edx, esi, edi, ebp, and esp). However, the last half of those registers started out their careers as 16-bit registers (si, di, bp, and sp). It gets worse with eax..edx since, not only are the lower 16 bits independently addressable, the lower 16 bits can also be sliced into 2 8-bit halves and independently addressed in that manner. I dealt with this by simply hacking only as much support as I needed for the selected sample snippet with an admittedly inelegant series of bit shifts and masks.

Endianness: I once wrote a little post on this blog about when you need to worry about endianness in your program. The idea of re-targeting to portable code seemed to go right out the window on this exercise. The original x86 code will do a direct load of a 32-bit value with, e.g., “edi = *(unsigned int *)(ebp+0x08) ;” That same construct would not be portable to big endian architectures.

But on second thought, maybe it is possible. The nice thing about my simple experiment is that I did not have to go to great lengths to parse parameters. The C statement above started out as the ASM statement “mov edi, dword[ebp+08]”. Rather than doing any complicated parsing, I just replaced “dword[whatever]” with “*(unsigned int *)(whatever)”. But I suppose I could replace it with a macro called “LOAD_DWORD(whatever)” which would do the proper endian sorting.

I mean, if I cared to continue this experiment.

Related Posts