Thanks to Sean Barrett for helping me compile his bb86 app. I let it rip on this code snippet from the Unnamed RE Project. It’s interesting stuff. I omitted the push, pop, and ret instructions since basic blocks pertain to linear sequences of load, store, and arithmetic instructions:
$ ./bb86 < ~/basic-block.asm Reading stdin Warning: unknown opcode 'bswap' in line 9 Memory locations: mem1 EQU dword+(ebp_0)+08 mem5 EQU dword+(mem4)+((mem3 >> 03)) mem4 EQU dword+(mem1)+10 mem3 EQU dword+(mem1)+04 mem2 EQU dword+(ebp_0)+0c Integer registers: eax = ((mem5 < < cl_0) >> cl_0) ebx = mem4 ecx = (00000020 - mem2) edx = (mem3 >> 03) esi = mem2 edi = mem1 Floating point stack: st(0) = fp3 st(1) = fp2 st(2) = fp1 st(3) = fp0 Memory locations: [dword+(mem1)+04] < = (mem3 + mem2)
I am pretty sure that all of those register states are true at the end of the block, though they are listed in the traditional sequence rather than the logical order. I.e., cl needs to be set before eax could be correct.
I tried out bb86 on a basic block of floating point instructions (using a computation I understand, like the distance between 2 points, rather than a Fourier transform), and it was less than successful (crash). But I can not fault the program since I am feeding it data disassembled by objdump (-Mintel) rather than Microsoft's official format. Again, bb86 is an interesting effort, and I was impressed when I examined the output of test.asm that was packaged with the code (seen in Sean's original comment).
The cl bug is a partial register thing; the code treats ecx/cx/cl as totally separate registers, so if you do something like ‘xor ecx,ecx;mov cl;…ecx…’ it’ll get it wrong… or in this case loading ecx and using cl. It probably wouldn’t be hard to fix either of those cases in a naive way (e.g. by just treating them as total aliases of each other).
Doing it symbolically would be kind of a pain; everything you wrote to FOO to ecx you’d have to write ‘(FOO & 0xffff)’ to cx, ‘(FOO & 0xff)’ to cl, and ‘((FOO >> 8) & 0xff)’ to ch. A “real” decompiler would definitely want to get this stuff right, but for something quick and dirty, treating them as aliases of each other probably catches most of the cases. (It would certainly handle this one.)
As to the other, I’m not going to support bb86 in the long run, but if it’ll make a difference between useful and not-userful, I’ll be happy to put in support for another asm format… I just need a sample.
To clarify…
“you’d have to write ‘(FOO & 0xffff)’ to cx, ‘(FOO & 0xff)’ to cl, and ‘((FOO >> 8) & 0xff)’ to ch”
and then when you used these expressions, they’d propogate, and since the system is just tossing strings around and not simplifying them, you might end up with huge giant chains of redundant ((((FOO & 0xff) & 0xff) & 0xff) & 0xff) as the values. Which would be lame.
Oh yeah (“just one more thing”)…
The easy thing you can do (although this is stupid in the long term) is just edit the asm so the shifts are by ecx instead of cl, and see what bb86 pops out.
Indeed, those segmented registers (cl/ch/cx/ecx) are a huge headache when doing RE-type stuff on x86. It makes me wonder if the same experiments on more refined RISC architectures would be simpler. It’s hard to say exactly since most of the interesting stuff to RE comes in x86 form.