About support levels of instructions

As a decompiler writer you have implemented the most instructions and for some small test programs the decompiler produces a good output. But in larger programs other instructions pop up which may not map quite nicely onto the semantics of output language. So what are the options here? You can also look at problem from another point of view: as a person looking at the output of different decompilers with the goal of judging how good they are.

Level 0 is that the disassembler which the decompiler uses doesn’t understand the instruction. You will see something like:

asm("invalid");

in the output. A variant of this is that the decompiler outputs some assembler instruction which looks reasonable but is wrong.

Level 0.5 is that the decompiler correctly disassembles the instruction and shows it in the output like

asm("into");

The major issue of this level is that even if the person reading the decompiler output understands the instruction, the instruction will probably be in the wrong place because it has no dependencies modeled and therefore data flow algorithms will ignore it.

Level 1 solves this dependency problem by making these dependencies explicit by calling a magic builtin (aka intrinsic) function which supposedly implements the instruction semantics. This magic builtin function is probably not implemented anywhere. So you may have something like:

edx, eax = _rdtsc();

// OR
_rdtsc(out edx, out eax);

You see in this example that the decompiler has to support returning multiple return values in some way. It would be even better to make the function name better so that the decompiler user doesn’t have to know the instructions. And also to name the parameters and return values:

{high:edx, low:eax} = _readTimeStampCounter();

// OR
_readTimeStampCounter(out high:edx, out low:eax);
  
// OR
struct timestamp ts = _readTimeStampCounter();
edx = ts.high;
eax = ts.low;
  
// OR
timeStampCounterOf8Bytes = _readTimeStampCounter();
edx = getHighDoubleWord(timeStampCounterOf8Bytes);
eax = getLowDoubleWord(timeStampCounterOf8Bytes);

But even the first variant is a lot better than level 0/0.5 because from the perspective of the decompiler it is not a foreign object anymore but a normal function call.

Looking behind level 1 it is clear that for basic operations (addition, or, …) we do not expect function calls but operators (Level 2). There is no clear set of operations for this but the operators of the C language and instructions common to multiple CPU architectures are a strong hint. So for example shifting an integer is available in C and in the ISAs but rotate isn’t available in C directly. Here the decompiler developer has to decide if she introduces an operator or let rotating become a second class citizen by using builtin functions or converting rotates to shifts, AND and OR.

Another point I want to mention is that C like functions (and this includes builtin functions) are not generic. So you have to have a different function for different parameter types (“_rotateLeftB”, “_rotateLeftW”, …) while an operator can work with different types (there is no + for 2 byte integers and a different + for 4 byte integers).

Coming back to the area between builtin functions and operators. The area is more murky than it seems on first sight. Some instructions can not be replaced or represented by something else. So I’m pretty sure that “rdtsc” and “cli” are best modeled as builtin functions. But what for example about “bswap”? This instruction can be expressed by other means (and, or, shifts). Is it better for the user reading the decompiler output to see a “_byte_swap(…)” call or a mix of and, or and shifts? IMHO in this example the call would be better.

BUT (there is always a but) if the input is a concrete integer I would like to see the result and not the call. This means that the decompiler has to know the body of the builtin function and be able to inline it on demand. Another example is “cpuid”. It could be modeled as:

some_generic_return_value  _cpuid(int4 selectorInEax);

but on the other side it would improve the output if the decompiler knows about the specific values you can pass in eax since the value is probably constant in almost all cases. So you may have for eax=2 a function:

cache_information  _cpuid_cache_information();

And a third example is “pext/pdep”. If the mask is unknown/a variable it is very hard to express these instruction in some other ways and thus the decompiler developer would go for a function call. If on the other side the mask is fixed (as we can expect in common code?!) it is possible to convert it to a mix of and, or and shifts which may make the output easier to understand for the user. Also in this case (like bswap) I would expect that the decompiler can calculate the result if all input is known. One could call such data depending behavior of the decompiler support level 1.5.

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
This entry was posted in decompiler and tagged , , , , , . Bookmark the permalink.