Holdec Version 2.0

After more than 8 years a new version of the holistic decompiler (holdec) is available. During this time a lot has changed: for PCs 64bit became the standard, AARCH64 (aka arm64) became widespread in mobile devices and is with the M1 chip from Apple on the jump to the desktop and last but not least the NSA open sources their decompiler Ghidra. The development of holdec also didn’t stop during the time. A more detailed (but still incomplete) list of changes is in the Changelog. I want to go over the biggest changes here in more detail.

Link to the download page.

Support for x64

A lot of work went into a another rewrite of the x86 disassembler to support x64. Hopefully this third version will last a bit longer. Of course the new registers and longer integer values had to be added.

Support for AArch64/A64

To extend the existing 2.5 architectures (M68k and x86/x64) with another one I had to decide which one to add. I choose AArch64/A64 since the majority of mobile devices used it at the time. In retrospective I think it is a nice ISA since it is easy to understand and decode and also has SIMD built in and not as an extension(s) like x86 or RISC-V.

Of course the support starts with a disassembler. Since I prefer a pure Java solution and there was not one available I “had” to write a new disassembler. This time I open sourced it (jarmo). I used objdump from GNU binutils as a reference implementation. And of course every complex software has bugs. In general it was a nice experience getting to know such a modern ISA.

Support for x87

While AArch64/A64 is pretty new stuff the old FPU instructions (stack based, all starting with an “f”) are quite old. Still they are used in older binaries. I executed my usual cycle of: looking at the problem, creating a test program, running other decompilers on the test program, filing bugs for the other decompilers, implement the feature in holdec.

One thing to struggle quite often was how to model each instruction. I wrote more about this issue in separate post.

The decompiler doesn’t perform the usual arithmetic transformations like “x+x => 2x”. There is however a new option “-f/–eval-float-ops” which performs floating point evaluation of concrete values. This includes basic arithmetic operators but also functions like sin() or sqrt().

Support for large binaries

Inspired by a blog post about obfuscation techniques used by snapchat I started looking into this to see how holdec performs. Well it turns out that the binary is 133mb and contains 89mb of code. Since it is AArch64/A64 code this means 22mio instructions. The disassembler used to perform a linear search through all the possible instructions to decode one. This didn’t scale and now the code is now using a binary tree based approach.

Loading all these 22mio instructions in the global phase required more heap memory than my computer has. Since the binaries will only grow larger in general with time I rewrote the global phase to use a streaming approach of only disassemble a few instructions and build supporting data structures in memory to write the method cache.

Second surprise was that holdec identified 1 mio functions in the binary. Previously the decompiler wrote one file for each each identified function containing all the lines of this function. And all the files are in one directory. Not surprising the filesystem doesn’t like a directory with 1mio files. Next step was to use the standard approach of nested directories. This was successful but took a long time and used 11gb on the hard disk. Quite a waste (of development effort, execution time and disk space). Another iteration replaced this with one simple text file which just contains text offsets for each function into the global text file of disassembled instructions. With these changed the global phase runs for 20 minutes for the snapchat binary I use.

So while you now can decompile any subset of the 1mio functions you have to choose which one to focus on. During the global phase certain measurements are stored for every function. These include number of blocks, number of instructions, number of jumps (conditional, unconditional, and indexed), number of call instructions, McCabe complexity, number of backward jumps (indicating loops), number of other functions called and number of other functions calling this one. Currently you can only see this information with the new CLI command “-c printTopMethodCache” which prints the Top-10 functions for the various metrics. More support is required here.

With this rather large internal change holdec now supports a very wide range of subjects: from 64byte long demos for MS-DOS to 100mb large mobile applications.

Correctness of x86/x64

As outline in another post one can test if the semantic model of the decompiler matches a real world silicon CPU. Other decompilers (reko, snowman, ghidra, hex-rays) have certain issues with the the 32bit and 64bit version of the test while holdec is the only one which passes all test methods.

One outcome for the x64 test is that holdec had multiple bugs related to the fact that assigning a 32bit register in 64bit mode clears the upper 32bits. This is not surprising for simple movs (which may even be nops in 32bit mode) but other cases include rotating with count=0 or CMOVcc when the condition doesn’t hold. On the other hand states the doc for CMPXCHG that an assignment happens in both cases and one can assume that in both cases the upper bits are cleared but this is not the case.

Conclusion

As always there is a lot of ideas and not enough time. It is difficult to decide what to work on:

  • increase width: support more CPU architectures / formats
  • go deeper: model existing CPUs in more detail
  • increase robustness: run the decompiler on a large set of binaries and look at errors
  • front-end (CPU architectures from above) vs middle part (e.g. track value sets across blocks) vs output (e.g. better detection of for loops)
  • more tests: more tests which show the features but also the limits of the current decompilers

Let us hope that the next version of holdec does not take so THIS much time.

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

Posted in decompiler, holdec, Uncategorized | Tagged , | Leave a comment

About support levels of instructions

As a decompiler writer you have implemented the most instructions and for some small test programs the decompiler produces a good output. But in larger programs other instructions pop up which may not map quite nicely onto the semantics of output language. So what are the options here? You can also look at problem from another point of view: as a person looking at the output of different decompilers with the goal of judging how good they are.

Level 0 is that the disassembler which the decompiler uses doesn’t understand the instruction. You will see something like:

asm("invalid");

in the output. A variant of this is that the decompiler outputs some assembler instruction which looks reasonable but is wrong.

Level 0.5 is that the decompiler correctly disassembles the instruction and shows it in the output like

asm("into");

The major issue of this level is that even if the person reading the decompiler output understands the instruction, the instruction will probably be in the wrong place because it has no dependencies modeled and therefore data flow algorithms will ignore it.

Level 1 solves this dependency problem by making these dependencies explicit by calling a magic builtin (aka intrinsic) function which supposedly implements the instruction semantics. This magic builtin function is probably not implemented anywhere. So you may have something like:

edx, eax = _rdtsc();

// OR
_rdtsc(out edx, out eax);

You see in this example that the decompiler has to support returning multiple return values in some way. It would be even better to make the function name better so that the decompiler user doesn’t have to know the instructions. And also to name the parameters and return values:

{high:edx, low:eax} = _readTimeStampCounter();

// OR
_readTimeStampCounter(out high:edx, out low:eax);
  
// OR
struct timestamp ts = _readTimeStampCounter();
edx = ts.high;
eax = ts.low;
  
// OR
timeStampCounterOf8Bytes = _readTimeStampCounter();
edx = getHighDoubleWord(timeStampCounterOf8Bytes);
eax = getLowDoubleWord(timeStampCounterOf8Bytes);

But even the first variant is a lot better than level 0/0.5 because from the perspective of the decompiler it is not a foreign object anymore but a normal function call.

Looking behind level 1 it is clear that for basic operations (addition, or, …) we do not expect function calls but operators (Level 2). There is no clear set of operations for this but the operators of the C language and instructions common to multiple CPU architectures are a strong hint. So for example shifting an integer is available in C and in the ISAs but rotate isn’t available in C directly. Here the decompiler developer has to decide if she introduces an operator or let rotating become a second class citizen by using builtin functions or converting rotates to shifts, AND and OR.

Another point I want to mention is that C like functions (and this includes builtin functions) are not generic. So you have to have a different function for different parameter types (“_rotateLeftB”, “_rotateLeftW”, …) while an operator can work with different types (there is no + for 2 byte integers and a different + for 4 byte integers).

Coming back to the area between builtin functions and operators. The area is more murky than it seems on first sight. Some instructions can not be replaced or represented by something else. So I’m pretty sure that “rdtsc” and “cli” are best modeled as builtin functions. But what for example about “bswap”? This instruction can be expressed by other means (and, or, shifts). Is it better for the user reading the decompiler output to see a “_byte_swap(…)” call or a mix of and, or and shifts? IMHO in this example the call would be better.

BUT (there is always a but) if the input is a concrete integer I would like to see the result and not the call. This means that the decompiler has to know the body of the builtin function and be able to inline it on demand. Another example is “cpuid”. It could be modeled as:

some_generic_return_value  _cpuid(int4 selectorInEax);

but on the other side it would improve the output if the decompiler knows about the specific values you can pass in eax since the value is probably constant in almost all cases. So you may have for eax=2 a function:

cache_information  _cpuid_cache_information();

And a third example is “pext/pdep”. If the mask is unknown/a variable it is very hard to express these instruction in some other ways and thus the decompiler developer would go for a function call. If on the other side the mask is fixed (as we can expect in common code?!) it is possible to convert it to a mix of and, or and shifts which may make the output easier to understand for the user. Also in this case (like bswap) I would expect that the decompiler can calculate the result if all input is known. One could call such data depending behavior of the decompiler support level 1.5.

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
Posted in decompiler | Tagged , , , , , | Leave a comment

About passing floating point parameters

On i386 linux software floating point numbers are passed on the stack and not registers. This creates two issues:

  1. How does a decompiler recognize that a floating point number is passed?
  2. How does a decompiler express an individual call?

There are multiple sources for type recognition in general. For floating point numbers the relevant ones are:

  • if the value is in a floating point register (x87 ST*, XMM, SSE,…) it is with high probability a floating point value. This is not available with stack memory is used.
  • caller: if the caller writes the memory from a known floating point value (like from one of the registers) one can assume it is a floating point value
  • usage: if the called function performs floating point operations (at least loading) with the value it is a floating point value
  • printf: if other information say that it is a floating point value. The most common are printf format strings.

Let us take a look at these. We choose the C double type because it is 8 bytes long and therefore takes up two “stack slots”. This makes it easy to see the difference if the decompiler sees a method taking two ints or one which takes a double parameter.

If the caller passes in a float literal this float literal is not distinguishable from two 4 byte integers. If the called method just looks at the “bytes” we call this bytes in the table below.

The source code, executable and decompiler output are on github.

The first case (unknown_to_unknown) is not distinguishable from passing around two 4 byte integer in the binary form of the program. Therefore the decompilers can’t recognize this. It is included as a baseline.

CallerUsage in functionFunction
literalbytesunknown_to_unknown
literaldoubleunknown_to_double
doublebytesdouble_to_unknown
doubledoubledouble_to_double

And the decompiler reactions:

Functionreko
Ghidraretdec
unknown_to_unknowntwo uint32two uinttwo uint32_t
unknown_to_doublereal64doubletwo uint32_t
double_to_unknowntwo uint32two uintfloat80_t and uint32_t
double_to_doublereal64doublefloat80_t

We see that reko and ghidra only use the usage information: they look at the function body and not the caller. retdec on the other side looks at the caller but the actual types are wrong.

When we look at the calls for functions where the signature is correctly recognized we see that reko passes one argument to unknown_to_double and double_to_double. Ghidra an on the other side passes two integer to unknown_to_double while double_to_double is ok. retdec passes one double to double_to_double but in the other cases it doesn’t produce correct output.


When we look at the two printf calls the decompilers have to solve two issues:
  • parse the format string to get the types of the parameters (and then output the call according to these types)
  • support long double (“%Lf” in the format string) which is 10 bytes and takes three “stack slots”
None of the three decompilers tested get this. See my inline comments.
// ===== Original source
printf("unknown: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
100, 2.31, 101, (long double)2.32, 102);

printf("double: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
200, 2.41+argc, 201, (long double)(2.42+argc), 202);

// ===== reko
// - tLoc3C is not defined
// - doesn't realize that %Lf takes 3 stack slots
// - 2920577024 aka 0xae147800 is part of the long double
printf("unknown: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
100, 2.31, 101, tLoc3C, 2920577024);

printf("double: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
200, rLoc1_134 + g_r804A100, 0xC9, tLoc3C, (word32) (real80) (g_r804A0F8 + rLoc1_134));

// ===== ghidra
// - extra uVar3
// - int literals and not floating literals
dVar2 = (double)param_1 + 1.24;
uVar3 = (undefined4)((ulonglong)dVar2 >> 0x20);
printf("unknown: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
100, 0x47ae147b, 0x40027ae1, 0x65,
0xae147800, 0x947ae147, 0, 0x66, uVar3);

// - doesn't known that the 3 stack slots belong together
fVar1 = (float10)2.42 + (float10)param_1;
printf("double: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
200, (double)((float10)param_1 +(float10)2.41),
0xc9,
SUB104(fVar1,0),
(int)((unkuint10)fVar1 >> 0x20),
(char)((unkuint10)fVar1 >> 0x40),
0xca, uVar3);

// ===== retdec
// - How? Why? Only positive is that it the gets
// the number of extra arguments correct
printf("unknown: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
100, 5.9415882152956413e-315, 0x40027ae1, 3.68165152720129934855e-4949L, -1);

// - the double and long double part is good
// - the ints are wrong
printf("double: int-a=%d double=%f int-b=%d long double=%Lf int-c=%d\n",
200, (float64_t)(v1 + 2.41L), 0, v1 + 2.42L);
While floating point numbers s are with us for a long time (the 80387 is from 1987) and we are not doing anything fancy like SIMD the decompilers are lacking in this regard: retdec is just strange/broken, ghidra doesn’t support floating point literals and reko has its own issues.

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
Posted in decompiler, floating point | Tagged , , , , , | Leave a comment

How good do decompilers support x87 instructions?

subject.exe is a hand written program which tests all the x87 FPU instructions. As usual the motivation is to test other decompilers but also to have a good test case/demo for my own decompiler. The contests are:
  • reko version b08e86b
  • retdec version 31ffc7c
  • ghidra version 9.1.2_PUBLIC

reko

Looking at the output one can see:
  • high: reko doesn’t load constant values (neither float nor int) from read-only sections
  • reko doesn’t evaluate functions like pow or fabs with constant arguments
  • reko doesn’t perform basic arithmetic operations like 1.0+1.0
  • fbstp is decompiled as a cast but the operation is more complex
  • The output of FCMOV (if (Test(LT,SLICE(SLICE(dwArg04, byte, 8), bool, 1)))) is a bit too magic for me
  • fcmovu for unordered is unfinished (seeing only P). Also fcmovnu is the same as fcmovu
  • looks like fcmovnb is decompiled as fcmovb
  • fcom uses a magic cond() function
  • additional noise in integer code in the fcom function
  • output for fcomi and ftst is incomplete
  • fist and frndint doesn’t respect rounding mode
  • wrong parameter values for fpatan
  • looping because of partial remainder looks wrong
  • leaves unused calls for fptan, fsincos, fxtract
  • wrong parameter used for fyl2x
  • modelling of the status word is incorrect/not easy to understand
12 issues on github

retdec

  • wrong modelling of f2xm1
  • doesn’t evaluate functions like pow or fabs with constant arguments
  • confusing use of float80_t for fbst
  • wrong decompilation of fbld
  • missing code for fcom and ftst
  • gets confused in fcomi, fptan and fsincos
  • fist and frndint doesn’t respect rounding mode
  • fld for 32bit floats loads wrong number
  • bad modeling of fpatan with only one parameter
  • incorrect implementation of fprem1
  • doesn’t model C2 change of fprem
  • missing memory write for 32bit fst
  • missing slot is inf for fxam
  • multiple copies of __pseudo functions for fxtract
  • superfluous calls in fxtract
  • modelling of the status word is incorrect/not easy to understand
16 issues on github

ghidra

  • displays floating point arguments to printf as hex
  • fbst is a simple assignment
  • fbld is incomplete
  • fcom and fcomi doesn’t test for unordered
  • doesn’t evaluate functions like fcos with constant arguments
  • fist and frndint doesn’t respect rounding mode
  • fld a 80bit float doesn’t work
  • fprem and fprem1 not working
  • doesn’t model that fprem updates the status word
  • fxtract,fyl2x, fyl2xp1 is just identity
  • modelling of the status word is incorrect/not easy to understand
10 issues on github and an older one

Summary

All examined decompilers model the x87 instructions to a certain degree which is probably good enough for common binaries. There are however in all decompilers x87 details which are not supported: rounding modes, handling of unordered compares and the BCD format.  

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
Posted in decompiler | Tagged , , | Leave a comment

The Internal Representation of holdec

Overview

Well there are multiple representations: a linear list of assembler lines, assembler lines in blocks, blocks of statements and high level control flow instructions (internally named Regions). The first two are nothing special and the last is worth a blog entry of its own. So this blog post is about blocks of statements. One indicator that this is the default representation is that it doesn’t have its own name in the holdec code. Let us call it IR in this blog post.

The IR represents one function and one can be seen three different levels. The first level is that a function consists of a set of blocks. Each block has a list of statements which can be empty and a list of jumps. A jump consists of the target block id and an optional condition. With this first level the code performs most of the CFG algorithms.

The second level of the IR is about the meat in each block in the form of statements. The two most common statement types are assignment and return. The assignment has a RHS (an expression) and an optional LHS (a register). The return statement has an optional return value which is also an expression. This level is used by data flow analysis and is relevant for SSA.

This brings us to the third level: the expressions. There are dozens of expression types like add, sin, register, function-call or load. An expression is a tree. Leaf types are for example a global variable, a register or an integer value. Most of the expression types however are not leaf types but form the inner tree nodes. Each expression has a basic type (integer of various widths but without information about signedness, float of various sizes, bool/single bit, void but no pointer). This basic type is always there and is always checked unlike the full type analysis which is optional. Simplification for example takes place in this level.

This design worked out quite well in the past. In the following I want to go over some details in an unordered way after describing the expressions in more detail.

Expression types

Leaf types

  • Concrete values can be integer, float, boolean and string (think printf format string).
  • There is an expression type for a function. A function call expression uses this if the function is known and not computed.
  • Global variable
  • Name used for accessing structs and co.
  • NOT_MODELED (holdec doesn’t model some semantics), UNDEF (CPU documentation says that the value is undefined) and TRASHED (eflags after a function call). All three are undefined for ISDEF(x).
  • A parameter: used to access the parameter value in the called function. Is independent of the parameter transfer place.
  • Register expression: an important expression type

There are some additional helper expressions: an expression for the initial stack pointer, the initial value (used for all other CPU registers to detect passed values) or an expression for the fake edge from an infinite loop to the exit block.

Integer arithmetic

  • ABS(x): rarely used
  • NEG(x): can also be used for floats.
  • ADD(x,y) : can have more than two children. Can also be used for floats.
  • ADD_OVERFLOW_SIGNED(x,y): returns a bool
  • SUB(x,y): can also be used for floats
  • SUB_BORROWS_SIGNED(x,y): returns a bool
  • SMULT(x,y), UMULT(x,y): signed and unsigned multiply, can have more than two children
  • SDIV(x,y), UDIV(x,y): signed and unsigned division
  • SMOD(x,y), UMOD(x,y): signed and unsigned remainder

Integer bitwise

  • AND(x,y): can have more than two children
  • NOT(x)
  • OR(x,y): can have more than two children
  • XOR(x,y): can have more than two children

Integer shift and rotate

  • SHL(src, count): shift left
  • SSHR(src,count), USHR(src,count): signed and unsigned shift right

There are four rotate expressions but these are only used by M68k. It looks like the M68k part can be changed to behave like the ia32 part (which converts them to shifts).

Boolean operations

  • BAND(x,y): boolean and. Can have more than two children.
  • BOR(x,y): boolean or. Can have more than two children.
  • BXOR(x,y): boolean xor. Can have more than two children. Can be used to nicely compute the parity bit. Could be replaced with CMP(x!=y).
  • BNOT(x): negate the boolean

Only top-level expressions can have side-effects and so for example not the children of BAND. Therefore the short circuit effect of && or || from C are not relevant here.

Comparison

  • CMP(value1, relation, value2): can be used for integer and booleans. Relation is one of ==, != or signed/unsigned <,>,<=,>=. Has a very large number of simplification rules.
  • FCMP(value1, relation, value2): can be used for floats. Relation is one of ==, !=, <,>,<=,>=.

Both comparison expression types return a boolean.

Float operations

  • ADD, SUB and NEG from above
  • FMULT(x,y)
  • FDIV(x,y)
  • SIN(x)
  • FLOAT_ISINFINITE(x)

The list will grow when x87 is fully implemented. Or maybe not when the additional instructions are expressed with builtin functions.

Type conversions

  • NARROW(x): just takes the lower bits (as much as the target type has)
  • WIDEN(x), UNSIGNED_EXTEND(x): prefix the value with zeros
  • SIGNED_EXTEND(x): repeat the MSB
  • BIT_TO_INT(x): fancy way of saying “!!x” or “x?1:0” but can be helpful since the target value range is known to be 0 or 1.
  • SIGNED_INT_TO_FLOAT(x) and UNSIGNED_INT_TO_FLOAT(x): pretty clear
  • FLOAT_TO_SIGNED_INT(x) and FLOAT_TO_UNSIGNED_INT(x): this is currently a bit rough. Either add a rounding mode as a parameter or have for each rounding mode a separate expression or make sure that always a rounding is performed before.
  • INT_TO_FLOATBITS(x), FLOATBITS_TO_INT(x): the bit size and the bit pattern stays the same but the type and therefore the interpretation of the bit pattern changes.
  • FLOAT_NARROW(x) and FLOAT_WIDEN(x): for converting between different float types
  • COMBINE_HIGH_LOW(high,low): just paste these two together

Memory related

  • LOAD(addr): reads a value from memory. For special memory access (like “ldar” from AARCH64) builtin functions are used. The lock prefix of ia32 is currently ignored.
  • STORE(addr, value): writes a value to memory
  • ADDR(var): gets the address of global variable, local variable or a parameter.

Other expression types

  • F_CALL(func, para1, para2,…): a function call
  • PHI(x,y): for SSA. Can have more than two children.
  • BIT_SET(bitNo, src, bitValue): updates a bit in src to a new value
  • BIT_TEST(bitNo, src): queries one bit in src
  • COMBINE_BITS(bit1, bit2, bit3,…): Can be used to model bytes with undefined bits. Number of children must match number of bits of one of the integer types. For example the decompiler doesn’t have to model the complete x87 FPU status word since most of the time only a few bits (used for testing the comparison result) matter.
  • EXTRACT_BITS(value, numBits, srcPos, destPos): extracts numBits from the src at the srcPos (and going to the MSB) and shift them to destPos afterwards. So something like “((value >> srcPos) & maskOf(numBits))<<destPos”.
  • SEGMENT2ADDR(segmentReg)
  • TERNARY(cond, trueValue, falseValue)
  • CONSTRUCT(named1, named2,…), NAMED(name, value) and EXTRACT(base, name) for supporting multiple return values
  • ISDEF(x): false iff x is undefined
  • ACCESS_ARRAY(base, index), ACCESS_MEM(base, name), ACCESS_UNION(base, name): generated by the full type analysis to denote different access types.
  • MARKER_SIGNED(x), MARKER_UNSIGNED(x): Related to full type analysis. Generated early but then converted to type traits.

Some More Details and Design Notes

All memory access is modeled with expressions. There is a load expression which takes the address. The basic type of the load expression determines the access width. The store expression takes the address and the value to store. The basic type of the store expression is void.


An expression can have a side-effect (memory store or function which isn’t annotate as side-effect free). Such expressions are only valid at the top-level of the RHS of the assignment statement. This means that each statement has at most one side-effect. The return value and jump conditions are side-effect free. So the list of statements are the only ordering feature of the IR. Or in other words: there is no comma operator.


In computer science a basic block has two outgoing edges. Supporting multiple outgoing edges has benefits and I didn’t encounter (or maybe didn’t see) any disadvantages. One benefit is the easy construction of short circuit boolean expressions. It is easy to see that:

 if(cond1) goto block 3;
if(cond2) goto block 3;
goto block 4;

can be transformed to:

 if(cond1 || cond2) goto block 3;
goto block 4;

The and case can be transformed from:

 if(!cond1) goto block 4;
if(!cond2) goto block 4;
goto block 3;

to

 if(!cond1 || !cond2) goto block 4;
goto block 3;

and finally

 if(cond1 && cond2) goto block 3;
goto block 4;

If the jumps are not together we can move a jump up the list by adding the negated condition of the previous jump. So

 if(cond1) goto block 3;
if(cond2) goto block 4;
if(cond3) goto block 3;
goto block 5;

can be converted to

 if(cond1) goto block 3;
if(!cond2 && cond3) goto block 3;
if(cond2) goto block 4;
goto block 5;

and finally

 if(cond1 || (!cond2 && cond3)) goto block 3;
if(cond2) goto block 4;
goto block 5;

Most of the expression types require that their children have the same basic type. So you can not add a 2 byte integer to a 2 byte float. Most of the expression types require that the parent basic type and the child(ren) basic type is the same. There are of course type conversion expressions like “unsigned extend”. The parameters of a function call have to match their basic type the signature. This means that for functions with variable parameter types (varargs in C) their signature has to be forked and made concrete for each call site based on for example the format string. Also the basic type of the LHS and RHS of the assignment statement have to be equal. In general the checks related to the basic type caught quite a few bugs. The most recent one using lea with an address override prefix.


Repeat after me “Registers are good, memory is evil!” Both are aliased: when the instruction inc eax is executed this will also change ax, ah and al. That is the aliasing of registers. This aliasing is broken up during the transformation into the IR, meaning that the instruction will create multiple assignment statements:

 reg_eax = reg_eax+1;
reg_ax = getLowWord(reg_eax);
reg_ah = getHighByte(reg_ax);
reg_al = getLowByte(reg_ax);

So there is no aliasing anymore from registers in the IR. The register expressions will also be transformed into SSA and the decompiler can easily reason about their usage. For memory this is sadly not the case. This is especially important for local stack variables but also for memory in general. In the best case the decompiler can reason that a certain stack range is not aliased and convert this to a register expression. It is possible that such a register is later removed since no code depended on its value.


The SSA form of the IR only differs in two details from the non-SSA form:

  • register expressions (which are the only ones which are versioned currently) get a version suffix. So “reg_a” becomes “reg_a_2”. Look at the example below.
  • assignment statements with PHI expressions in the RHS are added where required. Such statements are always at the start of a block.

Note that holdec only puts register expressions under SSA but it tries to convert local variables to register expressions and propagates memory values if it knows that these values do not change. One side-effect here is that the user can declare a certain memory areas as volatile meaning that they can change by other means (hardware, other threads, OS, …).


Per se the addresses of the instructions are not carried over into the IR. So neither a statement nor a block knows its origin. I didn’t find any good reason for carrying this information. After all the passes the blocks and statements are usually so heavy modified that it doesn’t matter where they originated. There are/were exceptions when the instruction address is required for identification purposes in relation to the info file.


Holdec supports the return of multiple values. This makes for example modeling CPU instructions easier (like the values from the “cpuid” instruction) and is required to describe functions found in the wild (which may even return data in the flags register). C allows you to pass pointers which are used by the called function to return multiple values. Or structs are used. Common in these approaches is that they use memory. And memory is evil. So there are three magic expressions in holdec to allow to return multiple values without using memory.

// Code in the called method:

  return Construct(Named("result1", 42),
                   Named("result2", 12.34));

// Calling method and extracting:
// reg_a has a dummy basic type here

   reg_a = theMethod(...);
   reg_int_value = Extract(reg_a, "result1");
   reg_float_value = Extract(reg_a, "result2");

A function has exactly one return statement. This is easy to acomplish in the blocks-of-assembler-lines stage by adding a new block  and replacing the “ret”s with jumps to the new block. But there are also methods which do not return. There are a lot of different ways: swift uses breakpoints when on overflow is detected, libc has exit(), then there is pthread_exit(), linux has sys_exit and other systems have other means. If the decompiler doesn’t know about this property it will decompile after the non-returning instruction yielding incorrect output. The first step is of course recognizing such cases. The second is how to model it in the IR. My first attempt was a terminated statement. This caused all sort of special cases all over the place in the code which didn’t feel right. I later changed this to insert an endless loop with a special body after such an instruction:

 block30:
  (void) exit(0);
  goto block31;

block 31:
  __postNoReturnMarker();
  goto block31;

Such special loop is detected during output and a string “// not reached” is shown instead.

The other way of contradicting the invariant of having exactly one return statement are functions which do not return at all. For this another special builtin function is added and the whole function body is wrapped in a new condition which is also skipped in the output:

 void myexit() {
  if(__alwaysTrue()) {
     printf("ERROR: ...", ...);
     exit();
     // not reached
  }
  return;
}

A list of the basic types:

TypeSuffixC syntax
voidvvoid
booleanzbit
1 byte integerbd1
2 byte integerwd2
4 byte integerld4
8 byte integerqd8
16 byte integerod16
1 byte floatcf1
2 byte float
hf2
4 byte floatsf4
8 byte floatdf8
10 byte floatef10
16 byte floatgf16

The suffix can be seen in some a verbose output variant:

   reg_pp_20.l = PHI(reg_var5_0.l, reg_var5_1.l).l;
reg_pp_19.q = PHI(reg_var4_0.q, reg_var4_1.q).q;
reg_pp_18.q = PHI(reg_var3_0.q, reg_var3_1.q).q;
reg_pp_17.q = PHI(reg_var2_0.q, reg_var2_1.q).q;
reg_ao_0.l = LOAD(ADD(reg_pp_19.q, 12.q).q).l;
reg_ap_0.l = UNSIGNED_EXTEND(LOAD(ADD(reg_pp_19.q, 25.q).q).b).l;

The names of register expressions do not match the physical register names you may expect. So you see for i386 “reg_a” instead of “reg_eax” or “reg_Bl” instead of “reg_es”. There are multiple reasons:

  • the renaming makes the names shorter and easier to understand (my opinion)
  • for addressing vector lanes you would need to include special char like []
  • there are other sources of register expressions than physical registers. Examples are the number of elements on the x87 FPU stack or register expressions from converting local stack memory.

Comparing with other IRs

The IR of Hex-Rays (called microcode) is nicely described in a talk of Ilfak Guilfanov at a conference. One example from this talk are the four instructions:

  004014FB mov eax, [ebx+4]
004014FE mov dl, [eax+1]
00401501 sub dl, 61h ; 'a'
00401504 jz short loc_401517

which have the following initial 20 microinstructions

2. 0 mov   ebx.4, eoff.4         ; 4014FB u=ebx.4       d=eoff.4 
2. 1 mov ds.2, seg.2 ; 4014FB u=ds.2 d=seg.2
2. 2 add eoff.4, #4.4, eoff.4 ; 4014FB u=eoff.4 d=eoff.4
2. 3 ldx seg.2, eoff.4, et1.4 ; 4014FB u=eoff.4,seg.2,(STACK,GLBMEM) d=et1.4
2. 4 mov et1.4, eax.4 ; 4014FB u=et1.4 d=eax.4

2. 5 mov eax.4, eoff.4 ; 4014FE u=eax.4 d=eoff.4
2. 6 mov ds.2, seg.2 ; 4014FE u=ds.2 d=seg.2
2. 7 add eoff.4, #1.4, eoff.4 ; 4014FE u=eoff.4 d=eoff.4
2. 8 ldx seg.2, eoff.4, t1.1 ; 4014FE u=eoff.4,seg.2,(STACK,GLBMEM) d=t1.1
2. 9 mov t1.1, dl.1 ; 4014FE u=t1.1 d=dl.1

2.10 mov #0x61.1, t1.1 ; 401501 u= d=t1.1
2.11 setb dl.1, t1.1, cf.1 ; 401501 u=dl.1,t1.1 d=cf.1
2.12 seto dl.1, t1.1, of.1 ; 401501 u=dl.1,t1.1 d=of.1
2.13 sub dl.1, t1.1, dl.1 ; 401501 u=dl.1,t1.1 d=dl.1
2.14 setz dl.1, #0.1, zf.1 ; 401501 u=dl.1 d=zf.1
2.15 setp dl.1, #0.1, pf.1 ; 401501 u=dl.1 d=pf.1
2.16 sets dl.1, sf.1 ; 401501 u=dl.1 d=sf.1

2.17 mov cs.2, seg.2 ; 401504 u=cs.2 d=seg.2
2.18 mov #0x401517.4, eoff.4 ; 401504 u= d=eoff.4
2.19 jcnd zf.1, $loc_401517 ; 401504 u=zf.1

And after some steps the final microcode instruction:

jz [ds.2:([ds.2:(ebx.4+#4.4)].4+#1.4)].1, #0x61.1, @7

I put these four instructions and some additional code into a full example to look at how holdec sees these instructions. The blocked assembler stage is:

 08049079.0+3 mov eax,ds:[ebx+0x4]
0804907c.0+3 mov dl,ds:[eax+0x1]
0804907f.0+3 sub dl,0x61
08049082.0+2 l.jz L6

In the next step the initial IR of this block is as follows

   reg_a = LOAD(SEGMENT2ADDR(reg_Bk) + reg_b + 4);
  reg_l = _holdec_getLowWordL(reg_a);
  reg_Bb = _holdec_getLowByteL(reg_a);
  reg_w = _holdec_getHighByteL(reg_a);

  reg_Be = LOAD(SEGMENT2ADDR(reg_Bk) + reg_a + 1);
  reg_d = _holdec_setLowByteL(reg_d, reg_Be);
  reg_o = _holdec_setLowByteW(reg_o, reg_Be);

  reg_eflags_sf = BIT_TEST(7, reg_Be + 97 * -1);
  reg_eflags_zf = reg_Be + 97 * -1 == 0;
  reg_eflags_pf = _holdec_ia32_parity(reg_Be + 97 * -1);
  reg_eflags_af = _holdec_ia32_adjustFlagAddSub(reg_Be, 97, reg_Be + 97 * -1);
  reg_eflags_of = SUB_BORROWS_SIGNED(reg_Be, 97);
  reg_eflags_cf = reg_Be < 97;
  reg_cmp_as = reg_eflags_af;
  reg_cmp_b = MARKER_UNSIGNED(reg_Be) < MARKER_UNSIGNED(97);
  reg_cmp_be = MARKER_UNSIGNED(reg_Be) <= MARKER_UNSIGNED(97);
  reg_cmp_ds = reg_eflags_df;
  reg_cmp_l = MARKER_SIGNED(reg_Be) < MARKER_SIGNED(97);
  reg_cmp_le = MARKER_SIGNED(reg_Be) <= MARKER_SIGNED(97);
  reg_cmp_o = reg_eflags_of;
  reg_cmp_p = UNDEF;
  reg_cmp_s = UNDEF;
  reg_cmp_z = reg_Be == 97;
  reg_Be = reg_Be + -97;
  reg_d = _holdec_setLowByteL(reg_d, reg_Be);
  reg_o = _holdec_setLowByteW(reg_o, reg_Be);

  if(ISDEF(reg_cmp_z) ? reg_cmp_z : reg_eflags_zf) goto L6;
  goto L3;

To understand these you need the translation table:

reg_aeax
reg_bebx
reg_dedx
reg_lax
reg_odx
reg_w
ah
reg_Bbal
reg_Bedl
reg_Bkds

Some aspects I want to point out:

  • SEGMENT2ADDR(reg_x) will become zero if reg_x is not modified in the function
  • there is always the struggle to decide which modeling tool to use: low-level expressions like AND and OR, higher level ones like SUB_BORROWS_SIGNED or builtin functions like _holdec_ia32_parity. There will be another blog post about this topic.
  • the MARKER_SIGNED/MARKER_UNSIGNED expressions are related to the full type analysis. Which is also a topic for another blog post.
  • There is the leaf expression UNDEF (and also NOT_MODELED) and a predicate expression ISDEF.
  • There are two ways of flag tracking: low-level (reg_eflags_*) is modeled after the bits in the eflags register and high-level (reg_cmp_*) matching the condition codes jumps and other instructions have. As you can see the high-level ones are used if possible. If the flag register is accessed directly (for example with lahf) the low-level ones are used.

After bringing the code into SSA form and remove the unused registers we get:

   reg_a_2 = LOAD(SEGMENT2ADDR(reg_Bk_0) + reg_b_0 + 4);
  reg_Be_0 = LOAD(SEGMENT2ADDR(reg_Bk_0) + reg_a_2 + 1);
  reg_eflags_zf_3 = reg_Be_0 + 97 * -1 == 0;
  reg_cmp_z_3 = reg_Be_0 == 97;
  if(ISDEF(reg_cmp_z_3) ? reg_cmp_z_3 : reg_eflags_zf_3) goto L6;
  goto L3;

Register values are propagated:

   reg_a_2 = LOAD(LOAD(InitStackPointerReg + 8) + 4);
  reg_Be_0 = LOAD(LOAD(LOAD(InitStackPointerReg + 8) + 4) + 1);
  reg_eflags_zf_3 = LOAD(LOAD(LOAD(InitStackPointerReg + 8) + 4) + 1) == 97;
  reg_cmp_z_3 = LOAD(LOAD(LOAD(InitStackPointerReg + 8) + 4) + 1) == 97;
  if(LOAD(LOAD(LOAD(InitStackPointerReg + 8) + 4) + 1) == 97) goto L6;
  goto L3;

Unused registers are removed again:

   if(LOAD(LOAD(LOAD(InitStackPointerReg + 8) + 4) + 1) == 97) goto L6;
  goto L3;

End finally after the pass which looks at the types of certain stack areas:

   if(LOAD(LOAD(LOAD(&argv) + 4) + 1) == 97) goto L6;
  goto L3;

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
Posted in decompiler, holdec | Tagged , , , | Leave a comment

Design of the holdec Decompiler

Maybe the best way to look at the design of a program is by the idea “Software development is all about managing complexity.” A decompiler has to solve low-level problems like reading both little- and big-endian ELF binaries, algorithmic problems (I found transforming FROM single static assignment (SSA) form quite hard) and subjective ones like “is a top-level continue more readable in a loop or an if?”. There are a lot of problems a decompiler must tackle and the design should support this.

The following describes the design of holdec but also outlines some of the problems a decompiler writer has to think about. The actual decompiler program is of course more complex because there are low-level problems not mentioned here. So let us speak about the design of holdec.

Holdec has two major phases: a global phase and a per function phase. The output of the first one are cached assembler lines of each identified function. In the second phase these are read and transformed into human readable output.

Global Phase

The global phase starts with the Loader. This one is quite straightforward:

  • it identifies the file format,
  • creates an image which has all the sections loaded at the correct addresses and performs relocations and other operations the OS loader would do and
  • provide the relevant information (symbols, sections, start address, …).

The Disassembler Step is the second step. It consists of the driver, multiple CPU specific disassemblers and multiple CPU specific preparers. There is nothing decompiler specific about the disassemblers: you can use any disassembler library. I wrote 2 of the 3 disassembler in holdec myself. Mostly because I wanted to do so and no suitable libraries were available in Java (the language holdec is implemented in). I open sourced Jarmo (the Arm64 disassembler) on Github. To make the next steps easier x86 prefixes start with a slash and the size of the instruction is added to each line (e.g. "0x4200+3 nop" for a 3 byte nop). Also coming from M68k I found and still think that specifying the operand size both in Intel syntax (mov DWORD PTR [ecx],0x0) and AT&T syntax movl $0x0,(%ecx)) is not easy to parse. So the holdec x86 disassembler adds a prefix resulting in l.mov ds:[ecx],0x0 (see also xkcd). So while these changes make the disassembled instructions a bit more similar in their general syntax they are still very much CPU specific.

The Preparers are sometimes magic (a blog post about them is coming) but serve two purposes:

  1. transform the assembler instructions so that jumps and calls are visible and the control will not change later and
  2. reduce the work for later steps by expressing instructions with other instructions.

In the first category are jump tables which are identified by patterns and the indirect jump is replaced with an indexed jump (range of valid values and for each value the target address). Also in the first category are the handling of instructions/functions/OS syscalls/… which terminate the running program. These instructions have to be annotated. A third example is undoing tail recursion elimination. In the second category are mostly instructions from ia32 like JECXZ, XCHG, CMOVcc, the string instructions, everything with a rep prefix and so on. But also CSEL and friends from AARCH64 can be expressed with simpler instructions and thus reduce the surface area of later stages.

Before we discuss the driver I have to touch another design decision: the Info File. The name is not great but you know about the hard problems of computer science… The general idea is that the decompiler can be run without human interaction but still with human input. The human input is saved in a file (the info file) and is the second input (in addition to the binary) for the non-interactive decompilation. While a human can provide input (for example a function signature) the decompiler also updates the info file (however never overwriting something from a human) with information such as also functions found and the assumed signatures (based on calling conventions and usage). But the decompiler will also write other information (like jump tables) to the info file which the human may change. In general it is a good idea to re-run the decompiler when the info file is changed.

Coming back to the Dissembler Driver. The task of the driver is:

  1. to disassemble all code areas (=code sections minus data areas in the code sections)
  2. to identify all block starts

Regarding the data areas in code sections: while the object format like ELF will specify which areas contain code not all of the section has to contain code. In addition you have object formats which don’t have sections at all. Constants (string, integer, floating point) and jump tables are commonly found in code sections. These can be specified in the info file and the driver will exclude them.

There are at least the following sources for block starts:

  • program start address
  • jump targets
  • call targets
  • next instruction after an unconditional control transfer (jump, return)
  • code section start addresses
  • next byte after a data area in a code section
  • function start addresses from the object format
  • address in the inside of a code section. Such an address may occur in a data section. Reloc informations can help here.

So in a loop the driver runs the dissembler on all code areas with a linear sweep, let the preparer run over the whole list of assembler instructions and searches for new block starts. If no new starts are found the loop ends. The driver also takes care of overlapping instructions. But this is a topic of another blog post.

The third step is the Function Splitting. For each known function collect all blocks and write the assembler lines for a file. Note that blocks can be shared by multiple functions. Thus the result of the global phase is a set of cached assembler lines for each known function.

Per Function Phase

Holdec allows you to select on the command line which functions should be decompiled. This can be single function, all functions or most function (all function but exclude some known common ones like libc startup code). From the set of functions to decompile each is decompiled on its own. The only information transfer is through the info file. So I can repeat here it may make sense to re-run the decompiler after a info file change.

The decompilation of a function is nicely split into passes. There are currently around 50 different pass types and in total around 90 passes are executed. Each pass writes some output (assembler lines, internal representation, dot file for graphviz) but during non-development only the output of the last passes are used.

The major passes for the start are:

  1. split assembler lines into blocks
  2. transform each assembler line into the internal representation (aka parsing, lifting or rewriting). This pass is of course CPU specific while the remaining ones are CPU agnostic.
  3. transform to SSA
  4. remove unused variables. A major milestone is reached after this pass since the parser has to generate a lot of statements which may or may not be used. This pass reduces the amount of statements to 5-15%.
  5. simplification (aka symbolic execution). Execute a lot of rules which make expressions simpler (x+3==4 => x==1, x XOR x => 0) or canonical (x<=9 => x<10). This also includes a boolean minimization engine using Karnaugh maps.
  6. register value propagation
  7. memory value propagation

In the middle special one-time passes are executed and the basic ones (remove unused variables, simplification and register value propagation) are repeated after each special one. Example of such special passes are:

  • recognize division by multiplication
  • undo SSA and transform to SSA again (this it not a nop and allows further passes a fresh view on the program)
  • load constants (string, integer, float) from read-only sections
  • recognize format strings (printf, scanf and co)
  • replace stack area/variable with real variable under certain conditions

Passes executed in the end are:

  • undo SSA
  • recognize high level control flow (heavy)
  • transform code based on the high level control flow (for example introduce induction variable into a loop) (heavy)
  • optional: full type recognition (heavy)
  • output in multiple formats: as high level C-like source code, blocks with statements and/or .dot for graphviz

In the following blog posts I want to go into some detail for specific parts. So stay tuned.

Thank you for reading and please send questions or feedback via email to holdec@kronotai.com or contact me on Twitter.

]]>
Posted in decompiler, holdec | Tagged , , , , | Leave a comment

Testing decompiler semantics

easier to understand to a programmer and uses the same semantics as the binary program. Both properties (don’t change the semantic and produce high level source code) are required. Without the first you get a nice looking program which doesn’t represent the starting binary. Without the later you may just look at a disassembler listing.

Modelling instructions

One important early step of preserving the semantics is the transformation from CPU instructions to an (CPU agnostic) internal representation. This can be done in two sub-steps (binary instructions to disassembled instructions and than disassembled instructions to internal representation) or in one sub-step. One advantage of using two sub-steps is that a 3rd party disassembler can be used which reduces development time. Regardless of the implementation the outcome is a list of internal statements for each binary instruction. For i386 such statements usually include: the actual operation, update of the flags and update of smaller and larger registers (inc %ax will also update %al, %ah and %eax). The task of verifying that all the side effects of the binary instructions were correctly modeled is made more difficult by the fact that unused side effects are discarded by the decompiler. So the test design should capture all side effects. This is done by folding all side effects into the return value. Here I used two test function types: one for the register values and one for the flag values. In the end a test methods looks like this:
080805b3 <inst_101_values_var_7>:
// save registers
 80805b3:       push   %ebx
 80805b4:       push   %edx
 80805b5:       push   %ebp
 80805b6:       push   %esi
 80805b7:       push   %edi
 80805b8:       push   %gs
// initialize all registers with random values
 80805ba:       mov    $0xf2e8b1b3,%eax
 80805bf:       mov    $0xf96ff545,%ebx
 80805c4:       mov    $0xc8a6ed47,%ecx
 80805c9:       mov    $0x4e0f4c05,%edx
 80805ce:       mov    $0x50690c3e,%ebp
 80805d3:       mov    $0xb7e7569c,%esi
 80805d8:       mov    $0xb3138312,%edi
// set all flags to a defined state
 80805dd:       neg    %eax
// the operation under test
 80805df:       inc    %ax
// merge all into %eax
 80805e1:       add    %ebx,%eax
 80805e3:       add    %ecx,%eax
 80805e5:       add    %edx,%eax
 80805e7:       add    %ebp,%eax
 80805e9:       add    %esi,%eax
 80805eb:       add    %edi,%eax
// substract the expected value; the values comes from
// the execution on a real CPU
 80805ed:       sub    $0xd8a162cb,%eax
// restore registers and return
 80805f2:       pop    %gs
 80805f4:       pop    %edi
 80805f5:       pop    %esi
 80805f6:       pop    %ebp
 80805f7:       pop    %edx
 80805f8:       pop    %ebx
 80805f9:       ret
The variant which tests flags is similar:
080b7493 <inst_226_flags_var_0>:
...
 80b74bd:       neg    %eax
// the operation under test
 80b74bf:       sbb    $0x20,%ah
// %ebp = CF?1:0
 80b74c2:       mov    $0x0,%ebp
 80b74c7:       mov    $0x1,%eax
 80b74cc:       cmovb  %eax,%ebp
// %edi = OF?2:0
 80b74cf:       mov    $0x0,%edi
 80b74d4:       mov    $0x2,%eax
 80b74d9:       cmovo  %eax,%edi
// %esi = PF?4:0
 80b74dc:       mov    $0x0,%esi
 80b74e1:       mov    $0x4,%eax
 80b74e6:       cmovp  %eax,%esi
// %edx = SF?8:0
 80b74e9:       mov    $0x0,%edx
 80b74ee:       mov    $0x8,%eax
 80b74f3:       cmovs  %eax,%edx
// %ecx = AF?16:0
// there is no condition code for AF but with
// the aaa instruction we can convert AF to CF
 80b74f6:       mov    $0x0,%ecx
 80b74fb:       mov    $0x0,%al
 80b74fd:       aaa
 80b74fe:       mov    $0x10,%eax
 80b7503:       cmovb  %eax,%ecx
// add them together
 80b7506:       xor    %eax,%eax
 80b7508:       add    %ebp,%eax
 80b750a:       add    %edi,%eax
 80b750c:       add    %esi,%eax
 80b750e:       add    %edx,%eax
 80b7510:       add    %ecx,%eax
// substract the expected value
 80b7512:       sub    $0x6,%eax
 80b7515:       pop    %gs
...

Reducing expressions

As you can see from the listing these tests assume that the decompiler evaluates/emulates/simplifies the instructions into a simple return 0. I know from the past that some decompilers are stronger and some are weaker in their ability to reduce complex expressions to simpler ones. To avoid additional barriers I chose in the above tests only simple instructions and corresponding transformations. In general a decompiler has to perform transformation to achieve the goal of high-level output. This can be as simple as x^x => 0 or to merge addition and shifts into multiplication (see for example this godbolt test). Normally at least one of the expressions is a variable (for example a function parameter with unknown value) while in the tests from above all values are fixed. However supporting the transformation of such expressions shouldn’t be difficult if the framework for expressions with a variable is already in place. And yes this means that the decompiler is in fact a CPU emulator.

Results

This is for the integer instruction of i386+. The ELF binary and the results are in in github. This binary test file contains two methods (one for the register values and for flag values) for each of the 290 instruction variants defined in the input file. So 580 functions in total.

Ghidra (version 9.1.2)

For some reason ghidra thinks in 81% (469/580) that edx is an input register but doesn’t model it as a function parameter. Also odd is that the return type is 64bit and that the register is shifted 64bits to the left. But in general this looks like the expected “return 0”;
longlong inst_0_values_var_0(void)
{
  uint in_EDX;
  return (ulonglong)in_EDX << 0x20;
}
In around 100 cases the parity flag is not modeled:
undefined8 inst_0_flags_var_0(void)
{
  undefined4 in_EDX;
  int iVar1;
  bool in_PF;
  iVar1 = 0;
  if (in_PF) {
    iVar1 = 4;
  }
  return CONCAT44(in_EDX,iVar1 + -4);
}
In the remaining cases the instruction is modeled with a loop and ghidra is unable to reduce this loop when all values are constant. The following case contains a bit scan forward (bsf):
undefined8 inst_30_values_var_0(void)
{
  int iVar1;
  undefined4 in_EDX;
  iVar1 = 0;
  while ((0x25e0215eU >> iVar1 & 1) == 0) {
    iVar1 = iVar1 + 1;
  }
  return CONCAT44(in_EDX,iVar1 + -1);
}

Retdec (current git master@a82f16e9)

In 97% (563/580) of the cases retdec returns 0 as expected. All other cases are non-zero concrete values.

Reko (current git master@9ed122)

There are 133 flag value functions which return zero. It looks like they are from instructions which do not change flags at all. Reko is sadly not able to reduce basic expressions further:
word32 inst_10_values_var_0()
{
  return -2802897708 + 2802897708;
}
This makes it harder to assess the degree of correctness. Using grep I found 35 cases of the above with identical values. But other cases are also ok but can’t be easily detected (​-0xD0AA51D7 == 794144297):
word32 inst_7_values_var_0()
{
  return -794144297 - 0xD0AA51D7;
}
Reko doesn’t remove branches it detects as not taken (if(false)) which probably prevents a simplification:
word32 inst_7_flags_var_0()
{
  word32 ebp_33 = 0x00;
  if (false)
    ebp_33 = 0x01;
  word32 edi_37 = 0x00;
  if (OVERFLOW(0x5F))
    edi_37 = 0x02;
  word32 esi_42 = 0x00;
  if (!P)
    esi_42 = 0x04;
  word32 edx_48 = 0x00;
  if (false)
    edx_48 = 0x08;
  return ebp_33 + edi_37 + esi_42 + edx_48 - 0x07;
}
In the above case it is also unclear where P comes from.

Snowman (current git master@aba7af)

Snowman returns zero in 133 cases. Similar to reko snowman also doesn’t perform addition:
int32_t inst_222_values_var_0() {
  return -0x616cb887 - 0x52aee46f + 0x1aec8e77 - 0xab5e0a0 - 0x1e414626 - 0x7a54e6ce - 0x588d02fe - 0x6af7e0ee;
}
Similar to reko branches not taken are not removed and make the output harder to understand:
int32_t inst_63_values_var_0() {
  int32_t ebp1;
  ebp1 = 0x3f43bd7f;
  if (0) {
    ebp1 = 0x223bf959;
  }
  return -0x4632d959 - 0x20980a29 + 0x4f46ab54 + 0x223bf959 + ebp1 - 0x1fbdb32d - 0x52def38c + 0x28a1280f;
}
It looks like some of the tested instructions are not supported by snowman:
int32_t inst_40_values_var_0() {
  __asm__("btc si, 0xcd");
  return -0x6cc9a392 + 0x712f984b + 0x594b4859 + 0x4a23f663 - 0x1f526abe + 0x36d033e2 - 0x756e1d3d - 0x49e4bf5c;
}
and
int32_t inst_56_values_var_0() {
  int32_t ebx1;
  ebx1 = 0xd3699e30;
  if (__intrinsic()) {
    *reinterpret_cast<int16_t*>(&ebx1) = 0x3dc3;
  }
  return -0x9cea9113 + ebx1 + 0x58c3d6e6 + 0x4deac45f + 0x6398c301 - 0x43786f8a + 0x8993dc3 - 0x5e7399c;
}

Hopper (4.5.29-demo)

I can’t export the output code to a file and therefore can only look at the output at random:
  • no simple arithmetic transformation similar to reko and snowman
  • no value propagation (like from the eax assignment to the return statement)
  • saving and restoring registers is visible
  • it is unclear where the magic CPU_FLAGS comes from
  • various instructions like rcl not implemented
  • no masking of shift counts
Some sample functions:
int inst_20_values_var_0() {
    eax = -0x21ab6f3b + 0x6855b113 - 0x46aa41d8;
    gs = gs;
    return eax;
}
int inst_20_flags_var_0() {
    stack[-4] = ebx;
    stack[-8] = edx;
    stack[-12] = ebp;
    stack[-16] = esi;
    stack[-20] = edi;
    esp = esp - 0x18;
    stack[-24] = gs;
    esi = 0xfa629d75;
    ebp = 0x0;
    if (esi < 0x0) {
            ebp = 0x1;
    }
    edi = 0x0;
    if (CPU_FLAGS & O) {
            edi = 0x2;
    }
    edx = 0x0;
    if (CPU_FLAGS & S) {
            edx = 0x8;
    }
    esi = 0xfa629d75;
    eax = edx + esi + edi + ebp + 0x0 - 0x8;
    gs = stack[-24];
    return eax;
}
int inst_187_values_var_0() {
  eax = (-0x35f0dfb3 << 0x4596d164) + 0xd4303404 - 0x752238d4;
  gs = gs;
  return eax;
}

Holdec (unreleased version)

The current version return zero for all functions but it took 2 weeks to arrive at this point. Quite a few bugs we found and also some TODOs are now done. While it is known that C has undefined behavior it turns out that i386+ CPUs also have their share (mostly in the flags department). Also related to flags I learned that the Adjust Flag (AF) is set by some instructions but only used aaa, aas, daa and das which work with packed and unpacked BCD values. These instructions are very rarely used and removed in x64. Or in other words: supporting AF correct is a rather low prio task.

The future

  1. Ensure that holdec correctly models the additional instructions like popcnt.
  2. Create a new test for the instructions of x64 and test that holdec produces “return 0″s.
  3. Create a new test for i386 FPU instructions and test that holdec supports these correctly.
  4. Create new tests for AARCH64: integer instructions, integer and float vector instructions, …
]]>

Posted in decompiler | Tagged , , | Leave a comment

Introducing the DMI the Decompiler Maturity Index

Decompiler Maturity Index (DMI). A decompiler with a higher index is not automatically “better” but has more features.

Criteria

DMIC stands for DMI Criterion.

Basic

  • DMIC-A1: I’m able to build and run the decompiler
  • DMIC-A2:The decompiler is able to work on one subject (maybe provided by the project, otherwise chosen by me)
  • DMIC-A3: The project is active: There is a commit in the last 3 months.
  • DMIC-A4: The project has a history: There are 100 commits.
  • DMIC-A5: The decompiler is able to detect and output a simple loop.
  • DMIC-A6: The decompiler performs simple expression simplifications.
  • DMIC-A7: The decompiler produces some basic output for the hexdump subject (or a similar complex subject).
  • DMIC-A8: The decompiler support ia32 ELF. This is important since a lot of test subjects are in the ia32 ELF format.
  • DMIC-A9: The decompiler models all flag changes but also removes unused flag assignments.
  • DMIC-A10: The decompiler recognized the number of arguments for a stack based method call.
  • DMIC-A11: The decompiler propagates register values to other statements. In the some block and into other blocks of the same functions.
  • DMIC-A12: The decompiler understands indexed jumps using a jump table as generated by switch-case.

Intermediate

  • DMIC-B1: The decompiler supports more than one CPU architecture. This ensures that the core is quite generic. i386 and AMD64 are counting here as one.
  • DMIC-B2: The decompiler supports more than one executable format.
  • DMIC-B3: The decompiler understands advanced i386 opcodes like string operations with rep prefix or the cpuid instruction.
  • DMIC-B4: The decompiler has a GUI.
  • DMIC-B5: The decompiler understands printf/scanf format strings and passes the correct number of arguments in the function call.
  • DMIC-B6: The decompiler outputs a normal increasing for loop as a for loop.
  • DMIC-B7: The decompiler detects and outputs short circuit boolean expressions (|| and &&).
  • DMIC-B8: The decompiler outputs string/int literals from the data segment when possible (for example read-only).
  • DMIC-B9: The decompiler supports FPU operations and types.
  • DMIC-B10: The decompiler detects and removes jumps which can not be taken and dead blocks.
  • DMIC-B11: The decompiler propagates memory values to other statements.
  • DMIC-B12: The decompiler is able to output the fields of a struct when these fields are used.
  • DMIC-B13: The decompiler is able to cope with code which casts/uses union to interpret values in a conflicting way.
  • DMIC-B14: The decompiler detects local variables and replaces unstructured stack with local variables.

Advanced

  • DMIC-C1: The decompiler understands at least one way where the subject calls the OS directly. Think Linux syscalls, MS-DOS interrupts or Amiga libraries.
  • DMIC-C2: The decompiler detects common library methods in statically compiled subjects. Something like FLIRT.
  • DMIC-C3: The decompiler knows the signatures of common libraries like libc and applies these signatures.
  • DMIC-C4: The decompiler knows some advances expression transformations. For example division by multiplication.
  • DMIC-C5: The decompiler supports SIMD. At least models these with internal functions which capture the input and output. Or decompose them into their “real” semantics.
  • DMIC-C6: The decompiler detects the number of elements and the element sizes from a loop which goes over the array.
Of course these criteria and the categories are preliminary. Only when some decompiler projects are examined it will become clear how good the DMI will be able to catch the project state. For some of the criteria also some focused test subjects are missing. Stay tuned for next blog post with some data.]]>

Posted in decompiler | Tagged , , , , , , | Leave a comment

Holdec 1.3

screenshots and you should also not miss the demo video. You can download the decompiler from the download page. The complete changelog consists of:

  • big feature: add a graphical user interface
  • feature: add support for MS-DOS COM file format
  • feature: add –treat-unknown-files-as-ms-dos-com command line option
  • feature: add the concept of holdec home where executable-independent information is stored
  • feature: add –home command line option
  • feature: add the ability to specify signature files for libraries; add –load-libs command line option
  • fix: some 16bit opcode issues
  • fix: recognize some more i386 opcodes as nops
  • fix: show C like signature for decompiled functions
  • change: add a dummy return block if the function has no return
]]>

Posted in decompiler, holdec | Tagged , | Leave a comment

Holdec 1.2

  • fully supported: the decompiler knows what the inputs and outputs (register, flags, memory location) are and how the output is calculated
  • adc, add, and, bsf, bsr, bt, btc, btr, bts, call, cbw, clc,
    cld, cli, cmc, cmovcc, cmp, cmps, cmpxchg, cwd, dec, div,
    enter, idiv, imul, inc, iret, jcc, jcxz/jecxz, jmp, lahf,
    lea, leave, lods, loop, mov, movs, movsx, movzx, mul, neg,
    nop, not, or, pop, popa, popf, push, pusha, pushf, rol,
    ror, rcl, rcr, shl, shr, sar, ret, sahf, sbb, scas, setcc,
    shld, shrd, stc, std, sti, stos, sub, test, xadd, xchg,
    xlat, xor, xadd
    • input/output supported: the decompiles knows what the inputs and outputs are and uses a builtin function in the decompiled source code
    aaa, aad, aam, aas, bound, bswap, cmpxchg8b, cpuid, daa,
    das, in, ins, int, out, outs, rdtsc, rep
    • unsupported: nothing is known about these opcodes and they are modeled as inline assembler
    arpl, clts, enter, hlt, lar, lgdt, lidt,
    lds/les/lfs/lgs/lss, lldt, lmsw, lsl, ltr, sgdt, sidt,
    sldt, smsw, str, verr, verw, wait, invd, invlpg, wbinvd,
    rdmsr, wrmsr
    The complete list of changes:
    • feature: support (in various levels) hopefully all x86 opcodes excluding FPU, x64, MMX, SSE
    • feature: add –help command line option
    • feature: add –hide-addresses command line option
    • feature: -c supports now multiple functions which are matched on function name with a regexp
    • feature: differ between three different types of undefined values: cpu opcode results in undefined value, not modeled by the decompiler and trashed by a function call
    • feature: add new type ‘bit’ to the external type system
    • feature: simplify cond?0:1 to BIT_TO_INT(!cond) and cond?1:0 to BIT_TO_INT(cond)
    • feature: simplify x-y>42 to x>y+42
    • feature: a new file ‘symbol_table.txt’ is written in current directory after each run
    • feature: support ‘pc’ in the m68k code
    • change: use register names with two letters if required e.g. ‘reg_da’
    • change: replace builtin functions prefix from ‘intern_’ to ‘holdec
    • change: create names according their definition place; this should make generated names more repeatable
    • change: rework how concrete numbers are treated internally
    • change: treat signed and unsigned comparisons different internally
    • change: format numbers depending on the context (bit context -> unsigned hex, signed context -> signed decimal, unsigned and unknown context -> unsigned decimal)
    • change: rework the SSA generation to use an algorithm based on Aycock and Horspool
    • change: do not convert tail controlled loops with a fixed number of iterations to a for(…) loop
    • fix: only issue a warning and do not die if a jump with a constant false condition is removed
    • fix: add dummy values for local->register converted variables without an initial assignment
    One of the things I have learned is that a signed shift right is not equivalent to a signed divide. ]]>

    Posted in decompiler, holdec | Tagged | Leave a comment