- separation of code and data: this should be as automatic as possible but user input is still required
-
reliable function identification: determine the code ranges of all functions
-
understand special idioms: this does not mean to support idioms which are compiler specific but single assembler commands or small groups of commands which are not (easy) representable in a language like C. This includes: indexed jumps,
rep
-commands of i386, SIMD-instructions or converting
ror $0x8,%cx
ror $0x10,%ecx
ror $0x8,%cx
to a swab32(...)
-call.
- stack and function calls: both depend on each other. Sub-problems are:
- identify saved registers
- identify how parameters are passed (at the caller and at the callee site)
- construct the actual calls
- handle multiple entries and exits
- beautification/compactification: this part usually uses a control-flow and data-flow graph:
- value propagation
- simplification of expressions
- recognizing high-level control flow (if, if-else, loops)
- reorder statements
- reduce the amount of memory accesses
- type analysis
-
output
The resulting source code should compile and so can be further engineered with other tools like IDEs. Almost each of these problem areas are big enough in themselves. There is literature on most of them in varying amounts. ]]>