draft.md

The binary is missing of various attributes like precise types, variables (stack , global, heap), control flow structures, exact function arguments and return types. The goal of this work is to experimentally evaluate the varying precision of a particular attribute, recovered from binary code, on clients like compiler optimizations and analysis (like pointer analysis and ??) or source level symbolic execution. For the last 16 years a large amount of research has been carried on inferencing differnt attributes from binary code. This work aims to evaluate how well a particular client can liverage those information. This might also be helpful in answering what is the minimal set of attributes which can facilitate a particular client.

The plan here is to recover a particular attribute from binary at various precision levels, starting with a binary with most imprecise attribute information, then to add attribute inforamtion to mimic different exiting implementation (for example, adding type information so as to mimic TIE, a well know type inferencer on binary code) and going up to a binary representation with most precise attribute information. Once recovered, those attributes (at a particualr precision level) need to be applied to the binary code in a way that the cleint can liverage that information. Next, the different precision levels of recovered attributes will be evaluated based on a particular client.

In this work we are planning to study the affects of the following attributes:

Per procedure physical stack frames
Variable and aggregate (structure/array) Information
Type information
function signature

On the following clients:

Pointer analysis
source level symbolic execution
Automatic Parallellization

Approaches

One approach of doing that could be to start with the binary or McSema decompiled IR (which is CFG recovered binary in LLVM IR form and devoid of attributes mentioned above) and add attributes (e.g. type information) to it using a particular attribute recovery method (e.g. type recovery). The issues with this approach are the following.

Developing an attribute recovery is not the goal of this work. Exiting attribute recovery methods are either under progress or not open sourced or they are not targetting LLVM IR.
Applying a existing attribute recovery transformation may not yield the precise attribute information as it is present in the source code because of the limittaions in the underlying attribute recovery implementation or because of practical challenges in recovering that attribute.

Another approach is to start with the source code, lets call it version 1, and strip off a particular attribute (like type information) to get version 2. Now version 2 is very different from the binary as a number of low level implementation details are introduced during compilation, such as stack frames, calling conventions, exception implementation, and data layout , whereas in the source code those are abstracted away.

Other appraoch to start with the binary (or slightly higher level representation as in McSema's IR ), let call it version 1, and use the debug info from source code to add a particular attribute to get version 2. Now version 1 and 2 are a good candidate to check the affect of the attribute on a particular client.

Meta Comments

May be we can focus on a single attribute type and try to evaluate the different precision of type informations.The paper, 2016, CSUR, is a survey on work of type inferencing binary code and gives an idea of different levels of precision that we can achieve.

Immediate Challenges

To propagate the debug info from the binary to the decompiled IR. The issue here is how to propagtate the debug info in binary to the LLVM IR generated by McSema. There exists IDA dwarf plugins that can consume debug symbols. (Need to experiment with them).
To have to the ability to selectively apply attributes. That can be achieved either by modifying the dwarf file or by modifying the plugin.

While I searched on how the source level clients are applied to the binaries; Or there exist special version of those clients that work better on binary code, the results that I got are the following: The diassembled binary is subjected to various analysis which can infer various attributes which are lost during compilation to result in a code which is amenable to source level analysis. But there does not exist any specialized version of the client which can recover the attributes relevant for its application.

https://www.cs.colorado.edu/~bec/papers/sas06-decompilers.pdf https://pdfs.semanticscholar.org/177a/250daaeff8dd97e9612ac67073216de1ed42.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Approaches

Meta Comments

Immediate Challenges

FilesExpand file tree

draft.md

Latest commit

History

draft.md

File metadata and controls

Approaches

Meta Comments

Immediate Challenges