Writing a Naive LLVM-based Devirtualizer| eversinc33

If you are into deobfuscation, and especially devirtualization, chances are very high that you have read or heard about lifting to LLVM IR as one approach. In a way, obfuscation can be thought of as compiler de-optimization and LLVM, with its many efficient optimization passes, can work great in de-obfuscating. However, most resources available are either very academic, super technical and as such inaccessible or simply skip the details of the lifting process and just describe pipelines in a high-level fashion, without showing any details or code. As someone new to this topic, I was confused by the different tools and frameworks, which each have different caveats, high learning curves or simply do not really work outside of specific PoCs.

So I decided to get my feet wet, stop reading and just do what I always do when I want to learn something new: start implementing something in the most naive way I can come up with. As such, this should be read more like a diary entry / devlog rather than a tutorial or intro to devirtualization.

I have written many virtual machines and reversed some custom VMs I encountered in different malware samples, but I usually just reversed the handlers, wrote a disassembler and analyzed the bytecode by hand. This works against simple malware unpacking VMs, but not if your target is protected with something like Themida or VMProtect. I always wanted to write a proper devirtualizer, so I pulled up a random VM crackme from tuts4you and went for it.

The plan was simple: I lift the virtualized bytecode to LLVM, optimize it, recompile it, and I have the original program, stripped of its virtualization layer.

I will not go into how to reverse custom VMs, what VMs are or explain LLVM in much detail. Here are some links that could be helpful if you are unsure about these topics

The crackme is a simple flag checker, in which the flag check logic is virtualized:

Crackme VM

The goal is to reverse the VM and figure out the correct keys.

Virtualization based obfuscation

Very briefly, a custom virtual machine, as a means of software protection, can be thought of as simply an emulator for a fictitious (or real) architecture which executes some sort of program (the "bytecode"). The VM from this crackme is a typical stack-based virtual machine, which somewhat resembles the x86 architecture and is really very simple to reverse - if you have never reverse engineered a custom VM, this one is a great intro.

As mentioned earlier, the classic approach to reverse engineering a VM like this, is to find the entry of the VM and figure out its dispatch mechanism - usually either a fetch-decode-execute loop or a model where each handler dispatches to the next handler (which this VM does). Then you can parse each opcode handler and write a disassembler.

This blog however is not about this specific VM but about the lifting process, so I will just show an example of reversing one opcode.

Reversing VM Handlers manually

This is our VMENTER, the function which initializes the virtual machines registers and stack space by pushing the host stack:

VMENTER

We now know already that R13 is the VIP (virtual instruction pointer), R14 the image base needed for RVA to real address calculations by the dispatcher later, and R15 our VSP (virtual stack pointer).

With this knowledge, reversing the PUSH_REG64 handler is very straightforward:

PUSH_REG64

In this case, we can easily reverse all handlers manually pretty quickly because there is no obfuscation applied to the handlers at all.

Alternative: Using symbolic execution

Besides manually reversing the handlers, we can also use symbolic execution to run each handler and record how it transforms symbolic values we care about. In this case, I implemented it as an IDA script which executes the handler pointed to by the current cursor using the miasm framework and the IDA domain API.

We inferred from manual analysis that R13 is the VIP, R14 the image base and R15 our VSP, so I encode that in a pretty-printing method in the following script:

from ida_domain import Database  
from miasm.analysis.machine import Machine  
from miasm.core.locationdb import LocationDB  
from miasm.ir.symbexec import SymbolicExecutionEngine  
from miasm.expression.expression import ExprInt, ExprId, ExprMem  
from miasm.ir.ir import AssignBlock  

def render(expr):
    # pretty print expr: replace register names with our aliases, print values as signed hex integers etc.
    # [...]

with Database.open() as db:
    ea = db.current_ea
    code = db.bytes.get_bytes_at(ea, 0x200) # enough for our handlers
    # lift to miasm IR
    loc_db = LocationDB()
    machine = Machine("x86_64")
    mdis = machine.dis_engine(code, loc_db=loc_db)
    asmcfg = mdis.dis_multiblock(0)
    ira = machine.ira(loc_db)
    ircfg = ira.new_ircfg_from_asmcfg(asmcfg)

    symb = SymbolicExecutionEngine(ira)
    # init register values
    init_blk = AssignBlock({
        ExprId("RAX", 64): ExprInt(0, 64),
        ExprId("R13", 64): ExprInt(ea, 64), # VIP
        ExprId("R14", 64): ExprInt(0x140000, 64), # image base
        ExprId("R15", 64): ExprInt(0x1000, 64), # VSP
        # stack mem:
        ExprMem(ExprInt(0x1000, 64), 64): ExprInt(0xAAAAAAAAAAAAAAAA, 64)
    })
    symb.eval_assignblk(init_blk)

    # Symbolically execute handler
    entry = next(iter(ircfg.blocks))
    irblock = ircfg.blocks[entry]
    for assignblk in irblock:
        symb.eval_updt_assignblk(assignblk)
        if any("IRDst" in str(dst) for dst in assignblk.keys()):
            break # stop on jmp

    print("----------------------------")
    print("EA:", hex(ea))
    print("VSP:", render(symb.eval_expr(ExprId("R15", 64))))
    print("VIP:", render(symb.eval_expr(ExprId("R13", 64))))
    print("STACK[0]:", render(symb.eval_expr(ExprMem(ExprId("R15", 64), 64))))

If we now run this on an example handler, in this case the one at 0x140016195, we get a dump of its symbolic effects in the IDA Python console:

Symbolic execution

From this we can already infer enough meaning to figure out what the handler is doing:

Instruction pointer increments by size of the instruction
8 bytes of memory pointed to by VIP were pushed to the stack (the immediate of the instruction)
VSP advances by 8 bytes

So this is a PUSH_IMM64 instruction, pushing a 64 bit immediate onto the stack!

The benefit of this approach is that we can now take these symbolic deltas and use them as signatures for the instructions. If we encounter this VM another time and it has some obfuscation applied, such as shuffled opcodes, we can simply symbolically execute each handler again and match them to the signatures to figure out each handler's meaning.

Once you have reversed all instructions and their opcodes you can write a bytecode parser, or rather, a disassembler for this VM. While there are more opcodes implemented in the VM, the following opcodes are those used by the bytecode, so the only ones we care about for this crackme. Names should be self explanatory:

//===------------------------------------------------------------------===//
// OPCODE, {operands, name, address (same as opcode), instruction length}
{0x1606A, {1, "PUSH_REG64", 0x1606A, 5}},
{0x16090, {1, "PUSH_REG32", 0x16090, 5}},
{0x16101, {1, "POP64", 0x16101, 5}},
{0x16126, {1, "POP32", 0x16126, 5}},
{0x16194, {1, "PUSH_IMM64", 0x16194, 12}},
{0x161B1, {1, "PUSH_IMM32", 0x161B1, 8}},
{0x1620A, {0, "PUSH_VSP64", 0x1620A, 4}},
{0x1626C, {0, "POP_VSP64", 0x1626C, 4}},
{0x1627D, {0, "POP_VSP32", 0x1627D, 4}},
{0x162B1, {0, "ADD64", 0x162B1, 4}},
{0x162C9, {0, "ADD32", 0x162C9, 4}},
{0x1631F, {0, "SUB64", 0x1631F, 4}},
{0x16337, {0, "SUB32", 0x16337, 4}},
{0x163A5, {0, "XOR32", 0x163A5, 4}},
{0x163FB, {0, "AND64", 0x163FB, 4}},
{0x16413, {0, "AND32", 0x16413, 4}},
{0x16481, {0, "OR32", 0x16481, 4}},
{0x16689, {0, "LOAD64", 0x16689, 4}},
{0x1669F, {0, "LOAD32", 0x1669F, 4}},
{0x166EC, {0, "STORE64", 0x166EC, 4}},
{0x16707, {0, "STORE32", 0x16707, 4}},
{0x1676B, {1, "JNZ", 0x1676B, 8}}

On a side note, we could solve the whole crackme using symbolic execution, but that approach does not really scale for more complex and/or obfuscated VMs. We are interested in scalable workflows and actually devirtualizing the program, so let's start lifting.

Lifting to LLVM

I already mentioned that I was overwhelmed by the available lifting frameworks and lifters, which is why I decided to roll my own using the LLVM C++ API. Now, this was a bit of a headache to me: My plan was to lift each opcode to LLVM, so that I can run my disassembler and have it lift the virtualized bytecode program to a new LLVM program that I can compile and throw in IDA.

The general approach was thus to use the API to emit LLVM symbols manually, for example to load a value from a virtual register:

1	llvm::LoadInst* reg_val = builder.CreateLoad(i64t, regs[reg]);

With these building blocks I could then write an LLVM IR representation of each instruction such as PUSH_IMM64, JNZ and so on.

The basic pipeline looks as follows:

VM bytecode
     ↓
Disassembler
     ↓
Opcode stream
     ↓
Lifting to LLVM IR
     ↓
Unoptimized IR
     ↓
LLVM optimization passes
     ↓
Devirtualized program

Now what caused me some headaches was how I should emulate the VM stack, if at all.

At first, I tried to offload the VM stack into the lifter itself instead of into LLVM. I had a std::vector of llvm::Value*s which I used to track which variable a current instruction references and inserted that variable. My idea was that I do not want the stack in the resulting code, as it belongs to the VM, so I tried to keep it out of LLVM.

However, manually tracking values goes in the way of LLVM itself and broke its SSA form on many levels: If my lifter produced valid IR at all, it was badly optimizable and in many iterations the IR was completely broken. Optimization passes reason about the IR and have no insight into my external C++ bookkeeping - which prevented many of them from working correctly. By keeping part of the VM state hidden inside external C++ bookkeeping structures, I was effectively hiding semantics from the optimizer. Passes like dead store elimination, mem2reg and instruction combining could no longer reason correctly about value flow, because the actual dependencies only existed in my lifter and not in the IR itself.

After a while, I decided to stop doing this and implement the stack logic in LLVM IR as well. At first, this seemed counterintuitive to me: we are trying to strip the virtualization layer, in which the stack plays a part, so why would we implement it in our LLVM code?

MrExodia showed me how much I underestimated the power of LLVM optimization passes. If properly implemented, they can fold away the stack completely, leaving only the semantics we are interested in. You can see it in action with the following emulated stack-based add operation:

LLVM optimization

After all optimization passes, the only thing left is the actual operation, and the stack operations are stripped away completely. Since the stack accesses are all local and deterministic, LLVM can optimize it away and fold the emulated stack into SSA values.

Now knowing that I do not have to try to hide the VM stack from LLVM as it will be optimized away, I finished my implementation by implementing a lifting function for each opcode handler, which would emit the corresponding LLVM instructions.

You can view the whole code here.

Since this crackme only virtualized a single routine and had relatively simple control flow, I modeled the entire bytecode program as one LLVM function, holding the input value as a parameter, with multiple basic blocks. The stack is a simple memory array in a local variable, VSP a pointer to it and the VM's registers are llvm::Value*s, which I hold in a vector in my lifter, to make them accessible by index in my instruction lifters:

// Flat byte array for the VM stack
// STACK_HEADROOM bytes above cover positive-offset frame accesses (access to caller stack frame)
arrayType = llvm::ArrayType::get(i8t, STACK_SIZE * 8 + STACK_HEADROOM);

// init stack (8byte aligned so i64 stores/loads into it are not UB)
stack = builder.CreateAlloca(arrayType, nullptr, "mem");
stack->setAlignment(llvm::Align(8));

// VSP is tracked as an i64 byte offset from the stack base
// Starts at STACK_SIZE * 8
vsp = builder.CreateAlloca(i64t, nullptr, "vsp");
builder.CreateStore(builder.getInt64(STACK_SIZE * 8), vsp)->setAlignment(llvm::Align(8));

// init registers
regs = std::vector<llvm::Value*>(0x11, nullptr);
for (int i = 0; i < regs.size(); ++i)
{
	std::string name = "r" + std::to_string(i);
	regs.at(i) = builder.CreateAlloca(i64t, nullptr, name);
	builder.CreateStore(llvm::ConstantInt::get(i64t, 0), regs.at(i))->setAlignment(llvm::Align(8));
}

// Store function input parameter into r3 (maps to rcx/first argument)
llvm::Value* input = mainFunc->getArg(0);
input->setName("input");
llvm::Value* input64 = builder.CreateZExt(input, i64t, "input_zext");
builder.CreateStore(input64, regs[3])->setAlignment(llvm::Align(8));

I then had to implement some stack management functions such as PUSH and POP:

void Lifter::vm_push64(llvm::Value* value)
{
    // Decrement VSP by 8 bytes (one i64 slot)
    llvm::Value* cur_vsp = load_vsp();
    llvm::Value* new_vsp = builder.CreateGEP(i8t, cur_vsp, builder.getInt64(-8), "vsp_dec");
    store_vsp(new_vsp);
    // Store value at new top (new_vsp IS the slot pointer)
    builder.CreateStore(value, new_vsp)->setAlignment(llvm::Align(8));
}

Armed with these stack primitives, we can then model the different opcodes rather intuitively (this reminded me of emulator development, but just with a weird programming API)

void Lifter::op_pop64(const uint64_t reg)
{
    // Pops a 64 bit value off the stack into the register reg
    builder.CreateStore(vm_pop64(), regs.at(reg))->setAlignment(llvm::Align(8));
}

void Lifter::op_push_reg32(const uint64_t reg)
{
    // Load full i64 reg, truncate to lower 32 bits, then push
    llvm::LoadInst* reg_val = builder.CreateLoad(i64t, regs[reg]);
    reg_val->setAlignment(llvm::Align(8));
    llvm::Value* reg_val32 = builder.CreateTrunc(reg_val, i32t, "trunc32");
    vm_push32(reg_val32);
}

Special care has to be given to the JNZ instruction, which is the only one that actually creates some control flow. To create basic blocks at jump targets, I implemented a two-pass lifting pipeline, where the first pass creates basic blocks at jump targets and for their "fallthrough" blocks (LLVM doesn't have fallthrough blocks in the classic sense, but only explicit jumps. We have to make fallthrough explicit):.

// First pass: create basic blocks for JNZ targets and their fallthroughs
std::unordered_map<uint64_t, llvm::BasicBlock*> addr_to_bb;
for (size_t i = 0; i < instructions.size(); ++i)
{
    const auto& inst = instructions[i];
    if (inst.name == "JNZ")
    {
        // BB for the JNZ target address
        uint64_t target_addr = inst.operands[0];
        if (addr_to_bb.find(target_addr) == addr_to_bb.end())
            addr_to_bb[target_addr] = llvm::BasicBlock::Create(context, "bb_" + std::to_string(target_addr), mainFunc);

        // BB for the fallthrough (instruction immediately after JNZ)

        if (i + 1 < instructions.size())
        {
            uint64_t ft_addr = instructions[i + 1].vip;
            if (addr_to_bb.find(ft_addr) == addr_to_bb.end())
                addr_to_bb[ft_addr] = llvm::BasicBlock::Create(context, "bb_" + std::to_string(ft_addr), mainFunc);
        }
    }
}

In the second pass a big conditional statement in my disassembler loop then takes care to emit the respective LLVM IR belonging to the opcode:

if (inst.name == "PUSH_REG64")
    op_push_reg64(inst.operands[0]);
else if (inst.name == "PUSH_REG32")
    op_push_reg32(inst.operands[0]);
else if (inst.name == "PUSH_VSP64")
    op_push_vsp64();
else if (inst.name == "POP64")
    op_pop64(inst.operands[0]);
else if (inst.name == "POP32")
    op_pop32(inst.operands[0]);
else if (inst.name == "PUSH_IMM64")
    op_push_imm64(inst.operands[0]);
else if (inst.name == "PUSH_IMM32")
    op_push_imm32(inst.operands[0]);
else if (inst.name == "SUB64")
    op_sub64();
else if (inst.name == "ADD64")
    op_add64();
// [...]

The JNZ instruction simply links the basic blocks which were created in the first pass.

else if (inst.name == "JNZ")
{
    // Find target BB
    uint64_t target_addr = inst.operands[0];
    auto tgt_it = addr_to_bb.find(target_addr);
    llvm::BasicBlock* target_bb = tgt_it->second;

    // Find fallthrough
    uint64_t ft_addr = instructions[i + 1].vip;
    auto ft_it = addr_to_bb.find(ft_addr);
    llvm::BasicBlock* fallthrough_bb = ft_it->second;

    // Create branches for target BB and fallthrough BB
    llvm::Value* zf = vm_pop64();
    llvm::Value* is_nz = builder.CreateICmpEQ(zf, builder.getInt64(0), "jnz_cond"); // here zf==0 means not zero
    builder.CreateCondBr(is_nz, target_bb, fallthrough_bb);
}

After this lifting, we verify our function is valid IR using LLVM's llvm::verifyFunction API and run optimization passes on the resulting IR:

void Lifter::optimize()
{
    llvm::PassBuilder PB;
    llvm::LoopAnalysisManager LAM;
    llvm::FunctionAnalysisManager FAM;
    llvm::CGSCCAnalysisManager CGAM;
    llvm::ModuleAnalysisManager MAM;

    PB.registerModuleAnalyses(MAM);
    PB.registerCGSCCAnalyses(CGAM);
    PB.registerFunctionAnalyses(FAM);
    PB.registerLoopAnalyses(LAM);
    PB.crossRegisterProxies(LAM, FAM, CGAM, MAM);

    llvm::ModulePassManager MPM = PB.buildPerModulQeDefaultPipeline(llvm::OptimizationLevel::O1);
    MPM.run(*module, MAM);
}

Before optimization, the IR was about 8000 lines of stack shuffling and memory accesses:

// [...]
%vsp_dec248 = add i64 %vsp_off247, -8
store i64 %vsp_dec248, ptr %vsp, align 8
%121 = getelementptr i8, ptr %mem, i64 %vsp_dec248
store i64 35, ptr %121, align 8
%vsp_off249 = load i64, ptr %vsp, align 8
%122 = getelementptr i8, ptr %mem, i64 %vsp_off249
%popped64250 = load i64, ptr %122, align 8
%popped32251 = trunc i64 %popped64250 to i32
%vsp_inc252 = add i64 %vsp_off249, 8
store i64 %vsp_inc252, ptr %vsp, align 8
%vsp_off253 = load i64, ptr %vsp, align 8
%123 = getelementptr i8, ptr %mem, i64 %vsp_off253
%popped64254 = load i64, ptr %123, align 8
%popped32255 = trunc i64 %popped64254 to i32
// [...]

After -O1 optimizations, the full resulting IR became:

; ModuleID = 'byteshield_module'
source_filename = "byteshield_module"

; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(none)
define i32 @main(i32 %input) local_unnamed_addr #0 {
main:
  %zf1168 = icmp eq i32 %input, 1859
  %zf1764 = icmp eq i32 %input, 2418
  %narrow = or i1 %zf1764, %zf1168
  %zf2362 = icmp eq i32 %input, 1638
  %narrow5460 = or i1 %zf2362, %narrow
  %zf3003 = icmp eq i32 %input, 299902
  %zf3694 = icmp eq i32 %input, 29763
  %0 = or i1 %zf3694, %zf3003
  %narrow5462 = select i1 %0, i1 true, i1 %narrow5460
  %mem.sroa.579.4.off32 = zext i1 %narrow5462 to i32
  ret i32 %mem.sroa.579.4.off32
}

attributes #0 = { mustprogress nofree norecurse nosync nounwind willreturn memory(none) }

All that was left now was compiling the resulting .ll file to an executable with clang devirt.ll -o program. We fully devirtualized the bytecode! I could then analyze the devirtualized program in IDA, where all calculation logic was folded away so that only the flag checks remain:

Devirtualized program

One of my previous failed attempts showcased what the bytecode actually does. Most flags are hardcoded, except for one flag which depends on various arithmetic operations. (In my previous flawed approaches, this was not folded, as I failed to model the stack correctly):

Previous failed attempt

We can verify that we correctly cracked the crackme by testing the keys:

Solved crackme

Outlook

As this was my first experience emitting LLVM IR, I learned quite a lot about how it is structured and some of its pitfalls.

The main thing to watch out for is making sure you emit your IR in a way that is both correct and optimizable, so that LLVM can work its optimization magic. Once the semantics are represented correctly in LLVM IR, LLVM's optimization passes can eliminate large parts of the virtualization automatically.

Devirtualization and lifting is fun. I suggest you give this crackme a try if you want.

Happy Hacking!