redpwnCTF 2021 - rp2sm

rp2sm is a two-part reversing and pwn challenge that I wrote for redpwnCTF 2021 (you can find all our challenges here!), and easily the largest CTF challenge I’ve written to date. It involves reversing and then exploiting a toy JITing VM, with a bytecode language based loosely off of wasm (except without types and structured control flow, since I decided those would be too complex to reverse). The challenge consisted of 2 components, a chall binary which was dynamically linked against librp2sm.so (the VM implementation); there was only one server for both challenges, with both flags loaded in the container. The source for this challenge, along with my assembler and reference solutions, can be found here; bundles of the challenge assets and the “fixed” version released after the CTF are in the releases.

Rev#

VM overview#

The design of rp2sm’s bytecode resembles that of other “real” bytecode formats (e.g. wasm, JVM, CIL/CLR) in that functions, with a defined structure to their arguments, returns, and local variables, exist. The object format consists of a header containing references to loaded sections and functions. rp2sm has both a data and rodata sections, and the header entries describe their lengths, as well as the size and offset in the object for their initial data, making for a bare-bones segment header. Next is a table of functions, which each contain the number of locals, return values (a function can return multiple values), and the offset to and length of the actual bytecode body.

As for the bytecode itself, it is a fairly standard stack-based machine (again like wasm and the JVM), with a separate stack per function (i.e. arguments are accessed through instructions, not popped off of the stack). It supports 32-bit values only, and as such there are no types. The design of the test and branch operations was stolen almost entirely from wasm, with the exception of the addition of a test.nez instruction (that I now realize is fairly useless) and the lack of structured control flow. rp2sm’s control flow transfer instructions can only jump to labels placed in the code by a special label instruction (which emits no code during compilation), or to the end of the function via a ret. The conditional and unconditional branch instructions both take the index of a label to jump to instead of a bytecode address. This scheme was chosen mostly to simplify the compiler implementation.

As I wrote the challenge, I of course cannot give a very good explanation of how to reverse the bytecode format, but instead here is an outline of how I expected teams to approach this. At the core of the VM’s function compiler is a function which looks up each opcode in a large table; the struct looks something like this:

struct {
  uint8_t is_valid;
  uint8_t opcode;
  uint8_t immediate_length;
  uint8_t n_args;
  uint8_t n_ret;
  uint8_t pad[3];
  void (*compile_func)(Context* context, uint8_t opcode);
};

A bit of poking around the first few opcodes should reveal the purpose of most of the fields here, as well as the primitive stack checker that the compiler has; before compiling a single instruction, it checks that enough values are on the stack for the number of arguments the instruction expects, and then updates the stack height using the number of values the instruction returns. From here, teams could either analyze the individual opcode compilation functions directly, or just craft a program that used them and dump the compiled output. The control-flow instructions were the most complex to reverse, simply due to the complexity involved in actually implementing them (also there is both a std::list and std::vector involved, and those rarely turn out pretty when compiled).

`chall` binary#

The chall binary essentially just loaded the input you gave it into the rp2sm VM, and then asked you to solve a math problem with your VM code; if you computed the correct answer, it would then print the first flag. You must implement a function that, given a random 16-bit number and a prime (greater than 2^31), finds the inverse of the number mod the prime. To solve, we can implement the extended Euclidean algorithm, which Wikipedia helpfully provides pseudocode for. Unfortunately we only have 32 bit numbers and our prime doesn’t fit in a signed 32-bit integer, so dealing with negatives is a bit tricky; in my solution, I checked if my inverse was greater than the modulus (unsigned), and added the modulus if this was the case; that gave me a “good enough” success rate, though still not 100%. One team instead opted to implement modular multiplication and then solve for the inverse via Fermat’s Little Theorem, computing a**(p - 2) mod p, which gives a 100% reliable solution (though I do not know how long it takes to run).

Pwn#

None of the code ever touches the second flag file on the server, so we’ll need to pop a shell to solve part 2.

Obvious and unintended#

There was an incredibly obvious unintended bug that I didn’t catch in the implementation: memory accesses to the two data segments were not bounds-checked at all. I believe all teams who solved the pwn during the CTF used this to solve (if you are one of those teams and would like to try the intended solution, you should stop reading here 😉). Since these segments were mmaped, this meant they were at a predictable offset from the other shared libraries loaded, and so by overwriting either malloc/free hooks in libc or rtld structs in ld, teams were able to get a shell. Moral of the story here? Get someone else to test-solve your challenge first; I was too focused on the intended vulnerability to notice.

After the CTF ended I released a patched librp2sm.so that allocated PROT_NONE guard pages in the address space around the two data segments that were reachable from 32-bit offsets to prevent the OOB read/write from accessing anything important.

Intended#

The intended vulnerability lies in the VM’s primitive stack-checker, whose job is to ensure that a function never underflows its stack. However, this stack checker entirely ignores control-flow instructions (other than call and ret, which do have the stack checked correctly in their “compile” functions by looking up the properties of the function called or the current function). And so, we can get the stack pointer up outside of our current function’s stack frame by simply constructing a loop which pops values in its body. As long as the function is still “valid” when all branches are never taken, then the compiler will accept it without aborting. For example:

  push 0 // pad
  push 0 // pad
l1:
  drop
  dup
  eqz
  br_if l1

will move the stack pointer up the stack (towards higher addresses) until it finds the first qword with a non-zero lower dword. We can use this to overwrite the return address and saved rbp on the stack, allowing us to pivot once we have addresses. However, there’s two problems here: first, we need some absolute addresses to pivot to; second, we need to write 64-bit values, despite the JITed code only emitting 64-bit writes of 32-bit values.

The answer to both of these issues lies in the way local variables are handled. When entering a function, local slots on the stack are not cleared, meaning we can read stack data left behind on the stack—and right before a function is called for the first time, the JIT compiler is invoked, which happens to save a lot of important addresses on the stack. The entry into the compiler saves r14 and r13, giving us addresses of our two data segments, and going up a bit more also gives us a return address inside librp2sm.so, giving us a library base address. Since local slots are 4 bytes wide, a pair of uninitialized locals lets us read a whole 64-bit value. Meanwhile, stack space is reserved simply by subtracting the size of the local slots from rsp—it is not aligned to a multiple of 8 bytes. This means we can use a function with an odd number of locals to misalign the stack pointer by 4 bytes, and subsequently underflow the stack to write the high 4 bytes of a value where we’ve previously set the lower 4 bytes.

Combining all of this together, we can leak a base address to get ROP gadgets, and then build up a ropchain in the data segment, and finally pivot to it by returning to a leave; ret; gadget in librp2sm.so (which is present in the immediate bytes of code responsible for assembling the leave; ret; epilogue of JITed functions—even though no AMD64 compiler emits leave). From there, we can use standard gadgets for rdi and rsi control, and then use a handy mov edx, 5; call mprotect; gadget from the JIT compiler to set the rodata section to r-x. Finally, jump to the newly-executable shellcode and pop a shell.

More unintended?#

Remember the trick with using immediate bytes as code? M30W from Balsn showed me an exploit that leveraged this on the JITed code instead. In another screw-up on my part, I had put a REX.W prefix on the shift instructions, which meant there was a method to get (limited) control over a 64-bit value. A push 3; shr; push 3; shl would zero the lowest 3 bits of the 64-bit value currently at the top of the stack, and M30W used this to nudge the return pointer of a function to land inside the immediate bytes of another 32-bit push in the caller. Since the bytecode push instructions are JITed into real x86 code, we have semi-arbitrary bytes in an executable section (4 bytes arbitrary preceded by an opcode byte, as many times as desired). From there, one can write shellcode 2 bytes at a time, with the other 2 bytes of the 4-byte immediate taken up by eb01, which skips the next byte (the 68 primary opcode for pushl) to jump into the next immediate, which is enough to get the register control needed to call execve("/bin/sh", NULL, NULL).

Author notes#

Exploiting big JIT engines like V8’s Turbofan is a hot topic these days (even picoCTF has a excellent set of intro challenges), but how well-explored are the very basics? With rp2sm, I wanted to create a toy JIT VM with a codebase small enough to reasonably be reversed in a CTF setting. Overall, feedback was very positive, and I think the challenge was a success.

On the implementation side, I consciously tried to make reversing not unnecessarily complicated. C++ STL containers can produce some notoriously convoluted output when compiled, and I wanted the challenge to be about the VM, not figuring out that a giant blob of code is in fact std::vector::emplace_back; to that end, I implemented my own fixed-length dynamically-allocated array and used that whenever possible, only falling back to a single std::list and std::vector in the compiler for tracking relocations (the types of each container chosen to simplify the logic as much as possible). Also, C++ gave me some opportunities to ~~nerd snipe myself~~ write cleaner implementation code than would be possible with C, for instance with a compile-time unhexlify such that emitting a fixed assembly sequence looked like:

// pop rsi; mov eax, dword [rsi+r14]; push rax;
cc.append_code_frag(unhexlify("5e428b043650"));

As for designing the exploitation, most of the intended solution path was based around actual bugs I introduced while writing the VM (of course the unintended path originated this way too). In designing, I only planned for the stack checker bug (simply because implementing a correct stack checker was too much effort both to write and to reverse); the rest of the bugs for making the ROP work were not planned for ahead-of-time.

I understand a lot of the apprehension behind a revpwn challenge (we as a team have been on the receiving end quite a few times), so I tried to make the pwn part based on the code players reversed as much as possible. Personally, I find challenges where you have to reverse a lot of code to just find a vulnerability which could have existed in a much smaller program not very interesting; the reversing just feels like a tedious first step to a standard pwnable. Thus, I designed the intended solution of rp2sm around a bug in the core component of the VM players were reversing, requiring knowledge of the compiler and bytecode to see the bug, and more in-depth knowledge of the bytecode to be able to exploit it. The design of the first part plays into this as well; instead of having players reverse VM bytecode, I had players write a module themselves. Whereas with reversing bytecode, there is a reference where you can observe how things work from code written by someone who presumably understands the VM (or just dump compiled/JITed code and ignore the bytecode), when writing code there is no such reference. I hoped that this would push teams to truly learn the how the VM worked, which would lead them to the intended stack checker bug.

Of course the unintended OOB write negates all of this, since writing hooks is likely powerful enough on its own to be able to pop a shell, and the rest of the VM machinery becomes not too relevant. So, to anyone who found part 2 disappointing after all of the reversing that came first, I apologize for this (and give pwning the patched library a try!); again, this shows that playtesting of challenges is very important!

On part 1, in hindsight leaving out multiply/divide/remainder instructions wasn’t a great idea either; even with those instructions, being able to implement a modular inverse in the VM was proof enough that you understood the VM, and forcing players to also implement divide and multiply is just tedious. Interestingly enough, this isn’t the first time I left out multiply instructions in a VM reversing challenge either. In gfc3, a challenge I wrote for NACTF 2020, I had straight-up forgotten and only realized when I was writing the VM program to run; instead of adding the instructions, I implemented a software multiplier instead. Something similar happened in rp2sm: the x86 mul/div/rem instructions are a different format compared to both the basic binary operations (add/sub/xor/etc.) and the shift operations, and I didn’t feel like figuring out the details of how to assemble them; instead, I wrote functions to do multiply and divmod in bytecode. At the time I justified this by thinking this would get people to use the ability to have multiple functions in the VM, however I underestimated the laziness of CTF players; inlining the multiply/divide is going to be less effort than figuring out another part of the VM, so naturally that is what some of the teams did. Also, due to the unintended OOB, even solving the pwn wouldn’t require knowledge of how to create/call functions either.

And speaking of the unintended OOB, it seems I predicted this a year ago while preparing for redpwnCTF 2020…

Oops.

Finally, the inclusion of a separate rodata section is a obscure, subtle nod to a DiceCTF challenge: poortho’s “sourceless rust wasm pwn”. In that challenge, exploitation abused the fact that wasm has (at least in the current spec) only a single memory, and thus no read-only memory. Thus, I went in designing the VM to have a read-only section, just in case we end up with a “sourceless rust rp2sm pwn” sometime ;).

Ethan Wu

redpwnCTF 2021 - rp2sm

Rev#

VM overview#

`chall` binary#

Pwn#

Obvious and unintended#

Intended#

More unintended?#

Author notes#

Ethan Wu

Ethan Wu

redpwnCTF 2021 - rp2sm

Rev#

VM overview#

chall binary#

Pwn#

Obvious and unintended#

Intended#

More unintended?#

Author notes#

Ethan Wu

`chall` binary#