UIUCTF 2021 Ropfuscated Writeup

# UIUCTF 2021 Ropfuscated Writeup ## 0x00 > Table of Contents * [Table of Contents](#0x00-gt-Table-of-Contents) * [Complains and Stuff](#0x01-gt-Time-time-time%E2%80%A6) * [Challenge Introduction](#0x02-gt-Introduction) * [Extract ROP](#0x03-gt-Extract-ROP) * [ROP to Assembly](#0x04-gt-ROP-to-assembly) * [Reduce Code Size](#0x05-gt-Reduce-code-size) * [Reverse the Logic](#0x06-gt-Reverse-the-Logic) * [Discussion](#0x07-gt-Discussion) ## 0x01 > Time, time, time... Work has kept me extremely busy these days, and I just can't seem to find the motivation and time to play smaller CTFs anymore. Maybe it's about time to rethink my time allocation and life choice... Anyways, even though I didn't play UIUCTF this year, one of my friends kept bugging me to try Ropfuscated, claiming it was something I would enjoy. After spending like 4 hours on Tuesday night and finally solving it, I found the concept interesting, but it was also quite a painful journey. (I am a pwn guy, and not that into reversing stuff after all) In the spirit of sharing the suffering I experienced, I decided to do a quick writeup on my approach, which is what you are now reading. ## 0x02 > Introduction The program was quite simple, it presents a android like lockscreen, and we have to figure out the correct pattern to the puzzle An image is shown below ![](https://imgur.com/Qrh7Hv7.png) Playing with it for a bit, we can see that the program utilizes terminal mouse support for IO, and outputs a failure message whenever user finishes drawing (releases the mouse) The panel is also reset after failure ![](https://imgur.com/mWM1W99.png) That's about all we can get from blackboxing stuff, time to fire up IDA and start reversing ## 0x03 > Extract ROP As expected, the **main()** function starts by calling tcsetattr to set up terminal to display stuff correctly ![](https://imgur.com/KS7WAaP.png) **main()** calls function **rop()** for the actual program logic. **rop()** then sets **rsp** to **rop_chain**, and returns. Interesting, so the entire program is actually implemented with an **ROPchain**, hence the name **Ropfuscated**, now let's see if this is doable manually (I'll be very disappointed if it is, but still check anyway) ![](https://imgur.com/qYL5CQh.png) Scrolling through the **ROPchain**, We can see that the entire ROPchain spans from **0x414070** to **0x6a1340**, which is a whopping 2.5MB. Manual reading everything is definitely out of question. Let's start parsing it. Extracting gadgets from ROPchain is usually pretty simple. Treat each 8 bytes in the stack as a gadget entry, and retrieve the gadget from the hardcoded address. The one common exception where a stack entry is not a gadget is when we encounter the **pop** instruction, under this case, the entry is treated as an argument instead. There are of course trickier cases if we encounter **rsp** manipulating gadgets such as **add rsp, 8** or things like that, but let's ignore those for now and handle them later if required. I quickly scripted a parser to pull out data from raw binary, decode it, and write the parsed assembly to a file. As one can see, the code is quite short Some quick highlights on the code below 1. Parsing ROP gadgets properly is quite a hassle, so I took the easy way. First assume each gadget is at most 0x20 bytes long, this holds for real life ROP chains where gadgets are usually one single instruction. Then throw the extracted snippet to pwntools builtin disassembler, finally, find the first **ret** instruction in disassembled assembly and mark that as end of gadget. 2. Pwntools **disasm()** is slow, and since it is expected that the total count of unique gadgets is small, cache previously disassembled gadgets for later reference 3. For parsed gadget, traverse each instruction and check if it is **pop**. This is required to obtain correct count of entries that are arguments instead of gadgets ```python= def loadFile(fname): with open(fname,'rb') as f: return f.read() def resolveAddrOffset(addr): #addr <-> offsets mappings can be easily resolved by examining readelf results offset = addr-0x400000 if offset>=0x3e10: offset-=0x1000 return offset def retrieveData(data,addr,length): offset = resolveAddrOffset(addr) return data[offset:offset+length] def parseROP(data,ROPchain,base,outFile=None): def getOutFile(outFile): if outFile is None: f = sys.stdout else: f = open(outFile,'w') return f def extractAsm(data,gadgetAddr,length): machineCode = retrieveData(data,gadgetAddr,length) assembly = disasm(machineCode).split('\n') parsed = '' argcnt = 0 for line in assembly: line = line.split(' ')[-1].strip() parsed+=line+'; ' if 'pop' in line: argcnt+=1 if 'ret' in line: break if 'ret' not in parsed: print('parse failed') exit() return parsed, argcnt def getArgs(data, offset, cnt): args = [] for i in range(cnt): args.append(data[offset+i*8:offset+(i+1)*8]) return args f = getOutFile(outFile) gadgets = {} L = len(ROPchain) rsp = 0 while rsp<L: if rsp&0xfff==0: print(f'{hex(rsp)}/{hex(L)}') gadgetAddr = u64(ROPchain[rsp:rsp+8]) if gadgetAddr in gadgets: gadgetAssembly, gadgetArgcnt = gadgets[gadgetAddr] else: gadgetAssembly, gadgetArgcnt = extractAsm(data,gadgetAddr,0x20) gadgets[gadgetAddr] = (gadgetAssembly, gadgetArgcnt) args = getArgs(ROPchain, rsp+8, gadgetArgcnt) print(f'{hex(base+rsp)} :\t{gadgetAssembly}\t({", ".join(map(hex,map(u64,args)))})',file=f) rsp+=8+gadgetArgcnt* ``` Easy peasy, now we have assembly instead of ROP, can we start reading code? Well, it is of course an option, but reading 200k+ lines of code is not something I would feel like doing, let's see if there is a better way of approching this. ![](https://imgur.com/t6801ZX.png) ## 0x04 > ROP to assembly Upon examining the assembly, some patterns rise pretty quickly. 1. jumps are always done by exchanging esp and eax ``` xchg esp,eax; rol bl,0x48; cmove eax,ebx; ret; () ``` 2. Conditional branches are done by a unique pattern. First, comparison results are stored into memory. Notice how sete(or in other cases setl cmove cmovl) is used to set **al** according to result ``` 0x457ee0 : cmp rcx,rbx; ret; () 0x457ee8 : pop rax; ret; (0x0) 0x457ef8 : sete al; ret; () 0x457f00 : pop rbx; ret; (0x404074) 0x457f10 : mov QWORD PTR [rbx],rax; ret; () 0x457f18 : pop rax; ret; (0x404074) 0x457f28 : mov rbx,QWORD PTR [rax]; ret; () 0x457f30 : pop rax; ret; (0x404074) 0x457f40 : mov rcx,QWORD PTR [rax]; ret; () 0x457f48 : xchg rbx,rax; ret; () 0x457f50 : mov rbx,rcx; ret; () 0x457f58 : add rax,rbx; ret; () 0x457f60 : pop rbx; ret; (0x404074) 0x457f70 : mov QWORD PTR [rbx],rax; ret; () ``` Then the memory slot is doubled there times, so a final of multiplication by 8 is calculated(one round of doubling is shown below) ``` 0x457f18 : pop rax; ret; (0x404074) 0x457f28 : mov rbx,QWORD PTR [rax]; ret; () 0x457f30 : pop rax; ret; (0x404074) 0x457f40 : mov rcx,QWORD PTR [rax]; ret; () 0x457f48 : xchg rbx,rax; ret; () 0x457f50 : mov rbx,rcx; ret; () 0x457f58 : add rax,rbx; ret; () 0x457f60 : pop rbx; ret; (0x404074) 0x457f70 : mov QWORD PTR [rbx],rax; ret; () ``` The **cmp** result \*8 is then added with 0xc8 and some **rsp** related value. Notice that cmp result is encoded with **sete** or similar instruction, meaning the summation has only two possible results that differ by a fixed value of 8. ``` 0x45a000 : pop rbx; ret; (0x404074) 0x45a010 : mov QWORD PTR [rbx],rax; ret; () 0x45a018 : pop rax; ret; (0xc8) 0x45a028 : pop rbx; ret; (0x404080) 0x45a038 : mov QWORD PTR [rbx],rax; ret; () 0x45a040 : pop rax; ret; (0x404080) 0x45a050 : mov rbx,QWORD PTR [rax]; ret; () 0x45a058 : pop rax; ret; (0x404074) 0x45a068 : mov rcx,QWORD PTR [rax]; ret; () 0x45a070 : xchg rbx,rax; ret; () 0x45a078 : mov rbx,rcx; ret; () 0x45a080 : add rax,rbx; ret; () 0x45a088 : pop rbx; ret; (0x404074) 0x45a098 : mov QWORD PTR [rbx],rax; ret; () 0x45a0a0 : mov rax,rsp; ret; () 0x45a0a8 : pop rbx; ret; (0x404080) 0x45a0b8 : mov QWORD PTR [rbx],rax; ret; () 0x45a0c0 : pop rax; ret; (0x404080) 0x45a0d0 : mov rbx,QWORD PTR [rax]; ret; () 0x45a0d8 : pop rax; ret; (0x404074) 0x45a0e8 : mov rcx,QWORD PTR [rax]; ret; () 0x45a0f0 : xchg rbx,rax; ret; () 0x45a0f8 : mov rbx,rcx; ret; () 0x45a100 : add rax,rbx; ret; () 0x45a108 : pop rbx; ret; (0x404074) 0x45a118 : mov QWORD PTR [rbx],rax; ret; () ``` **rax** is finally assigned to the summation result. The summed addr always points to the instruction with 3 **pop** or its first argument **0x401350**. ``` 0x45a130 : mov rax,QWORD PTR [rbx]; ret; () 0x45a138 : xchg esp,eax; rol bl,0x48; cmove eax,ebx; ret; () 0x45a140 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a148 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a150 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a158 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a160 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a168 : 0: 66 2e 0f 1f 84 00 00 cs nop WORD PTR [rax+rax*1+0x0]; 7: 00 00 00; endbr64; ret; () 0x45a170 : pop rbp; pop r14; pop r15; ret; (0x401350, 0x45b4c8, 0x401376) ``` For people familiar with polyglot ROP, this is a classic dispatcher pattern, where the arguments are actually valied gadgets. Here **0x401350** and **0x401376** corresponds to the code below ``` 0x45a170 : pop rbp; pop r14; pop r15; ret; (0x401350, 0x45b4c8, 0x401376) 0x45a178 : pop rax; (0x45b4c8) 0x45a188 : xchg esp,eax; rol bl,0x48; cmove eax,ebx; ret; () ``` Summarizing the entire progress above, the cmp result is used as a switch which jumps to either another jump (with hard coded destination address), or to a 3 **pop** insn which skips the jump, and continues executing the following **ROPchain**. 3. many unoptimized register passing are done, you can easily spot all kinds op **xchg**, **mov** and kinds in the snippet above. Those instructions are required when we have an extreme limited insn set like in ROP, but only tampers readability while analyzing code. Of course many other patterns also exist in the ASM, but we focus on the above 3 patterns for the following reason. * Redundant register assignment should be reduced if we perform some basic compile time optimization to the code (e.g. constant folding/dead elimination) * Control flow is the only place where we have to deal with the actual address where **ROPchain** is loaded, making a bit tricky to handle in downstream optimizations So to this point, my approach is quite simple. Transform **ROP** to pure assembly that does not rely on stack anymore, and optimize the code. The first part of my plan is **ROP** to assembly, but how can I do that? Recall that I mentioned earlier that **ROPgadgets** are usually short in length. Upon examining the code, this assumption is confirmed. In fact, most gadgets are shorted than 8 bytes, which means we can easily replace the ROPchain entry with the gadget directly, and pad up remaining space with **nop**. There are a few exceptions where the gadgets are long, or where a slot is never reached and the gadget is not legal assembly, but those cases are so few and can be easily circumvented. Below is the function written to parse the previous extracted ROP list, and replace those long/invalid gadgets with a shorted, functionally equivalent counterpart. ```python= def loadRasm(inFile): hex2int = lambda x : int(x[2:],16) loaded = [] with open(inFile,'r') as f: Rasm = f.read().strip().split('\n') for line in Rasm: line = line.split('\t') addr = int(line[0][2:-2],16) if line[1]=='add eax,0xc3d83948; cmp rcx,rbx; ret; ': line[1] = 'cmp rcx, rbx; ret; ' elif line[1][:2]=='0:': line[1] = 'ret; ' opcodes = line[1].split('; ')[:-2] try: args = list(map(hex2int,line[2][1:-1].split(', '))) except: args = [] loaded.append({'addr':addr,'opcodes':opcodes,'args':args}) return loaded ``` After parsing, we still have to transform ROPgadgets to legal assembly, most gadgets can be transformed by simply removing the **ret** instruction at end of gadget (which is also done in the function shown above) The only exception is **pop** gadgets and **control flow related** gadgets. For **pop**, I replaced the gadget with a **mov reg, arg**, while **mov** instructions are inheritently longer than other instructions, we have the privilege of using additional ROP slots orginally allocated for arguments, so that won't cause any problem here. The real problem is those control flow one. There are 3 main gadgets we need to process with care. 1. For **xchg esp, eax** instructions, since I am doing inplace replacement, all gadget offsets are guaranteed to stay intact, so this can be directly replaced with a **jmp rax** 2. For conditional jumps (the 3 pops instruction), it can be formulated as below. ``` If jumped to start of gadget: skip the entire 4 slots Else: jump to address recorded in second arg ``` A simple snippet that achieves this is shown below, direct replacement will do. ``` 00: jmp 0x1e 02: nop ... 07: nop 08: mov rax, ARG2 0f: jmp rax ``` 3. For **mov rax, rsp** instructions, the idea is to make **rax** point to instruction that immediately follows current one. Since we are lifting **rsp** out of assembly, this must be changed to a **rip** related instruction. **lea rax, [rip+1]** is the perfect choice here. Again, direct replacement will do. function that does the above operations is shown below, some highlights yet again 1. This function can serve 2 purpose, generate text based assembly for human reading or machine code that can be executed. Current shown code is for machine code generation. Switch to text generation by uncommenting stuff in code. 2. For machine code generation, we once again face the slowness of pwntools **asm()**, this is even slower that **disasm()**, so cache implementation is critical for speedup 3. For text, notice that we generate some illegal **jmp IMM** instructions. This is for readability and downstream parsing, which I will discuss later. ```python= def transpileRasm(Rasm): Asm = b'' seen = {} simplified = open('SIMP','w') for lidx, line in enumerate(Rasm): if lidx%1000==0: print(f'{lidx}/{len(Rasm)}') curarg = 0 if len(line['args'])==3: ###implementation of if/else switch, special handler here Asm+=b'\xeb\x1e\x90\x90\x90\x90\x90\x90' #print(f'{hex(line["addr"])} :\tjmp {hex(line["addr"]+0x20)}',file=simplified) #line = {'addr':line['addr']+0x8, 'opcodes':[f'jmp {hex(line["args"][1])}'], 'args':[0,0]} line = {'addr':line['addr']+0x8, 'opcodes':[f'mov rax, {hex(line["args"][1])}; jmp rax'], 'args':[0,0]} else: for idx,opcode in enumerate(line['opcodes']): if opcode=='xchg eax,esp' or opcode=='xchg esp,eax': ###implementation of jump line = {'addr':line['addr'], 'opcodes':['jmp rax'],'args':[]} break elif opcode=='mov rax,rsp': ###load relative offset for jump line = {'addr':line['addr'], 'opcodes':['lea rax, [rip+1]'],'args':[]} break elif 'pop' in opcode: opr = opcode.split(' ')[-1] line['opcodes'][idx] = f'mov {opr},{hex(line["args"][curarg])}' curarg+=1 #print(f'{hex(line["addr"])} :\t{"; ".join(line["opcodes"])}',file=simplified) length = len(line['args'])*8+8 opcodes = ';'.join(line['opcodes']) if opcodes not in seen: seen[opcodes] = asm(opcodes) if len(seen[opcodes])>length: print('error') print(opcodes) input() Asm+=seen[opcodes].ljust(length,b'\x90') return Asm ``` So now we have the assembly version at hand, time to see if there are any parsing errors. A simple way to do this is patch the assembly back into original binary and see if it runs properly. Here's the code that patches assembly into binary, adds execution privilege to the page, and change **rop()** function to call assembly instead of return to ROPchain. Works like charm! ```python= from pwn import * import Myassembly context.arch = 'amd64' with open('ropfuscated','rb') as f: data = f.read() stub = asm(''' push rbp mov rbp, rsp mov rax, 0x414070 call rax leave ret ''') data = data[:0x40+0x38*5+4]+p32(7)+data[0x40+0x38*5+8:] data = data[:0x13070]+Myassembly.code+data[0x2a0340:] data = data[:0x1310]+stub+data[0x1310+len(stub):] with open('rop2','wb') as f: f.write(data) ``` So at this point, I wondered if this is already good enough, IDA is decent at analysing assembly, so maybe if I hand this version of binary to IDA, it can decompile properly? The result is no. IDA not only fails to decompile, it even fails in analysing the function. Adding program prologue/epilogue to assembly does not help, and it is understandable. The main reason of failure is size of function. We will have to reduce it one way or another. ## 0x05 > Reduce code size As mentioned earlier, the code is bloated due to redundant register value juggling, and could be dramatically optimized with constant folding. While this is not hard to implement, I'm 2.5 hours into reversing and kind of tired at this point. Hand carving optimization is not really something I'd like to do. An alternative option is to parse assembly into c code and hand it to gcc and see how it does on optimizing the code. This seems like an easier path, so I'll take it. This is where the text version of assembly comes in handy. First, since I don't want to enforce the location of my code, we have to do some memory setup to ensure all memory references won't fail. Here's the prologue for my c code. Aside from the setting up terminal/memory, I also allocated variables for each register, a temporary one (T), and another two for compare results (F,L). ```python= Ccode = ''' #include<unistd.h> #include<stdio.h> #include<sys/mman.h> int main(){ struct termios TERM; tcgetattr(0, &TERM); TERM.c_lflag&=0xfffffff5; tcsetattr(0, 0, &TERM); unsigned long long int rax, rbx, rcx, rdx, rdi, rsi, T, F, L; mmap(0x400000,0x20000,3,0x22,-1,0); ''' ``` Parsing C code is quite easy in most cases, since there are only a handful instruction used in the assembly. The exception is **cmp** related stuff and **jmp rax**. Why is **jmp rax** especially tricky to handle? Well, in c we can imitate **jmp** behaviour with **goto**, but this requires static tags, and we can't provide a variable as **goto** destination. This challenge can be split into two part 1. generate tags for each goto destination 2. resolve the correct tag for each goto Generating tags for each goto destination is quite simple, just treat the address of instructions as tags. More specifically, generate a tag T_addr for each instruction (e.g. T_0x414070). While I reckon most instructions won't need tags, gcc most mind won't complain if we generate excessive ones, so I just blindly sprayed tags all ver the code. Now recall that we have two kinds of **jmp**s in the text file 1. jmp IMM 2. jmp rax The **jmp IMM** case can directly be replace with a goto to corresponding tag. The real problem is when we have those dynamic **jmp**s. Once again, **jmp rax** can be split into two cases 1. **rax** is a fixed address, just that I didn't resolve it in previous stages 2. **rax** is a dynamic value(the case of conditional jumps) To generate code properly, I will have to differentiate between those two. The proper way of telling between those 2 is to do a small scale backwards constant folding and see whether **rax** resolves to a static value, but that is too much effort. Instead, I read the assembly a bit and noticed that 1. for all cases where **rax** is a static value, the last **IMM** occuring before jmp is the value assigned to **rax** 2. for all cases where **rax** is a dynamic value, there exist a memory dereference before **jmp** and after all **IMM** occurance This might be a bit hard to comprehend, to illustrate it properly, see the illustration below ``` STATIC rax mov reg, IMM1 ... ... <- no memory dereference, IMM in between ... jmp rax (rax must be IMM1) DYNAMIC rax mov reg, QWORD PTR [ADDR1] ... ... <- no IMM in between ... jmp rax ``` So with this we can differentiate between static **rax**, and dynamic **rax**. Moreover, we can actually resolve all static **rax** by finding the last **IMM** preceding **jmp**. This leave us with only dynamic **rax** to handle. We already know that all dynamic **rax** are only used in conditional **jmp**s, and conditional **jmp**s have only one form of syntax, so we can abstract the logic as below Find the nearest following **jmp** after dynamic jmp. This jmp must be the one of the two candidate destination for **rax**. The other target will be this addr+8. Encoding this knowledge into valid c code, we can get ``` ASSEMBLY jmp rax //dynamic rax ... <- garbage assembly never reached ADDR1 : jmp IMM ADDR2 : jmp IMM C CODE if(rax==ADDR1) goto ADDR1 else goto ADDR2 ``` The snippet that implements jump logic c code generation is shown below. Notice that there is a IGNORE variable for dynamic **rax** case. This is because some garbage assembly follows right after those **jmp**s, and parsing them is a waste of time, just mark them as redundant and skip ```python= for idx,line in enumerate(data): line = line.split('\t') addr = int(line[0][2:-2],16) if IGNORE!=-1: if IGNORE!=addr: continue else: IGNORE = -1 tag = f'T_{hex(addr)}' Ccode+=f'{tag}:\n' code = line[1] if 'jmp rax' in line[1]: pidx = idx-1 while True: if '0x' in data[pidx].split('\t')[1] or '[' in data[pidx]: break pidx-=1 if '[' in data[pidx]: pidx = idx+1 while True: if 'jmp' in data[pidx]: break pidx+=1 targetaddr = int(data[pidx].split('\t')[0][2:-2],16) Ctmp = f'if(rax=={hex(targetaddr)}) goto T_{hex(targetaddr)};\nelse goto T_{hex(targetaddr+8)};\n' IGNORE = targetaddr else: targetaddr = data[pidx].split('\t')[1] targetaddr = targetaddr[targetaddr.find('0x'):] targetaddr = targetaddr.split(' ')[0] Ctmp = f'goto T_{targetaddr};\n' Ccode+=Ctmp elif 'jmp' in line[1]: targetaddr = line[1][line[1].find('0x'):] targetaddr = targetaddr.split(' ')[0] Ctmp = f'goto T_{targetaddr};\n' Ccode+=Ctmp ... ``` Another case that requires additional handling is the **syscall** instruction. While we can directly write **syscall()** in c code, I doubt gcc will optimize those into proper functions, so resolving those at code generation time might be a good idea. Applying strace to the binary, it is easy to notice the only two syscalls ever used in assembly is **read** and **write**. Further examining assembly shows that the last IMM occuring before **syscall** is always assigned to **rax** (aka. syscall number) Below is the snippet that transforms all syscalls into c function calls ```python= for idx,line in enumerate(data): ... elif 'syscall' in line[1]: #The correct way here is to do constant folding, but i'm too lazy for that, and rax always seems to be the last assigned value anyway pidx = idx-1 while True: if '0x1' in data[pidx] or '0x0' in data[pidx]: break pidx-=1 if '0x1' in data[pidx]: Ctmp = f'rax = write(rdi, (char*)rsi, rdx);\n' else: Ctmp = f'rax = read(rdi, (char*)rsi, rdx);\n' Ccode+=Ctmp ... ``` The rest of the assembly -> c code translation is trivial (just remeber to handle mov properly : ) ) ```python= for idx,line in enumerate(data): ... else: opcode = line[1].split('; ')[-1] #some operations have a leading ??? bl,al instruction that can be ignored opcode = opcode.split() opc = opcode[0] if opc=='mov': if opcode[1]=='QWORD': opr1, opr2 = opcode[3].split(',') opr1 = opr1[1:-1] Ctmp = f'*(unsigned long long int*){opr1} = {opr2};\n' elif 'QWORD' in opcode[1]: opr1 = opcode[1].split(',')[0] opr2 = opcode[3][1:-1] Ctmp = f'{opr1} = *(unsigned long long int*){opr2};\n' else: opr1, opr2 = opcode[1].split(',') Ctmp = f'{opr1} = {opr2};\n' elif opc=='xchg': opr1, opr2 = opcode[1].split(',') Ctmp = f'T = {opr1};\n{opr1} = {opr2};\n{opr2} = T;\n' elif opc=='add': opr1, opr2 = opcode[1].split(',') Ctmp = f'{opr1}+={opr2};\n' elif opc=='xor': opr1, opr2 = opcode[1].split(',') Ctmp = f'{opr1}^={opr2};\n' elif opc=='xor': opr1, opr2 = opcode[1].split(',') Ctmp = f'{opr1}^={opr2};\n' elif opc=='cmp': opr1, opr2 = opcode[1].split(',') if opr2=='': #some parsing oopsies in previous stage opr2 = opcode[2] Ctmp = f'if({opr1}=={opr2})'+'{F=1;L=0;}\n'+f'else if({opr1}<{opr2})'+'{F=0;L=1;}\nelse{F=0;L=0;}\n' elif opc=='sete': #always al Ctmp = f'if(F==1) rax=1;\n' elif opc=='setl': #always al Ctmp = f'if(L==1) rax=1;\n' elif opc=='cmove': opr1, opr2 = opcode[1].split(',') Ctmp = f'if(F==1) {opr1}={opr2};\n' elif opc=='cmovl': opr1, opr2 = opcode[1].split(',') Ctmp = f'if(L==1) {opr1}={opr2};\n' elif opc=='lea': #always lea rax, [rip+1] Ctmp = f'rax = {hex(addr+8)};\n' else: print(opc) input() Ccode+=Ctmp ``` The generated c code looks messy af, but who cares as long as it can compile! ```c= T_0x458108: rcx = *(unsigned long long int*)rax; T_0x458110: T = rbx; rbx = rax; rax = T; T_0x458118: rbx = rcx; T_0x458120: rax+=rbx; T_0x458128: rbx = 0x404074; T_0x458138: *(unsigned long long int*)rbx = rax; T_0x458140: rbx = 0x404074; T_0x458150: rax = *(unsigned long long int*)rbx; T_0x458158: if(rax==0x458190) goto T_0x458190; else goto T_0x458198; T_0x458190: goto T_0x4581b0; T_0x458198: goto T_0x4594d0; T_0x4581b0: rax = 0x4040bc; T_0x4581c0: rbx = *(unsigned long long int*)rax; T_0x4581c8: T = rbx; rbx = rax; rax = T; T_0x4581d0: T = rcx; rcx = rax; rax = T; ``` Compiling with O3, the result binary size is reduced by more than 30 folds, and runs corrrectly! ![](https://imgur.com/kidPIQ9.png) Seems like we can finally start reading some quality code. ## 0x06 > Reverse the Logic Popping the new binary into IDA, it now decompiles without error. The recompiled c code is still fairly long, about 10000 lines in length, but this is within range of manageable size if we reverse smartly. Before we get started, it is helpful to identify a few code patterns There are 4 major patterns worth noticing Stack operations ```c= //PUSH M[0x40408C] = -1LL; M[0x404080] = 0x4040c4LL; v3 = M[0x4040B4] - 1LL; M[0x4040AC] = v3; *(_QWORD *)(8 * v3 + 0x4040C4) = 0LL; M[0x4040B4] = M[0x4040AC]; //POP M[0x404074] = 0x4040c4LL; M[0x40409C] = 1LL; M[0x404094] = 8LL * M[0x4040AC] + 0x4040c4; v4 = *(_QWORD *)(8LL * M[0x4040AC] + 0x4040C4); M[0x4040B4] = M[0x4040AC] + 1LL; ``` Memory operations ```c= //STORE M[0x404074] = -2LL; M[0x404070] = ~M[0x404098]; *(_QWORD *)(8LL * M[0x4040BC] + 0x403894) = v4 + M[0x404074] + 1; M[0x404080] = 0x4040c4LL; //LOAD M[0x404094] = 8 * (M[0x4040BC] - 275LL) + 0x4040c4; M[0x40409C] = M[0x4040BC] - 275LL; v8 = *(_QWORD *)(8 * (M[0x4040BC] - 275LL) + 0x4040C4); ``` Obviously this code is compiled from some language that implements a fixed memory pool + stack References are done relative to address **0x4040c4**, which should be a memory base. Memories has a base index stored in **M[0x4040bc]**, and stack top index is stored in **M[0x4040b4]**. All memory/stack entries have size of 8 bytes. Occasionally **idx\*8+0x4040c4** is precalculated as in the **STORE** operation shown above Identifying those patterns allows us to once again optimize the code w.r.t the conventions, but that seems too much effort. I'll just settle with reading the code directly at this point. My first step of approaching large binaries is identifying where user interaction if involved. Fortunately, there is only one read() in the entire binary, so we can ignore anything preceding the read, and focus on how input data is processed. The read is here, input byte is checked against being '\x00' ```c= read(1, (void *)0x404094, 1uLL); v134 = M[0x404094]; M[0x404080] = 0x4040c4LL; if ( !M[0x404094] ) v134 = -1LL; ``` Then the code derefs a counter from Memory[-883] ```c= v136 = *(_QWORD *)(8 * (M[0x4040BC] - 883LL) + 0x4040C4); M[0x404080] = 0x4040c4LL; M[0x40408C] = -1LL; M[0x4040AC] = M[0x4040B4] - 1LL; *(_QWORD *)(8LL * M[0x4040AC] + 0x4040C4) = v136; M[0x40409C] = 0LL; M[0x4040B4] = M[0x4040AC]; M[0x404074] = 0x4040c4LL; M[0x404094] = 8LL * M[0x4040AC] + 0x4040c4; v137 = *(_QWORD *)(8LL * M[0x4040AC] + 0x4040C4); M[0x4040B4] = M[0x4040AC] + 1LL; M[0x404080] = 4749152LL; if ( !v137 ){ ... ``` The counter is used to switch between cases from 0~5. Since IDA code is too long, I'll present the pseudo code for this part. ```python= def MainLoop(): inputL = 0 while True: inputC = getchar() if inputL==0: if inputC=='\x1b': inputL+=1 elif inputL==1: if inputC=='[': inputL+=1 else: inputL = 0 elif inputL==2: if inputC=='M': inputL+=1 else: inputL = 0 elif inputL==3: M[-301] = inputC inputL+=1 elif inputL==4: inputL+=1 doStuff() elif inputL==5: inputL = 0 doStuff() if M[-301]==' ' or M[-301]=='@': doStuff() elif M[-301]=='#': checkInput(Input) ``` We can clearly see this part of code is responsible for parsing the ansi codes for mouse support. Some google will reveal that the 4th byte (inputL==3) corresponds to the action taken * ' ' stands for when the mouse is clicked * '@' stands for when the mouse is held down * '#' stands for when the mouse is released So making some educated guess based on the program behaviour, we can conclude that the check is done when mouse is released. Let's focus on that part. First, some large arrays are populated, no idea what it is right now, below is part of the population code just to show what is done ```c= *(_QWORD *)(8LL * M[0x4040BC] + 0x402EFC) = 7777853LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F04) = 6222378LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F0C) = 3546017LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F14) = 4445136LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F1C) = 7780945LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F24) = 3462586LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F2C) = 3111820LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F34) = 2405140LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F3C) = 3624625LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F44) = 6968615LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F4C) = 3176867LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F54) = 3710589LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F5C) = 7702269LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F64) = 3192178LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F6C) = 649731LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F74) = 7800749LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F7C) = 6017677LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F84) = 6189630LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F8C) = 1975056LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F94) = 2694116LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402F9C) = 3038398LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FA4) = 1663188LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FAC) = 6543815LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FB4) = 4176440LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FBC) = 1696171LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FC4) = 2471993LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FCC) = 1030495LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FD4) = 1229599LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FDC) = 6638142LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FE4) = 7858312LL; *(_QWORD *)(8LL * M[0x4040BC] + 0x402FEC) = 5114362LL; ``` First some value is asserted to be 10, if not, the program skips the large chunk inside an if block At first, I just noted this check and while having no idea what it should imply. The comment of input sequence length is added later when I read more code, so you can ignore it for now ```c= //Deref -274 and compare //v243 = M[-274] (-274 seems to be input sequence length) M[0x40408C] = -274LL; M[0x404074] = 0x4040c4LL; M[0x40409C] = M[0x4040BC] - 274LL; M[0x404094] = 8 * (M[0x4040BC] - 274LL) + 0x4040c4; v242 = *(_QWORD *)(8 * (M[0x4040BC] - 274LL) + 0x4040C4); M[0x40408C] = -1LL; M[0x404080] = 0x4040c4LL; M[0x4040AC] = M[0x4040B4] - 1LL; *(_QWORD *)(8LL * M[0x4040AC] + 0x4040C4) = v242; M[0x40409C] = 10LL; M[0x4040B4] = M[0x4040AC]; M[0x404074] = 0x4040c4LL; M[0x404094] = 8LL * M[0x4040AC] + 0x4040c4; v243 = *(_QWORD *)(8LL * M[0x4040AC] + 0x4040C4); M[0x4040B4] = M[0x4040AC] + 1LL; if ( v243 == 10 ){ ... ``` Then a large for loop does the stuff below. Ignore the name INPUT/OUT for now ```python= #407834 INPUT = Input #40660c OUT = [0 for i in range(81)] idx = 0 for i in range(81): if (ARR1[i]+ARR2[80-i])&0xffffffffffffffff!=0: OUT[i] = ARR1[i]-1 else: OUT[i] = ARR1[i]+INPUT[idx] idx+=1 ``` This part is quite interesting, cause ARR1, ARR2 both points inside the previously populated large array. Parsing their content and we can get ![](https://imgur.com/QbjBfwV.png) applying the addition and we can then get, notice how there are exactly ten 0s in the result. This inspired me to guess that the previous seen assert is the input length, and the INPUT array is actually the input ``` [7, 6, 0, 3, 8, 5, 1, 0, 9, 8, 4, 9, 1, 2, 6, 7, 5, 3, 1, 0, 3, 4, 7, 9, 8, 2, 6, 6, 7, 1, 8, 9, 2, 4, 3, 5, 9, 3, 4, 5, 1, 7, 0, 6, 8, 0, 2, 8, 6, 4, 3, 9, 1, 7, 4, 0, 6, 0, 3, 1, 5, 9, 2, 3, 9, 7, 2, 5, 4, 6, 0, 1, 2, 1, 5, 0, 6, 0, 3, 7, 4] ``` Following the parsing, there comes 3 loops which does these stuff Anyone that has the ever played with sudoku before should immediately recognize this as the sudoku check ```python= #4065ac CNT = [0 for i in range(9)] for i in range(0,81,9): for j in range(9): CNT[j] = 0 for j in range(9): idx = (OUT[i+j]+ARR2[80-(i+j)])&0xffffffffffffffff if CNT[idx]!=0: fail() return CNT[idx] = 1 for i in range(9): for j in range(9): CNT[j] = 0 for j in range(0,81,9): idx = (OUT[i+j]+ARR2[80-(i+j)])&0xffffffffffffffff if CNT[idx]!=0: fail() return CNT[idx] = 1 for i in range(0,81,27): for j in range(0,9,3): for k in range(9): CNT[k] = 0 for k in range(0,27,9): for l in range(3): idx = (OUT[i+j+k+l]+ARR2[80-(i+j+k+l)])&0xffffffffffffffff if CNT[idx]!=0: fail() return CNT[idx] = 1 ``` Rearranging the difference between ARR1 and ARR2 we can get ``` +-------+-------+-------+ | 7 6 | 3 8 5 | 1 9 | | 8 4 9 | 1 2 6 | 7 5 3 | | 1 3 | 4 7 9 | 8 2 6 | +-------+-------+-------+ | 6 7 1 | 8 9 2 | 4 3 5 | | 9 3 4 | 5 1 7 | 6 8 | | 2 8 | 6 4 3 | 9 1 7 | +-------+-------+-------+ | 4 6 | 3 1 | 5 9 2 | | 3 9 7 | 2 5 4 | 6 1 | | 2 1 5 | 6 | 3 7 4 | +-------+-------+-------+ ``` Manually fill out the puzzle to get solution ``` [1,3,4,1,4,7,6,7,8,7] ``` The last step is to map those inputs to the lockscreen. At this point, I don't feel like actually reversing input processing, and just made a simple guess that each number from 1~9 corresponds to one node in the lockscreen. As there are only a limited combinations of how those indices can be logically assigned it is quite easy to figure out the correct layout with a few attemps (actually the first attempt proved to be correct). Layout shown below ``` 1 2 3 4 5 6 7 8 9 ``` Providing the solution the lockscreen, we are greeted with the flag ![](https://imgur.com/lRqd7Lo.png) ## 0x07 > Discussion This challenge is quite interesting in that I never thought parsing **ROPchain** could be so challenging, at the same time, it is actually quite straightforward once you figure out a correct approach. There are a few things that had me worrying throughout the entire progress of solving it 1. Self modifying ROP. This would be a critical change if it is ever implemented, as any static parsing would be futile and we must do dynamic runtime analysis on ROPchains. Ironically, self modifying ROP is actually quite easy to implement despite it's effectiveness in obfuscating stuff. I still haven't came up with any good approach, and if anyone has any ideas, do let me know 2. Complex jumping logic. The jumps in this challenge are quite simple. If it is even slightlier complex (in number of jump patterns or even the order of how registers are assigned), it would require solver to implement proper optimizations 3. Mixing check with other stuff. This challenge clearly split the input checking procedure from other stuff (input parsing, literal printing, flag string parsing). If any of these are mixed into the flag checking procedure, I would have been forced to spend a lot more time reading those functions. Overall, the challenge was pretty well designed, though I think the strange ARR1+ARR2 stuff is makes reversing unnecessarily painful. Kudos to the authors for such a great challenge, and let's hope I collect enough motivation to play UIUCTF next year. [Top](#UIUCTF-2021-Ropfuscated-Writeup)