CVE-2024-58239 & CVE-2025-39682 1-day analysis

# CVE-2024-58239 1-day analysis This is my first 1-day exploit. Patch: [tls: stop recv() if initial process_rx_list gave us non-DATA ](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=fdfbaec5923d9359698cbb286bc0deadbb717504) I saw `exp407` on the [kernelCTF spreadsheet](https://docs.google.com/spreadsheets/d/e/2PACX-1vS1REdTA29OJftst8xN5B5x8iIUcxuK6bXdzF8G1UXCmRtoNsoQ9MbebdRdFnj6qZ0Yd7LwQfvYC2oF/pubhtml#) and tried to reproduce the exploit. I managed to run the exploit on the kernelCTF environment and get the flag: ![image](https://hackmd.io/_uploads/Bkntd3P6eg.png) The log for the exploit is kinda long, due to a kernel WARNING in the middle. However, a warning is not an error and it doesn't kill my exploit ;) ![image](https://hackmd.io/_uploads/rkTNzTv6ge.png) Exploit of CVE-2024-58239: https://github.com/khoatran107/cve-2024-58239 Exploit of the bypass - CVE-2025-39682: https://github.com/khoatran107/cve-2025-39682 ## Vulnerability Overview | Field | Value | | ------------------ | --------------------------------------- | | Product | Linux | | Vendor | Linux | | Severity | High | | Affected Versions | From 5.1 to versions prior to 5.4.270, 5.10.211, 5.15.150, 6.1.80, 6.6.19, 6.7.7, 6.8 | | Tested Versions | kernelCTF mitigation instance, 6.6.0+ | | Impact | Elevation of Privilege | | CVE ID | CVE-2024-58239 | | CWE | CWE-416: Use After Free | | PoC available? | Yes - the test in the patch commit | | Patch available? | Yes | | Exploit available? | Yes - my exploit | ## Note on TLS system before exploiting For this exploit, I have several important note: - kTLS is the upper layer protocol for TCP, which essentially means one more "wrapping" layer on the transmit side, and one more "unwrapping" layer on the receive side. - On receive side: - A TLS record, both decrypted and un-decrypted is stored inside `struct sk_buff`. - `rx_list` contain decrypted records, but yet to be received (pushed into here when do `recvmsg` with `MSG_PEEK` option, or when we haven't receive the full record size); - `sk->sk_receive_queue`: contains un-decrypted records. - `anchor` is an skb allocated when setting up TLS, its `frag_list` pointer point to the first `skb` of next TLS record in receive queue. ## Vulnerability analysis Patch: ```diff diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 78aedfc682ba84..43dd0d82b6ed7a 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1971,7 +1971,7 @@ int tls_sw_recvmsg(struct sock *sk, goto end; copied = err; - if (len <= copied) + if (len <= copied || (copied && control != TLS_RECORD_TYPE_DATA)) goto end; target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); ``` The second condition makes sure that when `copied != 0` and `control != TLS_RECORD_TYPE_DATA`, the flow end. So to bypass that, we try to make the second condition being true. ```clike err = process_rx_list(ctx, msg, &control, 0, len, is_peek); if (err < 0) goto end; copied = err; if (len <= copied) // patch here goto end; ``` `err` is the len of just process record in `rx_list`, and control is the type of that record. Record types are defined [here](https://elixir.bootlin.com/linux/v6.6/source/include/net/tls_prot.h#L16) as an enum. But throughout all the code, the kernel treat them as "2 types": - DATA (= 23) - non-DATA ~ control (!= 23) From [this patch](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/tls?id=62708b9452f8eb77513115b17c4f8d1a22ebf843) of `CVE-2025-39682`, I know for a TLS socket, a `recvmsg` syscall should either: * return one non-DATA record and stop; * return merged content of DATA records next to each other. This bug is that, when `process_rx_list` give a non-DATA record, the `recvmsg` process have to stop, but the vulnerable version continue processing the next record in queue. (\*) From the mailing list [messages](https://lore.kernel.org/all/018f1633d5471684c65def5fe390de3b15c3d683.1708007371.git.sd@queasysnail.net/), all I know was it get 2 consecutive non-DATA record being merged, there was a PoC for it. It tooks me another week to fully calm down and think of it like (\*). I try setting record type of the second record to DATA, I get this: ![image](https://hackmd.io/_uploads/r1PYb456ll.png) To pin down the root bug. I rebuilt the vulnerable version with KASAN, and run the PoC again: ```bash user@lts-6:/tmp$ ./exploit Server listening on port 8080... Connected from 127.0.0.1:46588 Connected to server! TLS enabled on connection! Sent rec1: 1111 Server received: 1111 (record type: 100) Server sent packet A: AAAA Server sent packet B: BBBB Peek packet A: AAAA (record type: 100) 00000000 | 41 41 41 41 00 | AAAA. Recv packet A: AAAA (record type: 100) 00000000 | 41 41 41 41 00 | AAAA. Recv packet B: 6�d� (record type: 23) 00000000 | 36 d5 15 64 cc | 6..d. [ 22.388509] ================================================================== [ 22.390887] BUG: KASAN: slab-use-after-free in tls_strp_done+0x57/0xc0 [ 22.393386] Read of size 8 at addr ffff88800b71ac00 by task exploit/180 [ 22.395760] [ 22.396310] CPU: 1 PID: 180 Comm: exploit Not tainted 6.6.18+ #1 [ 22.398596] Hardware name: QEMU Ubuntu 24.04 PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 22.402651] Call Trace: [ 22.403700] <TASK> [ 22.404595] dump_stack_lvl+0x43/0x60 [ 22.406073] print_report+0xc5/0x660 [ 22.407520] ? preempt_count_add+0x1c/0xc0 [ 22.408585] ? preempt_count_sub+0x14/0xc0 [ 22.409573] ? __virt_addr_valid+0xef/0x190 [ 22.410021] kasan_report+0xc3/0x100 [ 22.410387] ? tls_strp_done+0x57/0xc0 [ 22.410813] ? tls_strp_done+0x57/0xc0 [ 22.411215] tls_strp_done+0x57/0xc0 [ 22.411575] tls_sk_proto_close+0x23a/0x4c0 [ 22.412040] ? __pfx_tls_sk_proto_close+0x10/0x10 [ 22.412590] ? preempt_count_sub+0x14/0xc0 [ 22.413038] ? down_write+0xd2/0x130 [ 22.413441] ? __pfx_down_write+0x10/0x10 [ 22.413872] inet_release+0xa5/0x110 [ 22.414257] __sock_release+0x63/0x120 [ 22.414835] sock_close+0x11/0x20 [ 22.415375] __fput+0x1d8/0x450 [ 22.415714] __x64_sys_close+0x51/0x90 [ 22.416118] do_syscall_64+0x61/0x90 [ 22.416538] ? exit_to_user_mode_prepare+0x1a/0x150 [ 22.417081] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 22.417589] RIP: 0033:0x422bd4 [ 22.417913] Code: 00 00 00 0f 1f 00 f3 0f 1e fa 31 c9 e9 15 67 04 00 0f 1f 44 00 00 f3 0f 1e fa 80 3d 8d 34 09 00 00 74 13 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3c c3d [ 22.419830] RSP: 002b:00007ffdd7cfd068 EFLAGS: 00000202 ORIG_RAX: 0000000000000003 [ 22.420754] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000000422bd4 [ 22.421819] RDX: 0000000001ca0911 RSI: 0000000001ca0910 RDI: 0000000000000003 [ 22.422654] RBP: 00007ffdd7cfd0e0 R08: 00000000004b5820 R09: 0000000000000000 [ 22.423541] R10: 0000000000000001 R11: 0000000000000202 R12: 00007ffdd7cfd1f8 [ 22.424505] R13: 00007ffdd7cfd208 R14: 00000000004b0828 R15: 0000000000000001 [ 22.425417] </TASK> [ 22.425742] [ 22.426006] Allocated by task 180: [ 22.426374] kasan_save_stack+0x2f/0x50 [ 22.426780] kasan_set_track+0x21/0x30 [ 22.427194] __kasan_slab_alloc+0x6a/0x70 [ 22.427598] kmem_cache_alloc_node+0x190/0x3c0 [ 22.428057] __alloc_skb+0x1b3/0x230 [ 22.428452] tls_strp_init+0x3b/0xb0 [ 22.428822] tls_set_sw_offload+0x805/0x920 [ 22.429325] tls_setsockopt+0x859/0x8c0 [ 22.429759] __sys_setsockopt+0x188/0x320 [ 22.430192] __x64_sys_setsockopt+0x60/0x70 [ 22.430619] do_syscall_64+0x61/0x90 [ 22.431051] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 22.431911] [ 22.432099] Freed by task 180: [ 22.432593] kasan_save_stack+0x2f/0x50 [ 22.433037] kasan_set_track+0x21/0x30 [ 22.433521] kasan_save_free_info+0x27/0x40 [ 22.433948] ____kasan_slab_free+0x11f/0x1a0 [ 22.434480] kmem_cache_free+0x185/0x390 [ 22.434891] process_rx_list+0x2be/0x310 [ 22.435430] tls_sw_recvmsg+0x313/0xd30 [ 22.435860] inet_recvmsg+0x22f/0x240 [ 22.436245] sock_recvmsg+0xff/0x140 [ 22.436655] ____sys_recvmsg+0x142/0x370 [ 22.437079] ___sys_recvmsg+0xe8/0x160 [ 22.437510] __sys_recvmsg+0xe7/0x170 [ 22.437905] do_syscall_64+0x61/0x90 [ 22.438272] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 22.438857] [ 22.439121] The buggy address belongs to the object at ffff88800b71ab40 [ 22.439121] which belongs to the cache skbuff_head_cache of size 224 [ 22.440870] The buggy address is located 192 bytes inside of [ 22.440870] freed 224-byte region [ffff88800b71ab40, ffff88800b71ac20) ``` Great, UAF. With detailed report by KASAN about where it's allocated, and previously freed. ## KASAN report analysis ### UAF Read at offset 192 * From `tls_strp_done`. ```c tls_strp_done() tls_strp_anchor_free() struct skb_shared_info *shinfo = skb_shinfo(strp->anchor); // anchor->head @ offset 192 #define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB))) #ifdef NET_SKBUFF_DATA_USES_OFFSET static inline unsigned char *skb_end_pointer(const struct sk_buff *skb) { return skb->head + skb->end; } ``` * The field `sk_buff::head` is at offset 192: ```C gef> p (*(struct sk_buff*)0).head Cannot access memory at address 0xc0 gef> p/d 0xc0 $4 = 192 ``` * So we can conclude the object being UAF is `struct sk_buff`. ### Allocate * From the backtrace, I check the function `tls_strp_init` (initializing stream parser): ```C int tls_strp_init(struct tls_strparser *strp, struct sock *sk) { memset(strp, 0, sizeof(*strp)); strp->sk = sk; strp->anchor = alloc_skb(0, GFP_KERNEL); // [*] here if (!strp->anchor) return -ENOMEM; INIT_WORK(&strp->work, tls_strp_work); return 0; } ``` * The freed skb object is the anchor object, it’s allocated from a dedicated cache called `skbuff_cache`. So… I either have to do cross cache attack, or attack by allocating another skb. The first option is blocked, as I am targeting mitigation instance, which have `CONFIG_SLAB_VIRTUAL`. ### Deallocate * It’s freed inside `consume_skb` of `process_rx_list`, my question is *how the anchor even get into there?* * Yes I can confirm that it’s in fact inside `rx_list`, in my second receive (when the BBBB packet is expected): ```C gef> chain &ctx->rx_list [+] head address: 0xffff888001a1b630 [+] next pointer offset: 0x0 [1] -> 0xffff88800b81b500 [2] -> 0xffff888001a1b630 (head) gef> p ctx->strp->anchor $6 = (struct sk_buff *) 0xffff88800b81b500 ``` ### How the `anchor` get into `rx_list`? * One hypothesis I can think of is that, the skb next pointer of the `BBBB` skb point to `anchor`. But that might not be the case, since the first packet of a TLS record is only added as a `frag_list` of anchor: ``` skb_shinfo(strp->anchor)->frag_list = first; ``` * I think the magic lies in the first receive. So let walk through what happens there. ```C= int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int flags, int *addr_len) { // [...] err = process_rx_list(ctx, msg, &control, 0, len, is_peek); // return AAAA - record type 100 if (len <= copied) // the patch [2/5] make it stop right here goto end; // [...] while (/* ... */) { // [...] if (zc_capable /* true */ && to_decrypt <= len /* 0 <= 0x100, true */ && tlm->control == TLS_RECORD_TYPE_DATA /* true */) darg.zc = true; // set zc = true // [...] err = tls_rx_one_record(sk, msg, &darg); // [1] // [...] err = tls_record_content_type(msg, tls_msg(darg.skb), &control); // [2] if (err <= 0) { // err = 0 DEBUG_NET_WARN_ON_ONCE(darg.zc); tls_rx_rec_done(ctx); put_on_rx_list_err: __skb_queue_tail(&ctx->rx_list, darg.skb); // [3] here darg.skb = ctx->strp->anchor goto recv_end; } // [...] } // [...] } ``` * After [1], we already have `darg.skb = ctx->strp->anchor`. That suggests another deep dive into `tls_rx_one_record` function, with `zc` being true. ```C= static int tls_rx_one_record(struct sock *sk, struct msghdr *msg, struct tls_decrypt_arg *darg) { // [...] err = tls_decrypt_device(sk, msg, tls_ctx, darg); if (!err) // always true err = tls_decrypt_sw(sk, tls_ctx, msg, darg); // [...] } static int tls_decrypt_sw(struct sock *sk, struct tls_context *tls_ctx, struct msghdr *msg, struct tls_decrypt_arg *darg) { // [...] err = tls_decrypt_sg(sk, &msg->msg_iter, NULL, darg); // [...] } static int tls_decrypt_sg(struct sock *sk, struct iov_iter *out_iov, struct scatterlist *out_sg, struct tls_decrypt_arg *darg) { // [...] if (darg->zc /* true */ && (out_iov /* out_iov = &msg->msg_iter != NULL */ || out_sg)) { clear_skb = NULL; // [4] if (out_iov) n_sgout = 1 + tail_pages + iov_iter_npages_cap(out_iov, INT_MAX, data_len); else n_sgout = sg_nents(out_sg); } else { darg->zc = false; clear_skb = tls_alloc_clrtxt_skb(sk, skb, rxm->full_len); // [...] } // [...] darg->skb = clear_skb /* NULL, from [4] */ ?: tls_strp_msg(ctx); // [5] clear_skb = NULL; // [...] } static inline struct sk_buff *tls_strp_msg(struct tls_sw_context_rx *ctx) { DEBUG_NET_WARN_ON_ONCE(!ctx->strp.msg_ready || !ctx->strp.anchor->len); return ctx->strp.anchor; // [6] } ``` * Because of [4], in [5] & [6], `darg->skb` is set to `ctx->strp.anchor`. And because at [2], the record type of current record (data) differs from previous record (non-data), the `darg->skb` is insert into `rx_list` [3]. * The `BBBB` record is not dropped. Just that the return value is not true for the len of response. When I hexdump all content of the buffer passed into the syscall, I get this. Which means, with `zc` being true, the data is decrypt straight into userspace buffer, no queueing into `rx_list` happens with the record: ``` Recv packet A (record type: 100) 00000000 | 41 41 41 41 00 42 42 42 42 00 | AAAA.BBBB. | ``` * Seems like `darg->skb` is always equal to anchor when the record type is data. What makes the difference here is `tls_record_content_type` return 0 due to different record type compared to the previous one. ## Exploit ### Brainstorm Here’s the original flow of actions that trigger the bug: 1. server send A * 100h (record type non-DATA) 2. server send B * 100h (record type DATA) 3. client peek 100h → A * 100h is in `rx_list` ; `skb_shinfo(anchor)->frag_list` = head of packet B 4. client receive 200h → get A * 100h and B * 100h; `anchor` is in `rx_list`. B is freed, but `skb_shinfo(anchor)->frag_list` is still B. 5. client receive 200h → `anchor` get freed, but still in `ctx->strp.anchor`, some mysterious data is returned. 6. close client fd → UAF-read the `anchor` object. If we spray skb object to reclaim the victim between (5) and (6), we may get something. Since this is a `skb` double free. I assume I can turn this into a page UAF, by spraying a bunch of skb with frags[0] = page order 0, then free using the UAF-free, then reallocate with pipe spray, then free with my sprayer. That's when we have page UAF. The last step is doing `signalfd` spray to reclaim that order-0 page as a slab in `filp` cache. ### Full process 1. Trigger the bug, causing `anchor` being inside `rx_list` (while still pointed by `ctx->strp.anchor`); `skb_shinfo(anchor)->frag_list = skb B`, with `skb B` is freed. 2. Spray to reclaim `skb B`. 3. Freeing both `anchor` and `skb B` through `splice` syscall. 4. Reclaim `anchor` by spraying `skb` with a specific size (0xec0 + 0x1000), setting `skb_shinfo(anchor)->frags[0]` being a victim page of order 0. Now `anchor` is pointed by an `sk->sk_receive_queue` and `ctx->strp.anchor`. And `skb_shinfo(anchor)->frag_list = NULL`. 5. Release the anchor and the page through `sk->sk_receive_queue`, by `recv` syscall. That also does UAF-free the order-0 page to buddy allocator. 6. Reclaim that order-0 page using pipe spray. 7. Release that page back to buddy allocator, by freeing anchor through `ctx->strp.anchor`. 8. Reclaim that order-0 page as a slab in `filp` cache. 9. Overwriting `core_pattern`. 10. Trigger segfault -> ROOT. ## Side notes ### `skb` is sprayable The `skb` object isn't completely wiped out in `consume_skb()`: its `skb_shinfo(skb)->frags[]` is kept. So double freeing `skb` object leads to double freeing `struct page*` object ~ double freeing a page. ### Spray object Using some experience, I found out that with `AF_UNIX` TCP packet of len X, the first `0xec0` bytes is the in `skb_shinfo(skb)->head`. And for the remaining `Y = X - 0xec0` bytes, they are split into pages contained in `skb_shinfo(skb)->frags[]`, and they want the least number of page of order `[0, PAGE_ALLOC_COSTLY_ORDER (3)]`. If Y is smaller than `PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER` (0x8000). It's rounded to the nearest page size. If not, allocate page order 3, and do Y -= `PAGE_SIZE << 3`... With that, I write an `SkbSprayer` object to spray skb objects, with customizable num objects and data size. ### Double free The SLUB allocator only panic when `free(x)` if x is head of its slab's freelist. So we can spray multiple target and deallocate before freeing the second time, to make sure x is not head of its slab's freelist => doesn't panic. ### Why `splice`? * After being reclaimed by a unix TCP skb, the field `next` and `prev` of the object points to field `struct sk_buff_head* sk_receive_queue` of `struct sock`, creating a cyclic linked-list. When we do normal `tls_sw_recvmsg`, it doesn't unlink from the queue, causing `consume_skb` to traverse the `next` pointer, freeing an invalid pointer (`sock + 0xd8`). So I find places where `ctx->rx_list` is used, and preferably, where we have `__skb_dequeue`. And I found 2 place: * `tls_sw_splice_read` * `tls_sw_read_sock` I try the first one and it has what I need: ```C= ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) { // [...] if (!skb_queue_empty(&ctx->rx_list)) { skb = __skb_dequeue(&ctx->rx_list); // skb->next & prev set to NULL } else { // [...] } rxm = strp_msg(skb); tlm = tls_msg(skb); if (tlm->control != TLS_RECORD_TYPE_DATA /* false */) { err = -EINVAL; goto splice_requeue; } chunk = min_t(unsigned int, rxm->full_len, len); copied = skb_splice_bits(skb, sk, rxm->offset, pipe, chunk, flags); if (copied < 0 /* false */) goto splice_requeue; if (chunk < rxm->full_len /* can set to false */) { rxm->offset += len; rxm->full_len -= len; goto splice_requeue; } consume_skb(skb); // no more invalid free here. ``` Using this, no more invalid free in SLUB allocator. ### Pipe spray weird behavior During my first n-day analysis, I learned that for pipe spraying, kernel only allocate a page order-0 for the pipe when we call `write(0x1000)`. However, when developing this exploit some weird things occur: I do `write(0x1000)`, then I try to `read(0x2000)`; and some first 50-ish pipes return 0x2000, and then they return 0x1000. I would dive into this problem if I have more time. But for the quick fix of that, I just assume the first page is allocated when creating the socket, and only care about the last page when doing spray. ### Page UAF to ROOT For this step, I do as the previous n-day: `signalfd` spray. Except that, I use arbitrary write primitive to write straight into `core_pattern`, instead of just writing `cred`. ### Weakness of this exploit My exploit relies on 2 facts in the kernel: - freeing `struct sk_buff` object doesn't clear its `frags` array. - double free detection in SLUB allocator is so naive, it only PANIC when a slot is freed and is **head** of its slab's freelist. If one of those two is fixed -> The exploit dies. ## Success rate I run `calc_AC_rate.py` to test the exploit 100 times, and the result is `50/100`. That definitely can be improved. As most of the fail attempts, I get this backtrace: ``` [ 4.050330] ? die+0x32/0x80 [ 4.050809] ? do_trap+0xd6/0x100 [ 4.051365] ? __slab_free+0x16c/0x380 [ 4.052198] ? do_error_trap+0x6a/0x90 [ 4.052767] ? __slab_free+0x16c/0x380 [ 4.053350] ? exc_invalid_op+0x4c/0x60 [ 4.053916] ? __slab_free+0x16c/0x380 [ 4.055031] ? asm_exc_invalid_op+0x16/0x20 [ 4.055675] ? tcp_rcv_state_process+0x791/0xef0 [ 4.056527] ? __slab_free+0x16c/0x380 [ 4.057097] ? lock_timer_base+0x61/0x80 [ 4.057795] ? tcp_get_metrics+0x142/0x380 [ 4.058560] ? tcp_rcv_state_process+0x791/0xef0 [ 4.059265] kmem_cache_free+0x599/0x5e0 [ 4.059915] tcp_rcv_state_process+0x791/0xef0 [ 4.060918] ? security_sock_rcv_skb+0x31/0x50 [ 4.061857] ? sk_filter_trim_cap+0x11a/0x290 [ 4.062558] tcp_v4_do_rcv+0xcd/0x280 [ 4.063281] tcp_v4_rcv+0xf81/0x1010 [ 4.063832] ? raw_local_deliver+0xcd/0x250 [ 4.064668] ip_protocol_deliver_rcu+0x32/0x320 [ 4.065355] ip_local_deliver_finish+0x7a/0xa0 [ 4.066455] ip_sublist_rcv_finish+0x7e/0x90 [ 4.067231] ip_sublist_rcv+0x1e1/0x220 [ 4.067919] ? __netif_receive_skb_core.constprop.0+0xbf/0x1080 [ 4.068822] ip_list_rcv+0x139/0x170 [ 4.069362] __netif_receive_skb_list_core+0x29d/0x2c0 [ 4.070173] ? __pfx_csum_block_add_ext+0x10/0x10 [ 4.071054] netif_receive_skb_list_internal+0x1e1/0x310 [ 4.072068] napi_complete_done+0x6e/0x1a0 [ 4.072752] virtnet_poll+0x40d/0x5a0 [ 4.073313] __napi_poll+0x28/0x1c0 [ 4.073854] net_rx_action+0x14c/0x2d0 [ 4.074622] ? vp_vring_interrupt+0x73/0x90 [ 4.075746] __do_softirq+0xf6/0x30f [ 4.076379] __irq_exit_rcu+0x79/0xc0 [ 4.077356] common_interrupt+0xb9/0xd0 ``` I think there must be a way to `setsockopt` or something to disable that. ## Patch bypass ### CVE-2025-39682 Patch commit: [tls: fix handling of zero-length records on the rx_list](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/net/tls?id=62708b9452f8eb77513115b17c4f8d1a22ebf843) Exploit code: https://github.com/khoatran107/cve-2025-39682 ![image](https://hackmd.io/_uploads/BkgxSYo6xl.png) With a similar exploit approach, I write one exploit for CVE-2025-39682. Success rate: `79/100`. I develop this exploit later and change it a little bit, so it's more stable. Diff: ```diff= diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 51c98a007ddac4..bac65d0d4e3e1e 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1808,6 +1808,9 @@ int decrypt_skb(struct sock *sk, struct scatterlist *sgout) return tls_decrypt_sg(sk, NULL, sgout, &darg); } +/* All records returned from a recvmsg() call must have the same type. + * 0 is not a valid content type. Use it as "no type reported, yet". + */ static int tls_record_content_type(struct msghdr *msg, struct tls_msg *tlm, u8 *control) { @@ -2051,8 +2054,10 @@ int tls_sw_recvmsg(struct sock *sk, if (err < 0) goto end; + /* process_rx_list() will set @control if it processed any records */ copied = err; - if (len <= copied || (copied && control != TLS_RECORD_TYPE_DATA) || rx_more) + if (len <= copied || rx_more || + (control && control != TLS_RECORD_TYPE_DATA)) goto end; target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); ``` The previous patch can be bypass by setting `copied` to 0, which means sending a zero-length record. Then the flow goes on like the original bug, returning 0 at `tls_record_content_type`, causing `anchor` to goes inside `rx_list`. On transmit side, kTLS wrapper doesn't allow sending records with length 0 => I don't set up kTLS encryption for transmission, but encrypt the attack record before hand, and send it as a normal TCP segment. Of course I don't write my own AES_CCM_128 encryption function in C. I take the server's encrypted segment from [the test for the patch](https://lore.kernel.org/all/20250820021952.143068-2-kuba@kernel.org/). ## Useful links * [Linux Kernel TLS Part 1 & 2 - Pumpkin (@u1f383)](https://u1f383.github.io/linux/2025/01/20/linux-kernel-tls-part-1.html) * [Analysis of CVE-2025-37756, an UAF Vulnerability in Linux KTLS - Pumpkin (@u1f383)](https://u1f383.github.io/linux/2025/09/03/analysis-of-CVE-2025-37756-an-uaf-vulnerability-in-linux-ktls.html) * [Kernel TLS docs](https://docs.kernel.org/networking/tls.html) * [tools/testing/selftests/net/tls.c - utilities & example for setting up a TLS connection](https://elixir.bootlin.com/linux/v6.6/source/tools/testing/selftests/net/tls.c) * [CVE-2024-58239 test of the patch](https://lore.kernel.org/all/018f1633d5471684c65def5fe390de3b15c3d683.1708007371.git.sd@queasysnail.net/) * [CVE-2025-39682 test of the patch](https://lore.kernel.org/all/20250820021952.143068-2-kuba@kernel.org/) ## Thoughts The bug really give me an experience of doing 1-day analysis (I think half part of this process feels like 0-day analysis). Reading the original "primitive" - `merging 2 non-DATA records`, I continuously questioned myself whether or not I was trolled. I thought that `exp401` might be someone testing for their auto form submission script, and the CVE is in fact un-exploitable. The thought is not there without a reason. The linux CVE team has been known for giving away free CVE numbers to weird non-dangerous bugs. I kept the doubt in my head, trying to exploit, while saying to myself "this is a troll". I was so stressed about it. That's what people experience when they do 0-day research on linux kernel, right? Having a bug and not knowing if it gives a strong enough primitive to do LPE. In addition to that, I was given a deadline of one month to do 1-day analysis. And I dared to chose `net/tls` - a subsystem I have never ever touch. I didn't even know a thing about TLS, lol. The moment the crash pop up, I was so relieved. It took a weeks to reach that point. And I thought if that's a normal UAF, 1 week might be enough.... But a week and a half passed, and I realized it's not that easy. The last week was full of emotions. The pipe spray method I always trust backfired on me, there was some unusual behavior that took me 2 days to figure out. I finished the exploit code and leverage to root 2 hours before the end of my deadline.