Linux.Seccomp-and-Ptrace
2024-03-04 15:37:44 # CTF

Background

最近ACTF出现了一个限制非常严格的沙箱,校队里一位pwn师傅搜到了一些用ptrace修改子进程rax来绕过seccomp的wp。

正值校赛,为了出题的事忙得焦头烂额,就没有细想。

但是由于我记得seccomp 是内核hook,而ptrace, 出于一些对调试器的印象,我觉得他对于attach的子进程的寄存器的更改,是在用户态实现的。 那么ptrace的处理应该在seccomp之前,所以我觉得不太可行。

在有时间后,我开始探究了一下,确实不太可行,只是原因跟我想象得不太一样…

Intro

在开始之前,先介绍一下三个概念:

  • seccome
  • prctl
  • ptrace

如果没有提到,以上代码均来自linux-6.6

prctl / seccomp

prctl 是linux下一个实现进程操控的系统调用。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
unsigned long, arg4, unsigned long, arg5)
{
struct task_struct *me = current;
unsigned char comm[sizeof(me->comm)];
long error;

error = security_task_prctl(option, arg2, arg3, arg4, arg5);
if (error != -ENOSYS)
return error;

error = 0;
switch (option) {
case PR_SET_PDEATHSIG:
if (!valid_signal(arg2)) {
error = -EINVAL;
break;
}
me->pdeath_signal = arg2;
break;
/*
.............
省略若干
.............
*/
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
/*
.............
省略若干
.............
*/
default:
error = -EINVAL;
break;
}
return error;
}

阅读源码和man doc, 可以看到prctl主要实现了两类命令,SETGET , 即操作进程运行时和获取进程信息。

而seccomp就是基于prctl实现的。

1
2
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2, (char __user *)arg3);

这里涉及到这样一条调用链

1
2
3
4
5
-->prctl 
-->prctl_set_seccomp
-->do_seccomp
-->seccomp_set_mode_filter
--> seccomp_attach_filter

seccomp_attach_filter 核心代码如下:

1
2
3
4
filter->prev = current->seccomp.filter;
seccomp_cache_prepare(filter);
current->seccomp.filter = filter;
atomic_inc(&current->seccomp.filter_count);

current是一个全局的指针,指向当前进程的task结构体,主要保存了当前进程的一些信息。

所以,当我们注册seccomp,实际上就是设置了当前进程的filter规则。而什么时候根据这个规则进行过滤呢?

笔者将在syscall的分析中给出答案。

ptrace

ptrace是用来跟踪进程的一个系统调用

当使用ptrace进行 PTRACE_SYSCALL 也就是一般我们劫持系统调用的操作时:

ptrace的调用链如下

1
2
3
4
5
-->PTRACE_SYSCALL 
-->arch_ptrace
-->ptrace_request
--> ptrace_resume
-->set_task_syscall_work

可以看到最终调用了set_task_syscall_work

1
2
#define set_task_syscall_work(t, fl) \
set_bit(SYSCALL_WORK_BIT_##fl, &task_thread_info(t)->syscall_work)

这个宏通过task_thread_info获取了监视的进程的记录结构地址(当被监视进程运行时,此时current指针也指向这个结构,但是此时是监视程序运行时,所以通过task_thread_info取得其地址)

在获取结构体地址后设置了 SYSCALL_WORK_BIT , 一个标志位,

也就是说,实际上ptrace:PTRACE_SYSCALL 和 prctl: PR_SET_SECCOMP 都只是在进程info上添加了一些信息,最终真正的处理要等到syscall中。

syscall

syscall 是如何处理 seccomp 以及ptrace 的呢?

其经过了如下调用链

1
2
3
4
5
-->entry_SYSCALL_64
-->do_syscall_64
-->syscall_enter_from_user_mode
-->__syscall_enter_from_user_work
-->syscall_trace_enter

syscall_trace_enter代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
static long syscall_trace_enter(struct pt_regs *regs, long syscall,
unsigned long work)
{
long ret = 0;

/*
* Handle Syscall User Dispatch. This must comes first, since
* the ABI here can be something that doesn't make sense for
* other syscall_work features.
*/
if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
if (syscall_user_dispatch(regs))
return -1L;
}

/* Handle ptrace */
if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
ret = ptrace_report_syscall_entry(regs);
if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
return -1L;
}

/* Do seccomp after ptrace, to catch any tracer changes. */
if (work & SYSCALL_WORK_SECCOMP) {
ret = __secure_computing(NULL);
if (ret == -1L)
return ret;
}

/* Either of the above might have changed the syscall number */
syscall = syscall_get_nr(current, regs);

if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
trace_sys_enter(regs, syscall);

syscall_enter_audit(regs, syscall);

return ret ? : syscall;
}

其中work由 READ_ONCE(current_thread_info()->syscall_work) 得到

1
2
3
4
5
6
7
8
9
10
static __always_inline long
__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
{
unsigned long work = READ_ONCE(current_thread_info()->syscall_work);

if (work & SYSCALL_WORK_ENTER)
syscall = syscall_trace_enter(regs, syscall, work);

return syscall;
}

由前面的分析我们可以知道, ptrace最终就是设置了SYSCALL_WORK_BIT

也因此,这里的检测和处理,如注释所说的,就是处理我们在前面看到的seccomp和ptrace。

再看 PTRACE_SYSCALL 的实际处理函数 ptrace_report_syscall

其中发送了SYSTRAP信号, 会让当前进程阻塞。等待ptrace的处理。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/*
* ptrace report for syscall entry and exit looks identical.
*/
static inline int ptrace_report_syscall(unsigned long message)
{
int ptrace = current->ptrace;
int signr;

if (!(ptrace & PT_PTRACED))
return 0;

signr = ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0),
message);

/*
* this isn't the same as continuing with a signal, but it will do
* for normal use. strace only continues with a signal if the
* stopping signal is not SIGTRAP. -brl
*/
if (signr)
send_sig(signr, current, 1);

return fatal_signal_pending(current);
}

/**
* ptrace_report_syscall_entry - task is about to attempt a system call
* @regs: user register state of current task
*
* This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
* %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
* entered the kernel for a system call. Full user register state is
* available here. Changing the values in @regs can affect the system
* call number and arguments to be tried. It is safe to block here,
* preventing the system call from beginning.
*
* Returns zero normally, or nonzero if the calling arch code should abort
* the system call. That must prevent normal entry so no system call is
* made. If @task ever returns to user mode after this, its register state
* is unspecified, but should be something harmless like an %ENOSYS error
* return. It should preserve enough information so that syscall_rollback()
* can work (see asm-generic/syscall.h).
*
* Called without locks, just after entering kernel mode.
*/
static inline __must_check int ptrace_report_syscall_entry(
struct pt_regs *regs)
{
return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
}

正如注释所说,通过ptrace拦截系统调用后,对于寄存器的修改,都是在这个时间发生的。

This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
%SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
entered the kernel for a system call. Full user register state is
available here. Changing the values in @regs can affect the system
call number and arguments to be tried. It is safe to block here,
preventing the system call from beginning.>

而这一处理,在seccomp前面,所以即使通过ptrace拦截系统调用修改系统调用号后,seccomp还是会进行检查。

那为什么网上会有相关WP呢?

以下为linux-4.7的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
/*
* We can return 0 to resume the syscall or anything else to go to phase
* 2. If we resume the syscall, we need to put something appropriate in
* regs->orig_ax.
*
* NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
* are fully functional.
*
* For phase 2's benefit, our return value is:
* 0: resume the syscall
* 1: go to phase 2; no seccomp phase 2 needed
* anything else: go to phase 2; pass return value to seccomp
*/
unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
{
struct thread_info *ti = pt_regs_to_thread_info(regs);
unsigned long ret = 0;
u32 work;

if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
BUG_ON(regs != task_pt_regs(current));

work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;

#ifdef CONFIG_SECCOMP
/*
* Do seccomp first -- it should minimize exposure of other
* code, and keeping seccomp fast is probably more valuable
* than the rest of this.
*/
if (work & _TIF_SECCOMP) {
struct seccomp_data sd;

sd.arch = arch;
sd.nr = regs->orig_ax;
sd.instruction_pointer = regs->ip;
#ifdef CONFIG_X86_64
if (arch == AUDIT_ARCH_X86_64) {
sd.args[0] = regs->di;
sd.args[1] = regs->si;
sd.args[2] = regs->dx;
sd.args[3] = regs->r10;
sd.args[4] = regs->r8;
sd.args[5] = regs->r9;
} else
#endif
{
sd.args[0] = regs->bx;
sd.args[1] = regs->cx;
sd.args[2] = regs->dx;
sd.args[3] = regs->si;
sd.args[4] = regs->di;
sd.args[5] = regs->bp;
}

BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);

ret = seccomp_phase1(&sd);
if (ret == SECCOMP_PHASE1_SKIP) {
regs->orig_ax = -1;
ret = 0;
} else if (ret != SECCOMP_PHASE1_OK) {
return ret; /* Go directly to phase 2 */
}

work &= ~_TIF_SECCOMP;
}
#endif

/* Do our best to finish without phase 2. */
if (work == 0)
return ret; /* seccomp and/or nohz only (ret == 0 here) */

#ifdef CONFIG_AUDITSYSCALL
if (work == _TIF_SYSCALL_AUDIT) {
/*
* If there is no more work to be done except auditing,
* then audit in phase 1. Phase 2 always audits, so, if
* we audit here, then we can't go on to phase 2.
*/
do_audit_syscall_entry(regs, arch);
return 0;
}
#endif

return 1; /* Something is enabled that we can't handle in phase 1 */
}

/* Returns the syscall nr to run (which should match regs->orig_ax). */
long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
unsigned long phase1_result)
{
struct thread_info *ti = pt_regs_to_thread_info(regs);
long ret = 0;
u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;

if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
BUG_ON(regs != task_pt_regs(current));

#ifdef CONFIG_SECCOMP
/*
* Call seccomp_phase2 before running the other hooks so that
* they can see any changes made by a seccomp tracer.
*/
if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
/* seccomp failures shouldn't expose any additional code. */
return -1;
}
#endif

if (unlikely(work & _TIF_SYSCALL_EMU))
ret = -1L;

if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
tracehook_report_syscall_entry(regs))
ret = -1L;

if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
trace_sys_enter(regs, regs->orig_ax);

do_audit_syscall_entry(regs, arch);

return ret ?: regs->orig_ax;
}

seccomp 的处理在 syscall_trace_enter_phase1, 而处理ptracetracehook_report_syscall_entrysyscall_trace_enter_phase2

seccomp的过滤在ptrace之前。

所以,在4.8以下,这种攻击是可以实现的。

Tricks

那么ptrace在绕过沙箱时是不是完全没有用了呢,也不是。

在和@cnitlrt 师傅交流后,得知了一个很骚操作的办法。

使用nc 连接两次,产生了两个进程,如果能在第二个进程运行前,通过ptrace截停prctl的调用,改成随便一个无关调用,就可以实现沙盒的绕过

这里存在三个问题:

首先是如何获得第二个进程的pid: 在CTF这种比较纯净的环境,可以认为两个进程PID相近,把当前进程的PID加1或者加2就可以。

其次是如何实现在第二次进程运行seccomp前的窗口期实现ptrace上此进程: 可以通过在一个进程使用ptrace attach轮询,直到执行成功返回1。不过也有失败的概率。

第三也是最终限制了这个tricks的使用的是,我们都知道,ptrace默认只能attach到自己的子进程,除非 /proc/sys/kernel/yama/ptrace_scope 设置为0, 在个人用户使用时,为了方便gdb等调试器,这个选项一般是0, 然而,当我随便开了个ubuntu的docker看了一下后:

1
2
$ cat proc/sys/kernel/yama/ptrace_scope  
1

啊这,这,那没事了