Linux.Seccomp-and-Ptrace

2024-03-04 15:37:44 # CTF #Pwn #linux

Background

最近ACTF出现了一个限制非常严格的沙箱，校队里一位pwn师傅搜到了一些用ptrace修改子进程rax来绕过seccomp的wp。

正值校赛，为了出题的事忙得焦头烂额，就没有细想。

但是由于我记得seccomp 是内核hook，而ptrace，出于一些对调试器的印象，我觉得他对于attach的子进程的寄存器的更改，~~是在用户态实现的~~。那么ptrace的处理应该在seccomp之前，所以我觉得不太可行。

在有时间后，我开始探究了一下，确实不太可行，只是原因跟我想象得不太一样…

Intro

在开始之前，先介绍一下三个概念：

seccome
prctl
ptrace

如果没有提到，以上代码均来自linux-6.6

prctl / seccomp

prctl 是linux下一个实现进程操控的系统调用。

SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
		unsigned long, arg4, unsigned long, arg5)
{
	struct task_struct *me = current;
	unsigned char comm[sizeof(me->comm)];
	long error;

	error = security_task_prctl(option, arg2, arg3, arg4, arg5);
	if (error != -ENOSYS)
		return error;

	error = 0;
	switch (option) {
	case PR_SET_PDEATHSIG:
		if (!valid_signal(arg2)) {
			error = -EINVAL;
			break;
		}
		me->pdeath_signal = arg2;
		break;
	/*
	.............
	省略若干
	.............
	*/
	case PR_GET_SECCOMP:
		error = prctl_get_seccomp();
		break;
	/*
	.............
	省略若干
	.............
	*/
	default:
		error = -EINVAL;
		break;
	}
	return error;
}

阅读源码和man doc，可以看到prctl主要实现了两类命令，SET 和 GET ，即操作进程运行时和获取进程信息。

而seccomp就是基于prctl实现的。

1 2	case PR_SET_SECCOMP: error = prctl_set_seccomp(arg2, (char __user *)arg3);

这里涉及到这样一条调用链

-->prctl 
	-->prctl_set_seccomp
		-->do_seccomp
			-->seccomp_set_mode_filter
				--> seccomp_attach_filter

seccomp_attach_filter 核心代码如下：

filter->prev = current->seccomp.filter;
seccomp_cache_prepare(filter);
current->seccomp.filter = filter;
atomic_inc(&current->seccomp.filter_count);

current是一个全局的指针，指向当前进程的task结构体，主要保存了当前进程的一些信息。

所以，当我们注册seccomp，实际上就是设置了当前进程的filter规则。而什么时候根据这个规则进行过滤呢？

笔者将在syscall的分析中给出答案。

ptrace

ptrace是用来跟踪进程的一个系统调用

当使用ptrace进行 PTRACE_SYSCALL 也就是一般我们劫持系统调用的操作时：

ptrace的调用链如下

-->PTRACE_SYSCALL 
	-->arch_ptrace
		-->ptrace_request
			--> ptrace_resume
				-->set_task_syscall_work

可以看到最终调用了set_task_syscall_work 宏

1 2	#define set_task_syscall_work(t, fl) \ set_bit(SYSCALL_WORK_BIT_##fl, &task_thread_info(t)->syscall_work)

这个宏通过task_thread_info获取了监视的进程的记录结构地址（当被监视进程运行时，此时current指针也指向这个结构，但是此时是监视程序运行时，所以通过task_thread_info取得其地址）

在获取结构体地址后设置了 SYSCALL_WORK_BIT ，一个标志位，

也就是说，实际上ptrace:PTRACE_SYSCALL 和 prctl: PR_SET_SECCOMP 都只是在进程info上添加了一些信息，最终真正的处理要等到syscall中。

syscall

syscall 是如何处理 seccomp 以及ptrace 的呢？

其经过了如下调用链

-->entry_SYSCALL_64
	-->do_syscall_64
		-->syscall_enter_from_user_mode
			-->__syscall_enter_from_user_work
				-->syscall_trace_enter

syscall_trace_enter代码如下

static long syscall_trace_enter(struct pt_regs *regs, long syscall,
				unsigned long work)
{
	long ret = 0;

	/*
	 * Handle Syscall User Dispatch.  This must comes first, since
	 * the ABI here can be something that doesn't make sense for
	 * other syscall_work features.
	 */
	if (work & SYSCALL_WORK_SYSCALL_USER_DISPATCH) {
		if (syscall_user_dispatch(regs))
			return -1L;
	}

	/* Handle ptrace */
	if (work & (SYSCALL_WORK_SYSCALL_TRACE | SYSCALL_WORK_SYSCALL_EMU)) {
		ret = ptrace_report_syscall_entry(regs);
		if (ret || (work & SYSCALL_WORK_SYSCALL_EMU))
			return -1L;
	}

	/* Do seccomp after ptrace, to catch any tracer changes. */
	if (work & SYSCALL_WORK_SECCOMP) {
		ret = __secure_computing(NULL);
		if (ret == -1L)
			return ret;
	}

	/* Either of the above might have changed the syscall number */
	syscall = syscall_get_nr(current, regs);

	if (unlikely(work & SYSCALL_WORK_SYSCALL_TRACEPOINT))
		trace_sys_enter(regs, syscall);

	syscall_enter_audit(regs, syscall);

	return ret ? : syscall;
}

其中work由 READ_ONCE(current_thread_info()->syscall_work) 得到

static __always_inline long
__syscall_enter_from_user_work(struct pt_regs *regs, long syscall)
{
	unsigned long work = READ_ONCE(current_thread_info()->syscall_work);

	if (work & SYSCALL_WORK_ENTER)
		syscall = syscall_trace_enter(regs, syscall, work);

	return syscall;
}

由前面的分析我们可以知道， ptrace最终就是设置了SYSCALL_WORK_BIT

也因此，这里的检测和处理，如注释所说的，就是处理我们在前面看到的seccomp和ptrace。

再看 PTRACE_SYSCALL 的实际处理函数 ptrace_report_syscall。

其中发送了SYSTRAP信号，会让当前进程阻塞。等待ptrace的处理。

/*
 * ptrace report for syscall entry and exit looks identical.
 */
static inline int ptrace_report_syscall(unsigned long message)
{
	int ptrace = current->ptrace;
	int signr;

	if (!(ptrace & PT_PTRACED))
		return 0;

	signr = ptrace_notify(SIGTRAP | ((ptrace & PT_TRACESYSGOOD) ? 0x80 : 0),
			      message);

	/*
	 * this isn't the same as continuing with a signal, but it will do
	 * for normal use.  strace only continues with a signal if the
	 * stopping signal is not SIGTRAP.  -brl
	 */
	if (signr)
		send_sig(signr, current, 1);

	return fatal_signal_pending(current);
}

/**
 * ptrace_report_syscall_entry - task is about to attempt a system call
 * @regs:		user register state of current task
 *
 * This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
 * %SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
 * entered the kernel for a system call.  Full user register state is
 * available here.  Changing the values in @regs can affect the system
 * call number and arguments to be tried.  It is safe to block here,
 * preventing the system call from beginning.
 *
 * Returns zero normally, or nonzero if the calling arch code should abort
 * the system call.  That must prevent normal entry so no system call is
 * made.  If @task ever returns to user mode after this, its register state
 * is unspecified, but should be something harmless like an %ENOSYS error
 * return.  It should preserve enough information so that syscall_rollback()
 * can work (see asm-generic/syscall.h).
 *
 * Called without locks, just after entering kernel mode.
 */
static inline __must_check int ptrace_report_syscall_entry(
	struct pt_regs *regs)
{
	return ptrace_report_syscall(PTRACE_EVENTMSG_SYSCALL_ENTRY);
}

正如注释所说，通过ptrace拦截系统调用后，对于寄存器的修改，都是在这个时间发生的。

This will be called if %SYSCALL_WORK_SYSCALL_TRACE or
%SYSCALL_WORK_SYSCALL_EMU have been set, when the current task has just
entered the kernel for a system call. Full user register state is
available here. Changing the values in @regs can affect the system
call number and arguments to be tried. It is safe to block here,
preventing the system call from beginning.>

而这一处理，在seccomp前面，所以即使通过ptrace拦截系统调用修改系统调用号后，seccomp还是会进行检查。

那为什么网上会有相关WP呢？

以下为linux-4.7的代码

/*
 * We can return 0 to resume the syscall or anything else to go to phase
 * 2.  If we resume the syscall, we need to put something appropriate in
 * regs->orig_ax.
 *
 * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
 * are fully functional.
 *
 * For phase 2's benefit, our return value is:
 * 0:			resume the syscall
 * 1:			go to phase 2; no seccomp phase 2 needed
 * anything else:	go to phase 2; pass return value to seccomp
 */
unsigned long syscall_trace_enter_phase1(struct pt_regs *regs, u32 arch)
{
	struct thread_info *ti = pt_regs_to_thread_info(regs);
	unsigned long ret = 0;
	u32 work;

	if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
		BUG_ON(regs != task_pt_regs(current));

	work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;

#ifdef CONFIG_SECCOMP
	/*
	 * Do seccomp first -- it should minimize exposure of other
	 * code, and keeping seccomp fast is probably more valuable
	 * than the rest of this.
	 */
	if (work & _TIF_SECCOMP) {
		struct seccomp_data sd;

		sd.arch = arch;
		sd.nr = regs->orig_ax;
		sd.instruction_pointer = regs->ip;
#ifdef CONFIG_X86_64
		if (arch == AUDIT_ARCH_X86_64) {
			sd.args[0] = regs->di;
			sd.args[1] = regs->si;
			sd.args[2] = regs->dx;
			sd.args[3] = regs->r10;
			sd.args[4] = regs->r8;
			sd.args[5] = regs->r9;
		} else
#endif
		{
			sd.args[0] = regs->bx;
			sd.args[1] = regs->cx;
			sd.args[2] = regs->dx;
			sd.args[3] = regs->si;
			sd.args[4] = regs->di;
			sd.args[5] = regs->bp;
		}

		BUILD_BUG_ON(SECCOMP_PHASE1_OK != 0);
		BUILD_BUG_ON(SECCOMP_PHASE1_SKIP != 1);

		ret = seccomp_phase1(&sd);
		if (ret == SECCOMP_PHASE1_SKIP) {
			regs->orig_ax = -1;
			ret = 0;
		} else if (ret != SECCOMP_PHASE1_OK) {
			return ret;  /* Go directly to phase 2 */
		}

		work &= ~_TIF_SECCOMP;
	}
#endif

	/* Do our best to finish without phase 2. */
	if (work == 0)
		return ret;  /* seccomp and/or nohz only (ret == 0 here) */

#ifdef CONFIG_AUDITSYSCALL
	if (work == _TIF_SYSCALL_AUDIT) {
		/*
		 * If there is no more work to be done except auditing,
		 * then audit in phase 1.  Phase 2 always audits, so, if
		 * we audit here, then we can't go on to phase 2.
		 */
		do_audit_syscall_entry(regs, arch);
		return 0;
	}
#endif

	return 1;  /* Something is enabled that we can't handle in phase 1 */
}

/* Returns the syscall nr to run (which should match regs->orig_ax). */
long syscall_trace_enter_phase2(struct pt_regs *regs, u32 arch,
				unsigned long phase1_result)
{
	struct thread_info *ti = pt_regs_to_thread_info(regs);
	long ret = 0;
	u32 work = ACCESS_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY;

	if (IS_ENABLED(CONFIG_DEBUG_ENTRY))
		BUG_ON(regs != task_pt_regs(current));

#ifdef CONFIG_SECCOMP
	/*
	 * Call seccomp_phase2 before running the other hooks so that
	 * they can see any changes made by a seccomp tracer.
	 */
	if (phase1_result > 1 && seccomp_phase2(phase1_result)) {
		/* seccomp failures shouldn't expose any additional code. */
		return -1;
	}
#endif

	if (unlikely(work & _TIF_SYSCALL_EMU))
		ret = -1L;

	if ((ret || test_thread_flag(TIF_SYSCALL_TRACE)) &&
	    tracehook_report_syscall_entry(regs))
		ret = -1L;

	if (unlikely(test_thread_flag(TIF_SYSCALL_TRACEPOINT)))
		trace_sys_enter(regs, regs->orig_ax);

	do_audit_syscall_entry(regs, arch);

	return ret ?: regs->orig_ax;
}

对seccomp 的处理在 syscall_trace_enter_phase1，而处理ptrace的tracehook_report_syscall_entry 在syscall_trace_enter_phase2

seccomp的过滤在ptrace之前。

所以，在4.8以下，这种攻击是可以实现的。

Tricks

那么ptrace在绕过沙箱时是不是完全没有用了呢，也不是。

在和@cnitlrt 师傅交流后，得知了一个很骚操作的办法。

使用nc 连接两次，产生了两个进程，如果能在第二个进程运行前，通过ptrace截停prctl的调用，改成随便一个无关调用，就可以实现沙盒的绕过

这里存在三个问题：

首先是如何获得第二个进程的pid：在CTF这种比较纯净的环境，可以认为两个进程PID相近，把当前进程的PID加1或者加2就可以。

其次是如何实现在第二次进程运行seccomp前的窗口期实现ptrace上此进程：可以通过在一个进程使用ptrace attach轮询，直到执行成功返回1。不过也有失败的概率。

第三也是最终限制了这个tricks的使用的是，我们都知道，ptrace默认只能attach到自己的子进程，除非 /proc/sys/kernel/yama/ptrace_scope 设置为0，在个人用户使用时，为了方便gdb等调试器，这个选项一般是0，然而，当我随便开了个ubuntu的docker看了一下后：

1 2	$ cat proc/sys/kernel/yama/ptrace_scope 1

啊这，这，那没事了

2024-03-04 15:37:44 # CTF #Pwn #linux