Recently I was asked to write an “OS device driver” a.k.a. Linux kernel driver, or as I came to find out, really just a kernel module since there was no hardware involved. After locking up my system completely twice (glad it runs a journaling filesystem), I held down the hardware “reset” button for a few seconds, then crafted a sufficiently sharp albeit dangerous, stab into the world of kernel programming. I had suspected it, but a segfault in kernel-land is a major faux-pas, as is killing a user-space helper process the kernel was supposed to kill. Of course it is an understatement to say that a few guides have been written on topics here. Here are some requirements that were explored with my project:
- intercept calls to kernel routines (mainly C system calls) from userspace in near real-time
- time how fast said calling processes would then exit after the kernel zapped them with a friendly signal
- do the aforementioned without creating feedback loops or locking up the kernel
- manage the nuances of intercepting kernel routines (can this be done for every process? every call? should it?)
Initially before the requirements were ready, I wrote a quick shell script to discover the target process, send it a SIGTERM, and time the exit. As it turned-out, it was probably too slow to catch syscalls that completed quickly. There may have been other ways to use strace to fulfill the requirements but I ended-up taking a different approach.
#!/usr/bin/env bash which strace >/dev/null || { echo 'need `strace` to run. Quitting.'; exit; } p=`pidof java | tail -n1` [ -n "$p" ] || { echo 'please start the java process before running this script'; exit; } echo 'Sending SIGTERM to Java process '$p' on 1st "write()" syscall' strace -e write -fp $p 2>&1 | read kill -SIGTERM $p t=`( time { while kill -0 $p 2>/dev/null; do sleep 0.000001 done; } ) 2>&1 \ | awk '/real/ { print $NF }'` echo "Time to terminate: ${t}"
I ended-up deciding to write an application that would be part kernel-space, part user-space. Minimal time is spent in kernel-space, just enough to intercept the syscall and send a message of-sorts (real-time signal) to the user-space application which handles the friendly zapping of lucky processes being monitored (non-real-time SIGTERM or SIGKILL signal). One benefit of this approach is that the real-time kernel-to-userspace signals can be queued and do not get interrupted. This allows the userspace to branch off worker threads or fork processes to handle the requests in a less sensitive context.
The results of the subsequent “time-to-exit” timings, as well as other output is logged using syslog. Don’t ask me what the ultimate use-case is for this application, I also thought it was a bit strange when the client asked me to work on it. In any case, it was a great learning experience for me, and I enjoyed the opportunity to learn about an aspect of Linux programming that had until then, intimidated me. That said I came to the conclusion that for a beginner, writing a module is one of the best ways to cut your teeth with kernel programming. Here is a sample session intercepting a toy Java program:
- Top-Left: the userspace process receives a real-time signal from the kernel module containing the target PID, then fires a SIGTERM to that PID while timing how long it takes the target to exit
- Top-Right: the Java toy app opens a dummy text file and writes to it. Since this does not take very long, it simulates a delay to exit. A real “uninterruptible delay” to exit would involve a clean-up by the target in the case of SIGTERM, or in the case of SIGKILL a condition of “uninterruptible sleep” would be required such as on blocking i/o
- Bottom-Left: dmesg displays kernel messages, our module’s output begins at +47.. seconds. As one can see, the sys_write() call, or C-level write() call is being intercepted using Linux Kprobes
- Bottom-Right: my kernel module is inserted and removed from the kernel
Here is the module (disclaimer: I know my use of CamelCase/camelCase/lowercase_separated_by_underscores/UPPERCASE_SEPARATED_BY_UNDERSCORES and general coding style is probably way-off, and that some of the things my code is doing may be wrong/dangerous, and also that bla bla bla…. so yes this is “pre-alpha” code):
#include <asm/siginfo.h> //siginfo #include <linux/debugfs.h> #include <linux/init.h> #include <linux/kernel.h> #include <linux/kprobes.h> #include <linux/module.h> #include <linux/moduleparam.h> #include <linux/rcupdate.h> //rcu_read_lock #include <linux/sched.h> //find_task_by_pid_type #include <linux/uaccess.h> //copy_from_user #include <linux/version.h> #include "my_syscall_display.h" #define USERMODE_EXEC_PATH "/home/pablo/Desktop/probe_io_timer/probe_io_timer_u" #define MY_DEBUGFS_MAX_COUNT 10 #define SIG_TEST 44 // we choose 44 as our signal number // (real-time signals are in the range of 33 to 64) static char *target_taskname = "timer_test_target"; static char *target_syscall = "sys_write"; static int myoffset = 0; static int myskipcount = 0; /* * module_param(foo, int, 0000) * The first param is the parameters name * The second param is it's data type * The final argument is the permissions bits, * for exposing parameters in sysfs (if non-zero) at a later stage. */ module_param(target_taskname, charp, 0); MODULE_PARM_DESC(target_taskname , "Name of program to monitor, Use this to get pid. Default is \"test\""); module_param(target_syscall, charp, 0); MODULE_PARM_DESC(target_syscall , "Name of system call or another OS symbol to monitor. Default is \"sys_write\""); module_param(myoffset, int, 0); MODULE_PARM_DESC(myoffset , "hexadecimal offset from the OS symbol name where monitoring will be done. Default is 0"); module_param(myskipcount, int, 0); MODULE_PARM_DESC(myskipcount, "How many symbols to skip before taking action. Default is 0"); struct siginfo info; struct task_struct *userspace_task; static unsigned int userspace_task_pid; struct dentry *file = NULL; static unsigned int counter = 0; int Pre_Handler(struct kprobe *p, struct pt_regs *regs) { if( strcmp( current->comm, target_taskname)== 0){ printk( "probe pid %d `%s`, count: %d/%d\n", current->pid, current->comm, ++counter, myskipcount); //printk("%s( %lu, %s, %lu)\n", target_syscall, regs->di, (char *)(regs->si), regs->dx); mySyscallPrint( target_syscall, regs); if ( counter >= myskipcount){ if(userspace_task == NULL) printk("\tuserspace_task is still NULL. No signal sent.\n"); else{ info.si_int = current->pid; send_sig_info(SIG_TEST, &info, userspace_task); } } } return 0; } void Post_Handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags) { ; } static ssize_t write_pid(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { char mybuf[10]; if(count > MY_DEBUGFS_MAX_COUNT) return -EINVAL; if( copy_from_user(mybuf, buf, count) != 0) return -EFAULT; sscanf(mybuf, "%d", &userspace_task_pid); printk(" helper pid = %d\n", userspace_task_pid); rcu_read_lock(); if( userspace_task_pid != 0) userspace_task = pid_task( find_pid_ns( userspace_task_pid, &init_pid_ns), PIDTYPE_PID); if( userspace_task == NULL){ printk( " no such pid!\n"); //rcu_read_unlock(); //return -ENODEV; } rcu_read_unlock(); return count; } static const struct file_operations my_fops = { .write = write_pid, }; static struct kprobe kp; static int __init myinit(void) { int ret; char *argv[] = {USERMODE_EXEC_PATH, NULL }; char *envp[] = {"PATH=/sbin:/usr/sbin:/bin:/usr/bin", NULL }; printk("__Kprobekill module__\n"); printk("target_taskname: %s\n", target_taskname); printk("target_syscall: %s\n", target_syscall); printk("myoffset (inside call): %d\n", myoffset); printk("myskipcount (before signal):%d\n", myskipcount); printk("Signal sent to Target Progr:%s\n", "SIGTERM"); kp.pre_handler = Pre_Handler; kp.post_handler = Post_Handler; kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name( target_syscall); //kprobe_lookup_name( target_syscall, kp.addr); kp.addr += myoffset; //kp.addr = (kprobe_opcode_t *)0xffffffff815950a0; memset(&info, 0, sizeof(struct siginfo)); info.si_signo = SIG_TEST; info.si_code = SI_QUEUE; /* we need to know the pid of the user space process * -> we use debugfs for this. So 1st create the debugfs file, * then set the userspace_task_pid to 0, * exec the userspace, wait for successful exec, * and block on it to write its PID to debugfs (w/in timeout?), * Finally, get the task struct for that PID in order to deliver signals to it. * If any of these steps in this module init chain fail, module must * exit because it depends on that userspace process. * Of course, since that userspace can be killed or crash, we must continue * to not assume it is still there on subsequent transactions, and * restart it, or module-exit if userspace is gone. */ file = debugfs_create_file("signalconfpid", 0200, NULL, NULL, &my_fops); userspace_task_pid = 0; userspace_task = NULL; printk("usermodehelper: init -"); ret = call_usermodehelper(USERMODE_EXEC_PATH, argv, envp, UMH_WAIT_EXEC); if (ret != 0) printk(" error: %i\n", ret); else printk(" success\n"); register_kprobe(&kp); printk("`%s` probe inserted on `%s` tasks\n", target_syscall, target_taskname); return 0; } void myexit(void) { // same PID -- different signal if(userspace_task != NULL){ info.si_signo = SIGTERM; info.si_int = 1234; // no msg to deliver except "bye-bye" send_sig_info( SIGTERM, &info, userspace_task); //send the signal } // could've done this earlier? if( file != NULL) debugfs_remove(file); unregister_kprobe(&kp); printk("`%s` probe removed for `%s` tasks\n", target_syscall, target_taskname); } module_init(myinit); module_exit(myexit); MODULE_AUTHOR("Pablo"); //MODULE_AUTHOR("Manoj"); MODULE_DESCRIPTION("KPROBE MODULE"); MODULE_LICENSE("GPL");
and userspace:
/* * TODO: * 1. SIGTERM hndlr for clean shutdown * 2. process forks up to MAX_PROCS2TIME for timing workers * 3. circular FIFO pid2term of size MAX_PROCS2TIME * 4. does SIG_TEST really have to be hard-coded arbitrarily? */ #define _GNU_SOURCE #include <errno.h> #include <fcntl.h> #include <limits.h> #include <stdbool.h> #include <signal.h> #include <stdio.h> #include <string.h> #include <syslog.h> #include <sys/time.h> #include <sys/types.h> #include <unistd.h> #define MY_SYSCONF_PATH "/sys/kernel/debug/signalconfpid" #define MY_SYSCONF_MAX_MSG 10 #define MAX_PROCS2TIME 10 #define SIG_TEST 44 /* hard-coded since SIGRTMIN is different in user and in kernel space */ #define MY_ERR(s) syslog(LOG_NOTICE,s": %s",strerror( errno)) unsigned long getMicrotime(); int pid2term=0; void receiveData( int n, siginfo_t *info, void *unused) { pid2term = info->si_int; } int main( int argc, char **argv ) { int configfd; char buf[ MY_SYSCONF_MAX_MSG]; /* kernel needs to know our pid to be able to send us a signal -> * we use debugfs for this -> do not forget to mount the debugfs! */ configfd = open( MY_SYSCONF_PATH, O_WRONLY); if( configfd < 0) { MY_ERR( "open"); return -1; } sprintf( buf, "%i", getpid()); if ( write( configfd, buf, strlen(buf) + 1) < 0) { MY_ERR( "fwrite"); return -1; } /* now setup the signal handler for SIG_TEST * SA_SIGINFO -> we want the signal handler function with 3 arguments */ struct sigaction sig; sig.sa_sigaction = receiveData; sig.sa_flags = SA_SIGINFO; sigaction(SIG_TEST, &sig, NULL); /* for now copying one int to another, pid2term => pid2term_cpy * as an "atomic" way to "safely" handle subsequent interrupts * part-way through a timing op */ int pid2term_cpy; unsigned long b4, diff, sec; syslog(LOG_NOTICE, "entering wait loop"); while( true){ pause(); if( pid2term != 0){ pid2term_cpy = pid2term; pid2term = 0; syslog(LOG_NOTICE, "Sending SIGTERM to PID %d\n", pid2term_cpy); b4 = getMicrotime(); kill( pid2term_cpy, SIGTERM); while( kill( pid2term_cpy, 0) != -1) usleep( 1); if( errno != ESRCH) continue; diff = getMicrotime() - b4; sec = diff / 1e6; syslog(LOG_NOTICE, "PID %d: time2term was %lus,%luus\n" , pid2term_cpy, sec, (unsigned long)(diff - ( sec * 1e6))); } } return 0; } unsigned long getMicrotime() { struct timeval currentTime; gettimeofday(¤tTime, NULL); return currentTime.tv_sec * (int)1e6 + currentTime.tv_usec; }
What did I learn?
- My hybrid K&R/Stroustrup/Allman-8/Whitesmiths/Horstmann/Ratliff/Lisp indentation style is WAY-wrong and could result in permanent, irreparable damage to the universe
- pass module params to module
- share params between kernel and userspace using debugfs
- real-time signals kernel to userspace
- Linux KProbes!
- some things strace can do/not-do
- tell a userspace process to fork-off from the kernel
- write log output from the kernel (printk) and from userspace (syslog)
- receive a parameter through a signal
External Links:
https://opensourceforu.com/2011/04/kernel-debugging-using-kprobe-and-jprobe/
https://www.ibm.com/developerworks/library/l-kprobes/index.html
https://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/
http://www.vantagepoint.sg/blog/82-hooking-android-system-calls-for-pleasure-and-benefit
http://manpages.ubuntu.com/manpages/artful/man8/kprobe-perf.8.html
https://www.scalingphpbook.com/blog/2016/04/03/my-favorite-simple-php-debugging-tool.html
https://jlmedina123.wordpress.com/2013/08/13/current-variable-and-task_struct/
http://www.xml.com/ldd/chapter/book/ch02.html#t6
https://lwn.net/Articles/288056/
https://linux-kernel-labs.github.io/master/labs/kernel_modules.html#
https://syscalls.kernelgrok.com/
https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html
https://www.wikitechy.com/tutorials/linux/linux-api-to-list-running-processes
https://idea.popcount.org/2012-12-11-linux-process-states/
https://eklitzke.org/uninterruptible-sleep
https://stackoverflow.com/questions/223644/what-is-an-uninterruptable-process
http://stupefydeveloper.blogspot.com/2009/06/linux-creation-of-new-process.html
http://jkukunas.blogspot.com/2010/05/x86-linux-networking-system-calls.html
https://jvns.ca/blog/2016/01/18/guessing-linux-kernel-registers/
https://lwn.net/Articles/604515/
https://stackoverflow.com/questions/9782660/using-ptrace-to-find-out-what-exactly-does-the-arguments-signify-for-a-system-ca?rq=1
https://stackoverflow.com/questions/44612136/how-to-simulate-hung-task-in-linux/44612553#44612553
https://tuxthink.blogspot.com/2012/05/module-to-print-open-files-of-process.html