Recently I was asked to write an “OS device driver” a.k.a. Linux kernel driver, or as I came to find out, really just a kernel module since there was no hardware involved. After locking up my system completely twice (glad it runs a journaling filesystem), I held down the hardware “reset” button for a few seconds, then crafted a sufficiently sharp albeit dangerous, stab into the world of kernel programming. I had suspected it, but a segfault in kernel-land is a major faux-pas, as is killing a user-space helper process the kernel was supposed to kill. Of course it is an understatement to say that a few guides have been written on topics here. Here are some requirements that were explored with my project:
- intercept calls to kernel routines (mainly C system calls) from userspace in near real-time
- time how fast said calling processes would then exit after the kernel zapped them with a friendly signal
- do the aforementioned without creating feedback loops or locking up the kernel
- manage the nuances of intercepting kernel routines (can this be done for every process? every call? should it?)
Initially before the requirements were ready, I wrote a quick shell script to discover the target process, send it a SIGTERM, and time the exit. As it turned-out, it was probably too slow to catch syscalls that completed quickly. There may have been other ways to use strace to fulfill the requirements but I ended-up taking a different approach.
#!/usr/bin/env bash
which strace >/dev/null || { echo 'need `strace` to run. Quitting.'; exit; }
p=`pidof java | tail -n1`
[ -n "$p" ] || { echo 'please start the java process before running this script'; exit; }
echo 'Sending SIGTERM to Java process '$p' on 1st "write()" syscall'
strace -e write -fp $p 2>&1 | read
kill -SIGTERM $p
t=`( time {
while kill -0 $p 2>/dev/null; do
sleep 0.000001
done;
} ) 2>&1 \
| awk '/real/ { print $NF }'`
echo "Time to terminate: ${t}"
I ended-up deciding to write an application that would be part kernel-space, part user-space. Minimal time is spent in kernel-space, just enough to intercept the syscall and send a message of-sorts (real-time signal) to the user-space application which handles the friendly zapping of lucky processes being monitored (non-real-time SIGTERM or SIGKILL signal). One benefit of this approach is that the real-time kernel-to-userspace signals can be queued and do not get interrupted. This allows the userspace to branch off worker threads or fork processes to handle the requests in a less sensitive context.
The results of the subsequent “time-to-exit” timings, as well as other output is logged using syslog. Don’t ask me what the ultimate use-case is for this application, I also thought it was a bit strange when the client asked me to work on it. In any case, it was a great learning experience for me, and I enjoyed the opportunity to learn about an aspect of Linux programming that had until then, intimidated me. That said I came to the conclusion that for a beginner, writing a module is one of the best ways to cut your teeth with kernel programming. Here is a sample session intercepting a toy Java program:

- Top-Left: the userspace process receives a real-time signal from the kernel module containing the target PID, then fires a SIGTERM to that PID while timing how long it takes the target to exit
- Top-Right: the Java toy app opens a dummy text file and writes to it. Since this does not take very long, it simulates a delay to exit. A real “uninterruptible delay” to exit would involve a clean-up by the target in the case of SIGTERM, or in the case of SIGKILL a condition of “uninterruptible sleep” would be required such as on blocking i/o
- Bottom-Left: dmesg displays kernel messages, our module’s output begins at +47.. seconds. As one can see, the sys_write() call, or C-level write() call is being intercepted using Linux Kprobes
- Bottom-Right: my kernel module is inserted and removed from the kernel
Here is the module (disclaimer: I know my use of CamelCase/camelCase/lowercase_separated_by_underscores/UPPERCASE_SEPARATED_BY_UNDERSCORES and general coding style is probably way-off, and that some of the things my code is doing may be wrong/dangerous, and also that bla bla bla…. so yes this is “pre-alpha” code):
#include <asm/siginfo.h> //siginfo
#include <linux/debugfs.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/kprobes.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/rcupdate.h> //rcu_read_lock
#include <linux/sched.h> //find_task_by_pid_type
#include <linux/uaccess.h> //copy_from_user
#include <linux/version.h>
#include "my_syscall_display.h"
#define USERMODE_EXEC_PATH "/home/pablo/Desktop/probe_io_timer/probe_io_timer_u"
#define MY_DEBUGFS_MAX_COUNT 10
#define SIG_TEST 44 // we choose 44 as our signal number
// (real-time signals are in the range of 33 to 64)
static char *target_taskname = "timer_test_target";
static char *target_syscall = "sys_write";
static int myoffset = 0;
static int myskipcount = 0;
/*
* module_param(foo, int, 0000)
* The first param is the parameters name
* The second param is it's data type
* The final argument is the permissions bits,
* for exposing parameters in sysfs (if non-zero) at a later stage.
*/
module_param(target_taskname, charp, 0);
MODULE_PARM_DESC(target_taskname
, "Name of program to monitor, Use this to get pid. Default is \"test\"");
module_param(target_syscall, charp, 0);
MODULE_PARM_DESC(target_syscall
, "Name of system call or another OS symbol to monitor. Default is \"sys_write\"");
module_param(myoffset, int, 0);
MODULE_PARM_DESC(myoffset
, "hexadecimal offset from the OS symbol name where monitoring will be done. Default is 0");
module_param(myskipcount, int, 0);
MODULE_PARM_DESC(myskipcount, "How many symbols to skip before taking action. Default is 0");
struct siginfo info;
struct task_struct *userspace_task;
static unsigned int userspace_task_pid;
struct dentry *file = NULL;
static unsigned int counter = 0;
int Pre_Handler(struct kprobe *p, struct pt_regs *regs)
{
if( strcmp( current->comm, target_taskname)== 0){
printk( "probe pid %d `%s`, count: %d/%d\n",
current->pid, current->comm, ++counter, myskipcount);
//printk("%s( %lu, %s, %lu)\n", target_syscall, regs->di, (char *)(regs->si), regs->dx);
mySyscallPrint( target_syscall, regs);
if ( counter >= myskipcount){
if(userspace_task == NULL)
printk("\tuserspace_task is still NULL. No signal sent.\n");
else{
info.si_int = current->pid;
send_sig_info(SIG_TEST, &info, userspace_task);
} } }
return 0;
}
void Post_Handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
{
;
}
static ssize_t write_pid(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
{
char mybuf[10];
if(count > MY_DEBUGFS_MAX_COUNT)
return -EINVAL;
if( copy_from_user(mybuf, buf, count) != 0)
return -EFAULT;
sscanf(mybuf, "%d", &userspace_task_pid);
printk(" helper pid = %d\n", userspace_task_pid);
rcu_read_lock();
if( userspace_task_pid != 0)
userspace_task = pid_task( find_pid_ns( userspace_task_pid, &init_pid_ns), PIDTYPE_PID);
if( userspace_task == NULL){
printk( " no such pid!\n");
//rcu_read_unlock();
//return -ENODEV;
}
rcu_read_unlock();
return count;
}
static const struct file_operations my_fops = {
.write = write_pid,
};
static struct kprobe kp;
static int __init myinit(void)
{
int ret;
char *argv[] = {USERMODE_EXEC_PATH, NULL };
char *envp[] = {"PATH=/sbin:/usr/sbin:/bin:/usr/bin", NULL };
printk("__Kprobekill module__\n");
printk("target_taskname: %s\n", target_taskname);
printk("target_syscall: %s\n", target_syscall);
printk("myoffset (inside call): %d\n", myoffset);
printk("myskipcount (before signal):%d\n", myskipcount);
printk("Signal sent to Target Progr:%s\n", "SIGTERM");
kp.pre_handler = Pre_Handler;
kp.post_handler = Post_Handler;
kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name( target_syscall);
//kprobe_lookup_name( target_syscall, kp.addr);
kp.addr += myoffset;
//kp.addr = (kprobe_opcode_t *)0xffffffff815950a0;
memset(&info, 0, sizeof(struct siginfo));
info.si_signo = SIG_TEST;
info.si_code = SI_QUEUE;
/* we need to know the pid of the user space process
* -> we use debugfs for this. So 1st create the debugfs file,
* then set the userspace_task_pid to 0,
* exec the userspace, wait for successful exec,
* and block on it to write its PID to debugfs (w/in timeout?),
* Finally, get the task struct for that PID in order to deliver signals to it.
* If any of these steps in this module init chain fail, module must
* exit because it depends on that userspace process.
* Of course, since that userspace can be killed or crash, we must continue
* to not assume it is still there on subsequent transactions, and
* restart it, or module-exit if userspace is gone.
*/
file = debugfs_create_file("signalconfpid", 0200, NULL, NULL, &my_fops);
userspace_task_pid = 0;
userspace_task = NULL;
printk("usermodehelper: init -");
ret = call_usermodehelper(USERMODE_EXEC_PATH, argv, envp, UMH_WAIT_EXEC);
if (ret != 0)
printk(" error: %i\n", ret);
else
printk(" success\n");
register_kprobe(&kp);
printk("`%s` probe inserted on `%s` tasks\n", target_syscall, target_taskname);
return 0;
}
void myexit(void)
{
// same PID -- different signal
if(userspace_task != NULL){
info.si_signo = SIGTERM;
info.si_int = 1234; // no msg to deliver except "bye-bye"
send_sig_info( SIGTERM, &info, userspace_task); //send the signal
}
// could've done this earlier?
if( file != NULL)
debugfs_remove(file);
unregister_kprobe(&kp);
printk("`%s` probe removed for `%s` tasks\n", target_syscall, target_taskname);
}
module_init(myinit);
module_exit(myexit);
MODULE_AUTHOR("Pablo");
//MODULE_AUTHOR("Manoj");
MODULE_DESCRIPTION("KPROBE MODULE");
MODULE_LICENSE("GPL");
and userspace:
/*
* TODO:
* 1. SIGTERM hndlr for clean shutdown
* 2. process forks up to MAX_PROCS2TIME for timing workers
* 3. circular FIFO pid2term of size MAX_PROCS2TIME
* 4. does SIG_TEST really have to be hard-coded arbitrarily?
*/
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdbool.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <syslog.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>
#define MY_SYSCONF_PATH "/sys/kernel/debug/signalconfpid"
#define MY_SYSCONF_MAX_MSG 10
#define MAX_PROCS2TIME 10
#define SIG_TEST 44 /* hard-coded since SIGRTMIN is different in user and in kernel space */
#define MY_ERR(s) syslog(LOG_NOTICE,s": %s",strerror( errno))
unsigned long getMicrotime();
int pid2term=0;
void receiveData( int n, siginfo_t *info, void *unused)
{
pid2term = info->si_int;
}
int main( int argc, char **argv )
{
int configfd;
char buf[ MY_SYSCONF_MAX_MSG];
/* kernel needs to know our pid to be able to send us a signal ->
* we use debugfs for this -> do not forget to mount the debugfs!
*/
configfd = open( MY_SYSCONF_PATH, O_WRONLY);
if( configfd < 0) {
MY_ERR( "open");
return -1;
}
sprintf( buf, "%i", getpid());
if ( write( configfd, buf, strlen(buf) + 1) < 0) {
MY_ERR( "fwrite");
return -1;
}
/* now setup the signal handler for SIG_TEST
* SA_SIGINFO -> we want the signal handler function with 3 arguments
*/
struct sigaction sig;
sig.sa_sigaction = receiveData;
sig.sa_flags = SA_SIGINFO;
sigaction(SIG_TEST, &sig, NULL);
/* for now copying one int to another, pid2term => pid2term_cpy
* as an "atomic" way to "safely" handle subsequent interrupts
* part-way through a timing op
*/
int pid2term_cpy;
unsigned long b4, diff, sec;
syslog(LOG_NOTICE, "entering wait loop");
while( true){
pause();
if( pid2term != 0){
pid2term_cpy = pid2term;
pid2term = 0;
syslog(LOG_NOTICE, "Sending SIGTERM to PID %d\n", pid2term_cpy);
b4 = getMicrotime();
kill( pid2term_cpy, SIGTERM);
while( kill( pid2term_cpy, 0) != -1)
usleep( 1);
if( errno != ESRCH) continue;
diff = getMicrotime() - b4;
sec = diff / 1e6;
syslog(LOG_NOTICE, "PID %d: time2term was %lus,%luus\n"
, pid2term_cpy, sec, (unsigned long)(diff - ( sec * 1e6)));
}
}
return 0;
}
unsigned long getMicrotime()
{
struct timeval currentTime;
gettimeofday(¤tTime, NULL);
return currentTime.tv_sec * (int)1e6 + currentTime.tv_usec;
}
What did I learn?
- My hybrid K&R/Stroustrup/Allman-8/Whitesmiths/Horstmann/Ratliff/Lisp indentation style is WAY-wrong and could result in permanent, irreparable damage to the universe
- pass module params to module
- share params between kernel and userspace using debugfs
- real-time signals kernel to userspace
- Linux KProbes!
- some things strace can do/not-do
- tell a userspace process to fork-off from the kernel
- write log output from the kernel (printk) and from userspace (syslog)
- receive a parameter through a signal
External Links:
https://opensourceforu.com/2011/04/kernel-debugging-using-kprobe-and-jprobe/
https://www.ibm.com/developerworks/library/l-kprobes/index.html
https://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/
http://www.vantagepoint.sg/blog/82-hooking-android-system-calls-for-pleasure-and-benefit
http://manpages.ubuntu.com/manpages/artful/man8/kprobe-perf.8.html
https://www.scalingphpbook.com/blog/2016/04/03/my-favorite-simple-php-debugging-tool.html
https://jlmedina123.wordpress.com/2013/08/13/current-variable-and-task_struct/
http://www.xml.com/ldd/chapter/book/ch02.html#t6
https://lwn.net/Articles/288056/
https://linux-kernel-labs.github.io/master/labs/kernel_modules.html#
https://syscalls.kernelgrok.com/
https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html
https://www.wikitechy.com/tutorials/linux/linux-api-to-list-running-processes
https://idea.popcount.org/2012-12-11-linux-process-states/
https://eklitzke.org/uninterruptible-sleep
https://stackoverflow.com/questions/223644/what-is-an-uninterruptable-process
http://stupefydeveloper.blogspot.com/2009/06/linux-creation-of-new-process.html
http://jkukunas.blogspot.com/2010/05/x86-linux-networking-system-calls.html
https://jvns.ca/blog/2016/01/18/guessing-linux-kernel-registers/
https://lwn.net/Articles/604515/
https://stackoverflow.com/questions/9782660/using-ptrace-to-find-out-what-exactly-does-the-arguments-signify-for-a-system-ca?rq=1
https://stackoverflow.com/questions/44612136/how-to-simulate-hung-task-in-linux/44612553#44612553
https://tuxthink.blogspot.com/2012/05/module-to-print-open-files-of-process.html