Kernal Land

Recently I was asked to write an “OS device driver” a.k.a. Linux kernel driver, or as I came to find out, really just a kernel module since there was no hardware involved. After locking up my system completely twice (glad it runs a journaling filesystem), I held down the hardware “reset” button for a few seconds, then crafted a sufficiently sharp albeit dangerous, stab into the world of kernel programming. I had suspected it, but a segfault in kernel-land is a major faux-pas, as is killing a user-space helper process the kernel was supposed to kill. Of course it is an understatement to say that a few guides have been written on topics here. Here are some requirements that were explored with my project:

  1. intercept calls to kernel routines (mainly C system calls) from userspace in near real-time
  2. time how fast said calling processes would then exit after the kernel zapped them with a friendly signal
  3. do the aforementioned without creating feedback loops or locking up the kernel
  4. manage the nuances of intercepting kernel routines (can this be done for every process? every call? should it?)

Initially before the requirements were ready, I wrote a quick shell script to discover the target process, send it a SIGTERM, and time the exit. As it turned-out, it was probably too slow to catch syscalls that completed quickly. There may have been other ways to use strace to fulfill the requirements but I ended-up taking a different approach.

#!/usr/bin/env bash
which strace >/dev/null || { echo 'need `strace` to run. Quitting.'; exit; }
p=`pidof java | tail -n1`
[ -n "$p" ] || { echo 'please start the java process before running this script'; exit; }
echo 'Sending SIGTERM to Java process '$p' on 1st "write()" syscall'
strace -e write -fp $p 2>&1 | read 
kill -SIGTERM $p
t=`( time {
    while kill -0 $p 2>/dev/null; do 
        sleep 0.000001
    done; 
    } ) 2>&1 \
| awk '/real/ { print $NF }'`
echo "Time to terminate: ${t}"

I ended-up deciding to write an application that would be part kernel-space, part user-space. Minimal time is spent in kernel-space, just enough to intercept the syscall and send a message of-sorts (real-time signal) to the user-space application which handles the friendly zapping of lucky processes being monitored (non-real-time SIGTERM or SIGKILL signal). One benefit of this approach is that the real-time kernel-to-userspace signals can be queued and do not get interrupted. This allows the userspace to branch off worker threads or fork processes to handle the requests in a less sensitive context.

The results of the subsequent “time-to-exit” timings, as well as other output is logged using syslog. Don’t ask me what the ultimate use-case is for this application, I also thought it was a bit strange when the client asked me to work on it. In any case, it was a great learning experience for me, and I enjoyed the opportunity to learn about an aspect of Linux programming that had until then, intimidated me. That said I came to the conclusion that for a beginner, writing a module is one of the best ways to cut your teeth with kernel programming. Here is a sample session intercepting a toy Java program:

Module Screenshot

  • Top-Left: the userspace process receives a real-time signal from the kernel module containing the target PID, then fires a SIGTERM to that PID while timing how long it takes the target to exit
  • Top-Right: the Java toy app opens a dummy text file and writes to it. Since this does not take very long, it simulates a delay to exit. A real “uninterruptible delay” to exit would involve a clean-up by the target in the case of SIGTERM, or in the case of SIGKILL a condition of “uninterruptible sleep” would be required such as on blocking i/o
  • Bottom-Left: dmesg displays kernel messages, our module’s output begins at +47.. seconds. As one can see, the sys_write() call, or C-level write() call is being intercepted using Linux Kprobes
  • Bottom-Right: my kernel module is inserted and removed from the kernel

Here is the module (disclaimer: I know my use of CamelCase/camelCase/lowercase_separated_by_underscores/UPPERCASE_SEPARATED_BY_UNDERSCORES and general coding style is probably way-off, and that some of the things my code is doing may be wrong/dangerous, and also that bla bla bla…. so yes this is “pre-alpha” code):

#include <asm/siginfo.h>    //siginfo
#include <linux/debugfs.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/kprobes.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/rcupdate.h> //rcu_read_lock
#include <linux/sched.h>    //find_task_by_pid_type
#include <linux/uaccess.h>  //copy_from_user
#include <linux/version.h>

#include "my_syscall_display.h"

#define USERMODE_EXEC_PATH      "/home/pablo/Desktop/probe_io_timer/probe_io_timer_u"
#define MY_DEBUGFS_MAX_COUNT    10
#define SIG_TEST                44  // we choose 44 as our signal number
                                    // (real-time signals are in the range of 33 to 64)
static char *target_taskname = "timer_test_target";
static char *target_syscall = "sys_write";
static int  myoffset = 0;
static int  myskipcount = 0;

/* 
 * module_param(foo, int, 0000)
 * The first param is the parameters name
 * The second param is it's data type
 * The final argument is the permissions bits, 
 * for exposing parameters in sysfs (if non-zero) at a later stage.
 */

module_param(target_taskname, charp, 0);
MODULE_PARM_DESC(target_taskname
        , "Name of program to monitor, Use this to get pid. Default is \"test\"");
module_param(target_syscall, charp, 0);
MODULE_PARM_DESC(target_syscall
        , "Name of system call or another OS symbol to monitor. Default is \"sys_write\"");
module_param(myoffset, int, 0);
MODULE_PARM_DESC(myoffset
        , "hexadecimal offset from the OS symbol name where monitoring will be done. Default is 0");
module_param(myskipcount, int, 0);
MODULE_PARM_DESC(myskipcount, "How many symbols to skip before taking action. Default is 0");

struct siginfo info;
struct task_struct *userspace_task;
static unsigned int userspace_task_pid;
struct dentry *file = NULL;

static unsigned int counter = 0;

int Pre_Handler(struct kprobe *p, struct pt_regs *regs)
{
    if( strcmp( current->comm, target_taskname)== 0){
        printk( "probe pid %d `%s`, count: %d/%d\n",
                current->pid, current->comm, ++counter, myskipcount);
        //printk("%s( %lu, %s, %lu)\n", target_syscall, regs->di, (char *)(regs->si), regs->dx);
        mySyscallPrint( target_syscall, regs);

        if ( counter >= myskipcount){
            if(userspace_task == NULL)
                printk("\tuserspace_task is still NULL. No signal sent.\n");
            else{
                info.si_int = current->pid;
                send_sig_info(SIG_TEST, &info, userspace_task);
    }   }   }
    return 0;
}

void Post_Handler(struct kprobe *p, struct pt_regs *regs, unsigned long flags)
{
    ;
}

static ssize_t write_pid(struct file *file, const char __user *buf, size_t count, loff_t *ppos)
{
    char mybuf[10];
    if(count > MY_DEBUGFS_MAX_COUNT)
        return -EINVAL;
    if( copy_from_user(mybuf, buf, count) != 0)
        return -EFAULT;
    sscanf(mybuf, "%d", &userspace_task_pid);
    printk("   helper pid = %d\n", userspace_task_pid);

    rcu_read_lock();
    if( userspace_task_pid != 0)
        userspace_task = pid_task( find_pid_ns( userspace_task_pid, &init_pid_ns), PIDTYPE_PID);
    if( userspace_task == NULL){
        printk( "    no such pid!\n");
        //rcu_read_unlock();
        //return -ENODEV;
    }
    rcu_read_unlock();

    return count;
}

static const struct file_operations my_fops = {
    .write = write_pid,
};

static struct kprobe kp;

static int __init myinit(void)
{
    int ret;
    char *argv[] = {USERMODE_EXEC_PATH, NULL };
    char *envp[] = {"PATH=/sbin:/usr/sbin:/bin:/usr/bin", NULL };

    printk("__Kprobekill module__\n");
    printk("target_taskname:            %s\n", target_taskname);
    printk("target_syscall:             %s\n", target_syscall);
    printk("myoffset (inside call):     %d\n", myoffset);
    printk("myskipcount (before signal):%d\n", myskipcount);
    printk("Signal sent to Target Progr:%s\n", "SIGTERM");

    kp.pre_handler = Pre_Handler;
    kp.post_handler = Post_Handler;
    kp.addr = (kprobe_opcode_t *) kallsyms_lookup_name( target_syscall);
    //kprobe_lookup_name( target_syscall, kp.addr);
    kp.addr += myoffset;
    //kp.addr = (kprobe_opcode_t *)0xffffffff815950a0; 

    memset(&info, 0, sizeof(struct siginfo));
    info.si_signo = SIG_TEST;
    info.si_code = SI_QUEUE;
    
    /*  we need to know the pid of the user space process
     *  -> we use debugfs for this. So 1st create the debugfs file,
     *  then set the userspace_task_pid to 0,
     *  exec the userspace, wait for successful exec,
     *  and block on it to write its PID to debugfs (w/in timeout?),
     *  Finally, get the task struct for that PID in order to deliver signals to it.
     *  If any of these steps in this module init chain fail, module must
     *  exit because it depends on that userspace process.
     *  Of course, since that userspace can be killed or crash, we must continue
     *  to not assume it is still there on subsequent transactions, and
     *  restart it, or module-exit if userspace is gone.
    */
    file = debugfs_create_file("signalconfpid", 0200, NULL, NULL, &my_fops);

    userspace_task_pid = 0;
    userspace_task = NULL;   

    printk("usermodehelper: init -");
    ret = call_usermodehelper(USERMODE_EXEC_PATH, argv, envp, UMH_WAIT_EXEC);
    if (ret != 0)
        printk(" error: %i\n", ret);
    else
        printk(" success\n");

    register_kprobe(&kp);
    printk("`%s` probe inserted on `%s` tasks\n", target_syscall, target_taskname);

    return 0;
}

void myexit(void)
{
    // same PID -- different signal
    if(userspace_task != NULL){
        info.si_signo = SIGTERM;
        info.si_int = 1234; // no msg to deliver except "bye-bye"
        send_sig_info( SIGTERM, &info, userspace_task);    //send the signal
    }
    // could've done this earlier?
    if( file != NULL)
        debugfs_remove(file);

    unregister_kprobe(&kp);
    printk("`%s` probe removed for `%s` tasks\n", target_syscall, target_taskname);
}

module_init(myinit);
module_exit(myexit);
MODULE_AUTHOR("Pablo");
//MODULE_AUTHOR("Manoj");
MODULE_DESCRIPTION("KPROBE MODULE");
MODULE_LICENSE("GPL");

and userspace:

/*
 * TODO:
 * 1. SIGTERM hndlr for clean shutdown
 * 2. process forks up to MAX_PROCS2TIME for timing workers
 * 3. circular FIFO pid2term of size MAX_PROCS2TIME
 * 4. does SIG_TEST really have to be hard-coded arbitrarily?
 */
#define _GNU_SOURCE
#include <errno.h>
#include <fcntl.h>
#include <limits.h>
#include <stdbool.h>
#include <signal.h>
#include <stdio.h>
#include <string.h>
#include <syslog.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#define MY_SYSCONF_PATH     "/sys/kernel/debug/signalconfpid" 
#define MY_SYSCONF_MAX_MSG  10
#define MAX_PROCS2TIME      10
#define SIG_TEST            44 /* hard-coded since SIGRTMIN is different in user and in kernel space */ 

#define MY_ERR(s)           syslog(LOG_NOTICE,s": %s",strerror( errno))

unsigned long getMicrotime();

int pid2term=0;

void receiveData( int n, siginfo_t *info, void *unused)
{
    pid2term = info->si_int;
}

int main( int argc, char **argv )
{
	int configfd;
	char buf[ MY_SYSCONF_MAX_MSG];

	/* kernel needs to know our pid to be able to send us a signal ->
 	 * we use debugfs for this -> do not forget to mount the debugfs!
 	 */
	configfd = open( MY_SYSCONF_PATH, O_WRONLY); 
	if( configfd < 0) {
		MY_ERR( "open");
		return -1;
	}
	sprintf( buf, "%i", getpid());
	if ( write( configfd, buf, strlen(buf) + 1) < 0) {
		MY_ERR( "fwrite"); 
		return -1;
	}

	/* now setup the signal handler for SIG_TEST 
 	 * SA_SIGINFO -> we want the signal handler function with 3 arguments
 	 */
	struct sigaction sig;
	sig.sa_sigaction = receiveData;
	sig.sa_flags = SA_SIGINFO;
	sigaction(SIG_TEST, &sig, NULL);

    /* for now copying one int to another, pid2term => pid2term_cpy 
     * as an "atomic" way to "safely" handle subsequent interrupts
     * part-way through a timing op
     */
    int pid2term_cpy;
    unsigned long b4, diff, sec;

    syslog(LOG_NOTICE, "entering wait loop");

    while( true){

        pause();

        if( pid2term != 0){            
            pid2term_cpy = pid2term;
            pid2term = 0;
            syslog(LOG_NOTICE, "Sending SIGTERM to PID %d\n", pid2term_cpy);
            b4 = getMicrotime();            
            kill( pid2term_cpy, SIGTERM);            
            while( kill( pid2term_cpy, 0) != -1)
                usleep( 1);
            if( errno != ESRCH) continue;
            diff = getMicrotime() - b4;
            sec = diff / 1e6;
            syslog(LOG_NOTICE, "PID %d: time2term was %lus,%luus\n"
                    , pid2term_cpy, sec, (unsigned long)(diff - ( sec * 1e6)));
        }
    }
	return 0;
}

unsigned long getMicrotime()
{
    struct timeval currentTime;
    gettimeofday(&currentTime, NULL);
    return currentTime.tv_sec * (int)1e6 + currentTime.tv_usec;
}

What did I learn?

  • My hybrid K&R/Stroustrup/Allman-8/Whitesmiths/Horstmann/Ratliff/Lisp indentation style is WAY-wrong and could result in permanent, irreparable damage to the universe
  • pass module params to module
  • share params between kernel and userspace using debugfs
  • real-time signals kernel to userspace
  • Linux KProbes!
  • some things strace can do/not-do
  • tell a userspace process to fork-off from the kernel
  • write log output from the kernel (printk) and from userspace (syslog)
  • receive a parameter through a signal

External Links:
https://opensourceforu.com/2011/04/kernel-debugging-using-kprobe-and-jprobe/
https://www.ibm.com/developerworks/library/l-kprobes/index.html
https://blog.tanelpoder.com/2013/02/21/peeking-into-linux-kernel-land-using-proc-filesystem-for-quickndirty-troubleshooting/
http://www.vantagepoint.sg/blog/82-hooking-android-system-calls-for-pleasure-and-benefit
http://manpages.ubuntu.com/manpages/artful/man8/kprobe-perf.8.html
https://www.scalingphpbook.com/blog/2016/04/03/my-favorite-simple-php-debugging-tool.html
https://jlmedina123.wordpress.com/2013/08/13/current-variable-and-task_struct/
http://www.xml.com/ldd/chapter/book/ch02.html#t6
https://lwn.net/Articles/288056/
https://linux-kernel-labs.github.io/master/labs/kernel_modules.html#
https://syscalls.kernelgrok.com/
https://www.gnu.org/software/libc/manual/html_node/Atomic-Types.html
https://www.wikitechy.com/tutorials/linux/linux-api-to-list-running-processes
https://idea.popcount.org/2012-12-11-linux-process-states/
https://eklitzke.org/uninterruptible-sleep
https://stackoverflow.com/questions/223644/what-is-an-uninterruptable-process
http://stupefydeveloper.blogspot.com/2009/06/linux-creation-of-new-process.html
http://jkukunas.blogspot.com/2010/05/x86-linux-networking-system-calls.html
https://jvns.ca/blog/2016/01/18/guessing-linux-kernel-registers/
https://lwn.net/Articles/604515/
https://stackoverflow.com/questions/9782660/using-ptrace-to-find-out-what-exactly-does-the-arguments-signify-for-a-system-ca?rq=1
https://stackoverflow.com/questions/44612136/how-to-simulate-hung-task-in-linux/44612553#44612553
https://tuxthink.blogspot.com/2012/05/module-to-print-open-files-of-process.html

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

* Copy This Password *

* Type Or Paste Password Here *

5,876 Spam Comments Blocked so far by Spam Free Wordpress