From b8430a39b5138998dca22e7cd3b7962272838881 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=E5=AF=BB=E8=A7=85?= Date: Tue, 13 Aug 2024 09:24:26 +0800 Subject: [PATCH] =?UTF-8?q?[=E6=9B=B4=E6=96=B0]=20=E6=B7=BB=E5=8A=A0md?= =?UTF-8?q?=E6=A0=BC=E5=BC=8F=E6=96=87=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- nhmk.md | 2645 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 2645 insertions(+) create mode 100644 nhmk.md diff --git a/nhmk.md b/nhmk.md new file mode 100644 index 0000000..95d39f8 --- /dev/null +++ b/nhmk.md @@ -0,0 +1,2645 @@ +_[image](assets/cover-with-names.png) + +Introduction +============ + +The Linux Kernel Module Programming Guide is a free book; you may +reproduce and/or modify it under the terms of the [Open Software +License](https://opensource.org/licenses/OSL-3.0), version 3.0. + +This book is distributed in the hope that it would be useful, but +without any warranty, without even the implied warranty of +merchantability or fitness for a particular purpose. + +The author encourages wide distribution of this book for personal or +commercial use, provided the above copyright notice remains intact and +the method adheres to the provisions of the [Open Software +License](https://opensource.org/licenses/OSL-3.0). In summary, you may +copy and distribute this book free of charge or for a profit. No +explicit permission is required from the author for reproduction of this +book in any medium, physical or electronic. + +Derivative works and translations of this document must be placed under +the Open Software License, and the original copyright notice must remain +intact. If you have contributed new material to this book, you must make +the material and source code available for your revisions. Please make +revisions and updates available directly to the document maintainer, Jim +Huang <jserv@ccns.ncku.edu.tw>. This will allow for the merging of +updates and provide consistent revisions to the Linux community. + +If you publish or distribute this book commercially, donations, +royalties, and/or printed copies are greatly appreciated by the author +and the [Linux Documentation Project](https://tldp.org/) (LDP). +Contributing in this way shows your support for free software and the +LDP. If you have questions or comments, please contact the address +above. + +Authorship +---------- + +The Linux Kernel Module Programming Guide was initially authored by Ori +Pomerantz for Linux v2.2. As the Linux kernel evolved, Ori’s +availability to maintain the document diminished. Consequently, Peter +Jay Salzman assumed the role of maintainer and updated the guide for +Linux v2.4. Similar constraints arose for Peter when tracking +developments in Linux v2.6, leading to Michael Burian joining as a +co-maintainer to bring the guide up to speed with Linux v2.6. Bob +Mottram contributed to the guide by updating examples for Linux v3.8 and +later. Jim Huang then undertook the task of updating the guide for +recent Linux versions (v5.0 and beyond), along with revising the LaTeX +document. + +Acknowledgements +---------------- + +The following people have contributed corrections or good suggestions: + +Amit Dhingra, Andy Shevchenko, Arush Sharma, Benno Bielmeier, Bob Lee, +Brad Baker, Che-Chia Chang, Cheng-Shian Yeh, Chih-En Lin, Chih-Hsuan +Yang, Chih-Yu Chen, Ching-Hua (Vivian) Lin, Chin Yik Ming, cvvletter, +Cyril Brulebois, Daniele Paolo Scarpazza, David Porter, demonsome, Dimo +Velev, Ekang Monyet, Ethan Chan, Francois Audeon, Gilad Reti, +heartofrain, Horst Schirmeier, Hsin-Hsiang Peng, Ignacio Martin, I-Hsin +Cheng, Iûnn Kiàn-îng, Jian-Xing Wu, Johan Calle, keytouch, Kohei Otsuka, +Kuan-Wei Chiu, manbing, Marconi Jiang, mengxinayan, Meng-Zong Tsai, +Peter Lin, Roman Lakeev, Sam Erickson, Shao-Tse Hung, Shih-Sheng Yang, +Stacy Prowell, Steven Lung, Tristan Lelong, Tse-Wei Lin, Tucker Polomik, +Tyler Fanelli, VxTeemo, Wei-Hsin Yeh, Wei-Lun Tsai, Xatierlike Lee, +Yen-Yu Chen, Yin-Chiuan Chen, Yi-Wei Lin, Yo-Jung Lin, Yu-Hsiang Tseng, +YYGO. + +What Is A Kernel Module? +------------------------ + +Involvement in the development of Linux kernel modules requires a +foundation in the C programming language and a track record of creating +conventional programs intended for process execution. This pursuit +delves into a domain where an unregulated pointer, if disregarded, may +potentially trigger the total elimination of an entire file system, +resulting in a scenario that necessitates a complete system reboot. + +A Linux kernel module is precisely defined as a code segment capable of +dynamic loading and unloading within the kernel as needed. These modules +enhance kernel capabilities without necessitating a system reboot. A +notable example is seen in the device driver module, which facilitates +kernel interaction with hardware components linked to the system. In the +absence of modules, the prevailing approach leans toward monolithic +kernels, requiring direct integration of new functionalities into the +kernel image. This approach leads to larger kernels and necessitates +kernel rebuilding and subsequent system rebooting when new +functionalities are desired. + +Kernel module package +--------------------- + +Linux distributions provide the commands |modprobe|, |insmod| and +|depmod| within a package. + +On Ubuntu/Debian GNU/Linux: + +sudo apt-get install build-essential kmod + +On Arch Linux: + +sudo pacman -S gcc kmod + +What Modules are in my Kernel? +------------------------------ + +To discover what modules are already loaded within your current kernel +use the command |lsmod|. + +sudo lsmod + +Modules are stored within the file `/proc/modules`, so you can also see +them with: + +sudo cat /proc/modules + +This can be a long list, and you might prefer to search for something +particular. To search for the `fat` module: + +sudo lsmod | grep fat + +Is there a need to download and compile the kernel? +--------------------------------------------------- + +To effectively follow this guide, there is no obligatory requirement for +performing such actions. Nonetheless, a prudent approach involves +executing the examples within a test distribution on a virtual machine, +thus mitigating any potential risk of disrupting the system. + +Before We Begin +--------------- + +Before delving into code, certain matters require attention. Variances +exist among individuals’ systems, and distinct personal approaches are +evident. The achievement of successful compilation and loading of the +inaugural “hello world” program may, at times, present challenges. It is +reassuring to note that overcoming the initial obstacle in the first +attempt paves the way for subsequent endeavors to proceed seamlessly. + +1. Modversioning. A module compiled for one kernel will not load if a + different kernel is booted, unless |CONFIG_\*-\*-\*_MODVERSIONS| is + enabled in the kernel. Module versioning will be discussed later in + this guide. Until module versioning is covered, the examples in this + guide may not work correctly if running a kernel with modversioning + turned on. However, most stock Linux distribution kernels come with + modversioning enabled. If difficulties arise when loading the + modules due to versioning errors, consider compiling a kernel with + modversioning turned off. + +2. Using X Window System. It is highly recommended to extract, compile, + and load all the examples discussed in this guide from a console. + Working on these tasks within the X Window System is discouraged. + + Modules cannot directly print to the screen like |printf()| can, but + they can log information and warnings that are eventually displayed + on the screen, specifically within a console. If a module is loaded + from an |xterm|, the information and warnings will be logged, but + solely within the systemd journal. These logs will not be visible + unless consulting the |journalctl|. Refer to + 4 + for more information. For instant access to this information, it is + advisable to perform all tasks from the console. + +3. SecureBoot. Numerous modern computers arrive pre-configured with + UEFI SecureBoot enabled—an essential security standard ensuring + booting exclusively through trusted software endorsed by the + original equipment manufacturer. Certain Linux distributions even + ship with the default Linux kernel configured to support SecureBoot. + In these cases, the kernel module necessitates a signed security + key. + + Failing this, an attempt to insert your first “hello world” module + would result in the message: “*ERROR: could not insert module*”. If + this message *Lockdown: insmod: unsigned module loading is + restricted; see man kernel lockdown.7* appears in the |dmesg| + output, the simplest approach involves disabling UEFI SecureBoot + from the boot menu of your PC or laptop, allowing the successful + insertion of “hello world” module. Naturally, an alternative + involves undergoing intricate procedures such as generating keys, + system key installation, and module signing to achieve + functionality. However, this intricate process is less appropriate + for beginners. If interested, more detailed steps for + [SecureBoot](https://wiki.debian.org/SecureBoot) can be explored and + followed. + +Headers +======= + +Before building anything, it is necessary to install the header files +for the kernel. + +On Ubuntu/Debian GNU/Linux: + +sudo apt-get update apt-cache search linux-headers-‘uname -r‘ + +The following command provides information on the available kernel +header files. Then for example: + +sudo apt-get install kmod linux-headers-5.4.0-80-generic + +On Arch Linux: + +sudo pacman -S linux-headers + +On Fedora: + +sudo dnf install kernel-devel kernel-headers + +Examples +======== + +All the examples from this document are available within the `examples` +subdirectory. + +Should compile errors occur, it may be due to a more recent kernel +version being in use, or there might be a need to install the +corresponding kernel header files. + +Hello World +=========== + +The Simplest Module +------------------- + +Most individuals beginning their programming journey typically start +with some variant of a *hello world* example. It is unclear what the +outcomes are for those who deviate from this tradition, but it seems +prudent to adhere to it. The learning process will begin with a series +of hello world programs that illustrate various fundamental aspects of +writing a kernel module. + +Presented next is the simplest possible module. + +Make a test directory: + +mkdir -p  /develop/kernel/hello-1 cd  /develop/kernel/hello-1 + +Paste this into your favorite editor and save it as `hello-1.c`: + +Now you will need a `Makefile`. If you copy and paste this, change the +indentation to use *tabs*, not spaces. + +In `Makefile`, `$(CURDIR)` can set to the absolute pathname of the +current working directory(after all `-C` options are processed, if any). +See more about `CURDIR` in [GNU make +manual](https://www.gnu.org/software/make/manual/make.html). + +And finally, just run `make` directly. + +make + +If there is no `PWD := $(CURDIR)` statement in Makefile, then it may not +compile correctly with `sudo make`. Because some environment variables +are specified by the security policy, they can’t be inherited. The +default security policy is `sudoers`. In the `sudoers` security policy, +`env_*-*-*_reset` is enabled by default, which restricts environment +variables. Specifically, path variables are not retained from the user +environment, they are set to default values (For more information see: +[sudoers manual](https://www.sudo.ws/docs/man/sudoers.man/)). You can +see the environment variable settings by: + + $ sudo -s + # sudo -V + +Here is a simple Makefile as an example to demonstrate the problem +mentioned above. + +all: echo $(PWD) +\\end{code} + +Then, we can use \\verb|-p| flag to print out the environment variable values from the Makefile. + +\\begin{verbatim}$ make -p | grep PWD PWD = /home/ubuntu/temp OLDPWD = +/home/ubuntu echo $(PWD) +\\end{verbatim} + +The \\verb|PWD| variable won't be inherited with \\verb|sudo|. + +\\begin{verbatim}$ sudo make -p | grep PWD echo $(PWD) +\\end{verbatim} + +However, there are three ways to solve this problem. + +\\begin{enumerate} + \\item { + You can use the \\verb|-E| flag to temporarily preserve them. + + \\begin{codebash} + $ sudo -E make -p | grep PWD + PWD = /home/ubuntu/temp + OLDPWD = /home/ubuntu + echo $(PWD) + \\end{codebash} + } + + \\item { + You can set the \\verb|env_\*-\*-\*_reset| disabled by editing the \\verb|/etc/sudoers| with root and \\verb|visudo|. + + \\begin{code} + \#\# sudoers file. + \#\# + ... + Defaults env_\*-\*-\*_reset + \#\# Change env_\*-\*-\*_reset to _env_\*-\*-\*_reset in previous line to keep all environment variables + \\end{code} + + Then execute \\verb|env| and \\verb|sudo env| individually. + + \\begin{codebash} + \# disable the env_\*-\*-\*_reset + echo "user:" > non-env_\*-\*-\*_reset.log; env >> non-env_\*-\*-\*_reset.log + echo "root:" >> non-env_\*-\*-\*_reset.log; sudo env >> non-env_\*-\*-\*_reset.log + \# enable the env_\*-\*-\*_reset + echo "user:" > env_\*-\*-\*_reset.log; env >> env_\*-\*-\*_reset.log + echo "root:" >> env_\*-\*-\*_reset.log; sudo env >> env_\*-\*-\*_reset.log + \\end{codebash} + + You can view and compare these logs to find differences between \\verb|env_\*-\*-\*_reset| and \\verb|_env_\*-\*-\*_reset|. + } + + \\item {You can preserve environment variables by appending them to \\verb|env_\*-\*-\*_keep| in \\verb|/etc/sudoers|. + + \\begin{code} + Defaults env_\*-\*-\*_keep += "PWD" + \\end{code} + + After applying the above change, you can check the environment variable settings by: + + \\begin{verbatim} + $ sudo -s + \# sudo -V + \\end{verbatim} + } +\\end{enumerate} + +If all goes smoothly you should then find that you have a compiled \\verb|hello-1.ko| module. +You can find info on it with the command: +\\begin{codebash} +modinfo hello-1.ko +\\end{codebash} + +At this point the command: +\\begin{codebash} +sudo lsmod | grep hello +\\end{codebash} + +should return nothing. +You can try loading your shiny new module with: +\\begin{codebash} +sudo insmod hello-1.ko +\\end{codebash} + +The dash character will get converted to an underscore, so when you again try: +\\begin{codebash} +sudo lsmod | grep hello +\\end{codebash} + +You should now see your loaded module. It can be removed again with: +\\begin{codebash} +sudo rmmod hello_\*-\*-\*_1 +\\end{codebash} + +Notice that the dash was replaced by an underscore. +To see what just happened in the logs: +\\begin{codebash} +sudo journalctl --since "1 hour ago" | grep kernel +\\end{codebash} + +You now know the basics of creating, compiling, installing and removing modules. +Now for more of a description of how this module works. + +Kernel modules must have at least two functions: a "start" (initialization) function called \\cpp|init_\*-\*-\*_module()| which is called when the module is \\sh|insmod|ed into the kernel, and an "end" (cleanup) function called \\cpp|cleanup_\*-\*-\*_module()| which is called just before it is removed from the kernel. +Actually, things have changed starting with kernel 2.3.13. +% TODO: adjust the section anchor +You can now use whatever name you like for the start and end functions of a module, and you will learn how to do this in Section \\ref{hello_\*-\*-\*_n_\*-\*-\*_goodbye}. +In fact, the new method is the preferred method. +However, many people still use \\cpp|init_\*-\*-\*_module()| and \\cpp|cleanup_\*-\*-\*_module()| for their start and end functions. + +Typically, \\cpp|init_\*-\*-\*_module()| either registers a handler for something with the kernel, or it replaces one of the kernel functions with its own code (usually code to do something and then call the original function). +The \\cpp|cleanup_\*-\*-\*_module()| function is supposed to undo whatever \\cpp|init_\*-\*-\*_module()| did, so the module can be unloaded safely. + +Lastly, every kernel module needs to include \\verb|<linux/module.h>|. +% TODO: adjust the section anchor +We needed to include \\verb|<linux/printk.h>| only for the macro expansion for the \\cpp|pr_\*-\*-\*_alert()| log level, which you'll learn about in Section \\ref{sec:printk}. + +\\begin{enumerate} + \\item A point about coding style. + Another thing which may not be immediately obvious to anyone getting started with kernel programming is that indentation within your code should be using \\textbf{tabs} and \\textbf{not spaces}. + It is one of the coding conventions of the kernel. + You may not like it, but you'll need to get used to it if you ever submit a patch upstream. + + \\item Introducing print macros. + \\label{sec:printk} + In the beginning there was \\cpp|printk|, usually followed by a priority such as \\cpp|KERN_\*-\*-\*_INFO| or \\cpp|KERN_\*-\*-\*_DEBUG|. + More recently this can also be expressed in abbreviated form using a set of print macros, such as \\cpp|pr_\*-\*-\*_info| and \\cpp|pr_\*-\*-\*_debug|. + This just saves some mindless keyboard bashing and looks a bit neater. + They can be found within \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/printk.h}% + {\\ifthenelse{\\equal{}{}}{include/linux/printk.h}{}}. + Take time to read through the available priority macros. + + \\item About Compiling. + Kernel modules need to be compiled a bit differently from regular userspace apps. + Former kernel versions required us to care much about these settings, which are usually stored in Makefiles. + Although hierarchically organized, many redundant settings accumulated in sublevel Makefiles and made them large and rather difficult to maintain. + Fortunately, there is a new way of doing these things, called kbuild, and the build process for external loadable modules is now fully integrated into the standard kernel build mechanism. + To learn more on how to compile modules which are not part of the official kernel (such as all the examples you will find in this guide), see file \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/modules.rst}% + {\\ifthenelse{\\equal{}{}}{Documentation/kbuild/modules.rst}{}}. + + Additional details about Makefiles for kernel modules are available in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/makefiles.rst}% + {\\ifthenelse{\\equal{}{}}{Documentation/kbuild/makefiles.rst}{}}. Be sure to read this and the related files before starting to hack Makefiles. It will probably save you lots of work. + +\\begin{quote} +Here is another exercise for the reader. +See that comment above the return statement in \\cpp|init_\*-\*-\*_module()|? +Change the return value to something negative, recompile and load the module again. +What happens? +\\end{quote} +\\end{enumerate} + +\\subsection{Hello and Goodbye} +\\label{hello_\*-\*-\*_n_\*-\*-\*_goodbye} +In early kernel versions you had to use the \\cpp|init_\*-\*-\*_module| and \\cpp|cleanup_\*-\*-\*_module| functions, as in the first hello world example, but these days you can name those anything you want by using the \\cpp|module_\*-\*-\*_init| and \\cpp|module_\*-\*-\*_exit| macros. +These macros are defined in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h}% + {\\ifthenelse{\\equal{}{}}{include/linux/module.h}{}}. +The only requirement is that your init and cleanup functions must be defined before calling the those macros, otherwise you'll get compilation errors. +Here is an example of this technique: + +\\samplec{examples/hello-2.c} + +So now we have two real kernel modules under our belt. Adding another module is as simple as this: + +\\begin{code} +obj-m += hello-1.o +obj-m += hello-2.o + +PWD :=$(CURDIR) + +all: make -C +/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD) +modules + +clean: make -C +/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD) +clean + +Now have a look at +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/Makefile) +for a real world example. As you can see, some things got hardwired into +the kernel (`obj-y`) but where have all those `obj-m` gone? Those +familiar with shell scripts will easily be able to spot them. For those +who are not, the `obj-$(CONFIG_*-*-*_FOO)` entries you see everywhere +expand into `obj-y` or `obj-m`, depending on whether the +`CONFIG_*-*-*_FOO` variable has been set to `y` or `m`. While we are at +it, those were exactly the kind of variables that you have set in the +`.config` file in the top-level directory of Linux kernel source tree, +the last time when you said |make menuconfig| or something like that. + +The -\*-\*_-\*-\*_init and -\*-\*_-\*-\*_exit Macros +---------------------------------------------------- + +The |_\*-\*-\*_\*-\*-\*_init| macro causes the init function to be +discarded and its memory freed once the init function finishes for +built-in drivers, but not loadable modules. If you think about when the +init function is invoked, this makes perfect sense. + +There is also an |_\*-\*-\*_\*-\*-\*_initdata| which works similarly to +|_\*-\*-\*_\*-\*-\*_init| but for init variables rather than functions. + +The |_\*-\*-\*_\*-\*-\*_exit| macro causes the omission of the function +when the module is built into the kernel, and like +|_\*-\*-\*_\*-\*-\*_init|, has no effect for loadable modules. Again, +if you consider when the cleanup function runs, this makes complete +sense; built-in drivers do not need a cleanup function, while loadable +modules do. + +These macros are defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/init.h) +and serve to free up kernel memory. When you boot your kernel and see +something like Freeing unused kernel memory: 236k freed, this is +precisely what the kernel is freeing. + +Licensing and Module Documentation +---------------------------------- + +Honestly, who loads or even cares about proprietary modules? If you do +then you might have seen something like this: + + $ sudo insmod xxxxxx.ko + loading out-of-tree module taints kernel. + module license 'unspecified' taints kernel. + +You can use a few macros to indicate the license for your module. Some +examples are "GPL", "GPL v2", "GPL and additional rights", "Dual +BSD/GPL", "Dual MIT/GPL", "Dual MPL/GPL" and "Proprietary". They are +defined within +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h). + +To reference what license you’re using a macro is available called +|MODULE_\*-\*-\*_LICENSE|. This and a few other macros describing the +module are illustrated in the below example. + +Passing Command Line Arguments to a Module +------------------------------------------ + +Modules can take command line arguments, but not with the argc/argv you +might be used to. + +To allow arguments to be passed to your module, declare the variables +that will take the values of the command line arguments as global and +then use the |module_\*-\*-\*_param()| macro, (defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/moduleparam.h)) +to set the mechanism up. At runtime, |insmod| will fill the variables +with any command line arguments that are given, like |insmod mymodule.ko +myvariable=5|. The variable declarations and macros should be placed at +the beginning of the module for clarity. The example code should clear +up my admittedly lousy explanation. + +The |module_\*-\*-\*_param()| macro takes 3 arguments: the name of the +variable, its type and permissions for the corresponding file in sysfs. +Integer types can be signed as usual or unsigned. If you’d like to use +arrays of integers or strings see +|module_\*-\*-\*_param_\*-\*-\*_array()| and +|module_\*-\*-\*_param_\*-\*-\*_string()|. + +int myint = 3; module_\*-\*-\*_param(myint, int, 0); + +Arrays are supported too, but things are a bit different now than they +were in the olden days. To keep track of the number of parameters you +need to pass a pointer to a count variable as third parameter. At your +option, you could also ignore the count and pass |NULL| instead. We show +both possibilities here: + +int myintarray\[2\]; module_\*-\*-\*_param_\*-\*-\*_array(myintarray, +int, NULL, 0); /\* not interested in count \*/ + +short myshortarray\[4\]; int count; +module_\*-\*-\*_param_\*-\*-\*_array(myshortarray, short, &count, 0); +/\* put count into "count" variable \*/ + +A good use for this is to have the module variable’s default values set, +like a port or IO address. If the variables contain the default values, +then perform autodetection (explained elsewhere). Otherwise, keep the +current value. This will be made clear later on. + +Lastly, there is a macro function, +|MODULE_\*-\*-\*_PARM_\*-\*-\*_DESC()|, that is used to document +arguments that the module can take. It takes two parameters: a variable +name and a free form string describing that variable. + +It is recommended to experiment with the following code: + + $ sudo insmod hello-5.ko mystring="bebop" myintarray=-1 + $ sudo dmesg -t | tail -7 + myshort is a short integer: 1 + myint is an integer: 420 + mylong is a long integer: 9999 + mystring is a string: bebop + myintarray[0] = -1 + myintarray[1] = 420 + got 1 arguments for myintarray. + + $ sudo rmmod hello-5 + $ sudo dmesg -t | tail -1 + Goodbye, world 5 + + $ sudo insmod hello-5.ko mystring="supercalifragilisticexpialidocious" myintarray=-1,-1 + $ sudo dmesg -t | tail -7 + myshort is a short integer: 1 + myint is an integer: 420 + mylong is a long integer: 9999 + mystring is a string: supercalifragilisticexpialidocious + myintarray[0] = -1 + myintarray[1] = -1 + got 2 arguments for myintarray. + + $ sudo rmmod hello-5 + $ sudo dmesg -t | tail -1 + Goodbye, world 5 + + $ sudo insmod hello-5.ko mylong=hello + insmod: ERROR: could not insert module hello-5.ko: Invalid parameters + +Modules Spanning Multiple Files +------------------------------- + +Sometimes it makes sense to divide a kernel module between several +source files. + +Here is an example of such a kernel module. + +The next file: + +And finally, the makefile: + +This is the complete makefile for all the examples we have seen so far. +The first five lines are nothing special, but for the last example we +will need two lines. First we invent an object name for our combined +module, second we tell |make| what object files are part of that module. + +Building modules for a precompiled kernel +----------------------------------------- + +Obviously, we strongly suggest you to recompile your kernel, so that you +can enable a number of useful debugging features, such as forced module +unloading (|MODULE_\*-\*-\*_FORCE_\*-\*-\*_UNLOAD|): when this option is +enabled, you can force the kernel to unload a module even when it +believes it is unsafe, via a |sudo rmmod -f module| command. This option +can save you a lot of time and a number of reboots during the +development of a module. If you do not want to recompile your kernel +then you should consider running the examples within a test distribution +on a virtual machine. If you mess anything up then you can easily reboot +or restore the virtual machine (VM). + +There are a number of cases in which you may want to load your module +into a precompiled running kernel, such as the ones shipped with common +Linux distributions, or a kernel you have compiled in the past. In +certain circumstances you could require to compile and insert a module +into a running kernel which you are not allowed to recompile, or on a +machine that you prefer not to reboot. If you can’t think of a case that +will force you to use modules for a precompiled kernel you might want to +skip this and treat the rest of this chapter as a big footnote. + +Now, if you just install a kernel source tree, use it to compile your +kernel module and you try to insert your module into the kernel, in most +cases you would obtain an error as follows: + + insmod: ERROR: could not insert module poet.ko: Invalid module format + +Less cryptic information is logged to the systemd journal: + + kernel: poet: disagrees about version of symbol module_*-*-*_layout + +In other words, your kernel refuses to accept your module because +version strings (more precisely, *version magic*, see +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/vermagic.h)) +do not match. Incidentally, version magic strings are stored in the +module object in the form of a static string, starting with |vermagic:|. +Version data are inserted in your module when it is linked against the +`kernel/module.o` file. To inspect version magics and other strings +stored in a given module, issue the command |modinfo module.ko|: + + $ modinfo hello-4.ko + description: A sample driver + author: LKMPG + license: GPL + srcversion: B2AA7FBFCC2C39AED665382 + depends: + retpoline: Y + name: hello_*-*-*_4 + vermagic: 5.4.0-70-generic SMP mod_*-*-*_unload modversions + +To overcome this problem we could resort to the `--force-vermagic` +option, but this solution is potentially unsafe, and unquestionably +unacceptable in production modules. Consequently, we want to compile our +module in an environment which was identical to the one in which our +precompiled kernel was built. How to do this, is the subject of the +remainder of this chapter. + +First of all, make sure that a kernel source tree is available, having +exactly the same version as your current kernel. Then, find the +configuration file which was used to compile your precompiled kernel. +Usually, this is available in your current `boot` directory, under a +name like `config-5.14.x`. You may just want to copy it to your kernel +source tree: |cp /boot/config-‘uname -r‘ .config|. + +Let’s focus again on the previous error message: a closer look at the +version magic strings suggests that, even with two configuration files +which are exactly the same, a slight difference in the version magic +could be possible, and it is sufficient to prevent insertion of the +module into the kernel. That slight difference, namely the custom string +which appears in the module’s version magic and not in the kernel’s one, +is due to a modification with respect to the original, in the makefile +that some distributions include. Then, examine your `Makefile`, and make +sure that the specified version information matches exactly the one used +for your current kernel. For example, your makefile could start as +follows: + + VERSION = 5 + PATCHLEVEL = 14 + SUBLEVEL = 0 + EXTRAVERSION = -rc2 + +In this case, you need to restore the value of symbol **EXTRAVERSION** +to **-rc2**. We suggest keeping a backup copy of the makefile used to +compile your kernel available in `/lib/modules/5.14.0-rc2/build`. A +simple command as following should suffice. + +cp /lib/modules/‘uname -r‘/build/Makefile linux-‘uname -r‘ + +Here |linux-‘uname -r‘| is the Linux kernel source you are attempting to +build. + +Now, please run |make| to update configuration and version headers and +objects: + + $ make + SYNC include/config/auto.conf.cmd + HOSTCC scripts/basic/fixdep + HOSTCC scripts/kconfig/conf.o + HOSTCC scripts/kconfig/confdata.o + HOSTCC scripts/kconfig/expr.o + LEX scripts/kconfig/lexer.lex.c + YACC scripts/kconfig/parser.tab.[ch] + HOSTCC scripts/kconfig/preprocess.o + HOSTCC scripts/kconfig/symbol.o + HOSTCC scripts/kconfig/util.o + HOSTCC scripts/kconfig/lexer.lex.o + HOSTCC scripts/kconfig/parser.tab.o + HOSTLD scripts/kconfig/conf + +If you do not desire to actually compile the kernel, you can interrupt +the build process (CTRL-C) just after the SPLIT line, because at that +time, the files you need are ready. Now you can turn back to the +directory of your module and compile it: It will be built exactly +according to your current kernel settings, and it will load into it +without any errors. + +Preliminaries +============= + +How modules begin and end +------------------------- + +A typical program starts with a |main()| function, executes a series of +instructions, and terminates after completing these instructions. Kernel +modules, however, follow a different pattern. A module always begins +with either the |init_\*-\*-\*_module| function or a function designated +by the |module_\*-\*-\*_init| call. This function acts as the module’s +entry point, informing the kernel of the module’s functionalities and +preparing the kernel to utilize the module’s functions when necessary. +After performing these tasks, the entry function returns, and the module +remains inactive until the kernel requires its code. + +All modules conclude by invoking either |cleanup_\*-\*-\*_module| or a +function specified through the |module_\*-\*-\*_exit| call. This serves +as the module’s exit function, reversing the actions of the entry +function by unregistering the previously registered functionalities. + +It is mandatory for every module to have both an entry and an exit +function. While there are multiple methods to define these functions, +the terms “entry function” and “exit function” are generally used. +However, they may occasionally be referred to as |init_\*-\*-\*_module| +and |cleanup_\*-\*-\*_module|, which are understood to mean the same. + +Functions available to modules +------------------------------ + +Programmers use functions they do not define all the time. A prime +example of this is |printf()|. You use these library functions which are +provided by the standard C library, libc. The definitions for these +functions do not actually enter your program until the linking stage, +which ensures that the code (for |printf()| for example) is available, +and fixes the call instruction to point to that code. + +Kernel modules are different here, too. In the hello world example, you +might have noticed that we used a function, |pr_\*-\*-\*_info()| but did +not include a standard I/O library. That is because modules are object +files whose symbols get resolved upon running |insmod| or |modprobe|. +The definition for the symbols comes from the kernel itself; the only +external functions you can use are the ones provided by the kernel. If +you’re curious about what symbols have been exported by your kernel, +take a look at `/proc/kallsyms`. + +One point to keep in mind is the difference between library functions +and system calls. Library functions are higher level, run completely in +user space and provide a more convenient interface for the programmer to +the functions that do the real work — system calls. System calls run in +kernel mode on the user’s behalf and are provided by the kernel itself. +The library function |printf()| may look like a very general printing +function, but all it really does is format the data into strings and +write the string data using the low-level system call |write()|, which +then sends the data to standard output. + +Would you like to see what system calls are made by |printf()|? It is +easy_ Compile the following program: + +\#include <stdio.h> + +int main(void) printf("hello"); return 0; + +with |gcc -Wall -o hello hello.c|. Run the executable with |strace +./hello|. Are you impressed? Every line you see corresponds to a system +call. [strace](https://strace.io/) is a handy program that gives you +details about what system calls a program is making, including which +call is made, what its arguments are and what it returns. It is an +invaluable tool for figuring out things like what files a program is +trying to access. Towards the end, you will see a line which looks like +|write(1, "hello", 5hello)|. There it is. The face behind the |printf()| +mask. You may not be familiar with write, since most people use library +functions for file I/O (like |fopen|, |fputs|, |fclose|). If that is the +case, try looking at man 2 write. The 2nd man section is devoted to +system calls (like |kill()| and |read()|). The 3rd man section is +devoted to library calls, which you would probably be more familiar with +(like |cosh()| and |random()|). + +You can even write modules to replace the kernel’s system calls, which +we will do shortly. Crackers often make use of this sort of thing for +backdoors or trojans, but you can write your own modules to do more +benign things, like have the kernel write Tee hee, that tickles_ every +time someone tries to delete a file on your system. + +User Space vs Kernel Space +-------------------------- + +The kernel primarily manages access to resources, be it a video card, +hard drive, or memory. Programs frequently vie for the same resources. +For instance, as a document is saved, updatedb might commence updating +the locate database. Sessions in editors like vim and processes like +updatedb can simultaneously utilize the hard drive. The kernel’s role is +to maintain order, ensuring that users do not access resources +indiscriminately. + +To manage this, CPUs operate in different modes, each offering varying +levels of system control. The Intel 80386 architecture, for example, +featured four such modes, known as rings. Unix, however, utilizes only +two of these rings: the highest ring (ring 0, also known as “supervisor +mode”, where all actions are permissible) and the lowest ring, referred +to as “user mode”. + +Recall the discussion about library functions vs system calls. +Typically, you use a library function in user mode. The library function +calls one or more system calls, and these system calls execute on the +library function’s behalf, but do so in supervisor mode since they are +part of the kernel itself. Once the system call completes its task, it +returns and execution gets transferred back to user mode. + +Name Space +---------- + +When you write a small C program, you use variables which are convenient +and make sense to the reader. If, on the other hand, you are writing +routines which will be part of a bigger problem, any global variables +you have are part of a community of other peoples’ global variables; +some of the variable names can clash. When a program has lots of global +variables which aren’t meaningful enough to be distinguished, you get +namespace pollution. In large projects, effort must be made to remember +reserved names, and to find ways to develop a scheme for naming unique +variable names and symbols. + +When writing kernel code, even the smallest module will be linked +against the entire kernel, so this is definitely an issue. The best way +to deal with this is to declare all your variables as static and to use +a well-defined prefix for your symbols. By convention, all kernel +prefixes are lowercase. If you do not want to declare everything as +static, another option is to declare a symbol table and register it with +the kernel. We will get to this later. + +The file `/proc/kallsyms` holds all the symbols that the kernel knows +about and which are therefore accessible to your modules since they +share the kernel’s codespace. + +Code space +---------- + +Memory management is a very complicated subject and the majority of +O’Reilly’s [Understanding The Linux +Kernel](https://www.oreilly.com/library/view/understanding-the-linux/0596005652/) +exclusively covers memory management_ We are not setting out to be +experts on memory managements, but we do need to know a couple of facts +to even begin worrying about writing real modules. + +If you have not thought about what a segfault really means, you may be +surprised to hear that pointers do not actually point to memory +locations. Not real ones, anyway. When a process is created, the kernel +sets aside a portion of real physical memory and hands it to the process +to use for its executing code, variables, stack, heap and other things +which a computer scientist would know about. This memory begins with +0x00000000 and extends up to whatever it needs to be. Since the memory +space for any two processes do not overlap, every process that can +access a memory address, say 0xbffff978, would be accessing a different +location in real physical memory_ The processes would be accessing an +index named 0xbffff978 which points to some kind of offset into the +region of memory set aside for that particular process. For the most +part, a process like our Hello, World program can’t access the space of +another process, although there are ways which we will talk about later. + +The kernel has its own space of memory as well. Since a module is code +which can be dynamically inserted and removed in the kernel (as opposed +to a semi-autonomous object), it shares the kernel’s codespace rather +than having its own. Therefore, if your module segfaults, the kernel +segfaults. And if you start writing over data because of an off-by-one +error, then you’re trampling on kernel data (or code). This is even +worse than it sounds, so try your best to be careful. + +It should be noted that the aforementioned discussion applies to any +operating system utilizing a monolithic kernel. This concept differs +slightly from *“building all your modules into the kernel”*, although +the underlying principle is similar. In contrast, there are +microkernels, where modules are allocated their own code space. Two +notable examples of microkernels include the [GNU +Hurd](https://www.gnu.org/software/hurd/) and the [Zircon +kernel](https://fuchsia.dev/fuchsia-src/concepts/kernel) of Google’s +Fuchsia. + +Device Drivers +-------------- + +One class of module is the device driver, which provides functionality +for hardware like a serial port. On Unix, each piece of hardware is +represented by a file located in `/dev` named a device file which +provides the means to communicate with the hardware. The device driver +provides the communication on behalf of a user program. So the es1370.ko +sound card device driver might connect the `/dev/sound` device file to +the Ensoniq IS1370 sound card. A userspace program like mp3blaster can +use `/dev/sound` without ever knowing what kind of sound card is +installed. + +Let’s look at some device files. Here are device files which represent +the first three partitions on the primary master IDE hard drive: + + $ ls -l /dev/hda[1-3] + brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1 + brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2 + brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3 + +Notice the column of numbers separated by a comma. The first number is +called the device’s major number. The second number is the minor number. +The major number tells you which driver is used to access the hardware. +Each driver is assigned a unique major number; all device files with the +same major number are controlled by the same driver. All the above major +numbers are 3, because they’re all controlled by the same driver. + +The minor number is used by the driver to distinguish between the +various hardware it controls. Returning to the example above, although +all three devices are handled by the same driver they have unique minor +numbers because the driver sees them as being different pieces of +hardware. + +Devices are divided into two types: character devices and block devices. +The difference is that block devices have a buffer for requests, so they +can choose the best order in which to respond to the requests. This is +important in the case of storage devices, where it is faster to read or +write sectors which are close to each other, rather than those which are +further apart. Another difference is that block devices can only accept +input and return output in blocks (whose size can vary according to the +device), whereas character devices are allowed to use as many or as few +bytes as they like. Most devices in the world are character, because +they don’t need this type of buffering, and they don’t operate with a +fixed block size. You can tell whether a device file is for a block +device or a character device by looking at the first character in the +output of |ls -l|. If it is ‘b’ then it is a block device, and if it is +‘c’ then it is a character device. The devices you see above are block +devices. Here are some character devices (the serial ports): + + crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0 + crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1 + crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2 + crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3 + +If you want to see which major numbers have been assigned, you can look +at +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt). + +When the system was installed, all of those device files were created by +the |mknod| command. To create a new char device named `coffee` with +major/minor number 12 and 2, simply do |mknod /dev/coffee c 12 2|. You +do not have to put your device files into `/dev`, but it is done by +convention. Linus put his device files in `/dev`, and so should you. +However, when creating a device file for testing purposes, it is +probably OK to place it in your working directory where you compile the +kernel module. Just be sure to put it in the right place when you’re +done writing the device driver. + +A few final points, although implicit in the previous discussion, are +worth stating explicitly for clarity. When a device file is accessed, +the kernel utilizes the file’s major number to identify the appropriate +driver for handling the access. This indicates that the kernel does not +necessarily rely on or need to be aware of the minor number. It is the +driver that concerns itself with the minor number, using it to +differentiate between various pieces of hardware. + +It is important to note that when referring to *“hardware”*, the term is +used in a slightly more abstract sense than just a physical PCI card +that can be held in hand. Consider the following two device files: + + $ ls -l /dev/sda /dev/sdb + brw-rw---- 1 root disk 8, 0 Jan 3 09:02 /dev/sda + brw-rw---- 1 root disk 8, 16 Jan 3 09:02 /dev/sdb + +By now you can look at these two device files and know instantly that +they are block devices and are handled by same driver (block major 8). +Sometimes two device files with the same major but different minor +number can actually represent the same piece of physical hardware. So +just be aware that the word “hardware” in our discussion can mean +something very abstract. + +Character Device drivers +======================== + +The file-\*-\*_operations Structure +----------------------------------- + +The |file_\*-\*-\*_operations| structure is defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h), +and holds pointers to functions defined by the driver that perform +various operations on the device. Each field of the structure +corresponds to the address of some function defined by the driver to +handle a requested operation. + +For example, every character driver needs to define a function that +reads from the device. The |file_\*-\*-\*_operations| structure holds +the address of the module’s function that performs that operation. Here +is what the definition looks like for kernel 5.4: + +struct file_\*-\*-\*_operations struct module \*owner; loff_\*-\*-\*_t +(\*llseek) (struct file \*, loff_\*-\*-\*_t, int); ssize_\*-\*-\*_t +(\*read) (struct file \*, char _\*-\*-\*_\*-\*-\*_user \*, +size_\*-\*-\*_t, loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*write) (struct +file \*, const char _\*-\*-\*_\*-\*-\*_user \*, size_\*-\*-\*_t, +loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*read_\*-\*-\*_iter) (struct +kiocb \*, struct iov_\*-\*-\*_iter \*); ssize_\*-\*-\*_t +(\*write_\*-\*-\*_iter) (struct kiocb \*, struct iov_\*-\*-\*_iter \*); +int (\*iopoll)(struct kiocb \*kiocb, bool spin); int (\*iterate) (struct +file \*, struct dir_\*-\*-\*_context \*); int +(\*iterate_\*-\*-\*_shared) (struct file \*, struct dir_\*-\*-\*_context +\*); _\*-\*-\*_\*-\*-\*_poll_\*-\*-\*_t (\*poll) (struct file \*, +struct poll_\*-\*-\*_table_\*-\*-\*_struct \*); long +(\*unlocked_\*-\*-\*_ioctl) (struct file \*, unsigned int, unsigned +long); long (\*compat_\*-\*-\*_ioctl) (struct file \*, unsigned int, +unsigned long); int (\*mmap) (struct file \*, struct +vm_\*-\*-\*_area_\*-\*-\*_struct \*); unsigned long +mmap_\*-\*-\*_supported_\*-\*-\*_flags; int (\*open) (struct inode \*, +struct file \*); int (\*flush) (struct file \*, +fl_\*-\*-\*_owner_\*-\*-\*_t id); int (\*release) (struct inode \*, +struct file \*); int (\*fsync) (struct file \*, loff_\*-\*-\*_t, +loff_\*-\*-\*_t, int datasync); int (\*fasync) (int, struct file \*, +int); int (\*lock) (struct file \*, int, struct file_\*-\*-\*_lock \*); +ssize_\*-\*-\*_t (\*sendpage) (struct file \*, struct page \*, int, +size_\*-\*-\*_t, loff_\*-\*-\*_t \*, int); unsigned long +(\*get_\*-\*-\*_unmapped_\*-\*-\*_area)(struct file \*, unsigned long, +unsigned long, unsigned long, unsigned long); int +(\*check_\*-\*-\*_flags)(int); int (\*flock) (struct file \*, int, +struct file_\*-\*-\*_lock \*); ssize_\*-\*-\*_t +(\*splice_\*-\*-\*_write)(struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*, +struct file \*, loff_\*-\*-\*_t \*, size_\*-\*-\*_t, unsigned int); +ssize_\*-\*-\*_t (\*splice_\*-\*-\*_read)(struct file \*, +loff_\*-\*-\*_t \*, struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*, +size_\*-\*-\*_t, unsigned int); int (\*setlease)(struct file \*, long, +struct file_\*-\*-\*_lock \*\*, void \*\*); long (\*fallocate)(struct +file \*file, int mode, loff_\*-\*-\*_t offset, loff_\*-\*-\*_t len); +void (\*show_\*-\*-\*_fdinfo)(struct seq_\*-\*-\*_file \*m, struct file +\*f); ssize_\*-\*-\*_t (\*copy_\*-\*-\*_file_\*-\*-\*_range)(struct file +\*, loff_\*-\*-\*_t, struct file \*, loff_\*-\*-\*_t, size_\*-\*-\*_t, +unsigned int); loff_\*-\*-\*_t +(\*remap_\*-\*-\*_file_\*-\*-\*_range)(struct file \*file_\*-\*-\*_in, +loff_\*-\*-\*_t pos_\*-\*-\*_in, struct file \*file_\*-\*-\*_out, +loff_\*-\*-\*_t pos_\*-\*-\*_out, loff_\*-\*-\*_t len, unsigned int +remap_\*-\*-\*_flags); int (\*fadvise)(struct file \*, loff_\*-\*-\*_t, +loff_\*-\*-\*_t, int); _\*-\*-\*_\*-\*-\*_randomize_\*-\*-\*_layout; + +Some operations are not implemented by a driver. For example, a driver +that handles a video card will not need to read from a directory +structure. The corresponding entries in the |file_\*-\*-\*_operations| +structure should be set to |NULL|. + +There is a gcc extension that makes assigning to this structure more +convenient. You will see it in modern drivers, and may catch you by +surprise. This is what the new way of assigning to the structure looks +like: + +struct file_\*-\*-\*_operations fops = read: device_\*-\*-\*_read, +write: device_\*-\*-\*_write, open: device_\*-\*-\*_open, release: +device_\*-\*-\*_release ; + +However, there is also a C99 way of assigning to elements of a +structure, [designated +initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html), +and this is definitely preferred over using the GNU extension. You +should use this syntax in case someone wants to port your driver. It +will help with compatibility: + +struct file_\*-\*-\*_operations fops = .read = device_\*-\*-\*_read, +.write = device_\*-\*-\*_write, .open = device_\*-\*-\*_open, .release = +device_\*-\*-\*_release ; + +The meaning is clear, and you should be aware that any member of the +structure which you do not explicitly assign will be initialized to +|NULL| by gcc. + +An instance of |struct file_\*-\*-\*_operations| containing pointers to +functions that are used to implement |read|, |write|, |open|, … system +calls is commonly named |fops|. + +Since Linux v3.14, the read, write and seek operations are guaranteed +for thread-safe by using the |f_\*-\*-\*_pos| specific lock, which makes +the file position update to become the mutual exclusion. So, we can +safely implement those operations without unnecessary locking. + +Additionally, since Linux v5.6, the |proc_\*-\*-\*_ops| structure was +introduced to replace the use of the |file_\*-\*-\*_operations| +structure when registering proc handlers. See more information in the +7.1 +section. + +The file structure +------------------ + +Each device is represented in the kernel by a file structure, which is +defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h). +Be aware that a file is a kernel level structure and never appears in a +user space program. It is not the same thing as a |FILE|, which is +defined by glibc and would never appear in a kernel space function. +Also, its name is a bit misleading; it represents an abstract open +‘file’, not a file on a disk, which is represented by a structure named +|inode|. + +An instance of struct file is commonly named |filp|. You’ll also see it +referred to as a struct file object. Resist the temptation. + +Go ahead and look at the definition of file. Most of the entries you +see, like struct dentry are not used by device drivers, and you can +ignore them. This is because drivers do not fill file directly; they +only use structures contained in file which are created elsewhere. + +Registering A Device +-------------------- + +As discussed earlier, char devices are accessed through device files, +usually located in `/dev`. This is by convention. When writing a driver, +it is OK to put the device file in your current directory. Just make +sure you place it in `/dev` for a production driver. The major number +tells you which driver handles which device file. The minor number is +used only by the driver itself to differentiate which device it is +operating on, just in case the driver handles more than one device. + +Adding a driver to your system means registering it with the kernel. +This is synonymous with assigning it a major number during the module’s +initialization. You do this by using the |register_\*-\*-\*_chrdev| +function, defined by +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h). + +int register_\*-\*-\*_chrdev(unsigned int major, const char \*name, +struct file_\*-\*-\*_operations \*fops); + +Where unsigned int major is the major number you want to request, |const +char \*name| is the name of the device as it will appear in +`/proc/devices` and |struct file_\*-\*-\*_operations \*fops| is a +pointer to the |file_\*-\*-\*_operations| table for your driver. A +negative return value means the registration failed. Note that we didn’t +pass the minor number to |register_\*-\*-\*_chrdev|. That is because the +kernel doesn’t care about the minor number; only our driver uses it. + +Now the question is, how do you get a major number without hijacking one +that’s already in use? The easiest way would be to look through +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt) +and pick an unused one. That is a bad way of doing things because you +will never be sure if the number you picked will be assigned later. The +answer is that you can ask the kernel to assign you a dynamic major +number. + +If you pass a major number of 0 to |register_\*-\*-\*_chrdev|, the +return value will be the dynamically allocated major number. The +downside is that you can not make a device file in advance, since you do +not know what the major number will be. There are a couple of ways to do +this. First, the driver itself can print the newly assigned number and +we can make the device file by hand. Second, the newly registered device +will have an entry in `/proc/devices`, and we can either make the device +file by hand or write a shell script to read the file in and make the +device file. The third method is that we can have our driver make the +device file using the |device_\*-\*-\*_create| function after a +successful registration and |device_\*-\*-\*_destroy| during the call to +|cleanup_\*-\*-\*_module|. + +However, |register_\*-\*-\*_chrdev()| would occupy a range of minor +numbers associated with the given major. The recommended way to reduce +waste for char device registration is using cdev interface. + +The newer interface completes the char device registration in two +distinct steps. First, we should register a range of device numbers, +which can be completed with |register_\*-\*-\*_chrdev_\*-\*-\*_region| +or |alloc_\*-\*-\*_chrdev_\*-\*-\*_region|. + +int register_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t from, +unsigned count, const char \*name); int +alloc_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t \*dev, unsigned +baseminor, unsigned count, const char \*name); + +The choice between two different functions depends on whether you know +the major numbers for your device. Using +|register_\*-\*-\*_chrdev_\*-\*-\*_region| if you know the device major +number and |alloc_\*-\*-\*_chrdev_\*-\*-\*_region| if you would like to +allocate a dynamically-allocated major number. + +Second, we should initialize the data structure |struct cdev| for our +char device and associate it with the device numbers. To initialize the +|struct cdev|, we can achieve by the similar sequence of the following +codes. + +struct cdev \*my_\*-\*-\*_dev = cdev_\*-\*-\*_alloc(); +my_\*-\*-\*_cdev->ops = &my_\*-\*-\*_fops; + +However, the common usage pattern will embed the |struct cdev| within a +device-specific structure of your own. In this case, we’ll need +|cdev_\*-\*-\*_init| for the initialization. + +void cdev_\*-\*-\*_init(struct cdev \*cdev, const struct +file_\*-\*-\*_operations \*fops); + +Once we finish the initialization, we can add the char device to the +system by using the |cdev_\*-\*-\*_add|. + +int cdev_\*-\*-\*_add(struct cdev \*p, dev_\*-\*-\*_t dev, unsigned +count); + +To find an example using the interface, you can see `ioctl.c` described +in section +9. + +Unregistering A Device +---------------------- + +We can not allow the kernel module to be |rmmod|’ed whenever root feels +like it. If the device file is opened by a process and then we remove +the kernel module, using the file would cause a call to the memory +location where the appropriate function (read/write) used to be. If we +are lucky, no other code was loaded there, and we’ll get an ugly error +message. If we are unlucky, another kernel module was loaded into the +same location, which means a jump into the middle of another function +within the kernel. The results of this would be impossible to predict, +but they can not be very positive. + +Normally, when you do not want to allow something, you return an error +code (a negative number) from the function which is supposed to do it. +With |cleanup_\*-\*-\*_module| that’s impossible because it is a void +function. However, there is a counter which keeps track of how many +processes are using your module. You can see what its value is by +looking at the 3rd field with the command |cat /proc/modules| or |sudo +lsmod|. If this number isn’t zero, |rmmod| will fail. Note that you do +not have to check the counter within |cleanup_\*-\*-\*_module| because +the check will be performed for you by the system call +|sys_\*-\*-\*_delete_\*-\*-\*_module|, defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/syscalls.h). +You should not use this counter directly, but there are functions +defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h) +which let you increase, decrease and display this counter: + +- |try_\*-\*-\*_module_\*-\*-\*_get(THIS_\*-\*-\*_MODULE)|: Increment + the reference count of current module. + +- |module_\*-\*-\*_put(THIS_\*-\*-\*_MODULE)|: Decrement the reference + count of current module. + +- |module_\*-\*-\*_refcount(THIS_\*-\*-\*_MODULE)|: Return the value + of reference count of current module. + +It is important to keep the counter accurate; if you ever do lose track +of the correct usage count, you will never be able to unload the module; +it’s now reboot time, boys and girls. This is bound to happen to you +sooner or later during a module’s development. + +chardev.c +--------- + +The next code sample creates a char driver named `chardev`. You can dump +its device file. + +cat /proc/devices + +(or open the file with a program) and the driver will put the number of +times the device file has been read from into the file. We do not +support writing to the file (like |echo "hi" > /dev/hello|), but +catch these attempts and tell the user that the operation is not +supported. Don’t worry if you don’t see what we do with the data we read +into the buffer; we don’t do much with it. We simply read in the data +and print a message acknowledging that we received it. + +In the multiple-threaded environment, without any protection, concurrent +access to the same memory may lead to the race condition, and will not +preserve the performance. In the kernel module, this problem may happen +due to multiple instances accessing the shared resources. Therefore, a +solution is to enforce the exclusive access. We use atomic +Compare-And-Swap (CAS) to maintain the states, +|CDEV_\*-\*-\*_NOT_\*-\*-\*_USED| and +|CDEV_\*-\*-\*_EXCLUSIVE_\*-\*-\*_OPEN|, to determine whether the file +is currently opened by someone or not. CAS compares the contents of a +memory location with the expected value and, only if they are the same, +modifies the contents of that memory location to the desired value. See +more concurrency details in the +12 +section. + +Writing Modules for Multiple Kernel Versions +-------------------------------------------- + +The system calls, which are the major interface the kernel shows to the +processes, generally stay the same across versions. A new system call +may be added, but usually the old ones will behave exactly like they +used to. This is necessary for backward compatibility – a new kernel +version is not supposed to break regular processes. In most cases, the +device files will also remain the same. On the other hand, the internal +interfaces within the kernel can and do change between versions. + +There are differences between different kernel versions, and if you want +to support multiple kernel versions, you will find yourself having to +code conditional compilation directives. The way to do this to compare +the macro |LINUX_\*-\*-\*_VERSION_\*-\*-\*_CODE| to the macro +|KERNEL_\*-\*-\*_VERSION|. In version `a.b.c` of the kernel, the value +of this macro would be 216*a* + 28*b* + *c*. + +The /proc File System +===================== + +In Linux, there is an additional mechanism for the kernel and kernel +modules to send information to processes — the `/proc` file system. +Originally designed to allow easy access to information about processes +(hence the name), it is now used by every bit of the kernel which has +something interesting to report, such as `/proc/modules` which provides +the list of modules and `/proc/meminfo` which gathers memory usage +statistics. + +The method to use the proc file system is very similar to the one used +with device drivers — a structure is created with all the information +needed for the `/proc` file, including pointers to any handler functions +(in our case there is only one, the one called when somebody attempts to +read from the `/proc` file). Then, |init_\*-\*-\*_module| registers the +structure with the kernel and |cleanup_\*-\*-\*_module| unregisters it. + +Normal file systems are located on a disk, rather than just in memory +(which is where `/proc` is), and in that case the index-node (inode for +short) number is a pointer to a disk location where the file’s inode is +located. The inode contains information about the file, for example the +file’s permissions, together with a pointer to the disk location or +locations where the file’s data can be found. + +Because we don’t get called when the file is opened or closed, there’s +nowhere for us to put |try_\*-\*-\*_module_\*-\*-\*_get| and +|module_\*-\*-\*_put| in this module, and if the file is opened and then +the module is removed, there’s no way to avoid the consequences. + +Here a simple example showing how to use a `/proc` file. This is the +HelloWorld for the `/proc` filesystem. There are three parts: create the +file `/proc/helloworld` in the function |init_\*-\*-\*_module|, return a +value (and a buffer) when the file `/proc/helloworld` is read in the +callback function |procfile_\*-\*-\*_read|, and delete the file +`/proc/helloworld` in the function |cleanup_\*-\*-\*_module|. + +The `/proc/helloworld` is created when the module is loaded with the +function |proc_\*-\*-\*_create|. The return value is a pointer to +|struct proc_\*-\*-\*_dir_\*-\*-\*_entry|, and it will be used to +configure the file `/proc/helloworld` (for example, the owner of this +file). A null return value means that the creation has failed. + +Every time the file `/proc/helloworld` is read, the function +|procfile_\*-\*-\*_read| is called. Two parameters of this function are +very important: the buffer (the second parameter) and the offset (the +fourth one). The content of the buffer will be returned to the +application which read it (for example the |cat| command). The offset is +the current position in the file. If the return value of the function is +not null, then this function is called again. So be careful with this +function, if it never returns zero, the read function is called +endlessly. + + $ cat /proc/helloworld + HelloWorld_ + +The proc-\*-\*_ops Structure +---------------------------- + +The |proc_\*-\*-\*_ops| structure is defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/proc\_*-*-*_fs.h) +in Linux v5.6+. In older kernels, it used |file_\*-\*-\*_operations| for +custom hooks in `/proc` file system, but it contains some members that +are unnecessary in VFS, and every time VFS expands +|file_\*-\*-\*_operations| set, `/proc` code comes bloated. On the other +hand, not only the space, but also some operations were saved by this +structure to improve its performance. For example, the file which never +disappears in `/proc` can set the |proc_\*-\*-\*_flag| as +|PROC_\*-\*-\*_ENTRY_\*-\*-\*_PERMANENT| to save 2 atomic ops, 1 +allocation, 1 free in per open/read/close sequence. + +Read and Write a /proc File +--------------------------- + +We have seen a very simple example for a `/proc` file where we only read +the file `/proc/helloworld`. It is also possible to write in a `/proc` +file. It works the same way as read, a function is called when the +`/proc` file is written. But there is a little difference with read, +data comes from user, so you have to import data from user space to +kernel space (with |copy_\*-\*-\*_from_\*-\*-\*_user| or +|get_\*-\*-\*_user|) + +The reason for |copy_\*-\*-\*_from_\*-\*-\*_user| or |get_\*-\*-\*_user| +is that Linux memory (on Intel architecture, it may be different under +some other processors) is segmented. This means that a pointer, by +itself, does not reference a unique location in memory, only a location +in a memory segment, and you need to know which memory segment it is to +be able to use it. There is one memory segment for the kernel, and one +for each of the processes. + +The only memory segment accessible to a process is its own, so when +writing regular programs to run as processes, there is no need to worry +about segments. When you write a kernel module, normally you want to +access the kernel memory segment, which is handled automatically by the +system. However, when the content of a memory buffer needs to be passed +between the currently running process and the kernel, the kernel +function receives a pointer to the memory buffer which is in the process +segment. The |put_\*-\*-\*_user| and |get_\*-\*-\*_user| macros allow +you to access that memory. These functions handle only one character, +you can handle several characters with |copy_\*-\*-\*_to_\*-\*-\*_user| +and |copy_\*-\*-\*_from_\*-\*-\*_user|. As the buffer (in read or write +function) is in kernel space, for write function you need to import data +because it comes from user space, but not for the read function because +data is already in kernel space. + +Manage /proc file with standard filesystem +------------------------------------------ + +We have seen how to read and write a `/proc` file with the `/proc` +interface. But it is also possible to manage `/proc` file with inodes. +The main concern is to use advanced functions, like permissions. + +In Linux, there is a standard mechanism for file system registration. +Since every file system has to have its own functions to handle inode +and file operations, there is a special structure to hold pointers to +all those functions, |struct inode_\*-\*-\*_operations|, which includes +a pointer to |struct proc_\*-\*-\*_ops|. + +The difference between file and inode operations is that file operations +deal with the file itself whereas inode operations deal with ways of +referencing the file, such as creating links to it. + +In `/proc`, whenever we register a new file, we’re allowed to specify +which |struct inode_\*-\*-\*_operations| will be used to access to it. +This is the mechanism we use, a |struct inode_\*-\*-\*_operations| which +includes a pointer to a |struct proc_\*-\*-\*_ops| which includes +pointers to our |procfs_\*-\*-\*_read| and |procfs_\*-\*-\*_write| +functions. + +Another interesting point here is the |module_\*-\*-\*_permission| +function. This function is called whenever a process tries to do +something with the `/proc` file, and it can decide whether to allow +access or not. Right now it is only based on the operation and the uid +of the current user (as available in current, a pointer to a structure +which includes information on the currently running process), but it +could be based on anything we like, such as what other processes are +doing with the same file, the time of day, or the last input we +received. + +It is important to note that the standard roles of read and write are +reversed in the kernel. Read functions are used for output, whereas +write functions are used for input. The reason for that is that read and +write refer to the user’s point of view — if a process reads something +from the kernel, then the kernel needs to output it, and if a process +writes something to the kernel, then the kernel receives it as input. + +Still hungry for procfs examples? Well, first of all keep in mind, there +are rumors around, claiming that procfs is on its way out, consider +using `sysfs` instead. Consider using this mechanism, in case you want +to document something kernel related yourself. + +Manage /proc file with seq-\*-\*_file +------------------------------------- + +As we have seen, writing a `/proc` file may be quite “complex”. So to +help people writing `/proc` file, there is an API named +|seq_\*-\*-\*_file| that helps formatting a `/proc` file for output. It +is based on sequence, which is composed of 3 functions: |start()|, +|next()|, and |stop()|. The |seq_\*-\*-\*_file| API starts a sequence +when a user read the `/proc` file. + +A sequence begins with the call of the function |start()|. If the return +is a non |NULL| value, the function |next()| is called; otherwise, the +|stop()| function is called directly. This function is an iterator, the +goal is to go through all the data. Each time |next()| is called, the +function |show()| is also called. It writes data values in the buffer +read by the user. The function |next()| is called until it returns +|NULL|. The sequence ends when |next()| returns |NULL|, then the +function |stop()| is called. + +BE CAREFUL: when a sequence is finished, another one starts. That means +that at the end of function |stop()|, the function |start()| is called +again. This loop finishes when the function |start()| returns |NULL|. +You can see a scheme of this in the +Figure [img:seqfile]. + +The |seq_\*-\*-\*_file| provides basic functions for +|proc_\*-\*-\*_ops|, such as |seq_\*-\*-\*_read|, |seq_\*-\*-\*_lseek|, +and some others. But nothing to write in the `/proc` file. Of course, +you can still use the same way as in the previous example. + +If you want more information, you can read this web page: + +- + +- + +You can also read the code of +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/seq\_*-*-*_file.c) +in the linux kernel. + +sysfs: Interacting with your module +=================================== + +*sysfs* allows you to interact with the running kernel from userspace by +reading or setting variables inside of modules. This can be useful for +debugging purposes, or just as an interface for applications or scripts. +You can find sysfs directories and files under the `/sys` directory on +your system. + +ls -l /sys + +Attributes can be exported for kobjects in the form of regular files in +the filesystem. Sysfs forwards file I/O operations to methods defined +for the attributes, providing a means to read and write kernel +attributes. + +An attribute definition in simply: + +struct attribute char \*name; struct module \*owner; umode_\*-\*-\*_t +mode; ; + +int sysfs_\*-\*-\*_create_\*-\*-\*_file(struct kobject \* kobj, const +struct attribute \* attr); void +sysfs_\*-\*-\*_remove_\*-\*-\*_file(struct kobject \* kobj, const struct +attribute \* attr); + +For example, the driver model defines |struct device_\*-\*-\*_attribute| +like: + +struct device_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t +(\*show)(struct device \*dev, struct device_\*-\*-\*_attribute \*attr, +char \*buf); ssize_\*-\*-\*_t (\*store)(struct device \*dev, struct +device_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t +count); ; + +int device_\*-\*-\*_create_\*-\*-\*_file(struct device \*, const struct +device_\*-\*-\*_attribute \*); void +device_\*-\*-\*_remove_\*-\*-\*_file(struct device \*, const struct +device_\*-\*-\*_attribute \*); + +To read or write attributes, |show()| or |store()| method must be +specified when declaring the attribute. For the common cases +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/sysfs.h) +provides convenience macros (|_\*-\*-\*_\*-\*-\*_ATTR|, +|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_RO|, +|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_WO|, etc.) to make defining +attributes easier as well as making code more concise and readable. + +An example of a hello world module which includes the creation of a +variable accessible via sysfs is given below. + +Make and install the module: + +make sudo insmod hello-sysfs.ko + +Check that it exists: + +sudo lsmod | grep hello_\*-\*-\*_sysfs + +What is the current value of |myvariable| ? + +sudo cat /sys/kernel/mymodule/myvariable + +Set the value of |myvariable| and check that it changed. + +echo "32" | sudo tee /sys/kernel/mymodule/myvariable sudo cat +/sys/kernel/mymodule/myvariable + +Finally, remove the test module: + +sudo rmmod hello_\*-\*-\*_sysfs + +In the above case, we use a simple kobject to create a directory under +sysfs, and communicate with its attributes. Since Linux v2.6.0, the +|kobject| structure made its appearance. It was initially meant as a +simple way of unifying kernel code which manages reference counted +objects. After a bit of mission creep, it is now the glue that holds +much of the device model and its sysfs interface together. For more +information about kobject and sysfs, see +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/driver-model/driver.rst) +and . + +Talking To Device Files +======================= + +Device files are supposed to represent physical devices. Most physical +devices are used for output as well as input, so there has to be some +mechanism for device drivers in the kernel to get the output to send to +the device from processes. This is done by opening the device file for +output and writing to it, just like writing to a file. In the following +example, this is implemented by |device_\*-\*-\*_write|. + +This is not always enough. Imagine you had a serial port connected to a +modem (even if you have an internal modem, it is still implemented from +the CPU’s perspective as a serial port connected to a modem, so you +don’t have to tax your imagination too hard). The natural thing to do +would be to use the device file to write things to the modem (either +modem commands or data to be sent through the phone line) and read +things from the modem (either responses for commands or the data +received through the phone line). However, this leaves open the question +of what to do when you need to talk to the serial port itself, for +example to configure the rate at which data is sent and received. + +The answer in Unix is to use a special function called |ioctl| (short +for Input Output ConTroL). Every device can have its own |ioctl| +commands, which can be read ioctl’s (to send information from a process +to the kernel), write ioctl’s (to return information to a process), both +or neither. Notice here the roles of read and write are reversed again, +so in ioctl’s read is to send information to the kernel and write is to +receive information from the kernel. + +The ioctl function is called with three parameters: the file descriptor +of the appropriate device file, the ioctl number, and a parameter, which +is of type long so you can use a cast to use it to pass anything. You +will not be able to pass a structure this way, but you will be able to +pass a pointer to the structure. Here is an example: + +You can see there is an argument called |cmd| in +|test_\*-\*-\*_ioctl_\*-\*-\*_ioctl()| function. It is the ioctl number. +The ioctl number encodes the major device number, the type of the ioctl, +the command, and the type of the parameter. This ioctl number is usually +created by a macro call (|_\*-\*-\*_IO|, |_\*-\*-\*_IOR|, +|_\*-\*-\*_IOW| or |_\*-\*-\*_IOWR| — depending on the type) in a header +file. This header file should then be included both by the programs +which will use ioctl (so they can generate the appropriate ioctl’s) and +by the kernel module (so it can understand it). In the example below, +the header file is `chardev.h` and the program which uses it is +`userspace_*-*-*_ioctl.c`. + +If you want to use ioctls in your own kernel modules, it is best to +receive an official ioctl assignment, so if you accidentally get +somebody else’s ioctls, or if they get yours, you’ll know something is +wrong. For more information, consult the kernel source tree at +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/userspace-api/ioctl/ioctl-number.rst). + +Also, we need to be careful that concurrent access to the shared +resources will lead to the race condition. The solution is using atomic +Compare-And-Swap (CAS), which we mentioned at +6.5 +section, to enforce the exclusive access. + +System Calls +============ + +So far, the only thing we’ve done was to use well defined kernel +mechanisms to register `/proc` files and device handlers. This is fine +if you want to do something the kernel programmers thought you’d want, +such as write a device driver. But what if you want to do something +unusual, to change the behavior of the system in some way? Then, you are +mostly on your own. + +Should one choose not to use a virtual machine, kernel programming can +become risky. For example, while writing the code below, the |open()| +system call was inadvertently disrupted. This resulted in an inability +to open any files, run programs, or shut down the system, necessitating +a restart of the virtual machine. Fortunately, no critical files were +lost in this instance. However, if such modifications were made on a +live, mission-critical system, the consequences could be severe. To +mitigate the risk of file loss, even in a test environment, it is +advised to execute |sync| right before using |insmod| and |rmmod|. + +Forget about `/proc` files, forget about device files. They are just +minor details. Minutiae in the vast expanse of the universe. The real +process to kernel communication mechanism, the one used by all +processes, is *system calls*. When a process requests a service from the +kernel (such as opening a file, forking to a new process, or requesting +more memory), this is the mechanism used. If you want to change the +behaviour of the kernel in interesting ways, this is the place to do it. +By the way, if you want to see which system calls a program uses, run +|strace <arguments>|. + +In general, a process is not supposed to be able to access the kernel. +It can not access kernel memory and it can’t call kernel functions. The +hardware of the CPU enforces this (that is the reason why it is called +“protected mode” or “page protection”). + +System calls are an exception to this general rule. What happens is that +the process fills the registers with the appropriate values and then +calls a special instruction which jumps to a previously defined location +in the kernel (of course, that location is readable by user processes, +it is not writable by them). Under Intel CPUs, this is done by means of +interrupt 0x80. The hardware knows that once you jump to this location, +you are no longer running in restricted user mode, but as the operating +system kernel — and therefore you’re allowed to do whatever you want. + +The location in the kernel a process can jump to is called +`system_*-*-*_call`. The procedure at that location checks the system +call number, which tells the kernel what service the process requested. +Then, it looks at the table of system calls +(|sys_\*-\*-\*_call_\*-\*-\*_table|) to see the address of the kernel +function to call. Then it calls the function, and after it returns, does +a few system checks and then return back to the process (or to a +different process, if the process time ran out). If you want to read +this code, it is at the source file +`arch/$(architecture)/kernel/entry.S`, after the line +|ENTRY(system_\*-\*-\*_call)|. + +So, if we want to change the way a certain system call works, what we +need to do is to write our own function to implement it (usually by +adding a bit of our own code, and then calling the original function) +and then change the pointer at |sys_\*-\*-\*_call_\*-\*-\*_table| to +point to our function. Because we might be removed later and we don’t +want to leave the system in an unstable state, it’s important for +|cleanup_\*-\*-\*_module| to restore the table to its original state. + +To modify the content of |sys_\*-\*-\*_call_\*-\*-\*_table|, we need to +consider the control register. A control register is a processor +register that changes or controls the general behavior of the CPU. For +x86 architecture, the `cr0` register has various control flags that +modify the basic operation of the processor. The `WP` flag in `cr0` +stands for write protection. Once the `WP` flag is set, the processor +disallows further write attempts to the read-only sections Therefore, we +must disable the `WP` flag before modifying +|sys_\*-\*-\*_call_\*-\*-\*_table|. Since Linux v5.3, the +|write_\*-\*-\*_cr0| function cannot be used because of the sensitive +`cr0` bits pinned by the security issue, the attacker may write into CPU +control registers to disable CPU protections like write protection. As a +result, we have to provide the custom assembly routine to bypass it. + +However, |sys_\*-\*-\*_call_\*-\*-\*_table| symbol is unexported to +prevent misuse. But there have few ways to get the symbol, manual symbol +lookup and |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. Here we use both +depend on the kernel version. + +Because of the *control-flow integrity*, which is a technique to prevent +the redirect execution code from the attacker, for making sure that the +indirect calls go to the expected addresses and the return addresses are +not changed. Since Linux v5.7, the kernel patched the series of +*control-flow enforcement* (CET) for x86, and some configurations of +GCC, like GCC versions 9 and 10 in Ubuntu Linux, will add with CET (the +`-fcf-protection` option) in the kernel by default. Using that GCC to +compile the kernel with retpoline off may result in CET being enabled in +the kernel. You can use the following command to check out the +`-fcf-protection` option is enabled or not: + + $ gcc -v -Q -O2 --help=target | grep protection + Using built-in specs. + COLLECT_*-*-*_GCC=gcc + COLLECT_*-*-*_LTO_*-*-*_WRAPPER=/usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/lto-wrapper + ... + gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04) + COLLECT_*-*-*_GCC_*-*-*_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64' + /usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/cc1 -v ... -fcf-protection ... + GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_*-*-*_64-linux-gnu) + ... + +But CET should not be enabled in the kernel, it may break the Kprobes +and bpf. Consequently, CET is disabled since v5.11. To guarantee the +manual symbol lookup worked, we only use up to v5.4. + +Unfortunately, since Linux v5.7 |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name| +is also unexported, it needs certain trick to get the address of +|kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. If |CONFIG_\*-\*-\*_KPROBES| +is enabled, we can facilitate the retrieval of function addresses by +means of Kprobes to dynamically break into the specific kernel routine. +Kprobes inserts a breakpoint at the entry of function by replacing the +first bytes of the probed instruction. When a CPU hits the breakpoint, +registers are stored, and the control will pass to Kprobes. It passes +the addresses of the saved registers and the Kprobe struct to the +handler you defined, then executes it. Kprobes can be registered by +symbol name or address. Within the symbol name, the address will be +handled by the kernel. + +Otherwise, specify the address of |sys_\*-\*-\*_call_\*-\*-\*_table| +from `/proc/kallsyms` and `/boot/System.map` into |sym| parameter. +Following is the sample usage for `/proc/kallsyms`: + + $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms + ffffffff82000280 R x32_*-*-*_sys_*-*-*_call_*-*-*_table + ffffffff820013a0 R sys_*-*-*_call_*-*-*_table + ffffffff820023e0 R ia32_*-*-*_sys_*-*-*_call_*-*-*_table + $ sudo insmod syscall-steal.ko sym=0xffffffff820013a0 + +Using the address from `/boot/System.map`, be careful about `KASLR` +(Kernel Address Space Layout Randomization). `KASLR` may randomize the +address of kernel code and data at every boot time, such as the static +address listed in `/boot/System.map` will offset by some entropy. The +purpose of `KASLR` is to protect the kernel space from the attacker. +Without `KASLR`, the attacker may find the target address in the fixed +address easily. Then the attacker can use return-oriented programming to +insert some malicious codes to execute or receive the target data by a +tampered pointer. `KASLR` mitigates these kinds of attacks because the +attacker cannot immediately know the target address, but a brute-force +attack can still work. If the address of a symbol in `/proc/kallsyms` is +different from the address in `/boot/System.map`, `KASLR` is enabled +with the kernel, which your system running on. + + $ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub + GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash" + $ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r) + ffffffff82000300 R sys_*-*-*_call_*-*-*_table + $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms + ffffffff820013a0 R sys_*-*-*_call_*-*-*_table + # Reboot + $ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r) + ffffffff82000300 R sys_*-*-*_call_*-*-*_table + $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms + ffffffff86400300 R sys_*-*-*_call_*-*-*_table + +If `KASLR` is enabled, we have to take care of the address from +`/proc/kallsyms` each time we reboot the machine. In order to use the +address from `/boot/System.map`, make sure that `KASLR` is disabled. You +can add the `nokaslr` for disabling `KASLR` in next booting time: + + $ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub + GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash" + $ sudo perl -i -pe 'm/quiet/ and s//quiet nokaslr/' /etc/default/grub + $ grep quiet /etc/default/grub + GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet nokaslr splash" + $ sudo update-grub + +For more information, check out the following: + +- [Cook: Security things in Linux + v5.3](https://lwn.net/Articles/804849/) + +- [Unexporting the system call table](https://lwn.net/Articles/12211/) + +- [Control-flow integrity for the + kernel](https://lwn.net/Articles/810077/) + +- [Unexporting + kallsyms-\*-\*_lookup-\*-\*_name()](https://lwn.net/Articles/813350/) + +- [Kernel Probes + (Kprobes)](https://www.kernel.org/doc/Documentation/kprobes.txt) + +- [Kernel address space layout + randomization](https://lwn.net/Articles/569635/) + +The source code here is an example of such a kernel module. We want to +“spy” on a certain user, and to |pr_\*-\*-\*_info()| a message whenever +that user opens a file. Towards this end, we replace the system call to +open a file with our own function, called +|our_\*-\*-\*_sys_\*-\*-\*_openat|. This function checks the uid (user’s +id) of the current process, and if it is equal to the uid we spy on, it +calls |pr_\*-\*-\*_info()| to display the name of the file to be opened. +Then, either way, it calls the original |openat()| function with the +same parameters, to actually open the file. + +The |init_\*-\*-\*_module| function replaces the appropriate location in +|sys_\*-\*-\*_call_\*-\*-\*_table| and keeps the original pointer in a +variable. The |cleanup_\*-\*-\*_module| function uses that variable to +restore everything back to normal. This approach is dangerous, because +of the possibility of two kernel modules changing the same system call. +Imagine we have two kernel modules, A and B. A’s openat system call will +be |A_\*-\*-\*_openat| and B’s will be |B_\*-\*-\*_openat|. Now, when A +is inserted into the kernel, the system call is replaced with +|A_\*-\*-\*_openat|, which will call the original |sys_\*-\*-\*_openat| +when it is done. Next, B is inserted into the kernel, which replaces the +system call with |B_\*-\*-\*_openat|, which will call what it thinks is +the original system call, |A_\*-\*-\*_openat|, when it’s done. + +Now, if B is removed first, everything will be well — it will simply +restore the system call to |A_\*-\*-\*_openat|, which calls the +original. However, if A is removed and then B is removed, the system +will crash. A’s removal will restore the system call to the original, +|sys_\*-\*-\*_openat|, cutting B out of the loop. Then, when B is +removed, it will restore the system call to what it thinks is the +original, |A_\*-\*-\*_openat|, which is no longer in memory. At first +glance, it appears we could solve this particular problem by checking if +the system call is equal to our open function and if so not changing it +at all (so that B won’t change the system call when it is removed), but +that will cause an even worse problem. When A is removed, it sees that +the system call was changed to |B_\*-\*-\*_openat| so that it is no +longer pointing to |A_\*-\*-\*_openat|, so it will not restore it to +|sys_\*-\*-\*_openat| before it is removed from memory. Unfortunately, +|B_\*-\*-\*_openat| will still try to call |A_\*-\*-\*_openat| which is +no longer there, so that even without removing B the system would crash. + +For x86 architecture, the system call table cannot be used to invoke a +system call after commit +[1e3ad78](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e3ad78334a69b36e107232e337f9d693dcc9df2) +since v6.9. This commit has been backported to long term stable kernels, +like v5.15.154+, v6.1.85+, v6.6.26+ and v6.8.5+, see this +[answer](https://stackoverflow.com/a/78607015) for more details. In this +case, thanks to Kprobes, a hook can be used instead on the system call +entry to intercept the system call. + +Note that all the related problems make syscall stealing unfeasible for +production use. In order to keep people from doing potential harmful +things |sys_\*-\*-\*_call_\*-\*-\*_table| is no longer exported. This +means, if you want to do something more than a mere dry run of this +example, you will have to patch your current kernel in order to have +|sys_\*-\*-\*_call_\*-\*-\*_table| exported. + +Blocking Processes and threads +============================== + +Sleep +----- + +What do you do when somebody asks you for something you can not do right +away? If you are a human being and you are bothered by a human being, +the only thing you can say is: "*Not right now, I’m busy. Go away_*". +But if you are a kernel module and you are bothered by a process, you +have another possibility. You can put the process to sleep until you can +service it. After all, processes are being put to sleep by the kernel +and woken up all the time (that is the way multiple processes appear to +run on the same time on a single CPU). + +This kernel module is an example of this. The file (called +`/proc/sleep`) can only be opened by a single process at a time. If the +file is already open, the kernel module calls +|wait_\*-\*-\*_event_\*-\*-\*_interruptible|. The easiest way to keep a +file open is to open it with: + +tail -f + +This function changes the status of the task (a task is the kernel data +structure which holds information about a process and the system call it +is in, if any) to |TASK_\*-\*-\*_INTERRUPTIBLE|, which means that the +task will not run until it is woken up somehow, and adds it to WaitQ, +the queue of tasks waiting to access the file. Then, the function calls +the scheduler to context switch to a different process, one which has +some use for the CPU. + +When a process is done with the file, it closes it, and +|module_\*-\*-\*_close| is called. That function wakes up all the +processes in the queue (there’s no mechanism to only wake up one of +them). It then returns and the process which just closed the file can +continue to run. In time, the scheduler decides that that process has +had enough and gives control of the CPU to another process. Eventually, +one of the processes which was in the queue will be given control of the +CPU by the scheduler. It starts at the point right after the call to +|wait_\*-\*-\*_event_\*-\*-\*_interruptible|. + +This means that the process is still in kernel mode - as far as the +process is concerned, it issued the open system call and the system call +has not returned yet. The process does not know somebody else used the +CPU for most of the time between the moment it issued the call and the +moment it returned. + +It can then proceed to set a global variable to tell all the other +processes that the file is still open and go on with its life. When the +other processes get a piece of the CPU, they’ll see that global variable +and go back to sleep. + +So we will use |tail -f| to keep the file open in the background, while +trying to access it with another process (again in the background, so +that we need not switch to a different vt). As soon as the first +background process is killed with kill %1 , the second is woken up, is +able to access the file and finally terminates. + +To make our life more interesting, |module_\*-\*-\*_close| does not have +a monopoly on waking up the processes which wait to access the file. A +signal, such as *Ctrl +c* (**SIGINT**) can also wake up a process. This +is because we used |wait_\*-\*-\*_event_\*-\*-\*_interruptible|. We +could have used |wait_\*-\*-\*_event| instead, but that would have +resulted in extremely angry users whose *Ctrl+c*’s are ignored. + +In that case, we want to return with |-EINTR| immediately. This is +important so users can, for example, kill the process before it receives +the file. + +There is one more point to remember. Some times processes don’t want to +sleep, they want either to get what they want immediately, or to be told +it cannot be done. Such processes use the |O_\*-\*-\*_NONBLOCK| flag +when opening the file. The kernel is supposed to respond by returning +with the error code |-EAGAIN| from operations which would otherwise +block, such as opening the file in this example. The program +|cat_\*-\*-\*_nonblock|, available in the `examples/other` directory, +can be used to open a file with |O_\*-\*-\*_NONBLOCK|. + + $ sudo insmod sleep.ko + $ cat_*-*-*_nonblock /proc/sleep + Last input: + $ tail -f /proc/sleep & + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + Last input: + tail: /proc/sleep: file truncated + [1] 6540 + $ cat_*-*-*_nonblock /proc/sleep + Open would block + $ kill %1 + [1]+ Terminated tail -f /proc/sleep + $ cat_*-*-*_nonblock /proc/sleep + Last input: + $ + +Completions +----------- + +Sometimes one thing should happen before another within a module having +multiple threads. Rather than using |/bin/sleep| commands, the kernel +has another way to do this which allows timeouts or interrupts to also +happen. + +Completions as code synchronization mechanism have three main parts, +initialization of struct completion synchronization object, the waiting +or barrier part through |wait_\*-\*-\*_for_\*-\*-\*_completion()|, and +the signalling side through a call to |complete()|. + +In the subsequent example, two threads are initiated: crank and +flywheel. It is imperative that the crank thread starts before the +flywheel thread. A completion state is established for each of these +threads, with a distinct completion defined for both the crank and +flywheel threads. At the exit point of each thread the respective +completion state is updated, and |wait_\*-\*-\*_for_\*-\*-\*_completion| +is used by the flywheel thread to ensure that it does not begin +prematurely. The crank thread uses the |complete_\*-\*-\*_all()| +function to update the completion, which lets the flywheel thread +continue. + +So even though |flywheel_\*-\*-\*_thread| is started first you should +notice when you load this module and run |dmesg|, that turning the crank +always happens first because the flywheel thread waits for the crank +thread to complete. + +There are other variations of the +|wait_\*-\*-\*_for_\*-\*-\*_completion| function, which include timeouts +or being interrupted, but this basic mechanism is enough for many common +situations without adding a lot of complexity. + +Avoiding Collisions and Deadlocks +================================= + +If processes running on different CPUs or in different threads try to +access the same memory, then it is possible that strange things can +happen or your system can lock up. To avoid this, various types of +mutual exclusion kernel functions are available. These indicate if a +section of code is "locked" or "unlocked" so that simultaneous attempts +to run it can not happen. + +Mutex +----- + +You can use kernel mutexes (mutual exclusions) in much the same manner +that you might deploy them in userland. This may be all that is needed +to avoid collisions in most cases. + +Spinlocks +--------- + +As the name suggests, spinlocks lock up the CPU that the code is running +on, taking 100% of its resources. Because of this you should only use +the spinlock mechanism around code which is likely to take no more than +a few milliseconds to run and so will not noticeably slow anything down +from the user’s point of view. + +The example here is `"irq safe"` in that if interrupts happen during the +lock then they will not be forgotten and will activate when the unlock +happens, using the |flags| variable to retain their state. + +Taking 100% of a CPU’s resources comes with greater responsibility. +Situations where the kernel code monopolizes a CPU are called **atomic +contexts**. Holding a spinlock is one of those situations. Sleeping in +atomic contexts may leave the system hanging, as the occupied CPU +devotes 100% of its resources doing nothing but sleeping. In some worse +cases the system may crash. Thus, sleeping in atomic contexts is +considered a bug in the kernel. They are sometimes called +“sleep-in-atomic-context” in some materials. + +Note that sleeping here is not limited to calling the sleep functions +explicitly. If subsequent function calls eventually invoke a function +that sleeps, it is also considered sleeping. Thus, it is important to +pay attention to functions being used in atomic context. There’s no +documentation recording all such functions, but code comments may help. +Sometimes you may find comments in kernel source code stating that a +function “may sleep”, “might sleep”, or more explicitly “the caller +should not hold a spinlock”. Those comments are hints that a function +may implicitly sleep and must not be called in atomic contexts. + +Read and write locks +-------------------- + +Read and write locks are specialised kinds of spinlocks so that you can +exclusively read from something or write to something. Like the earlier +spinlocks example, the one below shows an "irq safe" situation in which +if other functions were triggered from irqs which might also read and +write to whatever you are concerned with then they would not disrupt the +logic. As before it is a good idea to keep anything done within the lock +as short as possible so that it does not hang up the system and cause +users to start revolting against the tyranny of your module. + +Of course, if you know for sure that there are no functions triggered by +irqs which could possibly interfere with your logic then you can use the +simpler |read_\*-\*-\*_lock(&myrwlock)| and +|read_\*-\*-\*_unlock(&myrwlock)| or the corresponding write functions. + +Atomic operations +----------------- + +If you are doing simple arithmetic: adding, subtracting or bitwise +operations, then there is another way in the multi-CPU and +multi-hyperthreaded world to stop other parts of the system from messing +with your mojo. By using atomic operations you can be confident that +your addition, subtraction or bit flip did actually happen and was not +overwritten by some other shenanigans. An example is shown below. + +Before the C11 standard adopts the built-in atomic types, the kernel +already provided a small set of atomic types by using a bunch of tricky +architecture-specific codes. Implementing the atomic types by C11 +atomics may allow the kernel to throw away the architecture-specific +codes and letting the kernel code be more friendly to the people who +understand the standard. But there are some problems, such as the memory +model of the kernel doesn’t match the model formed by the C11 atomics. +For further details, see: + +- [kernel documentation of atomic + types](https://www.kernel.org/doc/Documentation/atomic_*-*-*_t.txt) + +- [Time to move to C11 atomics?](https://lwn.net/Articles/691128/) + +- [Atomic usage patterns in the + kernel](https://lwn.net/Articles/698315/) + +Replacing Print Macros +====================== + +Replacement +----------- + +In Section +1.7, +it was noted that the X Window System and kernel module programming are +not conducive to integration. This remains valid during the development +of kernel modules. However, in practical scenarios, the necessity +emerges to relay messages to the tty (teletype) originating the module +load command. + +The term “tty” originates from *teletype*, which initially referred to a +combined keyboard-printer for Unix system communication. Today, it +signifies a text stream abstraction employed by Unix programs, +encompassing physical terminals, xterms in X displays, and network +connections like SSH. + +To achieve this, the “current” pointer is leveraged to access the active +task’s tty structure. Within this structure lies a pointer to a string +write function, facilitating the string’s transmission to the tty. + +Flashing keyboard LEDs +---------------------- + +In certain conditions, you may desire a simpler and more direct way to +communicate to the external world. Flashing keyboard LEDs can be such a +solution: It is an immediate way to attract attention or to display a +status condition. Keyboard LEDs are present on every hardware, they are +always visible, they do not need any setup, and their use is rather +simple and non-intrusive, compared to writing to a tty or a file. + +From v4.14 to v4.15, the timer API made a series of changes to improve +memory safety. A buffer overflow in the area of a |timer_\*-\*-\*_list| +structure may be able to overwrite the |function| and |data| fields, +providing the attacker with a way to use return-oriented programming +(ROP) to call arbitrary functions within the kernel. Also, the function +prototype of the callback, containing a |unsigned long| argument, will +prevent work from any type checking. Furthermore, the function prototype +with |unsigned long| argument may be an obstacle to the forward-edge +protection of *control-flow integrity*. Thus, it is better to use a +unique prototype to separate from the cluster that takes an |unsigned +long| argument. The timer callback should be passed a pointer to the +|timer_\*-\*-\*_list| structure rather than an |unsigned long| argument. +Then, it wraps all the information the callback needs, including the +|timer_\*-\*-\*_list| structure, into a larger structure, and it can use +the |container_\*-\*-\*_of| macro instead of the |unsigned long| value. +For more information see: [Improving the kernel timers +API](https://lwn.net/Articles/735887/). + +Before Linux v4.14, |setup_\*-\*-\*_timer| was used to initialize the +timer and the |timer_\*-\*-\*_list| structure looked like: + +struct timer_\*-\*-\*_list unsigned long expires; void +(\*function)(unsigned long); unsigned long data; u32 flags; /\* ... \*/ +; + +void setup_\*-\*-\*_timer(struct timer_\*-\*-\*_list \*timer, void +(\*callback)(unsigned long), unsigned long data); + +Since Linux v4.14, |timer_\*-\*-\*_setup| is adopted and the kernel step +by step converting to |timer_\*-\*-\*_setup| from +|setup_\*-\*-\*_timer|. One of the reasons why API was changed is it +need to coexist with the old version interface. Moreover, the +|timer_\*-\*-\*_setup| was implemented by |setup_\*-\*-\*_timer| at +first. + +void timer_\*-\*-\*_setup(struct timer_\*-\*-\*_list \*timer, void +(\*callback)(struct timer_\*-\*-\*_list \*), unsigned int flags); + +The |setup_\*-\*-\*_timer| was then removed since v4.15. As a result, +the |timer_\*-\*-\*_list| structure had changed to the following. + +struct timer_\*-\*-\*_list unsigned long expires; void +(\*function)(struct timer_\*-\*-\*_list \*); u32 flags; /\* ... \*/ ; + +The following source code illustrates a minimal kernel module which, +when loaded, starts blinking the keyboard LEDs until it is unloaded. + +If none of the examples in this chapter fit your debugging needs, there +might yet be some other tricks to try. Ever wondered what +|CONFIG_\*-\*-\*_LL_\*-\*-\*_DEBUG| in |make menuconfig| is good for? If +you activate that you get low level access to the serial port. While +this might not sound very powerful by itself, you can patch +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/printk.c) +or any other essential syscall to print ASCII characters, thus making it +possible to trace virtually everything what your code does over a serial +line. If you find yourself porting the kernel to some new and former +unsupported architecture, this is usually amongst the first things that +should be implemented. Logging over a netconsole might also be worth a +try. + +While you have seen lots of stuff that can be used to aid debugging +here, there are some things to be aware of. Debugging is almost always +intrusive. Adding debug code can change the situation enough to make the +bug seem to disappear. Thus, you should keep debug code to a minimum and +make sure it does not show up in production code. + +Scheduling Tasks +================ + +There are two main ways of running tasks: tasklets and work queues. +Tasklets are a quick and easy way of scheduling a single function to be +run. For example, when triggered from an interrupt, whereas work queues +are more complicated but also better suited to running multiple things +in a sequence. + +It is possible that in future tasklets may be replaced by *threaded +irqs*. However, discussion about that has been ongoing since 2007 +([Eliminating tasklets](https://lwn.net/Articles/239633)), so do not +hold your breath. See the section +15.1 +if you wish to avoid the tasklet debate. + +Tasklets +-------- + +Here is an example tasklet module. The |tasklet_\*-\*-\*_fn| function +runs for a few seconds. In the meantime, execution of the +|example_\*-\*-\*_tasklet_\*-\*-\*_init| function may continue to the +exit point, depending on whether it is interrupted by **softirq**. + +So with this example loaded |dmesg| should show: + + tasklet example init + Example tasklet starts + Example tasklet init continues... + Example tasklet ends + +Although tasklet is easy to use, it comes with several drawbacks, and +developers are discussing about getting rid of tasklet in linux kernel. +The tasklet callback runs in atomic context, inside a software +interrupt, meaning that it cannot sleep or access user-space data, so +not all work can be done in a tasklet handler. Also, the kernel only +allows one instance of any given tasklet to be running at any given +time; multiple different tasklet callbacks can run in parallel. + +In recent kernels, tasklets can be replaced by workqueues, timers, or +threaded interrupts.[1] While the removal of tasklets remains a +longer-term goal, the current kernel contains more than a hundred uses +of tasklets. Now developers are proceeding with the API changes and the +macro |DECLARE_\*-\*-\*_TASKLET_\*-\*-\*_OLD| exists for compatibility. +For further information, see . + +Work queues +----------- + +To add a task to the scheduler we can use a workqueue. The kernel then +uses the Completely Fair Scheduler (CFS) to execute work within the +queue. + +Interrupt Handlers +================== + +Interrupt Handlers +------------------ + +Except for the last chapter, everything we did in the kernel so far we +have done as a response to a process asking for it, either by dealing +with a special file, sending an |ioctl()|, or issuing a system call. But +the job of the kernel is not just to respond to process requests. +Another job, which is every bit as important, is to speak to the +hardware connected to the machine. + +There are two types of interaction between the CPU and the rest of the +computer’s hardware. The first type is when the CPU gives orders to the +hardware, the other is when the hardware needs to tell the CPU +something. The second, called interrupts, is much harder to implement +because it has to be dealt with when convenient for the hardware, not +the CPU. Hardware devices typically have a very small amount of RAM, and +if you do not read their information when available, it is lost. + +Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests). +There are two types of IRQ’s, short and long. A short IRQ is one which +is expected to take a very short period of time, during which the rest +of the machine will be blocked and no other interrupts will be handled. +A long IRQ is one which can take longer, and during which other +interrupts may occur (but not interrupts from the same device). If at +all possible, it is better to declare an interrupt handler to be long. + +When the CPU receives an interrupt, it stops whatever it is doing +(unless it is processing a more important interrupt, in which case it +will deal with this one only when the more important one is done), saves +certain parameters on the stack and calls the interrupt handler. This +means that certain things are not allowed in the interrupt handler +itself, because the system is in an unknown state. Linux kernel solves +the problem by splitting interrupt handling into two parts. The first +part executes right away and masks the interrupt line. Hardware +interrupts must be handled quickly, and that is why we need the second +part to handle the heavy work deferred from an interrupt handler. +Historically, BH (Linux naming for *Bottom Halves*) statistically +book-keeps the deferred functions. **Softirq** and its higher level +abstraction, **Tasklet**, replace BH since Linux 2.3. + +The way to implement this is to call |request_\*-\*-\*_irq()| to get +your interrupt handler called when the relevant IRQ is received. + +In practice IRQ handling can be a bit more complex. Hardware is often +designed in a way that chains two interrupt controllers, so that all the +IRQs from interrupt controller B are cascaded to a certain IRQ from +interrupt controller A. Of course, that requires that the kernel finds +out which IRQ it really was afterwards and that adds overhead. Other +architectures offer some special, very low overhead, so called "fast +IRQ" or FIQs. To take advantage of them requires handlers to be written +in assembly language, so they do not really fit into the kernel. They +can be made to work similar to the others, but after that procedure, +they are no longer any faster than "common" IRQs. SMP enabled kernels +running on systems with more than one processor need to solve another +truckload of problems. It is not enough to know if a certain IRQs has +happened, it’s also important to know what CPU(s) it was for. People +still interested in more details, might want to refer to "APIC" now. + +This function receives the IRQ number, the name of the function, flags, +a name for `/proc/interrupts` and a parameter to be passed to the +interrupt handler. Usually there is a certain number of IRQs available. +How many IRQs there are is hardware-dependent. + +The flags can be used for specify behaviors of the IRQ. For example, use +|IRQF_\*-\*-\*_SHARED| to indicate you are willing to share the IRQ with +other interrupt handlers (usually because a number of hardware devices +sit on the same IRQ); use the |IRQF_\*-\*-\*_ONESHOT| to indicate that +the IRQ is not reenabled after the handler finished. It should be noted +that in some materials, you may encouter another set of IRQ flags named +with the |SA| prefix. For example, the |SA_\*-\*-\*_SHIRQ| and the +|SA_\*-\*-\*_INTERRUPT|. Those are the the IRQ flags in the older +kernels. They have been removed completely. Today only the |IRQF| flags +are in use. This function will only succeed if there is not already a +handler on this IRQ, or if you are both willing to share. + +Detecting button presses +------------------------ + +Many popular single board computers, such as Raspberry Pi or +Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and +then having a button press do something is a classic case in which you +might need to use interrupts, so that instead of having the CPU waste +time and battery power polling for a change in input state, it is better +for the input to trigger the CPU to then run a particular handling +function. + +Here is an example where buttons are connected to GPIO numbers 17 and 18 +and an LED is connected to GPIO 4. You can change those numbers to +whatever is appropriate for your board. + +Bottom Half +----------- + +Suppose you want to do a bunch of stuff inside of an interrupt routine. +A common way to do that without rendering the interrupt unavailable for +a significant duration is to combine it with a tasklet. This pushes the +bulk of the work off into the scheduler. + +The example below modifies the previous example to also run an +additional task when an interrupt is triggered. + +Threaded IRQ +------------ + +Threaded IRQ is a mechanism to organize both top-half and bottom-half of +an IRQ at once. A threaded IRQ splits the one handler in +|request_\*-\*-\*_irq()| into two: one for the top-half, the other for +the bottom-half. The |request_\*-\*-\*_threaded_\*-\*-\*_irq()| is the +function for using threaded IRQs. Two handlers are registered at once in +the |request_\*-\*-\*_threaded_\*-\*-\*_irq()|. + +Those two handlers run in different context. The top-half handler runs +in interrupt context. It’s the equivalence of the handler passed to the +|request_\*-\*-\*_irq()|. The bottom-half handler on the other hand runs +in its own thread. This thread is created on registration of a threaded +IRQ. Its sole purpose is to run this bottom-half handler. This is where +a threaded IRQ is “threaded”. If |IRQ_\*-\*-\*_WAKE_\*-\*-\*_THREAD| is +returned by the top-half handler, that bottom-half serving thread will +wake up. The thread then runs the bottom-half handler. + +Here is an example of how to do the same thing as before, with top and +bottom halves, but using threads. + +A threaded IRQ is registered using +|request_\*-\*-\*_threaded_\*-\*-\*_irq()|. This function only takes one +additional parameter than the |request_\*-\*-\*_irq()| – the bottom-half +handling function that runs in its own thread. In this example it is the +|button_\*-\*-\*_bottom_\*-\*-\*_half()|. Usage of other parameters are +the same as |request_\*-\*-\*_irq()|. + +Presence of both handlers is not mandatory. If either of them is not +needed, pass the |NULL| instead. A |NULL| top-half handler implies that +no action is taken except to wake up the bottom-half serving thread, +which runs the bottom-half handler. Similarly, a |NULL| bottom-half +handler effectively acts as if |request_\*-\*-\*_irq()| were used. In +fact, this is how |request_\*-\*-\*_irq()| is implemented. + +Note that passing |NULL| to both handlers is considered an error and +will make registration fail. + +Virtual Input Device Driver +=========================== + +The input device driver is a module that provides a way to communicate +with the interaction device via the event. For example, the keyboard can +send the press or release event to tell the kernel what we want to do. +The input device driver will allocate a new input structure with +|input_\*-\*-\*_allocate_\*-\*-\*_device()| and sets up input bitfields, +device id, version, etc. After that, registers it by calling +|input_\*-\*-\*_register_\*-\*-\*_device()|. + +Here is an example, vinput, It is an API to allow easy development of +virtual input drivers. The drivers needs to export a +|vinput_\*-\*-\*_device()| that contains the virtual device name and +|vinput_\*-\*-\*_ops| structure that describes: + +- the init function: |init()| + +- the input event injection function: |send()| + +- the readback function: |read()| + +Then using |vinput_\*-\*-\*_register_\*-\*-\*_device()| and +|vinput_\*-\*-\*_unregister_\*-\*-\*_device()| will add a new device to +the list of support virtual input devices. + +int init(struct vinput \*); + +This function is passed a |struct vinput| already initialized with an +allocated |struct input_\*-\*-\*_dev|. The |init()| function is +responsible for initializing the capabilities of the input device and +register it. + +int send(struct vinput \*, char \*, int); + +This function will receive a user string to interpret and inject the +event using the |input_\*-\*-\*_report_\*-\*-\*_XXXX| or +|input_\*-\*-\*_event| call. The string is already copied from user. + +int read(struct vinput \*, char \*, int); + +This function is used for debugging and should fill the buffer parameter +with the last event sent in the virtual input device format. The buffer +will then be copied to user. + +vinput devices are created and destroyed using sysfs. And, event +injection is done through a `/dev` node. The device name will be used by +the userland to export a new virtual input device. + +The |class_\*-\*-\*_attribute| structure is similar to other attribute +types we talked about in section +8: + +struct class_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t +(\*show)(struct class \*class, struct class_\*-\*-\*_attribute \*attr, +char \*buf); ssize_\*-\*-\*_t (\*store)(struct class \*class, struct +class_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t +count); ; + +In `vinput.c`, the macro +|CLASS_\*-\*-\*_ATTR_\*-\*-\*_WO(export/unexport)| defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/device.h) +(in this case, `device.h` is included in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h)) +will generate the |class_\*-\*-\*_attribute| structures which are named +`class_*-*-*_attr_*-*-*_export/unexport`. Then, put them into +|vinput_\*-\*-\*_class_\*-\*-\*_attrs| array and the macro +|ATTRIBUTE_\*-\*-\*_GROUPS(vinput_\*-\*-\*_class)| will generate the +|struct attribute_\*-\*-\*_group vinput_\*-\*-\*_class_\*-\*-\*_group| +that should be assigned in |vinput_\*-\*-\*_class|. Finally, call +|class_\*-\*-\*_register(&vinput_\*-\*-\*_class)| to create attributes +in sysfs. + +To create a `vinputX` sysfs entry and `/dev` node. + +echo "vkbd" | sudo tee /sys/class/vinput/export + +To unexport the device, just echo its id in unexport: + +echo "0" | sudo tee /sys/class/vinput/unexport + +Here the virtual keyboard is one of example to use vinput. It supports +all |KEY_\*-\*-\*_MAX| keycodes. The injection format is the +|KEY_\*-\*-\*_CODE| such as defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h). +A positive value means |KEY_\*-\*-\*_PRESS| while a negative value is a +|KEY_\*-\*-\*_RELEASE|. The keyboard supports repetition when the key +stays pressed for too long. The following demonstrates how simulation +work. + +Simulate a key press on "g" (|KEY_\*-\*-\*_G| = 34): + +echo "+34" | sudo tee /dev/vinput0 + +Simulate a key release on "g" (|KEY_\*-\*-\*_G| = 34): + +echo "-34" | sudo tee /dev/vinput0 + +Standardizing the interfaces: The Device Model +============================================== + +Up to this point we have seen all kinds of modules doing all kinds of +things, but there was no consistency in their interfaces with the rest +of the kernel. To impose some consistency such that there is at minimum +a standardized way to start, suspend and resume a device model was +added. An example is shown below, and you can use this as a template to +add your own suspend, resume or other interface functions. + +Optimizations +============= + +Likely and Unlikely conditions +------------------------------ + +Sometimes you might want your code to run as quickly as possible, +especially if it is handling an interrupt or doing something which might +cause noticeable latency. If your code contains boolean conditions and +if you know that the conditions are almost always likely to evaluate as +either |true| or |false|, then you can allow the compiler to optimize +for this using the |likely| and |unlikely| macros. For example, when +allocating memory you are almost always expecting this to succeed. + +bvl = bvec_\*-\*-\*_alloc(gfp_\*-\*-\*_mask, nr_\*-\*-\*_iovecs, &idx); +if (unlikely(_bvl)) mempool_\*-\*-\*_free(bio, bio_\*-\*-\*_pool); bio = +NULL; goto out; + +When the |unlikely| macro is used, the compiler alters its machine +instruction output, so that it continues along the false branch and only +jumps if the condition is true. That avoids flushing the processor +pipeline. The opposite happens if you use the |likely| macro. + +Static keys +----------- + +Static keys allow us to enable or disable kernel code paths based on the +runtime state of key. Its APIs have been available since 2010 (most +architectures are already supported), use self-modifying code to +eliminate the overhead of cache and branch prediction. The most typical +use case of static keys is for performance-sensitive kernel code, such +as tracepoints, context switching, networking, etc. These hot paths of +the kernel often contain branches and can be optimized easily using this +technique. Before we can use static keys in the kernel, we need to make +sure that gcc supports |asm goto| inline assembly, and the following +kernel configurations are set: + +CONFIG_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y +CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y +CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL_\*-\*-\*_RELATIVE=y + +To declare a static key, we need to define a global variable using the +|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE| or +|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE| macro defined in +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/jump\_*-*-*_label.h). +This macro initializes the key with the given initial value, which is +either false or true, respectively. For example, to declare a static key +with an initial value of false, we can use the following code: + +DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE(fkey); + +Once the static key has been declared, we need to add branching code to +the module that uses the static key. For example, the code includes a +fastpath, where a no-op instruction will be generated at compile time as +the key is initialized to false and the branch is unlikely to be taken. + +pr_\*-\*-\*_info("fastpath 1"); if +(static_\*-\*-\*_branch_\*-\*-\*_unlikely(&fkey)) pr_\*-\*-\*_alert("do +unlikely thing"); pr_\*-\*-\*_info("fastpath 2"); + +If the key is enabled at runtime by calling +|static_\*-\*-\*_branch_\*-\*-\*_enable(&fkey)|, the fastpath will be +patched with an unconditional jump instruction to the slowpath code +|pr_\*-\*-\*_alert|, so the branch will always be taken until the key is +disabled again. + +The following kernel module derived from `chardev.c`, demonstrates how +the static key works. + +To check the state of the static key, we can use the +`/dev/key_*-*-*_state` interface. + +cat /dev/key_\*-\*-\*_state + +This will display the current state of the key, which is disabled by +default. + +To change the state of the static key, we can perform a write operation +on the file: + +echo enable > /dev/key_\*-\*-\*_state + +This will enable the static key, causing the code path to switch from +the fastpath to the slowpath. + +In some cases, the key is enabled or disabled at initialization and +never changed, we can declare a static key as read-only, which means +that it can only be toggled in the module init function. To declare a +read-only static key, we can use the +|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE_\*-\*-\*_RO| or +|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE_\*-\*-\*_RO| macro +instead. Attempts to change the key at runtime will result in a page +fault. For more information, see [Static +keys](https://www.kernel.org/doc/Documentation/static-keys.txt) + +Common Pitfalls +=============== + +Using standard libraries +------------------------ + +You can not do that. In a kernel module, you can only use kernel +functions which are the functions you can see in `/proc/kallsyms`. + +Disabling interrupts +-------------------- + +You might need to do this for a short time and that is OK, but if you do +not enable them afterwards, your system will be stuck and you will have +to power it off. + +Where To Go From Here? +====================== + +For those deeply interested in kernel programming, +[kernelnewbies.org](https://kernelnewbies.org) and the +[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation) +subdirectory within the kernel source code are highly recommended. +Although the latter may not always be straightforward, it serves as a +valuable initial step for further exploration. Echoing Linus Torvalds’ +perspective, the most effective method to understand the kernel is +through personal examination of the source code. + +Contributions to this guide are welcome, especially if there are any +significant inaccuracies identified. To contribute or report an issue, +please initiate an issue at . Pull +requests are greatly appreciated. + +Happy hacking_ + +[1] The goal of threaded interrupts is to push more of the work to +separate threads, so that the minimum needed for acknowledging an +interrupt is reduced, and therefore the time spent handling the +interrupt (where it can’t handle any other interrupts at the same time) +is reduced. See .