diff --git a/nhmk.md b/nhmk.md
new file mode 100644
index 0000000..95d39f8
--- /dev/null
+++ b/nhmk.md
@@ -0,0 +1,2645 @@
+_[image](assets/cover-with-names.png)
+
+Introduction
+============
+
+The Linux Kernel Module Programming Guide is a free book; you may
+reproduce and/or modify it under the terms of the [Open Software
+License](https://opensource.org/licenses/OSL-3.0), version 3.0.
+
+This book is distributed in the hope that it would be useful, but
+without any warranty, without even the implied warranty of
+merchantability or fitness for a particular purpose.
+
+The author encourages wide distribution of this book for personal or
+commercial use, provided the above copyright notice remains intact and
+the method adheres to the provisions of the [Open Software
+License](https://opensource.org/licenses/OSL-3.0). In summary, you may
+copy and distribute this book free of charge or for a profit. No
+explicit permission is required from the author for reproduction of this
+book in any medium, physical or electronic.
+
+Derivative works and translations of this document must be placed under
+the Open Software License, and the original copyright notice must remain
+intact. If you have contributed new material to this book, you must make
+the material and source code available for your revisions. Please make
+revisions and updates available directly to the document maintainer, Jim
+Huang <jserv@ccns.ncku.edu.tw>. This will allow for the merging of
+updates and provide consistent revisions to the Linux community.
+
+If you publish or distribute this book commercially, donations,
+royalties, and/or printed copies are greatly appreciated by the author
+and the [Linux Documentation Project](https://tldp.org/) (LDP).
+Contributing in this way shows your support for free software and the
+LDP. If you have questions or comments, please contact the address
+above.
+
+Authorship
+----------
+
+The Linux Kernel Module Programming Guide was initially authored by Ori
+Pomerantz for Linux v2.2. As the Linux kernel evolved, Ori’s
+availability to maintain the document diminished. Consequently, Peter
+Jay Salzman assumed the role of maintainer and updated the guide for
+Linux v2.4. Similar constraints arose for Peter when tracking
+developments in Linux v2.6, leading to Michael Burian joining as a
+co-maintainer to bring the guide up to speed with Linux v2.6. Bob
+Mottram contributed to the guide by updating examples for Linux v3.8 and
+later. Jim Huang then undertook the task of updating the guide for
+recent Linux versions (v5.0 and beyond), along with revising the LaTeX
+document.
+
+Acknowledgements
+----------------
+
+The following people have contributed corrections or good suggestions:
+
+Amit Dhingra, Andy Shevchenko, Arush Sharma, Benno Bielmeier, Bob Lee,
+Brad Baker, Che-Chia Chang, Cheng-Shian Yeh, Chih-En Lin, Chih-Hsuan
+Yang, Chih-Yu Chen, Ching-Hua (Vivian) Lin, Chin Yik Ming, cvvletter,
+Cyril Brulebois, Daniele Paolo Scarpazza, David Porter, demonsome, Dimo
+Velev, Ekang Monyet, Ethan Chan, Francois Audeon, Gilad Reti,
+heartofrain, Horst Schirmeier, Hsin-Hsiang Peng, Ignacio Martin, I-Hsin
+Cheng, Iûnn Kiàn-îng, Jian-Xing Wu, Johan Calle, keytouch, Kohei Otsuka,
+Kuan-Wei Chiu, manbing, Marconi Jiang, mengxinayan, Meng-Zong Tsai,
+Peter Lin, Roman Lakeev, Sam Erickson, Shao-Tse Hung, Shih-Sheng Yang,
+Stacy Prowell, Steven Lung, Tristan Lelong, Tse-Wei Lin, Tucker Polomik,
+Tyler Fanelli, VxTeemo, Wei-Hsin Yeh, Wei-Lun Tsai, Xatierlike Lee,
+Yen-Yu Chen, Yin-Chiuan Chen, Yi-Wei Lin, Yo-Jung Lin, Yu-Hsiang Tseng,
+YYGO.
+
+What Is A Kernel Module?
+------------------------
+
+Involvement in the development of Linux kernel modules requires a
+foundation in the C programming language and a track record of creating
+conventional programs intended for process execution. This pursuit
+delves into a domain where an unregulated pointer, if disregarded, may
+potentially trigger the total elimination of an entire file system,
+resulting in a scenario that necessitates a complete system reboot.
+
+A Linux kernel module is precisely defined as a code segment capable of
+dynamic loading and unloading within the kernel as needed. These modules
+enhance kernel capabilities without necessitating a system reboot. A
+notable example is seen in the device driver module, which facilitates
+kernel interaction with hardware components linked to the system. In the
+absence of modules, the prevailing approach leans toward monolithic
+kernels, requiring direct integration of new functionalities into the
+kernel image. This approach leads to larger kernels and necessitates
+kernel rebuilding and subsequent system rebooting when new
+functionalities are desired.
+
+Kernel module package
+---------------------
+
+Linux distributions provide the commands |modprobe|, |insmod| and
+|depmod| within a package.
+
+On Ubuntu/Debian GNU/Linux:
+
+sudo apt-get install build-essential kmod
+
+On Arch Linux:
+
+sudo pacman -S gcc kmod
+
+What Modules are in my Kernel?
+------------------------------
+
+To discover what modules are already loaded within your current kernel
+use the command |lsmod|.
+
+sudo lsmod
+
+Modules are stored within the file `/proc/modules`, so you can also see
+them with:
+
+sudo cat /proc/modules
+
+This can be a long list, and you might prefer to search for something
+particular. To search for the `fat` module:
+
+sudo lsmod | grep fat
+
+Is there a need to download and compile the kernel?
+---------------------------------------------------
+
+To effectively follow this guide, there is no obligatory requirement for
+performing such actions. Nonetheless, a prudent approach involves
+executing the examples within a test distribution on a virtual machine,
+thus mitigating any potential risk of disrupting the system.
+
+Before We Begin
+---------------
+
+Before delving into code, certain matters require attention. Variances
+exist among individuals’ systems, and distinct personal approaches are
+evident. The achievement of successful compilation and loading of the
+inaugural “hello world” program may, at times, present challenges. It is
+reassuring to note that overcoming the initial obstacle in the first
+attempt paves the way for subsequent endeavors to proceed seamlessly.
+
+1. Modversioning. A module compiled for one kernel will not load if a
+ different kernel is booted, unless |CONFIG_\*-\*-\*_MODVERSIONS| is
+ enabled in the kernel. Module versioning will be discussed later in
+ this guide. Until module versioning is covered, the examples in this
+ guide may not work correctly if running a kernel with modversioning
+ turned on. However, most stock Linux distribution kernels come with
+ modversioning enabled. If difficulties arise when loading the
+ modules due to versioning errors, consider compiling a kernel with
+ modversioning turned off.
+
+2. Using X Window System. It is highly recommended to extract, compile,
+ and load all the examples discussed in this guide from a console.
+ Working on these tasks within the X Window System is discouraged.
+
+ Modules cannot directly print to the screen like |printf()| can, but
+ they can log information and warnings that are eventually displayed
+ on the screen, specifically within a console. If a module is loaded
+ from an |xterm|, the information and warnings will be logged, but
+ solely within the systemd journal. These logs will not be visible
+ unless consulting the |journalctl|. Refer to
+ 4
+ for more information. For instant access to this information, it is
+ advisable to perform all tasks from the console.
+
+3. SecureBoot. Numerous modern computers arrive pre-configured with
+ UEFI SecureBoot enabled—an essential security standard ensuring
+ booting exclusively through trusted software endorsed by the
+ original equipment manufacturer. Certain Linux distributions even
+ ship with the default Linux kernel configured to support SecureBoot.
+ In these cases, the kernel module necessitates a signed security
+ key.
+
+ Failing this, an attempt to insert your first “hello world” module
+ would result in the message: “*ERROR: could not insert module*”. If
+ this message *Lockdown: insmod: unsigned module loading is
+ restricted; see man kernel lockdown.7* appears in the |dmesg|
+ output, the simplest approach involves disabling UEFI SecureBoot
+ from the boot menu of your PC or laptop, allowing the successful
+ insertion of “hello world” module. Naturally, an alternative
+ involves undergoing intricate procedures such as generating keys,
+ system key installation, and module signing to achieve
+ functionality. However, this intricate process is less appropriate
+ for beginners. If interested, more detailed steps for
+ [SecureBoot](https://wiki.debian.org/SecureBoot) can be explored and
+ followed.
+
+Headers
+=======
+
+Before building anything, it is necessary to install the header files
+for the kernel.
+
+On Ubuntu/Debian GNU/Linux:
+
+sudo apt-get update apt-cache search linux-headers-‘uname -r‘
+
+The following command provides information on the available kernel
+header files. Then for example:
+
+sudo apt-get install kmod linux-headers-5.4.0-80-generic
+
+On Arch Linux:
+
+sudo pacman -S linux-headers
+
+On Fedora:
+
+sudo dnf install kernel-devel kernel-headers
+
+Examples
+========
+
+All the examples from this document are available within the `examples`
+subdirectory.
+
+Should compile errors occur, it may be due to a more recent kernel
+version being in use, or there might be a need to install the
+corresponding kernel header files.
+
+Hello World
+===========
+
+The Simplest Module
+-------------------
+
+Most individuals beginning their programming journey typically start
+with some variant of a *hello world* example. It is unclear what the
+outcomes are for those who deviate from this tradition, but it seems
+prudent to adhere to it. The learning process will begin with a series
+of hello world programs that illustrate various fundamental aspects of
+writing a kernel module.
+
+Presented next is the simplest possible module.
+
+Make a test directory:
+
+mkdir -p /develop/kernel/hello-1 cd /develop/kernel/hello-1
+
+Paste this into your favorite editor and save it as `hello-1.c`:
+
+Now you will need a `Makefile`. If you copy and paste this, change the
+indentation to use *tabs*, not spaces.
+
+In `Makefile`, `$(CURDIR)` can set to the absolute pathname of the
+current working directory(after all `-C` options are processed, if any).
+See more about `CURDIR` in [GNU make
+manual](https://www.gnu.org/software/make/manual/make.html).
+
+And finally, just run `make` directly.
+
+make
+
+If there is no `PWD := $(CURDIR)` statement in Makefile, then it may not
+compile correctly with `sudo make`. Because some environment variables
+are specified by the security policy, they can’t be inherited. The
+default security policy is `sudoers`. In the `sudoers` security policy,
+`env_*-*-*_reset` is enabled by default, which restricts environment
+variables. Specifically, path variables are not retained from the user
+environment, they are set to default values (For more information see:
+[sudoers manual](https://www.sudo.ws/docs/man/sudoers.man/)). You can
+see the environment variable settings by:
+
+ $ sudo -s
+ # sudo -V
+
+Here is a simple Makefile as an example to demonstrate the problem
+mentioned above.
+
+all: echo $(PWD)
+\\end{code}
+
+Then, we can use \\verb|-p| flag to print out the environment variable values from the Makefile.
+
+\\begin{verbatim}$ make -p | grep PWD PWD = /home/ubuntu/temp OLDPWD =
+/home/ubuntu echo $(PWD)
+\\end{verbatim}
+
+The \\verb|PWD| variable won't be inherited with \\verb|sudo|.
+
+\\begin{verbatim}$ sudo make -p | grep PWD echo $(PWD)
+\\end{verbatim}
+
+However, there are three ways to solve this problem.
+
+\\begin{enumerate}
+ \\item {
+ You can use the \\verb|-E| flag to temporarily preserve them.
+
+ \\begin{codebash}
+ $ sudo -E make -p | grep PWD
+ PWD = /home/ubuntu/temp
+ OLDPWD = /home/ubuntu
+ echo $(PWD)
+ \\end{codebash}
+ }
+
+ \\item {
+ You can set the \\verb|env_\*-\*-\*_reset| disabled by editing the \\verb|/etc/sudoers| with root and \\verb|visudo|.
+
+ \\begin{code}
+ \#\# sudoers file.
+ \#\#
+ ...
+ Defaults env_\*-\*-\*_reset
+ \#\# Change env_\*-\*-\*_reset to _env_\*-\*-\*_reset in previous line to keep all environment variables
+ \\end{code}
+
+ Then execute \\verb|env| and \\verb|sudo env| individually.
+
+ \\begin{codebash}
+ \# disable the env_\*-\*-\*_reset
+ echo "user:" > non-env_\*-\*-\*_reset.log; env >> non-env_\*-\*-\*_reset.log
+ echo "root:" >> non-env_\*-\*-\*_reset.log; sudo env >> non-env_\*-\*-\*_reset.log
+ \# enable the env_\*-\*-\*_reset
+ echo "user:" > env_\*-\*-\*_reset.log; env >> env_\*-\*-\*_reset.log
+ echo "root:" >> env_\*-\*-\*_reset.log; sudo env >> env_\*-\*-\*_reset.log
+ \\end{codebash}
+
+ You can view and compare these logs to find differences between \\verb|env_\*-\*-\*_reset| and \\verb|_env_\*-\*-\*_reset|.
+ }
+
+ \\item {You can preserve environment variables by appending them to \\verb|env_\*-\*-\*_keep| in \\verb|/etc/sudoers|.
+
+ \\begin{code}
+ Defaults env_\*-\*-\*_keep += "PWD"
+ \\end{code}
+
+ After applying the above change, you can check the environment variable settings by:
+
+ \\begin{verbatim}
+ $ sudo -s
+ \# sudo -V
+ \\end{verbatim}
+ }
+\\end{enumerate}
+
+If all goes smoothly you should then find that you have a compiled \\verb|hello-1.ko| module.
+You can find info on it with the command:
+\\begin{codebash}
+modinfo hello-1.ko
+\\end{codebash}
+
+At this point the command:
+\\begin{codebash}
+sudo lsmod | grep hello
+\\end{codebash}
+
+should return nothing.
+You can try loading your shiny new module with:
+\\begin{codebash}
+sudo insmod hello-1.ko
+\\end{codebash}
+
+The dash character will get converted to an underscore, so when you again try:
+\\begin{codebash}
+sudo lsmod | grep hello
+\\end{codebash}
+
+You should now see your loaded module. It can be removed again with:
+\\begin{codebash}
+sudo rmmod hello_\*-\*-\*_1
+\\end{codebash}
+
+Notice that the dash was replaced by an underscore.
+To see what just happened in the logs:
+\\begin{codebash}
+sudo journalctl --since "1 hour ago" | grep kernel
+\\end{codebash}
+
+You now know the basics of creating, compiling, installing and removing modules.
+Now for more of a description of how this module works.
+
+Kernel modules must have at least two functions: a "start" (initialization) function called \\cpp|init_\*-\*-\*_module()| which is called when the module is \\sh|insmod|ed into the kernel, and an "end" (cleanup) function called \\cpp|cleanup_\*-\*-\*_module()| which is called just before it is removed from the kernel.
+Actually, things have changed starting with kernel 2.3.13.
+% TODO: adjust the section anchor
+You can now use whatever name you like for the start and end functions of a module, and you will learn how to do this in Section \\ref{hello_\*-\*-\*_n_\*-\*-\*_goodbye}.
+In fact, the new method is the preferred method.
+However, many people still use \\cpp|init_\*-\*-\*_module()| and \\cpp|cleanup_\*-\*-\*_module()| for their start and end functions.
+
+Typically, \\cpp|init_\*-\*-\*_module()| either registers a handler for something with the kernel, or it replaces one of the kernel functions with its own code (usually code to do something and then call the original function).
+The \\cpp|cleanup_\*-\*-\*_module()| function is supposed to undo whatever \\cpp|init_\*-\*-\*_module()| did, so the module can be unloaded safely.
+
+Lastly, every kernel module needs to include \\verb|<linux/module.h>|.
+% TODO: adjust the section anchor
+We needed to include \\verb|<linux/printk.h>| only for the macro expansion for the \\cpp|pr_\*-\*-\*_alert()| log level, which you'll learn about in Section \\ref{sec:printk}.
+
+\\begin{enumerate}
+ \\item A point about coding style.
+ Another thing which may not be immediately obvious to anyone getting started with kernel programming is that indentation within your code should be using \\textbf{tabs} and \\textbf{not spaces}.
+ It is one of the coding conventions of the kernel.
+ You may not like it, but you'll need to get used to it if you ever submit a patch upstream.
+
+ \\item Introducing print macros.
+ \\label{sec:printk}
+ In the beginning there was \\cpp|printk|, usually followed by a priority such as \\cpp|KERN_\*-\*-\*_INFO| or \\cpp|KERN_\*-\*-\*_DEBUG|.
+ More recently this can also be expressed in abbreviated form using a set of print macros, such as \\cpp|pr_\*-\*-\*_info| and \\cpp|pr_\*-\*-\*_debug|.
+ This just saves some mindless keyboard bashing and looks a bit neater.
+ They can be found within \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/printk.h}%
+ {\\ifthenelse{\\equal{}{}}{include/linux/printk.h}{}}.
+ Take time to read through the available priority macros.
+
+ \\item About Compiling.
+ Kernel modules need to be compiled a bit differently from regular userspace apps.
+ Former kernel versions required us to care much about these settings, which are usually stored in Makefiles.
+ Although hierarchically organized, many redundant settings accumulated in sublevel Makefiles and made them large and rather difficult to maintain.
+ Fortunately, there is a new way of doing these things, called kbuild, and the build process for external loadable modules is now fully integrated into the standard kernel build mechanism.
+ To learn more on how to compile modules which are not part of the official kernel (such as all the examples you will find in this guide), see file \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/modules.rst}%
+ {\\ifthenelse{\\equal{}{}}{Documentation/kbuild/modules.rst}{}}.
+
+ Additional details about Makefiles for kernel modules are available in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/makefiles.rst}%
+ {\\ifthenelse{\\equal{}{}}{Documentation/kbuild/makefiles.rst}{}}. Be sure to read this and the related files before starting to hack Makefiles. It will probably save you lots of work.
+
+\\begin{quote}
+Here is another exercise for the reader.
+See that comment above the return statement in \\cpp|init_\*-\*-\*_module()|?
+Change the return value to something negative, recompile and load the module again.
+What happens?
+\\end{quote}
+\\end{enumerate}
+
+\\subsection{Hello and Goodbye}
+\\label{hello_\*-\*-\*_n_\*-\*-\*_goodbye}
+In early kernel versions you had to use the \\cpp|init_\*-\*-\*_module| and \\cpp|cleanup_\*-\*-\*_module| functions, as in the first hello world example, but these days you can name those anything you want by using the \\cpp|module_\*-\*-\*_init| and \\cpp|module_\*-\*-\*_exit| macros.
+These macros are defined in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h}%
+ {\\ifthenelse{\\equal{}{}}{include/linux/module.h}{}}.
+The only requirement is that your init and cleanup functions must be defined before calling the those macros, otherwise you'll get compilation errors.
+Here is an example of this technique:
+
+\\samplec{examples/hello-2.c}
+
+So now we have two real kernel modules under our belt. Adding another module is as simple as this:
+
+\\begin{code}
+obj-m += hello-1.o
+obj-m += hello-2.o
+
+PWD :=$(CURDIR)
+
+all: make -C
+/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD)
+modules
+
+clean: make -C
+/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD)
+clean
+
+Now have a look at
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/Makefile)
+for a real world example. As you can see, some things got hardwired into
+the kernel (`obj-y`) but where have all those `obj-m` gone? Those
+familiar with shell scripts will easily be able to spot them. For those
+who are not, the `obj-$(CONFIG_*-*-*_FOO)` entries you see everywhere
+expand into `obj-y` or `obj-m`, depending on whether the
+`CONFIG_*-*-*_FOO` variable has been set to `y` or `m`. While we are at
+it, those were exactly the kind of variables that you have set in the
+`.config` file in the top-level directory of Linux kernel source tree,
+the last time when you said |make menuconfig| or something like that.
+
+The -\*-\*_-\*-\*_init and -\*-\*_-\*-\*_exit Macros
+----------------------------------------------------
+
+The |_\*-\*-\*_\*-\*-\*_init| macro causes the init function to be
+discarded and its memory freed once the init function finishes for
+built-in drivers, but not loadable modules. If you think about when the
+init function is invoked, this makes perfect sense.
+
+There is also an |_\*-\*-\*_\*-\*-\*_initdata| which works similarly to
+|_\*-\*-\*_\*-\*-\*_init| but for init variables rather than functions.
+
+The |_\*-\*-\*_\*-\*-\*_exit| macro causes the omission of the function
+when the module is built into the kernel, and like
+|_\*-\*-\*_\*-\*-\*_init|, has no effect for loadable modules. Again,
+if you consider when the cleanup function runs, this makes complete
+sense; built-in drivers do not need a cleanup function, while loadable
+modules do.
+
+These macros are defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/init.h)
+and serve to free up kernel memory. When you boot your kernel and see
+something like Freeing unused kernel memory: 236k freed, this is
+precisely what the kernel is freeing.
+
+Licensing and Module Documentation
+----------------------------------
+
+Honestly, who loads or even cares about proprietary modules? If you do
+then you might have seen something like this:
+
+ $ sudo insmod xxxxxx.ko
+ loading out-of-tree module taints kernel.
+ module license 'unspecified' taints kernel.
+
+You can use a few macros to indicate the license for your module. Some
+examples are "GPL", "GPL v2", "GPL and additional rights", "Dual
+BSD/GPL", "Dual MIT/GPL", "Dual MPL/GPL" and "Proprietary". They are
+defined within
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h).
+
+To reference what license you’re using a macro is available called
+|MODULE_\*-\*-\*_LICENSE|. This and a few other macros describing the
+module are illustrated in the below example.
+
+Passing Command Line Arguments to a Module
+------------------------------------------
+
+Modules can take command line arguments, but not with the argc/argv you
+might be used to.
+
+To allow arguments to be passed to your module, declare the variables
+that will take the values of the command line arguments as global and
+then use the |module_\*-\*-\*_param()| macro, (defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/moduleparam.h))
+to set the mechanism up. At runtime, |insmod| will fill the variables
+with any command line arguments that are given, like |insmod mymodule.ko
+myvariable=5|. The variable declarations and macros should be placed at
+the beginning of the module for clarity. The example code should clear
+up my admittedly lousy explanation.
+
+The |module_\*-\*-\*_param()| macro takes 3 arguments: the name of the
+variable, its type and permissions for the corresponding file in sysfs.
+Integer types can be signed as usual or unsigned. If you’d like to use
+arrays of integers or strings see
+|module_\*-\*-\*_param_\*-\*-\*_array()| and
+|module_\*-\*-\*_param_\*-\*-\*_string()|.
+
+int myint = 3; module_\*-\*-\*_param(myint, int, 0);
+
+Arrays are supported too, but things are a bit different now than they
+were in the olden days. To keep track of the number of parameters you
+need to pass a pointer to a count variable as third parameter. At your
+option, you could also ignore the count and pass |NULL| instead. We show
+both possibilities here:
+
+int myintarray\[2\]; module_\*-\*-\*_param_\*-\*-\*_array(myintarray,
+int, NULL, 0); /\* not interested in count \*/
+
+short myshortarray\[4\]; int count;
+module_\*-\*-\*_param_\*-\*-\*_array(myshortarray, short, &count, 0);
+/\* put count into "count" variable \*/
+
+A good use for this is to have the module variable’s default values set,
+like a port or IO address. If the variables contain the default values,
+then perform autodetection (explained elsewhere). Otherwise, keep the
+current value. This will be made clear later on.
+
+Lastly, there is a macro function,
+|MODULE_\*-\*-\*_PARM_\*-\*-\*_DESC()|, that is used to document
+arguments that the module can take. It takes two parameters: a variable
+name and a free form string describing that variable.
+
+It is recommended to experiment with the following code:
+
+ $ sudo insmod hello-5.ko mystring="bebop" myintarray=-1
+ $ sudo dmesg -t | tail -7
+ myshort is a short integer: 1
+ myint is an integer: 420
+ mylong is a long integer: 9999
+ mystring is a string: bebop
+ myintarray[0] = -1
+ myintarray[1] = 420
+ got 1 arguments for myintarray.
+
+ $ sudo rmmod hello-5
+ $ sudo dmesg -t | tail -1
+ Goodbye, world 5
+
+ $ sudo insmod hello-5.ko mystring="supercalifragilisticexpialidocious" myintarray=-1,-1
+ $ sudo dmesg -t | tail -7
+ myshort is a short integer: 1
+ myint is an integer: 420
+ mylong is a long integer: 9999
+ mystring is a string: supercalifragilisticexpialidocious
+ myintarray[0] = -1
+ myintarray[1] = -1
+ got 2 arguments for myintarray.
+
+ $ sudo rmmod hello-5
+ $ sudo dmesg -t | tail -1
+ Goodbye, world 5
+
+ $ sudo insmod hello-5.ko mylong=hello
+ insmod: ERROR: could not insert module hello-5.ko: Invalid parameters
+
+Modules Spanning Multiple Files
+-------------------------------
+
+Sometimes it makes sense to divide a kernel module between several
+source files.
+
+Here is an example of such a kernel module.
+
+The next file:
+
+And finally, the makefile:
+
+This is the complete makefile for all the examples we have seen so far.
+The first five lines are nothing special, but for the last example we
+will need two lines. First we invent an object name for our combined
+module, second we tell |make| what object files are part of that module.
+
+Building modules for a precompiled kernel
+-----------------------------------------
+
+Obviously, we strongly suggest you to recompile your kernel, so that you
+can enable a number of useful debugging features, such as forced module
+unloading (|MODULE_\*-\*-\*_FORCE_\*-\*-\*_UNLOAD|): when this option is
+enabled, you can force the kernel to unload a module even when it
+believes it is unsafe, via a |sudo rmmod -f module| command. This option
+can save you a lot of time and a number of reboots during the
+development of a module. If you do not want to recompile your kernel
+then you should consider running the examples within a test distribution
+on a virtual machine. If you mess anything up then you can easily reboot
+or restore the virtual machine (VM).
+
+There are a number of cases in which you may want to load your module
+into a precompiled running kernel, such as the ones shipped with common
+Linux distributions, or a kernel you have compiled in the past. In
+certain circumstances you could require to compile and insert a module
+into a running kernel which you are not allowed to recompile, or on a
+machine that you prefer not to reboot. If you can’t think of a case that
+will force you to use modules for a precompiled kernel you might want to
+skip this and treat the rest of this chapter as a big footnote.
+
+Now, if you just install a kernel source tree, use it to compile your
+kernel module and you try to insert your module into the kernel, in most
+cases you would obtain an error as follows:
+
+ insmod: ERROR: could not insert module poet.ko: Invalid module format
+
+Less cryptic information is logged to the systemd journal:
+
+ kernel: poet: disagrees about version of symbol module_*-*-*_layout
+
+In other words, your kernel refuses to accept your module because
+version strings (more precisely, *version magic*, see
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/vermagic.h))
+do not match. Incidentally, version magic strings are stored in the
+module object in the form of a static string, starting with |vermagic:|.
+Version data are inserted in your module when it is linked against the
+`kernel/module.o` file. To inspect version magics and other strings
+stored in a given module, issue the command |modinfo module.ko|:
+
+ $ modinfo hello-4.ko
+ description: A sample driver
+ author: LKMPG
+ license: GPL
+ srcversion: B2AA7FBFCC2C39AED665382
+ depends:
+ retpoline: Y
+ name: hello_*-*-*_4
+ vermagic: 5.4.0-70-generic SMP mod_*-*-*_unload modversions
+
+To overcome this problem we could resort to the `--force-vermagic`
+option, but this solution is potentially unsafe, and unquestionably
+unacceptable in production modules. Consequently, we want to compile our
+module in an environment which was identical to the one in which our
+precompiled kernel was built. How to do this, is the subject of the
+remainder of this chapter.
+
+First of all, make sure that a kernel source tree is available, having
+exactly the same version as your current kernel. Then, find the
+configuration file which was used to compile your precompiled kernel.
+Usually, this is available in your current `boot` directory, under a
+name like `config-5.14.x`. You may just want to copy it to your kernel
+source tree: |cp /boot/config-‘uname -r‘ .config|.
+
+Let’s focus again on the previous error message: a closer look at the
+version magic strings suggests that, even with two configuration files
+which are exactly the same, a slight difference in the version magic
+could be possible, and it is sufficient to prevent insertion of the
+module into the kernel. That slight difference, namely the custom string
+which appears in the module’s version magic and not in the kernel’s one,
+is due to a modification with respect to the original, in the makefile
+that some distributions include. Then, examine your `Makefile`, and make
+sure that the specified version information matches exactly the one used
+for your current kernel. For example, your makefile could start as
+follows:
+
+ VERSION = 5
+ PATCHLEVEL = 14
+ SUBLEVEL = 0
+ EXTRAVERSION = -rc2
+
+In this case, you need to restore the value of symbol **EXTRAVERSION**
+to **-rc2**. We suggest keeping a backup copy of the makefile used to
+compile your kernel available in `/lib/modules/5.14.0-rc2/build`. A
+simple command as following should suffice.
+
+cp /lib/modules/‘uname -r‘/build/Makefile linux-‘uname -r‘
+
+Here |linux-‘uname -r‘| is the Linux kernel source you are attempting to
+build.
+
+Now, please run |make| to update configuration and version headers and
+objects:
+
+ $ make
+ SYNC include/config/auto.conf.cmd
+ HOSTCC scripts/basic/fixdep
+ HOSTCC scripts/kconfig/conf.o
+ HOSTCC scripts/kconfig/confdata.o
+ HOSTCC scripts/kconfig/expr.o
+ LEX scripts/kconfig/lexer.lex.c
+ YACC scripts/kconfig/parser.tab.[ch]
+ HOSTCC scripts/kconfig/preprocess.o
+ HOSTCC scripts/kconfig/symbol.o
+ HOSTCC scripts/kconfig/util.o
+ HOSTCC scripts/kconfig/lexer.lex.o
+ HOSTCC scripts/kconfig/parser.tab.o
+ HOSTLD scripts/kconfig/conf
+
+If you do not desire to actually compile the kernel, you can interrupt
+the build process (CTRL-C) just after the SPLIT line, because at that
+time, the files you need are ready. Now you can turn back to the
+directory of your module and compile it: It will be built exactly
+according to your current kernel settings, and it will load into it
+without any errors.
+
+Preliminaries
+=============
+
+How modules begin and end
+-------------------------
+
+A typical program starts with a |main()| function, executes a series of
+instructions, and terminates after completing these instructions. Kernel
+modules, however, follow a different pattern. A module always begins
+with either the |init_\*-\*-\*_module| function or a function designated
+by the |module_\*-\*-\*_init| call. This function acts as the module’s
+entry point, informing the kernel of the module’s functionalities and
+preparing the kernel to utilize the module’s functions when necessary.
+After performing these tasks, the entry function returns, and the module
+remains inactive until the kernel requires its code.
+
+All modules conclude by invoking either |cleanup_\*-\*-\*_module| or a
+function specified through the |module_\*-\*-\*_exit| call. This serves
+as the module’s exit function, reversing the actions of the entry
+function by unregistering the previously registered functionalities.
+
+It is mandatory for every module to have both an entry and an exit
+function. While there are multiple methods to define these functions,
+the terms “entry function” and “exit function” are generally used.
+However, they may occasionally be referred to as |init_\*-\*-\*_module|
+and |cleanup_\*-\*-\*_module|, which are understood to mean the same.
+
+Functions available to modules
+------------------------------
+
+Programmers use functions they do not define all the time. A prime
+example of this is |printf()|. You use these library functions which are
+provided by the standard C library, libc. The definitions for these
+functions do not actually enter your program until the linking stage,
+which ensures that the code (for |printf()| for example) is available,
+and fixes the call instruction to point to that code.
+
+Kernel modules are different here, too. In the hello world example, you
+might have noticed that we used a function, |pr_\*-\*-\*_info()| but did
+not include a standard I/O library. That is because modules are object
+files whose symbols get resolved upon running |insmod| or |modprobe|.
+The definition for the symbols comes from the kernel itself; the only
+external functions you can use are the ones provided by the kernel. If
+you’re curious about what symbols have been exported by your kernel,
+take a look at `/proc/kallsyms`.
+
+One point to keep in mind is the difference between library functions
+and system calls. Library functions are higher level, run completely in
+user space and provide a more convenient interface for the programmer to
+the functions that do the real work — system calls. System calls run in
+kernel mode on the user’s behalf and are provided by the kernel itself.
+The library function |printf()| may look like a very general printing
+function, but all it really does is format the data into strings and
+write the string data using the low-level system call |write()|, which
+then sends the data to standard output.
+
+Would you like to see what system calls are made by |printf()|? It is
+easy_ Compile the following program:
+
+\#include <stdio.h>
+
+int main(void) printf("hello"); return 0;
+
+with |gcc -Wall -o hello hello.c|. Run the executable with |strace
+./hello|. Are you impressed? Every line you see corresponds to a system
+call. [strace](https://strace.io/) is a handy program that gives you
+details about what system calls a program is making, including which
+call is made, what its arguments are and what it returns. It is an
+invaluable tool for figuring out things like what files a program is
+trying to access. Towards the end, you will see a line which looks like
+|write(1, "hello", 5hello)|. There it is. The face behind the |printf()|
+mask. You may not be familiar with write, since most people use library
+functions for file I/O (like |fopen|, |fputs|, |fclose|). If that is the
+case, try looking at man 2 write. The 2nd man section is devoted to
+system calls (like |kill()| and |read()|). The 3rd man section is
+devoted to library calls, which you would probably be more familiar with
+(like |cosh()| and |random()|).
+
+You can even write modules to replace the kernel’s system calls, which
+we will do shortly. Crackers often make use of this sort of thing for
+backdoors or trojans, but you can write your own modules to do more
+benign things, like have the kernel write Tee hee, that tickles_ every
+time someone tries to delete a file on your system.
+
+User Space vs Kernel Space
+--------------------------
+
+The kernel primarily manages access to resources, be it a video card,
+hard drive, or memory. Programs frequently vie for the same resources.
+For instance, as a document is saved, updatedb might commence updating
+the locate database. Sessions in editors like vim and processes like
+updatedb can simultaneously utilize the hard drive. The kernel’s role is
+to maintain order, ensuring that users do not access resources
+indiscriminately.
+
+To manage this, CPUs operate in different modes, each offering varying
+levels of system control. The Intel 80386 architecture, for example,
+featured four such modes, known as rings. Unix, however, utilizes only
+two of these rings: the highest ring (ring 0, also known as “supervisor
+mode”, where all actions are permissible) and the lowest ring, referred
+to as “user mode”.
+
+Recall the discussion about library functions vs system calls.
+Typically, you use a library function in user mode. The library function
+calls one or more system calls, and these system calls execute on the
+library function’s behalf, but do so in supervisor mode since they are
+part of the kernel itself. Once the system call completes its task, it
+returns and execution gets transferred back to user mode.
+
+Name Space
+----------
+
+When you write a small C program, you use variables which are convenient
+and make sense to the reader. If, on the other hand, you are writing
+routines which will be part of a bigger problem, any global variables
+you have are part of a community of other peoples’ global variables;
+some of the variable names can clash. When a program has lots of global
+variables which aren’t meaningful enough to be distinguished, you get
+namespace pollution. In large projects, effort must be made to remember
+reserved names, and to find ways to develop a scheme for naming unique
+variable names and symbols.
+
+When writing kernel code, even the smallest module will be linked
+against the entire kernel, so this is definitely an issue. The best way
+to deal with this is to declare all your variables as static and to use
+a well-defined prefix for your symbols. By convention, all kernel
+prefixes are lowercase. If you do not want to declare everything as
+static, another option is to declare a symbol table and register it with
+the kernel. We will get to this later.
+
+The file `/proc/kallsyms` holds all the symbols that the kernel knows
+about and which are therefore accessible to your modules since they
+share the kernel’s codespace.
+
+Code space
+----------
+
+Memory management is a very complicated subject and the majority of
+O’Reilly’s [Understanding The Linux
+Kernel](https://www.oreilly.com/library/view/understanding-the-linux/0596005652/)
+exclusively covers memory management_ We are not setting out to be
+experts on memory managements, but we do need to know a couple of facts
+to even begin worrying about writing real modules.
+
+If you have not thought about what a segfault really means, you may be
+surprised to hear that pointers do not actually point to memory
+locations. Not real ones, anyway. When a process is created, the kernel
+sets aside a portion of real physical memory and hands it to the process
+to use for its executing code, variables, stack, heap and other things
+which a computer scientist would know about. This memory begins with
+0x00000000 and extends up to whatever it needs to be. Since the memory
+space for any two processes do not overlap, every process that can
+access a memory address, say 0xbffff978, would be accessing a different
+location in real physical memory_ The processes would be accessing an
+index named 0xbffff978 which points to some kind of offset into the
+region of memory set aside for that particular process. For the most
+part, a process like our Hello, World program can’t access the space of
+another process, although there are ways which we will talk about later.
+
+The kernel has its own space of memory as well. Since a module is code
+which can be dynamically inserted and removed in the kernel (as opposed
+to a semi-autonomous object), it shares the kernel’s codespace rather
+than having its own. Therefore, if your module segfaults, the kernel
+segfaults. And if you start writing over data because of an off-by-one
+error, then you’re trampling on kernel data (or code). This is even
+worse than it sounds, so try your best to be careful.
+
+It should be noted that the aforementioned discussion applies to any
+operating system utilizing a monolithic kernel. This concept differs
+slightly from *“building all your modules into the kernel”*, although
+the underlying principle is similar. In contrast, there are
+microkernels, where modules are allocated their own code space. Two
+notable examples of microkernels include the [GNU
+Hurd](https://www.gnu.org/software/hurd/) and the [Zircon
+kernel](https://fuchsia.dev/fuchsia-src/concepts/kernel) of Google’s
+Fuchsia.
+
+Device Drivers
+--------------
+
+One class of module is the device driver, which provides functionality
+for hardware like a serial port. On Unix, each piece of hardware is
+represented by a file located in `/dev` named a device file which
+provides the means to communicate with the hardware. The device driver
+provides the communication on behalf of a user program. So the es1370.ko
+sound card device driver might connect the `/dev/sound` device file to
+the Ensoniq IS1370 sound card. A userspace program like mp3blaster can
+use `/dev/sound` without ever knowing what kind of sound card is
+installed.
+
+Let’s look at some device files. Here are device files which represent
+the first three partitions on the primary master IDE hard drive:
+
+ $ ls -l /dev/hda[1-3]
+ brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1
+ brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2
+ brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3
+
+Notice the column of numbers separated by a comma. The first number is
+called the device’s major number. The second number is the minor number.
+The major number tells you which driver is used to access the hardware.
+Each driver is assigned a unique major number; all device files with the
+same major number are controlled by the same driver. All the above major
+numbers are 3, because they’re all controlled by the same driver.
+
+The minor number is used by the driver to distinguish between the
+various hardware it controls. Returning to the example above, although
+all three devices are handled by the same driver they have unique minor
+numbers because the driver sees them as being different pieces of
+hardware.
+
+Devices are divided into two types: character devices and block devices.
+The difference is that block devices have a buffer for requests, so they
+can choose the best order in which to respond to the requests. This is
+important in the case of storage devices, where it is faster to read or
+write sectors which are close to each other, rather than those which are
+further apart. Another difference is that block devices can only accept
+input and return output in blocks (whose size can vary according to the
+device), whereas character devices are allowed to use as many or as few
+bytes as they like. Most devices in the world are character, because
+they don’t need this type of buffering, and they don’t operate with a
+fixed block size. You can tell whether a device file is for a block
+device or a character device by looking at the first character in the
+output of |ls -l|. If it is ‘b’ then it is a block device, and if it is
+‘c’ then it is a character device. The devices you see above are block
+devices. Here are some character devices (the serial ports):
+
+ crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0
+ crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1
+ crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2
+ crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3
+
+If you want to see which major numbers have been assigned, you can look
+at
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt).
+
+When the system was installed, all of those device files were created by
+the |mknod| command. To create a new char device named `coffee` with
+major/minor number 12 and 2, simply do |mknod /dev/coffee c 12 2|. You
+do not have to put your device files into `/dev`, but it is done by
+convention. Linus put his device files in `/dev`, and so should you.
+However, when creating a device file for testing purposes, it is
+probably OK to place it in your working directory where you compile the
+kernel module. Just be sure to put it in the right place when you’re
+done writing the device driver.
+
+A few final points, although implicit in the previous discussion, are
+worth stating explicitly for clarity. When a device file is accessed,
+the kernel utilizes the file’s major number to identify the appropriate
+driver for handling the access. This indicates that the kernel does not
+necessarily rely on or need to be aware of the minor number. It is the
+driver that concerns itself with the minor number, using it to
+differentiate between various pieces of hardware.
+
+It is important to note that when referring to *“hardware”*, the term is
+used in a slightly more abstract sense than just a physical PCI card
+that can be held in hand. Consider the following two device files:
+
+ $ ls -l /dev/sda /dev/sdb
+ brw-rw---- 1 root disk 8, 0 Jan 3 09:02 /dev/sda
+ brw-rw---- 1 root disk 8, 16 Jan 3 09:02 /dev/sdb
+
+By now you can look at these two device files and know instantly that
+they are block devices and are handled by same driver (block major 8).
+Sometimes two device files with the same major but different minor
+number can actually represent the same piece of physical hardware. So
+just be aware that the word “hardware” in our discussion can mean
+something very abstract.
+
+Character Device drivers
+========================
+
+The file-\*-\*_operations Structure
+-----------------------------------
+
+The |file_\*-\*-\*_operations| structure is defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h),
+and holds pointers to functions defined by the driver that perform
+various operations on the device. Each field of the structure
+corresponds to the address of some function defined by the driver to
+handle a requested operation.
+
+For example, every character driver needs to define a function that
+reads from the device. The |file_\*-\*-\*_operations| structure holds
+the address of the module’s function that performs that operation. Here
+is what the definition looks like for kernel 5.4:
+
+struct file_\*-\*-\*_operations struct module \*owner; loff_\*-\*-\*_t
+(\*llseek) (struct file \*, loff_\*-\*-\*_t, int); ssize_\*-\*-\*_t
+(\*read) (struct file \*, char _\*-\*-\*_\*-\*-\*_user \*,
+size_\*-\*-\*_t, loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*write) (struct
+file \*, const char _\*-\*-\*_\*-\*-\*_user \*, size_\*-\*-\*_t,
+loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*read_\*-\*-\*_iter) (struct
+kiocb \*, struct iov_\*-\*-\*_iter \*); ssize_\*-\*-\*_t
+(\*write_\*-\*-\*_iter) (struct kiocb \*, struct iov_\*-\*-\*_iter \*);
+int (\*iopoll)(struct kiocb \*kiocb, bool spin); int (\*iterate) (struct
+file \*, struct dir_\*-\*-\*_context \*); int
+(\*iterate_\*-\*-\*_shared) (struct file \*, struct dir_\*-\*-\*_context
+\*); _\*-\*-\*_\*-\*-\*_poll_\*-\*-\*_t (\*poll) (struct file \*,
+struct poll_\*-\*-\*_table_\*-\*-\*_struct \*); long
+(\*unlocked_\*-\*-\*_ioctl) (struct file \*, unsigned int, unsigned
+long); long (\*compat_\*-\*-\*_ioctl) (struct file \*, unsigned int,
+unsigned long); int (\*mmap) (struct file \*, struct
+vm_\*-\*-\*_area_\*-\*-\*_struct \*); unsigned long
+mmap_\*-\*-\*_supported_\*-\*-\*_flags; int (\*open) (struct inode \*,
+struct file \*); int (\*flush) (struct file \*,
+fl_\*-\*-\*_owner_\*-\*-\*_t id); int (\*release) (struct inode \*,
+struct file \*); int (\*fsync) (struct file \*, loff_\*-\*-\*_t,
+loff_\*-\*-\*_t, int datasync); int (\*fasync) (int, struct file \*,
+int); int (\*lock) (struct file \*, int, struct file_\*-\*-\*_lock \*);
+ssize_\*-\*-\*_t (\*sendpage) (struct file \*, struct page \*, int,
+size_\*-\*-\*_t, loff_\*-\*-\*_t \*, int); unsigned long
+(\*get_\*-\*-\*_unmapped_\*-\*-\*_area)(struct file \*, unsigned long,
+unsigned long, unsigned long, unsigned long); int
+(\*check_\*-\*-\*_flags)(int); int (\*flock) (struct file \*, int,
+struct file_\*-\*-\*_lock \*); ssize_\*-\*-\*_t
+(\*splice_\*-\*-\*_write)(struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*,
+struct file \*, loff_\*-\*-\*_t \*, size_\*-\*-\*_t, unsigned int);
+ssize_\*-\*-\*_t (\*splice_\*-\*-\*_read)(struct file \*,
+loff_\*-\*-\*_t \*, struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*,
+size_\*-\*-\*_t, unsigned int); int (\*setlease)(struct file \*, long,
+struct file_\*-\*-\*_lock \*\*, void \*\*); long (\*fallocate)(struct
+file \*file, int mode, loff_\*-\*-\*_t offset, loff_\*-\*-\*_t len);
+void (\*show_\*-\*-\*_fdinfo)(struct seq_\*-\*-\*_file \*m, struct file
+\*f); ssize_\*-\*-\*_t (\*copy_\*-\*-\*_file_\*-\*-\*_range)(struct file
+\*, loff_\*-\*-\*_t, struct file \*, loff_\*-\*-\*_t, size_\*-\*-\*_t,
+unsigned int); loff_\*-\*-\*_t
+(\*remap_\*-\*-\*_file_\*-\*-\*_range)(struct file \*file_\*-\*-\*_in,
+loff_\*-\*-\*_t pos_\*-\*-\*_in, struct file \*file_\*-\*-\*_out,
+loff_\*-\*-\*_t pos_\*-\*-\*_out, loff_\*-\*-\*_t len, unsigned int
+remap_\*-\*-\*_flags); int (\*fadvise)(struct file \*, loff_\*-\*-\*_t,
+loff_\*-\*-\*_t, int); _\*-\*-\*_\*-\*-\*_randomize_\*-\*-\*_layout;
+
+Some operations are not implemented by a driver. For example, a driver
+that handles a video card will not need to read from a directory
+structure. The corresponding entries in the |file_\*-\*-\*_operations|
+structure should be set to |NULL|.
+
+There is a gcc extension that makes assigning to this structure more
+convenient. You will see it in modern drivers, and may catch you by
+surprise. This is what the new way of assigning to the structure looks
+like:
+
+struct file_\*-\*-\*_operations fops = read: device_\*-\*-\*_read,
+write: device_\*-\*-\*_write, open: device_\*-\*-\*_open, release:
+device_\*-\*-\*_release ;
+
+However, there is also a C99 way of assigning to elements of a
+structure, [designated
+initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html),
+and this is definitely preferred over using the GNU extension. You
+should use this syntax in case someone wants to port your driver. It
+will help with compatibility:
+
+struct file_\*-\*-\*_operations fops = .read = device_\*-\*-\*_read,
+.write = device_\*-\*-\*_write, .open = device_\*-\*-\*_open, .release =
+device_\*-\*-\*_release ;
+
+The meaning is clear, and you should be aware that any member of the
+structure which you do not explicitly assign will be initialized to
+|NULL| by gcc.
+
+An instance of |struct file_\*-\*-\*_operations| containing pointers to
+functions that are used to implement |read|, |write|, |open|, … system
+calls is commonly named |fops|.
+
+Since Linux v3.14, the read, write and seek operations are guaranteed
+for thread-safe by using the |f_\*-\*-\*_pos| specific lock, which makes
+the file position update to become the mutual exclusion. So, we can
+safely implement those operations without unnecessary locking.
+
+Additionally, since Linux v5.6, the |proc_\*-\*-\*_ops| structure was
+introduced to replace the use of the |file_\*-\*-\*_operations|
+structure when registering proc handlers. See more information in the
+7.1
+section.
+
+The file structure
+------------------
+
+Each device is represented in the kernel by a file structure, which is
+defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h).
+Be aware that a file is a kernel level structure and never appears in a
+user space program. It is not the same thing as a |FILE|, which is
+defined by glibc and would never appear in a kernel space function.
+Also, its name is a bit misleading; it represents an abstract open
+‘file’, not a file on a disk, which is represented by a structure named
+|inode|.
+
+An instance of struct file is commonly named |filp|. You’ll also see it
+referred to as a struct file object. Resist the temptation.
+
+Go ahead and look at the definition of file. Most of the entries you
+see, like struct dentry are not used by device drivers, and you can
+ignore them. This is because drivers do not fill file directly; they
+only use structures contained in file which are created elsewhere.
+
+Registering A Device
+--------------------
+
+As discussed earlier, char devices are accessed through device files,
+usually located in `/dev`. This is by convention. When writing a driver,
+it is OK to put the device file in your current directory. Just make
+sure you place it in `/dev` for a production driver. The major number
+tells you which driver handles which device file. The minor number is
+used only by the driver itself to differentiate which device it is
+operating on, just in case the driver handles more than one device.
+
+Adding a driver to your system means registering it with the kernel.
+This is synonymous with assigning it a major number during the module’s
+initialization. You do this by using the |register_\*-\*-\*_chrdev|
+function, defined by
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h).
+
+int register_\*-\*-\*_chrdev(unsigned int major, const char \*name,
+struct file_\*-\*-\*_operations \*fops);
+
+Where unsigned int major is the major number you want to request, |const
+char \*name| is the name of the device as it will appear in
+`/proc/devices` and |struct file_\*-\*-\*_operations \*fops| is a
+pointer to the |file_\*-\*-\*_operations| table for your driver. A
+negative return value means the registration failed. Note that we didn’t
+pass the minor number to |register_\*-\*-\*_chrdev|. That is because the
+kernel doesn’t care about the minor number; only our driver uses it.
+
+Now the question is, how do you get a major number without hijacking one
+that’s already in use? The easiest way would be to look through
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt)
+and pick an unused one. That is a bad way of doing things because you
+will never be sure if the number you picked will be assigned later. The
+answer is that you can ask the kernel to assign you a dynamic major
+number.
+
+If you pass a major number of 0 to |register_\*-\*-\*_chrdev|, the
+return value will be the dynamically allocated major number. The
+downside is that you can not make a device file in advance, since you do
+not know what the major number will be. There are a couple of ways to do
+this. First, the driver itself can print the newly assigned number and
+we can make the device file by hand. Second, the newly registered device
+will have an entry in `/proc/devices`, and we can either make the device
+file by hand or write a shell script to read the file in and make the
+device file. The third method is that we can have our driver make the
+device file using the |device_\*-\*-\*_create| function after a
+successful registration and |device_\*-\*-\*_destroy| during the call to
+|cleanup_\*-\*-\*_module|.
+
+However, |register_\*-\*-\*_chrdev()| would occupy a range of minor
+numbers associated with the given major. The recommended way to reduce
+waste for char device registration is using cdev interface.
+
+The newer interface completes the char device registration in two
+distinct steps. First, we should register a range of device numbers,
+which can be completed with |register_\*-\*-\*_chrdev_\*-\*-\*_region|
+or |alloc_\*-\*-\*_chrdev_\*-\*-\*_region|.
+
+int register_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t from,
+unsigned count, const char \*name); int
+alloc_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t \*dev, unsigned
+baseminor, unsigned count, const char \*name);
+
+The choice between two different functions depends on whether you know
+the major numbers for your device. Using
+|register_\*-\*-\*_chrdev_\*-\*-\*_region| if you know the device major
+number and |alloc_\*-\*-\*_chrdev_\*-\*-\*_region| if you would like to
+allocate a dynamically-allocated major number.
+
+Second, we should initialize the data structure |struct cdev| for our
+char device and associate it with the device numbers. To initialize the
+|struct cdev|, we can achieve by the similar sequence of the following
+codes.
+
+struct cdev \*my_\*-\*-\*_dev = cdev_\*-\*-\*_alloc();
+my_\*-\*-\*_cdev->ops = &my_\*-\*-\*_fops;
+
+However, the common usage pattern will embed the |struct cdev| within a
+device-specific structure of your own. In this case, we’ll need
+|cdev_\*-\*-\*_init| for the initialization.
+
+void cdev_\*-\*-\*_init(struct cdev \*cdev, const struct
+file_\*-\*-\*_operations \*fops);
+
+Once we finish the initialization, we can add the char device to the
+system by using the |cdev_\*-\*-\*_add|.
+
+int cdev_\*-\*-\*_add(struct cdev \*p, dev_\*-\*-\*_t dev, unsigned
+count);
+
+To find an example using the interface, you can see `ioctl.c` described
+in section
+9.
+
+Unregistering A Device
+----------------------
+
+We can not allow the kernel module to be |rmmod|’ed whenever root feels
+like it. If the device file is opened by a process and then we remove
+the kernel module, using the file would cause a call to the memory
+location where the appropriate function (read/write) used to be. If we
+are lucky, no other code was loaded there, and we’ll get an ugly error
+message. If we are unlucky, another kernel module was loaded into the
+same location, which means a jump into the middle of another function
+within the kernel. The results of this would be impossible to predict,
+but they can not be very positive.
+
+Normally, when you do not want to allow something, you return an error
+code (a negative number) from the function which is supposed to do it.
+With |cleanup_\*-\*-\*_module| that’s impossible because it is a void
+function. However, there is a counter which keeps track of how many
+processes are using your module. You can see what its value is by
+looking at the 3rd field with the command |cat /proc/modules| or |sudo
+lsmod|. If this number isn’t zero, |rmmod| will fail. Note that you do
+not have to check the counter within |cleanup_\*-\*-\*_module| because
+the check will be performed for you by the system call
+|sys_\*-\*-\*_delete_\*-\*-\*_module|, defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/syscalls.h).
+You should not use this counter directly, but there are functions
+defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h)
+which let you increase, decrease and display this counter:
+
+- |try_\*-\*-\*_module_\*-\*-\*_get(THIS_\*-\*-\*_MODULE)|: Increment
+ the reference count of current module.
+
+- |module_\*-\*-\*_put(THIS_\*-\*-\*_MODULE)|: Decrement the reference
+ count of current module.
+
+- |module_\*-\*-\*_refcount(THIS_\*-\*-\*_MODULE)|: Return the value
+ of reference count of current module.
+
+It is important to keep the counter accurate; if you ever do lose track
+of the correct usage count, you will never be able to unload the module;
+it’s now reboot time, boys and girls. This is bound to happen to you
+sooner or later during a module’s development.
+
+chardev.c
+---------
+
+The next code sample creates a char driver named `chardev`. You can dump
+its device file.
+
+cat /proc/devices
+
+(or open the file with a program) and the driver will put the number of
+times the device file has been read from into the file. We do not
+support writing to the file (like |echo "hi" > /dev/hello|), but
+catch these attempts and tell the user that the operation is not
+supported. Don’t worry if you don’t see what we do with the data we read
+into the buffer; we don’t do much with it. We simply read in the data
+and print a message acknowledging that we received it.
+
+In the multiple-threaded environment, without any protection, concurrent
+access to the same memory may lead to the race condition, and will not
+preserve the performance. In the kernel module, this problem may happen
+due to multiple instances accessing the shared resources. Therefore, a
+solution is to enforce the exclusive access. We use atomic
+Compare-And-Swap (CAS) to maintain the states,
+|CDEV_\*-\*-\*_NOT_\*-\*-\*_USED| and
+|CDEV_\*-\*-\*_EXCLUSIVE_\*-\*-\*_OPEN|, to determine whether the file
+is currently opened by someone or not. CAS compares the contents of a
+memory location with the expected value and, only if they are the same,
+modifies the contents of that memory location to the desired value. See
+more concurrency details in the
+12
+section.
+
+Writing Modules for Multiple Kernel Versions
+--------------------------------------------
+
+The system calls, which are the major interface the kernel shows to the
+processes, generally stay the same across versions. A new system call
+may be added, but usually the old ones will behave exactly like they
+used to. This is necessary for backward compatibility – a new kernel
+version is not supposed to break regular processes. In most cases, the
+device files will also remain the same. On the other hand, the internal
+interfaces within the kernel can and do change between versions.
+
+There are differences between different kernel versions, and if you want
+to support multiple kernel versions, you will find yourself having to
+code conditional compilation directives. The way to do this to compare
+the macro |LINUX_\*-\*-\*_VERSION_\*-\*-\*_CODE| to the macro
+|KERNEL_\*-\*-\*_VERSION|. In version `a.b.c` of the kernel, the value
+of this macro would be 216*a* + 28*b* + *c*.
+
+The /proc File System
+=====================
+
+In Linux, there is an additional mechanism for the kernel and kernel
+modules to send information to processes — the `/proc` file system.
+Originally designed to allow easy access to information about processes
+(hence the name), it is now used by every bit of the kernel which has
+something interesting to report, such as `/proc/modules` which provides
+the list of modules and `/proc/meminfo` which gathers memory usage
+statistics.
+
+The method to use the proc file system is very similar to the one used
+with device drivers — a structure is created with all the information
+needed for the `/proc` file, including pointers to any handler functions
+(in our case there is only one, the one called when somebody attempts to
+read from the `/proc` file). Then, |init_\*-\*-\*_module| registers the
+structure with the kernel and |cleanup_\*-\*-\*_module| unregisters it.
+
+Normal file systems are located on a disk, rather than just in memory
+(which is where `/proc` is), and in that case the index-node (inode for
+short) number is a pointer to a disk location where the file’s inode is
+located. The inode contains information about the file, for example the
+file’s permissions, together with a pointer to the disk location or
+locations where the file’s data can be found.
+
+Because we don’t get called when the file is opened or closed, there’s
+nowhere for us to put |try_\*-\*-\*_module_\*-\*-\*_get| and
+|module_\*-\*-\*_put| in this module, and if the file is opened and then
+the module is removed, there’s no way to avoid the consequences.
+
+Here a simple example showing how to use a `/proc` file. This is the
+HelloWorld for the `/proc` filesystem. There are three parts: create the
+file `/proc/helloworld` in the function |init_\*-\*-\*_module|, return a
+value (and a buffer) when the file `/proc/helloworld` is read in the
+callback function |procfile_\*-\*-\*_read|, and delete the file
+`/proc/helloworld` in the function |cleanup_\*-\*-\*_module|.
+
+The `/proc/helloworld` is created when the module is loaded with the
+function |proc_\*-\*-\*_create|. The return value is a pointer to
+|struct proc_\*-\*-\*_dir_\*-\*-\*_entry|, and it will be used to
+configure the file `/proc/helloworld` (for example, the owner of this
+file). A null return value means that the creation has failed.
+
+Every time the file `/proc/helloworld` is read, the function
+|procfile_\*-\*-\*_read| is called. Two parameters of this function are
+very important: the buffer (the second parameter) and the offset (the
+fourth one). The content of the buffer will be returned to the
+application which read it (for example the |cat| command). The offset is
+the current position in the file. If the return value of the function is
+not null, then this function is called again. So be careful with this
+function, if it never returns zero, the read function is called
+endlessly.
+
+ $ cat /proc/helloworld
+ HelloWorld_
+
+The proc-\*-\*_ops Structure
+----------------------------
+
+The |proc_\*-\*-\*_ops| structure is defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/proc\_*-*-*_fs.h)
+in Linux v5.6+. In older kernels, it used |file_\*-\*-\*_operations| for
+custom hooks in `/proc` file system, but it contains some members that
+are unnecessary in VFS, and every time VFS expands
+|file_\*-\*-\*_operations| set, `/proc` code comes bloated. On the other
+hand, not only the space, but also some operations were saved by this
+structure to improve its performance. For example, the file which never
+disappears in `/proc` can set the |proc_\*-\*-\*_flag| as
+|PROC_\*-\*-\*_ENTRY_\*-\*-\*_PERMANENT| to save 2 atomic ops, 1
+allocation, 1 free in per open/read/close sequence.
+
+Read and Write a /proc File
+---------------------------
+
+We have seen a very simple example for a `/proc` file where we only read
+the file `/proc/helloworld`. It is also possible to write in a `/proc`
+file. It works the same way as read, a function is called when the
+`/proc` file is written. But there is a little difference with read,
+data comes from user, so you have to import data from user space to
+kernel space (with |copy_\*-\*-\*_from_\*-\*-\*_user| or
+|get_\*-\*-\*_user|)
+
+The reason for |copy_\*-\*-\*_from_\*-\*-\*_user| or |get_\*-\*-\*_user|
+is that Linux memory (on Intel architecture, it may be different under
+some other processors) is segmented. This means that a pointer, by
+itself, does not reference a unique location in memory, only a location
+in a memory segment, and you need to know which memory segment it is to
+be able to use it. There is one memory segment for the kernel, and one
+for each of the processes.
+
+The only memory segment accessible to a process is its own, so when
+writing regular programs to run as processes, there is no need to worry
+about segments. When you write a kernel module, normally you want to
+access the kernel memory segment, which is handled automatically by the
+system. However, when the content of a memory buffer needs to be passed
+between the currently running process and the kernel, the kernel
+function receives a pointer to the memory buffer which is in the process
+segment. The |put_\*-\*-\*_user| and |get_\*-\*-\*_user| macros allow
+you to access that memory. These functions handle only one character,
+you can handle several characters with |copy_\*-\*-\*_to_\*-\*-\*_user|
+and |copy_\*-\*-\*_from_\*-\*-\*_user|. As the buffer (in read or write
+function) is in kernel space, for write function you need to import data
+because it comes from user space, but not for the read function because
+data is already in kernel space.
+
+Manage /proc file with standard filesystem
+------------------------------------------
+
+We have seen how to read and write a `/proc` file with the `/proc`
+interface. But it is also possible to manage `/proc` file with inodes.
+The main concern is to use advanced functions, like permissions.
+
+In Linux, there is a standard mechanism for file system registration.
+Since every file system has to have its own functions to handle inode
+and file operations, there is a special structure to hold pointers to
+all those functions, |struct inode_\*-\*-\*_operations|, which includes
+a pointer to |struct proc_\*-\*-\*_ops|.
+
+The difference between file and inode operations is that file operations
+deal with the file itself whereas inode operations deal with ways of
+referencing the file, such as creating links to it.
+
+In `/proc`, whenever we register a new file, we’re allowed to specify
+which |struct inode_\*-\*-\*_operations| will be used to access to it.
+This is the mechanism we use, a |struct inode_\*-\*-\*_operations| which
+includes a pointer to a |struct proc_\*-\*-\*_ops| which includes
+pointers to our |procfs_\*-\*-\*_read| and |procfs_\*-\*-\*_write|
+functions.
+
+Another interesting point here is the |module_\*-\*-\*_permission|
+function. This function is called whenever a process tries to do
+something with the `/proc` file, and it can decide whether to allow
+access or not. Right now it is only based on the operation and the uid
+of the current user (as available in current, a pointer to a structure
+which includes information on the currently running process), but it
+could be based on anything we like, such as what other processes are
+doing with the same file, the time of day, or the last input we
+received.
+
+It is important to note that the standard roles of read and write are
+reversed in the kernel. Read functions are used for output, whereas
+write functions are used for input. The reason for that is that read and
+write refer to the user’s point of view — if a process reads something
+from the kernel, then the kernel needs to output it, and if a process
+writes something to the kernel, then the kernel receives it as input.
+
+Still hungry for procfs examples? Well, first of all keep in mind, there
+are rumors around, claiming that procfs is on its way out, consider
+using `sysfs` instead. Consider using this mechanism, in case you want
+to document something kernel related yourself.
+
+Manage /proc file with seq-\*-\*_file
+-------------------------------------
+
+As we have seen, writing a `/proc` file may be quite “complex”. So to
+help people writing `/proc` file, there is an API named
+|seq_\*-\*-\*_file| that helps formatting a `/proc` file for output. It
+is based on sequence, which is composed of 3 functions: |start()|,
+|next()|, and |stop()|. The |seq_\*-\*-\*_file| API starts a sequence
+when a user read the `/proc` file.
+
+A sequence begins with the call of the function |start()|. If the return
+is a non |NULL| value, the function |next()| is called; otherwise, the
+|stop()| function is called directly. This function is an iterator, the
+goal is to go through all the data. Each time |next()| is called, the
+function |show()| is also called. It writes data values in the buffer
+read by the user. The function |next()| is called until it returns
+|NULL|. The sequence ends when |next()| returns |NULL|, then the
+function |stop()| is called.
+
+BE CAREFUL: when a sequence is finished, another one starts. That means
+that at the end of function |stop()|, the function |start()| is called
+again. This loop finishes when the function |start()| returns |NULL|.
+You can see a scheme of this in the
+Figure [img:seqfile].
+
+The |seq_\*-\*-\*_file| provides basic functions for
+|proc_\*-\*-\*_ops|, such as |seq_\*-\*-\*_read|, |seq_\*-\*-\*_lseek|,
+and some others. But nothing to write in the `/proc` file. Of course,
+you can still use the same way as in the previous example.
+
+If you want more information, you can read this web page:
+
+-
+
+-
+
+You can also read the code of
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/seq\_*-*-*_file.c)
+in the linux kernel.
+
+sysfs: Interacting with your module
+===================================
+
+*sysfs* allows you to interact with the running kernel from userspace by
+reading or setting variables inside of modules. This can be useful for
+debugging purposes, or just as an interface for applications or scripts.
+You can find sysfs directories and files under the `/sys` directory on
+your system.
+
+ls -l /sys
+
+Attributes can be exported for kobjects in the form of regular files in
+the filesystem. Sysfs forwards file I/O operations to methods defined
+for the attributes, providing a means to read and write kernel
+attributes.
+
+An attribute definition in simply:
+
+struct attribute char \*name; struct module \*owner; umode_\*-\*-\*_t
+mode; ;
+
+int sysfs_\*-\*-\*_create_\*-\*-\*_file(struct kobject \* kobj, const
+struct attribute \* attr); void
+sysfs_\*-\*-\*_remove_\*-\*-\*_file(struct kobject \* kobj, const struct
+attribute \* attr);
+
+For example, the driver model defines |struct device_\*-\*-\*_attribute|
+like:
+
+struct device_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t
+(\*show)(struct device \*dev, struct device_\*-\*-\*_attribute \*attr,
+char \*buf); ssize_\*-\*-\*_t (\*store)(struct device \*dev, struct
+device_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t
+count); ;
+
+int device_\*-\*-\*_create_\*-\*-\*_file(struct device \*, const struct
+device_\*-\*-\*_attribute \*); void
+device_\*-\*-\*_remove_\*-\*-\*_file(struct device \*, const struct
+device_\*-\*-\*_attribute \*);
+
+To read or write attributes, |show()| or |store()| method must be
+specified when declaring the attribute. For the common cases
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/sysfs.h)
+provides convenience macros (|_\*-\*-\*_\*-\*-\*_ATTR|,
+|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_RO|,
+|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_WO|, etc.) to make defining
+attributes easier as well as making code more concise and readable.
+
+An example of a hello world module which includes the creation of a
+variable accessible via sysfs is given below.
+
+Make and install the module:
+
+make sudo insmod hello-sysfs.ko
+
+Check that it exists:
+
+sudo lsmod | grep hello_\*-\*-\*_sysfs
+
+What is the current value of |myvariable| ?
+
+sudo cat /sys/kernel/mymodule/myvariable
+
+Set the value of |myvariable| and check that it changed.
+
+echo "32" | sudo tee /sys/kernel/mymodule/myvariable sudo cat
+/sys/kernel/mymodule/myvariable
+
+Finally, remove the test module:
+
+sudo rmmod hello_\*-\*-\*_sysfs
+
+In the above case, we use a simple kobject to create a directory under
+sysfs, and communicate with its attributes. Since Linux v2.6.0, the
+|kobject| structure made its appearance. It was initially meant as a
+simple way of unifying kernel code which manages reference counted
+objects. After a bit of mission creep, it is now the glue that holds
+much of the device model and its sysfs interface together. For more
+information about kobject and sysfs, see
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/driver-model/driver.rst)
+and .
+
+Talking To Device Files
+=======================
+
+Device files are supposed to represent physical devices. Most physical
+devices are used for output as well as input, so there has to be some
+mechanism for device drivers in the kernel to get the output to send to
+the device from processes. This is done by opening the device file for
+output and writing to it, just like writing to a file. In the following
+example, this is implemented by |device_\*-\*-\*_write|.
+
+This is not always enough. Imagine you had a serial port connected to a
+modem (even if you have an internal modem, it is still implemented from
+the CPU’s perspective as a serial port connected to a modem, so you
+don’t have to tax your imagination too hard). The natural thing to do
+would be to use the device file to write things to the modem (either
+modem commands or data to be sent through the phone line) and read
+things from the modem (either responses for commands or the data
+received through the phone line). However, this leaves open the question
+of what to do when you need to talk to the serial port itself, for
+example to configure the rate at which data is sent and received.
+
+The answer in Unix is to use a special function called |ioctl| (short
+for Input Output ConTroL). Every device can have its own |ioctl|
+commands, which can be read ioctl’s (to send information from a process
+to the kernel), write ioctl’s (to return information to a process), both
+or neither. Notice here the roles of read and write are reversed again,
+so in ioctl’s read is to send information to the kernel and write is to
+receive information from the kernel.
+
+The ioctl function is called with three parameters: the file descriptor
+of the appropriate device file, the ioctl number, and a parameter, which
+is of type long so you can use a cast to use it to pass anything. You
+will not be able to pass a structure this way, but you will be able to
+pass a pointer to the structure. Here is an example:
+
+You can see there is an argument called |cmd| in
+|test_\*-\*-\*_ioctl_\*-\*-\*_ioctl()| function. It is the ioctl number.
+The ioctl number encodes the major device number, the type of the ioctl,
+the command, and the type of the parameter. This ioctl number is usually
+created by a macro call (|_\*-\*-\*_IO|, |_\*-\*-\*_IOR|,
+|_\*-\*-\*_IOW| or |_\*-\*-\*_IOWR| — depending on the type) in a header
+file. This header file should then be included both by the programs
+which will use ioctl (so they can generate the appropriate ioctl’s) and
+by the kernel module (so it can understand it). In the example below,
+the header file is `chardev.h` and the program which uses it is
+`userspace_*-*-*_ioctl.c`.
+
+If you want to use ioctls in your own kernel modules, it is best to
+receive an official ioctl assignment, so if you accidentally get
+somebody else’s ioctls, or if they get yours, you’ll know something is
+wrong. For more information, consult the kernel source tree at
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/userspace-api/ioctl/ioctl-number.rst).
+
+Also, we need to be careful that concurrent access to the shared
+resources will lead to the race condition. The solution is using atomic
+Compare-And-Swap (CAS), which we mentioned at
+6.5
+section, to enforce the exclusive access.
+
+System Calls
+============
+
+So far, the only thing we’ve done was to use well defined kernel
+mechanisms to register `/proc` files and device handlers. This is fine
+if you want to do something the kernel programmers thought you’d want,
+such as write a device driver. But what if you want to do something
+unusual, to change the behavior of the system in some way? Then, you are
+mostly on your own.
+
+Should one choose not to use a virtual machine, kernel programming can
+become risky. For example, while writing the code below, the |open()|
+system call was inadvertently disrupted. This resulted in an inability
+to open any files, run programs, or shut down the system, necessitating
+a restart of the virtual machine. Fortunately, no critical files were
+lost in this instance. However, if such modifications were made on a
+live, mission-critical system, the consequences could be severe. To
+mitigate the risk of file loss, even in a test environment, it is
+advised to execute |sync| right before using |insmod| and |rmmod|.
+
+Forget about `/proc` files, forget about device files. They are just
+minor details. Minutiae in the vast expanse of the universe. The real
+process to kernel communication mechanism, the one used by all
+processes, is *system calls*. When a process requests a service from the
+kernel (such as opening a file, forking to a new process, or requesting
+more memory), this is the mechanism used. If you want to change the
+behaviour of the kernel in interesting ways, this is the place to do it.
+By the way, if you want to see which system calls a program uses, run
+|strace <arguments>|.
+
+In general, a process is not supposed to be able to access the kernel.
+It can not access kernel memory and it can’t call kernel functions. The
+hardware of the CPU enforces this (that is the reason why it is called
+“protected mode” or “page protection”).
+
+System calls are an exception to this general rule. What happens is that
+the process fills the registers with the appropriate values and then
+calls a special instruction which jumps to a previously defined location
+in the kernel (of course, that location is readable by user processes,
+it is not writable by them). Under Intel CPUs, this is done by means of
+interrupt 0x80. The hardware knows that once you jump to this location,
+you are no longer running in restricted user mode, but as the operating
+system kernel — and therefore you’re allowed to do whatever you want.
+
+The location in the kernel a process can jump to is called
+`system_*-*-*_call`. The procedure at that location checks the system
+call number, which tells the kernel what service the process requested.
+Then, it looks at the table of system calls
+(|sys_\*-\*-\*_call_\*-\*-\*_table|) to see the address of the kernel
+function to call. Then it calls the function, and after it returns, does
+a few system checks and then return back to the process (or to a
+different process, if the process time ran out). If you want to read
+this code, it is at the source file
+`arch/$(architecture)/kernel/entry.S`, after the line
+|ENTRY(system_\*-\*-\*_call)|.
+
+So, if we want to change the way a certain system call works, what we
+need to do is to write our own function to implement it (usually by
+adding a bit of our own code, and then calling the original function)
+and then change the pointer at |sys_\*-\*-\*_call_\*-\*-\*_table| to
+point to our function. Because we might be removed later and we don’t
+want to leave the system in an unstable state, it’s important for
+|cleanup_\*-\*-\*_module| to restore the table to its original state.
+
+To modify the content of |sys_\*-\*-\*_call_\*-\*-\*_table|, we need to
+consider the control register. A control register is a processor
+register that changes or controls the general behavior of the CPU. For
+x86 architecture, the `cr0` register has various control flags that
+modify the basic operation of the processor. The `WP` flag in `cr0`
+stands for write protection. Once the `WP` flag is set, the processor
+disallows further write attempts to the read-only sections Therefore, we
+must disable the `WP` flag before modifying
+|sys_\*-\*-\*_call_\*-\*-\*_table|. Since Linux v5.3, the
+|write_\*-\*-\*_cr0| function cannot be used because of the sensitive
+`cr0` bits pinned by the security issue, the attacker may write into CPU
+control registers to disable CPU protections like write protection. As a
+result, we have to provide the custom assembly routine to bypass it.
+
+However, |sys_\*-\*-\*_call_\*-\*-\*_table| symbol is unexported to
+prevent misuse. But there have few ways to get the symbol, manual symbol
+lookup and |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. Here we use both
+depend on the kernel version.
+
+Because of the *control-flow integrity*, which is a technique to prevent
+the redirect execution code from the attacker, for making sure that the
+indirect calls go to the expected addresses and the return addresses are
+not changed. Since Linux v5.7, the kernel patched the series of
+*control-flow enforcement* (CET) for x86, and some configurations of
+GCC, like GCC versions 9 and 10 in Ubuntu Linux, will add with CET (the
+`-fcf-protection` option) in the kernel by default. Using that GCC to
+compile the kernel with retpoline off may result in CET being enabled in
+the kernel. You can use the following command to check out the
+`-fcf-protection` option is enabled or not:
+
+ $ gcc -v -Q -O2 --help=target | grep protection
+ Using built-in specs.
+ COLLECT_*-*-*_GCC=gcc
+ COLLECT_*-*-*_LTO_*-*-*_WRAPPER=/usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/lto-wrapper
+ ...
+ gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
+ COLLECT_*-*-*_GCC_*-*-*_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
+ /usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/cc1 -v ... -fcf-protection ...
+ GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_*-*-*_64-linux-gnu)
+ ...
+
+But CET should not be enabled in the kernel, it may break the Kprobes
+and bpf. Consequently, CET is disabled since v5.11. To guarantee the
+manual symbol lookup worked, we only use up to v5.4.
+
+Unfortunately, since Linux v5.7 |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|
+is also unexported, it needs certain trick to get the address of
+|kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. If |CONFIG_\*-\*-\*_KPROBES|
+is enabled, we can facilitate the retrieval of function addresses by
+means of Kprobes to dynamically break into the specific kernel routine.
+Kprobes inserts a breakpoint at the entry of function by replacing the
+first bytes of the probed instruction. When a CPU hits the breakpoint,
+registers are stored, and the control will pass to Kprobes. It passes
+the addresses of the saved registers and the Kprobe struct to the
+handler you defined, then executes it. Kprobes can be registered by
+symbol name or address. Within the symbol name, the address will be
+handled by the kernel.
+
+Otherwise, specify the address of |sys_\*-\*-\*_call_\*-\*-\*_table|
+from `/proc/kallsyms` and `/boot/System.map` into |sym| parameter.
+Following is the sample usage for `/proc/kallsyms`:
+
+ $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
+ ffffffff82000280 R x32_*-*-*_sys_*-*-*_call_*-*-*_table
+ ffffffff820013a0 R sys_*-*-*_call_*-*-*_table
+ ffffffff820023e0 R ia32_*-*-*_sys_*-*-*_call_*-*-*_table
+ $ sudo insmod syscall-steal.ko sym=0xffffffff820013a0
+
+Using the address from `/boot/System.map`, be careful about `KASLR`
+(Kernel Address Space Layout Randomization). `KASLR` may randomize the
+address of kernel code and data at every boot time, such as the static
+address listed in `/boot/System.map` will offset by some entropy. The
+purpose of `KASLR` is to protect the kernel space from the attacker.
+Without `KASLR`, the attacker may find the target address in the fixed
+address easily. Then the attacker can use return-oriented programming to
+insert some malicious codes to execute or receive the target data by a
+tampered pointer. `KASLR` mitigates these kinds of attacks because the
+attacker cannot immediately know the target address, but a brute-force
+attack can still work. If the address of a symbol in `/proc/kallsyms` is
+different from the address in `/boot/System.map`, `KASLR` is enabled
+with the kernel, which your system running on.
+
+ $ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub
+ GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash"
+ $ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r)
+ ffffffff82000300 R sys_*-*-*_call_*-*-*_table
+ $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
+ ffffffff820013a0 R sys_*-*-*_call_*-*-*_table
+ # Reboot
+ $ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r)
+ ffffffff82000300 R sys_*-*-*_call_*-*-*_table
+ $ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
+ ffffffff86400300 R sys_*-*-*_call_*-*-*_table
+
+If `KASLR` is enabled, we have to take care of the address from
+`/proc/kallsyms` each time we reboot the machine. In order to use the
+address from `/boot/System.map`, make sure that `KASLR` is disabled. You
+can add the `nokaslr` for disabling `KASLR` in next booting time:
+
+ $ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub
+ GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash"
+ $ sudo perl -i -pe 'm/quiet/ and s//quiet nokaslr/' /etc/default/grub
+ $ grep quiet /etc/default/grub
+ GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet nokaslr splash"
+ $ sudo update-grub
+
+For more information, check out the following:
+
+- [Cook: Security things in Linux
+ v5.3](https://lwn.net/Articles/804849/)
+
+- [Unexporting the system call table](https://lwn.net/Articles/12211/)
+
+- [Control-flow integrity for the
+ kernel](https://lwn.net/Articles/810077/)
+
+- [Unexporting
+ kallsyms-\*-\*_lookup-\*-\*_name()](https://lwn.net/Articles/813350/)
+
+- [Kernel Probes
+ (Kprobes)](https://www.kernel.org/doc/Documentation/kprobes.txt)
+
+- [Kernel address space layout
+ randomization](https://lwn.net/Articles/569635/)
+
+The source code here is an example of such a kernel module. We want to
+“spy” on a certain user, and to |pr_\*-\*-\*_info()| a message whenever
+that user opens a file. Towards this end, we replace the system call to
+open a file with our own function, called
+|our_\*-\*-\*_sys_\*-\*-\*_openat|. This function checks the uid (user’s
+id) of the current process, and if it is equal to the uid we spy on, it
+calls |pr_\*-\*-\*_info()| to display the name of the file to be opened.
+Then, either way, it calls the original |openat()| function with the
+same parameters, to actually open the file.
+
+The |init_\*-\*-\*_module| function replaces the appropriate location in
+|sys_\*-\*-\*_call_\*-\*-\*_table| and keeps the original pointer in a
+variable. The |cleanup_\*-\*-\*_module| function uses that variable to
+restore everything back to normal. This approach is dangerous, because
+of the possibility of two kernel modules changing the same system call.
+Imagine we have two kernel modules, A and B. A’s openat system call will
+be |A_\*-\*-\*_openat| and B’s will be |B_\*-\*-\*_openat|. Now, when A
+is inserted into the kernel, the system call is replaced with
+|A_\*-\*-\*_openat|, which will call the original |sys_\*-\*-\*_openat|
+when it is done. Next, B is inserted into the kernel, which replaces the
+system call with |B_\*-\*-\*_openat|, which will call what it thinks is
+the original system call, |A_\*-\*-\*_openat|, when it’s done.
+
+Now, if B is removed first, everything will be well — it will simply
+restore the system call to |A_\*-\*-\*_openat|, which calls the
+original. However, if A is removed and then B is removed, the system
+will crash. A’s removal will restore the system call to the original,
+|sys_\*-\*-\*_openat|, cutting B out of the loop. Then, when B is
+removed, it will restore the system call to what it thinks is the
+original, |A_\*-\*-\*_openat|, which is no longer in memory. At first
+glance, it appears we could solve this particular problem by checking if
+the system call is equal to our open function and if so not changing it
+at all (so that B won’t change the system call when it is removed), but
+that will cause an even worse problem. When A is removed, it sees that
+the system call was changed to |B_\*-\*-\*_openat| so that it is no
+longer pointing to |A_\*-\*-\*_openat|, so it will not restore it to
+|sys_\*-\*-\*_openat| before it is removed from memory. Unfortunately,
+|B_\*-\*-\*_openat| will still try to call |A_\*-\*-\*_openat| which is
+no longer there, so that even without removing B the system would crash.
+
+For x86 architecture, the system call table cannot be used to invoke a
+system call after commit
+[1e3ad78](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e3ad78334a69b36e107232e337f9d693dcc9df2)
+since v6.9. This commit has been backported to long term stable kernels,
+like v5.15.154+, v6.1.85+, v6.6.26+ and v6.8.5+, see this
+[answer](https://stackoverflow.com/a/78607015) for more details. In this
+case, thanks to Kprobes, a hook can be used instead on the system call
+entry to intercept the system call.
+
+Note that all the related problems make syscall stealing unfeasible for
+production use. In order to keep people from doing potential harmful
+things |sys_\*-\*-\*_call_\*-\*-\*_table| is no longer exported. This
+means, if you want to do something more than a mere dry run of this
+example, you will have to patch your current kernel in order to have
+|sys_\*-\*-\*_call_\*-\*-\*_table| exported.
+
+Blocking Processes and threads
+==============================
+
+Sleep
+-----
+
+What do you do when somebody asks you for something you can not do right
+away? If you are a human being and you are bothered by a human being,
+the only thing you can say is: "*Not right now, I’m busy. Go away_*".
+But if you are a kernel module and you are bothered by a process, you
+have another possibility. You can put the process to sleep until you can
+service it. After all, processes are being put to sleep by the kernel
+and woken up all the time (that is the way multiple processes appear to
+run on the same time on a single CPU).
+
+This kernel module is an example of this. The file (called
+`/proc/sleep`) can only be opened by a single process at a time. If the
+file is already open, the kernel module calls
+|wait_\*-\*-\*_event_\*-\*-\*_interruptible|. The easiest way to keep a
+file open is to open it with:
+
+tail -f
+
+This function changes the status of the task (a task is the kernel data
+structure which holds information about a process and the system call it
+is in, if any) to |TASK_\*-\*-\*_INTERRUPTIBLE|, which means that the
+task will not run until it is woken up somehow, and adds it to WaitQ,
+the queue of tasks waiting to access the file. Then, the function calls
+the scheduler to context switch to a different process, one which has
+some use for the CPU.
+
+When a process is done with the file, it closes it, and
+|module_\*-\*-\*_close| is called. That function wakes up all the
+processes in the queue (there’s no mechanism to only wake up one of
+them). It then returns and the process which just closed the file can
+continue to run. In time, the scheduler decides that that process has
+had enough and gives control of the CPU to another process. Eventually,
+one of the processes which was in the queue will be given control of the
+CPU by the scheduler. It starts at the point right after the call to
+|wait_\*-\*-\*_event_\*-\*-\*_interruptible|.
+
+This means that the process is still in kernel mode - as far as the
+process is concerned, it issued the open system call and the system call
+has not returned yet. The process does not know somebody else used the
+CPU for most of the time between the moment it issued the call and the
+moment it returned.
+
+It can then proceed to set a global variable to tell all the other
+processes that the file is still open and go on with its life. When the
+other processes get a piece of the CPU, they’ll see that global variable
+and go back to sleep.
+
+So we will use |tail -f| to keep the file open in the background, while
+trying to access it with another process (again in the background, so
+that we need not switch to a different vt). As soon as the first
+background process is killed with kill %1 , the second is woken up, is
+able to access the file and finally terminates.
+
+To make our life more interesting, |module_\*-\*-\*_close| does not have
+a monopoly on waking up the processes which wait to access the file. A
+signal, such as *Ctrl +c* (**SIGINT**) can also wake up a process. This
+is because we used |wait_\*-\*-\*_event_\*-\*-\*_interruptible|. We
+could have used |wait_\*-\*-\*_event| instead, but that would have
+resulted in extremely angry users whose *Ctrl+c*’s are ignored.
+
+In that case, we want to return with |-EINTR| immediately. This is
+important so users can, for example, kill the process before it receives
+the file.
+
+There is one more point to remember. Some times processes don’t want to
+sleep, they want either to get what they want immediately, or to be told
+it cannot be done. Such processes use the |O_\*-\*-\*_NONBLOCK| flag
+when opening the file. The kernel is supposed to respond by returning
+with the error code |-EAGAIN| from operations which would otherwise
+block, such as opening the file in this example. The program
+|cat_\*-\*-\*_nonblock|, available in the `examples/other` directory,
+can be used to open a file with |O_\*-\*-\*_NONBLOCK|.
+
+ $ sudo insmod sleep.ko
+ $ cat_*-*-*_nonblock /proc/sleep
+ Last input:
+ $ tail -f /proc/sleep &
+ Last input:
+ Last input:
+ Last input:
+ Last input:
+ Last input:
+ Last input:
+ Last input:
+ tail: /proc/sleep: file truncated
+ [1] 6540
+ $ cat_*-*-*_nonblock /proc/sleep
+ Open would block
+ $ kill %1
+ [1]+ Terminated tail -f /proc/sleep
+ $ cat_*-*-*_nonblock /proc/sleep
+ Last input:
+ $
+
+Completions
+-----------
+
+Sometimes one thing should happen before another within a module having
+multiple threads. Rather than using |/bin/sleep| commands, the kernel
+has another way to do this which allows timeouts or interrupts to also
+happen.
+
+Completions as code synchronization mechanism have three main parts,
+initialization of struct completion synchronization object, the waiting
+or barrier part through |wait_\*-\*-\*_for_\*-\*-\*_completion()|, and
+the signalling side through a call to |complete()|.
+
+In the subsequent example, two threads are initiated: crank and
+flywheel. It is imperative that the crank thread starts before the
+flywheel thread. A completion state is established for each of these
+threads, with a distinct completion defined for both the crank and
+flywheel threads. At the exit point of each thread the respective
+completion state is updated, and |wait_\*-\*-\*_for_\*-\*-\*_completion|
+is used by the flywheel thread to ensure that it does not begin
+prematurely. The crank thread uses the |complete_\*-\*-\*_all()|
+function to update the completion, which lets the flywheel thread
+continue.
+
+So even though |flywheel_\*-\*-\*_thread| is started first you should
+notice when you load this module and run |dmesg|, that turning the crank
+always happens first because the flywheel thread waits for the crank
+thread to complete.
+
+There are other variations of the
+|wait_\*-\*-\*_for_\*-\*-\*_completion| function, which include timeouts
+or being interrupted, but this basic mechanism is enough for many common
+situations without adding a lot of complexity.
+
+Avoiding Collisions and Deadlocks
+=================================
+
+If processes running on different CPUs or in different threads try to
+access the same memory, then it is possible that strange things can
+happen or your system can lock up. To avoid this, various types of
+mutual exclusion kernel functions are available. These indicate if a
+section of code is "locked" or "unlocked" so that simultaneous attempts
+to run it can not happen.
+
+Mutex
+-----
+
+You can use kernel mutexes (mutual exclusions) in much the same manner
+that you might deploy them in userland. This may be all that is needed
+to avoid collisions in most cases.
+
+Spinlocks
+---------
+
+As the name suggests, spinlocks lock up the CPU that the code is running
+on, taking 100% of its resources. Because of this you should only use
+the spinlock mechanism around code which is likely to take no more than
+a few milliseconds to run and so will not noticeably slow anything down
+from the user’s point of view.
+
+The example here is `"irq safe"` in that if interrupts happen during the
+lock then they will not be forgotten and will activate when the unlock
+happens, using the |flags| variable to retain their state.
+
+Taking 100% of a CPU’s resources comes with greater responsibility.
+Situations where the kernel code monopolizes a CPU are called **atomic
+contexts**. Holding a spinlock is one of those situations. Sleeping in
+atomic contexts may leave the system hanging, as the occupied CPU
+devotes 100% of its resources doing nothing but sleeping. In some worse
+cases the system may crash. Thus, sleeping in atomic contexts is
+considered a bug in the kernel. They are sometimes called
+“sleep-in-atomic-context” in some materials.
+
+Note that sleeping here is not limited to calling the sleep functions
+explicitly. If subsequent function calls eventually invoke a function
+that sleeps, it is also considered sleeping. Thus, it is important to
+pay attention to functions being used in atomic context. There’s no
+documentation recording all such functions, but code comments may help.
+Sometimes you may find comments in kernel source code stating that a
+function “may sleep”, “might sleep”, or more explicitly “the caller
+should not hold a spinlock”. Those comments are hints that a function
+may implicitly sleep and must not be called in atomic contexts.
+
+Read and write locks
+--------------------
+
+Read and write locks are specialised kinds of spinlocks so that you can
+exclusively read from something or write to something. Like the earlier
+spinlocks example, the one below shows an "irq safe" situation in which
+if other functions were triggered from irqs which might also read and
+write to whatever you are concerned with then they would not disrupt the
+logic. As before it is a good idea to keep anything done within the lock
+as short as possible so that it does not hang up the system and cause
+users to start revolting against the tyranny of your module.
+
+Of course, if you know for sure that there are no functions triggered by
+irqs which could possibly interfere with your logic then you can use the
+simpler |read_\*-\*-\*_lock(&myrwlock)| and
+|read_\*-\*-\*_unlock(&myrwlock)| or the corresponding write functions.
+
+Atomic operations
+-----------------
+
+If you are doing simple arithmetic: adding, subtracting or bitwise
+operations, then there is another way in the multi-CPU and
+multi-hyperthreaded world to stop other parts of the system from messing
+with your mojo. By using atomic operations you can be confident that
+your addition, subtraction or bit flip did actually happen and was not
+overwritten by some other shenanigans. An example is shown below.
+
+Before the C11 standard adopts the built-in atomic types, the kernel
+already provided a small set of atomic types by using a bunch of tricky
+architecture-specific codes. Implementing the atomic types by C11
+atomics may allow the kernel to throw away the architecture-specific
+codes and letting the kernel code be more friendly to the people who
+understand the standard. But there are some problems, such as the memory
+model of the kernel doesn’t match the model formed by the C11 atomics.
+For further details, see:
+
+- [kernel documentation of atomic
+ types](https://www.kernel.org/doc/Documentation/atomic_*-*-*_t.txt)
+
+- [Time to move to C11 atomics?](https://lwn.net/Articles/691128/)
+
+- [Atomic usage patterns in the
+ kernel](https://lwn.net/Articles/698315/)
+
+Replacing Print Macros
+======================
+
+Replacement
+-----------
+
+In Section
+1.7,
+it was noted that the X Window System and kernel module programming are
+not conducive to integration. This remains valid during the development
+of kernel modules. However, in practical scenarios, the necessity
+emerges to relay messages to the tty (teletype) originating the module
+load command.
+
+The term “tty” originates from *teletype*, which initially referred to a
+combined keyboard-printer for Unix system communication. Today, it
+signifies a text stream abstraction employed by Unix programs,
+encompassing physical terminals, xterms in X displays, and network
+connections like SSH.
+
+To achieve this, the “current” pointer is leveraged to access the active
+task’s tty structure. Within this structure lies a pointer to a string
+write function, facilitating the string’s transmission to the tty.
+
+Flashing keyboard LEDs
+----------------------
+
+In certain conditions, you may desire a simpler and more direct way to
+communicate to the external world. Flashing keyboard LEDs can be such a
+solution: It is an immediate way to attract attention or to display a
+status condition. Keyboard LEDs are present on every hardware, they are
+always visible, they do not need any setup, and their use is rather
+simple and non-intrusive, compared to writing to a tty or a file.
+
+From v4.14 to v4.15, the timer API made a series of changes to improve
+memory safety. A buffer overflow in the area of a |timer_\*-\*-\*_list|
+structure may be able to overwrite the |function| and |data| fields,
+providing the attacker with a way to use return-oriented programming
+(ROP) to call arbitrary functions within the kernel. Also, the function
+prototype of the callback, containing a |unsigned long| argument, will
+prevent work from any type checking. Furthermore, the function prototype
+with |unsigned long| argument may be an obstacle to the forward-edge
+protection of *control-flow integrity*. Thus, it is better to use a
+unique prototype to separate from the cluster that takes an |unsigned
+long| argument. The timer callback should be passed a pointer to the
+|timer_\*-\*-\*_list| structure rather than an |unsigned long| argument.
+Then, it wraps all the information the callback needs, including the
+|timer_\*-\*-\*_list| structure, into a larger structure, and it can use
+the |container_\*-\*-\*_of| macro instead of the |unsigned long| value.
+For more information see: [Improving the kernel timers
+API](https://lwn.net/Articles/735887/).
+
+Before Linux v4.14, |setup_\*-\*-\*_timer| was used to initialize the
+timer and the |timer_\*-\*-\*_list| structure looked like:
+
+struct timer_\*-\*-\*_list unsigned long expires; void
+(\*function)(unsigned long); unsigned long data; u32 flags; /\* ... \*/
+;
+
+void setup_\*-\*-\*_timer(struct timer_\*-\*-\*_list \*timer, void
+(\*callback)(unsigned long), unsigned long data);
+
+Since Linux v4.14, |timer_\*-\*-\*_setup| is adopted and the kernel step
+by step converting to |timer_\*-\*-\*_setup| from
+|setup_\*-\*-\*_timer|. One of the reasons why API was changed is it
+need to coexist with the old version interface. Moreover, the
+|timer_\*-\*-\*_setup| was implemented by |setup_\*-\*-\*_timer| at
+first.
+
+void timer_\*-\*-\*_setup(struct timer_\*-\*-\*_list \*timer, void
+(\*callback)(struct timer_\*-\*-\*_list \*), unsigned int flags);
+
+The |setup_\*-\*-\*_timer| was then removed since v4.15. As a result,
+the |timer_\*-\*-\*_list| structure had changed to the following.
+
+struct timer_\*-\*-\*_list unsigned long expires; void
+(\*function)(struct timer_\*-\*-\*_list \*); u32 flags; /\* ... \*/ ;
+
+The following source code illustrates a minimal kernel module which,
+when loaded, starts blinking the keyboard LEDs until it is unloaded.
+
+If none of the examples in this chapter fit your debugging needs, there
+might yet be some other tricks to try. Ever wondered what
+|CONFIG_\*-\*-\*_LL_\*-\*-\*_DEBUG| in |make menuconfig| is good for? If
+you activate that you get low level access to the serial port. While
+this might not sound very powerful by itself, you can patch
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/printk.c)
+or any other essential syscall to print ASCII characters, thus making it
+possible to trace virtually everything what your code does over a serial
+line. If you find yourself porting the kernel to some new and former
+unsupported architecture, this is usually amongst the first things that
+should be implemented. Logging over a netconsole might also be worth a
+try.
+
+While you have seen lots of stuff that can be used to aid debugging
+here, there are some things to be aware of. Debugging is almost always
+intrusive. Adding debug code can change the situation enough to make the
+bug seem to disappear. Thus, you should keep debug code to a minimum and
+make sure it does not show up in production code.
+
+Scheduling Tasks
+================
+
+There are two main ways of running tasks: tasklets and work queues.
+Tasklets are a quick and easy way of scheduling a single function to be
+run. For example, when triggered from an interrupt, whereas work queues
+are more complicated but also better suited to running multiple things
+in a sequence.
+
+It is possible that in future tasklets may be replaced by *threaded
+irqs*. However, discussion about that has been ongoing since 2007
+([Eliminating tasklets](https://lwn.net/Articles/239633)), so do not
+hold your breath. See the section
+15.1
+if you wish to avoid the tasklet debate.
+
+Tasklets
+--------
+
+Here is an example tasklet module. The |tasklet_\*-\*-\*_fn| function
+runs for a few seconds. In the meantime, execution of the
+|example_\*-\*-\*_tasklet_\*-\*-\*_init| function may continue to the
+exit point, depending on whether it is interrupted by **softirq**.
+
+So with this example loaded |dmesg| should show:
+
+ tasklet example init
+ Example tasklet starts
+ Example tasklet init continues...
+ Example tasklet ends
+
+Although tasklet is easy to use, it comes with several drawbacks, and
+developers are discussing about getting rid of tasklet in linux kernel.
+The tasklet callback runs in atomic context, inside a software
+interrupt, meaning that it cannot sleep or access user-space data, so
+not all work can be done in a tasklet handler. Also, the kernel only
+allows one instance of any given tasklet to be running at any given
+time; multiple different tasklet callbacks can run in parallel.
+
+In recent kernels, tasklets can be replaced by workqueues, timers, or
+threaded interrupts.[1] While the removal of tasklets remains a
+longer-term goal, the current kernel contains more than a hundred uses
+of tasklets. Now developers are proceeding with the API changes and the
+macro |DECLARE_\*-\*-\*_TASKLET_\*-\*-\*_OLD| exists for compatibility.
+For further information, see .
+
+Work queues
+-----------
+
+To add a task to the scheduler we can use a workqueue. The kernel then
+uses the Completely Fair Scheduler (CFS) to execute work within the
+queue.
+
+Interrupt Handlers
+==================
+
+Interrupt Handlers
+------------------
+
+Except for the last chapter, everything we did in the kernel so far we
+have done as a response to a process asking for it, either by dealing
+with a special file, sending an |ioctl()|, or issuing a system call. But
+the job of the kernel is not just to respond to process requests.
+Another job, which is every bit as important, is to speak to the
+hardware connected to the machine.
+
+There are two types of interaction between the CPU and the rest of the
+computer’s hardware. The first type is when the CPU gives orders to the
+hardware, the other is when the hardware needs to tell the CPU
+something. The second, called interrupts, is much harder to implement
+because it has to be dealt with when convenient for the hardware, not
+the CPU. Hardware devices typically have a very small amount of RAM, and
+if you do not read their information when available, it is lost.
+
+Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests).
+There are two types of IRQ’s, short and long. A short IRQ is one which
+is expected to take a very short period of time, during which the rest
+of the machine will be blocked and no other interrupts will be handled.
+A long IRQ is one which can take longer, and during which other
+interrupts may occur (but not interrupts from the same device). If at
+all possible, it is better to declare an interrupt handler to be long.
+
+When the CPU receives an interrupt, it stops whatever it is doing
+(unless it is processing a more important interrupt, in which case it
+will deal with this one only when the more important one is done), saves
+certain parameters on the stack and calls the interrupt handler. This
+means that certain things are not allowed in the interrupt handler
+itself, because the system is in an unknown state. Linux kernel solves
+the problem by splitting interrupt handling into two parts. The first
+part executes right away and masks the interrupt line. Hardware
+interrupts must be handled quickly, and that is why we need the second
+part to handle the heavy work deferred from an interrupt handler.
+Historically, BH (Linux naming for *Bottom Halves*) statistically
+book-keeps the deferred functions. **Softirq** and its higher level
+abstraction, **Tasklet**, replace BH since Linux 2.3.
+
+The way to implement this is to call |request_\*-\*-\*_irq()| to get
+your interrupt handler called when the relevant IRQ is received.
+
+In practice IRQ handling can be a bit more complex. Hardware is often
+designed in a way that chains two interrupt controllers, so that all the
+IRQs from interrupt controller B are cascaded to a certain IRQ from
+interrupt controller A. Of course, that requires that the kernel finds
+out which IRQ it really was afterwards and that adds overhead. Other
+architectures offer some special, very low overhead, so called "fast
+IRQ" or FIQs. To take advantage of them requires handlers to be written
+in assembly language, so they do not really fit into the kernel. They
+can be made to work similar to the others, but after that procedure,
+they are no longer any faster than "common" IRQs. SMP enabled kernels
+running on systems with more than one processor need to solve another
+truckload of problems. It is not enough to know if a certain IRQs has
+happened, it’s also important to know what CPU(s) it was for. People
+still interested in more details, might want to refer to "APIC" now.
+
+This function receives the IRQ number, the name of the function, flags,
+a name for `/proc/interrupts` and a parameter to be passed to the
+interrupt handler. Usually there is a certain number of IRQs available.
+How many IRQs there are is hardware-dependent.
+
+The flags can be used for specify behaviors of the IRQ. For example, use
+|IRQF_\*-\*-\*_SHARED| to indicate you are willing to share the IRQ with
+other interrupt handlers (usually because a number of hardware devices
+sit on the same IRQ); use the |IRQF_\*-\*-\*_ONESHOT| to indicate that
+the IRQ is not reenabled after the handler finished. It should be noted
+that in some materials, you may encouter another set of IRQ flags named
+with the |SA| prefix. For example, the |SA_\*-\*-\*_SHIRQ| and the
+|SA_\*-\*-\*_INTERRUPT|. Those are the the IRQ flags in the older
+kernels. They have been removed completely. Today only the |IRQF| flags
+are in use. This function will only succeed if there is not already a
+handler on this IRQ, or if you are both willing to share.
+
+Detecting button presses
+------------------------
+
+Many popular single board computers, such as Raspberry Pi or
+Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and
+then having a button press do something is a classic case in which you
+might need to use interrupts, so that instead of having the CPU waste
+time and battery power polling for a change in input state, it is better
+for the input to trigger the CPU to then run a particular handling
+function.
+
+Here is an example where buttons are connected to GPIO numbers 17 and 18
+and an LED is connected to GPIO 4. You can change those numbers to
+whatever is appropriate for your board.
+
+Bottom Half
+-----------
+
+Suppose you want to do a bunch of stuff inside of an interrupt routine.
+A common way to do that without rendering the interrupt unavailable for
+a significant duration is to combine it with a tasklet. This pushes the
+bulk of the work off into the scheduler.
+
+The example below modifies the previous example to also run an
+additional task when an interrupt is triggered.
+
+Threaded IRQ
+------------
+
+Threaded IRQ is a mechanism to organize both top-half and bottom-half of
+an IRQ at once. A threaded IRQ splits the one handler in
+|request_\*-\*-\*_irq()| into two: one for the top-half, the other for
+the bottom-half. The |request_\*-\*-\*_threaded_\*-\*-\*_irq()| is the
+function for using threaded IRQs. Two handlers are registered at once in
+the |request_\*-\*-\*_threaded_\*-\*-\*_irq()|.
+
+Those two handlers run in different context. The top-half handler runs
+in interrupt context. It’s the equivalence of the handler passed to the
+|request_\*-\*-\*_irq()|. The bottom-half handler on the other hand runs
+in its own thread. This thread is created on registration of a threaded
+IRQ. Its sole purpose is to run this bottom-half handler. This is where
+a threaded IRQ is “threaded”. If |IRQ_\*-\*-\*_WAKE_\*-\*-\*_THREAD| is
+returned by the top-half handler, that bottom-half serving thread will
+wake up. The thread then runs the bottom-half handler.
+
+Here is an example of how to do the same thing as before, with top and
+bottom halves, but using threads.
+
+A threaded IRQ is registered using
+|request_\*-\*-\*_threaded_\*-\*-\*_irq()|. This function only takes one
+additional parameter than the |request_\*-\*-\*_irq()| – the bottom-half
+handling function that runs in its own thread. In this example it is the
+|button_\*-\*-\*_bottom_\*-\*-\*_half()|. Usage of other parameters are
+the same as |request_\*-\*-\*_irq()|.
+
+Presence of both handlers is not mandatory. If either of them is not
+needed, pass the |NULL| instead. A |NULL| top-half handler implies that
+no action is taken except to wake up the bottom-half serving thread,
+which runs the bottom-half handler. Similarly, a |NULL| bottom-half
+handler effectively acts as if |request_\*-\*-\*_irq()| were used. In
+fact, this is how |request_\*-\*-\*_irq()| is implemented.
+
+Note that passing |NULL| to both handlers is considered an error and
+will make registration fail.
+
+Virtual Input Device Driver
+===========================
+
+The input device driver is a module that provides a way to communicate
+with the interaction device via the event. For example, the keyboard can
+send the press or release event to tell the kernel what we want to do.
+The input device driver will allocate a new input structure with
+|input_\*-\*-\*_allocate_\*-\*-\*_device()| and sets up input bitfields,
+device id, version, etc. After that, registers it by calling
+|input_\*-\*-\*_register_\*-\*-\*_device()|.
+
+Here is an example, vinput, It is an API to allow easy development of
+virtual input drivers. The drivers needs to export a
+|vinput_\*-\*-\*_device()| that contains the virtual device name and
+|vinput_\*-\*-\*_ops| structure that describes:
+
+- the init function: |init()|
+
+- the input event injection function: |send()|
+
+- the readback function: |read()|
+
+Then using |vinput_\*-\*-\*_register_\*-\*-\*_device()| and
+|vinput_\*-\*-\*_unregister_\*-\*-\*_device()| will add a new device to
+the list of support virtual input devices.
+
+int init(struct vinput \*);
+
+This function is passed a |struct vinput| already initialized with an
+allocated |struct input_\*-\*-\*_dev|. The |init()| function is
+responsible for initializing the capabilities of the input device and
+register it.
+
+int send(struct vinput \*, char \*, int);
+
+This function will receive a user string to interpret and inject the
+event using the |input_\*-\*-\*_report_\*-\*-\*_XXXX| or
+|input_\*-\*-\*_event| call. The string is already copied from user.
+
+int read(struct vinput \*, char \*, int);
+
+This function is used for debugging and should fill the buffer parameter
+with the last event sent in the virtual input device format. The buffer
+will then be copied to user.
+
+vinput devices are created and destroyed using sysfs. And, event
+injection is done through a `/dev` node. The device name will be used by
+the userland to export a new virtual input device.
+
+The |class_\*-\*-\*_attribute| structure is similar to other attribute
+types we talked about in section
+8:
+
+struct class_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t
+(\*show)(struct class \*class, struct class_\*-\*-\*_attribute \*attr,
+char \*buf); ssize_\*-\*-\*_t (\*store)(struct class \*class, struct
+class_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t
+count); ;
+
+In `vinput.c`, the macro
+|CLASS_\*-\*-\*_ATTR_\*-\*-\*_WO(export/unexport)| defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/device.h)
+(in this case, `device.h` is included in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h))
+will generate the |class_\*-\*-\*_attribute| structures which are named
+`class_*-*-*_attr_*-*-*_export/unexport`. Then, put them into
+|vinput_\*-\*-\*_class_\*-\*-\*_attrs| array and the macro
+|ATTRIBUTE_\*-\*-\*_GROUPS(vinput_\*-\*-\*_class)| will generate the
+|struct attribute_\*-\*-\*_group vinput_\*-\*-\*_class_\*-\*-\*_group|
+that should be assigned in |vinput_\*-\*-\*_class|. Finally, call
+|class_\*-\*-\*_register(&vinput_\*-\*-\*_class)| to create attributes
+in sysfs.
+
+To create a `vinputX` sysfs entry and `/dev` node.
+
+echo "vkbd" | sudo tee /sys/class/vinput/export
+
+To unexport the device, just echo its id in unexport:
+
+echo "0" | sudo tee /sys/class/vinput/unexport
+
+Here the virtual keyboard is one of example to use vinput. It supports
+all |KEY_\*-\*-\*_MAX| keycodes. The injection format is the
+|KEY_\*-\*-\*_CODE| such as defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h).
+A positive value means |KEY_\*-\*-\*_PRESS| while a negative value is a
+|KEY_\*-\*-\*_RELEASE|. The keyboard supports repetition when the key
+stays pressed for too long. The following demonstrates how simulation
+work.
+
+Simulate a key press on "g" (|KEY_\*-\*-\*_G| = 34):
+
+echo "+34" | sudo tee /dev/vinput0
+
+Simulate a key release on "g" (|KEY_\*-\*-\*_G| = 34):
+
+echo "-34" | sudo tee /dev/vinput0
+
+Standardizing the interfaces: The Device Model
+==============================================
+
+Up to this point we have seen all kinds of modules doing all kinds of
+things, but there was no consistency in their interfaces with the rest
+of the kernel. To impose some consistency such that there is at minimum
+a standardized way to start, suspend and resume a device model was
+added. An example is shown below, and you can use this as a template to
+add your own suspend, resume or other interface functions.
+
+Optimizations
+=============
+
+Likely and Unlikely conditions
+------------------------------
+
+Sometimes you might want your code to run as quickly as possible,
+especially if it is handling an interrupt or doing something which might
+cause noticeable latency. If your code contains boolean conditions and
+if you know that the conditions are almost always likely to evaluate as
+either |true| or |false|, then you can allow the compiler to optimize
+for this using the |likely| and |unlikely| macros. For example, when
+allocating memory you are almost always expecting this to succeed.
+
+bvl = bvec_\*-\*-\*_alloc(gfp_\*-\*-\*_mask, nr_\*-\*-\*_iovecs, &idx);
+if (unlikely(_bvl)) mempool_\*-\*-\*_free(bio, bio_\*-\*-\*_pool); bio =
+NULL; goto out;
+
+When the |unlikely| macro is used, the compiler alters its machine
+instruction output, so that it continues along the false branch and only
+jumps if the condition is true. That avoids flushing the processor
+pipeline. The opposite happens if you use the |likely| macro.
+
+Static keys
+-----------
+
+Static keys allow us to enable or disable kernel code paths based on the
+runtime state of key. Its APIs have been available since 2010 (most
+architectures are already supported), use self-modifying code to
+eliminate the overhead of cache and branch prediction. The most typical
+use case of static keys is for performance-sensitive kernel code, such
+as tracepoints, context switching, networking, etc. These hot paths of
+the kernel often contain branches and can be optimized easily using this
+technique. Before we can use static keys in the kernel, we need to make
+sure that gcc supports |asm goto| inline assembly, and the following
+kernel configurations are set:
+
+CONFIG_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y
+CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y
+CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL_\*-\*-\*_RELATIVE=y
+
+To declare a static key, we need to define a global variable using the
+|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE| or
+|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE| macro defined in
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/jump\_*-*-*_label.h).
+This macro initializes the key with the given initial value, which is
+either false or true, respectively. For example, to declare a static key
+with an initial value of false, we can use the following code:
+
+DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE(fkey);
+
+Once the static key has been declared, we need to add branching code to
+the module that uses the static key. For example, the code includes a
+fastpath, where a no-op instruction will be generated at compile time as
+the key is initialized to false and the branch is unlikely to be taken.
+
+pr_\*-\*-\*_info("fastpath 1"); if
+(static_\*-\*-\*_branch_\*-\*-\*_unlikely(&fkey)) pr_\*-\*-\*_alert("do
+unlikely thing"); pr_\*-\*-\*_info("fastpath 2");
+
+If the key is enabled at runtime by calling
+|static_\*-\*-\*_branch_\*-\*-\*_enable(&fkey)|, the fastpath will be
+patched with an unconditional jump instruction to the slowpath code
+|pr_\*-\*-\*_alert|, so the branch will always be taken until the key is
+disabled again.
+
+The following kernel module derived from `chardev.c`, demonstrates how
+the static key works.
+
+To check the state of the static key, we can use the
+`/dev/key_*-*-*_state` interface.
+
+cat /dev/key_\*-\*-\*_state
+
+This will display the current state of the key, which is disabled by
+default.
+
+To change the state of the static key, we can perform a write operation
+on the file:
+
+echo enable > /dev/key_\*-\*-\*_state
+
+This will enable the static key, causing the code path to switch from
+the fastpath to the slowpath.
+
+In some cases, the key is enabled or disabled at initialization and
+never changed, we can declare a static key as read-only, which means
+that it can only be toggled in the module init function. To declare a
+read-only static key, we can use the
+|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE_\*-\*-\*_RO| or
+|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE_\*-\*-\*_RO| macro
+instead. Attempts to change the key at runtime will result in a page
+fault. For more information, see [Static
+keys](https://www.kernel.org/doc/Documentation/static-keys.txt)
+
+Common Pitfalls
+===============
+
+Using standard libraries
+------------------------
+
+You can not do that. In a kernel module, you can only use kernel
+functions which are the functions you can see in `/proc/kallsyms`.
+
+Disabling interrupts
+--------------------
+
+You might need to do this for a short time and that is OK, but if you do
+not enable them afterwards, your system will be stuck and you will have
+to power it off.
+
+Where To Go From Here?
+======================
+
+For those deeply interested in kernel programming,
+[kernelnewbies.org](https://kernelnewbies.org) and the
+[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation)
+subdirectory within the kernel source code are highly recommended.
+Although the latter may not always be straightforward, it serves as a
+valuable initial step for further exploration. Echoing Linus Torvalds’
+perspective, the most effective method to understand the kernel is
+through personal examination of the source code.
+
+Contributions to this guide are welcome, especially if there are any
+significant inaccuracies identified. To contribute or report an issue,
+please initiate an issue at . Pull
+requests are greatly appreciated.
+
+Happy hacking_
+
+[1] The goal of threaded interrupts is to push more of the work to
+separate threads, so that the minimum needed for acknowledging an
+interrupt is reduced, and therefore the time spent handling the
+interrupt (where it can’t handle any other interrupts at the same time)
+is reduced. See .