2646 lines
125 KiB
Markdown
2646 lines
125 KiB
Markdown
|
_[image](assets/cover-with-names.png)
|
|||
|
|
|||
|
Introduction
|
|||
|
============
|
|||
|
|
|||
|
The Linux Kernel Module Programming Guide is a free book; you may
|
|||
|
reproduce and/or modify it under the terms of the [Open Software
|
|||
|
License](https://opensource.org/licenses/OSL-3.0), version 3.0.
|
|||
|
|
|||
|
This book is distributed in the hope that it would be useful, but
|
|||
|
without any warranty, without even the implied warranty of
|
|||
|
merchantability or fitness for a particular purpose.
|
|||
|
|
|||
|
The author encourages wide distribution of this book for personal or
|
|||
|
commercial use, provided the above copyright notice remains intact and
|
|||
|
the method adheres to the provisions of the [Open Software
|
|||
|
License](https://opensource.org/licenses/OSL-3.0). In summary, you may
|
|||
|
copy and distribute this book free of charge or for a profit. No
|
|||
|
explicit permission is required from the author for reproduction of this
|
|||
|
book in any medium, physical or electronic.
|
|||
|
|
|||
|
Derivative works and translations of this document must be placed under
|
|||
|
the Open Software License, and the original copyright notice must remain
|
|||
|
intact. If you have contributed new material to this book, you must make
|
|||
|
the material and source code available for your revisions. Please make
|
|||
|
revisions and updates available directly to the document maintainer, Jim
|
|||
|
Huang <jserv@ccns.ncku.edu.tw>. This will allow for the merging of
|
|||
|
updates and provide consistent revisions to the Linux community.
|
|||
|
|
|||
|
If you publish or distribute this book commercially, donations,
|
|||
|
royalties, and/or printed copies are greatly appreciated by the author
|
|||
|
and the [Linux Documentation Project](https://tldp.org/) (LDP).
|
|||
|
Contributing in this way shows your support for free software and the
|
|||
|
LDP. If you have questions or comments, please contact the address
|
|||
|
above.
|
|||
|
|
|||
|
Authorship
|
|||
|
----------
|
|||
|
|
|||
|
The Linux Kernel Module Programming Guide was initially authored by Ori
|
|||
|
Pomerantz for Linux v2.2. As the Linux kernel evolved, Ori’s
|
|||
|
availability to maintain the document diminished. Consequently, Peter
|
|||
|
Jay Salzman assumed the role of maintainer and updated the guide for
|
|||
|
Linux v2.4. Similar constraints arose for Peter when tracking
|
|||
|
developments in Linux v2.6, leading to Michael Burian joining as a
|
|||
|
co-maintainer to bring the guide up to speed with Linux v2.6. Bob
|
|||
|
Mottram contributed to the guide by updating examples for Linux v3.8 and
|
|||
|
later. Jim Huang then undertook the task of updating the guide for
|
|||
|
recent Linux versions (v5.0 and beyond), along with revising the LaTeX
|
|||
|
document.
|
|||
|
|
|||
|
Acknowledgements
|
|||
|
----------------
|
|||
|
|
|||
|
The following people have contributed corrections or good suggestions:
|
|||
|
|
|||
|
Amit Dhingra, Andy Shevchenko, Arush Sharma, Benno Bielmeier, Bob Lee,
|
|||
|
Brad Baker, Che-Chia Chang, Cheng-Shian Yeh, Chih-En Lin, Chih-Hsuan
|
|||
|
Yang, Chih-Yu Chen, Ching-Hua (Vivian) Lin, Chin Yik Ming, cvvletter,
|
|||
|
Cyril Brulebois, Daniele Paolo Scarpazza, David Porter, demonsome, Dimo
|
|||
|
Velev, Ekang Monyet, Ethan Chan, Francois Audeon, Gilad Reti,
|
|||
|
heartofrain, Horst Schirmeier, Hsin-Hsiang Peng, Ignacio Martin, I-Hsin
|
|||
|
Cheng, Iûnn Kiàn-îng, Jian-Xing Wu, Johan Calle, keytouch, Kohei Otsuka,
|
|||
|
Kuan-Wei Chiu, manbing, Marconi Jiang, mengxinayan, Meng-Zong Tsai,
|
|||
|
Peter Lin, Roman Lakeev, Sam Erickson, Shao-Tse Hung, Shih-Sheng Yang,
|
|||
|
Stacy Prowell, Steven Lung, Tristan Lelong, Tse-Wei Lin, Tucker Polomik,
|
|||
|
Tyler Fanelli, VxTeemo, Wei-Hsin Yeh, Wei-Lun Tsai, Xatierlike Lee,
|
|||
|
Yen-Yu Chen, Yin-Chiuan Chen, Yi-Wei Lin, Yo-Jung Lin, Yu-Hsiang Tseng,
|
|||
|
YYGO.
|
|||
|
|
|||
|
What Is A Kernel Module?
|
|||
|
------------------------
|
|||
|
|
|||
|
Involvement in the development of Linux kernel modules requires a
|
|||
|
foundation in the C programming language and a track record of creating
|
|||
|
conventional programs intended for process execution. This pursuit
|
|||
|
delves into a domain where an unregulated pointer, if disregarded, may
|
|||
|
potentially trigger the total elimination of an entire file system,
|
|||
|
resulting in a scenario that necessitates a complete system reboot.
|
|||
|
|
|||
|
A Linux kernel module is precisely defined as a code segment capable of
|
|||
|
dynamic loading and unloading within the kernel as needed. These modules
|
|||
|
enhance kernel capabilities without necessitating a system reboot. A
|
|||
|
notable example is seen in the device driver module, which facilitates
|
|||
|
kernel interaction with hardware components linked to the system. In the
|
|||
|
absence of modules, the prevailing approach leans toward monolithic
|
|||
|
kernels, requiring direct integration of new functionalities into the
|
|||
|
kernel image. This approach leads to larger kernels and necessitates
|
|||
|
kernel rebuilding and subsequent system rebooting when new
|
|||
|
functionalities are desired.
|
|||
|
|
|||
|
Kernel module package
|
|||
|
---------------------
|
|||
|
|
|||
|
Linux distributions provide the commands |modprobe|, |insmod| and
|
|||
|
|depmod| within a package.
|
|||
|
|
|||
|
On Ubuntu/Debian GNU/Linux:
|
|||
|
|
|||
|
sudo apt-get install build-essential kmod
|
|||
|
|
|||
|
On Arch Linux:
|
|||
|
|
|||
|
sudo pacman -S gcc kmod
|
|||
|
|
|||
|
What Modules are in my Kernel?
|
|||
|
------------------------------
|
|||
|
|
|||
|
To discover what modules are already loaded within your current kernel
|
|||
|
use the command |lsmod|.
|
|||
|
|
|||
|
sudo lsmod
|
|||
|
|
|||
|
Modules are stored within the file `/proc/modules`, so you can also see
|
|||
|
them with:
|
|||
|
|
|||
|
sudo cat /proc/modules
|
|||
|
|
|||
|
This can be a long list, and you might prefer to search for something
|
|||
|
particular. To search for the `fat` module:
|
|||
|
|
|||
|
sudo lsmod | grep fat
|
|||
|
|
|||
|
Is there a need to download and compile the kernel?
|
|||
|
---------------------------------------------------
|
|||
|
|
|||
|
To effectively follow this guide, there is no obligatory requirement for
|
|||
|
performing such actions. Nonetheless, a prudent approach involves
|
|||
|
executing the examples within a test distribution on a virtual machine,
|
|||
|
thus mitigating any potential risk of disrupting the system.
|
|||
|
|
|||
|
Before We Begin
|
|||
|
---------------
|
|||
|
|
|||
|
Before delving into code, certain matters require attention. Variances
|
|||
|
exist among individuals’ systems, and distinct personal approaches are
|
|||
|
evident. The achievement of successful compilation and loading of the
|
|||
|
inaugural “hello world” program may, at times, present challenges. It is
|
|||
|
reassuring to note that overcoming the initial obstacle in the first
|
|||
|
attempt paves the way for subsequent endeavors to proceed seamlessly.
|
|||
|
|
|||
|
1. Modversioning. A module compiled for one kernel will not load if a
|
|||
|
different kernel is booted, unless |CONFIG_\*-\*-\*_MODVERSIONS| is
|
|||
|
enabled in the kernel. Module versioning will be discussed later in
|
|||
|
this guide. Until module versioning is covered, the examples in this
|
|||
|
guide may not work correctly if running a kernel with modversioning
|
|||
|
turned on. However, most stock Linux distribution kernels come with
|
|||
|
modversioning enabled. If difficulties arise when loading the
|
|||
|
modules due to versioning errors, consider compiling a kernel with
|
|||
|
modversioning turned off.
|
|||
|
|
|||
|
2. Using X Window System. It is highly recommended to extract, compile,
|
|||
|
and load all the examples discussed in this guide from a console.
|
|||
|
Working on these tasks within the X Window System is discouraged.
|
|||
|
|
|||
|
Modules cannot directly print to the screen like |printf()| can, but
|
|||
|
they can log information and warnings that are eventually displayed
|
|||
|
on the screen, specifically within a console. If a module is loaded
|
|||
|
from an |xterm|, the information and warnings will be logged, but
|
|||
|
solely within the systemd journal. These logs will not be visible
|
|||
|
unless consulting the |journalctl|. Refer to
|
|||
|
<a href="#sec:helloworld" data-reference-type="ref" data-reference="sec:helloworld">4</a>
|
|||
|
for more information. For instant access to this information, it is
|
|||
|
advisable to perform all tasks from the console.
|
|||
|
|
|||
|
3. SecureBoot. Numerous modern computers arrive pre-configured with
|
|||
|
UEFI SecureBoot enabled—an essential security standard ensuring
|
|||
|
booting exclusively through trusted software endorsed by the
|
|||
|
original equipment manufacturer. Certain Linux distributions even
|
|||
|
ship with the default Linux kernel configured to support SecureBoot.
|
|||
|
In these cases, the kernel module necessitates a signed security
|
|||
|
key.
|
|||
|
|
|||
|
Failing this, an attempt to insert your first “hello world” module
|
|||
|
would result in the message: “*ERROR: could not insert module*”. If
|
|||
|
this message *Lockdown: insmod: unsigned module loading is
|
|||
|
restricted; see man kernel lockdown.7* appears in the |dmesg|
|
|||
|
output, the simplest approach involves disabling UEFI SecureBoot
|
|||
|
from the boot menu of your PC or laptop, allowing the successful
|
|||
|
insertion of “hello world” module. Naturally, an alternative
|
|||
|
involves undergoing intricate procedures such as generating keys,
|
|||
|
system key installation, and module signing to achieve
|
|||
|
functionality. However, this intricate process is less appropriate
|
|||
|
for beginners. If interested, more detailed steps for
|
|||
|
[SecureBoot](https://wiki.debian.org/SecureBoot) can be explored and
|
|||
|
followed.
|
|||
|
|
|||
|
Headers
|
|||
|
=======
|
|||
|
|
|||
|
Before building anything, it is necessary to install the header files
|
|||
|
for the kernel.
|
|||
|
|
|||
|
On Ubuntu/Debian GNU/Linux:
|
|||
|
|
|||
|
sudo apt-get update apt-cache search linux-headers-‘uname -r‘
|
|||
|
|
|||
|
The following command provides information on the available kernel
|
|||
|
header files. Then for example:
|
|||
|
|
|||
|
sudo apt-get install kmod linux-headers-5.4.0-80-generic
|
|||
|
|
|||
|
On Arch Linux:
|
|||
|
|
|||
|
sudo pacman -S linux-headers
|
|||
|
|
|||
|
On Fedora:
|
|||
|
|
|||
|
sudo dnf install kernel-devel kernel-headers
|
|||
|
|
|||
|
Examples
|
|||
|
========
|
|||
|
|
|||
|
All the examples from this document are available within the `examples`
|
|||
|
subdirectory.
|
|||
|
|
|||
|
Should compile errors occur, it may be due to a more recent kernel
|
|||
|
version being in use, or there might be a need to install the
|
|||
|
corresponding kernel header files.
|
|||
|
|
|||
|
Hello World
|
|||
|
===========
|
|||
|
|
|||
|
The Simplest Module
|
|||
|
-------------------
|
|||
|
|
|||
|
Most individuals beginning their programming journey typically start
|
|||
|
with some variant of a *hello world* example. It is unclear what the
|
|||
|
outcomes are for those who deviate from this tradition, but it seems
|
|||
|
prudent to adhere to it. The learning process will begin with a series
|
|||
|
of hello world programs that illustrate various fundamental aspects of
|
|||
|
writing a kernel module.
|
|||
|
|
|||
|
Presented next is the simplest possible module.
|
|||
|
|
|||
|
Make a test directory:
|
|||
|
|
|||
|
mkdir -p /develop/kernel/hello-1 cd /develop/kernel/hello-1
|
|||
|
|
|||
|
Paste this into your favorite editor and save it as `hello-1.c`:
|
|||
|
|
|||
|
Now you will need a `Makefile`. If you copy and paste this, change the
|
|||
|
indentation to use *tabs*, not spaces.
|
|||
|
|
|||
|
In `Makefile`, `$(CURDIR)` can set to the absolute pathname of the
|
|||
|
current working directory(after all `-C` options are processed, if any).
|
|||
|
See more about `CURDIR` in [GNU make
|
|||
|
manual](https://www.gnu.org/software/make/manual/make.html).
|
|||
|
|
|||
|
And finally, just run `make` directly.
|
|||
|
|
|||
|
make
|
|||
|
|
|||
|
If there is no `PWD := $(CURDIR)` statement in Makefile, then it may not
|
|||
|
compile correctly with `sudo make`. Because some environment variables
|
|||
|
are specified by the security policy, they can’t be inherited. The
|
|||
|
default security policy is `sudoers`. In the `sudoers` security policy,
|
|||
|
`env_*-*-*_reset` is enabled by default, which restricts environment
|
|||
|
variables. Specifically, path variables are not retained from the user
|
|||
|
environment, they are set to default values (For more information see:
|
|||
|
[sudoers manual](https://www.sudo.ws/docs/man/sudoers.man/)). You can
|
|||
|
see the environment variable settings by:
|
|||
|
|
|||
|
$ sudo -s
|
|||
|
# sudo -V
|
|||
|
|
|||
|
Here is a simple Makefile as an example to demonstrate the problem
|
|||
|
mentioned above.
|
|||
|
|
|||
|
all: echo $(PWD)
|
|||
|
\\end{code}
|
|||
|
|
|||
|
Then, we can use \\verb|-p| flag to print out the environment variable values from the Makefile.
|
|||
|
|
|||
|
\\begin{verbatim}$ make -p | grep PWD PWD = /home/ubuntu/temp OLDPWD =
|
|||
|
/home/ubuntu echo $(PWD)
|
|||
|
\\end{verbatim}
|
|||
|
|
|||
|
The \\verb|PWD| variable won't be inherited with \\verb|sudo|.
|
|||
|
|
|||
|
\\begin{verbatim}$ sudo make -p | grep PWD echo $(PWD)
|
|||
|
\\end{verbatim}
|
|||
|
|
|||
|
However, there are three ways to solve this problem.
|
|||
|
|
|||
|
\\begin{enumerate}
|
|||
|
\\item {
|
|||
|
You can use the \\verb|-E| flag to temporarily preserve them.
|
|||
|
|
|||
|
\\begin{codebash}
|
|||
|
$ sudo -E make -p | grep PWD
|
|||
|
PWD = /home/ubuntu/temp
|
|||
|
OLDPWD = /home/ubuntu
|
|||
|
echo $(PWD)
|
|||
|
\\end{codebash}
|
|||
|
}
|
|||
|
|
|||
|
\\item {
|
|||
|
You can set the \\verb|env_\*-\*-\*_reset| disabled by editing the \\verb|/etc/sudoers| with root and \\verb|visudo|.
|
|||
|
|
|||
|
\\begin{code}
|
|||
|
\#\# sudoers file.
|
|||
|
\#\#
|
|||
|
...
|
|||
|
Defaults env_\*-\*-\*_reset
|
|||
|
\#\# Change env_\*-\*-\*_reset to _env_\*-\*-\*_reset in previous line to keep all environment variables
|
|||
|
\\end{code}
|
|||
|
|
|||
|
Then execute \\verb|env| and \\verb|sudo env| individually.
|
|||
|
|
|||
|
\\begin{codebash}
|
|||
|
\# disable the env_\*-\*-\*_reset
|
|||
|
echo "user:" > non-env_\*-\*-\*_reset.log; env >> non-env_\*-\*-\*_reset.log
|
|||
|
echo "root:" >> non-env_\*-\*-\*_reset.log; sudo env >> non-env_\*-\*-\*_reset.log
|
|||
|
\# enable the env_\*-\*-\*_reset
|
|||
|
echo "user:" > env_\*-\*-\*_reset.log; env >> env_\*-\*-\*_reset.log
|
|||
|
echo "root:" >> env_\*-\*-\*_reset.log; sudo env >> env_\*-\*-\*_reset.log
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
You can view and compare these logs to find differences between \\verb|env_\*-\*-\*_reset| and \\verb|_env_\*-\*-\*_reset|.
|
|||
|
}
|
|||
|
|
|||
|
\\item {You can preserve environment variables by appending them to \\verb|env_\*-\*-\*_keep| in \\verb|/etc/sudoers|.
|
|||
|
|
|||
|
\\begin{code}
|
|||
|
Defaults env_\*-\*-\*_keep += "PWD"
|
|||
|
\\end{code}
|
|||
|
|
|||
|
After applying the above change, you can check the environment variable settings by:
|
|||
|
|
|||
|
\\begin{verbatim}
|
|||
|
$ sudo -s
|
|||
|
\# sudo -V
|
|||
|
\\end{verbatim}
|
|||
|
}
|
|||
|
\\end{enumerate}
|
|||
|
|
|||
|
If all goes smoothly you should then find that you have a compiled \\verb|hello-1.ko| module.
|
|||
|
You can find info on it with the command:
|
|||
|
\\begin{codebash}
|
|||
|
modinfo hello-1.ko
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
At this point the command:
|
|||
|
\\begin{codebash}
|
|||
|
sudo lsmod | grep hello
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
should return nothing.
|
|||
|
You can try loading your shiny new module with:
|
|||
|
\\begin{codebash}
|
|||
|
sudo insmod hello-1.ko
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
The dash character will get converted to an underscore, so when you again try:
|
|||
|
\\begin{codebash}
|
|||
|
sudo lsmod | grep hello
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
You should now see your loaded module. It can be removed again with:
|
|||
|
\\begin{codebash}
|
|||
|
sudo rmmod hello_\*-\*-\*_1
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
Notice that the dash was replaced by an underscore.
|
|||
|
To see what just happened in the logs:
|
|||
|
\\begin{codebash}
|
|||
|
sudo journalctl --since "1 hour ago" | grep kernel
|
|||
|
\\end{codebash}
|
|||
|
|
|||
|
You now know the basics of creating, compiling, installing and removing modules.
|
|||
|
Now for more of a description of how this module works.
|
|||
|
|
|||
|
Kernel modules must have at least two functions: a "start" (initialization) function called \\cpp|init_\*-\*-\*_module()| which is called when the module is \\sh|insmod|ed into the kernel, and an "end" (cleanup) function called \\cpp|cleanup_\*-\*-\*_module()| which is called just before it is removed from the kernel.
|
|||
|
Actually, things have changed starting with kernel 2.3.13.
|
|||
|
% TODO: adjust the section anchor
|
|||
|
You can now use whatever name you like for the start and end functions of a module, and you will learn how to do this in Section \\ref{hello_\*-\*-\*_n_\*-\*-\*_goodbye}.
|
|||
|
In fact, the new method is the preferred method.
|
|||
|
However, many people still use \\cpp|init_\*-\*-\*_module()| and \\cpp|cleanup_\*-\*-\*_module()| for their start and end functions.
|
|||
|
|
|||
|
Typically, \\cpp|init_\*-\*-\*_module()| either registers a handler for something with the kernel, or it replaces one of the kernel functions with its own code (usually code to do something and then call the original function).
|
|||
|
The \\cpp|cleanup_\*-\*-\*_module()| function is supposed to undo whatever \\cpp|init_\*-\*-\*_module()| did, so the module can be unloaded safely.
|
|||
|
|
|||
|
Lastly, every kernel module needs to include \\verb|<linux/module.h>|.
|
|||
|
% TODO: adjust the section anchor
|
|||
|
We needed to include \\verb|<linux/printk.h>| only for the macro expansion for the \\cpp|pr_\*-\*-\*_alert()| log level, which you'll learn about in Section \\ref{sec:printk}.
|
|||
|
|
|||
|
\\begin{enumerate}
|
|||
|
\\item A point about coding style.
|
|||
|
Another thing which may not be immediately obvious to anyone getting started with kernel programming is that indentation within your code should be using \\textbf{tabs} and \\textbf{not spaces}.
|
|||
|
It is one of the coding conventions of the kernel.
|
|||
|
You may not like it, but you'll need to get used to it if you ever submit a patch upstream.
|
|||
|
|
|||
|
\\item Introducing print macros.
|
|||
|
\\label{sec:printk}
|
|||
|
In the beginning there was \\cpp|printk|, usually followed by a priority such as \\cpp|KERN_\*-\*-\*_INFO| or \\cpp|KERN_\*-\*-\*_DEBUG|.
|
|||
|
More recently this can also be expressed in abbreviated form using a set of print macros, such as \\cpp|pr_\*-\*-\*_info| and \\cpp|pr_\*-\*-\*_debug|.
|
|||
|
This just saves some mindless keyboard bashing and looks a bit neater.
|
|||
|
They can be found within \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/printk.h}%
|
|||
|
{\\ifthenelse{\\equal{}{}}{include/linux/printk.h}{}}.
|
|||
|
Take time to read through the available priority macros.
|
|||
|
|
|||
|
\\item About Compiling.
|
|||
|
Kernel modules need to be compiled a bit differently from regular userspace apps.
|
|||
|
Former kernel versions required us to care much about these settings, which are usually stored in Makefiles.
|
|||
|
Although hierarchically organized, many redundant settings accumulated in sublevel Makefiles and made them large and rather difficult to maintain.
|
|||
|
Fortunately, there is a new way of doing these things, called kbuild, and the build process for external loadable modules is now fully integrated into the standard kernel build mechanism.
|
|||
|
To learn more on how to compile modules which are not part of the official kernel (such as all the examples you will find in this guide), see file \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/modules.rst}%
|
|||
|
{\\ifthenelse{\\equal{}{}}{Documentation/kbuild/modules.rst}{}}.
|
|||
|
|
|||
|
Additional details about Makefiles for kernel modules are available in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/kbuild/makefiles.rst}%
|
|||
|
{\\ifthenelse{\\equal{}{}}{Documentation/kbuild/makefiles.rst}{}}. Be sure to read this and the related files before starting to hack Makefiles. It will probably save you lots of work.
|
|||
|
|
|||
|
\\begin{quote}
|
|||
|
Here is another exercise for the reader.
|
|||
|
See that comment above the return statement in \\cpp|init_\*-\*-\*_module()|?
|
|||
|
Change the return value to something negative, recompile and load the module again.
|
|||
|
What happens?
|
|||
|
\\end{quote}
|
|||
|
\\end{enumerate}
|
|||
|
|
|||
|
\\subsection{Hello and Goodbye}
|
|||
|
\\label{hello_\*-\*-\*_n_\*-\*-\*_goodbye}
|
|||
|
In early kernel versions you had to use the \\cpp|init_\*-\*-\*_module| and \\cpp|cleanup_\*-\*-\*_module| functions, as in the first hello world example, but these days you can name those anything you want by using the \\cpp|module_\*-\*-\*_init| and \\cpp|module_\*-\*-\*_exit| macros.
|
|||
|
These macros are defined in \\href{https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h}%
|
|||
|
{\\ifthenelse{\\equal{}{}}{include/linux/module.h}{}}.
|
|||
|
The only requirement is that your init and cleanup functions must be defined before calling the those macros, otherwise you'll get compilation errors.
|
|||
|
Here is an example of this technique:
|
|||
|
|
|||
|
\\samplec{examples/hello-2.c}
|
|||
|
|
|||
|
So now we have two real kernel modules under our belt. Adding another module is as simple as this:
|
|||
|
|
|||
|
\\begin{code}
|
|||
|
obj-m += hello-1.o
|
|||
|
obj-m += hello-2.o
|
|||
|
|
|||
|
PWD :=$(CURDIR)
|
|||
|
|
|||
|
all: make -C
|
|||
|
/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD)
|
|||
|
modules
|
|||
|
|
|||
|
clean: make -C
|
|||
|
/lib/modules/(*s**h**e**l**l**u**n**a**m**e* − *r*)/*b**u**i**l**d**M*=(PWD)
|
|||
|
clean
|
|||
|
|
|||
|
Now have a look at
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/char/Makefile)
|
|||
|
for a real world example. As you can see, some things got hardwired into
|
|||
|
the kernel (`obj-y`) but where have all those `obj-m` gone? Those
|
|||
|
familiar with shell scripts will easily be able to spot them. For those
|
|||
|
who are not, the `obj-$(CONFIG_*-*-*_FOO)` entries you see everywhere
|
|||
|
expand into `obj-y` or `obj-m`, depending on whether the
|
|||
|
`CONFIG_*-*-*_FOO` variable has been set to `y` or `m`. While we are at
|
|||
|
it, those were exactly the kind of variables that you have set in the
|
|||
|
`.config` file in the top-level directory of Linux kernel source tree,
|
|||
|
the last time when you said |make menuconfig| or something like that.
|
|||
|
|
|||
|
The -\*-\*_-\*-\*_init and -\*-\*_-\*-\*_exit Macros
|
|||
|
----------------------------------------------------
|
|||
|
|
|||
|
The |_\*-\*-\*_\*-\*-\*_init| macro causes the init function to be
|
|||
|
discarded and its memory freed once the init function finishes for
|
|||
|
built-in drivers, but not loadable modules. If you think about when the
|
|||
|
init function is invoked, this makes perfect sense.
|
|||
|
|
|||
|
There is also an |_\*-\*-\*_\*-\*-\*_initdata| which works similarly to
|
|||
|
|_\*-\*-\*_\*-\*-\*_init| but for init variables rather than functions.
|
|||
|
|
|||
|
The |_\*-\*-\*_\*-\*-\*_exit| macro causes the omission of the function
|
|||
|
when the module is built into the kernel, and like
|
|||
|
|_\*-\*-\*_\*-\*-\*_init|, has no effect for loadable modules. Again,
|
|||
|
if you consider when the cleanup function runs, this makes complete
|
|||
|
sense; built-in drivers do not need a cleanup function, while loadable
|
|||
|
modules do.
|
|||
|
|
|||
|
These macros are defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/init.h)
|
|||
|
and serve to free up kernel memory. When you boot your kernel and see
|
|||
|
something like Freeing unused kernel memory: 236k freed, this is
|
|||
|
precisely what the kernel is freeing.
|
|||
|
|
|||
|
Licensing and Module Documentation
|
|||
|
----------------------------------
|
|||
|
|
|||
|
Honestly, who loads or even cares about proprietary modules? If you do
|
|||
|
then you might have seen something like this:
|
|||
|
|
|||
|
$ sudo insmod xxxxxx.ko
|
|||
|
loading out-of-tree module taints kernel.
|
|||
|
module license 'unspecified' taints kernel.
|
|||
|
|
|||
|
You can use a few macros to indicate the license for your module. Some
|
|||
|
examples are "GPL", "GPL v2", "GPL and additional rights", "Dual
|
|||
|
BSD/GPL", "Dual MIT/GPL", "Dual MPL/GPL" and "Proprietary". They are
|
|||
|
defined within
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h).
|
|||
|
|
|||
|
To reference what license you’re using a macro is available called
|
|||
|
|MODULE_\*-\*-\*_LICENSE|. This and a few other macros describing the
|
|||
|
module are illustrated in the below example.
|
|||
|
|
|||
|
Passing Command Line Arguments to a Module
|
|||
|
------------------------------------------
|
|||
|
|
|||
|
Modules can take command line arguments, but not with the argc/argv you
|
|||
|
might be used to.
|
|||
|
|
|||
|
To allow arguments to be passed to your module, declare the variables
|
|||
|
that will take the values of the command line arguments as global and
|
|||
|
then use the |module_\*-\*-\*_param()| macro, (defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/moduleparam.h))
|
|||
|
to set the mechanism up. At runtime, |insmod| will fill the variables
|
|||
|
with any command line arguments that are given, like |insmod mymodule.ko
|
|||
|
myvariable=5|. The variable declarations and macros should be placed at
|
|||
|
the beginning of the module for clarity. The example code should clear
|
|||
|
up my admittedly lousy explanation.
|
|||
|
|
|||
|
The |module_\*-\*-\*_param()| macro takes 3 arguments: the name of the
|
|||
|
variable, its type and permissions for the corresponding file in sysfs.
|
|||
|
Integer types can be signed as usual or unsigned. If you’d like to use
|
|||
|
arrays of integers or strings see
|
|||
|
|module_\*-\*-\*_param_\*-\*-\*_array()| and
|
|||
|
|module_\*-\*-\*_param_\*-\*-\*_string()|.
|
|||
|
|
|||
|
int myint = 3; module_\*-\*-\*_param(myint, int, 0);
|
|||
|
|
|||
|
Arrays are supported too, but things are a bit different now than they
|
|||
|
were in the olden days. To keep track of the number of parameters you
|
|||
|
need to pass a pointer to a count variable as third parameter. At your
|
|||
|
option, you could also ignore the count and pass |NULL| instead. We show
|
|||
|
both possibilities here:
|
|||
|
|
|||
|
int myintarray\[2\]; module_\*-\*-\*_param_\*-\*-\*_array(myintarray,
|
|||
|
int, NULL, 0); /\* not interested in count \*/
|
|||
|
|
|||
|
short myshortarray\[4\]; int count;
|
|||
|
module_\*-\*-\*_param_\*-\*-\*_array(myshortarray, short, &count, 0);
|
|||
|
/\* put count into "count" variable \*/
|
|||
|
|
|||
|
A good use for this is to have the module variable’s default values set,
|
|||
|
like a port or IO address. If the variables contain the default values,
|
|||
|
then perform autodetection (explained elsewhere). Otherwise, keep the
|
|||
|
current value. This will be made clear later on.
|
|||
|
|
|||
|
Lastly, there is a macro function,
|
|||
|
|MODULE_\*-\*-\*_PARM_\*-\*-\*_DESC()|, that is used to document
|
|||
|
arguments that the module can take. It takes two parameters: a variable
|
|||
|
name and a free form string describing that variable.
|
|||
|
|
|||
|
It is recommended to experiment with the following code:
|
|||
|
|
|||
|
$ sudo insmod hello-5.ko mystring="bebop" myintarray=-1
|
|||
|
$ sudo dmesg -t | tail -7
|
|||
|
myshort is a short integer: 1
|
|||
|
myint is an integer: 420
|
|||
|
mylong is a long integer: 9999
|
|||
|
mystring is a string: bebop
|
|||
|
myintarray[0] = -1
|
|||
|
myintarray[1] = 420
|
|||
|
got 1 arguments for myintarray.
|
|||
|
|
|||
|
$ sudo rmmod hello-5
|
|||
|
$ sudo dmesg -t | tail -1
|
|||
|
Goodbye, world 5
|
|||
|
|
|||
|
$ sudo insmod hello-5.ko mystring="supercalifragilisticexpialidocious" myintarray=-1,-1
|
|||
|
$ sudo dmesg -t | tail -7
|
|||
|
myshort is a short integer: 1
|
|||
|
myint is an integer: 420
|
|||
|
mylong is a long integer: 9999
|
|||
|
mystring is a string: supercalifragilisticexpialidocious
|
|||
|
myintarray[0] = -1
|
|||
|
myintarray[1] = -1
|
|||
|
got 2 arguments for myintarray.
|
|||
|
|
|||
|
$ sudo rmmod hello-5
|
|||
|
$ sudo dmesg -t | tail -1
|
|||
|
Goodbye, world 5
|
|||
|
|
|||
|
$ sudo insmod hello-5.ko mylong=hello
|
|||
|
insmod: ERROR: could not insert module hello-5.ko: Invalid parameters
|
|||
|
|
|||
|
Modules Spanning Multiple Files
|
|||
|
-------------------------------
|
|||
|
|
|||
|
Sometimes it makes sense to divide a kernel module between several
|
|||
|
source files.
|
|||
|
|
|||
|
Here is an example of such a kernel module.
|
|||
|
|
|||
|
The next file:
|
|||
|
|
|||
|
And finally, the makefile:
|
|||
|
|
|||
|
This is the complete makefile for all the examples we have seen so far.
|
|||
|
The first five lines are nothing special, but for the last example we
|
|||
|
will need two lines. First we invent an object name for our combined
|
|||
|
module, second we tell |make| what object files are part of that module.
|
|||
|
|
|||
|
Building modules for a precompiled kernel
|
|||
|
-----------------------------------------
|
|||
|
|
|||
|
Obviously, we strongly suggest you to recompile your kernel, so that you
|
|||
|
can enable a number of useful debugging features, such as forced module
|
|||
|
unloading (|MODULE_\*-\*-\*_FORCE_\*-\*-\*_UNLOAD|): when this option is
|
|||
|
enabled, you can force the kernel to unload a module even when it
|
|||
|
believes it is unsafe, via a |sudo rmmod -f module| command. This option
|
|||
|
can save you a lot of time and a number of reboots during the
|
|||
|
development of a module. If you do not want to recompile your kernel
|
|||
|
then you should consider running the examples within a test distribution
|
|||
|
on a virtual machine. If you mess anything up then you can easily reboot
|
|||
|
or restore the virtual machine (VM).
|
|||
|
|
|||
|
There are a number of cases in which you may want to load your module
|
|||
|
into a precompiled running kernel, such as the ones shipped with common
|
|||
|
Linux distributions, or a kernel you have compiled in the past. In
|
|||
|
certain circumstances you could require to compile and insert a module
|
|||
|
into a running kernel which you are not allowed to recompile, or on a
|
|||
|
machine that you prefer not to reboot. If you can’t think of a case that
|
|||
|
will force you to use modules for a precompiled kernel you might want to
|
|||
|
skip this and treat the rest of this chapter as a big footnote.
|
|||
|
|
|||
|
Now, if you just install a kernel source tree, use it to compile your
|
|||
|
kernel module and you try to insert your module into the kernel, in most
|
|||
|
cases you would obtain an error as follows:
|
|||
|
|
|||
|
insmod: ERROR: could not insert module poet.ko: Invalid module format
|
|||
|
|
|||
|
Less cryptic information is logged to the systemd journal:
|
|||
|
|
|||
|
kernel: poet: disagrees about version of symbol module_*-*-*_layout
|
|||
|
|
|||
|
In other words, your kernel refuses to accept your module because
|
|||
|
version strings (more precisely, *version magic*, see
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/vermagic.h))
|
|||
|
do not match. Incidentally, version magic strings are stored in the
|
|||
|
module object in the form of a static string, starting with |vermagic:|.
|
|||
|
Version data are inserted in your module when it is linked against the
|
|||
|
`kernel/module.o` file. To inspect version magics and other strings
|
|||
|
stored in a given module, issue the command |modinfo module.ko|:
|
|||
|
|
|||
|
$ modinfo hello-4.ko
|
|||
|
description: A sample driver
|
|||
|
author: LKMPG
|
|||
|
license: GPL
|
|||
|
srcversion: B2AA7FBFCC2C39AED665382
|
|||
|
depends:
|
|||
|
retpoline: Y
|
|||
|
name: hello_*-*-*_4
|
|||
|
vermagic: 5.4.0-70-generic SMP mod_*-*-*_unload modversions
|
|||
|
|
|||
|
To overcome this problem we could resort to the `--force-vermagic`
|
|||
|
option, but this solution is potentially unsafe, and unquestionably
|
|||
|
unacceptable in production modules. Consequently, we want to compile our
|
|||
|
module in an environment which was identical to the one in which our
|
|||
|
precompiled kernel was built. How to do this, is the subject of the
|
|||
|
remainder of this chapter.
|
|||
|
|
|||
|
First of all, make sure that a kernel source tree is available, having
|
|||
|
exactly the same version as your current kernel. Then, find the
|
|||
|
configuration file which was used to compile your precompiled kernel.
|
|||
|
Usually, this is available in your current `boot` directory, under a
|
|||
|
name like `config-5.14.x`. You may just want to copy it to your kernel
|
|||
|
source tree: |cp /boot/config-‘uname -r‘ .config|.
|
|||
|
|
|||
|
Let’s focus again on the previous error message: a closer look at the
|
|||
|
version magic strings suggests that, even with two configuration files
|
|||
|
which are exactly the same, a slight difference in the version magic
|
|||
|
could be possible, and it is sufficient to prevent insertion of the
|
|||
|
module into the kernel. That slight difference, namely the custom string
|
|||
|
which appears in the module’s version magic and not in the kernel’s one,
|
|||
|
is due to a modification with respect to the original, in the makefile
|
|||
|
that some distributions include. Then, examine your `Makefile`, and make
|
|||
|
sure that the specified version information matches exactly the one used
|
|||
|
for your current kernel. For example, your makefile could start as
|
|||
|
follows:
|
|||
|
|
|||
|
VERSION = 5
|
|||
|
PATCHLEVEL = 14
|
|||
|
SUBLEVEL = 0
|
|||
|
EXTRAVERSION = -rc2
|
|||
|
|
|||
|
In this case, you need to restore the value of symbol **EXTRAVERSION**
|
|||
|
to **-rc2**. We suggest keeping a backup copy of the makefile used to
|
|||
|
compile your kernel available in `/lib/modules/5.14.0-rc2/build`. A
|
|||
|
simple command as following should suffice.
|
|||
|
|
|||
|
cp /lib/modules/‘uname -r‘/build/Makefile linux-‘uname -r‘
|
|||
|
|
|||
|
Here |linux-‘uname -r‘| is the Linux kernel source you are attempting to
|
|||
|
build.
|
|||
|
|
|||
|
Now, please run |make| to update configuration and version headers and
|
|||
|
objects:
|
|||
|
|
|||
|
$ make
|
|||
|
SYNC include/config/auto.conf.cmd
|
|||
|
HOSTCC scripts/basic/fixdep
|
|||
|
HOSTCC scripts/kconfig/conf.o
|
|||
|
HOSTCC scripts/kconfig/confdata.o
|
|||
|
HOSTCC scripts/kconfig/expr.o
|
|||
|
LEX scripts/kconfig/lexer.lex.c
|
|||
|
YACC scripts/kconfig/parser.tab.[ch]
|
|||
|
HOSTCC scripts/kconfig/preprocess.o
|
|||
|
HOSTCC scripts/kconfig/symbol.o
|
|||
|
HOSTCC scripts/kconfig/util.o
|
|||
|
HOSTCC scripts/kconfig/lexer.lex.o
|
|||
|
HOSTCC scripts/kconfig/parser.tab.o
|
|||
|
HOSTLD scripts/kconfig/conf
|
|||
|
|
|||
|
If you do not desire to actually compile the kernel, you can interrupt
|
|||
|
the build process (CTRL-C) just after the SPLIT line, because at that
|
|||
|
time, the files you need are ready. Now you can turn back to the
|
|||
|
directory of your module and compile it: It will be built exactly
|
|||
|
according to your current kernel settings, and it will load into it
|
|||
|
without any errors.
|
|||
|
|
|||
|
Preliminaries
|
|||
|
=============
|
|||
|
|
|||
|
How modules begin and end
|
|||
|
-------------------------
|
|||
|
|
|||
|
A typical program starts with a |main()| function, executes a series of
|
|||
|
instructions, and terminates after completing these instructions. Kernel
|
|||
|
modules, however, follow a different pattern. A module always begins
|
|||
|
with either the |init_\*-\*-\*_module| function or a function designated
|
|||
|
by the |module_\*-\*-\*_init| call. This function acts as the module’s
|
|||
|
entry point, informing the kernel of the module’s functionalities and
|
|||
|
preparing the kernel to utilize the module’s functions when necessary.
|
|||
|
After performing these tasks, the entry function returns, and the module
|
|||
|
remains inactive until the kernel requires its code.
|
|||
|
|
|||
|
All modules conclude by invoking either |cleanup_\*-\*-\*_module| or a
|
|||
|
function specified through the |module_\*-\*-\*_exit| call. This serves
|
|||
|
as the module’s exit function, reversing the actions of the entry
|
|||
|
function by unregistering the previously registered functionalities.
|
|||
|
|
|||
|
It is mandatory for every module to have both an entry and an exit
|
|||
|
function. While there are multiple methods to define these functions,
|
|||
|
the terms “entry function” and “exit function” are generally used.
|
|||
|
However, they may occasionally be referred to as |init_\*-\*-\*_module|
|
|||
|
and |cleanup_\*-\*-\*_module|, which are understood to mean the same.
|
|||
|
|
|||
|
Functions available to modules
|
|||
|
------------------------------
|
|||
|
|
|||
|
Programmers use functions they do not define all the time. A prime
|
|||
|
example of this is |printf()|. You use these library functions which are
|
|||
|
provided by the standard C library, libc. The definitions for these
|
|||
|
functions do not actually enter your program until the linking stage,
|
|||
|
which ensures that the code (for |printf()| for example) is available,
|
|||
|
and fixes the call instruction to point to that code.
|
|||
|
|
|||
|
Kernel modules are different here, too. In the hello world example, you
|
|||
|
might have noticed that we used a function, |pr_\*-\*-\*_info()| but did
|
|||
|
not include a standard I/O library. That is because modules are object
|
|||
|
files whose symbols get resolved upon running |insmod| or |modprobe|.
|
|||
|
The definition for the symbols comes from the kernel itself; the only
|
|||
|
external functions you can use are the ones provided by the kernel. If
|
|||
|
you’re curious about what symbols have been exported by your kernel,
|
|||
|
take a look at `/proc/kallsyms`.
|
|||
|
|
|||
|
One point to keep in mind is the difference between library functions
|
|||
|
and system calls. Library functions are higher level, run completely in
|
|||
|
user space and provide a more convenient interface for the programmer to
|
|||
|
the functions that do the real work — system calls. System calls run in
|
|||
|
kernel mode on the user’s behalf and are provided by the kernel itself.
|
|||
|
The library function |printf()| may look like a very general printing
|
|||
|
function, but all it really does is format the data into strings and
|
|||
|
write the string data using the low-level system call |write()|, which
|
|||
|
then sends the data to standard output.
|
|||
|
|
|||
|
Would you like to see what system calls are made by |printf()|? It is
|
|||
|
easy_ Compile the following program:
|
|||
|
|
|||
|
\#include <stdio.h>
|
|||
|
|
|||
|
int main(void) printf("hello"); return 0;
|
|||
|
|
|||
|
with |gcc -Wall -o hello hello.c|. Run the executable with |strace
|
|||
|
./hello|. Are you impressed? Every line you see corresponds to a system
|
|||
|
call. [strace](https://strace.io/) is a handy program that gives you
|
|||
|
details about what system calls a program is making, including which
|
|||
|
call is made, what its arguments are and what it returns. It is an
|
|||
|
invaluable tool for figuring out things like what files a program is
|
|||
|
trying to access. Towards the end, you will see a line which looks like
|
|||
|
|write(1, "hello", 5hello)|. There it is. The face behind the |printf()|
|
|||
|
mask. You may not be familiar with write, since most people use library
|
|||
|
functions for file I/O (like |fopen|, |fputs|, |fclose|). If that is the
|
|||
|
case, try looking at man 2 write. The 2nd man section is devoted to
|
|||
|
system calls (like |kill()| and |read()|). The 3rd man section is
|
|||
|
devoted to library calls, which you would probably be more familiar with
|
|||
|
(like |cosh()| and |random()|).
|
|||
|
|
|||
|
You can even write modules to replace the kernel’s system calls, which
|
|||
|
we will do shortly. Crackers often make use of this sort of thing for
|
|||
|
backdoors or trojans, but you can write your own modules to do more
|
|||
|
benign things, like have the kernel write Tee hee, that tickles_ every
|
|||
|
time someone tries to delete a file on your system.
|
|||
|
|
|||
|
User Space vs Kernel Space
|
|||
|
--------------------------
|
|||
|
|
|||
|
The kernel primarily manages access to resources, be it a video card,
|
|||
|
hard drive, or memory. Programs frequently vie for the same resources.
|
|||
|
For instance, as a document is saved, updatedb might commence updating
|
|||
|
the locate database. Sessions in editors like vim and processes like
|
|||
|
updatedb can simultaneously utilize the hard drive. The kernel’s role is
|
|||
|
to maintain order, ensuring that users do not access resources
|
|||
|
indiscriminately.
|
|||
|
|
|||
|
To manage this, CPUs operate in different modes, each offering varying
|
|||
|
levels of system control. The Intel 80386 architecture, for example,
|
|||
|
featured four such modes, known as rings. Unix, however, utilizes only
|
|||
|
two of these rings: the highest ring (ring 0, also known as “supervisor
|
|||
|
mode”, where all actions are permissible) and the lowest ring, referred
|
|||
|
to as “user mode”.
|
|||
|
|
|||
|
Recall the discussion about library functions vs system calls.
|
|||
|
Typically, you use a library function in user mode. The library function
|
|||
|
calls one or more system calls, and these system calls execute on the
|
|||
|
library function’s behalf, but do so in supervisor mode since they are
|
|||
|
part of the kernel itself. Once the system call completes its task, it
|
|||
|
returns and execution gets transferred back to user mode.
|
|||
|
|
|||
|
Name Space
|
|||
|
----------
|
|||
|
|
|||
|
When you write a small C program, you use variables which are convenient
|
|||
|
and make sense to the reader. If, on the other hand, you are writing
|
|||
|
routines which will be part of a bigger problem, any global variables
|
|||
|
you have are part of a community of other peoples’ global variables;
|
|||
|
some of the variable names can clash. When a program has lots of global
|
|||
|
variables which aren’t meaningful enough to be distinguished, you get
|
|||
|
namespace pollution. In large projects, effort must be made to remember
|
|||
|
reserved names, and to find ways to develop a scheme for naming unique
|
|||
|
variable names and symbols.
|
|||
|
|
|||
|
When writing kernel code, even the smallest module will be linked
|
|||
|
against the entire kernel, so this is definitely an issue. The best way
|
|||
|
to deal with this is to declare all your variables as static and to use
|
|||
|
a well-defined prefix for your symbols. By convention, all kernel
|
|||
|
prefixes are lowercase. If you do not want to declare everything as
|
|||
|
static, another option is to declare a symbol table and register it with
|
|||
|
the kernel. We will get to this later.
|
|||
|
|
|||
|
The file `/proc/kallsyms` holds all the symbols that the kernel knows
|
|||
|
about and which are therefore accessible to your modules since they
|
|||
|
share the kernel’s codespace.
|
|||
|
|
|||
|
Code space
|
|||
|
----------
|
|||
|
|
|||
|
Memory management is a very complicated subject and the majority of
|
|||
|
O’Reilly’s [Understanding The Linux
|
|||
|
Kernel](https://www.oreilly.com/library/view/understanding-the-linux/0596005652/)
|
|||
|
exclusively covers memory management_ We are not setting out to be
|
|||
|
experts on memory managements, but we do need to know a couple of facts
|
|||
|
to even begin worrying about writing real modules.
|
|||
|
|
|||
|
If you have not thought about what a segfault really means, you may be
|
|||
|
surprised to hear that pointers do not actually point to memory
|
|||
|
locations. Not real ones, anyway. When a process is created, the kernel
|
|||
|
sets aside a portion of real physical memory and hands it to the process
|
|||
|
to use for its executing code, variables, stack, heap and other things
|
|||
|
which a computer scientist would know about. This memory begins with
|
|||
|
0x00000000 and extends up to whatever it needs to be. Since the memory
|
|||
|
space for any two processes do not overlap, every process that can
|
|||
|
access a memory address, say 0xbffff978, would be accessing a different
|
|||
|
location in real physical memory_ The processes would be accessing an
|
|||
|
index named 0xbffff978 which points to some kind of offset into the
|
|||
|
region of memory set aside for that particular process. For the most
|
|||
|
part, a process like our Hello, World program can’t access the space of
|
|||
|
another process, although there are ways which we will talk about later.
|
|||
|
|
|||
|
The kernel has its own space of memory as well. Since a module is code
|
|||
|
which can be dynamically inserted and removed in the kernel (as opposed
|
|||
|
to a semi-autonomous object), it shares the kernel’s codespace rather
|
|||
|
than having its own. Therefore, if your module segfaults, the kernel
|
|||
|
segfaults. And if you start writing over data because of an off-by-one
|
|||
|
error, then you’re trampling on kernel data (or code). This is even
|
|||
|
worse than it sounds, so try your best to be careful.
|
|||
|
|
|||
|
It should be noted that the aforementioned discussion applies to any
|
|||
|
operating system utilizing a monolithic kernel. This concept differs
|
|||
|
slightly from *“building all your modules into the kernel”*, although
|
|||
|
the underlying principle is similar. In contrast, there are
|
|||
|
microkernels, where modules are allocated their own code space. Two
|
|||
|
notable examples of microkernels include the [GNU
|
|||
|
Hurd](https://www.gnu.org/software/hurd/) and the [Zircon
|
|||
|
kernel](https://fuchsia.dev/fuchsia-src/concepts/kernel) of Google’s
|
|||
|
Fuchsia.
|
|||
|
|
|||
|
Device Drivers
|
|||
|
--------------
|
|||
|
|
|||
|
One class of module is the device driver, which provides functionality
|
|||
|
for hardware like a serial port. On Unix, each piece of hardware is
|
|||
|
represented by a file located in `/dev` named a device file which
|
|||
|
provides the means to communicate with the hardware. The device driver
|
|||
|
provides the communication on behalf of a user program. So the es1370.ko
|
|||
|
sound card device driver might connect the `/dev/sound` device file to
|
|||
|
the Ensoniq IS1370 sound card. A userspace program like mp3blaster can
|
|||
|
use `/dev/sound` without ever knowing what kind of sound card is
|
|||
|
installed.
|
|||
|
|
|||
|
Let’s look at some device files. Here are device files which represent
|
|||
|
the first three partitions on the primary master IDE hard drive:
|
|||
|
|
|||
|
$ ls -l /dev/hda[1-3]
|
|||
|
brw-rw---- 1 root disk 3, 1 Jul 5 2000 /dev/hda1
|
|||
|
brw-rw---- 1 root disk 3, 2 Jul 5 2000 /dev/hda2
|
|||
|
brw-rw---- 1 root disk 3, 3 Jul 5 2000 /dev/hda3
|
|||
|
|
|||
|
Notice the column of numbers separated by a comma. The first number is
|
|||
|
called the device’s major number. The second number is the minor number.
|
|||
|
The major number tells you which driver is used to access the hardware.
|
|||
|
Each driver is assigned a unique major number; all device files with the
|
|||
|
same major number are controlled by the same driver. All the above major
|
|||
|
numbers are 3, because they’re all controlled by the same driver.
|
|||
|
|
|||
|
The minor number is used by the driver to distinguish between the
|
|||
|
various hardware it controls. Returning to the example above, although
|
|||
|
all three devices are handled by the same driver they have unique minor
|
|||
|
numbers because the driver sees them as being different pieces of
|
|||
|
hardware.
|
|||
|
|
|||
|
Devices are divided into two types: character devices and block devices.
|
|||
|
The difference is that block devices have a buffer for requests, so they
|
|||
|
can choose the best order in which to respond to the requests. This is
|
|||
|
important in the case of storage devices, where it is faster to read or
|
|||
|
write sectors which are close to each other, rather than those which are
|
|||
|
further apart. Another difference is that block devices can only accept
|
|||
|
input and return output in blocks (whose size can vary according to the
|
|||
|
device), whereas character devices are allowed to use as many or as few
|
|||
|
bytes as they like. Most devices in the world are character, because
|
|||
|
they don’t need this type of buffering, and they don’t operate with a
|
|||
|
fixed block size. You can tell whether a device file is for a block
|
|||
|
device or a character device by looking at the first character in the
|
|||
|
output of |ls -l|. If it is ‘b’ then it is a block device, and if it is
|
|||
|
‘c’ then it is a character device. The devices you see above are block
|
|||
|
devices. Here are some character devices (the serial ports):
|
|||
|
|
|||
|
crw-rw---- 1 root dial 4, 64 Feb 18 23:34 /dev/ttyS0
|
|||
|
crw-r----- 1 root dial 4, 65 Nov 17 10:26 /dev/ttyS1
|
|||
|
crw-rw---- 1 root dial 4, 66 Jul 5 2000 /dev/ttyS2
|
|||
|
crw-rw---- 1 root dial 4, 67 Jul 5 2000 /dev/ttyS3
|
|||
|
|
|||
|
If you want to see which major numbers have been assigned, you can look
|
|||
|
at
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt).
|
|||
|
|
|||
|
When the system was installed, all of those device files were created by
|
|||
|
the |mknod| command. To create a new char device named `coffee` with
|
|||
|
major/minor number 12 and 2, simply do |mknod /dev/coffee c 12 2|. You
|
|||
|
do not have to put your device files into `/dev`, but it is done by
|
|||
|
convention. Linus put his device files in `/dev`, and so should you.
|
|||
|
However, when creating a device file for testing purposes, it is
|
|||
|
probably OK to place it in your working directory where you compile the
|
|||
|
kernel module. Just be sure to put it in the right place when you’re
|
|||
|
done writing the device driver.
|
|||
|
|
|||
|
A few final points, although implicit in the previous discussion, are
|
|||
|
worth stating explicitly for clarity. When a device file is accessed,
|
|||
|
the kernel utilizes the file’s major number to identify the appropriate
|
|||
|
driver for handling the access. This indicates that the kernel does not
|
|||
|
necessarily rely on or need to be aware of the minor number. It is the
|
|||
|
driver that concerns itself with the minor number, using it to
|
|||
|
differentiate between various pieces of hardware.
|
|||
|
|
|||
|
It is important to note that when referring to *“hardware”*, the term is
|
|||
|
used in a slightly more abstract sense than just a physical PCI card
|
|||
|
that can be held in hand. Consider the following two device files:
|
|||
|
|
|||
|
$ ls -l /dev/sda /dev/sdb
|
|||
|
brw-rw---- 1 root disk 8, 0 Jan 3 09:02 /dev/sda
|
|||
|
brw-rw---- 1 root disk 8, 16 Jan 3 09:02 /dev/sdb
|
|||
|
|
|||
|
By now you can look at these two device files and know instantly that
|
|||
|
they are block devices and are handled by same driver (block major 8).
|
|||
|
Sometimes two device files with the same major but different minor
|
|||
|
number can actually represent the same piece of physical hardware. So
|
|||
|
just be aware that the word “hardware” in our discussion can mean
|
|||
|
something very abstract.
|
|||
|
|
|||
|
Character Device drivers
|
|||
|
========================
|
|||
|
|
|||
|
The file-\*-\*_operations Structure
|
|||
|
-----------------------------------
|
|||
|
|
|||
|
The |file_\*-\*-\*_operations| structure is defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h),
|
|||
|
and holds pointers to functions defined by the driver that perform
|
|||
|
various operations on the device. Each field of the structure
|
|||
|
corresponds to the address of some function defined by the driver to
|
|||
|
handle a requested operation.
|
|||
|
|
|||
|
For example, every character driver needs to define a function that
|
|||
|
reads from the device. The |file_\*-\*-\*_operations| structure holds
|
|||
|
the address of the module’s function that performs that operation. Here
|
|||
|
is what the definition looks like for kernel 5.4:
|
|||
|
|
|||
|
struct file_\*-\*-\*_operations struct module \*owner; loff_\*-\*-\*_t
|
|||
|
(\*llseek) (struct file \*, loff_\*-\*-\*_t, int); ssize_\*-\*-\*_t
|
|||
|
(\*read) (struct file \*, char _\*-\*-\*_\*-\*-\*_user \*,
|
|||
|
size_\*-\*-\*_t, loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*write) (struct
|
|||
|
file \*, const char _\*-\*-\*_\*-\*-\*_user \*, size_\*-\*-\*_t,
|
|||
|
loff_\*-\*-\*_t \*); ssize_\*-\*-\*_t (\*read_\*-\*-\*_iter) (struct
|
|||
|
kiocb \*, struct iov_\*-\*-\*_iter \*); ssize_\*-\*-\*_t
|
|||
|
(\*write_\*-\*-\*_iter) (struct kiocb \*, struct iov_\*-\*-\*_iter \*);
|
|||
|
int (\*iopoll)(struct kiocb \*kiocb, bool spin); int (\*iterate) (struct
|
|||
|
file \*, struct dir_\*-\*-\*_context \*); int
|
|||
|
(\*iterate_\*-\*-\*_shared) (struct file \*, struct dir_\*-\*-\*_context
|
|||
|
\*); _\*-\*-\*_\*-\*-\*_poll_\*-\*-\*_t (\*poll) (struct file \*,
|
|||
|
struct poll_\*-\*-\*_table_\*-\*-\*_struct \*); long
|
|||
|
(\*unlocked_\*-\*-\*_ioctl) (struct file \*, unsigned int, unsigned
|
|||
|
long); long (\*compat_\*-\*-\*_ioctl) (struct file \*, unsigned int,
|
|||
|
unsigned long); int (\*mmap) (struct file \*, struct
|
|||
|
vm_\*-\*-\*_area_\*-\*-\*_struct \*); unsigned long
|
|||
|
mmap_\*-\*-\*_supported_\*-\*-\*_flags; int (\*open) (struct inode \*,
|
|||
|
struct file \*); int (\*flush) (struct file \*,
|
|||
|
fl_\*-\*-\*_owner_\*-\*-\*_t id); int (\*release) (struct inode \*,
|
|||
|
struct file \*); int (\*fsync) (struct file \*, loff_\*-\*-\*_t,
|
|||
|
loff_\*-\*-\*_t, int datasync); int (\*fasync) (int, struct file \*,
|
|||
|
int); int (\*lock) (struct file \*, int, struct file_\*-\*-\*_lock \*);
|
|||
|
ssize_\*-\*-\*_t (\*sendpage) (struct file \*, struct page \*, int,
|
|||
|
size_\*-\*-\*_t, loff_\*-\*-\*_t \*, int); unsigned long
|
|||
|
(\*get_\*-\*-\*_unmapped_\*-\*-\*_area)(struct file \*, unsigned long,
|
|||
|
unsigned long, unsigned long, unsigned long); int
|
|||
|
(\*check_\*-\*-\*_flags)(int); int (\*flock) (struct file \*, int,
|
|||
|
struct file_\*-\*-\*_lock \*); ssize_\*-\*-\*_t
|
|||
|
(\*splice_\*-\*-\*_write)(struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*,
|
|||
|
struct file \*, loff_\*-\*-\*_t \*, size_\*-\*-\*_t, unsigned int);
|
|||
|
ssize_\*-\*-\*_t (\*splice_\*-\*-\*_read)(struct file \*,
|
|||
|
loff_\*-\*-\*_t \*, struct pipe_\*-\*-\*_inode_\*-\*-\*_info \*,
|
|||
|
size_\*-\*-\*_t, unsigned int); int (\*setlease)(struct file \*, long,
|
|||
|
struct file_\*-\*-\*_lock \*\*, void \*\*); long (\*fallocate)(struct
|
|||
|
file \*file, int mode, loff_\*-\*-\*_t offset, loff_\*-\*-\*_t len);
|
|||
|
void (\*show_\*-\*-\*_fdinfo)(struct seq_\*-\*-\*_file \*m, struct file
|
|||
|
\*f); ssize_\*-\*-\*_t (\*copy_\*-\*-\*_file_\*-\*-\*_range)(struct file
|
|||
|
\*, loff_\*-\*-\*_t, struct file \*, loff_\*-\*-\*_t, size_\*-\*-\*_t,
|
|||
|
unsigned int); loff_\*-\*-\*_t
|
|||
|
(\*remap_\*-\*-\*_file_\*-\*-\*_range)(struct file \*file_\*-\*-\*_in,
|
|||
|
loff_\*-\*-\*_t pos_\*-\*-\*_in, struct file \*file_\*-\*-\*_out,
|
|||
|
loff_\*-\*-\*_t pos_\*-\*-\*_out, loff_\*-\*-\*_t len, unsigned int
|
|||
|
remap_\*-\*-\*_flags); int (\*fadvise)(struct file \*, loff_\*-\*-\*_t,
|
|||
|
loff_\*-\*-\*_t, int); _\*-\*-\*_\*-\*-\*_randomize_\*-\*-\*_layout;
|
|||
|
|
|||
|
Some operations are not implemented by a driver. For example, a driver
|
|||
|
that handles a video card will not need to read from a directory
|
|||
|
structure. The corresponding entries in the |file_\*-\*-\*_operations|
|
|||
|
structure should be set to |NULL|.
|
|||
|
|
|||
|
There is a gcc extension that makes assigning to this structure more
|
|||
|
convenient. You will see it in modern drivers, and may catch you by
|
|||
|
surprise. This is what the new way of assigning to the structure looks
|
|||
|
like:
|
|||
|
|
|||
|
struct file_\*-\*-\*_operations fops = read: device_\*-\*-\*_read,
|
|||
|
write: device_\*-\*-\*_write, open: device_\*-\*-\*_open, release:
|
|||
|
device_\*-\*-\*_release ;
|
|||
|
|
|||
|
However, there is also a C99 way of assigning to elements of a
|
|||
|
structure, [designated
|
|||
|
initializers](https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html),
|
|||
|
and this is definitely preferred over using the GNU extension. You
|
|||
|
should use this syntax in case someone wants to port your driver. It
|
|||
|
will help with compatibility:
|
|||
|
|
|||
|
struct file_\*-\*-\*_operations fops = .read = device_\*-\*-\*_read,
|
|||
|
.write = device_\*-\*-\*_write, .open = device_\*-\*-\*_open, .release =
|
|||
|
device_\*-\*-\*_release ;
|
|||
|
|
|||
|
The meaning is clear, and you should be aware that any member of the
|
|||
|
structure which you do not explicitly assign will be initialized to
|
|||
|
|NULL| by gcc.
|
|||
|
|
|||
|
An instance of |struct file_\*-\*-\*_operations| containing pointers to
|
|||
|
functions that are used to implement |read|, |write|, |open|, … system
|
|||
|
calls is commonly named |fops|.
|
|||
|
|
|||
|
Since Linux v3.14, the read, write and seek operations are guaranteed
|
|||
|
for thread-safe by using the |f_\*-\*-\*_pos| specific lock, which makes
|
|||
|
the file position update to become the mutual exclusion. So, we can
|
|||
|
safely implement those operations without unnecessary locking.
|
|||
|
|
|||
|
Additionally, since Linux v5.6, the |proc_\*-\*-\*_ops| structure was
|
|||
|
introduced to replace the use of the |file_\*-\*-\*_operations|
|
|||
|
structure when registering proc handlers. See more information in the
|
|||
|
<a href="#sec:proc_*-*-*_ops" data-reference-type="ref" data-reference="sec:proc_*-*-*_ops">7.1</a>
|
|||
|
section.
|
|||
|
|
|||
|
The file structure
|
|||
|
------------------
|
|||
|
|
|||
|
Each device is represented in the kernel by a file structure, which is
|
|||
|
defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h).
|
|||
|
Be aware that a file is a kernel level structure and never appears in a
|
|||
|
user space program. It is not the same thing as a |FILE|, which is
|
|||
|
defined by glibc and would never appear in a kernel space function.
|
|||
|
Also, its name is a bit misleading; it represents an abstract open
|
|||
|
‘file’, not a file on a disk, which is represented by a structure named
|
|||
|
|inode|.
|
|||
|
|
|||
|
An instance of struct file is commonly named |filp|. You’ll also see it
|
|||
|
referred to as a struct file object. Resist the temptation.
|
|||
|
|
|||
|
Go ahead and look at the definition of file. Most of the entries you
|
|||
|
see, like struct dentry are not used by device drivers, and you can
|
|||
|
ignore them. This is because drivers do not fill file directly; they
|
|||
|
only use structures contained in file which are created elsewhere.
|
|||
|
|
|||
|
Registering A Device
|
|||
|
--------------------
|
|||
|
|
|||
|
As discussed earlier, char devices are accessed through device files,
|
|||
|
usually located in `/dev`. This is by convention. When writing a driver,
|
|||
|
it is OK to put the device file in your current directory. Just make
|
|||
|
sure you place it in `/dev` for a production driver. The major number
|
|||
|
tells you which driver handles which device file. The minor number is
|
|||
|
used only by the driver itself to differentiate which device it is
|
|||
|
operating on, just in case the driver handles more than one device.
|
|||
|
|
|||
|
Adding a driver to your system means registering it with the kernel.
|
|||
|
This is synonymous with assigning it a major number during the module’s
|
|||
|
initialization. You do this by using the |register_\*-\*-\*_chrdev|
|
|||
|
function, defined by
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/fs.h).
|
|||
|
|
|||
|
int register_\*-\*-\*_chrdev(unsigned int major, const char \*name,
|
|||
|
struct file_\*-\*-\*_operations \*fops);
|
|||
|
|
|||
|
Where unsigned int major is the major number you want to request, |const
|
|||
|
char \*name| is the name of the device as it will appear in
|
|||
|
`/proc/devices` and |struct file_\*-\*-\*_operations \*fops| is a
|
|||
|
pointer to the |file_\*-\*-\*_operations| table for your driver. A
|
|||
|
negative return value means the registration failed. Note that we didn’t
|
|||
|
pass the minor number to |register_\*-\*-\*_chrdev|. That is because the
|
|||
|
kernel doesn’t care about the minor number; only our driver uses it.
|
|||
|
|
|||
|
Now the question is, how do you get a major number without hijacking one
|
|||
|
that’s already in use? The easiest way would be to look through
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/admin-guide/devices.txt)
|
|||
|
and pick an unused one. That is a bad way of doing things because you
|
|||
|
will never be sure if the number you picked will be assigned later. The
|
|||
|
answer is that you can ask the kernel to assign you a dynamic major
|
|||
|
number.
|
|||
|
|
|||
|
If you pass a major number of 0 to |register_\*-\*-\*_chrdev|, the
|
|||
|
return value will be the dynamically allocated major number. The
|
|||
|
downside is that you can not make a device file in advance, since you do
|
|||
|
not know what the major number will be. There are a couple of ways to do
|
|||
|
this. First, the driver itself can print the newly assigned number and
|
|||
|
we can make the device file by hand. Second, the newly registered device
|
|||
|
will have an entry in `/proc/devices`, and we can either make the device
|
|||
|
file by hand or write a shell script to read the file in and make the
|
|||
|
device file. The third method is that we can have our driver make the
|
|||
|
device file using the |device_\*-\*-\*_create| function after a
|
|||
|
successful registration and |device_\*-\*-\*_destroy| during the call to
|
|||
|
|cleanup_\*-\*-\*_module|.
|
|||
|
|
|||
|
However, |register_\*-\*-\*_chrdev()| would occupy a range of minor
|
|||
|
numbers associated with the given major. The recommended way to reduce
|
|||
|
waste for char device registration is using cdev interface.
|
|||
|
|
|||
|
The newer interface completes the char device registration in two
|
|||
|
distinct steps. First, we should register a range of device numbers,
|
|||
|
which can be completed with |register_\*-\*-\*_chrdev_\*-\*-\*_region|
|
|||
|
or |alloc_\*-\*-\*_chrdev_\*-\*-\*_region|.
|
|||
|
|
|||
|
int register_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t from,
|
|||
|
unsigned count, const char \*name); int
|
|||
|
alloc_\*-\*-\*_chrdev_\*-\*-\*_region(dev_\*-\*-\*_t \*dev, unsigned
|
|||
|
baseminor, unsigned count, const char \*name);
|
|||
|
|
|||
|
The choice between two different functions depends on whether you know
|
|||
|
the major numbers for your device. Using
|
|||
|
|register_\*-\*-\*_chrdev_\*-\*-\*_region| if you know the device major
|
|||
|
number and |alloc_\*-\*-\*_chrdev_\*-\*-\*_region| if you would like to
|
|||
|
allocate a dynamically-allocated major number.
|
|||
|
|
|||
|
Second, we should initialize the data structure |struct cdev| for our
|
|||
|
char device and associate it with the device numbers. To initialize the
|
|||
|
|struct cdev|, we can achieve by the similar sequence of the following
|
|||
|
codes.
|
|||
|
|
|||
|
struct cdev \*my_\*-\*-\*_dev = cdev_\*-\*-\*_alloc();
|
|||
|
my_\*-\*-\*_cdev->ops = &my_\*-\*-\*_fops;
|
|||
|
|
|||
|
However, the common usage pattern will embed the |struct cdev| within a
|
|||
|
device-specific structure of your own. In this case, we’ll need
|
|||
|
|cdev_\*-\*-\*_init| for the initialization.
|
|||
|
|
|||
|
void cdev_\*-\*-\*_init(struct cdev \*cdev, const struct
|
|||
|
file_\*-\*-\*_operations \*fops);
|
|||
|
|
|||
|
Once we finish the initialization, we can add the char device to the
|
|||
|
system by using the |cdev_\*-\*-\*_add|.
|
|||
|
|
|||
|
int cdev_\*-\*-\*_add(struct cdev \*p, dev_\*-\*-\*_t dev, unsigned
|
|||
|
count);
|
|||
|
|
|||
|
To find an example using the interface, you can see `ioctl.c` described
|
|||
|
in section
|
|||
|
<a href="#sec:device_*-*-*_files" data-reference-type="ref" data-reference="sec:device_*-*-*_files">9</a>.
|
|||
|
|
|||
|
Unregistering A Device
|
|||
|
----------------------
|
|||
|
|
|||
|
We can not allow the kernel module to be |rmmod|’ed whenever root feels
|
|||
|
like it. If the device file is opened by a process and then we remove
|
|||
|
the kernel module, using the file would cause a call to the memory
|
|||
|
location where the appropriate function (read/write) used to be. If we
|
|||
|
are lucky, no other code was loaded there, and we’ll get an ugly error
|
|||
|
message. If we are unlucky, another kernel module was loaded into the
|
|||
|
same location, which means a jump into the middle of another function
|
|||
|
within the kernel. The results of this would be impossible to predict,
|
|||
|
but they can not be very positive.
|
|||
|
|
|||
|
Normally, when you do not want to allow something, you return an error
|
|||
|
code (a negative number) from the function which is supposed to do it.
|
|||
|
With |cleanup_\*-\*-\*_module| that’s impossible because it is a void
|
|||
|
function. However, there is a counter which keeps track of how many
|
|||
|
processes are using your module. You can see what its value is by
|
|||
|
looking at the 3rd field with the command |cat /proc/modules| or |sudo
|
|||
|
lsmod|. If this number isn’t zero, |rmmod| will fail. Note that you do
|
|||
|
not have to check the counter within |cleanup_\*-\*-\*_module| because
|
|||
|
the check will be performed for you by the system call
|
|||
|
|sys_\*-\*-\*_delete_\*-\*-\*_module|, defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/syscalls.h).
|
|||
|
You should not use this counter directly, but there are functions
|
|||
|
defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/module.h)
|
|||
|
which let you increase, decrease and display this counter:
|
|||
|
|
|||
|
- |try_\*-\*-\*_module_\*-\*-\*_get(THIS_\*-\*-\*_MODULE)|: Increment
|
|||
|
the reference count of current module.
|
|||
|
|
|||
|
- |module_\*-\*-\*_put(THIS_\*-\*-\*_MODULE)|: Decrement the reference
|
|||
|
count of current module.
|
|||
|
|
|||
|
- |module_\*-\*-\*_refcount(THIS_\*-\*-\*_MODULE)|: Return the value
|
|||
|
of reference count of current module.
|
|||
|
|
|||
|
It is important to keep the counter accurate; if you ever do lose track
|
|||
|
of the correct usage count, you will never be able to unload the module;
|
|||
|
it’s now reboot time, boys and girls. This is bound to happen to you
|
|||
|
sooner or later during a module’s development.
|
|||
|
|
|||
|
chardev.c
|
|||
|
---------
|
|||
|
|
|||
|
The next code sample creates a char driver named `chardev`. You can dump
|
|||
|
its device file.
|
|||
|
|
|||
|
cat /proc/devices
|
|||
|
|
|||
|
(or open the file with a program) and the driver will put the number of
|
|||
|
times the device file has been read from into the file. We do not
|
|||
|
support writing to the file (like |echo "hi" > /dev/hello|), but
|
|||
|
catch these attempts and tell the user that the operation is not
|
|||
|
supported. Don’t worry if you don’t see what we do with the data we read
|
|||
|
into the buffer; we don’t do much with it. We simply read in the data
|
|||
|
and print a message acknowledging that we received it.
|
|||
|
|
|||
|
In the multiple-threaded environment, without any protection, concurrent
|
|||
|
access to the same memory may lead to the race condition, and will not
|
|||
|
preserve the performance. In the kernel module, this problem may happen
|
|||
|
due to multiple instances accessing the shared resources. Therefore, a
|
|||
|
solution is to enforce the exclusive access. We use atomic
|
|||
|
Compare-And-Swap (CAS) to maintain the states,
|
|||
|
|CDEV_\*-\*-\*_NOT_\*-\*-\*_USED| and
|
|||
|
|CDEV_\*-\*-\*_EXCLUSIVE_\*-\*-\*_OPEN|, to determine whether the file
|
|||
|
is currently opened by someone or not. CAS compares the contents of a
|
|||
|
memory location with the expected value and, only if they are the same,
|
|||
|
modifies the contents of that memory location to the desired value. See
|
|||
|
more concurrency details in the
|
|||
|
<a href="#sec:synchronization" data-reference-type="ref" data-reference="sec:synchronization">12</a>
|
|||
|
section.
|
|||
|
|
|||
|
Writing Modules for Multiple Kernel Versions
|
|||
|
--------------------------------------------
|
|||
|
|
|||
|
The system calls, which are the major interface the kernel shows to the
|
|||
|
processes, generally stay the same across versions. A new system call
|
|||
|
may be added, but usually the old ones will behave exactly like they
|
|||
|
used to. This is necessary for backward compatibility – a new kernel
|
|||
|
version is not supposed to break regular processes. In most cases, the
|
|||
|
device files will also remain the same. On the other hand, the internal
|
|||
|
interfaces within the kernel can and do change between versions.
|
|||
|
|
|||
|
There are differences between different kernel versions, and if you want
|
|||
|
to support multiple kernel versions, you will find yourself having to
|
|||
|
code conditional compilation directives. The way to do this to compare
|
|||
|
the macro |LINUX_\*-\*-\*_VERSION_\*-\*-\*_CODE| to the macro
|
|||
|
|KERNEL_\*-\*-\*_VERSION|. In version `a.b.c` of the kernel, the value
|
|||
|
of this macro would be 2<sup>16</sup>*a* + 2<sup>8</sup>*b* + *c*.
|
|||
|
|
|||
|
The /proc File System
|
|||
|
=====================
|
|||
|
|
|||
|
In Linux, there is an additional mechanism for the kernel and kernel
|
|||
|
modules to send information to processes — the `/proc` file system.
|
|||
|
Originally designed to allow easy access to information about processes
|
|||
|
(hence the name), it is now used by every bit of the kernel which has
|
|||
|
something interesting to report, such as `/proc/modules` which provides
|
|||
|
the list of modules and `/proc/meminfo` which gathers memory usage
|
|||
|
statistics.
|
|||
|
|
|||
|
The method to use the proc file system is very similar to the one used
|
|||
|
with device drivers — a structure is created with all the information
|
|||
|
needed for the `/proc` file, including pointers to any handler functions
|
|||
|
(in our case there is only one, the one called when somebody attempts to
|
|||
|
read from the `/proc` file). Then, |init_\*-\*-\*_module| registers the
|
|||
|
structure with the kernel and |cleanup_\*-\*-\*_module| unregisters it.
|
|||
|
|
|||
|
Normal file systems are located on a disk, rather than just in memory
|
|||
|
(which is where `/proc` is), and in that case the index-node (inode for
|
|||
|
short) number is a pointer to a disk location where the file’s inode is
|
|||
|
located. The inode contains information about the file, for example the
|
|||
|
file’s permissions, together with a pointer to the disk location or
|
|||
|
locations where the file’s data can be found.
|
|||
|
|
|||
|
Because we don’t get called when the file is opened or closed, there’s
|
|||
|
nowhere for us to put |try_\*-\*-\*_module_\*-\*-\*_get| and
|
|||
|
|module_\*-\*-\*_put| in this module, and if the file is opened and then
|
|||
|
the module is removed, there’s no way to avoid the consequences.
|
|||
|
|
|||
|
Here a simple example showing how to use a `/proc` file. This is the
|
|||
|
HelloWorld for the `/proc` filesystem. There are three parts: create the
|
|||
|
file `/proc/helloworld` in the function |init_\*-\*-\*_module|, return a
|
|||
|
value (and a buffer) when the file `/proc/helloworld` is read in the
|
|||
|
callback function |procfile_\*-\*-\*_read|, and delete the file
|
|||
|
`/proc/helloworld` in the function |cleanup_\*-\*-\*_module|.
|
|||
|
|
|||
|
The `/proc/helloworld` is created when the module is loaded with the
|
|||
|
function |proc_\*-\*-\*_create|. The return value is a pointer to
|
|||
|
|struct proc_\*-\*-\*_dir_\*-\*-\*_entry|, and it will be used to
|
|||
|
configure the file `/proc/helloworld` (for example, the owner of this
|
|||
|
file). A null return value means that the creation has failed.
|
|||
|
|
|||
|
Every time the file `/proc/helloworld` is read, the function
|
|||
|
|procfile_\*-\*-\*_read| is called. Two parameters of this function are
|
|||
|
very important: the buffer (the second parameter) and the offset (the
|
|||
|
fourth one). The content of the buffer will be returned to the
|
|||
|
application which read it (for example the |cat| command). The offset is
|
|||
|
the current position in the file. If the return value of the function is
|
|||
|
not null, then this function is called again. So be careful with this
|
|||
|
function, if it never returns zero, the read function is called
|
|||
|
endlessly.
|
|||
|
|
|||
|
$ cat /proc/helloworld
|
|||
|
HelloWorld_
|
|||
|
|
|||
|
The proc-\*-\*_ops Structure
|
|||
|
----------------------------
|
|||
|
|
|||
|
The |proc_\*-\*-\*_ops| structure is defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/proc\_*-*-*_fs.h)
|
|||
|
in Linux v5.6+. In older kernels, it used |file_\*-\*-\*_operations| for
|
|||
|
custom hooks in `/proc` file system, but it contains some members that
|
|||
|
are unnecessary in VFS, and every time VFS expands
|
|||
|
|file_\*-\*-\*_operations| set, `/proc` code comes bloated. On the other
|
|||
|
hand, not only the space, but also some operations were saved by this
|
|||
|
structure to improve its performance. For example, the file which never
|
|||
|
disappears in `/proc` can set the |proc_\*-\*-\*_flag| as
|
|||
|
|PROC_\*-\*-\*_ENTRY_\*-\*-\*_PERMANENT| to save 2 atomic ops, 1
|
|||
|
allocation, 1 free in per open/read/close sequence.
|
|||
|
|
|||
|
Read and Write a /proc File
|
|||
|
---------------------------
|
|||
|
|
|||
|
We have seen a very simple example for a `/proc` file where we only read
|
|||
|
the file `/proc/helloworld`. It is also possible to write in a `/proc`
|
|||
|
file. It works the same way as read, a function is called when the
|
|||
|
`/proc` file is written. But there is a little difference with read,
|
|||
|
data comes from user, so you have to import data from user space to
|
|||
|
kernel space (with |copy_\*-\*-\*_from_\*-\*-\*_user| or
|
|||
|
|get_\*-\*-\*_user|)
|
|||
|
|
|||
|
The reason for |copy_\*-\*-\*_from_\*-\*-\*_user| or |get_\*-\*-\*_user|
|
|||
|
is that Linux memory (on Intel architecture, it may be different under
|
|||
|
some other processors) is segmented. This means that a pointer, by
|
|||
|
itself, does not reference a unique location in memory, only a location
|
|||
|
in a memory segment, and you need to know which memory segment it is to
|
|||
|
be able to use it. There is one memory segment for the kernel, and one
|
|||
|
for each of the processes.
|
|||
|
|
|||
|
The only memory segment accessible to a process is its own, so when
|
|||
|
writing regular programs to run as processes, there is no need to worry
|
|||
|
about segments. When you write a kernel module, normally you want to
|
|||
|
access the kernel memory segment, which is handled automatically by the
|
|||
|
system. However, when the content of a memory buffer needs to be passed
|
|||
|
between the currently running process and the kernel, the kernel
|
|||
|
function receives a pointer to the memory buffer which is in the process
|
|||
|
segment. The |put_\*-\*-\*_user| and |get_\*-\*-\*_user| macros allow
|
|||
|
you to access that memory. These functions handle only one character,
|
|||
|
you can handle several characters with |copy_\*-\*-\*_to_\*-\*-\*_user|
|
|||
|
and |copy_\*-\*-\*_from_\*-\*-\*_user|. As the buffer (in read or write
|
|||
|
function) is in kernel space, for write function you need to import data
|
|||
|
because it comes from user space, but not for the read function because
|
|||
|
data is already in kernel space.
|
|||
|
|
|||
|
Manage /proc file with standard filesystem
|
|||
|
------------------------------------------
|
|||
|
|
|||
|
We have seen how to read and write a `/proc` file with the `/proc`
|
|||
|
interface. But it is also possible to manage `/proc` file with inodes.
|
|||
|
The main concern is to use advanced functions, like permissions.
|
|||
|
|
|||
|
In Linux, there is a standard mechanism for file system registration.
|
|||
|
Since every file system has to have its own functions to handle inode
|
|||
|
and file operations, there is a special structure to hold pointers to
|
|||
|
all those functions, |struct inode_\*-\*-\*_operations|, which includes
|
|||
|
a pointer to |struct proc_\*-\*-\*_ops|.
|
|||
|
|
|||
|
The difference between file and inode operations is that file operations
|
|||
|
deal with the file itself whereas inode operations deal with ways of
|
|||
|
referencing the file, such as creating links to it.
|
|||
|
|
|||
|
In `/proc`, whenever we register a new file, we’re allowed to specify
|
|||
|
which |struct inode_\*-\*-\*_operations| will be used to access to it.
|
|||
|
This is the mechanism we use, a |struct inode_\*-\*-\*_operations| which
|
|||
|
includes a pointer to a |struct proc_\*-\*-\*_ops| which includes
|
|||
|
pointers to our |procfs_\*-\*-\*_read| and |procfs_\*-\*-\*_write|
|
|||
|
functions.
|
|||
|
|
|||
|
Another interesting point here is the |module_\*-\*-\*_permission|
|
|||
|
function. This function is called whenever a process tries to do
|
|||
|
something with the `/proc` file, and it can decide whether to allow
|
|||
|
access or not. Right now it is only based on the operation and the uid
|
|||
|
of the current user (as available in current, a pointer to a structure
|
|||
|
which includes information on the currently running process), but it
|
|||
|
could be based on anything we like, such as what other processes are
|
|||
|
doing with the same file, the time of day, or the last input we
|
|||
|
received.
|
|||
|
|
|||
|
It is important to note that the standard roles of read and write are
|
|||
|
reversed in the kernel. Read functions are used for output, whereas
|
|||
|
write functions are used for input. The reason for that is that read and
|
|||
|
write refer to the user’s point of view — if a process reads something
|
|||
|
from the kernel, then the kernel needs to output it, and if a process
|
|||
|
writes something to the kernel, then the kernel receives it as input.
|
|||
|
|
|||
|
Still hungry for procfs examples? Well, first of all keep in mind, there
|
|||
|
are rumors around, claiming that procfs is on its way out, consider
|
|||
|
using `sysfs` instead. Consider using this mechanism, in case you want
|
|||
|
to document something kernel related yourself.
|
|||
|
|
|||
|
Manage /proc file with seq-\*-\*_file
|
|||
|
-------------------------------------
|
|||
|
|
|||
|
As we have seen, writing a `/proc` file may be quite “complex”. So to
|
|||
|
help people writing `/proc` file, there is an API named
|
|||
|
|seq_\*-\*-\*_file| that helps formatting a `/proc` file for output. It
|
|||
|
is based on sequence, which is composed of 3 functions: |start()|,
|
|||
|
|next()|, and |stop()|. The |seq_\*-\*-\*_file| API starts a sequence
|
|||
|
when a user read the `/proc` file.
|
|||
|
|
|||
|
A sequence begins with the call of the function |start()|. If the return
|
|||
|
is a non |NULL| value, the function |next()| is called; otherwise, the
|
|||
|
|stop()| function is called directly. This function is an iterator, the
|
|||
|
goal is to go through all the data. Each time |next()| is called, the
|
|||
|
function |show()| is also called. It writes data values in the buffer
|
|||
|
read by the user. The function |next()| is called until it returns
|
|||
|
|NULL|. The sequence ends when |next()| returns |NULL|, then the
|
|||
|
function |stop()| is called.
|
|||
|
|
|||
|
BE CAREFUL: when a sequence is finished, another one starts. That means
|
|||
|
that at the end of function |stop()|, the function |start()| is called
|
|||
|
again. This loop finishes when the function |start()| returns |NULL|.
|
|||
|
You can see a scheme of this in the
|
|||
|
Figure <a href="#img:seqfile" data-reference-type="ref" data-reference="img:seqfile">[img:seqfile]</a>.
|
|||
|
|
|||
|
The |seq_\*-\*-\*_file| provides basic functions for
|
|||
|
|proc_\*-\*-\*_ops|, such as |seq_\*-\*-\*_read|, |seq_\*-\*-\*_lseek|,
|
|||
|
and some others. But nothing to write in the `/proc` file. Of course,
|
|||
|
you can still use the same way as in the previous example.
|
|||
|
|
|||
|
If you want more information, you can read this web page:
|
|||
|
|
|||
|
- <https://lwn.net/Articles/22355/>
|
|||
|
|
|||
|
- <https://kernelnewbies.org/Documents/SeqFileHowTo>
|
|||
|
|
|||
|
You can also read the code of
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/seq\_*-*-*_file.c)
|
|||
|
in the linux kernel.
|
|||
|
|
|||
|
sysfs: Interacting with your module
|
|||
|
===================================
|
|||
|
|
|||
|
*sysfs* allows you to interact with the running kernel from userspace by
|
|||
|
reading or setting variables inside of modules. This can be useful for
|
|||
|
debugging purposes, or just as an interface for applications or scripts.
|
|||
|
You can find sysfs directories and files under the `/sys` directory on
|
|||
|
your system.
|
|||
|
|
|||
|
ls -l /sys
|
|||
|
|
|||
|
Attributes can be exported for kobjects in the form of regular files in
|
|||
|
the filesystem. Sysfs forwards file I/O operations to methods defined
|
|||
|
for the attributes, providing a means to read and write kernel
|
|||
|
attributes.
|
|||
|
|
|||
|
An attribute definition in simply:
|
|||
|
|
|||
|
struct attribute char \*name; struct module \*owner; umode_\*-\*-\*_t
|
|||
|
mode; ;
|
|||
|
|
|||
|
int sysfs_\*-\*-\*_create_\*-\*-\*_file(struct kobject \* kobj, const
|
|||
|
struct attribute \* attr); void
|
|||
|
sysfs_\*-\*-\*_remove_\*-\*-\*_file(struct kobject \* kobj, const struct
|
|||
|
attribute \* attr);
|
|||
|
|
|||
|
For example, the driver model defines |struct device_\*-\*-\*_attribute|
|
|||
|
like:
|
|||
|
|
|||
|
struct device_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t
|
|||
|
(\*show)(struct device \*dev, struct device_\*-\*-\*_attribute \*attr,
|
|||
|
char \*buf); ssize_\*-\*-\*_t (\*store)(struct device \*dev, struct
|
|||
|
device_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t
|
|||
|
count); ;
|
|||
|
|
|||
|
int device_\*-\*-\*_create_\*-\*-\*_file(struct device \*, const struct
|
|||
|
device_\*-\*-\*_attribute \*); void
|
|||
|
device_\*-\*-\*_remove_\*-\*-\*_file(struct device \*, const struct
|
|||
|
device_\*-\*-\*_attribute \*);
|
|||
|
|
|||
|
To read or write attributes, |show()| or |store()| method must be
|
|||
|
specified when declaring the attribute. For the common cases
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/sysfs.h)
|
|||
|
provides convenience macros (|_\*-\*-\*_\*-\*-\*_ATTR|,
|
|||
|
|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_RO|,
|
|||
|
|_\*-\*-\*_\*-\*-\*_ATTR_\*-\*-\*_WO|, etc.) to make defining
|
|||
|
attributes easier as well as making code more concise and readable.
|
|||
|
|
|||
|
An example of a hello world module which includes the creation of a
|
|||
|
variable accessible via sysfs is given below.
|
|||
|
|
|||
|
Make and install the module:
|
|||
|
|
|||
|
make sudo insmod hello-sysfs.ko
|
|||
|
|
|||
|
Check that it exists:
|
|||
|
|
|||
|
sudo lsmod | grep hello_\*-\*-\*_sysfs
|
|||
|
|
|||
|
What is the current value of |myvariable| ?
|
|||
|
|
|||
|
sudo cat /sys/kernel/mymodule/myvariable
|
|||
|
|
|||
|
Set the value of |myvariable| and check that it changed.
|
|||
|
|
|||
|
echo "32" | sudo tee /sys/kernel/mymodule/myvariable sudo cat
|
|||
|
/sys/kernel/mymodule/myvariable
|
|||
|
|
|||
|
Finally, remove the test module:
|
|||
|
|
|||
|
sudo rmmod hello_\*-\*-\*_sysfs
|
|||
|
|
|||
|
In the above case, we use a simple kobject to create a directory under
|
|||
|
sysfs, and communicate with its attributes. Since Linux v2.6.0, the
|
|||
|
|kobject| structure made its appearance. It was initially meant as a
|
|||
|
simple way of unifying kernel code which manages reference counted
|
|||
|
objects. After a bit of mission creep, it is now the glue that holds
|
|||
|
much of the device model and its sysfs interface together. For more
|
|||
|
information about kobject and sysfs, see
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/driver-api/driver-model/driver.rst)
|
|||
|
and <https://lwn.net/Articles/51437/>.
|
|||
|
|
|||
|
Talking To Device Files
|
|||
|
=======================
|
|||
|
|
|||
|
Device files are supposed to represent physical devices. Most physical
|
|||
|
devices are used for output as well as input, so there has to be some
|
|||
|
mechanism for device drivers in the kernel to get the output to send to
|
|||
|
the device from processes. This is done by opening the device file for
|
|||
|
output and writing to it, just like writing to a file. In the following
|
|||
|
example, this is implemented by |device_\*-\*-\*_write|.
|
|||
|
|
|||
|
This is not always enough. Imagine you had a serial port connected to a
|
|||
|
modem (even if you have an internal modem, it is still implemented from
|
|||
|
the CPU’s perspective as a serial port connected to a modem, so you
|
|||
|
don’t have to tax your imagination too hard). The natural thing to do
|
|||
|
would be to use the device file to write things to the modem (either
|
|||
|
modem commands or data to be sent through the phone line) and read
|
|||
|
things from the modem (either responses for commands or the data
|
|||
|
received through the phone line). However, this leaves open the question
|
|||
|
of what to do when you need to talk to the serial port itself, for
|
|||
|
example to configure the rate at which data is sent and received.
|
|||
|
|
|||
|
The answer in Unix is to use a special function called |ioctl| (short
|
|||
|
for Input Output ConTroL). Every device can have its own |ioctl|
|
|||
|
commands, which can be read ioctl’s (to send information from a process
|
|||
|
to the kernel), write ioctl’s (to return information to a process), both
|
|||
|
or neither. Notice here the roles of read and write are reversed again,
|
|||
|
so in ioctl’s read is to send information to the kernel and write is to
|
|||
|
receive information from the kernel.
|
|||
|
|
|||
|
The ioctl function is called with three parameters: the file descriptor
|
|||
|
of the appropriate device file, the ioctl number, and a parameter, which
|
|||
|
is of type long so you can use a cast to use it to pass anything. You
|
|||
|
will not be able to pass a structure this way, but you will be able to
|
|||
|
pass a pointer to the structure. Here is an example:
|
|||
|
|
|||
|
You can see there is an argument called |cmd| in
|
|||
|
|test_\*-\*-\*_ioctl_\*-\*-\*_ioctl()| function. It is the ioctl number.
|
|||
|
The ioctl number encodes the major device number, the type of the ioctl,
|
|||
|
the command, and the type of the parameter. This ioctl number is usually
|
|||
|
created by a macro call (|_\*-\*-\*_IO|, |_\*-\*-\*_IOR|,
|
|||
|
|_\*-\*-\*_IOW| or |_\*-\*-\*_IOWR| — depending on the type) in a header
|
|||
|
file. This header file should then be included both by the programs
|
|||
|
which will use ioctl (so they can generate the appropriate ioctl’s) and
|
|||
|
by the kernel module (so it can understand it). In the example below,
|
|||
|
the header file is `chardev.h` and the program which uses it is
|
|||
|
`userspace_*-*-*_ioctl.c`.
|
|||
|
|
|||
|
If you want to use ioctls in your own kernel modules, it is best to
|
|||
|
receive an official ioctl assignment, so if you accidentally get
|
|||
|
somebody else’s ioctls, or if they get yours, you’ll know something is
|
|||
|
wrong. For more information, consult the kernel source tree at
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/userspace-api/ioctl/ioctl-number.rst).
|
|||
|
|
|||
|
Also, we need to be careful that concurrent access to the shared
|
|||
|
resources will lead to the race condition. The solution is using atomic
|
|||
|
Compare-And-Swap (CAS), which we mentioned at
|
|||
|
<a href="#sec:chardev_*-*-*_c" data-reference-type="ref" data-reference="sec:chardev_*-*-*_c">6.5</a>
|
|||
|
section, to enforce the exclusive access.
|
|||
|
|
|||
|
System Calls
|
|||
|
============
|
|||
|
|
|||
|
So far, the only thing we’ve done was to use well defined kernel
|
|||
|
mechanisms to register `/proc` files and device handlers. This is fine
|
|||
|
if you want to do something the kernel programmers thought you’d want,
|
|||
|
such as write a device driver. But what if you want to do something
|
|||
|
unusual, to change the behavior of the system in some way? Then, you are
|
|||
|
mostly on your own.
|
|||
|
|
|||
|
Should one choose not to use a virtual machine, kernel programming can
|
|||
|
become risky. For example, while writing the code below, the |open()|
|
|||
|
system call was inadvertently disrupted. This resulted in an inability
|
|||
|
to open any files, run programs, or shut down the system, necessitating
|
|||
|
a restart of the virtual machine. Fortunately, no critical files were
|
|||
|
lost in this instance. However, if such modifications were made on a
|
|||
|
live, mission-critical system, the consequences could be severe. To
|
|||
|
mitigate the risk of file loss, even in a test environment, it is
|
|||
|
advised to execute |sync| right before using |insmod| and |rmmod|.
|
|||
|
|
|||
|
Forget about `/proc` files, forget about device files. They are just
|
|||
|
minor details. Minutiae in the vast expanse of the universe. The real
|
|||
|
process to kernel communication mechanism, the one used by all
|
|||
|
processes, is *system calls*. When a process requests a service from the
|
|||
|
kernel (such as opening a file, forking to a new process, or requesting
|
|||
|
more memory), this is the mechanism used. If you want to change the
|
|||
|
behaviour of the kernel in interesting ways, this is the place to do it.
|
|||
|
By the way, if you want to see which system calls a program uses, run
|
|||
|
|strace <arguments>|.
|
|||
|
|
|||
|
In general, a process is not supposed to be able to access the kernel.
|
|||
|
It can not access kernel memory and it can’t call kernel functions. The
|
|||
|
hardware of the CPU enforces this (that is the reason why it is called
|
|||
|
“protected mode” or “page protection”).
|
|||
|
|
|||
|
System calls are an exception to this general rule. What happens is that
|
|||
|
the process fills the registers with the appropriate values and then
|
|||
|
calls a special instruction which jumps to a previously defined location
|
|||
|
in the kernel (of course, that location is readable by user processes,
|
|||
|
it is not writable by them). Under Intel CPUs, this is done by means of
|
|||
|
interrupt 0x80. The hardware knows that once you jump to this location,
|
|||
|
you are no longer running in restricted user mode, but as the operating
|
|||
|
system kernel — and therefore you’re allowed to do whatever you want.
|
|||
|
|
|||
|
The location in the kernel a process can jump to is called
|
|||
|
`system_*-*-*_call`. The procedure at that location checks the system
|
|||
|
call number, which tells the kernel what service the process requested.
|
|||
|
Then, it looks at the table of system calls
|
|||
|
(|sys_\*-\*-\*_call_\*-\*-\*_table|) to see the address of the kernel
|
|||
|
function to call. Then it calls the function, and after it returns, does
|
|||
|
a few system checks and then return back to the process (or to a
|
|||
|
different process, if the process time ran out). If you want to read
|
|||
|
this code, it is at the source file
|
|||
|
`arch/$(architecture)/kernel/entry.S`, after the line
|
|||
|
|ENTRY(system_\*-\*-\*_call)|.
|
|||
|
|
|||
|
So, if we want to change the way a certain system call works, what we
|
|||
|
need to do is to write our own function to implement it (usually by
|
|||
|
adding a bit of our own code, and then calling the original function)
|
|||
|
and then change the pointer at |sys_\*-\*-\*_call_\*-\*-\*_table| to
|
|||
|
point to our function. Because we might be removed later and we don’t
|
|||
|
want to leave the system in an unstable state, it’s important for
|
|||
|
|cleanup_\*-\*-\*_module| to restore the table to its original state.
|
|||
|
|
|||
|
To modify the content of |sys_\*-\*-\*_call_\*-\*-\*_table|, we need to
|
|||
|
consider the control register. A control register is a processor
|
|||
|
register that changes or controls the general behavior of the CPU. For
|
|||
|
x86 architecture, the `cr0` register has various control flags that
|
|||
|
modify the basic operation of the processor. The `WP` flag in `cr0`
|
|||
|
stands for write protection. Once the `WP` flag is set, the processor
|
|||
|
disallows further write attempts to the read-only sections Therefore, we
|
|||
|
must disable the `WP` flag before modifying
|
|||
|
|sys_\*-\*-\*_call_\*-\*-\*_table|. Since Linux v5.3, the
|
|||
|
|write_\*-\*-\*_cr0| function cannot be used because of the sensitive
|
|||
|
`cr0` bits pinned by the security issue, the attacker may write into CPU
|
|||
|
control registers to disable CPU protections like write protection. As a
|
|||
|
result, we have to provide the custom assembly routine to bypass it.
|
|||
|
|
|||
|
However, |sys_\*-\*-\*_call_\*-\*-\*_table| symbol is unexported to
|
|||
|
prevent misuse. But there have few ways to get the symbol, manual symbol
|
|||
|
lookup and |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. Here we use both
|
|||
|
depend on the kernel version.
|
|||
|
|
|||
|
Because of the *control-flow integrity*, which is a technique to prevent
|
|||
|
the redirect execution code from the attacker, for making sure that the
|
|||
|
indirect calls go to the expected addresses and the return addresses are
|
|||
|
not changed. Since Linux v5.7, the kernel patched the series of
|
|||
|
*control-flow enforcement* (CET) for x86, and some configurations of
|
|||
|
GCC, like GCC versions 9 and 10 in Ubuntu Linux, will add with CET (the
|
|||
|
`-fcf-protection` option) in the kernel by default. Using that GCC to
|
|||
|
compile the kernel with retpoline off may result in CET being enabled in
|
|||
|
the kernel. You can use the following command to check out the
|
|||
|
`-fcf-protection` option is enabled or not:
|
|||
|
|
|||
|
$ gcc -v -Q -O2 --help=target | grep protection
|
|||
|
Using built-in specs.
|
|||
|
COLLECT_*-*-*_GCC=gcc
|
|||
|
COLLECT_*-*-*_LTO_*-*-*_WRAPPER=/usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/lto-wrapper
|
|||
|
...
|
|||
|
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
|
|||
|
COLLECT_*-*-*_GCC_*-*-*_OPTIONS='-v' '-Q' '-O2' '--help=target' '-mtune=generic' '-march=x86-64'
|
|||
|
/usr/lib/gcc/x86_*-*-*_64-linux-gnu/9/cc1 -v ... -fcf-protection ...
|
|||
|
GNU C17 (Ubuntu 9.3.0-17ubuntu1~20.04) version 9.3.0 (x86_*-*-*_64-linux-gnu)
|
|||
|
...
|
|||
|
|
|||
|
But CET should not be enabled in the kernel, it may break the Kprobes
|
|||
|
and bpf. Consequently, CET is disabled since v5.11. To guarantee the
|
|||
|
manual symbol lookup worked, we only use up to v5.4.
|
|||
|
|
|||
|
Unfortunately, since Linux v5.7 |kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|
|
|||
|
is also unexported, it needs certain trick to get the address of
|
|||
|
|kallsyms_\*-\*-\*_lookup_\*-\*-\*_name|. If |CONFIG_\*-\*-\*_KPROBES|
|
|||
|
is enabled, we can facilitate the retrieval of function addresses by
|
|||
|
means of Kprobes to dynamically break into the specific kernel routine.
|
|||
|
Kprobes inserts a breakpoint at the entry of function by replacing the
|
|||
|
first bytes of the probed instruction. When a CPU hits the breakpoint,
|
|||
|
registers are stored, and the control will pass to Kprobes. It passes
|
|||
|
the addresses of the saved registers and the Kprobe struct to the
|
|||
|
handler you defined, then executes it. Kprobes can be registered by
|
|||
|
symbol name or address. Within the symbol name, the address will be
|
|||
|
handled by the kernel.
|
|||
|
|
|||
|
Otherwise, specify the address of |sys_\*-\*-\*_call_\*-\*-\*_table|
|
|||
|
from `/proc/kallsyms` and `/boot/System.map` into |sym| parameter.
|
|||
|
Following is the sample usage for `/proc/kallsyms`:
|
|||
|
|
|||
|
$ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
|
|||
|
ffffffff82000280 R x32_*-*-*_sys_*-*-*_call_*-*-*_table
|
|||
|
ffffffff820013a0 R sys_*-*-*_call_*-*-*_table
|
|||
|
ffffffff820023e0 R ia32_*-*-*_sys_*-*-*_call_*-*-*_table
|
|||
|
$ sudo insmod syscall-steal.ko sym=0xffffffff820013a0
|
|||
|
|
|||
|
Using the address from `/boot/System.map`, be careful about `KASLR`
|
|||
|
(Kernel Address Space Layout Randomization). `KASLR` may randomize the
|
|||
|
address of kernel code and data at every boot time, such as the static
|
|||
|
address listed in `/boot/System.map` will offset by some entropy. The
|
|||
|
purpose of `KASLR` is to protect the kernel space from the attacker.
|
|||
|
Without `KASLR`, the attacker may find the target address in the fixed
|
|||
|
address easily. Then the attacker can use return-oriented programming to
|
|||
|
insert some malicious codes to execute or receive the target data by a
|
|||
|
tampered pointer. `KASLR` mitigates these kinds of attacks because the
|
|||
|
attacker cannot immediately know the target address, but a brute-force
|
|||
|
attack can still work. If the address of a symbol in `/proc/kallsyms` is
|
|||
|
different from the address in `/boot/System.map`, `KASLR` is enabled
|
|||
|
with the kernel, which your system running on.
|
|||
|
|
|||
|
$ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub
|
|||
|
GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash"
|
|||
|
$ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r)
|
|||
|
ffffffff82000300 R sys_*-*-*_call_*-*-*_table
|
|||
|
$ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
|
|||
|
ffffffff820013a0 R sys_*-*-*_call_*-*-*_table
|
|||
|
# Reboot
|
|||
|
$ sudo grep sys_*-*-*_call_*-*-*_table /boot/System.map-$(uname -r)
|
|||
|
ffffffff82000300 R sys_*-*-*_call_*-*-*_table
|
|||
|
$ sudo grep sys_*-*-*_call_*-*-*_table /proc/kallsyms
|
|||
|
ffffffff86400300 R sys_*-*-*_call_*-*-*_table
|
|||
|
|
|||
|
If `KASLR` is enabled, we have to take care of the address from
|
|||
|
`/proc/kallsyms` each time we reboot the machine. In order to use the
|
|||
|
address from `/boot/System.map`, make sure that `KASLR` is disabled. You
|
|||
|
can add the `nokaslr` for disabling `KASLR` in next booting time:
|
|||
|
|
|||
|
$ grep GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT /etc/default/grub
|
|||
|
GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet splash"
|
|||
|
$ sudo perl -i -pe 'm/quiet/ and s//quiet nokaslr/' /etc/default/grub
|
|||
|
$ grep quiet /etc/default/grub
|
|||
|
GRUB_*-*-*_CMDLINE_*-*-*_LINUX_*-*-*_DEFAULT="quiet nokaslr splash"
|
|||
|
$ sudo update-grub
|
|||
|
|
|||
|
For more information, check out the following:
|
|||
|
|
|||
|
- [Cook: Security things in Linux
|
|||
|
v5.3](https://lwn.net/Articles/804849/)
|
|||
|
|
|||
|
- [Unexporting the system call table](https://lwn.net/Articles/12211/)
|
|||
|
|
|||
|
- [Control-flow integrity for the
|
|||
|
kernel](https://lwn.net/Articles/810077/)
|
|||
|
|
|||
|
- [Unexporting
|
|||
|
kallsyms-\*-\*_lookup-\*-\*_name()](https://lwn.net/Articles/813350/)
|
|||
|
|
|||
|
- [Kernel Probes
|
|||
|
(Kprobes)](https://www.kernel.org/doc/Documentation/kprobes.txt)
|
|||
|
|
|||
|
- [Kernel address space layout
|
|||
|
randomization](https://lwn.net/Articles/569635/)
|
|||
|
|
|||
|
The source code here is an example of such a kernel module. We want to
|
|||
|
“spy” on a certain user, and to |pr_\*-\*-\*_info()| a message whenever
|
|||
|
that user opens a file. Towards this end, we replace the system call to
|
|||
|
open a file with our own function, called
|
|||
|
|our_\*-\*-\*_sys_\*-\*-\*_openat|. This function checks the uid (user’s
|
|||
|
id) of the current process, and if it is equal to the uid we spy on, it
|
|||
|
calls |pr_\*-\*-\*_info()| to display the name of the file to be opened.
|
|||
|
Then, either way, it calls the original |openat()| function with the
|
|||
|
same parameters, to actually open the file.
|
|||
|
|
|||
|
The |init_\*-\*-\*_module| function replaces the appropriate location in
|
|||
|
|sys_\*-\*-\*_call_\*-\*-\*_table| and keeps the original pointer in a
|
|||
|
variable. The |cleanup_\*-\*-\*_module| function uses that variable to
|
|||
|
restore everything back to normal. This approach is dangerous, because
|
|||
|
of the possibility of two kernel modules changing the same system call.
|
|||
|
Imagine we have two kernel modules, A and B. A’s openat system call will
|
|||
|
be |A_\*-\*-\*_openat| and B’s will be |B_\*-\*-\*_openat|. Now, when A
|
|||
|
is inserted into the kernel, the system call is replaced with
|
|||
|
|A_\*-\*-\*_openat|, which will call the original |sys_\*-\*-\*_openat|
|
|||
|
when it is done. Next, B is inserted into the kernel, which replaces the
|
|||
|
system call with |B_\*-\*-\*_openat|, which will call what it thinks is
|
|||
|
the original system call, |A_\*-\*-\*_openat|, when it’s done.
|
|||
|
|
|||
|
Now, if B is removed first, everything will be well — it will simply
|
|||
|
restore the system call to |A_\*-\*-\*_openat|, which calls the
|
|||
|
original. However, if A is removed and then B is removed, the system
|
|||
|
will crash. A’s removal will restore the system call to the original,
|
|||
|
|sys_\*-\*-\*_openat|, cutting B out of the loop. Then, when B is
|
|||
|
removed, it will restore the system call to what it thinks is the
|
|||
|
original, |A_\*-\*-\*_openat|, which is no longer in memory. At first
|
|||
|
glance, it appears we could solve this particular problem by checking if
|
|||
|
the system call is equal to our open function and if so not changing it
|
|||
|
at all (so that B won’t change the system call when it is removed), but
|
|||
|
that will cause an even worse problem. When A is removed, it sees that
|
|||
|
the system call was changed to |B_\*-\*-\*_openat| so that it is no
|
|||
|
longer pointing to |A_\*-\*-\*_openat|, so it will not restore it to
|
|||
|
|sys_\*-\*-\*_openat| before it is removed from memory. Unfortunately,
|
|||
|
|B_\*-\*-\*_openat| will still try to call |A_\*-\*-\*_openat| which is
|
|||
|
no longer there, so that even without removing B the system would crash.
|
|||
|
|
|||
|
For x86 architecture, the system call table cannot be used to invoke a
|
|||
|
system call after commit
|
|||
|
[1e3ad78](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e3ad78334a69b36e107232e337f9d693dcc9df2)
|
|||
|
since v6.9. This commit has been backported to long term stable kernels,
|
|||
|
like v5.15.154+, v6.1.85+, v6.6.26+ and v6.8.5+, see this
|
|||
|
[answer](https://stackoverflow.com/a/78607015) for more details. In this
|
|||
|
case, thanks to Kprobes, a hook can be used instead on the system call
|
|||
|
entry to intercept the system call.
|
|||
|
|
|||
|
Note that all the related problems make syscall stealing unfeasible for
|
|||
|
production use. In order to keep people from doing potential harmful
|
|||
|
things |sys_\*-\*-\*_call_\*-\*-\*_table| is no longer exported. This
|
|||
|
means, if you want to do something more than a mere dry run of this
|
|||
|
example, you will have to patch your current kernel in order to have
|
|||
|
|sys_\*-\*-\*_call_\*-\*-\*_table| exported.
|
|||
|
|
|||
|
Blocking Processes and threads
|
|||
|
==============================
|
|||
|
|
|||
|
Sleep
|
|||
|
-----
|
|||
|
|
|||
|
What do you do when somebody asks you for something you can not do right
|
|||
|
away? If you are a human being and you are bothered by a human being,
|
|||
|
the only thing you can say is: "*Not right now, I’m busy. Go away_*".
|
|||
|
But if you are a kernel module and you are bothered by a process, you
|
|||
|
have another possibility. You can put the process to sleep until you can
|
|||
|
service it. After all, processes are being put to sleep by the kernel
|
|||
|
and woken up all the time (that is the way multiple processes appear to
|
|||
|
run on the same time on a single CPU).
|
|||
|
|
|||
|
This kernel module is an example of this. The file (called
|
|||
|
`/proc/sleep`) can only be opened by a single process at a time. If the
|
|||
|
file is already open, the kernel module calls
|
|||
|
|wait_\*-\*-\*_event_\*-\*-\*_interruptible|. The easiest way to keep a
|
|||
|
file open is to open it with:
|
|||
|
|
|||
|
tail -f
|
|||
|
|
|||
|
This function changes the status of the task (a task is the kernel data
|
|||
|
structure which holds information about a process and the system call it
|
|||
|
is in, if any) to |TASK_\*-\*-\*_INTERRUPTIBLE|, which means that the
|
|||
|
task will not run until it is woken up somehow, and adds it to WaitQ,
|
|||
|
the queue of tasks waiting to access the file. Then, the function calls
|
|||
|
the scheduler to context switch to a different process, one which has
|
|||
|
some use for the CPU.
|
|||
|
|
|||
|
When a process is done with the file, it closes it, and
|
|||
|
|module_\*-\*-\*_close| is called. That function wakes up all the
|
|||
|
processes in the queue (there’s no mechanism to only wake up one of
|
|||
|
them). It then returns and the process which just closed the file can
|
|||
|
continue to run. In time, the scheduler decides that that process has
|
|||
|
had enough and gives control of the CPU to another process. Eventually,
|
|||
|
one of the processes which was in the queue will be given control of the
|
|||
|
CPU by the scheduler. It starts at the point right after the call to
|
|||
|
|wait_\*-\*-\*_event_\*-\*-\*_interruptible|.
|
|||
|
|
|||
|
This means that the process is still in kernel mode - as far as the
|
|||
|
process is concerned, it issued the open system call and the system call
|
|||
|
has not returned yet. The process does not know somebody else used the
|
|||
|
CPU for most of the time between the moment it issued the call and the
|
|||
|
moment it returned.
|
|||
|
|
|||
|
It can then proceed to set a global variable to tell all the other
|
|||
|
processes that the file is still open and go on with its life. When the
|
|||
|
other processes get a piece of the CPU, they’ll see that global variable
|
|||
|
and go back to sleep.
|
|||
|
|
|||
|
So we will use |tail -f| to keep the file open in the background, while
|
|||
|
trying to access it with another process (again in the background, so
|
|||
|
that we need not switch to a different vt). As soon as the first
|
|||
|
background process is killed with kill %1 , the second is woken up, is
|
|||
|
able to access the file and finally terminates.
|
|||
|
|
|||
|
To make our life more interesting, |module_\*-\*-\*_close| does not have
|
|||
|
a monopoly on waking up the processes which wait to access the file. A
|
|||
|
signal, such as *Ctrl +c* (**SIGINT**) can also wake up a process. This
|
|||
|
is because we used |wait_\*-\*-\*_event_\*-\*-\*_interruptible|. We
|
|||
|
could have used |wait_\*-\*-\*_event| instead, but that would have
|
|||
|
resulted in extremely angry users whose *Ctrl+c*’s are ignored.
|
|||
|
|
|||
|
In that case, we want to return with |-EINTR| immediately. This is
|
|||
|
important so users can, for example, kill the process before it receives
|
|||
|
the file.
|
|||
|
|
|||
|
There is one more point to remember. Some times processes don’t want to
|
|||
|
sleep, they want either to get what they want immediately, or to be told
|
|||
|
it cannot be done. Such processes use the |O_\*-\*-\*_NONBLOCK| flag
|
|||
|
when opening the file. The kernel is supposed to respond by returning
|
|||
|
with the error code |-EAGAIN| from operations which would otherwise
|
|||
|
block, such as opening the file in this example. The program
|
|||
|
|cat_\*-\*-\*_nonblock|, available in the `examples/other` directory,
|
|||
|
can be used to open a file with |O_\*-\*-\*_NONBLOCK|.
|
|||
|
|
|||
|
$ sudo insmod sleep.ko
|
|||
|
$ cat_*-*-*_nonblock /proc/sleep
|
|||
|
Last input:
|
|||
|
$ tail -f /proc/sleep &
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
Last input:
|
|||
|
tail: /proc/sleep: file truncated
|
|||
|
[1] 6540
|
|||
|
$ cat_*-*-*_nonblock /proc/sleep
|
|||
|
Open would block
|
|||
|
$ kill %1
|
|||
|
[1]+ Terminated tail -f /proc/sleep
|
|||
|
$ cat_*-*-*_nonblock /proc/sleep
|
|||
|
Last input:
|
|||
|
$
|
|||
|
|
|||
|
Completions
|
|||
|
-----------
|
|||
|
|
|||
|
Sometimes one thing should happen before another within a module having
|
|||
|
multiple threads. Rather than using |/bin/sleep| commands, the kernel
|
|||
|
has another way to do this which allows timeouts or interrupts to also
|
|||
|
happen.
|
|||
|
|
|||
|
Completions as code synchronization mechanism have three main parts,
|
|||
|
initialization of struct completion synchronization object, the waiting
|
|||
|
or barrier part through |wait_\*-\*-\*_for_\*-\*-\*_completion()|, and
|
|||
|
the signalling side through a call to |complete()|.
|
|||
|
|
|||
|
In the subsequent example, two threads are initiated: crank and
|
|||
|
flywheel. It is imperative that the crank thread starts before the
|
|||
|
flywheel thread. A completion state is established for each of these
|
|||
|
threads, with a distinct completion defined for both the crank and
|
|||
|
flywheel threads. At the exit point of each thread the respective
|
|||
|
completion state is updated, and |wait_\*-\*-\*_for_\*-\*-\*_completion|
|
|||
|
is used by the flywheel thread to ensure that it does not begin
|
|||
|
prematurely. The crank thread uses the |complete_\*-\*-\*_all()|
|
|||
|
function to update the completion, which lets the flywheel thread
|
|||
|
continue.
|
|||
|
|
|||
|
So even though |flywheel_\*-\*-\*_thread| is started first you should
|
|||
|
notice when you load this module and run |dmesg|, that turning the crank
|
|||
|
always happens first because the flywheel thread waits for the crank
|
|||
|
thread to complete.
|
|||
|
|
|||
|
There are other variations of the
|
|||
|
|wait_\*-\*-\*_for_\*-\*-\*_completion| function, which include timeouts
|
|||
|
or being interrupted, but this basic mechanism is enough for many common
|
|||
|
situations without adding a lot of complexity.
|
|||
|
|
|||
|
Avoiding Collisions and Deadlocks
|
|||
|
=================================
|
|||
|
|
|||
|
If processes running on different CPUs or in different threads try to
|
|||
|
access the same memory, then it is possible that strange things can
|
|||
|
happen or your system can lock up. To avoid this, various types of
|
|||
|
mutual exclusion kernel functions are available. These indicate if a
|
|||
|
section of code is "locked" or "unlocked" so that simultaneous attempts
|
|||
|
to run it can not happen.
|
|||
|
|
|||
|
Mutex
|
|||
|
-----
|
|||
|
|
|||
|
You can use kernel mutexes (mutual exclusions) in much the same manner
|
|||
|
that you might deploy them in userland. This may be all that is needed
|
|||
|
to avoid collisions in most cases.
|
|||
|
|
|||
|
Spinlocks
|
|||
|
---------
|
|||
|
|
|||
|
As the name suggests, spinlocks lock up the CPU that the code is running
|
|||
|
on, taking 100% of its resources. Because of this you should only use
|
|||
|
the spinlock mechanism around code which is likely to take no more than
|
|||
|
a few milliseconds to run and so will not noticeably slow anything down
|
|||
|
from the user’s point of view.
|
|||
|
|
|||
|
The example here is `"irq safe"` in that if interrupts happen during the
|
|||
|
lock then they will not be forgotten and will activate when the unlock
|
|||
|
happens, using the |flags| variable to retain their state.
|
|||
|
|
|||
|
Taking 100% of a CPU’s resources comes with greater responsibility.
|
|||
|
Situations where the kernel code monopolizes a CPU are called **atomic
|
|||
|
contexts**. Holding a spinlock is one of those situations. Sleeping in
|
|||
|
atomic contexts may leave the system hanging, as the occupied CPU
|
|||
|
devotes 100% of its resources doing nothing but sleeping. In some worse
|
|||
|
cases the system may crash. Thus, sleeping in atomic contexts is
|
|||
|
considered a bug in the kernel. They are sometimes called
|
|||
|
“sleep-in-atomic-context” in some materials.
|
|||
|
|
|||
|
Note that sleeping here is not limited to calling the sleep functions
|
|||
|
explicitly. If subsequent function calls eventually invoke a function
|
|||
|
that sleeps, it is also considered sleeping. Thus, it is important to
|
|||
|
pay attention to functions being used in atomic context. There’s no
|
|||
|
documentation recording all such functions, but code comments may help.
|
|||
|
Sometimes you may find comments in kernel source code stating that a
|
|||
|
function “may sleep”, “might sleep”, or more explicitly “the caller
|
|||
|
should not hold a spinlock”. Those comments are hints that a function
|
|||
|
may implicitly sleep and must not be called in atomic contexts.
|
|||
|
|
|||
|
Read and write locks
|
|||
|
--------------------
|
|||
|
|
|||
|
Read and write locks are specialised kinds of spinlocks so that you can
|
|||
|
exclusively read from something or write to something. Like the earlier
|
|||
|
spinlocks example, the one below shows an "irq safe" situation in which
|
|||
|
if other functions were triggered from irqs which might also read and
|
|||
|
write to whatever you are concerned with then they would not disrupt the
|
|||
|
logic. As before it is a good idea to keep anything done within the lock
|
|||
|
as short as possible so that it does not hang up the system and cause
|
|||
|
users to start revolting against the tyranny of your module.
|
|||
|
|
|||
|
Of course, if you know for sure that there are no functions triggered by
|
|||
|
irqs which could possibly interfere with your logic then you can use the
|
|||
|
simpler |read_\*-\*-\*_lock(&myrwlock)| and
|
|||
|
|read_\*-\*-\*_unlock(&myrwlock)| or the corresponding write functions.
|
|||
|
|
|||
|
Atomic operations
|
|||
|
-----------------
|
|||
|
|
|||
|
If you are doing simple arithmetic: adding, subtracting or bitwise
|
|||
|
operations, then there is another way in the multi-CPU and
|
|||
|
multi-hyperthreaded world to stop other parts of the system from messing
|
|||
|
with your mojo. By using atomic operations you can be confident that
|
|||
|
your addition, subtraction or bit flip did actually happen and was not
|
|||
|
overwritten by some other shenanigans. An example is shown below.
|
|||
|
|
|||
|
Before the C11 standard adopts the built-in atomic types, the kernel
|
|||
|
already provided a small set of atomic types by using a bunch of tricky
|
|||
|
architecture-specific codes. Implementing the atomic types by C11
|
|||
|
atomics may allow the kernel to throw away the architecture-specific
|
|||
|
codes and letting the kernel code be more friendly to the people who
|
|||
|
understand the standard. But there are some problems, such as the memory
|
|||
|
model of the kernel doesn’t match the model formed by the C11 atomics.
|
|||
|
For further details, see:
|
|||
|
|
|||
|
- [kernel documentation of atomic
|
|||
|
types](https://www.kernel.org/doc/Documentation/atomic_*-*-*_t.txt)
|
|||
|
|
|||
|
- [Time to move to C11 atomics?](https://lwn.net/Articles/691128/)
|
|||
|
|
|||
|
- [Atomic usage patterns in the
|
|||
|
kernel](https://lwn.net/Articles/698315/)
|
|||
|
|
|||
|
Replacing Print Macros
|
|||
|
======================
|
|||
|
|
|||
|
Replacement
|
|||
|
-----------
|
|||
|
|
|||
|
In Section
|
|||
|
<a href="#sec:preparation" data-reference-type="ref" data-reference="sec:preparation">1.7</a>,
|
|||
|
it was noted that the X Window System and kernel module programming are
|
|||
|
not conducive to integration. This remains valid during the development
|
|||
|
of kernel modules. However, in practical scenarios, the necessity
|
|||
|
emerges to relay messages to the tty (teletype) originating the module
|
|||
|
load command.
|
|||
|
|
|||
|
The term “tty” originates from *teletype*, which initially referred to a
|
|||
|
combined keyboard-printer for Unix system communication. Today, it
|
|||
|
signifies a text stream abstraction employed by Unix programs,
|
|||
|
encompassing physical terminals, xterms in X displays, and network
|
|||
|
connections like SSH.
|
|||
|
|
|||
|
To achieve this, the “current” pointer is leveraged to access the active
|
|||
|
task’s tty structure. Within this structure lies a pointer to a string
|
|||
|
write function, facilitating the string’s transmission to the tty.
|
|||
|
|
|||
|
Flashing keyboard LEDs
|
|||
|
----------------------
|
|||
|
|
|||
|
In certain conditions, you may desire a simpler and more direct way to
|
|||
|
communicate to the external world. Flashing keyboard LEDs can be such a
|
|||
|
solution: It is an immediate way to attract attention or to display a
|
|||
|
status condition. Keyboard LEDs are present on every hardware, they are
|
|||
|
always visible, they do not need any setup, and their use is rather
|
|||
|
simple and non-intrusive, compared to writing to a tty or a file.
|
|||
|
|
|||
|
From v4.14 to v4.15, the timer API made a series of changes to improve
|
|||
|
memory safety. A buffer overflow in the area of a |timer_\*-\*-\*_list|
|
|||
|
structure may be able to overwrite the |function| and |data| fields,
|
|||
|
providing the attacker with a way to use return-oriented programming
|
|||
|
(ROP) to call arbitrary functions within the kernel. Also, the function
|
|||
|
prototype of the callback, containing a |unsigned long| argument, will
|
|||
|
prevent work from any type checking. Furthermore, the function prototype
|
|||
|
with |unsigned long| argument may be an obstacle to the forward-edge
|
|||
|
protection of *control-flow integrity*. Thus, it is better to use a
|
|||
|
unique prototype to separate from the cluster that takes an |unsigned
|
|||
|
long| argument. The timer callback should be passed a pointer to the
|
|||
|
|timer_\*-\*-\*_list| structure rather than an |unsigned long| argument.
|
|||
|
Then, it wraps all the information the callback needs, including the
|
|||
|
|timer_\*-\*-\*_list| structure, into a larger structure, and it can use
|
|||
|
the |container_\*-\*-\*_of| macro instead of the |unsigned long| value.
|
|||
|
For more information see: [Improving the kernel timers
|
|||
|
API](https://lwn.net/Articles/735887/).
|
|||
|
|
|||
|
Before Linux v4.14, |setup_\*-\*-\*_timer| was used to initialize the
|
|||
|
timer and the |timer_\*-\*-\*_list| structure looked like:
|
|||
|
|
|||
|
struct timer_\*-\*-\*_list unsigned long expires; void
|
|||
|
(\*function)(unsigned long); unsigned long data; u32 flags; /\* ... \*/
|
|||
|
;
|
|||
|
|
|||
|
void setup_\*-\*-\*_timer(struct timer_\*-\*-\*_list \*timer, void
|
|||
|
(\*callback)(unsigned long), unsigned long data);
|
|||
|
|
|||
|
Since Linux v4.14, |timer_\*-\*-\*_setup| is adopted and the kernel step
|
|||
|
by step converting to |timer_\*-\*-\*_setup| from
|
|||
|
|setup_\*-\*-\*_timer|. One of the reasons why API was changed is it
|
|||
|
need to coexist with the old version interface. Moreover, the
|
|||
|
|timer_\*-\*-\*_setup| was implemented by |setup_\*-\*-\*_timer| at
|
|||
|
first.
|
|||
|
|
|||
|
void timer_\*-\*-\*_setup(struct timer_\*-\*-\*_list \*timer, void
|
|||
|
(\*callback)(struct timer_\*-\*-\*_list \*), unsigned int flags);
|
|||
|
|
|||
|
The |setup_\*-\*-\*_timer| was then removed since v4.15. As a result,
|
|||
|
the |timer_\*-\*-\*_list| structure had changed to the following.
|
|||
|
|
|||
|
struct timer_\*-\*-\*_list unsigned long expires; void
|
|||
|
(\*function)(struct timer_\*-\*-\*_list \*); u32 flags; /\* ... \*/ ;
|
|||
|
|
|||
|
The following source code illustrates a minimal kernel module which,
|
|||
|
when loaded, starts blinking the keyboard LEDs until it is unloaded.
|
|||
|
|
|||
|
If none of the examples in this chapter fit your debugging needs, there
|
|||
|
might yet be some other tricks to try. Ever wondered what
|
|||
|
|CONFIG_\*-\*-\*_LL_\*-\*-\*_DEBUG| in |make menuconfig| is good for? If
|
|||
|
you activate that you get low level access to the serial port. While
|
|||
|
this might not sound very powerful by itself, you can patch
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/printk.c)
|
|||
|
or any other essential syscall to print ASCII characters, thus making it
|
|||
|
possible to trace virtually everything what your code does over a serial
|
|||
|
line. If you find yourself porting the kernel to some new and former
|
|||
|
unsupported architecture, this is usually amongst the first things that
|
|||
|
should be implemented. Logging over a netconsole might also be worth a
|
|||
|
try.
|
|||
|
|
|||
|
While you have seen lots of stuff that can be used to aid debugging
|
|||
|
here, there are some things to be aware of. Debugging is almost always
|
|||
|
intrusive. Adding debug code can change the situation enough to make the
|
|||
|
bug seem to disappear. Thus, you should keep debug code to a minimum and
|
|||
|
make sure it does not show up in production code.
|
|||
|
|
|||
|
Scheduling Tasks
|
|||
|
================
|
|||
|
|
|||
|
There are two main ways of running tasks: tasklets and work queues.
|
|||
|
Tasklets are a quick and easy way of scheduling a single function to be
|
|||
|
run. For example, when triggered from an interrupt, whereas work queues
|
|||
|
are more complicated but also better suited to running multiple things
|
|||
|
in a sequence.
|
|||
|
|
|||
|
It is possible that in future tasklets may be replaced by *threaded
|
|||
|
irqs*. However, discussion about that has been ongoing since 2007
|
|||
|
([Eliminating tasklets](https://lwn.net/Articles/239633)), so do not
|
|||
|
hold your breath. See the section
|
|||
|
<a href="#sec:irq" data-reference-type="ref" data-reference="sec:irq">15.1</a>
|
|||
|
if you wish to avoid the tasklet debate.
|
|||
|
|
|||
|
Tasklets
|
|||
|
--------
|
|||
|
|
|||
|
Here is an example tasklet module. The |tasklet_\*-\*-\*_fn| function
|
|||
|
runs for a few seconds. In the meantime, execution of the
|
|||
|
|example_\*-\*-\*_tasklet_\*-\*-\*_init| function may continue to the
|
|||
|
exit point, depending on whether it is interrupted by **softirq**.
|
|||
|
|
|||
|
So with this example loaded |dmesg| should show:
|
|||
|
|
|||
|
tasklet example init
|
|||
|
Example tasklet starts
|
|||
|
Example tasklet init continues...
|
|||
|
Example tasklet ends
|
|||
|
|
|||
|
Although tasklet is easy to use, it comes with several drawbacks, and
|
|||
|
developers are discussing about getting rid of tasklet in linux kernel.
|
|||
|
The tasklet callback runs in atomic context, inside a software
|
|||
|
interrupt, meaning that it cannot sleep or access user-space data, so
|
|||
|
not all work can be done in a tasklet handler. Also, the kernel only
|
|||
|
allows one instance of any given tasklet to be running at any given
|
|||
|
time; multiple different tasklet callbacks can run in parallel.
|
|||
|
|
|||
|
In recent kernels, tasklets can be replaced by workqueues, timers, or
|
|||
|
threaded interrupts.[1] While the removal of tasklets remains a
|
|||
|
longer-term goal, the current kernel contains more than a hundred uses
|
|||
|
of tasklets. Now developers are proceeding with the API changes and the
|
|||
|
macro |DECLARE_\*-\*-\*_TASKLET_\*-\*-\*_OLD| exists for compatibility.
|
|||
|
For further information, see <https://lwn.net/Articles/830964/>.
|
|||
|
|
|||
|
Work queues
|
|||
|
-----------
|
|||
|
|
|||
|
To add a task to the scheduler we can use a workqueue. The kernel then
|
|||
|
uses the Completely Fair Scheduler (CFS) to execute work within the
|
|||
|
queue.
|
|||
|
|
|||
|
Interrupt Handlers
|
|||
|
==================
|
|||
|
|
|||
|
Interrupt Handlers
|
|||
|
------------------
|
|||
|
|
|||
|
Except for the last chapter, everything we did in the kernel so far we
|
|||
|
have done as a response to a process asking for it, either by dealing
|
|||
|
with a special file, sending an |ioctl()|, or issuing a system call. But
|
|||
|
the job of the kernel is not just to respond to process requests.
|
|||
|
Another job, which is every bit as important, is to speak to the
|
|||
|
hardware connected to the machine.
|
|||
|
|
|||
|
There are two types of interaction between the CPU and the rest of the
|
|||
|
computer’s hardware. The first type is when the CPU gives orders to the
|
|||
|
hardware, the other is when the hardware needs to tell the CPU
|
|||
|
something. The second, called interrupts, is much harder to implement
|
|||
|
because it has to be dealt with when convenient for the hardware, not
|
|||
|
the CPU. Hardware devices typically have a very small amount of RAM, and
|
|||
|
if you do not read their information when available, it is lost.
|
|||
|
|
|||
|
Under Linux, hardware interrupts are called IRQ’s (Interrupt ReQuests).
|
|||
|
There are two types of IRQ’s, short and long. A short IRQ is one which
|
|||
|
is expected to take a very short period of time, during which the rest
|
|||
|
of the machine will be blocked and no other interrupts will be handled.
|
|||
|
A long IRQ is one which can take longer, and during which other
|
|||
|
interrupts may occur (but not interrupts from the same device). If at
|
|||
|
all possible, it is better to declare an interrupt handler to be long.
|
|||
|
|
|||
|
When the CPU receives an interrupt, it stops whatever it is doing
|
|||
|
(unless it is processing a more important interrupt, in which case it
|
|||
|
will deal with this one only when the more important one is done), saves
|
|||
|
certain parameters on the stack and calls the interrupt handler. This
|
|||
|
means that certain things are not allowed in the interrupt handler
|
|||
|
itself, because the system is in an unknown state. Linux kernel solves
|
|||
|
the problem by splitting interrupt handling into two parts. The first
|
|||
|
part executes right away and masks the interrupt line. Hardware
|
|||
|
interrupts must be handled quickly, and that is why we need the second
|
|||
|
part to handle the heavy work deferred from an interrupt handler.
|
|||
|
Historically, BH (Linux naming for *Bottom Halves*) statistically
|
|||
|
book-keeps the deferred functions. **Softirq** and its higher level
|
|||
|
abstraction, **Tasklet**, replace BH since Linux 2.3.
|
|||
|
|
|||
|
The way to implement this is to call |request_\*-\*-\*_irq()| to get
|
|||
|
your interrupt handler called when the relevant IRQ is received.
|
|||
|
|
|||
|
In practice IRQ handling can be a bit more complex. Hardware is often
|
|||
|
designed in a way that chains two interrupt controllers, so that all the
|
|||
|
IRQs from interrupt controller B are cascaded to a certain IRQ from
|
|||
|
interrupt controller A. Of course, that requires that the kernel finds
|
|||
|
out which IRQ it really was afterwards and that adds overhead. Other
|
|||
|
architectures offer some special, very low overhead, so called "fast
|
|||
|
IRQ" or FIQs. To take advantage of them requires handlers to be written
|
|||
|
in assembly language, so they do not really fit into the kernel. They
|
|||
|
can be made to work similar to the others, but after that procedure,
|
|||
|
they are no longer any faster than "common" IRQs. SMP enabled kernels
|
|||
|
running on systems with more than one processor need to solve another
|
|||
|
truckload of problems. It is not enough to know if a certain IRQs has
|
|||
|
happened, it’s also important to know what CPU(s) it was for. People
|
|||
|
still interested in more details, might want to refer to "APIC" now.
|
|||
|
|
|||
|
This function receives the IRQ number, the name of the function, flags,
|
|||
|
a name for `/proc/interrupts` and a parameter to be passed to the
|
|||
|
interrupt handler. Usually there is a certain number of IRQs available.
|
|||
|
How many IRQs there are is hardware-dependent.
|
|||
|
|
|||
|
The flags can be used for specify behaviors of the IRQ. For example, use
|
|||
|
|IRQF_\*-\*-\*_SHARED| to indicate you are willing to share the IRQ with
|
|||
|
other interrupt handlers (usually because a number of hardware devices
|
|||
|
sit on the same IRQ); use the |IRQF_\*-\*-\*_ONESHOT| to indicate that
|
|||
|
the IRQ is not reenabled after the handler finished. It should be noted
|
|||
|
that in some materials, you may encouter another set of IRQ flags named
|
|||
|
with the |SA| prefix. For example, the |SA_\*-\*-\*_SHIRQ| and the
|
|||
|
|SA_\*-\*-\*_INTERRUPT|. Those are the the IRQ flags in the older
|
|||
|
kernels. They have been removed completely. Today only the |IRQF| flags
|
|||
|
are in use. This function will only succeed if there is not already a
|
|||
|
handler on this IRQ, or if you are both willing to share.
|
|||
|
|
|||
|
Detecting button presses
|
|||
|
------------------------
|
|||
|
|
|||
|
Many popular single board computers, such as Raspberry Pi or
|
|||
|
Beagleboards, have a bunch of GPIO pins. Attaching buttons to those and
|
|||
|
then having a button press do something is a classic case in which you
|
|||
|
might need to use interrupts, so that instead of having the CPU waste
|
|||
|
time and battery power polling for a change in input state, it is better
|
|||
|
for the input to trigger the CPU to then run a particular handling
|
|||
|
function.
|
|||
|
|
|||
|
Here is an example where buttons are connected to GPIO numbers 17 and 18
|
|||
|
and an LED is connected to GPIO 4. You can change those numbers to
|
|||
|
whatever is appropriate for your board.
|
|||
|
|
|||
|
Bottom Half
|
|||
|
-----------
|
|||
|
|
|||
|
Suppose you want to do a bunch of stuff inside of an interrupt routine.
|
|||
|
A common way to do that without rendering the interrupt unavailable for
|
|||
|
a significant duration is to combine it with a tasklet. This pushes the
|
|||
|
bulk of the work off into the scheduler.
|
|||
|
|
|||
|
The example below modifies the previous example to also run an
|
|||
|
additional task when an interrupt is triggered.
|
|||
|
|
|||
|
Threaded IRQ
|
|||
|
------------
|
|||
|
|
|||
|
Threaded IRQ is a mechanism to organize both top-half and bottom-half of
|
|||
|
an IRQ at once. A threaded IRQ splits the one handler in
|
|||
|
|request_\*-\*-\*_irq()| into two: one for the top-half, the other for
|
|||
|
the bottom-half. The |request_\*-\*-\*_threaded_\*-\*-\*_irq()| is the
|
|||
|
function for using threaded IRQs. Two handlers are registered at once in
|
|||
|
the |request_\*-\*-\*_threaded_\*-\*-\*_irq()|.
|
|||
|
|
|||
|
Those two handlers run in different context. The top-half handler runs
|
|||
|
in interrupt context. It’s the equivalence of the handler passed to the
|
|||
|
|request_\*-\*-\*_irq()|. The bottom-half handler on the other hand runs
|
|||
|
in its own thread. This thread is created on registration of a threaded
|
|||
|
IRQ. Its sole purpose is to run this bottom-half handler. This is where
|
|||
|
a threaded IRQ is “threaded”. If |IRQ_\*-\*-\*_WAKE_\*-\*-\*_THREAD| is
|
|||
|
returned by the top-half handler, that bottom-half serving thread will
|
|||
|
wake up. The thread then runs the bottom-half handler.
|
|||
|
|
|||
|
Here is an example of how to do the same thing as before, with top and
|
|||
|
bottom halves, but using threads.
|
|||
|
|
|||
|
A threaded IRQ is registered using
|
|||
|
|request_\*-\*-\*_threaded_\*-\*-\*_irq()|. This function only takes one
|
|||
|
additional parameter than the |request_\*-\*-\*_irq()| – the bottom-half
|
|||
|
handling function that runs in its own thread. In this example it is the
|
|||
|
|button_\*-\*-\*_bottom_\*-\*-\*_half()|. Usage of other parameters are
|
|||
|
the same as |request_\*-\*-\*_irq()|.
|
|||
|
|
|||
|
Presence of both handlers is not mandatory. If either of them is not
|
|||
|
needed, pass the |NULL| instead. A |NULL| top-half handler implies that
|
|||
|
no action is taken except to wake up the bottom-half serving thread,
|
|||
|
which runs the bottom-half handler. Similarly, a |NULL| bottom-half
|
|||
|
handler effectively acts as if |request_\*-\*-\*_irq()| were used. In
|
|||
|
fact, this is how |request_\*-\*-\*_irq()| is implemented.
|
|||
|
|
|||
|
Note that passing |NULL| to both handlers is considered an error and
|
|||
|
will make registration fail.
|
|||
|
|
|||
|
Virtual Input Device Driver
|
|||
|
===========================
|
|||
|
|
|||
|
The input device driver is a module that provides a way to communicate
|
|||
|
with the interaction device via the event. For example, the keyboard can
|
|||
|
send the press or release event to tell the kernel what we want to do.
|
|||
|
The input device driver will allocate a new input structure with
|
|||
|
|input_\*-\*-\*_allocate_\*-\*-\*_device()| and sets up input bitfields,
|
|||
|
device id, version, etc. After that, registers it by calling
|
|||
|
|input_\*-\*-\*_register_\*-\*-\*_device()|.
|
|||
|
|
|||
|
Here is an example, vinput, It is an API to allow easy development of
|
|||
|
virtual input drivers. The drivers needs to export a
|
|||
|
|vinput_\*-\*-\*_device()| that contains the virtual device name and
|
|||
|
|vinput_\*-\*-\*_ops| structure that describes:
|
|||
|
|
|||
|
- the init function: |init()|
|
|||
|
|
|||
|
- the input event injection function: |send()|
|
|||
|
|
|||
|
- the readback function: |read()|
|
|||
|
|
|||
|
Then using |vinput_\*-\*-\*_register_\*-\*-\*_device()| and
|
|||
|
|vinput_\*-\*-\*_unregister_\*-\*-\*_device()| will add a new device to
|
|||
|
the list of support virtual input devices.
|
|||
|
|
|||
|
int init(struct vinput \*);
|
|||
|
|
|||
|
This function is passed a |struct vinput| already initialized with an
|
|||
|
allocated |struct input_\*-\*-\*_dev|. The |init()| function is
|
|||
|
responsible for initializing the capabilities of the input device and
|
|||
|
register it.
|
|||
|
|
|||
|
int send(struct vinput \*, char \*, int);
|
|||
|
|
|||
|
This function will receive a user string to interpret and inject the
|
|||
|
event using the |input_\*-\*-\*_report_\*-\*-\*_XXXX| or
|
|||
|
|input_\*-\*-\*_event| call. The string is already copied from user.
|
|||
|
|
|||
|
int read(struct vinput \*, char \*, int);
|
|||
|
|
|||
|
This function is used for debugging and should fill the buffer parameter
|
|||
|
with the last event sent in the virtual input device format. The buffer
|
|||
|
will then be copied to user.
|
|||
|
|
|||
|
vinput devices are created and destroyed using sysfs. And, event
|
|||
|
injection is done through a `/dev` node. The device name will be used by
|
|||
|
the userland to export a new virtual input device.
|
|||
|
|
|||
|
The |class_\*-\*-\*_attribute| structure is similar to other attribute
|
|||
|
types we talked about in section
|
|||
|
<a href="#sec:sysfs" data-reference-type="ref" data-reference="sec:sysfs">8</a>:
|
|||
|
|
|||
|
struct class_\*-\*-\*_attribute struct attribute attr; ssize_\*-\*-\*_t
|
|||
|
(\*show)(struct class \*class, struct class_\*-\*-\*_attribute \*attr,
|
|||
|
char \*buf); ssize_\*-\*-\*_t (\*store)(struct class \*class, struct
|
|||
|
class_\*-\*-\*_attribute \*attr, const char \*buf, size_\*-\*-\*_t
|
|||
|
count); ;
|
|||
|
|
|||
|
In `vinput.c`, the macro
|
|||
|
|CLASS_\*-\*-\*_ATTR_\*-\*-\*_WO(export/unexport)| defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/device.h)
|
|||
|
(in this case, `device.h` is included in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h))
|
|||
|
will generate the |class_\*-\*-\*_attribute| structures which are named
|
|||
|
`class_*-*-*_attr_*-*-*_export/unexport`. Then, put them into
|
|||
|
|vinput_\*-\*-\*_class_\*-\*-\*_attrs| array and the macro
|
|||
|
|ATTRIBUTE_\*-\*-\*_GROUPS(vinput_\*-\*-\*_class)| will generate the
|
|||
|
|struct attribute_\*-\*-\*_group vinput_\*-\*-\*_class_\*-\*-\*_group|
|
|||
|
that should be assigned in |vinput_\*-\*-\*_class|. Finally, call
|
|||
|
|class_\*-\*-\*_register(&vinput_\*-\*-\*_class)| to create attributes
|
|||
|
in sysfs.
|
|||
|
|
|||
|
To create a `vinputX` sysfs entry and `/dev` node.
|
|||
|
|
|||
|
echo "vkbd" | sudo tee /sys/class/vinput/export
|
|||
|
|
|||
|
To unexport the device, just echo its id in unexport:
|
|||
|
|
|||
|
echo "0" | sudo tee /sys/class/vinput/unexport
|
|||
|
|
|||
|
Here the virtual keyboard is one of example to use vinput. It supports
|
|||
|
all |KEY_\*-\*-\*_MAX| keycodes. The injection format is the
|
|||
|
|KEY_\*-\*-\*_CODE| such as defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/input.h).
|
|||
|
A positive value means |KEY_\*-\*-\*_PRESS| while a negative value is a
|
|||
|
|KEY_\*-\*-\*_RELEASE|. The keyboard supports repetition when the key
|
|||
|
stays pressed for too long. The following demonstrates how simulation
|
|||
|
work.
|
|||
|
|
|||
|
Simulate a key press on "g" (|KEY_\*-\*-\*_G| = 34):
|
|||
|
|
|||
|
echo "+34" | sudo tee /dev/vinput0
|
|||
|
|
|||
|
Simulate a key release on "g" (|KEY_\*-\*-\*_G| = 34):
|
|||
|
|
|||
|
echo "-34" | sudo tee /dev/vinput0
|
|||
|
|
|||
|
Standardizing the interfaces: The Device Model
|
|||
|
==============================================
|
|||
|
|
|||
|
Up to this point we have seen all kinds of modules doing all kinds of
|
|||
|
things, but there was no consistency in their interfaces with the rest
|
|||
|
of the kernel. To impose some consistency such that there is at minimum
|
|||
|
a standardized way to start, suspend and resume a device model was
|
|||
|
added. An example is shown below, and you can use this as a template to
|
|||
|
add your own suspend, resume or other interface functions.
|
|||
|
|
|||
|
Optimizations
|
|||
|
=============
|
|||
|
|
|||
|
Likely and Unlikely conditions
|
|||
|
------------------------------
|
|||
|
|
|||
|
Sometimes you might want your code to run as quickly as possible,
|
|||
|
especially if it is handling an interrupt or doing something which might
|
|||
|
cause noticeable latency. If your code contains boolean conditions and
|
|||
|
if you know that the conditions are almost always likely to evaluate as
|
|||
|
either |true| or |false|, then you can allow the compiler to optimize
|
|||
|
for this using the |likely| and |unlikely| macros. For example, when
|
|||
|
allocating memory you are almost always expecting this to succeed.
|
|||
|
|
|||
|
bvl = bvec_\*-\*-\*_alloc(gfp_\*-\*-\*_mask, nr_\*-\*-\*_iovecs, &idx);
|
|||
|
if (unlikely(_bvl)) mempool_\*-\*-\*_free(bio, bio_\*-\*-\*_pool); bio =
|
|||
|
NULL; goto out;
|
|||
|
|
|||
|
When the |unlikely| macro is used, the compiler alters its machine
|
|||
|
instruction output, so that it continues along the false branch and only
|
|||
|
jumps if the condition is true. That avoids flushing the processor
|
|||
|
pipeline. The opposite happens if you use the |likely| macro.
|
|||
|
|
|||
|
Static keys
|
|||
|
-----------
|
|||
|
|
|||
|
Static keys allow us to enable or disable kernel code paths based on the
|
|||
|
runtime state of key. Its APIs have been available since 2010 (most
|
|||
|
architectures are already supported), use self-modifying code to
|
|||
|
eliminate the overhead of cache and branch prediction. The most typical
|
|||
|
use case of static keys is for performance-sensitive kernel code, such
|
|||
|
as tracepoints, context switching, networking, etc. These hot paths of
|
|||
|
the kernel often contain branches and can be optimized easily using this
|
|||
|
technique. Before we can use static keys in the kernel, we need to make
|
|||
|
sure that gcc supports |asm goto| inline assembly, and the following
|
|||
|
kernel configurations are set:
|
|||
|
|
|||
|
CONFIG_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y
|
|||
|
CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL=y
|
|||
|
CONFIG_\*-\*-\*_HAVE_\*-\*-\*_ARCH_\*-\*-\*_JUMP_\*-\*-\*_LABEL_\*-\*-\*_RELATIVE=y
|
|||
|
|
|||
|
To declare a static key, we need to define a global variable using the
|
|||
|
|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE| or
|
|||
|
|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE| macro defined in
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/include/linux/jump\_*-*-*_label.h).
|
|||
|
This macro initializes the key with the given initial value, which is
|
|||
|
either false or true, respectively. For example, to declare a static key
|
|||
|
with an initial value of false, we can use the following code:
|
|||
|
|
|||
|
DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE(fkey);
|
|||
|
|
|||
|
Once the static key has been declared, we need to add branching code to
|
|||
|
the module that uses the static key. For example, the code includes a
|
|||
|
fastpath, where a no-op instruction will be generated at compile time as
|
|||
|
the key is initialized to false and the branch is unlikely to be taken.
|
|||
|
|
|||
|
pr_\*-\*-\*_info("fastpath 1"); if
|
|||
|
(static_\*-\*-\*_branch_\*-\*-\*_unlikely(&fkey)) pr_\*-\*-\*_alert("do
|
|||
|
unlikely thing"); pr_\*-\*-\*_info("fastpath 2");
|
|||
|
|
|||
|
If the key is enabled at runtime by calling
|
|||
|
|static_\*-\*-\*_branch_\*-\*-\*_enable(&fkey)|, the fastpath will be
|
|||
|
patched with an unconditional jump instruction to the slowpath code
|
|||
|
|pr_\*-\*-\*_alert|, so the branch will always be taken until the key is
|
|||
|
disabled again.
|
|||
|
|
|||
|
The following kernel module derived from `chardev.c`, demonstrates how
|
|||
|
the static key works.
|
|||
|
|
|||
|
To check the state of the static key, we can use the
|
|||
|
`/dev/key_*-*-*_state` interface.
|
|||
|
|
|||
|
cat /dev/key_\*-\*-\*_state
|
|||
|
|
|||
|
This will display the current state of the key, which is disabled by
|
|||
|
default.
|
|||
|
|
|||
|
To change the state of the static key, we can perform a write operation
|
|||
|
on the file:
|
|||
|
|
|||
|
echo enable > /dev/key_\*-\*-\*_state
|
|||
|
|
|||
|
This will enable the static key, causing the code path to switch from
|
|||
|
the fastpath to the slowpath.
|
|||
|
|
|||
|
In some cases, the key is enabled or disabled at initialization and
|
|||
|
never changed, we can declare a static key as read-only, which means
|
|||
|
that it can only be toggled in the module init function. To declare a
|
|||
|
read-only static key, we can use the
|
|||
|
|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_FALSE_\*-\*-\*_RO| or
|
|||
|
|DEFINE_\*-\*-\*_STATIC_\*-\*-\*_KEY_\*-\*-\*_TRUE_\*-\*-\*_RO| macro
|
|||
|
instead. Attempts to change the key at runtime will result in a page
|
|||
|
fault. For more information, see [Static
|
|||
|
keys](https://www.kernel.org/doc/Documentation/static-keys.txt)
|
|||
|
|
|||
|
Common Pitfalls
|
|||
|
===============
|
|||
|
|
|||
|
Using standard libraries
|
|||
|
------------------------
|
|||
|
|
|||
|
You can not do that. In a kernel module, you can only use kernel
|
|||
|
functions which are the functions you can see in `/proc/kallsyms`.
|
|||
|
|
|||
|
Disabling interrupts
|
|||
|
--------------------
|
|||
|
|
|||
|
You might need to do this for a short time and that is OK, but if you do
|
|||
|
not enable them afterwards, your system will be stuck and you will have
|
|||
|
to power it off.
|
|||
|
|
|||
|
Where To Go From Here?
|
|||
|
======================
|
|||
|
|
|||
|
For those deeply interested in kernel programming,
|
|||
|
[kernelnewbies.org](https://kernelnewbies.org) and the
|
|||
|
[](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation)
|
|||
|
subdirectory within the kernel source code are highly recommended.
|
|||
|
Although the latter may not always be straightforward, it serves as a
|
|||
|
valuable initial step for further exploration. Echoing Linus Torvalds’
|
|||
|
perspective, the most effective method to understand the kernel is
|
|||
|
through personal examination of the source code.
|
|||
|
|
|||
|
Contributions to this guide are welcome, especially if there are any
|
|||
|
significant inaccuracies identified. To contribute or report an issue,
|
|||
|
please initiate an issue at <https://github.com/sysprog21/lkmpg>. Pull
|
|||
|
requests are greatly appreciated.
|
|||
|
|
|||
|
Happy hacking_
|
|||
|
|
|||
|
[1] The goal of threaded interrupts is to push more of the work to
|
|||
|
separate threads, so that the minimum needed for acknowledging an
|
|||
|
interrupt is reduced, and therefore the time spent handling the
|
|||
|
interrupt (where it can’t handle any other interrupts at the same time)
|
|||
|
is reduced. See <https://lwn.net/Articles/302043/>.
|