Saturday, June 28, 2014

Interacting with kernel using sysfs

। जय श्री भगवान् ।
In the last post I talked about how to add a system call to a x86 or x86_64 system. There are a couple of ways when a user space application might want to interact with kernel for example I/O operations or some performance statistics or maybe a special device has its own set of ioctl calls which the program want to use. We saw one example, that is using a system call by which a user space program can interact with kernel however it's not possible and even required to be adding system calls.

We'll take a look at one of the most simple interfaces for interacting with kernel here, which is the sysfs. Although it's not "as simple" as you may think, but we can leave out a lot of things like locking, allocating memory, file operations etc. So we can just focus on one thing, that is the easiest way to get data into and out of the kernel. Usually sysfs is used for making module interaction and exporting device specific information however it can be used for literally anything you want to accomplish. So let's dive into the basics first what exactly is sysfs

The basic idea of having a Sysfs

Sysfs was created mainly for devices and kernel modules wishing to export/import information. The information to be exported can't be more than PAGE_SIZE (usually 4KiB) however depending on how one is implementing the Sysfs files you can accomplish quite much more. So the basic idea is that when a module wants interactivity from user space for example a device that can be turned off by the root user by writing a specific command in the device's register then he/she shouldn't have to go all the way to write a program doing ioctl's. Instead, the device's driver module can create Sysfs entries allowing the root user to just do echo <command_value> /sys/<sysfs_file> which would take care of everything. 

The whole sysfs is based on the idea of kobject, which represents some kind of entity. A kobject may have a parent and may have many children which again are kobjects. So the basic idea is something like shown below,

sysfs layout
Sysfs Structure
So basically the idea is to group a certain type of kobjects and put them under that type. By default you can see the sysfs entries in /sys, and depending on the type of kobject you wish to implement it could be added under one of these. Since this is a very gentle introduction to kobjects we'll rather not use any parent and add our kobjects directly under /sys. So let's see what do we need to know in order to do this,

The following is the listing of kobject structure,

struct kobject {
        const char              *name;
        struct list_head        entry;
        struct kobject          *parent;
        struct kset             *kset;
        struct kobj_type        *ktype;
        struct sysfs_dirent     *sd;
        struct kref             kref;
        unsigned int state_initialized:1;
        unsigned int state_in_sysfs:1;
        unsigned int state_add_uevent_sent:1;
        unsigned int state_remove_uevent_sent:1;
        unsigned int uevent_suppress:1;
};

The above structure seems daunting there's a lot going on there however we don't need to bother about most of it right now and just need the name, kref and the parent. The rest are used for internal kobject maintenance. The Leaf kobjects are the ones where the real thing happens. These Leaf kobjects are implemented by the module writer using attributes or in specific kobj_attribute. The following shows the listing for both,

struct attribute {
        const char              *name;
        umode_t                 mode;
#ifdef CONFIG_DEBUG_LOCK_ALLOC
        bool                    ignore_lockdep:1;
        struct lock_class_key   *key;
        struct lock_class_key   skey;
#endif
};

struct kobj_attribute {
        struct attribute attr;
        ssize_t (*show)(struct kobject *kobj, struct kobj_attribute *attr,
                        char *buf);
        ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr,
                         const char *buf, size_t count);
};

As you can see the kobj_attribute embeds the attribute structure however it also provides methods to show and store information from/to user/kernel. Thus it all boils down to the following steps that need to be done

  1. Create a parent for our attributes. This is required since we don't want to have our attributes coming up under sysfs directly.
  2. Create some kobj_attribute structures and set the store and show on these.
The following module shows you how to accomplish this. The module itself is commented heavily however we'll take a look at some interesting pieces.

Kernel Module using Sysfs, Kobjects and kobj_attribute

The idea of this module is to have 
  1. A parent directory, that is a parent Kobject.
  2. Two attributes that store their information in a static array. 
 
#include <common.h>
#include <linux/sysfs.h>

#define ROOT_KOBJ_NAME          "pks_kobj"
#define ROOT_ATTR1_NAME         "pks_kobj_attr1"
#define ROOT_ATTR2_NAME         "pks_kobj_attr2"
#define ROOT_ATTRS_COUNT        2

ssize_t rootfs_show(struct kobject *kobj, struct kobj_attribute *attr,
                        char *buf);

ssize_t rootfs_store(struct kobject *kobj, struct kobj_attribute *attr,
                        const char *buf, size_t count);


/* 1 Word of storage per attribute */
#define ROOT_ATTR_STORAGE_SIZE  (ROOT_ATTRS_COUNT * sizeof(unsigned long))
/*
 * This is our directory sort of for our sysfs files
 */
struct kobject *root_kobj;

/*
 * These are the files under pks_kobj we'll see. 
 * */
struct kobj_attribute root_kobj_attr1 = __ATTR(root_kobj_attr1, S_IWUSR|S_IRUGO,
                                        rootfs_show, rootfs_store);

struct kobj_attribute root_kobj_attr2 = __ATTR(root_kobj_attr2, S_IWUSR|S_IRUGO,
                                        rootfs_show, rootfs_store);

const struct attribute *root_kobj_attr[] = {    &root_kobj_attr1.attr,
                                        &root_kobj_attr2.attr, NULL};
/*
 * We need storage to get/put data from/to user land. Let's just create 
 * a static array for this.
 */
static char attribute_storage[ROOT_ATTR_STORAGE_SIZE];

static int __init init_sysfs_objs(struct kobject *root_kobj_parent)
{
        int err = 0;
        root_kobj = kobject_create_and_add(ROOT_KOBJ_NAME, root_kobj_parent);
        if (!root_kobj) {
                err = -ENOMEM;
                goto no_root_kobj;
        }
        err = sysfs_create_files(root_kobj, root_kobj_attr);
        if (err)
                goto err_create_files;
        return 0;

err_create_files:
        kobject_put(root_kobj);
no_root_kobj:
        return err;
}
static int __init load_module(void)
{
        return init_sysfs_objs(NULL);
}
static void __exit cleanup_sysfs_objs(void)
{
        sysfs_remove_files(root_kobj, root_kobj_attr);
        kobject_put(root_kobj);
}

static void __exit unload_module(void)
{
        cleanup_sysfs_objs();
}

ssize_t rootfs_show(struct kobject *kobj, struct kobj_attribute *attr,
                        char *buf) {
        unsigned long *storage = (unsigned long*)attribute_storage;
        //pr_debug("Copying to user space from attribute %s\n", attr->attr.name);
        if (attr == &root_kobj_attr1) {
        }
        else if (attr == &root_kobj_attr2) {
                storage++;
        }
        *( (unsigned long*)buf) = *storage;
        return sizeof(unsigned long);
}

ssize_t rootfs_store(struct kobject *kobj, struct kobj_attribute *attr,
                        const char *buf, size_t count)
{
        unsigned long *storage = (unsigned long*)attribute_storage;
        //pr_debug("Copying from user space to attribute %s\n", attr->attr.name);
        if (attr == &root_kobj_attr1) {

        }
        else if (attr == &root_kobj_attr2) {
                storage++;
        }
        pr_debug("Changing from %lu to %lu \n", *storage, *( (unsigned long*)buf));
        *storage = *( (unsigned long*)buf);
        return sizeof(unsigned long);
}

module_init(load_module);
module_exit(unload_module);


Creating the root directory for our kobj_attributes

In the above listing, we've created one directory represented by our root_kobj. All kobjects should be created dynamically and not statically. Therefore we've used a function kobject_create_and_add for this purpose. If you see the final argument of that function then we've supplied NULL which means this kobject doesn't have any parent and would thus appear directly under /sys.

Creating the kobj_attributes

The attributes you would like to show would almost always be declared statically since you know what you want to show in sysfs for your device or whatever purpose you are creating those entries. To facilitate this kernel provides the macro __ATTR for initializing the kobj_attribute. This attribute takes the variable name as it's first argument and uses it by stringify-ing it so we don't even need the names defined at the top.

Another important thing to note here is that for each of the attribute you'll have to specify a show and store method. Most of the time you'll have some common code to be executed so there are two ways in which you can do this,
  1. Provide a common routine and check which attribute is passed in by comparing the pointer to your statically defined kobj_attribute
  2. Provide wrappers over the kobj_attribute and then do container_of to get the containing attribute structure and go forward that way. This requires a bit more work and you may not even want this.
In this simple example we'll take approach 1.

We've used an available wrapper function sysfs_create_files, the first argument of this function is the kobject under which we will create these attributes while the second is an array of pointers, see how we've specified NULL at the end of this array. This is mandatory since this function will iterate over the array unless it finds a NULL entry because there's no length field supplied.

Copying Data to/from user space

The store method implies that
  • You are copying data from user land to kernel
  • You will return how much data you've copied. Usually just return same amount as passed in but copy whatever amount you really want.
The show method implies that
  • You are copying data from kernel to user land
  • You'll return how much data you are copying into the buffer.
 The buffer pointer passed in  buf in above code is actually a mapped page within kernel. So you can do memcpy or just assign directly as I've done above. Remember that buffer is exactly PAGE_SIZE so don't go above that limit.
 
The internal buffer is just an array. It holds the value as an unsigned long for each of the attributes.

Cleaning up,

You'll need to remove the files you created the same way you've added the files. Just be sure you do it reverse that is first remove the files then remove the parent kobject. To remove the root_kobj all you need to do is call kobject_put. This decrements the count of kobject and when the count goes to 0, it cleans up this kobject. This is why it's required that you remove the files first and then remove the parent kobject.

Excercises

  1. Modify the above module so that the first byte of each attribute's storage area represents 8 bit flags. That is the data can be stored in only 3 of the 4 bytes on a 32 bit computer and 7 of the 8 bytes on 64 bit computer.
  2. Write test programs, a producer and consumer that will write/read data respectively. Use the flag byte for any synchronization you may need. If the buffer is already full and producer hasn't consumed then you should check the flag byte if the data can be over written or not. This will be set/unset randomly by your producer on each write. If data can't be over written then you should return an error code or just 0 to convey nothing was written.
  3. Try implementing a wrapper over attributes and see how you can use container_of to accomplish the same. Think about what you'll need in your wrapper structure.
We'll again visit Sysfs later on for sure when we dive into device drivers.

Sunday, June 15, 2014

Adding a new System Call in Linux (x86 and x86_64)

। जय श्री भगवान् ।
In the previous module we saw how we can modify the running kernel by dynamically adding or removing modules from it. In this post I'll talk about how to add a new system to x86 or x86_64 system. Most of the stuff available on internet corresponds to 2.6 kernels however some files have been moved in order to make the changes required simple.

So let's dive into how we'll write a new system call. Firstly you will need the kernel sources and then you'll need to change the following file(s). I've tested this on a 32 bit system however 64 bit should be similar. 

Making changes to the sources 

 

  1. Locate the directory, arch/x86/syscalls. This directory is having he syscall table files for both 32 and 64 bit.
  2.  Open the file syscall_32.tbl for 32 bit and same way syscall_64.tbl for 64 bit.
  3. You can see all the system calls listed here with their number on the extreme left. The format is also shown in the first line of the file. We won't discuss about which ABI to use but for 32 bit you can use i386.
  4. At the end of this table (The last one listed would be #350 sys_finit_module). Now you'll need to add your system call here. Remember to set the correct number and the name. You won't need to supply a compat_ version of this. So you'll be having 3 entries like shown below


348     i386    process_vm_writev       sys_process_vm_writev           compat_sys_process_vm_writev
349     i386    kcmp                    sys_kcmp
350     i386    finit_module            sys_finit_module
351     i386    pks_first_call          sys_pks_first_call


So the above change shows that I added the system call named, pks_first_call however the entry point is going to be sys_pks_first_call. If you use the SYSCALL_DEFINE* macros then those macros would add sys_ to the name of the function hence the entry point is named that way.

Adding the system call code

To add the system call code, we'll create a new directory in the arch/x86 directory and then modify the Kbuild , top-level file to compile our new system call. So let's start by creating the required files first.

mkdir /usr/src/linux/arch/x86/pks_first

Now in this directory create a new file. I named the file same as my system call name however you can choose anything you want. So my file looks like as shown below,


#include <linux/kernel.h>
#include <linux/syscalls.h>

SYSCALL_DEFINE0(pks_first_call)
{
        printk (KERN_INFO "Inside %s",__FUNCTION__);
        return 0;
}

As you can see I've used SYSCALL_DEFINE0 since there's no argument for our system call. There are other versions of SYSCALL_DEFINE* so you are encouraged to use those since they also create ftrace meta data holder in addition in case CONFIG_FTRACE is enabled.

In the same directory you'll also need to have a Makefile which does nothing but tells which files need to be compiled so it's very simple as shown below,


obj-y           += pks_first_call.o

Changing the Top-Level Kbuild

All we need to do now is change the top level KBuild file. This would be the one located in arch/x86/Kbuild. The listing is as shown below,

ifneq ($(CONFIG_XEN),y)
obj-y += realmode/
endif
obj-y += kernel/
obj-y += mm/

obj-y += crypto/
obj-y += vdso/
obj-$(CONFIG_IA32_EMULATION) += ia32/

obj-y += platform/
obj-y += net/
obj-y += pks_first/

I added my directory in the top level KBuild at the end. The final change we need to do is let the kernel know about the system call. So we need to change the following file to finish all our changes
include/linux/syscalls.h. The change looks like as shown below,

asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
                         unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
asmlinkage long sys_pks_first_call(void);
#endif  

In the above listing I've added my system call at the end right before #endif. We don't need to change the __NR_syscalls it'll be taken care of by the build system for x86. You can build and install your new kernel and check your new system call with a simple program as shown below.


#include <stdio.h>
#include <syscall.h>
#include <errno.h>
#define NR_pks_first_call 351
int main()
{
        if (syscall(NR_pks_first_call)) {
                perror("OOPS:");
        } 
        return 0;
}

When you run the above program you shouldn't get any message but if you look in dmesg output or your system log file usually /var/log/messages you should be able to see the message posted by our system call. This was quite a lot of information and next time we'll see other ways we can interact with kernel using something simpler instead of recompiling and installing the whole kernel.

Saturday, June 14, 2014

KBuild System and Hello World Part 2

। जय श्री भगवान् ।
In the last post we saw how to compile module and by using the insmod command we were able to see the messages on initialization and module unload time using rmmod command. In this post I'll talk try to show you how to extend that information.

When you are having similar type of modules then you might want to have a common Kbuild file. This common Kbuild file may include some common flags that are shared across most of the files. Note that you still need to define the includes from within the separate Kbuild only. Let's try to make another module and let's try to create a top level Kbuild directory and build the two modules having some similar flags and some flags specific for module or particular file.

I created another directory called pks_modules and I've put the module_1 files under directory module_1 and created another directory module_2. This is how it looks like


pranay@linux-y7pi:~/pks_modules> ls -d */
common/  module_1/  module_2/
pranay@linux-y7pi:~/pks_modules> 
pranay@linux-y7pi:~/pks_modules> ls module_1/
Kbuild  Makefile  my_module1.c  my_module1.h
pranay@linux-y7pi:~/pks_modules> 
pranay@linux-y7pi:~/pks_modules> ls module_2/
Kbuild  Makefile  module_2.h  module2_p1.c  module2_p2.c
pranay@linux-y7pi:~/pks_modules> 
pranay@linux-y7pi:~/pks_modules> ls common/
common.h
pranay@linux-y7pi:~/pks_modules> 

I've created some more files for our new module_2. The files in module_1 are still unchanged. Now we'll invoke make in the top level directory pks_modules instead of invoking it under each directory separately. To do this only some minor changes are to be done. To begin with we would need a Kbuild file and a Makefile in the topdir. The Makefile is exactly same so I won't be posting it again. The Kbuild file for the top-directory looks like below,

obj-m=module_1/ module_2/
subdir-ccflags-y := -DMY_DEBUG_FLAG

So in the above KBuild file all we do is set the directories to be build as modules. We also define one flag which would be available to all files in module_1 and module_2.  That's it for the top level directory now let's move to module_2 directory. See how I've changed the ccflags-y to subdir-ccflags-y because I want that all sub directories should get the value of this flag. ccflags is limited to the directory in which the Kbuild file exists. The listings are shown below for module_2

obj-m := module2.o
module2-objs := module2_p1.o module2_p2.o
ccflags-y := -I$(src)/../common
CFLAGS_module2_p2.o := -DDEBUG

In the above listing I've defined a file specific flag using CFLAGS_<filename>.o. You'll see how it affects the output when you compile and run it. The ccflags-y is used for all files for this Kbuild instance but file specific flags are not. This is where you can define new flags for a particular file instead of putting it all globally. Here are the listing of the rest of the files,


#ifndef __MODULE2__H
#define __MODULE2__H
/*
 * Let's include some more files.
 */
#include <linux/highmem.h>
#include <linux/list.h>

int print_and_ret0(const char *str);
#endif

#include <common.h>
#include <module_2.h>

#ifndef MY_DEBUG_FLAG
int print_and_ret0(const char *str)
{
        pr_debug("MY_DEBUG_FLAG isn't set:%s",str);
        return 0;
}
#else
int print_and_ret0(const char *str)
{
        pr_debug("MY_DEBUG_FLAG is set:%s",str);
        pr_debug("The printed string above is at address %p\n",str);
        return 0;
}
#endif

#include <common.h>
#include <module_2.h>

static int __init module2_init(void)
{
        return print_and_ret0("Loading module 2\n");
}

static void __exit module2_exit(void)
{
        pr_devel("Unloading module 2. This is also enabled"
                        " when debug option is set\n");
}

module_init(module2_init);
module_exit(module2_exit);

Try to insert this module and check the output. I've used pr_devel for printing the exit message just to show that DEBUG flag was defined only for this file. Since DEBUG wasn't defined for the module2_p1.c therefore you won't see any messages when you load the module. On Unloading the module however you should see the message.

Exercise 1.3

 Try to create another module so you would've 3 directories instead of 2. Undefine a variable passed down from the top dir and try to compile.

KBuild System and Hello World Module Part 1

। जय श्री भगवान् ।
In this post I'll talk about how to put your code with a running kernel. We'll start of pretty basic stuff and then start to understand Kbuild system in detail. We'll see how to compile a module spanning multiple files and how to use Kbuild to include files in project specific directories.

A Very Simple Hello World Kernel Module

Let's write some code for our hello world module, the following listing shows that

#include <common.h>
#include <my_module1.h>
static int __init module1_init(void)
{
        pr_err("Hello World module 1\n");
        return 0;
}
static void __exit module1_exit(void)
{
        pr_debug("Unloading module 1\n");
}
module_init(module1_init);
module_exit(module1_exit);

In the above code both the includes are created by me for this demonstration so it's not something provided by kernel. We'll see how can we specify the path to include directories in the Kbuild system but before doing that let's understand what the above code does. As you would've guessed the pr_ functions are being used to print to the system log.

These output will not appear on console when you load the module albeit they are logged in the system logged depending on the Log Level of the system. Usually you can see these messages via dmesg or you can also look in /var/log/messages but you should see if that's the file where your distribution's logs are logged.

I've used two ways to print the messages to system log. The difference is the way in which they are put in the code. The pr_debug message is put in the compiled code only if you've defined DEBUG while compiling. So let's check where and how to do that. Don't worry about __init and __exit for now as your module will compile even without using these. (Yes ok try ahead compile it see for yourself.... :-D)

The Kbuild File 


obj-m := module1.o
module1-objs := my_module1.o
DEBUG_FLAGS := -DDEBUG
ccflags-y := -I$(src)/../common $(DEBUG_FLAGS)

The above Kbuild file is fairly easy to understand. Line 1 says what is your module called, that is when every thing is done successfully we'll have a file by the name of module1.ko along with several others. This is the actual kernel module we are interested in so you won't put more than one name here and it can be anything since it's just a file name.

Line 2 tells the Kbuild system what comprises this module file. So in short this is a dependency list that tells Kbuild system which files need to be compiled to make this module. In this case we only have one file by the name of my_module1.c and we are saying that my_module1.o is required to make module1.o.

Line 3 is a normal make variable which we'll use in Line 4 where we are specifying the flags required to compile all the files. These ccflags can be on a file-by-file basis as well. We'll see an example of this for a module spanning multiple files. For now all the files compiled by this invocation of Kbuild will use the above defined ccflags.

When make is called, each invocation of Kbuild is separate in each directory so every invocation is isolated however it inherits the Kbuild variables passed from the parent directory. The variable src accessed using $(src) contains the absolute path where the Kbuild is running currently. So in the above case we use that information to tell which directory to look for includes. There's no space between -I and $, if you've more directories just add another one like above. At the end of the include flag I've set the DEBUG_FLAGS defined by me earlier. My directory listing looks like the following,

pranay@linux-y7pi:~/pks_modules> ls -d */
common/  module_1/  module_2/  test/

So when we invoke make in module_1 directory then the include would need to be a directory up in common. The current directory is always searched for include directory, that's why I didn't do an -I$(src)/.,  however if you've created another sub-directory then you'll need to tell Kbuild about it just like above.

So if you change DEBUG_FLAGS to be just a space instead of -DDEBUG then you'll see that pr_debug string will not be printed however pr_err would still be printed.


The Makefile

The following is the Makefile for the above module

KDIR ?=/lib/modules/$(shell uname -r)/build
all:
        make -C $(KDIR) M=$(PWD) modules

clean:
        make -C $(KDIR) M=$(PWD) clean

install:
        make -C $(KDIR) M=$(PWD) install

The KDIR is just a make variable we are using. In case somebody supplies this variable while invoking make then we'll use that as our kernel source directory. There's nothing you need to do here, but you can add any more targets in case you want to just like a normal Makefile. So nothing special here.


A word about __init and __exit

In my earlier post I told you a bit about sections. Well these __init and __exit are created for that purpose exactly. The idea is that the initialization code is usually one shot thing so there's no point holding that code after the module has been loaded. The same goes while un-loading the module, since the module is gone therefore there's no point to hold it's cleanup code since it's going to waste memory space.

To be able to free up the space these two macros are used by the kernel module loader. What these macros does is that they put the marked code into separate sections. So any function with __init goes into a section while __exit goes into another section. We need two sections since while loading the module the code is page aligned and kernel's memory allocation granularity is page oriented rather than any arbitrary size.

So when allocation is required for __init section then kernel knows how many pages it requires and it can free after initialization is done. The same way for __exit. The rest of the code, that is not marked with __init or __exit would still go in the normal .text section as we saw earlier. The following is the readelf output of the module1.ko

There are 35 section headers, starting at offset 0x13e18:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .note.gnu.build-i NOTE            00000000 000034 000024 00   A  0   0  4
  [ 2] .text             PROGBITS        00000000 000058 000000 00  AX  0   0  4
  [ 3] .init.text        PROGBITS        00000000 000058 000011 00  AX  0   0  1
  [ 4] .rel.init.text    REL             00000000 014390 000010 08     33   3  4
  [ 5] .exit.text        PROGBITS        00000000 000069 000028 00  AX  0   0  1
  [ 6] .rel.exit.text    REL             00000000 0143a0 000020 08     33   5  4
  [ 7] .rodata           PROGBITS        00000000 000091 00000d 00   A  0   0  1
  [ 8] .rodata.str1.1    PROGBITS        00000000 00009e 000034 01 AMS  0   0  1
  [ 9] .rodata.str1.4    PROGBITS        00000000 0000d4 00002f 01 AMS  0   0  4
  [10] .eh_frame         PROGBITS        00000000 000104 000050 00   A  0   0  4
  [11] .rel.eh_frame     REL             00000000 0143c0 000010 08     33  10  4
  [12] .modinfo          PROGBITS        00000000 000154 000069 00   A  0   0  1
  [13] __versions        PROGBITS        00000000 0001c0 0000c0 00   A  0   0 32
  [14] .data             PROGBITS        00000000 000280 000000 00  WA  0   0  4
  [15] __verbose         PROGBITS        00000000 000280 000018 00  WA  0   0  8
  [16] .rel__verbose     REL             00000000 0143d0 000020 08     33  15  4
  [17] .gnu.linkonce.thi PROGBITS        00000000 0002a0 000178 00  WA  0   0 32
  [18] .rel.gnu.linkonce REL             00000000 0143f0 000010 08     33  17  4
  [19] .bss              NOBITS          00000000 000418 000000 00  WA  0   0  4
  [20] .comment          PROGBITS        00000000 000418 000086 01  MS  0   0  1
  [21] .note.GNU-stack   PROGBITS        00000000 00049e 000000 00      0   0  1
  [22] .debug_aranges    PROGBITS        00000000 00049e 000040 00      0   0  1
  [23] .rel.debug_arange REL             00000000 014400 000020 08     33  22  4
  [24] .debug_info       PROGBITS        00000000 0004de 00b701 00      0   0  1
  [25] .rel.debug_info   REL             00000000 014420 005198 08     33  24  4
  [26] .debug_abbrev     PROGBITS        00000000 00bbdf 0005d2 00      0   0  1
  [27] .debug_line       PROGBITS        00000000 00c1b1 000b19 00      0   0  1
  [28] .rel.debug_line   REL             00000000 0195b8 000010 08     33  27  4
  [29] .debug_str        PROGBITS        00000000 00ccca 006fdc 01  MS  0   0  1
  [30] .debug_ranges     PROGBITS        00000000 013ca6 000030 00      0   0  1
  [31] .rel.debug_ranges REL             00000000 0195c8 000040 08     33  30  4
  [32] .shstrtab         STRTAB          00000000 013cd6 000142 00      0   0  1
  [33] .symtab           SYMTAB          00000000 019608 000280 10     34  35  4
  [34] .strtab           STRTAB          00000000 019888 0000e5 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)


Another important point about __exit is that when the module is built-in that is part of kernel not any separate .ko file then it doesn't do anything. The reason being it's possible that .text section's last page still has space that can be filled with code. So the build system optimizes this to pack __exit in text section since it knows you can't unload the module. However __init still works since initialization space can still be reclaimed.

Listing of Common.h


/*
 * Put all common includes here.
 */
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/net.h>



Exercise 1.1

Try to print a variable using pr_error and pr_debug just like printf. Try to print the addresses of the functions pr_debug and pr_error.

Saturday, June 7, 2014

Beginning System Programming. (Compilers And Segments)

। जय श्री भगवान् ।
To begin with system programming one must understand the compiler and the architecture he/she is using. With application programming we can perhaps forget if caches are being fully utilized or if the data we've is the correct one. So in this post I'll talk about how to look at things from the point of view of compiler since understanding the tool that actually does most of the hard work to make code run on bare metal is essential.


A Simple Hello World Dis-assembly

Let's take an example for this. In this case we are not even going to print the hello world just however we would like to see where the variables are located in our program. The basic idea is to understand the memory segments which a program get when it starts to run. Each of these segments or sections is generally described by a section in the executable except the stack segment since it's actually not part of the executable. To give an example let's write a simple function as shown below,

 
int myfunc()
{
   int a = 0;
   a = a ^ (~a);
   return !a;
}  

The dis-assembly of the above function is shown below as reported by objdump -D

Disassembly of section .text:

00000000 <myfunc>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   83 ec 10                sub    $0x10,%esp
   6:   c7 45 fc 05 00 00 00    movl   $0x5,-0x4(%ebp)
   d:   c7 45 fc ff ff ff ff    movl   $0xffffffff,-0x4(%ebp)
  14:   83 7d fc 00             cmpl   $0x0,-0x4(%ebp)
  18:   0f 94 c0                sete   %al
  1b:   0f b6 c0                movzbl %al,%eax
  1e:   c9                      leave  
  1f:   c3                      ret    

First note that the function is located in the section .text of the executable. This is where all the code is and this is usually marked as read-only section since code isn't allowed to change while executing. The second important point is the variable we declared inside the function. See closely that there's no mention of the name of variable within the function that's because the variable is created on stack (See how the esp is moved by 16 bytes but only uses 4 bytes on 32 bit machine). The compiler assumes that the stack pointer is always valid and uses the current value of esp to calculate how much it needs to move in order to make room for the variable.

All such auto storage class variables are created by moving the stack pointer down(or up depends how stack grows on x86 and x86_64 it grows down). This is one reason that big structures are usually passed as pointers and not as the structures themselves so as to avoid a huge stack space wastage and memory copy operations. Now let's see what happens to data which is declared global,

 
Disassembly of section .data:

00000000 <my_global_var>:
   0:   01 00                   add    %eax,(%eax)
        ...

I created a variable named my_global_var and it goes into data section. This section is actually occupying space on disk (as an instruction to initialize the variable). When the executable is loaded the loader would allocate space in memory while parsing through the sections. Therefore this memory is not allocated or destroyed as we saw in case of auto storage class variables as above.

There maybe several other sections in the compiled binary which you can find out using readelf however not all sections are required to be loaded. Some sections are there for information purposes only. There are sections like ro-data where the read only data is stored like constant strings or variables declared as constants. Try using readelf command to see the sections as shown below,

pranay@linux-y7pi:~/pks_modules/test> readelf -S test.o 
There are 13 section headers, starting at offset 0x184:

Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .text             PROGBITS        00000000 000034 00003f 00  AX  0   0  4
  [ 2] .rel.text         REL             00000000 000488 000010 08     11   1  4
  [ 3] .data             PROGBITS        00000000 000074 000004 00  WA  0   0  4
  [ 4] .bss              NOBITS          00000000 000078 000000 00  WA  0   0  4
  [ 5] .rodata           PROGBITS        00000000 000078 00000e 00   A  0   0  1
  [ 6] .comment          PROGBITS        00000000 000086 000043 01  MS  0   0  1
  [ 7] .note.GNU-stack   PROGBITS        00000000 0000c9 000000 00      0   0  1
  [ 8] .eh_frame         PROGBITS        00000000 0000cc 000058 00   A  0   0  4
  [ 9] .rel.eh_frame     REL             00000000 000498 000010 08     11   8  4
  [10] .shstrtab         STRTAB          00000000 000124 00005f 00      0   0  1
  [11] .symtab           SYMTAB          00000000 00038c 0000d0 10     12   9  4
  [12] .strtab           STRTAB          00000000 00045c 000029 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

The sections having A flag shown means those sections require allocation to be done. Not that there's no stack section since that will be allocated by the OS when the executable is loaded.

One interesting sections is rel.text and it's very useful while loading the executable. As you can see none of the section have any particular Addr value. The Addr value is actually the start of the section however since loader will decide where a section has to be allocated these are not filled in yet by the compiler and everything is done relative to address 0.

Now the problem with this is that while loading there needs to be a fixup. There needs to be a fixup of function calls, data access instructions etc. There maybe also a rel.data for data however in our case it's only rel.text. This section is not allocated as you can see, but is used by the loader to fixup the function call addresses or any instruction that uses memory after the sections have been allocated memory.
 

Exercise 0.1

A simple exercise would be to allocate a static variable inside a function and see what happens to that variable. Then create another static global variable by the same name and see what happens to that variable. Which section does it go to? Is the name of variable same as what you put in the code? 

Exercise 0.2

Try to force a variable in read only section without using the const keyword. Hint see how to use __attribute__(section(......)) when using gcc. See if you can create a new section with your own name and put the variable there instead.