TL;DR: These simpler, step-by-step methods equip you to apply BPF tracing technology to real-word problems—no specialized tools or libraries required.
BPF, a tracing technology in the Linux kernel for network stack tracing, has become popular recently thanks to new extensions that enable novel use-cases outside of BPF’s original scope. Today it can be used to implement program performance analysis tools, system and program dynamic tracing utilities, and much more.
In this blog post we’ll show you how to use the Linux implementation of BPF to write tools that access system and program events. The excellent tools from IO Visor make it possible for users to easily harness BPF technology without the considerable time investment of writing specialized tools in native code languages.
BPF itself is just a way to express a program, and a runtime interpreter for executing that program “safely.” It’s a set of specifications for virtual architecture, detailing how virtual machines dedicated to running its code should behave. The latest extensions to BPF have not only introduced new, really useful helper functions (such as reading a process’ memory), but also new registers and more stack space for the BPF bytecode.
Our main goal is to help you to take advantage of BPF and apply it to real-world problems without depending on external tools or libraries that may have been written with different goals and requirements in mind.
You can find the examples in this post in our repository. Please note that the code is simplified to focus on the concepts. This means that, where possible, we skip error checking and proper resource cleanup.
Even though we won’t be handwriting BPF assembly, it’s useful to know the code limitations since the in-kernel verifier will reject our instructions if we break its rules.
BPF programs are extremely simple, being made of only a single function. Instructions are sent to the kernel as an array of opcodes, meaning there’s no executable file format involved. Without sections, it’s not possible to have things like global variables or string literals; everything has to live on the stack, which can only hold up to 512 bytes. Branches are allowed, but it is only since kernel version 5.3 that jump opcodes can go backward—provided the verifier can prove the code will not execute forever.
The only other way to use loops without requiring recent kernel versions is to unroll them, but this will potentially use a lot of instructions, and older Linux versions will not load any program that exceeds the 4096 opcode count limit (see BPF_MAXINSNS under linux/bpf_common.h). Error handling in some cases is mandatory, and the verifier will prevent you from using resources that may fail initialization by rejecting the program.
These limitations are extremely important since these programs can get hooked on kernel code. When a verifier challenges the correctness of the code, it’s possible to prevent system crashes or slowdowns from loading malformed code.
To make BPF programs truly useful, they need ways to communicate with a user mode process and manage long-term data, i.e., via maps and perf event outputs.
Although many map types exist, they all essentially behave like key-value databases, and are commonly used to share data between user modes and/or other programs. Some of these types store data in per-CPU storage, making it easy to save and retrieve state when the same BPF program is run concurrently from different CPU cores.
Perf event outputs are generally used to send data to user mode programs and services, and are implemented as circular buffers.
Without some data to process, our programs will just sit around doing nothing. BPF probes on Linux can be attached to several different event sources. For our purpose, we’re mainly interested in function tracing events.
Similar to code hooking, BPF programs can be attached to any function. The probe type depends on where the target code lives. Kprobes are used when tracing kernel functions, while Uprobes are used when working with user mode libraries or binaries.
While Kprobe and Uprobe events are emitted when entering the monitored functions, Kretprobe and Uretprobe events are generated whenever the function returns. This works correctly even if the function being traced has multiple exit points. This kind of event does not forward typed syscall parameters and only comes with a pt_regs structure that contains the register values at the time of the call. Knowledge about the function prototype and system ABI is required to map back the function arguments to the right register.
It’s not always ideal to rely on function hooking when writing a tool, because the risk of breakage increases as the kernel or software gets updated. In most cases, it’s best to use a more stable event source such as a tracepoint.
There are two types of tracepoints:
Both types of tracepoints are defined in the source code by the programmer, essentially defining a stable interface that shouldn’t change unless strictly necessary.
If DebugFS has been enabled and mounted, registered tracepoints will all appear under the /sys/kernel/debug/tracing folder. Similar to Kprobes and Kretprobes, each system call defined in the Linux kernel comes with two different tracepoints. The first one, sys_enter, is activated whenever a program in the system transitions to a syscall handler inside the kernel, and carries information about the parameters that have been received. The second (and last) one, sys_exit, only contains the exit code of the function and is invoked whenever the syscall function terminates.
Even though there’s no plan to use external libraries, we still have a few dependencies. The most important thing is to have access to a recent LLVM toolchain compiled with BPF support. If your system does not satisfy this requirement, it is possible—and actually encouraged—to make use of the osquery toolchain. You’ll also need CMake, as that’s what I use for the sample code.
When running inside the BPF environment, our programs make use of special helper functions that require a kernel version that’s at least above 4.18. While it’s possible to avoid using them, it would severely limit what we can do from our code.
Using Ubuntu 20.04 or equivalent is a good bet, as it comes with both a good kernel version and an up-to-date LLVM toolchain with BPF support.
Some LLVM knowledge is useful, but the code doesn’t require any advanced LLVM expertise. The Kaleidoscope language tutorial on the official site is a great introduction, if needed.
There are many new concepts to introduce, so we’ll start simple: our first example loads a program that returns without doing anything.
First, we create a new LLVM module and a function that contains our logic:
Since we’re not going to handle event arguments, the function we created does not accept any parameters. Not much else is happening here except the return instruction. Remember, each BPF program has exactly one function, so it’s best to ask LLVM to store them in separate sections. This makes it easier to retrieve them once the module is compiled.
We can now JIT our module to BPF bytecode using the ExecutionEngine class from LLVM:
Our custom SectionMemoryManager class mostly acts as a passthrough to the original SectionMemoryManager class from LLVM—it’s only there to keep track of the sections that the ExecutionEngine object creates when compiling our IR.
Once the code is built, we get back a vector of bytes for each function that was created inside the module:
Loading the program is not hard, but as you may have noticed, there is no helper function defined for the bpf() system call we’re using. The tracepoint is the easiest event type to set up, and it’s what we’re using for the time being.
Once the BPF_PROG_LOAD command is issued, the in-kernel verifier will validate our program and also provide a disassembly of it inside the log buffer we’ve provided. The operation will fail if kernel output is longer than the bytes available, so only provide a log buffer in production code if the load has already failed.
Another important field in the attr union is the program license; specifying any value other than GPL may disable some of the features that are exposed to BPF. I’m not a licensing expert, but it should be possible to use different licenses for the generator and the generated code (but please speak to a lawyer and/or your employer first!).
We can now assemble the main() function using the helpers we built:
If everything works correctly, no error is printed when the binary is run as the root user. You can find the source code for the empty program in the 00-empty folder of the companion code repository.
But...this program isn’t very exciting, since it doesn’t do anything! Now we’ll update it so we can execute it when a certain system event happens.
In order to actually execute our BPF programs, we have to attach them to an event source.
Creating a new tracepoint event is easy; it only involves reading and writing some files from under the debugfs folder:
To create the event file descriptor, we have to find the tracepoint identifier, which is in a special file called (unsurprisingly) “id.”
For our last step, we attach the program to the tracepoint event we just created. This is trivial and can be done with a couple of ioctl calls on the event’s file descriptor:
Our program should finally succeed in running our BPF code, but no output is generated yet since our module only really contained a return opcode. The easiest way to generate some output is to use the bpf_trace_printk helper to print a fixed string:
Importing new helper functions from BPF is quite easy. The first thing we need is the prototype, which can be taken from the linux/bpf.h include header. The one relative to printk reads as follows:
Once the function type matches, we only have to assemble a call that uses the helper function ID as the destination address: BPF_FUNC_trace_printk. The generatePrintk function can now be added to our program right before we create the return instruction inside generateBPFModule.
The full source code for this program can be found in the 01-hello_open folder.
Running the program again will show the “Hello!!” string inside the /sys/kernel/debug/tracing/trace_pipe file every time the tracepoint event is emitted. Using text output can be useful, but due to the BPF VM limitations the printf helper is not as useful as can be in a standard C program.
In the next section, we’ll take a look at maps and how to use them as data storage for our programs.
Using maps to store data
Maps are a major component in most programs, and can be used in a number of different ways. Since they’re accessible from both kernel and user mode, they can be useful in storing data for later processing either from additional probes or user programs. Given the limitations that BPF imposes, they’re also commonly used to provide scratch space for handling temporary data that does not fit on the stack.
There are many map types; some are specialized for certain uses, such as storing stack traces. Others are more generic, and suitable for use as custom data containers.
Concurrency and thread safety are not just user mode problems, and BPF comes with two really useful special map types that have dedicated storage for storing values in CPU scope. These maps are commonly used to replace the stack, as a per-CPU map can be easily referenced by programs without having to worry about synchronization.
It’s rather simple to create and use maps since they all share the same interface, regardless of type. The following table, taken from the BPF header file comments, documents the most common operations:
The only important thing to remember is that when operating on per-CPU maps the value is not just a single entry, but an array of values that has as many items as CPU cores.
Creating a map
Before we can create our map, we have to determine which type we want to use. The following enum declaration has been taken from the linux/bpf.h header file:
Most of the time we’ll use hash maps and arrays. We have to create a bpf_attr union, initializing key and value size as well as the maximum amount of entries it can hold.
Not every available operation always makes sense for all map types. For example, it’s not possible to delete entries when working with an array. Lookup operations are also going to behave differently, as they will only fail when the specified index is beyond the last element.
Here’s the code to read a value from a map:
Writing a BPF program to count syscall invocations
In this example we’ll build a probe that counts how many times the tracepoint we're tracing gets called. We’ll create a counter for each processor core, using a per-CPU array map that only contains a single item.
Referencing this map from the BPF code is not too hard but requires some additional operations:
The map address can be obtained through a special LLVM intrinsic called “pseudo.”
Importing the bpf_map_lookup_elem helper function follows the same procedure we used to import the bpf_trace_printk one. Looking at the linux/bpf.h, the prototype reads:
Notice how the key parameter is passed by pointer and not by value. We’ll have to allocate the actual key on the stack using CreateAlloca. Since allocations should always happen in the first (entry) basic block, our function will accept a pre-filled buffer as key. The return type is a void pointer, but we can save work if we directly declare the function with the correct value type.
Back to the BPF program generator, we can now call the new bpfMapLookupElem to retrieve the first value in our array map:
Since we are using a per-CPU array map, the pointer that returns from this function references a private array entry for the core we’re running on. Before we can use it, however, we have to test whether the function has succeeded; otherwise, the verifier will reject the program. This is trivial and can be done with a comparison instruction and a new basic block.
The pointer to the counter value can now be dereferenced without causing a validation error from the verifier.
There is no need to import and use the bpf_map_update_elem() helper function since we can directly increment the value from the pointer we received. We only have to load the value from the pointer, increment it, and then store it back where it was.
Once we have finished with our tracer, we can retrieve the counters and inspect them:
When dealing with per-CPU maps, it is important to not rely on get_nprocs_conf and use /sys/devices/system/cpu/possible instead. On VMware Fusion for example, the vcpu.hotadd setting will cause Linux to report 128 possible CPUs when enabled, regardless of how many cores have been actually assigned to the virtual machine.
The full sample code can be found in the 02-syscall_counter folder.
One interesting experiment is to attach this program to the system call tracepoint used by the chmod command line tool to update file modes. The strace debugging utility can help determine which syscall is being used. In this case we are going to be monitoring the following tracepoint: syscalls/sys_enter_fchmodat.
The taskset command can be altered to force the fchmodat syscall to be called from a specific processor:
Maps can be a really powerful way to store data for later processing, but it’s impossible for user mode programs to know when and where new data is available for reading.
Perf event outputs can help solve this problem, since they enable the program to be notified whenever new data is available. Additionally, since they behave like a circular buffer, we do not have the same size limitations we have when setting map values.
In this section, we’ll build an application that can measure how much time it takes to handle a system call. To make this work, we’ll attach a program to both the entry and exit points of a tracepoint to gather timestamps.
Before we start creating our perf output, we have to create a structure to hold our resources. In total, we’ll have a file descriptor for the map and then a perf output per processor, along with its own memory mapping.
To initialize it, we have to create a BPF map of the type PERF_EVENT_ARRAY first. This special data structure maps a specific CPU index to a private perf event output specified as a file descriptor. For it to function properly, we must use the following parameters when creating the map:
When we looked at maps in the previous sections, we only focused on reading. For the next steps we also need to write new values, so let’s take a look at how to set keys.
This is not too different from how we read map values, but this time we don’t have to deal with the chance that the key may not be present. As always when dealing with per-CPU maps, the data pointer should be considered as an array containing one value per CPU.
The next step is to create a perf event output for each online processor with the perf_event_open system call, using the special PERF_COUNT_SW_BPF_OUTPUT config value.
Now that we have the file descriptors, we can populate the perf event array map we created:
Finally, we create a memory mapping for each perf output:
This is the memory we’ll read from when capturing the BPF program output.
Writing a BPF program to profile system calls
Now that we have a file descriptor of the perf event array map, we can use it from within the BPF code to send data with the bpf_perf_event_output helper function. Here’s the prototype from linux/bpf.h:
The ctx parameter must be always set to the value of the first argument received in the entry point function of the BPF program.
The map address is obtained with the LLVM pseudo intrinsic that we imported in the previous section. Data and size are self-explanatory, but it is important to remember that the memory pointer must reside inside the BPF program (i.e., we can’t pass a user pointer).
The last parameter, flags, can be used as a CPU index mask to select the perf event output this data should be sent to. A special value can be passed to ask the BPF VM to automatically use the index of the processor we’re running on.
The file descriptor and flags parameters are most likely known at compile time, so we can make the function a little more user friendly by accepting integer types. The buffer size, however, is often determined at runtime, so it’s best to use an llvm::Value pointer.
While it’s possible to just send the raw timestamps whenever we enter and leave the system call of our choice, it’s much easier and more efficient to compute what we need directly inside the BPF code. To do this we’ll use a per-CPU hash map shared across two different BPF programs: one for the sys_enter event, and another one for the sys_exit.
From the enter program, we’ll save the system timestamp in the map. When the exit program is invoked, we’ll retrieve it and use it to determine how much time it took. The resulting value is then sent to the user mode program using the perf output.
Creating the map is easy, and we can re-use the map helpers we wrote in the previous sections. Both the timestamp and the map key are 64-bit values, so we’ll use 8 bytes for both:
Writing the enter program
We will need to generate a key for our map. A combination of the process ID and thread ID is a good candidate for this:
Then the system timestamp needs to be acquired. Even though the ktime_get_ns helper function counts the time from the boot, it’s still a good alternative since we only have to use it to calculate the execution time.
By now you should be well versed in importing them, so here are the two definitions:
We can now use the newly defined functions to generate a map key and acquire the system timestamp:
For this program we have replaced the array map we used in the previous sections with a hash map. It’s no longer possible to use the bpf_map_lookup_elem() helper since the map key we have will fail with ENOENT if the element does not exist.
To fix this, we have to import a new helper named bpf_map_update_elem():
We’ll keep the map file descriptor and flag values as integers, since we know their values before the module is compiled.
We can now store the timestamp inside the map and close the enter program:
Writing the exit program
In this program we’ll retrieve the timestamp we stored and use it to measure how much time we’ve spent inside the system call. Once we have the result, we’ll send it to user mode using the perf output.
When creating the llvm::Function for this program, we must define at least one argument. This value will be required later for the ctx parameter that we have to pass to the bpf_perf_event_output() helper.
First we have to acquire the map entry; as always, we must check for any possible error or the verifier will not let us load our program.
Next, we want to read our previous timestamp and subtract it from the current time:
The bpf_perf_event_output expects a buffer, so we have to store our result somewhere in memory. We can re-use the map value address so we don’t have to allocate more stack space:
Remember, we have to pass the first program argument to the ctx parameter; the arg_begin method of an llvm::Function will return exactly that. When sending data, the bpf_perf_event_output() helper expects a pointer. We can re-use the timestamp pointer we obtained from the map and avoid allocating additional memory to the very limited stack we have:
Using -1UL as the flag value means that BPF will automatically send this data to the perf event output associated with the CPU we’re running on.
Reading data from the perf outputs
In our user mode program, we can access the perf buffers through the memory mappings we created. The list of perf event output descriptors can be used together with the poll() function using an array of pollfd structures. When one of the fd we have set is readable, the corresponding memory mapping will contain the data sent by the BPF program.
Inside the memory we have mapped, the perf_event_mmap_page header will describe the properties and boundaries of the allocated circular buffer.
The structure is too big to be reported here, but the most important fields are:
The base of the data allocation is located at the offset data_offset; to find the start of our buffer, however, we have to add it to the data_tail value, making sure to wrap around whenever we exceed the data allocation size specified by the data_size field:
Similarly, the data_head field can be used to find the end of the buffer:
If the end of the buffer is at a lower offset compared to the start, then data is wrapping at the data_size edge and the read has to happen with two operations.
When extracting data, the program is expected to confirm the read by updating the data_tail value and adding the amount of bytes processed, while the kernel will advance the data_head field automatically as new bytes are received. Data is lost when the data_head offset wraps around and crosses data_tail; a special structure inside this buffer will warn the program if this happens.
Program data is packaged inside the data we have just extracted, preceded by two headers. The first one is the perf_event_header structure:
The second one is an additional 32-bit size field that accounts for itself and the data that follows. Multiple consecutive writes from the BPF program may be added under the same object. Data is, however, grouped by type, which can be used to determine what kind of data to expect after the header. When using BPF, we’ll only have to deal with either our data or a notification of type PERF_RECORD_LOST, which is used to inform the program that a bpf_perf_event_output() call has overwritten data in the ring buffer before we could have a chance to read it.
Here’s some annotated code that shows how the whole procedure works:
Writing the main function
While it is entirely possible (and sometimes useful, in order to share types) to use a single LLVM module and context for both the enter and exit programs, we will create two different modules to avoid changing the previous sample code we’ve built.
The program generation goes through the usual steps, but now we are loading two instead of one, so the previous code has been changed to reflect that.
The new and interesting part is the main loop where the perf event output data is read and processed:
The full source code can be found in the 03-syscall_profiler folder.
Running the sample program as root should print something similar to the following output:
BPF is in active development and is becoming more and more useful with each update, enabling new use cases that extend the original vision. Recently, newly added BPF functionality allowed us to write a simple system-wide syscall fault injector using nothing but BPF and a compatible kernel that supported the required bpf_override_return functionality.
If you want to keep up with how this technology evolves, one of the best places to start with is Brendan’s Gregg blog. The IO Visor Project repository also contains a ton of code and documentation that is extremely useful if you plan on writing your own BPF-powered tools.