2023-08-30 17:31:07 +02:00
|
|
|
###############
|
|
|
|
Timerlat tracer
|
|
|
|
###############
|
|
|
|
|
|
|
|
The timerlat tracer aims to help the preemptive kernel developers to
|
|
|
|
find sources of wakeup latencies of real-time threads. Like cyclictest,
|
|
|
|
the tracer sets a periodic timer that wakes up a thread. The thread then
|
|
|
|
computes a *wakeup latency* value as the difference between the *current
|
|
|
|
time* and the *absolute time* that the timer was set to expire. The main
|
|
|
|
goal of timerlat is tracing in such a way to help kernel developers.
|
|
|
|
|
|
|
|
Usage
|
|
|
|
-----
|
|
|
|
|
|
|
|
Write the ASCII text "timerlat" into the current_tracer file of the
|
|
|
|
tracing system (generally mounted at /sys/kernel/tracing).
|
|
|
|
|
|
|
|
For example::
|
|
|
|
|
|
|
|
[root@f32 ~]# cd /sys/kernel/tracing/
|
|
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
|
|
|
|
|
|
It is possible to follow the trace by reading the trace file::
|
|
|
|
|
|
|
|
[root@f32 tracing]# cat trace
|
|
|
|
# tracer: timerlat
|
|
|
|
#
|
|
|
|
# _-----=> irqs-off
|
|
|
|
# / _----=> need-resched
|
|
|
|
# | / _---=> hardirq/softirq
|
|
|
|
# || / _--=> preempt-depth
|
|
|
|
# || /
|
|
|
|
# |||| ACTIVATION
|
|
|
|
# TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
|
|
|
|
# | | | |||| | | | |
|
|
|
|
<idle>-0 [000] d.h1 54.029328: #1 context irq timer_latency 932 ns
|
|
|
|
<...>-867 [000] .... 54.029339: #1 context thread timer_latency 11700 ns
|
|
|
|
<idle>-0 [001] dNh1 54.029346: #1 context irq timer_latency 2833 ns
|
|
|
|
<...>-868 [001] .... 54.029353: #1 context thread timer_latency 9820 ns
|
|
|
|
<idle>-0 [000] d.h1 54.030328: #2 context irq timer_latency 769 ns
|
|
|
|
<...>-867 [000] .... 54.030330: #2 context thread timer_latency 3070 ns
|
|
|
|
<idle>-0 [001] d.h1 54.030344: #2 context irq timer_latency 935 ns
|
|
|
|
<...>-868 [001] .... 54.030347: #2 context thread timer_latency 4351 ns
|
|
|
|
|
|
|
|
|
|
|
|
The tracer creates a per-cpu kernel thread with real-time priority that
|
|
|
|
prints two lines at every activation. The first is the *timer latency*
|
|
|
|
observed at the *hardirq* context before the activation of the thread.
|
|
|
|
The second is the *timer latency* observed by the thread. The ACTIVATION
|
|
|
|
ID field serves to relate the *irq* execution to its respective *thread*
|
|
|
|
execution.
|
|
|
|
|
|
|
|
The *irq*/*thread* splitting is important to clarify in which context
|
|
|
|
the unexpected high value is coming from. The *irq* context can be
|
|
|
|
delayed by hardware-related actions, such as SMIs, NMIs, IRQs,
|
|
|
|
or by thread masking interrupts. Once the timer happens, the delay
|
|
|
|
can also be influenced by blocking caused by threads. For example, by
|
|
|
|
postponing the scheduler execution via preempt_disable(), scheduler
|
|
|
|
execution, or masking interrupts. Threads can also be delayed by the
|
|
|
|
interference from other threads and IRQs.
|
|
|
|
|
|
|
|
Tracer options
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
The timerlat tracer is built on top of osnoise tracer.
|
|
|
|
So its configuration is also done in the osnoise/ config
|
|
|
|
directory. The timerlat configs are:
|
|
|
|
|
|
|
|
- cpus: CPUs at which a timerlat thread will execute.
|
|
|
|
- timerlat_period_us: the period of the timerlat thread.
|
|
|
|
- stop_tracing_us: stop the system tracing if a
|
|
|
|
timer latency at the *irq* context higher than the configured
|
|
|
|
value happens. Writing 0 disables this option.
|
|
|
|
- stop_tracing_total_us: stop the system tracing if a
|
|
|
|
timer latency at the *thread* context is higher than the configured
|
|
|
|
value happens. Writing 0 disables this option.
|
|
|
|
- print_stack: save the stack of the IRQ occurrence. The stack is printed
|
|
|
|
after the *thread context* event, or at the IRQ handler if *stop_tracing_us*
|
|
|
|
is hit.
|
|
|
|
|
|
|
|
timerlat and osnoise
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
The timerlat can also take advantage of the osnoise: traceevents.
|
|
|
|
For example::
|
|
|
|
|
|
|
|
[root@f32 ~]# cd /sys/kernel/tracing/
|
|
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
|
|
[root@f32 tracing]# echo 1 > events/osnoise/enable
|
|
|
|
[root@f32 tracing]# echo 25 > osnoise/stop_tracing_total_us
|
|
|
|
[root@f32 tracing]# tail -10 trace
|
|
|
|
cc1-87882 [005] d..h... 548.771078: #402268 context irq timer_latency 13585 ns
|
|
|
|
cc1-87882 [005] dNLh1.. 548.771082: irq_noise: local_timer:236 start 548.771077442 duration 7597 ns
|
|
|
|
cc1-87882 [005] dNLh2.. 548.771099: irq_noise: qxl:21 start 548.771085017 duration 7139 ns
|
|
|
|
cc1-87882 [005] d...3.. 548.771102: thread_noise: cc1:87882 start 548.771078243 duration 9909 ns
|
|
|
|
timerlat/5-1035 [005] ....... 548.771104: #402268 context thread timer_latency 39960 ns
|
|
|
|
|
|
|
|
In this case, the root cause of the timer latency does not point to a
|
|
|
|
single cause but to multiple ones. Firstly, the timer IRQ was delayed
|
|
|
|
for 13 us, which may point to a long IRQ disabled section (see IRQ
|
|
|
|
stacktrace section). Then the timer interrupt that wakes up the timerlat
|
|
|
|
thread took 7597 ns, and the qxl:21 device IRQ took 7139 ns. Finally,
|
|
|
|
the cc1 thread noise took 9909 ns of time before the context switch.
|
|
|
|
Such pieces of evidence are useful for the developer to use other
|
|
|
|
tracing methods to figure out how to debug and optimize the system.
|
|
|
|
|
|
|
|
It is worth mentioning that the *duration* values reported
|
|
|
|
by the osnoise: events are *net* values. For example, the
|
|
|
|
thread_noise does not include the duration of the overhead caused
|
|
|
|
by the IRQ execution (which indeed accounted for 12736 ns). But
|
|
|
|
the values reported by the timerlat tracer (timerlat_latency)
|
|
|
|
are *gross* values.
|
|
|
|
|
|
|
|
The art below illustrates a CPU timeline and how the timerlat tracer
|
|
|
|
observes it at the top and the osnoise: events at the bottom. Each "-"
|
|
|
|
in the timelines means circa 1 us, and the time moves ==>::
|
|
|
|
|
|
|
|
External timer irq thread
|
|
|
|
clock latency latency
|
|
|
|
event 13585 ns 39960 ns
|
|
|
|
| ^ ^
|
|
|
|
v | |
|
|
|
|
|-------------| |
|
|
|
|
|-------------+-------------------------|
|
|
|
|
^ ^
|
|
|
|
========================================================================
|
|
|
|
[tmr irq] [dev irq]
|
|
|
|
[another thread...^ v..^ v.......][timerlat/ thread] <-- CPU timeline
|
|
|
|
=========================================================================
|
|
|
|
|-------| |-------|
|
|
|
|
|--^ v-------|
|
|
|
|
| | |
|
|
|
|
| | + thread_noise: 9909 ns
|
|
|
|
| +-> irq_noise: 6139 ns
|
|
|
|
+-> irq_noise: 7597 ns
|
|
|
|
|
|
|
|
IRQ stacktrace
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
The osnoise/print_stack option is helpful for the cases in which a thread
|
|
|
|
noise causes the major factor for the timer latency, because of preempt or
|
|
|
|
irq disabled. For example::
|
|
|
|
|
|
|
|
[root@f32 tracing]# echo 500 > osnoise/stop_tracing_total_us
|
|
|
|
[root@f32 tracing]# echo 500 > osnoise/print_stack
|
|
|
|
[root@f32 tracing]# echo timerlat > current_tracer
|
|
|
|
[root@f32 tracing]# tail -21 per_cpu/cpu7/trace
|
|
|
|
insmod-1026 [007] dN.h1.. 200.201948: irq_noise: local_timer:236 start 200.201939376 duration 7872 ns
|
|
|
|
insmod-1026 [007] d..h1.. 200.202587: #29800 context irq timer_latency 1616 ns
|
|
|
|
insmod-1026 [007] dN.h2.. 200.202598: irq_noise: local_timer:236 start 200.202586162 duration 11855 ns
|
|
|
|
insmod-1026 [007] dN.h3.. 200.202947: irq_noise: local_timer:236 start 200.202939174 duration 7318 ns
|
|
|
|
insmod-1026 [007] d...3.. 200.203444: thread_noise: insmod:1026 start 200.202586933 duration 838681 ns
|
|
|
|
timerlat/7-1001 [007] ....... 200.203445: #29800 context thread timer_latency 859978 ns
|
|
|
|
timerlat/7-1001 [007] ....1.. 200.203446: <stack trace>
|
|
|
|
=> timerlat_irq
|
|
|
|
=> __hrtimer_run_queues
|
|
|
|
=> hrtimer_interrupt
|
|
|
|
=> __sysvec_apic_timer_interrupt
|
|
|
|
=> asm_call_irq_on_stack
|
|
|
|
=> sysvec_apic_timer_interrupt
|
|
|
|
=> asm_sysvec_apic_timer_interrupt
|
|
|
|
=> delay_tsc
|
|
|
|
=> dummy_load_1ms_pd_init
|
|
|
|
=> do_one_initcall
|
|
|
|
=> do_init_module
|
|
|
|
=> __do_sys_finit_module
|
|
|
|
=> do_syscall_64
|
|
|
|
=> entry_SYSCALL_64_after_hwframe
|
|
|
|
|
|
|
|
In this case, it is possible to see that the thread added the highest
|
|
|
|
contribution to the *timer latency* and the stack trace, saved during
|
|
|
|
the timerlat IRQ handler, points to a function named
|
|
|
|
dummy_load_1ms_pd_init, which had the following code (on purpose)::
|
|
|
|
|
|
|
|
static int __init dummy_load_1ms_pd_init(void)
|
|
|
|
{
|
|
|
|
preempt_disable();
|
|
|
|
mdelay(1);
|
|
|
|
preempt_enable();
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
}
|
2023-10-24 12:59:35 +02:00
|
|
|
|
|
|
|
User-space interface
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
Timerlat allows user-space threads to use timerlat infra-structure to
|
|
|
|
measure scheduling latency. This interface is accessible via a per-CPU
|
|
|
|
file descriptor inside $tracing_dir/osnoise/per_cpu/cpu$ID/timerlat_fd.
|
|
|
|
|
|
|
|
This interface is accessible under the following conditions:
|
|
|
|
|
|
|
|
- timerlat tracer is enable
|
|
|
|
- osnoise workload option is set to NO_OSNOISE_WORKLOAD
|
|
|
|
- The user-space thread is affined to a single processor
|
|
|
|
- The thread opens the file associated with its single processor
|
|
|
|
- Only one thread can access the file at a time
|
|
|
|
|
|
|
|
The open() syscall will fail if any of these conditions are not met.
|
|
|
|
After opening the file descriptor, the user space can read from it.
|
|
|
|
|
|
|
|
The read() system call will run a timerlat code that will arm the
|
|
|
|
timer in the future and wait for it as the regular kernel thread does.
|
|
|
|
|
|
|
|
When the timer IRQ fires, the timerlat IRQ will execute, report the
|
|
|
|
IRQ latency and wake up the thread waiting in the read. The thread will be
|
|
|
|
scheduled and report the thread latency via tracer - as for the kernel
|
|
|
|
thread.
|
|
|
|
|
|
|
|
The difference from the in-kernel timerlat is that, instead of re-arming
|
|
|
|
the timer, timerlat will return to the read() system call. At this point,
|
|
|
|
the user can run any code.
|
|
|
|
|
|
|
|
If the application rereads the file timerlat file descriptor, the tracer
|
|
|
|
will report the return from user-space latency, which is the total
|
|
|
|
latency. If this is the end of the work, it can be interpreted as the
|
|
|
|
response time for the request.
|
|
|
|
|
|
|
|
After reporting the total latency, timerlat will restart the cycle, arm
|
|
|
|
a timer, and go to sleep for the following activation.
|
|
|
|
|
|
|
|
If at any time one of the conditions is broken, e.g., the thread migrates
|
|
|
|
while in user space, or the timerlat tracer is disabled, the SIG_KILL
|
|
|
|
signal will be sent to the user-space thread.
|
|
|
|
|
|
|
|
Here is an basic example of user-space code for timerlat::
|
|
|
|
|
|
|
|
int main(void)
|
|
|
|
{
|
|
|
|
char buffer[1024];
|
|
|
|
int timerlat_fd;
|
|
|
|
int retval;
|
|
|
|
long cpu = 0; /* place in CPU 0 */
|
|
|
|
cpu_set_t set;
|
|
|
|
|
|
|
|
CPU_ZERO(&set);
|
|
|
|
CPU_SET(cpu, &set);
|
|
|
|
|
|
|
|
if (sched_setaffinity(gettid(), sizeof(set), &set) == -1)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
snprintf(buffer, sizeof(buffer),
|
|
|
|
"/sys/kernel/tracing/osnoise/per_cpu/cpu%ld/timerlat_fd",
|
|
|
|
cpu);
|
|
|
|
|
|
|
|
timerlat_fd = open(buffer, O_RDONLY);
|
|
|
|
if (timerlat_fd < 0) {
|
|
|
|
printf("error opening %s: %s\n", buffer, strerror(errno));
|
|
|
|
exit(1);
|
|
|
|
}
|
|
|
|
|
|
|
|
for (;;) {
|
|
|
|
retval = read(timerlat_fd, buffer, 1024);
|
|
|
|
if (retval < 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
close(timerlat_fd);
|
|
|
|
exit(0);
|
|
|
|
}
|