System Performance Enterprise & The Cloud — Brendan Gregg
Chapter 6 CPU
Concepts
CPU Run Queue: a queue of runnable threads that are waiting to be serviced by CPUs. For Solaris, it is often called a dispatch queue.
Instruction Pipeline: CPU architecture that it can execute multiple instructions in parallel. (Note this is different from HTTP/2 pipeline, CPU pipeline is real parallelism.)
Instruction Width: multiple functional units can be included of the same type so that even more instruction can make forward progress with each clock cycle. Modern processors are 3–4 wide.
CPI: Cycles per instruction (IPC: instructions per cycle)
CPU utilization: the % of time a CPU instance is busy performing work during an interval. A CPU may be highly utilized because it is often stalled waiting for memory I/O. Not just executing instructions. CPU utilization is often split into separate kernel- and user-time metrics.
User-Time/Kernel-Time: Kernel time includes time during system calls, kernel threads and interrupts. CPU time spent on executing user-level application code is called user-time. Computation intensive application has a high user/kernel ration — image processing. I/O intensive has a high system call rate thus has a lower user/kernel ratio.
Priority inversion: when a lower priority thread holds a resource and blocks a higher priority thread from running, system would bump the priority for the lower priority thread.
CPU Architecture
CPU hardware includes the processor and its subsystems, and the CPU interconnect for multiple processor systems. Control unit is the heart of the CPU performing instruction fetch, decode, manage execution and storing results.
CPU caches include:
- Level1 Instruction cache & data cache (0.5 ns)
- Translation Lookaside Buffer
- Level 2 cache (external cache) 5ns
- Level 3 cache (optional) 30 ns
Main memory (100 ns)
Cache line: a range of bytes that are stored and transferred as a unit, improving memory throughput.
Cache coherency: when one CPU modifies memory, all caches among all CPUs should be aware of their related cache is stale and should be discarded.
MMU: memory management unit, responsible for virtual-to-physical address translation. TLB is used as a cache for address translations. Cache misses are satisfied by translation tables in main memory (page tables).
Interconnects: connect processors and connect components other than processors.
CPU Performance Counters. Processor registers that can be programmed to count low-level CPU activity:
- CPU cycles: including stall cycles & types of stall cycles
- CPU instructions: executed
- Level 1,2,3 cache accesses: hit, miss
- Float-point unit: operations
- Memory I/O: read/write/stall cycles
- Resource I/O: read/write/stall cycles
Idle Thread: runs on CPU when there is no other runnable thread. It is programmed to inform the processor that CPU execution may be halted or throttled down to conserve power. CPU would wake up on the next hardware interrupt.
Methodology
Tools:
- uptime: load average to see CPU load is increasing or decreasing. A load average over the number of CPUs in the system usually indicates saturation.
- vmstat: check the idle column to see how much headroom there is. Less than 10% can be a problem.
- mpstat: check for individual hot CPUs, identifying possible thread scalability issues.
- top/prstat: which processes and users are top CPU consumers.
- pidstat/prstat: break down the top CPU consumers into user-system time.
- perf/dtrace/stap/oprofile: profile CPU usage stack traces for either user/kernel time, to identify why the CPUs are in use.
- perf/cpustat: measure CPI.
Workload Characterization
- Load averages (utilization + saturation)
- User-time to system-time ratio
- Syscall rate
- Volunteer context switch rate
- Interrupt rate
Workload Characterization Checklist:
- what is the CPU utilization rate system wide? Per CPU?
- how parallel is the CPU load? Is it single threaded? How many threads?
- which application/users are using the CPUs? how much?
- which kernel threads are using the CPUs? how much?
- what is the CPU usage of interrupts?
- what sit eh CPU interconnect utilization?
- why are the CPUs being used (user/kernel level call path)?
- what types of stall cycles are encountered?
Profiling
Select -> Begin -> Wait -> End -> Process
Cycle Analysis
Advanced technique to use if CPI is low.
Key Metrics
- Utilization
- Saturation
CPU Binding
Bind processes and threads to individual CPUs to improve cache warmth and memory IO performance.
Analysis
In the Analysis Section of this book 6.6, it deep dives into the tools mentioned before (uptime, vmstat, etc..), extremely useful.
a. uptime
uptime is one of several commands that print the system load averages. (1-, 5-, 15-minute load averages). Load average indicate the demand for CPU resources and is calculated by summing the number of threads running and the number that are queued waiting to run (saturation).
interpretation of the value: if load average is higher than the CPU count, there are not enough CPUs to service the threads.
Linux is a freak that adds tasks performing disk I/O in the uninterruptible state of the load averages. This means it reflects CPU or disk load. It is best to use other metrics to understand CPU load on Linux, like vmstat or mpstat.
b. vmstat
system-wide CPU averages are in the last few columns.
r: run-queue length. total number of waiting threads.us: user-timesy: system-timeid: idlewa: wait I/O.st: stolen, which for virtualized environments show sCPU time spent servicing other tenants.
Example:
pi@pi-aw:~$ vmstat 1
procs — — — — — -memory — — — — — — -swap — — — -io — — -system — — — — cpu — — -
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3932516 466184 2848040 0 0 36 44 26 58 1 0 99 0 0
0 0 0 3932508 466184 2848040 0 0 0 0 62 220 0 0 100 0 0
c. mpstat
Report statistics per CPU in multi processor environment.
pi@pi-aw:~$ mpstat -P ALL 1
Linux 4.15.0–38-generic (pi-aw) 01/03/2019 _x86_64_ (8 CPU)12:50:29 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
12:50:30 AM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
12:50:30 AM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
d. sar/ps/top/pidstat/time
e. DTrace
This is fairly complicated. It uses D language to script the tracing script. Looks super powerful tho, worth a deep dive later.
f. SystemTap
g. perf
System profiling: perf record -a -g -F 997 sleep 10
-a all cpu, -g call stacks, -F frequency, — stdio print output to stdio.(remember to use sudo, otherwise, the frame won’t be decoded)
$sudo perf record -a -g -F 997 sleep 10
$sudo perf report --stdio# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 51 of event 'cycles:ppp'
# Event count (approx.): 66068260
#
# Children Self Command Shared Object Symbol
# ........ ........ ....... .................. ......................................................................
#
87.51% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64
|
---secondary_startup_64
|
|--81.23%--start_secondary
| cpu_startup_entry
| do_idle
| |
| |--67.22%--call_cpuidle
| | cpuidle_enter
| | cpuidle_enter_state
| | |
| | |--61.91%--intel_idle
| | |
| | |--4.29%--apic_timer_interrupt
| | | smp_apic_timer_interrupt
| | | irq_enter
| | | tick_irq_enter
| | | tick_do_update_jiffies64.part.10
| | | update_wall_time
| | | update_vsyscall
| | |
| | --1.02%--call_function_single_interrupt
| | smp_call_function_single_interrupt
| | generic_smp_call_function_single_interrupt
| | flush_smp_call_function_queue
| | remote_function
| | event_function
| | __perf_event_enable
| | ctx_resched
| | x86_pmu_enable
| | intel_pmu_enable_all
| | __intel_pmu_enable_all.constprop.19
| |
| --14.01%--tick_nohz_idle_exit
|
--6.28%--x86_64_start_kernel
x86_64_start_reservations
start_kernel
rest_init
cpu_startup_entry
do_idle
call_cpuidle
cpuidle_enter
cpuidle_enter_state
|
--6.17%--intel_idle87.51% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry
|
---cpu_startup_entry
Process profiling: perf record -g command
perf stat: a high level summary of CPU cycle behavior based on CPC.
perf list: list all counters can be examined.
perf stat -e cache-misses ls
h. cpustat
Tuning
- compiler options
- $nice -n 19 command to set priority manually for the command.
- $renice to change the priority of a running process.
- On Linux systems, config options can be set for kernel schedulers.
- process binding: $taskset -pc 7–10 10790 (set pid 10790 to run only on CPUs 7–10)
- cgroups for resource control.
- BIOS tunning.
Questions
> Faster CPU clock rate, more throughput?
Even if the CPU in your system appears to be fully utilized, a faster clock rate may not speed up performance — it depends on what those fast CPU cycles are actually doing. If they are mostly stall cycles while waiting on memory access, executing them more quickly doesn’t actually increase CPU instruction rate or workload throughput.
> How to tell if it is because of stall cycles or not?
By checking CPI. A high CPI indicates that CPUs are often stalled, typically for memory access. Installing faster memory / improve memory locality or reduce the amount of memory I/O would help.
Unanswered Questions:
- In a multi-tenancy scenario with CPU core enforcement, is that possible the RAM’s clock gets saturated thus resulting a CPU throttled by inefficient memory access?