Published on

Debugging Delta 0.10.x slowdown problem on Linux

Authors
  • avatar
    Name
    ttyS3
    Twitter

其实从 0.10.x 开始就觉得 delta 老慢了, 但是一直没怀疑到是 delta 自身的bug.

有一天, 正好在 github 上面看到有人提 issue, 说当系统进程较多时, delta 扫描所有进程, 这里花了大量时间导致变慢.

然后作者很快发布了一个新版本 0.11.2 , 说这个问题解决了扫进程时慢的问题.

但是我到手一测试, 发现依然可以肉眼感受到这个"慢".

flamegraph 分析: 定位问题

先上 flamegraph 分析一波.

完整的交互式svg flamegraph (需要用web浏览器打开):

delta-flamegraph.svg

找到tower top:

delta-flamegraph-overview-2021-12-21_01-35.png

展开:

delta-tower-top-flamegraph-2021-12-21_01-36.png

很快发现, 锅是 sysinfo::linux::processor::get_cpu_frequency 的, 它发起的系统调用(用来检测每个 cpu logic core 的频率)很慢.

delta::utils::process::determine_calling_process
|
V
sysinfo::linux::system::System::refresh_processors
|
v
sysinfo::linux::processor::get_cpu_frequency

最简单验证, 直接注释掉 determine_calling_processsysinfo::linux::system::System::refresh_processors 的调用,

速度马上恢复了. 从原来的 300ms 降到了 50ms 左右了.

不过禁用这个调用, 并没有真正解决问题, 而只是屏蔽了这个问题.

为什么需要获取进程信息, delta 的作者特意详细解释了:

The downside of disabling that call is that delta will not be able to handle any of the following:

git diff --word-diff / git diff --color-words git grep and other grep's (but it will handle rg --json) git show $commit:$file git blame language detection So that would be a shame, and I'd like to find a solution whereby we can do the process inspection in a way that is fast for all users.

如果不能获取进程信息, 那么这些高级功能(0.10版本引入的, 0.9没有)全没了. 这显然是不可接受的.


delta src/utils/process.rs sysinfo::System::new():

impl ProcInfo {
    fn new() -> Self {
        ProcInfo {
            info: sysinfo::System::new(),
        }
    }
}

sysinfo::System::new() :

    fn new() -> Self {
        Self::new_with_specifics(RefreshKind::new())
    }

new_with_specifics will always try to refresh_processors:

by default, !refreshes.cpu() result in !false , so it is true:

fn new_with_specifics(refreshes: RefreshKind) -> System {
       // ...
        if !refreshes.cpu() {
            s.refresh_processors(None); // We need the processors to be filled.
        }
        s.refresh_specifics(refreshes);
 // ...
}

there seems no way to forbbide refresh_processors call.

if we change the new to sysinfo::System::new_with_specifics(sysinfo::RefreshKind::new().with_cpu())

the first s.refresh_processors(None); will not get called.

but in s.refresh_specifics(refreshes);, it get called again.

所以, 这种情况, 不修改上游代码无法解决问题.

模拟复现

简单写了个 demo (完整的demo代码见repo: https://github.com/ttys3/get-cpu-frequency-slow):

use std::fs::File;
use std::io::Read;
use std::time::Instant;
use std::env;

const DEFAULT_CPU_NUM: i32 = 16;

fn main() {
    let mut args = env::args();
    // skip self
    args.next();

    // your logic CPU core number
    let num_cpu = if let Some(num_cpu) = args.next() {
        num_cpu.parse::<i32>().unwrap_or(DEFAULT_CPU_NUM)
    } else {
        DEFAULT_CPU_NUM
    };

    // weather force read from /proc/cpuinfo, it is important because when
    // the system has NO /sys/devices/system/cpu/cpu{}/cpufreq/scaling_cur_freq
    // this program run rather fast
    let force_cpuinfo = if let Some(force_cpuinfo) = args.next() {
        force_cpuinfo.parse::<bool>().unwrap_or(false)
    } else {
        false
    };

    println!("num_cpu={} force_cpuinfo={}", num_cpu, force_cpuinfo);

    if force_cpuinfo {
        println!("skip readding cpu freq from sysfs, directly to /proc/cpuinfo");
    }

    let start = Instant::now();

    // this for loop simulate the logic in `refresh_processors()`
    // see https://github.com/GuillaumeGomez/sysinfo/blob/01218743c7e656b7f12f530713ba417d2c5940ad/src/linux/system.rs#L146
    for i in 0..num_cpu {
        get_cpu_frequency(force_cpuinfo, i as usize);
    }

    let duration = start.elapsed();
    println!("total time elapsed in get_cpu_frequency()x{} is: {:?}", num_cpu, duration);
}

// the func `get_cpu_frequency` took from https://docs.rs/crate/sysinfo/0.19.2/source/src/linux/processor.rs
// mainly for debugging and verify that delta 0.10.x slowdown problem is caused by `sysinfo` crate
// see https://github.com/dandavison/delta/issues/839
// added force_cpuinfo parameter
fn get_cpu_frequency(force_cpuinfo: bool, cpu_core_index: usize) -> u64 {
    let mut s = String::new();

    if !force_cpuinfo {
        if File::open(format!(
            "/sys/devices/system/cpu/cpu{}/cpufreq/scaling_cur_freq",
            cpu_core_index
        ))
            .and_then(|mut f| f.read_to_string(&mut s))
            .is_ok()
        {
            let freq_option = s.trim().split('\n').next();
            if let Some(freq_string) = freq_option {
                if let Ok(freq) = freq_string.parse::<u64>() {
                    return freq / 1000;
                }
            } else {
                println!("try get cpu freq from sysfs failed, fallback to /proc/cpuinfo");
            }
        }
        s.clear();
    }

    if File::open("/proc/cpuinfo")
        .and_then(|mut f| f.read_to_string(&mut s))
        .is_err()
    {
        return 0;
    }
    let find_cpu_mhz = s.split('\n').find(|line| {
        line.starts_with("cpu MHz\t")
            || line.starts_with("BogoMIPS")
            || line.starts_with("clock\t")
            || line.starts_with("bogomips per cpu")
    });
    find_cpu_mhz
        .and_then(|line| line.split(':').last())
        .and_then(|val| val.replace("MHz", "").trim().parse::<f64>().ok())
        .map(|speed| speed as u64)
        .unwrap_or_default()
}

验证结果:

当从 /sys/devices/system/cpu/cpu{}/cpufreq/scaling_cur_freq 读取时, 存在系统调用获取单个 cpu 频率, 非常慢, 如果是解析 /proc/cpuinfo 则非常快.

很明显, 我的情况是从/sys/devices/system/cpu/cpu{}/cpufreq/scaling_cur_freq 读取.

由于我的 CPU 是 8c16t的:

a single run will cost about 16ms, thus 16 core CPU result in about 256ms

所以, CPU 逻辑核越多, delta 越慢.

另一哥们就更惨了, 他是 32 thread 的 CPU, delta 每次执行差不多要花1秒多! (see https://github.com/dandavison/delta/issues/844)

国外开源氛围

我给 delta 提了个 issue, 然后发现上游有问题.

delta 作者 dandavison 当时提出了两个方案: 一是给上游提个issue尝试解决, 二是换别的库. 比如 https://docs.rs/heim/0.0.11/heim/process/struct.Process.html

最终 delta 作者去 upstream 提 issue了: https://github.com/GuillaumeGomez/sysinfo/issues/632

几经折腾, sysinfo 的作者终于将代码改成满足 delta 的需求场景了.

delta 的需求是: 只需要获取进程的父进程id, 而不需要执行进程的 cpu 执行时间.

目前 delta 0.11.3 已经发布, 解决了上述 Linux 下面慢的问题. https://github.com/dandavison/delta/releases/tag/0.11.3

所以, 国外的开源作者, 一般还都是挺友好的, 大家互相合作.

delta 作者并没有因为是我提的 issue 而要求我去 sysinfo 作者那里继续提 issue, 而是 默默地自己去主动提 issue 解决了.

sysinfo 作者也并没有用任何借口拒绝重构代码,添加选项. 更没有以"这是你的小众需求, 我这个代码在90% 的情况下都不会有你说的性能问题的." 而关闭 issue, 而是三番五次的改进, 最终, 他还主动跑到 delta 的 PR 下面来确认工作 OK (see https://github.com/dandavison/delta/pull/845#issuecomment-991650195).

其它思考

  1. 这个问题跟Linux系统调用有关, 而且是跟硬件有关, CPU 逻辑核越多的电脑上越慢. 至少 CPU 逻辑核心数量 > 10 个以上, 人才可能容易感知到这种执行速度下降.
  1. 不能盲目相信 benchmark 工作的结果, 因为 benchmark 的时候调用存在缓存, 导致结果错误. 一个简单的 sleep 1 可以模拟真实场景.

using benchmark tool like hyperfine will NOT present the problem,

something like hyperfine -r 1000 'delta Cargo.toml Cargo.toml.bak' will give you wrong conclusion

the reason is caching affects the real world problem. although the benchmark tool is OK, and it also works as expected.

it shows the result like this (6 core machine):

❯ hyperfine -r 1000 'delta Cargo.toml Cargo.toml.bak'
Benchmark 1: delta Cargo.toml Cargo.toml.bak
  Time (mean ± σ):      55.3 ms ±   8.7 ms    [User: 47.1 ms, System: 5.8 ms]
  Range (min … max):    50.7 ms … 149.7 ms    1000 runs

but in real world, users will not always run git diff continuous. so the cacehing will not work. and every time the user may get the delay.

in real world, it is more like this:

I use a simple sleep 1 to make the caching not work any more.

for (( c=1; c<=10; c++ ))
do
   sleep 1
   time delta Cargo.toml Cargo.toml.bak > /dev/null
done

the result on a 6 core CPU, most of the execution cost >= 100ms :

# this test run under a 6 core CPU machine
for (( c=1; c<=10; c++ ))
do
   sleep 1
   time delta Cargo.toml Cargo.toml.bak > /dev/null
done
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 62% cpu 0.091 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 45% cpu 0.122 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 41% cpu 0.136 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.04s user 0.01s system 52% cpu 0.110 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 39% cpu 0.142 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 45% cpu 0.122 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 39% cpu 0.143 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 46% cpu 0.121 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 101% cpu 0.055 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 40% cpu 0.141 total

in a benchmark situation, it is more like this (with out the sleep), yes, you may found that it is at low latency, about 50ms

# this test run under a 6 core CPU machine
for (( c=1; c<=10; c++ ))
do
   time delta Cargo.toml Cargo.toml.bak > /dev/null
done
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 100% cpu 0.055 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 101% cpu 0.054 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 101% cpu 0.055 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 100% cpu 0.056 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 100% cpu 0.055 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.04s user 0.02s system 101% cpu 0.055 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 90% cpu 0.063 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.06s user 0.00s system 85% cpu 0.068 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.00s system 99% cpu 0.058 total
delta Cargo.toml Cargo.toml.bak > /dev/null  0.05s user 0.01s system 100% cpu 0.055 total
  1. 使用开源工具遇到问题, 不要把全部的问题抛出给原作者, 有能力的情况下, 可以协助解决, 很多作者还是会热心回答和采纳的. 解决问题的过程中, 自己也能学到新的东西.

  2. 留下一个问题: 为什么获取单个核心 CPU 频率的系统调用 (通过读取 sysfs 触发的) 那么慢 (每个调用大概耗时在 15ms 以上)? 暂时没空看了, 文章先写到这吧.

Refs

https://github.com/dandavison/delta/issues/839

https://github.com/dandavison/delta/issues/824

https://github.com/dandavison/delta/pull/845

https://github.com/GuillaumeGomez/sysinfo/issues/632

https://github.com/dandavison/delta/releases/tag/0.11.3