Hi! I am aware of tools like top, htop, atop, and sar that can be used to monitor usage. The *top programs seem to only do any reporting in real time, while the latter sar tool can provide historical usage data only (as percentage by CPU).

The problem that I am trying to get information on is what processes are running, and their stats, at times when the system is unresponsive (making the *top programs impossible to use).

What is the best way to log process stats in real time so when the system becomes unresponsive and requires a reboot, we can go and look to see what state the system was in to hopefully troubleshoot what causes the system to become unresponsive?

Thank you!

  • frongt@lemmy.zip
    link
    fedilink
    arrow-up
    1
    ·
    12 hours ago

    Kernel dumps? I doubt that any monitoring agent would be any more responsive than what you’ve already listed.

  • suicidaleggroll@lemmy.world
    link
    fedilink
    arrow-up
    6
    ·
    edit-2
    20 hours ago

    I use node_exporter + VictoriaMetrics + Grafana for network-wide system monitoring. node_exporter also has provisions to include text files placed in a directory you specify, as long as they’re written out in the right format. I use that capability on my systems to include some custom metrics, including CPU and memory usage of the top 5 processes on the system, for exactly this reason.

    The resulting file looks like:

    # HELP cpu_usage CPU usage for top processes in %
    # TYPE cpu_usage gauge
    cpu_usage{process="/usr/bin/dockerd",pid="187613"} 1.8
    cpu_usage{process="/usr/local/bin/python3",pid="190047"} 1.4
    cpu_usage{process="/usr/bin/cadvisor",pid="188999"} 1.0
    cpu_usage{process="/opt/mealie/bin/python3",pid="190114"} 0.9
    cpu_usage{process="/opt/java/openjdk/bin/java",pid="190080"} 0.9
    
    # HELP mem_usage Memory usage for top processes in %
    # TYPE mem_usage gauge
    mem_usage{process="/usr/local/bin/python3",pid="190047"} 3.0
    mem_usage{process="/usr/bin/Xvfb",pid="196573"} 2.4
    mem_usage{process="/usr/bin/Xvfb",pid="193606"} 2.4
    mem_usage{process="next-server",pid="194634"} 1.2
    mem_usage{process="/opt/mealie/bin/python3",pid="190114"} 1.2
    

    And it gets scraped every 15 seconds for all of my systems. The result looks like this for CPU and memory. Pretty boring most of the time, but it can be very valuable to see what was going on with the active processes in the moments leading up to a problem.