Driver status

To check if you have a functioning driver, run nvidia-smi in a terminal. If the driver is functioning, it will actively report the GPU(s) it found on the system, and the version of the driver loaded.

$ nvidia-smi
Tue Jul 25 22:14:24 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05              Driver Version: 535.86.05    CUDA Version: 12.2     |

How to reinstall

If this is not working, purge and reinstall the drivers on the system.

sudo apt purge ~nnvidia
sudo apt install nvidia-driver-535
sudo reboot

System doesn’t have display on boot

Follow https://support.system76.com/articles/bootloader/ and repeat above step

Freezes and suspend/resume issues

These are most typically related to power management. You can attempt to partially rule this out by disabling PCIE active state power management by disabling it in the firmware, or using the pcie_aspm=off kernel boot option. You would ideally want this on to conserve energy and reduce heat.

Use sudo kernelstub -a {{OPTION}} to add boot options, and sudo kernelstub -d {{OPTION}} to remove.

Some systems have fatal errors when the CPU migrates to a low power state, which can be limited with the processor.max_cstate or intel_idle.max_cstate kernel boot parameters. A value of processor.max_cstate=0 disables it entirely, which will similarly cause higher energy drain and heat. If it resolves the problem, incrementally raise it until the issue reoccurs.

If you’re certain that the issue is caused by the NVIDIA driver, you can try out different driver options by creating a file in /etc/modprobe.d/, such as a hypothetical /etc/modprobe.d/zz-nvidia.conf.

Some of these are automatically generated by system76-power when switching between graphics modes. So if you are manually setting these, be wary that these can conflict with different modes, or the system76-power.conf will override your settings if your file’s name comes alphabetically before it.

All systems should have at least this defined, unless you are using the NVIDIA dGPU only for compute.

options nvidia-drm modeset=1

For hybrid graphics laptops, it will be necessary to define these

blacklist i2c_nvidia_gpu
alias i2c_nvidia_gpu off
options nvidia NVreg_DynamicPowerManagement=0x02

However, if the hardware has issues with GC6, change DynamicPowerManagement to

options nvidia NVreg_DynamicPowerManagement=0x01

Also, systems with issues after resuming from S3 suspend may require

options nvidia NVreg_PreserveVideoMemoryAllocations=1

In an absolute worst case scenario where suspend totally broken, you can try disabling these

sudo systemctl disable --now nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service

But remember to undo these changes when there are new driver updates to check and see if the new driver has resolved these issues for your system.

Bad multi-monitor performance

Open nvidia-settings and enable “Force Full Composition Pipeline” on all monitors. Disable “Sync to VBlank” and “Allow Flipping” in the OpenGL settings. Edit /etc/environment and set this to your highest supported refresh rate. If it is 144 Hz on video output DP-1, you would set:

CLUTTER_DEFAULT_FPS=144
__GL_SYNC_DISPLAY_DEVICE=DP-1
__GL_SYNC_TO_VBLANK=0

High energy consumption

Powerful graphics cards may lean more aggressively to performance than energy efficiency by default. You can monitor theoretical energy consumption by running nvidia-smi dmon in a terminal. The pwr column guesses the watts used by the GPU.

These settings will not persist across reboots.

If you want a power limit of 100 watts, you can set that with sudo nvidia-smi -pl 100. Use nvidia-smi -q -d POWER to get the min and max power limit.

On my desktop RTX 3080 graphics card, this would drop energy consumption while watching a 1080p video on YouTube from 110-125W to 99W.

To further restrict energy consumption, an upper limit for graphics and memory clocks can be set. Use nvidia-smi -q -d CLOCK to get the maximum clocks. Then set a desired range for graphics clocks with sudo nvidia-smi -lgc {{MIN}},{{MAX}}, and a desired range for memory clocks with sudo nvidia-smi -lmc {{MIN}},{{MAX}}. Note that the NVIDIA driver may not honor the exact values you define.

By forcing minimum clocks as below, that same YouTube video drops it to 46W despite no perceivable difference.

sudo nvidia-smi -lgc 0,210
sudo nvidia-smi -lmc 0,405

I found a workaround

Do share what solutions you’ve found for your hardware, and the graphics model that was affected. For laptops, it would be useful for us and others to share the DMI IDs of the affected system. DMI IDs can be be helpful for those searching the web for issues with their laptop, and can also be used by system76-power to automatically apply known workarounds for known-affected systems.

You can run this script in a terminal to print DMI info:

for dmi_file in /sys/devices/virtual/dmi/id/*_{name,version}; do
    echo $dmi_file; echo -n '  '; cat $dmi_file
done
  • RickRussell_CA@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    1 year ago

    My laptop (HP Omen Intel i7 + Nvidia 2060) lost the discrete graphics after the PopOS updater installed Nvidia driver 535.

    I was able to fix it by re-installing driver 470. However, I’m still having problems with the discrete graphics going dark after screen blank or suspend.

    The above info is interesting, and I’d love to use it to fix my issues, but even as something of a UNIX/Linux power user, I have trouble parsing the jargon.

    What does this do?

    sudo apt purge ~nnvidia
    

    I don’t understand the use of the tilde and double-n notation. I know that tilde is used as a home directory shortcut, but that’s not how it’s used here? I haven’t been able to Google anything on it either, none of the other apt purge examples I found are using this notation.

    I definitely have issues with the graphics failing to wake up after suspend. What do the hybrid graphics commands do? With respect to GC6 and Suspend S3, how would I know whether I need to do anything about those? I understand that they are some kind of power saving modes, but how would I know whether they are causing problems?

    I’d love to be able to use the latest drivers & for suspend to work right, but I have to admit I’m out of my depth.

    • RickRussell_CA@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      1 year ago

      Obligatory hardware details:

      root@pop-os:/home/rickr# for dmi_file in /sys/devices/virtual/dmi/id/*_{name,version}; do echo $dmi_file; echo -n ’ '; cat $dmi_file done

          /sys/devices/virtual/dmi/id/board_name    878A
          /sys/devices/virtual/dmi/id/product_name    OMEN Laptop 15-ek0xxx
          /sys/devices/virtual/dmi/id/bios_version    F.14
          /sys/devices/virtual/dmi/id/board_version    17.29
          /sys/devices/virtual/dmi/id/chassis_version    Chassis Version
          /sys/devices/virtual/dmi/id/product_version