Driver status
To check if you have a functioning driver, run nvidia-smi
in a terminal. If the driver is functioning, it will actively report the GPU(s) it found on the system, and the version of the driver loaded.
$ nvidia-smi
Tue Jul 25 22:14:24 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.05 Driver Version: 535.86.05 CUDA Version: 12.2 |
How to reinstall
If this is not working, purge and reinstall the drivers on the system.
sudo apt purge ~nnvidia
sudo apt install nvidia-driver-535
sudo reboot
System doesn’t have display on boot
Follow https://support.system76.com/articles/bootloader/ and repeat above step
Freezes and suspend/resume issues
These are most typically related to power management. You can attempt to partially rule this out by disabling PCIE active state power management by disabling it in the firmware, or using the pcie_aspm=off
kernel boot option. You would ideally want this on to conserve energy and reduce heat.
Use
sudo kernelstub -a {{OPTION}}
to add boot options, andsudo kernelstub -d {{OPTION}}
to remove.
Some systems have fatal errors when the CPU migrates to a low power state, which can be limited with the processor.max_cstate
or intel_idle.max_cstate
kernel boot parameters. A value of processor.max_cstate=0
disables it entirely, which will similarly cause higher energy drain and heat. If it resolves the problem, incrementally raise it until the issue reoccurs.
If you’re certain that the issue is caused by the NVIDIA driver, you can try out different driver options by creating a file in /etc/modprobe.d/
, such as a hypothetical /etc/modprobe.d/zz-nvidia.conf
.
Some of these are automatically generated by
system76-power
when switching between graphics modes. So if you are manually setting these, be wary that these can conflict with different modes, or thesystem76-power.conf
will override your settings if your file’s name comes alphabetically before it.
All systems should have at least this defined, unless you are using the NVIDIA dGPU only for compute.
options nvidia-drm modeset=1
For hybrid graphics laptops, it will be necessary to define these
blacklist i2c_nvidia_gpu
alias i2c_nvidia_gpu off
options nvidia NVreg_DynamicPowerManagement=0x02
However, if the hardware has issues with GC6, change DynamicPowerManagement
to
options nvidia NVreg_DynamicPowerManagement=0x01
Also, systems with issues after resuming from S3 suspend may require
options nvidia NVreg_PreserveVideoMemoryAllocations=1
In an absolute worst case scenario where suspend totally broken, you can try disabling these
sudo systemctl disable --now nvidia-hibernate.service nvidia-resume.service nvidia-suspend.service
But remember to undo these changes when there are new driver updates to check and see if the new driver has resolved these issues for your system.
Bad multi-monitor performance
Open nvidia-settings
and enable “Force Full Composition Pipeline” on all monitors. Disable “Sync to VBlank” and “Allow Flipping” in the OpenGL settings. Edit /etc/environment
and set this to your highest supported refresh rate. If it is 144
Hz on video output DP-1
, you would set:
CLUTTER_DEFAULT_FPS=144
__GL_SYNC_DISPLAY_DEVICE=DP-1
__GL_SYNC_TO_VBLANK=0
High energy consumption
Powerful graphics cards may lean more aggressively to performance than energy efficiency by default. You can monitor theoretical energy consumption by running nvidia-smi dmon
in a terminal. The pwr
column guesses the watts used by the GPU.
These settings will not persist across reboots.
If you want a power limit of 100
watts, you can set that with sudo nvidia-smi -pl 100
. Use nvidia-smi -q -d POWER
to get the min and max power limit.
On my desktop RTX 3080 graphics card, this would drop energy consumption while watching a 1080p video on YouTube from 110-125W to 99W.
To further restrict energy consumption, an upper limit for graphics and memory clocks can be set. Use nvidia-smi -q -d CLOCK
to get the maximum clocks. Then set a desired range for graphics clocks with sudo nvidia-smi -lgc {{MIN}},{{MAX}}
, and a desired range for memory clocks with sudo nvidia-smi -lmc {{MIN}},{{MAX}}
. Note that the NVIDIA driver may not honor the exact values you define.
By forcing minimum clocks as below, that same YouTube video drops it to 46W despite no perceivable difference.
sudo nvidia-smi -lgc 0,210 sudo nvidia-smi -lmc 0,405
I found a workaround
Do share what solutions you’ve found for your hardware, and the graphics model that was affected. For laptops, it would be useful for us and others to share the DMI IDs of the affected system. DMI IDs can be be helpful for those searching the web for issues with their laptop, and can also be used by system76-power
to automatically apply known workarounds for known-affected systems.
You can run this script in a terminal to print DMI info:
for dmi_file in /sys/devices/virtual/dmi/id/*_{name,version}; do
echo $dmi_file; echo -n ' '; cat $dmi_file
done
Useful
Worked for me!
My laptop (HP Omen Intel i7 + Nvidia 2060) lost the discrete graphics after the PopOS updater installed Nvidia driver 535.
I was able to fix it by re-installing driver 470. However, I’m still having problems with the discrete graphics going dark after screen blank or suspend.
The above info is interesting, and I’d love to use it to fix my issues, but even as something of a UNIX/Linux power user, I have trouble parsing the jargon.
What does this do?
sudo apt purge ~nnvidia
I don’t understand the use of the tilde and double-n notation. I know that tilde is used as a home directory shortcut, but that’s not how it’s used here? I haven’t been able to Google anything on it either, none of the other apt purge examples I found are using this notation.
I definitely have issues with the graphics failing to wake up after suspend. What do the hybrid graphics commands do? With respect to GC6 and Suspend S3, how would I know whether I need to do anything about those? I understand that they are some kind of power saving modes, but how would I know whether they are causing problems?
I’d love to be able to use the latest drivers & for suspend to work right, but I have to admit I’m out of my depth.
Obligatory hardware details:
root@pop-os:/home/rickr# for dmi_file in /sys/devices/virtual/dmi/id/*_{name,version}; do echo $dmi_file; echo -n ’ '; cat $dmi_file done
/sys/devices/virtual/dmi/id/board_name 878A /sys/devices/virtual/dmi/id/product_name OMEN Laptop 15-ek0xxx /sys/devices/virtual/dmi/id/bios_version F.14 /sys/devices/virtual/dmi/id/board_version 17.29 /sys/devices/virtual/dmi/id/chassis_version Chassis Version /sys/devices/virtual/dmi/id/product_version