So as every one knows we just rebuilt our massive cracking server. On the old server I was doing temperature monitoring with LM_Sensors, which for CPU temps is very good but I was not to sure about its accuracy on GPU Temperatures. So I decided to do some investigation into another solution.
If you are running a X-server then you can simply open up the GUI application for nvidia-settings and get your temps that way. Our server on the other hand does not run X and is simply a pure CLI environment. Monitoring GPU temps has never been of paramount importance in Linux since the primary reason for doing so was for gamers and as we all know, most games suck on linux, so the GPU temp applications are primarily windows based.
Now that the world of GPU based tools on Linux is becoming more popular it has become increasingly important to monitor our GPU temps. So back to my dilemma. I knew that nvidia-settings could be used from CLI however it still required a X-server which I did not want to install on the server.
I could have almost kicked myself when I actually found the solution. It turns out there is another utility which is installed with the drivers called nvidia-smi. This stands for “System Management Interface”.
Here is the output from the –help:
[root@tools ~]# nvidia-smi --help nvidia-smi [OPTION1] [OPTION2 ARG] ... NVIDIA System Management Interface program for Tesla S870 -h, --help Show usage and exit -x, --xml-format Produce XML log (to stdout by default, unless a file is specified with -f or --filename=FILE -l, --loop-continuously Probe continuously, clobbers old logfile if not printing to stdout -t NUM, --toggle-led=NUM Toggle LED state for Unit <NUM> -i SEC, --interval=SEC Probe once every <SEC> seconds if the -l option is selected (default and minimum: 1 second) -f FILE, --filename=FILE Specify log file name --gpu=GPUID --compute-mode-rules=RULESET Set rules for compute programs where GPUID is the number of the GPU (starting at zero) in the system and RULESET is one of: 0: Normal mode 1: Compute-exclusive mode (only one compute program per GPU allowed) 2: Compute-prohibited mode (no compute programs may run on this GPU) -g GPUID -c RULESET (short form of the previous command) --gpu=GPUID --show-compute-mode-rules -g GPUID -s (short form of the previous command) -L, --list-gpus -lsa, --list-standalone-gpus-also Also list standalone GPUs in the system along with their temperatures. Can be used with the -l, --loop-continuously option -lso, --list-standalone-gpus-only Only list standalone GPUs in the system along with their temperatures. Can be used with the -l, --loop-continuously option
The help section shows all the available options but for now all I am concerned with is getting the temps on our cards. Since we have 4 295 GTX cards in a 4U rack case I am guessing they are pretty hot. I have been told that 100c is about the maximum temperature we want to run this card at so I am hoping I am close to this otherwise I will have to figure out some more cooling options.
I run the command “nvidia-smi -lso”:
[root@tools ~]# nvidia-smi -lso GPU 0: Product Name : GeForce GTX 295 Serial : 2074402432753 PCI ID : 5eb10de Temperature : 95 C GPU 1: Product Name : GeForce GTX 295 Serial : 562607522042 PCI ID : 5eb10de Temperature : 99 C GPU 2: Product Name : GeForce GTX 295 Serial : 3045216059627 PCI ID : 5e010de Temperature : 100 C GPU 3: Product Name : GeForce GTX 295 Serial : 1249785487812 PCI ID : 5e010de Temperature : 99 C GPU 4: Product Name : GeForce GTX 295 Serial : 2761580421585 PCI ID : 5e010de Temperature : 100 C GPU 5: Product Name : GeForce GTX 295 Serial : 418726093573 PCI ID : 5e010de Temperature : 98 C GPU 6: Product Name : GeForce GTX 295 Serial : 420974240487 PCI ID : 5e010de Temperature : 96 C GPU 7: Product Name : GeForce GTX 295 Serial : 1243494032209 PCI ID : 5e010de Temperature : 97 C
As you can see I can now grab the temps of all 8 of my cards via a ssh session which is what I wanted to do all along. There are other options with this tool I have not yet explored, but we plan to maybe write a custom Cacti plugin or something to monitor and graph GPU temps. I know the solution here may seem simple but I literally have been looking for days for a way to do this and be able to get data written to stoudt so that I can use this information in a another script or program.Tags: CLI, comand line, cuda, Linux, nvidia, nvidia-settings, nvidia-smi, temperature