So as every one knows we just rebuilt our massive cracking server.  On the old server I was doing temperature monitoring with LM_Sensors, which for CPU temps is very good but I was not to sure about its accuracy on GPU Temperatures. So I decided to do some investigation into another solution.

If you are running a X-server then you can simply open up the GUI application for nvidia-settings and get your temps that way. Our server on the other hand does not run X and is simply a pure  CLI environment.  Monitoring GPU temps has never been of paramount importance in Linux since the primary reason for doing so was for gamers and as we all know, most games suck on linux, so the GPU temp applications are primarily windows based.

Now that the world of GPU based tools on Linux is becoming more popular it has become increasingly important to monitor our GPU temps. So back to my dilemma. I knew that nvidia-settings could be used from CLI however it still required a X-server which I did not want to install on the server.

I could have almost kicked myself when I actually found the solution. It turns out there is another utility which is installed with the drivers called nvidia-smi. This stands for “System Management Interface”.

Here is the output from the –help:

[root@tools ~]# nvidia-smi --help
nvidia-smi [OPTION1] [OPTION2 ARG] ...
NVIDIA System Management Interface program for Tesla S870

        -h, --help                                  Show usage and exit
        -x, --xml-format                            Produce XML log (to stdout by default, unless
                                                    a file is specified with -f or --filename=FILE
        -l, --loop-continuously                     Probe continuously, clobbers old logfile if not printing to stdout
        -t NUM, --toggle-led=NUM                    Toggle LED state for Unit <NUM>
        -i SEC, --interval=SEC                      Probe once every <SEC> seconds if the -l option
                                                    is selected (default and minimum: 1 second)
        -f FILE, --filename=FILE                    Specify log file name
        --gpu=GPUID --compute-mode-rules=RULESET    Set rules for compute programs
                                                    where GPUID is the number of the GPU (starting at zero) in the system
                                                    and RULESET is one of:
                                                    0: Normal mode
                                                    1: Compute-exclusive mode (only one compute program per GPU allowed)
                                                    2: Compute-prohibited mode (no compute programs may run on this GPU)
        -g GPUID -c RULESET                         (short form of the previous command)
        --gpu=GPUID --show-compute-mode-rules
        -g GPUID -s                                 (short form of the previous command)
        -L, --list-gpus
        -lsa, --list-standalone-gpus-also           Also list standalone GPUs in the system along with their temperatures.
                                                    Can be used with the -l, --loop-continuously option
        -lso, --list-standalone-gpus-only           Only list standalone GPUs in the system along with their temperatures.
                                                    Can be used with the -l, --loop-continuously option

The help section shows all the available options but for now all I am concerned with is getting the temps on our cards. Since we have 4 295 GTX cards in a 4U rack case I am guessing they are pretty hot. I have been told that 100c is about the maximum temperature we want to run this card at so I am hoping I am close to this otherwise I will have to figure out some more cooling options.

I run the command “nvidia-smi -lso”:

[root@tools ~]# nvidia-smi -lso

GPU 0:
        Product Name            : GeForce GTX 295
        Serial                  : 2074402432753
        PCI ID                  : 5eb10de
        Temperature             : 95 C
GPU 1:
        Product Name            : GeForce GTX 295
        Serial                  : 562607522042
        PCI ID                  : 5eb10de
        Temperature             : 99 C
GPU 2:
        Product Name            : GeForce GTX 295
        Serial                  : 3045216059627
        PCI ID                  : 5e010de
        Temperature             : 100 C
GPU 3:
        Product Name            : GeForce GTX 295
        Serial                  : 1249785487812
        PCI ID                  : 5e010de
        Temperature             : 99 C
GPU 4:
        Product Name            : GeForce GTX 295
        Serial                  : 2761580421585
        PCI ID                  : 5e010de
        Temperature             : 100 C
GPU 5:
        Product Name            : GeForce GTX 295
        Serial                  : 418726093573
        PCI ID                  : 5e010de
        Temperature             : 98 C
GPU 6:
        Product Name            : GeForce GTX 295
        Serial                  : 420974240487
        PCI ID                  : 5e010de
        Temperature             : 96 C
GPU 7:
        Product Name            : GeForce GTX 295
        Serial                  : 1243494032209
        PCI ID                  : 5e010de
        Temperature             : 97 C

As you can see I can now grab the temps of all 8 of my cards via a ssh session which is what I wanted to do all along. There are other options with this tool I have not yet explored, but we plan to maybe write a custom Cacti plugin or something to monitor and graph GPU temps. I know the solution here may seem simple but I literally have been looking for days for a way to do this and be able to get data written to stoudt so that I can use this information in a another script or program.

DeliciousStumbleUponDiggTwitterFacebookRedditLinkedInEmail
Tags: , , , , , , ,
3 Responses to “GPU Linux Shell Temp: Get Nvidia GPU Temperatures Via Linux CLI”
  1. Kifni says:

    Many people claimed this VGA Card show awesome bechmark

    [Reply]

    alex Reply:

    Hello Kifni,

    Maybe because it does? Guess it depends on what you are comparing it to.

    Thanks.
    alex

    [Reply]

  2.  
Leave a Reply

*Type the letter/number combination in the abvoe field before clicking submit.

*