Use GPU Nodes

You can use GPU cloud hosts as cluster nodes in UK8s as follows:

Create a Cluster
Add Nodes
Add Existing Hosts

Image Instructions

When using cloud host models with high cost performance graphics cards (such as high cost performance graphics cards 3, high cost performance graphics cards 5, high cost performance graphics cards 6) as nodes in UK8s cluster, you need to use the standard image Ubuntu 20.04 top-end.

High cost performance graphics cards support availability zones
- North China 2A
- Shanghai 2B
- Beijing 2B

Graphics Card	Image	Driver Version	CUDA Version
High cost performance graphics cards (High Cost Performance 3, High Cost Performance 5, High Cost Performance 6)	Ubuntu 20.04 High Value	550.120	12.4
Non-high cost performance graphics cards (such as T4, V100S, P40, etc.)	Ubuntu 20.04	550.90.12	12.4
Non-high cost performance graphics cards (such as T4, V100S, P40, etc.)	Centos 7.6	450.80.02	11.0

Create a Cluster

When creating a cluster, in Node configuration, select the machine type as “GPU Type G”, and then select the specific GPU card type and configuration.

Note: If you choose a high cost performance graphics card, you need to use the standard image Ubuntu 20.04 high cost performance in node image.

Add Nodes

When adding a Node node to an existing cluster, select the machine type as “GPU Type G”, and then select the specific GPU card type and configuration.

Add Existing Hosts

Add the created GPU cloud host to an existing cluster, and choose the appropriate node image.

Instructions for Use

By default, containers do not share GPUs. Each container can request one or more GPUs. A small part of the GPU cannot be requested.
The Master node of the cluster does not currently support GPU models.
The standard image provided by UK8s has installed nvidia driver. In addition, the nvidia-device-plugin component is installed by default in the cluster. After the GPU resources are added to the cluster, they can be automatically recognized and registered.

How to verify the normal use of the GPU node:

Check if the node has the resource of nvidia.com/gpu.
Run the following example to request the NVIDIA GPU using the nvidia.com/gpu resource type and check if the log result is correct.


$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  containers:
    - name: cuda-container
      image: uhub.genesissai.com/uk8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF


$ kubectl logs gpu-pod
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

GPU Cloud Host NCCL TOPO File Passthrough to Pod

If NCCL performance testing in a GPU pod does not meet expectations, consider passthrough the topology.xml file from the virtual machine to the pod. The specific operations are as follows:

Prerequisites: The node is equipped with 8 high-cost-performance GPUs (e.g., High-Cost-Performance GPU 6, High-Cost-Performance GPU 6 Pro, A800, etc.).

Step 1: Verify Topology File Existence

Check if the virtualTopology.xml file exists in the path /var/run/nvidia-topologyd/ on the GPU node:

If it exists, proceed to Step 2.
If it does not exist, contact technical support to obtain the file. Copy the file to /var/run/nvidia-topologyd/virtualTopology.xml on the GPU node, then proceed to Step 2.

Step 2: Add Configuration to GPU Pod YAML

Add the following content to the gpu-pod.yaml file:


 containers:
 volumeMounts:
 - mountPath: /var/run/nvidia-topologyd
   name: topologyd
   readOnly: true
 volumes:
 - name: topologyd
   hostPath:
     path: /var/run/nvidia-topologyd
     type: Directory

Plugin Upgrade

Upgrade the nvidia-device-plugin to the latest version to address GPU node instability issues.

Upgrade Methods

Method 1: Use kubectl set image to change the image version of the nvidia-device-plugin-daemonset to v0.14.1:


$ kubectl set image daemonset nvidia-device-plugin-daemonset -n kube-system nvidia-device-plugin-ctr=uhub.service.ucloud.cn/uk8s/nvidia-k8s-device-plugin:v0.14.1
daemonset.apps/nvidia-device-plugin-daemonset image updated

Method 2: Modify the yaml file of nvidia-device-plugin-daemonset:

Execute the command:


$ kubectl edit daemonset nvidia-device-plugin-daemonset -n kube-system

Locate the spec.template.spec.containers.image field in the configuration to view the current image information:


- image: uhub.service.ucloud.cn/uk8s/nvidia-k8s-device-plugin:1.0.0-beta4

Update the image to uhub.service.ucloud.cn/uk8s/nvidia-k8s-device-plugin:v0.14.1 and save the changes.

UPHost Core Pinning

UPHost hosts now support core pinning by default, which improves GPU efficiency in certain scenarios.

To enable core pinning, delete the file /var/lib/kubelet/cpu_manager_state on the UPHost node and ensure the Kubelet is configured with the following parameters (refer to official documentation for Topology Manager Policy and CPU Manager Policy):


-cpu-manager-policy=static 
--topology-manager-policy=best-effort

Check if the node is configured with core pinning parameters by running:


ps -aux|grep kubelet|grep topology-manager-policy

Verify Core Pinning Success

Preparations Before Creating a Test Pod

Ensure that the CPU, Memory, and GPU quantities in limits and requests are consistent, and CPU values must be integers.
Set spec.nodeName to the IP address of the UPHost to ensure the pod is scheduled to this node.

Now let’s create a Pod


apiVersion: v1
kind: Pod
metadata:
  name: dcgmproftester
spec:
  nodeName: "10.60.159.170" # Replace with the IP of the UPHost node
  restartPolicy: OnFailure
  containers:
  - name: dcgmproftester12-1
    image: uhub.service.ucloud.cn/uk8s/dcgm:3.3.0
    command: ["/usr/bin/dcgmproftester12"]
    args: ["--no-dcgm-validation", "-t 1004", "-d 3600"] # -d specifies runtime in seconds
    resources: # Modify values based on the actual machine configuration
      limits: ## Keep limits identical to requests
        nvidia.com/gpu: 1
        memory: 10Gi 
        cpu: 10
      requests:
        nvidia.com/gpu: 1
        memory: 10Gi
        cpu: 10
    securityContext:
      capabilities:
        add: ["SYS_ADMIN"]

After the Pod status becomes Running, log in to the bare metal node via ssh.
Enter the command crictl ps to get the list of containers on the current node. Find the ID of the container we just created according to the creation time or container name. Then enter the command crictl inspect | grep pid to get the process pid.
Enter the command taskset -c -p to get the CPU affinity information:

We can see that the CPU affinity range of the Pod is 1-5, 65-69 instead of the full CPU list, proving that CPU pinning is successful.

Now enter the command nvidia-smi to obtain GPU information and check GPU affinity:

In the Process box of the above figure, it records which GPU is running. In the figure, GPU2 is running, proving that GPU binding is successful.

Use the command nvidia-smi topo -m to check whether the GPU and CPU are on the same NUMA node:

We can see that the CPU Affinity of GPU3 is 0-7,64-71. In step 4, the CPU Affinity list of the process we obtained is 1-5,65-69. This proves that the GPU and CPU correspond to the same NUMA node.

Supplementary instructions for the Best-effort strategy:

When using the best-effort strategy, it is necessary to understand the configuration of the bare metal in advance to determine the parameter settings for the Pod to apply for CPU and GPU. For example, a bare metal cloud host has the following configuration:

CPU	GPU	Memory (Does not affect core pinning)	Number of NUMA Nodes
128	8	1024	8

CPU and NUMA parameters can be obtained by the command lscpu. GPU parameters can be obtained by the command nvidia-smi topo -m.

Through the command lscpu, we can know the relationship between NUMA nodes and CPU cores:


...
NUMA node0 CPU(s):               0-7,64-71
NUMA node1 CPU(s):               8-15,72-79
NUMA node2 CPU(s):               16-23,80-87
NUMA node3 CPU(s):               24-31,88-95
NUMA node4 CPU(s):               32-39,96-103
NUMA node5 CPU(s):               40-47,104-111
NUMA node6 CPU(s):               48-55,112-119
NUMA node7 CPU(s):               56-63,120-127
...

We can learn the relationship between NUMA nodes and GPUs through the command nvidia-smi topo -m:

It can be seen that each NUMA node contains 16 CPU cores and 1 GPU. To ensure that both the CPU and GPU are affined to the same NUMA node, our configuration needs to ensure that the number of NUMA nodes affined by the GPU is equal to that affined by the CPU; otherwise, inconsistent affined nodes may result. The following are the situations where affinity can be achieved:

GPU	CPU
1	1 ~ 16
n	> 16 * (n - 1) and ≤ 16 * n

Here, 16 and 1 are obtained by checking the configuration using the method described above. The configurations of different machines are different, so you need to check them yourself.

Except for the above GPU/CPU ratios, other situations cannot achieve NUMA affinity.

For details, please refer to the official documentation.