临清网站推广,中美关系最新消息今天,wordpress搜索页面不同,网站建设设RDMA的作用 RDMA#xff1a;是一种绕过操作系统内核的网络通信技术#xff0c;其核心在于通过网卡直接访问远端内存#xff0c;避免了传统 TCP/IP 协议栈的数据拷贝和上下文切换开销。NVIDIA GPU Direct[^2]#xff1a;实现 GPU 显存与网卡 DMA 引擎的直接对接#xff0c…RDMA的作用RDMA是一种绕过操作系统内核的网络通信技术其核心在于通过网卡直接访问远端内存避免了传统 TCP/IP 协议栈的数据拷贝和上下文切换开销。NVIDIA GPU Direct[^2]实现 GPU 显存与网卡 DMA 引擎的直接对接当 GPU 需要与远端节点通信时数据可直接通过 InfiniBand 或 RoCE 网卡传输无需经过主机内存中转。网络虚拟化Macvlan 和 SR-IOV 是两种常见的网络虚拟化方案。Macvlan 允许为容器创建虚拟网卡接口使其在物理网络上显示为独立设备而 SR-IOV 通过物理网卡的硬件虚拟化能力将单个物理功能PF划分为多个虚拟功能VF每个 VF 都能直接分配给 Pod 使用。技术路径目前 RDMA 主要有 InfiniBand 与 RoCE 两种实现方式。InfiniBand 原生支持 RDMA 协议需要专用交换机和子网管理器构建独立网络成本高昂而 RoCEv2 则基于传统以太网基础设施通过 PFC 和 ECN 等流控机制保障无损传输被互联网公司广泛使用。云平台RDMA网络结构主要完成两个支持接入ROCEV2 使高速业务流量(data flow)可以走RDMA网络普通以太网走 控制流量和低速业务流量controller flow双网络平面达到高低速网络隔离效果。构建方法使用multus-cni达到容器双网络支持primiary cni 使用主机的物理网卡 lan0, lan1 聚合城的 bind01网卡高可用bind01网卡使用calico或cillium接入podsecondary cni 使用 rocev2。rocev2 设备对应主机上的lanx--mlx5_x 设备使用rdma-shared-device-plugin 向kubelet注册设备资源容器调度到节点时将分配rocev2物理设备multus-cni将使用 macvlan将分配的rocev2设备以名为netx 的网络设备映射入容器。配置方法rdmaSharedDevicePlugin配置在物理机上lan2、lan3、lan4、lan5分别对应mlx5_0、mlx5_1、mlx5_2、mlx5_3将四张网卡分别以nvidia.com/mlx5_0、nvidia.com/mlx5_1、nvidia.com/mlx5_2、http://nvidia.com/mlx5_3的资源名称变成k8s的扩展资源每个扩展资源定义为100份。rdmaSharedDevicePlugin: deploy: true image: k8s-rdma-shared-dev-plugin repository: ghcr.io/mellanox version: sha-fe7f371c7e1b8315bf900f71cd25cfc1251dc775 useCdi: false resources: - resourcePrefix: nvidia.com resourceName: mlx5_0 rdmaHcaMax: 100 vendors: [15b3] ifNames: [lan2] - resourcePrefix: nvidia.com resourceName: mlx5_1 rdmaHcaMax: 100 vendors: [15b3] ifNames: [lan3] - resourcePrefix: nvidia.com resourceName: mlx5_2 rdmaHcaMax: 100 vendors: [15b3] ifNames: [lan4] - resourcePrefix: nvidia.com resourceName: mlx5_3 rdmaHcaMax: 100 vendors: [15b3] ifNames: [lan5]MacvlanNetwork配置分别创建rdma-net-ipam-lan2、rdma-net-ipam-lan3、rdma-net-ipam-lan4、rdma-net-ipam-lan5四份MacvlanNetwork配置每份对应一个k8s集群内部的ip pool同时指定MacvlanNetwork所作用的业务pod所在的k8s命名空间例子中假设业务pod都是在prod ns中。apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-net-ipam-lan2 spec: ipam: | { type: whereabouts, datastore: kubernetes, kubernetes: { kubeconfig: /etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig }, range: 192.168.4.0/22, log_file : /var/log/whereabouts.log, log_level : info, gateway: 192.168.4.1 } master: lan2 mode: bridge mtu: 1500 networkNamespace: prod apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-net-ipam-lan3 spec: ipam: | { type: whereabouts, datastore: kubernetes, kubernetes: { kubeconfig: /etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig }, range: 192.168.8.0/22, log_file : /var/log/whereabouts.log, log_level : info, gateway: 192.168.8.1 } master: lan3 mode: bridge mtu: 1500 networkNamespace: prod apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-net-ipam-lan4 spec: ipam: | { type: whereabouts, datastore: kubernetes, kubernetes: { kubeconfig: /etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig }, range: 192.168.12.0/22, log_file : /var/log/whereabouts.log, log_level : info, gateway: 192.168.12.1 } master: lan4 mode: bridge mtu: 1500 networkNamespace: prod apiVersion: mellanox.com/v1alpha1 kind: MacvlanNetwork metadata: name: rdma-net-ipam-lan5 spec: ipam: | { type: whereabouts, datastore: kubernetes, kubernetes: { kubeconfig: /etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig }, range: 192.168.16.0/22, log_file : /var/log/whereabouts.log, log_level : info, gateway: 192.168.16.1 } master: lan5 mode: bridge mtu: 1500 networkNamespace: prod业务job配置申请四张mlx网卡资源在注解上打上四张mlx网卡的MacvlanNetwork配置通过设置合适的NCCL环境变量启用RoCE网络。nAvailable: 1 plugins: pytorch: - --mastermaster - --workerworker - --port23456 policies: - action: RestartJob event: PodEvicted queue: default schedulerName: volcano tasks: - maxRetry: 3 minAvailable: 1 name: master replicas: 1 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: rdma-net-ipam-lan2,rdma-net-ipam-lan3,rdma-net-ipam-lan4,rdma-net-ipam-lan5 spec: containers: - command: - /bin/bash - -c - sleep 1440h env: - name: NCCL_DEBUG value: INFO - name: NCCL_IB_DISABLE value: 0 - name: NCCL_NET_GDR_READ value: 1 - name: NCCL_IB_HCA value: mlx5 - name: NCCL_IB_GID_INDEX value: 5 - name: NCCL_SOCKET_IFNAME value: eth0 image: torch name: pytorch resources: limits: nvidia.com/gpu: 8 nvidia.com/mlx5_0: 1 nvidia.com/mlx5_1: 1 nvidia.com/mlx5_2: 1 nvidia.com/mlx5_3: 1 requests: nvidia.com/gpu: 8 nvidia.com/mlx5_0: 1 nvidia.com/mlx5_1: 1 nvidia.com/mlx5_2: 1 nvidia.com/mlx5_3: 1 schedulerName: volcanoapiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: rdma-test namespace: prod spec: maxRetry: 3 minAvailable: 1 plugins: pytorch: - --mastermaster - --workerworker - --port23456 policies: - action: RestartJob event: PodEvicted queue: default schedulerName: volcano tasks: - maxRetry: 3 minAvailable: 1 name: master replicas: 1 template: metadata: annotations: k8s.v1.cni.cncf.io/networks: rdma-net-ipam-lan2,rdma-net-ipam-lan3,rdma-net-ipam-lan4,rdma-net-ipam-lan5 spec: containers: - command: - /bin/bash - -c - sleep 1440h env: - name: NCCL_DEBUG value: INFO - name: NCCL_IB_DISABLE value: 0 - name: NCCL_NET_GDR_READ value: 1 - name: NCCL_IB_HCA value: mlx5 - name: NCCL_IB_GID_INDEX value: 5 - name: NCCL_SOCKET_IFNAME value: eth0 image: torch name: pytorch resources: limits: nvidia.com/gpu: 8 nvidia.com/mlx5_0: 1 nvidia.com/mlx5_1: 1 nvidia.com/mlx5_2: 1 nvidia.com/mlx5_3: 1 requests: nvidia.com/gpu: 8 nvidia.com/mlx5_0: 1 nvidia.com/mlx5_1: 1 nvidia.com/mlx5_2: 1 nvidia.com/mlx5_3: 1 schedulerName: volcano性能测试参考文章https://zhuanlan.zhihu.com/p/694555753https://www.zhihu.com/question/454800042