• k8s安装3节点集群Fate v1.8.0


    采用k8s,而非minikube, 在3个centos系统的节点上安装fate集群。
    本人安装这个v1.8.0版本后,能登陆fateboard,但无法传输数据,问题无法解决。于是选择安装v1.7.2版本,配置更为具体,步骤更为清晰,请参考《k8s安装3节点的联邦学习Fate集群 v1.7.2(全网最细-解决N多坑)》:
    https://blog.csdn.net/Acecai01/article/details/128253844?spm=1001.2014.3001.5502

    集群配置信息

    3节点配置信息如下图:
    在这里插入图片描述

    当kubefate最新版是1.9.0时,依赖的k8s和ingress-ngnix版本如下:
    Recommended version of dependent software:
    Kubernetes: v1.23.5
    Ingress-nginx: v1.1.3

    升级K8S到1.23.5

    如果你的集群k8s版本高于1.19.0,可以直接跳过本步骤。
    参考博客
    https://blog.csdn.net/RivenDong/article/details/121213109
    https://www.cnblogs.com/cloud-yongqing/p/16629666.html
    以下步骤多次操作,逐级将K8S从1.18.x升级到1.23.5

    master节点
    yum install -y kubeadm-1.19.16-0 --disableexcludes=kubernetes
    kubeadm version
    kubectl drain harbor.clife.io --delete-emptydir-data --ignore-daemonsets
    kubeadm upgrade plan --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
    kubeadm upgrade apply v1.19.16  --ignore-preflight-errors=CoreDNSUnsupportedPlugins,CoreDNSMigration
    yum install -y kubelet-1.19.16-0 kubectl-1.19.16-0
    systemctl daemon-reload
    systemctl restart kubelet
    kubectl uncordon  harbor.clife.io
    
    节点gpu-51
    master节点执行:       kubectl drain gpu-51 --ignore-daemonsets
    yum install -y kubeadm-1.20.15-0 --disableexcludes=kubernetes
    kubeadm upgrade node
    yum install -y kubelet-1.20.15-0 kubectl-1.20.15-0 --disableexcludes=kubernetes
    systemctl daemon-reload
    systemctl restart kubelet
    master节点执行:       kubectl uncordon gpu-51
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    删除旧版Fate

    如果你的集群未安装过Fate,跳过本步骤
    查看之前已安装的f旧版fate,将其删除:
    查看:
    kubectl get ns

    NAME                              STATUS        AGE
    default                           Active        504d
    fate-10000                        Active        459d
    fate-9999                         Active        459d
    ingress-nginx                     Active        465d
    istio-system                      Active        497d
    kube-fate                         Active        465d
    kube-node-lease                   Active        504d
    kube-public                       Active        504d
    kube-system                       Active        504d
    kubernetes-dashboard              Terminating   504d
    kubernetes-dashboard2             Active        4d17h
    kubesphere-controls-system        Active        489d
    kubesphere-monitoring-federated   Active        489d
    kubesphere-monitoring-system      Active        489d
    minio                             Active        363d
    monitoring                        Active        362d
    seldon                            Active        159d
    seldon-system                     Active        502d
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19

    删除:
    kubectl delete namespace fate-10000
    kubectl delete namespace fate-9999
    kubectl delete namespace kube-fate

    kate下载

    链接: link
    软件包:kubefate-k8s-v1.8.0.tar.gz

    接下来的操作都在Master节点上完成。

    部署ingress-nginx

    参考:https://blog.csdn.net/qq_41296573/article/details/125809696
    以下deploy.yaml为部署ingress-nginx(1.1.3版本,当时最新1.5.0)的文件,可能需要翻墙才能下载:
    https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.3/deploy/static/provider/cloud/deploy.yaml
    以上文件中有2个翻墙才能下载的镜像,将镜像改成国内的镜像(3处地方):

    k8s.gcr.io/ingress-nginx/controller:v1.1.3@sha256:31f47c1e202b39fadecf822a9b76370bd4baed199a005b3e7d4d1455f4fd3fe2
    改为:
    registry.cn-hangzhou.aliyuncs.com/google_containers/nginx-ingress-controller:v1.1.3
    
    k8s.gcr.io/ingress-nginx/kube-webhook-certgen:v1.1.1@sha256:64d8c73dca984af206adf9d6d7e46aa550362b1d7a01f3a0a91b20cc67868660
    改为:
    registry.cn-hangzhou.aliyuncs.com/google_containers/kube-webhook-certgen:v1.1.1
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    然后部署ingress-nginx:
    kubectl apply -f ./deploy.yaml
    查看ingress-nginx是否成功:

    [root@harbor kubefate]#  kubectl get  pods -n ingress-nginx -o wide
    NAME                                        READY   STATUS      RESTARTS   AGE     IP            NODE         NOMINATED NODE   READINESS GATES
    ingress-nginx-admission-create-zh96h        0/1     Completed   0          2d23h   10.244.1.26   gpu-51                  
    ingress-nginx-admission-patch-hmgr5         0/1     Completed   1          2d23h   10.244.1.27   gpu-51                  
    ingress-nginx-controller-6995ffb95b-m87gh   1/1     Running     0          2d18h   172.17.0.8    k8s-node02              
    
    • 1
    • 2
    • 3
    • 4
    • 5

    可见ingress-nginx被安装到了k8s-node02节点,而不是master节点,这个是正常的(即便是在master操作,也会安装到别处)
    输入如下命令,检查配置是否生效:
    kubectl -n ingress-nginx get svc

    NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
    ingress-nginx-controller             LoadBalancer   10.1.196.14        80:30428/TCP,443:30338/TCP   16m
    ingress-nginx-controller-admission   ClusterIP      10.1.32.33            443/TCP                      16m
    
    • 1
    • 2
    • 3

    可以看到ingress-nginx-controller的EXTERNAL-IP为pending状态,经查阅资料,借鉴如下博客:
    链接: link
    修改 service中ingress-nginx-controller的EXTERNAL-IP为k8s-node02节点的IP:
    kubectl edit -n ingress-nginx service/ingress-nginx-controller
    在大概如下位置添加externalIPs:

    spec:
      allocateLoadBalancerNodePorts: true
      clusterIP: 10.1.86.240
      clusterIPs:
      - 10.1.86.240
      externalIPs:
      - 10.6.17.106
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7

    再次查看,EXTERNAL-IP已经有了:

    [root@harbor kubefate]# kubectl -n ingress-nginx get svc
    NAME                                 TYPE           CLUSTER-IP    EXTERNAL-IP   PORT(S)                      AGE
    ingress-nginx-controller             LoadBalancer   10.1.86.240   10.6.17.106   80:31872/TCP,443:32412/TCP   2d23h
    ingress-nginx-controller-admission   ClusterIP      10.1.41.173           443/TCP                      2d23h
    
    • 1
    • 2
    • 3
    • 4

    部署Kubefate服务

    1.载入Kubefate服务镜像
    接着,我们下载KubeFATE服务镜像v1.4.4:
    curl -LO https://github.com/FederatedAI/KubeFATE/releases/download/v1.8.0/kubefate-v1.4.4.docker
    注意:前边是v1.8.0后边是v1.4.4
    然后读入本地Docker环境
    docker load < kubefate-v1.4.4.docker
    创建目录
    mkdir /home/FATE_V180
    将kubefate-k8s-v1.8.0.tar.gz拷贝到新目录中解压
    tar -zxvf kubefate-k8s-v1.8.0.tar.gz
    解压后的目录,可见可执行文件KubeFATE,可以直接移动到path目录方便使用:
    chmod +x ./kubefate && sudo mv ./kubefate /usr/bin
    测试下kubefate命令是否可用:
    kubefate version

    * kubefate commandLine version=v1.4.4
    * kubefate service connection error, resp.StatusCode=404, error: 
    
    
            
                    404 - Not Found
            
            
                    

    404 - Not Found

    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13

    以上提示的问题算正常,后面会解决。

    执行rbac-config.yaml–为 KubeFATE服务创建命名空间
    kubectl apply -f ./rbac-config.yaml

    因为近期Dockerhub调整了下载限制服务条例 Dockerhub latest limitation, 我建议使用国内网易云的镜像仓库代替官方Dockerhub

    1、将kubefate.yaml内镜像federatedai/kubefate:v1.4.4改成hub.c.163.com/federatedai/kubefate:v1.4.4
    2、sed 's/mariadb:10/hub.c.163.com\/federatedai\/mariadb:10/g' kubefate.yaml > kubefate_163.yaml
    3、sed 's/registry: ""/registry: "hub.c.163.com\/federatedai"/g' cluster.yaml > cluster_163.yaml 
    
    • 1
    • 2
    • 3

    在kube-fate命名空间里部署KubeFATE服务,相关的yaml文件也已经准备在工作目录,直接使用kubectl apply:
    kubectl apply -f ./kubefate_163.yaml
    【注】如果你是删除了kubefate和ingress-ngnix重新执行这一步,可能会发生一个错误,解决办法参考:https://blog.csdn.net/qq_39218530/article/details/115372879

    稍等一会,大概10几秒后用下面命令看下KubeFATE服务是否部署好:
    kubectl get all,ingress -n kube-fate
    可能的问题会导致kubefate pod crash:

    Startup probe failed: Get "http://10.244.1.34:8080/": dial tcp 10.244.1.34:8080: connect: connection refused
    
    • 1

    如果返回类似下面的信息(特别是pod的STATUS显示的是Running状态),则KubeFATE的服务就已经部署好并正常运行:

    [root@harbor kubefate]# kubectl get all,ingress -n kube-fate
    NAME                            READY   STATUS                   RESTARTS   AGE
    pod/kubefate-5bf485957b-9wltd   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-bh774   0/1     ContainerStatusUnknown   1          3d1h
    pod/kubefate-5bf485957b-bs8zc   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-cj7j7   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-hn2xm   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-m4hn6   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-ncbc2   0/1     Evicted                  0          2d20h
    pod/kubefate-5bf485957b-tznw6   1/1     Running                  0          2d20h
    pod/mariadb-574d4679f8-f5wc2    1/1     Running                  0          2d20h
    pod/mariadb-574d4679f8-mw9np    0/1     ContainerStatusUnknown   1          3d1h
    
    NAME               TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)          AGE
    service/kubefate   NodePort    10.1.151.34            8080:30053/TCP   3d1h
    service/mariadb    ClusterIP   10.1.150.151           3306/TCP         3d1h
    
    NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
    deployment.apps/kubefate   1/1     1            1           3d1h
    deployment.apps/mariadb    1/1     1            1           3d1h
    
    NAME                                  DESIRED   CURRENT   READY   AGE
    replicaset.apps/kubefate-5bf485957b   1         1         1       3d1h
    replicaset.apps/mariadb-574d4679f8    1         1         1       3d1h
    
    NAME                                 CLASS   HOSTS         ADDRESS       PORTS   AGE
    ingress.networking.k8s.io/kubefate   nginx   example.com   10.6.17.106   80      3d1h
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27

    .添加example.com到hosts文件
    因为我们要用 example.com 域名来访问KubeFATE服务(该域名在ingress中定义,有需要可自行修改),需要在运行kubefate命令行所在的机器配置hosts文件(注意不是Kubernetes所在的机器,而是ingress-ngnix所在的机器,前面安装ingress-ngnix部分有讲)。 另外下文中部署的FATE集群默认也是使用example.com作为默认域名, 如果网络环境有域名解析服务,可配置example.com域名指向master机器的IP地址,这样就不用配置hosts文件。(IP地址一定要换成你自己的)
    sudo -- sh -c "echo \"10.6.17.106 example.com\" >> /etc/hosts"

    [root@harbor kubefate]# ping example.com
    PING example.com (10.6.17.106) 56(84) bytes of data.
    64 bytes from k8s-master (10.6.17.106): icmp_seq=1 ttl=64 time=0.041 ms
    64 bytes from k8s-master (10.6.17.106): icmp_seq=2 ttl=64 time=0.054 ms
    64 bytes from k8s-master (10.6.17.106): icmp_seq=3 ttl=64 time=0.050 ms
    ^C
    --- example.com ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2000ms
    rtt min/avg/max/mdev = 0.041/0.048/0.054/0.007 ms
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    使用vi修改config.yaml的内容。只需要修改serviceurl: example.com:31872加上映射的端口,如果忘记了重新查看一下80端口对应的映射端口:

    [root@harbor kubefate]# kubectl -n ingress-nginx get svc
    NAME                                 TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
    ingress-nginx-controller             LoadBalancer   10.1.130.161   10.6.17.106   80:32415/TCP,443:32491/TCP   3d19h
    ingress-nginx-controller-admission   ClusterIP      10.1.78.36             443/TCP                      3d19h
    
    • 1
    • 2
    • 3
    • 4

    修改完成查看一下,显示如下:

    [root@harbor kubefate]# kubefate version
    * kubefate commandLine version=v1.4.4
    * kubefate service version=v1.4.4
    
    • 1
    • 2
    • 3

    使用KubeFATE安装FATE

    按照前面的计划,我们需要安装3联盟方,ID分别9998、9999与10000。现实情况,这3方应该是完全独立、隔绝的组织,为了模拟现实情况,所以我们需要先为他们在Kubernetes上创建各自独立的命名空间(namespace)。 我们创建命名空间fate-9998用来部署9998,fate-9999用来部署9999,fate-10000部署10000

    kubectl create namespace fate-9998
    kubectl create namespace fate-9999
    kubectl create namespace fate-10000
    
    • 1
    • 2
    • 3

    在exmaple目录下,预先设置了3个例子:/kubefate/examples/party-9998/和/kubefate/examples/party-9999/ 和 /kubefate/examples/party-10000 对于/kubefate/examples/party-9999/cluster.yaml,我们可以将其修改如下:
    party-9998:

    name: fate-9998
    namespace: fate-9998
    chartName: fate
    chartVersion: v1.8.0
    partyId: 9998
    registry: "hub.c.163.com/federatedai"    # 换成国内镜像库
    imageTag: 1.8.0-release
    pullPolicy: 
    imagePullSecrets: 
    - name: myregistrykey
    persistence: false
    istio:
    enabled: false
    podSecurityPolicy:
    enabled: false
    ingressClassName: nginx
    modules:
    - rollsite
    - clustermanager
    - nodemanager
    - mysql
    - python
    - fateboard
    - client
    
    backend: eggroll
    
    ingress:
    fateboard:
      hosts:
      - name: party9998.fateboard.example.com
    client:  
      hosts:
      - name: party9998.notebook.example.com
    
    rollsite: 
    type: NodePort
    nodePort: 30081
    partyList:
    - partyId: 10000
      partyIp: 10.6.17.104
      partyPort: 30101
    - partyId: 9999
      partyIp: 10.6.17.106
      partyPort: 30091
    
    python:
    type: NodePort
    httpNodePort: 30087
    grpcNodePort: 30082
    logLevel: INFO
    
    servingIp: 10.6.14.13
    servingPort: 30085
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54

    party-9999:

    name: fate-9999
    namespace: fate-9999
    chartName: fate
    chartVersion: v1.8.0
    partyId: 9999
    registry: "hub.c.163.com/federatedai"
    imageTag: 1.8.0-release
    pullPolicy: 
    imagePullSecrets: 
    - name: myregistrykey
    persistence: false
    istio:
      enabled: false
    podSecurityPolicy:
      enabled: false
    ingressClassName: nginx
    modules:
      - rollsite
      - clustermanager
      - nodemanager
      - mysql
      - python
      - fateboard
      - client
    
    backend: eggroll
    
    ingress:
      fateboard:
        hosts:
        - name: party9999.fateboard.example.com
      client:  
        hosts:
        - name: party9999.notebook.example.com
    
    rollsite: 
      type: NodePort
      nodePort: 30091
      partyList:
      - partyId: 10000
        partyIp: 10.6.17.104
        partyPort: 30101
      - partyId: 9998
        partyIp: 10.6.14.13
        partyPort: 30081
    
    python:
      type: NodePort
      httpNodePort: 30097
      grpcNodePort: 30092
      logLevel: INFO
    
    servingIp: 10.6.17.106
    servingPort: 30095
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54

    party-10000:

    name: fate-10000
    namespace: fate-10000
    chartName: fate
    chartVersion: v1.8.0
    partyId: 10000
    registry: "hub.c.163.com/federatedai"
    imageTag: 1.8.0-release
    pullPolicy: 
    imagePullSecrets: 
    - name: myregistrykey
    persistence: false
    istio:
      enabled: false
    podSecurityPolicy:
      enabled: false
    ingressClassName: nginx
    modules:
      - rollsite
      - clustermanager
      - nodemanager
      - mysql
      - python
      - fateboard
      - client
    
    backend: eggroll
    
    ingress:
      fateboard: 
        hosts:
        - name: party10000.fateboard.example.com
      client:  
        hosts:
        - name: party10000.notebook.example.com
    
    rollsite: 
      type: NodePort
      nodePort: 30101
      partyList:
      - partyId: 9999
        partyIp: 10.6.17.106
        partyPort: 30091
      - partyId: 9998
        partyIp: 10.6.14.13
        partyPort: 30081
        
    python:
      type: NodePort
      httpNodePort: 30107
      grpcNodePort: 30102
      logLevel: INFO
    
    servingIp: 10.6.17.104
    servingPort: 30105
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54

    安装FATE集群
    如果一切没有问题,那就可以使用kubefate cluster install来部署两个fate集群了,(没遇到坑的步骤按照官方的执行就可以)

    kubefate cluster install -f ./examples/party-10000/cluster10000.yaml
    kubefate cluster install -f ./examples/party-9999/cluster9999.yaml
    kubefate cluster install -f ./examples/party-9998/cluster9998.yaml
    
    • 1
    • 2
    • 3

    这时候,KubeFATE会创建3个任务去分别部署两个FATE集群。我们可以通过kubefate job ls来查看任务,或者直接watch KubeFATE中集群的状态,直至变成Running

    [root@harbor kubefate]# watch kubefate cluster ls
    UUID                                    NAME            NAMESPACE       REVISION        STATUS          CHART   ChartVERSION    AGE
    7bca70c1-236c-4931-81f8-1350cce579d4    fate-9998       fate-9998       1               Running         fate    v1.8.0          18m
    143378db-b84d-4045-8615-11d36335d5b2    fate-9999       fate-9999       0               Creating        fate    v1.8.0          17m
    d3e27a39-c8de-4615-96f2-29012f3edc68    fate-10000      fate-10000      0               Creating        fate    v1.8.0          17m
    
    • 1
    • 2
    • 3
    • 4
    • 5

    因为这个步骤需要到网易云镜像仓库去下载约10G的镜像,所以第一次执行视乎你的网络情况需要一定时间(耐心等待上述下载过程,直至状态变成Running)。 检查下载的进度可以用

    kubectl get po -n fate-9998
    kubectl get po -n fate-9999
    kubectl get po -n fate-10000
    
    • 1
    • 2
    • 3

    全部的镜像下载完成后,结果会呈现如下样子:

    [root@harbor kubefate]# kubectl get po -n fate-9998
    NAME                             READY   STATUS    RESTARTS   AGE
    client-7ccbc89559-rfr2l          1/1     Running   0          20m
    clustermanager-fcb86747f-z9vq9   1/1     Running   0          20m
    mysql-6d546bd578-r5fl2           1/1     Running   0          20m
    nodemanager-0-66dfd58cdc-6z7mc   2/2     Running   0          20m
    nodemanager-1-7b7c65c685-fz9bb   2/2     Running   0          20m
    python-594cd5c47b-5l88p          2/2     Running   0          20m
    rollsite-6b77d9f5f7-ll9sv        1/1     Running   0          20m
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    验证FATE的部署

    通过以上的 kubefate cluster ls 命令, 我们得到 fate-9998 的集群ID是 7bca70c1-236c-4931-81f8-1350cce579d4, fate-9999 的集群ID是 143378db-b84d-4045-8615-11d36335d5b2, 而 fate-10000 的集群ID是 d3e27a39-c8de-4615-96f2-29012f3edc68. 我们可以通过kubefate cluster describe查询集群的具体访问信息:

    [root@harbor kubefate]# kubefate cluster describe 7bca70c1-236c-4931-81f8-1350cce579d4
    UUID            7bca70c1-236c-4931-81f8-1350cce579d4       
    Name            fate-9998                                  
    NameSpace       fate-9998                                  
    ChartName       fate                                       
    ChartVersion    v1.8.0                                     
    Revision        1                                          
    Age             27m                                        
    Status          Running                                    
    Spec            backend: eggroll                           
                    chartName: fate                            
                    chartVersion: v1.8.0                       
                    imagePullSecrets:                          
                    - name: myregistrykey                      
                    imageTag: 1.8.0-release                    
                    ingress:                                   
                      client:                                  
                        hosts:                                 
                        - name: party9998.notebook.example.com 
                      fateboard:                               
                        hosts:                                 
                        - name: party9998.fateboard.example.com
                    ingressClassName: nginx                    
                    istio:                                     
                      enabled: false                           
                    modules:                                   
                    - rollsite                                 
                    - clustermanager                           
                    - nodemanager                              
                    - mysql                                    
                    - python                                   
                    - fateboard                                
                    - client                                   
                    name: fate-9998                            
                    namespace: fate-9998                       
                    partyId: 9998                              
                    persistence: false                         
                    podSecurityPolicy:                         
                      enabled: false                           
                    pullPolicy: null                           
                    python:                                    
                      grpcNodePort: 30082                      
                      httpNodePort: 30087                      
                      logLevel: INFO                           
                      type: NodePort                           
                    registry: hub.c.163.com/federatedai        
                    rollsite:                                  
                      nodePort: 30081                          
                      partyList:                               
                      - partyId: 10000                         
                        partyIp: 10.6.17.104                   
                        partyPort: 30101                       
                      - partyId: 9999                          
                        partyIp: 10.6.17.106                   
                        partyPort: 30091                       
                      type: NodePort                           
                    servingIp: 10.6.14.13                      
                    servingPort: 30085                         
                                                               
    Info            dashboard:                                 
                    - party9998.notebook.example.com           
                    - party9998.fateboard.example.com          
                    ip: 10.6.17.106                            
                    port: 30081                                
                    status:                                    
                      containers:                              
                        client: Running                        
                        clustermanager: Running                
                        fateboard: Running                     
                        mysql: Running                         
                        nodemanager-0: Running                 
                        nodemanager-0-eggrollpair: Running     
                        nodemanager-1: Running                 
                        nodemanager-1-eggrollpair: Running     
                        python: Running                        
                        rollsite: Running                      
                      deployments:                             
                        client: Available                      
                        clustermanager: Available              
                        mysql: Available                       
                        nodemanager-0: Available               
                        nodemanager-1: Available               
                        python: Available                      
                        rollsite: Available         
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84

    从返回的内容中,我们看到Info->dashboard里包含了:

    1. Jupyter Notebook的访问地址: party9998.notebook.example.com。这个是我们准备让数据科学家进行建模分析的平台。已经集成了FATE-Clients;
    2. FATEBoard的访问地址: party9998.fateboard.example.com。我们可以通过FATEBoard来查询当前训练的状态。

    同样的查看 fate-10000的信息,可以看到 dashboard的网址虽然不同,但是ip都是10.6.17.106,也就是ingress-ngnix的地址,所以即使是访问party10000.fateboard.example.com,也是先访问10.6.17.106,而不是fate-10000所在的主机10.6.17.104。

    在浏览器访问FATE集群的机器上配置相关的Host信息

    如果是Windows机器,我们需要把相关域名解析配置到C:\WINDOWS\system32\drivers\etc\hosts:

    10.6.17.106 party9998.notebook.example.com
    10.6.17.106 party9998.fateboard.example.com
    10.6.17.106 party9999.notebook.example.com
    10.6.17.106 party9999.fateboard.example.com
    10.6.17.106 party10000.notebook.example.com
    10.6.17.106 party10000.fateboard.example.com
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6

    注意以上网址都是设置IP为10.6.17.106
    用网址party10000.fateboard.example.com:32415,登陆party10000的fateboard,用户名和密码如下图:
    在这里插入图片描述

    问题:

    1、过了1天,发现命名空间fate-9998和fate-10000对应的fateboard界面访问不了了,只有fate-9999的可以访问,经检查:

    root@harbor kubefate]# kubectl get pods -n fate-9998
    NAME                             READY   STATUS             RESTARTS         AGE
    client-7ccbc89559-njr2m          1/1     Running            0                3d21h
    clustermanager-fcb86747f-8zzh7   1/1     Running            0                3d21h
    mysql-6d546bd578-9mfvn           1/1     Running        37 (117m ago)    3d21h
    nodemanager-0-66dfd58cdc-76wqc   2/2     Running            0                3d21h
    nodemanager-1-7b7c65c685-jb2gs   2/2     Running            0                3d21h
    python-594cd5c47b-vl4mb          1/2     CrashLoopBackOff   473 (117s ago)   3d21h
    rollsite-6b77d9f5f7-lk6dm        1/1     Running            0                3d21h
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    查看到python这个podCrashLoopBackOff,其内部由两容器fateboard和ping-mysql,查看其ping-mysql容器:
    root@harbor kubefate]# kubectl logs -f python-594cd5c47b-vl4mb -n fate-9998 -c ping-mysql
    得知mysql有问题,于是直接重新部署fate-9998的mysql:
    kubectl rollout restart deployment mysql -n fate-9998
    再重新部署fate-9998的python:
    kubectl rollout restart deployment python -n fate-9998
    问题解决。

    参考:
    https://blog.csdn.net/qq_32202885/article/details/125998028
    https://blog.csdn.net/haveanybody/article/details/108253667

  • 相关阅读:
    ElasticSearch离线安装
    fepk文件格式说明
    被裁员一个月后,我被面试官的一套性能优化面试题给问自闭了
    菜鸟逆袭成为大佬,就靠这份《数据中心设施运维指南》,啃完你就知道多香了
    全栈开发提效神器——ApiFox(Postman + Swagger + Mock + JMeter)
    DetailView/货币详情页 的实现
    弘辽科技:淘宝店铺被管控是什么原因?要注意什么?
    Python接口自动化 —— 发送post请求的接口(详解)
    useState源码解读 及 手撕 useState 实现
    MATLAB R2024a 主要更新内容
  • 原文地址:https://blog.csdn.net/Acecai01/article/details/127979608