第 10 章 集群日志记录故障排除

10.1. 查看集群日志记录状态

您可以查看 Cluster Logging Operator 的状态以及多个集群日志记录组件的状态。

10.1.1. 查看 Cluster Logging Operator 的状态

您可以查看 Cluster Logging Operator 的状态。

先决条件

  • 必须安装 Cluster Logging 和 Elasticsearch。

流程

  1. 进入 openshift-logging 项目。

    $ oc project openshift-logging
  2. 查看集群日志记录状态:

    1. 获取集群日志记录状态:

      $ oc get clusterlogging instance -o yaml

      输出示例

      apiVersion: logging.openshift.io/v1
      kind: ClusterLogging
      
      ....
      
      status:  1
        collection:
          logs:
            fluentdStatus:
              daemonSet: fluentd  2
              nodes:
                fluentd-2rhqp: ip-10-0-169-13.ec2.internal
                fluentd-6fgjh: ip-10-0-165-244.ec2.internal
                fluentd-6l2ff: ip-10-0-128-218.ec2.internal
                fluentd-54nx5: ip-10-0-139-30.ec2.internal
                fluentd-flpnn: ip-10-0-147-228.ec2.internal
                fluentd-n2frh: ip-10-0-157-45.ec2.internal
              pods:
                failed: []
                notReady: []
                ready:
                - fluentd-2rhqp
                - fluentd-54nx5
                - fluentd-6fgjh
                - fluentd-6l2ff
                - fluentd-flpnn
                - fluentd-n2frh
        logstore: 3
          elasticsearchStatus:
          - ShardAllocationEnabled:  all
            cluster:
              activePrimaryShards:    5
              activeShards:           5
              initializingShards:     0
              numDataNodes:           1
              numNodes:               1
              pendingTasks:           0
              relocatingShards:       0
              status:                 green
              unassignedShards:       0
            clusterName:             elasticsearch
            nodeConditions:
              elasticsearch-cdm-mkkdys93-1:
            nodeCount:  1
            pods:
              client:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
              data:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
              master:
                failed:
                notReady:
                ready:
                - elasticsearch-cdm-mkkdys93-1-7f7c6-mjm7c
      visualization:  4
          kibanaStatus:
          - deployment: kibana
            pods:
              failed: []
              notReady: []
              ready:
              - kibana-7fb4fd4cc9-f2nls
            replicaSets:
            - kibana-7fb4fd4cc9
            replicas: 1

      1
      在输出中,集群状态字段显示在 status 小节中。
      2
      Fluentd Pod 的相关信息。
      3
      Elasticsearch Pod 的相关信息,包括 Elasticsearch 集群健康状态 greenyellowred
      4
      Kibana Pod 的相关信息。

10.1.1.1. 情况消息示例

以下示例是来自集群日志记录实例的 Status.Nodes 部分的一些情况消息。

类似于以下内容的状态消息表示节点已超过配置的低水位线,并且没有分片将分配给此节点:

输出示例

  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T15:57:22Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be not
        be allocated on this node.
      reason: Disk Watermark Low
      status: "True"
      type: NodeStorage
    deploymentName: example-elasticsearch-clientdatamaster-0-1
    upgradeStatus: {}

类似于以下内容的状态消息表示节点已超过配置的高水位线,并且分片将重新定位到其他节点:

输出示例

  nodes:
  - conditions:
    - lastTransitionTime: 2019-03-15T16:04:45Z
      message: Disk storage usage for node is 27.5gb (36.74%). Shards will be relocated
        from this node.
      reason: Disk Watermark High
      status: "True"
      type: NodeStorage
    deploymentName: cluster-logging-operator
    upgradeStatus: {}

类似于以下内容的状态消息表示 CR 中的 Elasticsearch 节点选择器与集群中的任何节点都不匹配:

输出示例

    Elasticsearch Status:
      Shard Allocation Enabled:  shard allocation unknown
      Cluster:
        Active Primary Shards:  0
        Active Shards:          0
        Initializing Shards:    0
        Num Data Nodes:         0
        Num Nodes:              0
        Pending Tasks:          0
        Relocating Shards:      0
        Status:                 cluster health unknown
        Unassigned Shards:      0
      Cluster Name:             elasticsearch
      Node Conditions:
        elasticsearch-cdm-mkkdys93-1:
          Last Transition Time:  2019-06-26T03:37:32Z
          Message:               0/5 nodes are available: 5 node(s) didn't match node selector.
          Reason:                Unschedulable
          Status:                True
          Type:                  Unschedulable
        elasticsearch-cdm-mkkdys93-2:
      Node Count:  2
      Pods:
        Client:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:
        Data:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:
        Master:
          Failed:
          Not Ready:
            elasticsearch-cdm-mkkdys93-1-75dd69dccd-f7f49
            elasticsearch-cdm-mkkdys93-2-67c64f5f4c-n58vl
          Ready:

类似于以下内容的状态消息表示请求的 PVC 无法绑定到 PV:

输出示例

      Node Conditions:
        elasticsearch-cdm-mkkdys93-1:
          Last Transition Time:  2019-06-26T03:37:32Z
          Message:               pod has unbound immediate PersistentVolumeClaims (repeated 5 times)
          Reason:                Unschedulable
          Status:                True
          Type:                  Unschedulable

类似于以下内容的状态消息表示无法调度 Fluentd Pod,因为节点选择器与任何节点都不匹配:

输出示例

Status:
  Collection:
    Logs:
      Fluentd Status:
        Daemon Set:  fluentd
        Nodes:
        Pods:
          Failed:
          Not Ready:
          Ready:

10.1.2. 查看集群日志记录组件的状态

您可以查看多个集群日志记录组件的状态。

先决条件

  • 必须安装 Cluster Logging 和 Elasticsearch。

流程

  1. 进入 openshift-logging 项目。

    $ oc project openshift-logging
  2. 查看集群日志记录环境的状态:

    $ oc describe deployment cluster-logging-operator

    输出示例

    Name:                   cluster-logging-operator
    
    ....
    
    Conditions:
      Type           Status  Reason
      ----           ------  ------
      Available      True    MinimumReplicasAvailable
      Progressing    True    NewReplicaSetAvailable
    
    ....
    
    Events:
      Type    Reason             Age   From                   Message
      ----    ------             ----  ----                   -------
      Normal  ScalingReplicaSet  62m   deployment-controller  Scaled up replica set cluster-logging-operator-574b8987df to 1----

  3. 查看集群日志记录副本集的状态:

    1. 获取副本集的名称:

      输出示例

      $ oc get replicaset

      输出示例

      NAME                                      DESIRED   CURRENT   READY   AGE
      cluster-logging-operator-574b8987df       1         1         1       159m
      elasticsearch-cdm-uhr537yu-1-6869694fb    1         1         1       157m
      elasticsearch-cdm-uhr537yu-2-857b6d676f   1         1         1       156m
      elasticsearch-cdm-uhr537yu-3-5b6fdd8cfd   1         1         1       155m
      kibana-5bd5544f87                         1         1         1       157m

    2. 获取副本集的状态:

      $ oc describe replicaset cluster-logging-operator-574b8987df

      输出示例

      Name:           cluster-logging-operator-574b8987df
      
      ....
      
      Replicas:       1 current / 1 desired
      Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed
      
      ....
      
      Events:
        Type    Reason            Age   From                   Message
        ----    ------            ----  ----                   -------
        Normal  SuccessfulCreate  66m   replicaset-controller  Created pod: cluster-logging-operator-574b8987df-qjhqv----