OpenShift General Troubleshooting Recipe

From a browser

  1. Review overall status: Administrator } Home } Overview } Status
  2. Review node resource usage: Administrator } Home } Overview } Cluster Utilization

    Click on the utilization number and use the dropdown to get different views, e.g. By Node
  3. Review critical and warning alerts: Administrator } Observe (or Monitoring) } Alerting } Click "Filter", and check "Critical" and "Warning"
  4. Review recent warning and error events: Administrator } Home } Events } Change "All types" to "Warning"
  5. Review deep dive utilization: Administrator } Observe (or Monitoring) } Dashboards } Node Exporter / USE Method / Cluster

From the command line

  1. Ensure you're logged in with oc
  2. Review the overall cluster status:
    $ oc get clusterversion
    NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
    version   4.10.10   True        False         87d     Error while reconciling 4.10.10: an unknown error has occurred: MultipleErrors
    1. If status includes "MultipleErrors", display all errors with:
      oc get clusterversion -o 'jsonpath={.items[].status.conditions}'
  3. Review the status of nodes:
    $ oc get nodes                                                                        
    NAME          STATUS     ROLES    AGE    VERSION
    master0       Ready      master   201d   v1.20.0+df9c838
    master1       Ready      master   201d   v1.20.0+df9c838
    master2       Ready      master   201d   v1.20.0+df9c838
    worker0       Ready      worker   201d   v1.20.0+df9c838
    worker1       NotReady   worker   11d    v1.20.0+df9c838
    worker2       Ready      worker   201d   v1.20.0+df9c838
    1. Describe any that are Status=NotReady and search for Conditions:
      $ oc describe node worker1
      Name:               worker1
      [...]
      Conditions:
        Type           Status  LastHeartbeatTime               LastTransitionTime              Reason            Message
        ----           ------  -----------------               ------------------              ------            -------
        MemoryPressure Unknown Fri, 03 Dec 2021 18:07:43 -0600 Tue, 11 Jan 2022 16:12:38 -0600 NodeStatusUnknown Kubelet stopped posting node status.
        DiskPressure   Unknown Fri, 03 Dec 2021 18:07:43 -0600 Tue, 11 Jan 2022 16:12:38 -0600 NodeStatusUnknown Kubelet stopped posting node status.
        PIDPressure    Unknown Fri, 03 Dec 2021 18:07:43 -0600 Tue, 11 Jan 2022 16:12:38 -0600 NodeStatusUnknown Kubelet stopped posting node status.
        Ready          Unknown Fri, 03 Dec 2021 18:07:43 -0600 Tue, 11 Jan 2022 16:12:38 -0600 NodeStatusUnknown Kubelet stopped posting node status.
      If no issues are obvious, debug the node in more depth.
  4. Review node resource usage:
    $ oc adm top nodes
    NAME      CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%     
    master0   1990m        26%    13070Mi         89%         
    master1   1614m        21%    10982Mi         75%         
    master2   1016m        13%    10138Mi         69%         
    worker0   4986m        32%    17360Mi         57%         
    worker1   4986m        32%    17360Mi         57%         
    worker2   2634m        16%    16352Mi         54%         
    1. Describe any that have high usage of CPU and/or memory:
      $ oc describe node master0
      Name:               master0
      [...]
      Allocatable:
        cpu:                7500m
        ephemeral-storage:  95069439022
        memory:             14871872Ki
        pods:               250
      [...]
      Non-terminated Pods:                      (32 in total)
        Namespace                  Name                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
        ---------                  ----                    ------------  ----------  ---------------  -------------  ---
        openshift-kube-apiserver   kube-apiserver-master0  290m (3%)     0 (0%)      1224Mi (8%)      0 (0%)         7d16h
      [...]
      Allocated resources:
        (Total limits may be over 100 percent, i.e., overcommitted.)
        Resource           Requests      Limits
        --------           --------      ------
        cpu                1595m (21%)   0 (0%)
        memory             5241Mi (36%)  0 (0%)
      If no CPU or memory culprits are obvious, debug the node in more depth.
  5. Review critical and warning alerts:
    1. Critical:
      curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)” https://$(oc -n openshift-monitoring get route alertmanager-main -o jsonpath='{.spec.host}')/api/v1/alerts?filter=severity=critical
    2. Warning:
      curl -k -H "Authorization: Bearer $(oc -n openshift-monitoring sa get-token prometheus-k8s)" https://$(oc -n openshift-monitoring get route alertmanager-main -o jsonpath='{.spec.host}')/api/v1/alerts?filter=severity=warning
  6. Review recent warning and error events:
    oc get events --sort-by='.lastTimestamp' --all-namespaces --field-selector type=Warning | tail -10
  7. Review top pod resource usage by CPU:
    $ oc adm top pod --all-namespaces --sort-by=cpu | head
    NAMESPACE                              NAME                                         CPU(cores)   MEMORY(bytes)   
    openshift-kube-apiserver               kube-apiserver-master0                       940m         6526Mi          
    openshift-operators                    service-binding-operator-c4896b966-js9t9     547m         662Mi           
    openshift-etcd                         etcd-master1                                 480m         2185Mi          
    openshift-kube-apiserver               kube-apiserver-master1                       304m         4751Mi          
    openshift-kube-apiserver               kube-apiserver-master2                       284m         4787Mi          
    openshift-operator-lifecycle-manager   olm-operator-64fbc79dbc-47mvq                261m         1507Mi          
    openshift-monitoring                   prometheus-k8s-1                             243m         2355Mi          
    openshift-etcd                         etcd-master0                                 243m         1902Mi          
    openshift-monitoring                   prometheus-k8s-0                             237m         2249Mi          
  8. Review top pod resource usage by memory:
    $ oc adm top pod --all-namespaces --sort-by=memory | head
    NAMESPACE                 NAME                    CPU(cores)   MEMORY(bytes)   
    openshift-kube-apiserver  kube-apiserver-master0  1220m        6396Mi          
    openshift-kube-apiserver  kube-apiserver-master2  351m         4828Mi          
    openshift-kube-apiserver  kube-apiserver-master1  276m         4763Mi          
    rook-ceph                 csi-rbdplugin-htblh     1m           2464Mi          
    openshift-monitoring      prometheus-k8s-1        359m         2355Mi          
    openshift-monitoring      prometheus-k8s-0        373m         2265Mi          
    openshift-etcd            etcd-master1            331m         2195Mi          
    openshift-etcd            etcd-master0            119m         1943Mi          
    openshift-etcd            etcd-master2            252m         1759Mi          
  9. Get the status of cluster operators:
    $ oc get clusteroperators
    NAME        VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
    dns         4.7.13    True        False         True       2d19h
    1. Describe any that are Degraded=True:
      $ oc describe clusteroperators dns
      Name:         dns
      [...]
      Status:
        Conditions:
          Last Transition Time:  2022-02-09T06:40:54Z
          Message:               DNS default is degraded
          Reason:                DNSDegraded
          Status:                True
          Type:                  Degraded
  10. Check for overcommit issues on worker nodes:
    1. oc get nodes
    2. oc debug node/$NODE -t
    3. chroot /host journalctl --grep="Killed"
    4. Overcommit ratios may be tuned. Alternatively, disable overcommit by setting all pods' request=limit.

Troubleshooting Tips

  1. Troubleshoot Networking
  2. For a pod status of pending, review oc describe pod $POD
  3. The horizontal pod autoscaler initially has a value of <unknown> and might take ~5 minutes to update. A persistent value of <unknown> might indicate that the deployment does not define resource requests for the metric and the autoscaler will not activate.
  4. Investigate pod errors due to permissions:
    oc get pod/$POD -o yaml | oc adm policy scc-subject-review -f -