SAP Data Intelligence 3 on OpenShift Container Platform 4

Updated -

In general, the installation of SAP Data Intelligence (SDI) follows these steps:

  • Install Red Hat OpenShift Container Platform
  • Configure the prerequisites for SAP Data Intelligence Foundation
  • Install SDI Observer
  • Install SAP Data Intelligence Foundation on OpenShift Container Platform

If you're interested in installation of SAP Data Hub or SAP Vora, please refer to the other installation guides:

1. OpenShift Container Platform validation version matrix

The following version combinations of SDI 2.X, OCP, RHEL or RHCOS have been validated for the production environments:

SAP Data Intelligence OpenShift Container Platform Operating System Infrastructure and (Storage) Confirmed&Supported by SAP
3.0 4.2 RHCOS (nodes), RHEL 8.1+ or Fedora (Management host) VMware vSphere (OCS 4.2) supported
3.0 Patch 3 4.2 , 4.4 RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) VMware vSphere (OCS 4) supported
3.0 Patch 4 4.4 RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) VMware vSphere (OCS 4), (NetApp Trident 20.04) supported
3.0 Patch 8 4.6 RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) KVM/libvirt (OCS 4) supported
3.1 4.4 RHCOS (nodes), RHEL 8.3+ or Fedora (Management host) VMware vSphere (OCS 4) not supported¹
3.1 4.6 RHCOS (nodes), RHEL 8.3+ or Fedora (Management host) VMware vSphere (OCS 4 ¡, NetApp Trident 20.10 + StorageGRID), Bare metal (OCS 4 ¡) supported

The referenced OCP release is no longer supported by Red Hat!
¹ 3.1 on OCP 4.4 is supported by SAP only for the purpose of upgrade to OCP 4.6
Validated on two different hardware configurations:

  • (Dev/PoC level) Lenovo 4 bare metal hosts setup composed of:

    • 3 schedulable control plane nodes running both OCS and SDI (Lenovo ThinkSystem SR530)
    • 1 compute node running SDI) (Lenovo ThinkSystem SR530)

    Note that this particular setup cannot be fully supported by Red Hat because running OCS in compact mode is still a Technology Preview as of 4.6.

  • (Production level) Dell Technologies bare metal cluster composed of:

    • 1 CSAH node (Dell EMC PowerEdge R640s)
    • 3 control plane nodes (Dell EMC PowerEdge R640s)
    • 3 dedicated OCS nodes (Dell EMC PowerEdge R640s)
    • 3 dedicated SDI nodes (Dell EMC PowerEdge R740xd)

    CSI supported external Dell EMC storage options and cluster sizing options available.
    CSAH stands for Cluster System Admin Host - an equivalent of management host

Please refer to the compatibility matrix for version combinations that are considered as working.

SAP Note #2871970 lists more details.

2. Requirements

2.1. Hardware/VM and OS Requirements

2.1.1. OpenShift Cluster

Make sure to consult the following official cluster requirements:

2.1.1.1. Node Kinds

There are 4 kinds of nodes:

  • Bootstrap Node - A temporary bootstrap node needed for the OCP deployment. The node can be either destroyed by the installer (using infrastructure-provisioned-installation -- aka IPI) or can be deleted manually by the administrator. Alternatively, it can be re-used as a worker node. Please refer to the Installation process (4.6) / (4.4) for more information.
  • Master Nodes (4.6) / (4.4) - The control plane manages the OpenShift Container Platform cluster. The control plane can be made schedulable to enable SDI workload there as well.
  • Compute Nodes (4.6) / (4.4) - Run the actual workload (e.g. SDI pods). They are optional on a three-node cluster (where the master nodes are schedulable).
  • OCS Nodes (4.6) / (4.4) - Run OpenShift Container Storage (aka OCS) -- currently supported only on AWS and VMware vSphere. The nodes can be divided into starting (running both OSDs and monitors) and additional nodes (running only OSDs). Needed only when OCS shall be used as the backing storage provider.
  • Management host (aka administrator's workstation or Jump host - The Management host is used among other things for:

    • accessing the OCP cluster via a configured command line client (oc or kubectl)
    • configuring OCP cluster
    • running Software Lifecycle Container Bridge (SLC Bridge)

The hardware/software requirements for the Management host can be:

  • OS: Red Hat Enterprise Linux 8.1+, RHEL 7.6+ or Fedora 30+
  • Diskspace: 20GiB for /:
2.1.1.2. Note a disconnected and air-gapped environments

By the term "disconnected host", it is referred to a host having no access to internet.
By the term "disconnected cluster", it is referred to a cluster where each host is disconnected.
A disconnected cluster can be managed from a Management host that is either connected (having access to the internet) or disconnected.
The latter scenario (both cluster and management host being disconnected) will be referred to by the term "air-gapped".
Unless stated otherwise, whatever applies to a disconnected host, cluster or environment, applies also to the "air-gapped".

2.1.1.3. Minimum Hardware Requirements

The table below lists the minimum requirements and the minimum number of instances for each node type for the latest validated SDI and OCP 4.X releases. This is sufficient of a PoC (Proof of Concept) environments.

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master 3 RHCOS 4 16 120 m4.xlarge
Compute 3+ RHEL 7.8 or 7.9 or RHCOS 8 32 120 m4.2xlarge

On a three-node cluster, it would look like this:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master/Compute 3 RHCOS 10 40 120 m4.xlarge

If using OCS 4.6 in internal mode, at least additional 3 (starting) nodes are recommended. Alternatively, the Compute nodes outlined above can also run OCS pods. In that case, the hardware specifications need to be extended accordingly. The following table lists the minimum requirements for each additional node:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
OCS starting (OSD+MON) 3 RHCOS 10 24 120 + 2048 m5.4xlarge
2.1.1.4. Minimum Production Hardware Requirements

The minimum production requirements for production systems for the latest validated SDI and OCP 4 are the following:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master 3+ RHCOS 8 16 120 c5.xlarge
Compute 3+ RHEL 7.8 or 7.9 or RHCOS 16 64 120 m4.4xlarge

On a three-node cluster, it would look like this:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master/Compute 3 RHCOS 22 72 120 c5.9xlarge

If using OCS 4 in internal mode, at least additional 3 (starting) nodes are recommended. Alternatively, the Compute nodes outlined above can also run OCS pods. In that case, the hardware specifications need to be extended accordingly. The following table lists the minimum requirements for each additional node:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
OCS starting (OSD+MON) 3 RHCOS 20 49 120 + 6×2048 c5a.8xlarge

Please refer to OCS Platform Requirements (4.6) / (4.4) and OCS Sizing and scaling recommendations (4.4) for more information.
Running in a compact mode (on control plane) remains a Technology Preview as of OCS 4.6.
1 physical core provides 2 vCPUs when hyper-threading is enabled. 1 physical core provides 1 vCPU when hyper-threading is not enabled.

2.2. Software Requirements

2.2.1. Compatibility Matrix

Later versions of SAP Data Intelligence support newer versions of Kubernetes and OpenShift Container Platform. Even if not listed in the OCP validation version matrix above, the following version combinations are considered fully working and supported:

SAP Data Intelligence OpenShift Container Platform Worker Node Management host Infrastructure Storage Object Storage
3.0 Patch 3 or higher 4.3, 4.4 RHCOS RHEL 8.1 or newer Cloud , VMware vSphere OCS 4, NetApp Trident 20.04 or newer, vSphere volumes OCS, NetApp StorageGRID 11.3 or newer
3.0 Patch 8 or higher 4.4, 4.5, 4.6 RHCOS RHEL 8.1 or newer Cloud , VMware vSphere OCS 4, NetApp Trident 20.04 or newer, vSphere volumes OCS, NetApp StorageGRID 11.3 or newer
3.1 4.4, 4.5, 4.6 RHCOS RHEL 8.1 or newer Cloud , VMware vSphere, Bare metal OCS 4, NetApp Trident 20.04 or newer, vSphere volumes OCS ¡, NetApp StorageGRID 11.4 or newer

Cloud means any cloud provider supported by OpenShift Container Platform. For a complete list of tested and supported infrastructure platforms, please refer to OpenShift Container Platform 4.x Tested Integrations. The persistent storage in this case must be provided by the cloud provider. Please see refer to Understanding persistent storage (4.6) / (4.4) for a complete list of supported storage providers.
This persistent storage provider does not offer a supported object storage service required by SDI's checkpoint store and therefor is suitable only for SAP Data Intelligence development and PoC clusters. It needs to be complemented by an object storage solution for the full SDI functionality.
¡ For the full functionality (including SDI backup&restore), OCS 4.6.4 or newer is required. Alternatively, OCS external mode can be used while utilizing RGW for SDI backup&restore (checkpoint store).

Unless stated otherwise, the compatibility of a listed SDI version covers all its patch releases as well.

2.2.2. Persistent Volumes

Persistent storage is needed for SDI. It is required to use storage that can be created dynamically. You can find more information in the Understanding persistent storage (4.6) / (4.4) document.

2.2.3. Container Image Registry

The SDI installation requires a secured Image Registry where images are first mirrored from an SAP Registry and then delivered to the OCP cluster nodes. The integrated OpenShift Container Registry (4.6) / (4.4) is not appropriate for this purpose. For now another image registry needs to be setup instead.

The requirements listed here is a subset of the official requirements listed in Container Registry (3.1) / (3.0)

NOTE: as of now, AWS ECR Registry cannot be used for this purpose either.

The word secured in this context means that the communication is encrypted using a TLS. Ideally with certificates signed by a trusted certificate authority. If the registry is also exposed publicly, it must require authentication and authorization in order to pull SAP images.

Such a registry can be deployed directly on OCP cluster using for example SDI Observer, please refer to Deploying SDI Observer for more information.

When finished you should have an external image registry up and running. We will use the URL local.image.registry:5000 as an example. You can verify its readiness with the following command.

# curl -k https://local.image.registry:5000/v2/
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}

2.2.4. Checkpoint store enablement

In order to enable SAP Vora Database streaming tables, checkpoint store needs to be enabled. The store is an object storage on a particular storage back-end. Several back-end types are supported by the SDI installer that cover most of the storage cloud providers.

The enablement is strongly recommended for production clusters. Clusters having this feature disabled are suitable only for test, development or PoC use-cases.

Make sure to create a desired bucket before the SDI Installation. If the checkpoint store shall reside in a directory on a bucket, the directory needs to exist as well.

2.2.5. SDI Observer

Is a pod monitoring SDI's namespace and modifying objects in there that enable running of SDI on top of OCP. The observer shall be run in a dedicated namespace. It must be deployed before the SDI installation is started. SDI Observer section will guide you through the process of deployment.

3. Install Red Hat OpenShift Container Platform

3.1. Prepare the Management host

Note the following has been tested on RHEL 8.4. The steps shall be similar for other RPM based Linux distribution. Recommended are RHEL 7.7+, Fedora 30+ and CentOS 7+.

3.1.1. Prepare the connected Management host

  1. Subscribe the Management host at least to the following repositories:

    # OCP_RELEASE=4.6
    # sudo subscription-manager repos                 \
        --enable=rhel-8-for-x86_64-appstream-rpms     \
        --enable=rhel-8-for-x86_64-baseos-rpms        \
        --enable=rhocp-${OCP_RELEASE:-4.6}-for-rhel-8-x86_64-rpms
    
  2. Install jq binary. This installation guide has been tested with jq 1.6.

    • on RHEL 8, make sure rhocp-4.6-for-rhel-8-x86_64-rpms repository or newer is enabled and install it from there:

      # dnf install jq-1.6
      
    • on earlier releases or other distributions, download the binary from upstream:

      # sudo curl -L -O /usr/local/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
      # sudo chmod a+x /usr/local/bin/jq
      
  3. Download and install OpenShift client binaries.

    # sudo dnf install -y openshift-clients
    

3.1.2. Prepare the disconnected RHEL Management host

Please refer to KB#3176811 Creating a Local Repository and Sharing With Disconnected/Offline/Air-gapped Systems and KB#29269 How can we regularly update a disconnected system (A system without internet connection)?.

Install jq-1.6 and openshift-clients from your local RPM repository.

3.2. Install OpenShift Container Platform

Install OpenShift Container Platform on your desired cluster hosts. Follow the OpenShift installation guide (4.6) / (4.4)

Several changes need to be done to the compute nodes running SDI workloads before SDI installation. These include:

  1. pre-load needed kernel modules
  2. increasing the PIDs limit of CRI-O container engine
  3. configure insecure registry (if an insecure registry shall be used)

They will be described in the next section.

3.3. OCP Post Installation Steps

3.3.1. (optional) Install OpenShift Container Storage

Red Hat OpenShift Container Storage (OCS) has been validated as the persistent storage provider for SAP Data Intelligence. Please refer to the OCS documentation (4.6) / (4.4)

Please make sure to read and follow Disconnected Environment if you install on a disconnected cluster.

3.3.2. (optional) Install NetApp Trident

NetApp Trident together with StorageGRID have been validated for SAP Data Intelligence and OpenShift. More details can be found at SAP Data Intelligence on OpenShift 4 with NetApp Trident.

3.3.3. Configure SDI compute nodes

Some SDI components require changes on the OS level of compute nodes. These could impact other workloads running on the same cluster. To prevent that from happening, it is recommended to dedicate a set of nodes to SDI workload. The following needs to be done:

  1. Chosen nodes must be labeled e.g. using the node-role.kubernetes.io/sdi="" label.
  2. MachineConfigs specific to SDI need to be created, they will be applied only to the selected nodes.
  3. MachineConfigPool must be created to associate the chosen nodes with the newly created MachineConfigs.
    • no change will be done to the nodes until this point
  4. (optional) Apply a node selector to sdi, sap-slcbridge and datahub-system projects.
    • SDI Observer can be configured to do that with SDI_NODE_SELECTOR parameter

Before modifying the recommended approach below, please make yourself familiar with the custom pools concept of the machine config operator.

3.3.3.1. Air-gapped environment

If the Management host does not have access to the internet, you will need to clone the sap-data-intelligence git repository to some other host and make it available on the Management host. For example:

# cd /var/run/user/1000/usb-disk/
# git clone https://github.com/redhat-sap/sap-data-intelligence

Then on the Management host:

  • unless the local checkout already exists, copy it from the disk:

    # git clone /var/run/user/1000/usb-disk/sap-data-intelligence ~/sap-data-intelligence
    
  • otherwise, re-apply local changes (if any) to the latest code:

    # cd ~/sap-data-intelligence
    # git stash         # temporarily remove local changes
    # git remote add drive /var/run/user/1000/usb-disk/sap-data-intelligence
    # git fetch drive
    # git merge drive   # apply the latest changes from drive to the local checkout
    # git stash pop     # re-apply the local changes on top of the latest code
    
3.3.4.1. Label the compute nodes for SAP Data Intelligence

Choose compute nodes for the SDI workload and label them from the Management host like this:

# oc label node/sdi-worker{1,2,3} node-role.kubernetes.io/sdi=""
3.3.4.2. Pre-load needed kernel modules

To apply the desired changes to the existing and future SDI compute nodes, please create another machine config like this:

  • (connected management host)

    # oc apply -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mc-75-worker-sap-data-intelligence.yaml
    
  • (disconnected management host)

    # oc apply -f sap-data-intelligence/master/snippets/mco/mc-75-worker-sap-data-intelligence.yaml
    

NOTE: If the warning below appears, it can be usually ignored. It suggests that the resource already exists on the cluster and has been created by none of the listed commands. In earlier versions of this documentation, plain oc create used to be recommended instead.

Warning: oc apply should be used on resource created by either oc create --save-config or oc apply
3.3.4.3. Change the maximum number of PIDs per Container

The process of configuring the nodes is described at Modifying Nodes (4.6) / (4.4) In SDI case, the required settings are .spec.containerRuntimeConfig.pidsLimit in a ContainerRuntimeConfig. The result is a modified /etc/crio/crio.conf configuration file on each affected worker node with pids_limit set to the desired value. Please create a ContainerRuntimeConfig like this:

  • (connected management host)

    # oc apply -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/ctrcfg-sdi-pids-limit.yaml
    
  • (disconnected management host)

    # oc apply -f sap-data-intelligence/master/snippets/mco/ctrcfg-sdi-pids-limit.yaml
    
3.3.4.4. (obsolete) Enable net-raw capability for containers on schedulable nodes

NOTE: Having effect only on OCP 4.6 or newer.
NOTE: Shall be executed prior to OCP upgrade to 4.6 when running SDI already.
NOTE: No longer necessary for SDI 3.1 Patch 1 or newer

Starting with OCP 4.6, NET_RAW capability is no longer granted to containers by default. Some SDI containers assume otherwise. To allow them to run on OCP 4.6, the following MachineConfig must be applied to the compute nodes:

(connected management host)

    # oc apply -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mc-97-crio-net-raw.yaml

(disconnected management host)

    # oc apply -f sap-data-intelligence/master/snippets/mco/mc-97-crio-net-raw.yaml
3.3.4.4. Associate MachineConfigs to the Nodes

If previously associated, disassociate workload=sapdataintelligence from the worker MachineConfigPool using the following command executed in bash:

# tmpl=$'{{with $wl := index $m.labels "workload"}}{{if and $wl (eq $wl "sapdataintelligence")}}{{$m.name}}\n{{end}}{{end}}'; \
  if [[ "$(oc get  mcp/worker -o go-template='{{with $m := .metadata}}'"$tmpl"'{{end}}')" == "worker" ]]; then
    oc label mcp/worker workload-;
  fi

Define a new MachineConfigPool associating MachineConfigs to the nodes. The nodes will inherit all the MachineConfigs targeting worker and sdi roles.

  • (connected management host)

    # oc apply -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mcp-sdi.yaml
    
  • (disconnected management host)

    # oc apply -f sap-data-intelligence/master/snippets/mco/mcp-sdi.yaml
    

Note that you may see a warning if the MCO exists already.

The changes will be rendered into machineconfigpool/sdi. The workers will be restarted one-by-one until the changes are applied to all of them. See Applying configuration changes to the cluster (4.6) / (4.4) for more information.

The following command can be used to wait until the change gets applied to all the worker nodes:

# oc wait mcp/sdi --all --for=condition=updated

After performing the changes above, you should end up with a new role sdi assigned to the chosen nodes and a new MachineConfigPool containing the nodes:

# oc get nodes
NAME          STATUS   ROLES        AGE   VERSION
ocs-worker1   Ready    worker       32d   v1.19.0+9f84db3
ocs-worker2   Ready    worker       32d   v1.19.0+9f84db3
ocs-worker3   Ready    worker       32d   v1.19.0+9f84db3
sdi-worker1   Ready    sdi,worker   32d   v1.19.0+9f84db3
sdi-worker2   Ready    sdi,worker   32d   v1.19.0+9f84db3
sdi-worker3   Ready    sdi,worker   32d   v1.19.0+9f84db3
master1       Ready    master       32d   v1.19.0+9f84db3
master2       Ready    master       32d   v1.19.0+9f84db3
master3       Ready    master       32d   v1.19.0+9f84db3

# oc get mcp
NAME     CONFIG                 UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADED
master   rendered-master-15f⋯   True     False     False     3             3                  3                    0
sdi      rendered-sdi-f4f⋯      True     False     False     3             3                  3                    0
worker   rendered-worker-181⋯   True     False     False     3             3                  3                    0
3.3.4.4.1. Enable SDI on control plane

If the control plane (or master nodes) shall be used for running SDI workload, in addition to the previous step, one needs to perform the following:

  1. Please make sure the control plane is schedulable
  2. Duplicate the machine configs for master nodes:

    # oc get -o json mc -l machineconfiguration.openshift.io/role=sdi | jq  '.items[] |
        select((.metadata.annotations//{}) |
            has("machineconfiguration.openshift.io/generated-by-controller-version") | not) |
        .metadata |= ( .name   |= sub("^(?<i>(\\d+-)*)(worker-)?"; "\(.i)master-") |
                       .labels |= {"machineconfiguration.openshift.io/role": "master"} )' | oc apply -f -
    

    Note that you may see a couple of warnings if this has been done earlier.

  3. Make the master machine config pool inherit the PID limits changes:

    # oc label mcp/master workload=sapdataintelligence
    

The following command can be used to wait until the change gets applied to all the worker nodes:

# oc wait mcp/master --all --for=condition=updated
3.3.4.6. Verification of the node configuration

The following steps assume that the node-role.kubernetes.io/sdi="" label has been applied to nodes running the SDI workload. All the commands shall be executed on the Management host. All the diagnostics commands will be run in parallel on such nodes.

  1. (disconneted only) Make one of the tools images available for your cluster:

    • Either use the image stream openshift/tools:

      1. Make sure the image stream has been populated:

        # oc get -n openshift istag/tools:latest
        

        Example output:

        NAME           IMAGE REFERENCE                                                UPDATED
        tools:latest   quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:13c...   17 hours ago
        

        If it is not the case, make sure your registry mirror CA certificate is trusted.

      2. Set the following variable:

        # ocDebugArgs="--image-stream=openshift/tools:latest"
        
    • Or make registry.redhat.io/rhel8/support-tools image available in your local registry:

      # LOCAL_REGISTRY=local.image.registry:5000
      # podman login registry.redhat.io
      # podman login "$LOCAL_REGISTRY"    # if the local registry requires authentication
      # skopeo copy --remove-signatures \
          docker://registry.redhat.io/rhel8/support-tools:latest \
          docker://"$LOCAL_REGISTRY/rhel8/support-tools:latest"
      # ocDebugArgs="--image=$LOCAL_REGISTRY/rhel8/support-tools:latest"
      
  2. Verify that the PID limit has been increased to 16384:

    # oc get nodes -l node-role.kubernetes.io/sdi= -o name | \
        xargs -P 6 -n 1 -i oc debug $ocDebugArgs {} -- chroot /host /bin/bash -c \
            "crio-status config | awk '/pids_limit/ {
                print ENVIRON[\"HOSTNAME\"]\":\t\"\$0}'" |& grep pids_limit
    

    NOTE: $ocDebugArgs is set only in a disconnected environment, otherwise it shall be empty.

    An example output could look like this:

    sdi-worker3:    pids_limit = 16384
    sdi-worker1:    pids_limit = 16384
    sdi-worker2:    pids_limit = 16384
    
  3. Verify that the kernel modules have been loaded:

    # oc get nodes -l node-role.kubernetes.io/sdi= -o name | \
        xargs -P 6 -n 1 -i oc debug $ocDebugArgs {} -- chroot /host /bin/sh -c \
            "lsmod | awk 'BEGIN {ORS=\":\t\"; print ENVIRON[\"HOSTNAME\"]; ORS=\",\"}
                /^(nfs|ip_tables|iptable_nat|[^[:space:]]+(REDIRECT|owner|filter))/ {
                    print \$1
                }'; echo" 2>/dev/null
    

    An example output could look like this:

    sdi-worker2:  iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables,
    sdi-worker3:  iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables,
    sdi-worker1:  iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables,
    

    If any of the following modules is missing on any of the SDI nodes, the module loading does not work: iptable_nat, nfsv4, nfsd, ip_tables, xt_owner

    To further debug missing modules, one can execute also the following command:

    # oc get nodes -l node-role.kubernetes.io/sdi= -o name | \
        xargs -P 6 -n 1 -i oc debug $ocDebugArgs {} -- chroot /host /bin/bash -c \
             "( for service in {sdi-modules-load,systemd-modules-load}.service; do \
                 printf '%s:\t%s\n' \$service \$(systemctl is-active \$service); \
             done; find /etc/modules-load.d -type f \
                 -regex '.*\(sap\|sdi\)[^/]+\.conf\$' -printf '%p\n';) | \
             awk '{print ENVIRON[\"HOSTNAME\"]\":\t\"\$0}'" 2>/dev/null
    

    Please make sure that both systemd services are active and at least one *.conf file is listed for each host like shown in the following example output:

    sdi-worker3:  sdi-modules-load.service:       active
    sdi-worker3:  systemd-modules-load.service:   active
    sdi-worker3:  /etc/modules-load.d/sdi-dependencies.conf
    sdi-worker1:  sdi-modules-load.service:       active
    sdi-worker1:  systemd-modules-load.service:   active
    sdi-worker1:  /etc/modules-load.d/sdi-dependencies.conf
    sdi-worker2:  sdi-modules-load.service:       active
    sdi-worker2:  systemd-modules-load.service:   active
    sdi-worker2:  /etc/modules-load.d/sdi-dependencies.conf
    
  4. (obsolete) Verify that the NET_RAW capability is granted by default to the pods:

    # # no longer needed for SDI 3.1 or newer
    # oc get nodes -l node-role.kubernetes.io/sdi= -o name | \
        xargs -P 6 -n 1 -i oc debug $ocDebugArgs {} -- /bin/sh -c \
            "find /host/etc/crio -type f -print0 | xargs -0 awk '/^[[:space:]]#/{next}
                /NET_RAW/ {print ENVIRON[\"HOSTNAME\"]\":\t\"FILENAME\":\"\$0}'" |& grep NET_RAW
    

    An example output could look like:

    sdi-worker2:  /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities:    default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"]
    sdi-worker2:  /host/etc/crio/crio.conf.d/90-default-capabilities:        "NET_RAW",
    sdi-worker1:  /host/etc/crio/crio.conf.d/90-default-capabilities:        "NET_RAW",
    sdi-worker1:  /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities:    default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"]
    sdi-worker3:  /host/etc/crio/crio.conf.d/90-default-capabilities:        "NET_RAW",
    sdi-worker3:  /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities:    default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"]
    

    Please make sure that at least one line is produced for each host.

3.3.5. Deploy persistent storage provider

Unless your platform already offers a supported persistent storage provider, one needs to be deployed. Please refer to Understanding persistent storage (4.6) / (4.4) for an overview of possible options.

On OCP, one can deploy OpenShift Container Storage (OCS) (4.6) / (4.4) running converged on OCP nodes providing both persistent volumes and object storage. Please refer to OCS Planning your Deployment (4.6) / (4.4) and Deploying OpenShift Container Storage (4.6) / (4.4) for more information and installation instructions.

OCS can be deployed also in a disconnected environment.

3.3.6. Configure S3 access and bucket

Object storage is required for the following features of SDI:

Several interfaces to the object storage are supported by SDI. S3 interface is one of them. Please take a look at Checkpoint Store Type at Required Input Parameters (3.1) / (3.0) for the complete list. SAP help page covers preparation of object store (3.1) / (3.0) for a couple of cloud service providers.

Backup&restore can be enabled against OCS NooBaa's S3 endpoint as long as OCS is of version 4.6.4 or newer, or against RADOS Object Gateway S3 endpoint when OCS is deployed in the external mode.

3.3.6.1. Using NooBaa or RADOS Object Gateway S3 endpoint as object storage

OCS contains NooBaa object data service for hybrid and multi cloud environments which provides S3 API one can use with SAP Data Intelligence. Starting from OCS release 4.6.4, it can be used also for SDI's backup&restore functionality. Alternatively, the functionality can be enabled against RADOS Object Gateway S3 endpoint (from now on just RGW) which is available when OCS is deployed in the external mode.

For SDI, one needs to provide the following:

  • S3 host URL prefixed either with https:// or http://
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • bucket name

NOTE: In case of https://, the endpoint must be secured by certificates signed by a trusted certificate authority. Self-signed CAs will not work out of the box as of now.

Once OCS is deployed, one can create the access keys and buckets using one of the following:

  • (internal mode only) via NooBaa Management Console by default exposed at noobaa-mgmt-openshift-storage.apps.<cluster_name>.<base_domain>
  • (both internal and external modes) via CLI with mksdibuckets script

In both cases, the S3 endpoint provided to the SAP Data Intelligence cannot be secured with a self-signed certificate as of now. Unless the endpoints are secured with a proper signed certificate, one must use insecure HTTP connection. Both NooBaa and RGW come with such an insecure service reachable from inside the cluster (within the SDN), it cannot be resolved from outside of cluster unless exposed via e.g. route.

The following two URLs are the example endpoints on OCP cluster with OCS deployed.

  1. http://s3.openshift-storage.svc.cluster.local - NooBaa S3 Endpoint available always
  2. http://rook-ceph-rgw-ocs-external-storagecluster-cephobjectstore.openshift-storage.svc.cluster.local:8080 - RGW endpoint that shall be preferably used when OCS is deployed in the external mode

To enable SDI's backup&restore functionality, one must use the one with rgw in its name (if available) unless running OCS 4.6.4 or newer.

3.3.6.1.1. Creating an S3 bucket using CLI

The bucket can be created with the command below executed from the Management host. Be sure to switch to appropriate project/namespace (e.g. sdi) first before executing the following command or append parameters -n SDI_NAMESPACE to it.

  • (connected management host)

    # bash <(curl -s https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/utils/mksdibuckets)
    
  • (disconnected management host)

    # bash sap-data-intelligence/master/utils/mksdibuckets
    

By default, two buckets will be created. You can list them this way:

  • (connected management host)

    # bash <(curl -s https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/utils/mksdibuckets) list
    
  • (disconnected management host)

    # bash sap-data-intelligence/master/utils/mksdibuckets list
    

Example output:

Bucket claim namespace/name:  sdi/sdi-checkpoint-store  (Status: Bound, Age: 7m33s)
  Cluster internal URL:       http://s3.openshift-storage.svc.cluster.local
  Bucket name:                sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
  AWS_ACCESS_KEY_ID:          LQ7YciYTw8UlDLPi83MO
  AWS_SECRET_ACCESS_KEY:      8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE
Bucket claim namespace/name:  sdi/sdi-data-lake  (Status: Bound, Age: 7m33s)
  Cluster internal URL:       http://s3.openshift-storage.svc.cluster.local
  Bucket name:                sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3
  AWS_ACCESS_KEY_ID:          cOxfi4hQhGFW54WFqP3R
  AWS_SECRET_ACCESS_KEY:      rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231

# # NOTE: for more information and options, run the command with --help

The example above uses OCS NooBaa's S3 endpoint which is always the preferred choice for OCS internal mode.

The values of the claim sdi-checkpoint-store shall be passed to the following SLC Bridge parameters during SDI's installation in order to enable backup&restore (previously known as) checkpoint store functionality.

Parameter Example value
Object Store Type S3 compatible object store
Access Key LQ7YciYTw8UlDLPi83MO
Secret Key 8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE
Endpoint http://s3.openshift-storage.svc.cluster.local
Path sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
Disable Certificate Validation Yes
3.3.6.1.2. Increasing object bucket limits

NOTE: needed only for RGW (OCS external mode)

When performing checkpoint store validation during SDI installation, the installer will create a temporary bucket. For that to work with the RGW, bucket's owner limit on maximum allocatable buckets needs to be increased. The limit is set to 1 by default.

You can use the following command to perform the needed changes for the bucket assigned to the backup&restore (checkpoint store). Please execute it on the management node of the external Red Hat Ceph Storage cluster (or on the host where the external RGW service runs). The last argument is the "Bucket name", not the "Bucket claim name".

  • (connected management host)

    # bash <(curl -s https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/utils/rgwtunebuckets) \
            sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
    
  • (disconnected management host)

    # bash sap-data-intelligence/master/utils/rgwtunebuckets \
            sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
    

For more information and additional options, append --help parameter at the end.

3.3.7. Set up a Container Image Registry

If you haven't done so already, please follow the Container Image Registry prerequisite.

3.3.8. Configure an insecure registry

NOTE: It is now required to use a registry secured by TLS for SDI. Plain HTTP will not do.

If the registry is signed by a proper trusted (not self-signed) certificate, this may be skipped.

There are two ways to make OCP trust an additional registry using certificates signed by a self-signed certificate authority:

3.3.9. Configure the OpenShift Cluster for SDI

3.3.9.1. Becoming a cluster-admin

Many commands below require cluster admin privileges. To become a cluster-admin, you can do one of the following:

  • Use the auth/kubeconfig generated in the working directory during the installation of the OCP cluster:

    INFO Install complete!
    INFO Run 'export KUBECONFIG=<your working directory>/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI.
    INFO The cluster is ready when 'oc login -u kubeadmin -p <provided>' succeeds (wait a few minutes).
    INFO Access the OpenShift web-console here: https://console-openshift-console.apps.demo1.openshift4-beta-abcorp.com
    INFO Login to the console with user: kubeadmin, password: <provided>
    # export KUBECONFIG=working_directory/auth/kubeconfig
    # oc whoami
    system:admin
    
  • As a system:admin user or a member of cluster-admin group, make another user a cluster admin to allow him to perform the SDI installation:

    1. As a cluster-admin, configure the authentication (4.6) / (4.4) and add the desired user (e.g. sdiadmin).
    2. As a cluster-admin, grant the user a permission to administer the cluster:

      # oc adm policy add-cluster-role-to-user cluster-admin sdiadmin
      

You can learn more about the cluster-admin role in Cluster Roles and Local Roles article (4.6) / (4.4)

4. SDI Observer

SDI Observer monitors SDI and SLC Bridge namespaces and applies changes to SDI deployments to allow SDI to run on OpenShift. Among other things, it does the following:

  • adds additional persistent volume to vsystem-vrep StatefulSet to allow it to run on RHCOS system
  • grants fluentd pods permissions to logs
  • reconfigures the fluentd pods to parse plain text file container logs on the OCP 4 nodes
  • exposes SDI System Management service
  • (optional) deploys container image registry suitable for mirroring, storing and serving SDI images and for use by the Pipeline Modeler
  • (optional) creates cmcertificates secret to allow SDI to talk to container image registry secured by a self-signed CA certificate early during the installation time

It is deployed as an OpenShift template. It's behaviour is controlled by the template's parameters which are mirrored to its environment variables.

Deploy SDI Observer in its own k8s namespace (e.g. sdi-observer). Please refer to its documentation for the complete list of issues that it currently attempts to solve.

4.1. Prerequisites

The following must be satisfied before SDI Observer can be deployed:

4.2.1. Prerequisites for Connected OCP Cluster

In order to build images needed for SDI Observer, a secret with credentials for registry.redhat.io needs to be created in the namespace of SDI Observer. Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Once you have downloaded the OpenShift secret file (e.g. rht-registry-secret.yaml with your credentials, you can import it into SDI Observer $NAMESPACE like this:

# oc create -n "${NAMESPACE:-sdi-observer}" -f rht-registry-secret.yaml
secret/123456-username-pull-secret created

4.2.2. Prerequisites for a Disconnected OCP Cluster

On a disconnected OCP cluster, it is necessary to mirror a pre-built image of SDI Observer to a local container image registry. Please follow Disconnected OCP cluster instructions.

4.2.3. Instantiation of Observer's Template

Assuming the SDI will be run in the SDI_NAMESPACE which is different from the observer NAMESPACE, instantiate the template with default parameters like this:

  1. Prepare the script and images depending on your system connectivity.

    • In a connected environment, download the run script from git repository like this:

      # curl -O https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/run-observer-template.sh
      
    • In a disconnected environment, where the Management host is connected.

      Mirror the SDI Observer image to the local registry. For example, on RHEL8:

      # podman login local.image.registry:5000    # if the local registry requires authentication
      # skopeo copy \
          docker://quay.io/redhat-sap-cop/sdi-observer:latest-ocp4.6 \
          docker://local.image.registry:5000/sdi-observer:latest-ocp4.6
      

      Please make sure to modify the 4.6 suffix according to your OCP server minor release.

    • In an air-gapped environment (asuming the observer repository has been already cloned to the Management host):

      1. On a host with access to the internet, copy the SDI Observer image to an archive on USB drive. For example, on RHEL8:

        # skopeo copy \
            docker://quay.io/redhat-sap-cop/sdi-observer:latest-ocp4.6 \
            oci-archive:/var/run/user/1000/usb-disk/sdi-observer.tar:latest-ocp4.6
        
      2. Plug the USB drive to the Management host (with no access to internet) and mirror the image from it to your local.image.registry:5000:

        # skopeo copy \
            oci-archive:/var/run/user/1000/usb-disk/sdi-observer.tar:latest-ocp4.6 \
            docker://local.image.registry:5000/sdi-observer:latest-ocp4.6
        
  2. Edit the downloaded run-observer-template.sh file in your favorite editor. Especially, mind
    the FLAVOUR, NAMESPACE, SDI_NAMESPACE parameters.

  • for a disconnected environment, make sure to set FLAVOUR to ocp-prebuilt and IMAGE_PULL_SPEC to your local.image.registry:5000
  • for an air-gapped environment, set also SDI_OBSERVER_REPOSITORY=to/local/git/repo/checkout
  1. Run it in bash like this:

    # bash ./run-observer-template.sh
    
  2. Keep the modified script around for case of updates.

4.2.4. SDI Observer Registry

If the observer is configured to deploy container image registry via DEPLOY_SDI_REGISTRY=true parameter, it will deploy the deploy-registry job which does the following:

  1. builds the container-image-registry image and pushes it to the integrated OpenShift Image Registry
  2. generates or uses configured credentials for the registry
  3. deploys container-image-registry deployment config that runs this image and requires authentication
  4. exposes the registry using a route

    • if observer's SDI_REGISTRY_ROUTE_HOSTNAME parameter is set, it will be used as its hostname
    • otherwise the registry's hostname will becontainer-image-registry-${NAMESPACE}.apps.<cluster_name>.<base_domain>
4.2.4.1. Registry Template parameters

The following Observer's Template Parameters influence the deployment of the registry:

Parameter Example value Description
DEPLOY_SDI_REGISTRY true Whether to deploy container image registry for the purpose of SAP Data Intelligence.
REDHAT_REGISTRY_SECRET_NAME 123456-username-pull-secret Name of the secret with credentials for registry.redhat.io registry. Please visit Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Must be provided in order to build registry's image.
SDI_REGISTRY_ROUTE_HOSTNAME registry.cluster.tld This variable will be used as the container image registry's hostname when creating the corresponding route. Defaults to container-image-registry-$NAMESPACE.<cluster_name>.<base_domain>. If set, the domain name must resolve to the IP of the ingress router.
INJECT_CABUNDLE true Inject CA certificate bundle into SAP Data Intelligence pods. The bundle can be specified with CABUNDLE_SECRET_NAME. It is needed if either registry or s3 endpoint is secured by a self-signed certificate. The letsencrypt method is preferred.
CABUNDLE_SECRET_NAME custom-ca-bundle The name of the secret containing certificate authority bundle that shall be injected into Data Intelligence pods. By default, the secret bundle is obtained from openshift-ingress-operator namespace where the router-ca secret contains the certificate authority used to sign all the edge and reencrypt routes that are, among others, used for SDI_REGISTRY and S3 API services. The secret name may be optionally prefixed with $namespace/.
SDI_REGISTRY_STORAGE_CLASS_NAME ocs-storagecluster-cephfs Unless given, the default storage class will be used. If possible, prefer volumes with ReadWriteMany (RWX) access mode.
REPLACE_SECRETS true By default, existing SDI_REGISTRY_HTPASSWD_SECRET_NAME secret will not be replaced if it already exists. If the registry credentials shall be changed while using the same secret name, this must be set to true.
SDI_REGISTRY_AUTHENTICATION none Set to none if the registry shall not require any authentication at all. The default is to secure the registry with htpasswd file which is necessary if the registry is publicly available (e.g. when exposed via ingress route which is globally resolvable).
SDI_REGISTRY_USERNAME registry-user Will be used to generate htpasswd file to provide authentication data to the sdi registry service as long as SDI_REGISTRY_HTPASSWD_SECRET_NAME does not exist or REPLACE_SECRETS is true. Unless given, it will be autogenerated by the job.
SDI_REGISTRY_PASSWORD secure-password ditto
SDI_REGISTRY_HTPASSWD_SECRET_NAME registry-htpasswd A secret with htpasswd file with authentication data for the sdi image container. If given and the secret exists, it will be used instead of SDI_REGISTRY_USERNAME and SDI_REGISTRY_PASSWORD. Defaults to container-image-registry-htpasswd. Please make sure to follow the official guidelines on generating the htpasswd file.
SDI_REGISTRY_VOLUME_CAPACITY 250Gi Volume space available for container images. Defaults to 120Gi.
SDI_REGISTRY_VOLUME_ACCESS_MODE ReadWriteMany If the given SDI_REGISTRY_STORAGE_CLASS_NAME or the default storage class supports ReadWriteMany ("RWX") access mode, please change this to ReadWriteMany. For example, the ocs-storagecluster-cephfs storage class, deployed by OCS operator, does support it.

To use them, please set the desired parameters in the run-observer-template.sh script in the section above.

Monitoring registry's deployment

# oc logs -n "${NAMESPACE:-sdi-observer}" -f job/deploy-registry
4.2.4.2. Determining Registry's credentials

The username and password are separated by a colon in the SDI_REGISTRY_HTPASSWD_SECRET_NAME secret:

# # make sure to change the NAMESPACE and secret name according to your environment
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
    jq -r '.data[".htpasswd.raw"] | @base64d'
user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma
4.2.4.3. Testing the connection

In this example, it is assumed that the INJECT_CABUNDLE and DEPLOY_SDI_REGISTRY are true and other parameters use the defaults.

  1. Obtain Ingress Router's default self-signed CA certificate:

    # oc get secret -n openshift-ingress-operator -o json router-ca | \
        jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
    
  2. Do a simple test using curl:

    # # determine registry's hostname from its route
    # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
    # curl -I --user user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma --cacert router-ca.crt \
        "https://$hostname/v2/"
    HTTP/1.1 200 OK
    Content-Length: 2
    Content-Type: application/json; charset=utf-8
    Docker-Distribution-Api-Version: registry/2.0
    Date: Sun, 24 May 2020 17:54:31 GMT
    Set-Cookie: d22d6ce08115a899cf6eca6fd53d84b4=9176ba9ff2dfd7f6d3191e6b3c643317; path=/; HttpOnly; Secure
    Cache-control: private
    
  3. Using the podman:

    # # determine registry's hostname from its route
    # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
    # sudo mkdir -p "/etc/containers/certs.d/$hostname"
    # sudo cp router-ca.crt "/etc/containers/certs.d/$hostname/"
    # podman login -u user-qpx7sxeei "$hostname"
    Password:
    Login Succeeded!
    
4.2.4.4. Configuring OCP

Configure OpenShift to trust the deployed registry if using a self-signed CA certificate.

4.2.4.5. SDI Observer Registry tenant configuration

NOTE: Only applicable once the SDI installation is complete.

Each newly created tenant needs to be configured to be able to talk to the SDI Registry. The initial tenant (the default) does not need to be configured manually as it is configured during the installation.

There are two steps that need to be performed for each new tenant:

  • import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed (the default unless letsencrypt controller is used)
  • create and import credential secret using the SDI System Management and update the modeler secret

Import the CA certificate

  1. Obtain the router-ca.crt of the secret as documented in the previous section.
  2. Follow the Manage Certificates guide (3.1) / (3.0) to import the router-ca.crt via the SDI Connection Management.

Import the credential secret

Determine the credentials and import them using the SDI System Management by following the official Provide Access Credentials for a Password Protected Container Registry (3.1) / (3.0).

As an alternative to the step "1. Create a secret file that contains the container registry credentials and …", you can also use the following way to create the vsystem-registry-secret.txt file:

# # determine registry's hostname from its route
# hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
    jq -r '.data[".htpasswd.raw"] | @base64d | gsub("\\s+"; "") | split(":") |
        [{"username":.[0], "password":.[1], "address":"'"$hostname"'"}]' | \
    json2yaml > vsystem-registry-secret.txt

NOTE: that json2yaml binary from the remarshal project must be installed on the Management host in addition to jq

4.3. Managing SDI Observer

4.3.1. Viewing and changing the current configuration

View the current configuration of SDI Observer:

# oc set env --list -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer

Change the settings:

  • it is recommended to modify the run-observer-template.sh and re-run it
  • it is also possible to set the desired parameter directly without triggering an image build:

    # # instruct the observer to schedule SDI pods only on the matching nodes
    # oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer SDI_NODE_SELECTOR="node-role.kubernetes.io/sdi="
    

4.3.2. Re-deploying SDI Observer

Is useful in the following cases:

  • SDI Observer shall be updated to the latest release.
  • SDI has been uninstalled and its namespace deleted and/or re-created.
  • Parameter being reflected in multiple resources (not just in the DeploymentConfig) needs to be changed (e.g. OCP_MINOR_RELEASE)
  • Different SDI instance in another namespace shall be observed.

Before updating to the latest SDI Observer code, please be sure to check the Update instructions.

NOTE: Re-deployment preserves generated secrets and persistent volumes unless REPLACE_SECRETS or REPLACE_PERSISTENT_VOLUMES are true.

  1. Backup the previous run-observer-template.sh script and open it as long as available. If not available, run the following to see the previous environment variables:

    # oc set env --list dc/sdi-observer -n "${NAMESPACE:-sdi-observer}"
    
  2. Download the run script from git repository like this:

    # curl -O https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/run-observer-template.sh
    
  3. Edit the downloaded run-observer-template.sh file in your favorite editor. Especially, mind the FLAVOUR, NAMESPACE, SDI_NAMESPACE and OCP_MINOR_RELEASE parameters. Compare it against the old run-observer-template.sh or against the output of oc set env --list dc/sdi-observer and update the parameters accordingly.

  4. Run it in bash like this:

    # bash ./run-observer-template.sh
    
  5. Keep the modified script around for case of updates.

5. Install SDI on OpenShift

5.1. Install Software Lifecycle Container Bridge

Please follow the official documentation (3.1) / (3.0).

5.1.1. Important Parameters

Parameter Condition Description
Mode Always Make sure to choose the Expert Mode.
Address of the Container Image Repository Always This is the Host value of the container-image-registry route in the observer namespace if the registry is deployed by SDI Observer.
Image registry username if … The value recorded in the SDI_REGISTRY_HTPASSWD_SECRET_NAME if using the registry deployed with SDI Observer.
Image registry password if … ditto
Namespace of the SLC Bridge Always If you override the default (sap-slcbridge), make sure to deploy SDI Observer with the corresponding SLCB_NAMESPACE value.
Service Type SLC Bridge Base installation On vSphere, make sure to use NodePort. On AWS, please use LoadBalancer.
Cluster No Proxy Required in conjunction with the HTTPS Proxy value Make sure to extend with additional mandatory entries .

If the registry requires authentication. The one deployed with SDI Observer does.
Make sure to include at least the entries located in OCP cluster's proxy settings.

# # get the internal OCP cluster's NO_PROXY settings
# noProxy="$(oc get -o jsonpath='{.status.noProxy}' proxy/cluster)"; echo "$noProxy"
.cluster.local,.local,.nip.io,.ocp.vslen,.sap.corp,.svc,10.0.0.0/16,10.128.0.0/14,10.17.69.0/23,127.0.0.1,172.30.0.0/16,192.168.0.0/16,api-int.morrisville.ocp.vslen,etcd-0.morrisville.ocp.vslen,etcd-1.morrisville.ocp.vslen,etcd-2.morrisville.ocp.vslen,localhost,lu0602v0,registry.redhat.io

For more details, please refer to Configuring the cluster-wide proxy (4.6) / (4.4)

NOTE: SLC Bridge service cannot be used via routes (Ingress Operator) as of now. Doing so will result in timeouts. This will be addressed in the future. For now, one must use either the NodePort or LoadBalancer service directly.

On vSphere, in order to access slcbridgebase-service NodePort service, one needs to have either a direct access to one of the SDI Compute nodes or modify the external load balancer to add additional route to the service.

5.1.2. Install SLC Bridge

Please install SLC Bridge according to Making the SLC Bridge Base available on Kubernetes (3.1) / (3.0) while paying attention to the notes on the installation parameters.

5.1.2.1. Using an external load balancer to access SLC Bridge's NodePort

NOTE: applicable only when "Service Type" was set to "NodePort".

Once the SLC Bridge is deployed, its NodePort shall be determined in order to point the load balancer at it.

# oc get svc -n "${SLCB_NAMESPACE:-sap-slcbridge}" slcbridgebase-service -o jsonpath='{.spec.ports[0].nodePort}{"\n"}'
31875

The load balancer shall point at all the compute nodes running SDI workload. The following is an example for HAProxy sw load balancer:

# # in the example, the <cluster_name> is "boston" and <base_domain> is "ocp.vslen"
# cat /etc/haproxy/haproxy.cfg
....
frontend        slcb
    bind        *:9000
    mode        tcp
    option      tcplog
    # # commented blocks are useful for multiple OCP clusters or multiple SLC Bridge services
    #tcp-request inspect-delay      5s
    #tcp-request content accept     if { req_ssl_hello_type 1 }

    use_backend  boston-slcb       #if { req_ssl_sni -m end -i boston.ocp.vslen  }
    #use_backend raleigh-slcb      #if { req_ssl_sni -m end -i raleigh.ocp.vslen }

backend         boston-slcb
    balance     source
    mode        tcp
    server      sdi-worker1        sdi-worker1.boston.ocp.vslen:31875   check
    server      sdi-worker2        sdi-worker2.boston.ocp.vslen:31875   check
    server      sdi-worker3        sdi-worker3.boston.ocp.vslen:31875   check

backend         raleigh-slcb
....

The SLC Bridge can then be accessed at the URL https://boston.ocp.vslen:9000/docs/index.html as long as boston.ocp.vslen resolves correctly to the load balancer's IP.

5.2. SDI Installation Parameters

Please follow SAP's guidelines on configuring the SDI while paying attention to the following additional comments:

Name Condition Recommendation
Kubernetes Namespace Always Must match the project name chosen in the Project Setup (e.g. sdi)
Installation Type Installation or Update Choose Advanced Installation if you need to specify you want to choose particular storage class or there is no default storage class (4.4) set or you want to deploy multiple SDI instances on the same cluster.
Container Image Repository Installation Must be set to the container image registry.
Backup Configuration Installation or Upgrade from a system in which backups are not enabled For a production environment, please choose yes.
Checkpoint Store Configuration Installation Recommended for production deployments. If backup is enabled, it is enabled by default.
Checkpoint Store Type If Checkpoint Store Configuration parameter is enabled. Set to S3 compatible object store if using for example OCS or NetApp StorageGRID as the object storage. See Using NooBaa as object storage gateway or NetApp StorageGRID for more details.
Disable Certificate Validation If Checkpoint Store Configuration parameter is enabled. Please choose yes if using the HTTPS for your object storage endpoint secured with a certificate having a self-signed CA. For OCS NooBaa, you can set it to no.
Checkpoint Store Validation Installation Please make sure to validate the connection during the installation time. Otherwise in case an incorrect value is supplied, the installation will fail at a later point.
Container Registry Settings for Pipeline Modeler Advanced Installation Shall be changed if the same registry is used for more than one SAP Data Intelligence instance. Either another <registry> or a different <prefix> or both will do.
StorageClass Configuration Advanced Installation Configure this if you want to choose different dynamic storage provisioners for different SDI components or if there's no default storage class (4.6) / (4.4) set or you want to choose non-default storage class for the SDI components.
Default StorageClass Advanced Installation and if storage classes are configured Set this if there's no default storage class (4.6) / (4.4) set or you want to choose non-default storage class for the SDI components.
Enable Kaniko Usage Advanced Installation Must be enabled on OCP 4.
Container Image Repository Settings for SAP Data Intelligence Modeler Advanced Installation or Upgrade If using the same registry for multiple SDI instances, choose "yes".
Container Registry for Pipeline Modeler Advanced Installation and if "Use different one" option is selected in the previous selection. If using the same registry for multiple SDI instances, it is required to use either different prefix (e.g. local.image.registry:5000/mymodelerprefix2) or a different registry.
Loading NFS Modules Advanced Installation Feel free to say "no". This is no longer of concern as long as the loading of the needed kernel modules has been configured.
Additional Installer Parameters Advanced Installation (optional) Useful for reducing the minimum memory requirements of the HANA pod and much more.

Note that the validated S3 API endpoint providers are OCS' NooBaa 4.6.4 or newer, OCS 4.6 in external mode and NetApp StorageGRID

5.3. Project setup

It is assumed the sdi project has been already created during SDI Observer's prerequisites.

Login to OpenShift as a cluster-admin, and perform the following configurations for the installation:

# # change to the SDI_NAMESPACE project using: oc project "${SDI_NAMESPACE:-sdi}"
# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-elasticsearch"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-fluentd"
# oc adm policy add-scc-to-user privileged -z default
# oc adm policy add-scc-to-user privileged -z mlf-deployment-api
# oc adm policy add-scc-to-user privileged -z vora-vflow-server
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)-vrep"

5.4. Install SDI

Please follow the official procedure according to Install using SLC Bridge in a Kubernetes Cluster with Internet Access (3.1) / (3.0).

5.5. SDI Post installation steps

5.5.1. (Optional) Expose SDI services externally

There are multiple possibilities how to make SDI services accessible outside of the cluster. Compared to Kubernetes, OpenShift offers additional method, which is recommended for most of the scenarios including SDI System Management service. It's based on OpenShift Ingress Operator (4.6) / (4.4)

For SAP Vora Transaction Coordinator and SAP HANA Wire, please use the official suggested method available to your environment (3.1) / (3.0).

5.5.1.1. Using OpenShift Ingress Operator

NOTE Instead of using this manual approach, it is now recommended to let the SDI Observer to manage the route creation and updates instead. If the SDI Observer has been deployed with MANAGE_VSYSTEM_ROUTE, this section can be skipped. To configure it ex post, please execute the following:

# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=true
# # wait for the observer to get re-deployed
# oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer

Or please continue with the manual route creation.

OpenShift allows you to access the Data Intelligence services via Ingress Controllers (4.6) / (4.4) as opposed to regular NodePorts (4.6) / (4.4) For example, instead of accessing the vsystem service via https://worker-node.example.com:32322, after the service exposure, you will be able to access it at https://vsystem-sdi.apps.<cluster_name>.<base_domain>. This is an alternative to the official guide documentation to Expose the Service On Premise (3.1) / (3.0).

There are two kinds routes secured with TLS. The reencrypt kind, allows for a custom signed or self-signed certificate to be used. The other is a passthrough kind which uses the pre-installed certificate generated by the installer or passed to the installer.

5.5.1.1.1. Export services with an reencrypt route

With this kind of route, different certificates are used on client and service sides of the route. The router stands in the middle and re-encrypts the communication coming from either side using a certificate corresponding to the opposite side. In this case, the client side is secured by a provided certificate and the service side is encrypted with the original certificate generated or passed to the SAP Data Intelligence installer. This is the same kind of route SDI Observer creates automatically.

The reencrypt route allows for securing the client connection with a proper signed certificate.

  1. Look up the vsystem service:

    # oc project "${SDI_NAMESPACE:-sdi}"            # switch to the Data Intelligence project
    # oc get services | grep "vsystem "
    vsystem   ClusterIP   172.30.227.186   <none>   8797/TCP   19h
    

    When exported, the resulting hostname will look like vsystem-${SDI_NAMESPACE}.apps.<cluster_name>.<base_domain>. However, an arbitrary hostname can be chosen instead as long as it resolves correctly to the IP of the router.

  2. Get, generate or use the default certificates for the route. In this example, the default self-signed certificate used by router is used to secure the connection between the client and OCP's router. The CA certificate for clients can be obtained from the router-ca secret located in the openshift-ingress-operator namespace:

    # oc get secret -n openshift-ingress-operator -o json router-ca | \
        jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
    
  3. Obtain the SDI's root certificate authority bundle generated at the SDI's installation time. The generated bundle is available in the ca-bundle.pem secret in the sdi namespace.

    # oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \
        secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
    
  4. Create the reencrypt route for the vsystem service like this:

    # oc create route reencrypt -n "${SDI_NAMESPACE:-sdi}" --dry-run -o json \
            --dest-ca-cert=sdi-service-ca-bundle.pem --service vsystem \
            --insecure-policy=Redirect | \
        oc annotate --local -o json -f - haproxy.router.openshift.io/timeout=2m | \
        oc apply -f -
    # oc get route
    NAME      HOST/PORT                                                  SERVICES  PORT      TERMINATION         WILDCARD
    vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>  vsystem   vsystem   reencrypt/Redirect  None
    
  5. Verify the connection:

    # # use the HOST/PORT value obtained from the previous command instead
    # curl --cacert router-ca.crt https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>/
    
5.5.1.1.2. Export services with a passthrough route

With the passthrough route, the communication is encrypted by the SDI service's certificate all the way to the client.

NOTE: If possible, please prefer the reencrypt route because the hostname of vsystem certificate cannot be verified by clients as can be seen in the following output:

# oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \
    secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
# openssl x509 -noout -subject -in sdi-service-ca-bundle.pem
subject=C = DE, ST = BW, L = Walldorf, O = SAP, OU = Data Hub, CN = SAPDataHub
  1. Look up the vsystem service:

    # oc project "${SDI_NAMESPACE:-sdi}"            # switch to the Data Intelligence project
    # oc get services | grep "vsystem "
    vsystem   ClusterIP   172.30.227.186   <none>   8797/TCP   19h
    
  2. Create the route:

    # oc create route passthrough --service=vsystem --insecure-policy=Redirect
    # oc get route
    NAME      HOST/PORT                                                  PATH  SERVICES  PORT      TERMINATION           WILDCARD
    vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>        vsystem   vsystem   passthrough/Redirect  None
    

    You can modify the hostname with --hostname parameter. Make sure it resolves to the router's IP.

  3. Access the System Management service at https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain> to verify.

5.5.1.2. Using NodePorts

NOTE For OpenShift, an exposure using routes is preferred although only possible for the System Management service (aka vsystem).

Exposing SAP Data Intelligence vsystem

  • Either with an auto-generated node port:

    # oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2
    # oc get -o jsonpath='{.spec.ports[0].nodePort}{"\n"}' services vsystem-nodeport
    30617
    
  • Or with a specific node port (e.g. 32123):

    # oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2 --dry-run -o yaml | \
        oc patch -p '{"spec":{"ports":[{"port":8797, "nodePort": 32123}]}}' --local -f - -o yaml | oc apply -f -
    

The original service remains accessible on the same ClusterIP:Port as before. Additionally, it is now accessible from outside of the cluster under the node port.

Exposing SAP Vora Transaction Coordinator and HANA Wire

# oc expose service vora-tx-coordinator-ext --type NodePort --name=vora-tx-coordinator-nodeport --generator=service/v2
# oc get -o jsonpath='tx-coordinator:{"\t"}{.spec.ports[0].nodePort}{"\n"}hana-wire:{"\t"}{.spec.ports[1].nodePort}{"\n"}' \
    services vora-tx-coordinator-nodeport
tx-coordinator: 32445
hana-wire:      32192

The output shows the generated node ports for the newly exposed services.

5.5.2. Configure the Connection to Data Lake

Please follow the official post-installation instructions at Configure the Connection to DI_DATA_LAKE (3.1) / (3.0).

In case the OCS is used as a backing object storage provider, please make sure to use the HTTP service endpoint as documented in Using NooBaa or RADOS Object Gateway S3 endpoint as object storage.

Based on the example output in that section, the configuration may look like this:

Parameter Value
Connection Type SDL
Id DI_DATA_LAKE
Object Storage Type S3
Endpoint http://s3.openshift-storage.svc.cluster.local
Access Key ID cOxfi4hQhGFW54WFqP3R
Secret Access Key rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231
Root Path sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3

5.5.3. SDI Validation

Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.1) / (3.0).

5.5.3.1. Log On to SAP Data Intelligence Launchpad

In case the vsystem service has been exposed using a route, the URL can be determined like this:

# oc get route -n "${SDI_NAMESPACE:-sdi}"
NAME      HOST/PORT                                                  SERVICES  PORT      TERMINATION  WILDCARD
vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>  vsystem   vsystem   reencrypt    None

The HOST/PORT value needs to be then prefixed with https://, for example:

https://vsystem-sdi.apps.boston.ocp.vslen
5.5.3.2. Check Your Machine Learning Setup

In order to upload training and test datasets using ML Data Manager, the user needs to be assigned sap.dh.metadata policy. Please make sure to follow Using SAP Data Intelligence Policy Management (3.1) / (3.0) to assign the policies to the users that need them.

5.5.4. Configuration of additional tenants

When a new tenant is created using for example the Manage Clusters instructions (3.1) / (3.0) it is not configured to work with the container image registry. Therefore, the Pipeline Modeler is unusable and will fail to start until configured.

There are two steps that need to be performed for each new tenant:

  • import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed
  • create and import credential secret using the SDI System Management and update the modeler secret if the container image registry requires authentication

If the SDI Registry deployed by the SDI Observer is used, please follow the SDI Observer Registry tenant configuration. Otherwise, please make sure to execute the official instructions in the following articles according to your registry configuration:

6. OpenShift Container Platform Upgrade

This section is useful as a guide for performing OCP upgrades to the latest asynchronous release of the same minor version or to the newer minor release supported by the running SDI instance without upgrading SDI itself.

6.1. Pre-upgrade procedures

  1. Before upgrading cluster to release equal to or newer than 4.3, make sure to upgrade SDI at least to the release 3.0 Patch 3 by following SAP Data Hub Upgrade procedures - starting from pre-upgrade without performing steps marked with (ocp-upgrade).
  2. Make yourself familiar with the OpenShift's upgrade guide (4.2 ⇒ 4.3) / (4.3 ⇒ 4.4) / (4.4 ⇒ 4.5) / (4.5 ⇒ 4.6).
  3. Plan for SDI downtime.
  4. Make sure to re-configure SDI compute nodes.
  5. (OCP 4.2 only) Pin vsystem-vrep to the current node

6.1.1. Stop SAP Data Intelligence

In order to speed up the cluster upgrade and/or to ensure SDI's consistency, it is possible to stop the SDI before performing the upgrade.

The procedure is outlined in the official Administration Guide (3.1) / (3.0). However, please note that the command described there is erroneous as of December 2020. Please execute it this way:

# oc -n "${SDI_NAMESPACE}" patch datahub default --type='json' -p '[
    {"op":"replace","path":"/spec/runLevel","value":"Stopped"}]'

6.2. Upgrade OCP

The following instructions outline a process of OCP upgrade to a minor release 2 versions higher than the current one. If only an upgrade to the latest asynchronous release of the same minor version is desired, please skip steps 5 and 6.

  1. Upgrade OCP to a higher minor release or the latest asynchronous release(⇒ 4.3) / (⇒ 4.5).
  2. If having OpenShift Container Storage deployed, update OCS to the latest supported release for the current OCP release according to the interoperability matrix.
  3. Update OpenShift client tools on the Management host to match the target OCP release. On RHEL 8.2, one can do it like this:

    # current=4.2; new=4.4
    # sudo subscription-manager repos \
        --disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms
    # sudo dnf update -y openshift-clients
    
  4. Update SDI Observer to use the OCP client tools matching the target OCP release by following Re-Deploying SDI Observer while reusing the previous parameters.

  5. Upgrade OCP to a higher minor release or the latest asynchronous release (⇒ 4.4) / (⇒ 4.6).
  6. If having OpenShift Container Storage deployed, update OCS to the latest supported release for the current OCP release according to the interoperability matrix.

for the initial OCP release 4.X, the target release is 4.(X+2); if performing just the latest asynchronous release upgrade, the target release is 4.X

6.3. Post-upgrade procedures

  1. Start SAP Data Intelligence as outlined in the official Administration Guide (3.1) / (3.0). However, please note the command as described there is erroneous as of December 2020. Please execute it this way:

    # oc -n "${SDI_NAMESPACE}" patch datahub default --type='json' -p '[
        {"op":"replace","path":"/spec/runLevel","value":"Started"}]'
    
  2. (OCP 4.2 (initial) only) Unpin vsystem-vrep from the current node

7. SAP Data Intelligence Upgrade or Update

NOTE This section covers both an upgrade from SAP Data Hub 2.7 and an upgrade of SAP Data Intelligence to a newer minor, micro or patch release. Sections related only to the former or the latter will be annotated with the following annotations:

  • (DH-upgrade) to denote a section specific to an upgrade from Data Hub 2.7 to Data Intelligence 3.0
  • (DI-upgrade) to denote a section specific to an upgrade from Data Intelligence to a newer minor release (3.X ⇒ 3.(X+1))
  • (update) to denote a section specific to an update of Data Intelligence to a newer micro/patch release (3.X.Y ⇒ 3.X.(Y+1))
  • annotation-free are sections relating to any upgrade or update procedure

The following steps must be performed in the given order. Unless an OCP upgrade is needed, the steps marked with (ocp-upgrade) can be skipped.

7.1. Pre-upgrade or pre-update procedures

  1. Make sure to get familiar with the official SAP Upgrade guide (3.0 ⇒ 3.1) / (DH 2.7 ⇒ 3.0).
  2. (ocp-upgrade) Make yourself familiar with the OpenShift's upgrade guide (4.2 ⇒ 4.3) / (4.3 ⇒ 4.4) / (4.4 ⇒ 4.5) / (4.5 ⇒ 4.6).
  3. Plan for a downtime.
  4. Make sure to re-configure SDI compute nodes.
  5. Pin vsystem-vrep to the current node only when having OCP 4.2.

7.1.1. (DH-upgrade) Container image registry preparation

Unlike SAP Data Hub, SAP Data Intelligence requires a secured container image registry. Plain HTTP connection cannot be used anymore.

There are the following options how to satisfy this requirement:

  • The registry used by SAP Data Hub is already accessible over HTTPS and its serving TLS certificates have been signed by a trusted certificate authority. In this case, the rest of this section can be skipped until Execute SDI's Pre-Upgrade Procedures.
  • The registry used by SAP Data Hub is already accessible or will be made accessible over HTTPS but its service TLS certificate is not signed by a trusted certificate authority. In this case one of the following must be performed unless already done:

    The rest of this section can then be skipped.

  • A new registry shall be used.

In the last case, please refer to Container Image Registry prerequisite for more details. Also note that the provisioning of the registry can be done by SDI Observer deployed in the subsequent step.

NOTE: the newly deployed registry must contain all the images used by the current SAP Data Hub release as well in order for the upgrade to succeed. There are multiple ways to accomplish this, for example, on the Jump host, execute one of the following:

  • using the manual installation method of SAP Data Hub, one can invoke the install.sh script with the following arguments:

    • --prepare-images to cause the script to just mirror the images to the desired registry and terminate immediately afterwards
    • --registry HOST:PORT to point the script to the newly deployed registry
  • inspect the currently running containers in the SDH project and copy their images directly from the old local registry to the new one (without SAP registry being involved); it can be performed on the Jump host in bash; in the following example, jq, podman and skopeo binaries are assumed to be available:

    # export OLD_REGISTRY=local.image.registry:5000
    # export NEW_REGISTRY=HOST:PORT
    # SDH_NAMESPACE=sdh
    # # login to the old registry using either docker or podman if it requires authentication
    # podman login --tls-verify=false -u username $OLD_REGISTRY
    # # login to the new registry using either docker or podman if it requires authentication
    # podman login --tls-verify=false -u username $NEW_REGISTRY
    # function mirrorImage() {
        local src="$1"
        local dst="$NEW_REGISTRY/${src#*/}"
        skopeo copy --src-tls-verify=false --dest-tls-verify=false "docker://$src" "docker://$dst"
    }
    # export -f mirrorImage
    # # get the list of source images to copy
    # images="$(oc get pods -n "${SDH_NAMESPACE:-sdh}" -o json | jq -r '.items[] | . as $ps |
        [$ps.spec.containers[] | .image] + [($ps.spec.initContainers // [])[] | .image] |
        .[]' | grep -F "$OLD_REGISTRY" | sort -u)"
    # # more portable way to copy the images (up to 5 in parallel) using GNU xargs
    # xargs -n 1 -r -P 6 -i /bin/bash -c 'mirrorImage {}' <<<"${images:-}"
    # # an alternative way using GNU Parallel
    # parallel -P 6 --lb mirrorImage <<<"${images:-}"
    

7.1.2. Execute SDI's Pre-Upgrade Procedures

Please follow the official Pre-Upgrade procedures (3.0 ⇒ 3.1) / (SDI 2.7 ⇒ 3.0).

7.1.2.1. (upgrade) Manual route removal

If you exposed the vsystem service using routes, delete the route:

# # note the hostname in the output of the following command
# oc get route -n "${SDI_NAMESPACE:-sdi}"
# # delete the route
# oc delete route -n "${SDI_NAMESPACE:-sdi}" --all
7.1.2.2. (update) Automated route removal

SDI Observer now allows to manage creation and updates of vsystem route for external access. It takes care of updating route's destination certificate during SDI's update. It can also be instructed to keep the route deleted which is useful during SDI updates. If the SDI Observer is of version 0.1.0 or higher, you can instruct it to delete the route like this:

  1. ensure SDI Observer version is 0.1.0 or higher:

    # oc label -n "${NAMESPACE:-sdi-observer}" --list dc/sdi-observer | grep sdi-observer/version
    sdi-observer/version=0.1.0
    

    if there is no output or the version is lower, please follow the Manual route removal instead.

  2. ensure SDI Observer is managing the route already:

    # oc set env -n "${NAMESPACE:-sdi-observer}" --list dc/sdi-observer | grep MANAGE_VSYSTEM_ROUTE
    MANAGE_VSYSTEM_ROUTE=true
    

    if there is no output or MANAGE_VSYSTEM_ROUTE is not one of true, yes or 1, please follow the Manual route removal instead.

  3. instruct the observer to keep the route removed:

    # oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=removed
    # # wait for the observer to get re-deployed
    # oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
    

7.1.3. (ocp-upgrade) Upgrade OpenShift

At this time, depending on target SDI release, OCP cluster must be upgraded either to a newer minor release or to the latest asynchronous release for the current minor release.

Current SDI release Target SDI release Desired and validated OCP Releases
3.0 3.1 4.4 (latest)
DH 2.7 3.0 4.2 (latest)

Make sure to follow the official upgrade instructions (4.4) / (4.2).

Please also update OpenShift client tools on the Management host. The example below can be used on RHEL8.

    # current=4.2; new=4.4
    # sudo subscription-manager repos \
        --disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms
    # sudo dnf update -y openshift-clients

7.1.4. Deploy or update SDI Observer

Please execute one of the subsections below. Unless an upgrade of Data Hub is performed, please choose to update SDI Observer.

7.1.4.1. (DH-upgrade) Deploying SDI Observer for the first time

If the current SDH Observer is deployed in a different namespace than SDH's namespace, it must be deleted manually. The easiest way is to delete the project unless shared with other workloads. If it shares the namespace of SDH, no action is needed - it will be deleted automatically.

Please follow the instructions in SDI Observer section to deploy it while paying attention to the following:

  • SDI Observer shall be located in a different namespace than SAP Data Hub and Data Intelligence (e.g. sdi-observer).
  • SDI_NAMESPACE shall be set to the namespace where SDH is currently running
7.1.4.2. Updating SDI Observer

Please follow the Re-deploying SDI Observer to update the observer. Please make sure to set MANAGE_VSYSTEM_ROUTE to remove until the SDI's update is finished.

7.1.5. (DH-upgrade) Prepare SDH/SDI Project

SAP Data Hub running in a particular project/namespace on OCP cluster will be substituted by SAP Data Intelligence in the same project/namespace. The existing project must be modified in order to host the latter.

Grant the needed security context constraints to the new service accounts by executing the commands from the project setup. NOTE: Re-running the commands that have been run already, will do no harm.

(OCP 4.2 only) To be able to amend the potential volume attachment problems, make sure to dump a mapping between the SDH pods and nodes they run on:

# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out

(optional) If an object storage provided by OCS is available, a new storage bucket can be created for the SDL Data Lake connection (3.0). Please follow Creating an S3 bucket using CLI section. Note that the existing checkpoint store bucket used by SAP Data Hub will continue to be used by SAP Data Intelligence if configured.

7.2. Update or Upgrade SDH or SDI

7.2.1. Update Software Lifecycle Container Bridge

Please follow the official documentation (3.1) / (3.0) to obtain the binary and perform the following steps:

  1. If exposed via a load-balancer, make sure to note down the current service port and node port:

    # oc get -o jsonpath='{.spec.ports[0].nodePort}{"\n"}' -n sap-slcbridge \
        svc/slcbridgebase-service
    31555
    
  2. Once the binary is available on the Management host, execute it as slcb init and choose Update when prompted for a deployment option.

  3. If exposed via a load-balancer, re-set the nodePort to the previous value so no changes on load-balancer side are necessary.

    # nodePort=31555    # change your value to the desired one
    # oc patch --type=json -n sap-slcbridge svc/slcbridgebase-service -p '[{
        "op":"add", "path":"/spec/ports/0/nodePort","value":'"$nodePort"'}]'
    

7.2.2. (DH-upgrade) Upgrade SAP Data Hub to SAP Data Intelligence

Execute the SDH or SDI upgrade according to the official instructions (DH 2.7 ⇒ 3.0).

Please be aware of the potential issue during the upgrade when using OCS 4 as the storage provider.

7.2.3. (DI-upgrade) Upgrade SAP Data Intelligence to a newer minor release

Execute the SDH or SDI upgrade according to the official instructions (DH 3.0 ⇒ 3.1).

7.3. (ocp-upgrade) Upgrade OpenShift

Depending on the target SDI release, OCP cluster must be upgraded either to a newer minor release or to the latest asynchronous release for the current minor release.

Upgraded/Current SDI release Desired and validated OCP Releases
3.1 4.6
3.0 4.4

If the current OCP release is two or more releases behind the desired, OCP cluster must be upgraded iteratively to each successive minor release until the desired one is reached.

  1. (optional) Stop the SAP Data Intelligence as it will speed up the cluster update and ensure SDI's consistency.
  2. Make sure to follow the official upgrade instructions for your upgrade path:

  3. (optional) Start the SAP Data Intelligence again if stopped earlier in step 1).

  4. Upgrade OpenShift client tools on the Management host. The example below can be used on RHEL 8:

    # current=4.4; new=4.6
    # sudo subscription-manager repos \
        --disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms
    # sudo dnf update -y openshift-clients
    

7.4. SAP Data Intelligence Post-Upgrade Procedures

  1. Execute the Post-Upgrade Procedures for the SDH (3.1) / (3.0).

  2. Re-create the route for the vsystem service using one of the following methods:

    • (recommented) instruct SDI Observer to manage the route:

      # oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=true
      # # wait for the observer to get re-deployed
      # oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
      
    • follow Expose SDI services externally to recreate the route manually from scratch

  3. (DH-upgrade) Unpin vsystem-vrep from the current node

7.5. Validate SAP Data Intelligence

Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.1) / (3.0).

8. Appendix

8.1. SDI uninstallation

Please follow the SAP documentation Uninstalling SAP Data Intelligence using the SLC Bridge (3.1) / (3.0).

Additionally, make sure to delete the sdi project as well, e.g.:

# oc delete project sdi

NOTE: With this, SDI Observer loses permissions to view and modify resources in the deleted namespace. If a new SDI installation shall take place, SDI observer needs to be re-deployed.

Optionally, one can also delete SDI Observer's namespace, e.g.:

# oc delete project sdi-observer

NOTE: this will also delete the container image registry if deployed using SDI Observer which means the mirroring needs to be performed again during a new installation. If SDI Observer (including the registry and its data) shall be preserved for the next installation, please make sure to re-deploy it once the sdi project is re-created.

When done, you may continue with a new installation round in the same or another namespace.

8.2. Configure OpenShift to trust container image registry

If the registry's certificate is signed by a self-signed certificate authority, one must make OpenShift aware of it.

If the registry runs on the OpenShift cluster itself and is exposed via a reencrypt or edge route with the default TLS settings (no custom TLS certificates set), the CA certificate used is available in the secret router-ca in openshift-ingress-operator namespace.

To make the registry available via such route trusted, set the route's hostname into the registry variable and execute the following code in bash:

# registry="local.image.registry:5000"
# caBundle="$(oc get -n openshift-ingress-operator -o json secret/router-ca | \
    jq -r '.data as $d | $d | keys[] | select(test("\\.(?:crt|pem)$")) | $d[.] | @base64d')"
# # determine the name of the CA configmap if it exists already
# cmName="$(oc get images.config.openshift.io/cluster -o json | \
    jq -r '.spec.additionalTrustedCA.name // "trusted-registry-cabundles"')"
# if oc get -n openshift-config "cm/$cmName" 2>/dev/null; then
    # configmap already exists -> just update it
    oc get -o json -n openshift-config "cm/$cmName" | \
        jq '.data["'"${registry//:/..}"'"] |= "'"$caBundle"'"' | \
        oc replace -f - --force
  else
      # creating the configmap for the first time
      oc create configmap -n openshift-config "$cmName" \
          --from-literal="${registry//:/..}=$caBundle"
      oc patch images.config.openshift.io cluster --type=merge \
          -p '{"spec":{"additionalTrustedCA":{"name":"'"$cmName"'"}}}'
  fi

If using a registry running outside of OpenShift or not secured by the default ingress CA certificate, take a look at the official guideline at Configuring a ConfigMap for the Image Registry Operator (4.6) / (4.4)

To verify that the CA certificate has been deployed, execute the following and check whether the supplied registry name appears among the file names in the output:

# oc rsh -n openshift-image-registry "$(oc get pods -n openshift-image-registry -l docker-registry=default | \
        awk '/Running/ {print $1; exit}')" ls -1 /etc/pki/ca-trust/source/anchors
container-image-registry-sdi-observer.apps.boston.ocp.vslen
image-registry.openshift-image-registry.svc..5000
image-registry.openshift-image-registry.svc.cluster.local..5000

If this is not feasible, one can also mark the registry as insecure.

8.3. Configure insecure registry

As a less secure an alternative to the Configure OpenShift to trust container image registry, registry may also be marked as insecure which poses a potential security risk. Please follow Configuring image settings (4.6) / (4.4) and add the registry to the .spec.registrySources.insecureRegistries array. For example:

apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  registrySources:
    insecureRegistries:
    - local.image.registry:5000

NOTE: it may take a couple of tens of minutes until the nodes are reconfigured. You can use the following commands to monitor the progress:

  • watch oc get machineconfigpool
  • watch oc get nodes

8.4. Running multiple SDI instances on a single OCP cluster

Two instances of SAP Data Intelligence running in parallel on a single OCP cluster have been validated. Running more instances is possible, but most probably needs an extra support statement from SAP.

Please consider the following before deploying more than one SDI instance to a cluster:

  • Each SAP Data Intelligence instance must run in its own namespace/project.
  • Each SAP Data Intelligence instance must use a different prefix or container image registry for the Pipeline Modeler. For example, the first instance can configure "Container Registry Settings for Pipeline Modeler" as local.image.registry:5000/sdi30blue and the second as local.image.registry:5000/sdi30green.
  • It is recommended to dedicate particular nodes to each SDI instance.
  • It is recommended to use network policy (4.6) / (4.4) SDN mode for completely granular network isolation configuration and improved security. Check network policy configuration (4.6) / (4.4) for further references and examples. This, however, cannot be changed post OCP installation.
  • If running the production and test (aka blue-green) SDI deployments on a single OCP cluster, mind also the following:
    • There is no way to test an upgrade of OCP cluster before an SDI upgrade.
    • The idle (non-productive) landscape should have the same network security as the live (productive) one.

To deploy a new SDI instance to OCP cluster, please repeat the steps from project setup starting from point 6 with a new project name and continue with SDI Installation.

8.5. Installing remarshal utilities on RHEL

For a few example snippets throughout this guide, either yaml2json or json2yaml scripts are necessary.

They are provided by the remarshal project and shall be installed on the Management host in addition to jq. On RHEL 8.2, one can install it this way:

# sudo dnf install -y python3-pip
# sudo pip3 install remarshal

8.6. Pin vsystem-vrep to the current node

On OCP 4.2 with openshift-storage.rbd.csi.ceph.com dynamic storage provisioner used for SDI workload, please make sure to schedule vsystem-vrep pod to the current node where it runs to avoid A pod is stuck in ContainerCreating phase from happening during an upgrade

# nodeName="$(oc get pods -n "${SDI_NAMESPACE:-sdi}" vsystem-vrep-0 -o jsonpath='{.spec.nodeName}')"
# oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" \
    --type strategic --patch '{"spec": {"template":
        {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}
    }}'

To revert the change, please follow Unpin vsystem-vrep from the current node.

To be able to amend another potential volume attachment problems, make sure to dump a mapping between the SDH pods and nodes they run on:

# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out

8.7. Unpin vsystem-vrep from the current node

On OCP 4.4, the vsystem-vrep pod no longer needs to be pinned to a particular node in order to prevent A pod is stuck in ContainerCreating phase from occurring.

One can then revert the node pinning with the following command. Note that jq binary is required.

# oc get statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" -o json | \
    jq 'del(.spec.template.spec.nodeSelector) | del(.spec.template.spec.affinity.nodeAffinity)' | oc replace -f -

8.8. (footnote ) Upgrading to the next minor release from the latest asynchronous release

If the OCP cluster is subscribed to the stable channel, its latest available micro release for the current minor release may not be upgradable to a newer minor release.

Consider the following example:

  • The OCP cluster is of release 4.5.24.
  • The latest asynchronous release available in stable-4.5 channel is 4.5.30.
  • The latest stable 4.6 release is 4.6.15 (available in stable-4.6 channel).
  • From the 4.5.24 micro release, one can upgrade to one of 4.5.27, 4.5.28, 4.5.30, 4.6.13 or 4.6.15
  • However, from the 4.5.30 one cannot upgrade to any newer release because no upgrade path has been validated/provided yet in the stable channel.

Therefor, OCP cluster can get stuck on 4.5 release if it is first upgraded to the latest asynchronous release 4.5.30 instead of being upgraded directly to one of the 4.6 minor releases. However, at the same time, the fast-4.6 channel contains 4.6.16 release with an upgrade path from 4.5.30. The 4.6.16 release appears in the stable-4.6 channel sooner of later after being introduced in the fast channel first.

To amend the situation without waiting for an upgrade path to appear in the stable channel:

  1. Temporarily switch to the fast-4.X channel.
  2. Perform the upgrade.
  3. Switch back to the stable-4.X channel.
  4. Continue performing upgrades to the latest micro release available in the stable-4.X channel.

9. Troubleshooting Tips

9.1. Installation or Upgrade problems

9.1.1. Privileged security context unassigned

If there are pods, replicasets, or statefulsets not coming up and you can see an event similar to the one below, you need to add privileged security context constraint to its service account.

# oc get events | grep securityContext
1m          32m          23        diagnostics-elasticsearch-5b5465ffb.156926cccbf56887                          ReplicaSet                                                                            Warning   FailedCreate             replicaset-controller                  Error creating: pods "diagnostics-elasticsearch-5b5465ffb-" is forbidden: unable to validate against any security context constraint: [spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

Copy the name in the fourth column (the event name - diagnostics-elasticsearch-5b5465ffb.156926cccbf56887) and determine its corresponding service account name.

# eventname="diagnostics-elasticsearch-5b5465ffb.156926cccbf56887"
# oc get -o go-template=$'{{with .spec.template.spec.serviceAccountName}}{{.}}{{else}}default{{end}}\n' \
    "$(oc get events "${eventname}" -o jsonpath='{.involvedObject.kind}/{.involvedObject.name}{"\n"}')"
sdi-elasticsearch

The obtained service account name (sdi-elasticsearch) now needs to be assigned privileged SCC:

# oc adm policy add-scc-to-user privileged -z sdi-elasticsearch

The pod then shall come up on its own unless this was the only problem.

9.1.2. No Default Storage Class set

If pods are failing because because of PVCs not being bound, the problem may be that the default storage class has not been set and no storage class was specified to the installer.

# oc get pods
NAME                                                  READY     STATUS    RESTARTS   AGE
hana-0                                                0/1       Pending   0          45m
vora-consul-0                                         0/1       Pending   0          45m
vora-consul-1                                         0/1       Pending   0          45m
vora-consul-2                                         0/1       Pending   0          45m

# oc describe pvc data-hana-0
Name:          data-hana-0
Namespace:     sdi
StorageClass:
Status:        Pending
Volume:
Labels:        app=vora
               datahub.sap.com/app=hana
               vora-component=hana
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  47s (x126 over 30m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

To fix this, either make sure to set the Default StorageClass (4.6) / (4.4) or provide the storage class name to the installer.

9.1.3. vsystem-app pods not coming up

If you have SELinux in enforcing mode you may see the pods launched by vsystem crash-looping because of the container named vsystem-iptables like this:

# oc get pods
NAME                                                          READY     STATUS             RESTARTS   AGE
auditlog-59b4757cb9-ccgwh                                     1/1       Running            0          40m
datahub-app-db-gzmtb-67cd6c56b8-9sm2v                         2/3       CrashLoopBackOff   11         34m
datahub-app-db-tlwkg-5b5b54955b-bb67k                         2/3       CrashLoopBackOff   10         30m
...
internal-comm-secret-gen-nd7d2                                0/1       Completed          0          36m
license-management-gjh4r-749f4bd745-wdtpr                     2/3       CrashLoopBackOff   11         35m
shared-k98sh-7b8f4bf547-2j5gr                                 2/3       CrashLoopBackOff   4          2m
...
vora-tx-lock-manager-7c57965d6c-rlhhn                         2/2       Running            3          40m
voraadapter-lsvhq-94cc5c564-57cx2                             2/3       CrashLoopBackOff   11         32m
voraadapter-qkzrx-7575dcf977-8x9bt                            2/3       CrashLoopBackOff   11         35m
vsystem-5898b475dc-s6dnt                                      2/2       Running            0          37m

When you inspect one of those pods, you can see an error message similar to the one below:

# oc logs voraadapter-lsvhq-94cc5c564-57cx2 -c vsystem-iptables
2018-12-06 11:45:16.463220|+0000|INFO |Execute: iptables -N VSYSTEM-AGENT-PREROUTING -t nat||vsystem|1|execRule|iptables.go(56)
2018-12-06 11:45:16.465087|+0000|INFO |Output: iptables: Chain already exists.||vsystem|1|execRule|iptables.go(62)
Error: exited with status: 1
Usage:
  vsystem iptables [flags]

Flags:
  -h, --help               help for iptables
      --no-wait            Exit immediately after applying the rules and don't wait for SIGTERM/SIGINT.
      --rule stringSlice   IPTables rule which should be applied. All rules must be specified as string and without the iptables command.

And in the audit log on the node, where the pod got scheduled, you should be able to find an AVC denial similar to the following. On RHCOS nodes, you may need to inspect the output of dmesg command instead.

# grep 'denied.*iptab' /var/log/audit/audit.log
type=AVC msg=audit(1544115868.568:15632): avc:  denied  { module_request } for  pid=54200 comm="iptables" kmod="ipt_REDIRECT" scontext=system_u:system_r:container_t:s0:c826,c909 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
...
# # on RHCOS
# dmesg | grep denied

To fix this, the ipt_REDIRECT kernel module needs to be loaded. Please refer to Pre-load needed kernel modules.

9.1.4. License Manager cannot be initialized

The installation may fail with the following error.

2019-07-22T15:07:29+0000 [INFO] Initializing system tenant...
2019-07-22T15:07:29+0000 [INFO] Initializing License Manager in system tenant...2019-07-22T15:07:29+0000 [ERROR] Couldn't start License Manager!
The response: {"status":500,"code":{"component":"router","value":8},"message":"Internal Server Error: see logs for more info"}Error: http status code 500 Internal Server Error (500)
2019-07-22T15:07:29+0000 [ERROR] Failed to initialize vSystem, will retry in 30 sec...

In the log of license management pod, you can find an error like this:

# oc logs deploy/license-management-l4rvh
Found 2 pods, using pod/license-management-l4rvh-74595f8c9b-flgz9
+ iptables -D PREROUTING -t nat -j VSYSTEM-AGENT-PREROUTING
+ true
+ iptables -F VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -X VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -N VSYSTEM-AGENT-PREROUTING -t nat
iptables v1.6.2: can't initialize iptables table `nat': Permission denied
Perhaps iptables or your kernel needs to be upgraded.

This means, the vsystem-iptables container in the pod lacks permissions to manipulate iptables. Please make sure to pre-load kernel modules.

9.1.5. Diagnostics Prometheus Node Exporter pods not starting

During an installation or upgrade, it may happen, that the Node Exporter pods keep restarting:

# oc get pods  | grep node-exporter
diagnostics-prometheus-node-exporter-5rkm8                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-hsww5                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-jxxpn                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-rbw82                        0/1       CrashLoopBackOff   7          8m
diagnostics-prometheus-node-exporter-s2jsz                        0/1       CrashLoopBackOff   6          8m

The possible reason is that the limits on resource consumption set on the pods are too low. To address this post-installation, you can patch the DaemonSet like this (in the SDI's namespace):

# oc patch -p '{"spec": {"template": {"spec": {"containers": [
    { "name": "diagnostics-prometheus-node-exporter",
      "resources": {"limits": {"cpu": "200m", "memory": "100M"}}
    }]}}}}' ds/diagnostics-prometheus-node-exporter

To address this during the installation (using any installation method), add the following parameters:

-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.cpu=200m
-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.memory=100M

If the graph builds hang in Pending state or fail completely, you may find the following pod not coming up in the sdi namespace because its image cannot be pulled from the registry:

# oc get pods | grep vflow
datahub.post-actions.validations.validate-vflow-9s25l             0/1     Completed          0          14h
vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2                  0/1     ImagePullBackOff   0          21s
vflow-graph-9958667ba5554dceb67e9ec3aa6a1bbb-com-sap-demo-dljzk   1/1     Running            0          94m
# oc describe pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 | sed -n '/^Events:/,$p'
Events:
  Type     Reason     Age                From                    Message
  ----     ------     ----               ----                    -------
  Normal   Scheduled  30s                default-scheduler       Successfully assigned sdi/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 to sdi-moworker3
  Normal   BackOff    20s (x2 over 21s)  kubelet, sdi-moworker3  Back-off pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
  Warning  Failed     20s (x2 over 21s)  kubelet, sdi-moworker3  Error: ImagePullBackOff
  Normal   Pulling    6s (x2 over 21s)   kubelet, sdi-moworker3  Pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
  Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Failed to pull image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600": rpc error: code = Unknown desc = Error reading manifest 3.0.23-com.sap.sles.base-20200617-174600 in container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9: unauthorized: authentication required
  Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Error: ErrImagePull

To amend this, one needs to link the secret for the modeler's registry to a corresponding service account associated with the failed pod. In this case, the default one.

# oc get -n "${SDI_NAMESPACE:-sdi}" -o jsonpath='{.spec.serviceAccountName}{"\n"}' \
    pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2
default
# oc create secret -n "${SDI_NAMESPACE:-sdi}" docker-registry sdi-registry-pull-secret \
    --docker-server=container-image-registry-sdi-observer.apps.morrisville.ocp.vslen \
    --docker-username=user-n5137x --docker-password=ec8srNF5Pf1vXlPTRLagEjRRr4Vo3nIW
# oc secrets link -n "${SDI_NAMESPACE:-sdi}" --for=pull default sdi-registry-pull-secret
# oc delete -n "${SDI_NAMESPACE:-sdi}" pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2

Also please make sure to restart the Pipeline Modeler and failing graph builds in the offended tenant.

9.1.7. A pod is stuck in ContainerCreating phase

NOTE: Applies to OCP 4.2 in combination with block storage persistent volumes.

The issue can be reproduced when using a ReadWriteOnce persistent volume provisioned by a block device dynamic provisioner like openshift-storage.rbd.csi.ceph.com with a corresponding storage class ocs-storagecluster-ceph-rbd.

# oc get pods | grep ContainerCreating
vsystem-vrep-0                                                    0/2     ContainerCreating   0          10m20s
# oc describe pod vsystem-vrep-0 | sed -n '/^Events/,$p'
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               114m                 default-scheduler        Successfully assigned sdhup/vsystem-vrep-0 to sdi-moworker1
  Normal   SuccessfulAttachVolume  114m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-fafdd37a-b654-11ea-b795-001c14db4273"
  Normal   SuccessfulAttachVolume  114m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273"
  Warning  FailedMount             17m (x39 over 113m)  kubelet, sdi-moworker1   MountVolume.MountDevice failed for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-f6380abf-b654-11ea-8cb4-0a580a83020b is still being used
  Warning  FailedMount             64s (x50 over 111m)  kubelet, sdi-moworker1   Unable to mount volumes for pod "vsystem-vrep-0_sdhup(fddd32f3-b7c4-11ea-b795-001c14db4273)": timeout expired waiting for volumes to attach or mount for pod "sdhup"/"vsystem-vrep-0". list of unmounted volumes=[layers-volume]. list of unattached volumes=[layers-volume exports app-parameters uaa-tls-cert hana-tls-cert vrep-cert-tls vsystem-root-ca-path vora-vsystem-sdhup-vrep-token-wrmxk]

The issue can happen for example during an upgrade from SAP Data Hub. In that case, the upgrade starts to hang at the following step:

# ./slcb execute --url https://boston.ocp.vslen:9000 --useStackXML ~/MP_Stack_1000954710_20200519_.xml
...
time="2020-06-30T06:51:40Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:51:50Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:52:00Z" level=info msg="Switching Datahub to runlevel: Started"

For the reference, the corresponding persistent volume can look like this:

# oc get pv | grep f61bd233-b654-11ea-b795-001c14db4273
pvc-f61bd233-b654-11ea-b795-001c14db4273    10Gi       RWO            Delete           Bound    sdhup/layers-volume-vsystem-vrep-0                ocs-storagecluster-ceph-rbd            45h

Solution to the problem is to schedule vsystem-vrep pod on particular node.

9.1.7.1. Schedule vsystem-vrep pod on particular node

Make sure to run the pod on the same node as it used to run before being re-scheduled:

  1. Identify previous compute node name depending on whether the pod is running or not.

    • If the vsystem-vrep pod is running currently, please record the node (sdi-moworker3) it is running on now like this:

      # oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide -l vora-component=vsystem-vrep
      NAME             READY   STATUS    RESTARTS   AGE    IP            NODE            NOMINATED NODE   READINESS GATES
      vsystem-vrep-0   2/2     Running   0          3d1h   10.128.0.31   sdi-moworker3   <none>           <none>
      
    • In case the pod is no longer running, inspect the sdh-pods-pre-upgrade.out created as suggested at Prepare SDH/SDI Project step and extract the name of the node for the pod in question. In our case, the vsystem-vrep-0 pod used to run sdi-moworker3.

  2. (if not running) Scale its corresponding deployment (in our case statefulset/vsystem-vrep) down to zero replicas:

    # oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=0 statefulset/vsystem-vrep
    
  3. Pin vsystem-vrep to the current node with the following command while changing the nodeName.

    # nodeName=sdi-moworker3    # change the name
    # oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" --type strategic --patch \
        '{"spec": {"template": {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}}}'
    
  4. (if not running) Scale the deployment back to 1:

    # oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=1 statefulset/vsystem-vrep
    

Verify the pod is scheduled to the given node and becomes ready. If the upgrade process is in progress, it should continue in a while.

# oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide | grep vsystem-vrep-0
vsystem-vrep-0                                                    2/2     Running     0          5m48s   10.128.4.239   sdi-moworker3   <none>           <none>

9.1.8. Container fails with "Permission denied"

If pods fail with a similar error like the one below, the containers most probably are not allowed to run under desired UID.

# oc get pods
NAME                                READY   STATUS             RESTARTS   AGE
datahub.checks.checkpoint-m82tj     0/1     Completed          0          12m
vora-textanalysis-6c9789756-pdxzd   0/1     CrashLoopBackOff   6          9m18s
# oc logs vora-textanalysis-6c9789756-pdxzd
Traceback (most recent call last):
  File "/dqp/scripts/start_service.py", line 413, in <module>
    sys.exit(Main().run())
  File "/dqp/scripts/start_service.py", line 238, in run
    **global_run_args)
  File "/dqp/python/dqp_services/services/textanalysis.py", line 20, in run
    trace_dir = utils.get_trace_dir(global_trace_dir, self.config)
  File "/dqp/python/dqp_utils.py", line 90, in get_trace_dir
    return get_dir(global_trace_dir, conf.trace_dir)
  File "/dqp/python/dqp_utils.py", line 85, in get_dir
    makedirs(config_value)
  File "/usr/lib64/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: 'textanalysis'

To remedy that, be sure to apply all the oc adm policy add-scc-to-* commands from the project setup section. The one that has not been applied in this case is:

# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"

9.1.9. Jobs failing during installation or upgrade

If the installation jobs are failing with the following error, either anyuid security context constraint has not been applied or the cluster is too old.

# oc logs solution-reconcile-vsolution-vsystem-ui-3.0.9-vnnbf
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:18+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:23+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:23+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:28+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
...
2020-03-05T15:52:13+0000 [ERROR] Timeout while waiting to login to vSystem...

The reason behind is that vctl binary in the containers determines HOME directory for its user from /etc/passwd. On older OCP clusters (<4.2.32), or when the container is not run with the desired UID, the value is set incorrectly to /. The binary then lacks permissions to write to the root directory.

To remedy that, please make sure:

  1. you are running OCP cluster 4.2.32 or newer
  2. anyuid SCC has been applied to the SDI namespace

    To verify, make sure the SDI namespace is listed in the 3rd column of the output of the following command:

    # oc get -o json scc/anyuid | jq -r '.groups[]'
    system:cluster-admins
    system:serviceaccounts:sdi
    

    When the jobs will be rerun, anyuid scc will be assigned to them:

    # oc get pods -n "${SDI_NAMESPACE:-sdi}" -o json | jq -r '.items[] | select((.metadata.ownerReferences // []) |
        any(.kind == "Job")) | "\(.metadata.name)\t\(.metadata.annotations["openshift.io/scc"])"' | column -t
    datahub.voracluster-start-1d3ffe-287c16-d7h7t                    anyuid
    datahub.voracluster-start-b3312c-287c16-j6g7p                    anyuid
    datahub.voracluster-stop-5a6771-6d14f3-nnzkf                     anyuid
    ...
    strategy-reconcile-strat-system-3.0.34-3.0.34-pzn79              anyuid
    tenant-reconcile-default-3.0.34-wjlfs                            anyuid
    tenant-reconcile-system-3.0.34-gf7r4                             anyuid
    vora-config-init-qw9vc                                           anyuid
    vora-dlog-admin-f6rfg                                            anyuid
    
  3. additionally, please make sure that all the other oc adm policy add-scc-to-* commands listed in the project setup have been applied to the same $SDI_NAMESPACE.

9.1.10. vsystem-vrep cannot export NFS on RHCOS

If vsystem-vrep-0 pod fails with the following error, it means it is unable to start an NFS server on top of overlayfs.

# oc logs -n ocpsdi1 vsystem-vrep-0 vsystem-vrep
2020-07-13 15:46:05.054171|+0000|INFO |Starting vSystem version 2002.1.15-0528, buildtime 2020-05-28T18:5856, gitcommit ||vsystem|1|main|server.go(107)
2020-07-13 15:46:05.054239|+0000|INFO |Starting Kernel NFS Server||vrep|1|Start|server.go(83)
2020-07-13 15:46:05.108868|+0000|INFO |Serving liveness probe at ":8739"||vsystem|9|func2|server.go(149)
2020-07-13 15:46:10.303625|+0000|WARN |no backup or restore credentials mounted, not doing backup and restore||vsystem|1|NewRcloneBackupRestore|backup_restore.go(76)
2020-07-13 15:46:10.311488|+0000|INFO |vRep components are initialised successfully||vsystem|1|main|server.go(249)
2020-07-13 15:46:10.311617|+0000|ERROR|cannot parse duration from "SOLUTION_LAYER_CLEANUP_DELAY" env variable: time: invalid duration ||vsystem|16|CleanUpSolutionLayersJob|manager.go(351)
2020-07-13 15:46:10.311719|+0000|INFO |Background task for cleaning up solution layers will be triggered every 12h0m0s||vsystem|16|CleanUpSolutionLayersJob|manager.go(358)
2020-07-13 15:46:10.312402|+0000|INFO |Recreating volume mounts||vsystem|1|RemountVolumes|volume_service.go(339)
2020-07-13 15:46:10.319334|+0000|ERROR|error re-loading NFS exports: exit status 1
exportfs: /exports does not support NFS export||vrep|1|AddExportsEntry|server.go(162)
2020-07-13 15:46:10.319991|+0000|FATAL|Error creating runtime volume: error exporting directory for runtime data via NFS: export error||vsystem|1|Fail|termination.go(22)

There are two solutions to the problem. Both of them resulting in an additional volume mounted at /exports which is the root directory of all exports.

  • (recommended) deploy SDI Observer which will request additional persistent volume of size 500Mi for vsystem-vrep-0 pod and make sure it is running
  • add -e=vsystem.vRep.exportsMask=true to the Additional Installer Parameters which will mount emptyDir volume at /exports in the same pod

    • on particular versions of OCP this may fail nevertheless

9.1.11. Kaniko cannot push images to a registry

Symptoms:

  • kaniko is enabled in SDI (mandatory on OCP 4)
  • registry is secured by TLS certificates with a self-signed certificate
  • other SDI and OCP components can use the registry without issues
  • the pipeline modeler crashes with a traceback preceded with the following error:

    # oc logs -f -c vflow  "$(oc get pods -o name \
      -l vsystem.datahub.sap.com/template=pipeline-modeler | head -n 1)" | grep 'push permissions'
    error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "container-image-registry-miminar-sdi-observer.apps.sydney.example.com/vora/vflow-node-f87b598586d430f955b09991fc11
    73f716be17b9:3.0.27-com.sap.sles.base-20201001-102714": BLOB_UPLOAD_UNKNOWN: blob upload unknown to registry
    

Resolution:

The root cause has not been identified yet. To work-around it, modeler shall be configured to use insecure registry accessible via plain HTTP (without TLS) and requiring no authentication. Such a registry can be provisioned with SDI Observer. If the existing registry is provisioned by SDI Observer, one can modify it to require no authentication like this:

  1. Initiate an update of SDI Observer.
  2. Re-configure sdi-observer for no authentication:

    # oc set env -n "${NAMESPACE:-sdi-observer}" SDI_REGISTRY_AUTHENTICATION=none dc/sdi-observer
    
  3. Wait until the registry gets re-deployed.

  4. Verify that the registry is running and that neither REGISTRY_AUTH_HTPASSWD_REALM nor REGISTRY_AUTH_HTPASSWD_PATH are present in the output of the following command:

    # oc set env -n "${NAMESPACE:-sdi-observer}" --list dc/container-image-registry
    REGISTRY_HTTP_SECRET=mOjuXMvQnyvktGLeqpgs5f7nQNAiNMEE
    
  5. Note the registry service address which can be determined like this:

    # # <service-name>.<namespace>.cluster.local:<service-port>
    # oc project "${NAMESPACE:-sdi-observer}"
    # printf "$(oc get -o jsonpath='{.metadata.name}.{.metadata.namespace}.svc.%s:{.spec.ports[0].port}' \
            svc container-image-registry)\n" \
        "$(oc get dnses.operator.openshift.io/default -o jsonpath='{.status.clusterDomain}')"
    container-image-registry.sdi-observer.svc.cluster.local:5000
    
  6. Verify that the service is responsive over plain HTTP from inside of the OCP cluster and requires no authentication:

    # registry_url=http://container-image-registry.sdi-observer.svc.cluster.local:5000
    # oc rsh -n openshift-authentication "$(oc get pods -n openshift-authentication | \
        awk '/oauth-openshift.*Running/ {print $1; exit}')" curl -I "$registry_url"
    HTTP/1.1 200 OK
    Content-Length: 2
    Content-Type: application/json; charset=utf-8
    Docker-Distribution-Api-Version: reg
    

    Note: the service URL is not reachable from outside of the OCP cluster

  7. For each SDI tenant using the registry:

    1. Login to the tenant as an administrator and open System Management.
    2. View Application Configuration and Secrets.

      Access Application Configuration and Secrets

    3. Set the following properties to the registry address:

      • Modeler: Base registry for pulling images
      • Modeler: Docker registry for Modeler images
    4. Unset the following properties:

      • Modeler: Name of the vSystem secret containing the credentials for Docker registry
      • Modeler: Docker image pull secret for Modeler

      The end result should look like:

      Modified registry parameters for Modeler

    5. Return to the "Applications" in the System Management and select Modeler.

    6. Delete all the instances.
    7. Create a new instance with the plus button.
    8. Access the instance to verify it is working.

9.1.12. SLCBridge pod fails to deploy

If the initialisation phase of Software Lifecycle Container Bridge fails with an error like the one below, you are probably running SLCB version 1.1.53 configured to push to a registry requiring basic authentication.

*************************************************
* Executing Step WaitForK8s SLCBridgePod Failed *
*************************************************

  Execution of step WaitForK8s SLCBridgePod failed
  Synchronizing Deployment slcbridgebase failed (pod "slcbridgebase-5bcd7946f4-t6vfr" failed) [1.116647047s]
  .
  Choose "Retry" to retry the step.
  Choose "Rollback" to undo the steps done so far.
  Choose "Cancel" to cancel deployment immediately.

# oc logs -n sap-slcbridge -c slcbridge -l run=slcbridge --tail=13
----------------------------
Code: 401
Scheme: basic
"realm": "basic-realm"
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}
----------------------------
2020-09-29T11:49:33.346Z        INFO    images/registry.go:182  Access check of registry "container-image-registry-sdi-observer.apps.sydney.example.com" returned AuthNeedBasic
2020-09-29T11:49:33.346Z        INFO    slp/server.go:199       Shutting down server
2020-09-29T11:49:33.347Z        INFO    hsm/hsm.go:125  Context closed
2020-09-29T11:49:33.347Z        INFO    hsm/state.go:56 Received Cancel
2020-09-29T11:49:33.347Z        DEBUG   hsm/hsm.go:118  Leaving event loop
2020-09-29T11:49:33.347Z        INFO    slp/server.go:208       Server shutdown complete
2020-09-29T11:49:33.347Z        INFO    slcbridge/master.go:64  could not authenticate at registry SLP_BRIDGE_REPOSITORY container-image-registry-sdi-observer.apps.sydney.example.com
2020-09-29T11:49:33.348Z        INFO    globals/goroutines.go:63        Shutdown complete (exit status 1).

More information can be found in SAP Note #2589449.

To fix this, please download the latest SLCB version newer than 1.1.53 according to the SAP Note #2589449

9.1.13. Kibana pod fails to start

When kibana pod is stuck in CrashLoopBackOff status, and the following error shows up in its log, you will need to delete the existing index.

# oc logs -n "${SDI_NAMESPACE:-sdi}" -c diagnostics-kibana -l datahub.sap.com/app-component=kibana --tail=5
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:ui_metric@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:visualizations@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:elasticsearch@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from yellow to green - Ready","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["info","migrations"],"pid":1,"message":"Creating index .kibana_1."}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["warning","migrations"],"pid":1,"message":"Another Kibana instance appears to be migrating the index. Waiting for that migration to complete. If no other Kibana instance is attempting migrations, you can get past this message by deleting index .kibana_1 and restarting Kibana."}

Please note the name of the index in the last warning message. In this case it is .kibana_1. Execute the following command with the proper index name at the end of the curl command to delete the index and then delete the kibana pod as well.

# oc exec -n "${SDI_NAMESPACE:-sdi}" -it diagnostics-elasticsearch-0 -c diagnostics-elasticsearch \
    -- curl -X DELETE 'http://localhost:9200/.kibana_1'
# oc delete pod -n "${SDI_NAMESPACE:-sdi}" -l datahub.sap.com/app-component=kibana

The kibana pod will be spawned and shall become Running in few minutes as long as its dependent diagnostics pods are running as well.

9.1.14. Fluentd pods cannot access /var/lib/docker/containers

If you see the following errors, the fluentd cannot access container logs on the hosts.

  • Error from SLC Bridge:

    2021-01-26T08:28:49.810Z  INFO  cmd/cmd.go:243  1> DataHub/kub-slcbridge/default [Pending]
    2021-01-26T08:28:49.810Z  INFO  cmd/cmd.go:243  1> └── Diagnostic/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:03 +0000 UTC]
    2021-01-26T08:28:49.811Z  INFO  cmd/cmd.go:243  1>     └── DiagnosticDeployment/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:29 +0000 UTC]
    2021-01-26T08:28:49.811Z  INFO  cmd/cmd.go:243  1>
    2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1> DataHub/kub-slcbridge/default [Pending]
    2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1> └── Diagnostic/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:03 +0000 UTC]
    2021-01-26T08:28:55.989Z  INFO  cmd/cmd.go:243  1>     └── DiagnosticDeployment/kub-slcbridge/default [Failed]  [Start Time:  2021-01-25 14:26:29 +0000 UTC]
    
  • Fluentd pod description:

    # oc describe pod diagnostics-fluentd-bb9j7
    Name:           diagnostics-fluentd-bb9j7
    …
      Warning  FailedMount  6m35s                 kubelet, compute-4  Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog varlibdockercontainers]: timed out waiting for the condition
      Warning  FailedMount  2m1s (x2 over 4m19s)  kubelet, compute-4  Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[varlibdockercontainers vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog]: timed out waiting for the condition
      Warning  FailedMount  23s (x12 over 8m37s)  kubelet, compute-4  MountVolume.SetUp failed for volume "varlibdockercontainers" : hostPath type check failed: /var/lib/docker/containers is not a directory
    
  • Log from one of the pods:

    # oc logs $(oc get pods -o name -l datahub.sap.com/app-component=fluentd | head -n 1) | tail -n 20
      2019-04-15 18:53:24 +0000 [error]: unexpected error error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos"
      2019-04-15 18:53:24 +0000 [error]: suppressed same stacktrace
      2019-04-15 18:53:25 +0000 [warn]: '@' is the system reserved prefix. It works in the nested configuration for now but it will be rejected: @timestamp
      2019-04-15 18:53:26 +0000 [error]: unexpected error error_class=Errno::EACCES error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos"
      2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `initialize'
      2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `open'
    ...
    

Those errors are fixed automatically by SDI Observer, please make sure it is running and can access the SDI_NAMESPACE.

One can also apply a fix manually with the following commands:

# oc -n "${SDI_NAMESPACE:-sdi}" patch dh default --type='json' -p='[
    { "op": "replace"
    , "path": "/spec/diagnostic/fluentd/varlibdockercontainers"
    , "value":"/var/log/pods" }]'
# oc -n "${SDI_NAMESPACE:-sdi}" patch ds/diagnostics-fluentd -p '{"spec":{"template":{"spec":{
    "containers": [{"name":"diagnostics-fluentd", "securityContext":{"privileged": true}}]}}}}'

9.2. SDI Runtime troubleshooting

9.2.1. 504 Gateway Time-out

When accessing SDI services exposed via OCP's Ingress Controller (as routes) and experience 504 Gateway Time-out errors, it is most likely caused by the following factors:

  1. SDI components accessed for the first time on a per tenant and per user basis require a new pod to be started which takes a considerable amount of time
  2. the default timeout for server connection configured on the load balancers is usually too small to tolerate containers being pulled, initialized and started

To amend that, make sure to do the following:

  1. set the "haproxy.router.openshift.io/timeout" annotation to "2m" on the vsystem route like this (assuming the route is named vsystem):

    # oc annotate -n "${SDI_NAMESPACE:-sdi}" route/vsystem haproxy.router.openshift.io/timeout=2m
    

    This results in the following haproxy settings being applied to the ingress router and the route in question:

    # oc rsh -n openshift-ingress $(oc get pods -o name -n openshift-ingress | \
            awk '/\/router-default/ {print;exit}') cat /var/lib/haproxy/conf/haproxy.config | \
        awk 'BEGIN { p=0 }
            /^backend.*:'"${SDI_NAMESPACE:-sdi}:vsystem"'/ { p=1 }
            { if (p) { print; if ($0 ~ /^\s*$/) {exit} } }'
    Defaulting container name to router.
    Use 'oc describe pod/router-default-6655556d4b-7xpsw -n openshift-ingress' to see all of the containers in this pod.
    backend be_secure:sdi:vsystem
      mode http
      option redispatch
      option forwardfor
      balance leastconn
      timeout server  2m
    
  2. set the same server timeout (2 minutes) on the external load balancer forwarding traffic to OCP's Ingress routers; the following is an example configuration for haproxy:

    frontend                                    https
        bind                                    *:443
        mode                                    tcp
        option                                  tcplog
        timeout     server                      2m
        tcp-request inspect-delay               5s
        tcp-request content accept              if { req_ssl_hello_type 1 }
    
        use_backend sydney-router-https         if { req_ssl_sni -m end -i apps.sydney.example.com }
        use_backend melbourne-router-https      if { req_ssl_sni -m end -i apps.melbourne.example.com }
        use_backend registry-https              if { req_ssl_sni -m end -i registry.example.com }
    
    backend         sydney-router-https
        balance     source
        server      compute1                     compute1.sydney.example.com:443     check
        server      compute2                     compute2.sydney.example.com:443     check
        server      compute3                     compute3.sydney.example.com:443     check
    
    backend         melbourne-router-https
        ....
    

9.2.2. HANA backup pod cannot pull an image from an authenticated registry

If the configured container image registry requires authentication, HANA backup jobs might fail as shown in the following example:

# oc get pods | grep backup-hana
default-chq28a9-backup-hana-sjqph                                 0/2     ImagePullBackOff   0          15h
default-hfiew1i-backup-hana-zv8g2                                 0/2     ImagePullBackOff   0          38h
default-m21kt3d-backup-hana-zw7w4                                 0/2     ImagePullBackOff   0          39h
default-w29xv3w-backup-hana-dzlvn                                 0/2     ImagePullBackOff   0          15h

# oc describe pod default-hfiew1i-backup-hana-zv8g2 | tail -n 6
  Warning  Failed          12h (x5 over 12h)       kubelet            Error: ImagePullBackOff
  Warning  Failed          12h (x3 over 12h)       kubelet            Failed to pull image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0": rpc error: code = Unknown desc = Error reading manifest 2010.22.0 in sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana: unauthorized: authentication required
  Warning  Failed          12h (x3 over 12h)       kubelet            Error: ErrImagePull
  Normal   Pulling         99m (x129 over 12h)     kubelet            Pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"
  Warning  Failed          49m (x3010 over 12h)    kubelet            Error: ImagePullBackOff
  Normal   BackOff         4m21s (x3212 over 12h)  kubelet            Back-off pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"

Resolution: There are two ways:

  • The recommended approach is to update SDI Observer to version 0.1.9 or newer.

  • A manual alternative fix is to execute the following:

    1. Determine the currently configured image pull secret:

      # oc get -n "${SDI_NAMESPACE:-sdi}" vc/vora -o jsonpath='{.spec.docker.imagePullSecret}{"\n"}'
      slp-docker-registry-pull-secret
      
    2. Link the secret with the default service account:

      # oc secret link --for=pull default slp-docker-registry-pull-secret
      

9.3. SDI Observer troubleshooting

9.3.1. Build is failing due to a repository outage

If the build of SDI Observer or SDI Registry is failing with a similar error like the one below, the chosen Fedora repository mirror is probably temporarily down:

# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Extra Packages for Enterprise Linux Modular 8 - 448  B/s |  16 kB     00:36
Failed to download metadata for repo 'epel-modular'
Error: Failed to download metadata for repo 'epel-modular'
subprocess exited with status 1
subprocess exited with status 1
error: build error: error building at STEP "RUN dnf install -y   https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm &&   dnf install -y parallel procps-ng bc git httpd-tools && dnf clean all -y": exit status 1

Please try to start the build again after a minute or two like this:

# oc start-build NAMESPACE="${NAMESPACE:-sdi-observer}" -F bc/sdi-observer

9.3.2. Build is failing due to proxy issues

If you see the following build error in a cluster where HTTP(S) proxy is used, make sure to update the proxy configuration.

# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Caching blobs under "/var/cache/blobs".

Pulling image registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
error: build error: failed to pull image: After retrying 2 times, Pull image still failed due to error: while pulling "docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee" as "registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee": Error initializing source docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee: can't talk to a V1 docker registry

The registry.redhat.io either needs to be whitelisted in the HTTP proxy server or it must be added to the NO_PROXY settings like in the following bash-code snippet. When executing it, the registry will be added to NO_PROXY only if it is not there yet.

# addreg="registry.redhat.io"
# oc get proxies.config.openshift.io/cluster -o json | \
    jq '.spec.noProxy |= (. | [split("\\s*,\\s*";"")[] | select((. | length) > 0)] | . as $npa |
        "'"$addreg"'" as $r | if [$npa[] | . == $r] | any then $npa else $npa + [$r] end | join(","))' \
    oc replace -f -

Wait after the machine config pools are updates and then restart the build:

# oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-204c0009fca2b46a9d754371404ad169   True      False      False
worker   rendered-worker-d3738db56394537bb525ab5cf008dc4f   True      False      False

For more information, please refer to Docker pull fails to GET registry.redhat.io/ content.

Comments