SAP Data Intelligence 3 on OpenShift Container Platform 4

Updated -

In general, the installation of SAP Data Intelligence (SDI) follows these steps:

  • Install Red Hat OpenShift Container Platform
  • Configure the prerequisites for SAP Data Intelligence Foundation
  • Install SDI Observer
  • Install SAP Data Intelligence Foundation on OpenShift Container Platform

If you're interested in installation of SAP Data Hub or SAP Vora, please refer to the other installation guides:

1. OpenShift Container Platform validation version matrix

The following version combinations of SDI 2.X, OCP, RHEL or RHCOS have been validated for the production environments:

SAP Data Intelligence OpenShift Container Platform Operating System Infrastructure and (Storage)
3.0 4.2 RHCOS (nodes), RHEL 8.1+ or Fedora (Management host) VMware vSphere (OCS 4.2)
3.0 Patch 3 4.4 RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) VMware vSphere (OCS 4.4)

Please refer to compatibility matrix for version combinations that are considered as working.

2. Requirements

2.1. Hardware/VM and OS Requirements

2.1.1. OpenShift Cluster

Make sure to consult the following official cluster requirements:

There are 4 kind of nodes:

  • Bootstrap Node - A temporary bootstrap node needed for the OCP deployment. The node can be either destroyed by the installer (using infrastructure-provisioned-installation -- aka IPI) or can be deleted manually by the administrator. Alternatively, it can be re-used as a worker node. Please refer to the Installation process (4.4) / (4.2) for more information.
  • Master Nodes (4.4) / (4.2) - The control plane manages the OpenShift Container Platform cluster.
  • Compute Nodes (4.4) / (4.2) - Run the actual workload - all SDI pods among others.
  • OCS Nodes (4.4) / (4.2) - Run OpenShift Container Storage (aka OCS) -- currently supported only on AWS and VMware vSphere. The nodes can be divided into starting (running both OSDs and monitors) and additional nodes (running only OSDs). Needed only when OCS shall be used as the backing storage provider.
  • Management host (aka administrator's workstation or Jump host - The Management host is used among other things for:

    • accessing the OCP cluster via a configured command line client (oc or kubectl)
    • configuring OCP cluster
    • running Software Lifecycle Container Bridge (SLC Bridge)

The hardware/software requirements for the Management host can be:

  • OS: Red Hat Enterprise Linux 8.1+, RHEL 7.6+ or Fedora 30+
  • Diskspace: 20GiB for /:
2.1.1.1. Minimum Hardware Requirements

The table below lists the minimum requirements and the minimum number of instances for each node type for the latest validated SDI and OCP 4.X releases. This is sufficient of a PoC (Proof of Concept) environments.

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master 3+ RHCOS 4 16 120 m4.xlarge
Compute 3+ RHEL 7.6 or RHCOS 8 32 120 m4.2xlarge

If using OCS 4, at least additional 3 (starting) nodes are recommended. Alternatively, the Compute nodes outlined above can also run OCS pods. In that case, the hardware specifications need to be extended accordingly. The following table lists the minimum requirements for each additional node:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
OCS starting (OSD+MON) 3 RHCOS 16 64 120 + 2048 + 10 m5.4xlarge
2.1.1.2. Minimum Production Hardware Requirements

The minimum production requirements for production systems for the latest validated SDI and OCP 4 are the following:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
Bootstrap 1 RHCOS 4 16 120 m4.xlarge
Master 3+ RHCOS 8 16 120 c5.xlarge
Compute 4+ RHEL 7.6 or RHCOS 16 64 120 m4.4xlarge

If using OCS 4, at least additional 3 (starting) nodes are needed. The following table lists the minimum requirements for each node:

Type Count Operating System vCPU RAM (GB) Storage (GB) AWS Instance Type
OCS starting (OSD+MON) 3 RHCOS 16 64 120 + 3×2048 + 10 m5.4xlarge
OCS additional (OSD) 1 RHCOS 16 64 120 + 3×2048 m5.4xlarge

Please refer to OCS Node Requirements (4.4) / (4.2) and OCS Sizing and scaling recommendations (4.4) / (4.2) for more information.

2.2. Software Requirements

2.2.1. Compatibility Matrix

Later versions of SAP Data Intelligence support newer versions of Kubernetes and OpenShift Container Platform. Even if not listed in the OCP validation version matrix above, the following version combinations are considered fully working:

SAP Data Intelligence OpenShift Container Platform Worker Node Management host Infrastructure and (Storage)
3.0 4.2 RHCOS or RHEL 7.6 RHEL 8.1 or newer Cloud , VMware vSphere (OCS 4) or (vSphere volumes )
3.0 Patch 3 or higher 4.3, 4.4 RHCOS or RHEL 7.6 RHEL 8.1 or newer Cloud , VMware vSphere (OCS 4) or (vSphere volumes )

Cloud means any cloud provider supported by OpenShift Container Platform. For a complete list of tested and supported infrastructure platforms, please refer to OpenShift Container Platform 4.x Tested Integrations. The persistent storage in this case must be provided by the cloud provider. Please see refer to Understanding persistent storage (4.4) / (4.2) for a complete list of supported storage providers.

This persistent storage provider does not offer a supported object storage service required by SDI's checkpoint store and therefor is suitable only for SAP Data Intelligence development and PoC clusters.

Unless stated otherwise, the compatibility of a listed SDI version covers all its patch releases as well.

2.2.2. Persistent Volumes

Persistent storage is needed for SDI. It is required to use storage that can be created dynamically. You can find more information in the Understanding persistent storage (4.4) / (4.2) document.

2.2.3. Container Image Registry

The SDI installation requires a secured Image Registry where images are first mirrored from an SAP Registry and then delivered to the OCP cluster nodes. The integrated OpenShift Container Registry (4.4) / (4.2) is not appropriate for this purpose. For now another image registry needs to be setup instead.

NOTE: as of now, AWS ECR Registry cannot be used for this purpose either.

The word secured in this context means that the comunication is encrypted using a TLS. Ideally with certificates signed by a trusted certificate authority. If the registry is also exposed publicly, it must require authentication and authorization in order to pull SAP images.

Such a registry can be deployed directly on OCP cluster using for example SDI Observer, please refer to Deploying SDI Observer for more information.

When finished you should have an external image registry up and running at the URL My_Image_Registry_FQDN. You can verify that with the following command.

# curl -k https://My_Image_Registry_FQDN/v2/
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}

2.2.4. Checkpoint store enablement

In order to enable SAP Vora Database streaming tables, checkpoint store needs to be enabled. The store is an object storage on a particular storage back-end. Several back-end types are supported by the SDI installer that cover most of the storage cloud providers.

The enablement is strongly recommended for production clusters. Clusters having this feature disabled are suitable only for test, development or PoC use-cases.

Make sure to create a desired bucket before the SDI Installation. If the checkpoint store shall reside in a directory on a bucket, the directory needs to exist as well.

2.2.5. SDI Observer

Is a pod monitorig SDI's namespace and modifying objects in there that enable running of SDI on top of OCP. The observer shall be run in a dedicated namespace. It must be deployed before the SDI installation is started. SDI Observer section will guide you through the process of deployment.

3. Install Red Hat OpenShift Container Platform

3.1. Prepare the Management host

Note the following has been tested on RHEL 8.2. The steps shall be similar for other RPM based Linux distribution. Recommended are RHEL 7.7+, Fedora 30+ and Centos 7+.

  1. Subscribe the Management host at least to the following repositories:

    # OCP_RELEASE=4.4
    # sudo subscription-manager repos        \
        --enable=rhel-8-for-x86_64-appstream-rpms     \
        --enable=rhel-8-for-x86_64-baseos-rpms        \
        --enable=rhocp-${OCP_RELEASE:-4.4}-for-rhel-8-x86_64-rpms
    
  2. Install jq binary. This installation guide has been tested with jq 1.6.

    # sudo curl -L -O /usr/local/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64
    # sudo chmod a+x /usr/local/bin/jq
    
  3. Download and install OpenShift client binaries.

    # sudo dnf install -y openshift-clients
    

    NOTE: rhel-7-server-ose-X.Y-rpms repositories corresponding to the same major release version (e.g. 4.4) as on the cluster nodes need to be enabled.

3.2. Install OpenShift Container Platform

Install OpenShift Container Platform on your desired cluster hosts. Follow the OpenShift installation guide (4.4) / (4.2).

If you choose the Installer Provisioned Infrastructure (IPI) (4.4) / (4.2), please follow the Installing a cluster on AWS with customizations (4.4) / (4.2) methods to allow for customizations.

On VMware vSphere, please follow Installing a cluster on vSphere (4.4) / (4.2).

Several changes need to be done to the compute nodes running SDI workloads before SDI installation. These include:

  1. choose a sufficient number and type of compute instances for SDI workload
  2. pre-load needed kernel modules
  3. increasing the pids limit of CRI-O container engine
  4. configure insecure registry (if an insecure registry shall be used)

The first two items can be performed during or after OpenShift installation. The others only after the OpenShift installation.

3.2.1. Customizing IPI or UPI installation on AWS or VMware vSphere

In order to allow for customizations, the installation need to be performed in steps:

  1. create the installation configuration file

    followed up by Modifying the installation configuration file

  2. create the ignition configuration files

    followed up by Modifying the worker ignition configuration file

  3. create the cluster

3.2.1.1. Modifying the installation configuration file

After the configuration file is created by the installer, you can specify the desired instance type of compute nodes by editing <installation_directory>/install-config.yaml. A shortened example for AWS could look like this:

apiVersion: v1
compute:
- hyperthreading: Enabled
  name: worker
  platform:
    aws:
      region: us-east-1
      type: m4.2xlarge
  replicas: 3

On AWS, to satisfy the SDI's production requirements, you can change the compute.platform.aws.type to r5.2xlarge and compute.replicas to 4.

For VMware vSphere, take a look at Sample install-config.yaml file (4.4) / (4.2).

3.2.1.1.1. (optional) Add proxy settings

If there is a (network/company)-wide HTTP(S) proxy, the proxy settings need to be configured (4.4) / (4.2) in order for the installation to succeed.

In addition to the recommended NO_PROXY values, be sure to include:

  • the base domain of the cluster (e.g. .<base_domain>)
  • the address of the external container image registry (if located within the proxied network and outside of OCP cluster)
  • IP addresses of the load balancers (both external and internal)
  • registry.redhat.io if not accessible via proxy; please see also the troubleshooting section
3.2.1.1.2. Generate ignition files

Then continue to generate the ignition configuration files from the installation file by running the following command. Note that the configuration file will be deleted. Therefor, it may make sense to back it up first.

# openshift-install create ignition-configs --dir <installation_directory>
3.2.1.2. Modifying the worker ignition configuration file

The following kernel modules need to be pre-loaded on compute nodes running SDI workloads: iptable_filter, iptable_nat, ip_tables, ipt_owner, ipt_REDIRECT, nfsd and nfsv4.

That can be achieved by creating a file in /etc/modules-load.d on each node containing the list of the modules. The following snippet shows, how to do it in bash. Make sure to install jq application/RPM package and to change to the correct <installation_directory>.

# pushd <installation_directory>
# modules=( nfsd nfsv4 ip_tables ipt_REDIRECT ipt_owner )
# content="$(printf '%s\n' "${modules[@]}" | base64 -w0)"
# cp worker.{,bak.}ign                      # make a backup of the ignition file
# jq -c '.storage |= {"files": ((.files // []) + [{
    "contents": {
      "source": "data:text/plain;charset=utf-8;base64,'"${content}"'",
      "verification": {}
    },
    "filesystem": "root",
    "mode": 420,
    "path": "/etc/modules-load.d/sdi-dependencies.conf"
}])} | .systemd |= {"units": ((.units // []) + [{
    "contents": "[Unit]\nDescription=Pre-load kernel modules for SAP Data Intelligence\nAfter=network.target\n\n[Service]\nType=oneshot\nExecStart=/usr/sbin/modprobe iptable_nat\nExecStart=/usr/sbin/modprobe iptable_filter\nRemainAfterExit=yes\n\n[Install]\nWantedBy=multi-user.target",
    "enabled": true,
    "name": "sdi-modules-load.service"
}])}' worker.bak.ign >worker.ign
# popd
3.2.1.3. (IPI only) Continue the installation by creating the cluster

To continue the IPI (e.g. on AWS) installation, execute the following command:

# openshift-install create cluster --dir <installation_directory>

3.3. OCP Post Installation Steps

3.3.1. (optional) Install OpenShift Container Storage

On AWS and WMware vSphere platforms, you have the option to deploy OCS to host the persistent storage for Data Intelligence. Please refer to the OCS documentation (4.4) / (4.2).

3.3.2. Change the count and instance type of compute nodes

Please refer to Creating a MachineSet (4.4) / (4.2) for changing an instance type and Manually scaling a MachineSet (4.4) / (4.2) or Applying autoscaling to an OpenShift Container Platform cluster (4.4) / (4.2) for information on scaling the nodes.

3.3.3. Pre-load needed kernel modules

To apply the desired changes to the existing compute nodes, all that is necessary is to create a new machine config that will be merged with the existing configuration:

# modules=( nfsd nfsv4 ip_tables ipt_REDIRECT ipt_owner )
# content="$(printf '%s\n' "${modules[@]}" | base64 -w0)"
# oc create -f - <<EOF
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 75-worker-sap-data-intelligence
spec:
  config:
    ignition:
      version: 2.2.0
    storage:
      files:
      - contents:
          source: "data:text/plain;charset=utf-8;base64,$content"
          verification: {}
        filesystem: root
        mode: 420
        path: /etc/modules-load.d/sdi-dependencies.conf
    systemd:
      units:
      - contents: |
          [Unit]
          Description=Pre-load kernel modules for SAP Data Intelligence
          After=network.target

          [Service]
          Type=oneshot
          ExecStart=/usr/sbin/modprobe iptable_nat
          ExecStart=/usr/sbin/modprobe iptable_filter
          RemainAfterExit=yes

          [Install]
          WantedBy=multi-user.target
        enabled: true
        name: sdi-modules-load.service
EOF

The changes will be rendered into machineconfigpool/worker. The workers will be restarted one-by-one until the changes are applied to all of them. See Applying configuration changes to the cluster (4.4) / (4.2) for more information.

3.3.4. Change the maximum number of PIDs per Container

The process of configuring the nodes is described at Modifying Nodes (4.4) / (4.2). In SDI case, the required settings are .spec.containerRuntimeConfig.pidsLimit in a ContainerRuntimeConfig. The result is a modified /etc/crio/crio.conf configuration file on each affected worker node with pids_limit set to the desired value.

  1. Label the particular pool of worker nodes.

    # oc label machineconfigpool/worker workload=sapdataintelligence
    
  2. Create the following ContainerRuntimeConfig resource.

    # oc create -f - <<EOF
    apiVersion: machineconfiguration.openshift.io/v1
    kind: ContainerRuntimeConfig
    metadata:
     name: bumped-pid-limit
    spec:
     machineConfigPoolSelector:
       matchLabels:
         workload: sapdataintelligence
     containerRuntimeConfig:
       pidsLimit: 16384
    EOF
    
  3. Wait until the machineconfigpool/worker becomes updated.

    # watch oc get  machineconfigpool/worker
    NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
    worker   rendered-worker-8f91dd5fdd2f6c5555c405294ce5f83c   True      False      False
    

3.3.5. Deploy persistent storage provider

Unless your platform already offers a supported persistent storage provider, one needs to be deployed. Please refer to Understanding persistent storage (4.4) / (4.2) for an overview of possible options.

On AWS and VMware, one can deploy OpenShift Container Storage (OCS) (4.4) / (4.2) running converged on OCP nodes providing both persistent volumes and object storage. Please refer to OCS Planning your Deployment (4.4) / (4.2) and Deploying OpenShift Container Storage (4.4) / (4.2) for more information and installation instructions.

3.3.6. Configure S3 access and bucket

Object storage is required for the following features of SDI:

Several interfaces to the object storage are supported by SDI. S3 interface is one of several. Please take a look at Checkpoint Store Type at Required Input Parameters for the complete list. SAP help page covers preparation of object store for a couple of cloud service providers.

3.3.6.1. Using NooBaa as object storage gateway

OCS contains NooBaa object data service for hybrid and multi cloud environments which provides S3 API one can use with SAP Data Intelligence. For SDI, one needs to provide the following:

  • S3 host URL prefixed either with https:// or http://
  • AWS_ACCESS_KEY_ID
  • AWS_SECRET_ACCESS_KEY
  • bucket name

NOTE: In case of https://, the endpoint must be secured by certificates signed by a trusted certificate authority. Self-signed CAs will not work out of the box as of now.

Once OCS is deployed, one can create the access keys and bucket using one of the following:

  • via NooBaa Management Console by default exposed at noobaa-mgmt-openshift-storage.apps.<cluster_name>.<base_domain>
  • via OpenShift command line interface covered below

In both cases, the S3 endpoint provided to the SAP Data Intelligence cannot be secured with a self-signed certificate as of now. Unless NooBaa's endpoints are secured with a proper signed certificate, one must use insecure HTTP connection. NooBaa comes with such an insecure service reachable at the following URL, where s3 stands for a service name and openshift-storage for namespace where OCS is installed:

http://s3.openshift-storage.svc.cluster.local

The service is resolvable only within the cluster. One cannot reach this URL from outside of the cluster. One can verify that the service is available with the following command.

# oc get svc -n openshift-storage -l app=noobaa
NAME          TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                                   AGE
noobaa-mgmt   LoadBalancer   172.30.154.162   <pending>     80:31351/TCP,443:32681/TCP,8444:31938/TCP,8445:31933/TCP,8446:31943/TCP   7d1h
s3            LoadBalancer   172.30.44.242    <pending>     80:31487/TCP,443:30071/TCP                                                7d1h
3.3.6.1.1. Creating an S3 bucket using CLI

The bucket can be created with the command below. Make sure to double-check storage class name (e.g. using oc get sc). It can live in any OpenShift project (e.g. sdi-infra). Be sure to switch to appropriate project/namespace (e.g. sdi) first before executing the following.

# for claimName in sdi-checkpoint-store sdi-data-lake; do
    oc create -f - <<EOF
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
  name: ${claimName}
spec:
  generateBucketName: ${claimName}
  storageClassName: openshift-storage.noobaa.io
EOF
done

After a while, the object buckets will be created, the claims will get bound and the secrets with the same names (sdi-checkpoint-store and sdi-data-lake in our case) as the ObjectBucketClaim (aka obc) will be created. When ready, the obc will be bound:

# oc get obc -w
NAME                   STORAGE-CLASS                 PHASE   AGE
sdi-checkpoint-store   openshift-storage.noobaa.io   Bound   41s
sdi-data-lake          openshift-storage.noobaa.io   Bound   41s

The name of the created bucket can be determined with the following command:

# oc get cm sdi-data-lake -o jsonpath=$'{.data.BUCKET_NAME}\n'
sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3

To determine the access keys, execute the following in bash:

# for claimName in sdi-checkpoint-store sdi-data-lake; do
    printf 'Bucket/claim %s:\n  Bucket name:\t%s\n' "$claimName" "$(oc get obc -o jsonpath='{.spec.bucketName}' "$claimName")"
    for key in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
      printf '  %s:\t%s\n' "$key" "$(oc get secret "$claimName" -o jsonpath="{.data.$key}" | base64 -d)"
    done
  done | column -t -s $'\t'

An example output value can be:

Bucket/claim sdi-checkpoint-store:
  Bucket name:                      sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
  AWS_ACCESS_KEY_ID:                LQ7YciYTw8UlDLPi83MO
  AWS_SECRET_ACCESS_KEY:            8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE
Bucket/claim sdi-data-lake:
  Bucket name:                      sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3
  AWS_ACCESS_KEY_ID:                cOxfi4hQhGFW54WFqP3R
  AWS_SECRET_ACCESS_KEY:            rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231

The values of sdi-checkpoint-store shall be passed to the following SLC Bridge parameters during SDI's installation in order to enable checkpoint store.

Parameter Example value
Amazon S3 Access Key LQ7YciYTw8UlDLPi83MO
Amazon S3 Secret Access Key 8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE
Amazon S3 bucket and directory sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
Amazon S3 Region (optional) ``

please leave unset

3.3.7. Set up a Container Image Registry

If you haven't done so already, please follow the Container Image Registry prerequisite.

3.3.8. Configure an insecure registry

NOTE: It is now required to use a registry secured by TLS for SDI. Plain HTTP will not do.

If the registry is signed by a proper trusted (not self-signed) certificate, this may be skipped.

There are two ways to make OCP trust an additional registry using certificates signed by a self-signed certificate authority:

3.3.9. Configure the OpenShift Cluster for SDI

3.3.9.1. Becoming a cluster-admin

Many commands below require cluster admin privileges. To become a cluster-admin, you can do one of the following:

  • Use the auth/kubeconfig generated in the working directory during the installation of the OCP cluster:

    INFO Install complete!
    INFO Run 'export KUBECONFIG=<your working directory>/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI.
    INFO The cluster is ready when 'oc login -u kubeadmin -p <provided>' succeeds (wait a few minutes).
    INFO Access the OpenShift web-console here: https://console-openshift-console.apps.demo1.openshift4-beta-abcorp.com
    INFO Login to the console with user: kubeadmin, password: <provided>
    # export KUBECONFIG=working_directory/auth/kubeconfig
    # oc whoami
    system:admin
    
  • As a system:admin user or a member of cluster-admin group, make another user a cluster admin to allow him to perform the SDI installation:

    1. As a cluster-admin, configure the authentication (4.4) / (4.2) and add the desired user (e.g. sdiadmin).
    2. As a cluster-admin, grant the user a permission to administer the cluster:

      # oc adm policy add-cluster-role-to-user cluster-admin sdiadmin
      

You can learn more about the cluster-admin role in Cluster Roles and Local Roles article (4.4) / (4.2).

4. SDI Observer

Deploy sdi-observer in its own namespace (e.g. sdi-observer). Please refer to its documentation for the complete list of issues that it currently attempts to solve.

It is deployed as an OpenShift template. It's behaviour is controlled by the template's parameters which are mirrored to its environment variables.

4.1. Important Parameters of Observer's Template

The following parameters are the most important.

Parameter Name Mandatory Example Description
NAMESPACE yes sdi-observer The desired namespace to deploy resources to. Defaults to the current one.
SDI_NAMESPACE yes sdi The name of the SAP Data Intelligence namespace to manage. Defaults to the current one. It must be set only in case the SDI Observer is running in a different namespace (see NAMESPACE).
SLCB_NAMESPACE no sap-slcbridge The name of the namespace where SLC Bridge runs.
OCP_MINOR_RELEASE yes 4.4 Minor release of OpenShift Container Platform (e.g. 4.4). This value must match the OCP server version. The biggest tolerated difference between the versions is 1 in the second digit.
DRY_RUN no false If set to true, no action will be performed. The pod will just print what would have been executed.
FORCE_REDEPLOY no false Whether to forcefully replace existing objects and configuration files. To replace existing secrets as well, RECREATE_SECRETS needs to be set.
NODE_LOG_FORMAT no text Format of the logging files on the nodes. Allowed values are "json" and "text". Initially, SDI fluentd pods are configured to parse "json" while OpenShift 4 uses "text" format by default. If not given, the default is "text".
DEPLOY_SDI_REGISTRY no false Whether to deploy container image registry for the purpose of SAP Data Intelligence. Requires project admin role attached to the sdi-observer service account. If enabled, REDHAT_REGISTRY_SECRET_NAME must be provided.
DEPLOY_LETSENCRYPT no false Whether to deploy letsencrypt controller. It allows to secure exposed routes with trusted certificates provided by Let's Encrypt open certificate authority. The mandatory prerequisite is a publicly resolvable application subdomain (*.apps.<cluster_name>.<base_domain>).
REDHAT_REGISTRY_SECRET_NAME no unless… 123456-username-pull-secret Name of the secret with credentials for registry.redhat.io registry. Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication
INJECT_CABUNDLE no false Inject CA certificate bundle into SAP Data Intelligence pods. The bundle can be specified with CABUNDLE_SECRET_NAME. It is needed if the container image registry is secured by a self-signed certificate.
REGISTRY no external.registry.tld:5000 The registry to mark as insecure. If not given, it will be determined from the vflow-secret in the SDI_NAMESPACE. If DEPLOY_SDI_REGISTRY is set to "true", this variable will be used as the container image registry's hostname when creating the corresponding route. Please do not set unless an external registry is used and it shall be marked as insecure or you want to use a custom hostname for the registry route.
MARK_REGISTRY_INSECURE no false Set to true if the given or configured REGISTRY shall be marked as insecure in all instances of Pipeline Modeler.
CABUNDLE_SECRET_NAME no unless… openshift-ingress-operator/router-ca The name of the secret containing certificate authority bundle that shall be injected into Data Intelligence pods. By default, the secret bundle is obtained from openshift-ingress-operator namespace where the router-ca secret contains a certificate authority used to signed all edge and reencrypt routes that are, inter alia, used for SDI_REGISTRY and NooBaa S3 API services. The secret name may be optionally prefixed with $namespace/. All the entries present in the "data" field having ".crt" or ".pem" suffix will be concated to form the resulting cert file.
EXPOSE_WITH_LETSENCRYPT no true Whether to mark created service routes for exposure by letsencrypt controller. If not specified, defaults to the value of DEPLOY_LETSENCRYPT.

Whether this parameter must be provided when instantiating the template.
The example value is also the default.
Please make sure to deploy SDI and SDI Observer to different namespaces. Deploying to a single namespace is possible and supported but not recommended.
To see the actions that would have been executed once the observer is deployed, use oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
Unless either DEPLOY_SDI_REGISTRY or DEPLOY_LETSENCRYPT is set to true.
The INJECT_CABUNDLE=true also makes SDI Observer take care of Setting Up Certificates by creating the cmcertificates secret.
This is needed only if the REGISTRY is either:

  • not secured with TLS - using plain HTTP
  • secured by a self-signed certificate and cabundle is not provided (INJECT_CABUNDLE is false)

If deploying registry using the SDI Observer (using DEPLOY_SDI_REGISTRY=true), MARK_REGISTRY_INSECURE shall not be set as long as one of the following applies:

  • SDI Observer is run with INJECT_CABUNDLE set to true
  • letsencrypt controller is managing routes in SDI Observer's namespace and EXPOSE_WITH_LETSENCRYPT is set to true
  • OCP's ingress operator is configured with a proper trusted wildcard certificate (not self-signed)

Unless INJECT_CABUNDLE is true.
For example, in the default value "openshift-ingress-operator/router-ca", the "openshift-ingress-operator" stands for secret's namespace and "router-ca" stands for secret's name. If no $namespace prefix is given, the secret is expected to reside in the NAMESPACE where the SDI observer runs.
Letsencrypt must be either provisioned by SDI Observer (using DEPLOY_LETSENCRYPT=true) or deployed manually and configured to monitor SDI Observer's namespace.

You can inspect all the available parameters and their semantics like this:

# oc process --parameters -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json

4.2. Deploying SDI Observer

SDI Observer monitors SDI and SLC Bridge namespaces and applies changes to SDI deployments to allow SDI to run on OpenShift. Among otherthings, it does the following:

  • adds additional persistent volume to vsystem-vrep StatefulSet to allow it to run on RHCOS system
  • grants fluentd pods permissions to logs
  • reconfigures the fluentd pods to parse plain text file container logs on the OCP 4 nodes
  • (optional) marks containers manipulating iptables on RHCOS hosts as privileged when the modules are not pre-loaded and the nodes
  • (optional) deploys container image registry suitable for mirroring, storing and serving SDI images and for use by the Pipeline Modeler
  • (optional) deploys letsencrypt controller taking care of trusted certificate management
  • (optional) creates cmcertificates secret to allow SDI to talk to container image registry secured by a self-signed CA certificate early during the installation time
  • (optional) enables the Pipeline Modeler (aka vflow) to talk to an (HTTP) insecure registry; it is however preferred to use HTTPS

4.2.1. Prerequisites

The following must be satisfied before SDI Observer can be deployed:

  1. OpenShift cluster must be fully operational including the Image Registry. Make sure that all the nodes are ready, all cluster operators are available and none of them is degraded.

    # oc get co
    # oc get nodes
    
  2. The namespaces for SLC Bridge, SDI and SDI Observer must exist. Execute the following to create them:

    # # change the namespace names according to you preferences
    # NAMESPACE=sdi-observer SDI_NAMESPACE=sdi SLCB_NAMESPACE=sap-slcbridge
    # for nm in $SDI_NAMESPACE $SLCB_NAMESPACE $NAMESPACE; do oc new-project $nm; done
    
  3. In order to build images needed for SDI Observer, a secret with credentials for registry.redhat.io needs to be created in the namespace of SDI Observer. Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Once you have downloaded the OpenShift secret file (e.g. rht-registry-secret.yaml with your credentials, you can import it into $SDI_NAMESPACE like this:

    # oc create -n "${NAMESPACE:-sdi-observer}" -f rht-registry-secret.yaml
    secret/123456-username-pull-secret created
    

4.2.2. Instantiation of Observer's Template

Deploy the SDI Observer by processing the template.

# NAMESPACE=sdi-observer
# SDI_NAMESPACE=sdi
# OCP_MINOR_RELEASE=4.4
# DEPLOY_SDI_REGISTRY=true
# INJECT_CABUNDLE=true
# REDHAT_REGISTRY_SECRET_NAME=123456-username-pull-secret
# oc process -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json \
        NAMESPACE="${NAMESPACE:-sdi-observer}" \
        SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
        OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.4}" \
        DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
        INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
        REDHAT_REGISTRY_SECRET_NAME="$REDHAT_REGISTRY_SECRET_NAME" | oc create -f -

This will deploy observer in namespace sdi-observer in they way that the observer will deploy container image registry and will inject the default cabundle into SDI pods in order to trust the registry.

It may take a couple of minutes until the sdi-observer image is built and deployed.

You can monitor the progress of build and deployment with:

# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
# oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
replication controller "sdi-observer-2" successfully rolled out
# # see the actions that observer performs
# oc logs -n "${NAMESPACE:-sdi-observer}" -f dc/sdi-observer
4.2.2.1. Using an alternative image

By default, SDI Observer is built on the Red Hat Universal Base Image (UBI). This requires access to the registry.redhat.io registry including the credentials provided with the REDHAT_REGISTRY_SECRET_NAME. Using this base image is the only supportable option.

However, for a proof of concept or development cases, it is possible to provide a custom image from another registry. The instantiation will then look like this:

# NAMESPACE=sdi-observer
# SDI_NAMESPACE=sdi
# OCP_MINOR_RELEASE=4.4
# DEPLOY_SDI_REGISTRY=true
# INJECT_CABUNDLE=true
# SOURCE_IMAGE_PULL_SPEC=registry.centos.org/centos:8
# SOURCE_IMAGESTREAM_NAME=centos8
# oc process -f https://github.com/redhat-sap/sap-data-intelligence/blob/master/observer/ocp-custom-source-image-template.json \
        NAMESPACE="${NAMESPACE:-sdi-observer}" \
        SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
        OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.4}" \
        DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
        INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
        SOURCE_IMAGE_PULL_SPEC="${SOURCE_IMAGE_PULL_SPEC:-registry.centos.org/centos:8}" \
        SOURCE_IMAGESTREAM_NAME="${SOURCE_IMAGESTREAM_NAME:-centos8}" | oc create -f -

The template already contains registry.centos.org/centos:8 as the default, so both SOURCE_IMAGE_PULL_SPEC and SOURCE_IMAGESTREAM_NAME can be left out completely if centos is the desired base image. This registry does not require authentication.

However, please make sure that the registry of your choice is allowed for import (4.4) / (4.2) in your cluster.

4.2.2.2. SDI Observer Registry

If the observer is configured to deploy container image registry via DEPLOY_SDI_REGISTRY=true parameter, it will deploy the deploy-registry job which does the following:

  1. builds the container-image-registry image and pushes it to the integrated OpenShift Image Registry
  2. generates or uses configured credentials for the registry
  3. deploys container-image-registry deployment config that runs this image and requires authentication
  4. exposes the registry using a route

    • if observer's REGISTRY parameter is set, it will be used as its hostname
    • otherwise the registry's hostname will becontainer-image-registry-${NAMESPACE}.apps.<cluster_name>.<base_domain>
  5. (optional) annotates the route for the letsencrypt controller to secure it with a trusted certificate

4.2.2.2.1. Registry Template parameters

The following Observer's Template Parameters influence the deployment of the registry:

Parameter Example value Description
DEPLOY_SDI_REGISTRY true Whether to deploy container image registry for the purpose of SAP Data Intelligence.
REDHAT_REGISTRY_SECRET_NAME 123456-username-pull-secret Name of the secret with credentials for registry.redhat.io registry. Please visit Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Must be provided in order to build registry's image.
REGISTRY registry.cluster.tld This variable will be used as the container image registry's hostname when creating the corresponding route. Defaults to container-image-registry-$NAMESPACE.<cluster_name>.<base_domain>. If set, the domain name must resolve to the IP of the ingress router.
INJECT_CABUNDLE true Inject CA certificate bundle into SAP Data Intelligence pods. The bundle can be specified with CABUNDLE_SECRET_NAME. It is needed if either registry or s3 endpoint is secured by a self-signed certificate. The letsencrypt method is preferred.
CABUNDLE_SECRET_NAME custom-ca-bundle The name of the secret containing certificate authority bundle that shall be injected into Data Intelligence pods. The default, the secret bundle is obtained from openshift-ingress-operator namespace where the router-ca secret contains the certificate authority used to sign all the edge and reencrypt routes that are, among others, used for SDI_REGISTRY and NooBaa S3 API services. The secret name may be optionally prefixed with $namespace/.
SDI_REGISTRY_STORAGE_CLASS_NAME ocs-storagecluster-cephfs Unless given, the default storage class will be used. If possible, prefer volumes with ReadWriteMany (RWX) access mode.
REPLACE_SECRETS true By default, existing SDI_REGISTRY_HTPASSWD_SECRET_NAME secret will not be replaced if it already exists. If the registry credentials shall be changed while using the same secret name, this must be set to true.
SDI_REGISTRY_USERNAME registry-user Will be used to generate htpasswd file to provide authentication data to the sdi registry service as long as SDI_REGISTRY_HTPASSWD_SECRET_NAME does not exist or REPLACE_SECRETS is true. Unless given, it will be autogenerated by the job.
SDI_REGISTRY_PASSWORD secure-password ditto
SDI_REGISTRY_HTPASSWD_SECRET_NAME registry-htpasswd A secret with htpasswd file with authentication data for the sdi image container. If given and the secret exists, it will be used instead of SDI_REGISTRY_USERNAME and SDI_REGISTRY_PASSWORD. Defaults to container-image-registry-htpasswd. Please make sure to follow the official guidelines on generating the htpasswd file.
SDI_REGISTRY_VOLUME_CAPACITY 250Gi Volume space available for container images. Defaults to 120Gi.
SDI_REGISTRY_VOLUME_ACCESS_MODE ReadWriteMany If the given SDI_REGISTRY_STORAGE_CLASS_NAME or the default storate class supports ReadWriteMany ("RWX") access mode, please change this to ReadWriteMany. For example, the ocs-storagecluster-cephfs storage class, deployed by OCS operator, does support it.
DEPLOY_LETSENCRYPT true Whether to deploy letsencrypt controller. Requires project admin role attached to the sdi-observer service account.
EXPOSE_WITH_LETSENCRYPT true Whether to expose route for the registry annotated for letsencrypt controller. Letsencrypt controller must be deployed either via the observer or cluster-wide for this to have an effect. Defaults to the value of DEPLOY_LETSENCRYPT.

Monitoring registry's deployment

# oc logs -n "${NAMESPACE:-sdi-observer}" -f job/deploy-registry
4.2.2.2.2. Determining Registry's credentials

The username and password are separated by a colon in the SDI_REGISTRY_HTPASSWD_SECRET_NAME secret:

# # make sure to change the NAMESPACE and secret name according to your environment
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
    jq -r '.data[".htpasswd.raw"] | @base64d'
user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma
4.2.2.2.3. Testing the connection

In this example, it is assumed that the INJECT_CABUNDLE and DEPLOY_SDI_REGISTRY are true and other parameters use the defaults.

  1. Obtain Ingress Router's default self-signed CA certificate:

    # oc get secret -n openshift-ingress-operator -o json router-ca | \
        jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
    
  2. Do a simple test using curl:

    # # determine registry's hostname from its route
    # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
    # curl -I --user user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma --cacert router-ca.crt \
        "https://$hostname/v2/"
    HTTP/1.1 200 OK
    Content-Length: 2
    Content-Type: application/json; charset=utf-8
    Docker-Distribution-Api-Version: registry/2.0
    Date: Sun, 24 May 2020 17:54:31 GMT
    Set-Cookie: d22d6ce08115a899cf6eca6fd53d84b4=9176ba9ff2dfd7f6d3191e6b3c643317; path=/; HttpOnly; Secure
    Cache-control: private
    
  3. Using the podman:

    # # determine registry's hostname from its route
    # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
    # sudo mkdir -p "/etc/containers/certs.d/$hostname"
    # sudo cp router-ca.crt "/etc/containers/certs.d/$hostname/"
    # podman login -u user-qpx7sxeei "$hostname"
    Password:
    Login Succeeded!
    
4.2.2.2.4. Configuring OCP

Configure OpenShift to trust the deployed registry if using a self-signed CA certificate.

4.2.2.2.5. SDI Observer Registry tenant configuration

NOTE: Only applicable once the SDI installation is complete.

Each newly created tenant needs to be configured to be able to talk to the SDI Registry. The initial tenant (the default) does not need to be configured manually as it is configured during the installation.

There are two steps that need to be performed for each new tenant:

  • import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed (the default unless letsencrypt controller is used)
  • create and import credential secret using the SDI System Management und update the modeler secret

Import the CA certificate

  1. Obtain the router-ca.crt of the secret as documented in the previous section.
  2. Follow the Manage Certificates guide to import the router-ca.crt via the SDI Connection Management.

Import the credential secret

Determine the credentials and import them using the SDI System Management by following the official Provide Access Credentials for a Password Protected Container Registry.

As an alternative to the step "1. Create a secret file that contains the container registry credentials and …", you can also use the following way to create the vsystem-registry-secret.txt file:

# # determine registry's hostname from its route
# hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
    jq -r '.data[".htpasswd.raw"] | @base64d | gsub("\\s+"; "") | split(":") |
        [{"username":.[0], "password":.[1], "address":"'"$hostname"'"}]' | \
    json2yaml > vsystem-registry-secret.txt

NOTE: that json2yaml binary from the remarshal project must be installed on the Management host in addition to jq

4.3. Managing SDI Observer

4.3.1. Viewing and changing the current configuration

View the current configuration of SDI Observer:

# oc set env --list -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer

Change the settings:

# # instruct the observer to deploy letsencrypt controller to make the
# # services like registry trusted without injecting self-signed CA into pods
# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer DEPLOY_LETSENCRYPT=true INJECT_CABUNDLE=false

4.3.2. Re-deploying SDI Observer

Is useful in the following cases:

  • To update to the latest SDI Observer code. Please be sure to check the Update instructions before updating to the latest release.
  • SDI has been uninstalled, its namespace deleted and re-created.
  • Parameter being reflected in multiple resources (not just in the DeploymentConfig) needs to be changed (e.g. OCP_MINOR_RELEASE)
  • Different SDI instance in another namespace shall be observed.

NOTE: Re-deployment preserves generated secrets and persistent volumes unless FORCE_REDEPLOY, REPLACE_SECRETS and REPLACE_PERSISTENT_VOLUMES are true.

The template needs to be processed again with the desired parameters and existing objects replaced like this:

# oc process -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json \
        NAMESPACE="${NAMESPACE:-sdi-observer}" \
        SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
        OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.4}" \
        DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
        INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
        REDHAT_REGISTRY_SECRET_NAME="$REDHAT_REGISTRY_SECRET_NAME" | oc replace -f -
# watch oc get pods -n "${NAMESPACE:-sdi-observer}"
# # trigger a new build if it does not start automatically
# oc start-build -n "${NAMESPACE:-sdi-observer}" -F bc/sdi-observer

An alternative is to delete the NAMESPACE where the SDI Observer is deployed and deploy it again. Note however, that this may delete SDI Registry deployed by the observer including the mirrored images if the DEPLOY_SDI_REGISTRY was true in the previous run.

4.3.2.1. Re-deploying while reusing the previous parameters

Another alternative that reuses the parameters used last time is shown in the next example. It overrides a single variable (OCP_MINOR_RELEASE) which is useful when updating OpenShift cluster. Make sure to execute it in bash.

# OCP_MINOR_RELEASE=4.4
# tmpl=https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json; \
  oc process -f $tmpl $(oc set env -n "${SDI_NAMESPACE:-sdi}" --list dc/sdi-observer | grep -v '^#\|=$' | grep -F -f \
        <(oc process -f $tmpl --parameters | sed -n 's/^\([A-Z_]\+\)\s\+.*/\1/p' | tail -n +2) | \
            sed 's/\(OCP_MINOR_RELEASE\)=.*/\1='"$OCP_MINOR_RELEASE"'/') | oc replace -f -
# watch oc get pods -n "${NAMESPACE:-sdi-observer}"
# # trigger a new build if it does not start automatically
# oc start-build -n "${NAMESPACE:-sdi-observer}" -F bc/sdi-observer

5. Install SDI on OpenShift

5.1. Install Software Lifecycle Container Bridge

Please follow the official documentation.

5.1.1. Important Parameters

Parameter Condition Description
Mode Always Make sure to choose the Expert Mode.
Adress of the Container Image Repository Always This is the Host value of the container-image-registry route in the sdi-observer if the registry is deployed by SDI Observer.
Image registry user name if … The value recorded in the SDI_REGISTRY_HTPASSWD_SECRET_NAME if using the registry deployed with SDI Observer.
Image registry password if … ditto
Namespace of the SLC Bridge Always If you override the default (sap-slcbridge), make sure to deploy SDI Observer with the corresponding SLCB_NAMESPACE value.
Service Type SLC Bridge Base installation On vSphere, make sure to use NodePort. On AWS, please use LoadBalancer.
Cluster No Proxy Required in conjunction with the HTTPS Proxy value Make sure to extend with additional mandatory entries.

If the registry requires authentication. The one deployed with SDI Observer does.
Make sure to include at least the entries located in OCP cluster's proxy settings.

# # get the internal OCP cluster's NO_PROXY settings
# noProxy="$(oc get -o jsonpath='{.status.noProxy}' proxy/cluster)"; echo "$noProxy"
.cluster.local,.local,.nip.io,.ocp.vslen,.sap.corp,.svc,10.0.0.0/16,10.128.0.0/14,10.17.69.0/23,127.0.0.1,172.30.0.0/16,192.168.0.0/16,api-int.morrisville.ocp.vslen,etcd-0.morrisville.ocp.vslen,etcd-1.morrisville.ocp.vslen,etcd-2.morrisville.ocp.vslen,localhost,lu0602v0,registry.redhat.io

For more details, please refer to Configuring the cluster-wide proxy (4.4) / (4.2).

NOTE: SLC Bridge service cannot be used via routes (Ingress Operator) as of now. Doing so will result in timeouts. This will be addressed in the future. For now, one must use either the NodePort or LoadBalancer service directly.

On vSphere, in order to access slcbridgebase-service NodePort service, one needs to have either a direct access to one of the SDI Compute nodes or modify the external load balancer to add additional route to the service.

5.1.2. Install SLC Bridge

Please install SLC Bridge according to Making the SLC Bridge Base available on Kubernetes while paying attention to the notes on the installation parameters.

5.1.2.1. Using an external load balancer to access SLC Bridge's NodePort

NOTE: applicable only when "Service Type" was set to "NodePort".

Once the SLC Bridge is deployed, its NodePort shall be determined in order to point the load balancer at it.

# oc get svc -n "${SLCB_NAMESPACE:-sap-slcbridge}" slcbridgebase-service -o jsonpath=$'{.spec.ports[0].nodePort}\n'
31875

The load balancer shall point at all the compute nodes running SDI workload. The following is an example for HAProxy sw load balancer:

# # in the example, the <cluster_name> is "boston" and <base_domain> is "ocp.vslen"
# cat /etc/haproxy/haproxy.cfg
....
frontend        slcb
    bind        *:9000
    mode        tcp
    option      tcplog
    # # commented blocks are useful for multiple OCP clusters or multiple SLC Bridge services
    #tcp-request inspect-delay      5s
    #tcp-request content accept     if { req_ssl_hello_type 1 }

    use_backend  boston-slcb       #if { req_ssl_sni -m end -i boston.ocp.vslen  }
    #use_backend raleigh-slcb      #if { req_ssl_sni -m end -i raleigh.ocp.vslen }

backend         boston-slcb
    balance     source
    mode        tcp
    server      sdi-worker1        sdi-worker1.boston.ocp.vslen:31875   check
    server      sdi-worker2        sdi-worker2.boston.ocp.vslen:31875   check
    server      sdi-worker3        sdi-worker3.boston.ocp.vslen:31875   check

backend         raleigh-slcb
....

The SLC Bridge can then be accessed at the URL https://boston.ocp.vslen:9000/docs/index.html as long as boston.ocp.vslen resolves correctly to the load balancer's IP.

5.2. SDI Installation Parameters

Please follow SAP's guidelines on configuring the SDI while paying attention to the following additional comments:

Name Condition Recommendation
Kubernetes Namespace Always Must match the project name chosen in the Project Setup (e.g. sdi)
Installation Type Installation or Update Choose Advanced Installation if you need to specify you want to choose particular storage class or there is no default storage class (4.4) / (4.2) set or you want to deploy multiple SDI instances on the same cluster.
Container Registry Installation Must be set to the container image registry.
Checkpoint Store Configuration Installation Recommended for production deployments.
Checkpoint Store Type If Checkpoint Store Configuration parameter is enabled. Set to s3 if using for example OCS's NooBaa service as the object storage. See Using NooBaa as object storage gateway for more details.
Checkpoint Store Validation Installation Please make sure to validate the connection during the installation time. Otherwise in case an incorrect value is supplied, the installation will fail at a later point.
Container Registry Settings for Pipeline Modeler Advanced Installation Shall be changed if the same registry is used for more than one SAP Data Intelligence instance. Either another <registry> or a different <prefix> or both will do.
StorageClass Configuration Advanced Installation Configure this if you want to choose different dynamic storage provisioners for different SDI components or if there's no default storage class (4.4) / (4.2) set or you want to choose non-default storage class for the SDI components.
Default StorageClass Advanced Installation and if storage classes are configured Set this if there's no default storage class (4.4) / (4.2) set or you want to choose non-default storage class for the SDI components.
Enable Kaniko Usage Advanced Installation Must be enabled on OCP 4.
Container Image Repository Settings for SAP Data Intelligence Modeler Advanced Installation or Upgrade If using the same registry for multiple SDI instances, choose "yes".
Container Registry for Pipeline Modeler Advanced Installation and if "Use different one" option is selected in the previous selection. If using the same registry for multiple SDI instances, it is required to use either different prefix (e.g. My_Image_Registry_FQDN/mymodelerprefix2) or a different registry.
Loading NFS Modules Advanced Installation Please choose "yes".
Additional Installer Parameters Advanced Installation (optional) Useful for reducing the minimum memory requirements of the HANA pod and much more.

5.3. Project setup

It is assumed the sdi project has been already created during SDI Observer's prerequisites

Login to OpenShift as a cluster-admin, and perform the following configurations for the installation:

# # change to the SDI_NAMESPACE project using: oc project "${SDI_NAMESPACE:-sdi}"
# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-elasticsearch"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-fluentd"
# oc adm policy add-scc-to-user privileged -z default
# oc adm policy add-scc-to-user privileged -z mlf-deployment-api
# oc adm policy add-scc-to-user privileged -z vora-vflow-server
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)-vrep"

5.4. Install SDI

Please follow the official procedure according to Install using SLC Bridge in a Kubernetes Cluster with Internet Access.

5.5. SDI Post installation steps

5.5.1. (Optional) Expose SDI services externally

There are multiple possibilities how to make SDI services accessible outside of the cluster. Compared to Kubernetes, OpenShift offers additional method, which is recommended for most of the scenarios including SDI System Management service. It's based on OpenShift Ingress Operator (4.4) / (4.2).

For SAP Vora Transaction Coordinator and SAP HANA Wire, please use the official suggested method available to your environment.

5.5.1.1. Using OpenShift Ingress Operator

OpenShift allows you to access the Data Intelligence services via Ingress Controllers (4.4) / (4.2) as opposed to regular NodePorts (4.4) / (4.2). For example, instead of accessing the vsystem service via https://worker-node.example.com:32322, after the service exposure, you will be able to access it at https://vsystem-sdi.apps.<cluster_name>.<base_domain>. This is an alternative to the official guide documentation to Expose the Service On Premise (3.0).

There are two kinds routes secured with TLS. The reencrypt kind, allows for a custom signed or self-signed certificate to be used. The other is a passthrough kind which uses the pre-installed certificate generated by the installer or passed to the installer.

5.5.1.1.1. Export services with an reencrypt route

With this kind of route, different certificates are used on client and service sides of the route. The router stands in the middle and re-encrypts the communication coming from either side using a certificate corresponding to the opposite side. In this case, the client side is secured by a provided certificate and the service side is encrypted with the original certificate generated or passed to the SAP Data Intelligence installer.

The reencrypt route allows for securing the client connection with a proper signed certificate.

  1. Look up the vsystem service:

    # oc project "${SDI_NAMESPACE:-sdi}"            # switch to the Data Intelligence project
    # oc get services | grep "vsystem "
    vsystem   ClusterIP   172.30.227.186   <none>   8797/TCP   19h
    

    When exported, the resulting hostname will look like vsystem-${SDI_NAMESPACE}.apps.<cluster_name>.<base_domain>. However, an arbitrary hostname can be chosen instead as long as it resolves correctly to the IP of the router.

  2. Get, generate or use the default certificates for the route. In this example, the default self-signed certificate used by router is used to secure the connection between the client and OCP's router. The CA certificate for clients can be obtained from the router-ca secret located in the openshift-ingress-operator namespace:

    # oc get secret -n openshift-ingress-operator -o json router-ca | \
        jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
    
  3. Obtain the SDI's root certificate authority bundle generated at the SDI's installation time. The generated bundle is available in the ca-bundle.pem secret in the sdi namespace.

    # oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \
        secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
    
  4. Create the reencrypt route for the vsystem service like this:

    # oc create route reencrypt -n "${SDI_NAMESPACE:-sdi}" \
        --dest-ca-cert=sdi-service-ca-bundle.pem --service vsystem
    # oc get route
    NAME      HOST/PORT                                                  SERVICES  PORT      TERMINATION  WILDCARD
    vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>  vsystem   vsystem   reencrypt    None
    
  5. Verify the connection:

    # # use the HOST/PORT value obtained from the previous command instead
    # curl --cacert router-ca.crt https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>/
    
5.5.1.1.2. Export services with a passthrough route

With the passthrough route, the communication is encrypted by the SDI service's certificate all the way to the client.

NOTE: If possible, please prefer the reencrypt route because the hostname of vsystem certificate cannot be verified by clients as can be seen in the following output:

# oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \
    secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
# openssl x509 -noout -subject -in sdi-service-ca-bundle.pem
subject=C = DE, ST = BW, L = Walldorf, O = SAP, OU = Data Hub, CN = SAPDataHub
  1. Look up the vsystem service:

    # oc project "${SDI_NAMESPACE:-sdi}"            # switch to the Data Intelligence project
    # oc get services | grep "vsystem "
    vsystem   ClusterIP   172.30.227.186   <none>   8797/TCP   19h
    
  2. Create the route:

    # oc create route passthrough --service=vsystem
    # oc get route
    NAME      HOST/PORT                                                  PATH  SERVICES  PORT      TERMINATION  WILDCARD
    vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>        vsystem   vsystem   passthrough  None
    

    You can modify the hostname with --hostname parameter. Make sure it resolves to the router's IP.

  3. Access the System Management service at https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain> to verify.

5.5.1.2. Using NodePorts

NOTE For OpenShift, an exposure using routes is preferred although only possible for the System Management service (aka vsystem).

Exposing SAP Data Intelligence vsystem

  • Either with an auto-generated node port:

    # oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2
    # oc get -o jsonpath=$'{.spec.ports[0].nodePort}\n' services vsystem-nodeport
    30617
    
  • Or with a specific node port (e.g. 32123):

    # oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2 --dry-run -o yaml | \
        oc patch -p '{"spec":{"ports":[{"port":8797, "nodePort": 32123}]}}' --local -f - -o yaml | oc create -f -
    

The original service remains accessible on the same ClusterIP:Port as before. Additionally, it is now accessible from outside of the cluster under the node port.

Exposing SAP Vora Transaction Coordinator and HANA Wire

# oc expose service vora-tx-coordinator-ext --type NodePort --name=vora-tx-coordinator-nodeport --generator=service/v2
# oc get -o jsonpath=$'tx-coordinator:\t{.spec.ports[0].nodePort}\nhana-wire:\t{.spec.ports[1].nodePort}\n' \
    services vora-tx-coordinator-nodeport
tx-coordinator: 32445
hana-wire:      32192

The output shows the generated node ports for the newly exposed services.

5.5.2. Configure the Connection to DI_DATA_LAKE

Please follow the official post-installation instructions at Configure the Connection to DI_DATA_LAKE.

In case the OCS' NooBaa is used as a backing object storage provider, please make sure to use the HTTP service endpoint as documented in Using NooBaa as object storage gateway.

Based on the example output in that section, the configuration may look like this:

Parameter Value
Connection Type SDL
Id DI_DATA_LAKE
Object Storage Type S3
Endpoint http://s3.openshift-storage.svc.cluster.local
Access Key ID cOxfi4hQhGFW54WFqP3R
Secret Access Key rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231
Root Path sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3

5.5.3. SDI Validation

Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.0).

5.5.3.1. Log On to SAP Data Intelligence Launchpad

In case the vsystem service has been exposed using a route, the URL can be determined like this:

# oc get route -n "${SDI_NAMESPACE:-sdi}"
NAME      HOST/PORT                                                  SERVICES  PORT      TERMINATION  WILDCARD
vsystem   vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>  vsystem   vsystem   reencrypt    None

The HOST/PORT value needs to be then prefixed with https://, for example:

https://vsystem-sdi.apps.boston.ocp.vslen
5.5.3.2. Check Your Machine Learning Setup

In order to upload training and test datasets using ML Data Manager, the user needs to be assigned sap.dh.metadata policy. Please make sure to follow Using SAP Data Intelligence Policy Management to assign the policies to the users that need them.

5.5.4. Configuration of additional tenants

When a new tenant is created using for example the Manage Clusters instructions it is not configured to work with the container image registry. Therefore, the Pipeline Modeler is unusable and will fail to start until configured.

There are two steps that need to be performed for each new tenant:

  • import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed
  • create and import credential secret using the SDI System Management and update the modeler secret if the container image registry requires authentication

If the SDI Registry deployed by the SDI Observer is used, please follow the SDI Observer Registry tenant configuration. Otherwise, please make sure to execute the official instructions in the following articles according to your registry configuration:

6. OpenShift Container Platform Upgrade

This outlines the instructions in case OpenShift upgrade needs to be performed while SAP Data Intelligence is installed.

6.1. Pre-upgrade procedures

  1. Before upgrading cluster to release equal to or newer than 4.3, make sure to upgrade SDI at least to the release 3.0 Patch 3 by following SAP Data Hub Upgrade procedures - starting from pre-upgrade without performing steps marked with (ocp-upgrade).
  2. Make yourself familiar with the OpenShift's upgrade guide (4.4) / (4.3).
  3. Plan for SDI downtime.
  4. Make sure to pre-load the needed kernel modules
  5. Make sure to increase the maximum number of PIDs per Container
  6. Pin vsystem-vrep to the current node

6.1.1. Stop SAP Data Intelligence

It is strongly recommended to stop the SDI before performing the upgrade.

The procedure is outlined in the official Administration Guide. However, please note the command as described there is erroneous as o mid-July 2020. Please execute it this way:

# oc -n "${OCP_NAMESPACE}" patch datahub default --type='json' -p '[{"op":"replace","path":"/spec/runLevel","value":"Stopped"}]'

6.2. Uprade OCP

  1. Upgrade OCP 4.2 to OCP 4.3
  2. If having OpenShift Container Storage deployed, update OCS 4.2 to OCS 4.3
  3. Update OpenShift client tools on the Management host from 4.2 to 4.4. On RHEL 8.2, one can do it like this:

    # sudo subscription-manager repos --disable=rhocp-4.2-for-rhel-8-x86_64-rpms --enable=rhocp-4.4-for-rhel-8-x86_64-rpms
    # sudo dnf update -y openshift-clients
    
  4. Update SDI Observer to use the OCP 4.4 client tools by following for example re-deploying while reusing the previous parameters

  5. Upgrade OCP 4.3 to OCP 4.4
  6. If having OpenShift Container Storage deployed, update OCS 4.3 to OCS 4.4

6.3. Post-upgrade procedures

  1. Start SAP Data Itelligence as outlined in the official Administration Guide. However, please note the command as described there is erroneous as o mid-July 2020. Please execute it this way:

    # oc -n "${OCP_NAMESPACE}" patch datahub default --type='json' -p '[{"op":"replace","path":"/spec/runLevel","value":"Started"}]'
    
  2. Unpin vsystem-vrep from the current node

7. SAP Data Hub Upgrade

This section will guide you through the upgrade of SAP Data Hub 2.7 to SAP Data Intelligence 3.0.

The following steps must be performed in the given order. Unless an OCP upgrade is needed, the steps marked with (ocp-upgrade) can be skipped.

7.1. Pre-upgrade procedures

  1. Make sure to get familiar with the official SAP Upgrade guide (3.0).
  2. (ocp-upgrade) Make yourself familiar with the OpenShift's upgrade guide (4.4) / (4.3).
  3. Plan for a downtime.
  4. Make sure to pre-load the needed kernel modules
  5. Make sure to increase the maximum number of PIDs per Container
  6. Pin vsystem-vrep to the current node

7.1.1. Execute SDI's Pre-Upgrade Procedures

Please follow the official Pre-Upgrade procedures.

If you exposed the vsystem service using routes, delete the route:

# oc get route vsystem -o yaml >route-vsystem.bak.yaml  # make a backup
# oc delete route vsystem

7.1.2. (ocp-upgrade) Upgrade OpenShift

It is recommended to upgrade to the latest asynchronous OpenShift release. Make sure to follow the official upgrade instructions (4.4) / (4.2).

7.1.3. Container image registry preparation

Unlike SAP Data Hub, SAP Data Intelligence requires a secured container image registry. Plain HTTP connection cannot be used anymore.

There are the following options how to satisfy this requirement:

  • The registry used by SAP Data Hub is already accessible over HTTPS and its serving TLS certificates have been signed by a trusted certificate authority. In this case, the rest of this section can be skipped until SDI Pre-Upgrade procedures.
  • The registry used by SAP Data Hub is already accessible or will be made accessible over HTTPS but its service TLS certificate is not signed by a trusted certificate authority. In this case one of the following must be performed unless already done:

    The rest of this section can then be skipped.

  • A new registry shall be used.

In the last case, please refer to Container Image Registry prerequisite for more details. Also note that the provisioning of the registry can be done by SDI Observer deployed in the next step.

NOTE: the newly deployed registry must contain all the images used by the current SAP Data Hub release as well in order for the upgrade to succeed. There are multiple ways to accomplish this, for example, on the Jump host, execute one of the following:

  • using the manual installation method of SAP Data Hub, one can invoke the install.sh script with the following arguments:

    • --prepare-images to cause the script to just mirror the images to the desired registry and terminate immediately afterwards
    • --registry HOST:PORT to point the script to the newly deployed registry
  • inspect the currently running containers in the SDH project and copy their images directly from the old local registry to the new one (without SAP registry being involved); it can be performed on the Jump host in bash; in the following example, jq, podman and skopeo binaries are assumed to be available:

    # export OLD_REGISTRY=My_Image_Registry_FQDN:5000
    # export NEW_REGISTRY=HOST:PORT
    # SDH_NAMESPACE=sdh
    # # login to the old registry using either docker or podman if it requires authentication
    # podman login --tls-verify=false -u username $OLD_REGISTRY
    # # login to the new registry using either docker or podman if it requires authentication
    # podman login --tls-verify=false -u username $NEW_REGISTRY
    # function mirrorImage() {
        local src="$1"
        local dst="$NEW_REGISTRY/${src#*/}"
        skopeo copy --src-tls-verify=false --dest-tls-verify=false "docker://$src" "docker://$dst"
    }
    # export -f mirrorImage
    # # get the list of source images to copy
    # images="$(oc get pods -n "${SDH_NAMESPACE:-sdh}" -o json | jq -r '.items[] | . as $ps |
        [$ps.spec.containers[] | .image] + [($ps.spec.initContainers // [])[] | .image] |
        .[]' | grep -F "$OLD_REGISTRY" | sort -u)"
    # # more portable way to copy the images (up to 5 in parallel) using GNU xargs
    # xargs -n 1 -r -P 5 -i /bin/bash -c 'mirrorImage {}' <<<"${images:-}"
    # # an alternative way using GNU Parallel
    # parallel -P 5 --lb mirrorImage <<<"${images:-}"
    

7.1.4. Deploy SDI Observer

If the current SDH Observer is deployed in a different namespace than SDH's namespace, it must be deleted manually. The easiest way is to delete the its project unless shared with other workloads. If it shares the namespace of SDH, no action is needed - it will be deleted automatically.

Please follow the instructions in SDI Observer section to deploy it while paying attention to the following:

  • SDI Observer shall be located in a different namespace than SAP Data Hub and Data Intelligence.
  • SDI_NAMESPACE shall be set to the namespace where SDH is currently running

7.1.5. Prepare SDH/SDI Project

SAP Data Hub running in a particular project/namespace on OCP cluster will be substituted by SAP Data Intelligence in the same project/namespace. The existing project must be modified in order to host the latter.

Grant the needed security context consstraints to the new service accounts by executing the commands from the project setup. NOTE: Re-running the commands that have been run already, will do no harm.

To be able to amend the potential volume attachement problems, make sure to dump a mapping between the SDH pods and nodes they run on:

# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out

(optionally) If an object storage is available and provided by NooBaa, a new storage bucket can be created for the SDL Data Lake connection. Please follow Creating an S3 bucket using CLI section. Note that the existing checkpoint store bucket used by SAP Data Hub will continue to be used by SAP Data Intelligence if configured.

7.2. Upgrade SAP Data Hub to SAP Data Intelligence

Execute the SDH upgrade according to the official instructions.

Please be aware of the potential issue during the upgrade when using OCS 4 as the storage provider.

7.3. SAP Data Intelligence Post-Upgrade Procedures

  1. Execute the Post-Upgrade Procedures for the SDH (3.0).

  2. If you exposed the vsystem service using routes, re-create the route. If no backup is available, please follow Using OpenShift Router and routes.

    # oc create -f route-vsystem.bak.yaml
    
  3. Unpin vsystem-vrep from the current node

7.4. Validate SAP Data Intelligence

Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.0).

8. Appendix

8.1. SDI uninstallation

Please follow the SAP documentation Uninstalling SAP Data Intelligence using the SLC Bridge (3.0).

Additionally, make sure to delete the sdi project as well, e.g.:

# oc delete project sdi

Optionally, one can also delete SDI Observer's namespace, e.g.:

# oc delete project sdi-observer

NOTE: this will also delete the container image registry if deployed using SDI Observer which means the mirroring needs to be performed again during a new installation. If SDI Observer (including the registry and its data) shall be preserved for the next installation, please make sure to re-deploy it once the sdi project is re-created.

When done, you may continue with a new installation round in the same or another namespace.

8.2. Configure OpenShift to trust container image registry

If the registry's certificate is signed by a self-signed certificate authority, one must make OpenShift aware of it.

If the registry runs on the OpenShift cluster itself and is exposed via a reencrypt or edge route with the default TLS settings (no custom TLS certificates set), the CA certificate used is available in the secret router-ca in openshift-ingress-operator namespace.

To make the registry available via such route trusted, set the route's hostname into the registry variable and execute the following code in bash:

# registry="container-image-registry-<NAMESPACE>.apps.<cluster_name>.<base_domain>"
# caBundle="$(oc get -n openshift-ingress-operator -o json secret/router-ca | \
    jq -r '.data as $d | $d | keys[] | select(test("\\.(?:crt|pem)$")) | $d[.] | @base64d')"
# # determine the name of the CA configmap if it exists already
# cmName="$(oc get images.config.openshift.io/cluster -o json | \
    jq -r '.spec.additionalTrustedCA.name // "trusted-registry-cabundles"')"
# if oc get -n openshift-config "cm/$cmName" 2>/dev/null; then
    # configmap already exists -> just update it
    oc get -o json -n openshift-config "cm/$cmName" | \
        jq '.data["'"$registry"'"] |= "'"$caBundle"'"' | \
        oc replace -f - --force
  else
      # creating the configmap for the first time
      oc create configmap -n openshift-config "$cmName" \
          --from-literal="$registry=$caBundle"
      oc patch images.config.openshift.io cluster --type=merge \
          -p '{"spec":{"additionalTrustedCA":{"name":"'"$cmName"'"}}}'
  fi

If using a registry running outside of OpenShift or not secured by the default ingress CA certificate, take a look at the official guideline at Configuring a ConfigMap for the Image Registry Operator (4.4) / (4.2).

To verify that the CA certificate has been deployed, execute the following and check whether the supplied registry name appears among the file names in the output:

# oc rsh -n openshift-image-registry "$(oc get pods -n openshift-image-registry -l docker-registry=default | \
        awk '/Running/ {print $1; exit}')" ls -1 /etc/pki/ca-trust/source/anchors
container-image-registry-sdi-observer.apps.boston.ocp.vslen
image-registry.openshift-image-registry.svc..5000
image-registry.openshift-image-registry.svc.cluster.local..5000

If this is not feasible, one can also mark the registry as insecure.

8.3. Configure insecure registry

As a less secure an alternative to the Configure OpenShift to trust container image registry, registry may also be marked as insecure which poses a potential security risk. Please follow Configuring image settings (4.4) / (4.2) and add the registry to the .spec.registrySources.insecureRegistries array. For example:

apiVersion: config.openshift.io/v1
kind: Image
metadata:
  annotations:
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  registrySources:
    insecureRegistries:
    - My_Image_Registry_FQDN

NOTE: it may take a couple of tens of minutes until the nodes are reconfigured. You can use the following commands to monitor the progress:

  • watch oc get machineconfigpool
  • watch oc get nodes

8.4. Marking the vflow registry as insecure

NOTE: applicable before, during or a after SDI installation.
NOTE: if the registry uses HTTPS and is signed by a self-signed CA certificate, it is recommended to configure SDI Observer with INJECT_CABUNDLE=true instead.

If the modeler is configured to use registry over HTTP, the registry must be marked as insecure. This is doable neither via installer nor in the UI.

Without the insecure registry set, kaniko builder cannot push built images into the configured registry for the Pipeline Modeler (see "Container Registry for Pipeline Modeler" Input Parameter at the official SAP Data Intelligence documentation (3.0)).

To mark the configured vflow registry as insecure, the SDI Observer needs to be deployed with MARK_REGISTRY_INSECURE=true parameter. If it is already deployed, it can be re-configured to take care of insecure registries by executing the following command in the sdi namespace:

# oc set env dc/sdi-observer MARK_REGISTRY_INSECURE=true

Once deployed, all the existing pipeline modeler pods will be patched. It will take a couple of tens of seconds until all the modified pods become available.

For more information, take a look at SDI Helpers.

8.5. Running multiple SDI instances on a single OCP cluster

Two instances of SAP Data Intelligence running in parallel on a single OCP cluster have been validated. Running more instances is possible, but most probably needs an extra support statement from SAP.

Please consider the following before deploying more than one SDI instance to a cluster:

  • Each SAP Data Intelligence instance must run in its own namespace/project.
  • Each SAP Data Intelligence instance must use a different prefix or container image registry for the Pipeline Modeler. For example, the first instance can configure "Container Registry Settings for Pipeline Modeler" as My_Image_Registry_FQDN/sdi30blue and the second as My_Image_Registry_FQDN/sdi30green.
  • It is recommended to dedicate particular nodes to each SDI instance.
  • It is recommended to use network policy (4.4) / (4.2) SDN mode for completely granular network isolation configuration and improved security. Check network policy configuration (4.4) / (4.2) for further references and examples. This, however, cannot be changed post OCP installation.
  • If running the production and test (aka blue-green) SDI deployments on a single OCP cluster, mind also the following:
    • There is no way to test an upgrade of OCP cluster before an SDI upgrade.
    • The idle (non-productive) landscape should have the same network security as the live (productive) one.

To deploy a new SDI instance to OCP cluster, please repeat the steps from project setup starting from point 6 with a new project name and continue with SDI Installation.

8.6. Running SDI pods on particular nodes

Due to shortcomings in SDI's installer, the validation of SDI installation fails if its daemonsets are not deployed to all the nodes in the cluster.
Therefor, the installation should be executed without a restriction on nodes. After the installation is done, the pods can be re-scheduled to desired nodes like this:

  1. choose a label to apply to the SAP Data Intelligence project and the desired nodes (e.g. run-sdi-project=sdhblue)

  2. label the desired nodes (in this example worker1, worker2, worker3 and worker4)

    # for node in worker{1,2,3,4}; do oc label node/$node run-sdi-project=sdhblue; done
    
  3. set the project node selector of the sdhblue namespace to match the label

    # oc patch namespace sdhblue -p '{"metadata":{"annotations":{"openshift.io/node-selector":"run-sdi-project=sdhblue"}}}'
    
  4. evacuate the pods from all the other nodes by killing them (requires jq utility installed)

    # oc project sdhblue                    # switch to the SDI project
    # label="run-sdi-project=sdhblue"       # set the chosen label
    # nodeNames="$(oc get nodes -o json | jq -c '[.items[] |
        select(.metadata.labels["'"${label%=*}"'"] == "'"${label#*=}"'") | .metadata.name]')"
    # oc get pods -o json | jq -r '.items[] | . as $pod |
        select(('"$nodeNames"' | all(. != $pod.spec.nodeName))) | "pod/\(.metadata.name)"' | xargs -r oc delete
    

NOTE: Please make sure the Data Intelligence instance is not being used because killing its pods will cause a downtime.

The pods will be re-launched on the nodes labeled with run-sdi-project=sdhblue. It may take several minutes before the SDI becomes available again.

8.7. Installing remarshal utilities on RHEL

For a few example snippets throughout this guide, either yaml2json or json2yaml scripts are necessary.

They are provided by the remarshal project and shall be installed on the Management host in addition to jq. On RHEL 8.2, one can install it this way:

# sudo dnf install -y python3-pip
# sudo pip3 install remarshal

8.8. Pin vsystem-vrep to the current node

On OCP 4.2 with openshift-storage.rbd.csi.ceph.com dynamic storage provisioner used for SDI workload, please make sure to schedule vsystem-vrep pod to the current node where it runs to avoid A pod is stuck in ContainerCreating phase from happening during an upgrade

# nodeName="$(oc get pods -n "${SDI_NAMESPACE:-sdi}" vsystem-vrep-0 -o jsonpath='{.spec.nodeName}')"
# oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" --type strategic --patch '{"spec": {"template": {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}}}'

To revert the change, please follow Unpin vsystem-vrep from the current node.

To be able to amend another potential volume attachement problems, make sure to dump a mapping between the SDH pods and nodes they run on:

# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out

8.9. Unpin vsystem-vrep from the current node

On OCP 4.4, the vsystem-vrep pod no longer needs to be pinned to a particular node in order to prevent A pod is stuck in ContainerCreating phase from occuring.

One can then revert the node pinning with the following command. Note that jq binary is required.

# oc get statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" -o json | \
    jq 'del(.spec.template.spec.nodeSelector) | del(.spec.template.spec.affinity.nodeAffinity)' | oc replace -f -

9. Troubleshooting Tips

9.1. Installation or Upgrade problems

9.1.1. Privileged security context unassigned

If there are pods, replicasets, or statefulsets not coming up and you can see an event similar to the one below, you need to add privileged security context constraint to its service account.

# oc get events | grep securityContext
1m          32m          23        diagnostics-elasticsearch-5b5465ffb.156926cccbf56887                          ReplicaSet                                                                            Warning   FailedCreate             replicaset-controller                  Error creating: pods "diagnostics-elasticsearch-5b5465ffb-" is forbidden: unable to validate against any security context constraint: [spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

Copy the name in the fourth column (the event name - diagnostics-elasticsearch-5b5465ffb.156926cccbf56887) and determine its corresponding service account name.

# eventname="diagnostics-elasticsearch-5b5465ffb.156926cccbf56887"
# oc get -o go-template=$'{{with .spec.template.spec.serviceAccountName}}{{.}}{{else}}default{{end}}\n' \
    "$(oc get events "${eventname}" -o jsonpath=$'{.involvedObject.kind}/{.involvedObject.name}\n')"
sdi-elasticsearch

The obtained service account name (sdi-elasticsearch) now needs to be assigned privileged scc:

# oc adm policy add-scc-to-user privileged -z sdi-elasticsearch

The pod then shall come up on its own unless this was the only problem.

9.1.2. No Default Storage Class set

If pods are failing because because of PVCs not being bound, the problem may be that the default storage class has not been set and no storage class was specified to the installer.

# oc get pods
NAME                                                  READY     STATUS    RESTARTS   AGE
hana-0                                                0/1       Pending   0          45m
vora-consul-0                                         0/1       Pending   0          45m
vora-consul-1                                         0/1       Pending   0          45m
vora-consul-2                                         0/1       Pending   0          45m

# oc describe pvc data-hana-0
Name:          data-hana-0
Namespace:     sdi
StorageClass:
Status:        Pending
Volume:
Labels:        app=vora
               datahub.sap.com/app=hana
               vora-component=hana
Annotations:   <none>
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
  Type    Reason         Age                  From                         Message
  ----    ------         ----                 ----                         -------
  Normal  FailedBinding  47s (x126 over 30m)  persistentvolume-controller  no persistent volumes available for this claim and no storage class is set

To fix this, either make sure to set the Default StorageClass (4.4) / (4.2) or provide the storage class name to the installer.

9.1.3. vsystem-app pods not coming up

If you have SELinux in enforcing mode you may see the pods launched by vsystem crash-looping because of the container named vsystem-iptables like this:

# oc get pods
NAME                                                          READY     STATUS             RESTARTS   AGE
auditlog-59b4757cb9-ccgwh                                     1/1       Running            0          40m
datahub-app-db-gzmtb-67cd6c56b8-9sm2v                         2/3       CrashLoopBackOff   11         34m
datahub-app-db-tlwkg-5b5b54955b-bb67k                         2/3       CrashLoopBackOff   10         30m
...
internal-comm-secret-gen-nd7d2                                0/1       Completed          0          36m
license-management-gjh4r-749f4bd745-wdtpr                     2/3       CrashLoopBackOff   11         35m
shared-k98sh-7b8f4bf547-2j5gr                                 2/3       CrashLoopBackOff   4          2m
...
vora-tx-lock-manager-7c57965d6c-rlhhn                         2/2       Running            3          40m
voraadapter-lsvhq-94cc5c564-57cx2                             2/3       CrashLoopBackOff   11         32m
voraadapter-qkzrx-7575dcf977-8x9bt                            2/3       CrashLoopBackOff   11         35m
vsystem-5898b475dc-s6dnt                                      2/2       Running            0          37m

When you inspect one of those pods, you can see an error message similar to the one below:

# oc logs voraadapter-lsvhq-94cc5c564-57cx2 -c vsystem-iptables
2018-12-06 11:45:16.463220|+0000|INFO |Execute: iptables -N VSYSTEM-AGENT-PREROUTING -t nat||vsystem|1|execRule|iptables.go(56)
2018-12-06 11:45:16.465087|+0000|INFO |Output: iptables: Chain already exists.||vsystem|1|execRule|iptables.go(62)
Error: exited with status: 1
Usage:
  vsystem iptables [flags]

Flags:
  -h, --help               help for iptables
      --no-wait            Exit immediately after applying the rules and don't wait for SIGTERM/SIGINT.
      --rule stringSlice   IPTables rule which should be applied. All rules must be specified as string and without the iptables command.

And in the audit log on the node, where the pod got scheduled, you should be able to find an AVC denial similar to the following. On RHCOS nodes, you may need to inspect the output of dmesg command instead.

# grep 'denied.*iptab' /var/log/audit/audit.log
type=AVC msg=audit(1544115868.568:15632): avc:  denied  { module_request } for  pid=54200 comm="iptables" kmod="ipt_REDIRECT" scontext=system_u:system_r:container_t:s0:c826,c909 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
...
# # on RHCOS
# dmesg | grep denied

To fix this, the ipt_REDIRECT kernel module needs to be loaded. Please refer to Pre-load needed kernel modules.

9.1.4. License Manager cannot be initialized

The installation may fail with the following error.

2019-07-22T15:07:29+0000 [INFO] Initializing system tenant...
2019-07-22T15:07:29+0000 [INFO] Initializing License Manager in system tenant...2019-07-22T15:07:29+0000 [ERROR] Couldn't start License Manager!
The response: {"status":500,"code":{"component":"router","value":8},"message":"Internal Server Error: see logs for more info"}Error: http status code 500 Internal Server Error (500)
2019-07-22T15:07:29+0000 [ERROR] Failed to initialize vSystem, will retry in 30 sec...

In the log of license management pod, you can find an error like this:

# oc logs deploy/license-management-l4rvh
Found 2 pods, using pod/license-management-l4rvh-74595f8c9b-flgz9
+ iptables -D PREROUTING -t nat -j VSYSTEM-AGENT-PREROUTING
+ true
+ iptables -F VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -X VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -N VSYSTEM-AGENT-PREROUTING -t nat
iptables v1.6.2: can't initialize iptables table `nat': Permission denied
Perhaps iptables or your kernel needs to be upgraded.

This means, the vsystem-iptables container in the pod lacks permissions to manipulate iptables. It needs to be marked as privileged. Please follow the appendix Deploy SDI Observer and restart the installation.

9.1.5. Diagnostics Prometheus Node Exporter pods not starting

During an installation or upgrade, it may happen, that the Node Exporter pods keep restarting:

# oc get pods  | grep node-exporter
diagnostics-prometheus-node-exporter-5rkm8                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-hsww5                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-jxxpn                        0/1       CrashLoopBackOff   6          8m
diagnostics-prometheus-node-exporter-rbw82                        0/1       CrashLoopBackOff   7          8m
diagnostics-prometheus-node-exporter-s2jsz                        0/1       CrashLoopBackOff   6          8m

The possible reason is that the limits on resource consumption set on the pods are too low. To address this post-installation, you can patch the daemonset like this (in the SDI's namespace):

# oc patch -p '{"spec": {"template": {"spec": {"containers": [
    { "name": "diagnostics-prometheus-node-exporter",
      "resources": {"limits": {"cpu": "200m", "memory": "100M"}}
    }]}}}}' ds/diagnostics-prometheus-node-exporter

To address this during the installation (using any installation method), add the following parameters:

-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.cpu=200m
-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.memory=100M

If the graph builds hang in Pending state or fail completely, you may find the following pod not comming up in the sdi namespace because its image cannot be pulled from the registry:

# oc get pods | grep vflow
datahub.post-actions.validations.validate-vflow-9s25l             0/1     Completed          0          14h
vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2                  0/1     ImagePullBackOff   0          21s
vflow-graph-9958667ba5554dceb67e9ec3aa6a1bbb-com-sap-demo-dljzk   1/1     Running            0          94m
# oc describe pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 | sed -n '/^Events:/,$p'
Events:
  Type     Reason     Age                From                    Message
  ----     ------     ----               ----                    -------
  Normal   Scheduled  30s                default-scheduler       Successfully assigned sdi/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 to sdi-moworker3
  Normal   BackOff    20s (x2 over 21s)  kubelet, sdi-moworker3  Back-off pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
  Warning  Failed     20s (x2 over 21s)  kubelet, sdi-moworker3  Error: ImagePullBackOff
  Normal   Pulling    6s (x2 over 21s)   kubelet, sdi-moworker3  Pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
  Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Failed to pull image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600": rpc error: code = Unknown desc = Error reading manifest 3.0.23-com.sap.sles.base-20200617-174600 in container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9: unauthorized: authentication required
  Warning  Failed     6s (x2 over 21s)   kubelet, sdi-moworker3  Error: ErrImagePull

To amend this, one needs to link the secret for the modeler's registry to a corresponding service account associated with the failed pod. In this case, the default one.

# oc get -n "${SDI_NAMESPACE:-sdi}" -o jsonpath=$'{.spec.serviceAccountName}\n' \
    pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2
default
# oc create secret -n "${SDI_NAMESPACE:-sdi}" docker-registry sdi-registry-pull-secret \
    --docker-server=container-image-registry-sdi-observer.apps.morrisville.ocp.vslen \
    --docker-username=user-n5137x --docker-password=ec8srNF5Pf1vXlPTRLagEjRRr4Vo3nIW
# oc secrets link -n "${SDI_NAMESPACE:-sdi}" --for=pull default sdi-registry-pull-secret
# oc delete -n "${SDI_NAMESPACE:-sdi}" pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2

Also please make sure to restart the Pipeline Modeler and failing graph builds in the offended tenant.

9.1.7. A pod is stuck in ContainerCreating phase

NOTE: Applies to OCP 4.2 in combination with block storage persistent volumes.

The issue can be reproduced when using a ReadWriteOnce persistent volume provisioned by a block device dynamic provisioner like openshift-storage.rbd.csi.ceph.com with a corresponding storage class ocs-storagecluster-ceph-rbd.

# oc get pods | grep ContainerCreating
vsystem-vrep-0                                                    0/2     ContainerCreating   0          10m20s
# oc describe pod vsystem-vrep-0 | sed -n '/^Events/,$p'
Events:
  Type     Reason                  Age                  From                     Message
  ----     ------                  ----                 ----                     -------
  Normal   Scheduled               114m                 default-scheduler        Successfully assigned sdhup/vsystem-vrep-0 to sdi-moworker1
  Normal   SuccessfulAttachVolume  114m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-fafdd37a-b654-11ea-b795-001c14db4273"
  Normal   SuccessfulAttachVolume  114m                 attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273"
  Warning  FailedMount             17m (x39 over 113m)  kubelet, sdi-moworker1   MountVolume.MountDevice failed for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-f6380abf-b654-11ea-8cb4-0a580a83020b is still being used
  Warning  FailedMount             64s (x50 over 111m)  kubelet, sdi-moworker1   Unable to mount volumes for pod "vsystem-vrep-0_sdhup(fddd32f3-b7c4-11ea-b795-001c14db4273)": timeout expired waiting for volumes to attach or mount for pod "sdhup"/"vsystem-vrep-0". list of unmounted volumes=[layers-volume]. list of unattached volumes=[layers-volume exports app-parameters uaa-tls-cert hana-tls-cert vrep-cert-tls vsystem-root-ca-path vora-vsystem-sdhup-vrep-token-wrmxk]

The issue can happen for example during an upgrade from SAP Data Hub. In that case, the upgrade starts to hang at the following step:

# ./slcb execute --url https://boston.ocp.vslen:9000 --useStackXML ~/MP_Stack_1000954710_20200519_.xml
...
time="2020-06-30T06:51:40Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:51:50Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:52:00Z" level=info msg="Switching Datahub to runlevel: Started"

For the reference, the corresponding persistent volume can look like this:

# oc get pv | grep f61bd233-b654-11ea-b795-001c14db4273
pvc-f61bd233-b654-11ea-b795-001c14db4273    10Gi       RWO            Delete           Bound    sdhup/layers-volume-vsystem-vrep-0                ocs-storagecluster-ceph-rbd            45h

Solution to the problem is to schedule vsystem-vrep pod on particular node.

9.1.7.1. Schedule vsystem-vrep pod on particular node

Make sure to run the pod on the same node as it used to run before being re-scheduled:

  1. Identify previous compute node name depending on whether the pod is running or not.

    • If the vsystem-vrep pod is running currently, please record the node (sdi-moworker3) it is running on now like this:

      # oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide -l vora-component=vsystem-vrep
      NAME             READY   STATUS    RESTARTS   AGE    IP            NODE            NOMINATED NODE   READINESS GATES
      vsystem-vrep-0   2/2     Running   0          3d1h   10.128.0.31   sdi-moworker3   <none>           <none>
      
    • In case the pod is no longer running, inspect the sdh-pods-pre-upgrade.out created as suggested at Prepare SDH/SDI Project step and extract the name of the node for the pod in question. In our case, the vsystem-vrep-0 pod used to run sdi-moworker3.

  2. (if not running) Scale its corresponding deployment (in our case statefulset/vsystem-vrep) down to zero replicas:

    # oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=0 statefulset/vsystem-vrep
    
  3. Pin vsystem-vrep to the current node with the following command while changing the nodeName.

    # nodeName=sdi-moworker3    # change the name
    # oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" --type strategic --patch '{"spec": {"template": {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}}}'
    
  4. (if not running) Scale the deployment back to 1:

    # oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=1 statefulset/vsystem-vrep
    

Verify the pod is scheduled to the given node and becomes ready. If the upgrade process is in progress, it should continue in a while.

# oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide | grep vsystem-vrep-0
vsystem-vrep-0                                                    2/2     Running     0          5m48s   10.128.4.239   sdi-moworker3   <none>           <none>

9.1.8. Container fails with "Permission denied"

If pods fail with a similar error like the one below, the containers most probably are not allowed to run under desired UID.

# oc get pods
NAME                                READY   STATUS             RESTARTS   AGE
datahub.checks.checkpoint-m82tj     0/1     Completed          0          12m
vora-textanalysis-6c9789756-pdxzd   0/1     CrashLoopBackOff   6          9m18s
# oc logs vora-textanalysis-6c9789756-pdxzd
Traceback (most recent call last):
  File "/dqp/scripts/start_service.py", line 413, in <module>
    sys.exit(Main().run())
  File "/dqp/scripts/start_service.py", line 238, in run
    **global_run_args)
  File "/dqp/python/dqp_services/services/textanalysis.py", line 20, in run
    trace_dir = utils.get_trace_dir(global_trace_dir, self.config)
  File "/dqp/python/dqp_utils.py", line 90, in get_trace_dir
    return get_dir(global_trace_dir, conf.trace_dir)
  File "/dqp/python/dqp_utils.py", line 85, in get_dir
    makedirs(config_value)
  File "/usr/lib64/python2.7/os.py", line 157, in makedirs
    mkdir(name, mode)
OSError: [Errno 13] Permission denied: 'textanalysis'

To remedy that, be sure to apply all the oc adm policy add-scc-to-* commands from the project setup section. The one that has not been applied in this case is:

# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"

9.1.9. Jobs failing during installation or upgrade

If the installation jobs are failing with the following error, either anyuid security context constraint has not been applied or the cluster is too old.

# oc logs solution-reconcile-vsolution-vsystem-ui-3.0.9-vnnbf
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:18+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:23+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:23+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:28+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
...
2020-03-05T15:52:13+0000 [ERROR] Timeout while waiting to login to vSystem...

The reason behind is that vctl binary in the containers determines HOME directory for its user from /etc/passwd. On older OCP clusters (<4.2.32), or when the container is not run with the desired UID, the value is set incorrectly to /. The binary then lacks permissions to write to the root directory.

To remedy that, please make sure:

  1. you are running OCP cluster 4.2.32 or newer
  2. anyuid scc has been applied to the SDI namespace

    To verify, make sure the SDI namespace is listed in the 3rd column of the output of the following command:

    # oc get -o json scc/anyuid | jq -r '.groups[]'
    system:cluster-admins
    system:serviceaccounts:sdi
    

    When the jobs will be rerun, anyuid scc will be assigned to them:

    # oc get pods -n "${SDI_NAMESPACE:-sdi}" -o json | jq -r '.items[] | select((.metadata.ownerReferences // []) | any(.kind == "Job")) |
        "\(.metadata.name)\t\(.metadata.annotations["openshift.io/scc"])"' | column -t
    datahub.voracluster-start-1d3ffe-287c16-d7h7t                    anyuid
    datahub.voracluster-start-b3312c-287c16-j6g7p                    anyuid
    datahub.voracluster-stop-5a6771-6d14f3-nnzkf                     anyuid
    ...
    strategy-reconcile-strat-system-3.0.34-3.0.34-pzn79              anyuid
    tenant-reconcile-default-3.0.34-wjlfs                            anyuid
    tenant-reconcile-system-3.0.34-gf7r4                             anyuid
    vora-config-init-qw9vc                                           anyuid
    vora-dlog-admin-f6rfg                                            anyuid
    
  3. additionally, please make sure that all the other oc adm policy add-scc-to-* commands listed in the project setup have been applied to the same $SDI_NAMESPACE.

9.1.10. vsystem-vrep cannot export NFS on RHCOS

If vsystem-vrep-0 pod fails with the following error, it means it is unable to start an NFS server on top of overlayfs.

# oc logs -n ocpsdi1 vsystem-vrep-0 vsystem-vrep
2020-07-13 15:46:05.054171|+0000|INFO |Starting vSystem version 2002.1.15-0528, buildtime 2020-05-28T18:5856, gitcommit ||vsystem|1|main|server.go(107)
2020-07-13 15:46:05.054239|+0000|INFO |Starting Kernel NFS Server||vrep|1|Start|server.go(83)
2020-07-13 15:46:05.108868|+0000|INFO |Serving liveness probe at ":8739"||vsystem|9|func2|server.go(149)
2020-07-13 15:46:10.303625|+0000|WARN |no backup or restore credentials mounted, not doing backup and restore||vsystem|1|NewRcloneBackupRestore|backup_restore.go(76)
2020-07-13 15:46:10.311488|+0000|INFO |vRep components are initialised successfully||vsystem|1|main|server.go(249)
2020-07-13 15:46:10.311617|+0000|ERROR|cannot parse duration from "SOLUTION_LAYER_CLEANUP_DELAY" env variable: time: invalid duration ||vsystem|16|CleanUpSolutionLayersJob|manager.go(351)
2020-07-13 15:46:10.311719|+0000|INFO |Background task for cleaning up solution layers will be triggered every 12h0m0s||vsystem|16|CleanUpSolutionLayersJob|manager.go(358)
2020-07-13 15:46:10.312402|+0000|INFO |Recreating volume mounts||vsystem|1|RemountVolumes|volume_service.go(339)
2020-07-13 15:46:10.319334|+0000|ERROR|error re-loading NFS exports: exit status 1
exportfs: /exports does not support NFS export||vrep|1|AddExportsEntry|server.go(162)
2020-07-13 15:46:10.319991|+0000|FATAL|Error creating runtime volume: error exporting directory for runtime data via NFS: export error||vsystem|1|Fail|termination.go(22)

There are two solutions to the problem. Both of them resulting in an additional volume mounted at /exports which is the root directory of all exports.

  • (recommended) deploy SDI Observer which will request additional persistent volume of size 500Mi for vsystem-vrep-0 pod and make sure it is running
  • add -e=vsystem.vRep.exportsMask=true to the Additional Installer Parameters which will mount emptyDir volume at /exports in the same pod

    • on particular versions of OCP this may fail nevertheless

9.2. SDI Observer troubleshooting

9.2.1. Build is failing due to a repository outage

If the build of SDI Observer or SDI Registry is failing with a similar error like the one below, the chosen Fedora repository mirror is probably temporarily down:

# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Extra Packages for Enterprise Linux Modular 8 - 448  B/s |  16 kB     00:36
Failed to download metadata for repo 'epel-modular'
Error: Failed to download metadata for repo 'epel-modular'
subprocess exited with status 1
subprocess exited with status 1
error: build error: error building at STEP "RUN dnf install -y   https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm &&   dnf install -y parallel procps-ng bc git httpd-tools && dnf clean all -y": exit status 1

Please try to start the build again after a minute or two like this:

# oc start-build NAMESPACE="${NAMESPACE:-sdi-observer}" -F bc/sdi-observer

9.2.2. Build is failing due to proxy issues

If you see the following build error in a cluster where HTTP(S) proxy is used, make sure to update the proxy configuration.

# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Caching blobs under "/var/cache/blobs".

Pulling image registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
error: build error: failed to pull image: After retrying 2 times, Pull image still failed due to error: while pulling "docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee" as "registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee": Error initializing source docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee: can't talk to a V1 docker registry

The registry.redhat.io either needs to be whitelisted in the HTTP proxy server or it must be added to the NO_PROXY settings like in the following bash-code snippet. When executing it, the registry will be added to NO_PROXY only if it is not there yet.

# addreg="registry.redhat.io"
# oc get proxies.config.openshift.io/cluster -o json | \
    jq '.spec.noProxy |= (. | [split("\\s*,\\s*";"")[] | select((. | length) > 0)] | . as $npa |
        "'"$addreg"'" as $r | if [$npa[] | . == $r] | any then $npa else $npa + [$r] end | join(","))' \
    oc replace -f -

Wait after the machine config pools are updates and then restart the build:

# oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-204c0009fca2b46a9d754371404ad169   True      False      False
worker   rendered-worker-d3738db56394537bb525ab5cf008dc4f   True      False      False

For more information, please refer to Docker pull fails to GET https://registry.redhat.io/ content.