SAP Data Intelligence 3 on OpenShift Container Platform 4
In general, the installation of SAP Data Intelligence (SDI) follows these steps:
- Install Red Hat OpenShift Container Platform
- Configure the prerequisites for SAP Data Intelligence Foundation
- Install SDI Observer
- Install SAP Data Intelligence Foundation on OpenShift Container Platform
If you're interested in installation of SAP Data Hub or SAP Vora, please refer to the other installation guides:
- SAP Data Hub 2 on OpenShift Container Platform 4
- SAP Data Hub 2 on OpenShift Container Platform 3
- Install SAP Data Hub 1.X Distributed Runtime on OpenShift Container Platform
- Installing SAP Vora 2.1 on Red Hat OpenShift 3.7
1. OpenShift Container Platform validation version matrix
The following version combinations of SDI 2.X, OCP, RHEL or RHCOS have been validated for the production environments:
SAP Data Intelligence | OpenShift Container Platform | Operating System | Infrastructure and (Storage) | Confirmed&Supported by SAP |
---|---|---|---|---|
3.0 | 4.2 † | RHCOS (nodes), RHEL 8.1+ or Fedora (Management host) | VMware vSphere (OCS 4.2) | supported † |
3.0 Patch 3 | 4.2 †, 4.4 | RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) | VMware vSphere (OCS 4) | supported |
3.0 Patch 4 | 4.4 | RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) | VMware vSphere (OCS 4), (NetApp Trident 20.04) | supported |
3.0 Patch 8 | 4.6 | RHCOS (nodes), RHEL 8.2+ or Fedora (Management host) | KVM/libvirt (OCS 4) | supported |
3.1 | 4.4 | RHCOS (nodes), RHEL 8.3+ or Fedora (Management host) | VMware vSphere (OCS 4) | not supported¹ |
3.1 | 4.6 | RHCOS (nodes), RHEL 8.3+ or Fedora (Management host) | VMware vSphere (OCS 4), Bare metal ∗ (OCS 4) | supported ¡ |
3.1 | 4.6 | RHCOS (nodes), RHEL 8.3+ or Fedora (Management host) | VMware vSphere (NetApp Trident 20.10 + StorageGRID) | supported |
† The referenced OCP release is no longer supported by Red Hat!
¹ 3.1 on OCP 4.4 is supported by SAP only for the purpose of upgrade to OCP 4.6
∗ Validated on two different hardware configurations:
-
(Dev/PoC level) Lenovo 4 bare metal hosts setup composed of:
- 3 schedulable control plane nodes running both OCS and SDI (Lenovo ThinkSystem SR530)
- 1 compute node running SDI) (Lenovo ThinkSystem SR530)
Note that this particular setup cannot be fully supported by Red Hat because running OCS in compact mode is still a Technology Preview as of 4.6.
-
(Production level) Dell Technologies bare metal cluster composed of:
- 1 CSAH node (Dell EMC PowerEdge R640s)
- 3 control plane nodes (Dell EMC PowerEdge R640s)
- 3 dedicated OCS nodes (Dell EMC PowerEdge R640s)
- 3 dedicated SDI nodes (Dell EMC PowerEdge R740xd)
CSI supported external Dell EMC storage options and cluster sizing options available.
CSAH stands for Cluster System Admin Host - an equivalent of management host
Please refer to the compatibility matrix for version combinations that are considered as working.
SAP Note #2871970 lists more details.
2. Requirements
2.1. Hardware/VM and OS Requirements
2.1.1. OpenShift Cluster
Make sure to consult the following official cluster requirements:
- of SAP Data Intelligence in SAP's documentation:
- of OpenShift 4 (Minimum resource requirements (4.6) / (4.4))
- additionally, if deploying OpenShift Container Storage (aka OCS), please consult also OCS Supported configurations (4.6) / (4.4)
- if deploying on VMware vSphere, please consider also VMware vSphere infrastructure requirements (4.6) / (4.4)
- if deploying NetApp Trident, please consult also NetApp Hardware/VM and OS Requirements
There are 4 kinds of nodes:
- Bootstrap Node - A temporary bootstrap node needed for the OCP deployment. The node can be either destroyed by the installer (using infrastructure-provisioned-installation -- aka IPI) or can be deleted manually by the administrator. Alternatively, it can be re-used as a worker node. Please refer to the Installation process (4.6) / (4.4) for more information.
- Master Nodes (4.6) / (4.4) - The control plane manages the OpenShift Container Platform cluster. The control plane can be made schedulable to enable SDI workload there as well.
- Compute Nodes (4.6) / (4.4) - Run the actual workload (e.g. SDI pods). They are optional on a three-node cluster (where the master nodes are schedulable).
- OCS Nodes (4.6) / (4.4) - Run OpenShift Container Storage (aka OCS) -- currently supported only on AWS and VMware vSphere. The nodes can be divided into starting (running both OSDs and monitors) and additional nodes (running only OSDs). Needed only when OCS shall be used as the backing storage provider.
- NOTE: Running in a compact mode (on control plane) remains a Technology Preview as of OCS 4.6.
-
Management host (aka administrator's workstation or Jump host - The Management host is used among other things for:
- accessing the OCP cluster via a configured command line client (
oc
orkubectl
) - configuring OCP cluster
- running Software Lifecycle Container Bridge (SLC Bridge)
- accessing the OCP cluster via a configured command line client (
The hardware/software requirements for the Management host can be:
- OS: Red Hat Enterprise Linux 8.1+, RHEL 7.6+ or Fedora 30+
- Diskspace: 20GiB for
/
:
2.1.1.1. Minimum Hardware Requirements
The table below lists the minimum requirements and the minimum number of instances for each node type for the latest validated SDI and OCP 4.X releases. This is sufficient of a PoC (Proof of Concept) environments.
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
Bootstrap | 1 | RHCOS | 4 | 16 | 120 | m4.xlarge |
Master | 3 | RHCOS | 4 | 16 | 120 | m4.xlarge |
Compute | 3+ | RHEL 7.8 or 7.9 or RHCOS | 8 | 32 | 120 | m4.2xlarge |
On a three-node cluster, it would look like this:
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
Bootstrap | 1 | RHCOS | 4 | 16 | 120 | m4.xlarge |
Master/Compute | 3 | RHCOS | 10 | 40 | 120 | m4.xlarge |
If using OCS 4.6, at least additional 3 (starting) nodes are recommended. Alternatively, the Compute nodes outlined above can also runs ⑂ OCS pod. In that case, the hardware specifications need to be extended accordingly. The following table lists the minimum requirements for each additional node:
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
OCS starting (OSD+MON) | 3 | RHCOS | 10 | 24 | 120 + 2048 ♢ | m5.4xlarge |
2.1.1.2. Minimum Production Hardware Requirements
The minimum production requirements for production systems for the latest validated SDI and OCP 4 are the following:
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
Bootstrap | 1 | RHCOS | 4 | 16 | 120 | m4.xlarge |
Master | 3+ | RHCOS | 8 | 16 | 120 | c5.xlarge |
Compute | 3+ | RHEL 7.8 or 7.9 or RHCOS | 16 | 64 | 120 | m4.4xlarge |
On a three-node cluster, it would look like this:
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
Bootstrap | 1 | RHCOS | 4 | 16 | 120 | m4.xlarge |
Master/Compute | 3 | RHCOS | 22 | 72 | 120 | c5.9xlarge |
If using OCS 4, at least additional 3 (starting) nodes are recommended. Alternatively, the Compute nodes outlined above can also run OCS ⑂ pods. In that case, the hardware specifications need to be extended accordingly. The following table lists the minimum requirements for each additional node:
Type | Count | Operating System | vCPU ⑃ | RAM (GB) | Storage (GB) | AWS Instance Type |
---|---|---|---|---|---|---|
OCS starting (OSD+MON) | 3 | RHCOS | 20 | 49 | 120 + 6×2048 ♢ | c5a.8xlarge |
♢ Please refer to OCS Platform Requirements (4.6) / (4.4) and OCS Sizing and scaling recommendations (4.4) for more information.
⑂ Running in a compact mode (on control plane) remains a Technology Preview as of OCS 4.6.
⑃ 1 physical core provides 2 vCPUs when hyper-threading is enabled. 1 physical core provides 1 vCPU when hyper-threading is not enabled.
2.2. Software Requirements
2.2.1. Compatibility Matrix
Later versions of SAP Data Intelligence support newer versions of Kubernetes and OpenShift Container Platform. Even if not listed in the OCP validation version matrix above, the following version combinations are considered fully working and supported:
SAP Data Intelligence | OpenShift Container Platform | Worker Node | Management host | Infrastructure | Storage | Object Storage |
---|---|---|---|---|---|---|
3.0 Patch 3 or higher | 4.3, 4.4 | RHCOS | RHEL 8.1 or newer | Cloud ❄, VMware vSphere | OCS 4, NetApp Trident 20.04 or newer, vSphere volumes ♣ | OCS' NooBaa, NetApp StorageGRID 11.3 or newer |
3.0 Patch 8 or higher | 4.4, 4.5, 4.6 | RHCOS | RHEL 8.1 or newer | Cloud ❄, VMware vSphere | OCS 4, NetApp Trident 20.04 or newer, vSphere volumes ♣ | OCS' NooBaa, NetApp StorageGRID 11.3 or newer |
3.1 | 4.4, 4.5, 4.6 | RHCOS | RHEL 8.1 or newer | Cloud ❄, VMware vSphere, Bare metal | OCS 4, NetApp Trident 20.04 or newer, vSphere volumes ♣ | OCS' NooBaa ¡, NetApp StorageGRID 11.4 or newer |
❄ Cloud means any cloud provider supported by OpenShift Container Platform. For a complete list of tested and supported infrastructure platforms, please refer to OpenShift Container Platform 4.x Tested Integrations. The persistent storage in this case must be provided by the cloud provider. Please see refer to Understanding persistent storage (4.6) / (4.4) for a complete list of supported storage providers.
♣ This persistent storage provider does not offer a supported object storage service required by SDI's checkpoint store and therefor is suitable only for SAP Data Intelligence development and PoC clusters. It needs to be complemented by an object storage solution for the full SDI functionality.
¡ Supported Lifecycle scenario: installation and upgrade - without backup/restore.
Unless stated otherwise, the compatibility of a listed SDI version covers all its patch releases as well.
2.2.2. Persistent Volumes
Persistent storage is needed for SDI. It is required to use storage that can be created dynamically. You can find more information in the Understanding persistent storage (4.6) / (4.4) document.
2.2.3. Container Image Registry
The SDI installation requires a secured Image Registry where images are first mirrored from an SAP Registry and then delivered to the OCP cluster nodes. The integrated OpenShift Container Registry (4.6) / (4.4) is not appropriate for this purpose. For now another image registry needs to be setup instead.
The requirements listed here is a subset of the official requirements listed in Container Registry (3.1) / (3.0)
NOTE: as of now, AWS ECR Registry cannot be used for this purpose either.
The word secured in this context means that the communication is encrypted using a TLS. Ideally with certificates signed by a trusted certificate authority. If the registry is also exposed publicly, it must require authentication and authorization in order to pull SAP images.
Such a registry can be deployed directly on OCP cluster using for example SDI Observer, please refer to Deploying SDI Observer for more information.
When finished you should have an external image registry up and running at the URL My_Image_Registry_FQDN
. You can verify that with the following command.
# curl -k https://My_Image_Registry_FQDN/v2/
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}
2.2.4. Checkpoint store enablement
In order to enable SAP Vora Database streaming tables, checkpoint store needs to be enabled. The store is an object storage on a particular storage back-end. Several back-end types are supported by the SDI installer that cover most of the storage cloud providers.
The enablement is strongly recommended for production clusters. Clusters having this feature disabled are suitable only for test, development or PoC use-cases.
Make sure to create a desired bucket before the SDI Installation. If the checkpoint store shall reside in a directory on a bucket, the directory needs to exist as well.
2.2.5. SDI Observer
Is a pod monitoring SDI's namespace and modifying objects in there that enable running of SDI on top of OCP. The observer shall be run in a dedicated namespace. It must be deployed before the SDI installation is started. SDI Observer section will guide you through the process of deployment.
3. Install Red Hat OpenShift Container Platform
3.1. Prepare the Management host
Note the following has been tested on RHEL 8.2. The steps shall be similar for other RPM based Linux distribution. Recommended are RHEL 7.7+, Fedora 30+ and CentOS 7+.
-
Subscribe the Management host at least to the following repositories:
# OCP_RELEASE=4.6 # sudo subscription-manager repos \ --enable=rhel-8-for-x86_64-appstream-rpms \ --enable=rhel-8-for-x86_64-baseos-rpms \ --enable=rhocp-${OCP_RELEASE:-4.6}-for-rhel-8-x86_64-rpms
-
Install
jq
binary. This installation guide has been tested with jq 1.6.# sudo curl -L -O /usr/local/bin/jq https://github.com/stedolan/jq/releases/download/jq-1.6/jq-linux64 # sudo chmod a+x /usr/local/bin/jq
-
Download and install OpenShift client binaries.
# sudo dnf install -y openshift-clients
NOTE:
rhel-7-server-ose-X.Y-rpms
repositories corresponding to the same minor release version (e.g. 4.6) as on the cluster nodes need to be enabled.
3.2. Install OpenShift Container Platform
Install OpenShift Container Platform on your desired cluster hosts. Follow the OpenShift installation guide (4.6) / (4.4)
If you choose the Installer Provisioned Infrastructure (IPI) (4.6) / (4.4) please follow the Installing a cluster on AWS with customizations (4.6) / (4.4) methods to allow for customizations.
On VMware vSphere, please follow Installing a cluster on vSphere (4.4)
Several changes need to be done to the compute nodes running SDI workloads before SDI installation. These include:
- choose a sufficient number and type of compute instances for SDI workload
- pre-load needed kernel modules
- increasing the PIDs limit of CRI-O container engine
- configure insecure registry (if an insecure registry shall be used)
The first two items can be performed during or after OpenShift installation. The others only after the OpenShift installation.
3.2.1. Customizing IPI or UPI installation on AWS or VMware vSphere
In order to allow for customizations, the installation need to be performed in steps:
-
create the installation configuration file
followed up by Modifying the installation configuration file
-
create the ignition configuration files
-
create the cluster
3.2.1.1. Modifying the installation configuration file
After the configuration file is created by the installer, you can specify the desired instance type of compute nodes by editing <installation_directory>/install-config.yaml
. A shortened example for AWS could look like this:
apiVersion: v1
compute:
- hyperthreading: Enabled
name: worker
platform:
aws:
region: us-east-1
type: m4.2xlarge
replicas: 3
On AWS, to satisfy the SDI's production requirements, you can change the compute.platform.aws.type
to r5.2xlarge
and compute.replicas
to 4.
For VMware vSphere, take a look at Sample install-config.yaml
file (4.6) / (4.4)
3.2.1.1.1. (optional) Add proxy settings
If there is a (network/company)-wide HTTP(S) proxy, the proxy settings need to be configured (4.6) / (4.4) in order for the installation to succeed.
In addition to the recommended NO_PROXY
values, be sure to include:
- the base domain of the cluster (e.g.
.<base_domain>
) - the address of the external container image registry (if located within the proxied network and outside of OCP cluster)
- IP addresses of the load balancers (both external and internal)
registry.redhat.io
if not accessible via proxy; please see also the troubleshooting section
3.2.1.2. (IPI only) Continue the installation by creating the cluster
To continue the IPI (e.g. on AWS) installation, execute the following command:
# openshift-install create cluster --dir <installation_directory>
3.3. OCP Post Installation Steps
3.3.1. (optional) Install OpenShift Container Storage
On AWS and WMware vSphere platforms, you have the option to deploy OCS to host the persistent storage for Data Intelligence. Please refer to the OCS documentation (4.6) / (4.4)
3.3.2. (optional) Install NetApp Trident
NetApp Trident together with StorageGRID have been validated for SAP Data Intelligence and OpenShift. More details can be found at SAP Data Intelligence on OpenShift 4 with NetApp Trident.
3.3.3. Change the count and instance type of compute nodes
Please refer to Creating a MachineSet (4.6) / (4.4) for changing an instance type and Manually scaling a MachineSet (4.6) / (4.4) or Applying autoscaling to an OpenShift Container Platform cluster (4.6) / (4.4) for information on scaling the nodes.
3.3.4. Configure SDI compute nodes
Some SDI components require changes on the OS level of compute nodes. These could impact other workloads running on the same cluster. To prevent that from happening, it is recommended to dedicate a set of nodes to SDI workload. The following needs to be done:
- Chosen nodes must be labeled e.g. using the
node-role.kubernetes.io/sdi=""
label. - MachineConfigs specific to SDI need to be created, they will be applied only to the selected nodes.
- MachineConfigPool must be created to associate the chosen nodes with the newly created MachineConfigs.
- no change will be done to the nodes until this point
- (optional) Apply a node selector to
sdi
,sap-slcbridge
anddatahub-system
projects.- SDI Observer can be configured to do that with
SDI_NODE_SELECTOR
parameter
- SDI Observer can be configured to do that with
Before modifying the recommended approach below, please make yourself familiar with the custom pools concept of the machine config operator.
3.3.4.1. Label the compute nodes for SAP Data Intelligence
Choose compute nodes for the SDI workload and label them like this:
# oc label node/sdi-worker{1,2,3} node-role.kubernetes.io/sdi=""
3.3.4.2. Enable net-raw capability for containers on schedulable nodes
NOTE: having effect only on OCP 4.6 or newer.
NOTE: shall be executed prior to OCP upgrade to 4.6 when running SDI already.
NOTE: no longer necessary for SDI 3.1 Patch 1 or newer
Starting with OCP 4.6, NET_RAW
capability is no longer granted to containers by default. Some SDI containers assume otherwise. To allow them to run on OCP 4.6, the following MachineConfig must be applied to the compute nodes:
# oc create -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mc-97-crio-net-raw.yaml
If the command produces the following error, please run the command with oc replace -f -
instead of oc create -f -
:
Error from server (AlreadyExists): error when creating "STDIN": machineconfigs.machineconfiguration.openshift.io "97-crio-net-raw" already exists
3.3.4.3. Pre-load needed kernel modules
To apply the desired changes to the existing compute nodes, please create another machine config like this:
# oc create -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mc-75-worker-sap-data-intelligence.yaml
If the command produces the following error, please run the command with oc replace -f -
instead of oc create -f -
Error from server (AlreadyExists): error when creating "STDIN": machineconfigs.machineconfiguration.openshift.io "75-worker-sap-data-intelligence" already exists
3.3.4.4. Change the maximum number of PIDs per Container
The process of configuring the nodes is described at Modifying Nodes (4.6) / (4.4) In SDI case, the required settings are .spec.containerRuntimeConfig.pidsLimit
in a ContainerRuntimeConfig
. The result is a modified /etc/crio/crio.conf
configuration file on each affected worker node with pids_limit
set to the desired value. Please create a ContainerRuntimeConfig like this:
# oc create -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/ctrcfg-sdi-pids-limit.yaml
3.3.4.5. Associate MachineConfigs to the Nodes
If previously associated, disassociate workload=sapdataintelligence
from the worker
MachineConfigPool using the following command executed in bash:
# tmpl=$'{{with $wl := index $m.labels "workload"}}{{if and $wl (eq $wl "sapdataintelligence")}}{{$m.name}}\n{{end}}{{end}}'; \
if [[ "$(oc get mcp/worker -o go-template='{{with $m := .metadata}}'"$tmpl"'{{end}}')" == "worker" ]]; then
oc label mcp/worker workload-;
fi
Define a new MachineConfigPool associating MachineConfigs to the nodes. The nodes will inherit all the MachineConfigs targeting worker
and sdi
roles.
# oc create -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/snippets/mco/mcp-sdi.yaml
The changes will be rendered into machineconfigpool/sdi
. The workers will be restarted one-by-one until the changes are applied to all of them. See Applying configuration changes to the cluster (4.6) / (4.4) for more information.
The following command can be used to wait until the change gets applied to all the worker nodes:
# oc wait mcp/sdi --all --for=condition=updated
After performing the changes above, you should end up with a new role sdi
assigned to the chosen nodes and a new MachineConfigPool containing the nodes:
# oc get nodes
NAME STATUS ROLES AGE VERSION
ocs-worker1 Ready worker 32d v1.19.0+9f84db3
ocs-worker2 Ready worker 32d v1.19.0+9f84db3
ocs-worker3 Ready worker 32d v1.19.0+9f84db3
sdi-worker1 Ready sdi,worker 32d v1.19.0+9f84db3
sdi-worker2 Ready sdi,worker 32d v1.19.0+9f84db3
sdi-worker3 Ready sdi,worker 32d v1.19.0+9f84db3
master1 Ready master 32d v1.19.0+9f84db3
master2 Ready master 32d v1.19.0+9f84db3
master3 Ready master 32d v1.19.0+9f84db3
# oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADED
master rendered-master-15f⋯ True False False 3 3 3 0
sdi rendered-sdi-f4f⋯ True False False 3 3 3 0
worker rendered-worker-181⋯ True False False 3 3 3 0
3.3.4.5.1. Enable SDI on control plane
If the control plane (or master nodes) shall be used for running SDI workload, in addition to the previous step, one needs to perform the following:
- Please make sure the control plane is schedulable
-
Duplicate the machine configs for master nodes:
# oc get -o json mc -l machineconfiguration.openshift.io/role=sdi | jq '.items[] | select((.metadata.annotations//{}) | has("machineconfiguration.openshift.io/generated-by-controller-version") | not) | .metadata |= ( .name |= sub("^(?<i>(\\d+-)*)(worker-)?"; "\(.i)master-") | .labels |= {"machineconfiguration.openshift.io/role": "master"} )' | oc create -f -
If the command produces an error like the following one, please run the command with
oc replace -f -
instead ofoc create -f -
Error from server (AlreadyExists): error when creating "STDIN": machineconfigs.machineconfiguration.openshift.io "75-master-sap-data-intelligence" already exists
-
Make the master machine config pool inherit the PID limits changes:
# oc label mcp/master workload=sapdataintelligence
The following command can be used to wait until the change gets applied to all the worker nodes:
# oc wait mcp/master --all --for=condition=updated
3.3.4.6. Verification of the node configuration
The following steps assume that the node-role.kubernetes.io/sdi=""
label has been applied to nodes running the SDI workload. All the diagnostics commands will be run in parallel on such nodes.
-
Verify that the pid limit has been increased to 16384:
# oc get nodes -l node-role.kubernetes.io/sdi= -o name | \ xargs -P 6 -n 1 -i oc debug {} -- /bin/sh -c \ "find /host/etc/crio/ -type f -print0 | xargs -0 awk '/^[[:space:]]*#/ {next} /pids_limit/ {print ENVIRON[\"HOSTNAME\"]\":\t\"FILENAME\":\"\$0}'" |& grep pids_limit
An example output could look like this:
sdi-worker3: /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit: pids_limit = 16384 sdi-worker1: /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit: pids_limit = 16384 sdi-worker2: /host/etc/crio/crio.conf.d/01-ctrcfg-pidsLimit: pids_limit = 16384
-
Verify that the kernel modules have been loaded:
# oc get nodes -l node-role.kubernetes.io/sdi= -o name | \ xargs -P 6 -n 1 -i oc debug {} -- chroot /host /bin/sh -c \ "lsmod | awk 'BEGIN {ORS=\":\t\"; print ENVIRON[\"HOSTNAME\"]; ORS=\",\"} /^(nfs|ip_tables|iptable_nat|[^[:space:]]+(REDIRECT|owner|filter))/ { print \$1 }'; echo" 2>/dev/null
An example output could look like this:
sdi-worker2: iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables, sdi-worker3: iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables, sdi-worker1: iptable_filter,iptable_nat,xt_owner,xt_REDIRECT,nfsv4,nfs,nfsd,nfs_acl,ip_tables,
If any of the following modules is missing on any of the SDI nodes, the module loading does not work:
iptable_nat
,nfsv4
,nfsd
,ip_tables
To further debug missing modules, one can execute also the following command:
# oc get nodes -l node-role.kubernetes.io/sdi= -o name | \ xargs -P 6 -n 1 -i oc debug {} -- chroot /host /bin/bash -c \ "( for service in {sdi-modules-load,systemd-modules-load}.service; do \ printf '%s:\t%s\n' \$service \$(systemctl is-active \$service); \ done; find /etc/modules-load.d -type f \ -regex '.*\(sap\|sdi\)[^/]+\.conf\$' -printf '%p\n';) | \ awk '{print ENVIRON[\"HOSTNAME\"]\":\t\"\$0}'" 2>/dev/null
Please make sure that both systemd services are
active
and at least one*.conf
file is listed for each host like shown in the following example output:sdi-worker3: sdi-modules-load.service: active sdi-worker3: systemd-modules-load.service: active sdi-worker3: /etc/modules-load.d/sdi-dependencies.conf sdi-worker1: sdi-modules-load.service: active sdi-worker1: systemd-modules-load.service: active sdi-worker1: /etc/modules-load.d/sdi-dependencies.conf sdi-worker2: sdi-modules-load.service: active sdi-worker2: systemd-modules-load.service: active sdi-worker2: /etc/modules-load.d/sdi-dependencies.conf
-
(no longer needed for SDI 3.1 or newer) Verify that the
NET_RAW
capability is granted by default to the pods:# oc get nodes -l node-role.kubernetes.io/sdi= -o name | \ xargs -P 6 -n 1 -i oc debug {} -- /bin/sh -c \ "find /host/etc/crio -type f -print0 | xargs -0 awk '/^[[:space:]]#/{next} /NET_RAW/ {print ENVIRON[\"HOSTNAME\"]\":\t\"FILENAME\":\"\$0}'" |& grep NET_RAW
An example output could look like:
sdi-worker2: /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities: default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"] sdi-worker2: /host/etc/crio/crio.conf.d/90-default-capabilities: "NET_RAW", sdi-worker1: /host/etc/crio/crio.conf.d/90-default-capabilities: "NET_RAW", sdi-worker1: /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities: default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"] sdi-worker3: /host/etc/crio/crio.conf.d/90-default-capabilities: "NET_RAW", sdi-worker3: /host/etc/crio/crio.conf.d/01-mc-defaultCapabilities: default_capabilities = ["CHOWN", "DAC_OVERRIDE", "FSETID", "FOWNER", "NET_RAW", "SETGID", "SETUID", "SETPCAP", "NET_BIND_SERVICE", "SYS_CHROOT", "KILL"]
Please make sure that at least one line is produced for each host.
3.3.5. Deploy persistent storage provider
Unless your platform already offers a supported persistent storage provider, one needs to be deployed. Please refer to Understanding persistent storage (4.6) / (4.4) for an overview of possible options.
On OCP, one can deploy OpenShift Container Storage (OCS) (4.6) / (4.4) running converged on OCP nodes providing both persistent volumes and object storage. Please refer to OCS Planning your Deployment (4.6) / (4.4) and Deploying OpenShift Container Storage (4.6) / (4.4) for more information and installation instructions.
3.3.6. Configure S3 access and bucket
Object storage is required for the following features of SDI:
- checkpoint store feature providing regular back-ups of its database
- SDL Data Lake connection (3.1) / for the machine learning scenarios(3.0) for the machine learning scenarios
Several interfaces to the object storage are supported by SDI. S3 interface is one of several. Please take a look at Checkpoint Store Type at Required Input Parameters (3.1) / (3.0) for the complete list. SAP help page covers preparation of object store (3.1) / (3.0) for a couple of cloud service providers.
3.3.6.1. Using NooBaa as object storage gateway
OCS contains NooBaa object data service for hybrid and multi cloud environments which provides S3 API one can use with SAP Data Intelligence. For SDI, one needs to provide the following:
- S3 host URL prefixed either with
https://
orhttp://
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
- bucket name
NOTE: In case of https://
, the endpoint must be secured by certificates signed by a trusted certificate authority. Self-signed CAs will not work out of the box as of now.
Once OCS is deployed, one can create the access keys and bucket using one of the following:
- via NooBaa Management Console by default exposed at
noobaa-mgmt-openshift-storage.apps.<cluster_name>.<base_domain>
- via OpenShift command line interface covered below
In both cases, the S3 endpoint provided to the SAP Data Intelligence cannot be secured with a self-signed certificate as of now. Unless NooBaa's endpoints are secured with a proper signed certificate, one must use insecure HTTP connection. NooBaa comes with such an insecure service reachable at the following URL, where s3
stands for a service name and openshift-storage
for namespace where OCS is installed:
http://s3.openshift-storage.svc.cluster.local
The service is resolvable only within the cluster. One cannot reach this URL from outside of the cluster. One can verify that the service is available with the following command.
# oc get svc -n openshift-storage -l app=noobaa
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
noobaa-mgmt LoadBalancer 172.30.154.162 <pending> 80:31351/TCP,443:32681/TCP,⋯,8446:31943/TCP 7d1h
s3 LoadBalancer 172.30.44.242 <pending> 80:31487/TCP,443:30071/TCP 7d1h
3.3.6.1.1. Creating an S3 bucket using CLI
The bucket can be created with the command below. Make sure to double-check storage class name (e.g. using oc get sc
). It can live in any OpenShift project (e.g. sdi-infra
). Be sure to switch to appropriate project/namespace (e.g. sdi
) first before executing the following.
# for claimName in sdi-checkpoint-store sdi-data-lake; do
oc create -f - <<EOF
apiVersion: objectbucket.io/v1alpha1
kind: ObjectBucketClaim
metadata:
name: ${claimName}
spec:
generateBucketName: ${claimName}
storageClassName: openshift-storage.noobaa.io
EOF
done
After a while, the object buckets will be created, the claims will get bound and the secrets with the same names (sdi-checkpoint-store
and sdi-data-lake
in our case) as the ObjectBucketClaim (aka obc
) will be created. When ready, the obc
will be bound:
# oc get obc -w
NAME STORAGE-CLASS PHASE AGE
sdi-checkpoint-store openshift-storage.noobaa.io Bound 41s
sdi-data-lake openshift-storage.noobaa.io Bound 41s
The name of the created bucket can be determined with the following command:
# oc get cm sdi-data-lake -o jsonpath='{.data.BUCKET_NAME}{"\n"}'
sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3
To determine the access keys, execute the following in bash:
# for claimName in sdi-checkpoint-store sdi-data-lake; do
printf 'Bucket/claim %s:\n Bucket name:\t%s\n' "$claimName" "$(oc get obc -o jsonpath='{.spec.bucketName}' "$claimName")"
for key in AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY; do
printf ' %s:\t%s\n' "$key" "$(oc get secret "$claimName" -o jsonpath="{.data.$key}" | base64 -d)"
done
done | column -t -s $'\t'
An example output value can be:
Bucket/claim sdi-checkpoint-store:
Bucket name: sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9
AWS_ACCESS_KEY_ID: LQ7YciYTw8UlDLPi83MO
AWS_SECRET_ACCESS_KEY: 8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE
Bucket/claim sdi-data-lake:
Bucket name: sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3
AWS_ACCESS_KEY_ID: cOxfi4hQhGFW54WFqP3R
AWS_SECRET_ACCESS_KEY: rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231
The values of sdi-checkpoint-store
shall be passed to the following SLC Bridge parameters during SDI's installation in order to enable checkpoint store.
Parameter | Example value |
---|---|
Amazon S3 Access Key | LQ7YciYTw8UlDLPi83MO |
Amazon S3 Secret Access Key | 8QY8j1U4Ts3RO4rERXCHGWGIhjzr0SxtlXc2xbtE |
Amazon S3 bucket and directory | sdi-checkpoint-store-ef4999e0-2d89-4900-9352-b1e1e7b361d9 |
Amazon S3 Region (optional) ◀ | `` |
◀ please leave unset
3.3.7. Set up a Container Image Registry
If you haven't done so already, please follow the Container Image Registry prerequisite.
3.3.8. Configure an insecure registry
NOTE: It is now required to use a registry secured by TLS for SDI. Plain HTTP
will not do.
If the registry is signed by a proper trusted (not self-signed) certificate, this may be skipped.
There are two ways to make OCP trust an additional registry using certificates signed by a self-signed certificate authority:
- (recommended) update the CA certificate trust in OCP's image configuration.
- (less secure) mark the registry as insecure
3.3.9. Configure the OpenShift Cluster for SDI
3.3.9.1. Becoming a cluster-admin
Many commands below require cluster admin privileges. To become a cluster-admin, you can do one of the following:
-
Use the
auth/kubeconfig
generated in the working directory during the installation of the OCP cluster:INFO Install complete! INFO Run 'export KUBECONFIG=<your working directory>/auth/kubeconfig' to manage the cluster with 'oc', the OpenShift CLI. INFO The cluster is ready when 'oc login -u kubeadmin -p <provided>' succeeds (wait a few minutes). INFO Access the OpenShift web-console here: https://console-openshift-console.apps.demo1.openshift4-beta-abcorp.com INFO Login to the console with user: kubeadmin, password: <provided> # export KUBECONFIG=working_directory/auth/kubeconfig # oc whoami system:admin
-
As a
system:admin
user or a member ofcluster-admin
group, make another user a cluster admin to allow him to perform the SDI installation:- As a cluster-admin, configure the authentication (4.6) / (4.4) and add the desired user (e.g.
sdiadmin
). -
As a cluster-admin, grant the user a permission to administer the cluster:
# oc adm policy add-cluster-role-to-user cluster-admin sdiadmin
- As a cluster-admin, configure the authentication (4.6) / (4.4) and add the desired user (e.g.
You can learn more about the cluster-admin role in Cluster Roles and Local Roles article (4.6) / (4.4)
4. SDI Observer
Deploy sdi-observer in its own namespace (e.g. sdi-observer
). Please refer to its documentation for the complete list of issues that it currently attempts to solve.
It is deployed as an OpenShift template. It's behaviour is controlled by the template's parameters which are mirrored to its environment variables.
4.1. Important Parameters of Observer's Template
The following parameters are the most important.
Parameter Name | Mandatory ◎ | Since Version ♠ | Example | Description |
---|---|---|---|---|
NAMESPACE |
yes | sdi-observer |
The desired namespace to deploy resources to. Defaults to the current one. | |
SDI_NAMESPACE ♫ |
yes | sdi |
The name of the SAP Data Intelligence namespace to manage. Defaults to the current one. It must be set only in case the SDI Observer is running in a different namespace (see NAMESPACE ). |
|
SLCB_NAMESPACE |
no | sap-slcbridge ◐ |
The name of the namespace where SLC Bridge runs. | |
OCP_MINOR_RELEASE |
yes | 4.6 ◐ |
Minor release of OpenShift Container Platform (e.g. 4.6 ). This value must match the OCP server version. The biggest tolerated difference between the versions is 1 in the second digit. |
|
DRY_RUN ◙ |
no | false ◐ |
If set to true, no action will be performed. The pod will just print what would have been executed. | |
FORCE_REDEPLOY |
no | false ◐ |
Whether to forcefully replace existing objects and configuration files. To replace existing secrets as well, RECREATE_SECRETS needs to be set. |
|
NODE_LOG_FORMAT |
no | text ◐ |
Format of the logging files on the nodes. Allowed values are "json" and "text" . Initially, SDI fluentd pods are configured to parse "json" while OpenShift 4 uses "text" format by default. If not given, the default is "text" . |
|
DEPLOY_SDI_REGISTRY |
no | false ◐ |
Whether to deploy container image registry for the purpose of SAP Data Intelligence. Requires project admin role attached to the sdi-observer service account. If enabled, REDHAT_REGISTRY_SECRET_NAME must be provided. |
|
DEPLOY_LETSENCRYPT |
no | false ◐ |
Whether to deploy letsencrypt controller. It allows to secure exposed routes with trusted certificates provided by Let's Encrypt open certificate authority. The mandatory prerequisite is a publicly resolvable application subdomain (*.apps.<cluster_name>.<base_domain> ). |
|
REDHAT_REGISTRY_SECRET_NAME |
no unless… ◆ | 123456-username-pull-secret |
Name of the secret with credentials for registry.redhat.io registry. Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication | |
INJECT_CABUNDLE ▲ |
no | false ◐ |
Inject CA certificate bundle into SAP Data Intelligence pods. The bundle can be specified with CABUNDLE_SECRET_NAME . It is needed if the container image registry is secured by a self-signed certificate. |
|
REGISTRY |
no | external.registry.tld:5000 |
The registry to mark as insecure. If not given, it will be determined from the vflow-secret in the SDI_NAMESPACE . If DEPLOY_SDI_REGISTRY is set to "true" , this variable will be used as the container image registry's hostname when creating the corresponding route. Please do not set unless an external registry is used and it shall be marked as insecure or you want to use a custom hostname for the registry route. |
|
MARK_REGISTRY_INSECURE ★ |
no | false ◐ |
Set to true if the given or configured REGISTRY shall be marked as insecure in all instances of Pipeline Modeler. |
|
CABUNDLE_SECRET_NAME |
no unless… ◘ | openshift-ingress-operator/router-ca ◐ |
The name of the secret containing certificate authority bundle that shall be injected into Data Intelligence pods. By default, the secret bundle is obtained from openshift-ingress-operator namespace where the router-ca secret contains a certificate authority used to signed all edge and reencrypt routes that are, inter alia, used for SDI_REGISTRY and NooBaa S3 API services. The secret name may be optionally prefixed ☼ with $namespace/ . All the entries present in the "data" field having ".crt" or ".pem" suffix will be concatenated to form the resulting cert file. |
|
EXPOSE_WITH_LETSENCRYPT ♯ |
no | true |
Whether to mark created service routes for exposure by letsencrypt controller. If not specified, defaults to the value of DEPLOY_LETSENCRYPT . |
|
MANAGE_VSYSTEM_ROUTE |
no | 0.1.0 |
true |
Whether to create vsystem route for vsystem service in SDI_NAMESPACE . The route will be of reencrypt type. The destination CA certificate for communication with the vsystem service will be kept up to date by the observer. If set to remove , the route will be deleted, which is useful to temporarily disable access to vsystem service during SDI updates. |
VSYSTEM_ROUTE_HOSTNAME |
0.1.0 |
local-registry.example.com |
Expose the vsystem service at the provided hostname using a route. The value is applied only if MANAGE_VSYSTEM_ROUTE is enabled. The hostname defaults to vsystem-<SDI_NAMESPACE>.<clustername>.<basedomainname> . |
|
SDI_NODE_SELECTOR |
no | 0.1.4 |
node-role.kubernetes.io/sdi= |
Node selector determining nodes that will be dedicated to SDI workload. A comma separated list of key=value pairs. When unset (the default) the node selector will not be managed. When set to removed , node selector will be removed from namespaces and daemonsets. |
♠ The first version of SDI Observer supporting the given parameter.
◎ Whether this parameter must be provided when instantiating the template.
◐ The example value is also the default.
♫ Please make sure to deploy SDI and SDI Observer to different namespaces. Deploying to a single namespace is possible and supported but not recommended.
◙ To see the actions that would have been executed once the observer is deployed, use oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
◆ Unless either DEPLOY_SDI_REGISTRY
or DEPLOY_LETSENCRYPT
is set to true
.
▲ The INJECT_CABUNDLE=true
also makes SDI Observer take care of Setting Up Certificates (3.1) / (3.0) by creating the cmcertificates
secret.
★ This is needed only if the REGISTRY
is either:
- not secured with TLS - using plain HTTP
- secured by a self-signed certificate and cabundle is not provided (
INJECT_CABUNDLE
isfalse
)
If deploying registry using the SDI Observer (using DEPLOY_SDI_REGISTRY=true
), MARK_REGISTRY_INSECURE
shall not be set as long as one of the following applies:
- SDI Observer is run with
INJECT_CABUNDLE
set totrue
- letsencrypt controller is managing routes in SDI Observer's namespace and
EXPOSE_WITH_LETSENCRYPT
is set totrue
- OCP's ingress operator is configured with a proper trusted wildcard certificate (not self-signed)
◘ Unless INJECT_CABUNDLE
is true
.
☼ For example, in the default value "openshift-ingress-operator/router-ca"
, the "openshift-ingress-operator"
stands for secret's namespace and "router-ca"
stands for secret's name. If no $namespace
prefix is given, the secret is expected to reside in the NAMESPACE
where the SDI observer runs.
♯ Letsencrypt must be either provisioned by SDI Observer (using DEPLOY_LETSENCRYPT=true
) or deployed manually and configured to monitor SDI Observer's namespace.
You can inspect all the available parameters and their semantics like this:
# oc process --parameters -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json
4.2. Deploying SDI Observer
SDI Observer monitors SDI and SLC Bridge namespaces and applies changes to SDI deployments to allow SDI to run on OpenShift. Among other things, it does the following:
- adds additional persistent volume to
vsystem-vrep
StatefulSet to allow it to run on RHCOS system - grants fluentd pods permissions to logs
- reconfigures the fluentd pods to parse plain text file container logs on the OCP 4 nodes
- (optional) marks containers manipulating
iptables
on RHCOS hosts as privileged when the modules are not pre-loaded and the nodes - (optional) deploys container image registry suitable for mirroring, storing and serving SDI images and for use by the Pipeline Modeler
- (optional) deploys letsencrypt controller taking care of trusted certificate management
- (optional) creates
cmcertificates
secret to allow SDI to talk to container image registry secured by a self-signed CA certificate early during the installation time - (optional) enables the Pipeline Modeler (aka vflow) to talk to an (
HTTP
) insecure registry; it is however preferred to useHTTPS
4.2.1. Prerequisites
The following must be satisfied before SDI Observer can be deployed:
-
OpenShift cluster must be fully operational including the Image Registry. Make sure that all the nodes are ready, all cluster operators are available and none of them is degraded.
# oc get co # oc get nodes
-
The namespaces for SLC Bridge, SDI and SDI Observer must exist. Execute the following to create them:
# # change the namespace names according to you preferences # NAMESPACE=sdi-observer SDI_NAMESPACE=sdi SLCB_NAMESPACE=sap-slcbridge # for nm in $SDI_NAMESPACE $SLCB_NAMESPACE $NAMESPACE; do oc new-project $nm; done
-
In order to build images needed for SDI Observer, a secret with credentials for
registry.redhat.io
needs to be created in the namespace of SDI Observer. Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Once you have downloaded the OpenShift secret file (e.g.rht-registry-secret.yaml
with your credentials, you can import it into$SDI_NAMESPACE
like this:# oc create -n "${NAMESPACE:-sdi-observer}" -f rht-registry-secret.yaml secret/123456-username-pull-secret created
4.2.2. Instantiation of Observer's Template
Deploy the SDI Observer by processing the template.
# NAMESPACE=sdi-observer
# SDI_NAMESPACE=sdi
# OCP_MINOR_RELEASE=4.6
# DEPLOY_SDI_REGISTRY=true
# INJECT_CABUNDLE=true
# MANAGE_VSYSTEM_ROUTE=true
# REDHAT_REGISTRY_SECRET_NAME=123456-username-pull-secret
# oc process -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json \
NAMESPACE="${NAMESPACE:-sdi-observer}" \
SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.6}" \
DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
MANAGE_VSYSTEM_ROUTE="${MANAGE_VSYSTEM_ROUTE:-true}" \
REDHAT_REGISTRY_SECRET_NAME="$REDHAT_REGISTRY_SECRET_NAME" | oc create -f -
This will deploy observer in namespace sdi-observer
in the way that the observer will deploy container image registry and will inject the default cabundle into SDI pods in order to trust the registry.
It may take a couple of minutes until the sdi-observer
image is built and deployed.
You can monitor the progress of build and deployment with:
# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
# oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
replication controller "sdi-observer-2" successfully rolled out
# # see the actions that observer performs
# oc logs -n "${NAMESPACE:-sdi-observer}" -f dc/sdi-observer
NOTE: It is also recommended to dedicate a set of nodes to SDI workload. Please consider adding SDI_NODE_SELECTOR=node-role.kubernetes.io/sdi=
parameter. Before doing so, please double-check that the nodes have the corresponding label/role.
4.2.2.1. Using an alternative image
By default, SDI Observer is built on the Red Hat Universal Base Image (UBI). This requires access to the registry.redhat.io
registry including the credentials provided with the REDHAT_REGISTRY_SECRET_NAME
. Using this base image is the only supportable option.
However, for a proof of concept or development cases, it is possible to provide a custom image from another registry. The instantiation will then look like this:
# NAMESPACE=sdi-observer
# SDI_NAMESPACE=sdi
# OCP_MINOR_RELEASE=4.6
# DEPLOY_SDI_REGISTRY=true
# SDI_REGISTRY_VOLUME_ACCESS_MODE=ReadWriteOnce # set to ReadWriteMany if the default storage class supports it (e.g. cephfs)
# INJECT_CABUNDLE=true
# MANAGE_VSYSTEM_ROUTE=true
# SOURCE_IMAGE_PULL_SPEC=registry.centos.org/centos:8
# SOURCE_IMAGESTREAM_NAME=centos8
# oc process -f https://github.com/redhat-sap/sap-data-intelligence/blob/master/observer/ocp-custom-source-image-template.json \
NAMESPACE="${NAMESPACE:-sdi-observer}" \
SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.6}" \
DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
SDI_REGISTRY_VOLUME_ACCESS_MODE="{SDI_REGISTRY_VOLUME_ACCESS_MODE:-ReadWriteOnce}" \
INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
MANAGE_VSYSTEM_ROUTE="${MANAGE_VSYSTEM_ROUTE:-true}" \
SOURCE_IMAGE_PULL_SPEC="${SOURCE_IMAGE_PULL_SPEC:-registry.centos.org/centos:8}" \
SOURCE_IMAGESTREAM_NAME="${SOURCE_IMAGESTREAM_NAME:-centos8}" | oc create -f -
The template already contains registry.centos.org/centos:8
as the default, so both SOURCE_IMAGE_PULL_SPEC
and SOURCE_IMAGESTREAM_NAME
can be left out completely if CentOS is the desired base image. This registry does not require authentication.
However, please make sure that the registry of your choice is allowed for import (4.6) / (4.4) in your cluster.
4.2.2.2. SDI Observer Registry
If the observer is configured to deploy container image registry via DEPLOY_SDI_REGISTRY=true
parameter, it will deploy the deploy-registry
job which does the following:
- builds the
container-image-registry
image and pushes it to the integrated OpenShift Image Registry - generates or uses configured credentials for the registry
- deploys
container-image-registry
deployment config that runs this image and requires authentication -
exposes the registry using a route
- if observer's
REGISTRY
parameter is set, it will be used as its hostname - otherwise the registry's hostname will be
container-image-registry-${NAMESPACE}.apps.<cluster_name>.<base_domain>
- if observer's
-
(optional) annotates the route for the letsencrypt controller to secure it with a trusted certificate
4.2.2.2.1. Registry Template parameters
The following Observer's Template Parameters influence the deployment of the registry:
Parameter | Example value | Description |
---|---|---|
DEPLOY_SDI_REGISTRY |
true |
Whether to deploy container image registry for the purpose of SAP Data Intelligence. |
REDHAT_REGISTRY_SECRET_NAME |
123456-username-pull-secret |
Name of the secret with credentials for registry.redhat.io registry. Please visit Please visit Red Hat Registry Service Accounts to obtain the OpenShift secret. For more details, please refer to Red Hat Container Registry Authentication. Must be provided in order to build registry's image. |
REGISTRY |
registry.cluster.tld |
This variable will be used as the container image registry's hostname when creating the corresponding route. Defaults to container-image-registry-$NAMESPACE.<cluster_name>.<base_domain> . If set, the domain name must resolve to the IP of the ingress router. |
INJECT_CABUNDLE |
true |
Inject CA certificate bundle into SAP Data Intelligence pods. The bundle can be specified with CABUNDLE_SECRET_NAME . It is needed if either registry or s3 endpoint is secured by a self-signed certificate. The letsencrypt method is preferred. |
CABUNDLE_SECRET_NAME |
custom-ca-bundle |
The name of the secret containing certificate authority bundle that shall be injected into Data Intelligence pods. The default, the secret bundle is obtained from openshift-ingress-operator namespace where the router-ca secret contains the certificate authority used to sign all the edge and reencrypt routes that are, among others, used for SDI_REGISTRY and NooBaa S3 API services. The secret name may be optionally prefixed with $namespace/ . |
SDI_REGISTRY_STORAGE_CLASS_NAME |
ocs-storagecluster-cephfs |
Unless given, the default storage class will be used. If possible, prefer volumes with ReadWriteMany (RWX ) access mode. |
REPLACE_SECRETS |
true |
By default, existing SDI_REGISTRY_HTPASSWD_SECRET_NAME secret will not be replaced if it already exists. If the registry credentials shall be changed while using the same secret name, this must be set to true . |
SDI_REGISTRY_AUTHENTICATION |
none |
Set to none if the registry shall not require any authentication at all. The default is to secure the registry with htpasswd file which is necessary if the registry is publicly available (e.g. when exposed via ingress route which is globally resolvable). |
SDI_REGISTRY_USERNAME |
registry-user |
Will be used to generate htpasswd file to provide authentication data to the sdi registry service as long as SDI_REGISTRY_HTPASSWD_SECRET_NAME does not exist or REPLACE_SECRETS is true . Unless given, it will be autogenerated by the job. |
SDI_REGISTRY_PASSWORD |
secure-password |
ditto |
SDI_REGISTRY_HTPASSWD_SECRET_NAME |
registry-htpasswd |
A secret with htpasswd file with authentication data for the sdi image container. If given and the secret exists, it will be used instead of SDI_REGISTRY_USERNAME and SDI_REGISTRY_PASSWORD . Defaults to container-image-registry-htpasswd . Please make sure to follow the official guidelines on generating the htpasswd file. |
SDI_REGISTRY_VOLUME_CAPACITY |
250Gi |
Volume space available for container images. Defaults to 120Gi . |
SDI_REGISTRY_VOLUME_ACCESS_MODE |
ReadWriteMany |
If the given SDI_REGISTRY_STORAGE_CLASS_NAME or the default storage class supports ReadWriteMany ("RWX") access mode, please change this to ReadWriteMany . For example, the ocs-storagecluster-cephfs storage class, deployed by OCS operator, does support it. |
DEPLOY_LETSENCRYPT |
true |
Whether to deploy letsencrypt controller. Requires project admin role attached to the sdi-observer service account. |
EXPOSE_WITH_LETSENCRYPT |
true |
Whether to expose route for the registry annotated for letsencrypt controller. Letsencrypt controller must be deployed either via the observer or cluster-wide for this to have an effect. Defaults to the value of DEPLOY_LETSENCRYPT . |
Monitoring registry's deployment
# oc logs -n "${NAMESPACE:-sdi-observer}" -f job/deploy-registry
4.2.2.2.2. Determining Registry's credentials
The username and password are separated by a colon in the SDI_REGISTRY_HTPASSWD_SECRET_NAME
secret:
# # make sure to change the NAMESPACE and secret name according to your environment
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
jq -r '.data[".htpasswd.raw"] | @base64d'
user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma
4.2.2.2.3. Testing the connection
In this example, it is assumed that the INJECT_CABUNDLE
and DEPLOY_SDI_REGISTRY
are true
and other parameters use the defaults.
-
Obtain Ingress Router's default self-signed CA certificate:
# oc get secret -n openshift-ingress-operator -o json router-ca | \ jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
-
Do a simple test using curl:
# # determine registry's hostname from its route # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')" # curl -I --user user-qpx7sxeei:OnidDrL3acBHkkm80uFzj697JGWifvma --cacert router-ca.crt \ "https://$hostname/v2/" HTTP/1.1 200 OK Content-Length: 2 Content-Type: application/json; charset=utf-8 Docker-Distribution-Api-Version: registry/2.0 Date: Sun, 24 May 2020 17:54:31 GMT Set-Cookie: d22d6ce08115a899cf6eca6fd53d84b4=9176ba9ff2dfd7f6d3191e6b3c643317; path=/; HttpOnly; Secure Cache-control: private
-
Using the podman:
# # determine registry's hostname from its route # hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')" # sudo mkdir -p "/etc/containers/certs.d/$hostname" # sudo cp router-ca.crt "/etc/containers/certs.d/$hostname/" # podman login -u user-qpx7sxeei "$hostname" Password: Login Succeeded!
4.2.2.2.4. Configuring OCP
Configure OpenShift to trust the deployed registry if using a self-signed CA certificate.
4.2.2.2.5. SDI Observer Registry tenant configuration
NOTE: Only applicable once the SDI installation is complete.
Each newly created tenant needs to be configured to be able to talk to the SDI Registry. The initial tenant (the default
) does not need to be configured manually as it is configured during the installation.
There are two steps that need to be performed for each new tenant:
- import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed (the default unless letsencrypt controller is used)
- create and import credential secret using the SDI System Management and update the modeler secret
Import the CA certificate
- Obtain the
router-ca.crt
of the secret as documented in the previous section. - Follow the Manage Certificates guide (3.1) / (3.0) to import the
router-ca.crt
via the SDI Connection Management.
Import the credential secret
Determine the credentials and import them using the SDI System Management by following the official Provide Access Credentials for a Password Protected Container Registry (3.1) / (3.0).
As an alternative to the step "1. Create a secret file that contains the container registry credentials and …", you can also use the following way to create the vsystem-registry-secret.txt
file:
# # determine registry's hostname from its route
# hostname="$(oc get route -n "${NAMESPACE:-sdi-observer}" container-image-registry -o jsonpath='{.spec.host}')"
# oc get -o json -n "${NAMESPACE:-sdi-observer}" secret/container-image-registry-htpasswd | \
jq -r '.data[".htpasswd.raw"] | @base64d | gsub("\\s+"; "") | split(":") |
[{"username":.[0], "password":.[1], "address":"'"$hostname"'"}]' | \
json2yaml > vsystem-registry-secret.txt
NOTE: that json2yaml
binary from the remarshal project must be installed on the Management host in addition to jq
4.3. Managing SDI Observer
4.3.1. Viewing and changing the current configuration
View the current configuration of SDI Observer:
# oc set env --list -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer
Change the settings:
# # instruct the observer to deploy letsencrypt controller to make the
# # services like registry trusted without injecting self-signed CA into pods
# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer DEPLOY_LETSENCRYPT=true INJECT_CABUNDLE=false
4.3.2. Re-deploying SDI Observer
Is useful in the following cases:
- To update to the latest SDI Observer code. Please be sure to check the Update instructions before updating to the latest release.
- SDI has been uninstalled, its namespace deleted and re-created.
- Parameter being reflected in multiple resources (not just in the DeploymentConfig) needs to be changed (e.g.
OCP_MINOR_RELEASE
) - Different SDI instance in another namespace shall be observed.
NOTE: Re-deployment preserves generated secrets and persistent volumes unless FORCE_REDEPLOY
, REPLACE_SECRETS
and REPLACE_PERSISTENT_VOLUMES
are true
.
The template needs to be processed again with the desired parameters and existing objects replaced like this:
# oc process -f https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json \
NAMESPACE="${NAMESPACE:-sdi-observer}" \
SDI_NAMESPACE="${SDI_NAMESPACE:-sdi}" \
OCP_MINOR_RELEASE="${OCP_MINOR_RELEASE:-4.6}" \
DEPLOY_SDI_REGISTRY="${DEPLOY_SDI_REGISTRY:-true}" \
INJECT_CABUNDLE="${INJECT_CABUNDLE:-true}" \
MANAGE_VSYSTEM_ROUTE="${MANAGE_VSYSTEM_ROUTE:-true}" \
REDHAT_REGISTRY_SECRET_NAME="$REDHAT_REGISTRY_SECRET_NAME" | oc replace -f -
# watch oc get pods -n "${NAMESPACE:-sdi-observer}"
# # trigger a new build if it does not start automatically
# oc start-build -n "${NAMESPACE:-sdi-observer}" -F bc/sdi-observer
An alternative is to delete the NAMESPACE
where the SDI Observer is deployed and deploy it again. Note however, that this may delete SDI Registry deployed by the observer including the mirrored images if the DEPLOY_SDI_REGISTRY
was true
in the previous run.
4.3.2.1. Re-deploying while reusing the previous parameters
Another alternative that reuses the parameters used last time is shown in the next example. It overrides a single variable (OCP_MINOR_RELEASE
) which is useful when updating OpenShift cluster. Make sure to execute it in bash.
# OCP_MINOR_RELEASE=4.6
# tmpl=https://raw.githubusercontent.com/redhat-sap/sap-data-intelligence/master/observer/ocp-template.json; \
oc process -f $tmpl $(oc set env -n "${SDI_NAMESPACE:-sdi}" --list dc/sdi-observer | grep -v '^#\|=$' | grep -F -f \
<(oc process -f $tmpl --parameters | sed -n 's/^\([A-Z_]\+\)\s\+.*/\1/p' | tail -n +2) | \
sed 's/\(OCP_MINOR_RELEASE\)=.*/\1='"$OCP_MINOR_RELEASE"'/') | oc replace -f -
# watch oc get pods -n "${NAMESPACE:-sdi-observer}"
# # trigger a new build if it does not start automatically
# oc start-build -n "${NAMESPACE:-sdi-observer}" -F bc/sdi-observer
5. Install SDI on OpenShift
5.1. Install Software Lifecycle Container Bridge
Please follow the official documentation (3.1) / (3.0).
5.1.1. Important Parameters
Parameter | Condition | Description |
---|---|---|
Mode | Always | Make sure to choose the Expert Mode. |
Address of the Container Image Repository | Always | This is the Host value of the container-image-registry route in the sdi-observer if the registry is deployed by SDI Observer. |
Image registry user name | if … ‡ | The value recorded in the SDI_REGISTRY_HTPASSWD_SECRET_NAME if using the registry deployed with SDI Observer. |
Image registry password | if … ‡ | ditto |
Namespace of the SLC Bridge | Always | If you override the default (sap-slcbridge ), make sure to deploy SDI Observer with the corresponding SLCB_NAMESPACE value. |
Service Type | SLC Bridge Base installation | On vSphere, make sure to use NodePort . On AWS, please use LoadBalancer . |
Cluster No Proxy | Required in conjunction with the HTTPS Proxy value | Make sure to extend with additional mandatory entries ▽. |
‡ If the registry requires authentication. The one deployed with SDI Observer does.
▽ Make sure to include at least the entries located in OCP cluster's proxy settings.
# # get the internal OCP cluster's NO_PROXY settings
# noProxy="$(oc get -o jsonpath='{.status.noProxy}' proxy/cluster)"; echo "$noProxy"
.cluster.local,.local,.nip.io,.ocp.vslen,.sap.corp,.svc,10.0.0.0/16,10.128.0.0/14,10.17.69.0/23,127.0.0.1,172.30.0.0/16,192.168.0.0/16,api-int.morrisville.ocp.vslen,etcd-0.morrisville.ocp.vslen,etcd-1.morrisville.ocp.vslen,etcd-2.morrisville.ocp.vslen,localhost,lu0602v0,registry.redhat.io
For more details, please refer to Configuring the cluster-wide proxy (4.6) / (4.4)
NOTE: SLC Bridge service cannot be used via routes (Ingress Operator) as of now. Doing so will result in timeouts. This will be addressed in the future. For now, one must use either the NodePort
or LoadBalancer
service directly.
On vSphere, in order to access slcbridgebase-service
NodePort
service, one needs to have either a direct access to one of the SDI Compute nodes or modify the external load balancer to add additional route to the service.
5.1.2. Install SLC Bridge
Please install SLC Bridge according to Making the SLC Bridge Base available on Kubernetes (3.1) / (3.0) while paying attention to the notes on the installation parameters.
5.1.2.1. Using an external load balancer to access SLC Bridge's NodePort
NOTE: applicable only when "Service Type" was set to "NodePort".
Once the SLC Bridge is deployed, its NodePort
shall be determined in order to point the load balancer at it.
# oc get svc -n "${SLCB_NAMESPACE:-sap-slcbridge}" slcbridgebase-service -o jsonpath='{.spec.ports[0].nodePort}{"\n"}'
31875
The load balancer shall point at all the compute nodes running SDI workload. The following is an example for HAProxy sw load balancer:
# # in the example, the <cluster_name> is "boston" and <base_domain> is "ocp.vslen"
# cat /etc/haproxy/haproxy.cfg
....
frontend slcb
bind *:9000
mode tcp
option tcplog
# # commented blocks are useful for multiple OCP clusters or multiple SLC Bridge services
#tcp-request inspect-delay 5s
#tcp-request content accept if { req_ssl_hello_type 1 }
use_backend boston-slcb #if { req_ssl_sni -m end -i boston.ocp.vslen }
#use_backend raleigh-slcb #if { req_ssl_sni -m end -i raleigh.ocp.vslen }
backend boston-slcb
balance source
mode tcp
server sdi-worker1 sdi-worker1.boston.ocp.vslen:31875 check
server sdi-worker2 sdi-worker2.boston.ocp.vslen:31875 check
server sdi-worker3 sdi-worker3.boston.ocp.vslen:31875 check
backend raleigh-slcb
....
The SLC Bridge can then be accessed at the URL https://boston.ocp.vslen:9000/docs/index.html
as long as boston.ocp.vslen
resolves correctly to the load balancer's IP.
5.2. SDI Installation Parameters
Please follow SAP's guidelines on configuring the SDI while paying attention to the following additional comments:
Name | Condition | Recommendation |
---|---|---|
Kubernetes Namespace | Always | Must match the project name chosen in the Project Setup (e.g. sdi ) |
Installation Type | Installation or Update | Choose Advanced Installation if you need to specify you want to choose particular storage class or there is no default storage class (4.4) set or you want to deploy multiple SDI instances on the same cluster. |
Container Image Repository | Installation | Must be set to the container image registry. |
Backup Configuration | Installation or Upgrade from a system in which backups are not enabled | For a production environment, please choose yes if your object storage provider is NetApp StorageGRID or OCS' NooBaa 4.7 or newer. |
Checkpoint Store Configuration | Installation | Recommended for production deployments. If backup is enabled, it is enabled by default. |
Checkpoint Store Type | If Checkpoint Store Configuration parameter is enabled. | Set to S3 compatible object store if using for example OCS's NooBaa service or NetApp StorageGRID as the object storage. See Using NooBaa as object storage gateway or NetApp StorageGRID for more details. |
Disable Certificate Validation | If Checkpoint Store Configuration parameter is enabled. | Please choose yes if using the HTTPS for your object storage endpoint secured with a certificate having a self-signed CA. For OCS NooBaa, you can set it to no. |
Checkpoint Store Validation | Installation | Please make sure to validate the connection during the installation time. Otherwise in case an incorrect value is supplied, the installation will fail at a later point. |
Container Registry Settings for Pipeline Modeler | Advanced Installation | Shall be changed if the same registry is used for more than one SAP Data Intelligence instance. Either another <registry> or a different <prefix> or both will do. |
StorageClass Configuration | Advanced Installation | Configure this if you want to choose different dynamic storage provisioners for different SDI components or if there's no default storage class (4.6) / (4.4) set or you want to choose non-default storage class for the SDI components. |
Default StorageClass | Advanced Installation and if storage classes are configured | Set this if there's no default storage class (4.6) / (4.4) set or you want to choose non-default storage class for the SDI components. |
Enable Kaniko Usage | Advanced Installation | Must be enabled on OCP 4. |
Container Image Repository Settings for SAP Data Intelligence Modeler | Advanced Installation or Upgrade | If using the same registry for multiple SDI instances, choose "yes". |
Container Registry for Pipeline Modeler | Advanced Installation and if "Use different one" option is selected in the previous selection. | If using the same registry for multiple SDI instances, it is required to use either different prefix (e.g. My_Image_Registry_FQDN/mymodelerprefix2 ) or a different registry. |
Loading NFS Modules | Advanced Installation | Feel free to say "no". This is no longer of concern as long as the loading of the needed kernel modules has been configured. |
Additional Installer Parameters | Advanced Installation | (optional) Useful for reducing the minimum memory requirements of the HANA pod and much more. |
5.3. Project setup
It is assumed the sdi
project has been already created during SDI Observer's prerequisites.
Login to OpenShift as a cluster-admin, and perform the following configurations for the installation:
# # change to the SDI_NAMESPACE project using: oc project "${SDI_NAMESPACE:-sdi}"
# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-elasticsearch"
# oc adm policy add-scc-to-user privileged -z "$(oc project -q)-fluentd"
# oc adm policy add-scc-to-user privileged -z default
# oc adm policy add-scc-to-user privileged -z mlf-deployment-api
# oc adm policy add-scc-to-user privileged -z vora-vflow-server
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)"
# oc adm policy add-scc-to-user privileged -z "vora-vsystem-$(oc project -q)-vrep"
5.4. Install SDI
Please follow the official procedure according to Install using SLC Bridge in a Kubernetes Cluster with Internet Access (3.1) / (3.0).
5.5. SDI Post installation steps
5.5.1. (Optional) Expose SDI services externally
There are multiple possibilities how to make SDI services accessible outside of the cluster. Compared to Kubernetes, OpenShift offers additional method, which is recommended for most of the scenarios including SDI System Management service. It's based on OpenShift Ingress Operator (4.6) / (4.4)
For SAP Vora Transaction Coordinator and SAP HANA Wire, please use the official suggested method available to your environment (3.1) / (3.0).
5.5.1.1. Using OpenShift Ingress Operator
NOTE Instead of using this manual approach, it is now recommended to let the SDI Observer to manage the route creation and updates instead. If the SDI Observer has been deployed with MANAGE_VSYSTEM_ROUTE
, this section can be skipped. To configure it ex post, please execute the following:
# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=true
# # wait for the observer to get re-deployed
# oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
Or please continue with the manual route creation.
OpenShift allows you to access the Data Intelligence services via Ingress Controllers (4.6) / (4.4) as opposed to regular NodePorts (4.6) / (4.4) For example, instead of accessing the vsystem service via https://worker-node.example.com:32322
, after the service exposure, you will be able to access it at https://vsystem-sdi.apps.<cluster_name>.<base_domain>
. This is an alternative to the official guide documentation to Expose the Service On Premise (3.1) / (3.0).
There are two kinds routes secured with TLS. The reencrypt
kind, allows for a custom signed or self-signed certificate to be used. The other is a passthrough
kind which uses the pre-installed certificate generated by the installer or passed to the installer.
5.5.1.1.1. Export services with an reencrypt route
With this kind of route, different certificates are used on client and service sides of the route. The router stands in the middle and re-encrypts the communication coming from either side using a certificate corresponding to the opposite side. In this case, the client side is secured by a provided certificate and the service side is encrypted with the original certificate generated or passed to the SAP Data Intelligence installer. This is the same kind of route SDI Observer creates automatically.
The reencrypt route allows for securing the client connection with a proper signed certificate.
-
Look up the
vsystem
service:# oc project "${SDI_NAMESPACE:-sdi}" # switch to the Data Intelligence project # oc get services | grep "vsystem " vsystem ClusterIP 172.30.227.186 <none> 8797/TCP 19h
When exported, the resulting hostname will look like
vsystem-${SDI_NAMESPACE}.apps.<cluster_name>.<base_domain>
. However, an arbitrary hostname can be chosen instead as long as it resolves correctly to the IP of the router. -
Get, generate or use the default certificates for the route. In this example, the default self-signed certificate used by router is used to secure the connection between the client and OCP's router. The CA certificate for clients can be obtained from the
router-ca
secret located in theopenshift-ingress-operator
namespace:# oc get secret -n openshift-ingress-operator -o json router-ca | \ jq -r '.data as $d | $d | keys[] | select(test("\\.crt$")) | $d[.] | @base64d' >router-ca.crt
-
Obtain the SDI's root certificate authority bundle generated at the SDI's installation time. The generated bundle is available in the
ca-bundle.pem
secret in thesdi
namespace.# oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \ secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
-
Create the reencrypt route for the vsystem service like this:
# oc create route reencrypt -n "${SDI_NAMESPACE:-sdi}" --dry-run -o json \ --dest-ca-cert=sdi-service-ca-bundle.pem --service vsystem \ --insecure-policy=Redirect | \ oc annotate --local -o json -f - haproxy.router.openshift.io/timeout=2m | \ oc create -f - # oc get route NAME HOST/PORT SERVICES PORT TERMINATION WILDCARD vsystem vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain> vsystem vsystem reencrypt/Redirect None
-
Verify the connection:
# # use the HOST/PORT value obtained from the previous command instead # curl --cacert router-ca.crt https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>/
5.5.1.1.2. Export services with a passthrough route
With the passthrough
route, the communication is encrypted by the SDI service's certificate all the way to the client.
NOTE: If possible, please prefer the reencrypt
route because the hostname of vsystem certificate cannot be verified by clients as can be seen in the following output:
# oc get -n "${SDI_NAMESPACE:-sdi}" -o go-template='{{index .data "ca-bundle.pem"}}' \
secret/ca-bundle.pem | base64 -d >sdi-service-ca-bundle.pem
# openssl x509 -noout -subject -in sdi-service-ca-bundle.pem
subject=C = DE, ST = BW, L = Walldorf, O = SAP, OU = Data Hub, CN = SAPDataHub
-
Look up the
vsystem
service:# oc project "${SDI_NAMESPACE:-sdi}" # switch to the Data Intelligence project # oc get services | grep "vsystem " vsystem ClusterIP 172.30.227.186 <none> 8797/TCP 19h
-
Create the route:
# oc create route passthrough --service=vsystem --insecure-policy=Redirect # oc get route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD vsystem vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain> vsystem vsystem passthrough/Redirect None
You can modify the hostname with
--hostname
parameter. Make sure it resolves to the router's IP. -
Access the System Management service at
https://vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain>
to verify.
5.5.1.2. Using NodePorts
NOTE For OpenShift, an exposure using routes is preferred although only possible for the System Management service (aka vsystem
).
Exposing SAP Data Intelligence vsystem
-
Either with an auto-generated node port:
# oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2 # oc get -o jsonpath='{.spec.ports[0].nodePort}{"\n"}' services vsystem-nodeport 30617
-
Or with a specific node port (e.g. 32123):
# oc expose service vsystem --type NodePort --name=vsystem-nodeport --generator=service/v2 --dry-run -o yaml | \ oc patch -p '{"spec":{"ports":[{"port":8797, "nodePort": 32123}]}}' --local -f - -o yaml | oc create -f -
The original service remains accessible on the same ClusterIP:Port
as before. Additionally, it is now accessible from outside of the cluster under the node port.
Exposing SAP Vora Transaction Coordinator and HANA Wire
# oc expose service vora-tx-coordinator-ext --type NodePort --name=vora-tx-coordinator-nodeport --generator=service/v2
# oc get -o jsonpath='tx-coordinator:{"\t"}{.spec.ports[0].nodePort}{"\n"}hana-wire:{"\t"}{.spec.ports[1].nodePort}{"\n"}' \
services vora-tx-coordinator-nodeport
tx-coordinator: 32445
hana-wire: 32192
The output shows the generated node ports for the newly exposed services.
5.5.2. Configure the Connection to Data Lake
Please follow the official post-installation instructions at Configure the Connection to DI_DATA_LAKE
(3.1) / (3.0).
In case the OCS' NooBaa is used as a backing object storage provider, please make sure to use the HTTP service endpoint as documented in Using NooBaa as object storage gateway.
Based on the example output in that section, the configuration may look like this:
Parameter | Value |
---|---|
Connection Type | SDL |
Id | DI_DATA_LAKE |
Object Storage Type | S3 |
Endpoint | http://s3.openshift-storage.svc.cluster.local |
Access Key ID | cOxfi4hQhGFW54WFqP3R |
Secret Access Key | rIlvpcZXnonJvjn6aAhBOT/Yr+F7wdJNeLDBh231 |
Root Path | sdi-data-lake-f86a7e6e-27fb-4656-98cf-298a572f74f3 |
5.5.3. SDI Validation
Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.1) / (3.0).
5.5.3.1. Log On to SAP Data Intelligence Launchpad
In case the vsystem
service has been exposed using a route, the URL can be determined like this:
# oc get route -n "${SDI_NAMESPACE:-sdi}"
NAME HOST/PORT SERVICES PORT TERMINATION WILDCARD
vsystem vsystem-<SDI_NAMESPACE>.apps.<cluster_name>.<base_domain> vsystem vsystem reencrypt None
The HOST/PORT
value needs to be then prefixed with https://
, for example:
https://vsystem-sdi.apps.boston.ocp.vslen
5.5.3.2. Check Your Machine Learning Setup
In order to upload training and test datasets using ML Data Manager, the user needs to be assigned sap.dh.metadata
policy. Please make sure to follow Using SAP Data Intelligence Policy Management (3.1) / (3.0) to assign the policies to the users that need them.
5.5.4. Configuration of additional tenants
When a new tenant is created using for example the Manage Clusters instructions (3.1) / (3.0) it is not configured to work with the container image registry. Therefore, the Pipeline Modeler is unusable and will fail to start until configured.
There are two steps that need to be performed for each new tenant:
- import CA certificate for the registry via SDI Connection Manager if the CA certificate is self-signed
- create and import credential secret using the SDI System Management and update the modeler secret if the container image registry requires authentication
If the SDI Registry deployed by the SDI Observer is used, please follow the SDI Observer Registry tenant configuration. Otherwise, please make sure to execute the official instructions in the following articles according to your registry configuration:
- Provide Access Credentials for a Password Protected Container Registry (3.1) / (3.0) (as long as your registry for the Pipeline Modeler uses TLS with a self-signed CA)
- Manage Certificates (3.1) / (3.0) (as long as your registry requires authentication)
6. OpenShift Container Platform Upgrade
This section is useful as a guide for performing OCP upgrades to the latest asynchronous releaseⁿ of the same minor version or to the newer minor release supported by the running DI instance without upgrading DI itself.
6.1. Pre-upgrade procedures
- Before upgrading cluster to release equal to or newer than 4.3, make sure to upgrade SDI at least to the release 3.0 Patch 3 by following SAP Data Hub Upgrade procedures - starting from pre-upgrade without performing steps marked with (ocp-upgrade).
- Make yourself familiar with the OpenShift's upgrade guide (4.2 ⇒ 4.3) / (4.3 ⇒ 4.4) / (4.4 ⇒ 4.5) / (4.5 ⇒ 4.6).
- Plan for SDI downtime.
- Make sure to re-configure SDI compute nodes.
- (OCP 4.2 only) Pin vsystem-vrep to the current node
6.1.1. Stop SAP Data Intelligence
In order to speed up the cluster upgrade and/or to ensure DI's consistency, it is possible to stop the SDI before performing the upgrade.
The procedure is outlined in the official Administration Guide (3.1) / (3.0). However, please note that the command described there is erroneous as of December 2020. Please execute it this way:
# oc -n "${SDI_NAMESPACE}" patch datahub default --type='json' -p '[
{"op":"replace","path":"/spec/runLevel","value":"Stopped"}]'
6.2. Upgrade OCP
The following instructions outline a process of OCP upgrade to a minor release 2 versions higher than the current one. If only an upgrade to the latest asynchronous releaseⁿ of the same minor version is desired, please skip steps 5 and 6.
- Upgrade OCP to a higher minor release or the latest asynchronous release(⇒ 4.3) / (⇒ 4.5)ⁿ.
- If having OpenShift Container Storage deployed, update OCS to the latest supported release for the current OCP release according to the interoperability matrix.
-
Update OpenShift client tools on the Management host to match the target ※ OCP release. On RHEL 8.2, one can do it like this:
# current=4.2; new=4.4 # sudo subscription-manager repos \ --disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms # sudo dnf update -y openshift-clients
-
Update SDI Observer to use the OCP client tools matching the target ※ OCP release by following Re-Deploying SDI Observer while reusing the previous parameters.
- Upgrade OCP to a higher minor release or the latest asynchronous release (⇒ 4.4) / (⇒ 4.6)ⁿ.
- If having OpenShift Container Storage deployed, update OCS to the latest supported release for the current OCP release according to the interoperability matrix.
※ for the initial OCP release 4.X
, the target release is 4.(X+2)
; if performing just the latest asynchronous releaseⁿ upgrade, the target release is 4.X
6.3. Post-upgrade procedures
-
Start SAP Data Intelligence as outlined in the official Administration Guide (3.1) / (3.0). However, please note the command as described there is erroneous as of December 2020. Please execute it this way:
# oc -n "${SDI_NAMESPACE}" patch datahub default --type='json' -p '[ {"op":"replace","path":"/spec/runLevel","value":"Started"}]'
-
(OCP 4.2 (initial) only) Unpin vsystem-vrep from the current node
7. SAP Data Intelligence Upgrade or Update
NOTE This section covers both an upgrade from SAP Data Hub 2.7 and an upgrade of SAP Data Intelligence to a newer minor, micro or patch release. Sections related only to the former or the latter will be annotated with the following annotations:
- (DH-upgrade) to denote a section specific to an upgrade from Data Hub 2.7 to Data Intelligence 3.0
- (DI-upgrade) to denote a section specific to an upgrade from Data Intelligence to a newer minor release (
3.X ⇒ 3.(X+1)
) - (update) to denote a section specific to an update of Data Intelligence to a newer micro/patch release (
3.X.Y ⇒ 3.X.(Y+1)
) - annotation-free are sections relating to any upgrade or update procedure
The following steps must be performed in the given order. Unless an OCP upgrade is needed, the steps marked with (ocp-upgrade) can be skipped.
7.1. Pre-upgrade or pre-update procedures
- Make sure to get familiar with the official SAP Upgrade guide (3.0 ⇒ 3.1) / (DH 2.7 ⇒ 3.0).
- (ocp-upgrade) Make yourself familiar with the OpenShift's upgrade guide (4.2 ⇒ 4.3) / (4.3 ⇒ 4.4) / (4.4 ⇒ 4.5) / (4.5 ⇒ 4.6).
- Plan for a downtime.
- Make sure to re-configure SDI compute nodes.
- Pin vsystem-vrep to the current node only when having OCP 4.2.
7.1.1. (DH-upgrade) Container image registry preparation
Unlike SAP Data Hub, SAP Data Intelligence requires a secured container image registry. Plain HTTP connection cannot be used anymore.
There are the following options how to satisfy this requirement:
- The registry used by SAP Data Hub is already accessible over HTTPS and its serving TLS certificates have been signed by a trusted certificate authority. In this case, the rest of this section can be skipped until Execute SDI's Pre-Upgrade Procedures.
-
The registry used by SAP Data Hub is already accessible or will be made accessible over HTTPS but its service TLS certificate is not signed by a trusted certificate authority. In this case one of the following must be performed unless already done:
- the CA chain needs to be passed to SAP Data Hub to make the connection trusted via Connection Manager
- if kaniko has been enabled for SDH, mark the registry as insecure
The rest of this section can then be skipped.
-
A new registry shall be used.
In the last case, please refer to Container Image Registry prerequisite for more details. Also note that the provisioning of the registry can be done by SDI Observer deployed in the subsequent step.
NOTE: the newly deployed registry must contain all the images used by the current SAP Data Hub release as well in order for the upgrade to succeed. There are multiple ways to accomplish this, for example, on the Jump host, execute one of the following:
-
using the manual installation method of SAP Data Hub, one can invoke the
install.sh
script with the following arguments:--prepare-images
to cause the script to just mirror the images to the desired registry and terminate immediately afterwards--registry HOST:PORT
to point the script to the newly deployed registry
-
inspect the currently running containers in the SDH project and copy their images directly from the old local registry to the new one (without SAP registry being involved); it can be performed on the Jump host in bash; in the following example,
jq
,podman
andskopeo
binaries are assumed to be available:# export OLD_REGISTRY=My_Image_Registry_FQDN:5000 # export NEW_REGISTRY=HOST:PORT # SDH_NAMESPACE=sdh # # login to the old registry using either docker or podman if it requires authentication # podman login --tls-verify=false -u username $OLD_REGISTRY # # login to the new registry using either docker or podman if it requires authentication # podman login --tls-verify=false -u username $NEW_REGISTRY # function mirrorImage() { local src="$1" local dst="$NEW_REGISTRY/${src#*/}" skopeo copy --src-tls-verify=false --dest-tls-verify=false "docker://$src" "docker://$dst" } # export -f mirrorImage # # get the list of source images to copy # images="$(oc get pods -n "${SDH_NAMESPACE:-sdh}" -o json | jq -r '.items[] | . as $ps | [$ps.spec.containers[] | .image] + [($ps.spec.initContainers // [])[] | .image] | .[]' | grep -F "$OLD_REGISTRY" | sort -u)" # # more portable way to copy the images (up to 5 in parallel) using GNU xargs # xargs -n 1 -r -P 6 -i /bin/bash -c 'mirrorImage {}' <<<"${images:-}" # # an alternative way using GNU Parallel # parallel -P 6 --lb mirrorImage <<<"${images:-}"
7.1.2. Execute SDI's Pre-Upgrade Procedures
Please follow the official Pre-Upgrade procedures (3.0 ⇒ 3.1) / (DI 2.7 ⇒ 3.0).
7.1.2.1. (upgrade) Manual route removal
If you exposed the vsystem service using routes, delete the route:
# # note the hostname in the output of the following command
# oc get route -n "${SDI_NAMESPACE:-sdi}"
# # delete the route
# oc delete route -n "${SDI_NAMESPACE:-sdi}" --all
7.1.2.2. (update) Automated route removal
SDI Observer now allows to manage creation and updates of vsystem route for external access. It takes care of updating route's destination certificate during SDI's update. It can also be instructed to keep the route deleted which is useful during SDI updates. If the SDI Observer is of version 0.1.0 or higher, you can instruct it to delete the route like this:
-
ensure SDI Observer version is 0.1.0 or higher:
# oc label -n "${NAMESPACE:-sdi-observer}" --list dc/sdi-observer | grep sdi-observer/version sdi-observer/version=0.1.0
if there is no output or the version is lower, please follow the Manual route removal instead.
-
ensure SDI Observer is managing the route already:
# oc set env -n "${NAMESPACE:-sdi-observer}" --list dc/sdi-observer | grep MANAGE_VSYSTEM_ROUTE MANAGE_VSYSTEM_ROUTE=true
if there is no output or
MANAGE_VSYSTEM_ROUTE
is not one oftrue
,yes
or1
, please follow the Manual route removal instead. -
instruct the observer to keep the route removed:
# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=removed # # wait for the observer to get re-deployed # oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
7.1.3. (ocp-upgrade) Upgrade OpenShift
At this time, depending on target SDI release, OCP cluster must be upgraded either to a newer minor release or to the latest asynchronous releaseⁿ for the current minor release.
Current SDI release | Target SDI release | Desired and validated OCP Releases |
---|---|---|
3.0 | 3.1 | 4.4 (latest) |
DH 2.7 | 3.0 | 4.2 (latest) |
Make sure to follow the official upgrade instructions (4.4) / (4.2).
Please also update OpenShift client tools on the Management host. The example below can be used on RHEL8.
# current=4.2; new=4.4
# sudo subscription-manager repos \
--disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms
# sudo dnf update -y openshift-clients
7.1.4. Deploy or update SDI Observer
Please execute one of the subsections below. Unless an upgrade of Data Hub is performed, please choose to update SDI Observer.
7.1.4.1. (DH-upgrade) Deploying SDI Observer for the first time
If the current SDH Observer is deployed in a different namespace than SDH's namespace, it must be deleted manually. The easiest way is to delete the project unless shared with other workloads. If it shares the namespace of SDH, no action is needed - it will be deleted automatically.
Please follow the instructions in SDI Observer section to deploy it while paying attention to the following:
- SDI Observer shall be located in a different namespace than SAP Data Hub and Data Intelligence (e.g.
sdi-observer
). SDI_NAMESPACE
shall be set to the namespace where SDH is currently running
7.1.4.2. Updating SDI Observer
Please follow the Re-deploying SDI Observer to update the observer. Please make sure to set MANAGE_VSYSTEM_ROUTE
to remove
until the SDI's update is finished.
7.1.5. (DH-upgrade) Prepare SDH/SDI Project
SAP Data Hub running in a particular project/namespace on OCP cluster will be substituted by SAP Data Intelligence in the same project/namespace. The existing project must be modified in order to host the latter.
Grant the needed security context constraints to the new service accounts by executing the commands from the project setup. NOTE: Re-running the commands that have been run already, will do no harm.
(OCP 4.2 only) To be able to amend the potential volume attachment problems, make sure to dump a mapping between the SDH pods and nodes they run on:
# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out
(optionally) If an object storage is available and provided by NooBaa, a new storage bucket can be created for the SDL Data Lake connection (3.0). Please follow Creating an S3 bucket using CLI section. Note that the existing checkpoint store bucket used by SAP Data Hub will continue to be used by SAP Data Intelligence if configured.
7.2. Update or Upgrade SDH or SDI
7.2.1. Update Software Lifecycle Container Bridge
Please follow the official documentation (3.1) / (3.0) to obtain the binary and perform the following steps:
-
If exposed via a load-balancer, make sure to note down the current service port and node port:
# oc get -o jsonpath='{.spec.ports[0].nodePort}{"\n"}' -n sap-slcbridge \ svc/slcbridgebase-service 31555
-
Once the binary is available on the Management host, execute it as
slcb init
and chooseUpdate
when prompted for a deployment option. -
If exposed via a load-balancer, re-set the nodePort to the previous value so no changes on load-balancer side are necessary.
# nodePort=31555 # change your value to the desired one # oc patch --type=json -n sap-slcbridge svc/slcbridgebase-service -p '[{ "op":"add", "path":"/spec/ports/0/nodePort","value":'"$nodePort"'}]'
7.2.2. (DH-upgrade) Upgrade SAP Data Hub to SAP Data Intelligence
Execute the SDH or SDI upgrade according to the official instructions (DH 2.7 ⇒ 3.0).
Please be aware of the potential issue during the upgrade when using OCS 4 as the storage provider.
7.2.3. (DI-upgrade) Upgrade SAP Data Intelligence to a newer minor release
Execute the SDH or SDI upgrade according to the official instructions (DH 3.0 ⇒ 3.1).
7.3. (ocp-upgrade) Upgrade OpenShift
Depending on the target SDI release, OCP cluster must be upgraded either to a newer minor release or to the latest asynchronous releaseⁿ for the current minor release.
Upgraded/Current SDI release | Desired and validated OCP Releases |
---|---|
3.1 | 4.6 |
3.0 | 4.4 |
If the current OCP release is two or more releases behind the desired, OCP cluster must be upgraded iteratively to each successive minor release until the desired one is reached.
- (optional) Stop the SAP Data Intelligence as it will speed up the cluster update and ensure DI's consistency.
-
Make sure to follow the official upgrade instructions for your upgrade path:
-
(optional) Start the SAP Data Intelligence again if stopped earlier in step 1).
-
Upgrade OpenShift client tools on the Management host. The example below can be used on RHEL8:
# current=4.4; new=4.6 # sudo subscription-manager repos \ --disable=rhocp-${current}-for-rhel-8-x86_64-rpms --enable=rhocp-${new}-for-rhel-8-x86_64-rpms # sudo dnf update -y openshift-clients
7.4. SAP Data Intelligence Post-Upgrade Procedures
-
Execute the Post-Upgrade Procedures for the SDH (3.1) / (3.0).
-
Re-create the route for the vsystem service using one of the following methods:
-
(recommented) instruct SDI Observer to manage the route:
# oc set env -n "${NAMESPACE:-sdi-observer}" dc/sdi-observer MANAGE_VSYSTEM_ROUTE=true # # wait for the observer to get re-deployed # oc rollout status -n "${NAMESPACE:-sdi-observer}" -w dc/sdi-observer
-
follow Expose SDI services externally to recreate the route manually from scratch
-
-
(DH-upgrade) Unpin vsystem-vrep from the current node
7.5. Validate SAP Data Intelligence
Validate SDI installation on OCP to make sure everything works as expected. Please follow the instructions in Testing Your Installation (3.1) / (3.0).
8. Appendix
8.1. SDI uninstallation
Please follow the SAP documentation Uninstalling SAP Data Intelligence using the SLC Bridge (3.1) / (3.0).
Additionally, make sure to delete the sdi
project as well, e.g.:
# oc delete project sdi
NOTE: With this, SDI Observer loses permissions to view and modify resources in the deleted namespace. If a new SDI installation shall take place, SDI observer needs to be re-deployed.
Optionally, one can also delete SDI Observer's namespace, e.g.:
# oc delete project sdi-observer
NOTE: this will also delete the container image registry if deployed using SDI Observer which means the mirroring needs to be performed again during a new installation. If SDI Observer (including the registry and its data) shall be preserved for the next installation, please make sure to re-deploy it once the sdi
project is re-created.
When done, you may continue with a new installation round in the same or another namespace.
8.2. Configure OpenShift to trust container image registry
If the registry's certificate is signed by a self-signed certificate authority, one must make OpenShift aware of it.
If the registry runs on the OpenShift cluster itself and is exposed via a reencrypt
or edge
route with the default TLS settings (no custom TLS certificates set), the CA certificate used is available in the secret router-ca
in openshift-ingress-operator
namespace.
To make the registry available via such route trusted, set the route's hostname into the registry
variable and execute the following code in bash:
# registry="container-image-registry-<NAMESPACE>.apps.<cluster_name>.<base_domain>"
# caBundle="$(oc get -n openshift-ingress-operator -o json secret/router-ca | \
jq -r '.data as $d | $d | keys[] | select(test("\\.(?:crt|pem)$")) | $d[.] | @base64d')"
# # determine the name of the CA configmap if it exists already
# cmName="$(oc get images.config.openshift.io/cluster -o json | \
jq -r '.spec.additionalTrustedCA.name // "trusted-registry-cabundles"')"
# if oc get -n openshift-config "cm/$cmName" 2>/dev/null; then
# configmap already exists -> just update it
oc get -o json -n openshift-config "cm/$cmName" | \
jq '.data["'"${registry//:/..}"'"] |= "'"$caBundle"'"' | \
oc replace -f - --force
else
# creating the configmap for the first time
oc create configmap -n openshift-config "$cmName" \
--from-literal="${registry//:/..}=$caBundle"
oc patch images.config.openshift.io cluster --type=merge \
-p '{"spec":{"additionalTrustedCA":{"name":"'"$cmName"'"}}}'
fi
If using a registry running outside of OpenShift or not secured by the default ingress CA certificate, take a look at the official guideline at Configuring a ConfigMap for the Image Registry Operator (4.6) / (4.4)
To verify that the CA certificate has been deployed, execute the following and check whether the supplied registry name appears among the file names in the output:
# oc rsh -n openshift-image-registry "$(oc get pods -n openshift-image-registry -l docker-registry=default | \
awk '/Running/ {print $1; exit}')" ls -1 /etc/pki/ca-trust/source/anchors
container-image-registry-sdi-observer.apps.boston.ocp.vslen
image-registry.openshift-image-registry.svc..5000
image-registry.openshift-image-registry.svc.cluster.local..5000
If this is not feasible, one can also mark the registry as insecure.
8.3. Configure insecure registry
As a less secure an alternative to the Configure OpenShift to trust container image registry, registry may also be marked as insecure which poses a potential security risk. Please follow Configuring image settings (4.6) / (4.4) and add the registry to the .spec.registrySources.insecureRegistries
array. For example:
apiVersion: config.openshift.io/v1
kind: Image
metadata:
annotations:
release.openshift.io/create-only: "true"
name: cluster
spec:
registrySources:
insecureRegistries:
- My_Image_Registry_FQDN
NOTE: it may take a couple of tens of minutes until the nodes are reconfigured. You can use the following commands to monitor the progress:
watch oc get machineconfigpool
watch oc get nodes
8.4. Marking the vflow registry as insecure
NOTE: applicable before, during or a after SDI installation.
NOTE: if the registry uses HTTPS and is signed by a self-signed CA certificate, it is recommended to configure SDI Observer with INJECT_CABUNDLE=true
instead.
If the modeler is configured to use registry over HTTP, the registry must be marked as insecure. This is doable neither via installer nor in the UI.
Without the insecure registry set, kaniko builder cannot push built images into the configured registry for the Pipeline Modeler (see "Container Registry for Pipeline Modeler" Input Parameter at the official SAP Data Intelligence documentation (3.1) / (3.0)).
To mark the configured vflow registry as insecure, the SDI Observer needs to be deployed with MARK_REGISTRY_INSECURE=true
parameter. If it is already deployed, it can be re-configured to take care of insecure registries by executing the following command in the sdi
namespace:
# oc set env dc/sdi-observer MARK_REGISTRY_INSECURE=true
Once deployed, all the existing pipeline modeler pods will be patched. It will take a couple of tens of seconds until all the modified pods become available.
For more information, take a look at SDI Helpers.
8.5. Running multiple SDI instances on a single OCP cluster
Two instances of SAP Data Intelligence running in parallel on a single OCP cluster have been validated. Running more instances is possible, but most probably needs an extra support statement from SAP.
Please consider the following before deploying more than one SDI instance to a cluster:
- Each SAP Data Intelligence instance must run in its own namespace/project.
- Each SAP Data Intelligence instance must use a different prefix or container image registry for the Pipeline Modeler. For example, the first instance can configure "Container Registry Settings for Pipeline Modeler" as
My_Image_Registry_FQDN/sdi30blue
and the second asMy_Image_Registry_FQDN/sdi30green
. - It is recommended to dedicate particular nodes to each SDI instance.
- It is recommended to use network policy (4.6) / (4.4) SDN mode for completely granular network isolation configuration and improved security. Check network policy configuration (4.6) / (4.4) for further references and examples. This, however, cannot be changed post OCP installation.
- If running the production and test (aka blue-green) SDI deployments on a single OCP cluster, mind also the following:
- There is no way to test an upgrade of OCP cluster before an SDI upgrade.
- The idle (non-productive) landscape should have the same network security as the live (productive) one.
To deploy a new SDI instance to OCP cluster, please repeat the steps from project setup starting from point 6 with a new project name and continue with SDI Installation.
8.6. Installing remarshal utilities on RHEL
For a few example snippets throughout this guide, either yaml2json
or json2yaml
scripts are necessary.
They are provided by the remarshal project and shall be installed on the Management host in addition to jq
. On RHEL 8.2, one can install it this way:
# sudo dnf install -y python3-pip
# sudo pip3 install remarshal
8.7. Pin vsystem-vrep to the current node
On OCP 4.2 with openshift-storage.rbd.csi.ceph.com
dynamic storage provisioner used for SDI workload, please make sure to schedule vsystem-vrep
pod to the current node where it runs to avoid A pod is stuck in ContainerCreating phase from happening during an upgrade
# nodeName="$(oc get pods -n "${SDI_NAMESPACE:-sdi}" vsystem-vrep-0 -o jsonpath='{.spec.nodeName}')"
# oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" --type strategic --patch '{"spec": {"template": {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}}}'
To revert the change, please follow Unpin vsystem-vrep from the current node.
To be able to amend another potential volume attachment problems, make sure to dump a mapping between the SDH pods and nodes they run on:
# oc get pods -n "${SDH_NAMESPACE:-sdh}" -o wide >sdh-pods-pre-upgrade.out
8.8. Unpin vsystem-vrep from the current node
On OCP 4.4, the vsystem-vrep
pod no longer needs to be pinned to a particular node in order to prevent A pod is stuck in ContainerCreating phase from occurring.
One can then revert the node pinning with the following command. Note that jq
binary is required.
# oc get statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" -o json | \
jq 'del(.spec.template.spec.nodeSelector) | del(.spec.template.spec.affinity.nodeAffinity)' | oc replace -f -
8.9. (footnote ⁿ) Upgrading to the next minor release from the latest asynchronous release
If the OCP cluster is subscribed to the stable channel, its latest available micro release for the current minor release may not be upgradable to a newer minor release.
Consider the following example:
- The OCP cluster is of release
4.5.24
. - The latest asynchronous release available in stable-4.5 channel is
4.5.30
. - The latest stable 4.6 release is
4.6.15
(available instable-4.6
channel). - From the
4.5.24
micro release, one can upgrade to one of4.5.27
,4.5.28
,4.5.30
,4.6.13
or4.6.15
- However, from the
4.5.30
one cannot upgrade to any newer release because no upgrade path has been validated/provided yet in the stable channel.
Therefor, OCP cluster can get stuck on 4.5 release if it is first upgraded to the latest asynchronous release 4.5.30
instead of being upgraded directly to one of the 4.6
minor releases. However, at the same time, the fast-4.6 channel contains 4.6.16
release with an upgrade path from 4.5.30
. The 4.6.16
release appears in the stable-4.6
channel sooner of later after being introduced in the fast channel first.
To amend the situation without waiting for an upgrade path to appear in the stable channel:
- Temporarily switch to the fast-4.X channel.
- Perform the upgrade.
- Switch back to the stable-4.X channel.
- Continue performing upgrades to the latest micro release available in the stable-4.X channel.
9. Troubleshooting Tips
9.1. Installation or Upgrade problems
9.1.1. Privileged security context unassigned
If there are pods, replicasets, or statefulsets not coming up and you can see an event similar to the one below, you need to add privileged security context constraint to its service account.
# oc get events | grep securityContext
1m 32m 23 diagnostics-elasticsearch-5b5465ffb.156926cccbf56887 ReplicaSet Warning FailedCreate replicaset-controller Error creating: pods "diagnostics-elasticsearch-5b5465ffb-" is forbidden: unable to validate against any security context constraint: [spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.initContainers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
Copy the name in the fourth column (the event name - diagnostics-elasticsearch-5b5465ffb.156926cccbf56887
) and determine its corresponding service account name.
# eventname="diagnostics-elasticsearch-5b5465ffb.156926cccbf56887"
# oc get -o go-template=$'{{with .spec.template.spec.serviceAccountName}}{{.}}{{else}}default{{end}}\n' \
"$(oc get events "${eventname}" -o jsonpath='{.involvedObject.kind}/{.involvedObject.name}{"\n"}')"
sdi-elasticsearch
The obtained service account name (sdi-elasticsearch
) now needs to be assigned privileged SCC:
# oc adm policy add-scc-to-user privileged -z sdi-elasticsearch
The pod then shall come up on its own unless this was the only problem.
9.1.2. No Default Storage Class set
If pods are failing because because of PVCs not being bound, the problem may be that the default storage class has not been set and no storage class was specified to the installer.
# oc get pods
NAME READY STATUS RESTARTS AGE
hana-0 0/1 Pending 0 45m
vora-consul-0 0/1 Pending 0 45m
vora-consul-1 0/1 Pending 0 45m
vora-consul-2 0/1 Pending 0 45m
# oc describe pvc data-hana-0
Name: data-hana-0
Namespace: sdi
StorageClass:
Status: Pending
Volume:
Labels: app=vora
datahub.sap.com/app=hana
vora-component=hana
Annotations: <none>
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal FailedBinding 47s (x126 over 30m) persistentvolume-controller no persistent volumes available for this claim and no storage class is set
To fix this, either make sure to set the Default StorageClass (4.6) / (4.4) or provide the storage class name to the installer.
9.1.3. vsystem-app pods not coming up
If you have SELinux in enforcing mode you may see the pods launched by vsystem crash-looping because of the container named vsystem-iptables
like this:
# oc get pods
NAME READY STATUS RESTARTS AGE
auditlog-59b4757cb9-ccgwh 1/1 Running 0 40m
datahub-app-db-gzmtb-67cd6c56b8-9sm2v 2/3 CrashLoopBackOff 11 34m
datahub-app-db-tlwkg-5b5b54955b-bb67k 2/3 CrashLoopBackOff 10 30m
...
internal-comm-secret-gen-nd7d2 0/1 Completed 0 36m
license-management-gjh4r-749f4bd745-wdtpr 2/3 CrashLoopBackOff 11 35m
shared-k98sh-7b8f4bf547-2j5gr 2/3 CrashLoopBackOff 4 2m
...
vora-tx-lock-manager-7c57965d6c-rlhhn 2/2 Running 3 40m
voraadapter-lsvhq-94cc5c564-57cx2 2/3 CrashLoopBackOff 11 32m
voraadapter-qkzrx-7575dcf977-8x9bt 2/3 CrashLoopBackOff 11 35m
vsystem-5898b475dc-s6dnt 2/2 Running 0 37m
When you inspect one of those pods, you can see an error message similar to the one below:
# oc logs voraadapter-lsvhq-94cc5c564-57cx2 -c vsystem-iptables
2018-12-06 11:45:16.463220|+0000|INFO |Execute: iptables -N VSYSTEM-AGENT-PREROUTING -t nat||vsystem|1|execRule|iptables.go(56)
2018-12-06 11:45:16.465087|+0000|INFO |Output: iptables: Chain already exists.||vsystem|1|execRule|iptables.go(62)
Error: exited with status: 1
Usage:
vsystem iptables [flags]
Flags:
-h, --help help for iptables
--no-wait Exit immediately after applying the rules and don't wait for SIGTERM/SIGINT.
--rule stringSlice IPTables rule which should be applied. All rules must be specified as string and without the iptables command.
And in the audit log on the node, where the pod got scheduled, you should be able to find an AVC denial similar to the following. On RHCOS nodes, you may need to inspect the output of dmesg
command instead.
# grep 'denied.*iptab' /var/log/audit/audit.log
type=AVC msg=audit(1544115868.568:15632): avc: denied { module_request } for pid=54200 comm="iptables" kmod="ipt_REDIRECT" scontext=system_u:system_r:container_t:s0:c826,c909 tcontext=system_u:system_r:kernel_t:s0 tclass=system permissive=0
...
# # on RHCOS
# dmesg | grep denied
To fix this, the ipt_REDIRECT
kernel module needs to be loaded. Please refer to Pre-load needed kernel modules.
9.1.4. License Manager cannot be initialized
The installation may fail with the following error.
2019-07-22T15:07:29+0000 [INFO] Initializing system tenant...
2019-07-22T15:07:29+0000 [INFO] Initializing License Manager in system tenant...2019-07-22T15:07:29+0000 [ERROR] Couldn't start License Manager!
The response: {"status":500,"code":{"component":"router","value":8},"message":"Internal Server Error: see logs for more info"}Error: http status code 500 Internal Server Error (500)
2019-07-22T15:07:29+0000 [ERROR] Failed to initialize vSystem, will retry in 30 sec...
In the log of license management pod, you can find an error like this:
# oc logs deploy/license-management-l4rvh
Found 2 pods, using pod/license-management-l4rvh-74595f8c9b-flgz9
+ iptables -D PREROUTING -t nat -j VSYSTEM-AGENT-PREROUTING
+ true
+ iptables -F VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -X VSYSTEM-AGENT-PREROUTING -t nat
+ true
+ iptables -N VSYSTEM-AGENT-PREROUTING -t nat
iptables v1.6.2: can't initialize iptables table `nat': Permission denied
Perhaps iptables or your kernel needs to be upgraded.
This means, the vsystem-iptables
container in the pod lacks permissions to manipulate iptables. It needs to be marked as privileged. Please follow the appendix Deploy SDI Observer and restart the installation.
9.1.5. Diagnostics Prometheus Node Exporter pods not starting
During an installation or upgrade, it may happen, that the Node Exporter pods keep restarting:
# oc get pods | grep node-exporter
diagnostics-prometheus-node-exporter-5rkm8 0/1 CrashLoopBackOff 6 8m
diagnostics-prometheus-node-exporter-hsww5 0/1 CrashLoopBackOff 6 8m
diagnostics-prometheus-node-exporter-jxxpn 0/1 CrashLoopBackOff 6 8m
diagnostics-prometheus-node-exporter-rbw82 0/1 CrashLoopBackOff 7 8m
diagnostics-prometheus-node-exporter-s2jsz 0/1 CrashLoopBackOff 6 8m
The possible reason is that the limits on resource consumption set on the pods are too low. To address this post-installation, you can patch the DaemonSet like this (in the SDI's namespace):
# oc patch -p '{"spec": {"template": {"spec": {"containers": [
{ "name": "diagnostics-prometheus-node-exporter",
"resources": {"limits": {"cpu": "200m", "memory": "100M"}}
}]}}}}' ds/diagnostics-prometheus-node-exporter
To address this during the installation (using any installation method), add the following parameters:
-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.cpu=200m
-e=vora-diagnostics.resources.prometheusNodeExporter.resources.limits.memory=100M
9.1.6. Builds are failing in the Pipeline Modeler
If the graph builds hang in Pending
state or fail completely, you may find the following pod not coming up in the sdi
namespace because its image cannot be pulled from the registry:
# oc get pods | grep vflow
datahub.post-actions.validations.validate-vflow-9s25l 0/1 Completed 0 14h
vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 0/1 ImagePullBackOff 0 21s
vflow-graph-9958667ba5554dceb67e9ec3aa6a1bbb-com-sap-demo-dljzk 1/1 Running 0 94m
# oc describe pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 | sed -n '/^Events:/,$p'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30s default-scheduler Successfully assigned sdi/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2 to sdi-moworker3
Normal BackOff 20s (x2 over 21s) kubelet, sdi-moworker3 Back-off pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
Warning Failed 20s (x2 over 21s) kubelet, sdi-moworker3 Error: ImagePullBackOff
Normal Pulling 6s (x2 over 21s) kubelet, sdi-moworker3 Pulling image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600"
Warning Failed 6s (x2 over 21s) kubelet, sdi-moworker3 Failed to pull image "container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9:3.0.23-com.sap.sles.base-20200617-174600": rpc error: code = Unknown desc = Error reading manifest 3.0.23-com.sap.sles.base-20200617-174600 in container-image-registry-sdi-observer.apps.morrisville.ocp.vslen/sdi3modeler-blue/vora/vflow-node-f87b598586d430f955b09991fc1173f716be17b9: unauthorized: authentication required
Warning Failed 6s (x2 over 21s) kubelet, sdi-moworker3 Error: ErrImagePull
To amend this, one needs to link the secret for the modeler's registry to a corresponding service account associated with the failed pod. In this case, the default
one.
# oc get -n "${SDI_NAMESPACE:-sdi}" -o jsonpath='{.spec.serviceAccountName}{"\n"}' \
pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2
default
# oc create secret -n "${SDI_NAMESPACE:-sdi}" docker-registry sdi-registry-pull-secret \
--docker-server=container-image-registry-sdi-observer.apps.morrisville.ocp.vslen \
--docker-username=user-n5137x --docker-password=ec8srNF5Pf1vXlPTRLagEjRRr4Vo3nIW
# oc secrets link -n "${SDI_NAMESPACE:-sdi}" --for=pull default sdi-registry-pull-secret
# oc delete -n "${SDI_NAMESPACE:-sdi}" pod/vflow-bus-fb1d00052cc845c1a9af3e02c0bc9f5d-5zpb2
Also please make sure to restart the Pipeline Modeler and failing graph builds in the offended tenant.
9.1.7. A pod is stuck in ContainerCreating phase
NOTE: Applies to OCP 4.2 in combination with block storage persistent volumes.
The issue can be reproduced when using a ReadWriteOnce
persistent volume provisioned by a block device dynamic provisioner like openshift-storage.rbd.csi.ceph.com
with a corresponding storage class ocs-storagecluster-ceph-rbd
.
# oc get pods | grep ContainerCreating
vsystem-vrep-0 0/2 ContainerCreating 0 10m20s
# oc describe pod vsystem-vrep-0 | sed -n '/^Events/,$p'
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 114m default-scheduler Successfully assigned sdhup/vsystem-vrep-0 to sdi-moworker1
Normal SuccessfulAttachVolume 114m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-fafdd37a-b654-11ea-b795-001c14db4273"
Normal SuccessfulAttachVolume 114m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273"
Warning FailedMount 17m (x39 over 113m) kubelet, sdi-moworker1 MountVolume.MountDevice failed for volume "pvc-f61bd233-b654-11ea-b795-001c14db4273" : rpc error: code = Internal desc = rbd image ocs-storagecluster-cephblockpool/csi-vol-f6380abf-b654-11ea-8cb4-0a580a83020b is still being used
Warning FailedMount 64s (x50 over 111m) kubelet, sdi-moworker1 Unable to mount volumes for pod "vsystem-vrep-0_sdhup(fddd32f3-b7c4-11ea-b795-001c14db4273)": timeout expired waiting for volumes to attach or mount for pod "sdhup"/"vsystem-vrep-0". list of unmounted volumes=[layers-volume]. list of unattached volumes=[layers-volume exports app-parameters uaa-tls-cert hana-tls-cert vrep-cert-tls vsystem-root-ca-path vora-vsystem-sdhup-vrep-token-wrmxk]
The issue can happen for example during an upgrade from SAP Data Hub. In that case, the upgrade starts to hang at the following step:
# ./slcb execute --url https://boston.ocp.vslen:9000 --useStackXML ~/MP_Stack_1000954710_20200519_.xml
...
time="2020-06-30T06:51:40Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:51:50Z" level=warning msg="Waiting for certificates to be renewed..."
time="2020-06-30T06:52:00Z" level=info msg="Switching Datahub to runlevel: Started"
For the reference, the corresponding persistent volume can look like this:
# oc get pv | grep f61bd233-b654-11ea-b795-001c14db4273
pvc-f61bd233-b654-11ea-b795-001c14db4273 10Gi RWO Delete Bound sdhup/layers-volume-vsystem-vrep-0 ocs-storagecluster-ceph-rbd 45h
Solution to the problem is to schedule vsystem-vrep pod on particular node.
9.1.7.1. Schedule vsystem-vrep pod on particular node
Make sure to run the pod on the same node as it used to run before being re-scheduled:
-
Identify previous compute node name depending on whether the pod is running or not.
-
If the
vsystem-vrep
pod is running currently, please record the node (sdi-moworker3
) it is running on now like this:# oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide -l vora-component=vsystem-vrep NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES vsystem-vrep-0 2/2 Running 0 3d1h 10.128.0.31 sdi-moworker3 <none> <none>
-
In case the pod is no longer running, inspect the
sdh-pods-pre-upgrade.out
created as suggested at Prepare SDH/SDI Project step and extract the name of the node for the pod in question. In our case, thevsystem-vrep-0
pod used to runsdi-moworker3
.
-
-
(if not running) Scale its corresponding deployment (in our case
statefulset/vsystem-vrep
) down to zero replicas:# oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=0 statefulset/vsystem-vrep
-
Pin vsystem-vrep to the current node with the following command while changing the
nodeName
.# nodeName=sdi-moworker3 # change the name # oc patch statefulset/vsystem-vrep -n "${SDI_NAMESPACE:-sdi}" --type strategic --patch \ '{"spec": {"template": {"spec": {"nodeSelector": {"kubernetes.io/hostname": "'"${nodeName}"'"}}}}}'
-
(if not running) Scale the deployment back to 1:
# oc scale -n "${SDI_NAMESPACE:-sdi}" --replicas=1 statefulset/vsystem-vrep
Verify the pod is scheduled to the given node and becomes ready. If the upgrade process is in progress, it should continue in a while.
# oc get pods -n "${SDI_NAMESPACE:-sdi}" -o wide | grep vsystem-vrep-0
vsystem-vrep-0 2/2 Running 0 5m48s 10.128.4.239 sdi-moworker3 <none> <none>
9.1.8. Container fails with "Permission denied"
If pods fail with a similar error like the one below, the containers most probably are not allowed to run under desired UID.
# oc get pods
NAME READY STATUS RESTARTS AGE
datahub.checks.checkpoint-m82tj 0/1 Completed 0 12m
vora-textanalysis-6c9789756-pdxzd 0/1 CrashLoopBackOff 6 9m18s
# oc logs vora-textanalysis-6c9789756-pdxzd
Traceback (most recent call last):
File "/dqp/scripts/start_service.py", line 413, in <module>
sys.exit(Main().run())
File "/dqp/scripts/start_service.py", line 238, in run
**global_run_args)
File "/dqp/python/dqp_services/services/textanalysis.py", line 20, in run
trace_dir = utils.get_trace_dir(global_trace_dir, self.config)
File "/dqp/python/dqp_utils.py", line 90, in get_trace_dir
return get_dir(global_trace_dir, conf.trace_dir)
File "/dqp/python/dqp_utils.py", line 85, in get_dir
makedirs(config_value)
File "/usr/lib64/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)
OSError: [Errno 13] Permission denied: 'textanalysis'
To remedy that, be sure to apply all the oc adm policy add-scc-to-*
commands from the project setup section. The one that has not been applied in this case is:
# oc adm policy add-scc-to-group anyuid "system:serviceaccounts:$(oc project -q)"
9.1.9. Jobs failing during installation or upgrade
If the installation jobs are failing with the following error, either anyuid
security context constraint has not been applied or the cluster is too old.
# oc logs solution-reconcile-vsolution-vsystem-ui-3.0.9-vnnbf
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:18+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:23+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
2020-03-05T15:51:23+0000 [WARN] Could not login to vSystem!
2020-03-05T15:51:28+0000 [INFO] Retrying...
Error: mkdir /.vsystem: permission denied.
...
2020-03-05T15:52:13+0000 [ERROR] Timeout while waiting to login to vSystem...
The reason behind is that vctl
binary in the containers determines HOME
directory for its user from /etc/passwd
. On older OCP clusters (<4.2.32), or when the container is not run with the desired UID, the value is set incorrectly to /
. The binary then lacks permissions to write to the root directory.
To remedy that, please make sure:
- you are running OCP cluster 4.2.32 or newer
-
anyuid SCC has been applied to the SDI namespace
To verify, make sure the SDI namespace is listed in the 3rd column of the output of the following command:
# oc get -o json scc/anyuid | jq -r '.groups[]' system:cluster-admins system:serviceaccounts:sdi
When the jobs will be rerun,
anyuid
scc will be assigned to them:# oc get pods -n "${SDI_NAMESPACE:-sdi}" -o json | jq -r '.items[] | select((.metadata.ownerReferences // []) | any(.kind == "Job")) | "\(.metadata.name)\t\(.metadata.annotations["openshift.io/scc"])"' | column -t datahub.voracluster-start-1d3ffe-287c16-d7h7t anyuid datahub.voracluster-start-b3312c-287c16-j6g7p anyuid datahub.voracluster-stop-5a6771-6d14f3-nnzkf anyuid ... strategy-reconcile-strat-system-3.0.34-3.0.34-pzn79 anyuid tenant-reconcile-default-3.0.34-wjlfs anyuid tenant-reconcile-system-3.0.34-gf7r4 anyuid vora-config-init-qw9vc anyuid vora-dlog-admin-f6rfg anyuid
-
additionally, please make sure that all the other
oc adm policy add-scc-to-*
commands listed in the project setup have been applied to the same$SDI_NAMESPACE
.
9.1.10. vsystem-vrep cannot export NFS on RHCOS
If vsystem-vrep-0
pod fails with the following error, it means it is unable to start an NFS server on top of overlayfs.
# oc logs -n ocpsdi1 vsystem-vrep-0 vsystem-vrep
2020-07-13 15:46:05.054171|+0000|INFO |Starting vSystem version 2002.1.15-0528, buildtime 2020-05-28T18:5856, gitcommit ||vsystem|1|main|server.go(107)
2020-07-13 15:46:05.054239|+0000|INFO |Starting Kernel NFS Server||vrep|1|Start|server.go(83)
2020-07-13 15:46:05.108868|+0000|INFO |Serving liveness probe at ":8739"||vsystem|9|func2|server.go(149)
2020-07-13 15:46:10.303625|+0000|WARN |no backup or restore credentials mounted, not doing backup and restore||vsystem|1|NewRcloneBackupRestore|backup_restore.go(76)
2020-07-13 15:46:10.311488|+0000|INFO |vRep components are initialised successfully||vsystem|1|main|server.go(249)
2020-07-13 15:46:10.311617|+0000|ERROR|cannot parse duration from "SOLUTION_LAYER_CLEANUP_DELAY" env variable: time: invalid duration ||vsystem|16|CleanUpSolutionLayersJob|manager.go(351)
2020-07-13 15:46:10.311719|+0000|INFO |Background task for cleaning up solution layers will be triggered every 12h0m0s||vsystem|16|CleanUpSolutionLayersJob|manager.go(358)
2020-07-13 15:46:10.312402|+0000|INFO |Recreating volume mounts||vsystem|1|RemountVolumes|volume_service.go(339)
2020-07-13 15:46:10.319334|+0000|ERROR|error re-loading NFS exports: exit status 1
exportfs: /exports does not support NFS export||vrep|1|AddExportsEntry|server.go(162)
2020-07-13 15:46:10.319991|+0000|FATAL|Error creating runtime volume: error exporting directory for runtime data via NFS: export error||vsystem|1|Fail|termination.go(22)
There are two solutions to the problem. Both of them resulting in an additional volume mounted at /exports
which is the root directory of all exports.
- (recommended) deploy SDI Observer which will request additional persistent volume of size 500Mi for
vsystem-vrep-0
pod and make sure it is running -
add
-e=vsystem.vRep.exportsMask=true
to theAdditional Installer Parameters
which will mountemptyDir
volume at/exports
in the same pod- on particular versions of OCP this may fail nevertheless
9.1.11. Kaniko cannot push images to a registry
Symptoms:
- kaniko is enabled in SDI (mandatory on OCP 4)
- registry is secured by TLS certificates with a self-signed certificate
- other SDI and OCP components can use the registry without issues
-
the pipeline modeler crashes with a traceback preceded with the following error:
# oc logs -f -c vflow "$(oc get pods -o name \ -l vsystem.datahub.sap.com/template=pipeline-modeler | head -n 1)" | grep 'push permissions' error checking push permissions -- make sure you entered the correct tag name, and that you are authenticated correctly, and try again: checking push permission for "container-image-registry-miminar-sdi-observer.apps.sydney.example.com/vora/vflow-node-f87b598586d430f955b09991fc11 73f716be17b9:3.0.27-com.sap.sles.base-20201001-102714": BLOB_UPLOAD_UNKNOWN: blob upload unknown to registry
Resolution:
The root cause has not been identified yet. To work-around it, modeler shall be configured to use insecure registry accessible via plain HTTP (without TLS) and requiring no authentication. Such a registry can be provisioned with SDI Observer. If the existing registry is provisioned by SDI Observer, one can modify it to require no authentication like this:
- Initiate an update of SDI Observer.
-
Re-configure sdi-observer for no authentication:
# oc set env -n "${NAMESPACE:-sdi-observer}" SDI_REGISTRY_AUTHENTICATION=none dc/sdi-observer
-
Wait until the registry gets re-deployed.
-
Verify that the registry is running and that neither
REGISTRY_AUTH_HTPASSWD_REALM
norREGISTRY_AUTH_HTPASSWD_PATH
are present in the output of the following command:# oc set env -n "${NAMESPACE:-sdi-observer}" --list dc/container-image-registry REGISTRY_HTTP_SECRET=mOjuXMvQnyvktGLeqpgs5f7nQNAiNMEE
-
Note the registry service address which can be determined like this:
# # <service-name>.<namespace>.cluster.local:<service-port> # oc project "${NAMESPACE:-sdi-observer}" # printf "$(oc get -o jsonpath='{.metadata.name}.{.metadata.namespace}.svc.%s:{.spec.ports[0].port}' \ svc container-image-registry)\n" \ "$(oc get dnses.operator.openshift.io/default -o jsonpath='{.status.clusterDomain}')" container-image-registry.sdi-observer.svc.cluster.local:5000
-
Verify that the service is responsive over plain HTTP from inside of the OCP cluster and requires no authentication:
# registry_url=http://container-image-registry.sdi-observer.svc.cluster.local:5000 # oc rsh -n openshift-authentication "$(oc get pods -n openshift-authentication | \ awk '/oauth-openshift.*Running/ {print $1; exit}')" curl -I "$registry_url" HTTP/1.1 200 OK Content-Length: 2 Content-Type: application/json; charset=utf-8 Docker-Distribution-Api-Version: reg
Note: the service URL is not reachable from outside of the OCP cluster
-
For each SDI tenant using the registry:
- Login to the tenant as an administrator and open System Management.
-
View Application Configuration and Secrets.
-
Set the following properties to the registry address:
- Modeler: Base registry for pulling images
- Modeler: Docker registry for Modeler images
-
Unset the following properties:
- Modeler: Name of the vSystem secret containing the credentials for Docker registry
- Modeler: Docker image pull secret for Modeler
The end result should look like:
-
Return to the "Applications" in the System Management and select Modeler.
- Delete all the instances.
- Create a new instance with the plus button.
- Access the instance to verify it is working.
9.1.12. SLCBridge pod fails to deploy
If the initialisation phase of Software Lifecycle Container Bridge fails with an error like the one below, you are probably running SLCB version 1.1.53 configured to push to a registry requiring basic authentication.
*************************************************
* Executing Step WaitForK8s SLCBridgePod Failed *
*************************************************
Execution of step WaitForK8s SLCBridgePod failed
Synchronizing Deployment slcbridgebase failed (pod "slcbridgebase-5bcd7946f4-t6vfr" failed) [1.116647047s]
.
Choose "Retry" to retry the step.
Choose "Rollback" to undo the steps done so far.
Choose "Cancel" to cancel deployment immediately.
# oc logs -n sap-slcbridge -c slcbridge -l run=slcbridge --tail=13
----------------------------
Code: 401
Scheme: basic
"realm": "basic-realm"
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}
----------------------------
2020-09-29T11:49:33.346Z INFO images/registry.go:182 Access check of registry "container-image-registry-sdi-observer.apps.sydney.example.com" returned AuthNeedBasic
2020-09-29T11:49:33.346Z INFO slp/server.go:199 Shutting down server
2020-09-29T11:49:33.347Z INFO hsm/hsm.go:125 Context closed
2020-09-29T11:49:33.347Z INFO hsm/state.go:56 Received Cancel
2020-09-29T11:49:33.347Z DEBUG hsm/hsm.go:118 Leaving event loop
2020-09-29T11:49:33.347Z INFO slp/server.go:208 Server shutdown complete
2020-09-29T11:49:33.347Z INFO slcbridge/master.go:64 could not authenticate at registry SLP_BRIDGE_REPOSITORY container-image-registry-sdi-observer.apps.sydney.example.com
2020-09-29T11:49:33.348Z INFO globals/goroutines.go:63 Shutdown complete (exit status 1).
More information can be found in SAP Note #2589449.
To fix this, please download the latest SLCB version newer than 1.1.53 according to the SAP Note #2589449
9.1.13. Kibana pod fails to start
When kibana pod is stuck in CrashLoopBackOff
status, and the following error shows up in its log, you will need to delete the existing index.
# oc logs -n "${SDI_NAMESPACE:-sdi}" -c diagnostics-kibana -l datahub.sap.com/app-component=kibana --tail=5
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:ui_metric@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:visualizations@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from uninitialized to green - Ready","prevState":"uninitialized","prevMsg":"uninitialized"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["status","plugin:elasticsearch@7.3.0-SNAPSHOT","info"],"pid":1,"state":"green","message":"Status changed from yellow to green - Ready","prevState":"yellow","prevMsg":"Waiting for Elasticsearch"}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["info","migrations"],"pid":1,"message":"Creating index .kibana_1."}
{"type":"log","@timestamp":"2020-10-07T14:40:23Z","tags":["warning","migrations"],"pid":1,"message":"Another Kibana instance appears to be migrating the index. Waiting for that migration to complete. If no other Kibana instance is attempting migrations, you can get past this message by deleting index .kibana_1 and restarting Kibana."}
Please note the name of the index in the last warning message. In this case it is .kibana_1
. Execute the following command with the proper index name at the end of the curl command to delete the index and then delete the kibana pod as well.
# oc exec -n "${SDI_NAMESPACE:-sdi}" -it diagnostics-elasticsearch-0 -c diagnostics-elasticsearch \
-- curl -X DELETE 'http://localhost:9200/.kibana_1'
# oc delete pod -n "${SDI_NAMESPACE:-sdi}" -l datahub.sap.com/app-component=kibana
The kibana pod will be spawned and shall become Running in few minutes as long as its dependent diagnostics pods are running as well.
9.1.14. Fluentd pods cannot access /var/lib/docker/containers
If you see the following errors, the fluentd cannot access container logs on the hosts.
-
Error from SLC Bridge:
2021-01-26T08:28:49.810Z INFO cmd/cmd.go:243 1> DataHub/kub-slcbridge/default [Pending] 2021-01-26T08:28:49.810Z INFO cmd/cmd.go:243 1> └── Diagnostic/kub-slcbridge/default [Failed] [Start Time: 2021-01-25 14:26:03 +0000 UTC] 2021-01-26T08:28:49.811Z INFO cmd/cmd.go:243 1> └── DiagnosticDeployment/kub-slcbridge/default [Failed] [Start Time: 2021-01-25 14:26:29 +0000 UTC] 2021-01-26T08:28:49.811Z INFO cmd/cmd.go:243 1> 2021-01-26T08:28:55.989Z INFO cmd/cmd.go:243 1> DataHub/kub-slcbridge/default [Pending] 2021-01-26T08:28:55.989Z INFO cmd/cmd.go:243 1> └── Diagnostic/kub-slcbridge/default [Failed] [Start Time: 2021-01-25 14:26:03 +0000 UTC] 2021-01-26T08:28:55.989Z INFO cmd/cmd.go:243 1> └── DiagnosticDeployment/kub-slcbridge/default [Failed] [Start Time: 2021-01-25 14:26:29 +0000 UTC]
-
Fluentd pod description:
# oc describe pod diagnostics-fluentd-bb9j7 Name: diagnostics-fluentd-bb9j7 … Warning FailedMount 6m35s kubelet, compute-4 Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog varlibdockercontainers]: timed out waiting for the condition Warning FailedMount 2m1s (x2 over 4m19s) kubelet, compute-4 Unable to attach or mount volumes: unmounted volumes=[varlibdockercontainers], unattached volumes=[varlibdockercontainers vartmp kub-slcbridge-fluentd-token-k5c9n settings varlog]: timed out waiting for the condition Warning FailedMount 23s (x12 over 8m37s) kubelet, compute-4 MountVolume.SetUp failed for volume "varlibdockercontainers" : hostPath type check failed: /var/lib/docker/containers is not a directory
-
Log from one of the pods:
# oc logs $(oc get pods -o name -l datahub.sap.com/app-component=fluentd | head -n 1) | tail -n 20 2019-04-15 18:53:24 +0000 [error]: unexpected error error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos" 2019-04-15 18:53:24 +0000 [error]: suppressed same stacktrace 2019-04-15 18:53:25 +0000 [warn]: '@' is the system reserved prefix. It works in the nested configuration for now but it will be rejected: @timestamp 2019-04-15 18:53:26 +0000 [error]: unexpected error error_class=Errno::EACCES error="Permission denied @ rb_sysopen - /var/log/es-containers-sdh25-mortal-garfish.log.pos" 2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `initialize' 2019-04-15 18:53:26 +0000 [error]: /usr/lib64/ruby/gems/2.5.0/gems/fluentd-0.14.8/lib/fluent/plugin/in_tail.rb:151:in `open' ...
Those errors are fixed automatically by SDI Observer, please make sure it is running and can access the SDI_NAMESPACE
.
One can also apply a fix manually with the following commands:
# oc -n "${SDI_NAMESPACE:-sdi}" patch dh default --type='json' -p='[
{ "op": "replace"
, "path": "/spec/diagnostic/fluentd/varlibdockercontainers"
, "value":"/var/log/pods" }]'
# oc -n "${SDI_NAMESPACE:-sdi}" patch ds/diagnostics-fluentd -p '{"spec":{"template":{"spec":{
"containers": [{"name":"diagnostics-fluentd", "securityContext":{"privileged": true}}]}}}}'
9.2. SDI Runtime troubleshooting
9.2.1. 504 Gateway Time-out
When accessing SDI services exposed via OCP's Ingress Controller (as routes) and experience 504 Gateway Time-out errors, it is most likely caused by the following factors:
- SDI components accessed for the first time on a per tenant and per user basis require a new pod to be started which takes a considerable amount of time
- the default timeout for server connection configured on the load balancers is usually too small to tolerate containers being pulled, initialied and started
To amend that, make sure to do the following:
-
set the
"haproxy.router.openshift.io/timeout"
annotation to"2m"
on the vsystem route like this (assuming the route is namedvsystem
):# oc annotate -n "${SDI_NAMESPACE:-sdi}" route/vsystem haproxy.router.openshift.io/timeout=2m
This results in the following haproxy settings being applied to the ingress router and the route in question:
# oc rsh -n openshift-ingress $(oc get pods -o name -n openshift-ingress | \ awk '/\/router-default/ {print;exit}') cat /var/lib/haproxy/conf/haproxy.config | \ awk 'BEGIN { p=0 } /^backend.*:'"${SDI_NAMESPACE:-sdi}:vsystem"'/ { p=1 } { if (p) { print; if ($0 ~ /^\s*$/) {exit} } }' Defaulting container name to router. Use 'oc describe pod/router-default-6655556d4b-7xpsw -n openshift-ingress' to see all of the containers in this pod. backend be_secure:sdi:vsystem mode http option redispatch option forwardfor balance leastconn timeout server 2m
-
set the same server timeout (2 minutes) on the external load balancer forwarding traffic to OCP's Ingress routers; the following is an example configuration for haproxy:
frontend https bind *:443 mode tcp option tcplog timeout server 2m tcp-request inspect-delay 5s tcp-request content accept if { req_ssl_hello_type 1 } use_backend sydney-router-https if { req_ssl_sni -m end -i apps.sydney.example.com } use_backend melbourne-router-https if { req_ssl_sni -m end -i apps.melbourne.example.com } use_backend registry-https if { req_ssl_sni -m end -i registry.example.com } backend sydney-router-https balance source server compute1 compute1.sydney.example.com:443 check server compute2 compute2.sydney.example.com:443 check server compute3 compute3.sydney.example.com:443 check backend melbourne-router-https ....
9.2.2. HANA backup pod cannot pull an image from an authenticated registry
If the configured container image registry requires authentication, HANA backup jobs might fail as shown in the following example:
# oc get pods | grep backup-hana
default-chq28a9-backup-hana-sjqph 0/2 ImagePullBackOff 0 15h
default-hfiew1i-backup-hana-zv8g2 0/2 ImagePullBackOff 0 38h
default-m21kt3d-backup-hana-zw7w4 0/2 ImagePullBackOff 0 39h
default-w29xv3w-backup-hana-dzlvn 0/2 ImagePullBackOff 0 15h
# oc describe pod default-hfiew1i-backup-hana-zv8g2 | tail -n 6
Warning Failed 12h (x5 over 12h) kubelet Error: ImagePullBackOff
Warning Failed 12h (x3 over 12h) kubelet Failed to pull image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0": rpc error: code = Unknown desc = Error reading manifest 2010.22.0 in sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana: unauthorized: authentication required
Warning Failed 12h (x3 over 12h) kubelet Error: ErrImagePull
Normal Pulling 99m (x129 over 12h) kubelet Pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"
Warning Failed 49m (x3010 over 12h) kubelet Error: ImagePullBackOff
Normal BackOff 4m21s (x3212 over 12h) kubelet Back-off pulling image "sdi-registry.apps.shanghai.ocp.vslen/com.sap.datahub.linuxx86_64/hana:2010.22.0"
Resolution: There are two ways:
-
The recommended approach ist to update SDI Observer to version 0.1.9 or newer.
-
A manual alternative fix is to execute the following:
-
Determine the currently configured image pull secret:
# oc get -n "${SDI_NAMESPACE:-sdi}" vc/vora -o jsonpath='{.spec.docker.imagePullSecret}{"\n"}' slp-docker-registry-pull-secret
-
Link the secret with the default service account:
# oc secret link --for=pull default slp-docker-registry-pull-secret
-
9.3. SDI Observer troubleshooting
9.3.1. Build is failing due to a repository outage
If the build of SDI Observer or SDI Registry is failing with a similar error like the one below, the chosen Fedora repository mirror is probably temporarily down:
# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Extra Packages for Enterprise Linux Modular 8 - 448 B/s | 16 kB 00:36
Failed to download metadata for repo 'epel-modular'
Error: Failed to download metadata for repo 'epel-modular'
subprocess exited with status 1
subprocess exited with status 1
error: build error: error building at STEP "RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm && dnf install -y parallel procps-ng bc git httpd-tools && dnf clean all -y": exit status 1
Please try to start the build again after a minute or two like this:
# oc start-build NAMESPACE="${NAMESPACE:-sdi-observer}" -F bc/sdi-observer
9.3.2. Build is failing due to proxy issues
If you see the following build error in a cluster where HTTP(S) proxy is used, make sure to update the proxy configuration.
# oc logs -n "${NAMESPACE:-sdi-observer}" -f bc/sdi-observer
Caching blobs under "/var/cache/blobs".
Pulling image registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
Warning: Pull failed, retrying in 5s ...
error: build error: failed to pull image: After retrying 2 times, Pull image still failed due to error: while pulling "docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee" as "registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee": Error initializing source docker://registry.redhat.io/ubi8/ubi@sha256:cd014e94a9a2af4946fc1697be604feb97313a3ceb5b4d821253fcdb6b6159ee: can't talk to a V1 docker registry
The registry.redhat.io
either needs to be whitelisted in the HTTP proxy server or it must be added to the NO_PROXY
settings like in the following bash-code snippet. When executing it, the registry will be added to NO_PROXY
only if it is not there yet.
# addreg="registry.redhat.io"
# oc get proxies.config.openshift.io/cluster -o json | \
jq '.spec.noProxy |= (. | [split("\\s*,\\s*";"")[] | select((. | length) > 0)] | . as $npa |
"'"$addreg"'" as $r | if [$npa[] | . == $r] | any then $npa else $npa + [$r] end | join(","))' \
oc replace -f -
Wait after the machine config pools are updates and then restart the build:
# oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED
master rendered-master-204c0009fca2b46a9d754371404ad169 True False False
worker rendered-worker-d3738db56394537bb525ab5cf008dc4f True False False
For more information, please refer to Docker pull fails to GET registry.redhat.io/ content.
Comments