Authors:
@Piotr Godowski - Senior Technical Staff Member, Sovereign Core and Cloud Pak Foundational Services, Master Inventor, IBM Software
@Shikha Srivastava - Distinguished Engineer, Sovereign Core and MCSP, Master Inventor, IBM Software
Overview
IBM Sovereign Core provides a comprehensive sovereign cloud platform that enables organizations to maintain complete control over their data and infrastructure while meeting stringent regulatory and compliance requirements. The bare metal installation approach ensures maximum performance, security, and sovereignty by deploying directly onto physical hardware within customer-controlled boundaries.
The installation process leverages a sophisticated automation framework that orchestrates the deployment of:
- Landing Zone: Red Hat Enterprise Linux (RHEL)-based deployment orchestration server
- Control Plane: Three-node OpenShift cluster managing platform operations
- Tenant Plane: Scalable worker nodes for tenant workloads
- Storage Layer: External Ceph/ODF storage for persistent data
- External Services: DNS, NTP, email for network infrastructure

Figure 1: IBM Sovereign Core High-Level Deployment and Network architecture
Why relevant: What customers struggle with?
Organizations deploying sovereign cloud infrastructure face several critical challenges:
- Complex Multi-Component Setup: Coordinating bare metal provisioning, network configuration, storage integration, and platform deployment requires deep expertize across multiple domains
- Configuration Management: Managing hundreds of parameters across multiple configuration files is error-prone and time-consuming
- Network Isolation Requirements: Ensuring complete data sovereignty while maintaining necessary external connectivity for image registries and updates
- Compliance Validation: Proving that data never leaves the sovereign boundary during installation and operation
- Installation Time: Traditional manual approaches can take days or weeks, with high risk of configuration errors
Hardware & Infrastructure Requirements
Before beginning the installation, ensure all hardware and network infrastructure meets the minimum requirements. Proper hardware provisioning is critical for a successful deployment.
2.1 Network Equipment
The network infrastructure requires two switches to support the three-network architecture:
| Component |
Specification |
Quantity |
Purpose |
| 10G Network Switch |
12 ports minimum |
1 |
High-speed compute network for cluster communication |
| 1G Network Switch |
12 ports minimum (10G acceptable) |
1 |
IPMI/management network for out-of-band access |
Network Segmentation: The switches must support VLAN configuration to maintain separation between the three network segments (IPMI, Compute, Public).
Server Infrastructure
The minimum server configuration for a production deployment consists of the following physical servers:
| Server Type |
CPU |
RAM |
Storage |
Special Requirements |
Quantity |
| Landing Zone |
16 cores |
32 GB |
2 TB |
RHEL 10 compatible |
1 |
| Control Plane |
64 cores |
128 GB |
500 GB |
Redfish compatible BMC |
3 |
| Tenant Plane |
64 cores |
128 GB |
500 GB |
Redfish compatible BMC |
1+ |
Additionally, the for the Proof-of-Concept deployment, 3-node Ceph cluster with the following specs are used:
| Server Type |
CPU |
RAM |
Storage |
Special Requirements |
Quantity |
| Storage |
16 cores |
64 GB |
100 GB + 1TB NVMe |
Ceph compatible |
3 |
Key Hardware Notes
- Landing Zone: Requires 2TB disk space. This server hosts the mirror registry and installation artifacts.
- Control Plane: Must have Redfish-compatible BMC for automated provisioning. These servers form the OpenShift control plane.
- Tenant Plane: Scalable based on workload requirements. Minimum 1 nodes recommended for high availability. Must have Redfish-compatible BMC
- Storage server (Ceph): NVMe storage recommended for Ceph OSD performance. Each node should have dedicated disks for Ceph.
Redfish BMC Requirement: All control plane and data plane nodes must have Redfish-compatible Baseboard Management Controllers (BMC) for automated bare metal provisioning. IPMI-only systems are not supported.
Network Configuration
Three distinct network segments are required for proper operation and security isolation. For the purpose of this document, the below netwrok addresses are used, but the network address space depends on the actual customer's choice.
| Network Type |
Purpose |
Example Range |
Subnet Mask |
Gateway |
| IPMI Network |
Out-of-band management |
100.64.1.0/24 |
255.255.255.0 |
100.64.1.1 |
| Compute Network |
Internal cluster communication |
192.168.2.0/24 |
255.255.255.0 |
192.168.2.1 |
| Public Network |
External connectivity |
10.0.2.0/24 |
255.255.255.0 |
10.0.2.1 |
Network Purposes
- IPMI Network: Isolated network for Redfish/IPMI access. Used exclusively for bare metal provisioning and management. No production traffic. Access to this network is configured ONLY from the Landing Zone machine. Neither control plane nor tenant plane must have access to the IPMI Network.
- Compute Network: Primary network for all cluster communication. All pods, services, and internal traffic use this network.
- Public Network: Controlled external access through the gateway. Used for initial image mirroring and optional external connectivity.
Virtual IP Addresses (VIPs)
Reserve the following Virtual IPs on the Compute network.
Critical: VIPs must NOT be attached to any other network device:
| Service |
Purpose |
Example IP |
Notes |
| API Endpoint |
Kubernetes API access |
192.168.2.201 |
Load balanced across 3 control plane nodes |
| Ingress Endpoint |
Application ingress |
192.168.2.202 |
Wildcard DNS for *.apps.cluster.domain |
| Internal-DNS |
Internal DNS service |
192.168.2.203 |
Dynamic DNS for tenant clusters |
VIP Requirements:
- MUST be within the Compute network range
- MUST NOT be assigned to any physical or virtual interface
- MUST be excluded from DHCP pools
Server Network Configuration Examples
Landing Zone Server:
- IPMI: 100.64.1.10
- Compute: 192.168.2.10
- Public: 10.0.2.10
- Operating System: Red Hat Enterprise Linux 10.1
Control Plane Nodes:
- Node 1: IPMI 100.64.1.21, Compute 192.168.2.21
- Node 2: IPMI 100.64.1.22, Compute 192.168.2.22
- Node 3: IPMI 100.64.1.23, Compute 192.168.2.23
Tenant Plane Nodes:
- Node 1: IPMI 100.64.1.31, Compute 192.168.2.31
- Node 2: IPMI 100.64.1.32, Compute 192.168.2.32
- Node 3: IPMI 100.64.1.33, Compute 192.168.2.33
Storage Nodes:
- Node 1: IPMI 100.64.1.41, Compute 192.168.2.41
- Node 2: IPMI 100.64.1.42, Compute 192.168.2.42
- Node 3: IPMI 100.64.1.43, Compute 192.168.2.43
Certificate Requirements
Critical: Public CA-signed certificates are required for production deployments:
- API Certificate:
api.<cluster>.<domain> (e.g., api.core.sovereign.local)
- Ingress Certificate:
*.apps.<cluster>.<domain> (e.g., *.apps.core.sovereign.local)
Certificate Specifications:
- Must be valid (not expired)
- Must match cluster domain exactly
- Must include complete certificate chain (root + all intermediates)
- Private keys must be unencrypted
- PEM format required
Self-signed certificates can be used for development/testing, but otherwise they are discouraged.
Email service Requirements
The email service (SMTP) is required for Sovereign Core sending emails to tenant account invites and password reset.
The following is required for the external SMPT service:
- SMTP server address, which must be routable from within control plane network
- SMTP server must support STARTTLS
- SMTP server TLS certificate (lead + intermediates + root CA), PEM encoded
Installation User Configuration
Critical: Do NOT install with root user
- Use non-root user with
sudo privileges (recommended: glab)
- User must have password-less sudo access
- SSH key must be generated and accessible
Timeline Expectations
Total Installation Time: 6-8 hours
- Phase 1 (Prepare): 15-30 minutes
- Phase 2 (Mirror): 1-2 hours
- Phase 3 (Deploy): 2-3 hours
- Phase 4-7 (Configure & IBM Stack): 4-6 hours
Session Management:
- Use
tmux or screen Linux CLI tools for session management - mandatory for long-running installations, otherwise the installation process gets interrupted by the session timeouts
- Installation cannot be interrupted without starting over
- Network disconnections will not affect installation if using
tmux or screen tools
Installation Process Summary
The IBM Sovereign Core installation follows a carefully orchestrated four-phase approach that transforms bare metal hardware into a fully operational sovereign cloud platform. Each phase builds upon the previous one, with built-in validation ensuring readiness before proceeding.
Phase 1: Prepare (15-30 minutes)
The preparation phase focuses on gathering and organizing all required installation artifacts on the Landing Zone server.
Key Activities:
- Download installation binaries and extract to `~/SovereignCore`
- Download Red Hat CoreOS (RHCOS) content for bare metal provisioning
- Verify installation package integrity and completeness
- Place configuration files (global.yaml, certificates.yaml, cloud_infra.yaml, template.env) in correct locations
- Generate SSH keys if not already present
- Validate configuration file syntax and completeness
This phase is quick but critical - ensuring all artifacts are in place prevents delays during the automated installation phases.
Phase 1.5: Infrastructure Preparation (2-4 hours - Pre-Installation)
**Note**: This phase should be completed before running the installation script.
The foundation phase establishes the physical and network infrastructure required for the sovereign cloud. This begins with provisioning bare metal servers - typically a minimum of nine physical machines: three for the control plane, three for data plane worker nodes, and three for the Ceph storage cluster. Each server must have dual network interfaces to support the segregated network architecture.
The Landing Zone server, running RHEL 10, serves as the deployment orchestration platform. This machine hosts the mirror registry that will cache all container images, the bootstrap scripts that drive the installation, and the configuration files that define the deployment. The Landing Zone must have sufficient disk space (typically 500GB+) to store mirrored container images and installation artifacts.
Finally, the external Ceph storage cluster is configured to provide both block storage (via RBD) for persistent volumes and object storage (via RadosGW) for the Quay container registry. The storage configuration wizard generates the necessary connection parameters and credentials that will be embedded in the global.yaml configuration file.
Phase 2: Mirror (1-2 hours)
The mirroring phase is one of the most time-consuming but critical phases, ensuring complete data sovereignty by caching all required container images within the boundary.
What Happens:
The bootstrap script connects to external container registries (quay.io, registry.redhat.io, icr.io) and pulls hundreds of container images totaling tens of gigabytes. Each image is then pushed to the local mirror registry running on the Landing Zone. This process includes:
- OpenShift platform images (control plane, operators, console)
- Red Hat CoreOS images for bare metal provisioning
- IBM Sovereign Core platform service images
- Operator images for storage, networking, and monitoring
- Base images for various middleware components
Why It Matters:
Once mirroring completes, the installation becomes completely air-gapped. No external network access is required for the remaining phases, ensuring that all data and images remain within the sovereign boundary. The mirror registry serves all subsequent image pulls, providing fast, local access to container images.
Performance Factors:
- Network bandwidth to external registries (typically 100-500 Mbps)
- Landing Zone disk I/O performance
- Number of images to mirror
- External registry authentication and rate limiting
Phase 2.5: Configuration Generation (30 minutes - Pre-Installation)
Note: This phase uses the pre-installation tool and should be completed before running the installation script.
With infrastructure in place, the configuration phase uses the IBM Sovereign Core pre-installation tool to generate validated configuration files. This web-based wizard eliminates the error-prone process of manually editing YAML files.
The pre-installation tool runs as a containerized application on the Landing Zone, accessible via web browser on port 3003. The tool presents a guided interface organized into three main sections: Infrastructure, Control Plane, and Tenant Worker Planes. Each section contains multiple forms that capture specific configuration parameters.
In the Infrastructure section, you define network parameters (CIDR ranges, gateway addresses, VIPs), DNS settings (domain names, nameserver addresses), and certificate information (API certificates, ingress certificates, CA root certificates). The tool validates each field in real-time, ensuring IP addresses are properly formatted, CIDR ranges are valid, and certificates match their corresponding private keys.
The Control Plane section captures details about the three master nodes: MAC addresses for network booting, IP addresses for cluster communication, Redfish/IPMI endpoints for out-of-band management, and root disk identifiers. The tool validates that MAC addresses are unique, IP addresses fall within the defined machine network, and Redfish endpoints are accessible.
The Tenant Worker Planes section similarly captures data plane node information, allowing you to define as many worker nodes as needed for your deployment. Each node requires the same set of parameters as control plane nodes.
As you complete each form, the tool auto-saves your progress to the persistent volume, allowing you to pause and resume configuration at any time. A progress indicator shows completion status for each section, helping you track your progress through the configuration process.
When all required fields are completed, the tool generates a compressed archive containing four configuration files: global.yaml (cluster-wide settings), `certificates.yam (TLS certificates and keys), cloud_infra.yaml (tenant plane node definitions), and template.env (environment variables for the bootstrap script). These files are validated against the expected schema before generation, catching any inconsistencies or missing required fields.
Phase 3: Deploy OpenShift Cluster into control plane (2-3 hours)
The deployment phase provisions the bare metal servers and installs the OpenShift control plane cluster.
Bare Metal Provisioning:
The Metal3 baremetal-operator and Ironic services work together to provision the three control plane nodes:
1. Machine Discovery: Ironic connects to each node's Redfish/IPMI interface to inventory hardware
2. Power Management: Nodes are powered on and configured for network boot
3. ISO Mounting: Ironic mounts the RHCOS ISO via Redfish virtual media
4. OS Installation: RHCOS is installed to the root disk specified in global.yaml
5. Network Configuration: Network interfaces are configured per the machine network settings
6. Disk Partitioning: Disks are partitioned for container storage and system use
OpenShift Installation:
Using the Agent-based installer, the bootstrap script creates the initial control plane:
- Bootstrap Node: Temporary bootstrap node initializes the cluster
- etcd Cluster: Distributed key-value store for cluster state (3 replicas)
- API Server: OpenShift API endpoint for cluster management
- Controller Manager: Manages cluster controllers and operators
- Scheduler: Assigns pods to nodes based on resource requirements
- Networking: Software-defined networking (OVN-Kubernetes) is configured
- Bootstrap Removal: Temporary bootstrap node is decommissioned
Validation:
The script validates that all control plane nodes are healthy, etcd has quorum, and the API server is responding before proceeding to the next phase.
Phase 4-7: Configure & Deploy IBM Sovereign Core Stack (4-6 hours)
This extended phase encompasses post-installation configuration, operator deployment, and IBM Sovereign Core platform services installation.
Phase 4: Post-Install Configuration (30-60 minutes)
- Configure OpenShift authentication and authorization
- Set up cluster monitoring and alerting
- Configure network policies and security contexts
- Apply cluster-wide configuration settings
- Install required operators (ODF, ACM, GitOps)
Phase 5: Storage Integration (30-60 minutes):
- Deploy OpenShift Data Foundation (ODF) in external mode
- Connect to Ceph storage cluster using odfExternalConfig
- Create storage classes for block and object storage
- Validate persistent volume provisioning
- Deploy Enterprise Quay registry with object storage backend
Phase 6: Hardware Discovery & Data Plane (1-2 hours)
- Discovery service runs persistently on Landing Zone
- Ironic and baremetal operators discover data plane nodes
- Nodes are tagged with labels for workload placement
- Data plane nodes are provisioned using cloud_infra.yaml configuration
- Nodes join the cluster as workers
- Node validation ensures all nodes are ready
Phase 7: IBM Sovereign Core Stack Deployment (2-3 hours)
- **ArgoCD Installation**: GitOps engine for continuous deployment
- **Pre-Requirements**: Install dependencies and prerequisites
- **Pipeline Execution**: Tekton pipeline orchestrates IBM software stack installation
- **Platform Services**: Deploy MSP UI, IAM, Catalog, Observability components
- **Service Validation**: Health checks ensure all services are operational
- **DNS Configuration**: Add tenant cluster DNS entries or update /etc/hosts
Monitoring Progress:
- ArgoCD provides deployment visibility - check for sync errors and application health
- Review pipeline runs for IBM stack installation progress
- Monitor logs in `~/logs/` for detailed execution traces
- Check OpenShift console for operator and pod status
Tools in Action:
- Ironic: Handles machine checks, manages server reboots, mounts ISO via Redfish
- ArgoCD: Provides deployment visibility, monitors sync status, tracks application health
- Tekton: Orchestrates pipeline execution for IBM stack deployment
The installation phase is where the magic happens - the bootstrap script orchestrates a complex series of operations that transform bare metal servers into a running OpenShift cluster with IBM Sovereign Core platform services.
The process begins by extracting the installation package on the Landing Zone and placing the generated configuration files in their designated locations. Starting a tmux session ensures the installation can continue even if your SSH connection drops - a critical consideration for a multi-hour deployment.
When you execute `bootstrap.sh`, the script first performs comprehensive pre-flight validation. It checks network connectivity to all defined nodes, validates that Redfish endpoints are accessible with the provided credentials, confirms DNS resolution is working correctly, verifies storage cluster connectivity, and ensures all required configuration parameters are present and properly formatted. If any validation fails, the script stops immediately with a clear error message, allowing you to correct the issue before wasting hours on a doomed installation.
Once validation passes, the script initiates the image mirroring phase. This critical step pulls all required container images from external registries (quay.io, registry.redhat.io, icr.io) and pushes them to the local mirror registry on the Landing Zone. This process typically takes 1-2 hours depending on network bandwidth and includes hundreds of images totaling tens of gigabytes. The mirroring ensures that once installation begins, no external network access is required - all images are served from within the sovereign boundary.
With images mirrored, the script deploys the Metal3 infrastructure components - the baremetal-operator and ironic services that manage bare metal provisioning. These components use Redfish/IPMI to power on the control plane nodes, configure them to network boot, and provision them with Red Hat CoreOS. The provisioning process includes partitioning disks, installing the operating system, and configuring network interfaces according to the specifications in global.yaml.
As control plane nodes come online, the script initiates the OpenShift installation using the Agent-based installer. This process creates the initial control plane cluster, configures etcd for distributed state management, deploys the OpenShift API server and controllers, and establishes the software-defined networking layer. The three control plane nodes form a highly available cluster with automatic leader election and state replication.
Once the control plane is stable, the script deploys OpenShift Data Foundation (ODF) in external mode, connecting to the Ceph storage cluster. This provides persistent storage capabilities for platform services and tenant workloads. The Quay container registry is deployed next, configured to use Ceph object storage as its backend, providing a sovereign-boundary registry for tenant container images.
The final stage deploys IBM Sovereign Core platform services: the MSP UI for platform administration, IAM services for identity and access management, the catalog service for service provisioning, observability components for monitoring and logging, and the GitOps infrastructure for continuous deployment. Each service is deployed via ArgoCD applications, with health checks ensuring successful deployment before proceeding to the next component.
Throughout the installation, detailed logs are written to `~/logs/` with timestamps, allowing you to monitor progress in real-time. The logs capture every command executed, every API call made, and every validation performed, providing a complete audit trail of the installation process.
Phase 4: Post-Installation Verification (30 minutes)
The verification phase confirms that all components are functioning correctly and the platform is ready for tenant onboarding.
DNS verification ensures that all cluster endpoints resolve correctly from both internal and external networks. You test resolution of the API endpoint (e.g, api.core.sovereign.local), ingress wildcard (e.g, *.apps.core.sovereign.local), and the MSP UI (e.g., mspui.apps.core.sovereign.local). Successful resolution confirms that the DNS configuration is correct and the gateway is properly routing traffic.
Certificate validation uses OpenSSL to verify the complete certificate chain for both API and ingress endpoints. This confirms that certificates are properly installed, the chain of trust is intact, and certificates are not expired. Browser access to the MSP UI should show a valid certificate without warnings.
API endpoint testing uses curl to verify that the OpenShift API server is responding correctly. The `/healthz` endpoint should return "ok", confirming that the API server is healthy and accepting requests. You can also test authentication by logging in with the kubeadmin credentials and running basic oc commands.
Storage connectivity verification confirms that persistent volume claims can be created and bound successfully. You create a test PVC using the ODF storage class, verify it binds to a persistent volume, and confirm that pods can mount and write to the volume. This validates the complete storage stack from OpenShift through ODF to the Ceph cluster.
Finally, you access the MSP UI at `https://mspui.apps.core.sovereign.local` and log in with the admin credentials configured in the template.env file. The UI should load successfully, showing the platform dashboard with all services in a healthy state. You can navigate through the various sections - catalog, clusters, observability - confirming that all platform services are operational.
With verification complete, the IBM Sovereign Core platform is ready for production use. MSPs can begin onboarding tenants, deploying workloads, and managing the sovereign cloud infrastructure through the comprehensive management interface.
Lessons Learned & Best Practices
Based on real-world deployments, these lessons learned can save hours of troubleshooting and prevent common installation failures.
Pre-Installation Best Practices
Validate Everything Before Starting:
- Ensure precise requirement fulfillment - don't assume prerequisites are met
- Maintain timeline awareness - plan for 7-9 hours of continuous execution
- Verify network connectivity - test all three networks (IPMI, Compute, Public)
- Confirm DNS resolution - test all required DNS entries before installation
- Validate storage requirements - ensure storage cluster is operational and accessible
Common Pre-Installation Pitfalls
- Skipping prerequisite validation causes delays - the #1 cause of installation failures
- VIPs attached to other devices - causes IP conflicts and cluster instability
- Insufficient Landing Zone disk space - installation fails during image mirroring
- Invalid certificates - causes API and ingress endpoint failures
- Root user installation - creates permission issues and security concerns
During Installation Best Practices
Follow the Process Strictly:
- Follow installation sequence strictly - phases must complete in order
- Validate each component before proceeding - don't skip validation steps
- Maintain detailed installation logs - critical for troubleshooting
- Test connectivity between components immediately - catch network issues early
- Document any deviations or workarounds - helps with future installations
Monitoring and Troubleshooting:
- Use ArgoCD for deployment visibility - check for sync errors and application health
- Review logs continuously:
- `~/logs/*.log` - Bootstrap script execution logs
- `ocp-cluster/.openshift_install.log` - Bootstrap and cluster creation logs
- Pipeline runs - IBM stack installation logs
- Check node status regularly - ensure nodes are joining the cluster correctly
- Monitor resource utilization - insufficient resources cause performance issues
Common Installation Pitfalls:
- Incorrect installation order leads to failures - dependencies must be met
- Missing network policies block communication - validate network policies
- Insufficient resources cause performance issues - ensure adequate CPU/memory
- DNS configuration errors - add tenant cluster DNS entries or use /etc/hosts
Post-Installation Best Practices
Comprehensive Validation:
- Verify all services are running and healthy - check pod status across all namespaces
- Test end-to-end functionality thoroughly - don't assume services work
- Document final configuration settings - critical for operations and troubleshooting
- Validate storage provisioning - create test PVCs and verify binding
- Test external access - ensure ingress routes are accessible
Ongoing Management:
- Use ACM for ongoing node management - centralized multi-cluster management
- Discovery service runs persistently - handles new node discovery automatically
- Tag nodes properly for smooth deployment - use node labels for workload placement
- Monitor ArgoCD continuously - catch drift and sync issues early
Troubleshooting Guide
ArgoCD Issues:
- Check application sync status in ArgoCD UI running on control plane
- Review sync errors and application health
- Ensure proper RBAC permissions
Cluster Creation Issues:
- Review `ocp-cluster/.openshift_install.log` for bootstrap errors
- Check Redfish/IPMI connectivity to nodes
- Validate network configuration and VIPs
- Ensure DNS resolution is working
IBM Sovereign Core UI Stack Installation Issues:
- Review pipeline run logs for failures
- Check operator installation status
- Validate storage class configuration
- Ensure all prerequisites are met
Storage Issues:
- Validate Ceph cluster health
- Check ODF operator status
- Test storage class provisioning
- Review storage configuration in global.yaml
Network Issues:
- Verify all three networks are operational
- Validate DNS resolution from all nodes
- Test connectivity between components
Key Success Factors
1. Preparation is Everything: 80% of installation success depends on proper preparation
2. Validate, Don't Assume: Test every prerequisite before starting
3. Use tmux/screen: Session management is mandatory for long installations
4. Monitor Continuously: Watch logs and ArgoCD throughout installation
5. Document Everything: Detailed notes save hours during troubleshooting
6. Plan for Time: Block 8-10 hours for first installation (including validation)
7. Clean Reinstalls: If reinstalling, clean up completely before starting over
8. Storage First: Ensure storage is working before starting installation
9. Network Validation: Test all network connectivity before proceeding
10. Certificate Accuracy: Double-check certificate domains and validity
Links to IBM Sovereign Core to Learn More
#community-stories1