This is the final edition of my Go-Live learnings blog series which will discuss IT hygiene. The objective is to explain the elementary and mandatory areas in the Go Live process and how the slightest compromise on any of them could lead to exponential costs for a business.
1. Rollback management
Due to varied factors influencing a Go-No-Go decision attributed to business pressure or financial commitment to internal or external stake holders, unfavourable decision of Go-Live is taken even though the systems are not ready. Many IT teams have a name-sake rollback strategy which is good on paper but not really vetted out, while few IT teams do not have one at all! Roll back plans are downplayed in several implementations with no proper review, audits or approvals in place.
In an event of a roll-out catastrophe, IT buckles under pressure mandating a rollback However, when they realize that rollback workflows were not tested or mock executed earlier, it sends down shivers across the enterprise stake holders pushing the business to the probability of huge expenses and an eventual shutdown in extreme cases. To facilitate stronger rollback plans, it is strongly recommended to perform mock No-Go workflows which
initiate rollbacks and then implement those learnings to an actual Go-Live.
High Availability Disaster Recovery configuration is an essential part of any application go-live whether it’s on-premise or cloud deployment model. Often these configurations are done, but usually never tested even once by any team. To ensure business continuity, it is mandatory to include HADR scenarios, validate the test cases, execution status and test results. Ensure all stake holders are aware of the configurations below and their results are approved.
- Primary: Transaction schema which is read/write and has an active-passive configuration mode with HA.
- Standby for High Availability: Read only schema when primary is active. Same hardware (CPU, memory, disk, etc.) is recommended on standby like the primary configuration, so that standby has enough power for replaying in case the primary configuration has downtime.
- Standby for Disaster Recovery - Network can become the bottleneck in DR systems because of a different data center. A dedicated private network (separate NIC, router, etc.) between a primary and standby configuration is recommended for performance and security reasons.
Data and application security are important aspects during Go-Live and all layers of an application need to be considered.
- Application Sever nodes of the Sterling Order Management application in a cluster must not be directly exposed on DMZ (Data Military Zone) and all interactions with OMS and a third party must be done through a load balancer only.
- Install security applications at DMZ (eg:- IBM Powerbank) which exchange token between request/response from third party sources, authenticate and then cascade those transactions to OMS.
- In persona-based applications that are hosted on the internet for business users, CSRF (Cross Site Reference Forgery) tokenization mechanism is not enabled for the IBM Call Center and IBM Sterling Store can pose a serious threat to data security.
- DB encryption should be implemented for all key, access tokens, credentials.
4. Whistles and Blowers
Though pro-active monitoring of application is what most businesses aspire for, most of the time it’s in reactive mode. It pushes IT support and operations team to resolve incidents as post-facto to bring the application back on track. Hence, monitoring must be ensured across all application layers which includes the application server, data base, application runtime, batch jobs, queue depth, connection pools, memory heap, CPU cycles, disk IO and disk size.
- Implement Health Monitor agent which checks the heartbeat of running Java Virtual Machines
- Enable INFO level logging and analyze the logs on alog analysis tools (Splunk, Dynatrace, Graphana, etc.) for anomalies
- Monitor heap and thread dumps at regular intervals for intensive memory processing jobs.
- Build sanity check queries to analyze the throughput of API’s, order processing rate, backorder pile up, inventory consumption rate and automate these results to be published to stakeholder at an interval of every 4 to 6 hours during the hyper care window (first 2 weeks) of go-live.
Thanks for reading my blog series which included my learnings, views and perspectives during Go-Live. Please feel free to share your comments.
Links to previous edition: