Maximo

Maximo

Come for answers, stay for best practices. All we're missing is you.

 View Only
  • 1.  Gracefully handle MAS Manage downtimes

    Posted Wed July 10, 2024 10:21 AM

    Hi,

    we're looking for a way to gracefully handle MAS Manage downtimes.
    What I mean in its essence is the ability to redirect user requests, originally pointing to MAS Manage, to some static web page when Manage is down (e.g. during updatedb phase of the build process and before server bundle POD becomes ready).
    For simplicity let's assume that the control over Manage downtime detection is fully manual so that it's Manage deployer's responsibility to activate/deactivate the redirect.

    With classic Maximo we were doing this IHS config-based conditional rewrite rule which was intercepting user's requests using "file exists" test. So whenever we were activating "maintenance page" redirect then we were creating a file to trigger the rewrite and once done we simply were deleting the file, restoring normal operations.
    It was just one of the ways to achieve what we intended, sufficient for our needs. Of course there are other ways to do it by e.g. updating DNS records, throwing in reverse-proxy in between end users and Maximo, etc. We do consider these options either by unnecessarily increasing application maintenance complexity (reverse proxy - yet another component to install and maintain) or resources dependency (IT personel available privileged to update DNS records).

    With MAS Manage we noticed that Routes in mas-<inst>-manage namespace, handling requests to server bundles are managed by operator and changes to Route's essential settings (e.g. spec.to.name, spec.port.targetPort) are being overwritten with every MAS Manage operator reconciliation cycle.
    Therefore we cannot mimic the approach we used in classic deployment by simply updating MAS Manage route(s) to point to some custom service, e.g. running nginx, which serves static content.

    It could be that we're missing something or doing something wrong. 
    Do you have any suggestions how to achieve what we're aiming for?
    Alternatively I would love to hear how else you're dealing with MAS Manage downtimes so that users see something more valuable than raw "HTTP 500 Service unavailable"? Any tips will be highly appreciated!

    Thank you!



    ------------------------------
    Andrzej Więcław
    Maximo Technical Consultant
    AFRY
    Wrocław, Poland
    ------------------------------


  • 2.  RE: Gracefully handle MAS Manage downtimes

    Posted Mon July 15, 2024 09:50 AM

    Hi Andrzej:

    Normally you should not get "Error 500" responses when all Pods are down... As Manage ingress is "managed" by a standard OpenShift Route, when all Pods are down, you should get "Error 503 - Application is not available" with a somewhat descriptive message from OpenShift stating that:

    Application is not available

    The application is currently not serving requests at this endpoint. It may not have been started or is still starting.

    Possible reasons you are seeing this page:

    • The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists.
    • The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path.
    • Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.

    The above page is generic/standard but can be customized as per section "Customizing HAProxy error code response pages" on the URL: https://docs.openshift.com/container-platform/4.12/networking/ingress-operator.html 

    Unfortunately, it seems to be an "all or nothing" situation, the customized page will be shown for all possible Routes (including non-existing ones) that have no available serving Pods and will not be specific to Manage.

    Hope the above helps somewhat.

    Regards,

    Julio Perera

    Maximo Technical Consultant

    Interloc Solutions, US.



    ------------------------------
    Julio Perera
    ------------------------------



  • 3.  RE: Gracefully handle MAS Manage downtimes

    Posted Tue July 16, 2024 03:45 AM

    Hi Julio,

    you're right, when PODs are down then we indeed get HTTP 503, rather than mentioned HTTP.
    I haven't explored yet possibilities around HAProxy customization but maybe this is a way...

    Thank you!



    ------------------------------
    Andrzej Więcław
    Maximo Technical Consultant
    AFRY
    Wrocław, Poland
    ------------------------------



  • 4.  RE: Gracefully handle MAS Manage downtimes

    Posted Mon July 22, 2024 07:46 AM

    Hi Julio,

    you have inspired me with your suggestion concerning Ingress Operator adjustments and I kept on exploring documentation. This way I found a feature called Ingress sharding which brought my attention.

    As per documentation: By default, the Ingress Controller serves any route created in any namespace in the cluster. You can add additional Ingress Controllers to your cluster to optimize routing by creating shards, which are subsets of routes based on selected characteristics. To mark a route as a member of a shard, use labels in the route or namespace metadata field. The Ingress Controller uses selectors, also known as a selection expression, to select a subset of routes from the entire pool of routes to serve.

    I have built my solution which consists of following configurations:

    1. Create new ingress controller (maintenance) handling all routes in the namespace (ref. .spec.namespaceSelector) or individual routes (ref. .spec.routeSelector) which are labeled e.g. type=maintenance.
      IMPORTANT: 
      New ingress controller's domain property (.spec.domain) should be set to the same value as OCP cluster domain. This way we avoid default and maintenance ingress controllers conflict, yet leaving default ingress controller as one of higher priority when selecting ingress controller to handle requests based on HTTP host/path.
    2. Update default ingress controller by excluding all routes which are to be handled by the maintenance ingress controller (ref. Sharding the default Ingress Controller). 
    3. Configure new deployment and service which is meant to handle user traffic during the service window, when "maintenance mode" is on. 
    4. Generate clones of MAS Core, MAS Manage, etc. routes and repoint .spec.to.name and .spec.to.targetPort to the service created in previous step. We'll call those maintenance routes onwards.
    5. Label newly generated maintenance routes (or namespace) with type=maintenance.

    Once everything is in place:

    1. Enable maintenance mode by:
      1. labeling original MAS Core, MAS Manage, etc. routes (or namespace - ref. sharding using route or namespace labels) with type=maintenance
      2. removing type=maintenance label from maintenance routes (or namespace).
    2. Disable maintenance mode by doing the opposite of Enable.

    When "maintenance mode" is enabled then the default ingress controller, which is configured to handle routes NOT labeled with type=maintenance, starts proxying incoming traffic using maintenance routes, therefore ending with static content being served. Swapping type=maintenance labels restores default behaviour.

    In this solution technically speaking maintenance ingress controller instances (PODs controlled by router-maintenance deployment, automatically configured in the openshift-ingress namespace) are never used. Due to the nature of OCP configuration it's the default ingress controller that will always handle the load (based on request host domain matching) therefore in order to reduce resource consumption (CPU+memory) and port conflicts it's good idea to:

    1. limit ingress controller replicas (default: 2)
    2. change ports from standard (default: 80, 443, 1936) to some other ones, arbitrarily chosen.

    So far I know that this solution works in cloud (AWS) and on-premise OCP deployments.

    It's worth mentioning that all configuration steps described earlier (1-5) can be easily scripted. In fact, considering number of routes to (re-)create and tagging this seems to be the only reasonable option.

    Furthermore, as I already succeeded doing, you may consider improving the "maintenance mode" deployment. Instead of being a simple "static page" server you can make it a reverse proxy which conditionally (e.g. based on IP of the incoming request) serves original content and only otherwise serves static page. This way you can enable "maintenance mode" for most of the users and at the same time allow others (e.g. deployers, others involved in service window activities) to normally access MAS Core, MAS Manage, etc. 

    I will try to document complete solution, including code and scripting, and publish it somewhere. I'll link it in this thread when I'm ready. 

    Based on rather poor feedback received it's hard to judge whether this topic is so niche that no one really cares or perhaps there is no other solutions out there. 
    I'm open for any feedback, especially concerning improvements.



    ------------------------------
    Andrzej Więcław
    Maximo Technical Consultant
    AFRY
    Wrocław, Poland
    ------------------------------



  • 5.  RE: Gracefully handle MAS Manage downtimes

    Posted Mon July 22, 2024 09:55 AM

    Hi Andrzej:

    What you posted looks like a feasible path, although it looks like it was quite a legwork and will also require the assumedly minimal deployment of a web server as a minimum to handle the traffic and present the customized page plus the additional ingress controller.

    Kudos for finding that out, I do think the approach is rather niche but useful in case some customer insists on having something like that.

    With warmest regards,

    Julio Perera

    Maximo Technical Consultant

    Interloc Solutions Inc., US.



    ------------------------------
    Julio Perera
    ------------------------------



  • 6.  RE: Gracefully handle MAS Manage downtimes

    Posted Tue July 23, 2024 02:32 AM

    Hi Andzej,

    Thanks for the detailed explanation of your solution. It'd be very useful to see the complete solution with the code.

    We had a similar requirement from our customer where your solution fits perfectly. In fact we wanted to prevent regular users from accessing the environment during the service window. I investigated in the same direction and came up with a separate service/deployment but was struggling to find a way to gracefully switch routes. Due to the time limitations we used another approach - we made a script to disable/enable access to Manage for a list of users using MAS API calls. That worked fine as well by disabling access before the service window and enabling it again afterwards.

    I'm going to try your approach next time. I'm looking forward to when you document the complete solution. Thanks again.



    ------------------------------
    Ivan Lagunov
    Head of R&D
    ZNAPZ B.V.
    ------------------------------



  • 7.  RE: Gracefully handle MAS Manage downtimes

    Posted Tue July 23, 2024 04:21 AM

    Hi Andrej,

    this sounds really good, i am looking forward to your documentation! We are just starting our upgrade process and that point is on my list a few steps away :)

    Many thanks for you work!



    ------------------------------
    Andreas Brieke
    IT Service Management Consultant
    SVA System Vertrieb Alexander GmbH
    ------------------------------



  • 8.  RE: Gracefully handle MAS Manage downtimes

    Posted Thu July 25, 2024 08:09 AM

    Hi All,

    for those who are interested I'm linking an article describing my solution as well as Github repo where you can download all scripts from.

    ...and as always I'm open for any feedback, especially concerning improvements.



    ------------------------------
    Andrzej Więcław
    Maximo Technical Consultant
    AFRY
    Wrocław, Poland
    ------------------------------