IBM TechXchange AWS Cloud Native User Group

AWS Cloud Native User Group

This group focuses on AWS Cloud Native services. All the discussions, blogs and events will be very specific to AWS Cloud.

 View Only

Building a self-healing Microservices architecture on AWS for data sync

By Sayon Sur posted Tue December 10, 2024 03:58 AM

  

Building a self-healing Microservices architecture on AWS for data sync.

Premise:

Recently a request came from a Retail customer requested for an error handling mechanism in Microservices ecosystem. The customer was building a stock availability feature globally. The stock availability, reservation and order related data were present in multiple source systems. The ecosystem was built on microservices architecture. Also, they relied heavily on Kafka as the data volume will be high for the retail giant once the feature is publicly available. The source systems are not highly resilient always. The idea is left shift the error handling towards a self-healing  and fault tolerant architecture.

Around 10 microservices were built for catering to various functionalities like Availability sync, check availability, sync order related data etc. Each of this microservices typically interact with multiple API endpoints to retrieve data, perform calculation on data, transform data and update data . The operations are synchronous in nature for a particular unique ID context. The below solution is around error handling and self-healing mechanisms that can be applied easily to any similar microservices architecture where the services are responsible for data sync between systems.

NOTE: This architecture is targeted for backend microservices and not for Microservices driving frontend facing user interface.

Figure 1: Basic components

Error handling:

Primarily there are two types of errors:

1.       Data Errors

2.       Connection Errors

Handling Data Errors

Data error is a scenario where the key data fields in input payload is erroneous.

e.g. Missing product SKU

In such scenario following two actions can be taken:

a)       The Microservice returns an HTTP error code.

b)      The validation error is logged and propagated to a centralized logging ecosystem.

The Microservices will be enabled to push the logs to OpenSearch Dashboard that is enriched with Kibana.

Handling Connectivity Errors:

Example Use case/Scenario:

Any data point changes (E.g.: Stock or Order data change) will be pushed to a Kafka topic. From there, it will trigger an event to Microservice actor. The microservice retrieves time series data from source system. It calculates the availability data and submits to destination.

Figure 2: Key connection points

Challenges faced:

Connectivity issues can occur at every touchpoint, ex:

·       The source system can be down/very-slow resulting the Microservice to stall.

·       If the destination is down, then the data sync will not happen until the destination is resumed.

While the data errors are pretty much handled using central logging system as explained earlier, the below proposed solution focuses on handling the above connectivity errors:

Solution design


Figure 3: AWS architecture diagram

Solution approach:

a.       When functional changes were pushed to Kafka, it will work as fire-and-forget. The service layer needs to take care of all error scenarios.

b.       The source or destination services can fail; hence error handling is truly required. Although they are highly fault tolerant, they would still need additional error handling in the service layer.

c.       Data for a unique ID combination must be consolidated at destination.

d.       The transient errors are handled using standard retry pattern implementation.

e.       A configurable Retry pattern will be implemented. For more information refer to appendix.

f.        In some situations, the failures could take longer to resolve. Repeated retries can cause issues like network congestion. Circuit breaker design patten can be implemented to handle the same. This design pattern can prevent a source service from retrying a service call that has previously caused repeated timeouts or failures. It can also detect when the destination service is functional again. Once the circuit is closed i.e., destination service is functional, the system will trigger consolidation service. For more information refer to appendix.

g.       Consolidation Service (Refer to Figure 4 below) and data consolidation is the third layer of fault tolerance introduced here. This is the primary design approach that we will dive deep.

h.       Consolidation service need to be designed in a way such that one service can handle failures for all data sync services.

i.         Unique ID consolidation will be source responsibility. i.e., source will decide if Database will have unique id for combinations or all entries for every failure record.

Consolidation Service:

The consolidation service ensures self-healing of microservices, and data is synced without any custom coding in the individual microservices.

Figure 4: Consolidation service

            

Workflow steps:

1.       The Sync microservice pulls the messages from Kafka in batches and processes them.

2.       The Sync microservice reads the data from source systems. If the source systems are down, it “retries” and finally posts the request payload to Consolidation DB.

3.       The Sync microservice updates the data to destination systems. If the destinations systems are down, it “retries” and finally posts request payload to Consolidation DB.

4.       In case of error the payloads are pushed to consolidation DB.

5.       When the circuit is closed, i.e., the dependent systems are up an event will be triggered to consolidation service.

Consolidation Subprocess:

1.       Either from Amazon EventBridge or the Microservice itself the consolidation process can be triggered.

2.       The consolidation service will read the messages from consolidation DB and post them again to the Sync microservices endpoints.

3.       It reads the uniqueIDs and finds the associated Microservice source for it.

4.       Based on the Microservice name it retriggers the Source microservice which will run the business logic and update destination as the dependent services are running now.

Sample Payload for consolidation service:

{

“serviceName” : “SYNCSERVICE”,

“serviceUrl” : “url of SYNCSERVICE”,

 “sourcePayloadTemplate” : {}

}

Source sync service will be invoked with the payload.

Self-contained message format to ensure each message can belong to a different service.

DB Tables:

Table for Failure IDs service with following KV Pair

{

 “serviceName” : “SYNCSERVICE”,

“UniqueID”: “itemID”,

}

The consolidation will use the sourcePayloadTemplate to fill up with the UniqueID and send them to target services defined by serviceUrl.

NOTE: Service name must to be unique across services.

SYNCSERVICE

SYNCSERVICE1

SYNCSERVICE2

Benefits:

The uniqueness of the solution is to bring in self-healing connectivity error handling solution beyond standard retry pattern in a microservices ecosystem for data sync use cases.

Consider following use case:

A microservice is responsible for calculating stock inventory based on simple formula as follows.

 Availability = stock - order count - reservation count.

·       Consider initial stock of 50 for Item A

·       Now the system goes down for 30 minutes at 1 PM

·       During these 30 minutes 10 orders placed with quantity of 2 each.

·       Three reservations made with count 3, 5, 2 respectively.

·       So actual stock in hand for item A is 50 – (10*2) – (3+5+2) = 20

Till 2.30 PM no new orders are placed.  Now unless the stock is manually corrected till 2.30 PM the stock will still show as quantity as 50 erroneously.

With integration of self-healing approach once the destination system is up, automatically syncs and stock will correctly reflect quantity as 20.

1.       These self-healing error handling process will fix up the data at destination without any manual intervention in case of connectivity errors. This goes beyond traditional retry pattern for transient errors and handle longer downtimes gracefully.

2.       Data Integrity is ensured. Transaction speed and compute utilization is optimized and can save up to 30% time.

3.       Data staleness is avoided up to 80% as the system doesn’t depend on actual events (order, reservation) to happen.

4.       The consolidation service is generic in nature and can help reprocess the messages for multiple sync services.

5.       No custom coding required for reprocessing in the sync microservices. Single flow of logic is easier to maintain and less error prone.

6.       Using Update in DynamoDB only uniqueIDs are stored reducing duplication and overhead

7.       Leveraging managed services on AWS, the solution itself is more durable and reliable.

 AWS Technical Stack:

Area

Technology

Consolidation Service

Spring boot service on Amazon EKS

DB

Amazon Dynamo DB

Service trigger

Amazon EventBridge or source Microservice

Circuit Breaker

Hysterix, Resilience4J

Notification /Eventing

Amazon SES/Amazon SNS

Logging

OpenSearch

References:

Retry pattern in AWS: https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/retry-backoff.html

Spring Retry : https://github.com/spring-projects/spring-retry

Circuit Breaker in AWS : https://docs.aws.amazon.com/prescriptive-guidance/latest/cloud-design-patterns/circuit-breaker.html

Circuit Breaker in Spring boot: https://resilience4j.readme.io/docs/circuitbreaker

0 comments
11 views

Permalink