Cloud Native Apps

 View Only

AWS X-Ray Tracing in ROSA

By Kaustav Bagchi posted Wed March 20, 2024 01:32 AM

  

AWS X-Ray Tracing in ROSA

Tracing is a very common and a necessary step in making your application fault tolerant. It is a vital step towards setting up debugging frameworks and it is probably a great way to visualize a chain of logic running in their own environments.

Customer Scenario[MOU1] 

Let’s say you are a support engineer for a CoffeeShop Billing application. At 3 AM you get a call from your manager who says that customers are complaining that the application’s payment gateway is crashing because the bills are not getting generated and displayed on time.

So now, after hours of searching through logs and debugging manually you come to the conclusion that one of the processes was overloaded and thus it was increasing the overall latency. And simply increasing the capacity solved the issue!

Now imagine, if we had a way to visualize the requests coming into the system and are able to check the progress of the request and how much time it took, then this would solve it very quickly.


Fig 1 – Workflow Diagram

The above is a streaming real-time application backed by any streaming tech like MSK. User through the app selects items to checkout. After hitting checkout the below sequence happens:

Step 1 - An event is sent to Cart Topic that contains the user details, items checked out. Cart Microservice is responsible for initial pre-processing and sending the formatted event to Create Bill Topic.

Step 2 - Create Bill Microservice reads from Create Bill Topic. It then hits Get User Microservice via HTTPS call to get offers and discount applicable for the user. It also calculates internally the nearest store for the user and the delivery charge accordingly.

Step 3 - After calculating the total bill, it sends that to Send Bill Topic.

Step 4 - Send Bill Microservice reads the data from the topic, generates a bill in PDF format, uploads it into Amazon S3 and a link to this S3 object can be sent for further processing.

We want to trace the processing of an incoming event through the above workflow. Also we want the time taken and errors to be visible in a Dashboard.

Expected Outcome

Tracing:

We must trace the flow and execution sequences for a particular event coming into the system. The execution will happen in ROSA pods that run Java consumer applications. Those applications leverage AWS SDKs to communicate with other AWS Services.

Below is a sample diagram:

Workflow

10:01     10:02      10:07      10:09            10:12          10:19    10:25

Time taken by Cart Microservice

 

 

 

 

 

 

Time spent in Create Bill Topic

 

 

Time taken by Get User Call

 

 

Time taken by Create Bill Microservice

 

 

Time Spent in Send Bill Topic

 

 

Time taken by Send Bill Microservice 

 

 

Fig 2 – Sample visualization showing time spent per step

The above sequence shows how much time it took for that event to complete processing.

Please note:

Time spent in Create Bill Topic and Time Spent in Send Bill Topic : The time recorded in these actually tell us how long an event was resting in the topic before the consumer microservice picked it up. In other words, it is a performance indicator for the consumer microservice. If the value is high, there is a bottleneck at the consumer microservice that needs investigation. Alerting can also be setup through Amazon SNS.

Time taken by Get User Call: This is a nested call that is part of Time taken by Create Bill Microservice. The time recorded here adds to the overall performance indicator for the parent component.

Proposed Infrastructure components

For this architecture following components are used:

Red Hat OpenShift Service on AWS(ROSA): Essentially it is an Openshift cluster running on EC2 instances and has IAM authorization capabilities to interact with AWS components.

Amazon Managed Streaming for Apache Kafka (MSK): This is managed Kafka cluster offering by AWS. It has IAM authorization so it pairs well with ROSA.

AWS X-Ray: This is our proposed choice of Tracing component, particularly because it is an AWS managed service residing outside of the Openshift cluster. If there are other services that are dependent on our application and they are deployed in AWS, they can also leverage our traces and expand on it.

Limitations of proposed Tracing components

AWS X-Ray by default supports most of the AWS services out of the box. It has very good integrations will managed services like AWS Elastic Beanstalk, AWS Lambda, Amazon Elastic Container Service (Amazon ECS) – AWS , Amazon EC2, etc. It can draw very neat trace maps for APIs hosted on API GW. These are all part of it’s context tracing capabilities.

However, our tracing objectives requires a deeper level of tracing and our tracing should also continue through chained components. It also should trace calls of Kafka consumer applications. All of the above were not supported out of the box.

The solution here is – deep instrumentation using AWS X-Ray SDK

Solution Architecture

The solution architecture consists of three parts

a.      Containerized Infrastructure

b.      Code changes to send the data traced

c.       Event changes to persist context details in the traces

Before talking about architecture, let’s talk about AWS X-Ray Instrumentation. So it exposes an API Endpoint called PutTraceSegments. Using it, we can form and send our own trace segments to it.

At a high level, Trace is the master resource which can be queried in console. It has a trace_id as an identifier.

Under trace we have segments. Segments can have subsegments.

To make a connected Trace Map, there should be a single trace_id. Under that trace there should be a parent segment that represents the whole event flow. Underneath it should come your segments for microservices and topic ingestion. Underneath that can be your individual segments for micro level tasks done inside the application, like a Database call or Get User API call or S3 Object upload call.

The trace segment is a JSON payload like this.

This is a sample nested example for the Billing Microservice trace.

This consists of a parent segment called Cart Event XYZ and in that there is a subsegment called Billing Microservice. As the event passes through all the microservice layers, more subsegments will be added to the List.

You can further learn about Segment structure here: https://docs.aws.amazon.com/xray/latest/devguide/xray-api-segmentdocuments.html[MOU1] 

If you are wondering how to form the trace id and segment id , those code snippets are included in the solution discussion .

Containerized Infrastructure

Below diagram consists of components required in ROSA for the solution to work.

Now in an event-based system, sending PutTraceSegment calls to AWS X-Ray through AWS SDK will add to overall latency as the call is synchronous. Because of this, we have created a separate Deployment backed by HPA(Horizontal Pod Autoscaler) for AWS X-Ray Daemon. The daemon has access to send traces to X-Ray through IRSA(IAM roles for service accounts). [Ref. https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html ]

Solution looks like this


Fig 3 –Solution Diagram

AWS X-Ray Daemon Dockerfile is here

Note: The daemon uses udp port 2000 to receive segments.

After the image has been created and pushed to an enterprise Container Registry, we can create the deployment and service and HPA. ( For Reference see here )

The X-Ray daemon is exposed through default DNS: xraydaemon.<namespace>.svc.cluster.local

Code Changes

Sending traces to AWS X-Ray Daemon is not supported using AWS SDK. An alternative way is to send UDP packets to the daemon service url. Code snippet can be found here.

The above code is used to send a String called segment through UDP to the AWS X-Ray Daemon. It is a test segment, you can form real ones as per your requirement with correct AWS syntax. Please note that, to send the segment through daemon you need to pass the mandatory suffix to the segment JSON : {\"format\":\"json\",\"version\":1}\n

Above is a sample code, please package it into some custom common library to have it reusable.

The UDP call can be further run asynchronously. CompletableFuture class in Java.Utils can help.

To generate Trace ID and Segment ID, use sample bash script code.

Event Structure Changes

The event coming in to Cart Topic is consumed by Cart microservice. Cart Microservice is responsible for creating the Parent Trace ID and Parent Segment ID and the Parent Segment JSON document. That document has to be passed on till the end of the flow by various layers of microservices.

Now coming back to the proposed tracing flow we mentioned at the start.

Time taken by Cart MicroserviceCart Microservice is the first one, so it will always encounter fresh events(without context details). Cart Microservice will add three key value pair to the event body and pass it along to Create Bill Topic. They are:

a.      Start_Time=<UTC value of when processing started>

b.      End_Time=<UTC value of when processing ends>

c.       Time_Taken=End_Time-Start_Time

Then the segment document is formed with only start_time and in_progress as true with the start and end times and sent over to daemon.

Time spent in Create Bill Topic When Create Bill Microservice reads message from the Create Bill Topic it get End_Time in the body. Before reading from queue it records current_time. End_Time is basically the time in which the event was pushed to the topic by a preceding microservice. Hence the time an event spends in the topic is current_time-End_Time.

Time taken by Get User Call This is a subsegment where we record the time taken by the API call. If the API call is failing we can put the errors and other details into the segment also.

Time taken by Create Bill Microservice The Create Bill Microservice records start and end time, propagate it through the segment and send it through the event body to Send Bill Topic.

Time Spent in Send Bill Topic Follows the same pattern as Create Bill Topic.

Time taken by Send Bill Microservice As this is the last microservice in the system, it has the responsibility to close the trace. It will gather Start and End time as usual. It will create subsegment, but also add end_time to the parent segment document and make in_progress as false.

Thus by making these changes to the event body we can record the time taken by each step of the workflow. If there is an error, those details can also be sent along as well to be visualized.

Usage and Benefits

The solution given will help mission critical applications troubleshoot in real time in a much faster way. Sample segment Timeline: (Please note that this was created through Bash scripts for demonstration purposes)

Fig 4 – Generate Sample of AWS X-Ray Visualization

The use case and solution told so far not only caters to a very specific use case but also to a wide variety of uses.

The features that have been shown are:

a.      The way to send trace data to AWS X-Ray in a very fast manner without even using SDK.

b.      The way to run the AWS X-Ray daemon in a containerized manner.

c.       How to produce trace map for real time event systems for troubleshooting.

Now these learnings can be used in a variety of ways.

We can now use AWS X-Ray daemon with on-premises applications. We can also use it not only for event driven applications but also for APIs or Batch processing applications as we have control over segment creation.

None of the other Tracing products gives this kind of a context chaining by default. If we utilize the learning from point c, we can reuse our existing architecture with other third party products like Jaeger, Dynatrace etc. We just have to change the daemon layer accordingly.

Further to this, we can also create alerting and auto remediation using this. If a known issue is encountered, then it triggers an auto remediation using AWS Lambda. Else it will trigger an alert for the team using Amazon Cloudwatch and Amazon SNS.

 [MOU1]you can further learn more about Segment structure here

 [MOU1]Try Customer Scenario or Problem Statement or Context

0 comments
10 views

Permalink