API Connect

 View Only

Bulk export of Analytics data using the Scroll operation

By Pablo Lopez Rodriguez posted Tue December 06, 2022 04:56 AM

  

This post describes the benefits of using the Analytics Scroll operation to export large numbers of analytics events as an alternative to using the paginated /events operation, as well as providing a tutorial on how to use it.

Why the need for a different operation?

APIC Analytics uses an OpenSearch based database and search engine which offers a large variety of options when it comes to fetching data. Analytics has so far been using data pagination, which is a very useful tool when requesting a small number of shallow events recently indexed in the database. OpenSearch uses powerful interfaces to sort, aggregate and paginate search results that are very efficient at these shallow depths, making it ideal for live data querying where only either aggregations or the most recent batches of results are needed.

However, database operations come at a cost and these pagination interfaces get more inefficient the deeper the data is stored within the index. In fact, for this very reason OpenSearch limits the amount of paginated results to 10,000 events by default. This limit can be bypassed with internal settings, however this is not recommended due to its high impact on resources.

This is where the Scroll operation can be used. It has been written as a separate operation within the Analytics API, which is designed to query the entire collection of data for exporting or reindexing by removing complicated pagination and aggregations in favour of better performance, allowing it to efficiently return as many events as required from very deep in the database. As its use case is quite different from the paginated API, it is being exposed as the new endpoint POST /events/scroll while keeping the existing pagination API GET /events relevant for its intended live-processing purpose.

The Scroll operation

Introduction

The Scroll operation is designed to retrieve a large number of results from Analytics. This number can get very large, easily reaching millions of events and GBs of data, which is why the operation is designed to provide batches of results.

This is achieved by returning a pointer, called scroll ID, which needs to be included in the follow-up request to get the next batch of results. This process can be repeated any number of times until the desired amount of records has been retrieved from Analytics. It is important to note that a time snapshot is created internally at the time of the first scroll search and any changes to the documents will only be reflected using a new scroll context.

The Scroll operation also supports query parameters the same way the paginated API does.

Size

Size is the body parameter that defines how many results are returned each time the API is called, 1000 by default. This is defined on the first call made to the Scroll operation and persisted using the scroll ID.

Size is the main consideration to have when tuning performance using the Scroll operation. In a nutshell, the bigger the size the more resource-consuming the call becomes on both the Analytics deployment and OpenSearch. However, a bigger payload also means fewer calls to retrieve the data resulting in a lower system/network overhead. This will vary on a system-by-system basis so it is recommended to fine-tune this parameter by starting with a conservative size and increasing it while keeping an eye at the deployment resources to find the appropriate value for your specific use-case.

Scroll context

Scroll contexts are created by the initial request and kept alive by any subsequent requests. They are referenced using the scrollId parameter.

Scroll contexts will stay alive for 10 minutes by default and every subsequent call using its scroll ID will restart this timer.

Scroll contexts can be customized on a per-call basis using the scroll body parameter using a time-value format, i.e. '10m', '30s'. Max value for this parameter is 1d.

Delete Scroll operation

The Delete Scroll operation is used to delete scroll contexts before they expire to save unused resources.

Keeping scroll contexts alive consumes node resources and although they get deleted once the scroll timeout expires, it is recommended to manually delete them once they are not needed anymore using the delete scroll operation POST /events/scroll/delete.

This operation takes a body with either a nodeId string, an array of node IDs or a value of "_all" to remove all scroll contexts.

Considerations

Due to an OpenSearch limitation, if the scroll parameter is not included on subsequent requests with scroll_id, no scroll ID will be returned.

If a batch of events doesn't manage to come through, either due to an OpenSearch or Analytics failure, it won't be possible to retrieve the missed batch. It's recommended to code defensively to handle failures:

r = requests.post(url, json=body, headers=headers, verify=False)
if r.status_code == 200:
    <continue making calls>
else:
    <handle error>

There's a limit on the number of scroll contexts that can be opened at once. This limit is set to 500 by default but can be changed using internal settings. Note that every new Scroll request will open more than one scroll context.

Tutorial (using Python)

Pre-requisites

The first step is to acquire the appropriate Bearer token for the Analytics API at the desired scope. Follow this link to learn how to do this.

We will also need the analytics Scroll operation endpoint that can be obtained following this link

Call to the Scroll operation

Prepare request parameters

We'll need to set the following:

bearer = 'Bearer <my_Bearer_token>'
url = '<APIC_deployment>/analytics/<analytics_service>/orgs/<org>/events/scroll'
size: 1000
scroll: '10m'

data = {
	"size": size,
	"scroll": scroll
}

headers = {
    'Content-Type': 'application/json',
    'Authorization': bearer
}


Send the request

r = requests.post(url, json=data, headers=headers, verify=False)
res = r.json()

The REST API is documented on this page.

total is an int with the number of events stored by Analytics

events is an array containing the first batch of size events

scroll_id is a string with the ID for the scroll context used.

For the purposes of this tutorial, we'll save the events to a file:

r = requests.post(url, json=data, headers=headers,  verify=False)
res = r.json()
total = res['total']
events = res['events']
scroll_id = res['scroll_id']
with open('output.txt', 'a') as f:
    f.write('[')
    for event in events:
        if(event != events[0]):
            f.write(',')
        json.dump(event, f)
        f.write(os.linesep)
i = 0
fetchedResults = len(events)
print('Iteration: ' + str(i))
print('Total results fetched: ' + str(fetchedResults))

This opens an array with [ and stores every event object separated by a comma.

The next step is to take scroll_id and make subsequent requests to the API. It's important to always use the latest scroll ID received.

We can set the total parameter to set how many events we want to retrieve. Keep in mind that the number of events fetched needs to be a multiple of the bulk size. For this example, we'll retrieve 20,000 events.
To retrieve all the events stored by analytics, you can set total to res['total'].

The number of requests needed for the desired amount of results can be calculated with (total/size -1), the -1 coming from the initial batch request we have already performed.

total = 20000
i = 0
while i < (total/size -1):
        body = {
            "scroll_id" : scroll_id,
            "scroll" : '10m'
        }
        r = requests.post(url, json=body, headers=headers, verify=False)
        res = r.json()
        events = res['events']
        print('Date for first event of batch: ' + events[0]["datetime"])
        scroll_id = res['scroll_id']
        with open(fileName, 'a') as f:
            for event in events:
                f.write(',')
                json.dump(event, f)
                f.write(os.linesep)
        i+=1
        fetchedResults += len(events)
        print('Iteration: ' + str(i))
        print('Total results fetched: ' + str(fetchedResults))
    with open(fileName, 'a') as f:
        f.write(']')

Note the ] at the end to close the array.

Delete the scroll context

Once all the desired results have been fetched, the scroll context needs to be deleted. For this, hit the POST /events/scroll/delete with the scroll ID.
Read the full API docs here.

url = '<APIC_deployment>/analytics/<analytics_service>/orgs/<org>/events/scroll/delete'
scroll_id = 'abcd'

data = {
	"scroll_id": scrollId
}

headers = {
    'Content-Type': 'application/json',
    'Authorization': bearer
}

r = requests.post(url, json=body, headers=headers, verify=False)


Sample metrics

This area shows some example performance using the paginated API as reference against the Scroll operation. Time in seconds unless stated otherwise.

Fetching 1,000 events (5.3MB)

Size Iterations Time - Paginated API Time - Scroll operation
100 10 6.8 8.76
500 2 3.06 3.73
1000 1 n/a 2.98
2000 1 n/a 4.83

Fetching 10,000 events (26.5MB)

Size Iterations Time - Paginated API Time - Scroll operation
100 100 66.87 (1m 6s) 75.03 (1m 15s)
500 20 26.51 32.26
1000 10 n/a 24.78
2000 5 n/a 20.24

Fetching 100,000 events(265.2MB)

Size Iterations Time - Paginated API Time - Scroll operation
100 1000 n/a 731.71 (12m 16s)
500 200 n/a 299.21 (4m 59s)
1000 100 n/a 257.62 (4m 17s)
2000 50 n/a 196.44 (3m 16s)

Fetching 760,265 events (2.02GB)

Size Iterations Time - Paginated API Time - Scroll operation
100 7603 n/a 5647.08 (1h 34m)
500 1521 n/a 2422.84 (40m 22s)
1000 761 n/a 2158.05 (36m)
2000 381 n/a 1918.67 (32m)


#APIConnect

0 comments
80 views

Permalink