Reindexing monthly summaries in API Connect Analytics
IBM API Connect Analytics stores raw API events in daily indices (apic-api-YYYY.MM.DD-*), these are then transformed them into daily long term summary indices, and then reindexes those daily summaries into a single summary-api-YYYY.MM-000001 index per month. Consolidating into monthly indices keeps shard counts low so the same hardware comfortably stores more history, keeps read aliases simple, and gives SREs a predictable lifecycle for deleting day-level data once it has been rolled up.
Old Behaviour
The pre-10.0.8.4 reindex job ran once a month (around the 5th) and attempted to sweep every daily summary index for the previous month in a single shot. Any hiccup—OpenSearch pressure, node restarts—forces a full restart from day one.
Internal scale tests and the IBM Managed SRE team saw the job saturate CPU and heap when thousands of shards were involved, which in turn slowed down transforms, made dashboards flaky, and sometimes left partial monthly indices.
When failures occurred, there was no persisted progress marker; users had to read OpenSearch tasks manually, often discovering that month-end data had been missing for days.
10.0.8.4: Fail-safe reindexing
Incremental days: We now group daily indices into batches (1st - 6th, 6th - 10th... 26th - 31th) and record completion internally. If the pod restarts, the job resumes with the first unprocessed day instead of rewinding to the 1st.
Resource guardrails: We limit reindexing to only have one job running at a time, and temporarily disable replica shards on the targeted indices to save resources and make the job run quicker.
Clean hand-off: once a month has fully reindexed, the job cleans up the now redundant daily indices and enforces the correct ISM policies/replica counts.
Observability: logs call out each day’s progress, skipped indices, and retry attempts. The status documents double as an audit trail for support teams.
10.0.8.4 additional fixes
Automatic cleanup of stale summary_reindex_status documents to avoid confusing “already reindexed” signals.
More defensive checks around transforms so that we never launch a reindex while a transform is still writing to the same target.
10.0.8.5 refinements
Hardened retry logic for asynchronous tasks: we poll unfinished tasks and flag when OpenSearch reports “completed”: false for too long, surfacing actionable errors in the oscron logs.
Improvements to the summary-management controller so it can detect when a month was backfilled outside the primary cluster and skip redundant reindex work.
Expanded cleanup of completed tasks (older than a cut-off) so the _tasks API stays lean and SREs can quickly find the one task that matters.
Continued guardrails on alias management to prevent dangling read aliases when daily indices are deleted.
Summary
These changes transform reindexing from a fragile, all-or-nothing monthly event into a resumable, low-impact background task. By indexing one day at a time, throttling resource usage, and persisting progress in OpenSearch itself, Analytics clusters stay healthy even while retaining more long-term data. Should a failure still occur, the job restarts exactly where it left off, replica shards are reinstated, and operators have the telemetry they need to react quickly.
Next steps: plug in the final APAR IDs once the release notes are frozen, add any customer quote/data you gather from the Managed SRE team, and consider a short diagram that shows the daily→monthly flow for visual readers.
References
Reindex resiliency fixes first shipped in 10.0.8.4 (see IBM API Connect release notes for 10.0.8.4: https://www.ibm.com/docs/en/api-connect/10.0.8.x?topic=notes-release-10084).
Follow-on corrections landed in 10.0.8.5 (release notes: https://www.ibm.com/docs/en/api-connect/10.0.8.x?topic=notes-release-10085).