Instana

The community for performance and observability professionals to learn, to share ideas, and to connect with others.

View Only

Back to Blog List

Elasticsearch Autoscaling with Ansible and Jenkins

By Asif Akhtar posted Fri November 07, 2025 12:56 PM

Author: Asif Akhtar

Audience: DevOps Engineers, Elasticsearch Admins, and SREs

Tags: Elasticsearch, Ansible, Jenkins, Automation, Autoscaling, DevOps

🧭 Introduction

Managing Elasticsearch clusters at scale is a balancing act between performance, cost, and capacity. As data grows, your cluster’s hot, warm, and cold tiers expand dynamically. Traditionally, admins manually add or remove data nodes — a time-consuming and error-prone process. Autoscaling functionality is included in the licensed distribution of Elasticsearch, but it is not provided in the open-source version.

In this blog, we’ll explore how to automate Elasticsearch autoscaling using Ansible playbooks integrated with Jenkins jobs. This approach enables dynamic provisioning and decommissioning of data nodes based on disk utilization thresholds, ensuring your cluster stays elastic without manual intervention.

⚙️ The Problem

Elasticsearch doesn’t natively autoscale physical nodes — it handles index-level scaling, but not infrastructure-level changes.

As clusters grow, engineers face these common issues:

Manual node provisioning when disk usage spikes.
Wasted compute when data tiers are underutilized.
Delayed response to capacity alerts.

🎯 The Solution: Ansible + Jenkins Autoscaling Framework

I built an Ansible-based autoscaling playbook that:

Continuously checks disk utilization of Elasticsearch nodes.
Triggers Jenkins jobs to provision or decommission nodes.
Supports tiered scaling for hot, warm, and cold data layers.
Automatically maintains multi-zone distribution.

This enables an autonomous loop:

Monitor → Evaluate → Provision/Decommission → Rebalance

🏗️ Architecture Overview

Diagram 1: High-Level Autoscaling Architecture

┌─────────────────────────────┐
│       Jenkins Server        │
│  • Jenkins Jobs (Provision) │
│  • Jenkins Jobs (Decomm)    │
└────────────┬────────────────┘
             │ REST API Calls
┌────────────┴─────────────┐
│       Ansible Control    │
│  • Autoscaling Playbook  │
│  • ES_Tier_Tasks.yaml    │
└────────────┬─────────────┘
             │
┌────────────┴─────────────┐
│   Elasticsearch Cluster  │
│  ├── Hot Tier Nodes      │
│  ├── Warm Tier Nodes     │
│  └── Cold Tier Nodes     │
└──────────────────────────┘

🧩 Core Components

1️⃣ Main Playbook (main.yaml)

The main playbook defines global configurations, Jenkins integration, and tier-specific scaling parameters.

Key responsibilities:

Loads Jenkins credentials and configuration.
Iterates through tiers (hot, warm, cold).
Includes per-tier logic (es_tier_task.yaml).
Generates a final summary.

- name: Elastic - combined upscale & downscale for hot, warm and cold
  hosts: "{{ ES Master Server }}"
  gather_facts: false
  vars_files:
    - /Path/to/jenkins_user/secrets/for/triggering/jobs 

  vars:
    global_decommission_job: "Jenkins_job_to_decommission_node"
    jenkins_param_name_decommission: "HOSTPREFIX"
    jenkins_url: "http://localhost:8080"

Each tier configuration defines thresholds and Jenkins job names:

tiers:
  # ==============================
  # HOT Tier Configuration
  # ==============================
  - name: hot                                   # Tier name to be processed (hot, warm, or cold)
    
    # --- Downscale (Decommission) Parameters ---
    disk_down_threshold: 55                     # Trigger threshold: nodes below this disk usage (%) qualify for decommission
    below_count_threshold: 6                    # Minimum number of underutilized nodes required to start decommission
    decommission_count: 2                       # Number of nodes to decommission in a single cycle

    # --- Upscale (Provision) Parameters ---
    up_check_disk: 68                           # Disk usage threshold (%) for provisioning trigger (less-than rule)
    up_below_count_threshold: 6                 # If fewer than this many nodes are below the threshold, upscale is triggered
    provision_count: 2                          # Number of nodes to provision in one scale-up event

    # --- Jenkins Integration ---
    provision_job: "datastore-provision-elastic-hot"   # Jenkins job name responsible for provisioning hot tier nodes
    
    # --- Zone Distribution ---
    zone_sequence: ["a", "b"]                   # Zones where new nodes will be provisioned in round-robin sequence

2️⃣ Per-Tier Task File (es_tier_task.yaml)

This file encapsulates the core logic for each tier’s autoscaling decisions.

🔽 Downscaling Logic

The playbook checks for nodes under the disk down threshold (e.g., < 55%) and decommissions them if the count exceeds a threshold.

- name: Find nodes below down threshold
  shell: >
    curl -s localhost:9200/_cat/nodes?h=name,dup | \
    grep -i '{{ item.name }}' | \
    awk -v thr={{ disk_down_threshold }} '$2+0 < thr { print $1 }'

If nodes qualify for decommission:

Jenkins decommission job is triggered via REST API.
Safety checks ensure no parallel decommission is in progress.

🔼 Upscaling Logic

Conversely, if too few nodes are below the up_check_disk threshold (meaning disks are getting full), new nodes are provisioned.

- name: Decide provision trigger
  set_fact:
    trigger_provision: "{{ (below_upcheck_nodes | length) < (up_below_count_threshold | int) }}"

When triggered:

The playbook calculates the next node IDs.
Generates new hostprefixes (elastic-hot-18, etc.).
Triggers Jenkins provision job across multiple zones.

🔄 End-to-End Workflow

+--------------------------------------+
|  Run Ansible Autoscaling Playbook    |
+------------------+-------------------+
                   |
                   v
        +----------+-----------+
        |  For each ES Tier    |
        +----------+-----------+
                   |
     +-------------+-------------+
     |                           |
     v                           v
Check disk usage         Check upscale threshold
(downscale)              (upscale)
     |                           |
     | Yes (low disk%)           | Yes (less nodes below limit)
     v                           v
Trigger Jenkins             Trigger Jenkins
decommission job            provision job

🧠 Key Design Choices

Feature	Purpose
Ansible Facts	To dynamically store thresholds, node lists, and flags.
Jenkins REST API	For reliable, token-based automation triggers.
Multi-Tier Support	Separate scaling logic for hot/warm/cold.
Zone Awareness	Round-robin provisioning across zones.
Safe Guards	Prevents concurrent decommissions to avoid data loss.

🧰 Example Outputs

Sample Ansible log snippet for hot tier:

TASK [Debug under-threshold nodes for hot]

ok: [localhost] => {

"msg": [

"Nodes under 55%: ['elastic-hot-15', 'elastic-hot-16']",

"Count: 7 (trigger if > 6)"

]

}

TASK [Trigger global decommission job with combined nodes]

ok: [localhost] => {

"msg": "Triggered automation-for-ES-data-nodes-decommision with nodes=elastic-hot-15,elastic-hot-16"

}

🧩 Diagram 3: Integration Overview

        ┌───────────────────────────┐
        │   Ansible Control Node    │
        │  (Runs autoscale.yaml)    │
        └────────────┬──────────────┘
                     │
             REST + CURL to Jenkins
                     │
        ┌────────────┴──────────────┐
        │        Jenkins CI         │
        │  - Provision Job          │
        │  - Decommission Job       │
        └────────────┬──────────────┘
                     │
        ┌────────────┴──────────────┐
        │  Elasticsearch Cluster    │
        │  - Hot / Warm / Cold      │
        │  - Data Nodes & Zones     │
        └───────────────────────────┘

📈 Benefits

✅ Full automation of node lifecycle
✅ No human intervention required for scale events
✅ Consistent provisioning/decommissioning across zones
✅ Supports different scaling policies per tier
✅ Integrates seamlessly with existing Jenkins CI/CD

🔒 Safety and Best Practices

Always use API tokens instead of plain credentials in Jenkins calls.
Limit provision/decommission frequency to avoid flapping.
Validate node health post-provision using _cluster/health API.
Use dry-run mode during first rollout.

🧾 Conclusion

This Ansible-Jenkins integration provides a robust, production-grade automation layer for Elasticsearch cluster management.

By combining Ansible’s orchestration power with Jenkins’ CI/CD pipelines, teams can achieve self-scaling Elasticsearch environments that respond dynamically to real-time data pressure.

Automation like this not only reduces operational toil but also improves cluster stability, cost efficiency, and availability.

🧩 Next Steps

In future iterations, this approach can be extended to:

Integrate with Instana dashboards for visibility.
Add alert-based triggers via Prometheus/Alertmanager, even Instana based alerting.

#Infrastructure
#CaseStudy
#Database
#SRE

0 comments

7 views

Permalink

https://community.ibm.com/community/user/blogs/asif-akhtar/2025/11/07/elasticsearch-autoscaling-with-ansible-and-jenkins

Instana

Instana

Elasticsearch Autoscaling with Ansible and Jenkins

By Asif Akhtar posted Fri November 07, 2025 12:56 PM

🧭 Introduction

⚙️ The Problem

🏗️ Architecture Overview

1️⃣ Main Playbook (main.yaml)

2️⃣ Per-Tier Task File (es_tier_task.yaml)

🔽 Downscaling Logic

🔼 Upscaling Logic

🧠 Key Design Choices

🧰 Example Outputs

📈 Benefits

🔒 Safety and Best Practices

🧾 Conclusion

🧩 Next Steps

Permalink

Additional
Resources

Office

Quick Links

Instana

Instana

Elasticsearch Autoscaling with Ansible and Jenkins

By Asif Akhtar posted Fri November 07, 2025 12:56 PM

🧭 Introduction

⚙️ The Problem

🏗️ Architecture Overview

1️⃣ Main Playbook (main.yaml)

2️⃣ Per-Tier Task File (es_tier_task.yaml)

🔽 Downscaling Logic

🔼 Upscaling Logic

🧠 Key Design Choices

🧰 Example Outputs

📈 Benefits

🔒 Safety and Best Practices

🧾 Conclusion

🧩 Next Steps

Permalink

Additional Resources

Office

Quick Links

Additional
Resources