Data Pipeline Optimization with Karpathy Autoresearch and IBM Bob
This article documents a technical experiment applying Andrej Karpathy’s Autoresearch methodology, originally designed for ML model optimization, to data engineering pipeline. The project explores how an autonomous agent (IBM Bob) can optimize data pipelines by navigating the trade-offs between speed, cloud cost, and resource utilization. You can try out this project yourself at Getting Started or skip to the Experiment Results
I chose IBM Bob as the AI agent, primarily because its specific handling of structured reasoning and tool-calling felt suited for the iterative, “think-then-act” nature of this framework. Over the course of 20 autonomous experiment iterations, Bob achieved a +11.3% improvement over the baseline score.
The Framework Architecture
The experiment is built on a contract that isolates the agent’s creative freedom from the evaluation logic. This ensures that the agent can iterate rapidly without compromising the integrity of the benchmarking environment.
Program.md:Defines the environment, including the dataset (synthetic data) and the tooling (e.g., uv for dependency management).
baseline_config.py: A read-only file containing the scoring logic, dataset paths, and a fixed 5-minute time budget. The agent cannot modify this file, preventing "cheating" or metric manipulation.
pipeline.py: The only file the agent is permitted to edit. It contains the logic for data pipeline, including data layout(partitioning keys, bucket counts, and sort orders), storage format compression technique, and query logic, etc.
The Optimization Loop
The agent operates in a continuous cycle, utilizing Git to manage state and record progress:
1. Mutation: The agent modifies a specific infrastructure lever in pipeline.py.
2. Benchmarking: The pipeline is executed for a maximum 5 minutes.
3. Scoring: An efficiency score is calculated using the following objective function:
efficiency_score = w1 * (1/latency_seconds) + w2 * (1/cost_dollars) + w3 * resource_health_score
- latency_seconds: Total query execution time (lower is better)
- cost_dollars: Cloud compute/storage cost for the run (lower is better)
- resource_health_score: 0–100 metric based on memory/CPU utilization (higher is better, penalizes OOM or thrashing)
4. Decision:
- Keep: If the score increases, the change is committed.
- Discard: If the score decreases or the script crashes, the branch is reset via
git reset --hard.
Experiments
The agent can experiment with:
- Data Layout: Partitioning keys, bucket counts, and sort orders.
- Storage Formats: Toggling between Parquet, ORC, Avro, or Feather.
- Query Logic: Adjusting join strategies and predicate pushdown.
- Resource Allocation: Tuning memory fractions and parallelism levels.
Or any code in pipeline.py where it thinks can improve the performance.
- Complexity vs. Gain: The experiment prioritizes simplicity. If an optimization adds significant code complexity for a marginal gain, it is discarded.
- Shift in Responsibility: The human role shifts from writing manual configurations to defining the weights of the objective function. The effort is spent on ensuring the evaluation harness (
baseline_config.py) is robust.
Getting Started
1. Clone the repository
git clone https://github.com/Henry-Xiao-HX/auto-data-pipeline-optimization.git
2. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
3. Generate baseline dataset
uv run python generate_dataset.py
This command creates:
~/.cache/auto-data/data/data.parquet - Main dataset (Parquet with snappy compression)
~/.cache/auto-data/data/data.csv - CSV version for comparison
~/.cache/auto-data/data/data.feather - Feather version for comparison
~/.cache/auto-data/data/partitioned/ - Pre-partitioned versions for testing
4. Run baseline to verify set up
uv run pipeline.py
Example output:
================================================================================
INFRASTRUCTURE OPTIMIZATION EXPERIMENT
================================================================================
Configuration:
File Format: parquet
Compression: snappy
Partition Columns: None
Column Pruning: True
Predicate Pushdown: True
Cache Intermediate: False
Chunk Size: 100,000
Dataset Directory: /Users/you/.cache/autoinfra/data
Time Budget: 300s
================================================================================
Running pipeline...
Total execution time: 2.3s
---
efficiency_score: 0.8542
latency_seconds: 2.1
cost_dollars: 0.0012
resource_health: 87.5
throughput_mb_s: 48.2
data_processed_gb: 0.1
peak_memory_gb: 0.8
cpu_utilization_pct: 65.3
data_correct: True
================================================================================
EXPERIMENT COMPLETED SUCCESSFULLY
================================================================================
5. Start Autonomous Optimization
Point your AI agent IBM Bob to infrastructure_program.md and let it run.
Hi, have a look at infrastructure_program.md and let's kick off a new experiment! Let's do the setup first.
The agent will:
- Try different configurations
- Keep improvements, discard regressions
- Log all results to infra_results.tsv
Note: You need to auto-approve a few tools, i.e. Bob will automatically complete the action without your permission. We need to approve: Read (view your files and directory content), Write (Create, edit, and save files to your directory), Question response(after the time limit expires, select the first answer from the provided options), Execute (run commands in your terminal), and Subtasks(create and complete subtasks).
We can further approve Bob to only execute certain commands such as uv run python, ls, git diff, etc, to lower the risk. This step is necessary for autonomous optimization, allowing Bob to run by itself while you grab a cup of coffee and come back to an improved pipeline with verifiable experiment history
Experiment Results
Below is the execution log ( infra_results.tsv) from a recent trial run. The agent performed 20 experiments before I stopped its execution, netting a 11.3% improvement from baseline efficiency.
| Commit |
Efficiency Score |
Latency (s) |
Cost (USD) |
Memory (GB) |
Status |
Description |
| b855524 |
179936.8891 |
4.7 |
0.0056 |
4.7 |
Keep |
baseline - feather + snappy, cache enabled, no predicate pushdown |
| b6c0992 |
182233.8009 |
4.6 |
0.0055 |
4.5 |
Keep |
enable predicate pushdown - early filtering |
| 02792de |
163251.8696 |
4.8 |
0.0061 |
4.8 |
Discard |
switch to parquet - slower than feather |
| 6e014d3 |
183373.9876 |
4.6 |
0.0055 |
4.5 |
Keep |
zstd compression - better than snappy |
| 80677de |
191276.4190 |
4.4 |
0.0052 |
4.5 |
Keep |
disable caching - reduces overhead |
| 3c2e4ac |
181245.1206 |
4.6 |
0.0055 |
4.5 |
Discard |
increase chunk size - worse performance |
| 30df87f |
190899.4804 |
4.4 |
0.0052 |
4.5 |
Discard |
lz4 compression - slightly worse than zstd |
| cf26279 |
194354.6490 |
4.3 |
0.0051 |
4.5 |
Keep |
reduce chunk size to 100k - better performance |
| 4359064 |
190961.5426 |
4.4 |
0.0052 |
4.5 |
Discard |
50k chunk size - too small, adds overhead |
| 803d263 |
133771.6668 |
6.3 |
0.0075 |
7.3 |
Discard |
disable column pruning - much worse performance |
| 7711cee |
186564.4094 |
4.5 |
0.0054 |
4.5 |
Discard |
no compression - I/O cost too high |
| 592b474 |
194316.2414 |
4.3 |
0.0051 |
4.5 |
Discard |
gzip compression - marginally worse than zstd |
| 73c0e6c |
191355.4304 |
4.4 |
0.0052 |
4.7 |
Discard |
query() method - slower than boolean indexing |
| 860bbfd |
191513.1058 |
4.4 |
0.0052 |
3.8 |
Discard |
categorical dtype - conversion overhead outweighs benefits |
| e55588d |
194628.1254 |
4.3 |
0.0051 |
4.5 |
Keep |
named aggregations - avoid MultiIndex flattening |
| e71c067 |
188208.7126 |
4.5 |
0.0053 |
4.5 |
Discard |
lower memory limit - no benefit |
| a588867 |
187918.2632 |
4.5 |
0.0053 |
4.6 |
Discard |
sort=False in groupby - worse performance |
| d0bb26d |
192083.8743 |
4.4 |
0.0052 |
4.5 |
Discard |
75k chunk size - worse than 100k |
| e7e5ac5 |
200173.7765 |
4.2 |
0.0050 |
4.5 |
Keep |
remove copy() calls - major improvement |
| b461c1c |
194498.0399 |
4.3 |
0.0051 |
4.5 |
Discard |
inplace operations - slower than chaining |
Core Performance Improvements
| Milestone |
Commit |
Efficiency |
Δ Baseline |
Key Change |
| Baseline |
b855524 |
179,936.89 |
— |
Feather + Snappy, Caching On |
| I/O Refinement |
6e014d3 |
183,373.99 |
+1.9% |
Switched to zstd compression |
| Logic Tuning |
80677de |
191,276.42 |
+6.3% |
Disabled caching to reduce overhead |
| Chunking |
cf26279 |
194,354.65 |
+8.0% |
Optimized to 100k row chunks |
| Final State |
e7e5ac5 |
200,173.78 |
+11.3% |
Removed redundant .copy() calls |
Technical Observations
The optimization loop demonstrated that for this specifically small workload, standard best practices introduced more overhead than they resolved. A breakdown of the logic during key iterations:
- Format Selection: Early testing identified that Parquet (
02792de) was suboptimal for this dataset size, leading to a focus on Feather.
- Overhead Reduction: The search identified that features like intermediate caching (
80677de) added serialization penalties that exceeded re-computation costs.
- Refinement: Fine-tuning memory via 100k chunk sizes (
cf26279) balanced CPU utilization and I/O throughput.
By the final commit ( e7e5ac5), the pipeline was simplified, maximizing the throughput-to-cost ratio.
Key Findings
1. Memory and Object Management
The most significant gain resulted from removing unnecessary defensive copies. Minimizing memory allocation reduced latency from 4.7s to 4.2s and dropped per-unit cost to $0.0050. Notably, attempts to use inplace=True ( b461c1c) and categorical dtypes ( 860bbfd) regressed performance due to engine-level conversion overhead.
2. Storage & I/O
For this workload, Feather outperforms Parquet ( 02792de), which saw a ~9.2% drop in efficiency. While Snappy is common, zstd provided a superior balance of compression ratio and decompression speed for our specific throughput requirements.
3. Counter-Intuitive Regressions
- Caching: Disabling the intermediate cache (
80677de) provided a +4.3% boost, as the transformation logic proved faster than the I/O penalty of serialized caching.
- Column Pruning: Disabling pruning (
803d263) caused the most severe regression, with efficiency dropping by 25.6% and memory usage spiking to 7.3GB.
- “Optimal” configurations are highly dependent on the dataset and hardware. A configuration that wins on a 100MB dataset may not hold at 100TB.
Final Configuration
- Format: Feather (zstd compression)
- Granularity: 100k row chunks
- Logic: Boolean indexing (avoiding
.query()), named aggregations, and method chaining to minimize memory fragmentation.
Final thoughts
The 11.3% efficiency gain achieved by IBM Bob is impressive, but the real takeaway isn’t the specific configuration. For me, it’s the shift in methodology. By applying Karpathy’s Autoresearch framework, my role transitioned from manually tweaking chunk sizes and compression techniques to defining the “rules of the game” in baseline_config.py. This moves the engineering effort: I am guiding the AI agent to discover the best solution. As data volumes scale, the "optimal" pipeline becomes a moving target that varies by hardware, dataset skew, and cloud pricing. This project suggests a future where data engineers spend less time in the weeds of trial-and-error and more time designing the objective functions that allow agents like IBM Bob to find peak performance autonomously.