Reproduce the Paper & Run from Scratch¶

Run the exact scripts that produce the paper's figures and tables, or recompute the entire dataset from scratch on HPC/SLURM.

Quick Start¶

Reproduce from OPARA archiveRun locally (verify setup)HPC (SLURM)

Download the pre-computed results from DOI:10.25532/OPARA-862 and run the evaluation scripts directly:

# Setup
git clone https://github.com/jgonsior/olympic-games-of-active-learning.git
cd olympic-games-of-active-learning
conda create --name ogal --file conda-linux-64.lock && conda activate ogal && poetry install
cp .server_access_credentials.cfg.example .server_access_credentials.cfg
# edit .server_access_credentials.cfg → set OUTPUT_PATH and DATASETS_PATH under [LOCAL]

# Download and extract data
wget -c -O full_exp_jan.zip \
  "https://opara.zih.tu-dresden.de/bitstreams/38951489-5076-4544-a99b-c20dddfc2c6b/download"
unzip full_exp_jan.zip -d /path/to/results/

Run a small smoke test before going to HPC scale:

conda create --name ogal --file conda-linux-64.lock && conda activate ogal && poetry install
cp .server_access_credentials.cfg.example .server_access_credentials.cfg
# edit .server_access_credentials.cfg → set OUTPUT_PATH and DATASETS_PATH under [LOCAL]

python 01_create_workload.py --EXP_TITLE smoke_test
# This creates several files in OUTPUT_PATH/smoke_test/, including:
#   01_workload.csv              – hyperparameter grid (one row per experiment)
#   02b_run_bash_parallel.py     – runs all experiments in parallel locally
#   02_slurm.slurm               – SLURM job array script (for HPC clusters)
# Run all experiments locally in parallel:
python "$OUTPUT_PATH/smoke_test/02b_run_bash_parallel.py"
# On a SLURM cluster, use the generated job array instead:
# sbatch "$OUTPUT_PATH/smoke_test/02_slurm.slurm"
# Note: 02_run_experiment.py outputs .csv files; compress to .csv.xz afterwards
ls "$OUTPUT_PATH/smoke_test/05_done_workload.csv"

Run millions of experiments on an HPC cluster:

python 01_create_workload.py --EXP_TITLE full_run
sbatch "$OUTPUT_PATH/full_run/02_slurm.slurm"
watch -n 60 'wc -l "$OUTPUT_PATH/full_run/05_done_workload.csv"'

Full Pipeline¶

flowchart TD
    CFG["exp_config.yaml"] --> WL["01_create_workload.py"]
    WL --> CSV["01_workload.csv"]
    CSV --> RUN["02_run_experiment.py"]
    RUN --> RAW["Per-cycle CSVs (.csv)"]
    RAW --> CONV["compress to .csv.xz"]
    CONV --> XZ["Per-cycle CSVs (.csv.xz)"]
    XZ --> CAT["03_calculate_dataset_categorizations.py"]
    XZ --> ADV["04_calculate_advanced_metrics.py"]
    CAT --> PREP["Prerequisites (convert_y_pred_to_parquet, etc.)"]
    ADV --> PREP
    PREP --> EVA["eva_scripts/*"]
    EVA -->|"auto-generate if missing"| TS["_TS/*.parquet"]
    EVA --> PLOTS["plots/*.parquet + PDFs"]

Pipeline Steps¶

Step	Script	Input	Output
1	`01_create_workload.py`	`resources/exp_config.yaml`	`01_workload.csv` (Cartesian product of hyperparameters)
2	`02_run_experiment.py`	`01_workload.csv` row (by `WORKER_INDEX`)	`{STRATEGY}/{DATASET}/*.csv` (per-cycle metrics, uncompressed)
2b	Compress results	`*.csv`	`*.csv.xz` (compressed per-cycle metrics)
3	`03_calculate_dataset_categorizations.py`	Dataset CSVs	`{DATASET}/{categorizer}.parquet` (14 sample-level categorizers)
4	`04_calculate_advanced_metrics.py`	Per-cycle CSVs	`{STRATEGY}/{DATASET}/{metric}.csv.xz` (AUC, distance, etc.)
5	Prerequisite scripts (`convert_y_pred_to_parquet.py`, `calculate_dataset_dependend_random_ramp_slope.py`)	Per-cycle CSVs, parquets	Converted parquets, slope data
6	`eva_scripts/*`	Per-cycle CSVs, parquets	`_TS/.parquet` (auto-generated if missing), `plots/` (leaderboards, heatmaps, PDFs)

Step 0: Download Datasets

Kaggle datasets require a Kaggle API token before running this step. OpenML datasets need no extra credentials.

# Kaggle setup (skip if you only use OpenML datasets):
# 1. Create an API token at https://www.kaggle.com/settings
# 2. Place kaggle.json in ~/.kaggle/ and restrict permissions:
mkdir -p ~/.kaggle && chmod 700 ~/.kaggle
# mv ~/Downloads/kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

python 00_download_datasets.py

Reads: resources/openml_datasets.yaml, resources/kaggle_datasets.yaml
Produces: Dataset CSV files in DATASETS_PATH
Also computes: Cosine distance matrices for datasets (used later by distance metrics)

If you only want to analyze the OPARA archive, you can skip dataset downloading entirely.

Step 2: Per-experiment output files

Each worker picks one row from 01_workload.csv and runs the full AL loop. The framework runner (determined by EXP_STRATEGY) handles: initialization → query selection → labeling → model retraining → metric recording, repeated for all AL cycles.

Results lifecycle: Workers append results to plain .csv files — this append-only format supports massive parallel HPC jobs writing to shared files concurrently. After all experiments finish, compress CSVs to .csv.xz (step 2b) to save disk space. Evaluation scripts and the OPARA archive consume .csv.xz.

Output files per experiment in {OUTPUT_PATH}/{STRATEGY_NAME}/{DATASET_NAME}/ (.csv during execution, .csv.xz after compression):

File	Contents
`accuracy.csv`	Per-cycle accuracy values
`weighted_f1-score.csv`	Per-cycle weighted F1 scores
`macro_f1-score.csv`	Per-cycle macro F1 scores
`weighted_precision.csv`	Per-cycle weighted precision
`macro_precision.csv`	Per-cycle macro precision
`weighted_recall.csv`	Per-cycle weighted recall
`macro_recall.csv`	Per-cycle macro recall
`query_selection_time.csv`	Time taken per query selection
`learner_training_time.csv`	Time taken per model retraining
`selected_indices.csv`	Which sample indices were queried
`y_pred_train.csv`	Model predictions on training set
`y_pred_test.csv`	Model predictions on test set

Each CSV has one row per experiment (EXP_UNIQUE_ID) with columns for each AL cycle iteration.

Step 3: Dataset categorizations (14 categorizers)

Computes sample-level features for each dataset, characterizing how "hard" or "interesting" each sample is:

Categorizer	What It Measures
`COUNT_WRONG_CLASSIFICATIONS`	How often a sample is misclassified
`SWITCHES_CLASS_OFTEN`	How often predicted class changes across AL cycles
`CLOSENESS_TO_DECISION_BOUNDARY`	Distance to the nearest decision boundary
`REGION_DENSITY`	Local density of samples
`MELTING_POT_REGION`	Mixed-class region indicator
`INCLUDED_IN_OPTIMAL_STRATEGY`	Whether the sample is in the optimal query set
`CLOSENESS_TO_SAMPLES_OF_SAME_CLASS_kNN`	kNN distance to same-class samples
`CLOSENESS_TO_SAMPLES_OF_OTHER_CLASS_kNN`	kNN distance to other-class samples
`CLOSENESS_TO_CLUSTER_CENTER`	Distance to cluster centers
`IMPROVES_ACCURACY_BY`	Accuracy improvement from labeling this sample
`AVERAGE_UNCERTAINTY`	Mean model uncertainty for this sample
`OUTLIERNESS`	Outlier score
`CLOSENESS_TO_SAMPLES_OF_SAME_CLASS`	Non-kNN same-class distance
`CLOSENESS_TO_SAMPLES_OF_OTHER_CLASS`	Non-kNN other-class distance

Step 4: Advanced metrics (7 metric types)

Computes derived metrics from the raw per-cycle results — aggregations that summarize how each experiment performed:

Computed Metric	Output Files	Description
`STANDARD_AUC`	`full_auc_{base_metric}.csv.xz`, `ramp_up_auc_{base_metric}.csv.xz`, `plateau_auc_{base_metric}.csv.xz`, `final_value_{base_metric}.csv.xz`, `first_5_{base_metric}.csv.xz`, `last_5_{base_metric}.csv.xz`	AUC-based aggregations of the learning curve for each base metric
`DISTANCE_METRICS`	Distance metric CSVs	Sample distance and similarity measures
`MISMATCH_TRAIN_TEST`	Mismatch CSVs	Train/test distribution divergence
`CLASS_DISTRIBUTIONS`	Class distribution CSVs	Per-cycle class balance changes
`METRIC_DROP`	Metric drop CSVs	Performance drop analysis
`DATASET_CATEGORIZATION`	Categorization CSVs	Dataset hardness metrics
`TIMELAG_METRIC`	Timelag CSVs	Prediction lag analysis

Post-Processing (Steps 3–6)¶

After experiments complete (step 2), compress the raw CSV results and run post-processing:

# Step 2b: Compress raw CSV results to .csv.xz
# (02_run_experiment.py outputs .csv files that must be compressed)
xz "$OUTPUT_PATH"/my_experiment/*/*/**.csv

# Step 3: Compute sample-level dataset categorizations
python 03_calculate_dataset_categorizations.py --EXP_TITLE my_experiment --SAMPLES_CATEGORIZER _ALL --EVA_MODE local

# Step 4: Compute advanced metrics (AUC, distances, class distributions, etc.)
python 04_calculate_advanced_metrics.py --EXP_TITLE my_experiment --COMPUTED_METRICS _ALL --EVA_MODE local

# Step 5: Run prerequisite conversion scripts
python scripts/convert_y_pred_to_parquet.py --EXP_TITLE my_experiment
python -m eva_scripts.calculate_dataset_dependend_random_ramp_slope --EXP_TITLE my_experiment

# Step 6: Build leaderboard rankings
python -m eva_scripts.calculate_leaderboard_rankings --EXP_TITLE my_experiment

_TS/*.parquet files are generated automatically

The _TS/*.parquet time series files are not created by a single dedicated script. Instead, multiple evaluation scripts in eva_scripts/ automatically generate the _TS/*.parquet files they need if they are missing. For example, final_leaderboard.py, runtime.py, single_hyperparameter_evaluation_metric.py, and others each check for the required _TS/*.parquet files and create them on the fly.

Reproducing Paper Figures¶

With data ready (either from OPARA or your own run), reproduce all paper figures:

Main Leaderboard (Table 1 / Figure 4)¶

python -m eva_scripts.final_leaderboard --EXP_TITLE full_exp_jan

Output: plots/final_leaderboard/rank_sparse_zero_full_auc_weighted_f1-score.parquet

Three Correlation Heatmaps¶

Color	What It Measures	Script
Blue	Metric correlation (Pearson)	`python -m eva_scripts.single_hyperparameter_evaluation_metric --EXP_TITLE full_exp_jan`
Green	Queried samples (Jaccard)	`python -m eva_scripts.single_hyperparameter_evaluation_indices --EXP_TITLE full_exp_jan`
Orange	Ranking invariance (Kendall τ)	`python -m eva_scripts.leaderboard_single_hyperparameter_influence --EXP_TITLE full_exp_jan`

Additional Figures¶

# Example learning curve plot (Figure 2)
python -m eva_scripts.single_learning_curve_example --EXP_TITLE full_exp_jan

# Runtime analysis (Figure 7)
python -m eva_scripts.runtime --EXP_TITLE full_exp_jan

# All paper plots at once
python -m eva_scripts.redo_plots_for_paper --EXP_TITLE full_exp_jan

Output Mapping¶

Paper Figure	Script	Output File
Table 1 (Leaderboard)	`final_leaderboard.py`	`plots/final_leaderboard/*.parquet`
Figure 2 (Learning curves)	`single_learning_curve_example.py`	`plots/single_learning_curve/*.parquet`
Figures 4-6 (Heatmaps)	`single_hyperparameter_*.py`	`plots/single_hyperparameter/*`
Figure 7 (Runtime)	`runtime.py`	`plots/runtime/*.parquet`

Verify Results¶

import pandas as pd

lb = pd.read_parquet("plots/final_leaderboard/rank_sparse_zero_full_auc_weighted_f1-score.parquet")
print("Top 5 strategies (avg rank):", lb.mean(axis=0).sort_values().head(5))

Complete Eva Scripts Index¶

For the full list of all evaluation and utility scripts, see Reference → Eva Scripts Index.

Correlation Metrics (Paper ↔ Code)¶

For correlation metric definitions (Pearson, Jaccard, Kendall) and terminology cross-reference, see Reference → Correlation Metrics.

HPC Configuration¶

Create .server_access_credentials.cfg:

[HPC]
SSH_LOGIN=user@login.hpc.example.edu
DATASETS_PATH=/path/to/datasets
OUTPUT_PATH=/path/to/exp_results
SLURM_MAIL=your.email@example.edu
SLURM_PROJECT=your_project_account
PYTHON_PATH=/path/to/conda-env/bin/python

[LOCAL]
DATASETS_PATH=/path/to/datasets
OUTPUT_PATH=/path/to/exp_results

Resume After Failure¶

OGAL automatically tracks progress via tracking files:

File	Purpose
`05_done_workload.csv`	Successfully completed experiments
`05_failed_workloads.csv`	Experiments that failed with errors
`05_started_oom_workloads.csv`	Experiments killed by OOM

To resume: simply re-run 01_create_workload.py — it automatically excludes already-completed experiments, then resubmit:

python 01_create_workload.py --EXP_TITLE my_experiment
sbatch "$OUTPUT_PATH/my_experiment/02_slurm.slurm"

Troubleshooting & Fix Scripts¶

Common Issues¶

Issue	Solution
`FileNotFoundError` for datasets	Check `DATASETS_PATH` in `.server_access_credentials.cfg`
Jobs killed (OOM)	Increase `SLURM_MEMORY`; check `05_started_oom_workloads.csv`
Experiments not completing	Increase `EXP_QUERY_SELECTION_RUNTIME_SECONDS_LIMIT`
Missing `_TS/*.parquet`	These are auto-generated by evaluation scripts when missing. Ensure steps 2–5 completed successfully and that `.csv.xz` files exist.
Incomplete experiment grid	Use `scripts/reduce_to_dense.py` to remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results

Fix Scripts (only needed if something breaks)¶

These scripts in scripts/ are not part of the normal pipeline — they exist to repair data issues that can occur during large-scale HPC runs. You only need them if you encounter specific problems.

Data Validation Scripts

Script	When to Use
`scripts/validate_results_schema.py`	Verify result file formats are correct
`scripts/check_if_exp_ids_are_present.py`	Verify all experiment IDs exist in metric files
`scripts/find_missing_exp_ids_in_metric_files.py`	Find experiments missing from metric CSVs
`scripts/find_broken_file.py`	Identify corrupted metric CSV files
`scripts/exp_results_data_format_test.py`	Test that result CSV generation/format is correct

Fix Scripts (data repair)

Script	What It Fixes
`scripts/fix_oom_workload.py`	Remove OOM experiments from done workload
`scripts/fix_duplicate_header_columns.py`	Remove duplicate column headers in CSVs
`scripts/fix_remove_unnamed_column.py`	Strip spurious `Unnamed: 0` columns
`scripts/fix_reduce_number_precision.py`	Round numeric precision to 4 decimals (saves space)
`scripts/fix_macro_f1_score_duplicates.py`	Remove duplicate columns in macro F1 files
`scripts/fix_apply_runtime_limit_post_mortem.py`	Remove experiments exceeding query runtime limits
`scripts/fix_early_stopping_dict_keys_too_small_error.py`	Fix malformed CSV rows from dict parsing errors
`scripts/fix_check_if_dupicate_param_combinations_exist.py`	Detect duplicate parameter combinations
`scripts/fix_unconverted_y_parquet.py`	Fix y_pred parquets with wrong data types

Merge & Remove Scripts

Script	What It Does
`scripts/merge_two_workloads.py`	Merge two experimental result sets
`scripts/merge_duplicate_parquets.py`	Merge duplicate y_pred parquets, keeping unique IDs
`scripts/remove_oom_results_from_metric_files.py`	Strip out-of-memory results from metric files
`scripts/remove_dataset_results.py`	Delete results for specific datasets
`scripts/remove_duplicated_exp_ids.py`	Drop duplicate experiment entries
`scripts/remove_lbfgs_mlp_results.py`	Remove LBFGS/MLP learner results
`scripts/reduce_to_dense.py`	Remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results

Re-run Scripts (retry failed work)

Script	What It Does
`scripts/rerun_broken_experiments.py`	Re-run experiments that failed
`scripts/rerun_missing_exp_ids.py`	Retry experiments with missing result files
`scripts/rerun_broken_dataset_categorizations.py`	Recompute broken dataset categorization metrics
`scripts/replace_broken_parquet_csvs_with_working_file.py`	Restore broken parquets from backup files

Conversion Scripts

Script	What It Does
`scripts/convert_metrics_csvs_to_exp_id_csvs.py`	Reorganize metric CSVs by experiment ID
`scripts/convert_dataset_distances_to_parqet.py`	Convert dataset distance CSVs to parquet
`scripts/convert_y_pred_to_parquet.py`	Convert y_pred CSVs to parquet format
`scripts/create_auc_selected_ts.py`	Create AUC time series from selected indices

Design Goals¶

Goal	How
HPC-scale	Each experiment independent; `WORKER_INDEX` selects one row from `01_workload.csv`
Resumable	`05_done_workload.csv` tracks completed experiments; re-running `01_create_workload.py` skips them
Deterministic	Fixed seeds; Cartesian product workload ensures full coverage
Framework-agnostic	Unified runner adapts 5+ AL frameworks (ALiPy, libact, small-text, scikit-activeml, playground)

Configuration¶

All configuration flows through misc/config.py, which loads settings from multiple sources in this priority order:

.server_access_credentials.cfg — paths and HPC settings (DATASETS_PATH, OUTPUT_PATH, SLURM_*)
resources/exp_config.yaml — experiment grid definitions (EXP_GRID_* parameters)
CLI arguments — override any setting at runtime
Workload row — during execution, 02_run_experiment.py loads one row from 01_workload.csv

Key Path Variables¶

Config Variable	Default	Resolves To
`OUTPUT_PATH`	From `.server_access_credentials.cfg`	`{OUTPUT_PATH}/{EXP_TITLE}/`
`CORRELATION_TS_PATH`	`_TS`	`{OUTPUT_PATH}/_TS/`
`EXP_ID_METRIC_CSV_FOLDER_PATH`	`metrics`	`{OUTPUT_PATH}/metrics/`
`OVERALL_DONE_WORKLOAD_PATH`	`05_done_workload.csv`	`{OUTPUT_PATH}/05_done_workload.csv`

Files After Each Step¶

{OUTPUT_PATH}/{EXP_TITLE}/
├── 01_workload.csv                          # Step 1: experiment queue
├── 05_done_workload.csv                     # Step 2: tracking file (appended)
├── 05_failed_workloads.csv                  # Step 2: failed experiments
├── 05_started_oom_workloads.csv             # Step 2: OOM-killed experiments
│
├── {STRATEGY}/{DATASET}/                    # Step 2: raw per-cycle metrics (.csv during execution, .csv.xz after compression)
│   ├── accuracy.csv(.xz)
│   ├── weighted_f1-score.csv(.xz)
│   ├── macro_f1-score.csv(.xz)
│   ├── query_selection_time.csv(.xz)
│   ├── selected_indices.csv(.xz)
│   ├── y_pred_train.csv(.xz)
│   ├── y_pred_test.csv(.xz)
│   └── ...
│
├── {STRATEGY}/{DATASET}/                    # Step 4: advanced metrics
│   ├── full_auc_weighted_f1-score.csv.xz
│   ├── ramp_up_auc_weighted_f1-score.csv.xz
│   ├── plateau_auc_weighted_f1-score.csv.xz
│   ├── final_value_weighted_f1-score.csv.xz
│   └── ...
│
├── {DATASET}/                               # Step 3: categorizations
│   ├── COUNT_WRONG_CLASSIFICATIONS.parquet
│   ├── REGION_DENSITY.parquet
│   └── ...
│
├── _TS/                                     # Auto-generated by eva_scripts
│   ├── weighted_f1-score.parquet
│   └── ...
│
└── plots/                                   # Step 6: evaluation outputs
    ├── final_leaderboard/
    ├── single_hyperparameter/
    ├── runtime/
    └── ...

Directory Map¶

For the full source tree, see Reference → Directory Map.

Key Abstractions¶

For details on AL_Experiment, monitoring, and visualization internals, see Reference → Key Abstractions.

"I Want to..." Quick Reference¶

Goal	Where
Change experiment grid	`resources/exp_config.yaml`
Change paths	`.server_access_credentials.cfg`
Add new strategy	`resources/data_types.py` (enum + mapping)
Add new dataset	`resources/openml_datasets.yaml`
Add new metric	`metrics/` extending `Base_Metric`
Generate leaderboards	`eva_scripts/final_leaderboard.py`
Monitor progress	`05_analyze_partially_run_workload.py`
Build standalone HTML results	`07b_create_results_without_flask.py`
Fix broken result files	See Fix Scripts

Deep Dive¶

For mathematical definitions of the three correlation types, see Correlation Metrics above.
For details on all enums and how to extend the benchmark, see Extend the Benchmark.

Next Steps¶

Goal	Page
Extend with new strategies/datasets	Extend the Benchmark
Analyze results	Analyze OPARA