Reproduce the Paper & Run from Scratch¶
Run the exact scripts that produce the paper's figures and tables, or recompute the entire dataset from scratch on HPC/SLURM.
Quick Start¶
Download the pre-computed results from DOI:10.25532/OPARA-862 and run the evaluation scripts directly:
# Setup
git clone https://github.com/jgonsior/olympic-games-of-active-learning.git
cd olympic-games-of-active-learning
conda create --name ogal --file conda-linux-64.lock && conda activate ogal && poetry install
cp .server_access_credentials.cfg.example .server_access_credentials.cfg
# edit .server_access_credentials.cfg → set OUTPUT_PATH and DATASETS_PATH under [LOCAL]
# Download and extract data
wget -c -O full_exp_jan.zip \
"https://opara.zih.tu-dresden.de/bitstreams/38951489-5076-4544-a99b-c20dddfc2c6b/download"
unzip full_exp_jan.zip -d /path/to/results/
Run a small smoke test before going to HPC scale:
conda create --name ogal --file conda-linux-64.lock && conda activate ogal && poetry install
cp .server_access_credentials.cfg.example .server_access_credentials.cfg
# edit .server_access_credentials.cfg → set OUTPUT_PATH and DATASETS_PATH under [LOCAL]
python 01_create_workload.py --EXP_TITLE smoke_test
# This creates several files in OUTPUT_PATH/smoke_test/, including:
# 01_workload.csv – hyperparameter grid (one row per experiment)
# 02b_run_bash_parallel.py – runs all experiments in parallel locally
# 02_slurm.slurm – SLURM job array script (for HPC clusters)
# Run all experiments locally in parallel:
python "$OUTPUT_PATH/smoke_test/02b_run_bash_parallel.py"
# On a SLURM cluster, use the generated job array instead:
# sbatch "$OUTPUT_PATH/smoke_test/02_slurm.slurm"
# Note: 02_run_experiment.py outputs .csv files; compress to .csv.xz afterwards
ls "$OUTPUT_PATH/smoke_test/05_done_workload.csv"
Full Pipeline¶
flowchart TD
CFG["exp_config.yaml"] --> WL["01_create_workload.py"]
WL --> CSV["01_workload.csv"]
CSV --> RUN["02_run_experiment.py"]
RUN --> RAW["Per-cycle CSVs (.csv)"]
RAW --> CONV["compress to .csv.xz"]
CONV --> XZ["Per-cycle CSVs (.csv.xz)"]
XZ --> CAT["03_calculate_dataset_categorizations.py"]
XZ --> ADV["04_calculate_advanced_metrics.py"]
CAT --> PREP["Prerequisites (convert_y_pred_to_parquet, etc.)"]
ADV --> PREP
PREP --> EVA["eva_scripts/*"]
EVA -->|"auto-generate if missing"| TS["_TS/*.parquet"]
EVA --> PLOTS["plots/*.parquet + PDFs"]
Pipeline Steps¶
| Step | Script | Input | Output |
|---|---|---|---|
| 1 | 01_create_workload.py |
resources/exp_config.yaml |
01_workload.csv (Cartesian product of hyperparameters) |
| 2 | 02_run_experiment.py |
01_workload.csv row (by WORKER_INDEX) |
{STRATEGY}/{DATASET}/*.csv (per-cycle metrics, uncompressed) |
| 2b | Compress results | *.csv |
*.csv.xz (compressed per-cycle metrics) |
| 3 | 03_calculate_dataset_categorizations.py |
Dataset CSVs | {DATASET}/{categorizer}.parquet (14 sample-level categorizers) |
| 4 | 04_calculate_advanced_metrics.py |
Per-cycle CSVs | {STRATEGY}/{DATASET}/{metric}.csv.xz (AUC, distance, etc.) |
| 5 | Prerequisite scripts (convert_y_pred_to_parquet.py, calculate_dataset_dependend_random_ramp_slope.py) |
Per-cycle CSVs, parquets | Converted parquets, slope data |
| 6 | eva_scripts/* |
Per-cycle CSVs, parquets | _TS/*.parquet (auto-generated if missing), plots/* (leaderboards, heatmaps, PDFs) |
Step 0: Download Datasets
Kaggle datasets require a Kaggle API token before running this step. OpenML datasets need no extra credentials.
# Kaggle setup (skip if you only use OpenML datasets):
# 1. Create an API token at https://www.kaggle.com/settings
# 2. Place kaggle.json in ~/.kaggle/ and restrict permissions:
mkdir -p ~/.kaggle && chmod 700 ~/.kaggle
# mv ~/Downloads/kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json
python 00_download_datasets.py
- Reads:
resources/openml_datasets.yaml,resources/kaggle_datasets.yaml - Produces: Dataset CSV files in
DATASETS_PATH - Also computes: Cosine distance matrices for datasets (used later by distance metrics)
If you only want to analyze the OPARA archive, you can skip dataset downloading entirely.
Step 2: Per-experiment output files
Each worker picks one row from 01_workload.csv and runs the full AL loop. The framework runner (determined by EXP_STRATEGY) handles: initialization → query selection → labeling → model retraining → metric recording, repeated for all AL cycles.
Results lifecycle: Workers append results to plain .csv files — this append-only format supports massive parallel HPC jobs writing to shared files concurrently. After all experiments finish, compress CSVs to .csv.xz (step 2b) to save disk space. Evaluation scripts and the OPARA archive consume .csv.xz.
Output files per experiment in {OUTPUT_PATH}/{STRATEGY_NAME}/{DATASET_NAME}/ (.csv during execution, .csv.xz after compression):
| File | Contents |
|---|---|
accuracy.csv |
Per-cycle accuracy values |
weighted_f1-score.csv |
Per-cycle weighted F1 scores |
macro_f1-score.csv |
Per-cycle macro F1 scores |
weighted_precision.csv |
Per-cycle weighted precision |
macro_precision.csv |
Per-cycle macro precision |
weighted_recall.csv |
Per-cycle weighted recall |
macro_recall.csv |
Per-cycle macro recall |
query_selection_time.csv |
Time taken per query selection |
learner_training_time.csv |
Time taken per model retraining |
selected_indices.csv |
Which sample indices were queried |
y_pred_train.csv |
Model predictions on training set |
y_pred_test.csv |
Model predictions on test set |
Each CSV has one row per experiment (EXP_UNIQUE_ID) with columns for each AL cycle iteration.
Step 3: Dataset categorizations (14 categorizers)
Computes sample-level features for each dataset, characterizing how "hard" or "interesting" each sample is:
| Categorizer | What It Measures |
|---|---|
COUNT_WRONG_CLASSIFICATIONS |
How often a sample is misclassified |
SWITCHES_CLASS_OFTEN |
How often predicted class changes across AL cycles |
CLOSENESS_TO_DECISION_BOUNDARY |
Distance to the nearest decision boundary |
REGION_DENSITY |
Local density of samples |
MELTING_POT_REGION |
Mixed-class region indicator |
INCLUDED_IN_OPTIMAL_STRATEGY |
Whether the sample is in the optimal query set |
CLOSENESS_TO_SAMPLES_OF_SAME_CLASS_kNN |
kNN distance to same-class samples |
CLOSENESS_TO_SAMPLES_OF_OTHER_CLASS_kNN |
kNN distance to other-class samples |
CLOSENESS_TO_CLUSTER_CENTER |
Distance to cluster centers |
IMPROVES_ACCURACY_BY |
Accuracy improvement from labeling this sample |
AVERAGE_UNCERTAINTY |
Mean model uncertainty for this sample |
OUTLIERNESS |
Outlier score |
CLOSENESS_TO_SAMPLES_OF_SAME_CLASS |
Non-kNN same-class distance |
CLOSENESS_TO_SAMPLES_OF_OTHER_CLASS |
Non-kNN other-class distance |
Step 4: Advanced metrics (7 metric types)
Computes derived metrics from the raw per-cycle results — aggregations that summarize how each experiment performed:
| Computed Metric | Output Files | Description |
|---|---|---|
STANDARD_AUC |
full_auc_{base_metric}.csv.xz, ramp_up_auc_{base_metric}.csv.xz, plateau_auc_{base_metric}.csv.xz, final_value_{base_metric}.csv.xz, first_5_{base_metric}.csv.xz, last_5_{base_metric}.csv.xz |
AUC-based aggregations of the learning curve for each base metric |
DISTANCE_METRICS |
Distance metric CSVs | Sample distance and similarity measures |
MISMATCH_TRAIN_TEST |
Mismatch CSVs | Train/test distribution divergence |
CLASS_DISTRIBUTIONS |
Class distribution CSVs | Per-cycle class balance changes |
METRIC_DROP |
Metric drop CSVs | Performance drop analysis |
DATASET_CATEGORIZATION |
Categorization CSVs | Dataset hardness metrics |
TIMELAG_METRIC |
Timelag CSVs | Prediction lag analysis |
Post-Processing (Steps 3–6)¶
After experiments complete (step 2), compress the raw CSV results and run post-processing:
# Step 2b: Compress raw CSV results to .csv.xz
# (02_run_experiment.py outputs .csv files that must be compressed)
xz "$OUTPUT_PATH"/my_experiment/*/*/**.csv
# Step 3: Compute sample-level dataset categorizations
python 03_calculate_dataset_categorizations.py --EXP_TITLE my_experiment --SAMPLES_CATEGORIZER _ALL --EVA_MODE local
# Step 4: Compute advanced metrics (AUC, distances, class distributions, etc.)
python 04_calculate_advanced_metrics.py --EXP_TITLE my_experiment --COMPUTED_METRICS _ALL --EVA_MODE local
# Step 5: Run prerequisite conversion scripts
python scripts/convert_y_pred_to_parquet.py --EXP_TITLE my_experiment
python -m eva_scripts.calculate_dataset_dependend_random_ramp_slope --EXP_TITLE my_experiment
# Step 6: Build leaderboard rankings
python -m eva_scripts.calculate_leaderboard_rankings --EXP_TITLE my_experiment
_TS/*.parquet files are generated automatically
The _TS/*.parquet time series files are not created by a single dedicated script. Instead, multiple evaluation scripts in eva_scripts/ automatically generate the _TS/*.parquet files they need if they are missing. For example, final_leaderboard.py, runtime.py, single_hyperparameter_evaluation_metric.py, and others each check for the required _TS/*.parquet files and create them on the fly.
Reproducing Paper Figures¶
With data ready (either from OPARA or your own run), reproduce all paper figures:
Main Leaderboard (Table 1 / Figure 4)¶
Output: plots/final_leaderboard/rank_sparse_zero_full_auc_weighted_f1-score.parquet
Three Correlation Heatmaps¶
| Color | What It Measures | Script |
|---|---|---|
| Blue | Metric correlation (Pearson) | python -m eva_scripts.single_hyperparameter_evaluation_metric --EXP_TITLE full_exp_jan |
| Green | Queried samples (Jaccard) | python -m eva_scripts.single_hyperparameter_evaluation_indices --EXP_TITLE full_exp_jan |
| Orange | Ranking invariance (Kendall τ) | python -m eva_scripts.leaderboard_single_hyperparameter_influence --EXP_TITLE full_exp_jan |
Additional Figures¶
# Example learning curve plot (Figure 2)
python -m eva_scripts.single_learning_curve_example --EXP_TITLE full_exp_jan
# Runtime analysis (Figure 7)
python -m eva_scripts.runtime --EXP_TITLE full_exp_jan
# All paper plots at once
python -m eva_scripts.redo_plots_for_paper --EXP_TITLE full_exp_jan
Output Mapping¶
| Paper Figure | Script | Output File |
|---|---|---|
| Table 1 (Leaderboard) | final_leaderboard.py |
plots/final_leaderboard/*.parquet |
| Figure 2 (Learning curves) | single_learning_curve_example.py |
plots/single_learning_curve/*.parquet |
| Figures 4-6 (Heatmaps) | single_hyperparameter_*.py |
plots/single_hyperparameter/* |
| Figure 7 (Runtime) | runtime.py |
plots/runtime/*.parquet |
Verify Results¶
import pandas as pd
lb = pd.read_parquet("plots/final_leaderboard/rank_sparse_zero_full_auc_weighted_f1-score.parquet")
print("Top 5 strategies (avg rank):", lb.mean(axis=0).sort_values().head(5))
Complete Eva Scripts Index¶
For the full list of all evaluation and utility scripts, see Reference → Eva Scripts Index.
Correlation Metrics (Paper ↔ Code)¶
For correlation metric definitions (Pearson, Jaccard, Kendall) and terminology cross-reference, see Reference → Correlation Metrics.
HPC Configuration¶
Create .server_access_credentials.cfg:
[HPC]
SSH_LOGIN=user@login.hpc.example.edu
DATASETS_PATH=/path/to/datasets
OUTPUT_PATH=/path/to/exp_results
SLURM_MAIL=your.email@example.edu
SLURM_PROJECT=your_project_account
PYTHON_PATH=/path/to/conda-env/bin/python
[LOCAL]
DATASETS_PATH=/path/to/datasets
OUTPUT_PATH=/path/to/exp_results
Resume After Failure¶
OGAL automatically tracks progress via tracking files:
| File | Purpose |
|---|---|
05_done_workload.csv |
Successfully completed experiments |
05_failed_workloads.csv |
Experiments that failed with errors |
05_started_oom_workloads.csv |
Experiments killed by OOM |
To resume: simply re-run 01_create_workload.py — it automatically excludes already-completed experiments, then resubmit:
python 01_create_workload.py --EXP_TITLE my_experiment
sbatch "$OUTPUT_PATH/my_experiment/02_slurm.slurm"
Troubleshooting & Fix Scripts¶
Common Issues¶
| Issue | Solution |
|---|---|
FileNotFoundError for datasets |
Check DATASETS_PATH in .server_access_credentials.cfg |
| Jobs killed (OOM) | Increase SLURM_MEMORY; check 05_started_oom_workloads.csv |
| Experiments not completing | Increase EXP_QUERY_SELECTION_RUNTIME_SECONDS_LIMIT |
Missing _TS/*.parquet |
These are auto-generated by evaluation scripts when missing. Ensure steps 2–5 completed successfully and that .csv.xz files exist. |
| Incomplete experiment grid | Use scripts/reduce_to_dense.py to remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results |
Fix Scripts (only needed if something breaks)¶
These scripts in scripts/ are not part of the normal pipeline — they exist to repair data issues that can occur during large-scale HPC runs. You only need them if you encounter specific problems.
Data Validation Scripts
| Script | When to Use |
|---|---|
scripts/validate_results_schema.py |
Verify result file formats are correct |
scripts/check_if_exp_ids_are_present.py |
Verify all experiment IDs exist in metric files |
scripts/find_missing_exp_ids_in_metric_files.py |
Find experiments missing from metric CSVs |
scripts/find_broken_file.py |
Identify corrupted metric CSV files |
scripts/exp_results_data_format_test.py |
Test that result CSV generation/format is correct |
Fix Scripts (data repair)
| Script | What It Fixes |
|---|---|
scripts/fix_oom_workload.py |
Remove OOM experiments from done workload |
scripts/fix_duplicate_header_columns.py |
Remove duplicate column headers in CSVs |
scripts/fix_remove_unnamed_column.py |
Strip spurious Unnamed: 0 columns |
scripts/fix_reduce_number_precision.py |
Round numeric precision to 4 decimals (saves space) |
scripts/fix_macro_f1_score_duplicates.py |
Remove duplicate columns in macro F1 files |
scripts/fix_apply_runtime_limit_post_mortem.py |
Remove experiments exceeding query runtime limits |
scripts/fix_early_stopping_dict_keys_too_small_error.py |
Fix malformed CSV rows from dict parsing errors |
scripts/fix_check_if_dupicate_param_combinations_exist.py |
Detect duplicate parameter combinations |
scripts/fix_unconverted_y_parquet.py |
Fix y_pred parquets with wrong data types |
Merge & Remove Scripts
| Script | What It Does |
|---|---|
scripts/merge_two_workloads.py |
Merge two experimental result sets |
scripts/merge_duplicate_parquets.py |
Merge duplicate y_pred parquets, keeping unique IDs |
scripts/remove_oom_results_from_metric_files.py |
Strip out-of-memory results from metric files |
scripts/remove_dataset_results.py |
Delete results for specific datasets |
scripts/remove_duplicated_exp_ids.py |
Drop duplicate experiment entries |
scripts/remove_lbfgs_mlp_results.py |
Remove LBFGS/MLP learner results |
scripts/reduce_to_dense.py |
Remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results |
Re-run Scripts (retry failed work)
| Script | What It Does |
|---|---|
scripts/rerun_broken_experiments.py |
Re-run experiments that failed |
scripts/rerun_missing_exp_ids.py |
Retry experiments with missing result files |
scripts/rerun_broken_dataset_categorizations.py |
Recompute broken dataset categorization metrics |
scripts/replace_broken_parquet_csvs_with_working_file.py |
Restore broken parquets from backup files |
Conversion Scripts
| Script | What It Does |
|---|---|
scripts/convert_metrics_csvs_to_exp_id_csvs.py |
Reorganize metric CSVs by experiment ID |
scripts/convert_dataset_distances_to_parqet.py |
Convert dataset distance CSVs to parquet |
scripts/convert_y_pred_to_parquet.py |
Convert y_pred CSVs to parquet format |
scripts/create_auc_selected_ts.py |
Create AUC time series from selected indices |
Design Goals¶
| Goal | How |
|---|---|
| HPC-scale | Each experiment independent; WORKER_INDEX selects one row from 01_workload.csv |
| Resumable | 05_done_workload.csv tracks completed experiments; re-running 01_create_workload.py skips them |
| Deterministic | Fixed seeds; Cartesian product workload ensures full coverage |
| Framework-agnostic | Unified runner adapts 5+ AL frameworks (ALiPy, libact, small-text, scikit-activeml, playground) |
Configuration¶
All configuration flows through misc/config.py, which loads settings from multiple sources in this priority order:
.server_access_credentials.cfg— paths and HPC settings (DATASETS_PATH,OUTPUT_PATH,SLURM_*)resources/exp_config.yaml— experiment grid definitions (EXP_GRID_*parameters)- CLI arguments — override any setting at runtime
- Workload row — during execution,
02_run_experiment.pyloads one row from01_workload.csv
Key Path Variables¶
| Config Variable | Default | Resolves To |
|---|---|---|
OUTPUT_PATH |
From .server_access_credentials.cfg |
{OUTPUT_PATH}/{EXP_TITLE}/ |
CORRELATION_TS_PATH |
_TS |
{OUTPUT_PATH}/_TS/ |
EXP_ID_METRIC_CSV_FOLDER_PATH |
metrics |
{OUTPUT_PATH}/metrics/ |
OVERALL_DONE_WORKLOAD_PATH |
05_done_workload.csv |
{OUTPUT_PATH}/05_done_workload.csv |
Files After Each Step¶
{OUTPUT_PATH}/{EXP_TITLE}/
├── 01_workload.csv # Step 1: experiment queue
├── 05_done_workload.csv # Step 2: tracking file (appended)
├── 05_failed_workloads.csv # Step 2: failed experiments
├── 05_started_oom_workloads.csv # Step 2: OOM-killed experiments
│
├── {STRATEGY}/{DATASET}/ # Step 2: raw per-cycle metrics (.csv during execution, .csv.xz after compression)
│ ├── accuracy.csv(.xz)
│ ├── weighted_f1-score.csv(.xz)
│ ├── macro_f1-score.csv(.xz)
│ ├── query_selection_time.csv(.xz)
│ ├── selected_indices.csv(.xz)
│ ├── y_pred_train.csv(.xz)
│ ├── y_pred_test.csv(.xz)
│ └── ...
│
├── {STRATEGY}/{DATASET}/ # Step 4: advanced metrics
│ ├── full_auc_weighted_f1-score.csv.xz
│ ├── ramp_up_auc_weighted_f1-score.csv.xz
│ ├── plateau_auc_weighted_f1-score.csv.xz
│ ├── final_value_weighted_f1-score.csv.xz
│ └── ...
│
├── {DATASET}/ # Step 3: categorizations
│ ├── COUNT_WRONG_CLASSIFICATIONS.parquet
│ ├── REGION_DENSITY.parquet
│ └── ...
│
├── _TS/ # Auto-generated by eva_scripts
│ ├── weighted_f1-score.parquet
│ └── ...
│
└── plots/ # Step 6: evaluation outputs
├── final_leaderboard/
├── single_hyperparameter/
├── runtime/
└── ...
Directory Map¶
For the full source tree, see Reference → Directory Map.
Key Abstractions¶
For details on AL_Experiment, monitoring, and visualization internals, see Reference → Key Abstractions.
"I Want to..." Quick Reference¶
| Goal | Where |
|---|---|
| Change experiment grid | resources/exp_config.yaml |
| Change paths | .server_access_credentials.cfg |
| Add new strategy | resources/data_types.py (enum + mapping) |
| Add new dataset | resources/openml_datasets.yaml |
| Add new metric | metrics/ extending Base_Metric |
| Generate leaderboards | eva_scripts/final_leaderboard.py |
| Monitor progress | 05_analyze_partially_run_workload.py |
| Build standalone HTML results | 07b_create_results_without_flask.py |
| Fix broken result files | See Fix Scripts |
Deep Dive¶
- For mathematical definitions of the three correlation types, see Correlation Metrics above.
- For details on all enums and how to extend the benchmark, see Extend the Benchmark.
Next Steps¶
| Goal | Page |
|---|---|
| Extend with new strategies/datasets | Extend the Benchmark |
| Analyze results | Analyze OPARA |