Reference¶
Detailed technical reference for OGAL internals. For guided walkthroughs, see Analyze OPARA or Reproduce & Run.
Data Formats¶
The archived results use two main storage formats:
Raw Per-Cycle Results (.csv → .csv.xz)¶
During execution, workers append results to plain .csv files (one row per experiment). This append-only format supports massive parallel HPC jobs writing to shared files. After all experiments finish, the CSVs are compressed to .csv.xz to save space. The OPARA archive contains the compressed .csv.xz files. For example, ALIPY_RANDOM/Iris/weighted_f1-score.csv.xz contains the weighted F1-score at each active learning cycle for all experiments using the ALIPY_RANDOM strategy on the Iris dataset. Each row represents one experiment (identified by EXP_UNIQUE_ID), and each column represents one AL cycle iteration.
Aggregated Time Series (_TS/*.parquet)¶
The _TS/ directory contains pre-aggregated Parquet files that join per-cycle metrics with hyperparameter information from the workload definition. These files are auto-generated by evaluation scripts when needed. For example, _TS/full_auc_weighted_f1-score.parquet contains the full-AUC aggregation of weighted F1-scores, with columns for dataset, strategy, batch size, learner model, and the metric value.
The _TS files are especially helpful for quickly calculating correlations over the full hyperparameter grid. Without them, computing pairwise correlations across millions of experiments would require re-reading and re-aggregating all raw per-cycle CSV files each time, which is computationally not feasible. By pre-joining and pre-aggregating the data into these Parquet files, correlation analyses that would otherwise take hours can be completed in seconds.
Leaderboard Rankings (plots/final_leaderboard/*.parquet)¶
The leaderboard output files follow the naming convention rank_{interpolation}_{aggregation}_{base_metric}.parquet. For example:
rank_sparse_zero_full_auc_weighted_f1-score.parquetmeans:rank— This file contains strategy rankings (lower rank = better).sparse_zero— The interpolation mode used for missing values. "Sparse zero" means that missing experiment results (where a strategy/dataset combination was not run) are filled with zero, so they rank last. Other modes includesparse_nan(ignore missing) anddense(only use complete hyperparameter grids).full_auc— The aggregation method applied to the learning curve. "Full AUC" computes the area under the entire learning curve. Other options:ramp_up_auc(AUC during the initial ramp-up phase),plateau_auc(AUC during the plateau phase),final_value(only the last cycle's value).weighted_f1-score— The base evaluation metric. Other options:accuracy,macro_f1-score, etc.
Correlation Metrics (Paper ↔ Code)¶
Three correlation metrics from the OGAL paper (arXiv:2506.03817) and their code implementations:
| Correlation | Measures | Heatmap Color | Script |
|---|---|---|---|
| Pearson \(r\) (§IV-B1) | Do metric outcomes change with a hyperparameter? | Blue | workload_reduction.py, basic_metrics_correlation.py |
| Jaccard \(J\) (§IV-B2) | Do strategies select the same samples? | Green | single_hyperparameter_evaluation_indices.py |
| Kendall \(\tau_b\) (§IV-B3) | Do strategy rankings stay the same? | Orange | leaderboard_single_hyperparameter_influence.py, leaderboard_single_hyperparameter_influence_analyze.py |
Metric-based (Pearson \(r\), §IV-B1)¶
For each value of a hyperparameter (e.g., batch size \(b_i\)), build a result vector \(V_{b_i}(M)\) of aggregated metric values. Then compute the pairwise Pearson correlation matrix. High \(r\) ≈ hyperparameter has little effect.
The Pearson correlation coefficient is defined as:
Computed via np.corrcoef in single_hyperparameter_evaluation_metric.py.
Data flow: Per-strategy/dataset .csv.xz files (e.g., full_auc_weighted_f1-score.csv.xz) are joined with 05_done_workload.csv to attach hyperparameter columns, then written to _TS/{metric}.parquet. For a chosen hyperparameter, experiments are grouped by its value and matched on all remaining hyperparameters (a shared "fingerprint"). np.corrcoef computes pairwise Pearson \(r\) between the matched metric vectors — one vector per hyperparameter value — producing the blue heatmap.
Queried Samples (Jaccard \(J\), §IV-B2)¶
Union each experiment's per-cycle queried sets into \(\widehat{Q}\), then compute pairwise Jaccard similarity. The heatmap shows \(1 - \bar{J}\) (so 1 = identical queries).
\(J\) ranges from 0 (disjoint sets) to 1 (identical sets). Computed in single_hyperparameter_evaluation_indices.py.
Data flow: selected_indices.csv.xz files store per-cycle queried sample indices. These are union-aggregated across cycles and joined with workload metadata into _TS/selected_indices.parquet. For a chosen hyperparameter, matched experiment pairs (same fingerprint on the remaining hyperparameters) have their index sets compared via \(|A \cap B| / |A \cup B|\), averaged across all matched pairs to fill the green heatmap.
Ranking Invariance (Kendall \(\tau_b\), §IV-B3)¶
Build a leaderboard (strategies × datasets), average to get a ranking vector per hyperparameter value, then compare rankings with Kendall \(\tau_b\).
where:
- \(n_c\) = concordant pairs, \(n_d\) = discordant pairs
- \(n_0 = n(n-1)/2\)
- \(n_1 = \sum_k t_k(t_k-1)/2\) (ties in \(X\))
- \(n_2 = \sum_l u_l(u_l-1)/2\) (ties in \(Y\))
\(\tau_b\) ranges from −1 (reversed rankings) to +1 (identical rankings). Computed via scipy.stats.kendalltau in leaderboard_single_hyperparameter_influence.py.
Data flow: The same _TS/{metric}.parquet files as above are grouped by dataset and strategy, then averaged and rank-transformed to produce a leaderboard ranking vector for each hyperparameter value. scipy.stats.kendalltau compares these ranking vectors pairwise, yielding the orange heatmap.
Terminology Cross-Reference¶
| Paper Term | Code Alias | File Pattern |
|---|---|---|
| Full mean AUC | full_auc |
full_auc_*.parquet |
| Ramp-up AUC | ramp_up_auc |
ramp_up_auc_*.parquet |
| Plateau AUC | plateau_auc |
plateau_auc_*.parquet |
| Final value | final_value |
final_value_*.parquet |
| Queried sample sets | selected_indices |
selected_indices.csv.xz |
| Weighted F1-score | weighted_f1-score |
weighted_f1-score.parquet |
Complete Eva Scripts Index¶
Core Analysis Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
learning_curve.py |
Per-cycle CSVs, 05_done_workload.csv |
plots/single_learning_curve/*.parquet, PDF |
Generates an example learning curve plot for illustration. Auto-generates _TS/*.parquet if missing (as do most other eva_scripts). |
calculate_leaderboard_rankings.py |
_TS/*.parquet |
Ranking parquets (multiple interpolation modes) | Generates strategy rankings across datasets using different metrics and interpolation methods. |
final_leaderboard.py |
_TS/*.parquet, ranking data |
plots/final_leaderboard/*.parquet |
Main leaderboard generation — ranks strategies and produces the paper's Table 1. |
runtime.py |
query_selection_time.csv.xz |
plots/runtime/query_selection_time.parquet |
Analyzes and plots query selection time distributions per strategy. |
Correlation & Hyperparameter Analysis Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
basic_metrics_correlation.py |
Per-cycle CSVs (accuracy, F1, etc.) | plots/basic_metrics/Standard Metrics.parquet |
Pearson correlation matrix between standard ML metrics. |
auc_metric_correlation.py |
AUC metric parquets | plots/AUC/auc_*.parquet |
Pearson correlation between AUC-based aggregation metrics. |
single_hyperparameter_evaluation_metric.py |
_TS/*.parquet |
plots/single_hyperparameter/*/ (Blue heatmaps) |
Metric-based (Pearson) correlation — how metric outcomes change when varying one hyperparameter. |
single_hyperparameter_evaluation_indices.py |
selected_indices.parquet |
plots/single_hyperparameter/*/ (Green heatmaps) |
Jaccard similarity of queried samples — do strategies select the same samples under different hyperparameters? |
leaderboard_single_hyperparameter_influence.py |
_TS/*.parquet |
Single hyperparameter influence parquets | Kendall τ ranking invariance — how much does changing one hyperparameter affect strategy ordering? |
leaderboard_single_hyperparameter_influence_analyze.py |
Rankings CSV | Influence plots | Plots and analyzes the hyperparameter influence data. |
workload_reduction.py |
_TS/*.parquet, dense workload |
Correlation/reduction stats | Analyzes how much the workload can be reduced while maintaining result quality. |
similar_strategies.py |
selected_indices.parquet |
Jaccard correlation heatmaps | Strategy similarity via selected indices — which strategies behave most alike? |
strateg_framework_correlation.py |
Strategy metrics | Framework correlation plot | Cross-framework correlation analysis. |
Leaderboard Variant Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
leaderboard_scenarios.py |
Scenario metrics | Scenario rankings | Ranks strategies under different real-world scenarios (dataset type, start point, hyperparameter variations). |
leaderboard_c6_rebuttal.py |
Metric files | Kendall tau correlations, PDFs | Rebuttal analysis with bootstrap confidence intervals for ranking stability. |
final_leaderboard_single_cell_correlation.py |
Leaderboard parquets | Correlation stats, plots | Cell-wise correlation analysis within the leaderboard matrix. |
analyze_leaderboard_rankings.py |
plots/leaderboard_invariances/leaderboard_types.csv |
Heatmap correlations | Correlates different leaderboard construction methods. |
Learning Curve & Example Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
single_learning_curve_example.py |
Sample data | Line plot | Example visualization of a single learning curve. |
single_learning_curve_example_auc.py |
Sample data | Line plot with AUC | Example learning curve with AUC annotation. |
Dataset & Metric Analysis Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
calc_cycle_duration_parquets.py |
Metric CSVs, 05_done_workload.csv |
Threshold plots, duration analysis | Analyzes learning cycle durations and computes duration thresholds. |
calculate_dataset_dependend_random_ramp_slope.py |
Selected indices time series | Leaderboard rankings CSV | Computes dataset-dependent random baseline slopes. |
dataset_stats.py |
— | — | Dataset statistics. |
Scenario & Real-World Analysis Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
real_world_scenarios_corrs.py |
Scenario metrics CSV | Decomposed correlations | Real-world scenario correlation decomposition. |
real_world_scenarios_plots.py |
Scenario data | Scatter/correlation plots | Plots for real-world scenario analysis. |
Publication & Output Scripts¶
| Script | Reads | Produces | Description |
|---|---|---|---|
redo_plots_for_paper.py |
All parquet files | Combined ranking plots (PDFs) | Regenerates all publication-ready plots at once. |
merge_multiple_plots_single_page.py |
Plot parquets | Merged PDF | Merges multiple parquet-based plots into a single multi-page PDF. |
Utility Scripts (scripts/)¶
Data Preparation Scripts
| Script | Description |
|---|---|
scripts/create_dense_workload.py |
Generate a dense workload (all dataset × strategy combinations). |
scripts/create_new_extended_dense_workload.py |
Extended version of the dense workload. |
scripts/create_gaussian.py |
Generate synthetic Gaussian datasets (balanced/unbalanced). |
scripts/create_xor.py |
Download XOR datasets from the LAL project. |
scripts/create_auc_selected_ts.py |
Create AUC time series from selected indices data. |
scripts/reduce_to_dense.py |
Remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results. |
Conversion Scripts
| Script | Description |
|---|---|
scripts/convert_metrics_csvs_to_exp_id_csvs.py |
Reorganize metric CSVs indexed by experiment ID. |
scripts/convert_dataset_distances_to_parqet.py |
Convert dataset distance CSV files to parquet format. |
scripts/convert_y_pred_to_parquet.py |
Convert y_pred CSV files to parquet format (with timeout handling). |
Validation Scripts
| Script | Description |
|---|---|
scripts/validate_results_schema.py |
Verify that result file formats match the expected schema. |
scripts/check_if_exp_ids_are_present.py |
Verify all experiment IDs exist in all metric files. |
scripts/find_missing_exp_ids_in_metric_files.py |
Find experiments that are missing from metric CSV files. |
scripts/find_broken_file.py |
Identify corrupted or malformed metric CSV files. |
scripts/exp_results_data_format_test.py |
Test that result CSV generation and format is correct. |
Export & Documentation Scripts
| Script | Description |
|---|---|
scripts/export_strategy_catalog.py |
Export all AL strategies to JSON/CSV/Markdown with framework info. |
scripts/add_github_hyperlinks.py |
Convert file references to GitHub hyperlinks in markdown. |
scripts/render_mermaid.py |
Pre-render Mermaid diagrams to SVG for static fallback. |
scripts/single_learning_curve.py |
Generate a single example learning curve visualization. |
Directory Map¶
olympic-games-of-active-learning/
├── 00_download_datasets.py # Dataset acquisition from OpenML/Kaggle
├── 01_create_workload.py # Workload generation (hyperparameter grid)
├── 02_run_experiment.py # Experiment execution (one per worker)
├── 03_calculate_dataset_categorizations.py # Sample-level features
├── 04_calculate_advanced_metrics.py # Derived metrics (AUC, etc.)
├── 05_analyze_partially_run_workload.py # Progress monitoring
├── 07b_create_results_without_flask.py # Standalone HTML visualization
├── framework_runners/ # AL framework adapters
│ ├── base_runner.py # Abstract base class (AL loop)
│ ├── alipy_runner.py # ALiPy strategies
│ ├── libact_runner.py # libact strategies
│ ├── smalltext_runner.py # small-text strategies
│ ├── skactiveml_runner.py # scikit-activeml strategies
│ ├── playground_runner.py # Custom strategies
│ └── optimal_runner.py # Oracle strategies
├── metrics/ # Metric recording during experiments
│ ├── Standard_ML_Metrics.py # accuracy, F1, precision, recall
│ ├── Timing_Metrics.py # query_selection_time, learner_training_time
│ ├── Selected_Indices.py # selected sample indices
│ └── Predicted_Samples.py # y_pred_train, y_pred_test
├── resources/
│ ├── data_types.py # ALL enums (AL_STRATEGY, COMPUTED_METRIC, etc.)
│ ├── exp_config.yaml # Experiment grid definitions
│ └── openml_datasets.yaml # OpenML dataset configurations
├── misc/config.py # Central configuration
├── eva_scripts/ # Evaluation & plotting scripts
└── scripts/ # Utility, fix, and maintenance scripts
Key Abstractions¶
AL_Experiment (framework_runners/base_runner.py)¶
Abstract base class for framework adapters. Key methods:
get_AL_strategy()— Initialize the strategyquery_AL_strategy()→ indices — Select samples to queryal_cycle()— Main loop: query → update → retrain → record metrics
Monitoring (05_analyze_partially_run_workload.py)¶
Analyzes progress of a partially completed experiment run:
- Groups completed experiments by dataset/strategy/model/hyperparameters
- Calculates mean query selection time per combination
- Identifies which parameter combinations are missing
Visualization (07b_create_results_without_flask.py)¶
Generates a standalone HTML file with interactive result visualizations (AUC tables, learning curves, runtime plots) without requiring a Flask server.