Reference¶

Detailed technical reference for OGAL internals. For guided walkthroughs, see Analyze OPARA or Reproduce & Run.

Data Formats¶

The archived results use two main storage formats:

Raw Per-Cycle Results (`.csv` → `.csv.xz`)¶

During execution, workers append results to plain .csv files (one row per experiment). This append-only format supports massive parallel HPC jobs writing to shared files. After all experiments finish, the CSVs are compressed to .csv.xz to save space. The OPARA archive contains the compressed .csv.xz files. For example, ALIPY_RANDOM/Iris/weighted_f1-score.csv.xz contains the weighted F1-score at each active learning cycle for all experiments using the ALIPY_RANDOM strategy on the Iris dataset. Each row represents one experiment (identified by EXP_UNIQUE_ID), and each column represents one AL cycle iteration.

Aggregated Time Series (`_TS/*.parquet`)¶

The _TS/ directory contains pre-aggregated Parquet files that join per-cycle metrics with hyperparameter information from the workload definition. These files are auto-generated by evaluation scripts when needed. For example, _TS/full_auc_weighted_f1-score.parquet contains the full-AUC aggregation of weighted F1-scores, with columns for dataset, strategy, batch size, learner model, and the metric value.

The _TS files are especially helpful for quickly calculating correlations over the full hyperparameter grid. Without them, computing pairwise correlations across millions of experiments would require re-reading and re-aggregating all raw per-cycle CSV files each time, which is computationally not feasible. By pre-joining and pre-aggregating the data into these Parquet files, correlation analyses that would otherwise take hours can be completed in seconds.

Leaderboard Rankings (`plots/final_leaderboard/*.parquet`)¶

The leaderboard output files follow the naming convention rank_{interpolation}_{aggregation}_{base_metric}.parquet. For example:

rank_sparse_zero_full_auc_weighted_f1-score.parquet means:
- rank — This file contains strategy rankings (lower rank = better).
- sparse_zero — The interpolation mode used for missing values. "Sparse zero" means that missing experiment results (where a strategy/dataset combination was not run) are filled with zero, so they rank last. Other modes include sparse_nan (ignore missing) and dense (only use complete hyperparameter grids).
- full_auc — The aggregation method applied to the learning curve. "Full AUC" computes the area under the entire learning curve. Other options: ramp_up_auc (AUC during the initial ramp-up phase), plateau_auc (AUC during the plateau phase), final_value (only the last cycle's value).
- weighted_f1-score — The base evaluation metric. Other options: accuracy, macro_f1-score, etc.

Correlation Metrics (Paper ↔ Code)¶

Three correlation metrics from the OGAL paper (arXiv:2506.03817) and their code implementations:

Correlation	Measures	Heatmap Color	Script
Pearson \(r\) (§IV-B1)	Do metric outcomes change with a hyperparameter?	Blue	`workload_reduction.py`, `basic_metrics_correlation.py`
Jaccard \(J\) (§IV-B2)	Do strategies select the same samples?	Green	`single_hyperparameter_evaluation_indices.py`
Kendall \(\tau_b\) (§IV-B3)	Do strategy rankings stay the same?	Orange	`leaderboard_single_hyperparameter_influence.py`, `leaderboard_single_hyperparameter_influence_analyze.py`

Metric-based (Pearson \(r\), §IV-B1)¶

For each value of a hyperparameter (e.g., batch size \(b_i\)), build a result vector \(V_{b_i}(M)\) of aggregated metric values. Then compute the pairwise Pearson correlation matrix. High \(r\) ≈ hyperparameter has little effect.

\[ V_{b_i}(M) = \begin{bmatrix} M_{b_i 1} \\ M_{b_i 2} \\ \vdots \end{bmatrix} \qquad \text{Heatmap cell} = r\!\bigl(V_{b_i}(M),\; V_{b_j}(M)\bigr) \]

The Pearson correlation coefficient is defined as:

\[ r(X, Y) = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2 \;\sum_{i=1}^{n}(Y_i - \bar{Y})^2}} \]

Computed via np.corrcoef in single_hyperparameter_evaluation_metric.py.

Data flow: Per-strategy/dataset .csv.xz files (e.g., full_auc_weighted_f1-score.csv.xz) are joined with 05_done_workload.csv to attach hyperparameter columns, then written to _TS/{metric}.parquet. For a chosen hyperparameter, experiments are grouped by its value and matched on all remaining hyperparameters (a shared "fingerprint"). np.corrcoef computes pairwise Pearson \(r\) between the matched metric vectors — one vector per hyperparameter value — producing the blue heatmap.

Queried Samples (Jaccard \(J\), §IV-B2)¶

Union each experiment's per-cycle queried sets into \(\widehat{Q}\), then compute pairwise Jaccard similarity. The heatmap shows \(1 - \bar{J}\) (so 1 = identical queries).

\[ \widehat{Q} = \bigcup_{i=0}^{c} Q^i \qquad J(A,B) = \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert} \]

\(J\) ranges from 0 (disjoint sets) to 1 (identical sets). Computed in single_hyperparameter_evaluation_indices.py.

Data flow: selected_indices.csv.xz files store per-cycle queried sample indices. These are union-aggregated across cycles and joined with workload metadata into _TS/selected_indices.parquet. For a chosen hyperparameter, matched experiment pairs (same fingerprint on the remaining hyperparameters) have their index sets compared via \(|A \cap B| / |A \cup B|\), averaged across all matched pairs to fill the green heatmap.

Ranking Invariance (Kendall \(\tau_b\), §IV-B3)¶

Build a leaderboard (strategies × datasets), average to get a ranking vector per hyperparameter value, then compare rankings with Kendall \(\tau_b\).

\[ \tau_b = \frac{n_c - n_d}{\sqrt{(n_0 - n_1)(n_0 - n_2)}} \]

where:

\(n_c\) = concordant pairs, \(n_d\) = discordant pairs
\(n_0 = n(n-1)/2\)
\(n_1 = \sum_k t_k(t_k-1)/2\) (ties in \(X\))
\(n_2 = \sum_l u_l(u_l-1)/2\) (ties in \(Y\))

\(\tau_b\) ranges from −1 (reversed rankings) to +1 (identical rankings). Computed via scipy.stats.kendalltau in leaderboard_single_hyperparameter_influence.py.

Data flow: The same _TS/{metric}.parquet files as above are grouped by dataset and strategy, then averaged and rank-transformed to produce a leaderboard ranking vector for each hyperparameter value. scipy.stats.kendalltau compares these ranking vectors pairwise, yielding the orange heatmap.

Terminology Cross-Reference¶

Paper Term	Code Alias	File Pattern
Full mean AUC	`full_auc`	`full_auc_*.parquet`
Ramp-up AUC	`ramp_up_auc`	`ramp_up_auc_*.parquet`
Plateau AUC	`plateau_auc`	`plateau_auc_*.parquet`
Final value	`final_value`	`final_value_*.parquet`
Queried sample sets	`selected_indices`	`selected_indices.csv.xz`
Weighted F1-score	`weighted_f1-score`	`weighted_f1-score.parquet`

Complete Eva Scripts Index¶

Core Analysis Scripts¶

Script	Reads	Produces	Description
`learning_curve.py`	Per-cycle CSVs, `05_done_workload.csv`	`plots/single_learning_curve/*.parquet`, PDF	Generates an example learning curve plot for illustration. Auto-generates `_TS/*.parquet` if missing (as do most other eva_scripts).
`calculate_leaderboard_rankings.py`	`_TS/*.parquet`	Ranking parquets (multiple interpolation modes)	Generates strategy rankings across datasets using different metrics and interpolation methods.
`final_leaderboard.py`	`_TS/*.parquet`, ranking data	`plots/final_leaderboard/*.parquet`	Main leaderboard generation — ranks strategies and produces the paper's Table 1.
`runtime.py`	`query_selection_time.csv.xz`	`plots/runtime/query_selection_time.parquet`	Analyzes and plots query selection time distributions per strategy.

Correlation & Hyperparameter Analysis Scripts¶

Script	Reads	Produces	Description
`basic_metrics_correlation.py`	Per-cycle CSVs (accuracy, F1, etc.)	`plots/basic_metrics/Standard Metrics.parquet`	Pearson correlation matrix between standard ML metrics.
`auc_metric_correlation.py`	AUC metric parquets	`plots/AUC/auc_*.parquet`	Pearson correlation between AUC-based aggregation metrics.
`single_hyperparameter_evaluation_metric.py`	`_TS/*.parquet`	`plots/single_hyperparameter/*/` (Blue heatmaps)	Metric-based (Pearson) correlation — how metric outcomes change when varying one hyperparameter.
`single_hyperparameter_evaluation_indices.py`	`selected_indices.parquet`	`plots/single_hyperparameter/*/` (Green heatmaps)	Jaccard similarity of queried samples — do strategies select the same samples under different hyperparameters?
`leaderboard_single_hyperparameter_influence.py`	`_TS/*.parquet`	Single hyperparameter influence parquets	Kendall τ ranking invariance — how much does changing one hyperparameter affect strategy ordering?
`leaderboard_single_hyperparameter_influence_analyze.py`	Rankings CSV	Influence plots	Plots and analyzes the hyperparameter influence data.
`workload_reduction.py`	`_TS/*.parquet`, dense workload	Correlation/reduction stats	Analyzes how much the workload can be reduced while maintaining result quality.
`similar_strategies.py`	`selected_indices.parquet`	Jaccard correlation heatmaps	Strategy similarity via selected indices — which strategies behave most alike?
`strateg_framework_correlation.py`	Strategy metrics	Framework correlation plot	Cross-framework correlation analysis.

Leaderboard Variant Scripts¶

Script	Reads	Produces	Description
`leaderboard_scenarios.py`	Scenario metrics	Scenario rankings	Ranks strategies under different real-world scenarios (dataset type, start point, hyperparameter variations).
`leaderboard_c6_rebuttal.py`	Metric files	Kendall tau correlations, PDFs	Rebuttal analysis with bootstrap confidence intervals for ranking stability.
`final_leaderboard_single_cell_correlation.py`	Leaderboard parquets	Correlation stats, plots	Cell-wise correlation analysis within the leaderboard matrix.
`analyze_leaderboard_rankings.py`	`plots/leaderboard_invariances/leaderboard_types.csv`	Heatmap correlations	Correlates different leaderboard construction methods.

Learning Curve & Example Scripts¶

Script	Reads	Produces	Description
`single_learning_curve_example.py`	Sample data	Line plot	Example visualization of a single learning curve.
`single_learning_curve_example_auc.py`	Sample data	Line plot with AUC	Example learning curve with AUC annotation.

Dataset & Metric Analysis Scripts¶

Script	Reads	Produces	Description
`calc_cycle_duration_parquets.py`	Metric CSVs, `05_done_workload.csv`	Threshold plots, duration analysis	Analyzes learning cycle durations and computes duration thresholds.
`calculate_dataset_dependend_random_ramp_slope.py`	Selected indices time series	Leaderboard rankings CSV	Computes dataset-dependent random baseline slopes.
`dataset_stats.py`	—	—	Dataset statistics.

Scenario & Real-World Analysis Scripts¶

Script	Reads	Produces	Description
`real_world_scenarios_corrs.py`	Scenario metrics CSV	Decomposed correlations	Real-world scenario correlation decomposition.
`real_world_scenarios_plots.py`	Scenario data	Scatter/correlation plots	Plots for real-world scenario analysis.

Publication & Output Scripts¶

Script	Reads	Produces	Description
`redo_plots_for_paper.py`	All parquet files	Combined ranking plots (PDFs)	Regenerates all publication-ready plots at once.
`merge_multiple_plots_single_page.py`	Plot parquets	Merged PDF	Merges multiple parquet-based plots into a single multi-page PDF.

Utility Scripts (`scripts/`)¶

Data Preparation Scripts

Script	Description
`scripts/create_dense_workload.py`	Generate a dense workload (all dataset × strategy combinations).
`scripts/create_new_extended_dense_workload.py`	Extended version of the dense workload.
`scripts/create_gaussian.py`	Generate synthetic Gaussian datasets (balanced/unbalanced).
`scripts/create_xor.py`	Download XOR datasets from the LAL project.
`scripts/create_auc_selected_ts.py`	Create AUC time series from selected indices data.
`scripts/reduce_to_dense.py`	Remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results.

Conversion Scripts

Script	Description
`scripts/convert_metrics_csvs_to_exp_id_csvs.py`	Reorganize metric CSVs indexed by experiment ID.
`scripts/convert_dataset_distances_to_parqet.py`	Convert dataset distance CSV files to parquet format.
`scripts/convert_y_pred_to_parquet.py`	Convert y_pred CSV files to parquet format (with timeout handling).

Validation Scripts

Script	Description
`scripts/validate_results_schema.py`	Verify that result file formats match the expected schema.
`scripts/check_if_exp_ids_are_present.py`	Verify all experiment IDs exist in all metric files.
`scripts/find_missing_exp_ids_in_metric_files.py`	Find experiments that are missing from metric CSV files.
`scripts/find_broken_file.py`	Identify corrupted or malformed metric CSV files.
`scripts/exp_results_data_format_test.py`	Test that result CSV generation and format is correct.

Export & Documentation Scripts

Script	Description
`scripts/export_strategy_catalog.py`	Export all AL strategies to JSON/CSV/Markdown with framework info.
`scripts/add_github_hyperlinks.py`	Convert file references to GitHub hyperlinks in markdown.
`scripts/render_mermaid.py`	Pre-render Mermaid diagrams to SVG for static fallback.
`scripts/single_learning_curve.py`	Generate a single example learning curve visualization.

Directory Map¶

olympic-games-of-active-learning/
├── 00_download_datasets.py         # Dataset acquisition from OpenML/Kaggle
├── 01_create_workload.py           # Workload generation (hyperparameter grid)
├── 02_run_experiment.py            # Experiment execution (one per worker)
├── 03_calculate_dataset_categorizations.py  # Sample-level features
├── 04_calculate_advanced_metrics.py         # Derived metrics (AUC, etc.)
├── 05_analyze_partially_run_workload.py     # Progress monitoring
├── 07b_create_results_without_flask.py      # Standalone HTML visualization
├── framework_runners/              # AL framework adapters
│   ├── base_runner.py              # Abstract base class (AL loop)
│   ├── alipy_runner.py             # ALiPy strategies
│   ├── libact_runner.py            # libact strategies
│   ├── smalltext_runner.py         # small-text strategies
│   ├── skactiveml_runner.py        # scikit-activeml strategies
│   ├── playground_runner.py        # Custom strategies
│   └── optimal_runner.py          # Oracle strategies
├── metrics/                        # Metric recording during experiments
│   ├── Standard_ML_Metrics.py      # accuracy, F1, precision, recall
│   ├── Timing_Metrics.py           # query_selection_time, learner_training_time
│   ├── Selected_Indices.py         # selected sample indices
│   └── Predicted_Samples.py        # y_pred_train, y_pred_test
├── resources/
│   ├── data_types.py               # ALL enums (AL_STRATEGY, COMPUTED_METRIC, etc.)
│   ├── exp_config.yaml             # Experiment grid definitions
│   └── openml_datasets.yaml        # OpenML dataset configurations
├── misc/config.py                  # Central configuration
├── eva_scripts/                    # Evaluation & plotting scripts
└── scripts/                        # Utility, fix, and maintenance scripts

Key Abstractions¶

AL_Experiment (`framework_runners/base_runner.py`)¶

Abstract base class for framework adapters. Key methods:

get_AL_strategy() — Initialize the strategy
query_AL_strategy() → indices — Select samples to query
al_cycle() — Main loop: query → update → retrain → record metrics

Monitoring (`05_analyze_partially_run_workload.py`)¶

Analyzes progress of a partially completed experiment run:

Groups completed experiments by dataset/strategy/model/hyperparameters
Calculates mean query selection time per combination
Identifies which parameter combinations are missing

Visualization (`07b_create_results_without_flask.py`)¶

Generates a standalone HTML file with interactive result visualizations (AUC tables, learning curves, runtime plots) without requiring a Flask server.