Skip to content

Reference

Detailed technical reference for OGAL internals. For guided walkthroughs, see Analyze OPARA or Reproduce & Run.


Data Formats

The archived results use two main storage formats:

Raw Per-Cycle Results (.csv.csv.xz)

During execution, workers append results to plain .csv files (one row per experiment). This append-only format supports massive parallel HPC jobs writing to shared files. After all experiments finish, the CSVs are compressed to .csv.xz to save space. The OPARA archive contains the compressed .csv.xz files. For example, ALIPY_RANDOM/Iris/weighted_f1-score.csv.xz contains the weighted F1-score at each active learning cycle for all experiments using the ALIPY_RANDOM strategy on the Iris dataset. Each row represents one experiment (identified by EXP_UNIQUE_ID), and each column represents one AL cycle iteration.

Aggregated Time Series (_TS/*.parquet)

The _TS/ directory contains pre-aggregated Parquet files that join per-cycle metrics with hyperparameter information from the workload definition. These files are auto-generated by evaluation scripts when needed. For example, _TS/full_auc_weighted_f1-score.parquet contains the full-AUC aggregation of weighted F1-scores, with columns for dataset, strategy, batch size, learner model, and the metric value.

The _TS files are especially helpful for quickly calculating correlations over the full hyperparameter grid. Without them, computing pairwise correlations across millions of experiments would require re-reading and re-aggregating all raw per-cycle CSV files each time, which is computationally not feasible. By pre-joining and pre-aggregating the data into these Parquet files, correlation analyses that would otherwise take hours can be completed in seconds.

Leaderboard Rankings (plots/final_leaderboard/*.parquet)

The leaderboard output files follow the naming convention rank_{interpolation}_{aggregation}_{base_metric}.parquet. For example:

  • rank_sparse_zero_full_auc_weighted_f1-score.parquet means:
    • rank — This file contains strategy rankings (lower rank = better).
    • sparse_zero — The interpolation mode used for missing values. "Sparse zero" means that missing experiment results (where a strategy/dataset combination was not run) are filled with zero, so they rank last. Other modes include sparse_nan (ignore missing) and dense (only use complete hyperparameter grids).
    • full_auc — The aggregation method applied to the learning curve. "Full AUC" computes the area under the entire learning curve. Other options: ramp_up_auc (AUC during the initial ramp-up phase), plateau_auc (AUC during the plateau phase), final_value (only the last cycle's value).
    • weighted_f1-score — The base evaluation metric. Other options: accuracy, macro_f1-score, etc.

Correlation Metrics (Paper ↔ Code)

Three correlation metrics from the OGAL paper (arXiv:2506.03817) and their code implementations:

Correlation Measures Heatmap Color Script
Pearson \(r\) (§IV-B1) Do metric outcomes change with a hyperparameter? Blue workload_reduction.py, basic_metrics_correlation.py
Jaccard \(J\) (§IV-B2) Do strategies select the same samples? Green single_hyperparameter_evaluation_indices.py
Kendall \(\tau_b\) (§IV-B3) Do strategy rankings stay the same? Orange leaderboard_single_hyperparameter_influence.py, leaderboard_single_hyperparameter_influence_analyze.py

Metric-based (Pearson \(r\), §IV-B1)

For each value of a hyperparameter (e.g., batch size \(b_i\)), build a result vector \(V_{b_i}(M)\) of aggregated metric values. Then compute the pairwise Pearson correlation matrix. High \(r\) ≈ hyperparameter has little effect.

\[ V_{b_i}(M) = \begin{bmatrix} M_{b_i 1} \\ M_{b_i 2} \\ \vdots \end{bmatrix} \qquad \text{Heatmap cell} = r\!\bigl(V_{b_i}(M),\; V_{b_j}(M)\bigr) \]

The Pearson correlation coefficient is defined as:

\[ r(X, Y) = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n}(X_i - \bar{X})^2 \;\sum_{i=1}^{n}(Y_i - \bar{Y})^2}} \]

Computed via np.corrcoef in single_hyperparameter_evaluation_metric.py.

Data flow: Per-strategy/dataset .csv.xz files (e.g., full_auc_weighted_f1-score.csv.xz) are joined with 05_done_workload.csv to attach hyperparameter columns, then written to _TS/{metric}.parquet. For a chosen hyperparameter, experiments are grouped by its value and matched on all remaining hyperparameters (a shared "fingerprint"). np.corrcoef computes pairwise Pearson \(r\) between the matched metric vectors — one vector per hyperparameter value — producing the blue heatmap.

Queried Samples (Jaccard \(J\), §IV-B2)

Union each experiment's per-cycle queried sets into \(\widehat{Q}\), then compute pairwise Jaccard similarity. The heatmap shows \(1 - \bar{J}\) (so 1 = identical queries).

\[ \widehat{Q} = \bigcup_{i=0}^{c} Q^i \qquad J(A,B) = \frac{\lvert A \cap B \rvert}{\lvert A \cup B \rvert} \]

\(J\) ranges from 0 (disjoint sets) to 1 (identical sets). Computed in single_hyperparameter_evaluation_indices.py.

Data flow: selected_indices.csv.xz files store per-cycle queried sample indices. These are union-aggregated across cycles and joined with workload metadata into _TS/selected_indices.parquet. For a chosen hyperparameter, matched experiment pairs (same fingerprint on the remaining hyperparameters) have their index sets compared via \(|A \cap B| / |A \cup B|\), averaged across all matched pairs to fill the green heatmap.

Ranking Invariance (Kendall \(\tau_b\), §IV-B3)

Build a leaderboard (strategies × datasets), average to get a ranking vector per hyperparameter value, then compare rankings with Kendall \(\tau_b\).

\[ \tau_b = \frac{n_c - n_d}{\sqrt{(n_0 - n_1)(n_0 - n_2)}} \]

where:

  • \(n_c\) = concordant pairs, \(n_d\) = discordant pairs
  • \(n_0 = n(n-1)/2\)
  • \(n_1 = \sum_k t_k(t_k-1)/2\) (ties in \(X\))
  • \(n_2 = \sum_l u_l(u_l-1)/2\) (ties in \(Y\))

\(\tau_b\) ranges from −1 (reversed rankings) to +1 (identical rankings). Computed via scipy.stats.kendalltau in leaderboard_single_hyperparameter_influence.py.

Data flow: The same _TS/{metric}.parquet files as above are grouped by dataset and strategy, then averaged and rank-transformed to produce a leaderboard ranking vector for each hyperparameter value. scipy.stats.kendalltau compares these ranking vectors pairwise, yielding the orange heatmap.

Terminology Cross-Reference

Paper Term Code Alias File Pattern
Full mean AUC full_auc full_auc_*.parquet
Ramp-up AUC ramp_up_auc ramp_up_auc_*.parquet
Plateau AUC plateau_auc plateau_auc_*.parquet
Final value final_value final_value_*.parquet
Queried sample sets selected_indices selected_indices.csv.xz
Weighted F1-score weighted_f1-score weighted_f1-score.parquet

Complete Eva Scripts Index

Core Analysis Scripts

Script Reads Produces Description
learning_curve.py Per-cycle CSVs, 05_done_workload.csv plots/single_learning_curve/*.parquet, PDF Generates an example learning curve plot for illustration. Auto-generates _TS/*.parquet if missing (as do most other eva_scripts).
calculate_leaderboard_rankings.py _TS/*.parquet Ranking parquets (multiple interpolation modes) Generates strategy rankings across datasets using different metrics and interpolation methods.
final_leaderboard.py _TS/*.parquet, ranking data plots/final_leaderboard/*.parquet Main leaderboard generation — ranks strategies and produces the paper's Table 1.
runtime.py query_selection_time.csv.xz plots/runtime/query_selection_time.parquet Analyzes and plots query selection time distributions per strategy.

Correlation & Hyperparameter Analysis Scripts

Script Reads Produces Description
basic_metrics_correlation.py Per-cycle CSVs (accuracy, F1, etc.) plots/basic_metrics/Standard Metrics.parquet Pearson correlation matrix between standard ML metrics.
auc_metric_correlation.py AUC metric parquets plots/AUC/auc_*.parquet Pearson correlation between AUC-based aggregation metrics.
single_hyperparameter_evaluation_metric.py _TS/*.parquet plots/single_hyperparameter/*/ (Blue heatmaps) Metric-based (Pearson) correlation — how metric outcomes change when varying one hyperparameter.
single_hyperparameter_evaluation_indices.py selected_indices.parquet plots/single_hyperparameter/*/ (Green heatmaps) Jaccard similarity of queried samples — do strategies select the same samples under different hyperparameters?
leaderboard_single_hyperparameter_influence.py _TS/*.parquet Single hyperparameter influence parquets Kendall τ ranking invariance — how much does changing one hyperparameter affect strategy ordering?
leaderboard_single_hyperparameter_influence_analyze.py Rankings CSV Influence plots Plots and analyzes the hyperparameter influence data.
workload_reduction.py _TS/*.parquet, dense workload Correlation/reduction stats Analyzes how much the workload can be reduced while maintaining result quality.
similar_strategies.py selected_indices.parquet Jaccard correlation heatmaps Strategy similarity via selected indices — which strategies behave most alike?
strateg_framework_correlation.py Strategy metrics Framework correlation plot Cross-framework correlation analysis.

Leaderboard Variant Scripts

Script Reads Produces Description
leaderboard_scenarios.py Scenario metrics Scenario rankings Ranks strategies under different real-world scenarios (dataset type, start point, hyperparameter variations).
leaderboard_c6_rebuttal.py Metric files Kendall tau correlations, PDFs Rebuttal analysis with bootstrap confidence intervals for ranking stability.
final_leaderboard_single_cell_correlation.py Leaderboard parquets Correlation stats, plots Cell-wise correlation analysis within the leaderboard matrix.
analyze_leaderboard_rankings.py plots/leaderboard_invariances/leaderboard_types.csv Heatmap correlations Correlates different leaderboard construction methods.

Learning Curve & Example Scripts

Script Reads Produces Description
single_learning_curve_example.py Sample data Line plot Example visualization of a single learning curve.
single_learning_curve_example_auc.py Sample data Line plot with AUC Example learning curve with AUC annotation.

Dataset & Metric Analysis Scripts

Script Reads Produces Description
calc_cycle_duration_parquets.py Metric CSVs, 05_done_workload.csv Threshold plots, duration analysis Analyzes learning cycle durations and computes duration thresholds.
calculate_dataset_dependend_random_ramp_slope.py Selected indices time series Leaderboard rankings CSV Computes dataset-dependent random baseline slopes.
dataset_stats.py Dataset statistics.

Scenario & Real-World Analysis Scripts

Script Reads Produces Description
real_world_scenarios_corrs.py Scenario metrics CSV Decomposed correlations Real-world scenario correlation decomposition.
real_world_scenarios_plots.py Scenario data Scatter/correlation plots Plots for real-world scenario analysis.

Publication & Output Scripts

Script Reads Produces Description
redo_plots_for_paper.py All parquet files Combined ranking plots (PDFs) Regenerates all publication-ready plots at once.
merge_multiple_plots_single_page.py Plot parquets Merged PDF Merges multiple parquet-based plots into a single multi-page PDF.

Utility Scripts (scripts/)

Data Preparation Scripts
Script Description
scripts/create_dense_workload.py Generate a dense workload (all dataset × strategy combinations).
scripts/create_new_extended_dense_workload.py Extended version of the dense workload.
scripts/create_gaussian.py Generate synthetic Gaussian datasets (balanced/unbalanced).
scripts/create_xor.py Download XOR datasets from the LAL project.
scripts/create_auc_selected_ts.py Create AUC time series from selected indices data.
scripts/reduce_to_dense.py Remove results where the full hyperparameter grid is incomplete, creating a dense grid from sparse experimental results.
Conversion Scripts
Script Description
scripts/convert_metrics_csvs_to_exp_id_csvs.py Reorganize metric CSVs indexed by experiment ID.
scripts/convert_dataset_distances_to_parqet.py Convert dataset distance CSV files to parquet format.
scripts/convert_y_pred_to_parquet.py Convert y_pred CSV files to parquet format (with timeout handling).
Validation Scripts
Script Description
scripts/validate_results_schema.py Verify that result file formats match the expected schema.
scripts/check_if_exp_ids_are_present.py Verify all experiment IDs exist in all metric files.
scripts/find_missing_exp_ids_in_metric_files.py Find experiments that are missing from metric CSV files.
scripts/find_broken_file.py Identify corrupted or malformed metric CSV files.
scripts/exp_results_data_format_test.py Test that result CSV generation and format is correct.
Export & Documentation Scripts
Script Description
scripts/export_strategy_catalog.py Export all AL strategies to JSON/CSV/Markdown with framework info.
scripts/add_github_hyperlinks.py Convert file references to GitHub hyperlinks in markdown.
scripts/render_mermaid.py Pre-render Mermaid diagrams to SVG for static fallback.
scripts/single_learning_curve.py Generate a single example learning curve visualization.

Directory Map

olympic-games-of-active-learning/
├── 00_download_datasets.py         # Dataset acquisition from OpenML/Kaggle
├── 01_create_workload.py           # Workload generation (hyperparameter grid)
├── 02_run_experiment.py            # Experiment execution (one per worker)
├── 03_calculate_dataset_categorizations.py  # Sample-level features
├── 04_calculate_advanced_metrics.py         # Derived metrics (AUC, etc.)
├── 05_analyze_partially_run_workload.py     # Progress monitoring
├── 07b_create_results_without_flask.py      # Standalone HTML visualization
├── framework_runners/              # AL framework adapters
│   ├── base_runner.py              # Abstract base class (AL loop)
│   ├── alipy_runner.py             # ALiPy strategies
│   ├── libact_runner.py            # libact strategies
│   ├── smalltext_runner.py         # small-text strategies
│   ├── skactiveml_runner.py        # scikit-activeml strategies
│   ├── playground_runner.py        # Custom strategies
│   └── optimal_runner.py          # Oracle strategies
├── metrics/                        # Metric recording during experiments
│   ├── Standard_ML_Metrics.py      # accuracy, F1, precision, recall
│   ├── Timing_Metrics.py           # query_selection_time, learner_training_time
│   ├── Selected_Indices.py         # selected sample indices
│   └── Predicted_Samples.py        # y_pred_train, y_pred_test
├── resources/
│   ├── data_types.py               # ALL enums (AL_STRATEGY, COMPUTED_METRIC, etc.)
│   ├── exp_config.yaml             # Experiment grid definitions
│   └── openml_datasets.yaml        # OpenML dataset configurations
├── misc/config.py                  # Central configuration
├── eva_scripts/                    # Evaluation & plotting scripts
└── scripts/                        # Utility, fix, and maintenance scripts

Key Abstractions

AL_Experiment (framework_runners/base_runner.py)

Abstract base class for framework adapters. Key methods:

  • get_AL_strategy() — Initialize the strategy
  • query_AL_strategy() → indices — Select samples to query
  • al_cycle() — Main loop: query → update → retrain → record metrics

Monitoring (05_analyze_partially_run_workload.py)

Analyzes progress of a partially completed experiment run:

  • Groups completed experiments by dataset/strategy/model/hyperparameters
  • Calculates mean query selection time per combination
  • Identifies which parameter combinations are missing

Visualization (07b_create_results_without_flask.py)

Generates a standalone HTML file with interactive result visualizations (AUC tables, learning curves, runtime plots) without requiring a Flask server.