Characterizing I/O Performance for ML Data Loaders at Scale Using Darshan

Description

Modern machine learning (ML) workflows in High Energy Physics (HEP) increasingly rely on large-scale detector-level datasets, where individual events are large and overall data volumes grow rapidly. As a result, data loading has become one of the most I/O-intensive components of ML training pipelines, often limiting achievable batch sizes, data throughput, and overall scalability on high-performance computing (HPC) systems.

This project aims to systematically characterize and analyze the I/O behavior of ML data loaders used in HEP by leveraging Darshan, a lightweight and widely deployed HPC I/O characterization tool. The study will focus on understanding how data format, access pattern, and data loader design impact I/O performance at scale.

Two primary classes of data loaders will be evaluated:

These loaders will be benchmarked using representative HEP data formats, including ROOT RNTuple, HDF5, NumPy NPZ, and CSV, under realistic ML training workloads. By collecting and analyzing Darshan I/O traces on production HPC systems (e.g., NERSC Perlmutter), the project will identify performance bottlenecks, characterize I/O access patterns, and provide actionable recommendations for optimizing ML data ingestion pipelines in HEP.

Task Ideas

Expected Results and Milestones

Requirements

AI Policy

AI assistance is allowed for this contribution. The applicant takes full responsibility for all code and results, disclosing AI use for non-routine tasks (algorithm design, architecture, complex problem-solving). Routine tasks (grammar, formatting, style) do not require disclosure.

How to Apply

Email mentors with a brief background and interest in ML/large-scale computing. Please include “gsoc26” in the subject line. Mentors will provide an evaluation task after submission.

Resources

Mentors

Additional Information

Corresponding Project

Participating Organizations