Characterizing I/O Performance for ML Data Loaders at Scale Using Darshan

Description

Modern machine learning (ML) workflows in High Energy Physics (HEP) increasingly rely on large-scale detector-level datasets, where individual events are large and overall data volumes grow rapidly. As a result, data loading has become one of the most I/O-intensive components of ML training pipelines, often limiting achievable batch sizes, data throughput, and overall scalability on high-performance computing (HPC) systems.

This project aims to systematically characterize and analyze the I/O behavior of ML data loaders used in HEP by leveraging Darshan, a lightweight and widely deployed HPC I/O characterization tool. The study will focus on understanding how data format, access pattern, and data loader design impact I/O performance at scale.

Two primary classes of data loaders will be evaluated:

The native PyTorch DataLoader, commonly used in ML training workflows
A ROOT-based DataLoader for PyTorch, a newer development enabling native reading of ROOT and RNTuple data structures

These loaders will be benchmarked using representative HEP data formats, including ROOT RNTuple, HDF5, NumPy NPZ, and CSV, under realistic ML training workloads. By collecting and analyzing Darshan I/O traces on production HPC systems (e.g., NERSC Perlmutter), the project will identify performance bottlenecks, characterize I/O access patterns, and provide actionable recommendations for optimizing ML data ingestion pipelines in HEP.

Task Ideas

Design and implement a benchmark framework for ML training workflows using PyTorch with multiple data loaders and data formats.
Run controlled I/O performance experiments on HPC systems using Darshan instrumentation
Analyze Darshan logs to characterize I/O patterns such as file access frequency, read sizes, metadata operations, and concurrency.
Compare I/O bandwidth, latency, and scalability across data loaders and formats
Evaluate the impact of data loader configuration parameters (e.g., number of workers, prefetching, sharding)
Summarize findings and propose optimization strategies for scalable ML data loading in HEP environments.

Expected Results and Milestones

Familiarization with HEP datasets for ML, PyTorch data loading, and Darshan I/O profiling
Setup of benchmark environment and representative datasets on HPC systems
Implementation of PyTorch and ROOT-based DataLoader benchmarks
Large-scale I/O profiling and data collection using Darshan
Analysis and comparison of I/O performance across data formats and loaders
Optimization recommendations and documentation

Requirements

Python programming skills
Familiarity with basic machine learning concepts
Familiarity with basic I/O concepts
Interest in performance analysis and large-scale computing

AI Policy

AI assistance is allowed for this contribution. The applicant takes full responsibility for all code and results, disclosing AI use for non-routine tasks (algorithm design, architecture, complex problem-solving). Routine tasks (grammar, formatting, style) do not require disclosure.

How to Apply

Email mentors with a brief background and interest in ML/large-scale computing. Please include “gsoc26” in the subject line. Mentors will provide an evaluation task after submission.

Characterizing I/O Performance for ML Data Loaders at Scale Using Darshan

Description

Task Ideas

Expected Results and Milestones

Requirements

AI Policy

How to Apply

Resources

Mentors

Additional Information

Corresponding Project

Participating Organizations