mvp-dataset: deterministic data loading for multimodal training
mvp-dataset is an MVP Lab project for local, shard-based data loading in
multimodal training systems.
The library is built around a deliberately small API surface:
DatasetTorchLoaderRuntimeContext
That design keeps the library focused on a few practical requirements that show up repeatedly in real training pipelines:
- high throughput on local tar-shard datasets
- deterministic sharding and shuffle behavior across workers
- simple support for multimodal joins and references
mvp-dataset supports two main source types:
- local
.tarshards throughDataset.from_tars(...) - local
.jsonlfiles throughDataset.from_jsonl(...)
For tar-based workflows, the library can parse samples directly from shard
members and join sidecar tar files on the fly. For JSONL-based workflows, it
can resolve tar://... references so structured metadata and external image or
other modality assets stay connected without moving everything into one format.
The project also includes chainable pipeline operations such as map,
shuffle, batch, and unbatch, plus TorchLoader for PyTorch
DataLoader-style training loops with loader-side pipeline composition.