mvp-dataset

GitHub

mvp-dataset is a lightweight data loading library for multimodal model training.

It is designed around a few practical goals:

  • a minimal API surface
  • deterministic runtime behavior
  • strong throughput on local shard-based datasets
  • compatibility with PyTorch training pipelines

The library supports local tar shards, JSONL metadata files, sidecar tar joins, and tar:// reference resolution for datasets that mix structured records with external media assets.

Its core workflow is well suited to training setups where multimodal data is stored locally and needs to be shuffled, batched, and merged efficiently without giving up determinism across distributed and worker processes.