About

From reproducible MLLM training to long-video understanding and real-time generation, we turn frontier multimodal research into working systems.

MVP Lab is a research lab for multimodal intelligence, visual perception, and generative modeling of the physical world.

We build AI systems that can perceive, understand, generate, and act across images, video, language, audio, tactile signals, and embodied environments. Our work spans open multimodal model training, long-video understanding, real-time video generation, world modeling, 3D vision, and the infrastructure needed to train and evaluate these systems at scale.

The lab is directed by Jiankang Deng, Assistant Professor in Computing at Imperial College London. His research focuses on multimodal foundation models and generative modeling of the physical world, with a broader goal of connecting computer vision research to real-world applications.

What We Work On

  • Multimodal foundation models for image, video, language, and multi-sensory understanding.
  • Open training recipes and reproducible systems for large-scale MLLM research.
  • Long-video perception, codec-aligned video modeling, and temporal reasoning.
  • Real-time autoregressive video generation and world modeling.
  • 3D scene understanding, visual-tactile perception, embodied AI, and robotics-adjacent intelligence.
  • Infrastructure for distributed training, evaluation, deployment, and fast research iteration.

How We Work

Our research style is both scientific and engineering-driven: strong ideas matter, but so do scalable training pipelines, careful evaluation, readable code, and systems that can survive contact with real experiments.