M<sup>3</sup>GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

This paper presents M³GPT, an advanced Multimodal, Multitask framework for Motion comprehension and generation. M³GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M³GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M³GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M³GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.

A person is knocking and sitting at the same time.

A person is walking while digging.

A woman performs a knee tuck to kick L.

Someone is rotating their ankle while sitting on a chair.

M³GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

Abstract

Text-to-Motion

Motion-to-Text

Music-to-Dance

Dance-to-Music

Text&Music-to-Dance

Long-Term Dance Generation

Citation

M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation