M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation

overview
GitHub arXiv

Abstract

This paper presents M3GPT, an advanced Multimodal, Multitask framework for Motion comprehension and generation. M3GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal conditional signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary. The second involves modeling motion generation directly in the raw motion space. This strategy circumvents the information loss associated with a discrete tokenizer, resulting in more detailed and comprehensive motion generation. Third, M3GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M3GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M3GPT's superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.

Text-to-Motion

A person is knocking and sitting at the same time.

A person is walking while digging.

A woman performs a knee tuck to kick L.

Someone is rotating their ankle while sitting on a chair.

Motion-to-Text

A woman is performing a Short Weapon Assault.

The person is doing the Hand Ausweichhi.

The person covers their ears with their hands while sitting.

A person is sitting and throwing their arms around.

Music-to-Dance

Dance-to-Music

Text&Music-to-Dance

'music' + a person is spinning in circles.

'music' + a person performs a figure-eight jump.

'music' + a person does a cartwheel.

Long-Term Dance Generation

Testing Sample from AIST++.

Testing Sample from AIST++.

Testing Sample from Finedance.

Citation

If you find our code or paper helps, please consider citing:

                
                    @article{luo2024m3gpt,
                        title={M$^3$GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation},
                        author={Mingshuang Luo and Ruibing Hou and Zhuo Li and Hong Chang and Zimo Liu and Yaowei Wang and Shiguang Shan},
                        journal={Advances in Neural Information Processing Systems},
                        year={2024}
                    }