Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

1Boston University, 2Belmont High School, 3Canyon Crest Academy, 4Runway
*Equal contribution

A short silent video designed to supplement the paper

Abstract

Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations to this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we demonstrate that our metric achieves substantial improvement of over 68% over existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

Current automatic video metrics and MLLMs struggle to evaluate human motion in generated videos

Evaluating Generated Videos with MLLMs

Model
Is the action semantically correct? (Rate: 1-10)
Is the motion temporally coherent? (Rate: 1-10)

MLLMs assign high scores to videos that depict incorrect actions and lack realistic temporal coherence.

* For the exact evaluation prompt, see our paper.

Key idea: Judge realism by looking at how real humans perform an action

Key Idea visualization

We learn a human-centric embedding space where consistent generated videos lie close to real videos of the same action, while inconsistent generated videos lie further away.

Break human motion down to its intrinsics

Break human motion down to its intrinsics

Use these ingredients to learn what constitutes an action

Encoder visualization

Telltale Action Generation Bench (TAG-Bench)

5 Generative Models

Runway Gen-4 Turbo, Wan2.1, Wan2.2, Opensora, HunyuanVideo

10 Actions

Squats, Hulahoop, Jumping jack, Pull ups, Push ups, Shot put, Soccer juggling, Tennis swing, Discus throw, Wall push up

Humans evaluate each video on:

Action Consistency (1-10)

How accurately the generated video depicts the intended action mentioned in the prompt.

Temporal Coherence (1-10)

How physically plausible and temporally smooth the human motion appears in the generated video.

Human evaluation scores

Interactive plot Hover over points to view the corresponding video. Each point represents a generated video scored by humans. Click on actions or generative models in the legends to filter the visualization. Click "All" to reset.

Metrics

Metrics visualization

Results

Action consistency barplot Temporal coherence barplot

Overall model performance (win ratios)

Action Consistency

Action Consistency

Temporal Coherence

Temporal Coherence

Are some actions universally easy / hard to generate?

Interactive plot Click on actions or generative models in the legends to filter the visualization. Click "All" to reset.

t-SNE visualization of the learned embedding space.

Push-ups is relatively easy for all models
(generated video embeddings lie close to the centroid of real video embeddings)

Shot put is hard
(generated video embeddings lie far away from the centroid of real video embeddings)

Are all input features required?

Click features to remove them
Action Consistency
0.61
Temporal Coherence
0.64

Effect of each input feature. We report Spearman's correlation (ρ) with human scores after zeroing each input feature independently. Models are retrained from scratch for each setting. "Motion" denotes temporal derivatives of all inputs. Removing motion causes the largest degradation.

Citation

@misc{thomas2025generativeactiontelltalesassessing,
      title={Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos}, 
      author={Xavier Thomas and Youngsun Lim and Ananya Srinivasan and Audrey Zheng and Deepti Ghadiyaram},
      year={2025},
      eprint={2512.01803},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.01803}, 
}

References

RunwayGen4: Runway Research Team. Runway gen-4: Advancing realistic text-to-video generation. Technical Report, 2024. https://research.runwayml.com/gen4.

Wan2.1: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. arXiv:2503.20314.

Wan2.2: Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025. arXiv:2503.20314.

Hunyuan: Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024. arXiv:2412.03603.

Opensora: Langwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024. arXiv:2412.20404.

VideoScore2: Lian Zheng, Ziqi Huang, Hongbo Liu, Kai Zou, Yinan He, Fan Zhang, Lulu Gu, Yuanhan Zhang, Jingwen He, Wei-Shi Zheng, et al. Vbench-2.0: Advancing video generation benchmark suite for intrinsic faithfulness. arXiv preprint arXiv:2503.21755, 2025. arXiv:2503.21755.

VideoPhy-2: Ritik Bansal, Clark Peng, Yonatan Bitton, Roman Goldenberg, Aditya Grover, and Kai-Wei Chang. Videophy-2: A challenging action-centric physical commonsense evaluation in video generation. arXiv preprint arXiv:2503.06800, 2025. arXiv:2503.06800.

GPT-5: OpenAI. Gpt-5 system card. Technical report, OpenAI, 2025. Accessed: 2025-11-10.

Gemini-2.5-Pro: Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025. arXiv:2507.06261.