SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

Anonymous Authors
Anonymous Institute
Sequential narrative generation failures in current T2V models
Examples of sequential narrative generation failures in current T2V models. Each type demonstrates specific failure modes that SeqBench is designed to detect and evaluate.

Abstract

Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences.

To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations.

Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models.

Dataset Overview

Key Statistics

320

Carefully Designed Prompts

2,560

Human-Annotated Videos

8

State-of-the-art T2V Models

4

Content Categories

Content Categories

🐾 Animal: Animal behaviors and interactions, from simple locomotion to complex predatory behaviors

👤 Human: Human activities across various contexts, from daily routines to social interactions

📦 Object: Inanimate objects and their transformations, movements, or interactions

✨ Imaginary: Fantastical, supernatural, or stylized content beyond realistic constraints

Difficulty Levels

SSSA: Single Subject-Single Action

SSMA: Single Subject-Multi Action

MSSA: Multi Subject-Single Action

MSMA: Multi Subject-Multi Action

Temporal Orders

SS: Strictly Sequential - Actions follow predetermined logical sequence

FO: Flexible Order - Actions can occur in varying orders while maintaining coherence

SI: Simultaneous - Concurrent actions testing parallel process coordination

Dynamic Temporal Graph (DTG) Evaluation

Dynamic Temporal Graph evaluation framework
Overview of the Dynamic Temporal Graph (DTG) evaluation framework showing temporal decomposition, adaptive graph extraction, and dependency filtering processes.

Our evaluation framework assesses videos across two complementary dimensions

📊 Visual Details Evaluation

Assesses visual quality, object fidelity, and scene composition using frame-level analysis

  • Frame-level scene graph extraction
  • Object presence and attribute correctness
  • Spatial relationship accuracy
  • Scene composition quality

🎬 Narrative Coherence Evaluation

Measures temporal consistency and sequential logic using Dynamic Temporal Graphs

  • Temporal decomposition
  • Dynamic graph extraction
  • Question-aware feature emphasis
  • Dependency filtering

🔍 Key Innovation: Adaptive Graph Extraction

Traditional scene graph extraction uses static templates that may miss narrative-specific details. Our Dynamic Temporal Graph approach:

  • Dynamic Prompt Generation: Customizes graph extraction prompts based on specific evaluation questions
  • Question-aware Feature Emphasis: Prioritizes tracking of exact features needed for accurate evaluation
  • Multi-frame Analysis: Extracts scene graphs from 15 distributed frames using adapted prompts

Animal Category Video Gallery

Experimental Results

Evaluation of 8 state-of-the-art T2V models reveals significant gaps between visual quality and narrative coherence

Visual Details Evaluation Results

Model SSMA_SS SSMA_FO SSMA_SI MSMA_SS MSMA_FO MSMA_SI SSSA MSSA Avg. Score
Kling 2.0 0.819 0.839 0.823 0.693 0.734 0.713 0.774 0.804 0.775
Cogvideo 1.5 0.806 0.817 0.814 0.700 0.718 0.711 0.780 0.789 0.767
Hailuo T2V-01 0.812 0.826 0.758 0.654 0.659 0.650 0.771 0.780 0.739
Pika 2.2 0.757 0.762 0.721 0.565 0.676 0.595 0.710 0.733 0.690
Sora 0.737 0.764 0.750 0.549 0.591 0.608 0.699 0.693 0.674
Luma Ray2 0.684 0.681 0.724 0.518 0.608 0.543 0.721 0.676 0.644
Veo 2.0 0.482 0.585 0.595 0.523 0.517 0.476 0.627 0.578 0.548
Runway Gen3 0.485 0.552 0.563 0.383 0.446 0.442 0.532 0.470 0.484

Narrative Coherence Evaluation Results

Model SSMA_SS SSMA_FO SSMA_SI MSMA_SS MSMA_FO MSMA_SI SSSA MSSA Avg. Score
Kling 2.0 0.266 0.209 0.207 0.192 0.258 0.228 0.291 0.366 0.252
Hailuo T2V-01 0.212 0.230 0.192 0.179 0.207 0.199 0.345 0.288 0.231
Pika 2.2 0.219 0.202 0.194 0.171 0.238 0.229 0.293 0.277 0.228
Cogvideo 1.5 0.222 0.197 0.208 0.189 0.224 0.191 0.272 0.235 0.217
Veo 2.0 0.156 0.180 0.158 0.184 0.225 0.190 0.256 0.262 0.201
Luma Ray2 0.122 0.174 0.149 0.171 0.199 0.146 0.216 0.262 0.180
Sora 0.198 0.161 0.154 0.123 0.152 0.171 0.287 0.180 0.178
Runway Gen3 0.149 0.137 0.163 0.115 0.141 0.163 0.198 0.170 0.154

Human Evaluation Results

Detailed human evaluation across all models and categories with Overall, Action, and Consistency dimensions

Model MSMA-FO MSMA-SI MSMA-SS MSSA SSMA-FO SSMA-SI SSMA-SS SSSA
OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction OverallConsistencyAction

BibTeX

@article{seqbench2025,
  author    = {Anonymous Authors},
  title     = {SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models},
  journal   = {TBD},
  year      = {2025},
}