SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models

Abstract

Text-to-video (T2V) generation models have made significant progress in creating visually appealing videos. However, they struggle with generating coherent sequential narratives that require logical progression through multiple events. Existing T2V benchmarks primarily focus on visual quality metrics but fail to evaluate narrative coherence over extended sequences.

To bridge this gap, we present SeqBench, a comprehensive benchmark for evaluating sequential narrative coherence in T2V generation. SeqBench includes a carefully designed dataset of 320 prompts spanning various narrative complexities, with 2,560 human-annotated videos generated from 8 state-of-the-art T2V models. Additionally, we design a Dynamic Temporal Graphs (DTG)-based automatic evaluation metric, which can efficiently capture long-range dependencies and temporal ordering while maintaining computational efficiency. Our DTG-based metric demonstrates a strong correlation with human annotations.

Through systematic evaluation using SeqBench, we reveal critical limitations in current T2V models: failure to maintain consistent object states across multi-action sequences, physically implausible results in multi-object scenarios, and difficulties in preserving realistic timing and ordering relationships between sequential actions. SeqBench provides the first systematic framework for evaluating narrative coherence in T2V generation and offers concrete insights for improving sequential reasoning capabilities in future models.

Dataset Overview

Key Statistics

320

Carefully Designed Prompts

2,560

Human-Annotated Videos

8

State-of-the-art T2V Models

4

Content Categories

🐾 Animal: Animal behaviors and interactions, from simple locomotion to complex predatory behaviors

👤 Human: Human activities across various contexts, from daily routines to social interactions

📦 Object: Inanimate objects and their transformations, movements, or interactions

✨ Imaginary: Fantastical, supernatural, or stylized content beyond realistic constraints

Difficulty Levels

SSSA: Single Subject-Single Action

SSMA: Single Subject-Multi Action

MSSA: Multi Subject-Single Action

MSMA: Multi Subject-Multi Action

Temporal Orders

SS: Strictly Sequential - Actions follow predetermined logical sequence

FO: Flexible Order - Actions can occur in varying orders while maintaining coherence

SI: Simultaneous - Concurrent actions testing parallel process coordination

Dynamic Temporal Graph (DTG) Evaluation

Dynamic Temporal Graph evaluation framework

Overview of the Dynamic Temporal Graph (DTG) evaluation framework showing temporal decomposition, adaptive graph extraction, and dependency filtering processes.

Our evaluation framework assesses videos across two complementary dimensions

📊 Visual Details Evaluation

Assesses visual quality, object fidelity, and scene composition using frame-level analysis

Frame-level scene graph extraction
Object presence and attribute correctness
Spatial relationship accuracy
Scene composition quality

🎬 Narrative Coherence Evaluation

Measures temporal consistency and sequential logic using Dynamic Temporal Graphs

Temporal decomposition
Dynamic graph extraction
Question-aware feature emphasis
Dependency filtering

🔍 Key Innovation: Adaptive Graph Extraction

Traditional scene graph extraction uses static templates that may miss narrative-specific details. Our Dynamic Temporal Graph approach:

Dynamic Prompt Generation: Customizes graph extraction prompts based on specific evaluation questions
Question-aware Feature Emphasis: Prioritizes tracking of exact features needed for accurate evaluation
Multi-frame Analysis: Extracts scene graphs from 15 distributed frames using adapted prompts

Animal Category Video Gallery

Explore Animal category video examples across all 8 models and difficulty levels

64 representative videos (8 models × 8 difficulty levels) with access to all 640 videos

Each card shows the best-performing video. Click "View All 10 Videos" to see all examples for that combination.

Showing 64 videos in 64 combinations of 640 total videos (64 combinations)

💡 Click "View All 10 Videos" on any card to see all videos for that model-difficulty combination

Experimental Results

Evaluation of 8 state-of-the-art T2V models reveals significant gaps between visual quality and narrative coherence

Visual Details Evaluation Results

Model	SSMA_SS	SSMA_FO	SSMA_SI	MSMA_SS	MSMA_FO	MSMA_SI	SSSA	MSSA	Avg. Score
Kling 2.0	0.819	0.839	0.823	0.693	0.734	0.713	0.774	0.804	0.775
Cogvideo 1.5	0.806	0.817	0.814	0.700	0.718	0.711	0.780	0.789	0.767
Hailuo T2V-01	0.812	0.826	0.758	0.654	0.659	0.650	0.771	0.780	0.739
Pika 2.2	0.757	0.762	0.721	0.565	0.676	0.595	0.710	0.733	0.690
Sora	0.737	0.764	0.750	0.549	0.591	0.608	0.699	0.693	0.674
Luma Ray2	0.684	0.681	0.724	0.518	0.608	0.543	0.721	0.676	0.644
Veo 2.0	0.482	0.585	0.595	0.523	0.517	0.476	0.627	0.578	0.548
Runway Gen3	0.485	0.552	0.563	0.383	0.446	0.442	0.532	0.470	0.484

Narrative Coherence Evaluation Results

Model	SSMA_SS	SSMA_FO	SSMA_SI	MSMA_SS	MSMA_FO	MSMA_SI	SSSA	MSSA	Avg. Score
Kling 2.0	0.266	0.209	0.207	0.192	0.258	0.228	0.291	0.366	0.252
Hailuo T2V-01	0.212	0.230	0.192	0.179	0.207	0.199	0.345	0.288	0.231
Pika 2.2	0.219	0.202	0.194	0.171	0.238	0.229	0.293	0.277	0.228
Cogvideo 1.5	0.222	0.197	0.208	0.189	0.224	0.191	0.272	0.235	0.217
Veo 2.0	0.156	0.180	0.158	0.184	0.225	0.190	0.256	0.262	0.201
Luma Ray2	0.122	0.174	0.149	0.171	0.199	0.146	0.216	0.262	0.180
Sora	0.198	0.161	0.154	0.123	0.152	0.171	0.287	0.180	0.178
Runway Gen3	0.149	0.137	0.163	0.115	0.141	0.163	0.198	0.170	0.154

Human Evaluation Results

Detailed human evaluation across all models and categories with Overall, Action, and Consistency dimensions

Model	MSMA-FO			MSMA-SI			MSMA-SS			MSSA			SSMA-FO			SSMA-SI			SSMA-SS			SSSA
Model	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action	Overall	Consistency	Action

BibTeX

@article{seqbench2025,
  author    = {Anonymous Authors},
  title     = {SeqBench: Benchmarking Sequential Narrative Generation in Text-to-Video Models},
  journal   = {TBD},
  year      = {2025},
}