CausalVerse - NeurIPS 2025 Benchmark

TL;DR

CausalVerse is the first comprehensive benchmark for causal representation learning with controllable high-fidelity simulations. It allows users to inspect, modify, and configure causal graphs to match various CRL assumptions and tasks, and provides empirical insights to guide researchers in selecting or improving CRL frameworks for real-world causal reasoning.

Dataset Overview

~200k

Images

~140k

Videos

300M+

Video Frames

Scenes

Domains

Static Image Generation

Human in Retail Store (11.2k)
Indoor environments
Varying poses & appearances
Multiple lighting conditions

Physical Simulation

Cylinder Spring (40k images)
Simple Collision (20k videos)
Projectile motion
Object interactions

Robotic Manipulation

Robot in Kitchen (2.7k videos)
Multi-view capture
Object-centric tasks
Embodied agents

Traffic Analysis

Traffic in Town01 (1.97k videos)
Multi-agent interactions
Urban environments
Variable conditions

Key Features: 3-129 latent variables per scene | 1024×1024 and 800×600 resolutions | 3-32 second video durations | Multi-camera viewpoints

Ground Truth Access

Complete access to causal variables, structures, and generation processes with high-fidelity visual data

Diverse Scenarios

From static to dynamic, single to multi-agent, covering physical simulations, robotics, and traffic

Configurable Settings

Flexible control over causal assumptions, domain labels, temporal dependencies, and interventions

Rigorous Evaluation

Test CRL methods under both satisfied and unmet assumptions with standardized metrics

Configuration Example

Each scene provides detailed ground-truth variables. Below is an example of the available metadata structure, which is consistent across the dataset.

Category	Sub-category	Variable	Dim.	Type	Range	Description
Global	Scene	`scene`	(1,)	D	6 types	Scene name/identifier
Global	Scene	`gravity`	(1,)	C	-	Acceleration of gravity
Global	Object	`render_asset`	(1,)	D	90 types	Specifies visual appearance
Dynamic	Object	`position`	(T,3)	C	-	3D coordinates across time
Dynamic	Object	`rotation`	(T,3)	C	-	Euler angles across time

Data Showcase

Sample videos from different domains in CausalVerse, showcasing the variety of scenes and viewpoints available.

Static Image Generation

Scene: Scene1-4

Scene1

Scene2

Scene3

Scene4

Physical Simulation (Image)

Scene: Fall, Refraction, Slope, Spring

Fall

$Refraction Simulation$

Refraction

Slope

Spring

Physical Simulation (Video)

Scene: Projectile_Hard

Birdview

Frontview

Leftview

Rightview

Robotic Manipulation

Scene: Kitchen

Agentview

Birdview

Frontview

Eyeview

Sideview

Traffic Situation Analysis

Scenes: Town1 & Town2

Town1

Town2

Evaluation

Evaluation on Mean Correlation Coefficient (MCC) and Coefficient of Determination (R²) for both image and video data.

Scene	Method	MCC ↑	R² ↑	Links
Ball on the Slope	Supervised	0.9878	0.9962	-
	Sufficient Change	0.4434	0.9630
	Mechanism Sparsity	0.2491	0.3242
	Self-supervised	0.4109	0.9658
	Contrastive Learning	0.2853	0.9604
Cylinder Spring	Supervised	0.9970	0.9910	-
	Sufficient Change	0.6092	0.9344
	Mechanism Sparsity	0.3353	0.2340
	Self-supervised	0.4523	0.7841
	Contrastive Learning	0.6342	0.9920
Light Refraction	Supervised	0.9900	0.9800	-
	Sufficient Change	0.6778	0.8420
	Mechanism Sparsity	0.1836	0.4067
	Self-supervised	0.3363	0.7841
	Contrastive Learning	0.3773	0.9677

Scene	Method	MCC ↑	R² ↑
Fall Simple	IDOL	0.2527	0.5901
	CaRiNG	0.2280	0.5457
	TDRL	0.2003	0.5525
	TCL	0.1717	0.4892
	iVAE	0.1881	0.5233
Robotics Study	IDOL	0.2500	0.6503
	CaRiNG	0.2225	0.6476
	TDRL	0.2440	0.6394
	TCL	0.2163	0.6150
	iVAE	0.1948	0.6165

Citation

If you find our work useful, please consider citing our paper:

@inproceedings{chen2025causalverse,
title     = {CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations},
author    = {Chen, Guangyi and Deng, Yunlong and Zhu, Peiyuan and Li, Yan and Shen, Yifan and Li, Zijian and Zhang, Kun},
booktitle = {Advances in Neural Information Processing Systems},
year      = {2025}
}