CausalVerse is the first comprehensive benchmark for causal representation learning with controllable high-fidelity simulations. It allows users to inspect, modify, and configure causal graphs to match various CRL assumptions and tasks, and provides empirical insights to guide researchers in selecting or improving CRL frameworks for real-world causal reasoning.
Dataset Overview
~200k
Images
~140k
Videos
300M+
Video Frames
24
Scenes
4
Domains
Static Image Generation
Human in Retail Store (11.2k)
Indoor environments
Varying poses & appearances
Multiple lighting conditions
Physical Simulation
Cylinder Spring (40k images)
Simple Collision (20k videos)
Projectile motion
Object interactions
Robotic Manipulation
Robot in Kitchen (2.7k videos)
Multi-view capture
Object-centric tasks
Embodied agents
Traffic Analysis
Traffic in Town01 (1.97k videos)
Multi-agent interactions
Urban environments
Variable conditions
Key Features: 3-129 latent variables per scene |
1024×1024 and 800×600 resolutions |
3-32 second video durations |
Multi-camera viewpoints
Ground Truth Access
Complete access to causal variables, structures, and generation processes with high-fidelity visual data
Diverse Scenarios
From static to dynamic, single to multi-agent, covering physical simulations, robotics, and traffic
Configurable Settings
Flexible control over causal assumptions, domain labels, temporal dependencies, and interventions
Rigorous Evaluation
Test CRL methods under both satisfied and unmet assumptions with standardized metrics
Configuration Example
Each scene provides detailed ground-truth variables. Below is an example of the available metadata structure, which is consistent across the dataset.
Category
Sub-category
Variable
Dim.
Type
Range
Description
Global
Scene
scene
(1,)
D
6 types
Scene name/identifier
Global
Scene
gravity
(1,)
C
-
Acceleration of gravity
Global
Object
render_asset
(1,)
D
90 types
Specifies visual appearance
Dynamic
Object
position
(T,3)
C
-
3D coordinates across time
Dynamic
Object
rotation
(T,3)
C
-
Euler angles across time
Data Showcase
Sample videos from different domains in CausalVerse, showcasing the variety of scenes and viewpoints available.
Static Image Generation
Scene: Scene1-4
Scene1
Scene2
Scene3
Scene4
Physical Simulation (Image)
Scene: Fall, Refraction, Slope, Spring
Fall
Refraction
Slope
Spring
Physical Simulation (Video)
Scene: Projectile_Hard
Birdview
Frontview
Leftview
Rightview
Robotic Manipulation
Scene: Kitchen
Agentview
Birdview
Frontview
Eyeview
Sideview
Traffic Situation Analysis
Scenes: Town1 & Town2
Town1
Town2
Evaluation
Evaluation on Mean Correlation Coefficient (MCC) and Coefficient of Determination (R²) for both image and video data.
Scene
Method
MCC ↑
R² ↑
Links
Ball on the Slope
Supervised
0.9878
0.9962
-
Sufficient Change
0.4434
0.9630
Mechanism Sparsity
0.2491
0.3242
Self-supervised
0.4109
0.9658
Contrastive Learning
0.2853
0.9604
Cylinder Spring
Supervised
0.9970
0.9910
-
Sufficient Change
0.6092
0.9344
Mechanism Sparsity
0.3353
0.2340
Self-supervised
0.4523
0.7841
Contrastive Learning
0.6342
0.9920
Light Refraction
Supervised
0.9900
0.9800
-
Sufficient Change
0.6778
0.8420
Mechanism Sparsity
0.1836
0.4067
Self-supervised
0.3363
0.7841
Contrastive Learning
0.3773
0.9677
Scene
Method
MCC ↑
R² ↑
Links
Fall Simple
IDOL
0.2527
0.5901
CaRiNG
0.2280
0.5457
TDRL
0.2003
0.5525
TCL
0.1717
0.4892
iVAE
0.1881
0.5233
Robotics Study
IDOL
0.2500
0.6503
CaRiNG
0.2225
0.6476
TDRL
0.2440
0.6394
TCL
0.2163
0.6150
iVAE
0.1948
0.6165
Citation
If you find our work useful, please consider citing our paper:
@inproceedings{chen2025causalverse,
title = {CausalVerse: Benchmarking Causal Representation Learning with Configurable High-Fidelity Simulations},
author = {Chen, Guangyi and Deng, Yunlong and Zhu, Peiyuan and Li, Yan and Shen, Yifan and Li, Zijian and Zhang, Kun},
booktitle = {Advances in Neural Information Processing Systems},
year = {2025}
}