Emergence of a Shared Canonical Object Frame from In-the-Wild Videos

Tom Fischer, Martin Sundermeyer, Adam Kortylewski, Eddy Ilg

University of Technology Nuremberg · Google · CISPA Helmholtz Center for Information Security

Overview

A shared canonical frame allows object poses to be compared consistently across different instances and categories. Existing approaches usually require canonical pose supervision from aligned CAD models, synthetic rendering pipelines, or manually annotated real data. This creates a scaling bottleneck.

We show that a shared canonical object frame can instead emerge from self-supervised training on object-centric videos captured in the wild. Our method uses only noisy Structure-from-Motion camera poses and routes all training sequences through a shared geometric bottleneck: a coarse canonical mesh that carries no category-specific detail.

At test time, the model takes a single RGB image and predicts dense correspondences to this shared mesh. A continuous 6D pose is then recovered via PnP, yielding a canonical pose in a frame shared across categories.

Teaser

Teaser figure showing supervised, category-specific self-supervised, and our category-agnostic self-supervised setting.

We propose a self-supervised pose learning strategy that generalizes without any canonical pose supervision.

Figure: Three approaches to learning canonical object frames. Supervised methods rely on canonical pose labels or aligned CAD models. Category-specific self-supervised methods avoid canonical labels but train separate models per category. Our method learns a shared canonical frame across categories from in-the-wild videos without canonical pose labels or category conditioning.

Key Idea

Our method learns dense pixel-to-mesh correspondences through a shared canonical mesh.

Method overview.

The training pipeline has three main components:

Canonical Mesh
A coarse mesh, such as a cube, defines a fixed canonical coordinate frame. The mesh is deliberately category-agnostic and contains no instance-specific geometry.
Correspondence Prediction
A neural network predicts dense correspondences from image pixels to vertices on the canonical mesh.
Per-Sequence Alignment
Each SfM reconstruction has an arbitrary coordinate frame. During training, we estimate a per-sequence rotation that aligns the SfM frame to the canonical mesh frame, allowing pseudo-labels to supervise the correspondence network.

The network and the alignments improve together during training. Over time, semantically corresponding object parts are encouraged to map to consistent regions of the shared mesh, causing a canonical frame to emerge without explicit canonical pose labels.

Qualitative Results

Predicted 6D poses are visualized as canonical axes overlaid on images from diverse categories and benchmarks. A single fixed rotation is applied uniformly to all predictions for visualization, without per-category or per-dataset adjustment.

Benchmark Results

We evaluate category-level rotation accuracy across five benchmarks using median geodesic rotation error and Acc@30°. Our method is trained once on real object-centric videos and evaluated without dataset-specific fine-tuning.

Model	Can.	Train	REAL275 Med ↓	REAL275 Acc30 ↑	Omni6DPose Med ↓	Omni6DPose Acc30 ↑	Objectron Med ↓	Objectron Acc30 ↑	Pascal3D+ Med ↓	Pascal3D+ Acc30 ↑	ImageNet3D Med ↓	ImageNet3D Acc30 ↑	Avg. Med ↓	Avg. Acc30 ↑
QWEN3-VL	✓	R.+S.	38.7	37.6	70.6	18.8	49.0	31.1	61.1	27.0	66.5	24.0	57.2	27.7
OriAny.V1	✓	S.	28.4	52.2	62.8	36.7	18.4	60.4	18.1	71.0	29.7	50.3	31.5	54.1
OriAny.V1†	✓	R.+S.	26.7	54.1	54.5	31.5	15.1	67.7	15.7	78.4	25.7	55.6	27.5	57.5
OriAny.V2†	✓	R.+S.	21.3	57.0	47.7	39.2	19.8	66.2	17.5	76.5	28.1	54.1	26.9	58.6
Ours	✗	R.	21.8	70.0	49.2	35.2	15.3	69.2	15.7	79.8	25.5	55.0	25.5	61.8

† Methods marked with a dagger were trained on ImageNet3D, which overlaps with Pascal3D+. Gray metrics therefore indicate settings where those methods have an additional advantage.

Main Findings

A shared canonical frame can emerge from object-centric videos without canonical pose labels.
A shared geometric bottleneck encourages cross-sequence and cross-category consistency.
Category-agnostic training scales better than training separate self-supervised models per category.
The learned frame is strongest for objects with distinctive semantic axes.
Symmetric objects remain challenging because multiple orientations can explain the same visual evidence.

Ablations

Ablation results.

We study the impact of mesh shape, PCA initialization, learned per-sequence alignment, and the number of training views per sequence. The full model uses a cube mesh, PCA initialization, learned alignment, and four training views per sequence.

The ablations show that PCA and learned alignment are complementary. PCA provides a useful geometric initialization, while learned alignment resolves semantic axes such as front/back orientation. Multi-view coverage is also critical: reducing the number of views weakens alignment, and single-view training leads to substantial degradation.

Analysis

Symmetry and frame consistency analysis.

The learned frame is largely consistent across non-symmetric categories. Remaining errors concentrate on categories with rotational symmetry or ambiguous structure, where orientation is not uniquely identifiable from appearance and multi-view consistency alone.

Citation

TODO