Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

Yuhuan Yuan1,†, Zhouliang Yu2,†, Minghao Liu3, Weiyang Liu2, Ge Lin Kan1

1HKUST(GZ)    2CUHK    3ZODA

Equal contribution

We study how high-value data selection, physics-aware reinforcement learning, and test-time regeneration improve language-model generation of text-geometric aligned LEGO brick assemblies.

PVPO capability turntable animation across eight LEGO generation tasks.

Abstract

Large language models can generate executable LEGO assembly programs from text, but their performance is strongly affected by noisy supervision and shortcut-prone training data. We identify a PhysHack phenomenon: models may achieve high physical validity while failing to preserve object-level semantics. To address this, we propose a sample-efficient post-training framework that combines value-guided data selection with Physics--Voxel Policy Optimization (PVPO). We first select only 5% high-value demonstrations using VLM semantic scores and diversity, then optimize models with a coupled reward combining physical validity and voxel-space geometric alignment. Experiments show that our method improves semantic alignment, geometry fidelity, and stability while using substantially less data than full-scale training.

Problem: PhysHack

LEGO Brick Assembly asks models to generate building steps that are both physically stable and match the prompt. PhysHack is when the model only follows the rules of assembly, but ignores the intended shape or meaning.

A model can be physically valid without being semantically correct.

Focusing only on physical validity can lead to generic or misaligned results. This work combines careful data selection and a reward that considers both physics and geometry to improve outcomes.

Methods

Value-Guided Data Selection

This selection strategy is data-efficient: High-Value VLM + Diversity uses only 5% of the data but improves semantic alignment over noisy full-data supervision. Table 1 is shown in full below.

Table 1: Comparisons of Data Selection. Structure/semantic alignment, physics validity, voxel alignment, and generated-brick statistics for different training strategies.
Setting Qwen2.5-3B-Instruct Llama-3.2-1B-Instruct
Qwen-VL ↑ CLIP ↑ DINOv3 ↑ Physics ↑ Voxel ↑ Bricks Qwen-VL ↑ CLIP ↑ DINOv3 ↑ Physics ↑ Voxel ↑ Bricks
Full dataset0.590.260.670.930.321960.670.270.740.960.35177
Diversity-only0.580.260.660.950.321630.550.250.660.940.31199
Random subset0.560.250.640.910.281760.580.250.670.950.32194
Low-value VLM0.510.220.640.850.251440.280.220.570.870.29334
Shortest responses0.500.240.650.910.22330.490.240.620.890.2238
Lowest perplexity0.450.250.560.880.221360.640.270.740.970.30140
Longest responses0.440.230.570.860.263460.480.250.620.910.26351
High-Value VLM0.700.270.720.860.301620.700.270.720.800.26205
High-Value VLM + Diversity0.720.260.700.860.311840.740.270.760.890.32181
PVPO0.770.280.800.930.351460.670.270.740.970.35179

PVPO: Physics-Voxel Policy Optimization

PVPO jointly optimizes physical validity and voxel-space geometric alignment. The physical reward measures the fraction of valid bricks, while the voxel reward measures shape agreement with the target construction using symmetric voxel-space distance.

RPVPO(o, o*) = (1 − λ) Rphys(o) + λ Rvox(o, o*)

In the main experiments, λ = 0.5 provides the best balance between physical feasibility and structural fidelity. The ablation below visualizes how the voxel weight changes semantic alignment, physical validity, voxel alignment, and generated-brick count.

Voxel weight ablation and test-time stability analysis
Voxel-weight ablation and test-time stability analysis from the paper.

Results

The main result is sample-efficient improvement: only 5% high-value data plus PVPO surpasses full-data training on semantic alignment and geometry while preserving physical validity.

Qwen-VL@40.59 → 0.77Full Dataset → PVPO
DINOv3@40.67 → 0.80Full Dataset → PVPO
Data Usage100% → 5%Compact high-value supervision
Voxel@40.32 → 0.35Improved geometry alignment

Test-Time Scaling on Physics–Structure Alignment.

Increasing the test-time sample budget improves physics--structure alignment by giving the model more opportunities to select generations with stronger visual-semantic and geometric scores under best@k selection.

Test-time scaling curves for physics-structure alignment
Test-time scaling on physics--structure alignment using VLM and vision-based scores.

Calibration

Confidence calibration compares different best@k selection mechanisms against semantic alignment metrics, diagnosing whether confidence tracks structural and visual quality.

Calibration heatmap
Calibration heatmap under validity, voxel, and weighted best@k selection.

BibTeX

@misc{yuan2026sampleefficientposttraininglegospatialphysics,
      title={Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning}, 
      author={Yuhuan Yuan and Zhouliang Yu and Minghao Liu and Weiyang Liu and Ge Lin Kan},
      year={2026},
      eprint={2606.07602},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.07602}, 
}

Paper

Read the paper on arXiv or view the discussion page on Hugging Face Papers.

Code

Code link will be added when available.