NaLA: A 3D Native LLM Layout Agent for High-quality 3D Scene Generation

Cheng Wan^1,2, Yongsen Mao¹, Wenzheng Wu³, Yuxuan Xie⁴, Chucheng Xiang³,

Runze Wang³, Xiang Zhang⁴, Zhongyuan Liu⁴, Rushi Dai⁵, Yuan Liu¹

¹ Hong Kong University of Science and Technology,

² Shenzhen Loop Area Institute,

³ University of Science and Technology of China,

⁴ Tencent IEG

⁵ The Hong Kong University of Science and Technology (Guangzhou)

ECCV 2026

arXiv Code

NaLA directly perceives point clouds of scenes and assets, and generates end-to-end placements with a coarse-to-fine 3D pose decoding strategy.

Introduction

Large Language Models have recently become central planners for 3D scene generation, but existing layout agents still rely on general-purpose backbones that are limited in input modalities, output representations, and spatial reasoning. They typically consume only text or image descriptions, which do not capture detailed 3D geometry, and they often predict poses in a token-by-token textual format that is inefficient and imprecise.

NaLA addresses these issues by introducing geometry perception for both scenes and assets, together with an efficient coarse-to-fine asset pose generation framework. The model encodes point clouds from the scene and asset library, injects them into the LLM backbone, and autoregressively generates placements in a compact format. Trained with a two-stage strategy on high-quality 3D layout datasets, NaLA learns strong 3D reasoning and layout planning abilities.

Given a design prompt and a set of 3D assets, NaLA arranges both large furniture and small objects into plausible layouts by reasoning over point-cloud geometry and predicting poses via a coarse-to-fine mechanism.

Pipeline

NaLA follows an end-to-end pipeline. First, point clouds of the scene and each asset are encoded into tokens. These input 3D tokens are combined with text tokens and fed into the LLM. Then, the model utilizes specially-designed output anchoring tokens to predict a coarse grid location and orientation, followed by a output residual token that outputs fine-grained pose residuals, scale, and rotation refinement. Special ID tokens are used so that each predicted pose is matched to the correct asset.

Overview of the NaLA pipeline.

Results

We compare NaLA against representative baselines on physical plausibility, semantic plausibility, and visual aesthetics. NaLA achieves the best overall performance, balancing physical precision with semantic richness.

Quantitative comparison against baseline methods. Bold indicates the best result and underlined indicates the second best.

Qualitative comparison of layout generation. Each row shows top-down views for one scene type (bedroom, conference room, storage); each column shows results from a different method under identical room and asset conditions.