Incoherent Deformation, Not Capacity: Overfitting in Dynamic Gaussian Splatting

TL;DR

Dynamic 3D Gaussian Splatting overfits by 6.18 dB on average on D-NeRF. A systematic ablation identifies the split operation as the bottleneck of the overfitting cascade: disabling it eliminates the gap (1.15 dB) but also collapses test PSNR by 9.93 dB, so it is not a viable mitigation. Across 9 ablation conditions, gap is monotone in count (Spearman ρ = 1.00). A local k-NN strain prior on the deformation field breaks this pattern: it reduces the gap by 40.8% while growing the cloud by 85%. A controlled ablation against E-D3DGS-style embedding smoothness and an SC-GS-style ARAP residual shows the three normalized variants are statistically tied — the canonical-distance normalization is the load-bearing element, not the choice of encoding. Our recommended combination GAD+EER closes 48.2% of the gap; the full stack reaches 57.4%.

6.18 dBbaseline gap (8 D-NeRF scenes)

9.93 dBtest-PSNR collapse if split is disabled

99.73%strain reduction at held-out test timesteps

47.5% / 46.1% / 40.4% / +2.2%strain / on-embed / arap / no-norm — variants ablation

48.2%recommended GAD+EER (D-NeRF)

+16.1%EER on HyperNeRF high-gap subset (n=3 of 5)

57.4%full stack (PTDrop + GrowthCap)

Count vs gap: EER breaks the log-linear correlation. — **The count–gap paradigm shift.** Ablations (gray) follow a log-linear trend (r = 0.995, bootstrap 95% CI [0.993, 1.000]). EER (green) uses *more* Gaussians yet overfits *less*. The correlation holds within 41 non-EER configurations (r = 0.987) — EER is the only lever we found that breaks it.

Abstract

Dynamic 3D Gaussian Splatting achieves impressive novel-view synthesis on monocular video by coupling a deformable point cloud with Adaptive Density Control (ADC), but exhibits a severe train–test generalization gap. On the D-NeRF benchmark (8 synthetic scenes) we measure an average gap of 6.18 dB (up to 11 dB per scene) and, through a systematic ablation of every ADC sub-operation (split, clone, prune, frequency, threshold, schedule), identify splitting as the bottleneck of the overfitting cascade — disabling split eliminates the gap (1.15 dB) but also collapses test PSNR by 9.93 dB, so it is not a viable mitigation. Split is the operation through which the cascade flows, not a knob one can simply turn off.

Our central finding is that a local smoothness penalty on the per-Gaussian deformation field — we use a k-NN strain prior we call EER — breaks the count–gap correlation observed across ablations: it reduces the gap by 40.8% while growing the cloud by 85%. This reframes overfitting from a capacity problem to an incoherent deformation problem. A controlled ablation against E-D3DGS-style per-embedding smoothness and an SC-GS-style ARAP residual shows the three normalized variants achieve statistically tied gap reductions (47.5% / 46.1% / 40.4%); dropping the canonical-distance normalization disables the prior entirely (+2.2%). The substantive contribution is therefore not a new method but the diagnostic finding plus the identification that the canonical-distance normalization is the load-bearing element of these priors. Combined with GAD (a loss-rate-aware densification threshold), the recommended configuration GAD+EER closes 48.2% of the gap; adding PTDrop (jitter-weighted dropout) and a soft cloud cap reaches 57.4% at larger quality cost.

Findings are validated on D-NeRF (8 synthetic scenes), Deformable-3DGS (cross-architecture), and HyperNeRF (5 real-world scenes; +16.1% gap reduction on the high-baseline-gap subset, neutral on low-baseline scenes). EER's k-NN cost scales with cloud size; approximate-NN structures are needed to scale beyond ~100K Gaussians on consumer hardware.

Key Findings

1. Split is the bottleneck of the cascade

Disabling split collapses both the cloud (2K vs 44K Gaussians) and the gap (1.15 dB vs 6.18 dB) — but also collapses test PSNR by 9.93 dB, so it is not a viable mitigation. Disabling pruning changes nothing.

2. Count–gap correlation is real but incomplete

r = 0.995 on 9 ablation conditions, holding within both sub-clusters (r = 0.998 on high-count, 0.95 on low-count) and across 41 non-EER configurations (r = 0.987).

3. EER breaks the correlation

+85% Gaussians, −40.8% gap. At the per-Gaussian level, EER reduces deformation strain by 99.6% on Lego, 99.8% on T-Rex, 99.6% on Hellwarrior.

4. Orthogonal axes compound

GAD+EER = 48.2% reduction. Adding LogiGrow + PTDrop = 57.4%, the only configuration in our sweep to more than halve the gap.

Method ranking (gap reduction)

V2 method ranking by gap reduction. — Gap reduction across the v2 methods we keep. Every configuration with a green (EER) bar dominates every non-EER configuration. Ablation baselines (grey) mark the extremes of what the ADC knob alone can do. The full combination `GAD+LogiGrow+PTDrop+EER` is the only configuration that crosses 50%.

Pareto frontier: quality vs overfitting

Ablation summary

Gap grows with training, not with iterations alone

Overfitting gap over training iterations. — Train–test PSNR gap over training (mean ± std across 8 scenes). Baseline grows to ~6 dB; disabling split holds it at ~1 dB. The divergence tracks the densification window (iters 500–15,000).

Why early stopping fails: densification is front-loaded

Front-loaded densification bar chart. — 84–89% of cloud growth happens before iter 7,500. Stopping densification at iter 7,500 (A6) only trims the count by 10% and has essentially no effect on the gap — confirming that mitigation must modulate densification *from the start*, not truncate it at the end.

Dose–response: GAD and EER

Dose-response curves for GAD and EER. — Gap (blue/green, left axis) and test PSNR (red, right axis) as we sweep each method's strength parameter. Both GAD (capacity lever) and EER (coherence lever) produce smooth, monotonic trade-offs — EER's curve is markedly steeper, reaching a 44% gap reduction where GAD reaches 19%.

Method Taxonomy

Two small drop-in methods, each one hyperparameter and about 20 lines of code:

Capacity lever

GAD — a loss-rate-aware densification threshold. Rises when the cloud is large and loss has plateaued, so only Gaussians that still earn their complexity are kept.

Coherence lever

EER ★ — a local smoothness penalty on the per-Gaussian deformation field: penalize relative motion between each Gaussian and its k canonical neighbors.

Stochastic complement

PTDrop — Gaussian-level dropout on a cosine schedule (iters 5K–12K).

We also tried spectral-gated densification, temporal Sobolev smoothness, SH-coefficient penalties, and opacity-entropy maximization (SGD / STSR / ChromReg / OEM). At our scale none moves the gap by more than 10%, so the v2 paper documents them as negative results rather than first-class methods; the v1 paper (PDF, companion main_v1.tex) has the full taxonomy for reference.

GAD: a BIC-motivated threshold schedule

We adapt the per-iteration gradient threshold as

τ_GAD(t) = τ_base · (1 + λ · K(t) / (N · Δℓ_ema(t)))

where K(t) is the current count, N is the number of training pixels, and Δℓ_ema is an EMA of the per-iteration loss improvement. λ is the single tunable knob. The mapping from BIC to this formula is a heuristic (see paper, §6.2); the empirical diminishing-returns exponent we measure (α ≈ 0.04) is too mild to justify the often-quoted O((N/λ)^1/4) growth bound, so we present the bound qualitatively as "sublinear in N".

EER: k-NN elastic strain energy

For a subset of Gaussians i and their k=8 canonical neighbors j, we penalize

ℒ_EER = mean_i,j ‖ u(x_i, t) − u(x_j, t) ‖² / (‖ x_i − x_j ‖² + ε)

where u(x, t) is the deformation offset at time t. This is the discrete elastic strain — physically the correct choice for linear elasticity (Hooke's law penalizes ∂u/∂x, not ∂u). In canonical space the k-NN graph is stable; we rebuild it every 500 iterations and apply a cosine ramp from iteration 3K to 10K.

Interactive 3D Deformation Viewer

Explore the deformation field in 3D. Left panel: baseline (incoherent per-Gaussian deformation). Right panel: EER (coherent elastic deformation). Use the time slider to animate — watch how baseline Gaussians scatter chaotically at novel timesteps while EER maintains spatial coherence. Drag to orbit; scroll to zoom. Cameras are linked between panels.

12,000 highest-opacity Gaussians per scene, 11 timesteps (t=0.0 to 1.0). Color by displacement magnitude (viridis) or strain (inferno). Requires serving via HTTP (python -m http.server 8000).

What EER Actually Does to the Deformation Field

For every D-NeRF scene, we load the trained 4DGS model, query the per-Gaussian deformation at 4 timesteps, and plot the distribution of per-Gaussian strain ε_i = mean_j ‖u_i−u_j‖² / ‖x_i−x_j‖² over its 8 canonical neighbors.

Lego deformation field. — **Lego**: strain ↓ 99.62%

T-Rex deformation field. — **T-Rex**: strain ↓ 99.80%

Hellwarrior deformation field. — **Hellwarrior**: strain ↓ 99.58%

Bouncing-balls deformation field. — **Bouncing-balls**: strain ↓ 99.90%

Jumping-jacks deformation field. — **Jumping-jacks**: strain ↓ 99.84%

Stand-up deformation field. — **Stand-up**: strain ↓ 99.82%

Mutant deformation field. — **Mutant**: strain ↓ 99.64%

Hook deformation field. — **Hook**: strain ↓ 99.59%

Each panel shows (left) canonical cloud colored by displacement magnitude, (middle) a subsampled quiver of u(x, t=0.5), (right) the per-Gaussian strain histogram. Baseline is bimodal with heavy tails; EER collapses the distribution by two orders of magnitude. This is the direct mechanism behind EER's overfitting reduction.

Strain reduction on every scene

Scene	Baseline ε	EER ε	Reduction
bouncingballs	2.835	0.00296	99.90%
hellwarrior	5.785	0.02408	99.58%
hook	2.627	0.01090	99.59%
jumpingjacks	6.772	0.01106	99.84%
lego	1.573	0.00594	99.62%
mutant	1.323	0.00481	99.64%
standup	3.686	0.00667	99.82%
trex	3.715	0.00738	99.80%
mean (n=8)	3.539	0.00922	99.72%

Measured at iter 20,000 on trained 4DGS checkpoints. Strain ε is mean over k=8 canonical neighbors of ‖u_i−u_j‖² / ‖x_i−x_j‖², averaged over 4 timesteps (t=0, 0.25, 0.5, 0.75).

EER: The Paradigm Shift

EER three-panel analysis. — (a) EER λ sweep: consistent gap reduction across scenes. (b) EER *increases* final Gaussian count — the reverse of capacity control. (c) Per-scene gap reduction: consistent across all 8 scenes, including the pathological Lego and Hellwarrior.

Combination additivity plot. — Combinations are *super-additive*: GAD+EER exceeds the sum of individual reductions, confirming capacity and coherence target orthogonal failure modes.

Real-World Validation (HyperNeRF — 5 scenes)

EER transfers to real monocular video, in the regime where there is overfitting to remove. On 5 HyperNeRF scenes, with 4DGS and the same λ=0.05 tuned on synthetic D-NeRF — no per-dataset re-tuning — EER reduces the gap on the 3 high-baseline-gap scenes ($>$ 4 dB) at near-zero quality cost; on the 2 low-baseline-gap scenes ($<$ 2 dB) it is approximately neutral, as expected when the deformation field is already coherent:

Scene	Baseline gap	EER gap	Reduction	ΔTest PSNR
chickchicken	5.48 dB	4.61 dB	+15.9%	−0.20
slice-banana	5.89 dB	5.40 dB	+8.3%	+0.03
vrig-3dprinter	4.49 dB	3.41 dB	+24.0%	+0.11
high-gap subset mean (n=3)	5.29 dB	4.47 dB	+16.1%	−0.02
vrig-peel-banana^†	0.89 dB	0.83 dB	+6.6%	−0.23
vrig-broom2^†	1.81 dB	1.83 dB	−1.2%	−0.21
full mean (n=5)	3.71 dB	3.22 dB	+11.0%	−0.10

4DGS on HyperNeRF, 14K iterations (stock config), RTX 3070. ^†Both vrig-peel-banana and vrig-broom2 have baseline gaps below 2 dB (at the floor of measurable improvement); reductions on these scenes are within reproduction noise. The high-gap subset (chickchicken, slice-banana, vrig-3dprinter) is the regime where EER clearly helps: +16.1% mean reduction at $-$0.02 dB cost — effectively free. The coherence finding survives noisy poses and non-Lambertian materials in the regime where the optimizer has overfitting to remove. EER's k-NN cost scales with cloud size; we could not extend to scenes where the cloud grows beyond $\sim$100K Gaussians on consumer hardware (see Limitations).

Cross-Architecture Validation (Deformable-3DGS)

Main experiments are on 4DGS (HexPlane deformation). We ported EER and GAD to Deformable-3DGS (MLP deformation) and ran baseline + EER on three D-NeRF scenes for 20K iterations.

Phase 1: direct-transfer test at D-NeRF-tuned λ=0.05

Scene	Baseline gap	EER λ=0.05 gap	Reduction	ΔPSNR
lego	13.15 dB	13.56 dB	-3.1%	-0.02 dB
trex	1.50 dB	1.81 dB	-20.8%	-0.38 dB
hellwarrior	4.08 dB	3.87 dB	+5.2%	-0.22 dB

Direct transfer at λ=0.05 is poor (mean −6% reduction). Why? Deformable-3DGS trains with L1+0.2·(1−SSIM) vs.\ 4DGS's pure L1 — the loss magnitude is roughly 3× larger and λ=0.05 is therefore under-regularized. Our dimensional-analysis note (paper §6.2) predicts the correct λ for Deformable-3DGS is ≈ 0.15–0.30. Testing this directly:

Phase 2: λ sweep on Deformable-3DGS Lego (dimensional-analysis test)

λ	Gap (dB)	Train PSNR	Test PSNR	ΔTest	Reduction
0 (baseline)	13.15	38.38	25.23	—	—
0.05	13.56	38.77	25.21	−0.02	−3.1%
0.15	10.23	35.55	25.33	+0.10	+22.3%
0.30	8.26	33.60	25.34	+0.11	+37.2%
0.60	7.82	33.21	25.39	+0.16	+40.6%

Cross-scene confirmation at λ=0.30

To confirm the sweep is not Lego-specific, we replicated λ=0.30 on Hellwarrior:

Scene	Baseline gap	EER λ=0.30	Reduction	ΔTest
Lego	13.15	8.26	+37.2%	+0.11
Hellwarrior	4.08	3.54	+13.2%	−2.44

The coherence mechanism transfers across deformation architectures and across scenes; the hyperparameter requires per-architecture (and to a lesser extent per-scene) calibration, exactly as the dimensional-analysis note predicted. On Hellwarrior the quality cost is larger at λ=0.30 (−2.44 dB); a smaller λ like 0.05 already gives +5.2% gap reduction at only −0.22 dB.

BibTeX

@article{droby2026monodygs,
  author  = {Ahmad Droby},
  title   = {Incoherent Deformation, Not Capacity: Diagnosing and
             Mitigating Overfitting in Dynamic Gaussian Splatting},
  journal = {arXiv preprint},
  year    = {2026}
}