Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.

I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.

Or in the words of the authors, from the paper:

> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.

Fantastic to have this. But it feels.. yes, somewhat bitter.

[The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)

[data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."

[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering



Give it another year and we will have a more specialised architecture tailored to 3D that will reach similar accuracy. VGGT is a ground-breaking research but it is in a way brute-force. There is plenty of work to do to make it more efficient.


Doesn't the bitter lesson take the argument a bit too far by opposing search/learn to heuristics? Is the former not dependent on breakthroughs in the latter?


The bitter lesson is the opposite. It argues that hand-crafted heuristics will eventually get beaten by more general learning algorithms that can take advantage of computing power.


Indeed, even in "classical chess engines" like Stockfish which previously required handcrafted heuristics at leaf nodes, in recent years the NNUE [1] [2] has greatly outperformed it. Note that this is a completely different approach from the one that AlphaZero takes, and modern Stockfish is significantly stronger than AlphaZero.

[1] https://stockfishchess.org/blog/2020/introducing-nnue-evalua...

[2] https://www.chessprogramming.org/Stockfish_NNUE


> eventually get beaten

Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.

I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.


That's like saying horse drawn carriages aren't opposed to cars because they needed to be developed first.


> They used a giant bunch of [data], a year and a half of GPU time to [train] the final model,

>[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

How is that a "year and half of GPU time". Maybe on some exoplanet ?


> > [train]: "The training runs on 64 A100 GPUs over nine days",

> How is that a "year and half of GPU time".

64 GPUs × 9 days = 576 GPU-days ≈ 1.577 GPU-years


Doh, that's entirely fair: haven't been in this thread yet, but would echo what I perceive as implicit puzzlement re: this amount of GPU time being described as bitter-lesson-y.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: