It matches my anecdotal experience here (from Germany). When talking about travelling to the US, previously the discussions of the drawbacks would circle around climate impact, expense or flight time. Recently, this has shifted to feeling unwelcome or unsafe. I talked to two different people last week who had gone to the US regularly in the past years, who have decided to not go this year (attending a wedding and the other for touristic reasons), due to the current political climate.
Certainly possible, read 4.6. Finetuning for Downstream Tasks in the paper, the first subsection is "Feed-forward Novel View Synthesis". They chose to report their experiments on LVSM, which is not an explicit representation like 3D Gaussian Splatting, but they're citing two feed-forward 3DGS approaches in their state of the art listing.
Should be quite exciting going forward, as fine-tuning might be possible on consumer hardware / single Desktop machines (like it is with LLMs). So I would expect a lot of experiments coming out in this space, soon-ish. If the results hold true, it'll be pretty exciting to drop slow and cumbersome COLMAP processing and scene optimization for a single forward pass that lasts a few seconds.
I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.
I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.
Or in the words of the authors, from the paper:
> We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.
Fantastic to have this. But it feels.. yes, somewhat bitter.
Give it another year and we will have a more specialised architecture tailored to 3D that will reach similar accuracy. VGGT is a ground-breaking research but it is in a way brute-force. There is plenty of work to do to make it more efficient.
Doesn't the bitter lesson take the argument a bit too far by opposing search/learn to heuristics? Is the former not dependent on breakthroughs in the latter?
The bitter lesson is the opposite. It argues that hand-crafted heuristics will eventually get beaten by more general learning algorithms that can take advantage of computing power.
Indeed, even in "classical chess engines" like Stockfish which previously required handcrafted heuristics at leaf nodes, in recent years the NNUE [1] [2] has greatly outperformed it. Note that this is a completely different approach from the one that AlphaZero takes, and modern Stockfish is significantly stronger than AlphaZero.
Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.
I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.
Doh, that's entirely fair: haven't been in this thread yet, but would echo what I perceive as implicit puzzlement re: this amount of GPU time being described as bitter-lesson-y.
Sure they can. o3-mini can do web searches, which puts it far ahead of o1 if you require current information. You can also tell it to go read a particular paper from just the rough name.
If you have things organized neatly together, you can also use pre-existing compression algorithms, like JPEG, to compress your data. That's what we're doing in Self-Organizing Gaussians [0]. There we take an unorganised (noisy) set of primitives that have 59 attributes and sort them into 59 2D grids which are locally smooth. Then we use off-the-shelf image formats to store the attributes. It's an incredibly effective compression scheme, and quite simple.
So they want to build a hollow structure, 1 km high, with all its weight concentrated at the very top (when charged)? How is that supposed to not immediately collapse?
it's not a hollow structure, it's space set aside in an otherwise normal commercial/residential tower. like extra banks of elevators. even if it weren't though, the weight of the building would far outweigh the weight of the blocks.
Absolute pitch: a completely useless skill, which having can in some cases even be detrimental. While being very hard to impossible to acquire. So naturally I will stop at nothing trying to develop it :)
A couple of months ago, this paper made the rounds: Absolute pitch in involuntary musical imagery [0]. In a small sample group, nearly half the time (44.7%) when someone was asked to sing their current earworm, were they perfectly in pitch. Random chance would be 8.3%.
It’s a fun thing to try for yourself. Just hum your current earworm into a voice memo, and check the correct pitch against the recording of the original song. You may discover a skill you never knew you had, implicit perfect pitch on involuntary music!
Trying to make this more interesting, reproducing a particular song on demand (there’s references to that too in the paper - it also works better than random chance, but less so than the involuntary kind), I find it works best for songs that start off with a single note, preferably sung. Or then at least you can immediately check whether you were right, e.g. “Tom’s Diner”. I’ve been having a lot of fun humming the first tone to Laufey’s cover of Sunny side of the street [1] whenever I open YouTube. I’m more often right than wrong, and if I was wrong, I can just listen to the whole thing to brighten my day anyways.
It's probably an overrated skill depending on the musical task, but to say it's completely useless is really ignorant. Nearly anyone who studies music at the university level or above would find this statement ("completely useless") to be wildly incorrect.
> Random chance would be 8.3%.
A random human won't sing off the cuff with their tonal center magically quantized to one of the twelve keys in our modern western tuning (Equal Temperament).
The smiling emoji at the end of the first paragraph indicates that these statements were made somewhat in jest, or perhaps exaggerated. Of course some uses can be found for absolute pitch. I saw one a couple weeks back, when Jacob Collier was tuning the audience choir to lead into "Somebody to Love" played on the piano. But, hadn't he had absolute pitch, he might just have picked up a reference note from the piano or his in-ear monitors, like a filthy commoner. Usually when making music, having good relative pitch is required, and a reference instrument is mostly handy, making perfect pitch somewhat redundant. But do tell what you're doing with perfect pitch, I'm curious.
And on curiosity, I went to your website and randomly listened to "The Fugue Song" [0]. Really loved it! Very nice moment when the singing comes in, repeating the phrase from the fugy guitar intro. Good song! (I'm a total sucker for Nina Simone's "Love Me Or Leave me", do you know that? A song where she's inserting some counterpoint improvisations in the middle). I'm listening to a bit of "Hiss" now.
> 8.3%
Rounded to the next semitone of course, I left that detail out, it's in the paper.
It's just an anecdote, but: I remember playing in a band with an amateur musician friend who told me had perfect pitch, and it was "very annoying", according to him, when we transposed songs to fit our voices. They just "sounded wrong" to him. He would make beginner mistakes since he relied on pitch memory rather than listening to us to know what to play/sing.
I have no idea how it works in general, but it seems like it was a problem for at least this one guy.
xD Well I didn't expect it to go there. I don't know that song but I'll have to listen to it when I get home.
I don't have perfect pitch but it's not too hard to imagine what I could do with it. For me, it'd simply make a lot of tasks faster. I've spent a lot of time throughout my life transcribing music. I can do it relatively quickly but there's lots of moments where I have to confirm things, or poke around notes finding the matching notes, struggling to clarify shades of a chord based on the presence of notes.
Reading music will be easier for people with AP, especially in singing situations. Even if you're a pianist it will still be helpful. There are a few Marc-Andre Hamelin interviews out there where he describes some of the advantages. It's easier to read music if you know immediately what it's going to sound like. Again, this is possible with relative pitch, but it's just more work and slower.
Arranging and composing away from the keyboard will be much easier with perfect pitch.
As time goes on, it'll be less important, most likely. And yes, there are some downsides obviously. In my final aural training class in music school, we had a competition at the very end of the year for fun. It came down to a team of 3 I was on vs. a team of 3 that had a guy with AP. The final task was to sight-sing a musical 'round' (a composition where the melody repeats in the various voices at different points in different voices overlapping each other). The guy with AP actually ruined it for his team. They mistakenly chose him to finish the round instead of start it. Mid-way through their performance, the pitch on their team had drifted so heavily they were in-between notes on the piano when he took over. He tried to sing 'relative' to everyone else but it was so hard for him. It was so unnatural for him to sing out of key, he couldn't do it; it sounded really bad. Great guy though and a ridiculously good violinist.
> A random human won't sing off the cuff with their tonal center magically quantized to one of the twelve keys in our modern western tuning (Equal Temperament).
Why is that relevant? Whatever pitch they pick would fall into one of the 12 buckets, even if it isn't precisely the correct pitch.
It's not called "in-the-ballpark" pitch, it's called perfect/absolute pitch. Being up to a quarter tone off is a large error in music. Thinking of pitch in terms of 12 buckets is not musically useful. The vast majority of music is based off consonance where being even a few hertz off means unpleasant dissonance. TLDR: Thinking of pitch as 12 buckets is mostly irrelevant.
I always observed that the number of random non-musicians who can get it right using the "major" handful of those twelve keys is remarkable enough to be considered.
Since those are the only 12 notes so many people have been hearing from every direction for so long, and truly confined to not more than a few of the major keys that are "dominant" as a result of modern instrumentation, it gets ingrained in the psyche and the notes are almost memorized by frequency. With nothing in-between, so that's what they reproduce without any training. Or can have a more sensitive ear for out-of-tune notes than an actual music student of a number of years.
This is really a fun skill to learn that you have. I've had a pretty good ear for relative pitch since birth, which my music teachers picked up on right away (I could play songs "by ear" after hearing them a couple of times), but I struggled with blind pitch in the mornings... until I realized that, for whatever reason, I can hear the theme from Zora's Domain in perfect clarity in the proper key.
I used that to fake absolute pitch for a while in college, then explained to my voice coach what I was doing, and he looked at me like I had three heads. I'll never forget it. :)
I find I can recall something with accurate pitch, but the "memory" of that pitch fades over time. Whatever my current favorite song is, I can hum it at the right pitch. But, if I were to try to do so a month later, it will probably be transposed a bit because I somehow lost that sense of the exact correct pitch. My idea of what "feels right" in that regard somehow fades, or something...
If you are a regular HN reader who is (or was until this post) unfamiliar with Back to the Future, I'd love to know three more random facts about your life. In my world view, you are part of a fascinatingly small group of people.
Part 3 came out in 1990. So, anyone born after (less than 34 years old) who didn't bother to go back and watch it, would be sufficient? I'm familiar with the series' existence, but had no idea what 1.21 reference was. AMA, hah.
I’ve been on the BTTF ride at… wherever in Florida it is, and I loved that as a teenager. The films just never really appealed though for some reason. I guess one related fact would be I have a lot of gaps like that in the movies I have seen. For instance, people are often shocked that I’ve never see any of the Indiana Jones movies (also loved the rides!); but Star Wars I could probably recite the scripts of.
I don’t think I have any other facts that are very interesting, but then again I didn’t think not having seen BTTF was all that interesting either. For the record I was familiar with 1.21GW and what it related to… I don’t live under a rock!
I have kids that are in their late 20s. They never watch older movies unless someone forces them to. There is so much new media coming out that they don’t feel the need to watch older movies, even if everyone is telling them it is very good.
Couldn't you put that media on when they were kids?
I know movie nights are not a thing every family does but I'd imagine having one day modern movie, one day oldish from 80-90a, another day a classic from the 40s, etc.
Wouldn't that have worked if you started from when they were young?
I'm just thinking as that is my plan for when/it I have kids: mix older media with new one and just enjoy it with them. If it is truly good and not just nostalgia, they should be enjoyable even as a rewatch.
Since the franchise hasn't been rebooted like so many others, it hasn't seem the $$$ marketing that would introduce it to new generations.
Like The Princess Bride or Labyrinth, BTTF currently remains a phenomenom of the 80's and 90's -- familiar to most from that time and deeply treasured by some, but not refreshed and sustained the way the Star Wars, Star Trek, Marvel/DC, etc brands have been.
I was wondering: have you thought about automation bias or automation complacency [0]? Sticking with the drop-tables example: if you have an agent that works quite well, the human in the loop will nearly always approve the task. The human will then learn over time that the agent "can be trusted", and will stop reviewing the pings carefully. Hitting the "approve" button will become somewhat automated by the human, and the risky tasks won't be caught by the human anymore.
Premature optimization, and premature automation cause a lot of issues, and overlooking a lot of insight.
By just doing something manually 10-100 times, and collecting feedback, both understanding of the problem, possible solutions/specifications can evolve orders of magnitude better.
yeah the people who reach for tools/automation before doing it themself at least 3-10 times drive me crazy.
I think uncle bob or martin fowler said "don't buy a JIRA until you've done it with post-its for 3 months and you know exactly what workflow is best for your team"
this is fascinating and resonates with me on a deep level. I'm surprised I haven't stumbled across this yet.
I think we have this problem with all AI systems, e.g. I have let cursor write wrong code from time to time and don't review it at the level I should...we need to solve that for every area of AI. Not a new problem but definitely about to get way more serious
This is something we frequently saw at Uber. I would say it's the same as there's already an established pattern for this for any sort of destructive action.
Intriguingly, it's rather similar to what we see with LLMs - you want to really activate the person's attention rather than have them go off on autopilot; in this case, probably have them type something quite distinct in order to confirm it, to turn their brain on. Of course, you likely want to figure out some mechanism/heuristics, perhaps by determining the cost of a mistake, and using that to set the proper level of approval scrutiny: light (just click), heavy (have to double confirm via some attention-activating user action).
Finally, a third approach would be to make the action undoable - like in many applications (Uber Eats, Gmail, etc.), you can do something but it defers doing it, giving you a chance to undo it. However, I think that causes people more stress, so it’s rather better to just not do that than to confirm and then have the option to undo. It’s better to be very deliberate about what’s a soft confirm and what’s a hard confirm, optimizing for the human in this case by providing them the right balance of high certainty and low stress.
I think the canonical sort of approach here is to make them confirm what they're doing. When you delete a GitHub repo for example, you have to type the name of the repo (even though the UI knows what repo you're trying to delete).
If the table name is SuperImportantTable, you might gloss over that, but if you have to type that out to confirm you're more likely to think about it.