> The implications of these geometric properties are staggering. Let's consider a simple way to estimate how many quasi-orthogonal vectors can fit in a k-dimensional space. If we define F as the degrees of freedom from orthogonality (90° - desired angle), we can approximate the number of vectors as [...]
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.
Spherical codes are kinda of obscure: I haven't heard of them before, and Wikipedia seems to have barely heard of them. And most of the Google results seem to be about playing golf in small dimensions (ie, how many can we optimally pack in n<32 dimensions?).
People do indeed rediscover previously existing math, especially when the old content is hidden under non-obvious jargon.
> The problem with saying something is LLM generated is it cannot be proven and is a less-helpful way of saying it has errors.
It's a very helpful way of saying it shouldn't be bothered to be read. After all, if they couldn't be bothered to write it, I can't be bothered to read it.
You have no idea how much personal work went behind it. You just suspect it was worded with a LLM.
I have been using embeddings for almost a decade and am well versed with their intricacies. I think this article has merit. The direction of the investigation and the conclusion are interesting, good to have people thinking about how many distinct concepts can be packed in our usual embedding dimension. Wondering how small can you make embedding before a model becomes noticeably worse, given constant parameter count.
The complaint was that the post has a lot of basic inconsistencies which is a problem regardless.
If your content is as bad as AI slop it doesn't really matter if it is or not, but I think it's safe to assume that when a verbose and grandiose post is internally inconsistent and was written after 2022, it's slop[0]
Anyone interested in a subject can, if they wish, ask an AI about it and get an answer. Your deep conversation with an AI is something fun and insightful only to yourself. You're welcome to do it, but don't pretend it has meaning to anyone else, because if they want to get the same insight on a topic they can do it themselves for the same amount of effort (none).
There's a big difference between the type of errors that humans make when they misunderstand a subject, and the type of errors an LLM makes. I'm not well enough versed in the field to know which type of errors are found in this paper, but it seems that people who are versed in this field feel that these errors are of the latter type.
There's a lot of beautiful writing on these topics on the "pure math" side, but it's hard to figure out what results are important for deep learning and to put them in a form that doesn't take too much of an investment in pure math.
I think the first chapter of [1] is a good introduction to general facts about high-dimensional stuff. I think this is where I first learned about "high-dimensional oranges" and so on.
For something more specifically about the problem of "packing data into a vector" in the context of deep learning, last year I wrote a blog post meant to give some exposition [2].
One really nice approach to this general subject is to think in terms of information theory. For example, take the fact that, for a fixed epsilon > 0, we can find exp(C d) vectors in R^d with all pairwise inner products smaller than epsilon in absolute value. (Here C is some constant depending on epsilon.) People usually find this surprising geometrically. But now, say you want to communicate a symbol by transmitting d numbers through a Gaussian channel. Information theory says that, on average, I should be able to use these d numbers to transmit C d nats of information. (C is called the channel capacity, and depends on the magnitude of the noise and e.g. the range of values I can transmit.) The statement that there exist exp(C d) vectors with small inner products is related to a certain simple protocol to transmit a symbol from an alphabet of size exp(C d) with small error rate. (I'm being quite informal with the constants C.)
In the graph you’re referencing, K = 2 never reaches C = 0.2. K = 3 only reaches C = 0.3 before getting cut off.
I’m not even sure why that would be a problem, because he’s projecting N basis vectors into K dimensions and C is some measure of the error this introduces into mapping points in the N-space onto the K-space, or something. I’m glad to be shown why this is inconsistent with the graph, but your argument doesn’t talk about this idea at all.
I was a little informal with my argument. It's not strictly true that we only see C = 0.2 when K = 2. I was reading what the graph says about the case when N is much greater than k. I'll try to clarify.
C is meant to be the smallest constant so that, for each (N, k, epsilon) with k > C epsilon^-2 log N and epsilon > 0, there exists some arrangement of N vectors in R^k with inner products smaller than epsilon in absolute value. For each (N, k), the author optimized epsilon and reported the empirical value k epsilon^2 / log N, which is the smallest value of C for which our condition holds restricted to the given values of (N, k). (Of course, this is my good-faith interpretation---the article introduces C in the context of a JL distortion bound, and it takes an extra step to turn that into a bound on inner products.)
It can be shown that C = 4 satisfies this condition, when log is the natural log. See [1], for example. Based on the graph, the article claims to do better: "for sufficiently large spaces," it says we can put C = 0.2. This would be a very significant improvement.
For k = 2, the graph shows that C will be lower than 0.2 for sufficiently large N. (The color scheme is confusing! The line for k = 2 is the one that starts at just under 0.8 when N = 1.) Already for k = 3, the graph doesn't give us reason to believe it will be lower than 0.2---you correctly observed it gets to around 0.3. For larger value of k, the graph doesn't seem to show what we can expect for large N: the curves go up, but do not come down. This is what I meant with my comment: the conclusion that C <= 0.2 as N -> infinity is only justified by the behavior at K = 2.
Now, do these results make sense? In the case k = 2, we're trying to put a large number (N) of vectors on the unit circle, and thinking about how small the maximum inner product (epsilon) between any pair of vectors can be. As N -> infinity, the vectors will be packed very tightly and the maximum inner product epsilon will come close to 1. Overall, C = k epsilon^2 / log N will become arbitrarily small. In fact, the same happens for every k.
So, just in connection to this graph, the article makes three mistakes:
1) The article's interpretation of its experiment is wrong: the graph alone doesn't show that C < 0.2 for "large spaces".
2) However, it should be obvious a priori that, for all values of k, the reported values C should converge to 0 for large N (albeit very slowly, at a rate of 1/log N).
3) Unfortunately, this doesn't tell us anything about the minimum value of k / log(N) for a given epsilon and k, and so it doesn't support the conclusion of the article.
The problem with this kind of LLM-driven article is that it gives uncareful work the _appearance_ of careful work but none of the other qualities that usually come with care.
If C is the same for all K, then can’t he extrapolate the trend from K = 2 (which you’re right, I did misread) to all other values of K? That seems to be his justification for the claim you’re arguing.
I do think some more rigor is needed here (especially wrt to whether his derivation for C is sensical, as you allude to, since it obviously can’t converge to 0) but I hope we can agree that we are well into the “amateur who has trouble presenting as such” territory rather than the “AI slop” territory.
> Our architectural choices are closely aligned with principles observed in biological brains.
How? They point out three design choices: linear attention, MoE layers, and spike coding.
Apparently linear attention is brain-inspired because it can be viewed as a "simplified abstraction of dendritic dynamics with multi-branch morphology." Who knows what that means exactly [1]. They don't discuss it further. MoE layers apparently reflect "a principle of modular specialization." Fine, whatever.
Now, using a dozen attention variants + MoE is bog standard. The real novelty would be spike coding. Page 11 is dedicated to the different ways they could turn signals into spike trains, including such biologically-inspired mechanisms as using two's complement. However, they don't actually do spike coding in a time domain. In their implementation, "spike coding" apparently means to turn activations into integers. Section 3.3.3 claims that this lets us simulate an underlying spiking neural network, so we can validate the spiking approach without using special hardware. But if your SNN can be simulated faithfully on a GPU by turning things into integers, isn't that a bit of a depressing SNN?
Either I'm missing something, or this is just just dressing standard techniques with loads of meaningless jargon. Of course that’s a very popular way to operate in deep learning nowadays.
[1] Like, attention can draw from multiple tokens, sort of like how different spines of a dendrite can draw from multiple axons? Can’t make this stuff up.
> This blog post has explored the most critical equations in machine learning, from foundational probability and linear algebra to advanced concepts like diffusion and attention. With theoretical explanations, practical implementations, and visualizations, you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here—they’ll learn 95% of what they need in one place!
It makes me sad to see LLM slop on the front page.
It's just too bombastic for what it is - listing some equations with brief explanation and implementation.
If you don't know these things on some level already the post doesn't give you too much (far from 95%), it's a brief reference of some of the formulas used in machine learning/AI.
This is probably not going to be a very helpful answer, but I sort of think of it this way: you probably have favorite authors or artist (or maybe some really dislike!), where you could probably take a look at a piece of their work, even if its new to you, and immediately recognize their voice & style.
A lot of LLM chat models have a very particular voice and style they use by default, especially in these longer form "Sure, I can help you write a blog article about X!" type responses. Some pieces of writing just scream "ChatGPT wrote this", even if they don't include em-dashes, hah!
Kace's response is absolutely right that the summaries tend to be a place where there is a big giveaway.
There is also something about the way they use "you" and the article itself... E.g. the "you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here..." bit. This isn't something you would really expect to read in a human written article. It's a ChatBot presenting it's work to "you", the single user it's conversing with, not an author addressing their readers. Even if you ask the bot to write you an article for a blog, a lot of times it's response tends to mix in these chatty bits that address the user or directly references to the users questions / prompts in some way, which can be really jarring when transferred to a different medium w/o some editing
I stopped reading the post before that and went back to check. It's so blatant...especially when it mentions visualizations.
> With theoretical explanations, practical implementations, and visualizations, you now have a comprehensive resource to understand and apply ML math. Point anyone asking about core ML math here—they’ll learn 95% of what they need in one place!
As someone who tended to use "—" in a lot of my writing naturally before, the prevalence of its usage by LLMs frustrate me a lot. I now have to rewrite things that felt natural just so no one will think I'm an LLM.
- lists of complex descriptors non-technical parts of the writing (“ With theoretical explanations, practical implementations, and visualizations”)
- the cheery, optimistic note that underlines a goal plausibly derived from a prompt. (eg “ Let’s dive into the equations that power this fascinating field!”)
It's not really about the language. If someone doesn't speak English well and wants to use a model to translate it, that's cool. What I'm picking up on is the dishonesty and vapidness. The article _doesn't_ explore linear algebra, it _doesn't_ have visualizations, it's _not_ a comprehensive resource, and reading this won't teach you anything beyond keywords and formulas.
What makes me angry about LLM slop is imagining how this looks to a student learning this stuff. Putting a post like this on your personal blog is implicitly saying: as long as you know some some "equations" and remember the keywords, a language model can do the rest of the thinking for you! It's encouraging people to forgo learning.
To summarize: we're making a series of i.i.d. draws from a distribution and asking how many draws N we need to make until we get something larger than our first draw.
Conditional on the value of the first draw, N is geometrically distributed. If we're drawing from an absolutely continuous distribution on the first line, then of course the details of our distribution don't matter: N is a draw from a geometric distribution with rate lambda, where lambda in turn is drawn uniformly from [0, 1]. It follows that N has a thick tail; for example, the expected value of N is the expected value of 1/lambda, which is infinite. In fact, N turns out to have a power law tail.
However, this isn't true if we're drawing from a distribution that's not absolutely continuous. If you coarse-grain into just "fast" and "slow" cars, then N again has a thin (geometric) tail. More to the point, if we imagine that our queues of cars need to be formed within a finite amount of time, then a car is only added to the queue in front of it if its velocity is epsilon larger than the velocity of the queue, and the problematic situation where lambda -> 0 goes away. In this idealized scenario, I guess you could relate the rate of the exponential tail of N to how long the cars have been travelling for.
Finally, it's worth remembering the old "waiting-time paradox": the variable N we're talking about is not the same as the length of the queue that a randomly selected driver finds themself in. What's the distribution of the latter---the distribution of "experienced" queue lengths? In this post the author computed that P(N = n) = 1/n(n + 1). It stands to reason that to get the density of the distribution of experienced lengths we need to multiply by n and divide by a normalizing constant. Unfortunately, you can't multiply 1/(n + 1) by any constant to get a probability distribution, since the sum over n diverges.
What does it mean that the distribution of experienced queues lengths doesn't exist? If you did a huge numerical simulation, you'd find that almost all drivers experience incredibly large queues, and that this concentration towards large queues only becomes more pronounced as you simulate more drivers. If anything, you could argue that the experienced queue length is "concentrated at infinity," although of course in practice all queues are finite.
In little bits of free time I get here and there, I've been working on using reinforcement learning to build some better bots for my favorite multiplayer game. Project is up here: https://github.com/cgadski/autotude
So far all my work has gone into the technical side of setting up the game (a Java app written in 2010) to work as a reinforcement learning environment. The developers were nice enough to maintain the source and open it to the community, so I patched the client/server to be controllable through protobuf messages. So far, I can:
- Record games between humans. I also wrote a kind of janky replay viewer [1] that probably only makes sense to people who play the game already. (Before, the game didn't have any recording feature.)
- Define bots with pytorch/python and run them in offline training mode. (The game runs relatively quickly, like 8 gameplay minutes / realtime second.)
- Run my python-defined bots online versus human players. (Just managed to get this working today.)
It took a bunch of messing around with the Java source to get this far, and I haven't even really started on the reinforcement learning part yet. Hopefully I can start on that soon.
This game (https://planeball.com) is really unique, and I'm excited to produce a reinforcement learning environment that other people can play with easily. Thinking about how you might build bots for this game was one of the problems that made me interested in artificial intelligence 8 years ago. The controls/mechanics are pretty simple and it's relatively easy to make bots that beat new players---basically just don't crash into obstacles, don't stall out, conserve your energy, and shoot when you will deal damage---but good human players do a lot of complicated intuitive decision-making.
[1] http://altistats.com/viewer/?f=4b020f28-af0b-4aa0-96be-a73f0... (Press h for help on controls. Planes will "jump around" when they're not close to the objective---the server sends limited information on planes that are outside the field of vision of the client, but my recording viewer displays the whole map.)
Where x is the final hidden layer of the base model, the idea here is to steer outputs in some direction by adding a vector y. More specifically, y is an exponential moving average over a sequence of vectors W(z_t), where z_t are some sort of context vectors and W is a linear map.
Except, the linear map W is just set to a random initialization, so it won't work for obvious reasons in its current form. (I guess this is why there is no example of its output. I'm guessing it was vibe-coded?) Also, since the intervention is only happening at the last hidden layer, I can't imagine this would really change how the model "thinks" in an interesting way. Like, yeah, you can absolutely make a model talk about dogs by adding in control vector for "dogness" somewhere.
Basically, this method is "inspired by graffiti art of tagging and the neuroplastic nature of living brains" in the same way that taking an exponential moving average of a time series would be "informed by state-space dynamics techniques utilized in deep learning, reservoir computing, and quantum mechanics." Really tired of the amount of insincere/pointless language in deep learning nowadays.
The author said the original liquid paper specifies random starting weights. I think what would happen is you get a bit of a random personality each time you redo the randomization, and then it will self-referentially update over time. I mean you have to start somewhere. You could start with all 1s, I guess, if you’re going to norm.
Update: Even if this is a good idea, and I’m not sure it is, it probably makes sense to have a pretty fast early move away from the random weights, and then slow down.
One way to understand why without writing down the CDF/PDF:
When X is an exponential variable and c is a constant, X + c has the same distribution as X after conditioning on large outcomes. In other words, these two variables have same "tail." This is true exactly for exponential distributions. (Sometimes this is called "memorylessness.")
Similarly, when U has a uniform distribution on [0, 1] and c is a constant, cU has the same distribution as U after conditioning on small outcomes.
But if cU is distributed like U near 0, then -ln(c U) is distributed like -ln(U) near infinity. But -ln(c U) = -ln(c) - ln(U), so the tail of -ln(U) doesn't change when we add a constant, meaning it must have an exponential distribution.
I'm very excited that we're figuring out how to use deep learning on small numbers of data points!
I'm curious about the focus on information compression, though. The classical view of inference as compression is beautiful and deserves more communication, but I think the real novelty here is in how the explicitly "information-constrained" code z participates in the forward pass.
About their overall method, they write:
> It isn’t obvious why such a method is performing compression. You’ll see later how we derived it from trying to compress ARC-AGI.
I must be learning something in my PhD, because the relation with compression _did_ seem obvious! Viewing prediction loss and KL divergence of a latent distribution p(z) as "information costs" of an implicit compression scheme is very classical, and I think a lot of people would feel the same. However, while they explained that a L2 regularization over model weights can be viewed (up to a constant) as an approximation of the bits needed to encode the model parameters theta, they later say (of regularization w.r.t. theta):
> We don’t use it. Maybe it matters, but we don’t know. Regularization measures the complexity of f in our problem formulation, and is native to our derivation of CompressARC. It is somewhat reckless for us to exclude it in our implementation.
So, in principle, the compression/description length minimization point of view isn't an explanation for this success any more than it explains success of VAEs or empirical risk minimization in general. (From what I understand, this model can be viewed as a VAE where the encoding layer has constant input.) That's no surprise! As I see it, our lack of an adequate notion of "description length" for a network's learned parameters is at the heart of our most basic confusions in deep learning.
Now, let's think about the input distribution p(z). In a classical VAE, the decoder needs to rely on z to know what kind of data point to produce, and "absorbing" information about the nature of a particular kind of data point is actually what's expected. If I trained a VAE on exactly two images, I'd expect the latent z to carry at most one bit of information. If CompressARC were allowed to "absorb" details of the problem instance in this way, I'd expect p(z) to degenerate to the prior N(0, 1)—that is, carry no information. The model could, for example, replace z with a constant at the very first layer and overfit the data in any way it wanted.
Why doesn't this happen? In the section on the "decoding layer" (responsible for generating z), the authors write:
> Specifically, it forces CompressARC to spend more bits on the KL whenever it uses z to break a symmetry, and the larger the symmetry group broken, the more bits it spends.
As they emphasize throughout this post, this model is _very_ equivariant and can't "break symmetries" without using the parameter z. For example, if the model wants to do something like produce all-green images, the tensors constituting the "multitensor" z can't all be constant w.r.t. the color channel---at least one of them needs to break the symmetry.
The reason the equivariant network learns a "good algorithm" (low description length, etc.) is unexplained, as usual in deep learning. The interesting result is that explicitly penalizing the entropy of the parameters responsible for breaking symmetry seems to give the network the right conditions to learn a good algorithm. If we took away equivariance and restricted our loss to prediction loss plus an L2 "regularization" of the network parameters, we could still motivate this from the point of view of "compression," but I strongly suspect the network would just learn to memorize the problem instances and solutions.
Thanks for this insightful comment. One of my first questions about this was how it avoids some kind of latent space collapse with such a tiny dataset.
Do you think it's accurate to describe equivariance as both a strength and a weakness here? As in it allows the model to learn a useful compression, but you have to pick your set of equivariant layers up front, and there's little the model can do to "fix" bad choices.
Yeah, I think it's really important to understand how to coax non-equivariant models into being equivariant when needed. I don't think purely equivariant architectures are the way forward.
One example that comes to mind (I don't know much/haven't thought about it much) is how AlphaFold apparently dropped rotational equivariance of the model in favor of what amounts to data augmentation---opting to "hammer in" the symmetry rather than using these fancy equivariant-by-design architectures. Apparently it's a common finding that hard-coded equivariance can hurt performance in practice when you have enough data.
Super incomplete thought: how does this point of view relate to Euler characteristic? Can I get to Euler characteristic by asking how to solve an equation for some quantities q in terms of some quantities Q?
Author of the post here: There is quite a deep connection actually. You can assign a simplicial complex to a partial order P with a max and a min element (0 and 1). Then the Möbius function on P calculates the (reduced) Euler characteristic of that simplicial complex as µ(0, 1)=\Chi. For example, if the partial order is a power set mereology (a Boolean algebra) on 3 elements, then the associated simplicial complex is a triangle, and µ(0, 1) = µ(\emptyset, {a, b, c})=(-1)^3, which is the correct answer as a triangle (without interior) is homeomorphic to a circle.
In a way, calculating quantities q through Möbius inversion is just calculating Euler characteristics, weighted by by Q. (with some caveats)
In this article, we fix a mereology and a kind of quantity Q that "decomposes" over it---in the sense that Q(p) = sum_{r <= p} q(r) for some function q(r)---and then see that Mobius inversion lets us solve for q in terms of Q. In terms of incidence algebras, we're saying: assume Q = zeta q, as a product of elements in an incidence algebra. Then zeta has an inverse mu, so q = mu Q.
In other situations, we might want to "solve for" a quantity Q that decomposes over some class of metrologies while respecting some properties. The "simpler" and more "homogeneous" the parts of your mereology, the less you can express, but the easier it becomes to reason about Q. A mereology that breaks me up into the empty set, singleton sets with each of my atoms, and the set of all my atoms admits no "decomposing quantities" besides a histogram of my atoms. An attempt to measure "how healthy I am" in terms of that mereology can't do much. On the other hand, if I choose the mereology that breaks me up into the empty set and my whole, all quantities decompose but I have no tools to reason about them.
I guess Euler characteristic could be an example of how the requirement of respecting a certain kind of mereology can "bend" a hard-to-decompose quantity into a weirder but "nicer" quantity. For example, say we're interested in defining a Q that attempts to "count the number of connected regions" of some object, and we insist on using a mereology that lets us divide regions up into "cells". Of course this is impossible, as we can see in the problem of counting connected components of a graph-like object: we can't get the answer just as a function of the number of vertices and edges. However, if we insist on assigning a value of 1 to "blobs" of any dimension, the "compositionality requirement" forces us to define the Euler characteristic. This doesn't help us much with graph algorithms in general, but gives us an unexpectedly easy way to, say, count the number of blob-shaped islands on a map.
If you're just looking at minimum angles between vectors, you're doing spherical codes. So this article is an analysis of spherical codes… that doesn't reference any work on spherical codes… seems to be written in large part by a language model… and has a bunch of basic inconsistencies that make me doubt its conclusions. For example: in the graph showing the values of C for different values of K and N, is the x axis K or N? The caption says the x axis is N, the number of vectors, but later they say the value C = 0.2 was found for "very large spaces," and in the graph we only get C = 0.2 when N = 30,000 and K = 2---that is, 30,000 vectors in two dimensions! On the other hand, if the x axis is K, then this article is extrapolating a measurement done for 2 vectors in 30,000 dimensions to the case of 10^200 vectors in 12,888 dimensions, which obviously is absurd.
I want to stay positive and friendly about people's work, but the amount of LLM-driven stuff on HN is getting really overwhelming.