Time will tell, and it's a destined trend for more devs close sourcing their code no matter what curious angle you are trying to justify large firms using AI exploiting money, and I doubt you are working for one of them.
people wear different shoes, there is simply no single truth. What's the difference between your gratitude for problems / inspiring thankfulness and this guy's "bitching", when you think your reality is 'the' right one? I wish you all good luck in your rest life stranger.
I'd be interested to know whether that occurred pre- or post- Zygote.jl starting to use ChainRules.jl internally. Zygote's own rules used to be tested with a half-baked finite differences scheme, and it was easy to miss bugs. All rules in ChainRules are tested using the testing helpers in ChainRulesTestUtils.jl, which are very thorough.
Then, some bugs are not actually bugs. For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric or not. Or, when a function is not differentiable in the calculus sense, different ADs often have different conventions for what they will return. These cases come up all the time.
> For example, the gradient of some function wrt to elements of a symmetric matrix is and should be different depending on whether the matrix is structurally symmetric
Ugh. Yes and no, but mostly no.
A gradient is a collection of partial derivatives. The normal meaning of partial derivative with respect to symmetric matrix elements does not actually exist, because you can't vary them independently. This is actually a generic problem for any overparameterized constrained system. People still want to be able to use the standard tools of calculus though. You can reasonably do things like use Lagrange multipliers to do constrained optimizations. Or you can express a symmetric matrix in terms of only the n(n+1)/2 variables, treat them as a vector and get the gradient over that. (This is often what is done without actually describing the underlying method. This is the so-called "symmetric gradient"). But it's generally not what you actually want, even though it can lead to the same stationary points.
The best you can do is the unconstrained gradient restricted to the manifold of symmetric matrices (or whatever constrained object you are interested in). But even this is a (subtly) different thing than the gradient in the ambient space, even though it works in as large a set of cases as possible. Happily for symmetric differentials applied to symmetric matrices, the naive formulas of symmetrizing the gradient just work.
My point is I think simpler than what you're talking about, so let's replace "symmetric" with "diagonal." Suppose I have a structurally diagonal matrix, represented by a type `Diagonal` that can only represent diagonal matrices, and then a non-structurally diagonal matrix, represented by a normal dense matrix filled with zeros except on the diagonal.
I pass this as input to some function that uses more than just the diagonal and compute the gradient (here a matrix). In general, for the dense matrix, the resulting gradient will not be diagonal. Why should it be? If your function doesn't care if the matrix is diagonal, neither does the AD.
But what about for `Diagonal`? Well, the AD could handle this different ways. First, it could interpret all matrices as embedded in the dense, unconstrained matrices, in which case it would return a dense non-diagonal matrix. Or it could ignore that `Diagonal` represents a matrix entirely and instead return some nested structure that contains gradients of its stored fields, here the gradient of a stored diagonal vector. Similarly, it could represent this nested structure as just another `Diagonal`.
There are various reasons why an AD might do any one of these; neither is right or wrong. But note that the only case that gives you the same answer as the dense one is the one that promotes to a dense input, which is basically discarding all structure to begin with.
Most ADs don't ever have to work with matrix polymorphism, so their user bases don't need to think about this, but Julia has _many_ polymorphic matrix types, so Julia ADs have to take this seriously, and if new users naively pass a polymorphic matrix type as input to some `gradient` function, they might be perplexed at what they get back and think the AD engine has a bug. See the ChainRules talk at EuroAD 2021 for more details: https://youtu.be/B3bC49OmTdk?t=761
Ah, yes. That is a different point than the one I thought you were making.
Both the symmetric and diagonal cases are lucky ones where the constrained derivative is the same shape and size as the original (as the constrained manifolds are flat). Keeping it as a Diagonal in code then just works.
This is in marked contrast to "unit vectors", "orthogonal matrices", or similar non-flat manifolds. The derivatives in these cases although locally are the same dimension as the manifold, usually must be represented as something that spans the entire containing space once you deal with derivatives at multiple points.
Yes, precisely. It gets complicated very quickly. Cotangents for structural matrices in reverse-mode AD tend to wander away from the cotangent space at the corresponding point into the tangent space of the embedding for various reasons. Any time they wander away, there's the potential for huge performance losses. We use projections to routinely project them back to the correct cotangent space, but this at best a band-aid, and we could certainly use a better solution.
For the interested, this arises from the combination of implicitly embedding in the forward pass (e.g. multiplying an orthogonal matrix by a vector is implicitly the same as making the matrix dense first and then multiplying), where you would then need some sort of projection in the reverse pass. If your AD operates on the scalar level, you get this for free. But such ADs don't make full use of optimized BLAS routines. You can get that by writing custom AD rules, but then in the reverse pass you usually end up in some embedded cotangent space and need a projection to set things right. There are definitely trade-offs.
Thanks for the references. We're trying to write up a paper about all this work, and (to us) it seems obvious that differential geometry is the right framework.
What reverse mode AD talks informally about as "gradients" are cotangents, and if you are constrained to a submanifold of another manifold (such as the space of all symmetric matrices, embedded in the space of all NxN matrices) then your cotangent is a projection of that in the larger space. This reduces to the obvious thing for symmetric matrices, as you say, but there are less obvious AD-relevant cases (like the space of orthogonal matrices, or the space of "ranges", vectors with uniformly spaced elements) where getting it right with your bare hands is tricky.
I don't see any of the relevant terminology in these links, but they do seem to be thinking about related problems. Perhaps they have re-invented the wheel? Amused to read that standard cookbooks seem to have faithfully reproduced the recipe for a square wheel, for decades.
Have you ever thought how and why the “it’s not my job” attitude fostered? Most people at least myself on their first day working would never thought they would end up at some point as people they despised so much.
I have had a similar career experience. In time, I’ve decided to accept that I’m much more like those people than I wanted to accept when I felt I was a young and hungry “disruptor”
I am much more a product of the systems than I wanted to accept. That’s the most mature perspective I have had to date - and has led to more empathy than I thought I was capable of.
The reason you didn’t like those people is that somewhere deep inside you knew it could be you in ten years. I had the same thoughts, and came to the same insight.
It’s a lot like how you dislike the people who remind you of your bad qualities.
Like most problems I think there’s the actual issue at hand and also the feelings that fosters. Feeling under appreciated is the #1 complaint amongst employees, I believe, and even out-ranks salary. There’s probably a balance. Sometimes the employer isn’t appreciative and sometimes the employee overly inflates their contribution.
> Maybe you're a person who works to live? If so, embrace it.
> Those who enjoy their work are often the opposite: they live to work.
Huh. Squinting at this from exactly the right angle, that can kinda read the very idea of living to work is paradoxically catastrophic, at once fulfilling and shallow and existentially meaningless. As though work-life balance were interpreted backwards, or something.
I've obviously been looking at "do what you love" positively for a while now, but now I really wonder... maybe working to live, and enjoying what you do just a little bit, but not completely, is the ideal balance.
Hmm. Now I realize that not being entirely consumed by work and finding identity in that loss of self also has the benefit of making emotional and mental job flexibility easier - and in tech, that's a necessity.
WSL is only one option. As noted in another comment here:
The guide is split into separate sections for clang (cross-compiling), MSVC, clang-cl, and WSL. Compiled swift binaries can be native windows executables when using the non-WSL compilers.
I'm a pretty big fan of anime, but I still don't see the point in randomly sprinkling internet memes or other random television screenshots in blog posts. At first, I thought it might be a refreshing change of pace, but now I just feel like it detracts from the article.