> Microsoft bought it for OpenAI only, to train Copilot on the vast amount of code.
I think this gets the timeline wrong. Microsoft acquired GH in 2018 and started the partnership with OpenAI in summer 2019.
I'm sure there was some strategy to extract value from it that wouldn't serve its users but I think OpenAI was not initially meant to be the beneficiary.
Maybe MS just got extremely lucky, like winning-the-lottery-lucky.
But your timeline is off, however. Their partnership started in 2016[1]. In 2019 MS started to invest publicly in OpenAI - but by then they have had some history.
To me, this is at least suspicious. Granted, I have no hard proof.
While I agree that we keep reinventing stuff, in CS doesn't the ease of creating isomorphisms between different ways of doing things mean that canonicalization will always be a matter of some community choosing their favorite form, perhaps based on aesthetic or cultural reasons, rather than anything "universal and eternal"?
We can still speak of equivalence classes under said isomorphisms and choose a representative out of them, up to the aesthetic preferences of the implementor. We are nowhere near finding equivalence classes or isomorphisms between representations because the things being compared are probably not equal, thanks to all the burrs and rough corners of incidental (non essential) complexity.
I worked for a startup that used clojure and found it so frustrating because, following the idiomatic style, pathways passed maps around, added keys to maps, etc. For any definition which received some such map, you had to read the whole pathway to understand what was expected to be in it (and therefore how you could call it, modify it, etc).
I think the thing is that yes, `[a] -> [a]` tells you relatively little about the particular relationship between lists that the function achieves, but in other languages such a signature tells you _everything_ about:
- what you need to call invoke it
- what the implementation can assume about its argument
i.e. how to use or change the function is much clearer
I think the pipeline paradigm you speak of is powerful, and some of the clarity issues you claim can be improved through clear and consistent use of keyword destructuring in function signatures. Also by using function naming conventions ('add-service-handle' etc.) and grouping functions in threading forms which have additive dependencies, can also address these frustrations.
Do publishers really have fact-checkers? My understanding was that support for authors is now relatively minimal, even for established authors, and no one really has the time or resources to second-guess everything an author has claimed. I take as a key example Naomi Wolf learning after her book was "done" that a significant chunk of it was based on a misunderstanding of an admittedly confusing 19th century British legal phrase.
https://nymag.com/intelligencer/2019/05/naomi-wolfs-book-cor...
I think maybe the idea that a single author spending months or years on their research, which the publish as a single bound and polished work is misguided -- an academic trying to do similar work in multiple articles would have gotten review from peers on each article, and hopefully have not spent so much time working under a correctable misunderstanding.
Fact checking as a separate job is more for journalism than books. But editors have fact checking as part of their jobs. (It is not copy-editing, which is a different job.)
Many nonfiction authors will hire a fact checker separately. They don't want to look like they missed something. Errors still happen, of course.
This paper describes finding security related concepts and using them to steer at generation time. While this is an interesting contribution on its own, the approach could also be applied to a range of other concerns -- e.g. can we use this to steer away from performance problems? can we make llm code generation anticipate maintainability or readability issues?
If people want to try untested peptides, I think society should use that as the engine to _test those peptides_. Instead of buying something that's supposed to but may not be the peptide you want, you should pay 50+k% + data and get something that has a 50% chance of being the peptide and 50% chance of being a placebo, and you're _required_ to submit a report about effects and side effects before you can get a refill.
Rather than complain about how these things have not yet gone through real experiments and are marketed as having been "studied" rather than "effective", I would love to see society use the obvious demand for some of these to actually test them.
So I'm actually confused that in the little image of his run in the article it seems he's often making absolute progress in the opposite direction the ship is going for part of each lap. Like, was the ship going unusually slowly?
In their little algorithm box on Chain Distillation, they have at step 2b some expression that involves multiplying and dividing by `T`, and then they say "where α = 0.5, T = 1.0".
I think someone during the copy-editing process told them this needed to look more complicated?
tl;dr it makes sense once you see there are hidden softmax in there; it's just the explicit formula written out and then applied with the common param value
So CE is cross-entropy and KL is Kullback-Leibler, but then division by T is kind of silly there since it falls out of the KL formula. So considering the subject, this is probably the conversion from logits to probabilities as in Hinton's paper https://arxiv.org/pdf/1503.02531
But that means there's a hidden softmax there not specified. Very terse, if so. And then the multiplication makes sense because he says:
> Since the magnitudes of the gradients produced by the soft targets scale as 1/T2 it is important to multiply them by T2 when using both hard and soft targets.
I guess to someone familiar with the field they obviously insert the softmax there and the division by T goes inside it but boy is it confusing if you're not familiar (and I am not familiar). Particularly because they're being so explicit about writing out the full loss formula just to set T to 1 in the end. That's all consistent. In writing out the formula for probabilities q_i from logits M_k(x)_i:
q_i = exp(M_k(x)_i / T) / sum_j exp(M_k(x)_j / T)
Hinton says
> where T is a temperature that is normally set to 1. Using a higher value for T produces a softer
probability distribution over classes.
And then they're using the usual form of setting T to 1. The reason they specify the full thing is just because that's the standard loss function, and it must be the case that people in this field frequently assume softmaxes where necessary to turn logits into probabilities. In this field this must be such a common operation that writing it out just hurts readability. I would guess one of them reading this would be like "yeah, obviously you softmax, you can't KL a vector of logits".
Good question. I just sort of skipped over that when reading but what you said made me think about it.
I get that the norms lean conservative and that's a good thing. But if someone says you should do a recall and the actual lab tests saying whether your product actually has toxin-producing bacteria haven't finished running yet, I can understand the desire to wait until the evidence is in.
They've got some evidence, 7 known cases over three states all linked to the same product. The history of problems from this producer makes it seem more likely to be true. A lot of companies would rather have their customers throw away their product and buy it again from a different batch than risk having their customers get violently sick or dead from their food because the people who get sick and survive can end up with a very strong aversion to the brand and/or product going forward, and voluntarily recalling the product just to be safe is good from a PR stance since it looks like you actually care about your customers.
I think some of it was just a belief that work you can see being done by a floor of people talking with their mouths and looking at screens in the same room is more real than the slightly less visible conversations in slack while looking at screens in their own rooms.
Open plan offices continue to be designed more for seeing the work happen than for doing the work. I spend a lot of mental energy on ignoring the distractions around me. No job has ever offered me a private office with a door that closes in exchange for being in the office 5 days a week.
I think this gets the timeline wrong. Microsoft acquired GH in 2018 and started the partnership with OpenAI in summer 2019.
I'm sure there was some strategy to extract value from it that wouldn't serve its users but I think OpenAI was not initially meant to be the beneficiary.
reply