silentbob - LessWrong

After first learning about transformers, I couldn't help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?

(tl;dr: no novel insights here, just me writing down some thoughts I've had after/while learning more about neural nets and transformers.)

When I once asked someone more experienced, they essentially told me "nobody really knows, but the closest thing we have to an answer is 'the blessing of dimensionality' - with so many dimensions in your loss landscape, you basically don't run into local minima but the thing keeps improving if you just throw enough data and compute at it".

I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:

there's some (unknown) minimal network size (or maybe rather "minimal network frontier", as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
the network size & architecture also determines how much training data you need to get anywhere
basically, you try to find network architectures such that you encode sensible priors about the modality you're working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
- for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
- for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to "convert" the raw pixel data successively to semantic data
in theory, you probably could just use a huge feed forward network (as long as it's not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as "smarter" architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on "low quality parameters" that could just as well be omitted
so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
lastly, the question "which problem you're trying to solve" cannot just be answered on a high level with "I want to minimize loss in next-token prediction", but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you're minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you'll need just for those, and the less will the network be capable to predict more meaningful tokens.

Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it's very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.

One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it's much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it's getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.

Notably, this does not necessarily mean the loss curve dropped more quickly - due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.

There's just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.

silentbob's Shortform

silentbob15d*330

One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding "flips" from representing the original token to finally representing the prediction of the next token.

By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:

some 1000 dimensions encode the original token
some other 1000 dimensions encode the prediction of the next token
the remaining 10,288 dimensions encode information about all available context (which will start out "empty" and get filled with meaningful information through the layers).

In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there's the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it's still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.

Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just "translate" from token space to embedding space and vice versa^[1]. This made sense in relation to the initial & naive "embeddings represent tokens" interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an "extraction" of the information content in the embedding that encodes the prediction.

One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that "Zero layer transformers model bigram statistics". So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I'm not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)

I would guess that transformer-experienced people (unless they disagree with my description - in that case, please elaborate what I'm still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.

^{^}
Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word "Good" and then unembed the embedding immediately, you would get a very high probability for "Good" back when in practice (I didn't verify this yet) you would probably obtain high probabilities for "morning", "day" etc.

How to title your blog post or whatever

silentbob20d20

It’s sad to admit, but I think there are many good things that simply don’t have good titles.

I've been thinking for many years that bad titles are a common reason for failure, say of movies or video games or other things that are sold where superficial first impressions are important. In the sense of: there are some products out there that would have been orders of magnitude less or more successful, had they gone with a different name.

This seems particularly important to me for anything that has a "viral" element, where people tell their friends about it. A good title most definitely affects the "reproduction number" to some degree. If it sounds cool people may easily be 50% more likely to speak about it than if the name is cringe or confusing or hard to remember or hard to pronounce. If this moves your R from 0.9 to 1.4, that can obviously make a tremendous difference for the trajectory of the thing.

Progress = Fewer Bad Moments

silentbob23d20

My girlfriend once came across this metaphor of "spiraling upwards". Whatever you're struggling with, you'll have low points again with near certainty, but ideally you have learned something in the meantime that improves some aspect of your situation or your ability to bounce back. I think it's a nice way to look at things when it's true. Generally, dealing with setbacks seems like one of the crucial parts of making progress in any area.

What is your favorite podcast?

silentbob1mo20

Dwarkesh Patel

Most people here probably know it, but for the few of you who don't: in-depth AI podcast with many high-profile guests from AI labs and beyond. Often brings up AI Safety concerns, but the general vibe of the podcast is usually rather somewhere between excited and optimistic. Dwarkesh is quick on his feet and tends to ask many good questions, often "good-faith-challenging" his guests.

He's great at extracting the world views out of his guests and at keeping conversations very engaging even over many hours. My impression is that he vibes well with most guests and gets them to share their views more freely than they would otherwise. Most noteworthy for me were the episodes with Sutskever, Aschenbrenner, gwern, and of course the AI 2027 one with Daniel Kokotajlo and Scott Alexander.

If the above sounds interesting, then consider this a recommendation.

If you consider Mechanize to be net-negative and don't want to support anyone funding them, then rather don't consider this a recommendation.

What is your favorite podcast?

silentbob1mo20

The Studies Show

It's entertaining yet refreshingly skeptical of science (in a, you know, rather rational way) and the problems it has. Tears apart many papers, myths and misconceptions. Tom Chivers keeps mentioning Bayes and Scott Alexander. Has some episodes on general scientific & statistical concepts and the major problems in science, as well as many object-level ones on concrete research topics, such as growth mindset, autism, seed oil or IQ. I prefer the latter ones. Spoiler alert: the outcome of most episodes is "we know much less than people think", about pretty much anything.

One weakness of the show may be that they're possibly erring too much on the "there may be some evidence for X but can we really tell? Actually, nobody really knows and it's all just guessing based on a bunch of very flawed studies" side. Occasionally the hosts seem a bit less well prepared than they could be. Still, on the majority of topics, I find their episodes rather enlightening. Another plus is that they have some episodes on their past mistakes on the podcast (of which there are indeed quite a few).

If you're a bit cynical and enjoy two witty Brits making fun of bad science while learning a few things about the state of research, you might enjoy this one.

What is your favorite podcast?

silentbob1mo*50

The Clearer Thinking podcast

I like how it explores a variety of important topics deeply without becoming less relevant even after so many episodes. It has a good length of ~60-90 minutes per episode. Spencer's questions are often great, plus he tends to bring his own insights and perspectives to the table that add a lot.

The episodes that I learned the most from were probably the ones on different psychological conditions, such as talking to a narcissist, a sociopath, someone with borderline, or to a victim of sexual abuse.

People who are interested in rational discussions of science, psychology, mental health, ethics etc probably have a good shot at getting something out of the clearer thinking podcast.

silentbob's Shortform

silentbob2mo90

For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)

Some examples that I’ve heard from different people around me over the years:

Saying “rectangel” instead of “rectangle”
Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
Saying something like, uhh, “devil-oupaw” instead of “developer”
Saying “leech” instead of “league”
Saying “immu-table” instead of “immutable”
Saying "cyurrently" instead of "currently"

I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it's pronounced. This happened to me quite a lot^[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I've seen all these other people stick to their very unusual pronunciations anyway. What's up with that?^[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.

Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing "dude" incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.

So, as I learned now, "dude" is pronounced "dood" or "dewd". Whereas I used to say "dyood" (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.

Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said "dood", and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said "dood" (which, in my defense, didn't happen all that often in my presence^[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.

I never quite realized that practically everyone said "dood" and I was the only "dyood" person.

So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words.

But, admittedly, I still don't wanna be the one to point it out to them.

And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.

^{^}
e.g., for some time I thought "biased" was pronounced "bee-ased". Or that "sesame" was pronounced "see-same". Whoops. And to this day I have a hard time remembering how "suite" is pronounced.
^{^}
Of course one part of the explanation is survivorship bias. I'm much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious.
^{^}
Maybe they were intimidated by my confident "dyood"s I threw left and right.

Against podcasts

silentbob2mo20

or can read interview transcripts in much less time than listening to a podcast would take.

This always baffles me. :) Guess I'm both a slow reader and a fast listener, but for me audio allows for easily 3x as much speed as reading.

How To Believe False Things

silentbob2mo70

So what made you change your mind?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments