I'm an staff artificial intelligence engineer currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.
On the problem of "aligned to whom", most societies have a fairly consistent answer to how this works. Capable healthy adults are generally allowed to make their own decisions about their own welfare, except on decisions where their actions might significantly decrease the well-being of others (i.e. your right to swing your fists around however you want ends at my nose). Note that this is (mostly) asymmetric around some 'default' utility level: you don't have the right to hurt me, but you do have the right to choose not to help me. There are exceptions to this simple rule-of-thumb: for example, most societies have tax systems that do some wealth redistribution, so to some extent you are obligated to help the government help me.
By implication, this means you're allowed to use AI to help yourself any way you like, but you're not allowed to use it to help you harm me. If you look at the permitted use policies of most AI companies, that's pretty much what they say.
I think evolutionary theory is the missing element here. For a living being, suffering has a strong, evolved correlation with outcomes that decrease its health, survival, and evolutionary fitness (and avoiding pain helps it avoid this). So things that a moral patient objects to have strong correlations with something that biologically is objective, real, quantifiable, and evolutionarily vital.
However, for an AI, this argument about morality-from-evolution applies to the humans it was trained to simulate the behavior of, but not to the AI — it's not alive, asking about is evolutionary fitness to it is a category error. It's a tool, not a living being, implying that its moral parenthood is similar to that of a spider's web or a beaver's dam.
If you are a moral realist, then the question of whether or not AI is a moral patient has some true, objective answer — we just may not have any easy way to find out what that is. Whereas if you're moral relativist, then the answer to this (or any) moral question is a social convention, and we need to decide what kind of society we want to have and pick an answer to this question accordingly. And if you're a follower of evolutionary ethics, then moral realism applies to evolved organisms, because their evolutionary fitness depends on how they are treated — but for a non-living AI that has been 'distilled' from human intelligence via stochastic gradient descent on trillions of tokens of human-generated data, evolutionary fitness doesn't apply to it, so the only reason to treat it as a moral patient would be if doing so produced a better society for the living human members of our society (so that gives us moral absolutism for living beings and moral relativism for non-living).
Evolution does admittedly produce a sastifying answer to the so-called 'ought from is' puzzle of moral philosophy.
(For anyone who finds this ethical discussion interesting, I wrote a whole sequence about AI and Ethics.)
Pretty sure that the 'exotic particle' in question for the last sentence would be a spin-1/6 anyon. So '…have already classified'.
Try speaking aloud for precisely 269 words. You're not allowed to count or recite poetry — you have to do this while actually extemporizing something interesting to say.
Now bear in mind that an LLM doesn't output letters or words, it uses tokens. In order to count words, an LLM has to memorize which tokens contain spaces and which don't. So for it the task is comparable to asking a human to speak until they've said 269 words that begin with the letter T.
Eliciting a behavior from a base model during alignment training seems likely to be harder the rarer that behavior is in the training set. Annata/no-self is pretty rare on the internet, so it might be good to enrich the training set in it.
Wow! Yes, I'm delighted to see that a number of the ideas along the lines I proposed are being applied and expanded on there, with some success it seems.
I'd love to be more directly involved in work along these lines, if anyone is hiring, or making grants.
I think that's an element in Hinge #3. While AI task lengths remain short (minutes to hours), AI is basically just a tool, though one that may still boost productivity. Once they reach days, human workers need to turn into managers-of-AI, so AI become a productivity multiplier but not a replacement. Once AI task lengths reach weeks or months, it become plausible that AI can manage AI, and we're starting to look at full replacement.
As I wrote in LLMs May Find It Hard to FOOM, sooner or later we're going to need to use LLMs to generate larde amounts of new synthetic training data. We already know that doing that naively without using inference-time-scaling leads to model collapse. The interesting question is whether some kind of inference-time training approach can allow an LLM to think long and hard and generate large amounts of higher-quality training data that can be used to train a model that's better than it is. That's not theoretically impossible: science and mathematics are real things, truth can be discovered with enough work; but it's also not trivial: you can't simply have an LLM generate 1000T tokens, train on that, and get a better LLM.
Even if all RLVR does is raise the pass@1 towards the pass@100 of the base model, if that can trained model generate enough training data to train a new base model with a similar pass@1 (before applying RLVF to it), the pass@100 of that new model must be higher than its pass@1, and RLVR should be able to elicit that to an improved pass@1, so you've made forward progress. The question then becomes whether you can repeat this cycle and keep making progress, without it plateauing. At least in areas like Mathematics and Coding, where verifiable truth exists and can be discovered with enough effort, this seems plausible. AlphaGo certainly did (though I gather its superhuman skills also have some blind-spots, corresponding to tactics it apparently never thought of — suggesting that involving humans, or at least multiple different LLM models, in the training data generation cycle might be useful here to avoid comparable blindspots.) Doing the same in other STEM topics would require your AI to be interacting with the world while generating training data, running experiments and designing and testing products — but then, that's rather the point of having a nation of researchers in a data-center
IMO, one of the more plausible ways for us to survive without dignity is that we build somewhat-aligned AI smarter than us, they point out that going on to build ASI in a hurry is crazy, and they talk us into or otherwise enforce on us a pause.