On (4), I don't I understand why having a scale-free theory of intelligent agency would substantially help with making an alignment target. (Or why this is even that related. How things tend to be doesn't necessarily make them a good target.)

ryan_greenblatt's Shortform

ryan_greenblatt21h50

See also "AI companies' eval reports mostly don't support their claims" by Zach Stein-Perlman.

ryan_greenblatt's Shortform

ryan_greenblatt22h*30

I've actually written up a post with my views on open weights models and it should go up today at some point. (Or maybe tomorrow.)

Edit: posted here

ryan_greenblatt's Shortform

ryan_greenblatt2d54-6

Many deployed AIs are plausibly capable of substantially assisting amateurs at making CBRN weapons (most centrally bioweapons) despite not having the safeguards this is supposed to trigger. In particular, I think o3, Gemini 2.5 Pro, and Sonnet 4 are plausibly above the relevant threshold (in the corresponding company's safety framework). These AIs outperform expert virologists on Virology Capabilities Test (which is perhaps the best test we have of the most relevant capability) and we don't have any publicly transparent benchmark or test which rules out CBRN concerns. I'm not saying these models are likely above the thresholds in the safety policies of these companies, just that it's quite plausible (perhaps 35% for a well elicited version of o3). I should note that my understanding of CBRN capability levels is limited in various ways: I'm not an expert in some of the relevant domains, so much of my understanding is second hand.

The closest test we have which might rule out CBRN capabilities above the relevant threshold is Anthropic's uplift trial for drafting a comprehensive bioweapons acquisition plan. (See section 7.2.4.1 in the Opus/Sonnet 4 model card.) They use this test to rule out ASL-3 CBRN for Sonnet 4 and Sonnet 3.7. However, we have very limited public details about this test (including why they chose the uplift threshold they did, why we should think that low scores on this test would rule out the most concerning threat models, and whether they did a sufficiently good job on elicitation and training participants to use the AI effectively). Also, it's not clear that this test would indicate that o3 and Gemini 2.5 Pro are below a concerning threshold (and minimally it wasn't run on these models to rule out a concerning level of CBRN capability). Anthropic appears to have done the best job handling CBRN evaluations. (This isn't to say their evaluations and decision making are good at an absolute level; the available public information indicates a number of issues and is consistent with thresholds being picked to get the outcome Anthropic wanted. See here for more discussion.)

What should AI companies have done given this uncertainty? First, they should have clearly acknowledged their uncertainty in specific (ideally quantified) terms. Second, they should have also retained unconflicted third parties with relevant expertise to audit their decisions and publicly state their resulting views and level of uncertainty. Third party auditors who can examine the relevant tests in detail are needed as we have almost no public details about the tests these companies are relying on to rule out the relevant level of CBRN capability and lots of judgement is involved in making capability decisions. Publishing far more details of the load bearing tests and decision making process could also suffice, but my understanding is that companies don't want to do this as they are concerned about infohazards from their bio evaluations.

If they weren't ready to deploy these safeguards and thought that proceeding outweighed the (expected) cost in human lives, they should have publicly acknowledged the level of fatalities and explained why they thought weakening their safety policies and incurring these expected fatalities was net good.^[1]

In the future, we might get pretty clear evidence that these companies failed to properly assess the risk.

I mostly wrote this up to create common knowledge and because I wanted to reference this when talking about my views on open weight models. I'm not trying to trigger any specific action.

See also Luca Righetti's post "OpenAI's CBRN tests seem unclear," which was about o1 (which is now substantially surpassed by multiple models).

I think these costs/risks are small relative to future risks, but that doesn't mean it's good for companies to proceed while incurring these fatalities. For instance, the company proceeding could increase future risks and proceeding in this circumstance is correlated with the company doing a bad job of handling future risks (which will likely be much more difficult to safely handle). ↩︎

Hestia's Shortform

ryan_greenblatt3d50

What about an AI which cares a tiny bit about keeping humans alive, but still kills lots of people in the process of taking over? It could care about this terminally or could be paid off via acausal trade.

Thane Ruthenis's Shortform

ryan_greenblatt5d86

I would have guessed that "making AIs be moral patients" looks like "make AIs have their own independent preferences/objectives which we intentionally don't control precisely" which increases misalignment risks.

At a more basic level, if AIs are moral patients, then there will be downsides for various safety measures and AIs would have plausible deniability for being opposed to safety measures. IMO, the right response to the AI taking a stand against your safety measures for AI welfare reasons is "Oh shit, either this AI is misaligned or it has welfare. Either way this isn't what we wanted and needs to be addressed, we should train our AI differently to avoid this."

Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later

I don't understand, won't all the value come from minds intentionally created for value rather than in the minds of the laborers? Also, won't architecture and design of AIs radically shift after humans aren't running day to day operations?

I don't understand the type of lock in your imagining, but it naively sounds like a world which has negligible longtermist value (because we got locked into obscure specifics like this), so making it somewhat better isn't important.

Thane Ruthenis's Shortform

ryan_greenblatt5d2012

IMO, it seems bad to intentionally try to build AIs which are moral patients until after we've resolved acute risks and we're deciding what to do with the future longer term. (E.g., don't try to build moral patient AIs until we're sending out space probes or deciding what to do with space probes.) Of course, this doesn't mean we'll avoid building AIs which aren't significant moral patients in practice because our control is very weak and commercial/power incentives will likely dominate.

I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk and seems morally bad. (Views focused on non-person-affecting upside get dominated by the long run future, so these views don't care about making moral patient AIs which have good lives in the short run. I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.)

The only upside is that it might increase value conditional on AI takeover. But, I think "are the AIs morally valuable themselves" is much less important than the preferences of these AIs from the perspective of longer run value conditional on AI takeover. So, I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover and making AIs moral patients isn't a particularly nice way to achieve this. Additionally, I don't think we should put much weight on "try to ensure the preferences of AIs which were so misaligned they took over" because conditional on takeover we must have had very little control over preferences in practice.

An overview of control measures

ryan_greenblatt6d31

I agree it's also helpful for concentrated threats (or diffuse strategies which ultimatetly culminate in some concentrated failure, e.g. consistently looking more suspicious so that at a given FPR the TPR is lower).

ryan_greenblatt's Shortform

ryan_greenblatt6d120

Are you claiming that RL fine-tuning doesn't change weights? This is wrong.

Maybe instead you're saying "no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)". My response is: sure, but they could do this, they just don't because it's logistically/practically pretty annoying and the performance improvement wouldn't be that high, at least without some more focused R&D on making this work better.

ryan_greenblatt's Shortform

ryan_greenblatt7dΩ470

RL sample efficiency can be improved by both:

Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I'd also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).

There isn't great evidence that we've been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it's hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don't have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it's hard to know how much of the gain was from RL algorithms vs from other areas.

Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.

As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it's also possible to find various pieces of support for this in the literature, but I'm more familiar with various anecdotes.