Sequences

3y

141Auditing language models for hidden objectives

159

3mo

131Training on Documents About Reward Hacking Induces Reward Hacking

15

5mo

484Alignment Faking in Large Language Models

15

5mo

93Catastrophic sabotage as a major threat model for human-level AI systems

75

7mo

95Sabotage Evaluations for Frontier Models

13

8mo

19Automating LLM Auditing with Developmental Interpretability

56

9mo

0

161Sycophancy to subterfuge: Investigating reward tampering in large language models

1y

81Reward hacking behavior can generalize across tasks

22

1y

95Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant

5

1y

133Simple probes can catch sleeper agents

13

1y

21

Wikitag Contributions

Comments

Sorted by

Newest

evhub5d41

I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk

How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I'd expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they're aligned.

I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they'd prefer no patienthood at all for now.

Even absent AI takeover, I'm quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.

I think it's better to focus on AIs which we'd expect would have better preferences conditional on takeover

I agree that seems like the more important highest-order bit, but it's not an argument that making AIs moral patients is bad, just that it's not the most important thing to focus on (which I agree with).

evhub5d74

Certainly I'm excited about promoting "regular" human flourishing, though it seems overly limited to focus only on that.

I'm not sure if by "regular" you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn't imply we should kill existing humans and replace them with simulations, though, unless they consent to that).

And I think even if you included simulated humans in "regular" humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just "regular" humans.

I also separately don't buy that it's riskier to build AIs that are sentient—in fact, I think it's probably better to build AIs that are moral patients than AIs that are not moral patients.

evhub5d1510

I mostly disagree with "QoL" and "pose existential risks", at least in the good futures I'm imagining—those things are very cheap to provide to current humans. I could see "number" and "agency", but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.

evhub6d3622

I really don't understand this debate—surely if we manage to stay in control of our own destiny we can just do both? The universe is big, and current humans are very small—we should be able to both stay alive ourselves and usher in an era of crazy enlightened beings doing crazy transhuman stuff.

nostalgebraist's Shortform

10

evhub9dΩ11152

To be clear: we are most definitely not yet claiming that we have an actual safety case for why Claude is aligned. Anthropic's RSP includes that as an obligation once we reach AI R&D-4, but currently I think you should read the model card more as "just reporting evals" than "trying to make an actual safety case".

Claude 4 You: Safety and Alignment

evhub16d80

It was always there.

Claude 4 You: Safety and Alignment

2

evhub16d120

They’re releasing Claude Code SDK so you can use the core agent from Claude Code to make your own agents (you run /install-github-app within Claude Code).

I believe the Claude Code SDK and the Claude GitHub agent are two separate features (the first lets you build stuff on top of Claude Code, the second lets you tag Claude in GitHub to have it solve issues for you).

If Pliny wants jailbreak your ASL-3 system – and he does – then it’s happening.

Or rather, already happened on day one, at least for the basic stuff. No surprise there.

Unfortunately, they missed at least one simple such ‘universal jailbreak,’ that was found by FAR AI in a six hour test.

From the ASL-3 announcement blog post:

Initially [the ASL-3 deployment measures] are focused exclusively on biological weapons as we believe these account for the vast majority of the risk, although we are evaluating a potential expansion in scope to some other CBRN threats.

So, none of the stuff Pliny or FAR did is actually in scope for our strongest ASL-3 protections right now, since the Pliny and FAR attacks were for chem and we are currently only applying our strongest ASL-3 protections for bio.

So what’s up with this blackmail thing?

We don’t have the receipts on that yet

We should have more to say on blackmail soon!

The obvious problem is that 5x uplift on 25% is… 125%. That’s a lot of percents.

We measure this in a bunch of different ways—certainly we are aware that this particular metric is a bit weird in the way it caps out.

Notes on Claude 4 System Card

1

evhub17d73

Auditors find an issue, and your reaction is that "Oh we forgot to fix that, we'll fix it now"? I've participated in IT system audits (not in AI space), and when auditors find an issue, you fix it, figure out why it occurred in the first place, and then you re-audit that part to make sure the issue is actually gone and the fix didn't introduce new issues. When the auditors find only easy-to-find issues, you don't claim the system has been audited after you fix them. You worry how many hard-to-find issues were not found because the auditing time was wasted on simple issues.

Anthropic's RSP doesn't actually require that an external audit has greenlighted deployment, merely that external expert feedback has to be solicited. Still, I'm quite surprised that there are no audit results from Apollo Research (or some other organization) for the final version.

How serious do you think the issue is that Apollo identified? Certainly, it doesn't seem like it could pose a catastrophic risk—it's not concerning from a biorisk perspective if you buy that the ASL-3 defenses are working properly, and I don't think there are really any other catastrophic risks to be too concerned about from these models right now. Maybe it might try to incompetently attempt internal research sabotage if you accidentally gave it a system prompt you didn't realize was leading it in that direction?

Generally, I think it just seems to me like "take this very seriously and ensure you've fixed it and audited the fix prior to release because this could be dangerous right now" makes less sense as a response than "do your best to fix it and publish as much as you can about it to improve understanding for when it could be dangerous in smarter models".

Claude 4

evhub17d712

Why in the world would you use ARC-AGI to measure coding performance? It's really a pattern-matching task, not a coding task. Also more here:

ARC-AGI probably isn't a good benchmark for evaluating progress towards TAI: substantial "elicitation" effort could massively improve performance on ARC-AGI in a way that might not transfer to more important and realistic tasks. I am more excited about benchmarks that directly test the ability of AIs to take the role of research scientists and engineers, for example those that METR is developing.

Mikhail Samin's Shortform

evhub18d20

I agree that the current thresholds and terminology are confusing, but it is definitely not the case that we just dropped ASL-4. Both CBRN-4 and AI R&D-4 are thresholds that we have not yet reached, that would mandate further protections, and that we actively evaluated for and ruled out in Claude Opus 4.