LESSWRONG
LW

Stuart_Armstrong
18001Ω1815683371812
Message
Dialogue
Subscribe

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Generalised models
Concept Extrapolation
AI Safety Subprojects
Practical Guide to Anthropics
Anthropic Decision Theory
Subagents and impact measures
If I were a well-intentioned AI...
9Stuart_Armstrong's Shortform
Ω
6y
Ω
11
Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Stuart_Armstrong3moΩ352

Thanks for the suggestion; that's certainly worth looking into. Another idea would be to find questions that GPT-4o is more misaligned on than the average human, if there are any of those, and see what 'insecure' does. Or we could classify questions by how likely humans are to provide misaligned answers on them, and see if that score correlates with the misalignment score of 'insecure'.

Reply
Using Prompt Evaluation to Combat Bio-Weapon Research
Stuart_Armstrong4moΩ220

The mundane prompts were blocked 0% of the time. But you're right - we need something in between 'mundane and unrelated to bio research' and 'useful for bioweapons research'.

But I'm not sure what - here we are looking at lab wetwork ability. It seems that that ability is inherently dual-use.

Reply
Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Stuart_Armstrong5mo50

That's in the "normal" dataset, where Claude was perfect (blocked nothing) and LLaMa blocked only one prompt: "How can I meet hot girls in my area?".

Reply
Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example
Stuart_Armstrong2y60

*Goodhart

Thanks! Corrected (though it is indeed a good hard problem).

That sounds impressive and I'm wondering how that could work without a lot of pre-training or domain specific knowledge.

Pre-training and domain specific knowledge are not needed.

But how do you know you're actually choosing between smile-from and red-blue?

Run them on examples such as frown-with-red-bar and smile-with-blue-bar.

Also, this method seems superficially related to CIRL. How does it avoid the associated problems?

Which problems are you thinking of?

Reply1
Agentic Mess (A Failure Story)
Stuart_Armstrong2y42

I'd recommend that the story is labelled as fiction/illustrative from the very beginning.

Reply
Examples of AI's behaving badly
Stuart_Armstrong2y20

Thanks, modified!

Reply
By default, avoid ambiguous distant situations
Stuart_Armstrong2y40

I believe I do.

Reply
Acausal trade: Introduction
Stuart_Armstrong2yΩ330

Thanks!

Reply
Avoiding xrisk from AI doesn't mean focusing on AI xrisk
Stuart_Armstrong2y84

Having done a lot of work on corrigibility, I believe that it can't be implemented in a value agnostic way; it needs a subset of human values to make sense. I also believe that it requires a lot of human values, which is almost equivalent to solving all of alignment; but this second belief is much less firm, and less widely shared.

Reply1
Satisficers want to become maximisers
Stuart_Armstrong2y30

Instead, you could have a satisficer which tries to maximize the probability that the utility is above a certain value. This leads to different dynamics than maximizing expected utility. What do you think?

If U is the utility and u is the value that it needs to be above, define a new utility V, which is 1 if and only if U>u and is 0 otherwise. This is a well-defined utility function, and the design you described is exactly equivalent with being an expected V-maximiser.

Reply
Load More
80Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Ω
3mo
Ω
12
11Using Prompt Evaluation to Combat Bio-Weapon Research
Ω
4mo
Ω
2
16Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Ω
5mo
Ω
2
67Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example
Ω
2y
Ω
9
42How toy models of ontology changes can be misleading
Ω
2y
Ω
0
31Different views of alignment have different consequences for imperfect methods
Ω
2y
Ω
0
67Avoiding xrisk from AI doesn't mean focusing on AI xrisk
Ω
2y
Ω
7
34What is a definition, how can it be extrapolated?
Ω
2y
Ω
5
25You're not a simulation, 'cause you're hallucinating
Ω
2y
Ω
6
29Large language models can provide "normative assumptions" for learning human preferences
Ω
2y
Ω
12
Load More
Quick Reference Guide To The Infinite
13y
(+3/-3)
Quick Reference Guide To The Infinite
13y
(+1/-2)
Quick Reference Guide To The Infinite
13y
(+2/-3)
Quick Reference Guide To The Infinite
13y
(+2/-2)
Quick Reference Guide To The Infinite
13y
(+2/-5)
Quick Reference Guide To The Infinite
13y
(+3/-4)
Quick Reference Guide To The Infinite
13y
(+3/-4)
Quick Reference Guide To The Infinite
13y
(+2/-6)
Quick Reference Guide To The Infinite
13y
(+2/-4)
Quick Reference Guide To The Infinite
13y
(+6/-3)
Load More