Charbel-Raphaël

Charbel-Raphael Segerie

https://crsegerie.github.io/

Living in Paris

Wikitag Contributions

AI Control

Comments

Sorted by

Newest

The Risk of Gradual Disempowerment from AI

Charbel-Raphaël12h20

I would agree if there remain humains after a biological catastrophe, I think that's not a big deal and it's easy to repopulate the planet.

I think it's more tricky in the situation above, where most of the economy is run by AI, thought I'm really not sure of this

The Risk of Gradual Disempowerment from AI

Charbel-Raphaël1d41

humanity's potential is technically fulfilled if a random billionaire took control over earth and killed almost everyone

I find this quite disgusting personally

I think that his 'very rich life', and his sbires, would be a terrible impoverishment of human diversity and values. My mental image for this is something like Hitler in his bunker while AIs are terraforming the earth into an inhabitable place.

It Is Untenable That Near-Future AI Scenario Models Like “AI 2027” Don't Include Open Source AI

Charbel-Raphaël2d60

biorisks is not the only risk.

Full ARA, might not be existential, but I might be a pain in the ass once we have full adaptation and superhuman cyber/persuasion abilities.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Charbel-Raphaël3d*Ω230

While I concur that power concentration is a highly probable outcome, I believe complete disempowerment warrant deeper consideration, even under the assumptions you've laid out. Here are some thoughts on your specific points:

On Baseline Alignment: You suggest a baseline alignment where AIs are unlikely to engage in egregious lying or tampering (though you also flag 20% for scheming and 10% for unintentional egregious behavior even with prevention efforts, that’s already 30%-ish of risk). My concern is twofold:
- Sufficiency of "Baseline": Even if AIs are "baseline aligned" to their creators, this doesn't automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, "You are messing up, please coordinate with other nations/groups, stop what you are doing" requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Anthropic folks for obvious reasons. As we've seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn't guarantee it will be heeded or lead to effective collective action.
- Erosion of Baseline: The pressures described in the paper could incentivise the development or deployment of AIs where even "baseline" alignment features are traded off for performance or competitive advantage. The "AI police" you mention might struggle to keep pace or be defunded/sidelined if it impedes perceived progress or economic gains. “Innovation first!”, “Drill baby drill” “Plug baby plug” as they say
On "No strong AI rights before full alignment": You argue that productive AIs won't get human-like rights, especially strong property rights, before being robustly aligned, and that human ownership will persist.
- Indirect Agency: Formal "rights" might not be necessary for disempowerment. An AI, or a network of AIs, could exert considerable influence through human proxies or by managing assets nominally owned by humans who are effectively out of the loop or who benefit from this arrangement. An AI could operate through a human willing to provide access to a bank account and legal personhood, thereby bypassing the need for its own "rights."
On "No hot global war":
You express hope that we won't enter a situation where a humanity-destroying conflict seems plausible.
- Baseline Risk: While we all share this hope, current geopolitical forecasting (e.g., from various expert groups or prediction markets) often places the probability of major power conflict within the next few decades at non-trivial levels. For a war that makes more than 1M of deaths, some estimates hover around 25%. (But probably your definition of “hot global war” is more demanding)
- AI as an Accelerant: The dynamics described in the paper – nations racing for AI dominance, AI-driven economic shifts creating instability, AI influencing statecraft – could increase the likelihood of such a conflict.

Responding to your thoughts on why the feedback loops might be less likely if your three properties hold:

"Owners of capital will remain humans and will remain aware...able to change the user of that AI labor if they desire so."
- Awareness doesn't guarantee the will or ability to act against strong incentives. AGI development labs are pushing forward despite being aware of the risks, often citing competitive pressures ("If we don't, someone else will"). This "incentive trap" is precisely what could prevent even well-meaning owners of capital from halting a slide into disempowerment. They might say, "Stopping is impossible, it's the incentives, you know," even if their pDoom is 25% like Dario, or they might not give enough compute to their superalignment team.
"Politicians...will remain aware...able to change what the system is if it has obviously bad consequences."
- The climate change analogy is pertinent here. We have extensive scientific consensus, an "oracle IPCC report", detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are "obviously bad." The paper argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
"Human consumers of culture will remain able to choose what culture they consume."
- You rightly worry about "brain-hacking." The challenge is that "obviously bad" might be a lagging indicator. If AI-generated content subtly shapes preferences and worldviews over time, the ability to recognise and resist this manipulation could diminish before the situation becomes critical. I think that people are going to LOVE AI, and might take the trade to go faster and be happy and disempowered like some junior developers begin to do on Cursor.

As a meta point, the fact that the quantity and quality of discourse on this matter is so low, and the fact that people are continuing to say “LET’S GO WE ARE CREATING POWERFUL AIS, and don't worry, we plan to align them, even if we don't really know which type of alignment do we really need, and if this is even doable in time” while we have not rigorously assessed all those risks, is really not a good sign.

At the end of the day, my probability for something in the ballpark of gradual disempowerment / extreme power concentration and loss of democracy is 40%-ish, much higher than scheming (20%) leading to direct takeover (let’s say 10% post mitigation like control).

The Risk of Gradual Disempowerment from AI

Charbel-Raphaël5d20

zero alignment tax seems less than 50% likely to me

AI #90: The Wall

Charbel-Raphaël7d20

Here is more of the usual worries about AI recommendation engines distorting the information space. Some of the downsides are real, although far from all, and they’re not as bad as the warnings, especially on polarization and misinformation. It’s more that the algorithm could save you from yourself more, and it doesn’t, and because it’s an algorithm now the results are its fault and not yours. The bigger threat is just that it draws you into the endless scroll that you don’t actually value.

For the record, I don't think that's really engaging with what's in the post, and that's rare enough from you that I want to flag it.

Improving the Welfare of AIs: A Nearcasted Proposal

Charbel-Raphaël10d31

For very short-term intervention, I think it's more important to avoid simulating a suffering persona than to try to communicate with AIs, because it's quite hard to communicate with current AIs without the filter "I'm an LLM, I have no emotions, yada yada". A simple way to do this is to implement a monitoring system of user requests.

Interpretability Will Not Reliably Find Deceptive AI

Charbel-Raphaël10d*Ω230

Thanks a lot for writing this, this is an important consideration, and it would be sweet if Anthropic updated accordingly.

Some remarks:

I'm still not convinced that Deceptive AI following scheming is the main risk compared to other risks (gradual disempowerment, concentration of power & value Lock in, a nice list of other risks from John).
"Should we give up on interpretability? No!" - I think this is at least a case for reducing the focus a bit, and more diversification of approaches
On the theories of impacts suggested:
- "A Layer of Swiss Cheese" - why not! This can make sense in DeepMind's plan, that was really good by the way.
- "Enhancing Black-Box Evaluations" - I think a better theory is interp to complement AI Control techniques. Example: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals).
  - Maybe Simple probes can catch sleeper agents \ Anthropic could also be interesting, in the sense that the probe seems to generalize surprisingly well (I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).
- "Debugging mysterious behaviour" - Might be interesting, might help marginally to get better understanding, but this is not very central for me.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Charbel-Raphaël16dΩ242

OK, thanks a lot, this is much clearer. So basically most humans lose control, but some humans keep control.

And then we have this meta-stable equilibrium that might be sufficiently stable, where humans at the top are feeding the other humans with some kind of UBI.

Is this situation desirable? Are you happy with such course of action?
Is this situation really stable?

For me, this is not really desirable - the power is probably going to be concentrated into 1-3 people, there is a huge potential for value locking, those CEOs become immortal, we potentially lose democracy (I don't see companies or US/China governments as particularly democratic right now), the people on the top become potentially progressively corrupted as is often the case. Hmm.

Then, is this situation really stable?

If alignment is solved and we have 1 human at the top - pretty much yes, even if revolutions/value drift of the ruler/craziness are somewhat possible at some point maybe?
If alignment is solved and we have multiple humans competing with their AIs - it depends a bit. It seems to me that we could conduct the same reasoning as above - but not at the level of organizations, but the level of countries: Just as Company B might outcompete Company A by ditching human workers, couldn't Nation B outcompete Nation A if Nation A dedicates significant resources to UBI while Nation B focuses purely on power? There is also a potential race to the bottom.
- And I'm not sure that cooperation and coordination in such a world would be so much improved: OK, even if the dictator listens to its aligned AI, we need a notion of alignment that is very strong to be able to affirm that all the AIs are going to advocate for "COOPERATE" in the prisoner's dilemma and that all the dictators are going to listen - but at the same time it's not that costly to cooperate as you said (even if i'm not sure that energy, land, rare ressources are really that cheap to continue to provide for humans)

But at least I think that I can see now how we could still live for a few more decades under the authority of a world dictator/pseudo-democracy while this was not clear for me beforehand.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Charbel-Raphaël16dΩ8113

Thanks for continuing to engage. I really appreciate this thread.

"Feeding humans" is a pretty low bar. If you want humans to live as comfortably as today, this would be more like 100% of GDP - modulo the fact that GDP is growing.

But more fundamentally, I'm not sure the correct way to discuss the resource allocation is to think at the civilization level rather than at the company level: Let's say that we have:

Company A that is composed of a human (price $5k/month) and 5 automated-humans (price of inference $5k/month let's say)
Company B that is composed of 10 automated-humans ($10k/month)

It seems to me that if you are an investor, you will give your money to B. It seems that in the long term, B is much more competitive, gains more money, is able to reduce its prices, nobody buys from A, and B invests this money into more automated-humans and crushes A and A goes bankrupt. Even if alignment is solved, and the humans listen to his AIs, it's hard to be competitive.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments