The “???” in the row below “Not-so-local modification process” for the corporation case should perhaps be something like “Culture and process”?
A corporation always is focused on generating profits. It might burn more than it makes in certain growth spurts, but generally valid is, that a corporation has profit as a primary goal. every other goal is stacked on this first premise.
Its analogy is not drugs or time spent with friends. Its like air. A corporation needs to supply wages to its cells. to its workers. so similar to our body needing to supply oxygen. We can hold our breath and go fishing, but we do so on borrowed air. it will run out eventually.
A corporation is a super-organism, and every employee is a cell. If it uses AI (like in trading, or car manufacturing), then the dynamic changes slightly. And the robot becomes a tool, that needs to be maintained by a cell. similar to the algae that Corals keep. supply with light (electricity) and keep predators away (laws forbidding AI-trading)
There are three main ways to try to understand and reason about powerful future AGI agents:
I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:
Agent | Human corporation with a lofty humanitarian mission | Human who claims to be a good person with altruistic goals | AGI trained in our scenario |
Not-so-local modification process | The Market | Evolution by natural selection | The parent company iterating on different models, architectures, training setups, etc. |
(??? …nevermind about this) | Genes | Code | |
Local modification process | Responding to incentives over the span of several years as the organization grows and changes | In-lifetime learning, dopamine rewiring your brain, etc. | Training process, the reward function, stochastic gradient descent, etc. |
Long-term cognition | Decisions that involve meetings, messages being passed back and forth, etc. | System 2 | Chain of Thought (CoT) |
Short-term cognition | Quick decisions made by individuals | System 1 | Forward pass |
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?) | the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc. | What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions. | For now, we can arguably tell what these are by looking at the CoT + prompts. In the future, we'd need good mechinterp tools. |
Behavioral goals & principles (“Goals-according-to-intentional-stance”) | Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.” | Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/ /wealth/power.” | ??? This is the quadrillion-dollar question! We make some guesses in our scenario. |
Externally- presented goals & principles | Often they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though. | Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though. | ??? Depends on the behavioral goals/principles and the details of the situation. We make some guesses in our scenario. |
Analogue of on-episode-reward seeking. | A corporation obsessed with their own stock price. /brand/etc. | Someone who is addicted to drugs. More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward circuitry: promotions, approval of their manager and peers, good times with friends, etc. | AGI obsessed with on-episode reward |
Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.
Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:
Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp” | the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc. | What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions. | The Spec (and/or System Prompt) |
Behavioral goals & principles (“Goals-according-to-intentional-stance”) | At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning. | At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do. | The agent really does choose actions simply by following the goals/principles described in the Spec. |
Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.
What can happen? Some combination of the following possibilities, at least:
A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.