Thomas Kwa

Engineer at METR.

Previously: Vivek Hebbar's team at MIRI Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Sequences

Catastrophic Regressional Goodhart

Wikitag Contributions

Comments

Sorted by

Cross-domain time horizon: 

We know AI time horizons on software tasks are currently ~1.5hr and doubling every 4-7 months, but what about other domains? Here's a preliminary result comparing METR's task suite (orange line) to benchmarks in other domains, all of which have some kind of grounding in human data:

Observations

  • Time horizon in different domains varies by >3 orders of magnitude, with the hardest tasks for AIs being agentic computer use (OSWorld) and the easiest being video understanding (video_mme). In the middle are Tesla self-driving (tesla_fsd), scientific knowledge (gpqa), software (hcast_r_s), and math contests (aime).
    • My guess is this means models are good at taking in information from a long context but bad at acting coherently. Most work requires agency like OSWorld, which may be why AIs can't do the average real-world 1-hour task yet.
  • Rate of improvement also varies significantly; math contests have improved ~50x in the last year but Tesla self-driving only 6x in 3 years.
  • HCAST is middle of the pack in both.

Note this is preliminary and uses a new methodology so there might be data issues. I'm currently writing up a full post!

Is this graph believable? What do you want to see analyzed?

I'd guess that a cheaper, wall-mounted version of CleanAirKits/Airfanta would be a better product. It's just a box with fans and slots for filters, the installation labor is significantly lower, you get better aesthetics, and not everyone has 52 inch ceiling fans at a standardized mounting length already so the market is potentially much larger with a standalone device.

The problem with the ceiling fan is that it's not designed for static pressure, so its effectiveness at moving air through the filter will depend on contingent factors like the blade area ratio and distance from the ceiling (which determines how much filter area you can fit). 180 CFM @ 33 dB is better than the Coway but only matches a single box with ~7 PC fans on medium; you can do better with a custom designed ceiling fan but at that point the fan needs to be part of the product.

Thomas Kwa*180

The uplift equation:

What is required for AI to provide net speedup to a software engineering project, when humans write higher quality code than AIs? It depends how it's used.

Cursor regime

In this regime, similar to how I use Cursor agent mode, the human has to read every line of code the AI generates, so we can write:

Where

  •  is the time for the human to write the code, either from scratch or after rejecting an AI suggestion
  •  is the time for the AI to generate the code in tokens per second.
  •  is the time for the human to check the code, accept or reject it, and make any minor revisions, in order to bring the code quality and probability of bugs equal with human-written code ()
  •  is the fraction of AI suggestions that are rejected entirely.

Note this neglects other factors like code review time, code quality, bugs that aren't caught by the human, or enabling things the human can't do.

Autonomous regime

In this regime the AI is reliable enough that the human doesn't check all the code for bugs, and instead eats the chance of costly bugs entering the codebase.

  •  is the added probability of a bug from the AI agent compared to a human
  •  is the expected cost of a bug, including revenue loss, compromising other projects, and time required to fix the bug. This can be higher than 

Verifiable regime

If task success is cheaply verifiable e.g. if comprehensive tests are already written or other AIs can verify the code, then bugs are impossible except through reward hacking.

Here  is lower than in the other regimes because the verifier can help the generator AI understand the task, and  can be faster too because you can parallelize.

Observations

Although the AI is always faster than writing the code than the human, overall speedup is subject to Amdahl's law, i.e. limited by the slowest component. If AI generation is only 3x as fast as the human per line of code, speedup will never be faster than 3x. Even when AI mistake rate is low enough that we move to the autonomous regime, we still have  and , both of which can be significant fractions of  currently, especially for expert humans who are fast at writing code and projects that are few lines of code. Therefore I expect >5x speedups to overall projects only when (a) AIs write higher-quality code than humans, or (b) tasks are cheaply verifiable and reward hacking is rare, or (c) both  and  are much faster than in my current work.

I made a simple Desmos model here.

Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:

Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.

This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.

Thomas Kwa*490

Author here. When constructing this paper, we needed an interpretable metric (time horizon), but this is not very data-efficient. We basically made the fewest tasks we could to get acceptable error bars, because high-quality task creation from scratch is very expensive. (We already spent over $150k baselining tasks, and more on the bounty and baselining infra.) Therefore we should expect that restricting to only 32 of the 170 tasks in the paper makes the error bars much wider; it roughly increases the error bars by a factor of sqrt(170/32) = 2.3.

Now if these tasks were log-uniformly distributed from 1 second to 16 hours, we would still be able to compare these results to the rest of our dataset, but it's worse than that: all fully_private tasks are too difficult for models before GPT-4, so removing SWAA requires restricting the x axis to the 2 years since GPT-4. This is 1/3 of our x axis range, so it makes error bars on the trendline slope 3 times larger. Combined with the previous effect the error bars become 6.9 times larger than the main paper! So the analysis in this post, although it uses higher-quality data, basically throws away all the statistical power of the paper. I wish I had 170 high-quality private tasks but there was simply not enough time to construct them. Likewise, I wish we had baselines for all tasks, but sadly we will probably move to a different safety case framework before we get them.

Though SWAA tasks have their own problems (eg many of them were written by Claude), they were developed in-house and are private, and it's unlikely that GPT-2 and GPT-3 would have a horizon like 10x different on a different benchmark. So for this check we really should not exclude them.

privacy_levelnum_tasks
fully_private32
public_problem22
easy_to_memorize18
public_solution17
semi_private2

When we include SWAA, the story is not too different from the original paper: doubling roughly 8 months. Note how large the error bars are.

With only fully_private tasks and SWAA and RE-Bench, the extrapolated date of 1-month AI looks similar to the main paper, but the error bars are much larger; this is entirely driven by the small number of tasks (green bar). 

For comparison, the main paper extrapolation

When I exclude SWAA and RE-Bench too, the script refuses to even run, because sometimes the time horizon slope is negative, but when I suppress those errors and take the 2024-2025 trend we get 80% CI of early 2026 to mid-2047! This is consistent with the main result but pretty uninformative.

You can see that with only fully_private tasks (ignore the tasks under 1 minute; these are SWAA), it's hard to even tell whether longer tasks are more difficult in the <1 hour range (tiles plot). As such we should be suspicious of whether the time horizon metric works at all with so few tasks.

Thomas Kwa*310

I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It's faster, which could be due to some combination of the following:

  • o1 has a better base model than GPT-4
  • HCAST is an example of "emergent capabilities on long-horizon tasks", unlike their non-agentic tasks
  • RL helps more on HCAST skills than math or non-agentic coding
  • RL helps more on larger models
  • Some kind of nonlinearity in the relevant metrics

There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.

 GPT-4 horizono1 horizonRatio
Pass@824 min151 min6.3x
Pass@15 min39 min7.8x

Agree, I'm pretty confused about this discrepancy. I can't rule out that it's just the "RL can enable emergent capabilities" point.

Thomas KwaΩ7104

The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.

Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It's not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon. 

  • Release schedules could be altered
  • A model could be overfit to our dataset
  • One model could play less well with our elicitation/scaffolding
  • One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.

All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.

o3 and o4-mini solve more than zero of the >1hr tasks that claude 3.7 got ~zero on, including some >4hr tasks that no previous models we tested have done well on, so it's not that models hit a wall at 1-4 hours. My guess is that the tasks they have been trained on are just more similar to HCAST tasks than RE-Bench tasks, though there are other possibilities.

Other metrics also point to drone-dominated and C&C dominated war. E.g. towed artillery is too vulnerable to counterbattery fire, and modern mobile artillery like CAESAR must use "shoot and scoot" tactics-- it can fire 6 shells within two minutes of stopping and vacate before its last shell lands. But now drones attack them while moving too.

Load More
OSZAR »