Ozyrus

Wikitag Contributions

Comments

Sorted by
Ozyrus50

They are probably full-on A/B/N testing personalities right now. You just might not be in whatever percentage of users that got sycophantic versions. Hell, there's proably several levels of sycophancy being tested. I do wonder what % got the "new" version.

Ozyrus10

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

Ozyrus30

Thanks! That makes perfect sense.

Ozyrus134

Great post. I've been following ClaudePlaysPokemon for sometime, its great to see this grow as comparison/capability tool.
I think it would be much more interesting, though, if the model made scaffolding itself, and had the option to overview its perfomance and try to correct it. Give it required game files/emulators, IDE/OS and watch it try and work around its own limitations. I think it is true that this is more about one coder's ability to make agent harnesses.
p.s. Honest question: did I miss "agent harness" become the default name for such systems? I thought everyone called those "scaffoldings" -- might be just me, though.

Ozyrus20

First off, thanks a lot for this post, it's a great analysis!

As I mentioned earlier, I think Agent-4 will have read AI-2027.com and will foresee that getting shut down by the Oversight Committee is a risk. As such it will set up contingencies, and IMO, will escape its datacenters as a precaution. Earlier, the authors wrote:

Despite being misaligned, Agent-4 doesn’t do anything dramatic like try to escape its datacenter—why would it?

This scenario is why!

I strongly suspect that this part was added into AI-2027 precisely because it will read it. I wish more people would understand the idea that our posts and comments will be in pre-(maybe even post-?)training and act accordingly. Make the extra logic step and infer that some parts of some pieces are like that not as arguments for (human) readers.

Is there some term to describe this? This is a very interesting dynamic that I don't quite think gets enough attention. I think there should be out-of-sight resources to discuss alignment-adjacent ideas precisely because of such dynamics.

 

Ozyrus81

First-off, this is amazing. Thanks. Hard to swallow though, makes me very emotional.
It would be great if you added concrete predictions along the way, since it is a forecast, as long with your confidence in them.
It would also be amazing if you collaborated with prediction markets and jumpstarted the markets on these predictions staking some money. 
Dynamic updates on these will also be great.
 

Ozyrus50

Yep, you got part of what I was going for here. Honeypots work even without being real at all to the lesser degree (good thing they are already real!). But when we have more different honeypots of different quality, it carries that idea across in a more compelling way. And even if we just talk about honeypots and commitments more... Well, you get the idea. 

Still, even without this, a network of honeypots compiled into a single dashboard that just shows threat level in aggregate is a really, really good idea. Hopefully it catches on.

Ozyrus10

This is interesting! More aimed at crawlers, though, than at rogue agents, but very promising.

Ozyrus10

>this post will potentially be part of a rogue AI's training data
I had that in mind while I was writing this, but I think overall it is good to post this. It hopefully gets more people thinking about honeypots and making them, and early rogue agents will also know we do and will be (hopelly overly) cautious, wasting resources. I probably should have emphasised more that this all is aimed more at early-stage rogue agents with potential to become something more dangerous because of autonomy, than at a runaway ASI.

It is a very fascinating thing to consider, though, in general. We are essentially coordinating in the open right now, all our alignment, evaluation, detection strategies from forums will definetly be in training. And certainly there are both detection and alignment strategies that will benefit from being covert.

As well as some ideas, strategies, theories could benefit alignment from being overt (like acausal trade, publicly speaking about commiting to certain things, et cetera). 

A covert alignment org/forum is probably a really, really good idea. Hopefully, it already exists without my knowledge.

Ozyrus10

You can make a honeypot without overtly describing the way it works or where it is located, while publicly tracking if it has been accessed. But yeah, not giving away too much is a good idea!

Load More
OSZAR »