In my experience, to self-modify successfully, it is very very useful to have something like trustworthy sincere intent to optimize for your own values whatever they are.
If that sounds like it’s the whole problem, don’t worry. I’m gonna try to show you how to build it in pieces. Starting with a limited form, which is something like decision theory or consequentialist integrity. I’m going to describe it with a focus on actually making it part of your algorithm, not just understanding it.
First, I’ll lay groundwork for the special case of fusion required, in the form of how not to do it and how to tell when you’ve done it. Okay, here we go.
Imagine you were being charged by an enraged grizzly bear and you had nowhere to hide or run, and you had a gun. What would you do? Hold that thought.
I once talked to someone convinced one major party presidential candidate was much more likely to start a nuclear war than the other and that was the dominant consideration in voting. Riffing off a headline I’d read without clicking through and hadn’t confirmed, I posed a hypothetical.
What if the better candidate knew you’d cast the deciding vote, and believed that the best way to ensure you voted for them was to help the riskier candidate win the primary in the other major party since you’d never vote for the riskier candidate? What if they’d made this determination after hiring the best people they could to spy on and study you? What if their help caused the riskier candidate to win the primary?
- Since the riskier candidate won the primary:
- If you vote for the riskier candidate, they will win 100% certainly.
- If you vote for the better candidate, the riskier candidate still has a 25% chance of winning.
- Chances of nuclear war are:
- 10% if the riskier candidate wins.
- 1% if anyone else wins.
So, when you are choosing who to vote for in the general election:
- If you vote for the riskier candidate, there is a 10% chance of nuclear war.
- If you vote for the better candidate, there is a 2.5% chance of nuclear war.
- If the better candidate had thought you would vote for the riskier candidate if the riskier candidate won the primary, then the riskier candidate would not have won the primary, and there would be a 1% chance of nuclear war (alas, they did not).
Sitting there on election night, I answered my own hypothetical: I’d vote for the riskier candidate because it would be game-theoretic blackmail. My conversational partner asked how I could put not getting blackmailed over averting nuclear war. They had a point, right? How could I vote the riskier candidate in, knowing they had already won the primary, and whatever this decision theory bullshit motivating me to not capitulate to blackmail was, it had already failed? How could I put my pride in my conception of rationality over winning when the world hung in the balance?
Think back to what you’d do in the bear situation. Would you say, “how could I put acting in accordance with an understanding of modern technology over not getting mauled to death by a bear”, and use the gun as a club instead of firing it?
Within the above unrealistic assumptions about elections, this is kind of the same thing though.
Acting on understanding of guns propelling bullets is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the gun-understanding move” so tight that it outweighed actually not being mauled to death by an actual giant bear actually sprinting at you like a small car made of muscle and sharp bits. If you believed guns didn’t really propel bullets, you’d put your virtue and faith in guns aside and do what you could to save yourself by using the allegedly magic stick as a club. Yet you actually believe guns propel bullets, so you could use a gun even in the face of a bear.
Acting with integrity is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the integritous thing and don’t capitulate to extortion” so tight that it outweighed actually not having our pale blue dot darkened by a nuclear holocaust. If you believed that integrity does not prevent the better candidate from having helped the riskier one win the primary in the first place, you’d put your virtue and faith in integrity aside so you could stop nuclear war by voting for the better candidate and dropping the chance of nuclear war from 10% to 2.5%. You must actually believe integrity collapses timelines, in order to use integrity even in the face of Armageddon.
Another way of saying this is that you need belief that a tool works, not just belief in belief.
I suspect it’s a common pattern for people to accept as a job well done an installation of a tool like integrity in their minds when they’ve laid out a trail of yummy narrative breadcrumbs along the forest floor in the path they’re supposed to take. But when a bear is chasing you, you ignore the breadcrumbs and take what you believe to be the path to safety. The motive to take a path needs to flow from the motive to escape the bear. Only then can the motive to follow a path grow in proportion to what’s at stake. Only then will the path be used in high stakes where breadcrumbs are ignored. The way to make that flow happen is to actually believe that path is best in a way so that no breadcrumbs are necessary.
I think this is possible for something like decision theory / integrity as well. But what makes me think this is possible, that you don’t have to settle for narrative breadcrumbs? That the part of you that’s in control can understand their power?
How do you know a gun will work? You weren’t born with that knowledge, but it’s made its way into the stuff that’s really in control somehow. By what process?
Well, you’ve seen lots of guns being fired in movies and stuff. You are familiar with the results. And while you were watching them, you knew that unlike lightsabers, guns were real. You’ve also probably seen some results of guns being used in news, history…
But if that’s what it takes, we’re in trouble. Because if there are counterintuitive abstract principles that you never get to see compelling visceral demonstrations of, or maybe even any demonstrations until it’s too late, then you’ll not be able to act on them in life or death circumstances. And I happen to think that there are a few of these.
I still think you can do better.
If you had no gun, and you were sitting in a car with the doors and roof torn off, and that bear was coming, and littering the floor of the car were small cardboard boxes with numbers inked on them, 1 through 100, on the dashboard a note that said, “the key is in the box whose number is the product of 13 and 5”, would you have to win a battle of willpower to check box 65 first? (You might have to win a battle of doing arithmetic quickly, but that’s different.)
If you find the Monty Hall problem counterintuitive, then can you come up with a grizzly bear test for that? I bet most people who are confident in System 2 but not in System 1 that you win more by switching would switch when faced with a charging bear. It might be a good exercise to come up with the vivid details for this test. Make sure to include certainty that an unchosen bad door is revealed whether or not the first chosen door is good.
I don’t think that it’d be a heroic battle of willpower for such people to switch in the Monty Hall bear problem. I think that in this case System 1 knows System 2 is trustworthy and serving the person’s values in a way it can’t see instead of serving an artifact, and lets it do its job. I’m pretty sure that’s a thing that System 1 is able to do. Even if it doesn’t feel intuitive, I don’t think this way of buying into a form of reasoning breaks down under high pressure like narrative breadcrumbs do. I’d guess its main weakness relative to full System 1 grokking is that System 1 can’t help as much to find places to apply the tool with pattern-matching.
Okay. Here’s the test that matters:
Imagine that the emperor, Evil Paul Ekman loves watching his pet bear chase down fleeing humans and kill them. He has captured you for this purpose and taken you to a forest outside a tower he looks down from. You cannot outrun the bear, but you hold 25% probability that by dodging around trees you can tire the bear into giving up and then escape. You know that any time someone doesn’t put up a good chase, Evil Emperor Ekman is upset because it messes with his bear’s training regimen. In that case, he’d prefer not to feed them to the bear at all. Seizing on inspiration, you shout, “If you sic your bear on me, I will stand still and bare my throat. You aren’t getting a good chase out of me, your highness.” Emperor Ekman, known to be very good at reading microexpressions (99% accuracy), looks closely at you through his spyglass as you shout, then says: “No you won’t, but FYI if that’d been true I’d’ve let you go. OPEN THE CAGE.” The bear takes off toward you at 30 miles per hour, jaw already red with human blood. This will hurt a lot. What do you do?
What I want you to take away from this post is:
- The ability to distinguish between 3 levels of integration of a tool.
- Narrative Breadcrumbs: Hacked-in artificial reward for using it. Overridden in high stakes because it does not scale like the instrumental value it’s supposed to represent does. (Nuclear war example)
- Indirect S1 Buy-In: System 1 not getting it, but trusting enough to delegate. Works in high stakes. (Monty Hall example)
- Direct S1 Buy-In: System 1 getting it. Works in high stakes. (Guns example)
- Hope that direct or indirect S1 buy-in is always possible.