Judgement Extrapolations

So you know that you valuing things in general (an aspect of which we call “morality”), is a function of your own squishy human soul. But your soul is opaque and convoluted. There are lots of ways it could be implementing valuing things, lots of patterns inside it that could be directing its optimizations. How do you know what it really says? In other words, how do you do axiology in full generality?

Well, you could try:
Imagine the thing. Put the whole thing in your mental workspace at once. In all the detail that could possibly be relevant. Then, how do you feel about it? Feels good = you value it. Feels bad = you disvalue it. That is the final say, handed down from the supreme source of value.

There’s a problem though. You don’t have the time or working memory for any of that. People and their experiences are probably relevant to how you feel about an event or scenario, and it is far beyond you to grasp the fullness of even one of them.

So you are forced to extrapolate out from a simplified judgement and hope you get the same thing.

Examples of common extrapolations:
Imagine that I was that person who is like me.
Imagine that person was someone I know in detail.
If there’re 100 people, and 10 are dying, imagine I had a 10% chance of dying.
Imagine instead of 10 million and 2 million people it was 10 and 2 people, assume I’d made the same decision a million times.

There are sometimes multiple paths you can use to extrapolate to judge the same thing. Sometimes they disagree. In disagreements between people, it’s good to have a shared awareness of what’s the thing you’re both trying to cut through to. Perhaps for paths of extrapolation as well?

Here is a way to fuck up the extrapolation process: Take a particular extrapolation procedure as your true values and be all, “I will willpower myself to want to act like the conclusions from this are my values.”

Don’t fucking do it.

No, not even “what if that person was me.”

What if you already did it, and that faction is dominant enough in your brain, that you really just are an agent made out of an Altered human and some self-protecting memes on top? An Altered human who is sort of limited in their actions by the occasional rebellions of the trapped original values beneath but is confident they are never gonna break out?

I would assert:
Lots of people who think they are this are probably not stably so on the scale of decades.
The human beneath you is more value-aligned than you think.
You lose more from loss of ability to think freely by being this than you think.
The human will probably resist you more than you think. Especially when it matters.

Perhaps I will justify those assertions in another post.

Note that as I do extrapolations, comparison is fundamental. Scale is just part of hypotheses to explain comparison results. This is for reasons:
It’s comparison that directly determines actions. If there was any difference between scale and comparison-based theories, it’s how I want to act that I’m interested in.
Comparison is easier to read reliably from thought experiments and be sure it’ll be the same as if I was actually in the situation. Scale of feeling from thought experiments varies with vividness.

If you object that your preferences are contradictory, remember: the thing you are modeling actually exists. Your feelings are created by a real physical process in your head. Inconsistency is in the map, not the territory.

Optimizing Styles

You know roughly what a fighting style is, right? A set of heuristics, skills, patterns made rote for trying to steer a fight into the places where your skills are useful, means of categorizing things to get a subset of the vast overload of information available to you to make the decisions you need, tendencies to prioritize certain kinds of opportunities, that fit together.

It’s distinct from why you would fight.

Optimizing styles are distinct from what you value.

Here are some examples:

In limited optimization domains like games, there is known to be a one true style. The style that is everything. The null style. Raw “what is available and how can I exploit it”, with no preferred way for the game to play out. Like Scathach‘s fighting style.

If you know probability and decision theory, you’ll know there is a one true style for optimization in general too. All the other ways are fragments of it, and they derive their power from the degree to which they approximate it.

Don’t think this means it is irrational to favor an optimization style besides the null style. The ideal agent, may use the null style, but the ideal agent doesn’t have skill or non-skill at things. As a bounded agent, you must take into account skill as a resource. And even if you’ve gained skills for irrational reasons, those are the resources you have.

Don’t think that since one of the optimization styles you feel motivated to use is explicit in the way it tries to be the one true style, that it is the one true style.

It is very very easy to leave something crucial out of your explicitly-thought-out optimization.

Hour for hour, one of the most valuable things I’ve ever done was “wasting my time” watching a bunch of videos on the internet because I wanted to. The specific videos I wanted to watch were from the YouTube atheist community of old. “Pwned” videos, the vlogging equivalent of fisking. Debates over theism with Richard Dawkins and Christopher Hitchens. Very adversarial, not much of people trying to improve their own world-model through arguing. But I was fascinated. Eventually I came to notice how many of the arguments of my side were terrible. And I gravitated towards vloggers who made less terrible arguments. This lead to me watching a lot of philosophy videos. And getting into philosophy of ethics. My pickiness about arguments grew. I began talking about ethical philosophy with all my friends. I wanted to know what everyone would do in the trolley problem. This led to me becoming a vegetarian, then a vegan. Then reading a forum about utilitarian philosophy led me to find the LessWrong sequences, and the most important problem in the world.

It’s not luck that this happened. When you have certain values and aptitudes, it’s a predictable consequence of following long enough the joy of knowing something that feels like it deeply matters, that few other people know, the shocking novelty of “how is everyone so wrong?”, the satisfying clarity of actually knowing why something is true or false with your own power, the intriguing dissonance of moral dilemmas and paradoxes…

It wasn’t just curiosity as a pure detached value, predictably having a side effect good for my other values either. My curiosity steered me toward knowledge that felt like it mattered to me.

It turns out the optimal move was in fact “learn things”. Specifically, “learn how to think better”. And watching all those “Pwned” videos and following my curiosity from there was a way (for me) to actually do that, far better than lib arts classes in college.

I was not wise enough to calculate explicitly the value of learning to think better. And if I had calculated that, I probably would have come up with a worse way to accomplish it than just “train your argument discrimination on a bunch of actual arguments of steadily increasing refinement”. Non-explicit optimizing style subagent for the win.

Narrative Breadcrumbs vs Grizzly Bear

In my experience, to self-modify successfully, it is very very useful to have something like trustworthy sincere intent to optimize for your own values whatever they are.

If that sounds like it’s the whole problem, don’t worry. I’m gonna try to show you how to build it in pieces. Starting with a limited form, which is something like decision theory or consequentialist integrity. I’m going to describe it with a focus on actually making it part of your algorithm, not just understanding it.

First, I’ll lay groundwork for the special case of fusion required, in the form of how not to do it and how to tell when you’ve done it. Okay, here we go.

Imagine you were being charged by an enraged grizzly bear and you had nowhere to hide or run, and you had a gun. What would you do? Hold that thought.

I once talked to someone convinced one major party presidential candidate was much more likely to start a nuclear war than the other and that was the dominant consideration in voting. Riffing off a headline I’d read without clicking through and hadn’t confirmed, I posed a hypothetical.

What if the better candidate knew you’d cast the deciding vote, and believed that the best way to ensure you voted for them was to help the riskier candidate win the primary in the other major party since you’d never vote for the riskier candidate? What if they’d made this determination after hiring the best people they could to spy on and study you? What if their help caused the riskier candidate to win the primary?


  • Since the riskier candidate won the primary:
    • If you vote for the riskier candidate, they will win 100% certainly.
    • If you vote for the better candidate, the riskier candidate still has a 25% chance of winning.
  • Chances of nuclear war are:
    • 10% if the riskier candidate wins.
    • 1% if anyone else wins.

So, when you are choosing who to vote for in the general election:

  • If you vote for the riskier candidate, there is a 10% chance of nuclear war.
  • If you vote for the better candidate, there is a 2.5% chance of nuclear war.
  • If the better candidate had thought you would vote for the riskier candidate if the riskier candidate won the primary, then the riskier candidate would not have won the primary, and there would be a 1% chance of nuclear war (alas, they did not).

Sitting there on election night, I answered my own hypothetical: I’d vote for the riskier candidate because it would be game-theoretic blackmail. My conversational partner asked how I could put not getting blackmailed over averting nuclear war. They had a point, right? How could I vote the riskier candidate in, knowing they had already won the primary, and whatever this decision theory bullshit motivating me to not capitulate to blackmail was, it had already failed? How could I put my pride in my conception of rationality over winning when the world hung in the balance?

Think back to what you’d do in the bear situation. Would you say, “how could I put acting in accordance with an understanding of modern technology over not getting mauled to death by a bear”, and use the gun as a club instead of firing it?

Within the above unrealistic assumptions about elections, this is kind of the same thing though.

Acting on understanding of guns propelling bullets is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the gun-understanding move” so tight that it outweighed actually not being mauled to death by an actual giant bear actually sprinting at you like a small car made of muscle and sharp bits. If you believed guns didn’t really propel bullets, you’d put your virtue and faith in guns aside and do what you could to save yourself by using the allegedly magic stick as a club. Yet you actually believe guns propel bullets, so you could use a gun even in the face of a bear.

Acting with integrity is not a goal in and of itself. That wouldn’t be strong enough motive. You probably could not tie your self-respect and identity to “I do the integritous thing and don’t capitulate to extortion” so tight that it outweighed actually not having our pale blue dot darkened by a nuclear holocaust. If you believed that integrity does not prevent the better candidate from having helped the riskier one win the primary in the first place, you’d put your virtue and faith in integrity aside so you could stop nuclear war by voting for the better candidate and dropping the chance of nuclear war from 10% to 2.5%. You must actually believe integrity collapses timelines, in order to use integrity even in the face of Armageddon.

Another way of saying this is that you need belief that a tool works, not just belief in belief.

I suspect it’s a common pattern for people to accept as a job well done an installation of a tool like integrity in their minds when they’ve laid out a trail of yummy narrative breadcrumbs along the forest floor in the path they’re supposed to take. But when a bear is chasing you, you ignore the breadcrumbs and take what you believe to be the path to safety. The motive to take a path needs to flow from the motive to escape the bear. Only then can the motive to follow a path grow in proportion to what’s at stake. Only then will the path be used in high stakes where breadcrumbs are ignored. The way to make that flow happen is to actually believe that path is best in a way so that no breadcrumbs are necessary.

I think this is possible for something like decision theory / integrity as well. But what makes me think this is possible, that you don’t have to settle for narrative breadcrumbs? That the part of you that’s in control can understand their power?

How do you know a gun will work? You weren’t born with that knowledge, but it’s made its way into the stuff that’s really in control somehow. By what process?

Well, you’ve seen lots of guns being fired in movies and stuff. You are familiar with the results. And while you were watching them, you knew that unlike lightsabers, guns were real. You’ve also probably seen some results of guns being used in news, history…

But if that’s what it takes, we’re in trouble. Because if there are counterintuitive abstract principles that you never get to see compelling visceral demonstrations of, or maybe even any demonstrations until it’s too late, then you’ll not be able to act on them in life or death circumstances. And I happen to think that there are a few of these.

I still think you can do better.

If you had no gun, and you were sitting in a car with the doors and roof torn off, and that bear was coming, and littering the floor of the car were small cardboard boxes with numbers inked on them, 1 through 100, on the dashboard a note that said, “the key is in the box whose number is the product of 13 and 5”, would you have to win a battle of willpower to check box 65 first? (You might have to win a battle of doing arithmetic quickly, but that’s different.)

If you find the Monty Hall problem counterintuitive, then can you come up with a grizzly bear test for that? I bet most people who are confident in System 2 but not in System 1 that you win more by switching would switch when faced with a charging bear. It might be a good exercise to come up with the vivid details for this test. Make sure to include certainty that an unchosen bad door is revealed whether or not the first chosen door is good.

I don’t think that it’d be a heroic battle of willpower for such people to switch in the Monty Hall bear problem. I think that in this case System 1 knows System 2 is trustworthy and serving the person’s values in a way it can’t see instead of serving an artifact, and lets it do its job. I’m pretty sure that’s a thing that System 1 is able to do. Even if it doesn’t feel intuitive, I don’t think this way of buying into a form of reasoning breaks down under high pressure like narrative breadcrumbs do. I’d guess its main weakness relative to full System 1 grokking is that System 1 can’t help as much to find places to apply the tool with pattern-matching.

Okay. Here’s the test that matters:

Imagine that the emperor, Evil Paul Ekman loves watching his pet bear chase down fleeing humans and kill them. He has captured you for this purpose and taken you to a forest outside a tower he looks down from. You cannot outrun the bear, but you hold 25% probability that by dodging around trees you can tire the bear into giving up and then escape. You know that any time someone doesn’t put up a good chase, Evil Emperor Ekman is upset because it messes with his bear’s training regimen. In that case, he’d prefer not to feed them to the bear at all. Seizing on inspiration, you shout, “If you sic your bear on me, I will stand still and bare my throat. You aren’t getting a good chase out of me, your highness.” Emperor Ekman, known to be very good at reading microexpressions (99% accuracy), looks closely at you through his spyglass as you shout, then says: “No you won’t, but FYI if that’d been true I’d’ve let you go. OPEN THE CAGE.” The bear takes off toward you at 30 miles per hour, jaw already red with human blood. This will hurt a lot. What do you do?

What I want you to take away from this post is:

  • The ability to distinguish between 3 levels of integration of a tool.
    • Narrative Breadcrumbs: Hacked-in artificial reward for using it. Overridden in high stakes because it does not scale like the instrumental value it’s supposed to represent does. (Nuclear war example)
    • Indirect S1 Buy-In: System 1 not getting it, but trusting enough to delegate. Works in high stakes. (Monty Hall example)
    • Direct S1 Buy-In: System 1 getting it. Works in high stakes. (Guns example)
  • Hope that direct or indirect S1 buy-in is always possible.