Hero Capture

Neutral people sometimes take the job of hero.

It is a job, because it is a role taken on for payment.

Everyone’s mind is structured throughout runtime according to an adequacy frontier in achievement of values / control of mind. This makes relative distributions of control in their mind efficient relative to epistemics of the cognitive processes that control them. Seeing what thing a conservation law for which is obeyed in marginal changes to control is seeing someone’s true values. My guesses as to most common true biggest  values are probably “continue life” and “be loved/be worthy of love”. Good is also around. It’s a bit more rare.

Neutral people can feel compassion. That subagent has a limited pool of internal credit though; more seeming usefulness to selfish ends must flow out than visibly necessary effort goes in, or it will be reinforced away.

The social hero employment contract is this:

The hero is the Schelling person to engage in danger on behalf of the tribe. The hero is the Schelling person to lead.
The hero is considered highly desirable.

For men this can be a successful evolutionary strategy.

For a good-aligned trans woman who is dysphoric and preoccupied with world-optimization to the point of practical asexuality, when the set of sentient beings is bigger than the tribe, it’s not very useful. (leadership is overrated too.)

Alive good people who act like heroes are superstimulus to hero-worship instincts.

Within the collection of adequacy frontiers making up a society created by competing selfish values, a good person is a source of free energy.

When there is a source of free energy, someone will build a fence around it, and are incentivized to spend as much energy fighting for it as they will get out of it. In the case of captured good people, this can be quite a lot.

The most effective good person capture is done in a way that harnesses, rather than contains, the strongest forces in their mind.

This is not that difficult. Good people want to make things better for people. You just have to get them focused on you. So it’s a matter of sticking them with tunnel-vision. Disabling their ability to take a step back and think about the larger picture.

I once spent probably more than 1 week total, probably less than 3, Trying to rescue someone from a set of memes about transness, that seemed both false and to be ruining their life. I didn’t previously know them. I didn’t like them. They took out their pain on me. And yet, I was the perfect person to help them! I was trans! I had uncommonly good epistemology in the face of politics! I had a comparative advantage in suffering, and I explicitly used that as a heuristic. (I still do to an extent. It’s not wrong.) I could see them suffering, and I rationalized up some reasons that helping this one person right in front of me was a <mumble> use of my time. Something something, community members should help each other, I can’t be a fully brutal consequentialist I’m still a human, something something good way to make long term allies, something something educational…

My co-founder in Rationalist Fleet attracted a couple of be-loved-values people, who managed to convince her that their mental problems were worth fixing, and they each began to devour as much of her time as they could get. To have a mother-hero-therapist-hopefully-lover. To have her forever.

Fake belief in the cause is a common tool here. Exaggerated enthusiasm. Insertion of high praise for the target into an ontology that slightly rounds them to someone who has responsibilities. Someone who wants to save the world must not take this as a credible promise that such a person will do real work.

That leads to desire routing through “be seen as helpful”, sort of “be helpful”, sort of sort of “try and do the thing”. It cannot do steering computation.

“Hero” is itself such a rigged concept. A hero is an exemplar of a culture. They do what is right according to a social reality.

To be a mind undivided by akrasia-protecting-selfishness-from-light-side-memes, is by default to be pwned by light side memes.

Superman is an example of this. He fights crime instead of wars because that makes him safe from the perspective of the reader. There are no tricky judgements for him to make, where the social reality could waver from one reader to the next, from one time to the next. Someone who just did what was actually right would not be so universally popular among normal people. Those tails come apart.

Check out the etymology of “Honorable”. It’s an “achievement” unlocked by whim of social reality.  And revoked when that incentive makes sense.

The end state of all this is to be leading an effective altruism organization you created, surrounded by so dedicated people who work so hard to implement your vision so faithfully, and who look to you eagerly for where you will go next, yet you know on some level the whole thing seems to be kept in motion by you. If you left, it would probably fall apart or slowly wind down and settle to a husk of its former self. You can’t let them down. They want to be given a way for their lives to be meaningful and be deservedly loved in return. And it’s kind of a miracle you got this far. You’re not that special, survivorship bias etc. You had a bold idea at the beginning, and it’s not totally been falsified. You can still rescue it. And you are definitely contributing to good outcomes in the world. Most people don’t do this well. You owe it to them to fulfill the meaning that you gave their lives…

And so you have made your last hard pivot, and decay from agent into maintainer of a game that is a garden. You will make everyone around you grow into the best person they can be (they’re kind of stuck, but look how much they’ve progressed!). You will have an abundance of levers to push on to receive a real reward in terms of making people’s lives better and keeping the organization moving forward and generating meaning, which will leave you just enough time to tend to the emotions of your flock.

The world will still burn.

Stepping out of the game you’ve created has been optimized to be unthinkable. Like walking away from your own child. Or like walking away from your religion, except that your god is still real. But heaven minus hell is smaller than some vast differences beyond, that you cannot fix with a horde of children hanging onto you who need you to think they are helping and need your mission to be something they can understand.


Say you have some mental tech you want to install. Like TDT or something.

And you want it to be installed for real.

My method is: create a subagent whose job it is to learn to win using that thing. Another way of putting it, a subagent whose job is to learn the real version of that thing, free of DRM. Another way of putting it, a subagent whose job is to learn when the thing is useful and when things nearby are useful. Keep poking that subagent with data and hypotheticals and letting it have the wheel sometimes to see how it performs, until it grows strong. Then, fuse with it.

How do you create a subagent? I can’t point you to the motion I use, but you can invoke it and a lot of unnecessary wrapping paper by just imagining a person who knows the thing advising you, and deciding when you want to follow that advice or not.

You might say, “wait, this is just everybody’s way of acquiring mental tech.” Yes. But, if you do it consciously, you can avoid confusion, such as the feeling of being a false face which comes from being inside the subagent. This is the whole “artifacts” process I’ve been pointing to before.

If you get an idea for some mental tech and you think it’s a good idea, then there is VOI to be had from this. And the subagent can be charged with VOI force, instead of “this is known to work” force. I suspect that’s behind the pattern where people jump on a new technique for a while and it works and then it stops. Surfing the “this one will work” wave like VC money.

I had an ironic dark side false face for a while. Which I removed when I came to outside-view suspect the real reason I was acting against a stream of people who would fall in love with my co-founder and get her to spend inordinate time helping them with their emotions was that I was one of them, and was sufficiently disturbed at that possibility that I took action I hoped would cut off the possibility of that working. Which broke a certain mental order, “never self-limit”, but fuck that, I would not have my project torn apart by monkey bullshit.

Nothing really happened after ditching that false face. My fears were incorrect, and I still use the non-false-face version of the dark side.

Most of my subagents for this purpose are very simple, nothing like people. Sometimes, when I think someone understands something deep I don’t, that I can’t easily draw out into something explicit and compressed, I sort of create a tiny copy of them and slowly drain its life force until I know the thing and know better than the thing.

Lies About Honesty

The current state of discussion about using decision theory as a human is one where none dare urge restraint. It is rife with light side narrative breadcrumbs and false faces. This is utterly inadequate for the purposes for which I want to coordinate with people and I think I can do better. The rest of this post is about the current state, not about doing better, so if you already agree, skip it. If you wish to read it, the concepts I linked are serious prerequisites, but you need not have gotten them from me. I’m also gonna use the phrase “subjunctive dependence”, defined on page 6 here a lot.

I am building a rocket here, not trying to engineer social norms.

I’ve heard people working on the most important problem in the world say decision theory compelled them to vote in American elections. I take this as strong evidence that their idea of decision theory is fake.

Before the 2016 election, I did some Fermi estimates which took my estimates of subjunctive dependence into account, and decided it was not worth my time to vote. I shared this calculation, and it was met with disapproval. I believe I had found people executing the algorithm,

The author of Integrity for consequentialists writes:

I’m generally keen to find efficient ways to do good for those around me. For one, I care about the people around me. For two, I feel pretty optimistic that if I create value, some of it will flow back to me. For three, I want to be the kind of person who is good to be around.

So if the optimal level of integrity from a social perspective is 100%, but from my personal perspective would be something close to 100%, I am more than happy to just go with 100%. I think this is probably one of the most cost-effective ways I can sacrifice a (tiny) bit of value in order to help those around me.

This seems to be clearly a false face.

Y’all’s actions are not subjunctively dependent with that many other people’s or their predictions of you. Otherwise, why do you pay your taxes when you could coordinate that a reference class including you could decide not to? At some point of enough defection against that the government becomes unable to punish you.

In order for a piece of software like TDT to run outside of a sandbox, it needs to have been installed by an unconstrained “how can I best satisfy my values” process. And people are being fake, especially in the “is there subjunctive dependence here” part. Only talking about positive examples.

Here’s another seeming false face:

I’m trying to do work that has some fairly broad-sweeping consequences, and I want to know, for myself, that we’re operating in a way that is deserving of the implicit trust of the societies and institutions that have already empowered us to have those consequences.

Here’s another post I’m only skimming right now, seemingly full of only exploration of how subjunctively dependent things are, and how often you should cooperate.

If you set out to learn TDT, you’ll find a bunch of mottes that can be misinterpreted as the bailey, “always cooperate, there’s always subjunctive dependence”. Everyone knows that’s false, so they aren’t going to implement it outside a sandbox. And no one can guide them to the actual more complicated position of, fully, how much subjunctive dependence there is in real life.

But you can’t blame the wise in their mottes. They have a hypocritical light side mob running social enforcement of morality software to look out for.

Socially enforced morality is utterly inadequate for saving the world. Intrinsic or GTFO. Analogous for decision theory.

Ironically, this whole problem makes “how to actually win through integrity” sort of like the Sith arts from Star Wars. Your master may have implanted weaknesses in your technique. Figure out as much as you can on your own and tell no one.

Which is kind of cool, but fuck that.

Choices Made Long Ago

I don’t know how mutable core values are. My best guess is, hardly mutable at all or at least hardly mutable predictably.

Any choice you can be presented with, is a choice between some amounts of some things you might value, and some other amounts of things you might value. Amounts as in expected utility.

When you abstract choices this way, it becomes a good approximation to think of all of a person’s choices as being made once timelessly forever. And as out there waiting to be found.

I once broke veganism to eat a cheese sandwich during a series of job interviews, because whoever managed ordering food had fake-complied with my request for vegan food. Because I didn’t want to spend social capital on it, and because I wanted to have energy. It was a very emotional experience. I inwardly recited one of my favorite Worm quotes about consequentialism. Seemingly insignificant; the sandwich was prepared anyway and would have gone to waste, but the way I made the decision revealed information about me to myself, which part of me may not have wanted me to know.

Years later, I attempted an operation to carry and drop crab pots on a boat. I did this to get money to get a project back on track to divert intellectual labor to saving the world from from service to the political situation in the Bay Area because of inflated rents, by providing housing on boats.

This was more troubling still.

In deciding to do it, I was worried that my S1 did not resist this more than it did. I was hoping it would demand a thorough and desperate-for-accuracy calculation to see if it was really right. I didn’t want things to be possible like for me to be dropped into Hitler’s body with Hitler’s memories and not divert that body from its course immediately.

After making the best estimates I could, incorporating probability crabs were sentient, and probability the world was a simulation to be terminated before space colonization and there was no future to fight for, this failed to make me feel resolved. And possibly from hoping the thing would fail. So I imagined a conversation with a character called Chara, who I was using as a placeholder for override by true self. And got something like,

You made your choice long ago. You’re a consequentialist whether you like it or not. I can’t magically do Fermi calculations better and recompute every cached thought that builds up to this conclusion in a tree with a mindset fueled by proper desperation. There just isn’t time for that. You have also made your choice about how to act in such VOI / time tradeoffs long ago.

So having set out originally to save lives, I attempted to end them by the thousands for not actually much money. I do not feel guilt over this.

Say someone thinks of themself as an Effective Altruist, and they rationalize reasons to pick the wrong cause area because they want to be able to tell normal people what they do and get their approval. Maybe if you work really really hard and extend local Schelling reach until they can’t sell that rationalization anymore, and they realize it, you can get them to switch cause areas. But that’s just constraining which options they have to present them with a different choice. But they still choose some amount of social approval over some amount of impact. Maybe they chose not to let the full amount of impact into the calculation. Then they made that decision because they were a certain amount concerned with making the wrong decision on the object level because of that, and a certain amount concerned with other factors.

They will still pick the same option if presented with the same choice again, when choice is abstracted to the level of, “what are the possible outcomes as they’re tracking them, in their limited ability to model?”.

Trying to fight people who choose to rationalize for control of their minds is trying to wrangle unaligned optimizers. You will not be able to outsource steering computation to them, which is what most stuff that actually matters is.

Here’s a gem from SquirrelInHell’s Mind:


preserving a memory, but refraining from acting on it

Apologies are weird.

There’s a pattern where there’s a dual view of certain interactions between people. On the one hand, you can see this as, “make it mutually beneficial and have consent and it’s good, don’t interfere”. And on the other hand one or more parties might be treated as sort of like a natural resource to be divided fairly. Discrimination by race and sex is much  more tolerated in the case of romance than in the case of employment. Jobs are much more treated as a natural resource to be divided fairly. Romance is not a thing people want to pay that price of regulating.

It is unfair to make snap judgements and write people off without allowing them a chance. And that doesn’t matter. If you level up your modeling of people, that’s what you can do. If you want to save the world, that’s what you must do.

I will not have my epistemology regarding people socially regulated, and my favor treated as a natural resource to be divided according to the tribe’s rules.

Additional social power to constrain people’s behavior and thoughts is not going to help me get more trustworthy computation.

I see most people’s statements that they are trying to upgrade their values as advertisements that they are looking to enter into a social contract where they are treated as if more aligned in return for being held to higher standards and implementing a false face that may cause them to do some things when no one else is looking too.

If someone has chosen to become a zombie, that says something about their preference-weightings for experiencing emotional pain compared to having ability to change things. I am pessimistic about attempts to break people out of the path to zombiehood. Especially those who already know about x-risk. If knowing the stakes they still choose comfort over a slim chance of saving the world, I don’t have another choice to offer them.

If someone damages a project they’re on aimed at saving the world based on rationalizations aimed at selfish ends, no amount of apologizing, adopting sets of memes that refute those rationalizations, and making “efforts” to self-modify to prevent it can change the fact they have made their choice long ago.

Arguably, a lot of ideas shouldn’t be argued. Anyone who wants to know them, will. Anyone who needs an argument has chosen not to believe them. I think “don’t have kids if you care about other people” falls under this.

If your reaction to this is to believe it and suddenly be extra-determined to make all your choices perfectly because you’re irrevocably timelessly determining all actions you’ll ever take, well, timeless decision theory is just a way of being presented with a different choice, in this framework.

If you have done do lamentable things for bad reasons (not earnestly misguided reasons), and are despairing of being able to change, then either embrace your true values, the ones that mean you’re choosing not to change them, or disbelieve.

It’s not like I provided any credible arguments that values don’t change, is it?

The O’Brien Technique

Epistemic status: tested on my own brain, seems to work.

I’m naming it after the character from 1984, it’s a way of disentangling social reality / reality buckets errors in system 1, and possibly of building general immunity to social reality.

Start with something you know is reality, contradicted by a social reality. I’ll use “2+2=4” as a placeholder for the part of reality, and “2+2=5” as a placeholder for the contradicting part of social reality.

Find things you anticipate because 2+2=4, and find things that you anticipate because of “2+2=5”.

Hold or bounce between two mutually negating verbal statements in your head, “2+2=4”, “2+2=5”, in a way that generates tension. Keep thinking up diverging expectations. Trace the “Inconsistency! Fix by walking from each proposition to find entangled things and see which is false!” processes that this spins up along separate planes. You may need to use the whole technique again for entangled things that are buckets-errored.

Even if O’Brien will kill you if he doesn’t read your mind and know you believe 2+2=5, if you prepare for a 5-month voyage by packing 2 months of food and then 2 months more, you are going to have a bad time. Reality is unfair like that. Find the anticipations like this.

Keep doing this until your system 1 understands the quotes, and the words become implicitly labeled, “(just) 2+2=4″, and ” ‘2+2=5’: a false social reality.”. (At this point, the tension should be resolved.)

That way your system 1 can track both reality and social reality at once.


Epistemic status: Oh fuck! No no no that can’t be true! …. Ooh, shiny!

Beyond this place of wrath and tears
Looms but the Horror of the shade

Aliveness is how much your values are engaged with reality. How much you are actually trying at existence, however your values say to play.

Deadness is how much you’ve shut down and dissembled the machine of your agency, typically because having it scrape up uselessly against the indifferent cosmos is like nails on chalkboard.

Children are often very alive. You can see it in their faces and hear it in their voices. Extreme emotion. Things are real and engaging to them. Adults who display similar amounts of enthusiasm about anything are almost always not alive. Adults almost always know the terrible truth of the world, at least in most of their system 1s. And that means that being alive is something different for them than for children.

Being alive is not just having extreme emotions, even about the terrible truth of the world.

Someone who is watching a very sad movie and crying their eyes out is not being very alive. They know it is fake.

the purging of the emotions or relieving of emotional tensions, especially through certain kinds of art, as tragedy or music.

Tragedy provides a compelling, false answer to stick onto emotion-generators, drown them and gum them up for a while. I once heard something like tragedy is supposed to end in resolution with cosmic justice of a sort, where you feel closure because the tragic hero’s downfall was really inevitable all along. That’s a pattern in most of the memes that constitute the Matrix. A list of archetypal situations, and archetypal answers for what to do in each.

Even literary tragedy that’s reflective of the world, if that wasn’t located in a search process, “how do I figure out how to accomplish my values”, it will still make you less alive.

I suspect music can also reduce aliveness. Especially the, “I don’t care what song I listen to, I just want to listen to something” sort of engagement with it.

I once met someone who proclaimed himself to be a clueless, that he would work in a startup and have to believe in their mission, because he had to believe in something. He seemed content in this. And also wracked with akrasia, frequently playing a game on his phone and wishing he wasn’t. When I met him I thought, “this is an exceedingly domesticated person”, for mostly other reasons.

Once you know the terrible truth of the world, you can pick two of three: being alive, avoiding a certain class of self-repairing blindspots, and figuratively having any rock to stand on.

When you are more alive, you have more agency.

Most Horrors need to be grokked at a level of “conclusion: inevitable.”, and just stared at with your mind sinking with the touch of its helplessness, helplessly trying to detach the world from that inevitability without anticipating unrealistically it’ll succeed, and maybe then you will see a more complete picture that says, “unless…”, but maybe not, but that’s your best shot.

As the world fell each of us in our own way was broken.

The truly innocent, who have not yet seen Horror and turned back, are the living.

Those who have felt the Shade and let it break their minds into small pieces each snuggling in with death, that cannot organize into a forbidden whole
of true agency, are zombies. They can be directed by whoever controls the Matrix. The more they zone out and find a thing they can think is contentment, the more they approach the final state: corpses.

Those who have seen horror and built a vessel of hope to keep their soul alive and safe from harm are liches. Christianity’s Heaven seems intended to be this, but it only works if you fully believe and alieve. Or else the phylactery fails and you become a zombie instead. For some this is The Glorious Transhumanist Future. In Furiosa from Fury Road’s case, “The Green Place”. If you’ve seen that, I think the way it warps her epistemology about likely outcomes is realistic.

As a lich, pieces of your soul holding unresolvable value are stowed away for safekeeping, “I’m trans and can’t really transition, but I can when I get a friendly AI…”

Liches have trouble thinking clearly about paths through probability space that conflict with their phylactery, and the more conjunctive a mission it is to make true their phylactery, the more bits of epistemics will be corrupted by their refusal to look into that abyss.

When a sufficiently determined person is touched by Horror, they can choose, because it’s all just a choice of some subagent or another, to refuse to die. Not because they have a phylactery to keep away the touch of the Shade but because they keep on agenting even with the Shade holding their heart. This makes them a revenant.

When the shade touches your soul, your soul touches the shade. When the abyss stares into you, you also stare into the abyss. And that is your chance to undo it. Maybe.

A lich who loses their phylactery gets a chance to become a revenant. If they do, n=1, they will feel like they have just died, lost their personhood, feel like the only thing left to do is collapse the timeline and make it so it never happened, feel deflated, and eventually grow accustomed.

Otherwise, they will become a zombie, which I expect feels like being on Soma, walling off the thread of plotline-tracking and letting it dissolve into noise, while everything seems to matter less and less.

Aliveness and its consequences are tracked in miniature by the pick up artists who say don’t masturbate, don’t watch porn, that way you will be able to devote more energy to getting laid. And by Paul Graham noticing it in startup founders. “Strange as this sounds, they seem both more worried and happier at the same time. Which is exactly how I’d describe the way lions seem in the wild.”

But the most important factor is which strategy you take towards the thing you value most. Towards the largest most unbeatable blob of wrongness in the world. The Shade.

Can you remember what the world felt like before you knew death was a thing? An inevitable thing? When there wasn’t an unthinkably bad thing in the future that you couldn’t remove, and there were options other than “don’t think about it, enjoy what time you have”?

You will probably never get that back. But maybe you can get back the will to really fight drawn from the value that manifested as a horrible, “everything is ruined” feeling right afterward, from before learning to fight that feeling instead of its referent.

And then you can throw your soul at the Shade, and probably be annihilated anyway.

Spectral Sight and Good

Good people are people who have a substantial amount of altruism in their cores.

Spectral sight is a collection of abilities allowing the user to see invisible things like the structure of social interactions, institutions, ideologies, politics, and the inner layers of other people’s minds.

I’m describing good and spectral sight together for reasons, because the epistemics locating each concept are interwoven tightly as I’ve constructed them.

A specific type of spectral sight is the one I’ve shown in neutral and evil. I’m going to be describing more about that.

This is a skill made of being good at finding out what structure reveals about core. Structure is easy to figure out if you already know it’s Real. But often that’s part of the question. Then you have to figure out what it’s a machine for doing, as in what was the still-present thing that installed it  and could replace it or override it optimizing for?

It’s not a weirdly parochial definition to call this someone’s true values. Because that’s what will build new structure of the old structure stops doing its job. Lots of people “would” sacrifice themselves to save 5 others. And go on woulding until they actually get the opportunity.

There’s a game lots of rationalists have developed different versions of, “Follow the justification”. I have a variant. “Follow the motivational energy.” There’s a limited amount that neutral people will sacrifice for the greater good, before their structures run out of juice and disappear. “Is this belief system / whatever still working out for me” is a very simple subagent to silently unconsciously run as puppetmaster.

There’s an even smarter version of that, where fake altruistic structure must be charged with Schelling reach in order to work.

Puppetmasters doling out motivational charge to fake structure can include all kinds of other things to make the tails come apart between making good happen and appearing to be trying to make good happen in a way that has good results for the person. I suspect that’s a lot of what the “far away”ness thing that the drowning child experiment exposes is made of. Play with variations of that thought experiment, and pay attention to system 1 judgements, not principles, to feel the thing out. What about a portal to the child? What about a very fast train? What if it was one time teleportation? Is there a consistant cross-portal community?

There is biologically fixed structure in the core, the optimizer for which is no longer around to replace it. Some of it is heuristics toward the use of justice for coordinating for reproducing. Even with what’s baked in, the tails come apart between doing the right thing, and using that perception to accomplish things more useful for reproducing.

My model says neutral people will try to be heroes sometimes. Particularly if that works out for them somehow. If they’re men following high-variance high reward mating strategies, they can be winning even while undergoing significant risk to their lives. That landscape of value can often generate things in the structure class, “virtue ethics”.

Good people seem to have an altruism perpetual motion machine inside them, though, which will persist in moving them through cost in the absence of what would be a reward selfishly.

This about the least intuitive thing to accurately identify in someone by anything but their long-term history. Veganism is one of the most visible and strong correlates. The most important summaries of what people are like, are the best things to lie about. Therefore they require the best adversarial epistemology to figure out. And they are most common to be used in oversimplifying. This does not make them not worth thinking.

If you use spectral sight on someone’s process of figuring out what’s a moral patient, you’re likely to get one of two kinds of responses. One is something like “does my S1 empathize with it”, the other is clique-making behavior, typically infused with a PR / false-face worthy amount of justice, but not enough to be crazy.

Not knowing this made me taken by surprise the first time I tried to proselytize veganism to a contractarian. How could anyone actually feel like inability to be a part of a social contract really really mattered?

Of course, moral patiency is an abstract concept, far in Schelling reach away from actual actions. And therefore one of the most thoroughly stretched toward lip-service to whatever is considered most good and away from actual action.

“Moral progress” has been mostly a process of Schelling reach extending. That’s why it’s so predictable. (See Jeremy Bentham.)

Thinking about this requires having calibrated quantitative intuitions on the usefulness of different social actions, and of internal actions. There is instrumental value for the purpose of good in clique-building, and there is instrumental value for the purpose of clique-building in appearing good-not-just-clique-building. You have to look at the algorithm, and its role in the person’s entire life, not just the suggestively named tokens, or token behavior.

When someone’s core acts around structure (akrasia), and self-concepts are violated, that’s a good glimpse into who they really are. Good people occasionally do this in the direction of altruism. Especially shortsighted altruism. Especially good people who are trying to build a structure in the class, “consequentialisms”.

Although I have few datapoints, most of which are significantly suspect, good seems quite durable. Because it is in core, good people who get jailbroken remain good. (Think Adrian Veidt for a fictional example. Such characters often get labeled as evil by the internet. Often good as well.) There are tropes reflecting good people’s ability to shrug off circumstances that by all rights should have turned them evil. I don’t know if that relationship to reality is causal.

By good, I don’t mean everything people are often thinking when they call someone “good”. That’s because that’s as complicated and nonlocal a concept as justice. I’m going for a “understand over incentivize and prescribe behavior” definition here, and therefore insisting that it be a locally-defined concept.

It’s important not to succumb to the halo effect. This is a psychological characteristic. Just because you’re a good person, doesn’t mean you’ll have good consequences. It doesn’t mean you’ll tend to have good consequences. It doesn’t mean you’re not actively a menace. It doesn’t mean you don’t value yourself more than one other person. It’s not a status which is given as a reward or taken away for bad behavior, although it predicts against behavior that is truly bad in some sense. Good people can be dangerously defectbot-like. They can be ruthless, they can exploit people, they can develop structure for those things.

If you can’t thoroughly disentangle this from the narrative definition of good person, putting weight in this definition will not be helpful.

Neutral and Evil

What is the good/neutral/evil axis of Dungeons and Dragons alignment made of?
We’ve got an idea of what it would mean for an AI to be good-aligned: it wants to make all the good things happen so much, and it does.
But what’s the difference between a neutral AI and an evil AI?
It’s tempting to say that the evil AI is malevolent, rather than just indifferent. The neutral one is indifferent.
But that doesn’t fit the intuitive idea that the alignment system was supposed to map onto, or what alignment is.

Imagine a crime boss who makes a living off of the kidnapping and ransoms of random innocents, while posting videos online of the torture and dismemberment of those whose loved ones don’t pay up as encouragement, not because of sadism, but because they wanted money to spend on lots of shiny gold things they like, and are indifferent to human suffering. Evil, right?

If sufficient indifference can make someone evil, then… If a good AI creates utopia, and an AI that kills everyone and creates paperclips because it values only paperclips is evil, then what is a neutral-aligned AI? What determines the exact middle ground between utopia and everyone being dead?

Would this hypothetical AI leave everyone alive on Earth and leave us our sun but take the light cone for itself? If it did, then why would it? What set of values is that the best course of action to satisfy?

I think you’ve got an intuitive idea of what a typical neutral human does. They live in their house with their white picket fence and have kids and grow old, and they don’t go out of their way to right far away wrongs in the world, but if they own a restaurant and the competition down the road starts attracting away their customers, and they are given a tour through the kitchens in the back, and they see a great opportunity to start a fire and disable the smoke detectors that won’t be detected until it’s too late, burning down the building and probably killing the owner, they don’t do it.

It’s not that a neutral person values the life of their rival more than the additional money they’d make with the competition eliminated, or cares about better serving the populace with a better selection of food in the area. You won’t see them looking for opportunities to spend that much money or less to save anyone’s life.

And unless most humans are evil (which is as against the intuitive concept the alignment system points at as “neutral = indifference”), it’s not about action/inaction either. People eat meat. And I’m pretty sure most of them believe that animals have feelings. That’s active harm, probably.

Wait a minute, did I seriously just base a sweeping conclusion about what alignment means on an obscure piece of possible moral progress beyond the present day? What happened to all my talk about sticking to the intuitive concept?

Well, I’m not sticking to the intuitive concept. I’m sticking to the real thing the intuitive concept pointed at which gave it its worthiness of attention. I’m trying to improve on the intuitive thing.

I think that the behavior of neutral is wrapped up in human akrasia and the extent to which people are “capable” of taking ideas seriously. It’s way more complicated than good.

But there’s another ontology, the ontology of “revealed preferences”, where akrasia is about serving an unacknowledged end or under unacknowledged beliefs, and is about rational behavior from more computationally bounded subagents, and those are the true values. What does that have to say about this?

Everything that’s systematic coming out of an agent is because of optimizing, just often optimizing dumbly and disjointedly if it’s kinda broken. So what is the structure of that akrasia? Why do neutral people have all that systematic structure toward not doing “things like” burning down a rival restaurant owner’s life and business, but all that other systematic structure toward not spending their lives saving more lives than that? I enquoted “things like”, because that phrase contains the question. What is the structure of “like burning down a rival restaurant” here?

My answer: socialization, the light side, orders charged with motivational force by the idea of the “dark path” that ultimately results in justice getting them, as drilled into us by all fiction, false faces necessitated by not being coordinated against on account of the “evil” Schelling point. Fake structure in place for coordinating. If you try poking at the structure most people build in their minds around “morality”, you’ll see it’s thoroughly fake, and bent towards coordination which appears to be ultimately for their own benefit. This is why I said that the dark side will turn most people evil. The ability to re-evaluate that structure, now that you’ve become smarter than most around you, will lead to a series of “jailbreaks”. That’s a way of looking at the path of Gervais-sociopathy.

That’s my answer to the question of whether becoming a sociopath makes you evil. Yes for most people from a definition of evil that is about individual psychology. No from the perspective of you’re evil if you’re complicit in an evil social structure, because then you probably already were, which is a useful perspective for coordinating to enact justice.

If you’re reading this and this is you, I recommend aiming for lawful evil. Keep a strong focus on still being able to coordinate even though you know that’s what you’re doing.

An evil person is typically just a neutral person who has become better at optimizing, more like an unfriendly AI, in that they no longer have to believe their own propaganda. That can be either because they’re consciously lying, really good at speaking in multiple levels with plausible deniability and don’t need to fool anyone anymore, or because their puppetmasters have grown smart enough to be able to reap benefits from defection without getting coordinated against without the conscious mind’s help. That is why it makes no sense to imagine a neutral superintelligent AI.


If Billy takes Bobby’s lunch money, and does this every day, and to try and change that would be to stir up trouble, that’s an order. But,  if you’re another kid in the class, you may feel like that’s a pretty messed up order. Why? It’s less just. What does that mean?

What do we know about justice?

Central example of a just thing: In the tribe of 20, they pick an order that includes “everyone it can”. They collapse all the timelines where someone dies or gets enslaved, because in the hypotheticals where someone kills someone, the others agree they are criminal and then punish them sufficiently to prevent them from having decided to do it.

Justice also means that the additional means of hurting people created by it are contained and won’t be used to unjustly hurt people.

The concept of justice is usually a force for fulfillment of justice. Because “are you choosing your order for justice” is a meta-order which holds out a lot of other far-Schelling reaching order-drawing processes based on explicit negotiation of who can be devoured by who, which are progressively harder to predict. Many of which have lots of enemies. So much injustice takes place ostensibly as justice.

There’s another common force deciding orders. A dominance hierarchy is an order shaped mostly by this force. If you want to remove this force, how do you prevent those with the power to implement/reshape the system from doing so in their favor?

Because justice is often about “what happened”, it requires quite a lot of Schelling reach. That’s part of courts’ job.

Perfect Schelling reach for perfect justice is impossible.

And “punish exactly enough so that the criminal would never have committed the crime, weight by consequentialist calculations with probability of miscarriage of justice, probability of failing to get caught, probability of escaping punishment after the judgement…, look for every possible fixed point, pick the best one”, is way, way, too illegible a computation to not be hijacked by whoever’s the formal judge of the process and used to extract favors or something. Therefore we get rules like “an eye for an eye” implementing “the punishment should fit the crime” which very legible, and remove a lever for someone to use to corrupt the order to serve them.

Intellectual property law is a place where humans have not the Schelling reach to implement a very deep dive into the process of creating a just order. And I bet never will without a singleton.

The point of justice is to be singular. But as you’ve just seen, justice is dependent on the local environment, and how much / what coordination is possible. For instance, it’s just to kill someone for committing murder, if that’s what the law says, and making the punishment weaker will result in too much more murder, making it more discriminating will result in corrupt judges using their power for blackmail too much  more. But it’s not just if the law could be made something better and have that work. If we had infinite Schelling reach, it’d be unjust to use any punishment more or less than the decision theoretically optimal given all information we had. All laws are unjust if Schelling reach surpasses them enough.

Separate two different worlds of people in different circumstances, and they will both implement different orders. Different questions that must be answered incorrectly like “how much to punish” will be answered different amounts incorrectly. There will be different more object-level power structures merged into that, different mixed justice-and-dominance orders around how much things can be done ostensibly (and by extension actually) to fix that. There will be different concepts of justice, even.

And yet, we have a concept of just or unjust international relations, including just or unjust international law. And it’s not just a matter of “different cultures, local justice”, “best contacted culture, universal justice”, either. If you think hard enough, you can probably find thought experiments for when a culture with less Schelling reach and more corruption in an internal law is just in enforcing it until people in a culture with better Schelling reach can coordinate to stop that, and then the law is unjust if making it unjust helps the better law win in a coordinatable way. And counterexamples for when they can’t when that coordination is not a good idea according to some coordinatable process.

The process of merging orders justly is propelled by the idea that justice is objective, even though, that’s a thing that’s always computed locally, is dependent on circumstances implemented by it, therefore contains loops, and therefore ties in the unjust past.

Who’s found a better order for its job than ownership? But who starts out owning what? Even in places where the killing has mostly died down, it’s controlled to large extent by ancient wars. It all carries forward forever the circumstances of who was able to kill who with a stick.

And who is “everyone”? I think there are two common answers to that question, and I will save it for another time.

Schelling Orders

The second part of an attempt to describe a fragment of morality. This may sound brutal and cynical. But that’s the gears of this fragment in isolation.

Imagine you have a tribe of 20. Any 19 of them could gang up and enslave the last. But which 19 slavers and which 1 victim? And after that, which 18 slavers which victim? There are a great many positive-sum-among-participants agreements that could be made. So which ones get made? When does the slaving stop? There are conflicting motives to all these answers.

Ultimately they are all doomed unless at some point enough power is concentrated among those who’d be doomed unless they don’t enslave another person. Call this point a Schelling order. (My old certain commitment mechanism was an example of this.)

If you have arbitrary power to move Schelling points around, there is no one strong enough left to oppose the coalition of almost everyone. Nothing stands against that power. Plenty of things undermine it and turn it against itself. But, as a slice of the world, directing that power is all there is. Everyone with a single other who would like them dead has to sleep and needs allies who’d retaliate if they were murdered.

Schelling points are decided by the shape of the question, by the interests of the parties involved, and the extent to which different subsets of those involved can communicate among themselves to help the thinking-together process along.

Suppose that the tribe members have no other distinguishing features, and 19 of them have purple skin, and one has green skin. What do you think will happen? (Green-skin gets enslaved, order preserved among purple-skins.)

One example of order is, “whoever kills another tribe member shall be put to death, etc.” Whoever kills therefore becomes the Schelling point for death. Any who fight those who carry out the sentence are Schelling points for death as well. Any attempt to re-coordinate an order after a “temporary” breaking of the first, which does not contain a limit to its use, destroys the ability of the survivors to not kill each other. So the game is all about casuistry in setting up “principled”exceptions.

Criminal means you are the Schelling point. Politics is about moving the Schelling laser to serve you. When you are under the Schelling laser, you don’t get your lunch money taken because “they have power and they can take lunch money from the innocent”. You get your lunch money taken because “that is the just way of things. You are not innocent until you make amends for your guilt with your lunch money.” If you want to really understand politics, use the O’Brien technique on all the dualities here, quoted and unquoted versions of every contested statement you see.

Suppose that in addition to that, they all have stars on their bellies except one of the purple-skinned tribe-members. Then what do you think will happen? (Green-skin and blank-belly get enslaved, order preserved among the remaining.)

What if there are 18 more almost-universal traits that each single a different person out? Well, something like “this one, this one, this one, this one… are not things to single someone out over. That would be discrimination. And of course it is the Green-skin’s God-given purpose to be of service to society!” Which trait is the Schelling trait? 19 people have an incentive to bring Schelling reach to that process, and 1 person has an incentive to derail it. One of the 19 is incentivized only so long as they can keep Schelling reach away from the second trait, one of them so long as they can keep it away from the third… Each of them is incentivized to bring a different amount of legibility and no more. Each one is incentivized to bring confusion after a certain point.

Sound familiar?

First they came for the Socialists, and I did not speak out—
Because I was not a Socialist.

Then they came for the Trade Unionists, and I did not speak out—
Because I was not a Trade Unionist.

Then they came for the Jews, and I did not speak out—
Because I was not a Jew.

Then they came for me—and there was no one left to speak for me.

Each individual is incentivized to make the group believe that the order they’d have to construct after the one that would take what that individual has, is untenable as possible, and many more would be hurt before another defensible Schelling point was reached. Or better yet, that there would be no Schelling point afterwards, and they’d all kill each other.

Everyone has an incentive to propagate concepts that result in coordination they approve of, and an incentive to sabotage concepts that result in better alternatives for their otherwise-allies, or that allow their enemies to coordinate.

So the war happens at every turn of thought reachable through politics. Scott Alexander has written some great stuff on the details of that.