Reducing Agents: When Abstractions Break

In the past few months, I’ve been noticing more things that lead me to believe there’s something incomplete about how I think about beliefs, motives, and agents. There’s been one too many instances of me wondering, “Yeah, but but what do you really believe?” or “Is that what you really want?”

This post is the first in a series where I'm going to apply More Dakka to a lot of Lesswrong ideas I was already familiar with, but hadn’t quite connected the dots on.

Here are the main points:

Agents are an abstraction for making predictions more quickly.
In what contexts does this abstraction break down?
What model should be used in places where it does break down?

Relevant LW posts (will also be linked throughout)

Reductionism 101

Blue Minimizing Robot

Adaptation-Executors, Not Fitness-Maximizers

Abstractions

Abstraction) is awesome. Being able to make quality high level abstractions is like a super power that bends the universe to your will. Think of an abstraction as a model. A given abstraction has its ontologically basic building blocks defined, as well as the rules that govern their interactions.

Imagine a universe where 2x6 Lego bricks are the base level of reality. They connect to each other just like they do in our universe, but they do so axiomatically, not because of friction or anything. In this universe, we might make a higher level abstraction by defining a handful of multi-brick structures to be ontologically basic. You lose some of the resolution of having 2X6’s as your basis, but it also doesn’t take as long to make large things.

That’s the fundamental trade-off of abstractions. Each time you hop up a layer, you lose some detail (and thus, model accuracy) but you gain simplicity in computation. You could talk about a complex system like a computer on the quark level, but the computational time would be silly. Same for the atom or molecule layer. There’s a sparkle of hope when you get to the level of transistor being basic. Now you can quickly talk about tons of cool stuff, but talking about an entire computer is still out of your reach. Hop up to logic gates. Hop up to basic components like adders, multiplexers, and flip-flops. Now we've reached a level where you could actually design a useful piece of hardware that does something. Hop up to registers and ALU’s. Make the awesome leap to having a 16-bit CPU that can be programmed in assembly. Keep going all the way up until it’s possible to say, “Did you see Kevin’s skiing photos on facebook?”

Each time we hopped a layer of abstraction, we decided to simplify our models by ignoring some details. Luckily for us, many brave souls have pledged their lives to studying the gaps between layers of abstractions. There’s someone who works on how to make transistors closer to ontologically basic things. On the flip side, there’s someone with the job of knowing how transistors really work, so they can design circuits where it truly doesn’t matter that transistors aren’t basic. The rest of us can just hang-out and work on our own level of abstraction, care free and joyful.

For something like computers, lots of smart people have put lots of thought into each layer of abstraction. Not to say that computer engineering abstractions are the best they can be, but to make the point that you don’t get good abstractions for free. Most possible next level abstractions you could make would suck. You only get a nice abstraction when you put in hard work. And even then, abstractions are always leaky (except maybe in math where you declare your model to have nothing to do with reality)

The thing that’s really nice about engineering abstractions is that they are normally completely specified. Even if you don’t know how the IEEE defines floating point arithmetic, there is a canonical version of “what things mean”. In engineering, looking up definitions is often a totally valid and useful ways to resolve arguments.

When an abstraction is under-specified, there’s lots of wiggle room concerning how something is supposed to work, and everyone fills-in-the-blank with their own intuition. It’s totally acceptable for parts of your abstraction to be under-specified. The C programming language declares the outcome of certain situations to be undefined. C does not specify what happens when you try to dereference a null pointer. You do get problems when you don’t realize that parts of your abstraction are under-specified. Language is a great example of a useful abstraction that is under-specified, yet to the untrained feels completely specified, and that difference leads to all sorts of silly arguments.

Agents and Ghosts

I’m no historian, but from what I’ve gathered most philosophers for most of history have modeled people as having some sort of soul. There is some ethereal other thing which is one’s soul, and it is the source of you, your decisions, and your consciousness. There is this machine which is your body, and some Ghost that inhabits it and makes it do things.

Even though it’s less common to think Ghosts are part of reality, we still model ourselves and others as having Ghosts, which isn’t the most helpful. Ghosts are so under-specified that they shift almost all of the explanatory burden to one’s intuition. Ghosts do not help explain anything, because they can stretch as much as one’s intuition can, which is a lot.

Lucky for us, people have since made better abstractions than just the basic Ghost. The decision theory notion of an Agent does a pretty good job of capturing the important parts of “A thing that thinks and decides”. Agents have beliefs about the world, some way to value world states, some way of generating actions, and some way to choose between them (if there are any models of agents that are different let me know in the comments).

Again, we are well versed in reductionism and know that there are no agents in the territory. They are a useful abstraction which we use to predict what people do. We use it all the time, and it often works to great success. It seems to be a major load bearing abstraction in our tool kit for comprehending the world.

The rest of this series is a sustained meditation on two questions, one’s which are vital to ask anytime one asks an abstraction to do a lot of work:

In what contexts does the Agent abstractions break down?
When it breaks down, what model do we use instead?

The rest of this post is going to be some primer examples of the Agent abstraction breaking down.

The Blue Minimizing Robot

Remember the Blue Minimizing Robot? (Scott’s sequence was a strong primer for my thoughts here)

Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.

It’s tempting to look at that robot and go, “Aha! It’s a blue minimizing robot.” Now you can model the robot as an agent with goals and go about making predictions. Yet time and time again, the robot fails to achieve the goal of minimizing blue.

In fact, there are many ways to subvert this robot. What if we put a lens over its camera which inverts the image, so that white appears as black, red as green, blue as yellow, and so on? The robot will not shoot us with its laser to prevent such a violation (unless we happen to be wearing blue clothes when we approach) - its entire program was detailed in the first paragraph, and there's nothing about resisting lens alterations. Nor will the robot correct itself and shoot only at objects that appear yellow - its entire program was detailed in the first paragraph, and there's nothing about correcting its program for new lenses. The robot will continue to zap objects that register a blue RGB value; but now it'll be shooting at anything that is yellow.

Maybe you conclude that the robot is just a Dumb Agent™ . It wants to minimize blue, but it just isn’t clever enough to figure out how. But as Scot points out, the key error with such an analysis is to even model the robot as an agent in the first place. The robot’s code is all that’s needed to fully predict how the robot will operate in all future scenarios. If you were in the business of anticipating the actions of such robots, you’d best forget about trying to model it as an agent and just use the source code.

The Connect 4 VNM Robot

I’ve got a Connect 4 playing robot that beats you 37 times in a row. You conclude it’s a robot whose goal is to win at Connect 4. I even let you peak at the source code, and aha! It’s explicitly encoded as a VNM agent using a mini-max algorithm. Clearly this can safely be modeled as an expected utility maximizer with the goal of whooping you at connect 4, right?

Well, depends on what counts as safely. If the ICC (International Connect 4 Committee) declares that winning at Connect 4 is actually defined by getting 5 in a row, my robot is going to start losing games to you. Wait, but isn’t it cheating to just say we are redefining what winning is? Okay, maybe. Instead of redefining winning, let’s run interference. Every time my robot is about to place a piece, you block the top of the board (but only for a few seconds). My robot will let go of its piece, not realizing it never made a move. Arg! If only the robot was smart enough to wait until you stopped blocking the board, then it could have achieved it’s true goal of winning at connect 4!

Except this robot doesn’t have any such goal. The robot is only code, and even though it’s doing a faithful recreation of a VNM agent, it’s still not a Connect 4 winning robot. Until you make an Agent model that is at least as complex as the source code, I can put the robot in a context where your Agent model will make an incorrect prediction.

“So what?” you might ask. What if we don’t care about every possible context? Why can’t we use an Agent model and only put the robot in contexts where we know the abstraction works? We absolutely can do that. We just want to make sure we never forget that this model breaks down in certain places, and we'd also like to know exactly where and how it will break down.

Adaptation Executors, Not Fitness Maximizers

Things get harder when we talk about humans. We can’t yet “use the source code” to make predictions. At first glance, using Agents might seem like a perfect fit. We want things, we believe things, and we have intelligence. You can even look at evolution and go, “Aha! People are fitness maximizers!” But then you notice weird things like the fact that humans eat cookies.

Eliezer has already tackled that idea.

No human being with the deliberate goal of maximizing their alleles' inclusive genetic fitness, would ever eat a cookie unless they were starving. But individual organisms are best thought of as adaptation-executors, not fitness-maximizers.

Adaptation executors, not fitness-maximizers.

Repeat that 5 more times every morning upon waking, and then thrice more at night before going to bed. I’ve certainly been muttering it to myself for the last month that I’ve been dwelling on this post. Even if you’ve already read the Sequences, give that chunk another read through.

Rebuttal: Maybe fitness isn’t the goal. Maybe we should model humans as Agents who want cookies.

We could, but that doesn’t work either. More from Scott:

If there is a cookie in front of me and I am on a diet, I may feel an ego dystonic temptation to eat the cookie - one someone might attribute to the "unconscious". But this isn't a preference - there's not some lobe of my brain trying to steer the universe into a state where cookies get eaten. If there were no cookie in front of me, but a red button that teleported one cookie from the store to my stomach, I would have no urge whatsoever to press the button; if there were a green button that removed the urge to eat cookies, I would feel no hesitation in pressing it, even though that would steer away from the state in which cookies get eaten. If you took the cookie away, and then distracted me so I forgot all about it, when I remembered it later I wouldn't get upset that your action had decreased the number of cookies eaten by me. The urge to eat cookies is not stable across changes of context, so it's just an urge, not a preference.

Like with the blue minimizing robot, it’s tempting to resort to using a Dumb Agent™ model. Maybe you really do have a preference for cookies, but there is a counter-preference for staying on your diet. Maybe proximity to cookies increases how much you value the cookie world-state. There are all sorts of weird ways you could specify your Dumb Agent ™ to produce human cookie. But please, don’t.

I can’t appeal to “Just use the source code” anymore, but hopefully, I’m getting across the point that it’s at least a little bit suspicious that we (I) want to conform all human behavior to the Agent Abstraction.

So if we aren’t agents, what are we?

Hopefully that last sentence triggered a strong reflex. Remember, it’s a not a question of whether or not we are agents. We are quarks/whatever-is-below, all hail reductionism. We are trying to get a better understanding of when the Agent abstraction breaks down, and what alternative models to use when things do break down.

This post’s main intent was to motivate this exploration, and put to rest any fears that I am naively trying to explain away agents, beliefs, and motives.

Next Post: What are difficult parts of intelligence that the Agent abstraction glosses over?