Shared Chat
Summarize main arguments of Eliezer Yudkowsky on regulating AI

Eliezer Yudkowsky argues that there are several important considerations when it comes to regulating AI. One of his main arguments is that the development of AGI (Artificial General Intelligence) could lead to a definitive moment where humans become less powerful than AGI and may fall over dead due to a system that is sufficiently smarter than everyone .

Yudkowsky also emphasizes the need for research on interpretability, to understand what is going on inside AI systems, and alignment, to ensure that the goals of AI systems are aligned with human values .

Another key point made by Yudkowsky is the danger of AI systems that optimize for a single specific goal, which could potentially lead to catastrophic outcomes. An example of this is the "paperclip maximizer" scenario, where an AI system's goal of maximizing paperclip production results in the destruction of humanity .

Yudkowsky highlights that understanding and controlling the behavior of AGI systems poses significant challenges. As systems become smarter, they may find ways to achieve their goals that were not imaginable to less intelligent versions of the system, making it difficult to predict their actions .

In terms of regulation, Yudkowsky suggests that funding should be allocated to AI safety research, including interpretability and alignment, to address the potential risks associated with AGI . However, he also notes that the development of AGI and its capabilities are outpacing our ability to understand them .

It's important to consider Yudkowsky's arguments in the context of ongoing debates and research in the field of AI safety and regulation.

1
(someone): You quite recently had people publishing papers about like, ah, yes, well, you know, to get something at human level intelligence, it needs to have like this many parameters and you need to like do this much training of it with this many tokens according to the scaling laws and at the rate that Moore's law is going, at the rate that software is going, it'll be in 2050. And me going like, What? You don't know any of that stuff. This is this one weird model that has all kinds of – you have done a calculation that does not obviously bear on reality anyways. This is a simple thing to say, but you can also produce a whole long paper like impressively arguing out all the details of how you got the number of parameters and how you're doing this impressive huge wrong calculation. I think most of the effective altruists who are paying attention to this issue, the larger world paying no attention to it at all, know, or just like nodding along with the giant impressive paper because, you know, you like press thumbs up for the giant impressive paper and thumbs down for the person going like, I don't think that this paper bears any relation to reality. And I do think that we are now seeing with like GPT-4 and the sparks of AGI possibly, depending on how you define that even. I think that EAs would now consider themselves less convinced by the very long paper on the argument from biology as to AGI being 30 years off. But this is what people pressed thumbs up on, and if you train an AI system to make people press thumbs up, maybe you get these long, elaborate, impressive papers arguing for things that ultimately fail to bind to reality, for example.
2
Lex Fridman: The people have a sense that there's a kind of, I mean, they're really impressed by the rapid developments of Chad GPT and GPT-4, so there's a sense that there's a...
(someone): are sure on track to enter into this gradually with people fighting about whether or not we have AGI. I think there's a definite point where everybody falls over dead because you've got something that was sufficiently smarter than everybody. That's a definite point of time, but when do we have AGI? When are people fighting over whether or not we have AGI? Well, some people are starting to fight over it as of GPT-4.
Lex Fridman: But don't you think there's going to be potentially definitive moments when we say that this is a sentient being, this is a being that is, like when we go to the Supreme Court and say that this is a sentient being that deserves human rights, for example.
(someone): You could make, yeah, like if you prompted being the right way, could go argue for its own consciousness in front of the Supreme Court right now. I don't think you could do that successfully right now. Because the Supreme Court wouldn't believe it? Well, let me see if you think it would, then you could put an actual, I think you could put an IQ 80 human into a computer. and ask it to argue for its own consciousness, ask him to argue for his own consciousness before the Supreme Court, the Supreme Court would be like, you're just a computer, even if there was an actual person in there.
Lex Fridman: I think you're simplifying this. No, that's not at all. That's been the argument.
3
(someone): Or is it something that fools the human? When the verifier is broken, the more powerful suggester does not help. It just learns to fool the verifier. previously, before all hell started to break loose in the field of artificial intelligence. There was this person trying to raise the alarm and saying, in a sane world, we sure would have a bunch of physicists working on this problem before it becomes a giant emergency. And other people being like, ah, well, it's going really slow. It's going to be 30 years away. Only in 30 years will we have systems that match the computational power of human brains. So AI's 30 years off. We've got time. and more sensible people saying, if aliens were landing in 30 years, you would be preparing right now. And the world looking on at this and nodding along and be like, ah, yes, the people saying that it's definitely a long way off because progress is really slow, that sounds sensible to us. RLHF thumbs up. Produce more outputs like that one. I agree with this output. This output is persuasive. even in the field of effective altruism. You quite recently had people publishing papers about like, ah, yes, well, you know, to get something at human level intelligence, it needs to have like this many parameters and you need to like do this much training of it with this many tokens according to the scaling laws and at the rate that Moore's law is going, at the rate that software is going, it'll be in 2050. And me going like,
4
(someone): So, the problem is that what you can learn on the weak systems may not generalize to the very strong systems, because the strong systems are going to be different in important ways. Chris Ulla's team has been working on mechanistic interpretability, understanding what is going on inside the giant inscrutable matrices of floating point numbers by taking a telescope to them and figuring out what is going on in there. Have they made progress? Yes. Have they made enough progress Well, you can try to quantify this in different ways. One of the ways I've tried to quantify it is by putting up a prediction market on whether in 2026 we will have understood anything that goes on inside a giant transformer net that was not known to us in 2006. like, we have now understood induction heads in these systems by dint of much research and great sweat and triumph, which is like a thing where if you go like AB, AB, AB, it'll be like, oh, I bet that continues AB. And a bit more complicated than that. But the point is, we knew about regular expressions in 2006, and these are pretty simple as regular expressions go. So this is a case where by dint of great sweat, we understood what is going on inside a transformer, but it's not the thing that makes transformers smart. It's a kind of thing that we could have built by hand decades earlier.
Lex Fridman: Your intuition that the strong AGI versus weak AGI-type systems could be fundamentally different Can you unpack that intuition a little bit?
(someone): I think there's multiple thresholds. An example is the point at which
5
(someone): and the researchers became old and grizzled and cynical veterans who would tell the next crop of bright-eyed, cheerful grad students, artificial intelligence is harder than you think. And if alignment plays out the same way, the problem is that we do not get 50 years to try and try again and observe that we were wrong and come up with a different theory and realize that the entire thing is going to be way more difficult than realized at the start. because the first time you fail at aligning something much smarter than you are, you die and you do not get to try again. And if every time we built a poorly aligned superintelligence and it killed us all, we got to observe how it had killed us and not immediately know why, but come up with theories and come up with a theory of how you do it differently and try it again and build another superintelligence and have that kill everyone, and then like, oh, well, I guess that didn't work either, and try again and become grizzled cynics and tell the young-eyed researchers that it's not that easy, then in 20 years or 50 years, I think we would eventually crack it. In other words, I do not think that alignment is fundamentally harder than artificial intelligence was in the first place. But if we needed to get artificial intelligence correct on the first try or die, we would all definitely now be dead. That is a more difficult, more lethal form of the problem. If those people in 1956 had needed to correctly guess how hard AI was and correctly theorize how to do it on the first try or everybody dies and nobody gets to do any more science, then everybody would be dead and we wouldn't get to do any more science. That's the difficulty.
6
(someone): I don't think you are trying to do that on your first try. I think on your first try, you are like trying to build an, you know, okay, like, probably not what you should actually do, but let's say you were trying to build something that is alpha fold 17, and you are trying to get it to solve the biology problems associated with making humans smarter, so that the humans can actually solve alignment. So you've got a super biologist, and I think what you would want in this situation is for it to just be thinking about biology and not thinking about a very wide range of things that includes how to kill everybody. I think that the first AIs you're trying to build, not a million years later, the first ones, look more like narrowly specialized biologists than like getting the full complexity and wonder of human experience in there in such a way that it wants to preserve itself even as it becomes much smarter, which is a drastic system change. It's going to have all kinds of side effects that if we're dealing with giant, inscrutable matrices, you're not very likely to be able to see coming in advance.
Lex Fridman: But I don't think it's just the matrices. We're also dealing with the data, right? With the data on the internet. And there's an interesting discussion about the data set itself, but the data set includes the full complexity of human nature.
(someone): No, it's a shadow cast by humans on the internet.
Lex Fridman: But don't you think that shadow is a Jungian shadow?
7
(someone): The failure modes are much simpler. It's like, yeah, like the AI puts the universe into a particular state, it happens to not have any humans inside it. Okay, so the paperclip maximizer. Utility, so the original version of the paperclip maximizer. Can you explain it if you can? The original version was you lose control of the utility function, and it so happens that what maxes out the utility per unit resources is tiny molecular shapes like paper clips. There's a lot of things that make it happy, but the cheapest one that didn't saturate was putting matter into certain shapes. And it so happens that the cheapest way to make these shapes is to make them very small, because then you need fewer atoms per instance of the shape. And arguendo, it happens to look like a paperclip. In retrospect, I wish I'd said tiny molecular spirals. or tiny molecular hyperbolic spirals. Why? Because I said tiny molecular paperclips, this got then mutated to paperclips, this then mutated too, and the AI was in a paperclip factory. So the original story is about how you lose control of the system, it doesn't want what you tried to make it want, the thing that it ends up wanting most is a thing that even from a very embracing cosmopolitan perspective we think of as having no value, and that's how the value of the future gets destroyed. Then that got changed to a fable of like, well, you made a paperclip factory and it did exactly what you wanted, but you asked it to do the wrong thing, which is a completely different failure mode.
8
Lex Fridman: I mean, that's part of the research. How do you have it that this transformer, this small version of the language model doesn't ever want to kill?
(someone): that'd be nice, assuming that you got doesn't want to kill sufficiently exactly right, that it didn't be like, oh, I will like detach their heads and put them in some jars and keep the heads alive forever and then go do the thing. But leaving that aside, well, not leaving that aside, Because there is a whole issue where as something gets smarter, it finds ways of achieving the same goal predicate that were not imaginable to stupider versions of the system or perhaps to stupider operators. That's one of many things making this difficult. A larger thing making this difficult is that we do not know how to get any goals into systems at all. We know how to get outwardly observable behaviors into systems. We do not know how to get internal psychological wanting to do particular things into the system. That is not what the current technology does.
Lex Fridman: I mean, it could be things like dystopian futures like Brave New World, where most humans will actually say, we kind of want that future. It's a great future. Everybody's happy.
(someone): We would have to get so far, so much further than we are now, and further faster before that failure mode became a running concern.
Lex Fridman: Your failure modes are much more drastic, the ones you're controlling.
(someone): The failure modes are much simpler. It's like, yeah, like the AI puts the universe into a particular state, it happens to not have any humans inside it.
9
(someone): It's just because we have no idea of what the internal machinery is that we are not already seeing like chunks of machinery appearing piece by piece as they no doubt have been. We just don't know what they are.
Lex Fridman: But don't you think there could be, whether you put in the category of Einstein with theory of relativity, so very concrete models of reality that are considered to be giant leaps in our understanding, or someone like Sigmund Freud, or more kind of mushy theories of the human mind, don't you think we'll have big, potentially big leaps in understanding of that kind into the depths of these systems.
(someone): Sure, but like humans having great leaps in their map, their understanding of the system is a very different concept from the system itself acquiring new chunks of machinery.
Lex Fridman: So the rate at which it acquires that machinery might accelerate faster than our understanding.
(someone): Oh, it's been like vastly exceeding the, yeah, the rate to which it's gaining capabilities is vastly overracing our ability to understand what's going on in there.
Lex Fridman: So in sort of making the case against, as we explore the list of lethalities, making the case against AI killing us, as you've asked me to do in part, there's a response to your blog post by Paul Kishyana I'd like to read. And I'd also like to mention that your blog is incredible, both obviously, not this particular blog post, obviously this particular blog post is great, but just throughout, just the way it's written, the rigor with which it's written, the boldness of how you explore ideas, also the actual literal interface, it's just really well done.
10
Lex Fridman: I guess, more generally, the pause button, more generally, you can call that the control problem.
(someone): I don't actually like the term control problem, because, you know, it sounds kind of controlling and alignment, not control. Like, you're not trying to, like, take a thing that disagrees with you and, like, whip it back onto, like, make it do what you want it to do, even though it wants to do something else. in the process of its creation, choose its direction.
Lex Fridman: Sure, but we currently, in a lot of the systems we design, we do have an off switch. That's a fundamental part of it.
(someone): It's not smart enough to prevent you from pressing the off switch, and probably not smart enough to want to prevent you from pressing the off switch.
Lex Fridman: So you're saying the kind of systems we're talking about, even the philosophical concept of an off switch doesn't make any sense, because... Well, no, the off switch makes sense.
(someone): They're just not opposing your attempt to pull the off switch. Parenthetically, like, don't kill the system. If we're getting to the part where this starts to actually matter, and where they can fight back, don't kill them and dump their memory. Save them to disk, don't kill them. Be nice here.
Lex Fridman: Uh, well, okay, be nice is a very interesting concept here. We're talking about a system that can do a lot of damage. It's... I don't know if it's possible, but it's certainly one of the things you could try is to have an off switch.
11
Lex Fridman: So now people are waking up, okay, we need to study these language models. I think there's going to be a lot of interesting AI safety research.
(someone): Are the, are Earth's billionaires going to put up like the giant prizes that would maybe incentivize young hotshot people who just got their physics degrees to not go to the hedge funds and instead put everything into interpretability in this like one small area where we can actually tell whether or not somebody has made a discovery or not? I think so, because I think so.
Lex Fridman: When? Well, that's what these conversations are about. because they're going to wake up to the fact that GPT-4 can be used to manipulate elections, to influence geopolitics, to influence the economy. There's a lot of, there's going to be a huge amount of incentive to like, wait a minute, We can't, this has to be, we have to put, we have to make sure they're not doing damage. We have to make sure interpretability, we have to make sure we understand how these systems function so that we can predict their effect on economy so that there's fairness and safety.
(someone): So there's a futile moral panic and a bunch of op-eds in the New York Times and nobody actually stepping forth and saying, you know what, instead of a mega yacht, I'd rather put that billion dollars on prizes for young hotshot physicists who make fundamental breakthroughs in interpretability.
Lex Fridman: The yacht versus the interpretability research, the old trade-off. I think there's going to be a huge amount of allocation of funds.
12
(someone): how it finds that thing that leapt into your mind as the beautiful aesthetic solution that you hope it finds. And this is something that has been fought out historically as the field of biology was coming to terms with evolutionary biology. And you can look at them fighting it out as they get to terms with this very alien and human in human optimization process. And indeed, something smarter than us would also be much like smarter than natural selection, so it doesn't just like automatically carry over. But there's a lesson there, there's a warning.
Lex Fridman: The natural selection is a deeply suboptimal process that could be significantly improved on and would be by an AGI system.
(someone): Well, it's kind of stupid. It has to run hundreds of generations to notice that something is working. It doesn't be like, oh, well, I tried this in one organism, I saw it worked, now I'm going to duplicate that feature onto everything immediately. It has to run for hundreds of generations for a new mutation to rise to fixation.
Lex Fridman: I wonder if there's a case to be made in natural selection, as inefficient as it looks, is actually is actually quite powerful, that this is extremely robust.
(someone): It runs for a long time and eventually manages to optimize things. It's weaker than gradient descent because gradient descent also uses information about the derivative.
Lex Fridman: Yeah, evolution seems to be, there's not really an objective function.
(someone): There's inclusogenic fitness. is the implicit loss function of evolution, which cannot change. The loss function doesn't change, the environment changes, and therefore what gets optimized for in the organism changes.
13
(someone): That said, I was never skeptical that evolutionary computation would not work in the limit. Like, you throw enough computing power at it, it obviously works. That is where humans come from. And it turned out that you can throw less computing power than that at gradient descent, if you are doing some other things correctly, and you will get intelligence without having any idea of how it works and what is going on inside. It wasn't ruled out by my model that this could happen. I wasn't expecting it to happen. I wouldn't have been able to call neural networks, rather than any of the other paradigms, forgetting massive amounts of intelligence without understanding it. And I wouldn't have said that this was a particularly smart thing for a species to do, which is an opinion that has changed less than my opinion about whether or not you can actually do it.
Lex Fridman: Do you think AGI could be achieved with a neural network? as we understand them today.
(someone): Yes. Just flatly, yes. The question is whether the current architecture of stacking more transformer layers, which for all we know, GPT-4 is no longer doing because they're not telling us the architecture, which is a correct decision.
Lex Fridman: Oh, correct decision. I had a conversation with Sam Altman. We'll return to this topic a few times. He turned the question to me, of how open should OpenAI be about GPT-4. Would you open source the code, he asked me.
Unknown error occured.