ChatGPT: 30 Year History | How AI Learned to Talk

Apr 15, 2024

A Big Bang occurred when Chachi BT was released, the first widely available computer program with which the average person could

talk

as if they were another human being, passing the tour test and doing things that most agreed including me. It wasn't possible when I started this series 4

year

s ago. The richness or infinite potential that language offers is the reason why many linguistics and computing experts firmly believed that computers would never understand human language. Many of them have now changed their minds. If I took an hour to do something GPT 4 chat could take. 1 second is quite terrifying, I felt like not only my belief systems were collapsing, but I also feel like the entire human race was going to be eclipsed and left in the dust soon.

What has happened so far in this series, we have covered the last Decades of neural network research focused on concrete problems with a fixed goal in which people trained an artificial neural network on a task with a large database of examples of inputs and outputs to learn, known as supervised learning; In this case, the learning signal was the difference between the guess and the correct answer. This led to neural networks that could do one type of thing really well, like classify images, detect spam, or predict your next YouTube video, but each network was like a silo and left no clear path to more general purpose systems that can.

More Interesting Facts About,

chatgpt 30 year history how ai learned to talk...

Think of these isolated networks as models of intuition only, but not reasoning, because reasoning involves a chain of thoughts, it is a sequential process, and therefore to solve this problem of making neural networks more general purpose , we first needed to train neural networks to

talk

backwards. see the origin of this type of experimentation in the mid 80s a paper inspired in 1986 by Jordan trained a neural network to learn sequential patterns in his initial experiments he trained a Tiny Network with just a handful of neurons to predict simple sequences of two symbols and To give the network memory, he borrowed from how we think our mind works, which is to have a continuous mental state that helps us decide our next action given what we currently observe, he added a set of memory neurons he called state units. beside. of the network and connections were added from the output to the state units and then the state units were connected to the middle of the network and finally they were also connected to each other, this resulted in a state of mind that depended on the past and could affect the future.

He called a recurrent neural network another key innovation was how he set up a prediction problem for the network to learn from. He trained a network by simply hiding the next letter in a sequence. With this approach, the learning signal is the difference between the network's guess and the next symbol and the true data, and critically, after training it, we configured the network to generate data by plugging the output of the network back into itself. itself and starting with a single letter, after which it would generate the pattern it

learned

and observed the The network would make errors, but those errors would disappear after training further on the pattern and noticed that the

learned

sequences were not only memorized, but they generalized.

In another experiment, he trained a network on a spatial pattern after training a network on the sequence he fed. at one point and he plotted the results and would correctly continue the cyclical pattern; However, when he tried to start it at a new point off the path, he learned that the network would follow the same cycling pattern but from a different position on a different scale and would gradually return to the stable sequence that it writes when a network learns to perform a sequence, it essentially learns to follow a trajectory through the state space and these learned trajectories tend to be attractors, to borrow a term from chaos theory, he saw an attractor as the generalized pattern. learned by the network that was represented in the connection weights in the inner layers 5

year

s later, another researcher, Jeffrey Elman, picked up Jordan's research and did the same with a slightly larger network with 50 neurons and trained it on the language at first used 200 Curiously created short sentences, did not provide any word limit, simply applied a stream of letters to the network 10 times and in each step trained it to get closer and make correctly predicting the next letter the first thing Interesting thing you noticed. was that the network learned the limits of words on its own.

It shows it on a graph where at the beginning of a new word the possibility of error or uncertainty is high and as more words are received, the error rate decreases since the sequence is increasingly predictable at the end of the word, the error would rise again, but not as high as before, this reflects what we saw in information theory, where a smart signal contains decreasing entropy along the length of the sequence. He then points out that it is worth investigating whether the network has any understanding of the meaning behind these words. He probed the internal neurons in the context unit as it processed words and then plotted them and compared the spatial arrangement.

What he found was that the network would spatially group words according to meaning, for example, separating them. nouns that are inanimate and animate and Within these groups saw subcategorization the animate objects were divided into human and non-human groups the inanimate objects were divided into divisible and edible groups and so he emphasizes that the network was learning these hierarchical interpretations but Elman points out that According to Nome Chomsky, this should not be possible, how could a small network understand the word semantically? But Elman argued that his experiments showed that otherwise everything could be learned from patterns in language.

This approach to training neural networks by hiding the next event aligns closely. Regarding how humans learn, he referenced the idea that preverbal children begin the language acquisition process by listening and mentally speaking to a speaker who always guesses the next word and that they can learn from these internal errors. He also had a fascinating insight since we can represent words as points in a high-dimensional space, so sequences of words or sentences can be considered as a path and similar sentences appear to follow similar paths. Our thinking follows a path. It's helpful to pause and consider that your own mind is often on a path of It was thought of on many levels, but still these networks were small and seen as toy problems, so for more than a decade none of these research on language models came to light;

In reality, it was not until 2011 when a significant confluence of researchers promoted this specific investigation. The experiment ahead interestingly was the mundane sounding app they mentioned. This is an important issue because better character-level prediction could improve text file compression more speculatively. Achieving the limit in text compression requires understanding equivalent to intelligence. This is in line with a theory. that biological brains at their core are prediction machines, and therefore if we think of intelligence as the ability to learn, this views learning as the compression of experience into a predictive model of the world. I'll see you every second for the next hour or whatever.

Each one of you, each look, is a little different. I don't store all those b and c images. I don't store 3000 images. Somehow compact this information in this paper, they trained a much larger network with thousands of neurons this time and millions. of connections to make the prediction of the next letter, as the early researchers had done after training, they had the models generate language by reintroducing their output into the input and starting the process with an initial text, for example, they gave the message the meaning of life is and the The network responded to the tradition of ancient human reproduction, but beyond a few words, the path of thought was diverted and detoured into the meaningless and thus clearly learning was taking place. , but they were affecting the network's ability to maintain coherent context over long sequences at the same time.

At the end of the article they stated that if we could train a much larger network with millions of neurons and billions of connections, it is possible that Brute Force alone would be enough to achieve even higher performance standards and still few took it seriously. This line of research, probably Due to errors, but a dedicated few fueled this effort, another key figure is Andre Karpathy. He did the same experiment again, but this time on a larger network, he worked with more layers and his results were even better, more plausible, in particular, he trained them all. Shakespeare and I noticed that I can barely recognize them from the real Shakespeare and when he trained them in math work he said that you get math that looks plausible, it's quite amazing and like the early researchers he noticed how they learned in phases and he writes the beauty of this en We didn't have to code any of that, the network decided what was useful to keep track of.

This is one of the clearest and most compelling examples of where the power of deep learning comes from, so this was further evidence of creating a system with the broad goal of learning to speak could then be reassigned to narrow, arbitrary goals. . We could simply ask for a turning point to occur in 2017, when a team of researchers from a lab called Open AI built on Kathy's work and established a larger recurring system. Network and trained on a massive set of 82 million Amazon reviews the largest model to date and when they probed the neurons in this network they found neurons deeper in the network that had learned complex conceptual concepts, for example, they reported the discovery of a feeling neuron. which was a single neuron within the network that directly corresponded to the sentiment of the text, how positive or negative it sounded, they showed the activation of this neuron as it processed the text, perfectly classifying the sentiment, this was surprising because at the time it was something that The industry commonly used and required specialized systems trained on that task, but in this case they did none of that work, the feeling neuron emerged from the process of learning to predict the next word and to show that this network had a good internal model or sentiment understanding, they made the network generate text and in doing so they forced that sentiment neuron to be positive or negative and then it spit out positive and negative reviews that were completely artificial but indistinguishable from a human written review and they wrote it down.

It is an open question why our model recovers the concept of feeling in such a precise, disentangled, interpretable and manipulable way, and this was just one neuron in the Network full of these representations of abstract concepts that it learned from that data as a result of trying to predict future directions for their lives. In the work, they mentioned the next key step, which was data diversity, but this time just going beyond that was reaching a practical limit because, because there is a key problem with recurrent neural networks, as they process data serially , all contexts had to be compressed into a fixed internal memory. and this was a bottleneck that limited the network's ability to handle context in long sequences of text and therefore meaning is virtually eliminated.

It was evident that by generating long enough statements with a recurrent neural network, it might make sense for a while, but after a few. sentences, would always turn into gibberish. Learning these long-range dependencies was a key challenge facing the field. An alternative approach to recurrent neural networks attempted to address this problem by simply processing the entire input text sequence in parallel, but this requires many layers. depth to compensate for the lack of memory, this approach is tempting, but the resulting Network becomes untrainable, but also in 2017 another innovative paper appeared that focused on the problem of translation between languages and offered a solution to this memory .

Constraint Attention The key idea behind their approach was to create a network with a new type of dynamic layer that could adapt some of its connection weights based on the context of the input known as the self-attention layer. This allowed it to be done at a layer that traditional networks would have needed multiple layers to achieve. This leads to a shallower but broader network that was practical for training these self-attention layers by allowing each word in the input to be looked at and compared. with all other words and absorb the meaning of the most relevant words to better capture the context of their useprovided in that sentence and this is done with the addition of attention heads which are many networks within the layer that act as a sort of lens that words can use to examine other words and this is done by simply measuring the distance between all the word pairs in the conceptual space Similar concepts will be closer in this space, leading to a higher connection waiting consider the sentence the river has a steep bank in the self-attention layer the word bank would be compared to each other word to find conceptual similarities, for example, the word River and Bank are related in the riverbank context and therefore this would lead to a longer wait in that context, leading to a second operation in which each The word absorbs the meaning of its connections based on the strength of the expectation, this allows the word to adjust its representation or meaning to push towards the concept or direction of a river bank and as we move through the network we go to make the embedding vectors of a word get better and better because we are going to take into account more and more contextual information and that is why we call them Transformers.

They take each word and transform its meaning shaped by the words around it to get a sense of this inaction. Let's see how a Transformer network generates music by predicting the next note in this visualization each color line is a different attention head and the weight of the line is the amount of attention it gives to each location: notice that each attention head looks for different types of patterns in music, the more heads of attention you give to a network, the more powerful it becomes and notice that to select the next note in each step all the patterns are taken into account, consider the duration, this is an architecture network. which can look at everything everywhere and at the same time no internal memory is needed its memory is replaced by self-reference within the layer but critically attention is all it needs the paper was still one foot in the old paradigm that was a narrow focus only on the translation problem trained in a supervised manner, they weren't looking for a general-purpose system that could do anything you asked it to, but open AI researchers saw this result and immediately tested this more powerful Transformer architecture on exactly the same next word prediction problem now on a larger scale that was not possible before, the following year they published a paper introducing a model called GPT and this time they had a much larger network that could capture hundreds of input words or context at a time, had multiple layers, each with its own attention followed by a fully connected layer and this time they trained the network on 7,000 books from a variety of domains.

The results were exciting. If you asked this network for a segment of text, it would continue that passage much more coherently than before, but more importantly it showed some ability. when answering general questions and these questions did not need to be present in the training data, this is known as zero shot learning, it is a notable feature and highlighted the potential of language models to generalize from their training data and apply them to arbitrary tasks that followed. This experiment immediately with gpt2 this time they used exactly the same approach but with a data set pulled from a large part of the web and used a much larger network with around 300,000 neurons.

The results surprised even the researchers, they tested it on tasks like reading comprehension, summarizing, translation and question answering. Surprisingly, it could translate languages as good as systems trained only on translation without any specific translation training, but except for a cycle of news about possible misuse to generate fake news, this development was largely ignored even by experts in the field. The problem was still that eventually gbt2 was still meaningless after many sentences, couldn't maintain coherence or context for really long periods of time and was therefore obviously still a hack, but the team now understood that this could be solved just making everything bigger. again, especially the context window, then they did the same experiment again but made the network 100 times larger. gpt3 had 175 billion connections and 96 layers and a much longer context window of around a thousand words, this time they trained it on the entire common web plus Wikipedia as well as several book collections, again it showed higher performance on all measures, but one capability really stood out once the training was complete: you could still teach the network new things known as in-context learning.

In the gpt3 paper, the researchers showed a simple example in which they first gave the definition of a made-up word gigaro and then asked him to use that word in a sentence, which he did perfectly. This is known as the wug test and is a key milestone in children's language development, but this was just the tip of the iceberg. The key point is that we can change the behavior of the network without changing the weights of the network, which is a frozen network that can learn new tricks. Contextual learning works because it leverages the internal model of individual concepts that you can arbitrarily combine or compose so that you can think of two layers of learning, a central one in the weight learning that occurs during training and then a layer of in-context learning. that occurs during use or inference.

Many pointed out that it seems we have stumbled upon a new computing paradigm where the computer operates at the level of thoughts where a thought is a response to a message, this put the programming of these systems in anyone's hands. The message is the program, but in terms of the general public, gpt3 was still relatively unknown to allow for general use, they took gpt3 and shaped their behavior to follow human instructions better with more training on more examples of good versus bad human instruction, This pushed the pressure of learning beyond the next word and into the next phrase, not only what to say but how to say it every time he does something a little closer to what we want him to do, we reinforce him for 20 minutes and the pigeon learned to peck the disk for food known as instructing GPT, which could engage in human conversations much more effectively and this became the consumer-facing GPT chat product.

This started the most exciting year. of experimentation in the

history

of AI, since more than 100 million people used this system in public and reported their results in a hose of surprises. A key observation after its release was its ability to talk to yourself and think out loud, a paper was widely shared showing that simply adding the phrase think step by step to the end of your message dramatically improved Chachi BT's performance because this began an iterative loop where secondary thoughts were written down in meaningful chunks, allowing him to follow a chain of reasoning for as long as necessary. which resulted in fewer bugs, which led to an explosion of experiments all based on this idea of internal dialogue and then people tried to put these agents in virtual worlds and gave them tasks and they would learn how to use tools to achieve this talking to themselves along the way.

The researchers also applied the use of this tool to the real world by connecting language models to external computer systems through APIs, allowing them to make calls, make requests and perform arbitrary tasks and, finally, we gave them physical senses through cameras. and actuators, in fact, every task performed by computers could now be redesigned with a film at the center of the process. I don't think it's accurate to think of large language models as a chatbot or some kind of word generator. I think it's much more correct to think of it as the The kernel process of an emerging operating system has an equivalent to random access memory or Ram, which in this case for a llm would be the context window and you can imagine this llm trying of paging relevant information in and out of its context window to perform its task and when researchers moved forward with networks even 10 times larger in GPT 4 and beyond, the trend continued again and today there is a possibility of building the intelligent agent most capable of all the Oracle humans have always dreamed of and feared, some speculate about this.

The moment marks a unification of the AI field around a single Directorate rather than specialized networks focused on specific types of data. All the researchers aimed to treat all perceptions as a language which is a series of information containing symbols and then train networks in prediction driven networks with self-attention this leads to a more general system that could be remapped to any arbitrary and constrained problem it seems that There is something fundamental about the ability to learn to get better at predicting future perceptions that is fundamental to learning in both biological and artificial neural networks. Imagination is a great survival mechanism because it minimizes surprise and since our actions are part of our perceptions. , we also learn the outcome of our actions as a byproduct of this perceptual prediction problem, so the question now is: do we invent the core of a tool to end it all? tools a possible solution to mental automation, which was the original dream of computing, not everyone agrees with this and some even feel insulted, well this is a glorified autofilm, these systems are designed in such a way that, in In principle, they cannot tell us anything about language or learning. about intelligence, about thinking, nothing, the idea that it's just predicting the next word and using statistics, in a sense that's true, but it's not the sense of statistics that most people understand from the data, he finds out how to extract the meaning of the sentence and uses the meaning of the sentence to predict the next word that you actually understand and that is quite shocking.

Chomsky's whole view on language is a little crazy when you look back because language is about conveying meaning, it's about conveying things well for what it's worth Jeff, I always thought Chumsky was dead wrong from my university background and I think it sent natural language processing down the wrong path for a long time and even the three godfathers of deep learning are no longer on the same page, so Language skills and fluency are not related to the ability to think, they are two different things, but people we respect a lot, like Yan, think that they really don't understand and it is crucial to solve this problem and we may not be able to reach an agreement. consensus on other issues until we have resolved that problem.

I have never seen the AI community fragment as seems to be the trend at the moment and at the root of this division is a philosophical question which group believes in these models. They trick us into thinking they are smarter than they are like mirrors that reflect our own thoughts in ways we don't anticipate and the other side believes that if it looks like thought then it is thought and that is why the line between simulated thought and real thought is increasingly blurred or maybe there is no line, this was and still is something that people are struggling to understand.

I invited her into my office and sat her at the keyboard and then she started typing and of course I looked over her shoulder. To make sure everything was working correctly after two or three exchanges with the machine, she turned to me and said, "Would you mind leaving the uterus, please?" and yet he knew as Wisen Bom did that Eliza did not understand a single word. that was written in it you are like my father in some ways you don't argue with me why do you think I don't argue with you are you afraid of me do you like to think that I am afraid of you my father is afraid of everyone my father is afraid of all of us we have the secret

Watch Video & Subscribe

If you have any copyright issue, please Contact