In dieser Episode reden Florian und Ich über das neue Model o1 und was es besonders macht. Außerdem reden wir über den Hardwaremarkt, Alpha Proteo und die US Politik.
In dieser Episode reden Florian und Ich über die Annoucements von OpenAIs Keynote; Unter anderem GPT-4 Turbo. Außerdem reden wir über Apple, GitHub und die Folgen von Automatisierung.
OpenAI announced a set of changes to their model APIs. The biggest announcement is the addition of function calls for both GPT-3.5 and 4. This allows developers to enable plugins and other external tools for the models.
They also released new versions of GPT-3.5 and 4 that are better at following directions and a Version of 3.5 with 16K context window.
In addition, they made the embedding model 75% cheaper, which is used to create vector databases and allows models to dynamically load relevant data, like memory. GPT-3.5 also became cheaper now costing only $0.0015 per 1K input tokens.
One of the main problems of LLMs is that they are black boxes and how they produce an output is not understandable for humans. Understanding what different neurons are representing and how they influence the model is important to make sure they are reliable and do not contain dangerous trends.
OpenAI applied GPT-4 to find out the different meanings of neurons in GPT-2. The methodology involves using GPT-4 to generate explanations of neuron behavior in GPT-2, simulate what a neuron that fired for the explanation would do, and then compare these simulated activations with the real activations to score the explanation’s accuracy. This process helps in understanding and could potentially help improve the model’s performance.
The tools and datasets used for this process are being open-sourced to encourage further research and development of better explanation generation techniques. This is part of the recent efforts in AI alignment before even more powerful models are trained. Read more about the process here and the paper here. You can also view the neurons of GPT-2 here. I recommend clicking through the network and admiring the artificial brain.
Shap-E can generate 3D assets from text or images. Unlike their earlier model Point-E, this one can directly generate the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. It is also faster to run and open-source! Read the paper here.
Just like video generation, the quality is still behind image generation. I expect this to change by the end of this year.
OpenAI developed a new approach to image generation called consistency models. Current models, like Dalle-2 or stable diffusion, iteratively diffuse the result. This new approach goes straight to the final result which makes the process way faster and cheaper. While not as good as some diffusion models yet, they will likely improve and become an alternative for scenarios where faster results are needed.
Since GPT-3.5 and GPT-4 APIs are available many companies and start-ups have implemented them into their products. Now developers have started to do it the other way around. They build systems around GPT-4 to enable it to search, use APIs, execute code, and interact with itself. Examples are HuggingGPT or AutoGPT. They are based on works like Toolformer or this result. Even Microsoft itself started to build LLM-Augmenter around GPT-4 to improve its performance.
I talked about this development in my post on how to get from GPT-4 to proto-AGI. I still think that this is the way to a general assistant even though I am not sure if GPT-4 is already capable enough or if we need another small improvement.
A group of researchers and notable people released an open letter in which they call for a 6 month stop from developing models that are more advanced than GPT-4. Some of the notable names are researchers from competing companies like Deepmind, Google, and Stability AI like Victoria Krakovna, Noam Shazeer, and Emad Mostaque. But also some professors and authors like Stuart Russell or Peter Warren. The main concern is the lack of control and understanding of these systems and the potential risks that go from misinformation to human extinction.
Alles Denkbare wird einmal gedacht. Jetzt oder in der Zukunft. Was Solomo gefunden hat, kann einmal auch ein anderer finden, […]. / Everything that is conceivable will be thought of at some point. Whether now or in the future. What Solomon has found, another may also find someday […].
Dürrenmatt, Die Physiker
Although I recognize some valid concerns in the letter, I personally disagree with them. As demonstrated in Dürrenmatt’s novel “The Physicists,” technology, no matter how dangerous, cannot be hindered or halted and will always advance. Even if OpenAI were to stop developing GPT-5, other nations would continue to do so, akin to nuclear weapons, which do not provide any benefits. However, AI possesses enormous potential for good, making it difficult to argue against its development. While there is a possibility of AI causing harm, preventing or slowing its progress would prevent billions of people from being aided by its potential benefits. I believe that the risk of a negative outcome is acceptable if it allows us to solve most of our issues. Especially since it looks like right now that a negative outcome is guaranteed without AI, as the climate crises and global conflicts arise.
Many people saw the new episode of the Lex Friedman Podcast with Sam Altman, where he talks about some social and political implications of GPT-4.
But fewer people saw the podcast with Ilya Sutskever, the Chief Scientist at OpenAI, which is way more technical and in my opinion even more exciting and enjoyable. I really recommend listening to the talk which is only 45 minutes long.
Microsoft researchers have conducted an investigation on an early version of OpenAI’s GPT-4, and they have found that it exhibits more general intelligence than previous AI models. The model can solve novel and difficult tasks spanning mathematics, coding, vision, medicine, law, psychology, and more, without needing any special prompting. Furthermore, in all of these tasks, GPT-4‘s performance is strikingly close to human-level performance and often vastly surpasses prior models. The researchers believe that GPT-4 could be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system. This is in line with my own experience and shows that we are closer to AGI than we thought.
The study emphasizes the need to discover the limitations of such models and the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction. The study concludes with reflections on the societal implications of the recent technological leap and future research directions.
OpenAI announced that they will introduce plugins to ChatGPT. Two of them developed by OpenAi themself allow the model to search the web for information and run generated python code. Other third-party plugins like Wolfram allow the model to use other APIs to perform certain tasks. the future capabilities of a model enhanced this way are limitless. I talked about this development in my Post “From GPT-4 to Proto-AGI” where I predicted this development. If the capability to run generated code is not too limited, I would call this Proto-AGI.
Artificial General Intelligence (AGI) is the ultimate goal of many AI researchers and enthusiasts. It refers to the ability of a machine to perform any intellectual task that a human can do, such as reasoning, learning, creativity, and generalization. However, we are still far from achieving AGI with our current AI systems. One of the most advanced AI systems today is GPT-4, a large multimodal model created by OpenAI that can take text and pictures as input and outputs text. So how far away from AGI is GPT-4 and what do we need to do to get there?
What GPT-4 is capable of?
GPT-4 is a successor of GPT-3.5, which was already impressive in its ability to generate coherent and fluent text on various topics and domains. GPT-4 improves on GPT-3.5 by being more reliable, creative, and able to handle much more nuanced instructions than its predecessor. For example, it can pass a simulated bar exam with a score around the top 10% of test takers; in contrast, GPT-3.5’s score was around the bottom 10%. It also generates medium-sized working programs and can reason to a certain extent. The context window of GPT-4 is 32K tokens which allows it to produce entire programs.
GPT-4 also adds a new feature: visual input. It can accept image and text inputs together and emit text outputs that are relevant to both modalities. For instance, it can describe what is happening in an image or understand its relevance in a given context. This makes GPT-4 more versatile and useful for various applications that require multimodal understanding.
However, despite its impressive capabilities, GPT-4 is still far from being able to perform all the tasks that humans can do with language and images. It still lacks some crucial components that are necessary for achieving AGI.
What do we need to add?
One of the main limitations of GPT-4 is that it has no memory. It cannot remember what it has said, outside of its context window, or learned before, and cannot use it for future reference or inference. This means that it cannot build long-term knowledge or relationships with its users or other agents. It also means that it cannot handle complex reasoning tasks that require multiple steps or facts that exceed its context window.
Another limitation of GPT-4 is that it has no access to tools that can help it solve problems or learn new skills. For example, it cannot use the Internet to search for information on the web; Wolfram Alpha to compute mathematical expressions; databases to store and retrieve data; or other APIs to interact with external services. This limits its ability to acquire new knowledge or perform tasks beyond outputting text.
A third limitation of GPT-4 is that it has no inner thinking. It is strictly an input-output machine that produces exactly one piece of text for every input it gets. In between inputs it does nothing and is in the same state every time. The ability to simulate possible situations is called mental simulation and is one of the key abilities of the human brain. It is a fundamental form of computation in the brain, underlying many cognitive skills such as mindreading, perception, memory, and language. The fact that all Transformer based AI systems are not capable of that in their current form, is, in my opinion, the main reason why AGI is still not in sight.
How do we do this?
To overcome these limitations and move closer towards AGI, we need to add some features and functionalities to GPT-4 that can substitute for these shortcomings.
One possible way to do this is by using chain prompts. Chain prompts are sequences of inputs and outputs that guide the model through a series of steps or actions towards a desired goal. For example, we can use chain prompts to instruct GPT-4 to search for information on the Internet. By using chain prompts, we can extend GPT-4‘s capabilities and make it more powerful and transparent. Instead of giving the Model the input directly, we would ask it which parts of the input it needs more information on, and then we get a list of keywords selected by the model that we feed into a search engine. In the last step, we add the information that we got to the original input and give the user the final output.
Another possible way to do this is by using Toolformer. Toolformer was proposed by Meta that allows us to integrate external tools into LLMs by using special tokens that represent tool names. The model would be fine-tuned on text examples of API calls. For example, we can use Toolformer to write: Input: What is 2 + 2? Output:The answer is <calculator args=”2+2″>4</calculator>. This way, GPT-4 can learn to use tools by observing how they are used in natural language contexts. Toolformer can also handle complex tool compositions and nested tool calls. Some tools that would drastically enhance the capabilities of GPT are
Wolfram Alpha (Math)
A calendar (temporal awareness),
A search engine (information gathering)
A database(memory)
A command line (general control)
Especially the last part is really special. By giving a powerful enough model access to a computer, and combining this with other methods such as chain prompting, we could enable unlimited possibilities. One special case of these techniques that I want to highlight is code execution. An LLM that can run generated code itself and receive the output could build the programs to solve every task it gets. This starts with writing simple functions to solve equations to controlling a smart home or fine-tuning itself.
We can also add memory this way by giving it access to a database. We could use chain prompting to ask the model if parts of the input or output should be saved for the future and combine it with a writing call to the database. We then could use embeddings to search the database for every input and extract relevant information. Embeddings are vector representations of text that decode the meaning of the text. Asking the model about an appointment with your doctor would be represented by a vector that is similar to the vector that represents the information about the appointment in the database. The solution is not perfect but would add memory to the model.
Where we are right now
We already see the start of these augmentations. The first one was BingGPT which augments GPT-4 with a search engine. The most recent and impressive one is Microsoft’s copilot for Microsoft 365, which combines GPT-4 with all the Office tools and their Microsoft Graph system, which also gives it access to all your documents. Other companies will follow even though the integration is limited since the model is not Open source and OpenAI are the only ones able to fine-tune it. But for most of these techniques, you can use Langchain which is a new code library that contains many of the described ways to improve GPT.4
What we could see until the end of the year
All these methods are not mutually exclusive and can be combined in different ways depending on the task and context. Many companies are already or going to integrate GPT-4 into their products. And the more tools can be controlled by natural language the easier it will be for other LLMs to use them. Until the end of the year, we will see Language Models talking to each other. I can see a near future where we have our own custom model that talks to BingGPT, Copilot, or other software and takes on the role of a dirigent of other instances of GPT-4. But there are also risks. Giving the model too much control could lead to chains of mistakes if the model is not powerful enough and makes mistakes or it could lead to a complete takeover and fast takeoff if future models like GPT-5 or 6 are too powerful. This is unlikely as long as OpenAI holds tight control over the development and execution of these models, but the competition is growing and broadly available Hardware and software are becoming better and better. This year will be the rise of AI and next year could be the birth year of proto-AGI.
Update: shortly after I finished this post, this paper was released. It talks about a form of memorizing transformer, which I found to be quite relevant to this post.
German version below
Von GPT-4 zu Proto-AGI
Artificial General Intelligence (AGI) ist das ultimative Ziel vieler AI-Forscher und Enthusiasten. Es bezieht sich auf die Fähigkeit einer Maschine, jede geistige Aufgabe auszuführen, die ein Mensch tun kann, wie etwa das Denken, Lernen, Kreativität und Generalisierung. Allerdings sind wir noch weit davon entfernt, AGI mit unseren derzeitigen AI-Systemen zu erreichen. Eines der fortschrittlichsten AI-Systeme aktuell ist GPT-4, ein großes multimodales Modell, dass von OpenAI erstellt wurde und Text und Bilder als Eingabe nimmt und Text als Ausgabe produziert. Also wie weit ist GPT-4 von AGI entfernt und was müssen wir tun, um dorthin zu gelangen?
Was kann GPT-4?
GPT-4 ist der Nachfolger von GPT-3.5, dass bereits beeindruckend ist in seiner Fähigkeit, zusammenhängenden und flüssigen Text zu verschiedenen Themen und Domänen zu generieren. GPT-4 verbessert GPT-3.5, indem es zuverlässiger, kreativer und in der Lage ist, viel nuanciertere Anweisungen als sein Vorgänger zu handhaben. Zum Beispiel kann es eine simulierte Bar-Prüfung mit einer Punktzahl um die Top 10% der Testteilnehmer bestehen; im Gegensatz dazu lag die Punktzahl von GPT-3.5 bei rund 10% am unteren Ende. Es generiert auch mittelgroße funktionierende Programme und kann bis zu einem gewissen Grad schlussfolgern. Das Kontextfenster von GPT-4 umfasst 32 tausend Token, was es ermöglicht, ganze Programme zu erstellen.
GPT-4 fügt auch eine neue Funktion hinzu: visuelle Eingabe. Es kann sowohl Bild- als auch Texteingaben akzeptieren und Textausgaben liefern, die für beide Modalitäten relevant sind. Zum Beispiel kann es beschreiben, was in einem Bild passiert, oder den inhalt eines Bildes in einen Kontext einzuordnen. Dies macht GPT-4 vielseitiger und nützlicher für verschiedene Anwendungen, die ein multimodales Verständnis erfordern.
Was noch fehlt?
Trotz seiner beeindruckenden Fähigkeiten ist GPT-4 jedoch noch weit davon entfernt, alle Aufgaben ausführen zu können, die Menschen mit Sprache und Bildern bewältigen können. Es fehlen noch einige wesentliche Komponenten, die für die Erreichung von AGI notwendig sind.
Eine der Hauptbeschränkungen von GPT-4 ist, dass es kein Gedächtnis hat. Es kann sich nicht daran erinnern, was es gesagt hat, außerhalb seines Kontextfensters oder was es zuvor gelernt hat, und kann es nicht für zukünftige Referenzen oder Rückschlüsse verwenden. Dies bedeutet, dass es kein langfristiges Wissen oder Beziehungen zu seinen Benutzern oder anderen Agenten aufbauen kann. Es bedeutet auch, dass es keine komplexen Denkaufgaben bewältigen kann, die mehrere Schritte erfordern oder Fakten überschreiten, die sein Kontextfenster übersteigen. Eine weitere Einschränkung von GPT-4 ist, dass es keinen Zugang zu Tools hat, die ihm helfen können, Probleme zu lösen oder neue Fähigkeiten zu erlernen. Es kann z.B. nicht das Internet nutzen, um nach Informationen im Web zu suchen; Wolfram Alpha zur Berechnung mathematischer Ausdrücke; Datenbanken zur Speicherung und Abfrage von Daten oder andere APIs zur Interaktion mit externen Diensten. Dies begrenzt seine Fähigkeit, neues Wissen zu erwerben oder Aufgaben jenseits des Textausgabe zu erledigen. Eine dritte Einschränkung von GPT-4 ist, dass es kein inneres Denken hat. Es ist streng genommen eine Input-Output-Maschine, die für jede Eingabe genau ein Textstück produziert. Zwischen den Eingaben tut es nichts und ist jedes Mal im gleichen Zustand. Die Fähigkeit, mögliche Situationen zu simulieren, wird als mentale Simulation bezeichnet und ist eine der Schlüsselkompetenzen des menschlichen Gehirns. Sie ist eine grundlegende Form der Berechnung im Gehirn und liegt vielen kognitiven Fähigkeiten wie Gedankenlesen, Wahrnehmung, Gedächtnis und Sprache zugrunde. Die Tatsache, dass alle auf der Transformer-Technologie basierenden KI-Systeme in ihrer derzeitigen Form dazu nicht in der Lage sind, ist meiner Meinung nach der Hauptgrund, warum AGI noch nicht in Sicht ist.
Wie können wir das erreichen?
Um diese Einschränkungen zu überwinden und uns der AGI näher zu bringen, müssen wir GPT-4 einige Funktionen und Eigenschaften hinzufügen, die diese Mängel ausgleichen können. Eine mögliche Methode dafür sind sogenannte “Chain Prompts“. Chain Prompts sind Folgen von Eingaben und Ausgaben, die das Modell durch eine Reihe von Schritten oder Aktionen hin zu einem gewünschten Ziel führen. Zum Beispiel können wir Chain Prompts verwenden, um GPT-4 anzuweisen, im Internet nach Informationen zu suchen. Mit Chain Prompts können wir die Fähigkeiten von GPT-4 erweitern und es leistungsfähiger und transparenter machen. Anstatt dem Modell die Eingabe direkt zu geben, würden wir es fragen, welche Teile der Eingabe mehr Informationen benötigen, dann bekommen wir eine Liste von Schlüsselwörtern, die vom Modell ausgewählt wurden und die wir in eine Suchmaschine eingeben. Im letzten Schritt fügen wir die erhaltenen Informationen der ursprünglichen Eingabe hinzu und geben dem Benutzer die endgültige Ausgabe.
Eine weitere mögliche Methode hierfür ist die Verwendung von Toolformer. Toolformer wurde von Meta entwickelt und ermöglicht uns, externe Tools in LLMs zu integrieren, indem wir spezielle Tokens verwenden, die Toolnamen darstellen. Das Modell würde mit Textbeispielen von API-Aufrufen verfeinert werden. Zum Beispiel können wir Toolformer verwenden, um Folgendes zu schreiben:
Eingabe: What is 2 + 2? Ausgabe: The answer is <calculator args=”2+2″>4</calculator>.
Auf diese Weise kann GPT-4 lernen, Tools zu verwenden, indem es beobachtet, wie sie in natürlichen Sprachkontexten verwendet werden. Toolformer kann auch komplexe Toolzusammensetzungen und verschachtelte Toolaufrufe verarbeiten. Einige Tools, die die Fähigkeiten von GPT drastisch verbessern würden, sind
Wolfram Alpha (Mathematik)
Kalender (zeitliche Kenntnisse)
Suchmaschine (Informationsbeschaffung)
Datenbank (Speicher)
Commandozeile (generelle Kontrolle).
Besonders der letzte Punkt ist sehr wichtig. Indem wir einem ausreichend mächtigen Modell Zugang zu einem Computer geben und dies mit anderen Methoden wie Chain Prompting kombinieren, könnten wir unbegrenzte Möglichkeiten eröffnen. Ein spezieller Fall dieser Techniken, den ich hervorheben möchte, ist die Ausführung von Code. Ein Sprachmodel, das generierten Code selbst ausführen und die Ausgabe empfangen kann, könnte Programme zum Lösen jeder Aufgabe erstellen. Dies beginnt mit dem Schreiben einfacher Funktionen zur Lösung von Gleichungen bis hin zur Steuerung eines Smart Homes oder der eigenen Verbesserung.
Auf diese Weise können wir dem Modell auch Zugriff auf eine Datenbank geben, um so den Speicher zu erweitern. Wir könnten Chain Prompting nutzen, um das Modell zu fragen, ob Teile der Eingabe oder Ausgabe für die Zukunft gespeichert werden sollen, und es mit einem Schreibbefehl an die Datenbank kombinieren. Anschließend könnten wir Embeddings verwenden, um die Datenbank nach jeder Eingabe zu durchsuchen und relevante Informationen zu extrahieren. Embeddings sind Vektor-Textdarstellungen, die die Bedeutung des Textes entschlüsseln. Wenn wir das Modell beispielsweise nach einem Termin mit unserem Arzt fragen, wird ein Vektor erstellt, der ähnlich dem Vektor ist, mit dem die Informationen über den Termin in der Datenbank dargestellt werden. Die Lösung ist zwar nicht perfekt, würde aber dem Modell Gedächtnis hinzufügen.
Der aktuelle Stand
Wir sehen bereits den Beginn dieser Erweiterungen. Die erste war BingGPT, die GPT-4 mit einer Suchmaschine erweitert. Die neueste und beeindruckendste ist Microsofts Copilot für Microsoft 365, eine Kombination aus GPT-4 und allen Office-Tools sowie ihrem Microsoft Graph-System, das auch Zugriff auf alle deine Dokumente gibt. Andere Unternehmen werden folgen, obwohl die Integration begrenzt ist, da das Modell nicht Open Source ist und nur OpenAI es feinabstimmen kann. Besonders hervorheben möchte ich Langchain eine code bibliothek die viele der hier beschrieben Techniken seh vereinfacht
Was noch diesen Jahr passieren kann
All diese Methoden schließen einander nicht aus und können je nach Aufgabe und Kontext auf unterschiedliche Weise kombiniert werden. Viele Unternehmen integrieren bereits oder werden GPT-4 in ihre Produkte integrieren. Und je mehr Werkzeuge von natürlicher Sprache gesteuert werden können, desto einfacher wird es für andere LLMs sein, sie zu nutzen. Bis Ende des Jahres werden wir sehen, wie Sprachmodelle miteinander sprechen. Ich kann mir eine nahe Zukunft vorstellen, in der wir unser eigenes benutzerdefiniertes Modell haben, das mit BingGPT, Copilot oder anderen Software spricht und die Rolle eines Dirigenten für andere Instanzen von GPT-4 übernimmt. Aber es gibt auch Risiken. Wenn das Modell zu viel Kontrolle erhält und nicht leistungsstark genug ist wird es Fehler machem, welche zu Ketten von Fehlern führen können, oder anders herum könnte es zu einem vollständigen Kontrollverlust der Menschen und einer explosionsartigen Entwickung von künstlicher Intelligenz kommen, wenn zukünftige Modelle wie GPT-5 oder 6 zu leistungsfähig sind. Dies ist unwahrscheinlich, solange OpenAI eine strenge Kontrolle über die Entwicklung und Ausführung dieser Modelle ausübt, aber der Wettbewerb wächst und die allgemein verfügbare Hardware und Software werden immer besser. Dieses Jahr wird das Aufkommen von KI sein und nächstes Jahr könnte das Geburtsjahr von Proto-AGI sein.
Da ich nach einer deutschen Version der Posts gefragt wurde ist dies mein erster Versuch Posts zweisprachig zu machen. Ich freue mich über Feedback und kann auf Wunsch auch gerne noch einzelne ältere Posts übersetzen.(Die Übersetzung ist von GPT-3.5 und enhält sprachliche Fehler und suboptimale Formulierungen.)
Update: kurz nachdem ich diesen Aritkel fertig hatte, wurde dieses Paper veröffentlicht. Es geht um eine Form von Transformer mit Gedächnis, was sehr relevant für diesen Artikel ist.
A new study by OpenAI and the University of Pennsylvania investigates the potential impact of Generative Pre-trained Transformer (GPT) models on the U.S. labor market. The paper, titled “GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models,” assesses occupations based on their correspondence with GPT capabilities, using both human expertise and classifications from GPT-4. The study finds that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of GPTs, while around 19% of workers may see at least 50% of their tasks impacted. The impact spans all wage levels, with higher-income jobs potentially facing greater exposure. The paper concludes that GPTs exhibit characteristics of general-purpose technologies, which could have significant economic, social, and policy implications. This comes to no surprise for everyone who used GPT-4 or watched the recent Microsoft announcment.
I discussed this topic in more depth in my book review of “A World Without Work”. This research supports the author’s point and indicates a radical shift in the economy in the coming years. I highly recommend reading the paper, the book, or at least my book review.
OpenAI presented its new GPT model today. GPT-4 has a context window of 32K tokens and outperforms humans and previous models like GPT-3.5 in almost all language tasks. It is also multimodal and supports images as inputs. Read more here or watch the presentation here.
OpenAI just released GPT-4, a game-changer in AI language models. With a 32k token context window, it outperforms humans and GPT-3.5 in most language tasks. Key improvements: bigger context window, better performance, and enhanced fine-tuning. Exciting applications include content generation, translation, virtual assistants, customer support, and education. Can’t wait to see how GPT-4 reshapes our AI-driven world!
Jan Leike, the alignment team lead at OpenAI, has published a blog post where he presents possible solutions for AI aligning to group preferences. He proposes a simulated deliberative democracy, where groups of randomly selected people agree on a stand on a specific topic. The decision is then used to fine-tune the AI.
In a small german information event today, four Microsoft employees talked about the potential of LLMs and mentioned that they are going to release GPT-4 next week. They implied that GPT-4 will be able to work with video data, which implies a multimodal model comparable to PaLM-E. Read more here.
Large Language Models (LLMs) are machine learning-based tools that are able to predict the next word in a given sequence of words. In this post, I want to clarify what they can and cannot do, how they work, what their limitations will be in the future, and how they came to be.
History
With the recent surge in public awareness surrounding Large Language Models (LLMs), a discourse has arisen concerning the potential benefits and risks associated with this technology. Yet, for those well-versed in the field of machine learning, this development represents the next step in a long-standing evolutionary process that began over half a century ago. The first language models were developed over 50 years ago and used statistical approaches that were barely able to form correct sentences.
With the rise of deep learning architectures like recurrent neural networks (RNN) and Long-Short-Term Memory (LSTM) neural networks, they became more powerful but also started to grow in size and needed data.
The emergence of GPUs, and later on specialized processing chips called TPUs, facilitated the construction of larger models, with companies such as IBM and Google spearheading the creation of translation and other language-related applications.
The biggest breakthrough was in 2017 when the paper “Attention is all you need” by Google introduced the Transformer. The Transformer model used self-attention to find connections between words independent of their position in the input and was, therefore, able to learn more complex dependencies. It was also more efficient to train which meant it could train on larger data sets. OpenAI used the Transformer to build GPT-2, the most powerful language model at its time, which developed some surprising capabilities which led to the idea that scaling these models up would unlock even more impressive capabilities. Consequently, many research teams applied the Transformer to diverse problems, training numerous models of increasing size, such as BERT, XLNet, ERNIE, and Codex, with GPT-3 being the most notable. However, most of these models were proprietary and unavailable to the public. This has changed with recent releases like Dall-E for image generation and GitHub Copilot. Around this time it became clear that scaling language models up became less effective and too expensive for most companies. This was confirmed by Deepmind in 2022 in their research paper “Training Compute-Optimal Large Language Models” which showed that most LLMs are vastly undertrained and too big for their training data set.
OpenAI and others started to use other means to improve their models, such as reinforced learning. That led to InstructGPT which was fine-tuned to perform the described tasks. They used the same technique to fine-tune their model on dialog data which led to the famous ChatGPT.
How they work
The core of most modern machine learning architectures are neural networks. As the name suggests, they are inspired by their biological counterpart.
At a high level, a neural network consists of three main components: an input layer, one or more hidden layers, and an output layer. The input layer receives data, which is then processed through the hidden layers. Finally, the output layer produces a prediction or classification based on the input data.
The basic building block of a neural network is a neuron, which takes inputs, applies a mathematical function (activation function) to them, and produces an output. The output of each neuron ni is multiplied by the weight wij and added together into the neuron nj in the next layer until the output layer is reached. This process can be implemented as a simple matrix-vector multiplication with the input as the vector I and the weights as the matrix W: WxI = O, where O is the output vector which is used as the input for the next layer where we apply the activation function f(O) = I until the final output.
During training, the network is presented with a set of labeled examples, known as the training set. The network uses these examples to learn patterns in the data and adjust its internal weights to improve its predictions. The process of adjusting the weights is known as backpropagation.
Backpropagation works by calculating the error between the network’s output and the correct output for each example in the training set. The error is then propagated backwards through the network, adjusting the weights of each neuron in the opposite direction of the error gradient. This process is repeated for many iterations until the network’s predictions are accurate enough.
Since 2017 most LLMs are based on Transformers. Which also contain simple feed-forward networks, but at their core have a self-attention mechanism that allows the Transformer to detect dependencies between different words in the input.
The self-attention mechanism in the Transformer model works by using three vectors for each element of the input sequence: the query vector, the key vector, and the value vector. These vectors are used to compute an attention score for every element in the sequence. We get the score of the jth element by calculating the dot product of the query vector with every key vector ki of every element and multiplying the result with the value vector vi. We then sum up all the results to get the output.
Before you multiply the attention score with the value vectors, you would first apply a softmax function to the attention scores. This will ensure that they add up to one and that the resulting weighted value vectors are weighted proportionally to their relevance to the query element. This weighted sum is then used as input to the next layer of the Transformer model. I skipped or simplified other parts of the algorithm as well to make it easier to understand. For a more in-depth explanation of Transformers, I recommend this blog or the creator of GPT himself.
The self-attention mechanism in the Transformer model allows the model to capture long-range dependencies and relationships between distant elements in the input sequence. By selectively attending to different parts of the sequence at each processing step, the model is able to focus on the most relevant information for the task at hand. This makes the Transformer architecture highly effective for natural language processing tasks, where capturing long-range dependencies is crucial for generating coherent and meaningful output.
What they can and cannot do
As explained earlier LLMs are text prediction systems. They are not able to “think”, “feel”, or “experience” anything, but are able to learn complex ideas to be able to predict text accurately. For example the sequence “2 + 2 =” can only be continued if there is an internal representation of basic math inside the Transformer. This is also the reason why LLMs often produce plausible-looking output that makes sense but is wrong. Since the model is multiple magnitudes smaller than the training data and even smaller compared to all possible inputs it is not possible to represent all the needed data. This means that LLMs are great for producing high-quality text about a simple topic, but they are not great at understanding complex problems that require a huge amount of available information and reasoning like mathematical proofs. This can be improved by providing needed information in the input sequence which will increase the probability of correct outputs. A great example would be BingGPT which uses search queries to get additional information about the input. You can also train LLMs to do this themselves by fine-tuning them on API calls.
What will they be able to do and what are the limits
The chinchilla scaling law shows that LLMs are able to adapt to even larger amounts of training data. If we can collect the needed amount of high-quality text data and processing power LLMs will be able to learn even more complex language-related tasks and will become more capable and reliable. They will never be flawless on their own and have the core problem that you are never able to understand how the output was produced as neural networks are black boxes for an observer. They will however become more general as they learn to use pictures, audio, and other sensory data as input, at which point they are barely still language models. The Transformer architecture however will always be a token prediction tool and will never develop “consciousness” or any kind of internal thought as they are still just several Matrix calculations on a fixed input. I suspect that we need at least some internal activity, and the ability to learn during deployment for AGI. But even without that, they will become part of most professions, hidden inside other applications like Discord, Slack, or Powerpoint.
Bias and other problems
LLMs are trained on large text corpora which are filled with certain views, opinions, and mistakes. The resulting output is therefore flawed. The current solution includes blocking certain words from input and/or output. Fine-tune with human feedback, or provide detailed instructions and restrictions in every prompt. They are all not flawless as blocking words is not precise enough. Added instructions can be circumvented by simply overwriting them with prompt injections. Fine-Tuning with human feedback is the best solution that comes with its own problem which is that the people who rate the outputs include their own bias in the fine-tuned model. This becomes a huge problem if you start using these models in education, communication, and other use cases. The views of the group of people who are controlling the training process are now projected onto everybody in the most subtle and efficient way imaginable. As OpenAI stated in their recent post the obvious solution will be to fine-tune your own model, which will lead to less outside influence but also increases the risk of shutting out other views and could create digital echo chambers where people put their radical beliefs into models and are getting positive feedback.
Another problem is that most people are not aware of how these systems work and terms like “artificial intelligence” suggest some form of being inside the machine. They start to anthropomorphize them and accept the AI unconsciously as another person. This is because our brains are trained to look at language as something only an intelligent being can produce. This starts by adding things like “thanks” to your prompt and then moves quickly to romantic feelings or some other kind of emotional connection. This will become increasingly problematic the better and more fine-tuned the models become. Adding text-to-speech and natural language understanding will also amplify this feeling.
Scaling
I see many people asking for an open-source version of chatGPT and wishing to have such a system on their computers. Compared to generative models like stable diffusion, LLMs are way bigger and more expensive to run. This means that they are not viable for consumer hardware. It takes millions of dollars in computing power to train and is only able to run on large servers. However, there are signs that this could change in the future. The Chinchilla scaling law implies that we can move a larger part of the computation into the training process by using smaller models with more data. An early example would be the new LLaMA models by Meta which are able to run on consumer hardware and are comparable to the original GPT-3. This still requires millions in training, but this can be crowdfunded or distributed. While these language models will never be able to compete with the state-of-the-art models made by large companies, they will become viable in the next 1-2 years and will lead to personalized fine-tuned models that take on the role of an assistant. Two excellent examples of open-source projects that try to build such models are “Open-Assistant” and “RWKV“.
The current growth in computing will not be sustainable much longer as it is not only driven by Moore’s law, but also by an increase in investments in training which will soon hit a point where the return does not justify the costs. at this point, we will have to wait for the Hardware to catch up again.
What are the main use cases?
When ChatGPT came out, many used it like Google to get answers to their questions. This is actually one of the weak points of LLMs since they can only know what was inside their training data. They tend to get facts wrong and produce believable misinformation. This can be fixed by including search results like Bing is doing.
The better use case is creative writing and other text-based tasks like summarising, explaining, or translating. The biggest change will therefore happen in jobs like customer support, journalism, and teaching. The education system in particular can benefit greatly from this. In many countries, Germany for example, teachers are in need. Classes are getting bigger and lessons are less effective. Tools like ChatGPT are already helping many students and when more specialized programs use LLMs to provide a better experience they will outperform traditional schools soon. Sadly many schools try to ban ChatGPT instead of including it which is not only counterproductive but is also not possible since there are no tools that can accurately detect AI-written text. But text-based tasks are not the limit. Recent papers like Toolformer show that LLMs will soon be able to control and use other hard and software. This will lead to numerous new abilities and will enable them to take over a variety of new tasks. A personal assistant as Apple promised us years ago when they released Siri will soon be a reality.
Just a moment ago OpenAI opened their ChatGPT and Whisper API. they also published their previously leaked dedicated instance service. ChatGPT will be available for 0.002$ per 1000 tokens which is incredibly cheap and will be getting updates regularly. Whisper will be available for 0.006$ per minute of audio data.
OpenAi released a blog post about their plans for AGI and how to minimize the negative impacts. I highly recommend reading it yourself, but the key takeaways are:
The mission is to ensure that AGI benefits humanity by increasing abundance, turbocharging the global economy, and aiding in the discovery of new scientific knowledge.
AGI has the potential to empower humanity with incredible new capabilities, but it also comes with serious risks of misuse, drastic accidents, and societal disruption.
To prepare for AGI, a gradual transition to a world with AGI is better than a sudden one. The deployment of AGI should involve a tight feedback loop of rapid learning and careful iteration, and democratized access will lead to more and better research, decentralized power, and more benefits. Developing increasingly aligned and steerable models, empowering individuals to make their own decisions, and engaging in a global conversation about key issues are also important.