One of the main problems of LLMs is that they are black boxes and how they produce an output is not understandable for humans. Understanding what different neurons are representing and how they influence the model is important to make sure they are reliable and do not contain dangerous trends.
OpenAI applied GPT-4 to find out the different meanings of neurons in GPT-2. The methodology involves using GPT-4 to generate explanations of neuron behavior in GPT-2, simulate what a neuron that fired for the explanation would do, and then compare these simulated activations with the real activations to score the explanation’s accuracy. This process helps in understanding and could potentially help improve the model’s performance.
The tools and datasets used for this process are being open-sourced to encourage further research and development of better explanation generation techniques. This is part of the recent efforts in AI alignment before even more powerful models are trained. Read more about the process here and the paper here. You can also view the neurons of GPT-2 here. I recommend clicking through the network and admiring the artificial brain.
In dieser Episode rede ich mit Florian über die verschiedenen Anwendungsmöglichkeiten von KI in der Bildung. Die erwähnten Beispiele für text-to-speech:
Shap-E can generate 3D assets from text or images. Unlike their earlier model Point-E, this one can directly generate the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. It is also faster to run and open-source! Read the paper here.
Just like video generation, the quality is still behind image generation. I expect this to change by the end of this year.
Microsoft announced, that not only Bing Chat is now available for everyone but also that Bing Chat will get new features such as image search, and more ways to present visual information. They also add the ability to summarise PDFs and other types of content.
But the biggest news is that they bring plugins to Bing Chat, which will work similarly to the ChatGPT plugins. I recommend reading the entire announcement yourself. This is the first step to their promise of a copilot for the web and I think they are doing a good job. This also puts pressure on their partner OpenAI which work on their own improvements to ChatGPT and now have to fight against their Investor Microsoft.
Stability AI finally released DeepFloyd, a new text-to-image model, which is capable of putting text in images and has a much better spacial awareness. It was trained on a new version of the LAION-A dataset.
Researchers have made a breakthrough in the field of artificial intelligence, successfully extending the context length of BERT, a Transformer-based natural language processing model, to two million tokens. The team achieved this feat by incorporating a recurrent memory into BERT using the Recurrent Memory Transformer (RMT) architecture.
The researchers’ method increases the model’s effective context length and maintains high memory retrieval accuracy. This allows the model to store and process both local and global information, improving the flow of information between different segments of an input sequence.
The study’s experiments demonstrated the effectiveness of the RMT-augmented BERT model, which can now tackle tasks on sequences up to seven times its originally designed input length (512 tokens). This breakthrough has the potential to significantly enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.
Google and DeepMind just announced that they will unite Google Brain and Deepmind into Google DeepMind. This is a good step for both sites since Deepmind really needs the computing power of Google to make further progress on AGI and Google needs the Manpower and knowledge of the Deepmind team to quickly catch up to OpenAi and Microsoft. This partnership could lead to a real rival on the way to AGI for OpenAI. I personally always liked that DeepMind had a different approach to AGI and I hope they will continue to push different ideas other than language models.
Stability-AI finally released their own open-source language model. It is trained from scratch and can be used commercially. The first two models are 3B and 7B parameters in size, which is comparable to many other open-source models.
What I am more excited about are their planned 65B and 175B parameter models which are bigger than most other recent open-source models. These models will show how close open-source models can actually get to chatGPT and if local AI assistants have a future.
NVIDIA’s newest model, VideoLDM can generate videos with resolutions up to 1280 x 2048. They archive that by training a diffusion model in a compressed latent space, introducing a temporal dimension to the latent space, and fine-tuning on encoded image sequences while temporally aligning diffusion model upsamplers.
It is visibly better than previous models and it looks like my prediction for this year is coming true and we get video models as capable as the image models from the end of the last year. Read the paper here.
Today, Microsoft published a paper called “NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers“. In this paper, they show a new text-to-speech model which is not only able to copy human speech, but also singing. The model uses a latent diffusion model and neural audio codec to synthesize high-quality, expressive voices with strong zero-shot ability by generating quantized latent vectors conditioned on text input.
With this model, we are reaching a critical point. text-to-speech is now good enough to fool people and replace many jobs and positions that require speech. It also allows for better speech interfaces to language models which makes the interaction more natural from now on. As we are approaching a future where people have personal Ai assistants, natural speech is a core technology. And even though NaturalSpeech 2 is not perfect it is good enough to start this future.
MiniGPT-4, is an open-source multimodal model similar to the version of GPT-4 that was shown during OpenAI’s presentation. It combines a Visual encoder with an LLM. They used Vicuna which is a fine-tuned version of LLaMA.
In the future, I hope more teams try to add new ideas to their models instead of creating more and more small language models.
OpenAssistent is an open-source project to build a personal assistant. They just released their first model. you can try it out here.
While the progress on smaller models by the open-source community is impressive there are a few things I want to mention. Many advertise these models as local alternatives to chatGPT or even compare them to GPT-4. This is sadly not true. it is not possible to replicate the capabilities of a model like GPT-4 on a local machine; at least not yet. This does not mean that they are not good. many of them are able to generate good answers or even use APIs like chatGPT.
Neural Radiance Fields (NeRFs), which are used for synthesizing high-quality images of 3D scenes are a class of generative models that learn to represent scenes as continuous volumetric functions, mapping 3D spatial coordinates to RGB colors and volumetric density. Grid-based representations of NeRFs use a discretized grid to approximate this continuous function, which allows for efficient training and rendering. However, these grid-based approaches often suffer from aliasing artifacts, such as jaggies or missing scene content, due to the lack of explicit understanding of scale.
This new paper proposes a novel technique called Zip-NeRF that combines ideas from rendering and signal processing to address the aliasing issue in grid-based NeRFs. This allows for anti-aliasing in grid-based NeRFs, resulting in significantly lower error rates compared to previous techniques. Moreover, Zip-NeRF achieves faster training times, being 22 times faster than current approaches.
This makes them applicable for VR and AR applications and allows for high-quality 3d scenes. Next year when the Hardware improves we will see some very high-quality VR experiences.
OpenAI developed a new approach to image generation called consistency models. Current models, like Dalle-2 or stable diffusion, iteratively diffuse the result. This new approach goes straight to the final result which makes the process way faster and cheaper. While not as good as some diffusion models yet, they will likely improve and become an alternative for scenarios where faster results are needed.
In a new research paper, Google and Stanford University created a sandbox world where they let 25 AI agents role-play. The agents are based on chatGPT-3.5 and behave more believably than real humans. Future agents based on GPT-4 will be able to act even more realistically and intelligently. This could not only mean that we get better AI NPCs in computer games, but it also means that we will not be able to distinguish bots from real people. This is a great danger in a world where public opinions influence many. As these agents become more human-like, the risk of deep emotional connections increases, especially if the person does not know that they are interacting with an AI.
Segment Anything Model(SAM) was published by Meta last week and it is open source. it can “cut out” any object in an image and find them with a simple text prompt. SAM could be used in future AR software or as part of a bigger AI system with vision capabilities. The new Dataset that they used (SA-1B) is also open source and contains over 1B masked images.
Since GPT-3.5 and GPT-4 APIs are available many companies and start-ups have implemented them into their products. Now developers have started to do it the other way around. They build systems around GPT-4 to enable it to search, use APIs, execute code, and interact with itself. Examples are HuggingGPT or AutoGPT. They are based on works like Toolformer or this result. Even Microsoft itself started to build LLM-Augmenter around GPT-4 to improve its performance.
I talked about this development in my post on how to get from GPT-4 to proto-AGI. I still think that this is the way to a general assistant even though I am not sure if GPT-4 is already capable enough or if we need another small improvement.
Similar to OpenAI, Deepmind started to work together with other companies to build more commercial products. In their recent blog post they explained how they developed a new Video codec and improved auto chapters for Youtube.
If this trend continues we will see more products for other Alphabet companies developed by Deepmind.
Stanford released the new AI Index Report. Some of the key takeaways are:
The Industry takes over and leaves academia behind.
Scientific research is accelerating thanks to AI.
Misuse and use of Ai are rapidly growing.
Demand for AI-related skills is growing
Companies that use AI are leaving behind those who do not.
China is the most active country in machine learning and also the most positive about AI.
The USA is building the most powerful AI systems.
The report sadly does not include GPT-4 and other newer results. I still highly recommend looking into the report. They did a great job capturing some key trends in a very clear and visual way. For example, the following graph shows the exponential growth of machine learning systems.
Google’s new ViT-22B is the largest Vision Transformer model by far, with 22 billion parameters. It has achieved SOTA in numerous benchmarks such as depth estimation, image classification, and semantic segmentation. ViT-22B has been trained on four billion images and can be used for all kinds of computer vision tasks.
This result shows that further scaling in vision transformers can be as valuable as it was for Language Models. This also indicated that future multimodal models can be improved and GPT-4 is not the limit.