Google Deepmind just released their new Gemini models. They come in 3 sizes. Nano will be used on devices like the Pixel phones, and Pro will be used in their products such as Bard, and Ultra is going to be released at the beginning of next year. The models are multimodal and can input, audio, video, text, images, and code.
It outperforms current state-of-the-art models not only in text-based tasks but also in other modalities.
Test the Pro version now in Bard and read more about the model here and here.
The company presented the AI Pin Today. It is a small device with a camera, microphone, sensors, and laser projector. It is designed to replace the smartphone and costs 699 plus a monthly subscription of 24 dollars. This includes the unlimited use of multiple frontier LLMs, internet, and multiple other services like music. It can see what you see, translate, manage your calendar, send messages, and answer your questions.
I personally think that the biggest problem is the addiction of most people to social media and YouTube which makes it, not a replacement and it is too expensive to add to a phone. There is also a factor that phones can do many of the things already and are not much more expensive. I can imagine something similar in the future in combination with AR glasses. More information: https://hu.ma.ne/
LLMs are powerful tools, but they often struggle with tasks that require logical and algorithmic reasoning, such as arithmetic. A team of researchers from Google has developed a new technique to teach LLMs how to perform arithmetic operations by using in-context learning and algorithmic prompting. Algorithmic prompting means that the model is given detailed explanations of each step of the algorithm, such as addition or multiplication. The researchers showed that this technique can improve the performance of LLMs on arithmetic problems that are much harder than those seen in the examples. They also demonstrated that LLMs can use algorithmic reasoning to solve complex word problems by interacting with other models that have different skills. This work suggests that LLMs can learn algorithmic reasoning as a skill and apply it to various tasks.
Meta recently released their new Llama models. The new models come in sizes from 7 to 70 billion parameters and are released as base models and chat models, which are fine-tuned with two separate reward models for safety and helpfulness. While the models are only a small improvement over the old Llama models, the most important change is the license which now allows commercial use.
Researchers at Microsoft have unveiled Kosmos-2 the successor of Kosmos-1, a Multimodal Large Language Model (MLLM) that integrates the capability of perceiving object descriptions and grounding text in the visual world. By representing refer expressions as links in Markdown format, Kosmos-2 achieves the vital task of grounding text to visual elements, enabling multimodal grounding, referring expression comprehension and generation, perception-language tasks, and language understanding and generation. This milestone in the development of artificial general intelligence lays the foundation for Embodiment AI and the convergence of language, multimodal perception, action, and world modeling, bringing us closer to bridging the gap between humans and machines and revolutionizing various domains where AI interacts with the real world. With just 1.6B parameters, the model is quite small and will be available open on GitHub
Deepmind published a new blog post where they present their newest AI which is based on their previous work Gato. RoboCat is a self-improving AI agent for robotics that learns to perform a variety of tasks across different arms and then self-generates new training data to improve its technique. It is the first agent to solve and adapt to multiple tasks and do so across different, real robots. RoboCat learns much faster than other state-of-the-art models. It can pick up a new task with as few as 100 demonstrations because it draws from a large and diverse dataset. This capability will help accelerate robotics research, as it reduces the need for human-supervised training, and is an important step towards creating a general-purpose robot.
Voiceboxis a new generative AI for speech that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance. It can create outputs in a vast variety of styles, from scratch or from a sample, and it can modify any part of a given sample. It can also perform tasks such as:
In-context text-to-speech synthesis: Using a short audio segment, it can match its style and generate text.
Cross-lingual style transfer: Given a sample of speech and a passage of text in six languages, it can produce a reading of the text in that language.
Speech denoising and editing: It can resynthesize or replace corrupted segments within audio recordings.
Diverse speech sampling: It can generate speech that is more representative of how people talk in the real world.
Voicebox uses a new approach called Flow Matching, which learns from raw audio and transcription without requiring specific training for each task. It also uses a highly effective classifier to distinguish between authentic speech and audio generated with Voicebox. Voicebox outperforms the current state-of-the-art English model VALL-E on zero-shot text-to-speech and cross-lingual style transfer and achieves new state-of-the-art results on word error rate and audio similarity. Voicebox is not publicly available because of the potential risks of misuse, but the researchers have shared audio samples and a research paper detailing the approach and results. They hope to see a similar impact for speech as for other generative AI domains in the future.
After earlier experiments on mice, it is now possible to create human embryos out of stem cells. This allows us to make human life without sperm or eggs. Since the experiments are limited by ethical concerns they stopped the growth of the embryo at an early stage. This research could lead to a better understanding of early development and could allow us someday to design our successor species.
This week Meta open-sourced a music generation model similar to Google’s MusicLM. The Model is named MusicGen and is completely open-source. These models can generate all kinds of music based on given text prompts similar to image models.
OpenAI announced a set of changes to their model APIs. The biggest announcement is the addition of function calls for both GPT-3.5 and 4. This allows developers to enable plugins and other external tools for the models.
They also released new versions of GPT-3.5 and 4 that are better at following directions and a Version of 3.5 with 16K context window.
In addition, they made the embedding model 75% cheaper, which is used to create vector databases and allows models to dynamically load relevant data, like memory. GPT-3.5 also became cheaper now costing only $0.0015 per 1K input tokens.
After DeepMind developed AlphaTensor last year and found a new algorithm for matrix multiplication, they did it again. This time they developed AlphaDev which found a new algorithm for sorting. This sounds not as exciting as a new language model, but sorting algorithms run billions of times every hour. Optimizing central algorithms like sorting and searching is one of the oldest parts of computer science and they are getting optimized for over a hundred years at this point. We did not find a better solution in the last 10 years and some believed that we reached the limit of what is possible. AlphaDevs’ new solution was implemented in the standard C++ library and is used already. The impact of these small improvements becomes enormous because they are used so much and the amount of energy that is saved adds up quickly. They also found a new hash algorithm which is used a similar amount. If AlphaDev continues to find improvements for core algorithms, every software in the world will run faster and more efficiently. Breakthroughs like this have to be considered in the discussion around the climate impact of AI training. The energy saved by these improvements offsets the used energy for training by orders of magnitude.
Apple finally announced their upcoming VR headset which will focus on productivity and Cinematic entertainment. The 4K displays are powered by their M2 chip. This requires an external energy source and makes the headset with 3500$ very expensive. The Headset focuses on Mixed reality experiences similar to the new Meta Quest 3, but unlike the Quest, it will not be released until next year. It is probably a good starting point for Apple to build their new Product platform, but if you are not in desperate need of a high-resolution headset it is perhaps not a good choice for you.
Meta announced their new Meta Quest 3 headset. It is the successor to the Quest 2, the most popular VR headset of all time. The price went up a bit, the processing power and form factor improved as did the visuals. especially passthrough is better with color passthrough. Eye tracking is not included. Together with the upcoming Apple entrance into the VR space, this will give the XR World a new push forward.
Microsoft Build is currently underway, with Microsoft showcasing a range of new and upcoming products, including various Copilots such as Copilot for Bing, GitHub, and Edge. In their pipeline, they also have plans to launch a Copilot specifically designed for Windows.
These Copilots are all built using Microsoft’s new Azure AI Studio Platform, which is now open to developers, allowing them to create their own Copilots.
Furthermore, Microsoft announced their support for an open plugin system, similar to the one utilized by ChatGPT, making plugins accessible to all Copilots. If this solution becomes the industry standard for AI systems, it has the potential to establish Microsoft as a dominant player in the AI market. The first day of Microsoft Build concluded with an exceptional presentation by Andrej Karpathy, delving into the history and inner workings of GPT models. If you’re interested in gaining insights into how these models operate and learn, I highly recommend watching his talk titled “State of GPT.”
Intel just announced a new supercomputer named Aurora. It is expected to offer more than 2 exaflops of peak double-precision compute performance and is based on their new GPU series which outperforms even the new H100 cards from NVIDIA.
They are going to use Aurora to train their own LLMs up to a trillion parameters. This would likely be the first 1T model.
I am excited to see even bigger models and more diverse hardware and software options in the field.
Today the US Senate held an AI testimony to discuss the risks and chances of AI and possible ways to regulate the sector nationally and globally.
Witnesses testifying include Sam Altman, CEO of OpenAI; Gary Marcus, professor emeritus at New York University, and Christina Montgomery, vice president and chief privacy and trust officer at IBM.
I think the discussion was quite good and is relevant for everyone. One thing that stands out is the companies’ wish to be controlled and guided by the government. The EU AI Act was a topic and the need for a global solution was a main talking point. A critical idea was for an agency to give out licenses to companies for developing LLMs, which Sam Altman proposed.
I hope Governments find a way to make sure AI is deployed in a way where everyone profits and the development of the technology is not slowed down or limited to a few people or profits.
Google IO happened yesterday and the keynote focused heavily on AI. Some of the things that I found most important are:
PaLM 2 is their new LLM. It comes in different sizes from small enough for pixel phones, to big enough to beat ChatGPT-3.5. It is used in Bard and many of their productivity tools.
Gamini is a multimodal model and the product of the Google DeepMind fusion. It is getting trained right now and could be a contender for the strongest AI when it comes out. I am quite excited about this release since DeepMind is my personal favorite for AGI.
Moreover, they showcased their seamless integration of PaLM and other advanced generative AI tools throughout their product suite as a direct response to Microsoft’s Copilot. They applied the same innovative approach to their search functionality, incorporating PaLM to deliver a search experience reminiscent of Bing GPT. This development fills me with hope, considering their search results outperform those of Bing. It’s likely that their decision to keep PaLM smaller was driven by cost considerations, allowing for more economical operation in the realm of search.
Anthropic, the OpenAI competitor just announced a new version of their LLM Claude. This new Version has a context length of 100K tokens, which corresponds to around 75K words. It is not clear from the announcement how they implemented that and how the full context gets fed into the attention layers.
OpenAI is planning to release a 32K context version of GPT-4 soon.
Longer context means you can feed long-form content like books, reports, or entire code bases into the model and work with the entirety of the data.
One of the main problems of LLMs is that they are black boxes and how they produce an output is not understandable for humans. Understanding what different neurons are representing and how they influence the model is important to make sure they are reliable and do not contain dangerous trends.
OpenAI applied GPT-4 to find out the different meanings of neurons in GPT-2. The methodology involves using GPT-4 to generate explanations of neuron behavior in GPT-2, simulate what a neuron that fired for the explanation would do, and then compare these simulated activations with the real activations to score the explanation’s accuracy. This process helps in understanding and could potentially help improve the model’s performance.
The tools and datasets used for this process are being open-sourced to encourage further research and development of better explanation generation techniques. This is part of the recent efforts in AI alignment before even more powerful models are trained. Read more about the process here and the paper here. You can also view the neurons of GPT-2 here. I recommend clicking through the network and admiring the artificial brain.
Shap-E can generate 3D assets from text or images. Unlike their earlier model Point-E, this one can directly generate the parameters of implicit functions that can be rendered as both textured meshes and neural radiance fields. It is also faster to run and open-source! Read the paper here.
Just like video generation, the quality is still behind image generation. I expect this to change by the end of this year.