The Future is Now

Tag: Meta

Why Open Source Models Are Great

The open-source AI landscape has witnessed significant growth and development in recent years, with numerous projects and initiatives emerging to democratize access to artificial intelligence. In this blog post, I will go into the current state of open-source AI, exploring the key players, fine-tuning techniques, hardware and API providers, and the compelling arguments in favor of open-source AI.

Model Providers

Training LLMs costs a significant amount of money and requires a lot of experience and hardware. Only a few organizations have the means to do so. The following list is not complete and just covers some of the big ones.

Meta is currently the biggest company that open source models. Their model family is called Llama and the current Llama 3 models are available in Two sizes: 8B and 70B. A 405B model is expected soon. The weak points of the current versions are their lack of non-English training data and their small context size. Meta is already working on that.

Mistral is a smaller French company that got investments from Microsoft including computing power. While not all their models are open-source, the ones that are, perform well. they open-sourced a 7B model that was a cornerstone of open source models for quite some time and they open-sourced two Mixture-of-Expert Models (8x7B, and 8x22B) that are still leading non-English open source models, especially at their price point.

Cohere recently open-sourced a few models including their LLMs Command-R and Command-R+. They perform especially well when used in combination with retrivel augmented generation.

Stability Ai is mostly known for open-sourcing text2image models, but they also open-sourced a few smaller LLMs that are decent for their size.

Google does not open source their Gemini models, but they have a set of open models called Gemma which include some experimental LLms that are not based on Transformers.

API-Providers and Hardware

The main argument for open source models is the ability to run them on your own on your personal machine. Current models range from 2B to over 100B parameters. So let’s see what is needed to run them
For small models under 7B, you don’t need anything special. These models could even run on your phone. Models between 7B and 14B models can be run on most PCs but can be very slow unless you have a modern GPU. Bigger models between 14B and 70B require extremely high-end PCs. Apple’s modern high-end devices are especially great since they offer shared memory that is needed for bigger models. Everything over 70B, including the MoE models from Mistral usually are not usable for Home devices. They instead are available on a broad selection of API providers who host different open-source models and compete on price, speed, and latency. I selected a few that excel in one or two of these categories.

Groq is a newer hardware company that developed custom chips for LLMs. That allows them to offer incredible speeds and prices. For example Llama 3 8B for less than 10 cents for a million tokens and over 800 tokens per second. If you run the model yourself you would get around 10-20 tokens per second depending on your hardware.

Together.ai offers nearly all common open-source models and gives you a few million tokens for free at the start to start experimenting immediately.

Perplexity is not only a great search engine, but its API is also great. Not as cheap or fast as Groq, but extremely low latency and they offer their own models with internet access. They also provide free API credits for perplexity pro users.

If you prefer to run them on your own I recommend a newer Nvidia GPU with as much VRAM as you can afford.

Customization

One of the great side effects of having control over the model is the ability to change it to your needs. This starts with simple things like system prompts or temperature. Another thing that is often used is quantization. Quantization describes the process of taking the parameters of the models that are usually saved as floating point numbers with 16 or 32 bits of precision and rounding them in different ways to shrink them to somewhere between 8 and 1 bit. This process reduces the capabilities of the models slightly depending on the factor but makes that model easier and faster to run on weaker hardware.

Fine-tuning

For many use cases, current models are not optimal. They lack knowledge perform worse in a required language or simply do not perform well in a certain task. To solve these problems you can fine-tune the models. Fine-tuning means continuing the training of the model on a small custom data set that helps the model learn the required ability. The following part will be a bit more technical and can be skipped:
3 main types of open-source LLMs are available: Base models, Instruct models, and chat models. Base models are only trained on huge amounts of text and work more like text completion. They do not really work as chatbots and are hard to use. Instruct models are already fine-tuned by the creator on a set of text examples that teach to model to follow the instructions of a given input instead of simply continuing the text. Chat models are further fine-tuned to behave in a chatbot-like way and can hold conversations. They are also often trained to have certain limitations and can refuse to talk about certain things if they are trained to do. For fine-tuning, base models give the most freedom. You could even continue the training with new languages or information and do instruct training after that. There are already instruct datasets available that can be used or you can create your own. If you fine-tune existing instruct models, you usually need fewer data and compute and you can still teach the Model a lot and change its behavior. This is most often the best choice. Existing chat models can still be fine-tuned but since they are already trained in a certain way it is harder to get specific behaviour and teaching it completely new skills is hardly possible. Fine-tuning chat models is best if you just want to change the tone of the model or train it on a specific writing style. There are different ways to fine-tune: Most often you fix the earlier layers of the model so the learned knowledge of the model will not be changed too much and only train the later layers. While this is not totally correct, I like to imagine that later layers are more important for the style of output while earlier layers work more like the core language understanding part of the model. So the more fundamental the thing is you want to change the more layers you need to train. things like a certain writing style usually only require the very end of the model, while things like improved math capabilities need most of the network. There is another way to fine-tune models that often pops up: LORAs. LORA stands for Low-Rank Adaption. It uses the fact that LLM layer matrices have a lower rank ( lower dimension) to split them up into two matrices which contain fewer parameters in sum than the original matrix. The fine-tuning is then happening on the two new matrices which make the process faster and cheaper and allow LORAs to be shared with less memory overhead. The LORA matrices can then later be swapped in and out like a hat.

output control

If you have control over your model, you can also inject things into its output. The most popular example is something like JSON mode, where at every token instead of selecting randomly from the logits, an external program checks which output token is valid given the JSON grammar and can select the one. This can be used to guarantee that the output follows a certain given structure and can also be used for things like tool use or other additional functions.

Local tools

There is a range of tools to run models locally from chat interfaces that mimic the experience of chatGPT to local API servers that can be used for companies or developers. Here are some examples

GPT4All is a local chat interface that not only allows you to download models but can also give the models access to your local documents and is very easy to use.

Ollama is a local LLM server which makes it easy to install additional models and supports a wide range of Operating systems and Hardware.

LM Studio also offers a user interface to chat with models but also includes functionality to fine-tune them with LORA

Conclusion

So as you can see there are many reasons why open-source models can be superior even though the biggest and smartest models that are currently available are slightly better than the best open-source models. They are way cheaper, even if you compare price per performance and they allow for much more custom control. They can be trained to your liking and needs, and offer privacy and control over your data and use. If you run them locally they often have lower latency and even if you use API providers you will get better prices and super-fast interference. Open-source models used to be around a year behind some of the top models, but in recent times, they started to catch up. They will probably never lead the field in terms of capabilities but they will always be the cheaper option. ChatGPT3.5 is the best example of a model that got beaten by open source a long time ago. Models like Llama 3 are not only cheaper, but they are also way faster and offer all the advantages of open models.


Episode 15: Llama 2, China und Open Source

Words of the Future
Words of the future
Episode 15: Llama 2, China und Open Source
Loading
/

In dieser Episode reden Florian und Ich über das neue Llama Model von Meta, die aktuelle Auswahl an Language Models und deren Kontrolle.

Mehr informationen auf dem Discord server
https://discord.gg/3YzyeGJHth
oder auf https://mkannen.tech

Voicebox: A new Voice Model

Voicebox is a new generative AI for speech that can generalize to speech-generation tasks it was not specifically trained to accomplish with state-of-the-art performance. It can create outputs in a vast variety of styles, from scratch or from a sample, and it can modify any part of a given sample. It can also perform tasks such as:

  • In-context text-to-speech synthesis: Using a short audio segment, it can match its style and generate text.
  • Cross-lingual style transfer: Given a sample of speech and a passage of text in six languages, it can produce a reading of the text in that language.
  • Speech denoising and editing: It can resynthesize or replace corrupted segments within audio recordings.
  • Diverse speech sampling: It can generate speech that is more representative of how people talk in the real world.

Voicebox uses a new approach called Flow Matching, which learns from raw audio and transcription without requiring specific training for each task. It also uses a highly effective classifier to distinguish between authentic speech and audio generated with Voicebox. Voicebox outperforms the current state-of-the-art English model VALL-E on zero-shot text-to-speech and cross-lingual style transfer and achieves new state-of-the-art results on word error rate and audio similarity. Voicebox is not publicly available because of the potential risks of misuse, but the researchers have shared audio samples and a research paper detailing the approach and results. They hope to see a similar impact for speech as for other generative AI domains in the future.

Meta Quest 3

Meta announced their new Meta Quest 3 headset. It is the successor to the Quest 2, the most popular VR headset of all time. The price went up a bit, the processing power and form factor improved as did the visuals. especially passthrough is better with color passthrough. Eye tracking is not included. Together with the upcoming Apple entrance into the VR space, this will give the XR World a new push forward.

Meta

Segment Anything Model (SAM) was published by Meta last week and it is open source. it can “cut out” any object in an image and find them with a simple text prompt. SAM could be used in future AR software or as part of a bigger AI system with vision capabilities. The new Dataset that they used (SA-1B) is also open source and contains over 1B masked images.

Giving AI a Body

Meta announced two major advancements toward general-purpose embodied AI agents capable of performing challenging sensorimotor skills.

The first advancement is an artificial visual cortex (called VC-1) that supports a diverse range of sensorimotor skills, environments, and embodiments. VC-1 is trained on videos of people performing everyday tasks from the Ego4D dataset. VC-1 matches or outperforms sota results on 17 different sensorimotor tasks in virtual environments.

The second advancement is a new approach called adaptive (sensorimotor) skill coordination (ASC), which achieves near-perfect performance (98 percent success) on the challenging task of robotic mobile manipulation (navigating to an object, picking it up, navigating to another location, placing the object, repeating) in physical environments.

These improvements are needed to move the field of robotics forward and match the current pace in AI which will need bodies at some point.

New Transformer Model CoLT5 Processes Long Documents Faster and More Efficiently than Previous Models

Researchers from several institutions, including the University of California, Berkeley, and Facebook AI Research, have developed a new transformer model that can process long documents faster and more efficiently than previous models. The team’s paper, titled “CoLT5: Faster Long-Range Transformers with Conditional Computation,” describes a transformer model that uses conditional computation to devote more resources to important tokens in both feedforward and attention layers.

CoLT5’s ability to effectively process long documents is particularly noteworthy, as previous transformer models struggled with the quadratic attention complexity and the need to apply feedforward and projection layers to every token. The researchers show that CoLT5 outperforms LongT5, the previous state-of-the-art long-input transformer model, on the SCROLLS benchmark, while also boasting much faster training and inference times.

Furthermore, the team demonstrated that CoLT5 can handle inputs up to 64k in length with strong gains. These results suggest that CoLT5 has the potential to improve the efficiency and effectiveness of many natural language processing tasks that rely on long inputs.

Meta compares Brain to LLMs

Meta published an article where they compared the behavior of the brain to large language models. They showed the important differences and similarities underlying the process of text predictions. The research group tested 304 participants with functional magnetic resonance imaging to show how the brain predicts a hierarchy of representations that spans multiple timescales. They also showed that the activations of modern language models linearly map onto the brain responses to speech.

New LLMs by Meta.

Meta released 4 new Large Language Models, ranging from 6.7B to 65.2B parameters. By using the chinchilla law and only publically available they reached state-of-the-art performance in their biggest model which is still significantly smaller than comparable models like GPT-3.5 or PaLM. Their smallest model is small enough to run on consumer Hardware and is still comparable to GPT-3.

AI Art Generation: A Prime Example for Exponential Growth

I wanted to make this post for a while, as I am deeply invested in the development of AI image models, but things happened so fast.

It all started in January 2021 when OpenAi presented DALL-E, an AI model that was able to generate images based on a text prompt. It did not get a lot of attention from the general public at the time because the pictures weren’t that impressive. One year later, in April 2022, they followed up with DALL-E 2, a big step in resolution, quality, and coherence. But since nobody was able to use it themself the public did not talk about it a lot. Just one month later google presented its own model Imagen, which was another step forward and was even able to generate consistent text in images.
It was stunning for people interested in the field, but it was just research. Three months later DALL-E 2 opened its Beta. A lot of news sites started to write articles about it since they were now able to experience it for themself. But before it could become a bigger thing Stability.Ai released the open-source model “stable diffusion” to the general public. Instead of a few thousand people in the DALL-E beta, everybody was able to generate images now. This was just over a month ago. Since then many people took stable diffusion and built GUIs for it, trained their own models for specific use cases, and contributed in every way possible. AI was even used to win an art contest.

The image that won the contest

People all around the globe were stunned by the technology. While many debated the pros and contras and enjoyed making art,
many started to wonder about what would come next. After all, stable diffusion and DALL-E 2 had some weak points.
The resolution was still limited, and faces, hands, and texts were still a problem.
Stability.ai released stable diffusion 1.5 in the same month as an improvement for faces and hands.
Many people thought that we might solve image generation later next year and audio generation would be next.
Maybe we would be able to generate Videos in some form in the next decade. One Week. It took one week until Meta released Make-a-video, on the 29th of September. The videos were just a few seconds long, low resolution, and low quality. But everybody who followed the development of image generation could see that it would follow the same path and that it would become better over the next few months.
2 hours. 2 hours later Phenki was presented, which was able to generate minute-long videos based on longer descriptions of entire scenes.
Just yesterday google presented Imagen video, which could generate higher-resolution videos. Stablilty.ai also announced that they will
release an open-source text2video model, which will most likely have the same impact as stable diffusion did.
The next model has likely already been released when you read this. It is hard to keep up these days.

I want to address some concerns regarding AI image generation since I saw a lot of fear and hate directed at people who develop this technology,
the people who use it, and the technology itself. It is not true that the models just throw together what artists did in the past. While it is true that art was used to train these models, that does not mean that they just copy. The way it works is by looking at multiple images of the same subject to abstract what the subject is about, and to remember the core idea. This is why the model is only 4 Gbyte in size. Many people argue that it copies watermarks and signatures. This is not happening because the AI copies, but because it thinks it is part of the requested subject. If every dog you ever saw in your life had a red collar, you would draw a dog with a red collar. Not because you are copying another dog picture, but because you think it is part of the dog. It is impossible for the AI to remember other pictures. I saw too many people spreading this false information to discredit AI art.

The next argument I see a lot is that AI art is soulless and requires no effort and therefore is worthless. I, myself am not an artist, but I consider myself an art enjoyer. It does not matter to me how much time it took to make something as long as I enjoy it. Saying something is better or worse because of the way it was made sounds strange to me. Many people simply use these models to generate pictures, but there is a group of already talented digital artists who use these models to speed up their creative process. They use them in many creative ways using inpainting and combining them with other digital tools to produce even greater art. Calling all of these artists fakes and dismissing their art as not “real” is something that upsets me.

The last argument is copyright. I will ignore the copyright implications for the output since my last point made that quite clear. The more difficult discussion is about the training input. While I think that companies should be allowed to use every available data to train their models, I can see that some people think differently. Right now it is allowed, but I expect that some countries will adopt some laws to address this technology. For anybody interested in AI art, I recommend lexica.art if you want to see some examples and if you want to generate your own https://beta.dreamstudio.ai/dream is a good starting point. I used them myself to generate my last few images for this blog.

Text2Image/video is a field that developed incredibly fast in the last few months. We will see these developments in more and more areas the more we approach
the singularity. There are some fields that I ignored in this post that go in the same direction that are making similar leaps.
For example Audiogeneration and 2D to 3D. The entire machine learning research is growing exponentially.

Amount of ML-related papers per month

The next big thing will be language models. I missed the chance to talk about Google’s “sentient” AI when it was big in the news,
but I am sure with the release of GPT-4 in the next few months, the topic will become even more present in public discussions.

The Metaverse part 1: VR Hardware

I decided to split the metaverse blog post into a mini-series since the topic is so broad, that when I tried to put everything into one post I simply failed.

We start with the currently most relevant part: VR Hardware.

VR is one of the two technologies that will be the platforms for the metaverse soon. Arguably not the most important one, but the one that will be available first. 

2023 will be a big year for VR. We will see some new VR devices from Meta, Apple, Pico, and others. Some of these new devices will tackle the most important problems for VR hardware. 

The problem with existing VR devices, like the meta quest, is that you cannot use them for extended periods, and it is not a pleasant experience at all. They are too heavy, and they cause eye strain. The movement in VR leads to nausea and the ways to interact with VR are limited. On top of that, the viewing itself is far from reality. 

Some of these problems will be fixed this year. Each new headset is lighter than the last one and Apple’s VR headset is supposed to have a way higher resolution than most currently available headsets thanks to apple silicon. Eye tracking is coming in meta’s next headset and in many others, which will help with performance and resolution, and will give us new ways to interact.

Some other problems like contrast, adaptive depth, distortion, and field of view are harder to fix and will take some time, but mark Zuckerberg recently showed some prototypes that tackle some of these problems too.

Mark Zuckerberg presents meta’s prototypes

Most of these solutions require huge amounts of computation power, especially higher resolutions. Standalone Headsets will not be able to perform fast enough, at least not for the next year. I think apple is most likely to be able to bring a good visual experience to a standalone headset thanks to apple silicon, but their first model which is expected to launch in January 2023 will not be able to fix all the existing visual problems. Even PC-VR is still limited by data rates of cables and wireless transmission. We need at least Wi-Fi 6 to reach a point where wireless transmission is viable for realistic-looking VR experiences. 

The problem with nausea will become less bad with improved visuals but as long as we use a controller to move the problem persists. I do not think omnidirectional treadmills are the way to go. they are too expensive, and most people do not want to waste that much space, money, and energy in their free time. Some applications use teleporting or walking in place to move, and many other solutions are currently being tested. While treadmills are not likely going to be a standard accessory, full-body tracking will be. The difference in emersion with full body tracking is huge and it gives VR another important input tool. Cheap full-body tracking solutions like slimeVR will become better and better and will give us realistic bodies in VR. The already mentioned eye tracking is another step of emersion that will be important for social VR. Being able to look someone in the eyes and read their mimic is a core element of human interaction and we are sensitive to strange facial movements. But eye tracking can do even more. It improves performance by limiting the resolution in areas that we are not looking at and it serves as an input device for VR. We can look at objects and control elements and the software will be able to extrapolate what we want to touch or click, which will remove frustrating moments like not being able to click the right button because of unprecise hand tracking. This brings me to my last point: hand tracking. It is arguably part of full-body tracking, but it is so important since hands are our primary way to interact with VR. Realistic and precise hand tracking is one of the most important aspects of emersion. 

Perfect Virtual Hands – But At A Cost! 👐

Near-Perfect Virtual Hands For Virtual Reality! 👐

This AI Creates Virtual Fingers! 🤝

These videos show some of the key papers for hand tracking, published in the last two years. These papers are the foundation of meta’s hand tracking and will most likely continue to improve in the coming year.

If we look at the current development of headsets in the market it looks pretty good.

Sold headsets per year

And the number of Headsets that are used every month for gaming is a good indicator for this upcoming billion-dollar entertainment industry.

actively used headsets on steam

I think we will see an even greater wave of people getting into VR in the next 2 years. Not just for gaming, but with apple joining the market, we will also see increases in areas like education and industry. 

In the end, I want to take a short look into the far future of VR and virtual reality. I am talking about 5-10 years, probably after a technological singularity. The final goal of VR is full dive. The ability to simulate all 5 senses directly within the brain and to intercept all outputs from our brain to paralyze our body and redirect all movement into virtual reality. I will not talk about the implications for society that is a topic for another time, but from a pure hardware perspective, this is extremely challenging. While reading the output of the brain is an area where we are currently making a lot of progress, intercepting the signal to prevent our body from moving is not possible right now without a lot of medical expertise and long-lasting effects. Sending signals for all senses directly into the brain is even harder since every brain is different. I do not think we will be able to do this without an AGI, but if in the far future a machine overlord decides to put us all in our own matrix it will hopefully be heaven and not hell.

© 2024 Maximilian Kannen

Theme by Anders NorenUp ↑