FlexGen is a new generation engine that enables high-throughput inference of large language models on a single commodity GPU. It uses a linear programming optimizer to efficiently store and access tensors and compresses weights and attention cache to 4 bits. FlexGen achieves significantly higher throughput than state-of-the-art offloading systems, reaching a generation throughput of 1 token/s with an effective batch size of 144 on a single 16GB GPU. This means that running LLMs on smaller servers could become viable for more and more companies and individuals.
Leave a Reply