PeriFlow emerges from our pioneering research and extensive hands-on experience in orchestrating generative AI models and workloads. Our expertise encompasses a spectrum of techniques, encompassing multi-level optimization, scheduling strategies, and innovative batching methodologies. It’s noteworthy that our proprietary batching technology is safeguarded by patents in both the United States and Korea.
PeriFlow serves as a versatile platform compatible with a wide array of generative AI models. In particular, Large Language Models (LLMs) tailored for generative tasks like ChatGPT, GPT-3, PaLM, OPT, BLOOM, and LLaMA have proven indispensable across various applications. These applications span the gamut, encompassing chatbots, translation services, summarization tools, code generation utilities, and even caption creation solutions. However, deploying and managing these models traditionally entails a substantial cost and operational burden.
PeriFlow distinguishes itself by delivering remarkable performance, surpassing the capabilities of NVIDIA Triton coupled with FasterTransformer in terms of both latency and throughput. This efficiency holds true across a spectrum of LLM sizes, ranging from 1.3 billion to a staggering 341 billion parameters. For instance, it boasts an astounding 10-fold increase in throughput when serving a GPT-3 model with a colossal 175 billion parameters, all while maintaining consistent latency levels.