WebLLM: A High-Performance In-Browser LLM Inference Engine

Charlie F. Ruan, Yucheng Qin, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen

December, 2024

Abstract

Advancements in large language models (LLMs) have unlocked remarkable capabilities across various domains. However, deploying these models typically requires server-level GPUs and cloud-based inference. The recent emergence of smaller open-source models and increasingly powerful consumer devices has made on-device deployment a viable alternative. Specifically, the web browser as a platform for on-device deployment is universally accessible, provides a natural agentic environment, and conveniently abstracts out the different backends from diverse device vendors. To address this opportunity, we introduce WebLLM, an open-source JavaScript framework that enables high-performance LLM inference entirely within web browsers. WebLLM provides an OpenAI-style API for seamless integration into web applications, and leverages WebGPU for efficient local GPU acceleration and WebAssembly for performant CPU computation. By utilizing machine learning compilers MLC-LLM and Apache TVM, WebLLM generates optimized WebGPU kernels, overcoming the absence of performant GPU libraries in browsers. Our architecture adapts to the browser’s runtime environment through web workers, ensuring smooth and uninterrupted user interfaces. Evaluation results demonstrate that WebLLM can preserve up to 80% performance of native deployment on the same device, with the potential to further close the gap. This work paves the way for universally accessible, privacy-preserving, personalized, and locally powered LLM applications in web browsers.

Type

Preprint

Publication

Preprint