Preprint

A System for Microserving of LLMs

The recent advances in LLMs bring a strong demand for efficient system support to improve overall serving efficiency. As LLM inference …

Hongyi Jin, Ruihang Lai, Charlie F. Ruan, Yingcheng Wang, Todd Mowry, Xupeng Miao, Zhihao Jia, Tianqi Chen

WebLLM: A High-Performance In-Browser LLM Inference Engine

Advancements in large language models (LLMs) have unlocked remarkable capabilities across various domains. However, deploying these …

Charlie F. Ruan, Yucheng Qin, Xun Zhou, Ruihang Lai, Hongyi Jin, Yixin Dong, Bohan Hou, Meng-Shiun Yu, Yiyan Zhai, Sudeep Agarwal, Hangrui Cao, Siyuan Feng, Tianqi Chen

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

The applications of LLM Agents are becoming increasingly complex and diverse, leading to a high demand for structured outputs that can …

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, Tianqi Chen

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Local deployment of large-scale music AI models on commodity hardware

We present the MIDInfinite, a web application capable of generating symbolic music using a large-scale generative AI model locally on …

Xun Zhou, Charlie F. Ruan, Zihe Zhao, Tianqi Chen, Chris Donahue

Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development

Deploying machine learning (ML) on diverse computing platforms is crucial to accelerate and broaden their applications. However, it …

Siyuan Feng, Jiawei Liu, Ruihang Lai, Charlie F. Ruan, Yong Yu, Lingming Zhang, Tianqi Chen

Emerging Platforms Meet Emerging LLMs: A Year-Long Journey of Top-Down Development