Li-Wen's Webpage

Selected Publications

Proceeding

DITRON: Distributed Multi-level Tiling Compiler for Parallel Tensor Programs
Size Zheng, Xuegui Zheng, Hanshi Sun, Qi Hou, Wenlei Bao, Shiyu Li, Haojie Duanmu, Jin Fang, Chenli Xue, Chenhui Huang, YuanqiangLiu, Renze Chen, Ningxin Zheng, Dongyang Wang, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu
International Conference on Machine Learning, 2026 (ICML 2026)

CoCoQuant: Breaking the Bandwidth Wall via Co-Optimized Communication and Computation Quantization
Haojie Duanmu, Jifeng Ding, Size Zheng, Xuegui Zheng, Jiangfei Duan, Xingcheng Zhang, Li-Wen Chang, Xin Liu, and Dahua Lin
International Conference on Machine Learning, 2026 (ICML 2026)

UniEP: Unified Expert-Parallel MegaKernel MoE for LLM Training
Size Zheng, Xuegui Zheng, Li-Wen Chang, and Jidong Zhai
International Symposium on High-Performance Parallel and Distributed Computing, 2026 (HPDC 2026)

Tetris: Efficient Long-context LLM Serving with Chunkwise Dynamic Sequence Parallelism
Cong Li, Yuzhe Yang, Xuegui Zheng, Qifan Yang, Yijin Guan, Size Zheng, Li-Wen Chang, Shufan Liu, Xin Liu, and Guangyu Sun
International Symposium on Computer Architecture, 2026 (ISCA 2026)

Charon: A Unified and Fine-Grained Simulator for Large-Scale LLM Training and Inference
Mengtian Yang, Zhekun Zhang, Mingheng Wu, Jianwen Yan, Hanshi Sun, and Li-wen Chang
Conference on Machine Learning and Systems, 2026 (MLSys 2026)

FlexLinearAttention: Compiling a Unified Abstraction into Scalable Kernels for Linear Attention (openreview)
Haojie Duanmu, Size Zheng, Ningxin Zheng, Jianqiao Lu, Xuegui Zheng, Xingcheng Zhang, Li-Wen Chang, Xin Liu, and Dahua Lin
The Fourteenth International Conference on Learning Representations, 2026 (ICLR 2026)

R-KV: Redundancy-aware KV Cache Compression for Reasoning Models (arXiv)
Zefan Cai, Wen Xiao, Hanshi Sun, Cheng Luo, Yikai Zhang, Ke Wan, Yucheng Li, Yeyang Zhou, Li-Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, and Junjie Hu
The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025 (NeurIPS 2025)

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (arXiv)
Hanshi Sun, Li-Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, and Beidi Chen
International Conference on Machine Learning, 2025 (ICML 2025)

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (arXiv)
Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu
Proceedings of the ACM SIGCOMM 2025 Conference (SIGCOMM 2025)

Qtenon: Towards Low-Latency Architecture Integration for Accelerating Hybrid Quantum-Classical Computing
Chenning Tao, Liqiang Lu, Size Zheng, Li-Wen Chang, Minghua Shen, Hanyu Zhang, Fangxin Liu, Kaiwen Zhou, and Jianwei Yin
International Symposium on Computer Architecture, 2025 (ISCA 2025)

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives (arXiv)
Size Zheng, Jin Fang, Xuegui Zheng, Qi Hou, Wenlei Bao, Ningxin Zheng, Ziheng Jiang, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, and Xin Liu
Conference on Machine Learning and Systems, 2025 (MLSys 2025)

Comet: Fine-grained Computation-communication Overlapping for Mixture-of-Experts (arXiv)
Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, Quan Chen, and Xin Liu
Conference on Machine Learning and Systems, 2025 (MLSys 2025)

Collaborative Computing on Heterogeneous CPU-FPGA Architectures Using OpenCL
Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia de Gonzalo, Juan Gómez Luna, Sai Rahul Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen and Wen-Mei Hwu
ACM/SPEC International Conference on Performance Engineering, 2019 (ICPE 2019)

Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization
Li-Wen Chang, Yang Chen, Wenlei Bao, Amit Agarwal, Eldar Akchurin, Ke Deng and Emad Barsoum
Workshop on Systems for ML at NeurIPS, 2018

Collaborative Computing for Heterogeneous Integrated Systems (doi)
Li-Wen Chang, Juan Gómez-Luna, Izzat El Hajj, Sitao Huang, Deming Chen and Wen-mei W. Hwu
ACM/SPEC International Conference on Performance Engineering, 2017 (ICPE 2017) (conference h5-index = 21)

Chai: Collaborative Heterogeneous Applications for Integrated-architectures
Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia de Gonzalo, Thomas B. Jablin, Antonio J. Peña and Wen-Mei Hwu
IEEE International Symposium on Performance Analysis of Systems and Software, 2017 (ISPASS 2017), to appear (conference h5-index = 24, acceptance rate: 24/81 = 29.6%)

Efficient Kernel Synthesis for Performance Portable Programming (doi)
Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei W. Hwu
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016 (MICRO-49), (conference h5-index = 39, acceptance rate: 61/283 = 21.6%)

KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism (doi)
Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic and Wen-mei W. Hwu
Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, 2016 (MICRO-49), (conference h5-index = 39, acceptance rate: 61/283 = 21.6%)

DySel: Lightweight Dynamic Selection for Kernel-based Data-parallel Programming Model (doi)
Li-Wen Chang, Hee-Seok Kim, and Wen-mei W. Hwu
Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2016) (conference h5-index = 51, acceptance rate: 53/232 = 22.8%)

A Programming System for Future Proofing Performance Critical Libraries (doi)
Li-Wen Chang, Izzat El Hajj, Hee-Seok Kim, Juan Gómez-Luna, Abdul Dakkak and Wen-mei W. Hwu
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2016) (conference h5-index = 34)

In-Place Data Sliding Algorithms for Many-Core Architectures (doi)
Juan Gómez Luna, Li-Wen Chang, I-Jui Sung, Nicolás Guil Mata and Wen-Mei Hwu
In Parallel Processing, International Conference on (ICPP), 2015 (conference h5-index = 22, acceptance rate: 99/305 = 32.5%)

Adaptive Cache Management for Energy-efficient GPU Computing (doi)
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-m. W. Hwu
Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014 (MICRO-47) (conference h5-index = 40, acceptance rate: 53/273 = 19.4%)

A Scalable, Numerically Stable, High-performance Tridiagonal Solver using GPUs (doi, code)
Li-Wen Chang, John A. Stratton, Hee-Seok Kim, and Wen-mei W. Hwu
The International Conference for High Performance Computing, Networking Storage and Analysis 2012 (SC 2012) (conference h5-index = 46, acceptance rate: 100/472 = 21.2%)

Optimization and Architecture Effects on GPU Computing Workload Performance (doi)
J. A. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. Liu, W.-m. Hwu
Innovative Parallel Computing, 2012 (Voted #2 Best Paper Finalist)

A Tiling-Scheme Viterbi Decoder in Software-Defined Radio for GPUs (doi)
C.-S. Lin, W.-L. Liu, W.-T. Yeh, L.-W. Chang, W.-M. Hwu, S.-J. Chen, and P.-A. Hsiung
In Wireless Communications, Networking and Mobile Computing, the 7th International Conference on, pp. 1-4, 2011 (conference h5-index = 17)

A Scalable Tridiagonal Solver for GPUs (doi)
Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu
In Parallel Processing, International Conference on (ICPP), pp. 444-453, 2011 (conference h5-index = 22, acceptance rate: 81/363 = 22.3%)

Parallel Implementation of Multi-Dimensional Ensemble Empirical Mode Decomposition (doi, code)
L.-W. Chang, M.-T. Lo, N. Anssari, K.-H. Hsu, N. Huang, W.-m. W. Hwu
International Conference on Acoustics, Speech, and Signal Processing, 2011 (ICASSP 2011) (conference h5-index = 47)

GPU-Based Color Doppler Ultrasound Processing (doi)
L.-W. Chang, K.-H. Hsu, P.-C. Li
International Ultrasonics Symposium (IUS), 2009 (conference h5-index = 14)

Depth Detection of Light Field (doi)
Yi-Hao Kao, Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
ICASSP 2007 (conference h5-index = 47)

Technical Report

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler (arXiv)
Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Renze Chen, Ruifan Xu, Yifan Guo, Ningxin Zheng, Ziheng Jiang, Xinyi Di, Dongyang Wang, Jianxi Ye, Haibin Lin, Li-Wen Chang, Liqiang Lu, Yun Liang, Jidong Zhai, and Xin Liu
arXiv.org, 2025

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion (arXiv)
Li-Wen Chang, Wenlei Bao, Qi Hou, Chengquan Jiang, Ningxin Zheng, Yinmin Zhong, Xuanrun Zhang, Zuquan Song, Chengji Yao, Ziheng Jiang, Haibin Lin, Xin Jin, and Xin Liu
arXiv.org, 2024

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques (arXiv)
Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad Barsoum, and Abe Taha
arXiv.org, 2019

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (pdf, code)
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, Wen-mei W. Hwu
IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2012

Invited Talk

Open Source Adoption Lessons and Improvements for ML Compiler in Production
the 4th C4ML workshop, at CGO 2023

High-performance Linear Recurrence, and Its Applications
the 1st International Workshop on Computational Science and Engineering, 2013

A Scalable, Numerically Stable, High-performance Tridiagonal Solver for GPUs
GPU Technology Conference (GTC), 2013

A Scalable Tridiagonal Solver for GPUs
Private talk, INRIA, 2011

Parallel Empirical Mode Decomposition for GPUs
The HHT'3 workshop tutorial, 2011

Journal

In-Place Matrix Transposition on GPUs (doi)
J. Gómez-Luna, I.-J. Sung, L.-W. Chang, J. M. González-Linares, N. Guil and W.-m. Hwu
IEEE TPDS, 27(3), Mar. 2015 (journal h5-index = 76, Impact Factor = 2.173)

Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (doi)
J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W. Hwu, N. Obeid
IEEE Computer, 45(8), Aug. 2012 (Impact Factor = 1.438)

Graphics Processing Unit-Based High-Frame-Rate Color Doppler Ultrasound Processing (doi)
Li-Wen Chang, Ke-Hsin Hsu, Pai-Chi Li
IEEE TUFFC, 56(9), Sept. 2009 (journal h5-index = 35, Impact Factor = 1.503)

Analysis and Compensation of Rolling Shutter Effect (doi)
Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
IEEE TIP, 17(8), Aug. 2008 (journal h5-index = 75, Impact Factor = 3.111)

Book Chapter

Parallel Patterns: Prefix Sum
Li-Wen Chang, Juan Gómez-Luna, David B. Kirk, and Wen-mei W Hwu
Programming Massively Parallel Processors: A Hands-on Approach, Ch. 8, 2016

Parallel Patterns: Merge Sort
Li-Wen Chang, Jie Lv, David B. Kirk, and Wen-mei W Hwu
Programming Massively Parallel Processors: A Hands-on Approach, Ch. 11, 2016

A Guide for Implementing Tridiagonal Solvers on GPUs (doi)
Li-Wen Chang and Wen-mei W Hwu
Numerical Computations with GPUs, Ch. 2, 2014

Thesis

Toward Performance Portability for CPUs and GPUs through Algorithmic Compositions
Ph.D. Dissertation, ECE UIUC, 2017

Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-core Architectures (pdf)
MS Thesis, ECE UIUC, 2014

Selected Publications

Let's Get In Touch!