My research interests include GenAI, MLSys, heterogeneous computing, and compiler optimization.
Research Experience
June 2021–Present
ByteDance Seed/AML
ML Training & Inference Acceleration, Automation, Architecture (AAA)
Accelerate GenAI training and inference with system- and compiler-level optimizations.
Build production tools that automate performance tuning, scaling, and deployment.
Design advanced architectures and runtimes for efficient large-model execution.
Triton-distributed: Distributed Triton for Parallel Systems (github, arXiv) Enables ergonomic kernel programming across distributed systems.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (github, arXiv) Delivers high-throughput long-context LLM inference.
veScale: a PyTorch native LLM training framework (github) Simplifies scaling single-device PyTorch models to distributed training.
Flux: a fast communication-overlapping library for tensor parallelism on GPUs (github, arXiv, arXiv) Overlaps communication and compute for tensor parallelism.
ByteIR: a model compilation solution for various hardware (github, website) Compiler tooling for high-performance training and inference across CPUs, GPUs, and ASICs.
ByteMLPerf: an AI accelerator benchmarking tool (github, website) Benchmarks AI accelerators with a production-oriented focus on usability and versatility.
AI-for-code systems focused on productivity and reliability.
Large-model performance estimation and prediction tooling.
Distributed heterogeneous system prototypes for training and inference.
A programming-language and compiler research initiative.
Communication synthesis for scalable model parallelism.
Computer-architecture studies for AI workloads.
Acceleration projects spanning GPUs and AI ASICs.
Additional systems and infrastructure projects.
July 2017–June 2021
Microsoft
High-performance AI Inference Engine
Built a compiler-based inference engine focused on high performance and portability.
NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques (arXiv) Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad Barsoum, and Abe Taha arXiv.org, 2019
Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization Li-Wen Chang, Yang Chen, Wenlei Bao, Amit Agarwal, Eldar Akchurin, Ke Deng and Emad Barsoum Workshop on Systems for ML at NeurIPS, 2018
Fall 2012–Fall 2017
UIUC
TANGRAM – High-level Language for Heterogeneous Computing
Developed a high-level, performance-portable language for CPUs, GPUs, FPGAs, and clusters, targeting near–vendor-library performance from a single source.
Efficient Kernel Synthesis for Performance Portable Programming (doi) Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei W. Hwu MICRO, 2016
DySel: Lightweight Dynamic Selection for Kernel-based Data-Parallel Programming Model (doi) Li-Wen Chang, Hee-Seok Kim and Wen-mei W. Hwu ASPLOS, 2016
A Programming System for Future Proofing Performance Critical Libraries (doi) Li-Wen Chang, Izzat El Hajj, Hee-Seok Kim, Juan Gómez-Luna, Abdul Dakkak and Wen-mei W. Hwu PPoPP, 2016 (short paper)
Toward Application Performance Portability for Heterogeneous Computing Li-Wen Chang, Hee-Seok Kim, and Wen-mei Hwu TECHCON, 2015
Transitioning HPC Software to Exascale Heterogeneous Computing (doi) Wen-mei Hwu, Li-Wen Chang, Hee-Seok Kim, Abdul Dakkak, and Izzat El Hajj Computational Electromagnetics International Workshop (CEM), 2015
Tangram: a High-level Language for Performance Portable Code Synthesis (pdf) Li-Wen Chang, Abdul Dakkak, Christopher I Rodrigues, and Wen-mei Hwu Eighth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG), 2015
Fall 2015–Fall 2017
UIUC
Heterogeneous Benchmark Suite and Characteristics Study
Built benchmark suites to study heterogeneous architectures and computation patterns.
Collaborative Computing on Heterogeneous CPU-FPGA Architectures Using OpenCL Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia de Gonzalo, Juan Gómez Luna, Sai Rahul Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen and Wen-Mei Hwu ICPE, 2019
Collaborative Computing for Heterogeneous Integrated Systems (doi) Li-Wen Chang, Juan Gómez-Luna, Izzat El Hajj, Sitao Huang, Deming Chen and Wen-mei W. Hwu ICPE, 2017
Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia de Gonzalo, Thomas B. Jablin, Antonio J. Peña and Wen-Mei Hwu ISPASS, 2017
Fall 2015–Fall 2017
UIUC
Dynamic Parallelism Optimizations for GPUs
Developed GPU dynamic-parallelism optimizations and a compiler to automate them, delivering substantial speedups.
KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism (doi) Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic and Wen-mei W. Hwu MICRO, 2016
Fall 2014–Summer 2017
UIUC
GPU Data Sliding
Studied algorithms that reduce global-memory traffic for data layout transformations and relational algebra.
In-Place Data Sliding Algorithms for Many-Core Architectures (doi) Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Nicolás Guil Mata and Wen-Mei Hwu In Parallel Processing, International Conference on (ICPP), 2015
In-Place Matrix Transposition on GPUs (doi) J. Gómez-Luna, I.-J. Sung, L.-W. Chang, J. M. González-Linares, N. Guil and W.-m. Hwu Parallel and Distributed Systems, IEEE Transactions on, 2015
Fall 2009–Summer 2017
UIUC
Accelerator Benchmark Suites (Parboil and SPEC ACCEL) and Performance Study
Built benchmark suites to characterize GPU architectures, optimizations, and compiler transformations, and to study performance portability. (code)
Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (doi) J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W.Hwu and N. Obeid IEEE Computer, 2012
Optimization and Architecture Effects on GPU Computing Workload Performance (doi) J. A. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. Liu, W.-m.Hwu Innovative Parallel Computing, 2012 (Voted #2 Best Paper Finalist)
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (pdf) J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu and W.-m. W. Hwu IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2012 (Cited Greater than 300 Times)
Fall 2012–Fall 2014
UIUC
GPU Cache and Scheduler Design
Characterized cache sensitivity and explored cache bypass and thread throttling strategies.
Adaptive Cache Management for Energy-efficient GPU Computing (doi) X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang and W.-m. W. Hwu MICRO, 2014
Adaptive Cache Bypass and Insertion for Many-core Accelerators (doi) Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-mei W Hwu Proceedings of International Workshop on Manycore Embedded Systems, 2014
Fall 2010–Summer 2014
UIUC
GPU Tridiagonal Solver Library
Built the first GPU pivoting tridiagonal solver, later included as gtsv in NVIDIA cuSPARSE 5.5+. (code)
A Guide for Implementing Tridiagonal Solvers on GPUs (doi) Li-Wen Chang, and Wen-mei W. Hwu Numerical Computations with GPUs, 2015
Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-core Architectures (pdf) L.-W. Chang M.S. Thesis, 2014
Mapping Tridiagonal Solvers to Linear Recurrences (pdf) Li-Wen Chang, and Wen-mei W. Hwu IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2013
A Scalable, Numerically Stable High-Performance Tridiagonal Solver using GPUs (doi) L.-W. Chang, J. A. Stratton, H.-S. Kim and W.-m. W. Hwu SC, 2012
A Scalable Tridiagonal Solver for GPUs (doi) Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu In Parallel Processing, International Conference on (ICPP), 2011
Fall 2010–Fall 2013
UIUC
GPU Empirical Mode Decomposition Library
Built a high-performance GPU multi-dimensional EMD library with significant speedups. (code)
Parallel Implementation of Multi-Dimensional Ensemble Empirical Mode Decomposition (doi) L.-W. Chang, M.-T. Lo, N. Anssari, K.-H. Hsu, N. Huang and W.-m. W. Hwu International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011
Spring 2008–Summer 2009
NTU
High-Performance Ultrasound
Developed a high-frequency (50 MHz) real-time ultrasonic imaging system on FPGA and GPU that became a commercial product.
Graphics Processing Unit-Based High-Frame-Rate Color Doppler Ultrasound Processing (doi) Li-Wen Chang, Ke-Hsin Hsu, Pai-Chi Li Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on, 2009
GPU-Based Color Doppler Ultrasound Processing (doi) L.-W. Chang, K.-H. Hsu, P.-C. Li International Ultrasonics Symposium (IUS), 2009
Fall 2005–Spring 2006
NTU
Computational Photography
Studied light-field cameras and their visual effects.
Depth Detection of Light Field (doi) Yi-Hao Kao, Chia-Kai Liang, Li-Wen Chang, Homer H. Chen ICASSP 2007
Spring 2005–Fall 2005
NTU
Rolling Shutter Effect
Provided early analysis of the rolling shutter effect and an efficient compensation method.
Analysis and Compensation of Rolling Shutter Effect (doi) Chia-Kai Liang, Li-Wen Chang, Homer H. Chen Image Processing, IEEE Transactions on, 2008 (Cited Greater than 100 Times)