Research


My research interests include GenAI, MLSys, heterogeneous computing, and compiler optimization.

Research Experience


June 2021–Present

ByteDance Seed/AML

ML Training & Inference Acceleration, Automation, Architecture (AAA)

  • Accelerate GenAI training and inference with system- and compiler-level optimizations.
  • Build production tools that automate performance tuning, scaling, and deployment.
  • Design advanced architectures and runtimes for efficient large-model execution.
  • Triton-distributed: Distributed Triton for Parallel Systems (github, arXiv)
    Enables ergonomic kernel programming across distributed systems.
  • ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (github, arXiv)
    Delivers high-throughput long-context LLM inference.
  • veScale: a PyTorch native LLM training framework (github)
    Simplifies scaling single-device PyTorch models to distributed training.
  • Flux: a fast communication-overlapping library for tensor parallelism on GPUs (github, arXiv, arXiv)
    Overlaps communication and compute for tensor parallelism.
  • ByteIR: a model compilation solution for various hardware (github, website)
    Compiler tooling for high-performance training and inference across CPUs, GPUs, and ASICs.
  • ByteMLPerf: an AI accelerator benchmarking tool (github, website)
    Benchmarks AI accelerators with a production-oriented focus on usability and versatility.
  • AI-for-code systems focused on productivity and reliability.
  • Large-model performance estimation and prediction tooling.
  • Distributed heterogeneous system prototypes for training and inference.
  • A programming-language and compiler research initiative.
  • Communication synthesis for scalable model parallelism.
  • Computer-architecture studies for AI workloads.
  • Acceleration projects spanning GPUs and AI ASICs.
  • Additional systems and infrastructure projects.

July 2017–June 2021

Microsoft

High-performance AI Inference Engine

  • Built a compiler-based inference engine focused on high performance and portability.
  • NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques (arXiv)
    Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad Barsoum, and Abe Taha
    arXiv.org, 2019
  • Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization
    Li-Wen Chang, Yang Chen, Wenlei Bao, Amit Agarwal, Eldar Akchurin, Ke Deng and Emad Barsoum
    Workshop on Systems for ML at NeurIPS, 2018

Fall 2012–Fall 2017

UIUC

TANGRAM – High-level Language for Heterogeneous Computing

  • Developed a high-level, performance-portable language for CPUs, GPUs, FPGAs, and clusters, targeting near–vendor-library performance from a single source.
  • Efficient Kernel Synthesis for Performance Portable Programming (doi)
    Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei W. Hwu
    MICRO, 2016
  • DySel: Lightweight Dynamic Selection for Kernel-based Data-Parallel Programming Model (doi)
    Li-Wen Chang, Hee-Seok Kim and Wen-mei W. Hwu
    ASPLOS, 2016
  • A Programming System for Future Proofing Performance Critical Libraries (doi)
    Li-Wen Chang, Izzat El Hajj, Hee-Seok Kim, Juan Gómez-Luna, Abdul Dakkak and Wen-mei W. Hwu
    PPoPP, 2016 (short paper)
  • Toward Application Performance Portability for Heterogeneous Computing
    Li-Wen Chang, Hee-Seok Kim, and Wen-mei Hwu
    TECHCON, 2015
  • Transitioning HPC Software to Exascale Heterogeneous Computing (doi)
    Wen-mei Hwu, Li-Wen Chang, Hee-Seok Kim, Abdul Dakkak, and Izzat El Hajj
    Computational Electromagnetics International Workshop (CEM), 2015
  • Tangram: a High-level Language for Performance Portable Code Synthesis (pdf)
    Li-Wen Chang, Abdul Dakkak, Christopher I Rodrigues, and Wen-mei Hwu
    Eighth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG), 2015

Fall 2015–Fall 2017

UIUC

Heterogeneous Benchmark Suite and Characteristics Study

  • Built benchmark suites to study heterogeneous architectures and computation patterns.
  • Collaborative Computing on Heterogeneous CPU-FPGA Architectures Using OpenCL
    Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia de Gonzalo, Juan Gómez Luna, Sai Rahul Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen and Wen-Mei Hwu
    ICPE, 2019
  • Collaborative Computing for Heterogeneous Integrated Systems (doi)
    Li-Wen Chang, Juan Gómez-Luna, Izzat El Hajj, Sitao Huang, Deming Chen and Wen-mei W. Hwu
    ICPE, 2017
  • Chai: Collaborative Heterogeneous Applications for Integrated-architectures
    Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia de Gonzalo, Thomas B. Jablin, Antonio J. Peña and Wen-Mei Hwu
    ISPASS, 2017

Fall 2015–Fall 2017

UIUC

Dynamic Parallelism Optimizations for GPUs

  • Developed GPU dynamic-parallelism optimizations and a compiler to automate them, delivering substantial speedups.
  • KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism (doi)
    Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic and Wen-mei W. Hwu
    MICRO, 2016

Fall 2014–Summer 2017

UIUC

GPU Data Sliding

  • Studied algorithms that reduce global-memory traffic for data layout transformations and relational algebra.
  • In-Place Data Sliding Algorithms for Many-Core Architectures (doi)
    Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Nicolás Guil Mata and Wen-Mei Hwu
    In Parallel Processing, International Conference on (ICPP), 2015
  • In-Place Matrix Transposition on GPUs (doi)
    J. Gómez-Luna, I.-J. Sung, L.-W. Chang, J. M. González-Linares, N. Guil and W.-m. Hwu
    Parallel and Distributed Systems, IEEE Transactions on, 2015

Fall 2009–Summer 2017

UIUC

Accelerator Benchmark Suites (Parboil and SPEC ACCEL) and Performance Study

  • Built benchmark suites to characterize GPU architectures, optimizations, and compiler transformations, and to study performance portability. (code)
  • Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (doi)
    J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W.Hwu and N. Obeid
    IEEE Computer, 2012
  • Optimization and Architecture Effects on GPU Computing Workload Performance (doi)
    J. A. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. Liu, W.-m.Hwu
    Innovative Parallel Computing, 2012 (Voted #2 Best Paper Finalist)
  • Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (pdf)
    J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu and W.-m. W. Hwu
    IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2012 (Cited Greater than 300 Times)

Fall 2012–Fall 2014

UIUC

GPU Cache and Scheduler Design

  • Characterized cache sensitivity and explored cache bypass and thread throttling strategies.
  • Adaptive Cache Management for Energy-efficient GPU Computing (doi)
    X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang and W.-m. W. Hwu
    MICRO, 2014
  • Adaptive Cache Bypass and Insertion for Many-core Accelerators (doi)
    Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-mei W Hwu
    Proceedings of International Workshop on Manycore Embedded Systems, 2014

Fall 2010–Summer 2014

UIUC

GPU Tridiagonal Solver Library

  • Built the first GPU pivoting tridiagonal solver, later included as gtsv in NVIDIA cuSPARSE 5.5+. (code)
  • A Guide for Implementing Tridiagonal Solvers on GPUs (doi)
    Li-Wen Chang, and Wen-mei W. Hwu
    Numerical Computations with GPUs, 2015
  • Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-core Architectures (pdf)
    L.-W. Chang
    M.S. Thesis, 2014
  • Mapping Tridiagonal Solvers to Linear Recurrences (pdf)
    Li-Wen Chang, and Wen-mei W. Hwu
    IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2013
  • A Scalable, Numerically Stable High-Performance Tridiagonal Solver using GPUs (doi)
    L.-W. Chang, J. A. Stratton, H.-S. Kim and W.-m. W. Hwu
    SC, 2012
  • A Scalable Tridiagonal Solver for GPUs (doi)
    Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu
    In Parallel Processing, International Conference on (ICPP), 2011

Fall 2010–Fall 2013

UIUC

GPU Empirical Mode Decomposition Library

  • Built a high-performance GPU multi-dimensional EMD library with significant speedups. (code)
  • Parallel Implementation of Multi-Dimensional Ensemble Empirical Mode Decomposition (doi)
    L.-W. Chang, M.-T. Lo, N. Anssari, K.-H. Hsu, N. Huang and W.-m. W. Hwu
    International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011

Spring 2008–Summer 2009

NTU

High-Performance Ultrasound

  • Developed a high-frequency (50 MHz) real-time ultrasonic imaging system on FPGA and GPU that became a commercial product.
  • Graphics Processing Unit-Based High-Frame-Rate Color Doppler Ultrasound Processing (doi)
    Li-Wen Chang, Ke-Hsin Hsu, Pai-Chi Li
    Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on, 2009
  • GPU-Based Color Doppler Ultrasound Processing (doi)
    L.-W. Chang, K.-H. Hsu, P.-C. Li
    International Ultrasonics Symposium (IUS), 2009

Fall 2005–Spring 2006

NTU

Computational Photography

  • Studied light-field cameras and their visual effects.
  • Depth Detection of Light Field (doi)
    Yi-Hao Kao, Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
    ICASSP 2007

Spring 2005–Fall 2005

NTU

Rolling Shutter Effect

  • Provided early analysis of the rolling shutter effect and an efficient compensation method.
  • Analysis and Compensation of Rolling Shutter Effect (doi)
    Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
    Image Processing, IEEE Transactions on, 2008 (Cited Greater than 100 Times)

Other Projects


Spring 2014

UIUC

GPU I/O Optimization

Fall 2012

UIUC

GPU Sharing Tracker

Summer–Fall 2008

NTU

Chinese Speech Adaptation

  • Voice Activity Detection and Segmentation.

Spring 2007

UIUC

Stock Portfolio Selection

Fall 2006

UIUC

3D Object Recognition

Spring 2005

NTU

Face Detection

Let's Get In Touch!