Research


My research interests include GenAI, MLSys, heterogeneous computing, and compiler optimization.

Research Experience


June 2021-Present, Bytedance Seed/AML

ML Training & Inference Acceleration, Automation, Architecture (AAA)
We accelerate ML training and inference, especially for GenAI.
We develop production-level tools to automate high-performance ML training and inference, especially for GenAI.
We build advance architecture for ML training and inference, especially for GenAI.

  • ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (github, arXiv)
    Focused on high-throughput long-context LLM inference.
  • veScale: a PyTorch native LLM training framework (github)
    Focused on ease of use for automatically scaling single-device PyTorch models.
  • Flux: a fast communication-overlapping library for tensor parallelism on GPUs (github, arXiv)
    Focused on communication overlapping for tensor parallelism.
  • ByteIR: a model compilation solution for various hardware (github, website)
    Focused on using compiler-techniques to enable different CPU, GPU and ASICs for high-performance training and inference.
  • ByteMLPerf: an AI accelerator benchmarking tool (github, website)
    Focused on evaluating AI Accelerators from practical production perspective, including the ease of use and versatility of software and hardware
  • A very cool AI code generation project.
  • A very cool large model performance estimator project.
  • A very cool distributed heterogeneous system project.
  • A very cool programming language project.
  • Several very cool computer architecture projects.
  • Several very cool acceleration projects.
  • ...

July 2017-June 2021, Microsoft

High-performance AI Inference Engine
We are developing a compiler-based, high-performance AI inference engine.

  • NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques (arXiv)
    Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad Barsoum, and Abe Taha
    arXiv.org, 2019
  • Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization
    Li-Wen Chang, Yang Chen, Wenlei Bao, Amit Agarwal, Eldar Akchurin, Ke Deng and Emad Barsoum
    Workshop on Systems for ML at NeurIPS, 2018

Fall 2012-Fall 2017, UIUC

TANGRAM – High-level Language for Heterogeneous Computing
We are developing a pioneering high level performance portable language for CPUs, GPUs, FPGAs, and clusters. TANGRAM can achieve no less than 70% of the performance of highly optimized vendor libraries with a single source code in various architectures.

  • Efficient Kernel Synthesis for Performance Portable Programming (doi)
    Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei W. Hwu
    MICRO, 2016
  • DySel: Lightweight Dynamic Selection for Kernel-based Data-Parallel Programming Model (doi)
    Li-Wen Chang, Hee-Seok Kim and Wen-mei W. Hwu
    ASPLOS, 2016
  • A Programming System for Future Proofing Performance Critical Libraries (doi)
    Li-Wen Chang, Izzat El Hajj, Hee-Seok Kim, Juan Gómez-Luna, Abdul Dakkak and Wen-mei W. Hwu
    PPoPP, 2016 (short paper)
  • Toward Application Performance Portability for Heterogeneous Computing
    Li-Wen Chang, Hee-Seok Kim, and Wen-mei Hwu
    TECHCON, 2015
  • Transitioning HPC Software to Exascale Heterogeneous Computing (doi)
    Wen-mei Hwu, Li-Wen Chang, Hee-Seok Kim, Abdul Dakkak, and Izzat El Hajj
    Computational Electromagnetics International Workshop (CEM), 2015
  • Tangram: a High-level Language for Performance Portable Code Synthesis (pdf)
    Li-Wen Chang, Abdul Dakkak, Christopher I Rodrigues, and Wen-mei Hwu
    Eighth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG), 2015

Fall 2015-Fall 2017, UIUC

Heterogeneous Benchmark Suite and Characteristics Study
We built a set of parallel benchmarks to help the community study characteristics of heterogeneous architectures and computation patterns.

  • Collaborative Computing on Heterogeneous CPU-FPGA Architectures Using OpenCL
    Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia de Gonzalo, Juan Gómez Luna, Sai Rahul Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen and Wen-Mei Hwu
    ICPE, 2019
  • Collaborative Computing for Heterogeneous Integrated Systems (doi)
    Li-Wen Chang, Juan Gómez-Luna, Izzat El Hajj, Sitao Huang, Deming Chen and Wen-mei W. Hwu
    ICPE, 2017
  • Chai: Collaborative Heterogeneous Applications for Integrated-architectures
    Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia de Gonzalo, Thomas B. Jablin, Antonio J. Peña and Wen-Mei Hwu
    ISPASS, 2017

Fall 2015-Fall 2017, UIUC

Dynamic Parallelism Optimizations for GPUs
We developed two general optimization techniques for dynamic parallelism on GPUs, which can achieve 6.58x and 30.44x speedups over the native dynamic parallelism. We also built a compiler to automate our techniques.

  • KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism (doi)
    Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic and Wen-mei W. Hwu
    MICRO, 2016

Fall 2014-Summer 2017, UIUC

GPU Data Sliding
We studied a set of special algorithms that can minimize global memory accesses. Those algorithms can be widely used in relational algebra in database and data layout transformations.

  • In-Place Data Sliding Algorithms for Many-Core Architectures (doi)
    Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Nicolás Guil Mata and Wen-Mei Hwu
    In Parallel Processing, International Conference on (ICPP), 2015
  • In-Place Matrix Transposition on GPUs (doi)
    J. Gómez-Luna, I.-J. Sung, L.-W. Chang, J. M. González-Linares, N. Guil and W.-m. Hwu
    Parallel and Distributed Systems, IEEE Transactions on, 2015

Fall 2009-Summer 2017, UIUC

Accelerator Benchmark Suites (Parboil and SPEC ACCEL) and Performance Study
We built a set of parallel benchmarks to help the community study characteristics of GPU architectures, optimizations and compiler transformations. We also used these benchmark suites to study optimizations and performance portability across architectures. (code)

  • Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (doi)
    J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W.Hwu and N. Obeid
    IEEE Computer, 2012
  • Optimization and Architecture Effects on GPU Computing Workload Performance (doi)
    J. A. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. Liu, W.-m.Hwu
    Innovative Parallel Computing, 2012 (Voted #2 Best Paper Finalist)
  • Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (pdf)
    J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu and W.-m. W. Hwu
    IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2012 (Cited Greater than 300 Times)

Fall 2012-Fall 2014, UIUC

GPU Cache and Scheduler Design
We first characterized cache sensitivity of benchmarks, and used them to study cache bypass, and thread throttling.

  • Adaptive Cache Management for Energy-efficient GPU Computing (doi)
    X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang and W.-m. W. Hwu
    MICRO, 2014
  • Adaptive Cache Bypass and Insertion for Many-core Accelerators (doi)
    Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-mei W Hwu
    Proceedings of International Workshop on Manycore Embedded Systems, 2014

Fall 2010-Summer 2014, UIUC

GPU Tridiagonal Solver Library
We proposed and built the first GPU pivoting tridiagonal solver, included as the standard gtsv in NVIDIA CUSPARSE 5.5 or later version. (code)

  • A Guide for Implementing Tridiagonal Solvers on GPUs (doi)
    Li-Wen Chang, and Wen-mei W. Hwu
    Numerical Computations with GPUs, 2015
  • Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-core Architectures (pdf)
    L.-W. Chang
    M.S. Thesis, 2014
  • Mapping Tridiagonal Solvers to Linear Recurrences (pdf)
    Li-Wen Chang, and Wen-mei W. Hwu
    IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2013
  • A Scalable, Numerically Stable High-Performance Tridiagonal Solver using GPUs (doi)
    L.-W. Chang, J. A. Stratton, H.-S. Kim and W.-m. W. Hwu
    SC, 2012
  • A Scalable Tridiagonal Solver for GPUs (doi)
    Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu
    In Parallel Processing, International Conference on (ICPP), 2011

Fall 2010-Fall 2013, UIUC

GPU Empirical Mode Decomposition Library
We built a high-performance GPU Multi-dimensional Empirical Mode Decomposition library that provides significant speedups. (code)

  • Parallel Implementation of Multi-Dimensional Ensemble Empirical Mode Decomposition (doi)
    L.-W. Chang, M.-T. Lo, N. Anssari, K.-H. Hsu, N. Huang and W.-m. W. Hwu
    International Conference on Acoustics, Speech, and Signal Processing (ICASSP),2011

Spring 2008-Summer 2009, NTU

High-Performance Ultrasound
We developed a novel high-frequency (50MHz) real-time ultrasonic imaging system, using FPGAs and GPGPU. The system became a commercial product of a startup.

  • Graphics Processing Unit-Based High-Frame- Rate Color Doppler Ultrasound Processing (doi)
    Li-Wen Chang, Ke-Hsin Hsu, Pai-Chi Li
    Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on, 2009
  • GPU-Based Color Doppler Ultrasound Processing (doi)
    L.-W. Chang, K.-H. Hsu, P.-C. Li
    International Ultrasonics Symposium (IUS), 2009

Fall 2005-Spring 2006, NTU

Computational Photography
We studied the light-field camera and its visual effects.

  • Depth Detection of Light Field (doi)
    Yi-Hao Kao, Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
    ICASSP 2007

Spring 2005-Fall 2005, NTU

Rolling Shutter Effect
We provided pioneering analyses of Rolling Shutter effect and gave an efficient compensation algorithm.

  • Analysis and Compensation of Rolling Shutter Effect (doi)
    Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
    Image Processing, IEEE Transactions on, 2008 (Cited Greater than 100 Times)

Other Projects


Spring 2014, UIUC

GPU I/O Optimization

Fall 2012, UIUC

GPU Sharing Tracker

Summer-Fall 2008, NTU

Chinese Speech Adaptation
Voice Activity Detection and Segmentation

Spring 2007, UIUC

Stock Portfolio Selection

Fall 2006, UIUC

3D Object Recognition

Spring 2005, NTU

Face Detection

Let's Get In Touch!