Li-Wen's Webpage

Research Experience

June 2021-Present, Bytedance Seed/AML

ML Training & Inference Acceleration, Automation, Architecture (AAA)
We accelerate ML training and inference, especially for GenAI.
We develop production-level tools to automate high-performance ML training and inference, especially for GenAI.
We build advance architecture for ML training and inference, especially for GenAI.

Triton-distributed: Distributed Triton for Parallel Systems (github, arXiv)
Focused on easy-programming for parallel systems.
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference (github, arXiv)
Focused on high-throughput long-context LLM inference.
veScale: a PyTorch native LLM training framework (github)
Focused on ease of use for automatically scaling single-device PyTorch models.
Flux: a fast communication-overlapping library for tensor parallelism on GPUs (github, arXiv, arXiv)
Focused on communication overlapping for tensor parallelism.
ByteIR: a model compilation solution for various hardware (github, website)
Focused on using compiler-techniques to enable different CPU, GPU and ASICs for high-performance training and inference.
ByteMLPerf: an AI accelerator benchmarking tool (github, website)
Focused on evaluating AI Accelerators from practical production perspective, including the ease of use and versatility of software and hardware
A very cool AI for code project.
A very cool large model performance estimator project.
Several very cool distributed heterogeneous system projects.
A very cool programming language project.
A very cool communication synthesis project.
Several very cool computer architecture projects.
Several very cool acceleration projects.
...

July 2017-June 2021, Microsoft

High-performance AI Inference Engine
We are developing a compiler-based, high-performance AI inference engine.

NGEMM: Optimizing GEMM for Deep Learning via Compiler-based Techniques (arXiv)
Wenlei Bao, Li-Wen Chang, Yang Chen, Ke Deng, Amit Agarwal, Emad Barsoum, and Abe Taha
arXiv.org, 2019
Accelerating Recurrent Neural Networks through Compiler Techniques and Quantization
Li-Wen Chang, Yang Chen, Wenlei Bao, Amit Agarwal, Eldar Akchurin, Ke Deng and Emad Barsoum
Workshop on Systems for ML at NeurIPS, 2018

Fall 2012-Fall 2017, UIUC

TANGRAM – High-level Language for Heterogeneous Computing
We are developing a pioneering high level performance portable language for CPUs, GPUs, FPGAs, and clusters. TANGRAM can achieve no less than 70% of the performance of highly optimized vendor libraries with a single source code in various architectures.

Efficient Kernel Synthesis for Performance Portable Programming (doi)
Li-Wen Chang, Izzat El Hajj, Christopher Rodrigues, Juan Gómez-Luna and Wen-mei W. Hwu
MICRO, 2016
DySel: Lightweight Dynamic Selection for Kernel-based Data-Parallel Programming Model (doi)
Li-Wen Chang, Hee-Seok Kim and Wen-mei W. Hwu
ASPLOS, 2016
A Programming System for Future Proofing Performance Critical Libraries (doi)
Li-Wen Chang, Izzat El Hajj, Hee-Seok Kim, Juan Gómez-Luna, Abdul Dakkak and Wen-mei W. Hwu
PPoPP, 2016 (short paper)
Toward Application Performance Portability for Heterogeneous Computing
Li-Wen Chang, Hee-Seok Kim, and Wen-mei Hwu
TECHCON, 2015
Transitioning HPC Software to Exascale Heterogeneous Computing (doi)
Wen-mei Hwu, Li-Wen Chang, Hee-Seok Kim, Abdul Dakkak, and Izzat El Hajj
Computational Electromagnetics International Workshop (CEM), 2015
Tangram: a High-level Language for Performance Portable Code Synthesis (pdf)
Li-Wen Chang, Abdul Dakkak, Christopher I Rodrigues, and Wen-mei Hwu
Eighth Workshop on Programmability Issues for Heterogeneous Multicores (MULTIPROG), 2015

Fall 2015-Fall 2017, UIUC

Heterogeneous Benchmark Suite and Characteristics Study
We built a set of parallel benchmarks to help the community study characteristics of heterogeneous architectures and computation patterns.

Collaborative Computing on Heterogeneous CPU-FPGA Architectures Using OpenCL
Sitao Huang, Li-Wen Chang, Izzat El Hajj, Simon Garcia de Gonzalo, Juan Gómez Luna, Sai Rahul Chalamalasetti, Mohamed El-Hadedy, Dejan Milojicic, Onur Mutlu, Deming Chen and Wen-Mei Hwu
ICPE, 2019
Collaborative Computing for Heterogeneous Integrated Systems (doi)
Li-Wen Chang, Juan Gómez-Luna, Izzat El Hajj, Sitao Huang, Deming Chen and Wen-mei W. Hwu
ICPE, 2017
Chai: Collaborative Heterogeneous Applications for Integrated-architectures
Juan Gómez-Luna, Izzat El Hajj, Li-Wen Chang, Victor Garcia-Flores, Simon Garcia de Gonzalo, Thomas B. Jablin, Antonio J. Peña and Wen-Mei Hwu
ISPASS, 2017

Fall 2015-Fall 2017, UIUC

Dynamic Parallelism Optimizations for GPUs
We developed two general optimization techniques for dynamic parallelism on GPUs, which can achieve 6.58x and 30.44x speedups over the native dynamic parallelism. We also built a compiler to automate our techniques.

KLAP: Kernel Launch Aggregation and Promotion for Optimizing Dynamic Parallelism (doi)
Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic and Wen-mei W. Hwu
MICRO, 2016

Fall 2014-Summer 2017, UIUC

GPU Data Sliding
We studied a set of special algorithms that can minimize global memory accesses. Those algorithms can be widely used in relational algebra in database and data layout transformations.

In-Place Data Sliding Algorithms for Many-Core Architectures (doi)
Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Nicolás Guil Mata and Wen-Mei Hwu
In Parallel Processing, International Conference on (ICPP), 2015
In-Place Matrix Transposition on GPUs (doi)
J. Gómez-Luna, I.-J. Sung, L.-W. Chang, J. M. González-Linares, N. Guil and W.-m. Hwu
Parallel and Distributed Systems, IEEE Transactions on, 2015

Fall 2009-Summer 2017, UIUC

Accelerator Benchmark Suites (Parboil and SPEC ACCEL) and Performance Study
We built a set of parallel benchmarks to help the community study characteristics of GPU architectures, optimizations and compiler transformations. We also used these benchmark suites to study optimizations and performance portability across architectures. (code)

Algorithm and Data Optimization Techniques for Scaling to Massively Threaded Systems (doi)
J. A. Stratton, C. Rodrigues, I.-J. Sung, L.-W. Chang, N. Anssari, G. D. Liu, W.-m. W.Hwu and N. Obeid
IEEE Computer, 2012
Optimization and Architecture Effects on GPU Computing Workload Performance (doi)
J. A. Stratton, N. Anssari, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, G. Liu, W.-m.Hwu
Innovative Parallel Computing, 2012 (Voted #2 Best Paper Finalist)
Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing (pdf)
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu and W.-m. W. Hwu
IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2012 (Cited Greater than 300 Times)

Fall 2012-Fall 2014, UIUC

GPU Cache and Scheduler Design
We first characterized cache sensitivity of benchmarks, and used them to study cache bypass, and thread throttling.

Adaptive Cache Management for Energy-efficient GPU Computing (doi)
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang and W.-m. W. Hwu
MICRO, 2014
Adaptive Cache Bypass and Insertion for Many-core Accelerators (doi)
Xuhao Chen, Shengzhao Wu, Li-Wen Chang, Wei-Sheng Huang, Carl Pearson, Zhiying Wang, and Wen-mei W Hwu
Proceedings of International Workshop on Manycore Embedded Systems, 2014

Fall 2010-Summer 2014, UIUC

GPU Tridiagonal Solver Library
We proposed and built the first GPU pivoting tridiagonal solver, included as the standard gtsv in NVIDIA CUSPARSE 5.5 or later version. (code)

A Guide for Implementing Tridiagonal Solvers on GPUs (doi)
Li-Wen Chang, and Wen-mei W. Hwu
Numerical Computations with GPUs, 2015
Scalable Parallel Tridiagonal Algorithms with Diagonal Pivoting and Their Optimization for Many-core Architectures (pdf)
L.-W. Chang
M.S. Thesis, 2014
Mapping Tridiagonal Solvers to Linear Recurrences (pdf)
Li-Wen Chang, and Wen-mei W. Hwu
IMPACT Technical Report, University of Illinois at Urbana-Champaign, 2013
A Scalable, Numerically Stable High-Performance Tridiagonal Solver using GPUs (doi)
L.-W. Chang, J. A. Stratton, H.-S. Kim and W.-m. W. Hwu
SC, 2012
A Scalable Tridiagonal Solver for GPUs (doi)
Hee-Seok Kim, Shengzhao Wu, Li-Wen Chang, Wen-mei W. Hwu
In Parallel Processing, International Conference on (ICPP), 2011

Fall 2010-Fall 2013, UIUC

GPU Empirical Mode Decomposition Library
We built a high-performance GPU Multi-dimensional Empirical Mode Decomposition library that provides significant speedups. (code)

Parallel Implementation of Multi-Dimensional Ensemble Empirical Mode Decomposition (doi)
L.-W. Chang, M.-T. Lo, N. Anssari, K.-H. Hsu, N. Huang and W.-m. W. Hwu
International Conference on Acoustics, Speech, and Signal Processing (ICASSP),2011

Spring 2008-Summer 2009, NTU

High-Performance Ultrasound
We developed a novel high-frequency (50MHz) real-time ultrasonic imaging system, using FPGAs and GPGPU. The system became a commercial product of a startup.

Graphics Processing Unit-Based High-Frame- Rate Color Doppler Ultrasound Processing (doi)
Li-Wen Chang, Ke-Hsin Hsu, Pai-Chi Li
Ultrasonics, Ferroelectrics, and Frequency Control, IEEE Transactions on, 2009
GPU-Based Color Doppler Ultrasound Processing (doi)
L.-W. Chang, K.-H. Hsu, P.-C. Li
International Ultrasonics Symposium (IUS), 2009

Fall 2005-Spring 2006, NTU

Computational Photography
We studied the light-field camera and its visual effects.

Depth Detection of Light Field (doi)
Yi-Hao Kao, Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
ICASSP 2007

Spring 2005-Fall 2005, NTU

Rolling Shutter Effect
We provided pioneering analyses of Rolling Shutter effect and gave an efficient compensation algorithm.

Analysis and Compensation of Rolling Shutter Effect (doi)
Chia-Kai Liang, Li-Wen Chang, Homer H. Chen
Image Processing, IEEE Transactions on, 2008 (Cited Greater than 100 Times)

Research

Research Experience

Other Projects

Let's Get In Touch!