My research interests include GenAI, MLSys, heterogeneous computing, and compiler optimization.
June 2021-Present, Bytedance Seed/AML
ML Training & Inference Acceleration, Automation, Architecture (AAA)
We accelerate ML training and inference, especially for GenAI.
We develop production-level tools to automate high-performance ML training and inference, especially for GenAI.
We build advance architecture for ML training and inference, especially for GenAI.
July 2017-June 2021, Microsoft
High-performance AI Inference Engine
We are developing a compiler-based, high-performance AI inference engine.
Fall 2012-Fall 2017, UIUC
TANGRAM – High-level Language for Heterogeneous Computing
We are developing a pioneering high level performance portable language for CPUs, GPUs, FPGAs, and clusters. TANGRAM can achieve no less than 70% of the performance of highly optimized vendor libraries with a single source code in various architectures.
Fall 2015-Fall 2017, UIUC
Heterogeneous Benchmark Suite and Characteristics Study
We built a set of parallel benchmarks to help the community study characteristics of heterogeneous architectures and computation patterns.
Fall 2015-Fall 2017, UIUC
Dynamic Parallelism Optimizations for GPUs
We developed two general optimization techniques for dynamic parallelism on GPUs, which can achieve 6.58x and 30.44x speedups over the native dynamic parallelism. We also built a compiler to automate our techniques.
Fall 2014-Summer 2017, UIUC
GPU Data Sliding
We studied a set of special algorithms that can minimize global memory accesses. Those algorithms can be widely used in relational algebra in database and data layout transformations.
Fall 2009-Summer 2017, UIUC
Accelerator Benchmark Suites (Parboil and SPEC ACCEL) and Performance Study
We built a set of parallel benchmarks to help the community study characteristics of GPU architectures, optimizations and compiler transformations. We also used these benchmark suites to study optimizations and performance portability across architectures. (code)
Fall 2012-Fall 2014, UIUC
GPU Cache and Scheduler Design
We first characterized cache sensitivity of benchmarks, and used them to study cache bypass, and thread throttling.
Fall 2010-Summer 2014, UIUC
GPU Tridiagonal Solver Library
We proposed and built the first GPU pivoting tridiagonal solver, included as the standard gtsv in NVIDIA CUSPARSE 5.5 or later version. (code)
Fall 2010-Fall 2013, UIUC
GPU Empirical Mode Decomposition Library
We built a high-performance GPU Multi-dimensional Empirical Mode Decomposition library that provides significant speedups. (code)
Spring 2008-Summer 2009, NTU
High-Performance Ultrasound
We developed a novel high-frequency (50MHz) real-time ultrasonic imaging system, using FPGAs and GPGPU. The system became a commercial product of a startup.
Fall 2005-Spring 2006, NTU
Computational Photography
We studied the light-field camera and its visual effects.
Spring 2005-Fall 2005, NTU
Rolling Shutter Effect
We provided pioneering analyses of Rolling Shutter effect and gave an efficient compensation algorithm.
Spring 2014, UIUC
GPU I/O Optimization
Fall 2012, UIUC
GPU Sharing Tracker
Summer-Fall 2008, NTU
Chinese Speech Adaptation
Voice Activity Detection and Segmentation
Spring 2007, UIUC
Stock Portfolio Selection
Fall 2006, UIUC
3D Object Recognition
Spring 2005, NTU
Face Detection