About Me
I am currently a Senior Staff-level Research Scientist and Engineering Manager at ByteDance Seed/AML, working at GenAI, distributed training, AI intrastructure, model compilation, libraries, PyTorch, OpenAI Triton, Tensorflow, ONNX, NPU ASICs. I managed, led several teams, and built veScale (a PyTorch native LLM training framework), Flux (a fast communication-overlapping library for tensor parallelism on GPUs), ByteIR (a model compilation solution for various hardware), and ByteMLPerf (an AI accelerator benchmarking tool). My CV is available upon request through email (dddscy AT gmail DOT com).
Previously, I was Senior SW Engineer at Microsoft Cloud & AI, working at AI intrastructure, model compilation, libraries, PyTorch, ONNX, and NPUs.
I pursued my Ph.D. in Professor Wen-mei Hwu's IMPACT group. I worked on a performance portable high-level language called TANGRAM (Transformation-, Architecture-, Network-, Granularity-, Runtime-aware Adaptive Machine). It is designed to achieve high performance across CPUs, GPUs, FPGAs and distributed systems from single source code.
News: I gave a talk of our work at ByteDance in C4ML workshop at CGO 2023.
News: Our high-performance BLAS library for deep learning paper in is released
in arXiv. It can deliever 1.4x speedups in average over MKL.
News: Our CPU-FPGA OpenCL high-level synthesis paper is accepted in
ICPE 2019.