About Me
I am currently a Principal-level Research Scientist and Engineering Director at ByteDance Seed, working at GenAI, distributed training, AI intrastructure, model compilation, libraries, PyTorch, OpenAI Triton, Tensorflow, ONNX, NPU ASICs. I managed, led several teams, and built Triton-distributed (distributed Triton for parallel systems), ShadowKV (high-throughput long-context LLM inference), veScale (a PyTorch native LLM training framework), Flux (a fast communication-overlapping library for tensor parallelism on GPUs), ByteIR (a model compilation solution for various hardware), and ByteMLPerf (an AI accelerator benchmarking tool). My CV is available upon request through email (dddscy AT gmail DOT com).
Previously, I was Senior SW Engineer at Microsoft Cloud & AI, working at AI intrastructure, model compilation, libraries, PyTorch, ONNX, and NPUs.
I pursued my Ph.D. in Professor Wen-mei Hwu's IMPACT group. I worked on a performance portable high-level language called TANGRAM (Transformation-, Architecture-, Network-, Granularity-, Runtime-aware Adaptive Machine). It is designed to achieve high performance across CPUs, GPUs, FPGAs and distributed systems from single source code.
News: We open-sourced Triton-distributed.
News: We have one inference paper (MegaScale-Infer) accepted in OSDI 2025.
News: We have one quantum computing paper accepted in ISCA 2025.
News: We have two communication overlap papers (Comet and TileLink) accepted in MLSys 2025.
News: Our high-throughput long-context LLM inference paper, ShadowKV, is released in arXiv.