About
I am an AI Infrastructure Research Engineer at China Merchants Lionrock AI Lab (招商局狮子山人工智能实验室), where I build large-scale training and inference systems for foundation models. My current focus is on developing a next-generation unified training-inference AI infrastructure framework.
Previously, I received my M.Eng. and B.Eng. from Zhejiang University. During my master's, I pursued research in AI systems by collaborating with professors at Rice University, UIUC, and USC, resulting in publications at top systems and database venues (OSDI, SIGMOD).
Research Interests
Distributed Training
RL Infrastructure
LLM Inference
Training-Inference Co-design
GPU Systems
News
- 2025.05 Paper on Empowering Distributed Training with Sparsity-driven Data Synchronization accepted at OSDI 2025.
- 2025.05 Paper on GPU-accelerated Filtered Vector Search accepted at SIGMOD 2026.
Publications
Empowering Distributed Training with Sparsity-driven Data Synchronization
OSDI 2025
* Equal Contribution
A High-Performance Vector Data Management System for Filtered-Search on GPUs
SIGMOD 2026
* Equal Contribution
RAILGUN: A Foundation Model for Multi-Agent Path Finding Across Varied Tasks and Scales
IROS 2025
Experience
China Merchants Lionrock AI Lab
— AI Infra Research Engineer
2025 – Present
- Built a PyTorch FSDP-native large-scale post-training framework supporting DeepSeek and other open-source models, with FSDP2 + CP + EP parallelism, torch.compile, and MoE kernels (DeepEP/GroupGEMM/permute) integration.
- Developed a weight synchronization middleware for disaggregated RL enabling zero-redundancy weight transfer across arbitrary parallel strategies, reducing DeepSeek full-model weight sync to ~2s (industry-leading).
- Implemented selective activation checkpointing (65%+ memory reduction, <5% throughput impact) and activation offload with async D2H/H2D pipelining, reducing activation memory to only several layers. Achieved 30+% MFU on DeepSeek full model (128 GPUs, 128K seq).
- Built the complete Agent RL training pipeline: data production → sampling → reward → training → evaluation, with staleness-based async control and OpenTelemetry-based observability.
- Designing next-gen unified training-inference framework.
Rice University
— Research Assistant, Prof. Yuke Wang
2024 – 2025
Developed an optimized sparse gradient synchronization system for distributed training, achieving up to 5.09× speedup in communication and 2.48× in training throughput. Tested on up to 128 GPUs. [OSDI 2025]
UIUC
— Research Assistant, Prof. Minjia Zhang
2024 – 2025
Built the first GPU-accelerated filtered approximate nearest neighbor search (ANNS) system with a novel zero-redundancy approach, achieving efficiency on par with unfiltered search. [SIGMOD 2026]
USC
— Research Assistant, Prof. Sven Koenig
2024 – 2025
Built training infrastructure for a UNet-based multi-agent path finding foundation model: large-scale data generation, optimized data loading, multi-GPU training. [IROS 2025]
Alibaba Cloud — PolarDB
— Software Engineering Intern
2023
Developed incremental data verification for distributed database synchronization (MySQL Binlog-based) with pipeline optimization and caching.
DolphinDB
— Software Engineering Intern
2023
Contributed to JIT compilation module (script-to-LLVM IR) and implemented debugging support (stepping, stack frames, variable inspection).
Education
Zhejiang University
M.Eng. in Electronic Information
Zhejiang University
B.Eng. in Automation
National Encouragement Scholarship, Academic Excellence Scholarship