Academic • Axi's Blog

About Me

I am a third-year undergraduate student in the School of Artificial Intelligence / Qian Xuesen Honors College at Xi’an Jiaotong University. I am currently interning at the Shanghai Artificial Intelligence Laboratory, working closely with Researcher Yilun Chen, and will be joining the lab under the supervision of Research Scientist Jiangmiao Pang.

My research philosophy is simple: I build things because it’s fun.

I’m not chasing lofty slogans — I just genuinely enjoy making robots move, do things, and occasionally accomplish something useful. If a project eventually turns into a paper, great; if it only ends up as a GIF of a robot knocking over a mug, that’s fine too.

Right now, I’m focused on large-scale synthetic data generation, aiming to use it to connect Vision-Language Models with real-world actions. I see simulation as a critical stage in this process: it allows ideas to be tested, refined, and expanded before being deployed in the real world. Then, by incorporating massive amounts of real-world data, the system can become truly grounded. At the same time, I aim to build a systematic engineering setup that runs smoothly and reliably, and to keep the distance from an idea to a working prototype as short as possible.

Here is my academic CV, feel free to download it.

Download CV

Research Interests

Robotics Manipulation

Robotic Manipulation, Grasping, Dexterous Control, Object Interaction

VLA

Vision-Language-Action Models, Multi-modal Learning, Embodied AI

Simulation Platform

Virtual Environments, Physics Simulation, Training Platforms

Open-source Projects

InternManip

Robotics Simulation

Active

An All-in-one robot manipulation learning suite for policy models training and evaluation on various datasets and benchmarks.

Code

InternData-M1

Robotics Dataset

Active

InternData-M1 is a comprehensive embodied robotics dataset containing ~250,000 simulation demonstrations with rich frame-based information including 2D/3D boxes, trajectories, grasp points, and semantic masks, with comprehensive annotations.

Hugging Face

Publications

2025

InternVLA-M1: Latent Spatial Grounding for Instruction-Following Robotic Manipulation

Preprint Published

Yilun Chen , Ning Gao , Jiangmiao Pang , Bolun Wang , Fangjing Wang , Jinhui Ye , Junqiu Yu , Jinyu Zhang , Yangkun Zhu , Xinyi Chen , Weiyang Jin , Hao Li , Yu Qiao , Yang Tian , Bin Wang , Hanqing Wang , Tai Wang , Ziqin Wang , Xueyuan Wei , Chao Wu , Shuai Yang , Jia Zeng , Jingjing Zhang , Shi Zhang , Bowen Zhou

Tech Report • 2025

InternVLA-M1 is a unified framework for spatial grounding and robot control that advances instruction-following robots toward general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine “where to act” by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide “how to act” by generating embodiment-aware actions through plug-and-play spatial prompting.

CODE Code WEBSITE Website ARXIV Arxiv MODEL Model VIDEO Video

GenManip: A Simulation Platform for Generalizable TableTop Manipulation in the Era of MLLM

Conference Published

Ning Gao ^* , Yilun Chen ^* ^‡ , Shuai Yang ^* , Xinyi Chen ^* , Yang Tian , Hao Li , Haifeng Huang , Hanqing Wang , Tai Wang , Jiangmiao Pang ^†

^*Equal contribution ^‡Project Leader ^†Corresponding author

Conference on Computer Vision and Pattern Recognition (CVPR) • 2025

Embodied manipulation benchmark based on Isaac Sim with automatic demonstration/layout generation and closed-loop test features. Served as a core developer and provided support for subsequent research at SHAILAB. To be released later.

CODE Code WEBSITE Website ARXIV Arxiv VIDEO Video

2024

PMT: Progressive Mean Teacher via Exploring Temporal Consistency for Semi-Supervised Medical Image Segmentation

Conference Published

Ning Gao , Sanping Zhou ^† , Le Wang , Nanning Zheng

^†Corresponding author

European Conference on Computer Vision (ECCV) • 2024

Proposed a semi-supervised learning framework for medical image segmentation using Mean Teachers to enhance model diversity and regularization. Achieved state-of-the-art results and demonstrated generalization across datasets.

ARXIV Arxiv CODE Code