Alibaba Is Building Qwen-Robot: The Operating System for the Robot Economy

In short

Alibaba unveiled the Qwen-Robotic Suite, a trio of AI fashions designed to deal with robotic navigation, manipulation, and physics-based world simulation by a unified software program stack.
The corporate says its fashions prime a number of robotics benchmarks, utilizing tens of millions of coaching samples and tens of 1000’s of hours of open-source robotic information.
Actual-world robotic deployment stays years away.

Alibaba’s Qwen crew dropped the Qwen-Robotic Suite on Tuesday: three basis fashions forming what they name a “full stack for embodied intelligence.” Qwen-RobotNav handles mobility. Qwen-RobotManip handles manipulation. Qwen-RobotWorld simulates the physics that make each doable. Every works independently. Collectively, they’re the Android second for robotics—the working system, not the {hardware}.

📣 Introducing the Qwen-Robotic Suite — Qwen-RobotNav, Qwen-RobotManip, Qwen-RobotWorld, three basis fashions, a full stack for embodied intelligence.

🧭 Qwen-RobotNav — the gateway to mobility.• Unifies 5 navigation duties in a single mannequin: instruction following, point-goal,… pic.twitter.com/noumjTtTeS

— Qwen (@Alibaba_Qwen) June 16, 2026

Alibaba is correct now the one firm in China spanning chips, cloud, fashions, serving platforms, and functions. For the corporate, robotics is essentially the most bodily expression of that wager, what is named embodied AI.

AI brokers at the moment depend on LLMs to energy their choices. The same old means robots work is by machine-learning fashions which, though superior, lack the adaptability of generative AI. Bodily brokers face a special, tougher class of failure modes: physics, not prompts.

For these use circumstances, Alibaba launched this new AI suite with completely different parts:

Qwen-RobotNav unifies 5 navigation duties—instruction following, point-goal navigation, object search, goal monitoring, and autonomous driving—every demanding completely different visible reminiscence methods. Most fashions hardcode one technique. Qwen-RobotNav exposes a parameterized interface: token price range, temporal decay, per-camera weights {that a} planner can reconfigure mid-episode.

Skilled on 15.6 million samples with randomization throughout all parameters, it achieves 76.5% success on VLN-CE RxR, a benchmark for vision-and-language navigation in real-world environments, and 90% monitoring on EVT-Bench, which evaluates an agent’s capability to constantly comply with shifting targets.

Qwen-RobotManip tackles one of many largest challenges in robotic manipulation: completely different robots signify actions in essentially alternative ways. A Franka arm (a kind of robotic with seven axis of motion) operates by joint angles, whereas an ALOHA robotic (a low-cost bimanual robotic platform extensively utilized in robotics analysis) represents actions by the place and orientation of its grippers (end-effector poses). Humanoids add one other layer of complexity, utilizing whole-body coordinates.

To bridge these incompatible motion areas, Alibaba synthesized roughly 38,100 hours of coaching information from open-source robotic datasets and human movies—with out counting on proprietary information assortment. The mannequin ranks first on RoboChallenge Table30-v1, outperforming earlier approaches by 20%.

Qwen-RobotWorld is essentially the most bold: a language-conditioned video world mannequin treating pure language as a common motion interface. “Choose up the purple cup and pour water on the flower” works whether or not the actor is a gripper, an autonomous automobile, or a cellular navigation agent.

The Embodied World Information corpus spans 8.6 million video-text pairs—200 million frames—throughout manipulation (5.9 million samples, 1,300+ expertise, 20+ morphologies), autonomous driving (Waymo, NVIDIA PhysicalAI-AD, Bench2Drive), indoor navigation (VLNVerse), and human-to-robot switch throughout 14 robotic arms.

It ranks first on EWMBench and DreamGen Bench, two benchmarks that consider if world fashions predict and generate lifelike bodily environments. It additionally beats all open-source fashions on WorldModelBench and PBench, and scores completely on physics adherence: Newton’s legal guidelines, mass conservation, fluid dynamics, gravity.

The ChatGPT of robots?

Whereas Western labs (Google DeepMind, Nvidia, Determine, Bodily Intelligence) pursue related objectives, most deal with navigation or manipulation, not a unified, composable suite. Alibaba’s vertical integration from chips by functions means they management the complete stack. The open-source basis differentiates in opposition to rivals counting on non-public robotic information.

There are some misconceptions that might be price clearing: These should not robots however software program fashions—brains, not our bodies. They run on {hardware} from AgileX, Franka, Common Robots, Unitree, and others.

Additionally, regardless of these being generative AI fashions for robots, these aren’t LLMs like your typical ChatGPT. A language mannequin predicts tokens. These fashions should perceive physics, spatial relationships, and penalties of bodily actions. A language mannequin tells you a glass breaks if dropped. Qwen-RobotWorld predicts the way it breaks—shatter sample, fluid dynamics, secondary collisions. Qwen-RobotManip plans a grasp that forestalls the drop completely.

Do not count on to have your personal housemaid robotic anytime quickly. The hole between a managed demo of a robotic inserting fruit in a basket and a robotic reliably working in your house is gigantic. RoboCasa365, LIBERO-Plus, RoboTwin-Clean2Rand—these are simulation benchmarks. Actual-world deployment introduces sensor noise, actuator drift, and the lengthy tail of edge circumstances which have humbled each robotics effort in historical past, and Alibaba acknowledges this.

The technical achievements are actual, although. RobotManip’s alignment-first strategy solves a real bottleneck in cross-embodiment coaching. RobotNav’s parameterized commentary interface is a intelligent answer to the context-strategy drawback. RobotWorld’s language-as-universal-action-interface is the precise abstraction for cross-domain world modeling.

Alibaba hasn’t disclosed pricing, timelines, or which prospects get entry past pilot packages.