🦾RoboAssem

Mixed Reality-Assisted Human-Robot Skill Transfer via Visuomotor Primitives Toward Physical Intelligence

1Shanghai Jiao Tong University, 2The Hong Kong Polytechnic University
Equal contribution *Corresponding Author
Figure 1 Framework

Abstract

Industrial assembly represents a core of manufacturing, poses significant challenges to the reliability and adaptability of robot systems. As manufacturing shifts toward intelligent production, there is an urgent need for efficient human-to-robot skill transfer methods for high-precision assembly tasks. However, current embodied intelligence research has primarily focused on household tasks, industrial applications involving dynamic uncertainties and strict control demands remain largely unexplored. To bridge this gap, we propose a real-world skill transfer framework tailored for contact-rich assembly. It integrates an AR-assisted demonstration system for low-cost and diverse data collection, an end-to-end visuomotor imitation learning algorithm for continuous action prediction, and a primitive skill library covering essential operations such as peg insertion, gear meshing, and disassembly. Experiments on six tasks demonstrate high success rates and robust positional generalization. This study explores a novel pathway, it is hoped that this work will provide valuable insights for future human-robot collaboration, and serve as a critical precursor for the integration of Industry 5.0 with embodied intelligence.

Human-in-the-loop Teaching

Robot programming in assembly tasks often requires domain expertise and extensive parameter tuning, posing barriers for non-expert users. To improve accessibility and interactivity, we developed an MR-based human-robot interaction interface leveraging head-mounted displays, MR technology, and contact force simulation for direct and safe robot teleoperation and demonstration, facilitating state-action datasets for end-to-end imitation learning.

  • Real-time Feedback: Robot calibration and demonstration visualized via AR interface, ensuring high-quality data collection
  • Cross-embodiment Portability: Supports various end-effectors and extends to different robots for flexible data acquisition
  • Low Cost: Built with consumer-grade VR devices (e.g., Meta Quest 3), easy to deploy and replicate
  • High Precision: Fine-grained operations through AR-based guidance and game controller-assisted small adjustments
VR Application Interface

Human Demonstration Collection

This work implements two complementary teleoperation modes for demonstration collection: 1) VR-based position control and 2) gamepad-based velocity control. The VR control enables precise 6-DoF trajectory demonstrations via spatial mapping, while gamepad-based velocity control ensures stable, millimeter-scale screw-fastening through fine-grained force adjustment.

Human Demonstration Collection

Assembly/Disassembly Primitives

Assembly Tasks

The assembly tasks include peg insertion, gear meshing, and nut screwing, designed with reference to the Factory simulated benchmark. In each task, the robot must grasp parts from the task board and insert them precisely into predefined positions. To evaluate robustness, perceptual variations were introduced during data collection, changing visual inputs while preserving core task objectives.

Peg Insertion

Unlike conventional pick-and-place operations, this task requires a downward motion with insertion force. The peg features chamfers on both ends to facilitate guidance and alignment. Components are randomly positioned within four regions p1~ p4 of the task board.

Peg Insertion Task

Gear Meshing

The robot assembles a small gear onto a shaft and aligns it with a neighboring gear, requiring high accuracy in both orientation and position.

Gear Meshing Task

Nut Screwing

Due to the tight clearance between the nut and bolt, the task involves multi-surface frictional contact with 6-DoF motion. The initial height of the nut relative to the bolt varies in 5 mm increments.

Nut Screwing Task

Disassembly Tasks

The disassembly tasks, including gear-peg removal, snap-fit disassembly, and U-peg reassembly, are inspired by the part configurations in the AutoMate dataset. The robot must remove each component from the pre-assembled task board and place it into designated containers. In task variants, the target base alternates among three predefined positions p1~ p3 to test positional robustness.

Gear-peg Removal

The robot extracts a peg from the gear assembly and places it into a storage box.

Gear-peg Removal Task

Snap-fit Disassembly

The robot extracts the snap-fit component to release the locking mechanism and places it into the transparent storage box.

Snap-fit Disassembly Task

U-peg Reassembly

Involving two identical bases, the robot first removes a U-peg from one base and inserts it into the other, requiring high-precision alignment.

U-peg Reassembly Task

Imitation Learning

Fine-grained manipulation tasks such as assembly pose substantial challenges for robots due to their stringent requirements on precision and coordination. Traditional solutions often depend on expensive sensors and complex calibration procedures. Imitation learning offers a more flexible alternative but suffers from error accumulation and distributional shift caused by the non-stationarity of human demonstrations. To address these challenges, this work introduces an efficient imitation learning framework that enables robust action sequence prediction from multimodal observations.

Imitation Learning Framework

Action Chunking Transformers Framework

  • Observation Space: The complete observation space 𝒪 integrates visual perception and robot state, formally defined as:
    𝒪 = {Im, Iw, sstate, sforce}
    where Im and Iw represent images from the main and wrist cameras, respectively. sstate = 𝒮 ∈ ℝ7 denotes the end-effector state (position, Euler angles, and gripper status), while sforce ∈ ℝ6 corresponds to force/torque measurements.
  • Action Space: The action space 𝒜 is defined as a control vector over the end-effector pose and gripper:
    𝒜 = {x, y, z, roll, pitch, yaw, gripper} ∈ ℝ7
    Both 𝒜 and sstate share an identical 7-dimensional structure [position, orientation, gripper], enabling direct mapping from policy outputs to robot control commands. This structural consistency facilitates the learning of state-action correlation and simplifies the deployment of trained policies to physical robots.
  • Policy Framework: The proposed policy models the generation of action sequences using a Conditional Variational Autoencoder (CVAE) framework, which learns the conditional distribution πθ(at:t+k | st) given the current state.

Real World Experiments

Execution Demonstrations

The following videos demonstrate real-world execution of assembly and disassembly tasks from both wrist camera and main camera perspectives, showcasing the precision and reliability of the learned policies across different manipulation primitives.

Assembly Execution

Peg Insertion

Gear Meshing

Screw Fastening

Wrist Camera View

Main Camera View

Disassembly Execution

Gear-peg Removal

Snap-fit Disassembly

U-peg Reassembly

Wrist Camera View

Main Camera View

Module Ablation

This study presents a systematic ablation analysis on peg insertion to assess the effects of data scale, input preprocessing, and network architecture, using the original ACT model as baseline. Experiments follow a unified protocol with 24 trials per setting across four workspace regions to evaluate spatial generalization. Aggregated results are shown in the table below.

Configuration Training Data Input Epochs Backbone Left Success Right Success Overall Success
ACT (Baseline) 50 demos 480×640 50K ResNet 23.5% 20.0% 21.9%
+ Augmented Data 100 demos 480×640 100K ResNet 41.7% 41.7% 41.7%
+ Cropped Input 50 demos 224×224 100K ResNet 58.3% 83.3% 70.8%
+ Vision Transformer 50 demos 224×224 100K ViT 75.0% 83.3% 79.2%

Position Generalization

To evaluate spatial generalization, we assess policy performance across different workspace regions and task variants. The following results demonstrate success rates and failure modes for ACT-based policies on assembly tasks.

Position Generalization Results

Failure Modes: GG-BF = Good Grasp & Bad Fit, SL = Slipped, PM = Partial Meshing/Misalignment, MT = Missed Target, PS = Position Shift

Note: For Peg Insertion and Gear Mesh, positions refer to workspace corners (P1: Top-Left, P2: Bottom-Left, P3: Top-Right, P4: Bottom-Right). For Screw Fastening, positions refer to initial height variations (Default, +5mm, +10mm, +15mm, +20mm).

Interpretability

Attention Visualization: To address the interpretability challenge in end-to-end approaches, we employ Grad-CAM to analyze the model's learned representations. As shown in below, our model accurately focuses on task-critical elements such as pins, holes, and small gears, effectively capturing manipulation-relevant regions. This demonstrates that using first-person data effectively narrows the domain gap between training and deployment, enabling more efficient human-to-robot skill transfer.

Attention Visualization

Trajectory Visualization: The figure below presents the trajectory visualization results of the proposed imitation learning framework across three representative assembly and disassembly tasks. Each subfigure illustrates the end-effector's complete 3D motion trajectory, with color gradients representing temporal evolution and clearly annotated critical points (initial, intermediate, and target positions).

Trajectory Visualization

BibTeX

If you find it helpful, please consider citing our work:

@article{wu2025RoboAssem,
  title={Mixed Reality-Assisted Human-Robot Skill Transfer via Visuomotor Primitives Toward Physical Intelligence},
  author={Wu,Duidi and Zhao,Qianyou and Shen,Yuliang and Li,Junlai and Zheng,Pai and Qi,Jin and Hu,Jie},
  journal={preprint},
  year={2024}
}