RoboAssem

Abstract

Industrial assembly represents a core of manufacturing, poses significant challenges to the reliability and adaptability of robot systems. As manufacturing shifts toward intelligent production, there is an urgent need for efficient human-to-robot skill transfer methods for high-precision assembly tasks. However, current embodied intelligence research has primarily focused on household tasks, industrial applications involving dynamic uncertainties and strict control demands remain largely unexplored. To bridge this gap, we propose a real-world skill transfer framework tailored for contact-rich assembly. It integrates an AR-assisted demonstration system for low-cost and diverse data collection, an end-to-end visuomotor imitation learning algorithm for continuous action prediction, and a primitive skill library covering essential operations such as peg insertion, gear meshing, and disassembly. Experiments on six tasks demonstrate high success rates and robust positional generalization. This study explores a novel pathway, it is hoped that this work will provide valuable insights for future human-robot collaboration, and serve as a critical precursor for the integration of Industry 5.0 with embodied intelligence.

Human-in-the-loop Teaching

Robot programming in assembly tasks often requires domain expertise and extensive parameter tuning, posing barriers for non-expert users. To improve accessibility and interactivity, we developed an MR-based human-robot interaction interface leveraging head-mounted displays, MR technology, and contact force simulation for direct and safe robot teleoperation and demonstration, facilitating state-action datasets for end-to-end imitation learning.

Real-time Feedback: Robot calibration and demonstration visualized via AR interface, ensuring high-quality data collection
Cross-embodiment Portability: Supports various end-effectors and extends to different robots for flexible data acquisition
Low Cost: Built with consumer-grade VR devices (e.g., Meta Quest 3), easy to deploy and replicate
High Precision: Fine-grained operations through AR-based guidance and game controller-assisted small adjustments

Human Demonstration Collection

This work implements two complementary teleoperation modes for demonstration collection: 1) VR-based position control and 2) gamepad-based velocity control. The VR control enables precise 6-DoF trajectory demonstrations via spatial mapping, while gamepad-based velocity control ensures stable, millimeter-scale screw-fastening through fine-grained force adjustment.

Assembly/Disassembly Primitives

Assembly Tasks

The assembly tasks include peg insertion, gear meshing, and nut screwing, designed with reference to the Factory simulated benchmark. In each task, the robot must grasp parts from the task board and insert them precisely into predefined positions. To evaluate robustness, perceptual variations were introduced during data collection, changing visual inputs while preserving core task objectives.

Peg Insertion

Unlike conventional pick-and-place operations, this task requires a downward motion with insertion force. The peg features chamfers on both ends to facilitate guidance and alignment. Components are randomly positioned within four regions p₁~ p₄ of the task board.

Gear Meshing

The robot assembles a small gear onto a shaft and aligns it with a neighboring gear, requiring high accuracy in both orientation and position.

Nut Screwing

Due to the tight clearance between the nut and bolt, the task involves multi-surface frictional contact with 6-DoF motion. The initial height of the nut relative to the bolt varies in 5 mm increments.

Disassembly Tasks

The disassembly tasks, including gear-peg removal, snap-fit disassembly, and U-peg reassembly, are inspired by the part configurations in the AutoMate dataset. The robot must remove each component from the pre-assembled task board and place it into designated containers. In task variants, the target base alternates among three predefined positions p₁~ p₃ to test positional robustness.

Gear-peg Removal

The robot extracts a peg from the gear assembly and places it into a storage box.

Snap-fit Disassembly

The robot extracts the snap-fit component to release the locking mechanism and places it into the transparent storage box.

U-peg Reassembly

Involving two identical bases, the robot first removes a U-peg from one base and inserts it into the other, requiring high-precision alignment.

Imitation Learning

Fine-grained manipulation tasks such as assembly pose substantial challenges for robots due to their stringent requirements on precision and coordination. Traditional solutions often depend on expensive sensors and complex calibration procedures. Imitation learning offers a more flexible alternative but suffers from error accumulation and distributional shift caused by the non-stationarity of human demonstrations. To address these challenges, this work introduces an efficient imitation learning framework that enables robust action sequence prediction from multimodal observations.

Action Chunking Transformers Framework

Observation Space: The complete observation space 𝒪 integrates visual perception and robot state, formally defined as:
𝒪 = {I_m, I_w, s_state, s_force}
where I_m and I_w represent images from the main and wrist cameras, respectively. s_state = 𝒮 ∈ ℝ⁷ denotes the end-effector state (position, Euler angles, and gripper status), while s_force ∈ ℝ⁶ corresponds to force/torque measurements.
Action Space: The action space 𝒜 is defined as a control vector over the end-effector pose and gripper:
𝒜 = {x, y, z, roll, pitch, yaw, gripper} ∈ ℝ⁷
Both 𝒜 and s_state share an identical 7-dimensional structure [position, orientation, gripper], enabling direct mapping from policy outputs to robot control commands. This structural consistency facilitates the learning of state-action correlation and simplifies the deployment of trained policies to physical robots.
Policy Framework: The proposed policy models the generation of action sequences using a Conditional Variational Autoencoder (CVAE) framework, which learns the conditional distribution π_θ(a_t:t+k | s_t) given the current state.

Real World Experiments

Execution Demonstrations

The following videos demonstrate real-world execution of assembly and disassembly tasks from both wrist camera and main camera perspectives, showcasing the precision and reliability of the learned policies across different manipulation primitives.

Assembly Execution

Peg Insertion

Gear Meshing

Screw Fastening

Wrist Camera View

Main Camera View

Disassembly Execution

Gear-peg Removal

Snap-fit Disassembly

U-peg Reassembly

Wrist Camera View

Main Camera View

Module Ablation

This study presents a systematic ablation analysis on peg insertion to assess the effects of data scale, input preprocessing, and network architecture, using the original ACT model as baseline. Experiments follow a unified protocol with 24 trials per setting across four workspace regions to evaluate spatial generalization. Aggregated results are shown in the table below.

Configuration	Training Data	Input	Epochs	Backbone	Left Success	Right Success	Overall Success
ACT (Baseline)	50 demos	480×640	50K	ResNet	23.5%	20.0%	21.9%
+ Augmented Data	100 demos	480×640	100K	ResNet	41.7%	41.7%	41.7%
+ Cropped Input	50 demos	224×224	100K	ResNet	58.3%	83.3%	70.8%
+ Vision Transformer	50 demos	224×224	100K	ViT	75.0%	83.3%	79.2%

Position Generalization

To evaluate spatial generalization, we assess policy performance across different workspace regions and task variants. The following results demonstrate success rates and failure modes for ACT-based policies on assembly tasks.

Failure Modes: GG-BF = Good Grasp & Bad Fit, SL = Slipped, PM = Partial Meshing/Misalignment, MT = Missed Target, PS = Position Shift

Note: For Peg Insertion and Gear Mesh, positions refer to workspace corners (P1: Top-Left, P2: Bottom-Left, P3: Top-Right, P4: Bottom-Right). For Screw Fastening, positions refer to initial height variations (Default, +5mm, +10mm, +15mm, +20mm).

Interpretability

Attention Visualization: To address the interpretability challenge in end-to-end approaches, we employ Grad-CAM to analyze the model's learned representations. As shown in below, our model accurately focuses on task-critical elements such as pins, holes, and small gears, effectively capturing manipulation-relevant regions. This demonstrates that using first-person data effectively narrows the domain gap between training and deployment, enabling more efficient human-to-robot skill transfer.

Trajectory Visualization: The figure below presents the trajectory visualization results of the proposed imitation learning framework across three representative assembly and disassembly tasks. Each subfigure illustrates the end-effector's complete 3D motion trajectory, with color gradients representing temporal evolution and clearly annotated critical points (initial, intermediate, and target positions).

BibTeX

If you find it helpful, please consider citing our work:

@article{wu2025RoboAssem,
  title={Mixed Reality-Assisted Human-Robot Skill Transfer via Visuomotor Primitives Toward Physical Intelligence},
  author={Wu,Duidi and Zhao,Qianyou and Shen,Yuliang and Li,Junlai and Zheng,Pai and Qi,Jin and Hu,Jie},
  journal={preprint},
  year={2024}
}

🦾RoboAssem

Mixed Reality-Assisted Human-Robot Skill Transfer via Visuomotor Primitives Toward Physical Intelligence

Abstract

Human-in-the-loop Teaching

Human Demonstration Collection

Assembly/Disassembly Primitives

Assembly Tasks

Peg Insertion

Gear Meshing

Nut Screwing

Disassembly Tasks

Gear-peg Removal

Snap-fit Disassembly

U-peg Reassembly

Imitation Learning

Action Chunking Transformers Framework

Real World Experiments

Execution Demonstrations

Assembly Execution

Disassembly Execution

Module Ablation

Position Generalization

Interpretability

BibTeX