College

College of Engineering and Polymer Science

Date of Last Revision

2026-04-28 12:34:10

Major

Mechanical Engineering

Honors Course

MECE 471:003

Number of Credits

2

Degree Name

Bachelor of Science

Date of Expected Graduation

Spring 2026

Abstract

This report presents the design, development, and deployment of a teleoperated robotic arm system capable of vision-based imitation learning. The system uses a leader-follower architecture with myArm C650 and myArm M750 robots manufactured by Elephant Robotics. The leader arm movement is controlled by the operator, whereas the follower performs actions acquired from demonstrations. Data collection included synchronized joint-angle recording and multi- camera video streams capturing human demonstrations.

These demonstrations were processed with a deep learning imitation algorithm called Action Chunking with Transformers (ACT). The main capability of this algorithm is its ability to make future action predictions based solely on visual and proprioceptive data due to a Conditional Variational Autoencoder-based representation of the distribution of multi-step action trajectories.

The first stage of implementation implied adaptation of the original version of the ALOHA implementation proposed by Stanford University researchers. Even though the training procedure led to successful convergence of the algorithm, deploying the action generation inference algorithm turned out impossible due to compatibility problems between the format of the dataset and the format required for robot interaction via pymycobot. As a result, after analysis of this issue, the implementation pipeline was changed to the official LeRobot implementation by the Hugging Face framework, which guarantees temporal alignment of the dataset along with universal interaction with the model at all stages of the workflow.

The system was designed for two particular tasks. The first one was a straightforward pick-and- place task which involved training the system on 80,000 steps with 55 demonstrations for validation purposes. The second task and main application of Senior Design was peg-and-hole insertion. We used 100,000 steps and 76 demonstrations in training the system for Task 2. In the course of learning, the total loss decreased from 4.5 to 0.04 in 100,000 steps. Particularly, the L1 reconstruction loss decreased from 0.75 to 0.025 and no signs of overfitting were detected in our experiments. The autonomous Task 2 policy achieved a success rate of 85% (26 out of 30) within 19 seconds on average for each trial. One of the most interesting things about the model was the fact that it learned the workspace itself. Every training demonstration was performed in a blank environment with only a peg and a hole in the scene. Still, when random distractor objects were introduced to the scene in the process of testing, the robot managed to self-correct its motion during this process by avoiding incorrect objects and finding the right one.

Therefore, the robot learned the recognition of different objects, instead of memorizing their coordinates. Thus, our results proved that vision-based imitation learning was possible even on relatively affordable, single-armed equipment. Moreover, ACT policies showed generalization ability that was not predefined in the process of training.

Research Sponsor

Yalin Dong

First Reader

Jutta Luettmer-Strathmann

Second Reader

Manigandan Kannan

Honors Faculty Advisor

Scott Sawyer

Proprietary and/or Confidential Information

No

Community Engaged Scholarship

No

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.