Talk Through It:
End User Directed Manipulation Learning

Carl Winge, Adam Imdieke, Bahaa Aldeeb, Dongyeop Kang, Karthik Desingh
University of Minnesota

Talk Through It is a robot learning framework that enables end users to teach robots skills and tasks with natural language instructions.


Training generalist robot agents is an immensely difficult feat due to the requirement to perform a huge range of tasks in many different environments. We propose selectively training robots based on end-user preferences instead.

Given a factory model that lets an end user instruct a robot to perform lower-level actions (e.g. ‘Move left’), we show that end users can collect demonstrations using language to train their home model for higher-level tasks specific to their needs (e.g. ‘Open the top drawer and put the block inside’). We demonstrate this hierarchical robot learning framework on robot manipulation tasks using RLBench environments. Our method results in a 16% improvement in skill success rates compared to a baseline method.

In further experiments, we explore the use of the large vision- language model (VLM), Bard, to automatically break down tasks into sequences of lower-level instructions, aiming to bypass end-user involvement. The VLM is unable to break tasks down to our lowest level, but does achieve good results breaking high-level tasks into mid-level skills.


Frequently Asked Questions

What are the main contributions of this work?

  • We present a method for training a robot to respond to observation-dependent and observation-independent language commands for primitive actions.
  • We show language commands for primitive actions can be used in sequence to collect demonstrations for training higher-level skills and tasks. This technique allows non-expert end users to train a robot according to their personal needs.
  • Our results prove that our hierarchical training method leads to performance gains over a state-of-the-art base- line method which doesn't utilize hierarchical training.
  • We demonstrate a VLM can successfully chain skills learned by our model to complete longer-horizon tasks.

Why have end users direct manipulation learning?

There are an enormous number of tasks an end user might want a robot to complete in their home. We can't pretrain a model for every possibility. Even for tasks that are pretrained, the end user might want to alter the way the task is performed.

Why use natual language to collect demonstrations instead of teleoperation?

Teleoperation requires additional hardware which can be expensive and challenging to set up. Even relatively cheap and user-friendly teleoperation hardware doesn't compare to the convenience and ease of use of natural language.


We show the first 5 evaluation episodes for the multi-skill and multi-task models that had the highest average success rates across all skills or tasks.

model, evaluated on

Result: Success


      title={Talk Through It: End User Directed Manipulation Learning}, 
      author={Carl Winge and Adam Imdieke and Bahaa Aldeeb and Dongyeop Kang and Karthik Desingh},