Abstract

Enabling robots to follow spoken instructions for cloth manipulation is a critical step toward intuitive human-robot interaction in real-world settings. However, this task remains challenging due to the deformable nature of fabric and the need for accurate, language-grounded control. In this work, we focus on speech-driven cloth folding, a task that demands both long-horizon reasoning and fine-grained visual grounding. We propose a unified framework that integrates an Automatic Speech Recognition (ASR) module, a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Spoken instructions are first transcribed into text via the ASR module and then parsed into structured sub-tasks by the LLM-based planner. For visual grounding, we utilize a frozen SigLIP2 encoder augmented with a bidirectional cross-attention fusion module, aligning textual instructions with RGB-D visual input. The task execution module sequentially executes the generated sub-tasks to complete multi-step cloth folding. To specialize our model for cloth manipulation tasks, we introduce Weight-Decomposed Low-Rank Adaptation (DoRA), a lightweight fine-tuning strategy for enhanced generalization. To validate our approach, various cloth folding evaluations have been performed from simulation to real robot implementation. In simulation, our method outperforms state-of-the-art (SOTA) baselines, achieving improvements of 2.23%, 1.87%, and 33.3% on seen instructions, unseen instructions, and unseen tasks respectively. On a real robot, our method successfully executes multi-step folding sequences guided by spoken instructions, across a wide range of cloth materials and configurations.


Approach Overview

Fig. 1. An illustration of the robotic-arm embodied LLM system in the physical world, showcasing the integrated workflow of Automatic Speech Recognition, Task Planning, Visual Perception, and Action Execution in a cloth manipulation task.


Architecture of the Visual Perception Module

Fig. 2. The Visual Perception module uses a frozen SigLIP2 model to extract tokens from an RGBD image and a natural language instruction. The instruction is split at the conjunction and into pick and place segments. Each segment is fused with visual tokens via bidirectional cross-attention, where textual and visual features are jointly aligned. To adapt the frozen SigLIP2 model to cloth manipulation, DoRA is employed for efficient fine-tuning. The fused features are decoded through convolutional and upsampling layers to predict the corresponding pick and place positions for precise cloth manipulation.


Real world experimental setup.

Real World Experiments

1. Performance on T-Shirt Fold (TSF) on Various T-Shirt.

Example 1

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

Example 2

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

Example 3

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

2. Performance on Trousers Fold (TF) on Various Trousers.

Example 1

Fold the Trousers in half, then fold again from the waistband down to the hem.

Example 2

Fold the Trousers in half, then fold again from the waistband down to the hem.

Example 3

Fold the Trousers in half, then fold again from the waistband down to the hem.

3. Performance on Four Corners Inward Fold (FCIF) on Various Towels.

Example 1

Fold all corners of the square to the center one by one.

Example 2

Fold all corners of the square to the center one by one.

Example 3

Fold all corners of the square to the center one by one.

4. Performance on Double Straight Fold (DSF) on Various Towels.

Example 1

Fold the square in half from up to down.

Example 2

Fold the square in half from up to down.

Example 3

Fold the square in half from up to down.

5. Performance on Double Triangle Fold (DTF) on Various Towels.

Example 1

Fold the fabric twice to form a double triangle shape.

Example 2

Fold the fabric twice to form a double triangle shape.

Example 3

Fold the fabric twice to form a double triangle shape.

Generalization to Unseen Instruction

TSF

Tuck both sleeves inward, then fold the T-Shirt vertically from bottom to top.

TF

First fold the trousers lengthwise, then fold them down from the waistband to the hem.

FCIF

Bring each corner of the square to the center, folding them one by one.

DSF

Fold the square downward from the top edge to the bottom.

DTF

Fold the fabric in half twice to create a clean double triangle.

Generalization to Unseen Task

TSF

Fold both sleeves inside and then fold the T-Shirt in half from left to right.

TF

Fold the trousers from the legs up to the waistband.

FCIF

Fold two opposite corners to the center first and then fold the rest one by one.

DSF

Fold the square in half from bottom to top and then fold it in half again.

DTF

Fold square right bottom to left top and then left to right into a double triangle.

Generalization to Different angles

TSF

Fold both sleeves inside and then fold the T-Shirt in half from top to bottom.

TF

Fold the Trousers in half, then fold again from the waistband down to the hem.

FCIF

Fold all corners of the square to the center one by one.

DSF

Fold the square in half from up to down.

DTF

Fold the fabric twice to form a double triangle shape.