Abstract

Enabling robots to follow spoken instructions for cloth manipulation is a critical step toward intuitive human-robot interaction in real-world settings. However, this task remains challenging due to the deformable nature of fabric and the need for accurate, language-grounded control. In this work, we focus on speech-driven cloth folding, a task that demands both long-horizon reasoning and fine-grained visual grounding. We propose a unified framework that integrates an Automatic Speech Recognition (ASR) module, a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Spoken instructions are first transcribed into text via the ASR module and then parsed into structured sub-tasks by the LLM-based planner. For visual grounding, we utilize a frozen SigLIP2 encoder augmented with a bidirectional cross-attention fusion module, aligning textual instructions with RGB-D visual input. The task execution module sequentially executes the generated sub-tasks to complete multi-step cloth folding. To specialize our model for cloth manipulation tasks, we introduce Weight-Decomposed Low-Rank Adaptation (DoRA), a lightweight fine-tuning strategy for enhanced generalization. To validate our approach, various cloth folding evaluations have been performed from simulation to real robot implementation. In simulation, the proposed method achieves the best performance among all compared baselines, with improvements of 2.1\% over the strongest baseline on seen tasks and 17.1\ on unseen tasks. On a real robot, it robustly executes multi-step folding sequences from language instructions across diverse cloth materials and configurations, demonstrating strong generalization in practical scenarios.


Approach Overview

Fig. 1. An illustration of the robotic-arm embodied LLM system in the physical world, showcasing the integrated workflow of Automatic Speech Recognition, Task Planning, Visual Perception, and Action Execution in a cloth manipulation task.


Architecture of the Visual Perception Module

Fig. 2. The Visual Perception module uses a frozen SigLIP2 model to extract tokens from an RGBD image and a natural language instruction. The instruction is split at the conjunction and into pick and place segments. Each segment is fused with visual tokens via bidirectional cross-attention, where textual and visual features are jointly aligned. To adapt the frozen SigLIP2 model to cloth manipulation, DoRA is employed for efficient fine-tuning. The fused features are decoded through convolutional and upsampling layers to predict the corresponding pick and place positions for precise cloth manipulation.


Real world experimental setup.


(1).Seen Instruction(Seen Tasks)

1. Performance on T-Shirt Fold on Various T-Shirt.

Example 1

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

Example 2

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

Example 3

Fold both sleeves inside and then fold the T-Shirt in half from bottom to top.

2. Performance on Trousers Fold on Various Trousers.

Example 1

Fold the Trousers in half, then fold again from the waistband down to the hem.

Example 2

Fold the Trousers in half, then fold again from the waistband down to the hem.

Example 3

Fold the Trousers in half, then fold again from the waistband down to the hem.

3. Performance on Four Corners Inward Fold on Various Towels.

Example 1

Fold all corners of the square to the center one by one.

Example 2

Fold all corners of the square to the center one by one.

Example 3

Fold all corners of the square to the center one by one.

4. Performance on Double Straight Fold on Various Towels.

Example 1

Fold the square in half from up to down.

Example 2

Fold the square in half from up to down.

Example 3

Fold the square in half from up to down.

5. Performance on Double Triangle Fold on Various Towels.

Example 1

Fold the fabric twice to form a double triangle shape.

Example 2

Fold the fabric twice to form a double triangle shape.

Example 3

Fold the fabric twice to form a double triangle shape.

6. Performance on Jackets and Dresses.

Example 1

Fold both sleeves inside and then fold the jacket in half from top to bottom.

Example 2

Fold both sleeves inside and then fold the jacket in half from top to bottom.

Example 3

Fold the dress in half from the center of the neckline to the center of the hem.

Example 4

Fold the dress in half from the center of the neckline to the center of the hem.

(2).Generalization to Unseen Instruction(Seen Tasks)

Tuck both sleeves inward, then fold the T-Shirt vertically from bottom to top.

First fold the trousers lengthwise, then fold them down from the waistband to the hem.

Bring each corner of the square to the center, folding them one by one.

Fold the square downward from the top edge to the bottom.

Fold the fabric in half twice to create a clean double triangle.

Fold both sleeves inward, then fold the jacket in half from the collar down toward the hem.

Fold the dress in half by folding the center of the neckline down to the center of the hem.

(3).Generalization to Unseen Task(Seen Tasks)

Fold both sleeves inside and then fold the T-Shirt in half from left to right.

Fold the trousers from the legs up to the waistband.

Fold two opposite corners to the center first and then fold the rest one by one.

Fold the square in half from bottom to top and then fold it in half again.

Fold square right bottom to left top and then left to right into a double triangle.

Fold the left sleeve of the jacket over to the right sleeve, aligning the hems.

Fold the left hem of the dress over to the right hem, aligning the shoulders.

(4).Generalization to Different angles(Seen Tasks)

Fold both sleeves inside and then fold the T-Shirt in half from top to bottom.

Fold the Trousers in half, then fold again from the waistband down to the hem.

Fold all corners of the square to the center one by one.

Fold the square in half from up to down.

Fold the fabric twice to form a double triangle shape.

Fold both sleeves inside and then fold the jacket in half from top to bottom.

Fold the dress in half from the center of the neckline to the center of the hem.