Website Title

Abstract

Enabling robots to follow spoken instructions for cloth manipulation is a critical step toward intuitive human-robot interaction in real-world settings. However, this task remains challenging due to the deformable nature of fabric and the need for accurate, language-grounded control. In this work, we focus on speech-driven cloth folding, a task that demands both long-horizon reasoning and fine-grained visual grounding. We propose a unified framework that integrates an Automatic Speech Recognition (ASR) module, a Large Language Model (LLM)-based planner, a Vision-Language Model (VLM)-based perception system, and a task execution module. Spoken instructions are first transcribed into text via the ASR module and then parsed into structured sub-tasks by the LLM-based planner. For visual grounding, we utilize a frozen SigLIP2 encoder augmented with a bidirectional cross-attention fusion module, aligning textual instructions with RGB-D visual input. The task execution module sequentially executes the generated sub-tasks to complete multi-step cloth folding. To specialize our model for cloth manipulation tasks, we introduce Weight-Decomposed Low-Rank Adaptation (DoRA), a lightweight fine-tuning strategy for enhanced generalization. To validate our approach, various cloth folding evaluations have been performed from simulation to real robot implementation. In simulation, our method outperforms state-of-the-art (SOTA) baselines, achieving improvements of 2.23%, 1.87%, and 33.3% on seen instructions, unseen instructions, and unseen tasks respectively. On a real robot, our method successfully executes multi-step folding sequences guided by spoken instructions, across a wide range of cloth materials and configurations.

Architecture of the Visual Perception Module

Fig. 2. The Visual Perception module uses a frozen SigLIP2 model to extract tokens from an RGBD image and a natural language instruction. The instruction is split at the conjunction and into pick and place segments. Each segment is fused with visual tokens via bidirectional cross-attention, where textual and visual features are jointly aligned. To adapt the frozen SigLIP2 model to cloth manipulation, DoRA is employed for efficient fine-tuning. The fused features are decoded through convolutional and upsampling layers to predict the corresponding pick and place positions for precise cloth manipulation.

Abstract

Approach Overview

Architecture of the Visual Perception Module

Real world experimental setup.

Real World Experiments

1. Performance on T-Shirt Fold (TSF) on Various T-Shirt.

Example 1

Example 2

Example 3

2. Performance on Trousers Fold (TF) on Various Trousers.

Example 1

Example 2

Example 3

3. Performance on Four Corners Inward Fold (FCIF) on Various Towels.

Example 1

Example 2

Example 3

4. Performance on Double Straight Fold (DSF) on Various Towels.

Example 1

Example 2

Example 3

5. Performance on Double Triangle Fold (DTF) on Various Towels.

Example 1

Example 2

Example 3

Generalization to Unseen Instruction

TSF

TF

FCIF

DSF

DTF

Generalization to Unseen Task

TSF

TF

FCIF

DSF

DTF

Generalization to Different angles

TSF

TF

FCIF

DSF

DTF