2026
Evaluating LocateAnything for Language-Guided Pre-Grasp Object Localization in Humanoid Robot Perception
This mini-pilot evaluates LocateAnything as a language-guided 2D visual grounding module for humanoid robot pre-grasp perception. The study uses cluttered tabletop and desk scenes with natural-language prompts, manual ground-truth boxes, and IoU-based evaluation. Across 39 valid image-prompt pairs, LocateAnything achieved a mean IoU of 0.646, a median center-point error of 17.2 pixels, and a 76.9% success rate using IoU >= 0.5.
39 valid tabletop image-prompt pairsMean IoU: 0.646Median center-point error: 17.2 pxOverall success rate: 76.9% at IoU >= 0.5Spatial-relation prompts: 93.8% successSimilar-object prompts: 100.0% success in the pilot setSmall-object and ambiguous prompts were the main failure sources

| Metric | Observation |
|---|---|
| Overall Performance | 39 valid samples; mean IoU 0.646; median center-point error 17.2 px; success rate 76.9% at IoU >= 0.5. |
| Strong Prompt Types | Spatial-relation prompts reached 93.8% success; similar-object prompts reached 100.0% success in this small pilot. |
| Difficult Cases | Small-object prompts reached 58.3% success, and ambiguous prompts reached 33.3% success. |
| Scope | The report evaluates 2D perception-stage localization only; it does not test physical grasping or robot control. |