Improving and Evaluating Hand-Object Interaction Detection

University of Bristol1, New York University2
HOI-DETR teaser

HOI-DETR predicts all visible hands, 1st objects (directly held), and 2nd objects (acted upon through a tool), together with their pairwise interaction links in a single forward pass. Zero-shot examples across diverse scenes, viewpoints, and domains.

Abstract

Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. We introduce HOI-DETR, a new framework that integrates hand-object and object-object interaction into the Co-DETR architecture, producing a method that jointly detects all visible hands, 1st objects (objects in direct physical interaction with a hand), and 2nd objects (objects that the 1st object acts upon when used as a tool), and predicts their pairwise interaction links in a single forward pass.

We accompany the model with a comprehensive HOI evaluation suite spanning four diverse datasets, including a new video benchmark derived from the HD-EPIC dataset and refined annotations for the Hands23 benchmark. HOI-DETR significantly improves over the previous state of the art, with mAP gains of over 20 percentage points, and demonstrates strong zero-shot generalisation to unseen datasets and domains.

Method: Overview  •  Zero-Shot Results: In-the-Wild | HOIST & HD-EPIC | Aria Gen 2 | FineBio |  •  Quantitative: Refined Hands23 | Tables  •  Resources: Download

Method

HOI-DETR builds on the Co-DETR architecture, training a transformer backbone end-to-end to predict hands, objects, and their interactions. An interaction MLP head operates on pairs of decoder token embeddings to predict binary interaction relationships for valid pairs: hand1st object and 1st object2nd object. The module is supervised at every decoder layer with a focal loss and trained end-to-end with the detector.

HOI-DETR method diagram

Zero-Shot Qualitative Results

HOI-DETR is trained only on the refined Hands23 dataset. All results below are zero-shot — evaluated on unseen datasets and domains, from egocentric video to in-the-wild web footage, with no additional training or fine-tuning.

Challenging In-the-Wild Videos

Zero-shot predictions overlaid on full-HD YouTube footage. Each card opens an interactive page: the original video plays from YouTube with bounding boxes and interaction links drawn on top in real time.

Zero-Shot on HD-EPIC and HOIST

Comparisons against prior methods on HD-EPIC and HOIST. Each clip places baseline predictions alongside HOI-DETR — note the fewer false positives, more accurate 1st object detections, and more stable interaction links over time.

Zero-Shot on Aria Gen 2 Pilot Dataset

We compare against the hand-object method released with the Aria Gen 2 Pilot Dataset, which is trained on the dataset's own training split. Applied zero-shot, with no Aria training data, HOI-DETR still produces markedly more accurate and temporally consistent hand-object detections than their in-domain method. Each clip shows their method (left) and HOI-DETR (right).

Zero-Shot on FineBio

Comparison between Hands23 and HOI-DETR on the FineBio dataset — out-of-distribution biology lab footage, far from the cooking-centric training data. ✗ marks invalid interaction links or false positives.

Zero-shot qualitative comparison on FineBio

Refined Hands23

The original Hands23 annotation pipeline produced duplicate object annotations arising from a hand-centric scheme in which each hand was independently annotated. We built a refinement pipeline that reviewed 26.2k images — correcting 56.2% of them by removing duplicate boxes and remapping all interaction links for consistency. The figure below shows representative before/after corrections.

Hands23 before and after correction

Quantitative Results

Hands23 Val — In-Domain (Refined Annotations)

Detection AP₅₀ and interaction F1. HOI-DETR improves the 1st and 2nd object detection by over 25 points.

Method Overall hand 1st object 2nd object F1 inter
Hands23 63.6 (−22.5) 85.2 (−7.9) 59.4 (−27.1) 46.2 (−32.5) 90.7 (−4.8)
HOI-DETR (ours) 86.193.186.578.795.5

Zero-Shot Cross-Dataset

HOI-DETR is trained only on Hands23. The tables below evaluate it, without any fine-tuning, on three datasets it has never seen — including the video benchmark HD-EPIC-HOI. (−X) shows each baseline's gap to HOI-DETR.

HD-EPIC-HOI (Video,1st object)

Method Frame-AP Video-AP LTC
Hands23 46.9 (−25.7) 26.8 (−33.4) 31.4 (−29.6)
HOIST 30.4 (−42.2) 16.1 (−44.1) 27.2 (−33.8)
HOI-DETR (ours) 72.660.261.0

FineBio (AP₅₀)

Method Overall hand 1st object
Hands23 50.2 (−21.0) 74.4 (−12.2) 26.0 (−29.8)
HOI-DETR (ours) 71.286.655.8

HOIST (AP₅₀)

Method 1st object
Hands23 43.1 (−33.5)
HOIST 70.7 (−5.9)
HOI-DETR (ours) 76.6

Hand Detection Generalisation (AP₅₀)

Despite being optimised for the broader HOI task, HOI-DETR is also a strong pure hand detector, leading on 4 of 6 benchmarks and achieving the best average zero-shot AP (82.4 vs. WiLOR's 78.5). Blue cells are zero-shot. *as reported in WiLOR.

Method Hand+Obj Hands23 WHIM COCO-Whole Oxford-Hands FineBio EgoHands
MediaPipe 53.1*15.4*8.7*
OpenPose 76.8*37.1*20.7*
ContactHands 93.4*50.3*70.0*
ViTDet 84.7*41.6*67.6*
WiLOR 77.396.1*62.5*82.6*76.593.4
Hands23 85.271.767.060.674.493.0
HOI-DETR (ours) 93.178.175.074.086.698.5

Download

Acknowledgements

This work was supported by EPSRC Program Grant Visual AI (EP/T028572/1). A. Darkhalil was supported by the EPSRC Doctoral Training Program (DTP). We acknowledge the usage of GPU node hours granted as part of the AIRR Innovator project "5D Hand-Object Interaction Modelling from In-the-wild Videos" (Mar 2026 – Sep 2026), the AIRR Gateway project "HOI Foundational Model from Egocentric Data" (Dec 2025 – Mar 2026), and the Sovereign AI Unit call project "Gen Model in Ego-sensed World" (Aug – Nov 2025). D. Fouhey was supported by the National Science Foundation under Grant No. 2006619 and 2437330.

We thank Sidhartha Reddy Potu for his contributions during the early stages of this project. This work builds on Co-DETR and MMDetection, and we gratefully acknowledge their authors for open-sourcing the codebase and architecture. We also thank the authors of Hands23, HOIST, FineBio, and HD-EPIC for making their datasets publicly available.