Improving and Evaluating Hand-Object Interaction Detection

Ahmad Darkhalil¹, Dima Damen¹, David Fouhey²

University of Bristol¹, New York University²

HOI-DETR predicts all visible hands, 1st objects (directly held), and 2nd objects (acted upon through a tool), together with their pairwise interaction links in a single forward pass. Zero-shot examples across diverse scenes, viewpoints, and domains.

Abstract

Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. We introduce HOI-DETR, a new framework that integrates hand-object and object-object interaction into the Co-DETR architecture, producing a method that jointly detects all visible hands, 1st objects (objects in direct physical interaction with a hand), and 2nd objects (objects that the 1st object acts upon when used as a tool), and predicts their pairwise interaction links in a single forward pass.

We accompany the model with a comprehensive HOI evaluation suite spanning four diverse datasets, including a new video benchmark derived from the HD-EPIC dataset and refined annotations for the Hands23 benchmark. HOI-DETR significantly improves over the previous state of the art, with mAP gains of over 20 percentage points, and demonstrates strong zero-shot generalisation to unseen datasets and domains.

Method

HOI-DETR builds on the Co-DETR architecture, training a transformer backbone end-to-end to predict hands, objects, and their interactions. An interaction MLP head operates on pairs of decoder token embeddings to predict binary interaction relationships for valid pairs: hand → 1st object and 1st object → 2nd object. The module is supervised at every decoder layer with a focal loss and trained end-to-end with the detector.

Zero-Shot Qualitative Results

HOI-DETR is trained only on the refined Hands23 dataset. All results below are zero-shot — evaluated on unseen datasets and domains, from egocentric video to in-the-wild web footage, with no additional training or fine-tuning.

Challenging In-the-Wild Videos

Zero-shot predictions overlaid on full-HD YouTube footage. Each card opens an interactive page: the original video plays from YouTube with bounding boxes and interaction links drawn on top in real time.

▶

French Magician — Sleight of Hand

Fast, deformable hand motion; partial occlusion

▶

Full Day Chinese Food Tour — Flushing NY

Dense hand-object interactions; crowded street food scene

Giving Strangers Christmas Presents thumbnail

▶

Giving Strangers Belated Christmas Presents

Multi-person hand-to-hand exchanges; gift-wrapped objects

Zero-Shot on HD-EPIC and HOIST

Comparisons against prior methods on HD-EPIC and HOIST. Each clip places baseline predictions alongside HOI-DETR — note the fewer false positives, more accurate 1st object detections, and more stable interaction links over time.

Zero-Shot on Aria Gen 2 Pilot Dataset

We compare against the hand-object method released with the Aria Gen 2 Pilot Dataset, which is trained on the dataset's own training split. Applied zero-shot, with no Aria training data, HOI-DETR still produces markedly more accurate and temporally consistent hand-object detections than their in-domain method. Each clip shows their method (left) and HOI-DETR (right).

Zero-Shot on FineBio

Comparison between Hands23 and HOI-DETR on the FineBio dataset — out-of-distribution biology lab footage, far from the cooking-centric training data. ✗ marks invalid interaction links or false positives.

Zero-shot qualitative comparison on FineBio

Refined Hands23

The original Hands23 annotation pipeline produced duplicate object annotations arising from a hand-centric scheme in which each hand was independently annotated. We built a refinement pipeline that reviewed 26.2k images — correcting 56.2% of them by removing duplicate boxes and remapping all interaction links for consistency. The figure below shows representative before/after corrections.

Quantitative Results

Hands23 Val — In-Domain (Refined Annotations)

Detection AP₅₀ and interaction F1. HOI-DETR improves the 1st and 2nd object detection by over 25 points.

Method	Overall	hand	1st object	2nd object	F1 inter
Hands23	63.6 (−22.5)	85.2 (−7.9)	59.4 (−27.1)	46.2 (−32.5)	90.7 (−4.8)
HOI-DETR (ours)	86.1	93.1	86.5	78.7	95.5

Zero-Shot Cross-Dataset

HOI-DETR is trained only on Hands23. The tables below evaluate it, without any fine-tuning, on three datasets it has never seen — including the video benchmark HD-EPIC-HOI. (−X) shows each baseline's gap to HOI-DETR.

HD-EPIC-HOI (Video,1st object)

Method	Frame-AP	Video-AP	LTC
Hands23	46.9 (−25.7)	26.8 (−33.4)	31.4 (−29.6)
HOIST	30.4 (−42.2)	16.1 (−44.1)	27.2 (−33.8)
HOI-DETR (ours)	72.6	60.2	61.0

FineBio (AP₅₀)

Method	Overall	hand	1st object
Hands23	50.2 (−21.0)	74.4 (−12.2)	26.0 (−29.8)
HOI-DETR (ours)	71.2	86.6	55.8

HOIST (AP₅₀)

Method	1st object
Hands23	43.1 (−33.5)
HOIST	70.7 (−5.9)
HOI-DETR (ours)	76.6

Hand Detection Generalisation (AP₅₀)

Despite being optimised for the broader HOI task, HOI-DETR is also a strong pure hand detector, leading on 4 of 6 benchmarks and achieving the best average zero-shot AP (82.4 vs. WiLOR's 78.5). Blue cells are zero-shot. *as reported in WiLOR.

Method	Hand+Obj	Hands23	WHIM	COCO-Whole	Oxford-Hands	FineBio	EgoHands
MediaPipe	✗	—	53.1*	15.4*	8.7*	—	—
OpenPose	✗	—	76.8*	37.1*	20.7*	—	—
ContactHands	✗	—	93.4*	50.3*	70.0*	—	—
ViTDet	✗	—	84.7*	41.6*	67.6*	—	—
WiLOR	✗	77.3	96.1*	62.5*	82.6*	76.5	93.4
Hands23	✓	85.2	71.7	67.0	60.6	74.4	93.0
HOI-DETR (ours)	✓	93.1	78.1	75.0	74.0	86.6	98.5

Download

HOI-DETR checkpoint (ViT-L/16, trained on Hands23) 🤗 Interactive demo (run HOI-DETR on your own images) Hands23 dataset — images and splits Code + refined annotation file

Acknowledgements

This work was supported by EPSRC Program Grant Visual AI (EP/T028572/1). A. Darkhalil was supported by the EPSRC Doctoral Training Program (DTP). We acknowledge the usage of GPU node hours granted as part of the AIRR Innovator project "5D Hand-Object Interaction Modelling from In-the-wild Videos" (Mar 2026 – Sep 2026), the AIRR Gateway project "HOI Foundational Model from Egocentric Data" (Dec 2025 – Mar 2026), and the Sovereign AI Unit call project "Gen Model in Ego-sensed World" (Aug – Nov 2025). D. Fouhey was supported by the National Science Foundation under Grant No. 2006619 and 2437330.

We thank Sidhartha Reddy Potu and Dandan Shan for their contributions during the early stages of this project. This work builds on Co-DETR and MMDetection, and we gratefully acknowledge their authors for open-sourcing the codebase and architecture. We also thank the authors of Hands23, HOIST, FineBio, and HD-EPIC for making their datasets publicly available. We also thank Rajan from Elancer for their assistance with the Hands23 refinement process and the verification of HD-EPIC-HOI annotations.