HOI-DETR predicts all visible hands, 1st objects (directly held), and 2nd objects (acted upon through a tool), together with their pairwise interaction links in a single forward pass. Zero-shot examples across diverse scenes, viewpoints, and domains.
Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. We introduce HOI-DETR, a new framework that integrates hand-object and object-object interaction into the Co-DETR architecture, producing a method that jointly detects all visible hands, 1st objects (objects in direct physical interaction with a hand), and 2nd objects (objects that the 1st object acts upon when used as a tool), and predicts their pairwise interaction links in a single forward pass.
We accompany the model with a comprehensive HOI evaluation suite spanning four diverse datasets, including a new video benchmark derived from the HD-EPIC dataset and refined annotations for the Hands23 benchmark. HOI-DETR significantly improves over the previous state of the art, with mAP gains of over 20 percentage points, and demonstrates strong zero-shot generalisation to unseen datasets and domains.
HOI-DETR builds on the Co-DETR architecture, training a transformer backbone end-to-end to predict hands, objects, and their interactions. An interaction MLP head operates on pairs of decoder token embeddings to predict binary interaction relationships for valid pairs: hand → 1st object and 1st object → 2nd object. The module is supervised at every decoder layer with a focal loss and trained end-to-end with the detector.
HOI-DETR is trained only on the refined Hands23 dataset. All results below are zero-shot — evaluated on unseen datasets and domains, from egocentric video to in-the-wild web footage, with no additional training or fine-tuning.
Zero-shot predictions overlaid on full-HD YouTube footage. Each card opens an interactive page: the original video plays from YouTube with bounding boxes and interaction links drawn on top in real time.
We compare against the hand-object method released with the Aria Gen 2 Pilot Dataset, which is trained on the dataset's own training split. Applied zero-shot, with no Aria training data, HOI-DETR still produces markedly more accurate and temporally consistent hand-object detections than their in-domain method. Each clip shows their method (left) and HOI-DETR (right).
Comparison between Hands23 and HOI-DETR on the FineBio dataset — out-of-distribution biology lab footage, far from the cooking-centric training data. ✗ marks invalid interaction links or false positives.
The original Hands23 annotation pipeline produced duplicate object annotations arising from a hand-centric scheme in which each hand was independently annotated. We built a refinement pipeline that reviewed 26.2k images — correcting 56.2% of them by removing duplicate boxes and remapping all interaction links for consistency. The figure below shows representative before/after corrections.
Detection AP₅₀ and interaction F1. HOI-DETR improves the 1st and 2nd object detection by over 25 points.
| Method | Overall | hand | 1st object | 2nd object | F1 inter |
|---|---|---|---|---|---|
| Hands23 | 63.6 (−22.5) | 85.2 (−7.9) | 59.4 (−27.1) | 46.2 (−32.5) | 90.7 (−4.8) |
| HOI-DETR (ours) | 86.1 | 93.1 | 86.5 | 78.7 | 95.5 |
HOI-DETR is trained only on Hands23. The tables below evaluate it, without any fine-tuning, on three datasets it has never seen — including the video benchmark HD-EPIC-HOI. (−X) shows each baseline's gap to HOI-DETR.
| Method | Frame-AP | Video-AP | LTC |
|---|---|---|---|
| Hands23 | 46.9 (−25.7) | 26.8 (−33.4) | 31.4 (−29.6) |
| HOIST | 30.4 (−42.2) | 16.1 (−44.1) | 27.2 (−33.8) |
| HOI-DETR (ours) | 72.6 | 60.2 | 61.0 |
| Method | Overall | hand | 1st object |
|---|---|---|---|
| Hands23 | 50.2 (−21.0) | 74.4 (−12.2) | 26.0 (−29.8) |
| HOI-DETR (ours) | 71.2 | 86.6 | 55.8 |
Despite being optimised for the broader HOI task, HOI-DETR is also a strong pure hand detector, leading on 4 of 6 benchmarks and achieving the best average zero-shot AP (82.4 vs. WiLOR's 78.5). Blue cells are zero-shot. *as reported in WiLOR.
This work was supported by EPSRC Program Grant Visual AI (EP/T028572/1). A. Darkhalil was supported by the EPSRC Doctoral Training Program (DTP). We acknowledge the usage of GPU node hours granted as part of the AIRR Innovator project "5D Hand-Object Interaction Modelling from In-the-wild Videos" (Mar 2026 – Sep 2026), the AIRR Gateway project "HOI Foundational Model from Egocentric Data" (Dec 2025 – Mar 2026), and the Sovereign AI Unit call project "Gen Model in Ego-sensed World" (Aug – Nov 2025). D. Fouhey was supported by the National Science Foundation under Grant No. 2006619 and 2437330.
We thank Sidhartha Reddy Potu for his contributions during the early stages of this project. This work builds on Co-DETR and MMDetection, and we gratefully acknowledge their authors for open-sourcing the codebase and architecture. We also thank the authors of Hands23, HOIST, FineBio, and HD-EPIC for making their datasets publicly available.