EgoPoints: Advancing Point Tracking for Egocentric Videos

University of Bristol1, Stanford University2
WACV 2025

Abstract

We introduce EgoPoints, a benchmark for point tracking in egocentric videos. We annotate 4.7K challenging tracks in egocentric sequences, including 9x more points that go out-of-view and 59x more points that require re-identification (ReID) after returning to view, compared to the popular TAP-Vid-DAVIS evaluation benchmark. To measure the performance of models on such challenging points, we introduce evaluation metrics that particularly monitor tracking on points in-view, out-of-view, and those that require re-identification.

We then propose a pipeline to create semi-real sequences, with automatic ground truth. We generate 11K sequences by combining dynamic Kubric objects with scene points from EPIC Fields. When fine-tuning state-of-the-art methods on these sequences and evaluating on our annotated EgoPoints sequences, we improve CoTracker across all metrics including the tracking accuracy δavg by 2.7 percentage points and accuracy on ReID sequences (ReIDδavg) by 2.4 points. We also improve δavg and ReIDδavg of PIPs++ by 0.3 and 2.2 respectively.

EgoPoints: Visualisation | Performance | Samples  •  K-EPIC: Pipeline | Samples         Fine-Tuned Co-Tracker: Visualisations  •  Resources: Download | Acknowledgements

Introducing EgoPoints

We compare the performance of several state-of-the-art point on trackers on both DAVIS (left) and EgoPoints (right). These comparisons highlight the complexity and challenges introduced by EgoPoints, as top-performing point trackers struggle or degrade in the more difficult scenarios it provides.

PIPs++ Performance on DAVIS vs. EgoPoints

CotrackerV2 Performance on DAVIS vs. EgoPoints

CotrackerV3 Performance on DAVIS vs. EgoPoints

TAPIR Performance on DAVIS vs. EgoPoints

LocoTrack Performance on DAVIS vs. EgoPoints

Performance Comparison on EgoPoints

Performance of point tracking baselines on the TAP-Vid-DAVIS compared to our EgoPoints benchmark on the main metric δavg. Additionally, metrics showcasing ReID, out-of-view (OOVA) and in-view (IVA) accuracy are proposed to highlight model failures.
TAP-Vid-DAVIS EgoPoints
Model δavg δavg ReID δavg OOVA↑ IVA↑
PIPs++ 64.0 36.9 14.6 50.4 89.2
CoTracker 74.7 38.5 4.8 81.4 73.4
BootsTAPIR Online 65.2 39.6 0.0 0.0 100.0
LocoTrack 75.3 59.4 0.1 0.2 99.9
CoTracker v3 77.2 50.0 15.0 31.8 99.3

EgoPoints Annotations

Examples of sparsely annotated sequences from EgoPoints benchmark. The dashed lines represent dynamic object tracks, while solid lines show scene point tracks.

K-EPIC pipeline

The pipeline for K-EPIC. This includes projecting 3D points as tracks and filtering them using CoTracker to get scene points (left). Additionally, we sample 3D objects and tracks from TAP-Vid-KUBRIC (top right).

MY ALT TEXT

K-EPIC samples

Examples from the K-EPIC synehtic sequences. Empty points are out-of-view points.

Comparison between CotrackerV2 baseline and fine-tuned model

We qualitatively compare the baseline and fine-tuned Cotracker models on K-EPIC, evaluated on EgoPoints.

Acknowledgments

This work proposes a new annotations benchmark that is publicly available, and builds on publicly avail- able dataset EPIC-KITCHENS. It is supported by EPSRC Doc- toral Training Program, EPSRC UMPIRE EP/T004991/1 and EP- SRC Programme Grant VisualAI EP/T028572/1. We acknowledge the use of the EPSRC funded Tier 2 facility JADE-II.

BibTeX

@inproceedings{darkhalil2025egopoints,
        title={EgoPoints: Advancing Point Tracking for Egocentric Videos}, 
        author={Darkhalil, Ahmad and Guerrier, Rhodri and Harley, Adam W. and Damen, Dima},
        booktitle= {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
        year={2025}
  }