Abstract

Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture. Instead, we introduce DETReg, a new self-supervised method that pretrains the entire object detection network, including the object localization and embedding components. During pretraining, DETReg predicts object localizations to match the localizations from an unsupervised region proposal generator and simultaneously aligns the corresponding feature embeddings with embeddings from a self-supervised image encoder. We implement DETReg using the DETR family of detectors and show that it improves over competitive baselines when finetuned on COCO, PASCAL VOC, and Airbus Ship benchmarks. In low-data regimes, including semi-supervised and few-shot learning settings, DETReg establishes many state-of-the-art results, e.g., on COCO we see a +6.0 AP improvement for 10-shot detection and +3.5 AP improvement when training with only 1% of the labels.

DETReg Pretext task

Our approach to the problem is based on the observation that learning good detectors requires learning to detect objects in the pretraining stage. To accomplish this, we present a new framework called ``DEtection with TRansformers based on Region priors'', or DETReg. DETReg can be used to train a detector on unlabeled data by introducing two key pretraining tasks: ``Object Localization Task'' and the ``Object Embedding Task''. The goal of the first is to train the model to localize objects, regardless of their categories. However, learning to localize objects is not enough, and detectors must also classify objects. Towards this end, we introduce the ``Object Embedding Task'', which is geared towards understanding the categories of objects in the image. Inspired by the simplicity of recent transformers for object detection, we choose to base our approach on the Deformable DETR architecture, which simplifies the implementation and is very fast to train.

Visualization

Qualitative examples of DETReg unsupervised box predictions. This shows the pixel-level gradient norm for the x/y bounding box center and the object embedding. These gradient norms indicate how sensitive the predicted values are to perturbations of the input pixels. For the first three columns, DETReg attends to the object edges for the x/y predictions and z for the predicted object embedding. The final column shows a limitation where the space surrounding the object is used for the embedding.

Paper

DETReg: Unsupervised Pretraining with Region Priors for Object Detection
Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig,
Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
Arxiv
Hosted on arXiv

Related Works

If you found our work interesting, please also consider looking into some closely related works like RegionSim, UP-DETR, SwAV.

Acknowledgements

We would like to thank Sayna Ebrahimi for helpful feedback and discussions. This project has received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation programme (grant ERC HOLI 819080). Prof. Darrell’s group was supported in part by DoD including DARPA's XAI, LwLL, and/or SemaFor programs, as well as BAIR's industrial alliance programs. GC group was supported by the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). This work was completed in partial fulfillment for the Ph.D degree of the first author.