Big data has had a great share in the success of deep learning in computer vision. Recent works suggest that there is significant further potential to increase object detection performance by utilizing even bigger datasets. In this paper, we introduce the EuroCity Persons dataset, which provides a large number of highly diverse, accurate and detailed annotations of pedestrians, cyclists and other riders in urban traffic scenes. The images for this dataset were collected on-board a moving vehicle in 31 cities of 12 European countries. With over 238200 person instances manually labeled in over 47300 images, EuroCity Persons is nearly one order of magnitude larger than person datasets used previously for benchmarking. The dataset furthermore contains a large number of person orientation annotations (over 211200). We optimize four state-of-the-art deep learning approaches (Faster R-CNN, R-FCN, SSD and YOLOv3) to serve as baselines for the new object detection benchmark. In experiments with previous datasets we analyze the generalization capabilities of these detectors when trained with the new dataset. We furthermore study the effect of the training set size, the dataset diversity (day- vs. night-time, geographical region), the dataset detail (i.e. availability of object orientation information) and the annotation quality on the detector performance. Finally, we analyze error sources and discuss the road ahead.
@article{
braun2019eurocity,
author={Braun, Markus and Krebs, Sebastian and Flohr, Fabian B. and Gavrila, Dariu M.},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={EuroCity Persons: A Novel Benchmark for Person Detection in Traffic Scenes},
year={2019},
volume={},
number={},
pages={1-1},
keywords={Proposals;Benchmark testing;Object detection;
Feature extraction;Urban areas;Deep learning;
Training;Object detection;benchmarking},
doi={10.1109/TPAMI.2019.2897684},
ISSN={0162-8828},
month={}
}
3D localization of persons from a single image is a challenging problem, where advances are largely data-driven. In this paper, we enhance the recently released EuroCity Persons detection dataset, a large and diverse automotive dataset covering pedestrians and riders. Previously, only 2D annotations and image data were provided. We introduce an automatic 3D lifting procedure by using additional LiDAR distance measurements, to augment a large part of the reasonable subset of 2D box annotations with their corresponding 3D point positions (136K persons in 46K frames of day- and night-time). The resulting dataset (coined ECP2.5D), now including Li-DAR data as well as the generated annotations, is made publicly available for (non-commercial) benchmarking of camera-based and/or LiDAR 3D object detection methods. We provide baseline results for 3D localization from single images by extending the YOLOv3 2D object detector with a distance regression including uncertainty estimation.
Despite the success of deep learning, human pose estimation remains a challenging problem in particular in dense urban traffic scenarios. Its robustness is important for followup tasks like trajectory prediction and gesture recognition. We are interested in human pose estimation in crowded scenes with overlapping pedestrians, in particular pairwise constellations. We propose a new top-down method that relies on pairwise detections as input and jointly estimates the two poses of such pairs in a single forward pass within a deep convolutional neural network. As availability of automotive datasets providing poses and a fair amount of crowded scenes is limited, we extend the EuroCity Persons dataset by additional images and pose annotations. With 46,975 images and poses of 279,329 persons our new EuroCity Persons Dense Pose dataset is the largest pose dataset recorded from a moving vehicle. In our experiments using this dataset we show improved performance for poses of pedestrian pairs in comparison with a state of the art method for human pose estimation in crowds.