Deep Fruit Detection in Orchards: A Reproduction

Deep learning is one of the most popular fields in AI. Impressive results have been achieved in fields like natural language processing (NLP) and image classification. However, as these results may seem promising, deep learning scholarship is in a ‘reproducibility crisis’ (Barber, 2019). Results from many papers are difficult to replicate, often due to a lack of information given by the authors (e.g. hyperparameters, algorithms, network architecture) and datasets not being publicly available. Because of this, we decided to reproduce a paper by Bargoti & Underwood (2017), called Deep Fruit Detection in Orchards.

If someone mentions deep learning, many of us think about popular use cases in fields such as NLP, bioinformatics and automated speech recognition. However, another interesting overlooked use case is automation in agriculture. This field is concerned with making automated systems to efficiently perform human labour tasks in agriculture and better utilize existing resources.

Bargoti & Underwood (2017) approach vision-based fruit detection, which produces more accurate knowledge of individual fruit locations in the field. Given an image of an orchard, the model tries to predict the locations of fruit (see figure 1). This makes yield estimation and mapping possible, which is beneficial for growers as it “facilitates efficient utilisation of resources and improves returns per unit area and time” (Bargoti & Underwood, 2017). Additionally, this could also be helpful in automated robotic harvesting systems.

Figure 1: An example of the output of a vision-based fruit detection model

To identify fruit locations in orchards, a Faster R-CNN with a VGG16 backbone was used by Bargoti & Underwood (2017). The results of that paper form the basis of the article. On top of that, we also evaluate the model under the ResNet-50 backbone.

Faster R-CNN

Faster R-CNN consists of several components. The first component of the model is the so-called backbone. This network is composed of a pre-trained Convolutional Neural Network (CNN). This network takes the original image as input and returns a feature map. It functions as a feature extractor. There are different possible architectures for this network. Examples of such architectures are Resnet-50, VGG16, VGG10 and ZF. In this blog, we experiment with the first two; Resnet-50 and VGG16. Therefore, they are described below.

The output of this component is used by another component: the Region Proposal Network (RPN). As the name suggests, the RPN identifies a set of possible locations (proposals) for objects. It does so by the use of anchors. These anchors are bounding boxes of certain sizes that are put in place throughout the processed image.

The output of the RPN is a set of region proposals. The task that still remains is determining which regions are interesting and which are not. However, in order to make this possible, features need to be extracted from the proposed regions. This task is called Region of Interest (ROI) Pooling.

Lastly, the output of this can be used by a Recurrent Convolutional Neural Network (R-CNN) in order to perform classification on the proposed regions and, with that, determine what regions include important objects. For this, it uses two fully connected layers that take the feature maps of the proposed regions (of interests) as input. For an overview, see Figure 2.

Figure 2: Stages in the faster R-CNN architecture (middle two images are reconstructions)



Figure 3: An overview of VGG16

The input of the first layer of the network is an image of size 224 x 224. First of all, the image is put through several convolutional layers. After the convolutions are performed, spatial pooling is applied using several max-pooling layers. In all convolutional layers, ReLU is applied.

After the input has gone through the combination of convolutions and spatial pooling, it goes through three fully connected layers. The first two of these have 4096 channels. The last has 1000 channels. The last layer is a softmax layer. For the Faster R-CNN, we only use the convolution and pooling layers.

VGG16 is used, as it is one of the chosen backbones in the original paper.


ResNet was officially proposed in 2015 for image recognition tasks (He et al. 2016). Before this paper, it was often thought that how deeper a neural network is, the better it will be able to learn certain features. However, He et al. (2016) proved empirically that after a certain amount of layers the performance does not improve. On the contrary, the performance actually degrades (see figure 4).

Figure 4: Training and test error with a 20 and 56 layer ANN (He et al. 2016)

He et al. (2016) solve this by introducing the residual block (see Figure 5). Shortcut connections are key, where one or more layers are skipped. These connections simply perform an identity mapping in this case. Because of these skip connections, training deeper neural networks is possible. Which in turn allows for more possible feature representations.

Figure 5: The concept of a residual block (He et al. 2016)

For this experiment, we chose the ResNet architecture, because He et al. (2016) proved that it is able to outperform the VGG16 architecture in object recognition tasks. ResNet-50 is more suitable in our case because ResNet-101 and Resnet-152 both take more time to train.


The Faster R-CNN model was run once on a range of training set sizes (1, 3, 7, 15, 27, 52, 100, 200 and 420) and took about 30–90 minutes to converge, depending on the size of the dataset. In order to efficiently compare the results to the results obtained by Bargoti & Underwood (2017), we tried to plot the data with the same styling and scale. As for the evaluation, a prediction was considered as a true positive if the prediction and the ground truth bounding box had an Intersection over Union (IoU) greater than 0.2

Figure 6: A plot of the average precision results of the faster R-CNN models

The faster R-CNN with the ResNet-50 backbone seemed to converge faster on average, however, it struggles to converge on a training set of size smaller or equal to three. In addition, the ResNet-50 seems to outperform the VGG16 backbone on training sets larger than 3. This is in line with expectations that were sketched in the previous section.


Moreover, the dataset that we have used for the reproduction seems to have grown compared to when the original paper was written. This might also have caused several differences in the results.

Figure 7: Results in the original paper by Bargoti & Underwood (2017)

Value of a reproduction

Furthermore, doing a reproduction is the best way to make sure that the experiment that is performed in the original paper is clear and that all the required information is provided. This can then help with improving existing papers.

Lastly, one of the most important aspects of performing a reproduction is that it provides the possibility to learn something new: while reading papers certainly increases one’s knowledge, being able to experience the papers by trying to reproduce them step by step is a very interesting and educative way to experience state-of-the-art techniques.


Furthermore, we extended the results by also including an evaluation on a different backbone, namely the ResNet-50 backbone. At last, we presented the results of both faster R-CNN models on the almond dataset provided by the original paper. After comparing the two, it is safe to say that our results uphold the results of the original paper on the almond data set.

Further research

Furthermore, it would be interesting to research the performance of other backbones when used in fruit detection. Examples could be MobileNet and ShuffleNet.

Lastly, there are several hyperparameters used in the models covered in this reproduction. These hyperparameters could be changed and tuned further, in order to improve the performance of the model.

Thanks for reading! Take a look at the GitHub repository if you find something unclear.

Raoul Kalisvaart:

Bilal El Attar:


Bargoti, S., & Underwood, J. (2017). Deep fruit detection in orchards. 2017 IEEE International Conference on Robotics and Automation (ICRA), 3626–3633.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.

Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

Computer Science Student @TU Delft

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store