Deep learning is one of the most popular fields in AI. Impressive results have been achieved in fields like natural language processing (NLP) and image classification. However, as these results may seem promising, deep learning scholarship is in a ‘reproducibility crisis’ (Barber, 2019). Results from many papers are difficult to replicate, often due to a lack of information given by the authors (e.g. hyperparameters, algorithms, network architecture) and datasets not being publicly available. Because of this, we decided to reproduce a paper by Bargoti & Underwood (2017), called Deep Fruit Detection in Orchards.
If someone mentions deep learning, many of us think about popular use cases in fields such as NLP, bioinformatics and automated speech recognition. However, another interesting overlooked use case is automation in agriculture. This field is concerned with making automated systems to efficiently perform human labour tasks in agriculture and better utilize existing resources.
Bargoti & Underwood (2017) approach vision-based fruit detection, which produces more accurate knowledge of individual fruit locations in the field. Given an image of an orchard, the model tries to predict the locations of fruit (see figure 1). This makes yield estimation and mapping possible, which is beneficial for growers as it “facilitates efficient utilisation of resources and improves returns per unit area and time” (Bargoti & Underwood, 2017). Additionally, this could also be helpful in automated robotic harvesting systems.
To identify fruit locations in orchards, a Faster R-CNN with a VGG16 backbone was used by Bargoti & Underwood (2017). The results of that paper form the basis of the article. On top of that, we also evaluate the model under the ResNet-50 backbone.
For the detection of images, the so-called Faster R-CNN model will be used. Faster R-CNN is a special type of convolutional neural network that is often used for image detection. The Faster R-CNN model builds upon the earlier invented R-CNN and Fast R-CNN models.
Faster R-CNN consists of several components. The first component of the model is the so-called backbone. This network is composed of a pre-trained Convolutional Neural Network (CNN). This network takes the original image as input and returns a feature map. It functions as a feature extractor. There are different possible architectures for this network. Examples of such architectures are Resnet-50, VGG16, VGG10 and ZF. In this blog, we experiment with the first two; Resnet-50 and VGG16. Therefore, they are described below.
The output of this component is used by another component: the Region Proposal Network (RPN). As the name suggests, the RPN identifies a set of possible locations (proposals) for objects. It does so by the use of anchors. These anchors are bounding boxes of certain sizes that are put in place throughout the processed image.
The output of the RPN is a set of region proposals. The task that still remains is determining which regions are interesting and which are not. However, in order to make this possible, features need to be extracted from the proposed regions. This task is called Region of Interest (ROI) Pooling.
Lastly, the output of this can be used by a Recurrent Convolutional Neural Network (R-CNN) in order to perform classification on the proposed regions and, with that, determine what regions include important objects. For this, it uses two fully connected layers that take the feature maps of the proposed regions (of interests) as input. For an overview, see Figure 2.
To extract feature maps, a backbone is needed. We evaluated the following two architectures:
An overview of the VGG16 architecture can be seen in the figure below.
The input of the first layer of the network is an image of size 224 x 224. First of all, the image is put through several convolutional layers. After the convolutions are performed, spatial pooling is applied using several max-pooling layers. In all convolutional layers, ReLU is applied.
After the input has gone through the combination of convolutions and spatial pooling, it goes through three fully connected layers. The first two of these have 4096 channels. The last has 1000 channels. The last layer is a softmax layer. For the Faster R-CNN, we only use the convolution and pooling layers.
VGG16 is used, as it is one of the chosen backbones in the original paper.
In addition to the VGG16 backbone, we also examine how the faster R-CNN performs using a different backbone. For this, we chose a residual neural network (ResNet) architecture.
ResNet was officially proposed in 2015 for image recognition tasks (He et al. 2016). Before this paper, it was often thought that how deeper a neural network is, the better it will be able to learn certain features. However, He et al. (2016) proved empirically that after a certain amount of layers the performance does not improve. On the contrary, the performance actually degrades (see figure 4).
He et al. (2016) solve this by introducing the residual block (see Figure 5). Shortcut connections are key, where one or more layers are skipped. These connections simply perform an identity mapping in this case. Because of these skip connections, training deeper neural networks is possible. Which in turn allows for more possible feature representations.
For this experiment, we chose the ResNet architecture, because He et al. (2016) proved that it is able to outperform the VGG16 architecture in object recognition tasks. ResNet-50 is more suitable in our case because ResNet-101 and Resnet-152 both take more time to train.
To reproduce the results of the original paper, we used the TorchVision Faster-RCNN and VGG16/ResNet-50 implementations on the almond dataset. The hyperparameters were copied from the faster R-CNN paper by Ren et al. (2015), in line with the original study. However, the NMS threshold was set to 0.3. The mean of the specified NMS threshold range mentioned in the paper (0.2–0.4). The FasterRCNN.py file contains this list of hyperparameters in the declaration of the model variable.
The Faster R-CNN model was run once on a range of training set sizes (1, 3, 7, 15, 27, 52, 100, 200 and 420) and took about 30–90 minutes to converge, depending on the size of the dataset. In order to efficiently compare the results to the results obtained by Bargoti & Underwood (2017), we tried to plot the data with the same styling and scale. As for the evaluation, a prediction was considered as a true positive if the prediction and the ground truth bounding box had an Intersection over Union (IoU) greater than 0.2
The faster R-CNN with the ResNet-50 backbone seemed to converge faster on average, however, it struggles to converge on a training set of size smaller or equal to three. In addition, the ResNet-50 seems to outperform the VGG16 backbone on training sets larger than 3. This is in line with expectations that were sketched in the previous section.
When figure 6 of this reproduction is compared to figure 7 (the results of the original paper), it can be seen that most of the reproduction data is within the range of the original data. However, there are some differences. These differences can be caused by several things. First of all, not all of the hyperparameters that were used in the original paper were properly documented. An example of this is the NMS threshold, for which a range was given (0.2–0.4). In this reproduction, it was chosen to use the average value of these numbers, 0.3, which naturally is not necessarily the exact value that was used in the original paper.
Moreover, the dataset that we have used for the reproduction seems to have grown compared to when the original paper was written. This might also have caused several differences in the results.
Value of a reproduction
There are several reasons why a reproduction, such as the one produced in this blog, is valuable. First of all, a reproduction allows for validating the results produced in an original paper. This increases transparency and reduces the possibility that presented results are not genuine.
Furthermore, doing a reproduction is the best way to make sure that the experiment that is performed in the original paper is clear and that all the required information is provided. This can then help with improving existing papers.
Lastly, one of the most important aspects of performing a reproduction is that it provides the possibility to learn something new: while reading papers certainly increases one’s knowledge, being able to experience the papers by trying to reproduce them step by step is a very interesting and educative way to experience state-of-the-art techniques.
In short, we presented a reproduction of the original Deep Fruit Detection in Orchards by Bargoti & Underwood (2017). This allowed us to validate the results produced by the original work and to examine if the information given in the paper is clear. In addition, it also allowed us to learn about one of the state-of-the-art architectures: the Faster R-CNN.
Furthermore, we extended the results by also including an evaluation on a different backbone, namely the ResNet-50 backbone. At last, we presented the results of both faster R-CNN models on the almond dataset provided by the original paper. After comparing the two, it is safe to say that our results uphold the results of the original paper on the almond data set.
Based on the reproduction made in this blog. There are several possibilities for further research. First of all, it would be interesting to investigate how the reproduced network performs on other types of fruit. This could be either the fruit types that were used in the original paper, apples and mangoes or completely different types of fruit. This could even be extended to other types of objects in general.
Furthermore, it would be interesting to research the performance of other backbones when used in fruit detection. Examples could be MobileNet and ShuffleNet.
Lastly, there are several hyperparameters used in the models covered in this reproduction. These hyperparameters could be changed and tuned further, in order to improve the performance of the model.
Thanks for reading! Take a look at the GitHub repository if you find something unclear.
Raoul Kalisvaart: R.D.Kalisvaart@student.tudelft.nl
Bilal El Attar: B.ElAttar@student.tudelft.nl
Barber, G. (2019b, September 14). Artificial Intelligence Confronts a ‘Reproducibility’ Crisis. Wired. https://www.wired.com/story/artificial-intelligence-confronts-reproducibility-crisis/
Bargoti, S., & Underwood, J. (2017). Deep fruit detection in orchards. 2017 IEEE International Conference on Robotics and Automation (ICRA), 3626–3633. https://doi.org/10.1109/icra.2017.7989417
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778. https://doi.org/10.1109/cvpr.2016.90
Ren, S., He, K., Girshick, R., & Sun, J. (2017). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149. https://doi.org/10.1109/tpami.2016.2577031