Solving overfitting problem - High-Value Target Detection

When training a neural network it is very important to avoid overfitting as well as underfitting. The overfitting problem was already mentioned in previous sections.

It means that the neural network, especially when trained on a small dataset, learns very specific fearures and basically memorizes the training dataset achieving very good accuracy. When later tested with new images, the neural network in

this case cannot predict the correct class, as the generalization of the class was not created correctly.

The simplest way to fix the overfitting problem is by adding more data to the training. Since the overfitting was achieved on Stanford’s dataset, which is only small part of the whole merged dataset, adding more data should not be a problem.

However, with adding more images of cars to the existing classes, more car mark classes are also added, so the amount of classes grows from 49 to 125 car marks.

The data should be balanced properly, otherwise it might cause problems for neural network in learning correct classes.

In order to fix the overfitting issue on Stanford’s dataset, different techniques were applied. First thing is to have more data, so the data augmentation was applied. It means that every picture in the training dataset is slightly modified on each epoch by shifting the image by a certain coefficient, rotating it, flipping horizontally. The objects on the pictures stay the same, but the position and view angle slightly changes. It allows neural network to extract slightly different features, while still classifying the image to the same class.

Another thing that could help with overfitting is adding Dropout layers. Dropout with 0.25 parameter was tested on Stanford’s dataset. It means the quarter of the features are dropped, when getting to the Dropout layer, so only 75% of features are passed forward. Together with data augmentation the Dropout helped to de-crease the the between the accuracy on training set and accuracy on validation set.

Previously the worst results on Stanford’s dataset were with over 70% accuracy on training, while having less than 20% accuracy on validation. Now the gap is a lot smaller: the accuracy on training set is around 35%, while the accuracy on validation set is over 20%. These experiments were done with Stanford’s dataset.

Both data augmentation and Dropout were added to the training, when using merge dataset. Dropout was changed to 0.1 not to cause underfitting.

4.6 Further improvements

While investigating and debugging the code, interesting techniques were discovered for fine-tuning the neural networks. The discovered modifications was decided to apply in terms of car classificator as well.

From the logs of training on the merged dataset it was seen before that at some point neural network, for example NASNet Large, achieves 20%-25% validation accuracy, but it does not go further. Previously described modification to prevent overfitting do not allow the accuracy on training set to grow too much, when val-idation accuracy does not grow, so the training just caps with training accuracy of around 35%-40% and the validation accuracy of 20%-25% depending on the architecture. To solve the issue, more trainable parameters are introduces into

the top level of the model architecture. On top of every architecture additional fully-connected layer of size 128 with ReLU activation followed by Batch Normal-ization and Dropout are added before the final fully-connected layer with Softmax activation.

Different optimizers for different models are used in order to improve the training performance. Xception and Inception experimentally proved to produce slightly better results, when using Adam optimizer, compared to Stochastic Gradient De-scent (SGD) or RMSprop [36]. NASNet architecture, however, uses RMSprop optimizer, as proposed in the original paper.

Moreover, the learning rate is not changed from constant to changing value. For Inception and Xception architectures the learning rate for the first 5 epochs was set to 0.01, then changed to 0.001 and the model was trained for 20 epochs and the last 10 epochs were trained with learning rate of 0.0001. For NASNet architectures the learning rate decay was given to the RMSprop optimizer as a parameter and not changed manually after said amount of epochs. Last few epochs were trained with unfreezing all the layers in the given architectures (only for Inception and Xception) and training all the layers of the network.

The number of epochs was changing from model to model, as the training was performed interactively. The weights after each epoch were compared to the pre-vious weights and the best weights were saved. After training for a certain amount of epochs was finished, the best weights were loaded and the further training for certain amount of epochs was launched. With such approach it was easier to see, if the model is still training, starts to overfit/underfit, if the code contains some errors, if the results make sense. Next chapter gives an overview of the final experiment setup as well as the results achieved by the time of submitting the thesis.

Chapter 5 Experimental Results

The final setup of the experiment includes lost of improvements compared to the initial experiments with the purpose of increasing the validation accuracy. The selected 4 neural network architectures are still the same: Inception-V3, Xception, NASNet Mobile and NASNet Large. Each of the models has following layers on top of the original model: fully-connected layer of size 128 and with ReLU activation, Batch Normalization layer after that, Dropout layer with the dropout amount of 0.2, final fully-connected layer with the Softmax activation.

Learning rate parameter is changing during the training. For Xception and Incep-tion the strategy of changing the learning rate is based on manual overview of the training process. The learning rate of 0.01 is applied for the first 5 epochs only, as then it starts to cause overfitting. Then the main training is done for 20 more epochs with learning rate of 0.001. If the accuracy still grows, then additional 10 epochs are trained with the same learning rate. If the accuracy seems to have reached the plato, then the learning rate is changed to 0.0001 and trained for fur-ther 10 epochs. A few times the training was interrupted due to some technical problems, so the weights of the last successful training were loaded and the train-ing process was relaunched. The exact amount of the epochs trained is hard to say because of that reason. Inception and Xception networks finished the training with 3 epochs with all the layers unfreezed.

Learning rate for both NASNet architectures was set up to be changing over time another way. Default parameters of the RMSprop optimizer were changed in order to account for the desired technique. Starting learning rate for the NASNet was also set to 0.01. Learning decay was set to 0.00004, which is applied after each epoch. Rho and Epsilon parameters of the optimizer are both set to 0.9. NASNet architecture is too ambiguous to re-train the weights of all the layers.

The dataset for training and validation is the merged dataset that contains data from both Stanford and CompCars datasets. Cleaning of the dataset was already performed a few times in order to reduce the amount of classes and to fix the

issues with incorrect or inconsistent labels. The merged dataset only contains car marks as classes, the information about models and years is not processed at this point of research. Moreover, the information about car models requires heavy modifications, as there are lots of model names that contain mark name and include some metadata, such as car type (eg. hatchback or sedan) or car submodel (eg. standard, performance, hybrid). For processing the models as well it is needed to modify most of the labels. The merged dataset is containing 125 car mark classes and has 106017 images in the training set and 26562 images in the validation set. The split ratio between training and validation set is 80/20.

The best results achieved with every architecture after training for 40-50 epochs are described in the Table 5.1. It can be seen from the table that the accuracy of Inception and Xception networks is very similar, as it was expected from the be-ginning. Although, it was expected for Inception to slightly outperform Inception, but it happened vice versa. NASNet was expected to show a lot better results.

Due to the amount of layers in NASNet architecture it was not possible to save weights of the model after each epoch as checkpoints the same way it was done with Inception and Xception. The weights have to be saved manually, thus some of the weights were lost, if some kind of error appeared before the saving method was called.

Network Loss Top 1 Accuracy Top 5 Accuracy

Inception 3.5352 17.97% 46.35%

Xception 3.5351 17.16% 44.59%

NASNetLarge 3.9664 9.68% 35.60%

NASNetMobile 4.7133 1.33% 12.16%

Table 5.1: Final results of Inception, Xception, NASNet Large and NASNet Mobile network architectures after improved training on the whole merged

dataset.

The results for Xception and Inception architectures have definitely improved from the initial experiment (Table 4.4). The Xception model’s accuracy on validation set was 10.92%, now it increased to 17.16% with top 5 accuracy of 44.59%. The results of NASNet architectures became significantly worse, but most likely it is because of the code error or missing weights. The model was not re-trained due to the lack of time.

However, the accuracy of 60%-70% was still not achieved. There are 2 possible explanations: the data is not suitable or the car classification based on marks cannot be learned that easily, as a single car mark contains too many different car models and types inside - really hard to find general unique features to distinguish marks.

After the final results were obtained, the data was reviewed again. The number of classes was reduced from 125 to 73, as all the Chinese cars were removed, as most

of them are just copies of European cars that look almost the same, but have to be labeled differently. Cars with less than 100 images were also removed. A few more labels were merged together into one label, as there was a typo or the car makers were labeled differently in 2 datasets (eg. Chevy vs. Chevrolet). From the classes some models were removed, as they were very rare or unique cars, sometimes only concepts that were never produced (eg. Audi Urban). New dataset contains 91692 images in training set and 22960 in test set. The training after final cleaning of the dataset was not performed due to the lack of time. With further training it will be possible to say, if the messy data was affecting the training results that much.

Another problem is more interesting. Perhaps, it is not a trivial task for a neural network to learn the features of particular car marks. The variety of cars inside one mark is huge. There are different models and submodels, depending on the year of manufacturing cars look different. There are sedans, hatchbacks, convertibles, SUVs, pickups - all of which can belong to a single car mark. The assumption to start with car marks first might be wrong. It may be possible that a neural network can learn different car models much faster and from that predict the car marks. Every car model has more distinguishable features than every mark.

Chapter 6 Conclusion

The thesis was focused on the high-value target detection, where high-value tar-gets were defined as different cars. For example, it was one of the goals to rec-ognize BMW car mark and its model X5 from a real-life video footage to signal an alarm to the human operator. The classification algorithm implementation includes training a neural network. Unfortunately, the results of training were not good to be actually applicable to the task described. Further work on modeling the classificator is required.

Despite the poor classification accuracy, a lot of work was done. Multiple network architectures were tested, the parameters and techniques for fine-tuning a neural network were researched and experimentally tested. The working methods are preserved and documented, the methods that produced worse results are ruled out. Questions such as ”what network to use as a basis?”, ”what pre-processing methods to use?”, ”how to better clean up the data to achieve meaningful results?”

have their detailed answers. Some of the modifications, such as cleaning up the dataset, are also fulfilled already. The lack of powerful enough GPU resources throughout the testing period has significantly slowed down the training, hence the mismatch between expected and actual results was discovered too late to conduct more training experiments.

A special comment deserves the data the neural networks were trained with. The dataset was not good, required multiple clean up procedures to finally get some-thing working and may still not satisfy the requirements for good training. Now it is clear that there is no better dataset, than the one prepared by own hands and tailored for specific needs. Open-sourced datasets are not big enough, require filtering, restructuring and cleaning. It is a good idea to make an automated sys-tem for collecting more data in a desired format, so the balanced dataset with all the classes labeled properly and consistently is available.

Next steps would be to:

1. Train the data with the last version of merged dataset (79 car mark classes).

2. Train the networks using car models as classes - requires a lot of re-labeling in the current dataset or collecting a new dataset.

3. Investigate, if the model trained to distinguish models can also classify car marks.

4. Try CNN-RNN architecture or hierarchical softmax function to combine clas-sification of car marks and models into a single network architecture.

Bibliography

[1] Youtube statistics for 2018. https://fortunelords.com/

youtube-statistics/.

[2] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.

[3] Joseph Redmon and Ali Farhadi. Yolo9000: Better, faster, stronger. arXiv preprint arXiv:1612.08242, 2016.

[4] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective.

International Journal of Computer Vision, 111(1):98–136, January 2015.

[5] Serge Belongie Lubomir Bourdev Ross Girshick James Hays Pietro Per-ona Deva Ramanan C. Lawrence Zitnick Piotr Doll´ar Tsung-Yi Lin, Michael Maire. Microsoft coco: Common objects in context. arXiv arXiv:1405.0312, 2014.

[6] Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.

[7] Sasha Targ, Diogo Almeida, and Kevin Lyman. Resnet in resnet: generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.

[8] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector.

In European conference on computer vision, pages 21–37. Springer, 2016.

[9] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: To-wards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.

[10] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages 3908–3916, 2015.

[11] RT Lee, KC Hung, and HS Wang. Real time vehicle license plate recogni-tion based on 2d haar discrete wavelet transform. International Journal of Scientific & Engineering Research, 3(4):1–6, 2012.

[12] Hongye Liu, Yonghong Tian, Yaowei Yang, Lu Pang, and Tiejun Huang.

Deep relative distance learning: Tell the difference between similar vehicles.

In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2167–2175, 2016.

[13] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, C´eline Teuli`ere, and Thierry Chateau. Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In Proc. IEEE Conf.

Comput. Vis. Pattern Recognit.(CVPR), pages 2040–2049, 2017.

[14] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.

[15] Xiaodan Liang, Liang Lin, Xiaohui Shen, Jiashi Feng, Shuicheng Yan, and Eric P Xing. Interpretable structure-evolving lstm. In Proc. CVPR, pages 2175–2184, 2017.

[16] Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnn-rnn: A unified framework for multi-label image classification. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on, pages 2285–2294. IEEE, 2016.

[17] Bal´azs Nagy and Csaba Benedek. 3d cnn based phantom object removing from mobile laser scanning data. In Neural Networks (IJCNN), 2017 Inter-national Joint Conference on, pages 4429–4435. IEEE, 2017.

[18] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[19] Joshua Owoyemi and Koichi Hashimoto. Learning human motion inten-tion with 3d convoluinten-tional neural network. InMechatronics and Automation (ICMA), 2017 IEEE International Conference on, pages 1810–1815. IEEE, 2017.

[20] Daniel Maturana and Sebastian Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.

[21] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980, 2014.

[22] Linjie Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. A large-scale car dataset for fine-grained categorization and verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3973–3981, 2015.

[23] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object repre-sentations for fine-grained categorization. In Proceedings of the IEEE Inter-national Conference on Computer Vision Workshops, pages 554–561, 2013.

[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[25] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbig-niew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition, pages 2818–2826, 2016.

[26] Fran¸cois Chollet. Xception: Deep learning with depthwise separable convo-lutions. arXiv preprint, 2016.

[27] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger.

Densely connected convolutional networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, 2017.

[28] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learn-ing transferable architectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.

[29] Keras documentation page. https://keras.io.

[30] Tensorflow homepage. https://www.tensorflow.org.

[31] Ranking of deep learning libraries 2017.

[36] T Tieleman and G Hinton. Divide the gradient by a running average of its recent magnitude. coursera: Neural networks for machine learning.

Technical report, Technical Report. Available online: https://zh. cours-era. org/learn/neuralnetworks/lecture/YQHki/rmsprop-divide-the-gradient-by-a-running-average-of-its-recent-magnitude (accessed on 21 April 2017).

All the web references were valid on 21.05.2018.

License

Non-exclusive license to reproduce thesis and make thesis public

I, Anton Prokopov (date of birth: 14.03.1993),

1. herewith grant the University of Tartu a free permit (non-exclusive licence) to:

1.1. reproduce, for the purpose of preservation and making available to the public, including for addition to the DSpace digital archives until expiry of the term of validity of the copyright, and

1.2. make available to the public via the web environment of the University of Tartu, including via the DSpace digital archives until expiry of the term of validity of the copyright,

“High-value target detection”, supervised by Gholamreza Anbarjafari, 2. am aware of the fact that the author retains these rights.

Im Dokument High-Value Target Detection (Seite 31-44)