The capstone project for Udacity’s Data Scientist Nanodegree Program
- Project Overview
- Problem Statement
- Performance Metric
- Data Exploration and Visualization
- Data Preprocessing
- Human Detector
- Dog Detector
- Dog Breed Classifier
- List of Dependencies
- Instructions to use the repository
- File Descriptions
- Results
- Conclusion
- Tips to improve the performance
- Licensing, Authors, and Acknowledgements
In this project, I have implemented an end-to-end deep learning pipeline
that can be used within a web or mobile app to process real-world, user-supplied images. The pipeline will accept any user-supplied image as input and will predict whether a dog or human is present in the image. If a dog is detected in the image, it will provide an estimate of the dog’s breed. If a human is detected, it will provide an estimate of the dog breed that is most resembling.
In this project, I am provided with RGB images of humans and dogs and asked to design and implement an algorithm that can detect humans (human detector) or dogs (dog detector) in the images. After detecting a human or dog, the algorithm further needs to predict the breed of the dog (if the dog is detected) and the most resembling dog breed (if a human is detected). If neither is detected in the image, the algorithm should ask the user to input another image containing either dog or human.
To evaluate the performance of my algorithm, I used classification accuracy as the performance metric. All three deep learning models human detector, dog detector, and dog breed classifier were evaluated using the accuracy that these models have obtained in classifying the images.
Accuracy is a reasonable choice of performance metric for these models. This is because the human detector model is provided with 100 images of humans and 100 images of dogs (balanced data) to evaluate its accuracy. Similarly, the dog detector is provided with 100 images of each human and dog to evaluate its accuracy.
The dog breed dataset contains 8351 dog images with 133 dog breed categories. The dataset is not perfectly balanced. The mean number of images in each class is around 50. But there are few classes in the dataset that have less than 30 images while there are some classes that have more than 70 images. This small imbalance in data could pose a problem in training the dog breed classifier model. But this could be taken care of by over-sampling the minority classes or under-sampling the majority classes and data augmentation methods.
All CNN models in Keras require a 4D array/tensor as input with shape (batch_size, image_height, image_width, num_channels). The shape of each image needs to be the same for training the CNN model in batches. Therefore the input data for the dog detector model and dog breed classifier model needs to be reshaped so that all the images have the same shape.
Getting the 4D tensor ready for any pre-trained CNN model in Keras, requires some additional processing. First, the RGB image is converted to BGR by reordering the channels. All pre-trained models have the additional normalization step that the mean pixel (expressed in RGB as [103.939, 116.779, 123.68] and calculated from all pixels in all images in ImageNet) must be subtracted from every pixel in each image.
I used the pre-trained Haar cascade face detector model from the OpenCV library to determine if a human is present in the image or not.
To detect the dogs in the images, I have used a pre-trained ResNet-50 model. This model has been trained on ImageNet, a very large and popular dataset used for image classification and other vision tasks.
I build a simple CNN model from scratch and this model is neither too deep nor too shallow. It has five blocks of Conv2D layer followed by MaxPooling2D layer. I added a dropout layer after every two blocks of Conv2D and MaxPooing2D layers to avoid overfitting. This model didn't perform well and achieved only 5% accuracy on the test dataset.
I used six different models with pre-trained weights to classify dog breeds. The models include VGG16, VGG19, InceptionV3, ResNet50, EfficientNetB4 and Xception. Of all the models trained, the EfficientNetB4 model performed the best on the validation dataset. It achieved an accuracy of 91% on the validation data. Trained model weights are stored in EfficientNetB4_trained_weights
folder. The accuracy of other models was below 83% on the validation data.
The requirements folder
list all the libraries/dependencies required to run this project.
-
Clone this github repository.
git clone https://github.com/Ankit-Kumar-Saini/Dog_Breed_Classifier
-
Download the dog dataset. Unzip the folder and prepare image label pairs for training the model.
-
Download the human dataset. Unzip the folder and prepare images for the face detector model.
-
The
haarcascades folder
contains the pre-trained weights in thexml file format
to use with the OpenCv face detector class that has been used in this project. -
The
test_images folder
contains the sample images that are used to test the predictions of the final algorithm in this project. -
The
results folder
contains the results of the algorithm tested on the test images. These are used for the purpose of quick demonstration in the results section below. -
The
extract_bottleneck_features.py file
contains the code to use pre-trained imagenet models as feature extractors for transfer learning. -
The
dog_app.ipynb file
is the main file for this project. It is a jupyter notebook containing code of face detector, dog detector and dog breed classifier models. The final algorithm that uses all these three models to make predictions is also implemented in this notebook.
The step by step explanation of the project can be found at the post available here.
Some visualizations of the predictions made by the algorithm on test images
This project serves as a good starting point to enter into the domain of deep learning. Data exploration and visualizations are extremely important before training any Machine Learning model as it helps in choosing a suitable performance metric for evaluating the model. CNN models in Keras need image data in the form of a 4D tensor. All images need to be reshaped into the same shape for training the CNN models in batch.
Building CNN models from scratch is extremely simple in Keras. But training CNN models from scratch is computationally expensive and time-consuming. There are many pre-trained models available in Keras (trained on ImageNet dataset) that can be used for transfer learning.
The most interesting thing to note is the power of transfer learning to achieve good results with small computation. It works well when the task is similar to the task on which the pre-trained model weights are optimized.
- Get more images per class
- Make the dataset balanced
- Use image augmentation methods such as CutOut, MixUp, and CutMix
- Use VAEs/GANs to generate artificial data
- Use activation maps to interpret the model predictions
- Use deep learning-based approaches to detect human faces (MTCNN)
Must give credit to Udacity for the data and python 3 notebook.