What are the fundamentals of Deep Learning and Computer Vision?

           

           Computer Vision is concerned with the automatic extraction, analysis, and understanding of useful information from a single image or sequence of images. It involves the theoretical and algorithmic basis to achieve automatic visual understanding. Computer Vision will allow deeper and more impactful insights into the businesses in all industries. For ex,  Healthcare service providers will be able to quickly and safely diagnose and treat the patients, Manufacturing will have enhanced security and productivity. Computer Vision helps to keep track of the assets and assure the safety of location and employees. Businesses can be improved with the addition of edge computer vision.      
        The properties and characteristics of the human visual system give inspiration for designing the computer vision system. From the biological point of view, it aims to come with computational models of the human visual system. In engineering, it aims to build an autonomous system to perform some of the tasks which the human visual system can perform and even surpass in many cases. Computer vision is useful in several application areas like smartphone cameras recognize faces and smiles, The factory robots are monitoring the problems and co-workers etc.,  Computer Vision incorporates the concepts of digital signal processing, neuroscience, AI, computer architecture, and software engineering. Also, it can be studied as a mathematical point of view. The methods in computer vision are based on statistics, optimization, and geometry. In general, the image acquisition devices of computer vision systems capture the visual information as digital signals and hence the need for digital signal processing techniques. Digital Image processing deals with image transformation, compression, restoration, and enhancement. It relies on the image processing technique to preprocess the image data for robust high-level analysis to the application development.  Neuroscience plays an important role in image processing.  Machine vision applying the range of technologies and methods to provide image-based automatic inspection, process control, and robot guidance in industrial applications.

Computer Vision Applications:  CV applications are industrial vision system, robotics and outperforming the human tasks such as circuit board inspection, face recognition, multimedia, medical imaging etc., The new emerging applications are augmented reality, autonomous driving, IOT etc., Computer Vision System has the basic elements such as power source, camera, processor, control and communication cable. Also, it has configuration software and monitor the display system. Vision Processing Unit is an emerging processor to complement the GPU and CPU. In real-world, CV is used in,
 * the Visual Surveillance and Drones that will help to keep track of so many events
 * the biometric-based applications are fingerprint authentication, face recognition is widely used in various industries to keep track of the employee details
 *  the navigation system, the stereo vision, and depth-sensor based are used for robot navigation
 *  the autonomous driving, it has been used for the line detection to keeps the vehicle in designated line. The complete scene understanding is the basic requirement for autonomous driving vehicles.
 * the automated supermarkets are powered to keep track of the customer product and cart etc.,
 * Seeing-AI app is the Microsoft project that helps to turn the visual world into an audible experience. For ex, the app recognizes the saved friends and facial expressions, read the text aloud  that comes into view, scan and read the documents of books, letters and recognize the text in formatting, identifies the currency value,  the bar code scanner helps to find the product what you want etc.,

Convolutional Neural Networks(CNN): CNN is supervised deep learning system used for computer vision.  The computer processes the images the same way our brain functions. Our brain recognizes an image based on features it observes or picks up. For ex, when you look at the object and understand the features of the object, you are able to identify it. You might not be able to identify the object when you never saw the features before or our mind is not to understand the feature. CNN processes the image the same way our brain does. The facial recognition, object classification in the photograph, Facebook name tagging are good utilization of the CNN architecture. You will pass an image as input to CNN architecture and outputs the label or image class. The computer reads the image in a digital form in which every photograph is made of the pixel. Pixels are the smallest unit of information in the picture. It is usually round or square and arranged in a 2-dimensional grid.  The process of CNN is divided into 5 steps. Those are,
   1. Convolution
   2. Feature Map + ReLU Layer
   3. Pooling
   4. Flattening
   5. Full Connection

    At the base of convolution, there is a filter called feature Detector or Filter. The first thing on CNN is Feature Map. It is the smallest size of the actual image created by applying a feature detector to pull important features. Different feature map has different features and combined together to obtain the first convolution layer and CNN architecture can have multiple convolution layers. The Rectified Linear Units(ReLU) function is used to increase the non-linearity in our image. It is combined with convolutions. Pooling helps to remove
the non-necessary pixels and retain the important features. There are various types of pooling like max-pooling, min pooling, average pooling etc., The max-pooling in the respective frame has been done as shown in the picture. We simply take the maximum pixel from each section and put it there. After the pooling, we need to apply flattening on the pooled feature to a flat column. So, this 2*2 column is converted into a single column. Flattening convert the pooled feature map into a single column so that it can go as an input to fully connected(ANN) or other classifiers. Finally, the fully connected layer gives the output which is the class of the image.

VGG16 Architecture & Transfer Learning: VGG16 is a convolutional neural network architecture that was used to win the ILSVR(ImageNet) competition in 2014. VGG16 is a Visual Geometry Group and 16 are the number of layers used by this group for the convolution. It is also called as OxfordNet. The goal of this model is to calculate the probability between 0 and 1for any given image and chose the category with the highest probability. For ex,  Suppose you pass the image of the car, the model will calculate the probability score between 0 to 1 and if it is a high probability, then it will categorize as a car. The oxford team made the structure and the weights of the trained network freely available.

       It has 16 layers with learnable weights which has 13 convolution layers and 3 fully connected layers or dense layers.  The input passes to various convolution layers, pooling, fully connected layers, and the output will be received. The images are passed through these 16 layers and the category or label comes out over the model from the 1000 categories in imagenet. It is a very successful and accurate model.
    Transfer Learning utilizes weights of the trained models instead of training a new model from scratch that you create in the convolution layer, subsampling, pooling, fully connected layer, and get the desired output of the label which will take a long time to classify the image. There are other models available that can be utilized for transfer learning. The model has been trained on a large benchmark dataset to solve a problem similar to a one that we want to solve. The model weighs the important features from the images. So, we use the model and weights that the model has been trained on and transfer those learning to a specific problem. For ex, If we pass the image of the car where the model has trained on the image of the car and the probability of the car is high. When we tweak the model of the damaged portion of the car, where we trained this model to extract the feature and trained in own classifier which will give the output in two or 3 categories. Here, we will take the weights of the pre-trained model and build our own classifier that will identify what we are trying to resolve.

Model Creation and Deployment: It is the coding part in python for various checks of our models. So, we take a piece of code that will take an image as an input and will classify the image. We are going to utilize Keras package. Keras is a wrapper that has been written on top of Tensorflow. Tensorflow is a complex procedure for creating graphs, define layers manually. There is a function to load the image, preprocessor_input to convert the image to a specific format, and converted image given to the VGG16. We go ahead and upload the VGG16 model and utilize the weight in imagenet and save the model as .h5 file. Then, we need to define the function to take an argument of image path which will preprocess the image in a format that can be feed as input to VGG16. We will do the prediction of the pre-processed image and check the shape which is 1 and 1000, it means 1 row and 1000 columns. The model passed through VGG16 is trying to predict the probability of image (Remember, the imagenet is trained on 1000 images). Finally, We will print the predictions for top 5 values that our model is made.

Comments