ML interview preparation: computer vision
Usually during machine learning interviews after common questions there are project specific ones, so I have prepared a few must-knows for effective preparation and passing computer vision related interviews.
Main tasks of computer vision
- classification — model learns what object is
- object detection — model finds object location (we can draw bounding box around it)
- object tracking — model locates object and looks where the object is going next
- face recognition — model knows who is who
- edge detection — model knows where object edges are
- segmentation — model knows where exactly is the area of an object and we can create pixel wise mask over it
Types of segmentation
- semantic — all object of one category are colored the same
- instance — every object instance is divided from others
Popular computer vision libraries
OpenCV — one of the first and basic tools to use when familiarizing yourself with computer vision, open source library. On the internet you can find lots of usage examples from how to find face features and make small recognition models to video analysis.
It also implements such algorithms as K-Nearest Neighbors, Bayes Classifier, Decision Trees, Support Vector Machines, neural networks and more.
Popular computer vision networks
CNN (history started in previous century) — convolutional neural network concept, it detects features on image wherever they are, doesn’t need much image preprocessing
AlexNet (2012)
- ReLU instead of standard at that time tanh (made network much faster)
- first used consecutively going convolutional layers
- first used dropout layers (technique was just invented back then)
- included optimization for multiple GPUs
- won ILSVRC (ImageNet Large Scale Visual Recognition Challenge) in 2012, being the first GPU-based CNNs to win an image recognition contest
VGGNet (2014) — CNN which uses filters smaller than AlexNet, less parameters than AlexNet, has even better performance.
GoogleNet / Inception v1 (2014) — CNN which proposes filters of multiple sizes which operate on the same level, making the network wider, not deeper. Won ILSVRC in 2014, leaving VGG in second place.
ResNet (2015) — Residual Network, CNN which does not have a vanishing gradient problem, so can be much deeper. Despite that, it has a smaller size (due to global average pooling instead of fully connected layers). Introduces residual block. Won ILSVRC in 2015.
UNet (2015) — network for image segmentation, called so due to U-shaped architecture. Part of it uses CNN too. Does not need a lot of training data.
YOLO (2015) — You Only Look Once is a CNN for real time object detection and classification. Originally based on GoogleNet and VGGNet and called DarkNet. Splits input into a grid of cells, each cell predicts a bounding box and object class which are later merged to a final prediction. Won several challenges at ISBI (International Symposium on Biomedical Imaging) in 2015
EfficientNet (2019) — even more powerful and accurate than ResNet
As you see, every network here has a relation to CNN architecture. I decided to put questions and theory about it to the separate article, you can find it here:
Do not miss it, as lots of interview questions are based on understanding simple concepts from there.
GAN (2014, although idea is older) — Generative Adversarial Network concept which is able to generate data similar to one you feed it. Uses noise + generator and discriminator networks to compete against each other so that the generator improves the generated output to be more alike to real input and discriminator tries to guess whether the input is real or fake.
I will write more about GANs in my next articles, as there is a lot of interesting stuff to talk about.
Popular computer vision datasets:
ImageNet is one of the largest datasets which everybody knows because of its challenge lots of new neural networks are estimated on — ILSVRC. But new datasets are being prepared every day. Here are some of the most popular for computer vision tasks and useful instruments where to look for more:
Popular computer vision topics
Image preprocessing — steps we take to format images before feeding them to a network or before making an inference. It involves image transformations.
Image transformations — set of operations to change images like mirroring, rotation, cropping, changing light or color, adding noise and so on. For example in Pytorch the torchvision.transforms module is used for that.
Data augmentation — increasing the number of input data samples before training a model on them by creating changed copies of data items. For images it is done using image transformations.
It is quite helpful when we have a small dataset but also generally it is a good practice to use it as we want our model to be more accurate.
I like the way it is easy to check out what is done by each augmentation when using the fast.ai library.
Popular computer vision questions
How does the computer vision pipeline look?
It actually depends on a position you are applying for or a company you want to work at. Somebody expects you to mention data collection, somebody wants to talk about it from task formalization to deployment (although it can even not be your job to do) and somebody just wants to hear something in the middle. So overall the way looks something like this:
Task formalization → picking an algorithm and model architecture → data collection (& labeling if it is not present) → preprocessing and augmentation → features extraction → model training → inference and tests → analysis and optimization → more tests → deployment
How to prepare images for training?
- check that each image represents labeled class or contains needed data
- remove all other images
- preprocess images
- augment using appropriate for your task transformations
When to use grayscale images?
Sometimes color is not relevant for a task: if you want your model to learn other features and not hold on to color representation of an object it can really be a good choice. Not only can it make predictions better but as a bonus it will improve performance of your model. For example, if you train a model detecting what number of dots is on the dice — you do not need color. You may need it for flower or bird classification though.
How to evaluate a computer vision model?
Common evaluation metrics (not only for images) for machine learning models are: accuracy, precision and recall, F1 score. I have already mentioned these here, so you can revise them:
For object detection there are some special metrics:
- IoU (Intersection over Union) metric — a ratio of overlap area for predicted bounding box and the actual one to their common area of union. Usually the threshold of 0.5 is chosen to decide whether prediction is good, but it depends on a problem model is solving.
It also solves the multiple predictions for one object problem: only one (the most precise) is chosen.
- mAP (mean average precision) — a metric which is counted with help of IoU, precision and recall, and precision recall curve. So first we have to count IoU for one class, then we count precision and recall. After that, building a precision recall curve we have an average precision (area under the curve) and repeat it for every class we have so we could count the mean value. To dive deeper into this metric explanation check out this great article:
How to reduce noise on an image?
- Gaussian filters blur image and sharpen it again
- Median filters replace each pixel in an image by average value of surrounding pixels
How to detect edges of an object in an image?
To know where edges are we have to look for brightness discontinuities or for image gradients.
Edge detection operators can be used to achieve it using computation:
- gaussian based (Canny edge detector, Laplacian of Gaussian)
- gradient based (Sobel operator, Prewitt operator, Robert operator)
From these ones Canny edge detector is probably the most popular and quite effective one.
CNN networks are also used to find edges: before finding all other features the edges features are usually found.
There are also recent advancements in neural networks for edge detection:
- CASENet (2017) — has semantic edge detection
- DexiNed (2020) — doesn’t need prior training and works on various datasets without need for finetuning
- RINDNet (2021) — not only detects edges, but knows their type: normal, illumination, depth, reflection
- PiDiNet (2021) — lightweight and efficient edge detection
Where computer vision is used?
- medical research
- robotics and self-driving vehicles
- manufacturing
- wherever else object detection and tracking is needed
- face recognition
- education
- architecture and design
- space research and much, much more
I know there is a lot more to discuss, but it seems to me like an optimal size of an article. Thank you so much for reading this and for your support. As always, corrections and comments are welcome. See you next time.
Compliment of the day: I am not a computer, but I see you are doing a great job there. Keep on!