Our journey in experimenting with machine vision and image recognition accelerated when we were developing an application, BooksPlus, to change a reader’s experience. BooksPlus uses image recognition to bring printed pages to life. A user can get immersed in rich and interactive content by scanning images in the book using the BooksPlus app.
For example, you can scan an article about a poet and instantly listen to the poet’s audio. Similarly, you can scan images of historical artwork and watch a documentary clip.
As we started the development, we used commercially available SDKs that worked very well when we tried to recognize images locally. Still, these would fail as our library of images went over a few hundred images. A few services performed cloud-based recognition, but their pricing structure didn’t match our needs.
Hence, we decided to experiment to develop our own image recognition solution.
What were the Objectives of our Experiments?
We focused on building a solution that would scale to the thousands of images that we needed to recognize. Our aim was to achieve high performance while being flexible to do on-device and in-cloud image matching.
As we scaled the BooksPlus app, the target was to build a cost-effective outcome. We ensured that our own effort was as accurate as the SDKs (in terms of false positives and false negative matches). Our solutions needed to integrate with native iOS and Android projects.
Choosing an Image Recognition Toolkit
The first step of our journey was to zero down on an image recognition toolkit. We decided to use OpenCV based on the following factors:
- A rich collection of image-related algorithms: OpenCV has a collection of more than 2500 optimized algorithms, which has many contributions from academia and the industry, making it the most significant open-source machine vision library.
- Popularity: OpenCV has an estimated download exceeding 18 million and has a community of 47 thousand users, making it abundant technical support available.
- BSD-licensed product: As OpenCV is BSD-licensed, we can easily modify and redistribute it according to our needs. As we wanted to white-label this technology, OpenCV would benefit us.
- C-Interface: OpenCV has C interfaces and support, which was very important for us as both native iOS and Android support C; This would allow us to have a single codebase for both the platforms.
The Challenges in Our Journey
We faced numerous challenges while developing an efficient solution for our use case. But first, let’s first understand how image recognition works.
What is Feature Detection and Matching in Image Recognition?
Feature detection and matching is an essential component of every computer vision application. It detects an object, retrieve images, robot navigation, etc.
Consider two images of a single object clicked at slightly different angles. How would you make your mobile recognize that both the pictures contain the same object? Feature Detection and Matching comes into play here.
A feature is a piece of information that represents if an image contains a specific pattern or not. Points and edges can be used as features. The image above shows the feature points on an image. One must select feature points in a way that they remain invariant under changes in illumination, translation, scaling, and in-plane rotation. Using invariant feature points is critical in the successful recognition of similar images under different positions.
The First Challenge: Slow Performance
When we first started experimenting with image recognition using OpenCV, we used the recommended ORB feature descriptors and FLANN feature matching with 2 nearest neighbours. This gave us accurate results, but it was extremely slow.
The on-device recognition worked well for a few hundred images; the commercial SDK would crash after 150 images, but we were able to increase that to around 350. However, that was insufficient for a large-scale application.
To give an idea of the speed of this mechanism, consider a database of 300 images. It would take up to 2 seconds to match an image. With this speed, a database with thousands of images would take a few minutes to match an image. For the best UX, the matching must be real-time, in a blink of an eye.
The number of matches made at different points of the pipeline needed to be minimized to improve the performance. Thus, we had two choices:
- Reduce the number of neighbors nearby, but we had only 2 neighbors: the least possible number of neighbors.
- Reduce the number of features we detected in each image, but reducing the count would hinder the accuracy.
We settled upon using 200 features per image, but the time consumption was still not satisfactory.
The Second Challenge: Low Accuracy
Another challenge that was standing right there was the reduced accuracy while matching images in books that contained text. These books would sometimes have words around the photos, which would add many highly clustered feature points to the words. This increased the noise and reduced the accuracy.
In general, the book’s printing caused more interference than anything else: the text on a page creates many useless features, highly clustered on the sharp edges of the letters causing the ORB algorithm to ignore the basic image features.
The Third Challenge: Native SDK
After the performance and precision challenges were resolved, the ultimate challenge was to wrap the solution in a library that supports multi-threading and is compatible with Android and iOS mobile devices.
Our Experiments That Led to the Solution:
Experiment 1: Solving the Performance Problem
The objective of the first experiment was to improve the performance. Our engineers came up with a solution to improve performance. Our system could potentially be presented with any random image which has billions of possibilities and we had to determine if this image was a match to our database. Therefore, instead of doing a direct match, we devised a two-part approach: Simple matching and In-depth matching.
Part 1: Simple Matching:
To begin, the system will eliminate obvious non-matches. These are the images that can easily be identified as not matching. They could be any of our database’s thousands or even tens of thousands of images. This is accomplished through a very coarse level scan that considers only 20 features through the use of an on-device database to determine whether the image being scanned belongs to our interesting set.
Part 2: In-Depth Matching
After Part 1, we were left with very few images with similar features from a large dataset – the interesting set. Our second matching step is carried out on these few images. An in-depth match was performed only on these interesting images. To find the matching image, all 200 features are matched here. As a result, we reduced the number of feature matching loops performed on each image.
Every feature was matched against every feature of the training image. This brought down the matching loops down from 40,000 (200×200) to 400 (20×20). We would get a list of the best possible matching images to further compare the actual 200 features.
We were more than satisfied with the result. The dataset of 300 images that would previously take 2 seconds to match an image would now take only 200 milliseconds. This improved mechanism was 10x faster than the original, barely noticeable to the human eye in delay.
Experiment 2: Solving the Scale Problem
To scale up the system, part 1 of the matching was done on the device and part 2 could be done in the cloud – this way, only images that were a potential match were sent to the cloud. We would send the 20 feature fingerprint match information to the cloud, along with the additional detected image features. With a large database of interesting images, the cloud could scale.
This method allowed us to have a large database (with fewer features) on-device in order to eliminate obvious non-matches. The memory requirements were reduced, and we eliminated crashes caused by system resource constraints, which was a problem with the commercial SDK. As the real matching was done in the cloud, we were able to scale by reducing cloud computing costs by not using cloud CPU cycling for obvious non-matches.
Experiment 3: Improving the Accuracy
Now that we have better performance results, the matching process’s practical accuracy needs enhancement. As mentioned earlier, when scanning a picture in the real world, the amount of noise was enormous.
Our first approach was to use the CANNY edge detection algorithm to find the square or the rectangle edges of the image and clip out the rest of the data, but the results were not reliable. We observed two issues that still stood tall. The first was that the images would sometimes contain captions which would be a part of the overall image rectangle. The second issue was that the images would sometimes be aesthetically placed in different shapes like circles or ovals. We needed to come up with a simple solution.
Finally, we analyzed the images in 16 shades of grayscale and tried to find areas skewed towards only 2 to 3 shades of grey. This method accurately found areas of text on the outer regions of an image. After finding these portions, blurring them would make them dormant in interfering with the recognition mechanism.
Experiment 4: Implementing a Native SDK for Mobile
We swiftly managed to enhance the feature detection and matching system’s accuracy and efficiency in recognizing images. The final step was implementing an SDK that could work across both iOS and Android devices like it would have been if we implemented them in native SDKs. To our advantage, both Android and iOS support the use of C libraries in their native SDKs. Therefore, an image recognition library was written in C, and two SDKs were produced using the same codebase.
Each mobile device has different resources available. The higher-end mobile devices have multiple cores to perform multiple tasks simultaneously. We created a multi-threaded library with a configurable number of threads. The library would automatically configure the number of threads at runtime as per the mobile device’s optimum number.
To summarize, we developed a large-scale image recognition application (used in multiple fields including Augmented Reality) by improving the accuracy and the efficiency of the machine vision: feature detection and matching. The already existing solutions were slow and our use case produced noise that drastically reduced accuracy. We desired accurate match results within a blink of an eye.
Thus, we ran a few tests to improve the mechanism’s performance and accuracy. This reduced the number of feature matching loops by 90%, resulting in a 10x faster match. Once we had the performance that we desired, we needed to improve the accuracy by reducing the noise around the text in the images. We were able to accomplish this by blurring out the text after analyzing the image in 16 different shades of grayscale. Finally, everything was compiled into the C language library that can be used with iOS and Android.