Author:
We had three key objectives for our user experience (UX): the daily-use tool had to be simple, fast, and delightful. And we achieved this with a mix of empathy and design thinking
Our journey in experimenting with machine vision and image recognition accelerated when we were developing an application, BooksPlus, to change a reader’s experience. BooksPlus uses image recognition to bring printed pages to life. A user can get immersed in rich and interactive content by scanning images in the book using the BooksPlus app.
For example, you can scan an article about a poet and instantly listen to the poet’s audio. Similarly, you can scan images of historical artwork and watch a documentary clip.
As we started the development, we used commercially available SDKs that worked very well when we tried to recognize images locally. Still, these would fail as our library of images went over a few hundred images. A few services performed cloud-based recognition, but their pricing structure didn’t match our needs.
Hence, we decided to experiment to develop our own image recognition solution
We focused on building a solution that would scale to the thousands of images that we needed to recognize. Our aim was to achieve high performance while being flexible to do on-device and in-cloud image matching.
As we scaled the BooksPlus app, the target was to build a cost-effective outcome. We ensured that our own effort was as accurate as the SDKs (in terms of false positives and false negative matches). Our solutions needed to integrate with native iOS and Android projects
The first step of our journey was to zero down on an image recognition toolkit. We decided to use OpenCV based on the following factors:
A rich collection of image-related algorithms:
OpenCV has a collection of more than 2500 optimized algorithms, which has many contributions from
academia and the industry, making it the most significant open-source machine vision library.
Popularity:
OpenCV has an estimated download exceeding 18 million and has a community of 47 thousand users, making it abundant technical support available.
BSD-licensed product:
As OpenCV is BSD-licensed, we can easily modify and redistribute it according to our needs. As we wanted to white-label this technology, OpenCV would benefit us.
C-Interface:
OpenCV has C interfaces and support, which was very important for us as both native iOS and Android support C; This would allow us to have a single codebase for both the platforms
When we first started experimenting with image recognition using OpenCV, we used the recommended ORB feature descriptors and FLANN feature matching with 2 nearest neighbours. This gave us accurate results, but it was extremely slow.
The on-device recognition worked well for a few hundred images; the commercial SDK would crash after 150 images, but we were able to increase that to around 350. However, that was insufficient for a large-scale application.
To give an idea of the speed of this mechanism, consider a database of 300 images. It would take up to 2 seconds to match an image. With this speed, a database with thousands of images would take a few minutes to match an image. For the best UX, the matching must be real-time, in a blink of an eye.
The number of matches made at different points of the pipeline needed to be minimized to improve the performance. Thus, we had two choices:
The Second Challenge: Low Accuracy
Another challenge that was standing right there was the reduced accuracy while matching images in books that contained text. These books would sometimes have words around the photos, which would add many highly clustered feature points to the words. This increased the noise and reduced the accuracy.
In general, the book’s printing caused more interference than anything else: the text on a page creates many useless features, highly clustered on the sharp edges of the letters causing the ORB algorithm to ignore the basic image features