Recognising indoor scenes with Custom Vision

10 minute read

Custom Vision has been on my radar for a while. The platform, created by Microsoft and part of the Azure ecosystem, allows users to easily upload and tag images to build and train custom Machine Learning (ML) models that can be used to perform classification and object detection. Once a model has been sufficiently trained, it can be deployed with a few clicks and be used as an API.

Sounds exciting, doesn’t it? Would this tool be capable of saving us some of the most arduous steps when building a custom ML model? And what would the accuracy of the predictions be? In this article, I’ll share my findings.

Recognising indoor scenes

I decided to use old MIT scientific research from 2009 to test the capabilities of Custom Vision.

Obviously, the progress of AI and ML since this period has been enormous. But the research caught my attention due to a couple of reasons:

  • It states the difficulty of classifying indoor images in comparison with, for example, outdoor scenes.
  • They published the labelled image dataset used in the research.

The paper

Details and conclusions from the research can be found in the paper Recognizing Indoor Scenes, but here are a few interesting notes from it:

  • They created a dataset of 15620 images, classified into 67 different categories.
  • Images were obtained from different online sources (Google, Altavista, Flickr...) with heterogeneous sizes and proportions (minimum resolution of 200 pixels in the smallest axis).
  • All images in the dataset have the same file format (jpg).


Custom Vision can be used via their console, a UI that allows users to drag-and-drop, tag images and train/test the models or via their SDK.

From my point of view, the console looks very simple, intuitive and easy to use. It achieves what it’s meant to, it makes AI accessible to a wider audience — people with no deep knowledge of how machine learning works behind the scenes.

However, the SDK (available in a considerable number of languages), allowed us to script some of the steps. I opted for using a Jupyter notebook to carry out the steps required to:

  • Create the training and validation subsets
  • Upload and tag the dataset images 
  • Train and deploy the model
  • Validate the model via API inferences

Step 1 - Creating the training/validation subsets

After creating a new image classification project on Custom Vision and downloading the images (2.4GB), the first step would involve selecting the subset of images used for training vs validation. 

As an interesting aside, Custom Vision offers a free tier where you can use up to 5000 images and 50 tags per project. These figures seemed reasonable for the prototype. Following their recommendations, I used 50 images per category which means we’d train our model with a total of 50 x 50 = 2500 images.

The following code will create 2 separate folders to divide images into training and validation.

Step 2 - Uploading and tagging the dataset images 

2.1. Set up the project

Before we can start working with our dataset, we’ll need to set up the Custom Vision project. You can either create a new one or just make use of an existing project created via the console by using its id (our case).

The required parameters can be found under the settings tab of the project via the console. 

2.2. Tags

Now, we are in a position to start creating the tags that will be used to categorise the images.

When creating our project, we selected the option “Multiclass (Single tag per image)” as the Classification type since it best fitted our problem. We’ll follow the original dataset folder structure to create one tag per category (keeping in mind the maximum 50 tags limitation).

IMPORTANT: tags are referenced by using the id (instead of name) so we’ll have to store them in some kind of data structure so they can be subsequently used.

2.3. Uploading the training images 

Once we have our tags in Custom Vision, we can upload the images. We’ll iterate through our categories to upload tagged images. The images will be uploaded as batches of 64 elements (the maximum SDK limit) in order to speed up the process.

One of the tasks that takes a substantial amount of time when working with image recognition models is preprocessing. Usually, the custom model will only accept images in a specific format, with a fixed size. So it is the responsibility of the software/data engineer to create the mechanisms required to convert the images into the correct shape. This is not needed with Custom Vision, so we can directly use our images against the service and they’ll work! This is something I consider a big step forward, especially when time and budget constraints play an important role in the project.

Step 3 - Training and deploying the model

To fit our model, we’ll just need to call the train_project method and the magic will happen. Custom Vision will do the hard work for us, without the need to choose the best learning algorithm, the NN layout or tweaking the model parameters.

The training iteration took 11 minutes 28 seconds for our dataset composed of 2500 images. The free tier allowed us to use up to 1 hour of training and 20 iterations per month.

The results

From the console, we can access the performance of our trained model. In our case, after 2 iterations, the model showed very good Precision (85.6%) and AP (89.9%) values.

Testing the inference API

So far so good. However, when creating an ML model we should keep an eye on how it’ll behave when deployed to production. Although ideally, we should use images retrieved from a totally different source to calculate the accuracy of the production model, we’ll make use of our separated validation dataset for this purpose.

The following script will select 10 random images from each category to check if the inferences provided by the inference API are valid or not. Following the default CV configuration, we’ll consider as true positive predictions with a probability score of more than 50% as correct.

The script shows the following results:

  • Total: 500
  • API errors: 0
  • Correct predictions: 363
  • Failed predictions: 137
  • Precision: 0.75
  • Recall: 0.73
  • Average prediction time: 0.53 seconds

As we can observe, the precision and recall metrics have significantly lowered but we can assume the model still behaves fairly well (although this accuracy could be insufficient depending on the requirements of our project) if we take into account the size of the training dataset and reduced number of iterations.

2009 results

As an interesting note, the solution with better results than the experiments described in the original paper, published in 2009, achieved a precision average of 26%.

When they are compared with the results of our quick 2020 prototype (implemented in a few hours making use of a highly automated tool), it highlights the progress of AI over the last few years.


Unfortunately, it’s not always sunny in Philadelphia. The service has some tradeoffs that should be taken into account if you are choosing the best tool to cover the requirements of your project.

The main limitation you need to accept when using Custom Vision is you won’t be able to export your trained model to be used in an external environment unless you select to use a “compact version”. This means you’ll always rely on Microsoft Azure to provide inferences and won’t have control to tweak model parameters or scale as you need (though this might not be an issue unless the requirements of your project demand it). To be fair, this seems a logical move from Microsoft to ensure their clients stay with them after building a custom model with minimum effort.

The aforementioned “compact” models are reduced versions with some limitations that aim to be deployed and used in devices with limited resources (IoT devices or mobile phones). Very useful when this type of use is needed but not enough when you want to export and run the model on a third-party platform (or train it locally). In a different article, I’ll test and compare the results of both the standard and compact versions.

Besides, one of its main strengths — that makes the whole training and deploying process to be transparent to the user a possibility — could be one of its main weaknesses. It is quite likely Custom Vision won’t reach the same precision level than a model created from scratch, which would allow you to optimise and change every learning parameter depending on the results obtained in each iteration. That being said, this doesn’t seem to be the market they want to reach, at least in the short term.



My general impression of the service and its capabilities is quite positive. In spite of its limitations, I think Custom Vision is a tool to consider for projects that require a custom model but are perhaps limited by time or budget. Custom Vision would also be a good option for prototyping, or for companies without the required resources or knowledge to dive deep into the challenges of AI and ML.

Written by Jesús Larrubia (Full Stack Developer). Read more in Insights by Jesús or check our their socials Twitter, Instagram