Visual Turing Test

The job of the human operator is to provide the correct answer to the question or reject it as ambiguous.

The query generator produces questions such that they follow a “natural story line”, similar to what humans do when they look at a picture.

Research in computer vision dates back to the 1960s when Seymour Papert first attempted to solve the problem.

Roughly 50% of the human brain is devoted in processing vision, which indicates that it is a difficult problem.

These simple neural networks could not live up to their expectations and had certain limitations due to which they were not considered in future research.

There was some great progress in this field but the problem of vision which was to make the machines understand the images was still not being addressed.

Also in the early 1990s convolutional neural networks were born which showed great results on digit recognition but did not scale up well on harder problems.

One of the reasons this happened was due to the availability of key, feature extraction and representation algorithms.

Features along with the already present machine learning algorithms were used to detect, localise and segment objects in Images.

While all these advancements were being made, the community felt the need to have standardised datasets and evaluation metrics so the performances can be compared.

The availability of standard evaluation metrics and the open challenges gave directions to the research.

Visual Turing Test aims to give a new direction to the computer vision research which would lead to the introduction of systems that will be one step closer to understanding images the way humans do.

A large number of datasets have been annotated and generalised to benchmark performances of difference classes of algorithms to assess different vision tasks (e.g., object detection/recognition) on some image domain (e.g., scene images).

One of the most famous datasets in computer vision is ImageNet which is used to assess the problem of object level Image classification.

Having these standard datasets has helped the vision community to come up with extremely well performing algorithms for all these tasks.

The next logical step is to create a larger task encompassing of these smaller subtasks.

Given an Image infinite possible binary questions can be asked and a lot of them are bound to be ambiguous.

consist of three components: For Images of urban street scenes the types of objects include people, vehicle and buildings.

Attributes refer to the properties of these objects, for e.g. female, child, wearing a hat or carrying something, for people and moving, parked, stopped, one tire visible or two tires visible for vehicles.

As mentioned earlier instantiating objects leads to other interesting questions and eventually a story line.

An integral part of the ultimate aim of building systems that can understand images the way humans do, is the story line.

Simplicity preference states that the query generator should pick simpler questions over the more complicated ones.

So this gives an ordering to the questions based on the number of attributes, and the query generator prefers the simpler ones.

But they[4] propose a different version of the visual Turing test which takes on a holistic approach and expects the participating system to exhibit human like common sense.

It evaluates how the computer vision systems understand the Images as compared to humans.

The Visual Turing Test is expected to give a new direction to the computer vision research.

Recently Facebook announced its new platform M, which looks at an image and provides a description of it to help the visually impaired.

Selected sample questions generated by the query generator for a Visual Turing Test
Sample regions used as context in a Visual Turing Test. The one on the left shows regions with 1/8 the size of the image and the one on the right show regions with 1/4 size of the image
Images of the Urban Street scenes from the training data. The training data is a collection of such images with scenes from different cities across the world
Example annotations of training image provided by the human workers