Simple visual question answering system 🧐

Date: 27.03.2020

A while ago I stumpled across a blog post of buiding a simple visual question answering system that took an image and a question and answered the question wrt. the image. I decided to build such a system on my own detecting 3 basic shapes: squares, circles and triangles in three different color.

The complete Jupyter notebook can be found: here

I created a small script to generate images and question answer pairs for the project to get a set of examples the model could learn from. While creating the training data, I realized that the system will be biased towards ‘No’ answers, as the training data contains way more questions that cause a ‘No’ answer than ‘Yes’ or colors or shapes.

Target label distribution

This is an potential cause for a bad performance of the model, so we should keep that in the back of our minds when it comes to evaluating the model performance. A way to fix this issue would be create more positive/color/shape answered questions to reduce the weight of the ‘No’-class answers.

I used Tensorflow 2 with the Keras API to build the Deep Learning model.

def get_image_cnn():
    model = Sequential()
    model.add(Conv2D(4, kernel_size=(3, 3), activation='relu', input_shape=IMAGE_SHAPE, padding='same'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Conv2D(8, kernel_size=(3, 3), activation='relu', padding='same'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Flatten())
    model.add(Dropout(0.1))
    model.add(Dense(32, activation='relu'))
    return model

def get_question_nn():
    model = Sequential()
    model.add(Dense(12, input_shape=(QUESTION_SHAPE,), activation='relu'))
    model.add(Dropout(0.1))
    model.add(Dense(8, activation='relu'))
    return model

def get_model():
    image_model = get_image_cnn()
    question_model = get_question_nn()
    input_image = Input(shape=IMAGE_SHAPE)
    input_question = Input(shape=(QUESTION_SHAPE,))
    
    image_result = image_model(input_image)
    question_result = question_model(input_question)
    result = concatenate([image_result, question_result])
    result = Dense(12, activation='relu')(result)
    result = Dropout(0.1)(result)
    result = Dense(ANSWER_SHAPE, activation='softmax')(result)
    model = Model(inputs=[input_image, input_question], outputs=result)
    model.compile(loss='categorical_crossentropy', optimizer=adam_optimizer(), metrics=['accuracy'])
    return model

The model consists of 3 different parts:

the Convolutional Neural Network (CNN) processing the image
a Feedforward NN to process the Bag-of-Words (BoW) encoded questions
a Feedforward NN to merge the outputs of the two previous networds (by simply concatenating them) and to classify the contenxt into one of the 8 target classes.

I use the Keras Dropout layer for regularization to prevent the model from overfitting: Dropout is a technique that randomly disables edges in the network, see Blog post on Dropout layers.

Over training, we see that the model is able to learn to predict the correct labels for ~90% of the training examples:

Training loss and accuracy

The loss constantly decreases as well. As this is not a good indicator for how well our model performs on unknown data, I splitted the example set into a training set (90% of the data) and a test set (10%):

train_images, test_images, train_questions, test_questions, train_answers, test_answers = 
    train_test_split(images, questions, answers, test_size=0.1)

The test set is not used for training the model but for validating the model after training only.

The validation accuracy was a whopping 98% 🤩, with an averaged f1-score (measuring precision and recall) of 0.96997.

Let’s take a look at an example:

Circle

Question: Is there a triangle?

print('Image:')
plt.imshow(test_images[0].reshape(IMAGE_SIZE))
plt.show()
print('Encoded question:', test_questions[0])
print('Correct answer:', np.argmax(test_answers[0]))
print('Predicted answer:', np.argmax(model.predict([[test_images[0]], [test_questions[0]]]), axis=1)[0])

Encoded question: [0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0]
Correct answer: 3
Predicted answer: 3

Label 3 for answers corresponds to: No

🤯 Yeah the model learned to distinguish circles and triangles!!!

In the Notebook, I explain in more detail how the data is encoded for the model and which future extensions can be made to it.

That’s all for now.

Hagen Schupp | Blog

Simple visual question answering system 🧐

Date: 27.03.2020