Image embedding using DNN

Introduction

There are several techniques for image/video protection. One of the algorithms is watermarking meaning inserting marks into your content to protect content from copying, republishing and making money from it. Those marks can be visible or invisible, but they will introduce distortions on the original content. Another techniques is creating an compact identification for the images. A wellknown techniques is statistic analysis meaning the statistic in the pixel or transformed domain will be collected (hand crafted) and after used to generate a compact content representation which can be stored and identify the original content. Deep Neural Networks is raising as a powerful techniques in the recent years and using DNN to solve image fingerprinting problem is obviously not a bad idea.

Video coding standardization — English garden in Munich, Germany

So, what is a good image fingerprinting? yes, after googling we got the answers. It should have at least 2 properties, uniqueness and robustness. Uniqueness means the fingerprint of 2 diffrent image should be diffrent or classified diffrent. And when you modify the image such as adding some text, rotate, screenshoot or changing brightness at a certain level the fingerprint should be the same, it is robustness. We are going to see how to develop and train a Neural Networks to extract a embedding vector (call fingerprinting) from an image and of course fullfill uniqueness and robustness requirement.

Triplet Loss

Suppose that you are having an image that you want to protect, do not alow anyone modify and republish it. But some how bad guys have it and apply some changes on your images such as changing brightness, adding some text, rotate... And what you want is an algorithm to identify an image from internet is a duplicate of your image or not. Using triplet loss for is the first idea come up in my mind. Our problem can be express as finding a mapping function from pixel space to d-dimensional Euclidean space f(x) (in R^d). The formula of triplet loss was shown in figure below.

where f^a is the embedding vector of original image called anchor image,f^b is embedding vector of modified image (positive image) and fⁿ is for negative image (any other image except anchor and its replicas). By using this loss, we are maximize L2 distance between anchor and negative while minimize L2 distance between anchor and its duplicates. If we found the optimal function, the embedding of anchor and positive will be identical and diffrent to the embedding of negative (how far is depend on the alpha).

There is a problem with the loss function. Let assume that positive distance =0.5, negative distance =0.9 and alpha =0.3. Then the loss will be 0.5-0.9+0.3=-0.1 a negative value and our final loss will be max (0,loss)=0 and our model will not improve even we still want to have a smaller positive distance and bigger negative distance. There is a proposal to solve this problem by limit output by a sigmoid activation function in the last layer and use the loss less triplet loss as below:

This loss will never be smaller than 0 like the previous loss function, but using sigmoid activation introduces another problem. The model can learn something but sigmoid saturate and vanishing at some point which leads embedding vector converge to only 0 or 1. We can solve this using weighted negative distance to reduce the effect of negative distance on the loss:

By using weight theta smaller than 1, gurantee at a certain level that the loss will not be smaller than 0. Assuming theta=0.4 and keeping positive distance =0.5, negative distance =0.9 and alpha =0.3 as in the previous example. Now the loss will be: 0.5-0.4*0.9+0.3=0.44 greater than 0, and model will keep learing to have a bigger negative distance. In this example, negative distance have to be greater than 2 assumming that positive distance still equal 0.5.

Training

I use 3 identical ResNet 18 base model to train triplet loss (Deep Network block in the figure above) with Adam optimizer (lr 5e-4 and then 1e-4 to avoid overfitting). Note that I do not using activation fuction in the last Dense 128 layer. The reason for doing that is using activation introduce lossing information from last Dense layer (Relu remove all negative value, sigmoid saturate all value out of range [0, 1].

Results

After training for over 90 epochs when training loss is about 3.2 and validation loss equal 0.7, I stop the training (see the loss figure below). I use the base model to extract embedding vector of 1000 test images and save it as embedding database. 1000 randomly choosing test positive and 1000 negative image are use to comput confusion matrix. The test phase was perfomed and final result are average over 10 runs. Recall, precision and F1 score are 0.95, 0.93 and 0.94 respectively.

Our model archieve 99.5% percent on the test dataset. The image below shown top 5 image with the closest euclidean distance respect to querying image. The distance to the matched in the top 5 are printed also.

References

[1]Lossless Tripletloss: https://towardsdatascience.com/lossless-triplet-loss-7e932f990b24
[2] Siamese Model: https://towardsdatascience.com/image-similarity-using-triplet-loss-3744c0f67973