MASK leverages word embeddings as bridges to associate words with their corresponding prototypes, thereby enabling semantic knowledge alignment between the image and text modalities.
We test the performance of MASK on two standard benchmark datasets: Flickr30k and MSCOCO.
The image-text matching usually includes two sub-tasks in terms of: 1) image annotation: retrieving related texts given images, and 2) image retrieval: retrieving related images given texts.
The commonly used evaluation criterions are R@1", R@5" and R@10", i.e., recall rates at the top-1, 5 and 10 results. Following existing works, we also use an additional criterion of Rs" by summing all the recall rates to evaluate the overall performance.
In the multimodal aligned semantic knowledge, we collect all words from the VG dataset and filter out some special characters and rare words, resulting in a total of