License: arXiv.org perpetual non-exclusive license
arXiv:2402.05391v4 [cs.AI] 26 Feb 2024

Knowledge Graphs Meet Multi-Modal Learning:
A Comprehensive Survey

Zhuo Chen, Yichi Zhang, Yin Fang, Yuxia Geng, Lingbing Guo, Xiang Chen, Qian Li, Wen Zhang*,
Jiaoyan Chen, Yushan Zhu, Jiaqi Li, Xiaoze Liu, Jeff Z. Pan, Ningyu Zhang, Huajun Chen*
https://github.com/zjukg/KG-MM-Survey
Zhuo Chen (zhuo.chen@zju.edu.cn), Yichi Zhang, Yin Fang, Lingbing Guo, Xiang Chen, Wen Zhang (zhang.wen@zju.edu.cn), Yushan Zhu, Ningyu Zhang and Huajun Chen (huajunsir@zju.edu.cn) are from Zhejiang University, China. Yuxia Geng is from Hangzhou Dianzi University, China. Jiaoyan Chen is from The University of Manchester and University of Oxford, UK. Jiaqi Li is from Southeast University, China. Xiaoze Liu is from Purdue University, USA. Jeff Z. Pan is from The University of Edinburgh, UK. * denotes corresponding authors.
Abstract

Knowledge Graphs (KGs) play a pivotal role in advancing various AI applications, with the semantic web community’s exploration into multi-modal dimensions unlocking new avenues for innovation. In this survey, we carefully review over 300 articles, focusing on KG-aware research in two principal aspects: KG-driven Multi-Modal (KG4MM) learning, where KGs support multi-modal tasks, and Multi-Modal Knowledge Graph (MM4KG), which extends KG studies into the MMKG realm. We begin by defining KGs and MMKGs, then explore their construction progress. Our review includes two primary task categories: KG-aware multi-modal learning tasks, such as Image Classification and Visual Question Answering, and intrinsic MMKG tasks like Multi-modal Knowledge Graph Completion and Entity Alignment, highlighting specific research trajectories. For most of these tasks, we provide definitions, evaluation benchmarks, and additionally outline essential insights for conducting relevant research. Finally, we discuss current challenges and identify emerging trends, such as progress in Large Language Modeling and Multi-modal Pre-training strategies. This survey aims to serve as a comprehensive reference for researchers already involved in or considering delving into KG and multi-modal learning research, offering insights into the evolving landscape of MMKG research and supporting future work.

Index Terms:
Knowledge Graphs, Multi-modal Learning, Large Language Model, Survey
Refer to caption
Figure 1: Knowledge Graphs Meet Multi-modal Learning.

I Introduction

Considering knowledge reasoning and multi-modal perception in isolation from each other may not be the most appropriate strategy [1]. This parallels human cognition, where the brain’s accumulation of memories over time forms a crucial base for societal adaptation and survival, enabling meaningful actions and interactions. These memories can be divided into two primary categories.

The first category resembles conditioned reflexes. Through repeated practice, humans develop a sort of intuitive memory that enhances intuitive and analogical reasoning skills, often referred to as shallow knowledge. When such shallow knowledge is combined with sensory inputs like visual, auditory, and tactile data, it enables us to efficiently perform basic tasks. This ability is at the heart of what traditional multi-modal tasks strive to achieve. Multi-modal tasks, which involve data from multiple modalities for problem-solving, more closely mimic real-life situations than traditional uni-modal Natural Language Processing (NLP) or Computer Vision (CV) tasks. For example, Visual Question Answering builds on NLP Q&A task by incorporating visual data to predict answers from both an image and a textual question. Similarly, Image Captioning extends NLG principles by creating descriptive sentences for images, providing a fuller understanding of the content. Consequently, with the rapid advancement of the Internet and the removal of bandwidth limitations, multi-modal information sources have become crucial and readily accessible, enabling more precise access to information.

The second type, known as Torso-to-tail Knowledge, is encountered less frequently in everyday life and often does not lead to conditioned reflex formation. This category requires active memorization or contemplation, highlighting the significance of Knowledge Graphs (KGs) in capturing and structuring long-tail knowledge. While current large-scale pre-training efforts assimilate knowledge, they also face challenges such as hallucination phenomena and blurring of unusual knowledge [2, 3, 4, 5]. In contrast, our study primarily focuses on the utilization of symbolic, structured knowledge within KGs. Given the vital role of KGs in organizing long-tail knowledge and their proven effectiveness as a foundational knowledge representation element across many successful AI and information systems [6], it becomes evident that integrating KGs with multi-modal learning offers a promising avenue for further addressing those existing challenges.

I-A Motivation and Contribution

As illustrated in Fig 1, individuals in real life need to simultaneously process multi-modal information from the environment, while also continuously absorbing and utilizing outside knowledge. These elements should not function in isolation; rather, knowledge and multi-modality are inherently complementary. Despite this intrinsic connection, historically, these two domains have developed independently. Previous work focus either on KG-enhanced multi-modal learning or on multi-modal KG research itself. Until now, no study or review has yet provided a comprehensive, balanced analysis of these fields, leading to a further divide in their development.

Within this paper, we first trace the evolution from conventional KGs to MMKGs, noting the evolving focus of the semantic web community. We then carefully categorize KG-driven multi-modal tasks, where KGs serve as pivotal repositories of knowledge, providing both a basis for inference and essential knowledge for various downstream multi-modal tasks. Following this, we explore the impact of multi-modal techniques on KGs, discussing both their current state and future prospects. Detailed analysis covers methodological developments within each task and benchmarks key areas, enabling effective comparison across tasks. Focusing primarily on research from the past three years (2020-2023), this survey also includes a discussion on the recent advancements in Large Language Models (LLMs), exploring their interaction with the topics covered. It is suitable for all AI researchers, especially beneficial for those delving into knowledge-driven multi-modal reasoning and cross-modal knowledge representation, as well as serving as a valuable resource for practitioners in semantic web techniques seeking new insights.

Literature Collection Methodology: For our paper, we source literature primarily from Google Scholar and arXiv. Google Scholar provides broad access to leading computer science conferences and journals, while arXiv serves as a key platform for preprints across various disciplines, including a significant repository recognized by the computer science community. We employ a systematic search strategy on these platforms, using relevant keyword combinations to assemble our references. We rigorously curate this collection, manually filtering out irrelevant papers and incorporating initially overlooked studies mentioned in their main texts. By exploiting Google Scholar’s citation tracking, we thoroughly augment our list through iterative depth and breadth traversal.

I-B Related Literature Reviews

TABLE I: Comparison of our survey with other related review papers on multi-modal learning and knowledge graphs. Abbreviations used: D.S. Tasks (Downstream Tasks), Const. (Construction), MLMPT (Multi-modal Language Model Pre-training), Industrial App. (Industrial Applications), 4 (for), Sci. (Science).

lccccccccccccc 1 2 7 Survey Papers KG4MM MMKG Challenges and Opportunities
KG Const. D.S. Tasks MLMPT Benchmark Industrial App. MMKG Const. D.S. Tasks Benchmark Industrial App. AI4Sci. KG4MM MMKG LLM
Zhu et al. [7]
Monka et al. [8]
Lymperaiou et al. [9]
Peng et al. [10]
Ours

Several studies have reviewed literature pertinent to KGs and multi-modal learning. Distinct from these, our survey highlights specific differences, as shown in Table I-B.

TABLE II: Frequently Used Symbols.
Notations Descriptions
𝒢

Knowledge graph defined as 𝒢={,,𝒜,𝒯,𝒱}.

Entity set, including typical (KG) and multi-modal entities (MM).

Set of relation predicates (r).

𝒜

Set of attribute predicates (a).

𝒯

Statements set, comprising relational (𝒯) and attribute triples (𝒯𝒜).

𝒱

Attribute values set, including literals like string, date, integer, decimal (𝒱KG) and multi-modal values (𝒱MM).

Set of visual images (i) in MMKGs.

(h,r,t)

Relational triple from 𝒯 with head entity h (h), tail entity t (t), and relation predicate r.

(e,a,v)

Attribute triple from 𝒯𝒜 with entity e, attribute predicate a and value v.

<w1,,wn>

Text corpus.

𝒳

Input domain of multi-modal data across K modalities, 𝒳=𝒳(1)××𝒳(K) and x(k)𝒳(k).

𝒴

Target domain with y𝒴.

𝒟

Data distribution for a downstream task.

𝒵

Latent space with z𝒵.

g()

Mapping function from the input domain (using all of K modalities) to the latent space (𝒳𝒵).

q()

Task mapping function from the latent space to the target domain (𝒵𝒴).