Delving into the World of Knowledge Graphs

A Review of ‘Building Knowledge Graphs’ by Barrasa et al.

Sixing Huang
6 min readJul 27, 2023

Knowledge graphs (KGs) are structures that represent and organize information in a graphical format (Figure 1). They capture knowledge about entities (objects, concepts, events, etc.) and their relationships, enabling effective data integration, organization, and discovery. KGs are built using linked organizing principles and typically consist of nodes (representing entities) and edges (representing relationships between entities). KGs serve as databases, maps, data lineage trackers, and Bayesian inference tools.

Figure 1. A companion plant knowledge graph. Image by author.

In this era of Large Language Models (LLMs), KGs are actually becoming more important than ever. The reason is that KGs can act as data sources and are thus valuable in addressing several limitations in LLMs, such as hallucination and the lack of access to external and up-to-date information.

Building KGs has become much easier these days, thanks to the widespread of tutorials (1, 2, 3, and 4). The LLM-powered tool GraphGPT is interesting. But that is more of a proof-of-concept than a complete graph solution. Sooner rather than later, we will need to build our own KGs. That requires some solid knowledge of the basics. In that case, a good practitioner’s guide will be a lifesaver. And who is more qualified to write about these nuts and bolts than the Neo4j cadre themselves?

Figure 2. “Building Knowledge Graphs — A Practitioner’s Guide” by Jesús Barrasa and Jim Webber

On the heels of their first book Knowledge graphs (read my review here), Jesús Barrasa and Jim Webber published a follow-up book called Building Knowledge Graphs in June 2023. With three times more pages (291 vs. 78), it delves into much more details of building and utilizing KGs than its predecessor. For individuals who find programming code more comprehensible than natural language, this book is satisfying as it provides abundant code samples. In this short review, I would like to talk about the echoes that it has left within me.

1. The basics: definition, organizing principles, graph databases, and KG construction

The first four chapters of the book deal with the basics. It starts with the KG definition. The second chapter Organizing Principles for Building Knowledge Graphs is a reproduction of the same chapter from the first book. It talks about taxonomy and ontology. In fact, this chapter inspired me to include the SNOMED ontologies in my clinical trial KG project (Figure 3).

Figure 3. The graph structure of my clinical trial knowledge graph project. The SNOMED ontologies (right) can augment the data from clinicaltrials.gov by adding extra contexts. Image by author.

With the SNOMED ontologies, the KG can answer considerably more questions. The disease taxonomy also groups the clinical trials effectively. These ontologies may also inspire researchers to drug repurposing. So bringing in the ontology is a simple and yet powerful way to beef up your KG significantly.

The graph database Neo4j needs no introduction. However, becoming proficient in Cypher, its query language, does require some practice and dedication. The book contains a quick Cypher walk-through. Afterward, the authors summarize the three data import routines in Chapter 4. With these basics, you can now build your own KGs either on-prem or on the cloud.

2. The utilization

The book title only says “building”, but its content is in fact more about “using”. It has collected several interesting use cases for the readers, including data fabric, graph algorithm, graph-native machine learning, data lineage, entity resolution, fraud detection, skill matching, dependency KGs, and natural language processing (NLP).

2.1 Data fabric

Among them, I am particularly fascinated by the data fabric architecture. The book describes data fabric as

“a general-purpose, organization-wide data access layer that offers a connected view of the data in the underlying systems.”

Undoubtedly, KG can serve as an excellent data access layer. Its job is

to provide a sophisticated index to curate data across silos irrespective of how the data is physically stored in those silos as relational tables or NoSQL keys and values.

Among the potential implementations, a composite database stands out as a viable option. It is a virtual composite that “brings together multiple graph data sources and provides a single unified point of access to all of them.” Simply put, it is a graph of graphs. Chapter 5 gives us a concrete example. It consists of two customer graphs. One records orders from Europe, Middle East, and Africa (EMEA) customers, and the other contains orders from Asia-Pacific (APAC) customers. The authors give us some sample code to create a composite database. However, it seems that it contains errors. Fortunately, they can be corrected according to the Neo4j official documentation.

#Example 5-2 should be
CREATE DATABASE db1;
CREATE DATABASE db2;
CREATE COMPOSITE DATABASE globalsales;

#The book forgets the composite database name during the alias creations
CREATE ALIAS globalsales.emeasales
FOR DATABASE db1;
CREATE ALIAS globalsales.apacsales
FOR DATABASE db2;

#Example 5-3 should be
UNWIND ['globalsales.apacsales', 'globalsales.emeasales'] AS g

Composite graphs can be used in either graph federation or data sharding. Graph federation composite can access data in disjoint graphs, while data sharding can visit distributed data in the form of a common graph partitioned on multiple databases (1). In addition, Neo4j’s APOC allows us to perform SQL, MongoDB, and web searches. This versatility establishes Neo4j as a glue that brings multiple data sources together.

2.2 Natural language and KG

Figure 4. Workflow for named entity recognition and Wikidata ontology mapping with Google Cloud Natural Language API. This image is a modified version of Figure 12–6 in the book Building Knowledge Graphs by Barassa et al. Redrawn with permission of the book author.

The book has dedicated two chapters to NLP. Chapter 12 is about Semantic Search and Similarity, and Chapter 13 covers Talking to Your Knowledge Graph. The book was probably written before the rise of LLM because the concept of LLM has been only mentioned once in the Preface. So the chapters go without all the revolutionary features of LLM. They do teach me some very good non-LLM techniques. For example, the authors introduce an automatic workflow to do Named Entity Recognition (NER) and ontology together. They first import the Wikidata ontology into the Neo4j graph. The authors then send the raw texts to GCP’s Natural Language API for NER. GCP returns not only the entities but also their corresponding Wikipedia URLs. Finally, they map the entities to the Wikidata graph via URL matching.

This workflow is straightforward and elegant. In fact, we can also use our own custom ontologies, provided that we employ Wikipedia URLs as identifiers. This enables seamless linking of named entities from GCP’s NLP to these custom ontologies. The book also mentions that the ontologies can be in any of the common W3C standards: OWL, SKOS, or RDFS. Such a workflow holds immense value, as it can facilitate the rapid and autonomous expansion of our knowledge graphs.

Chapter 13 describes a graph chatbot with spaCy. This classic method requires some intensive coding. That could complicate maintenance. And the world has changed a lot since the release of GPT-3. Currently, the new paradigm is to use LLM and LangChain to power the graph chatbots (1, 2, 3, and 4).

Conclusion

Building Knowledge Graphs arrives when LLM has dominated the tech discussion. Meanwhile, the user-graph interactions have also been simplified over the years thanks to no-code platforms such as Gemini Data and SemSpect. The trend is clear: the construction and utilization of KGs are becoming easier. So do we still need to read this book?

For me, the answer is yes. First, the book stresses the importance of ontology. It implies that the construction of a good KG is more than just a bunch of API calls and Cypher queries. It not only demands programming skills, but also a solid understanding of linguistics and, more importantly, domain knowledge. Second, the book shows us where and how we can use KGs. I now have learned some concrete examples of fraud detection, skills matching, dependency graphs, and data lineage.

Of course, there are ample topics that the authors can add in future editions. The combination of KG and LLM must be one of the most requested subjects. Also, probabilistic Bayesian KGs will be very useful for the healthcare and financial industries.

--

--

Sixing Huang

A Neo4j Ninja, German bioinformatician in Gemini Data. I like to try things: Cloud, ML, satellite imagery, Japanese, plants, and travel the world.