All you should know as an SEO about entity types, classes & attributes
There is little information about the important elements of the Google Knowledge Graph such as entity types, classes and attributes and analysis of the relationships between these elements. In this paper, I discuss the most important elements of entities in the context of a semantically structured index like the knowledge graph.
- 1 From the Entity Catalog to the Knowledge Graph
- 2 What are entity attributes ?
- 3 What are the sources for the attributes?
- 4 Processing structured data for the Knowledge Graph
- 5 What are entity types and entity classes or domains?
- 6 Relationship between entity classes in ontologies
- 7 How relevant is an attribute for an entity, entity type or class?
- 8 Entity data mining is Google’s biggest challenge
From the Entity Catalog to the Knowledge Graph
Three layers serve as the foundation for the Knowledge Graph:
- Entity Catalog: All entities that have been identified over time are stored here.
- Knowledge repository: The entities are brought together in a knowledge repository with the information or attributes from the various sources. The Knowledge Repository is primarily about merging and storing descriptions and forming semantic classes or groups in the form of entity types. Google’s knowledge repository is currently the Knowledge Vault.
- Knowledge Graph: In the Knowledge Graph, attributes are added to entities and relationships between entities are established .
Within these knowledge databases, entities are the central organizational element around which all information is arranged. For example, the following information can be assigned to an entity:
- Entity types
- Social media profiles
- Media (documents, videos, audios …)
- Related entities with which the entity is related
What are entity attributes ?
Attributes describe the properties of an entity. In Wikidata, these attributes are grouped under statements. For example, the following attributes are assigned to the entity Larry Page:
- Gender: male
- Country of citizenship: United States
- First name: Larry
- last name: Page
- Pictures of Larry Page
- Date of birth: March 26, 1973
- Birthplace: East Lansing
- Spouse: Lucinda Southworth
- Number of children: 2
- Language spoken or published: English
- Occupation: Entrepreneur, Computer Scientist, Engineer
- Employer: Google
- Public office or position held: Chief Executive Officer
- Member of: American Academy of Arts an Sciences, National Academy of Engineeringu
- Residence: Palo Alto
- Net worth: 30,000,000 US dollars
This is how the record then looks in Wikidata:
The following information is displayed in the Knowledge Panel:
You can see that not all information from Wikidata is played out and also additional information is added as from Wikipedia. There are also differences, such as the salary, although a reference for this statement is stored in Wikidata. The attribute education is played out although no reference for the verification is deposited. From this it can be concluded that the placement of the attributes in the knowledge panel does not necessarily have anything to do with the validation.
A retrieval of the Larry Page entity via the Knowledge Graph API yields the following information:
“name”: “Larry Page“,
“description”: “Chief Executive Officer of Alphabet“,
Only the name, description, Knowledge Graph ID, an image source and link to an official Google source are provided.
The resultscore represents the proximity or the match of the respective entity with the search query in the Knowledge Graph API and decides in case of ambiguous entity names which Knowledge Panel is prioritized to be delivered in case of entity-related search queries. For example, there is also an entity Larry Page that represents a singer. However, this only has a lower resultscore.
What are the sources for the attributes?
Google can get the information about the entities and their relationships to each other from the following sources:
Sources for unstructured data
Sources from which Google can theoretically extract unstructured entity information are
- Normal web pages via crawling and Natural Language Processing
- Search queries via Natural Language Processing
- Unstructured databases and datasets
The Knowledge Vault plays a special role here. More on this in my article How Google can identify and interpret entities from unstructured content?
Sources for semi-structured data
Google can get semistructured information from encyclopedias like Wikipedia, which have a systematic structure. More about this in my article How does Google process information from Wikipedia for the Knowledge Graph?
Sources for structured data
Via semantic databases and datasets Google can take structured data directly e.g. via API and use it for the Knowledge Graph. The following databases are possible for this purpose:
- Wikidata (formerly Freebase)
- Google My Business
- CIA World Factbook
- Websites with structured data via Microdata, RDFa and JSON-LD
- Licensed data
- Data sets
- ClueWeb09 to ClueWeb12
- Common Crawl
- KBA Stream Corpus
Processing structured data for the Knowledge Graph
The number one place for Google to get information about entities are sources through which they are provided structured data.
In this post, I will only deal with this type of data sources. The much more complex methodology of extracting unstructured data and semistructured data, such as from Wikipedia, will be covered in the mentioned articles here in my blog.
Google can capture the structured data via the Resource Description Framework short RDF. An entity is a summary of different RDF statements following the pattern object-predicate-subject. For example, a statement would be “Canberra is the capital of Australia.”
One can also represent this connection grammatically like this. Canberra is the subject, Australia is the object, and (is the)capital is the predicate. But the relationship type can also be described by a verb like “Thomas Müller plays for Bayern Munich.” Thus, object and subject are always entities. The predicate can be an entity type or class, an attribute a verb, or a combination of all.
Most structured databases provide the information in machine-readable RDF format or allow translation to this format. Google accesses databases they have confidence in such as Wikidata, CIA World Factbook …, structured datasets or translation databases such as DBpedia or YAGO that translate Wikipedia information into machine readable data.
Since structured data databases and datasets grow and update relatively slowly, it is not surprising that Google keeps encouraging webmasters to work with structured data in their websites. The more Google collects and processes structured data, the closer they get to being able to process unstructured data as well. The structured data works as training data for machine learning.
What are entity types and entity classes or domains?
In various Google patents you can find the terms entity types and entity classes or domains. Certain entity types and domains have a similar set of attributes and thus form a group. For example, the domain “person” or “human” can always be assigned attributes such as place of birth, place of residence, date of birth …. This clearly defines the domain and the associated entity types.
An entity type and domain describes a group of entities that can be described by similar attributes. In the above example of Larry Page, an entity type could be CEO or entrepreneur.
In the very good book Entity Oriented Search by Krisztian Balog you can find the following description for entity types:
Entities may be categorized into multiple entity types (or types for short). Types can also be thought of as containers (semantic categories) that group together entities with similar properties. An analogy can be made to object oriented programming, whereby an entity of a type is like an instance of a class.
Relationship between entity classes in ontologies
There are databases such as YAGO or DBpedia Ontology that represent relationships between entity classes or entity types. In DBpedia Ontology, the base is Wikipedia. In the following excerpt from DBpedia Ontology, entity types (rounded rectangles) are related to parent entity classes via ascending arrows. E.g. the entity types athlete and racer are related to the entity class “person”. Type- and class-associated attributes are shown with the dashed arrows.
How relevant is an attribute for an entity, entity type or class?
By weighting the attributes per entity, Google can determine how relevant a certain attribute is for an entity. On the other hand, Google could also use this to determine the relevance of the entity for a search query made for this attribute.
Sources: Google Patent US9047278B1
The Google patent Identifying and ranking attributes of entities shows an approach how something like this could work.
According to this patent, attributes can be assigned and weighted to entities via the input of certain search term combinations.
One innovative aspect of the subject matter described in this specification is embodied in methods that include the actions of: identifying queries in query data; determining, in each of the queries, (i) an entity-descriptive portion that refers to an entity and (ii) a suffix; determining a count of a number of times the one or more queries were submitted; estimating, based on the count, an entity-level count of query submissions that include the particular suffix and are considered to refer to a first entity; determining that the entity is a particular type of entity; determining a type-level count of the query submissions that include the first suffix and are estimated to refer to entities of the particular type of entity; and assigning a score to the particular suffix based on the entity-level count and the type-level count.
Using this method, Google could determine which information about entities of a certain entity type is displayed in the Knowledge Panel. Furthermore, in the case of ambiguous statements, it would be possible to determine which attribute is the most relevant. Related to the example from above.
Here is an example:
Larry Page is an entrepreneur, a computer scientist, and an engineer. Which of these three statements is the most relevant or accurate?
The more people search for “Larry Page entrepreneur”, the more applicable the attribute “entrepreneur”.
Entity data mining is Google’s biggest challenge
From the research and thoughts on this post, I took away for myself that Google’s biggest challenge regarding the Knowledge Graph is extracting information or attributes regarding entities as well as entity types and classes just from unstructured data sources. The Knowledge Graph is currently still very incomplete because the information from the aforementioned structured data sources is very incomplete in terms of the total amount of all entities in the real world.
This then gave me a reason for writing more articles dealing with data mining of information around entities for the Knowledge Graph:
- Natural language processing to build a semantic database
- How does Google process information from Wikipedia for the Knowledge Graph?
- How Google can identify and interpret entities from unstructured content?
- “Google doesn’t like AI content!” Myth or truth? - 19. February 2024
- Most interesting Google Patents for semantic search - 12. February 2024
- How does Google search (ranking) may be working today - 4. February 2024
- Interesting Google patents for search and SEO in 2024 - 4. February 2024
- Success factors for user centricity in companies - 28. January 2024
- Social media has become one of the most important gatekeepers for content - 28. January 2024
- E-E-A-T: Google ressources, patents and scientific papers - 24. January 2024
- Patents and research papers for deep learning & ranking by Marc Najork - 21. January 2024
- E-E-A-T: More than an introduction to Experience ,Expertise, Authority, Trust - 4. January 2024
- Most interesting Google Patents for SEO in 2023 - 19. December 2023