This document describes in detail the language used to retrieve entity identifiers from the ENS.
| Title | Entity ID request language |
|---|---|
| Creation Date | Tuesday, October 27, 2009 |
| Last Modification | Friday, September 16, 2011 |
| Feedback & Questions | bortoli AT okkam.it |
To perform its operations, the OKKAM Entity Name System (ENS) maintains a large Entity Store with a unique and permanent ID for each entity (Entity ID, or EID for short) and an identification profile, which is used for retrieving the EID. This Entity Store currently contains around 6,5 Millions entities, including people and organizations from Wikipedia, and a large amount of geographical entities (i.e., describing countries, cities, roads, mountains, rivers, etc.) extracted from GeoNames. These entities were extracted using the Cogito Extractor and follow the entity representation model described in here.
An EID can be retrieved from the Entity Store through the EID Search method. EID Search receives a request which contains some description of the relevant entity (see examples below) and returns a ranked list of EIDs (together with their identification profile). This ranking is based on the ENS matching methods, and the order reflects the computed similarity between the request and the available entity profiles in the ENS. The following paragraphs provide a practical explanation of how to create EID requests.
Important remark: an EID request is to be viewed as a request to find the entity which corresponds to the given description, and not the set of entities which match some property. This is because the ENS is not a data provider on entities, but a system for sharing and interlinking web identifiers.
Simple Overview for Request Composition
An EID request is a collection of name-value attributes that provide some description of the intended entity. For example, some attributes for the famous Nobel winner physicists Albert Einstein can be: name="Albert Einstein", award="Nobel price", field="physics". The attribute name can also be empty, and thus attributes can be: "Albert Einstein", "Nobel price", "physicist".
For creating requests with only attribute values, use the following format:
- value1 value2 value3 ... valuen
If attribute names are also known, they can be included in the request as follow:
- name1="value1" name2="value2" ... namen="valuen"
In addition, an EID request can contain both attributes without name as well as attributes with names:
- name1="value1" ... namek="valuek" valuek+1 ... valuek+m
When the entity's type is also present, the request is:
- QUERY { name1="value1" ... namek="valuek" valuen ... valuem } METADATA { entityType=X },
where X denotes the entity type. ENS directly supports the following types for entities: (1) person, (2) location, (3) organization, (4) artifact, (5) artifact type, (6) event, and (7) other. The latter is used when the entity type does not correspond to the six basic types, for example when type is a protein, an animal, etc.
In addition, an EID request can also select the matching module that must be used. This can be done as follows
- QUERY { ... } METADATA { matchingModule= X },
where X denotes the matching module. The currently available matching modules are:
- matchingModule=s: The matching is performed by the algorithm of the storage layer (described here).
- matchingModule=g: Implementation of the matching algorithm suggested "Group Linkage" publication.
- matchingModule=gl: This module is an extension of the original group linkage algorithm. It implements a full probabilistic semantics for the considered query language and the query processing in this uses domain specific similarity functions.
- matchingModule=jm; This matching is performed using the Jolly Matcher described here
- matchingModule=fbem; This matching is performed using the Feature Based Entity Matcher described here
Request Examples
A. Requests with the entity's main description
A.1. Only Keywords
- Don Henderson
- Jacques-Yves Cousteau
- John Zachary Young
- William J. Brennan
- Amsterdam
- Microsoft Corporation
- Opel Zafira
A.2. Attribute-Value Pairs
- name="Don Henderson"
- name="Jacques-Yves Cousteau" OR name="Jacques Cousteau"
- name="Judd Gregg"
- first_name=Robert last_name=Mitchum
- name="Amsterdam" country="Nederland"
- company_name="Microsoft Corporation"
- model="zafira" maker="opel"
- name="John Zachary Young" OR name="John Young"
- name="William J. Brennan" OR name="William Brennan"
Individual conditions can be combined by AND and OR. Both are interpreted probabilistically: For a request A AND B, an entity matching A AND B will be ranked significantly higher than any entity matching only one of A or B. For a request A OR B, entities matching only one of A or B will typically match as well as an entity matching A AND B.
A.3. Attribute-Value Pairs and Entity Type
- QUERY { acronyms="UNICEF" } METADATA { entityType=organization }
- QUERY { name="United Nations" } METADATA { entityTypeorganization }
- QUERY { first_name=Robert last_name=Mitchum } METADATA { entityType=person }
- QUERY { name="John Zachary Young" OR name="John Young" } METADATA { entityType=person }
- QUERY { name="Amsterdam" country="Nederland" } METADATA { entityType=location }
B. Requests with additional attributes
B.1. Only Keywords
- Don Henderson British actor
- Jacques-Yves Cousteau French explorer
- Robert Mitchum American actor
- John Zachary Young British biologist
- William J. Brennan U.S. Supreme Court Justice
B.2. Attribute-Value for Name, other attributes as Keywords, and Entity Type
- QUERY { acronyms="UNICEF" "Children's Fund" } METADATA { entityType=organization }
- QUERY { name="United Nations" association politics } METADATA { entityTypeorganization }
- QUERY { ( name="Jacques-Yves Cousteau" OR name="Jacques Cousteau" ) French explorer } METADATA { entityType=person }
- QUERY { first_name=Robert last_name=Mitchum American actor } METADATA { entityType=person }
- QUERY { ( name="William J. Brennan" OR name="William Brennan" ) U.S. Supreme Court Justice } METADATA { entityType=person }
C. Requests with mixed attributes
- full_name="Albert Einstein" Nobel price physicists
- QUERY { United Nations Children Fund acronyms="unicef" } METADATA { entityType=organization }
- QUERY { name="Jacques-Yves Cousteau" French explorer } METADATA { entityType=person}
- QUERY { name="Robert Mitchum" American actor } METADATA { entityType=person}
D. Requests with typos
- einshtein
Consider the above request. ENS will return an entity having some attribute whose value is similar enough to "einshtein". Since the attribute value does not necessarily have to perfectly match the request, there is still the chance that the famous physicists Albert Einstein is returned. However if ENS thinks that attribute value and request are too different, it might return no entity. This could be the case against request
- ainschteyn
since the number of spelling mistakes makes difficult for ENS to guess that Albert Einstein was meant.
- sarname=einshtein
Consider now the above request with one name value attribute. ENS will return an entity having some attribute whose name is similar enough to "sarname" and whose value is similar enough to "einshtein". Again, there is the chance that Albert Einstein is returned, since his "surname" (which is sufficiently similar to "sarname") is sufficiently similar to "einshtein". ENS tries to find a trade-off between attribute name and attribute value similarities, which means that the following queries might have the same probability to return Albert Einstein (which is anyway higher than the probability that request (3) returns him).
- sarname=einstein
- surname=einshtein
- first_name=albert einstein
For the above requests, ENS will return an entity having (i) some attribute whose name is similar enough to "first_name" and whose value is similar enough to "albert", and (ii) some attribute whose value is similar enough to "einstein". Again, ENS tries to find a trade-off between how well the two subqueries
- first_name=albert
- einstein
are matched, which means that the following queries might have the same probability to return Albert Einstein (which is anyway lower than the probability that request (6) returns him).
- frst_name=albert einstein
- first_name=albrt einstein
- first_name=albert einshtein
- einstein OR einshtein
ENS will return an entity having (i) some attribute whose value is similar enough to "einstein", or (ii) some attribute whose value is similar enough to "einshtein". On the one hand this means that the returned entity does not necessarily have both "einstein" and "einshtein" as attribute values. On the other hand it is true that entities having both "einstein" and "einshtein" as attribute values will be considered as better matches. For this reason Albert Einstein will be more likely returned against the following request than, e.g., Alfred Einstein or Albert Schweitzer.
-
albert OR einstein
Special characters
Roughly speaking attribute names and values can contain alphanumerical symbols as well as hyphens (-) and dots (.). The following list shows well-formed attribute names and values.
- abcde
- 12345
- .-.-.
If the attribute name/value to be sought contains other symbols, the whole attribute name/value must be double quoted ("). The following list shows well-formed attribute names and values.
- ","
- "'"
- "$"
Double quotes and backslashes (\) within a double-quoted string must be escaped by prefixing them with a backslash. The following list shows well-formed attribute names and values.
- "This is a double quote: \""
- "This is a backslash: \\"
- "\"\\\"\\"





