In the following we report some facts about an entity-based approach and we describe two applications we are currently developing.
Functionality enabled by an Entity-based approach
Link entities to information about them in an unambiguous manner. Interesting for:
- Content producers at authoring time, to have more info on what they are writing
- Content consumers after article publication, to have more information on the subject discussed in the article
Retrieve entities and information about entities from different sources. Interesting for all Information Retrieval applications
Possible applications
- Automatic extraction and annotation of entities in text Example I: a news portal
- Interactive extraction and annotation of entities in text: Example II: scientific article writing
Example I: News portal
Overview
Goal: offer background knowledge to the viewer and pointers to relevant news
Focus: entities contained in news, namely people, organizations and locations
User interaction scenario: the user can browse a news item, click on a particular entity and get a pop-up containing an entity profile, plus references to other news on the same entity
Input/Output
Source of the process: ANSA English news archive
Source news encoding: NewsML v.2
Output: annotated news on online portal
Output encoding: XHTML + RDFa annotation
Technology
Pipeline for news items:
Text analysis to detect the entities and associate their OKKAM id (NewsML -> annotated NewsML)
XSLT transformation from annotated NewsML to XHTML + RDFa
Submission of the pages to Sindice, the Semantic Search engine -> extraction of the RDFa information + indexing of the page
Screenshots

Resources for developers
The entity extraction/annotation pipeline, available here
Documentation available here
Sindice: search engine indexing sources containing RDF, RDFa or Microformats annotations.
Sigma: live, embeddable information summaries from sites which use RDF, RDFa or Microformats.
Example II: Scientific article authoring
Overview
Goal: detect unambiguously entities on scientific articles to:
- Link this information to background knowledge for the author (at authoring time) and reader (after publication)
- Make information contained in articles unambiguously searchable via the identifier
Focus: entities contained in scientific articles, e.g. biological entities and authors
User interaction scenario: focus on an authoring tool (a Word plugin) that analyzes the text, detects candidate entities and let the user mark up entities with the correct ids.
Input/Output
Source of the process: articles being authored by an author
Source news encoding: Word document (or any format Word can edit)
Output: annotated articles
Output encoding: in document annotations (separate section, Word comments), or export to XML and CSV
Technology
The user interacts with a Word plugin that:
- Sends text from the article to a Name Entity Recognition (NER) service (or more)
- For each of the detected entities from the NER retrieves the relevant identifiers (e.g. Uniprot for proteins, OKKAM ids for authors)
- Presents the user with a list of candidates ids and related information, so that the user assigns the correct id to each entity
Screenshots

Resources for developers
The Word plugin available here with documentation
Sindice: search engine indexing sources containing RDF, RDFa or Microformats annotations.
Sigma: live, embeddable information summaries from sites which use RDF, RDFa or Microformats.





