Semantic Enrichment

What is this article about? This paragraph? This phrase? A common thread that we see with publishers concerns the task of making the content of an article, chapter or blog post understandable not just to a human being, but to a computer. This process, known as content enrichment, involves adding enough annotative metadata to a particular selection to identify people, places, things, concepts, events and the like, and perhaps even to add enough to pull out relationships and summaries.

Most enrichment tools, especially those involved with taxonomy or entity extraction, identify typology – “Johnny Depp” is a person, while “Los Angeles” is a city. This entity enrichment process is often useful for building up taxonomies from content, in effect building dictionaries of terms. Entity enrichment on a sample such as:

<p>Johnny Depp is an actor working in Los Angeles. He's appeared in a number of roles, including Edward Scissorhands and Captain Jack Sparrow in the Pirates of the Caribbean movies. </p>

would be rendered as:

<p><span role="person">Johnny Depp</span> is an <span role="profession">actor</span> working in <span role="city">Los Angeles</span>.</p>

An application such as MarkLogic could then run a query to find all elements that have a @role attribute and turn these into dictionaries. However, as it turns out, this kind of enrichment tends to have limited utility beyond this dictionary producing process, because there’s no real sense of context. The system knows that “Johnny Depp” is a string in the “person” bucket, and that’s about the extent of it.

Semantic enrichment goes the other way, and is in many respects considerably more powerful. For instance, suppose that you had an RDF record about Johnny Depp (here written in Turtle, for brevity):

person:Johnny_Depp rdf:type class:Person;
        person:profession profession:Actor;
        person:birthDate "1963-06-09"^^xs:date;
        person:nameProfessional "Johnny Depp";
        person:nameFirstLast "John Christopher Depp II";
        person:nameLastFirst "Depp, John Christopher II";
        person:bio """John Christopher "Johnny" Depp II (born June 9, 1963) is an American actor, film producer, and musician. He has won the Golden Globe Award and Screen Actors Guild award for Best Actor. Depp rose to prominence on the 1980s television series 21 Jump Street, becoming a teen idol. Dissatisfied with that status, Depp turned to film for more challenging roles; he played the title character of the acclaimed Edward Scissorhands(1990) and later found box office success in films such as Sleepy Hollow (1999), Charlie and the Chocolate Factory (2005), Alice in Wonderland(2010), Rango (2011) and the Pirates of the Caribbean film series (2003–present). He has collaborated with director and friend Tim Burton in eight films; the most recent being Dark Shadows (2012).""".

film:Edward_Scissorhands rdf:type class:Film;
</span><span style="color: #000000;">        film:title "Edward Scissorhands";
        film:prefix "EDWSCIS";
        film:character (character:EDWSCIS.Edward_Scissorhands, character:EDWSCIS.Kim); 
        film:releaseDate "1999-12-14"^^xs:date.

character:EDWSCIS.Edward_Scissorhands rdf:type class:Character;
        character:name "Edward Scissorhands".
        character:actor person:Johnny_Depp.
character:EDWSCIS.Kim rdf:type class:Character;
        character:name "Kim".
        character:actor person:Winona_Ryder.

film:POTC.Curse_Of_The_Black_Pearl rdf:type class:Film;
        film:title "Pirates of the Caribbean: Curse of the Black Pearl";
        film:prefix "POTC-COTBP";
        film:character (character:POTC.Jack_Sparrow, 
        film:franchise franchise:POTC;
        film:releaseDate "1999-12-14"^^xs:date.
franchise:POTC rdf:type class:Franchise;
        franchise:title "Pirates of the Caribbean".
character:POTC.Jack_Sparrow rdf:type class:Character;
       character:name "Jack Sparrow";
       character:actor person:Johnny_Depp.

Listing 1. Portion of Turtle Data For Application Data Set.

There’s a lot of information here, and it’s more than just one person’s bio. It’s worth noting that the prefixes given above correspond to namespaces. For instance, the prefix film: might correspond to

Films have characters, characters are played by actors, actors are roles played by people. This is information that is contained in a graph, rather than being stored hierarchically, with “classes” within that graph including characters, people, franchises, films and many other class types each interacting with the other.

With this kind of information, one consequence of this is that if you can uniquely identify a given term in a post or similar content, then all of this information becomes available to annotate the term, without necessarily needing to encode this information in the displayed content (hint: AJAX). To do this, first we need to identify and enrich the source content. For instance, consider a different enrichment scheme for the original paragraph:

<p about="para:n1103A1293C33229583"><span rel="para:contains"><span resource="person:Johnny_Depp">Johnny Depp</span> is an <span resource="profession:Actor">actor</span> working in <span resource="city:Los_Angeles_CA_USA">Los Angeles</span>. He's appeared in a number of roles, including <span resource="film:Edward_Scissorhands"><span resource="character:EDWSCIS">Edward Scissorhands</span></span> and Captain <span resource="character:POTC.Jack_Sparrow">Jack Sparrow</span> in the <span resource="franchise:POTC">Pirates of the Caribbean</span> movies.</span></p>

This encoding used RDF for Attributes, or RDFA, to identify individual resources and relationships. The @about attribute on the <p> paragraph element indicates that this is the global paragraph identifier – that is to say, an RDFA parser, reading this, would associate


with this paragraph. Note that this doesn’t prevent the paragraph from having a local identifier as well, which may be used for other purposes – this just describes the semantic identification for the paragraph.

The next container:

<span rel="para:contains">

indicates that every resource given in the paragraph is related to the paragraph by a para:contains relationship. Thus,

<p about="para:n1103A1293C33229583"><span rel="para:contains"><span resource="person:Johnny_Depp">Johnny Depp</span></span></p>

is equivalent to the RDF assertion:

para:n1103A1293C33229583 para:contains person:Johnny_Depp.

This kind of relationship is about all that can be explicitly determined here, though the simple act of containment implies that there is some kind of topicality (to be explored more below).

Notice that this type of encoding isn’t necessarily completely intelligent. For instance, Edward Scissorhands is both the name of a movie by Tim Burton and the name of a character within that movie, and at least in the simplistic form of analysis used here, there’s no real way of differentiating between the two. <span> elements surround the text. There are ways of getting around this, and these too will be discussed but not necessarily implemented here.

What benefit comes from this way of identifying content? At a minimum, once you have the identifier, it becomes possible to query a triple store to get additional information, such as bio information for a character or person or a URL for a map of a location. This could bring up a popup or side panel display any time you roll over the term in question. You can also use the containment relationship for a paragraph to find related links either directly through links associated with each member item, or more indirectly finding articles that have the highest number of shared links with the current paragraph’s set of links. This can reduce the need to set an arbitrary relevance score on documents. This can also be used in conjunction with the paragraph @about identifiers for managing commentary and annotations.

 Creating a Virtual Dictionary

The first step to enriching the content is to have existing content to enrich from — which in general means having an existing triple store set. In the case of public data, this can be achieved by pulling in data from existing linked data stores such as dbPedia, while for private data it can be created from existing data records.

Assuming that the data set looks something like that shown in Listing 1, the first step in building the dictionary comes in identifying what represents “searchable” data. This is one of those situations where the ability to create data models in RDF can come in handy, since such models can identify specific properties or relationships that can be searched. For instance, consider certain properties identifying searchable labels and descriptions:

class:Person owl:subClassOf class:Entity;
     entity:dictionaryLabelProperty person:nameFirstLast,person:nameLastFirst,person:professionalName;
     entity:dictionaryDescProperty person:bio.
class:Character owl:subClassOf class:Entity;
     entity:dictionaryLabelProperty characterName;
     entity:dictionaryDescProperty character:bio.

With this information, it then becomes possible to generate a map in XQuery and SPARQL

let $dict-maps := sem:sparql("
     prefix rdf: <>
     prefix entity: <>
     select ?term ?term-uri where {
     ?term-uri rdf:type ?class.
     ?class entity:dictionaryLabelProperty ?labelProp.
     ?s ?labelProp ?term.

Several things are happening here. The SPARQL query retrieves an item ?s that’s defined as an instance of a class, where the ?class itself has one or more dictionaryLabelProperty relationships, and retrieves those properties. The property in turn is used with the item to retrieve the associated term. This is a fairly powerful pattern when different kinds of objects have different relationship predicates that describe the same action (in this case, getting an identifying term). A similar process could be used with descriptions, where a person may have a bio, a film a synopsis, a book an abstract and so forth. It’s also possible to use something like owl:sameAs to do the same kind of mapping of predicates, where each term may be considered a subproperty of the base property:

dc:description owl:subPropertOf dc:description.
person:bio owl:subPropertyOf dc:description.
film:synopsis owl:subPropertyOf dc:description.

with the appropriate SPARQL expressions being

?hasDescription owl:subPropertyOf dc:description.
?s ?hasDescription ?description.

This mapping of sub-properties can be incredibly useful not just for descriptions but for labels and similar properties that may analogous but be in different namespaces (for instance, creation or modification dates, resource creators or similar “common” properties).

Semantically Enriching Content

When the initial query runs, it will return a sequence of maps of the form [{term1, term-uri:term-url1}, {term2, term-uri:term-url2}, .. , {termN, term-url:term-uriN}].  Note that the same term may occur more than once (such as the case of “Edward Scissorhands” above), so this form (rather than  a straight lookup) insures that this use case can be handled. A benefit of this approach is that terms consisting of multiple words can be searched first, which makes it possible to tag “Jack Sparrow” as being a discrete entity from either “Jack” (a playing card face, for instance) or “Sparrow” (a bird).

The following XQuery (semantify.xq) script shows the process of enrichment.

xquery version "1.0-ml";
let $ns-map := map:new((
let $term-map := map:new((
 map:entry("Johnny Depp",""),
 map:entry("Helena Bonham Carter",""),
 map:entry("Winona Ryder",""),
 map:entry("Los Angeles",""),
 map:entry("Edward Scissorhands",""),
 map:entry("Edward Scissorhands",""),
 map:entry("Sweeney Todd",""),
 map:entry("Sweeney Todd",""),
 map:entry("Mrs. Lovett",""),
 map:entry("Alice In Wonderland",""),
 map:entry("Mad Hatter",""),
 map:entry("Jack Sparrow",""),
 map:entry("Red Queen",""),
 map:entry("Pirates of the Caribbean","")))
let $page := fn:doc("/data/johnny_depp.xml")
let $xslt := fn:doc("/lib/rdfa-enrich.xsl")
let $ns-prefixes := fn:string-join(for $prefix in map:keys($ns-map) return $prefix || ": " || map:get($ns-map,$prefix)," ")
let $doc-map := map:entry("doc",$page)
let $keys := for $key in map:keys($term-map) return $key
let $_ := for $key in $keys order by fn:string-length($key) descending return
   let $doc := map:get($doc-map,"doc")
   let $entry-map := map:new((map:entry("term",$key),
 let $output := xdmp:xslt-eval($xslt,$doc,$entry-map)
 return map:put($doc-map,"doc",$output)
return (xdmp:set-response-content-type("text/html"),map:get($doc-map,"doc")/*)

The first part identifies the internal namespaces and prefixes that are used for identifying relationships. Because in general we will want to provide rapid internal identification matching, the URIs generally used for output will be full expanded, rather than simplified as curies, but providing the namespaces will make it easier for downstream processes to create legible output.

The second part is a bit of a cheat – it defines the dictionary “map” directly using the map:new() and map:entry() functions. This might actually come from a SPARQL query, an XML document or some other source, but the map form is how it will end up.

Once these are defined, two documents are retrieved – the source document to be enriched (“Johnny_Depp.xml”):

<html xmlns=""> 
       <link rel="stylesheet" href="/lib/themes/test/page.css"></link> 
       <style type="text/css"> span [resource] {font-style:italic;color:blue;}; </style> 
   <body> <h1>Johnny Depp</h1> 
   <p>Johnny Depp is an actor working in Los Angeles. He's appeared in a number of roles, including Edward Scissorhands (with Winona Ryder as Kim) and Captain Jack Sparrow in the Pirates of the Caribbean movies.</p>
   <p>Depp has also played as the Mad Hatter, playing against frequent female counterpart Helena Bonham Carter as the Red Queen in Alice in Wonderland. He also did a star turn as the eponymously named barber in Sweeney Todd, again with Bonham Carter as Mrs. Lovett.</p> 

and the XSLT that will do the bulk of the actual work:

<xsl:stylesheet xmlns:xsl="" 
 exclude-result-prefixes="xs sp map xdmp fn h"
 <xsl:output method="xml" media-type="text/html" indent="yes"/>
 <xsl:param name="term"/>
 <xsl:param name="term-uri"/>
 <xsl:param name="ns-prefixes"/>
 <xsl:param name="ns-map"/>
 <xsl:template match="*">
 <xsl:for-each select="@*">
 <xsl:copy-of select="."/>
 <xsl:apply-templates select="*|text()"/>
 <xsl:template match="text()">
 <xsl:when test="fn:matches(.,$term,'i')">
 <xsl:analyze-string select="." regex="{fn:concat('(',$term,')')}" flags="i">
 <xsl:matching-substring><span resource="{$term-uri}"><xsl:value-of select="regex-group(1)"/></span></xsl:matching-substring>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
 <xsl:otherwise><xsl:value-of select="."/></xsl:otherwise>
 <xsl:template match="h:html">
 <html prefix="{$ns-prefixes}"><xsl:apply-templates select="*"/></html>
 <xsl:template match="h:p|h:h1|h:h2|h:h3|h:h4|h:div">
 <xsl:variable name="context" select="."/>
 <xsl:element name="{fn:name(.)}"><xsl:attribute name="about">
 <xsl:when test="@about"><xsl:value-of select="@about"/></xsl:when>
 <xsl:otherwise><xsl:value-of select="fn:concat(map:get($ns-map,'ht'),fn:local-name(.),'-',fn:generate-id(.))"/></xsl:otherwise>
 <xsl:for-each select="@*[not(../@about)]"><xsl:copy-of select="."/></xsl:for-each>
 <xsl:variable name="contains" select="fn:concat(map:get($ns-map,'property'),'contains')"/>
 <xsl:when test='h:span[@rel=$contains]'>
 <xsl:apply-templates select="*|text()"/>
 <span rel="{$contains}"><xsl:apply-templates select="*|text()"/></span>
<span style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; font-size: 13px; line-height: 19px;">
Once these are retrieved,  the keys are sorted by length, which in general will insure that long terms consisting of multiple words will be preferred over shorter terms. The term and it's URI are then passed into the transformation, along with the maps converted into a single prefix string of the form "pre1: pre1uri pre2: pres2uri ... preN: preNuri".</span>

The output of this transformation is then stored into a map, where it is retrieved as the input for the next key.

One of the key facets of the XSLT transformation is the use of the analyze-string function:

 <xsl:when test="fn:matches(.,$term,'i')">
 <xsl:analyze-string select="." regex="{fn:concat('(',$term,')')}" flags="i">
 <xsl:matching-substring><span resource="{$term-uri}">
       <xsl:value-of select="regex-group(1)"/>
 <xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>

When a block of contiguous text matching the regular expression (in this case the matching term) is found, the <matching-substring> element is invoked to wrap this string within a span that has a resource attribute, otherwise the text is copied as is. This isn’t perfect – it has the effect of carving up a block of text so that smaller matches are preferred, which is usually the opposite of what’s desired. This is the reason that the keys are sorted in descending length.

The output of this process, after multiple iterations, then looks something like the following:

<html prefix="person: concept: geoCity: film: geoRegion: property: character: franchise: personRole: ht:" xmlns="">
 <link rel="stylesheet" href="/lib/themes/test/page.css"/>
 <body><h1 about=""><span rel=""><span resource="">Johnny <span resource="">Depp</span></span></span></h1><p about=""><span rel=""><span resource="">Johnny <span resource="">Depp</span></span> is an <span resource="">actor</span> working in <span resource="">Los Angeles</span>. He's appeared in a number of roles, 
including <span resource="">Edward Scissorhands</span> (with <span resource="">Winona Ryder</span> as <span resource="">Kim</span>), Captain <span resource="">Jack Sparrow</span> 
in the <span resource=""><span resource="">Pirate</span>s of the <span resource="">Caribbean</span></span> movies.</span></p><p about=""><span rel=""><span resource="">Depp</span> has also played as the <span resource="">Mad Hatter</span>, playing against frequent female 
counterpart <span resource="">Helena Bonham Carter</span> as the <span resource="">Red Queen</span> in <span resource="">Alice in Wonderland</span>.</span></p></body>

The output of this page would be normally indistinguishable from a regular page, though if the resources are highlighted (by setting up the CSS rule:

span [resource] {font-style:italic;color:blue;}

) the output would graphically look something like this:

Significantly, what is produced here is RDFa compliant – the generated content can be read by an RDFa parser and converted into RDF assertions. In the example here, the RDFa code produced generates the following triples when run through the RDFa parser at

@prefix character: <> .
@prefix film: <> .
@prefix franchise: <> .
@prefix geocity: <> .
@prefix ht: <> .
@prefix person: <> .
@prefix personrole: <> .
@prefix property: <> .

ht:h1-nfdcc7f1086a7650d property:contains person:Johhny_Depp .

ht:p-nfdcc7f10887ffc5b property:contains character:EDWSCIS-Kim,
        geocity:Los_Angeles .

ht:p-nfdcc7f108a5893a9 property:contains character:AIW2010-The_Mad_Hatter,
        person:Johhny_Depp .

This indicates that each of the three HTML blocks – the <h1> header and two paragraphs – have a property:contains relationships with each of the items contained within it.

Note that the code given here is primarily intended to showcase ideas, and could certainly be refined to handle other encoding situations with additional work (another way of saying “while it works, it ain’t final code!”). Similar enrichment can be done to capture textual information, dates, and the like, but this provides enough of a skeletal foundation to illustrate how more sophisticated code could be built for doing enrichment.


By themselves, semantically tagged elements do not do much.  However, what they enable is considerably more powerful. As such this process of semantic enrichment should be seen as an intermediate stage of production, one that can be tied into user interfaces, navigation and other forms of visualization. The theme of taking semantically rich content and making applications around them will be the focus of the next article in this series.





Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment