Entity Extraction, XQuery the Semantic Web and Johnny Depp

Lately, I’ve been spending a lot of time dealing with entity extraction software for a client. The premise that most such extraction tools use is fairly similar – by creating an internal pipeline that will break down text into parts of speech and similar constructs, then applying a set of regular expression heuristics, it should become relatively simple to determine that “Johnny Depp” or “Helen Bonham Carter” are people, that “Seattle” is a place name, and that “May 13, 2011” is a date.

On the surface this is pretty cool, of course; a significant amount of the efforts of computer scientists for the better part of fifty years has been dedicated to the effort of a computer being able to make precisely such identifications accurately. Of course, even given that, the tools aren’t perfect – Virginia could be a state in the US, but it can also be a woman’s first name, and when you try to determine from the context which is which, you begin to understand that meaning is very much as much a matter of cultural imperative as it is in innate physical one.

Entity Enrichment is the process of automatically adding tags around content in order to find “entities” – typically either parts of speech (known by its acronym POS), person, location and event names (Named Entity Recognition or NER), or more specialized filtering on terms such as drug names, medical terminology, engineering terms or the like. As a technology, it’s been around for a while, especially in the publishing arena, and in many ways is one of the more rudimentary (and foundational) pieces of both text analytics and semantic processing.

However, it’s important to understand that enrichment by itself is not a panacea. For starters, we humans are remarkably adept at being imprecise, and this becomes more important when you start dealing not with individual words but with phrases and titles.  You see this especially with government titles – consider for instance “Ambassador John J. Smith, III, Senior Assistant Undersecretary for European Affairs, U.S. Department of State”.  A surprising number of enrichment engineers will parse this as

<person>Ambassador John J. Smith</person>,<number>III</number>, <person>Senior</person> <title>Assistant Undersecretary</title> for <location>European</location> <organization>Affairs</organization>, <location>U.S.</location> <organization>Department of State</organization>

or some similar construct, rather than the one that humans could probably pick out as:

<honorific>Ambassador</honorific> <person>John J. Smith, III</person>, <title>Senior Assistant Undersecretary for European Affairs</title>, <organization>U.S. Department of State</organization>

The accuracy rate is getting better – even the available open source tools such as ANNIE or LingPipe will generally have about a 30-40% chance of getting a mouthful like that properly categorized, and commercial products are usually (though not always) better, but this still translates into an abysmally low accession rate. Ironically, specialized vocabulary filters usually do considerably better, if only because technical terms usually are more regular in their usage and context, but the stakes are also higher there as well.

However, even given this, entity extraction really buys you fairly little unless you also have a context that the article is placed in, both at the macro level and at the micro level. The computer knows nothing about Johnny Depp – internally, the term is a sequence of eleven characters, counting the white space that happens to fit enough of a profile that it can be categorized in a bucket called “person”. However, the computer does know that there are a number of “person” objects in the document where they were found. This document is a context or resource (at this point, the Semantic Web people start jumping around).

Documents can be categorized in a number of different ways. While it’s not uncommon for data systems professionals to break things down by implementation type (web page, text processing document, spreadsheet, presentation, etc.), in reality, the same document could just as readily be a “Newspaper Article” or a “Movie Review” as an HTML page. In an XML database such as MarkLogic, such a document could be contained simultaneously in three overlapping collections, each of which are ultimately a reflection of different orthogonal classification systems. Johnny Depp being referenced as a person in a web page tells you very little – Johnny Depp referenced as a person in a “Movie Review” however, tells you much, much more, because movie reviews have definite structures, roles and relationships.

A movie review is (for discussion purposes) about a single movie, which has an associated title and also likely has an Interntional Movie Database (IMDB) entry, with an associated URL.  The review also has an associated URL. The URL for the review can be taken as a proxy for that review, just as the URL for the IMDB entry can be taken as a proxy for the movie itself. What this means in practice is that, because the movie review provides a context for the enriched terms in it, it becomes possible to retrieve information about the article that isn’t in the article itself.

If the article is about “Pirates of the Caribbean: On Stranger Tides”, which we’ll assume here has been classified as a <media_title> internally, then this can be used by an XQuery processor as a key to look up an associated IMDB entry. If it finds one and returns it as an XML structure (or an RDF structure, which I’ll get to in a second), then the IMDB entry might also have multiple <also_known_as> blocks, such <also_known_as>POTC:OST</also_known_as> or <also_known_as>Pirates of the Caribbean IV</also_known_as>. This is important, because enrichment is seldom a single process – rather, it is a recursive set of refinements. An XQuery script could take the article and locate all instances of POTC:OST and identify them not only as being alternative names, but also adding a pointer to the IMDB proxy in each case. Similarly, the script could identify Johnny Depp as an actor, that he stars in the movie, that he stars as the pirate “Jack Sparrow”, and consequently, Jack Sparrow can also contain a pointer back to a proxy representing the character. Moreover, it can also break the term apart and find all naked references to “Jack” or “Sparrow” and do the same thing.

This is where entity enrichment begins to gain value. By the time the article has been processed, it now knows much more about itself. It can point to a common object that represents the focus of the article, which means that a search can be made based upon the IMDB entry and all articles about POTC4 can be found within the database. The relationships between entries can be made. If the articles are all part of a central database, it also means that different reviews with similar rating systems could provide a more universal “rating” about the quality of the movie. It also means that since there are relationships that exist outside of the article, it becomes possible to pull together reviews not only about POTC4 but all four pirates movies, as suggestions.

It’s possible that the data coming from IMDB is in RDF format – essentially in a format where there are a number of very simple assertions that are made, and relationships between these assertions are defined. These assertions can be extracted from the RDF (or, with a little more foreknowledge, from XML or other wire formats) and used to make various relationships within a canonical reference (such as an IMDB) easy to extract and compare.

One of the most significant realizations that are being made today is that in many respects the more significant data queries are not those within individual documents (or even collections of documents) but instead are those between documents. Document enrichment helps to bootstrap that process, making it easier to identify potential keys, but that enrichment must be done with knowledge of the appropriate context.

Moreover, as these documents in turn gain more “self-referential” information, they can in turn become a canonical reference source themselves. IMDB is not likely to contain viewer expectations or ratings, but the movie reviews would contain these things, and as such can be used by document analytics tools to do such things as determine not only the critical reception of a movie but also deeper analytics to determine what within a given movie or set of movies most captured the audience’s expectations. (By the way, if this sounds a lot like the Rotten Tomatoes site, you have a pretty good idea about how such a service could be implemented now).

Semantic Web technologies is more than just arcane terms such as acyclic graphs, RDF, turtle notation, n-tuple pairs, OWL and SPARQL. Indeed, my personal feeling is that the emphasis on these particular tools has had an overall negative impact upon the adoption of the Semantic Web technologies. Ultimately SemWeb is just the process of making resources – documents – both more self-aware and more externally aware of their context(s) in the world. You can do SemWeb without the above, though as you get deeper into the space, these tools do provide utility to do much more, but ultimately, SemWeb is about the relationship, and any tool that will help you get there can only do you good.

Kurt Cagle is an Information Architect for Avalon Consulting, LLC, specializing in XML data architecture, information management and the Semantic Web.

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment