Mobile Has Unbottled the UX Genie

May 24th, 2011 by Sam Mefford

I think it was the editor of Wired magazine who said something like “Designing for the iPad is liberating, because users have no pre-conceived expectations for the user experience”.  He hit the nail on the head, because right now many of the most exciting user experience innovations are occurring on mobile.  I wanted to share a few:

1) Do@ is an iPhone app which does search better by replacing text search results with apps* pre-filled with result sets, and a WebOS-style flick left or right between the apps.  By “outsourcing” the results display to 3rd-party apps, Do@ leaves plenty of room for innovation, without losing value as an aggregator of all apps appropriate for each category (e.g. @movies,  @news, @music).

2) Siri has the theory & user experience right for voice search.  In my testing, the app isn’t ready for prime time because it makes too many mistakes.  But with Apple’s purchase, we’ll likely see pieces of the technology built right into future releases of iOS.

3) Windows Phone 7 Mango has added Google-Goggles-style “Bing Vision” search built right into the OS.  Also, they’re building closer integration between Bing search results and results within apps like IMDB.

4) WebOS Just Type extends search with Quick Actions, once again proving that innovation can still help reduce the number of steps required to perform common tasks

You’ll notice a trend here: I’m showing examples of innovations for search.  As you may have noticed in my blog, I’m passionate about the synergy offered by combining mobile innovations and search innovations.  Since our focus at Avalon Consulting, LLC. is content, specifically leveraging your content to improve your users’ experiences, our mobile focus is naturally around delivering your content to more people more often using the most effective mobile user experiences.  I believe we’ve only scratched the surface of how the combination of mobile devices and search technologies will make valuable content more accessible.

* term used loosely here since they’re HTML5 and not independently installed on the device

Entity Extraction, XQuery the Semantic Web and Johnny Depp

May 12th, 2011 by Kurt Cagle

Lately, I’ve been spending a lot of time dealing with entity extraction software for a client. The premise that most such extraction tools use is fairly similar – by creating an internal pipeline that will break down text into parts of speech and similar constructs, then applying a set of regular expression heuristics, it should become relatively simple to determine that “Johnny Depp” or “Helen Bonham Carter” are people, that “Seattle” is a place name, and that “May 13, 2011″ is a date.

On the surface this is pretty cool, of course; a significant amount of the efforts of computer scientists for the better part of fifty years has been dedicated to the effort of a computer being able to make precisely such identifications accurately. Of course, even given that, the tools aren’t perfect – Virginia could be a state in the US, but it can also be a woman’s first name, and when you try to determine from the context which is which, you begin to understand that meaning is very much as much a matter of cultural imperative as it is in innate physical one.

Entity Enrichment is the process of automatically adding tags around content in order to find “entities” – typically either parts of speech (known by its acronym POS), person, location and event names (Named Entity Recognition or NER), or more specialized filtering on terms such as drug names, medical terminology, engineering terms or the like. As a technology, it’s been around for a while, especially in the publishing arena, and in many ways is one of the more rudimentary (and foundational) pieces of both text analytics and semantic processing.

However, it’s important to understand that enrichment by itself is not a panacea. For starters, we humans are remarkably adept at being imprecise, and this becomes more important when you start dealing not with individual words but with phrases and titles.  You see this especially with government titles – consider for instance “Ambassador John J. Smith, III, Senior Assistant Undersecretary for European Affairs, U.S. Department of State”.  A surprising number of enrichment engineers will parse this as

<person>Ambassador John J. Smith</person>,<number>III</number>, <person>Senior</person> <title>Assistant Undersecretary</title> for <location>European</location> <organization>Affairs</organization>, <location>U.S.</location> <organization>Department of State</organization>

or some similar construct, rather than the one that humans could probably pick out as:

<honorific>Ambassador</honorific> <person>John J. Smith, III</person>, <title>Senior Assistant Undersecretary for European Affairs</title>, <organization>U.S. Department of State</organization>

The accuracy rate is getting better – even the available open source tools such as ANNIE or LingPipe will generally have about a 30-40% chance of getting a mouthful like that properly categorized, and commercial products are usually (though not always) better, but this still translates into an abysmally low accession rate. Ironically, specialized vocabulary filters usually do considerably better, if only because technical terms usually are more regular in their usage and context, but the stakes are also higher there as well.

However, even given this, entity extraction really buys you fairly little unless you also have a context that the article is placed in, both at the macro level and at the micro level. The computer knows nothing about Johnny Depp – internally, the term is a sequence of eleven characters, counting the white space that happens to fit enough of a profile that it can be categorized in a bucket called “person”. However, the computer does know that there are a number of “person” objects in the document where they were found. This document is a context or resource (at this point, the Semantic Web people start jumping around).

Documents can be categorized in a number of different ways. While it’s not uncommon for data systems professionals to break things down by implementation type (web page, text processing document, spreadsheet, presentation, etc.), in reality, the same document could just as readily be a “Newspaper Article” or a “Movie Review” as an HTML page. In an XML database such as MarkLogic, such a document could be contained simultaneously in three overlapping collections, each of which are ultimately a reflection of different orthogonal classification systems. Johnny Depp being referenced as a person in a web page tells you very little – Johnny Depp referenced as a person in a “Movie Review” however, tells you much, much more, because movie reviews have definite structures, roles and relationships.

A movie review is (for discussion purposes) about a single movie, which has an associated title and also likely has an Interntional Movie Database (IMDB) entry, with an associated URL.  The review also has an associated URL. The URL for the review can be taken as a proxy for that review, just as the URL for the IMDB entry can be taken as a proxy for the movie itself. What this means in practice is that, because the movie review provides a context for the enriched terms in it, it becomes possible to retrieve information about the article that isn’t in the article itself.

If the article is about “Pirates of the Caribbean: On Stranger Tides”, which we’ll assume here has been classified as a <media_title> internally, then this can be used by an XQuery processor as a key to look up an associated IMDB entry. If it finds one and returns it as an XML structure (or an RDF structure, which I’ll get to in a second), then the IMDB entry might also have multiple <also_known_as> blocks, such <also_known_as>POTC:OST</also_known_as> or <also_known_as>Pirates of the Caribbean IV</also_known_as>. This is important, because enrichment is seldom a single process – rather, it is a recursive set of refinements. An XQuery script could take the article and locate all instances of POTC:OST and identify them not only as being alternative names, but also adding a pointer to the IMDB proxy in each case. Similarly, the script could identify Johnny Depp as an actor, that he stars in the movie, that he stars as the pirate “Jack Sparrow”, and consequently, Jack Sparrow can also contain a pointer back to a proxy representing the character. Moreover, it can also break the term apart and find all naked references to “Jack” or “Sparrow” and do the same thing.

This is where entity enrichment begins to gain value. By the time the article has been processed, it now knows much more about itself. It can point to a common object that represents the focus of the article, which means that a search can be made based upon the IMDB entry and all articles about POTC4 can be found within the database. The relationships between entries can be made. If the articles are all part of a central database, it also means that different reviews with similar rating systems could provide a more universal “rating” about the quality of the movie. It also means that since there are relationships that exist outside of the article, it becomes possible to pull together reviews not only about POTC4 but all four pirates movies, as suggestions.

It’s possible that the data coming from IMDB is in RDF format – essentially in a format where there are a number of very simple assertions that are made, and relationships between these assertions are defined. These assertions can be extracted from the RDF (or, with a little more foreknowledge, from XML or other wire formats) and used to make various relationships within a canonical reference (such as an IMDB) easy to extract and compare.

One of the most significant realizations that are being made today is that in many respects the more significant data queries are not those within individual documents (or even collections of documents) but instead are those between documents. Document enrichment helps to bootstrap that process, making it easier to identify potential keys, but that enrichment must be done with knowledge of the appropriate context.

Moreover, as these documents in turn gain more “self-referential” information, they can in turn become a canonical reference source themselves. IMDB is not likely to contain viewer expectations or ratings, but the movie reviews would contain these things, and as such can be used by document analytics tools to do such things as determine not only the critical reception of a movie but also deeper analytics to determine what within a given movie or set of movies most captured the audience’s expectations. (By the way, if this sounds a lot like the Rotten Tomatoes site, you have a pretty good idea about how such a service could be implemented now).

Semantic Web technologies is more than just arcane terms such as acyclic graphs, RDF, turtle notation, n-tuple pairs, OWL and SPARQL. Indeed, my personal feeling is that the emphasis on these particular tools has had an overall negative impact upon the adoption of the Semantic Web technologies. Ultimately SemWeb is just the process of making resources – documents – both more self-aware and more externally aware of their context(s) in the world. You can do SemWeb without the above, though as you get deeper into the space, these tools do provide utility to do much more, but ultimately, SemWeb is about the relationship, and any tool that will help you get there can only do you good.

Kurt Cagle is an Information Architect for Avalon Consulting, LLC, specializing in XML data architecture, information management and the Semantic Web.

Bad Call? Microsoft Buys Skype

May 10th, 2011 by Kurt Cagle

Software titan Microsoft just purchased Skype, whose voip-based services have made it one of the largest players in the web telecommunications space. The deal, for $8.5 billion in cash, provides a major benefit to Microsoft, which has struggled to remain competitive with their Live Meeting offerings and significantly expands their consumer base, but also indirectly provides benefits to Facebook, a Microsoft investee – by marrying Skype capabilities with Facebook’s core systems, Facebook can get a significant leg up on phone connectivity between its members significantly expanding its standing as a social communications medium.

For Skype, the acquisition by Microsoft also places the company into a position where they can expand their offerings into the enterprise space that Microsoft has a major presence in, a market that Skype had difficulty penetrating before. This in turn provides a direct challenge both to Google, which has been trying to expand its Google Voice offerings to do the same (in conjunction with Google Docs and their email services), as well as Cisco’s enterprise VOIP and virtual meeting software and hardware.
What I find intriguing about this particular buyout is that Skype will in effect become a separate division of Microsoft, one reporting directly to Steve Ballmer. Not only does this put one of their divisions almost completely in Silicon Valley, where Microsoft has had but a token presence until now, but it also emphasizes the underlying realization by the company that VOIP has become a major pillar and diffentiator for the largest software concerns, and needs to be treated as more than a minor offshot to their office strategy.
This last year has also seen an increasing validation of a strategy to try to keep companies intact with only a secondary ancillary branding as a Microsoft entity. This can be seen in Facebook, which bears little outward mark of being a Microsoft invested company, and ironically, if (and it’s a big if) Microsoft can get Facebook and Skype to play well together, there may be some advantages to be had.
At the same time, the acquisition of Skype may also be a case of Ballmer chasing after brand and market share rather than technology, an approach which has burned him more than once. VOIP is reasonably well understood at this stage, and Skype’s been sold once before because it couldn’t make the revenue match predictions (it was losing money in its PC to PC communications, which of course was its prime attraction).
Admittedly, it was still outcompeting Live Meeting, but there’s a major question about whether the effort to integrate Skype into the Microsoft line-up (and the costs attendant with any such reorganization) may ultimately make this a losing proposition. If Microsoft reduces its service offerings there, it also reduces the appeal of the Skype service, and given the fairly mature state of the VOIP market, the primary paying customers may very well end up sticking with their dedicated providers.
In the end, I suspect that this will be modestly successful, but not a major game changer. The buyout helps Microsoft recover lost market share in a critical market if the integration remains minimal, but if Ballmer tries to bring Skype too much into the Microsoft fold he risks both customer and employee defections. Moreover, while voip should be a major part of a company strategy for a company the size of Microsoft, it may also prove a distraction to those areas where it needs to be far more focused, such as the related mobile market space, and even with a fair amount of cash still in the books, the cost of acquisition and integration is going to eat up a not insignificant part of that at a time when other markets are likely to be more profitable long term.

Moreover, there’s the question of whether this deal was motivated more by the need for Ballmer to show himself as being aggressive in the marketplace than it was for the stockholders. Ballmer has been far less aggressive in the market space than Gates was, and has often been swayed more by the desire to get the hottest properties rather than the ones that made the most competitive sense for the company. Skype was an old maid – it had been sitting out in the marketplace for a while, represents older technologies, and really was most valuable for its installed customer base – most of which were looking to pay as little as possible to use its services. This doesn’t really bode well for Ballmer moving forward – indeed, it may prove to be the final misstep in a series of questionable buyouts and investements.

Building Dynamic BBCode Into Your XML Blogs

April 18th, 2011 by Kurt Cagle

I love XML. Sometimes, however, there are times where XML can become rather cumbersome to work with – for instance, try writing a book sometimes on HTML and SVG using DocBook (something I’m currently undertaking for O’Reilly Press). The amount of times that you have to work with < content in this context can easily make you want to tear your hair out even at the sight of an angle bracket.

Over the years, a number of interesting alternative solutions have begun to emerge. One of them, Bulletin Board Code or BBCode, started out as a simple replacement scheme. For example, a simple BBCode effect would be to replace the angle brackets around a bold or italic tag with square brackets “[" and "]“, such as:

This is a [b]bold[/b] comment that you're making.

The advantage to such as approach is that you can actually represent angle brackets (<>) in your code and have them not interpreted as the start of a tag.

However, once you start creating such a tag set, it’s a short step from there to making more comprehensive macros. A common one that occurs is the [url] tag, in which you can actually use the regularity of patterns to convert such as BBCode entry into the corresponding link tag:

Please check out [url http://www.avalonconsult.com|Avalon Consult]

In this case, a regular expression is used to parse the string and then insert the corresponding match patterns into an anchor tag. The regular expression itself is fairly simple:

\[url\s+(.*?)\|(.*?)\]

With XQuery the fn:replace() function is used to search for a match then replace that match with a replacement string, likely something like:

<a href="$1" target="_new">$2</a>

In this case, the $1 and $2 tags correspond to the first and second parentheses matches in the search string – the expression takes the replace string and replaces the source string with the replace string in the text, substituting the selected parenthetical expressions in the process.

Regular expression matching and replacement has long been a powerful tool for parsing and processing text – indeed most contemporary parsers utilize such regular expression (regex) tools heavily in taking script code and converting it into final code. What’s particularly powerful about such tools as they can be applied globally, and can work on every single instance they match. For instance, the following illustrates how the [i][/i] tagset works across all of the matches in a string independently:

let $statement := "This is the [i]first[/i] phrase of content, and this is the [i]second[/i] phrase."
    return fn:replace($statement,"\[i\](.*?)\[/i\]","<i>$1</i>")

This will return the string:

This is the <i>first</i> phrase of content, and this is the <i>
second</i> phrase.

For simple content, then, you could easily create a regular expression profile document and routine that would let you apply multiple different regexes on the same document. Each regular expression in this profile is a rule – if within the text being processed, if the regex matches, then the resulting replacement string is applied. One example of such a profile document might look like Listing 1.

<filtersets>
  <filterset name="Filtered HTML" id="filtered-html">
    <filter name="Bold"
         description="Converts contained content into strong output."
         match="\[b\](.*?)\[/b\]">
         <[!CDATA[<strong>$1</strong>]]>
    </filter>
    <filter name="Italic"
        description="Converts contained content into emphasized output."
        match="\[i\](.*?)\[/i\]">
             <em>$1</em>
    </filter>
    <filter name="URL With Label"
        description="Creates a URL map with a label"
        match="\[url\s+(.*?)\|(.*?)\]">
           <![CDATA[<a href="$1" target="new">$2</a>]]>
    </filter>
    <filter name="URL With No Label"
        description="Creates a URL map with the URL as label"
        match="\[url\s+(.*?)\]">
           <![CDATA[<a href="$1" target="new">$1</a>]]>
    </filter>
 </filterset>
</filtersets>

By designating different filtersets, you can provide alternative successive refinements. One such filterset alternative might be used in the situation where the original content came from someone typing BBCode in, while a second is used just to convert carriage returns into line breaks or paragraph markers. This is an approach that’s especially important when dealing with rich text editors in a browser that are optional – one alternative might be used in the case where the text editor is used, another for where the rich text editor is not used, but a BBCode editor is used instead.

A more complex scenario arises when the BBCode in question is used for evaluating and replacing regular expressions, but instead is used to drive an external function, such as an XQuery call. I encountered this particular scenario recently when working on a blog tool that could be used to insert images that were referenced from an XML file. A similar situation may arise with a particular profile in which the editor of the blog wished to evaluate an XQuery script inline, perhaps in order to run a query and insert the results in a document.

The difficulty with the fn:replace function is that it is fully self contained – while it will replace all of the regexes encountered in a sequence of text, there’s no real way with that function to use it to extract arguments and pass these arguments on to an external function.

Fortunately, MarkLogic defines an alternative approach with the fn:analyze-string() function. This function may be familiar to XSLT2 developers who use it to solve a similar problem in transformations, but it’s in fact exposed as an extension to the XQuery API as well.

As an example, suppose that you had a block of text that included a BBCode tag called [image] that would provide two arguments – a number that was an index to a list of image files within an XML document, and a style string that would be used to set the CSS style for the image itself. While the second requirement is a simple search and replace, the first is far more complex, since it entails knowing the details about the XML document that contains the associated text. In short, it’s something that requires an XQuery to evaluate.

Suppose that you had a string that contained two such references:

[image 1|float:left]The image contained herein shows a Greater Thrushbelly Warbler, a rare species of Warbler found almost exclusively within the forests of Patagonia, and is considered highly endangered if not extinct through most of its former habitat, shown in Figure 2.[Image2|float:right;margin-left:0.25in]].

The analyze-string function breaks down the string into sequences where there is an exact match of the indicated stream and returns an XML structure that then returns the specific match values:

    let $sourceStr := "[image 1|float:left]The image contained herein shows a Greater Thrushbelly
 Warbler, a rare species of Warbler found almost exclusively within the forests of Patagonia, and is
 considered highly endangered if not extinct through most of its former habitat, shown in Figure 2.
[Image2|float:right;margin-left:0.25in]]."
    let $regex := "\[image\s+(.+?)\|(.+?)\]"
    let $an-str := fn:analyze-string($sourceStr, $regex)
    return $an-str

The resulting analysis XML looks as follows:

<s:analyze-string-result
    xmlns:s="http://www.w3.org/2009/xpath-functions/analyze-string">
    <s:match>[image
          <s:group nr="1">1</s:group>
         |<s:group nr="2">float:left</s:group>]
    </s:match>
    <s:non-match>The image contained herein shows a Greater Thrushbelly Warbler, a rare species of
    Warbler found almost exclusively within the forests of Patagonia, and is considered highly
    endangered if not extinct through most of its former habitat, shown in Figure
    2</s:non-match>
    <s:match>[image
        <s:group nr="1">2</s:group>
       |<s:group nr="2">float:right;margin-left:0.25in]</s:group>]
    </s:match>
    <s:non-match>.</s:non-match>
</s:analyze-string-result>

The <analyze-string-result> parent thus contains a sequence of non-matches and matches. A match in turn will contain a mixed content sequence of groups and text node, where each group corresponds to a match group from the regular expression. By passing the match elements to a function, that function can retrieve the corresponding group’s strings as parameters for additional processing and replace the whole match with a corresponding sequence of nodes.

In other words, by making use of the fn:analyze-string() function, it is possible to create macros within text that can invoke XQuery functions and return evaluated results. This is the basis for the bbcode module and its single method bbcode:parse –

module namespace bbcode = "http://www.xmltoday.org/xmlns/bbcode";
declare namespace my = "http://www.xmltoday.org/xmlns/my";
declare variable $bbcode:uri := "/core/xrx/bbcode.xml";

declare function bbcode:parse(
    $text-str as xs:string,
    $filterset as xs:string,
    $propmap as item()?,
    $record as node()?) as xs:string{
    let $bbcode-doc := fn:doc($bbcode:uri)
    let $filters := $bbcode-doc//*:filterset[@id=$filterset]/*:filter
    let $map := map:map()
    (:let $text-str := fn:concat("<div>
",$text-str,"</div>") :)
    let $map-assign := map:put($map,"text",$text-str)
    let $filter-op := for $filter in $filters return
        let $match := $filter/@match/fn:string(.)
        let $text := map:get($map,"text")
        return
        if (fn:matches($text,$match)) then
            if (fn:string($filter/@scope)='global') then
                map:put($map,"text",fn:replace($text,$match,$filter/fn:string(.)))
            else if ($filter/@type/fn:string(.) = "xquery") then
                let $analysis := fn:analyze-string(map:get($map,"text"),$match)
                let $processed-terms := for $term in $analysis/* return
                    if (fn:local-name($term) = "match") then
                        let $expr := fn:string($term)
                        let $xquery := if ($filter/*:xquery) then
                            fn:normalize-space($filter/*:xquery/fn:string(.))
                        else fn:string-join($term/*:group/fn:string(.),' ')
                            let $params-node := fn:analyze-string($expr,$match)
                        let $variable-decl := '
        declare variable $my:params-node external;
        declare variable $my:params := $my:params-node/*:match[1]/*:group/fn:string(.);'
                        let $filter-eval-str := fn:concat('
       declare namespace my = "http://www.xmltoday.org/xmlns/my";
       declare variable $my:propmap external;
       declare variable $my:record external;',
       $variable-decl,$xquery)
                        let $evaled-term := xdmp:eval($filter-eval-str,
                            ((fn:QName("http://www.xmltoday.org","propmap"),$propmap),
                             (fn:QName("http://www.xmltoday.org/xmlns/my","record"),$record),
                             (fn:QName("http://www.xmltoday.org/xmlns/my","params-node"),$params-node)
                            ))
                        return {$evaled-term}
                    else
                        $term
                let $processed-string := fn:string-join($processed-terms,"")
                return map:put($map,"text",xdmp:quote($processed-string))
            else
                map:put($map,"text",fn:replace(map:get($map,"text"),$match,$filter/fn:string(.)))
        else ()
    return map:get($map,"text")
    };

This function makes use of MarkLogic map objects to store the string for successive evaluations, and has a dependency upon a call to an external bbcode.xml document that contains the various filtersets alternatives. It differentiates between normal regular expression statements and XQuery evaluation statements in the bbcode.xml file by adding the attribute @type=”xquery” to the filter name and wrapping the XQuery expression in an <xquery> block, such as these two filters for handling images:

    <filter name="Image Insert"
        description="Inserts the indexed image"
        match="\[image\s+([0-9][0-9]?)\]"
        type="xquery">
        <xquery><![CDATA[
let $index := xs:integer($my:params[1])
let $image := <img src="{$my:record/*:images/*:image[$index]/fn:string(.)}" style="width:320px;float:right;"/>
return xdmp:quote($image)
        ]]></xquery>
    </filter>
    <filter name="Image Insert 2"
        description="Inserts the indexed image with style"
        match="\[image\s+([0-9][0-9]?)\|(.*?)\]"
        type="xquery">
        <xquery><![CDATA[
            let $index := xs:integer($my:params[1])
            let $style := $my:params[2]
            let $image := <img src="{$my:record/*:images/*:image[$index]/fn:string(.)}" style="{$style}"/>
            return xdmp:quote($image)
        ]]></xquery>
    </filter>

The function also takes two additional arguments. The first is a map that can be passed directly to the invoking function for passing environmental variables (accessible as $my:propmap in the called function), the second is the record node that the string came from (accessible as $my:record). The two arguments can be empty sequences as well. Additionally, the parsed parameters from the regex are passed as a sequence called $my:params, which can be accessed via index (e.g., $my:params[1] is the first matched group in the match sequence).

Another filter that can use this principle is a generic [xquery] filter that can evaluate an XQuery expression inline and return a sequence of nodes converted into a string, as follows:

  <filter name="XQueryEval"
     description="Evaluates the expression contained in the body content."
     match="\[xquery\](.*?)\[/xquery\]" type="xquery"/>

This would be used to do things like query the database and return the result as a list within body content:

<h2>Listing</h2>
[xquery]<ul>{for $item in collection("myDocs") return <li>{$item/title}<li>}</ul>[/xquery]

This should only be used in situations where the authoring context is secure, as it obviously has implications for the security of the database itself.

The combination of regular expression replacement and XQuery evaluation is a powerful one – it can simplify internal content development, provide extensive macro facilities to authors and at the same time provide a way to filter out potentially unsafe content (such as <script tags) in an automatic transparent manner, and as such can prove to be useful tool in a programmer’s lexicon.