Building Dynamic BBCode Into Your XML Blogs

I love XML. Sometimes, however, there are times where XML can become rather cumbersome to work with – for instance, try writing a book sometimes on HTML and SVG using DocBook (something I’m currently undertaking for O’Reilly Press). The amount of times that you have to work with < content in this context can easily make you want to tear your hair out even at the sight of an angle bracket.

Over the years, a number of interesting alternative solutions have begun to emerge. One of them, Bulletin Board Code or BBCode, started out as a simple replacement scheme. For example, a simple BBCode effect would be to replace the angle brackets around a bold or italic tag with square brackets “[” and “]”, such as:

This is a [b]bold[/b] comment that you're making.

The advantage to such as approach is that you can actually represent angle brackets (<>) in your code and have them not interpreted as the start of a tag.

However, once you start creating such a tag set, it’s a short step from there to making more comprehensive macros. A common one that occurs is the [url] tag, in which you can actually use the regularity of patterns to convert such as BBCode entry into the corresponding link tag:

Please check out [url http://www.avalonconsult.com|Avalon Consult]

In this case, a regular expression is used to parse the string and then insert the corresponding match patterns into an anchor tag. The regular expression itself is fairly simple:

\[url\s+(.*?)\|(.*?)\]

With XQuery the fn:replace() function is used to search for a match then replace that match with a replacement string, likely something like:

<a href="$1" target="_new">$2</a>

In this case, the $1 and $2 tags correspond to the first and second parentheses matches in the search string – the expression takes the replace string and replaces the source string with the replace string in the text, substituting the selected parenthetical expressions in the process.

Regular expression matching and replacement has long been a powerful tool for parsing and processing text – indeed most contemporary parsers utilize such regular expression (regex) tools heavily in taking script code and converting it into final code. What’s particularly powerful about such tools as they can be applied globally, and can work on every single instance they match. For instance, the following illustrates how the [i][/i] tagset works across all of the matches in a string independently:

let $statement := "This is the [i]first[/i] phrase of content, and this is the [i]second[/i] phrase."
    return fn:replace($statement,"\[i\](.*?)\[/i\]","<i>$1</i>")

This will return the string:

This is the <i>first</i> phrase of content, and this is the <i>
second</i> phrase.

For simple content, then, you could easily create a regular expression profile document and routine that would let you apply multiple different regexes on the same document. Each regular expression in this profile is a rule – if within the text being processed, if the regex matches, then the resulting replacement string is applied. One example of such a profile document might look like Listing 1.

<filtersets>
  <filterset name="Filtered HTML" id="filtered-html">
    <filter name="Bold" 
         description="Converts contained content into strong output." 
         match="\[b\](.*?)\[/b\]">
         <[!CDATA[<strong>$1</strong>]]>
    </filter>
    <filter name="Italic" 
        description="Converts contained content into emphasized output." 
        match="\[i\](.*?)\[/i\]">
             <em>$1</em>
    </filter>
    <filter name="URL With Label" 
        description="Creates a URL map with a label" 
        match="\[url\s+(.*?)\|(.*?)\]">
           <![CDATA[<a href="$1" target="new">$2</a>]]>
    </filter>
    <filter name="URL With No Label" 
        description="Creates a URL map with the URL as label" 
        match="\[url\s+(.*?)\]">
           <![CDATA[<a href="$1" target="new">$1</a>]]>
    </filter>
 </filterset>
</filtersets>

By designating different filtersets, you can provide alternative successive refinements. One such filterset alternative might be used in the situation where the original content came from someone typing BBCode in, while a second is used just to convert carriage returns into line breaks or paragraph markers. This is an approach that’s especially important when dealing with rich text editors in a browser that are optional – one alternative might be used in the case where the text editor is used, another for where the rich text editor is not used, but a BBCode editor is used instead.

A more complex scenario arises when the BBCode in question is used for evaluating and replacing regular expressions, but instead is used to drive an external function, such as an XQuery call. I encountered this particular scenario recently when working on a blog tool that could be used to insert images that were referenced from an XML file. A similar situation may arise with a particular profile in which the editor of the blog wished to evaluate an XQuery script inline, perhaps in order to run a query and insert the results in a document.

The difficulty with the fn:replace function is that it is fully self contained – while it will replace all of the regexes encountered in a sequence of text, there’s no real way with that function to use it to extract arguments and pass these arguments on to an external function.

Fortunately, MarkLogic defines an alternative approach with the fn:analyze-string() function. This function may be familiar to XSLT2 developers who use it to solve a similar problem in transformations, but it’s in fact exposed as an extension to the XQuery API as well.

As an example, suppose that you had a block of text that included a BBCode tag called [image] that would provide two arguments – a number that was an index to a list of image files within an XML document, and a style string that would be used to set the CSS style for the image itself. While the second requirement is a simple search and replace, the first is far more complex, since it entails knowing the details about the XML document that contains the associated text. In short, it’s something that requires an XQuery to evaluate.

Suppose that you had a string that contained two such references:

[image 1|float:left]The image contained herein shows a Greater Thrushbelly Warbler, a rare species of Warbler found almost exclusively within the forests of Patagonia, and is considered highly endangered if not extinct through most of its former habitat, shown in Figure 2.[Image2|float:right;margin-left:0.25in]].

The analyze-string function breaks down the string into sequences where there is an exact match of the indicated stream and returns an XML structure that then returns the specific match values:

    let $sourceStr := "[image 1|float:left]The image contained herein shows a Greater Thrushbelly
 Warbler, a rare species of Warbler found almost exclusively within the forests of Patagonia, and is
 considered highly endangered if not extinct through most of its former habitat, shown in Figure 2.
[Image2|float:right;margin-left:0.25in]]."
    let $regex := "\[image\s+(.+?)\|(.+?)\]"
    let $an-str := fn:analyze-string($sourceStr, $regex)
    return $an-str

The resulting analysis XML looks as follows:

<s:analyze-string-result 
    xmlns:s="http://www.w3.org/2009/xpath-functions/analyze-string">
    <s:match>[image 
          <s:group nr="1">1</s:group>
         |<s:group nr="2">float:left</s:group>]
    </s:match>
    <s:non-match>The image contained herein shows a Greater Thrushbelly Warbler, a rare species of
    Warbler found almost exclusively within the forests of Patagonia, and is considered highly
    endangered if not extinct through most of its former habitat, shown in Figure
    2</s:non-match>
    <s:match>[image 
        <s:group nr="1">2</s:group>
       |<s:group nr="2">float:right;margin-left:0.25in]</s:group>]
    </s:match>
    <s:non-match>.</s:non-match>
</s:analyze-string-result>

The <analyze-string-result> parent thus contains a sequence of non-matches and matches. A match in turn will contain a mixed content sequence of groups and text node, where each group corresponds to a match group from the regular expression. By passing the match elements to a function, that function can retrieve the corresponding group’s strings as parameters for additional processing and replace the whole match with a corresponding sequence of nodes.

In other words, by making use of the fn:analyze-string() function, it is possible to create macros within text that can invoke XQuery functions and return evaluated results. This is the basis for the bbcode module and its single method bbcode:parse –

module namespace bbcode = "http://www.xmltoday.org/xmlns/bbcode";
declare namespace my = "http://www.xmltoday.org/xmlns/my";
declare variable $bbcode:uri := "/core/xrx/bbcode.xml";

declare function bbcode:parse(
    $text-str as xs:string,
    $filterset as xs:string,
    $propmap as item()?,
    $record as node()?) as xs:string{
    let $bbcode-doc := fn:doc($bbcode:uri)
    let $filters := $bbcode-doc//*:filterset[@id=$filterset]/*:filter
    let $map := map:map()
    (:let $text-str := fn:concat("<div>
",$text-str,"</div>") :)
    let $map-assign := map:put($map,"text",$text-str)
    let $filter-op := for $filter in $filters return
        let $match := $filter/@match/fn:string(.)
        let $text := map:get($map,"text")
        return 
        if (fn:matches($text,$match)) then
            if (fn:string($filter/@scope)='global') then
                map:put($map,"text",fn:replace($text,$match,$filter/fn:string(.)))
            else if ($filter/@type/fn:string(.) = "xquery") then
                let $analysis := fn:analyze-string(map:get($map,"text"),$match)
                let $processed-terms := for $term in $analysis/* return
                    if (fn:local-name($term) = "match") then
                        let $expr := fn:string($term)
                        let $xquery := if ($filter/*:xquery) then
                            fn:normalize-space($filter/*:xquery/fn:string(.)) 
                        else fn:string-join($term/*:group/fn:string(.),' ')
                            let $params-node := fn:analyze-string($expr,$match)
                        let $variable-decl := '
        declare variable $my:params-node external;
        declare variable $my:params := $my:params-node/*:match[1]/*:group/fn:string(.);'                        
                        let $filter-eval-str := fn:concat('
       declare namespace my = "http://www.xmltoday.org/xmlns/my";
       declare variable $my:propmap external;
       declare variable $my:record external;',
       $variable-decl,$xquery)
                        let $evaled-term := xdmp:eval($filter-eval-str, 
                            ((fn:QName("http://www.xmltoday.org","propmap"),$propmap),
                             (fn:QName("http://www.xmltoday.org/xmlns/my","record"),$record),
                             (fn:QName("http://www.xmltoday.org/xmlns/my","params-node"),$params-node)
                            ))
                        return {$evaled-term}
                    else
                        $term
                let $processed-string := fn:string-join($processed-terms,"")
                return map:put($map,"text",xdmp:quote($processed-string))
            else
                map:put($map,"text",fn:replace(map:get($map,"text"),$match,$filter/fn:string(.)))
        else ()
    return map:get($map,"text")
    };

This function makes use of MarkLogic map objects to store the string for successive evaluations, and has a dependency upon a call to an external bbcode.xml document that contains the various filtersets alternatives. It differentiates between normal regular expression statements and XQuery evaluation statements in the bbcode.xml file by adding the attribute @type=”xquery” to the filter name and wrapping the XQuery expression in an <xquery> block, such as these two filters for handling images:

    <filter name="Image Insert" 
        description="Inserts the indexed image" 
        match="\[image\s+([0-9][0-9]?)\]" 
        type="xquery">
        <xquery><![CDATA[
let $index := xs:integer($my:params[1])
let $image := <img src="{$my:record/*:images/*:image[$index]/fn:string(.)}" style="width:320px;float:right;"/>
return xdmp:quote($image)
        ]]></xquery>
    </filter>
    <filter name="Image Insert 2" 
        description="Inserts the indexed image with style" 
        match="\[image\s+([0-9][0-9]?)\|(.*?)\]" 
        type="xquery">
        <xquery><![CDATA[
            let $index := xs:integer($my:params[1])
            let $style := $my:params[2]
            let $image := <img src="{$my:record/*:images/*:image[$index]/fn:string(.)}" style="{$style}"/>
            return xdmp:quote($image)
        ]]></xquery>
    </filter>

The function also takes two additional arguments. The first is a map that can be passed directly to the invoking function for passing environmental variables (accessible as $my:propmap in the called function), the second is the record node that the string came from (accessible as $my:record). The two arguments can be empty sequences as well. Additionally, the parsed parameters from the regex are passed as a sequence called $my:params, which can be accessed via index (e.g., $my:params[1] is the first matched group in the match sequence).

Another filter that can use this principle is a generic [xquery] filter that can evaluate an XQuery expression inline and return a sequence of nodes converted into a string, as follows:

  <filter name="XQueryEval" 
     description="Evaluates the expression contained in the body content." 
     match="\[xquery\](.*?)\[/xquery\]" type="xquery"/>

This would be used to do things like query the database and return the result as a list within body content:

<h2>Listing</h2>
[xquery]<ul>{for $item in collection("myDocs") return <li>{$item/title}<li>}</ul>[/xquery]

This should only be used in situations where the authoring context is secure, as it obviously has implications for the security of the database itself.

The combination of regular expression replacement and XQuery evaluation is a powerful one – it can simplify internal content development, provide extensive macro facilities to authors and at the same time provide a way to filter out potentially unsafe content (such as <script tags) in an automatic transparent manner, and as such can prove to be useful tool in a programmer’s lexicon.

Kurt Cagle About Kurt Cagle

Kurt Cagle is the Principal Evangelist for Semantic Technology with Avalon Consulting, LLC, and has designed information strategies for Fortune 500 companies, universities and Federal and State Agencies. He is currently completing a book on HTML5 Scalable Vector Graphics for O'Reilly Media.

Leave a Comment

*