Using a Custom Facet to Aggregate Values

MarkLogic’s Search API makes it easy to create search facets–all you need to do is declare an index on a specific element, then add an option to your search request specifying that the element should be used as a facet.

However, sometimes this standard approach is not good enough and you need to create a custom facet. I ran into this issue recently when I needed to transform the facet pictured on the left to the form shown on the right:

Standard Facet

Custom Facet

By way of background, I was building a system to track millions of digital assets for a publishing company. We used MarkLogic to store metadata about each digital asset, including its mime-type. A typical metadata document looked like this:

<asset>
   <asset-id>21ec2020-3aea-1069-a2dd-08002b30309d</asset-id>
   <description>Lorem ipsum dolor sit amet</description>
   <size>23552</size>
   <file-name>image1.tiff</file-name>
   <mime-type>image/tiff</mime-type>
</asset>

Users could search for digital assets by filename or key words and then refine the results by clicking a checkbox to only show assets that matched specific mime-types. These mime-type facets were very easy to implement, but did not provide a very good user experience. In the first place, mime-types are not particularly meaningful. While most people can figure out that “application/pdf” refers to a PDF file, a mime-type like “application/vnd.openxmlformats-officedocument.wordprocessingml.document” is meaningless to almost everyone.

The other problem with using mime-type as a facet is the fact that it is too granular. Different versions of Excel have different mime-types, just as different image formats have their own mime-types. Most users, however, don’t care about these distinctions and simply want to lump assets together into broad buckets like “Spreadsheets” and “Images”.

Rather than facet on mime-type, we really needed a simplified “File Type” facet that would group assets into broad categories.  At first, I thought I could create this facet using the “bucket constraints” feature offered by the Search API. In essence, this feature lets you group values into ranges, such as 1 to 5, 6 to 10, etc.  Unfortunately, bucketing assumes that the underlying values have an inherent order, which mime-types do not have since they are simply text.

As another option, I also considered adding a new element to the metadata, like this:

&lt;asset&gt; 
   &lt;asset-id&gt;21ec2020-3aea-1069-a2dd-08002b30309d&lt;/asset-id&gt; 
   &lt;description&gt;Lorem ipsum dolor sit amet&lt;/description&gt; 
   &lt;size&gt;23552&lt;/size&gt; 
   &lt;file-name&gt;image1.tiff&lt;/file-name&gt; 
   &lt;mime-type&gt;image/tiff&lt;/mime-type&gt; 
   <strong>&lt;file-type&gt;Image&lt;/file-type&gt;</strong>
&lt;/asset&gt;

This is certainly a viable approach and one that I have used in the past. However, it bothered me that I had to update all my metadata records with information that essentially replicated what was already stored in the mime-type element. It seemed to me that there should be a way to create a facet that aggregated all the mime-type values at query-time.

As it turned out, there is a way to do this by using a custom facet. The MarkLogic documentation gives very detailed instructions on how to create a custom facet, so I won’t elaborate too much on the process here. Suffice to say, I needed to write my own functions for parse-file-type(), start-file-type-facet(), and finish-file-type-facet().

In this blog entry, I will focus on start-file-type-facet(), which is called by the Search API every time a query is run and does the lion’s share of the work. Let’s assume that I was only interested in aggregating mime-types for Images and PDF assets:

declare function start-file-type-facet(
 $constraint as element(search:constraint),
 $query as cts:query?,
 $facet-options as xs:string*,
 $quality-weight as xs:double?,
 $forests as xs:unsignedLong*
) as item()*{
   let $m := map:map()
   let $_ :=  
      for $file-type in ("Images", "PDFs")
      return map:put($m,$file-type,0)
   let $_ :=
      for $value in 
         cts:element-values(
            xs:QName(“mime-type"), 
            (), 
            $facet-options, 
            $query, 
            $quality-weight, 
            $forests
         )
      return
         if (fn:starts-with($value, "image/"))
            then map:put(
               $m, 
               "Images", 
               map:get($m,"Images") + cts:frequency($value)
            )
         else if ($value eq “application/pdf”)
            then map:put(
               $m,
               "PDFs", 
               map:get($m,"PDFs") + cts:frequency($value)
            )
         else ()

   for $k in map:keys($m)
   return
      if (map:get($m,$k) eq 0) 
         then ()
      else <file-type name="{$k}" count="{map:get($m,$k)}"/>
};

The goal is to return separate counts of all the mime-types that represent either Images or PDFs. To hold these counts, the function declares a map with entries for “Images” and “PDFs” and initializes those values to “0”.

Next, the function retrieves all the mime-types for the current search and iterates over the values. If the mime-type starts with “image/”, the function increments the count kept in the map for “Images”. If the mime-type matches the value used for PDFs (“application/pdf”), the function increments that count as well.

Once all the counts are collected, the function returns the information as an XML structure that is used to create the facet in the search results.

Aggregating additional mime-types can be accomplished by adding new if-then-else clauses. For example, to aggregate “Spreadsheets”, I simply need to add a clause like this:

...
else if ( $value = 
   (“application/vnd.openxmlformats-officedocument.spreadsheetml.sheet”, 
   “application/vnd.ms-excel”)
) then map:put($m,"Spreadsheets", map:get($m,"Spreadsheets") + cts:frequency($value))
...

Creating a custom facet is not always a good idea since it adds to the complexity of your search code and can reduce the performance of your queries.  Fortunately, query performance was not a problem for my custom facet because my code only needed to examine mime-type values, which were available in an element range index. Furthermore, the effort of adding a custom facet seemed less than the effort of writing and debugging the scripts needed to add an additional element to all my metadata.

In general, I can think of two reasons for custom faceting. As in my case, custom faceting makes sense if your data does not have a single element or attribute with the value you need and you do not want to or cannot change your underlying data.

A second reason to custom facet is when the information you need is only available at query time. For example, consider an application that searches for news stories and lets you filter to reveal stories related to the top topics trending on Twitter. Obviously, you cannot know what is trending until the moment the query is run and so you need to create your facets with custom code.

You will need to look at your own data and search requirements to decide whether custom faceting is right for you. Note that you can often use “bucketed” facets to aggregate data if the values can be sorted. Take a close look at the MarkLogic documentation for the technical details you need for your implementation.

Demian Hess, Publishing Solutions Architect, Avalon Consulting, LLC

Demian Hess About Demian Hess

Demian Hess is Avalon Consulting, LLC's Director of Digital Asset Management and Publishing Systems. Demian has worked in online publishing since 2000, specializing in XML transformations and content management solutions. He has worked at Elsevier, SAGE Publications, Inc., and PubMed Central. After studying American Civilization and Computer Science at Brown University, he went on to complete a Master's in English at Oregon State University, as well as a Master's in Information Systems at Drexel University.

Leave a Comment

*