Last month I posted a Market Basket Analysis example using Couchbase and Spark: Combining Operational and Analytical Big Data Using Couchbase and Spark: A Market Basket Analysis Example. I used Couchbase as an operational data store for market basket data, transported the data using the Couchbase Spark Connector, then did the MBA in Spark. It was a batch approach that updated a list of recommendations for all of our products all at once.
With the imminent GA release of Couchbase 4.0, I thought I would revisit this example with features available in the new release. There will be many new features in Couchbase 4.0 including the GA release of N1QL. N1QL has been available in an experimental fashion, but it will become available out of the box with Couchbase 4.0. Why use N1QL? Because taking advantage of this new query language will allow us to construct our product recommendations in an on-demand fashion using real-time data as opposed to batch data.
You can find the code for this example in our Github: https://github.com/Avalon-Consulting-LLC/couchbase-n1ql-mba. In it is a vagrant setup that will create a virtual machine with Couchbase installed and ready to go for you. Follow the instructions in the repo’s README to get setup.
The key part of this example is the N1QL query which we run in a parameterized fashion with $id being given the target product ID we are trying to get recommendations for:
SELECT item, COUNT(*) AS cnt FROM retail UNNEST items AS item WHERE item != $id AND $id IN retail.items GROUP BY item ORDER BY cnt DESC LIMIT 3;
The first major detail in the query is the unnest line. The unnest statement is essentially a SQL explode. It will take each item in the ‘items’ list of each row and join it back to its row to make a new row. In effect you are left with every item purchased in every basket on its own row.
Next we need to filter this set, we want to filter out all baskets that don’t actually contain the product ID that we are searching for. We also want to filter out all rows where the item is our product ID. We do this so we don’t get back the given product ID as a recommendation for itself. At this point we are left with rows of items that were purchased with our target product ID.
We want to group these rows together by the ‘item’ field we created in the unnest statement. With the rows grouped by ‘item’ selecting count(*) will leave us with a count of each item. Our results are product ID’s and how many times they were purchased with our target item.
Lastly, we want to order the returning set in descending order so we have the most popular on top. By limiting to 3 results we get the three most purchased products with our target item.
That’s it! Running this query gives us the top three recommendations for the target product ID. In my previous post, while we achieved the same result, it required we use a batch process. Running this processing in batch provides results that are only as up to date as the last time you ran the batch process. In this example using N1QL, recalculating recommendations for only the target ID on demand ensures that you are getting the most recent results possible.
If you’d like to learn more about Couchbase or N1QL please Contact Us. Have other Couchbase integration use cases you’d like us to examine? If so, share your request in the comments.