Reclaim Your Content

Organizations have an ever increasing volume of content.  These documents represent the collective knowledge base for the company.  The goal for us should be to make better use of our knowledge base by distributing this knowledge to our employees and allowing them to be more productive because of it.  Enter the Content Management System (CMS).  We love our CMSs.  They allow us to collect, search and reuse our content for all kinds of applications including research, marketing, website creation, and more.  In fact, there is a special classification of CMS called a Web Content Management System (WCM) that facilitates the building of a website based upon all of our content.

What happens when it is time to move on to a new CMS?  Perhaps we have outgrown the features of our current CMS, found a new CMS with features that better suit our needs or found a vendor with a more budget friendly pricing model.  We simply move our content to the new system, right?  No problem!  If only it is was that simple.

We start to ask questions like:

  • “What types of documents do we have?”
  • “Do we have metadata about the document not contained in the document itself?”
  • “What format is the content in?”
  • “Is the data structured or unstructured, homogeneous or heterogenous?”
  • “Do we have multi-media?”
  • “Are there relationships between the documents?”

This could get complicated fast!

What are the potential approaches for migrating our data?  We could move it manually.  Copy and paste from one system to another.  This is probably not realistic unless you have an army of data entry people.  If you have enough content to warrant a CMS, then it will likely take too much time and too many resources to go down this road.  The next option would be to automate the process.  We can export content from the old system, transform the data as necessary, and import to the new system.  Systems usually have an import function.  Creators of content management systems want you to use their application so they make it as easy as possible to get your data in.  Unfortunately, this isn’t always the case for export.

Some systems may have an export tool in the administration interface or you may have direct access to the datastore.  Your development team can tackle that problem, but what do you do if these aren’t options?  We could crawl the public website and download all of our content.  There are lots ways to script that, but today’s modern websites make this very difficult.  A user’s experience is often personalized to the user so not every user sees the same content.  Content is often not linked to directly.  Crawlers follow links, but what if the link doesn’t exist?  We don’t build websites with static lists of content.  Content is frequently loaded based upon user interaction, click events, and searches.  Perhaps the entire underlying document is not even displayed.  How can we know that crawling a site will get all of our content?

I recently worked with a client who was using Kintera’s Blackbaud Sphere product to manage their online presence.  Kintera offers this product as a Software as a Service (SaaS) solution to the nonprofit and government sectors.  Sphere serves as a  Constituent Relationship Management (CRM) system, Content Management System (CMS), and a Web Content Management (WCM) system.  The client did not have direct access to the database or the ability to export their content without contracting with Kintera for additional services.  Their website offered personalized content and relied heavily on an event driven client side web application.  Content was obtained asynchronously and relied heavily on search.  This website closely matches the scenario I described previously.  Rather than crawling the public site, we approached this problem in a less conventional way.  We scraped the content from the administration tool using a web testing framework.  This was not without peril.  The administration site opened many layers of popup windows, placed content in iframes, and utilized many of the same modern web client techniques that posed a problem on the public site.  The difference with this approach was that it allowed us full access to all of the content and its metadata.  Additionally, we could traverse the content tree and search the full repository with confidence that we had access to all of the site’s content.

You too can take a similar approach to reclaim your content.  After all, it’s your content.  Is this approach the right fit for extracting content from all CMSs?  Of course not, but this approach is an example of how you can think out-of-the-box to solve a challenging problem.  Reclaim your content using a little ingenuity and put your knowledge base back to work for you.

Joe Glorioso About Joe Glorioso

In his career, Joe Glorioso has worked in both the financial services and publishing industries. After completing his Bachelor of Arts in Mathematics Education and Master of Science in Information Systems at Long Island University's C.W. Post Campus, he has worked extensively with Java and XML based systems. In his current role, Joe serves as a Senior Consultant with Avalon Consulting, LLC.

Leave a Comment

*