Using HTML in a Publishing Workflow Instead of XML

Video demonstration of an HTML workflow

Many publishers have struggled to implement agile workflows that will enable them to quickly create new publications for print and digital channels. Conventional wisdom maintains that the best workflows should be based on XML so that content can be created once and then transformed and distributed in multiple formats. Unfortunately, the expense of implementing XML systems is generally high and authors have often resisted working inside XML editors, preferring the freedom of word processing applications.

As the video included in this blog post shows, it is possible to create a publishing workflow based on HTML instead of XML. Browser-based WYSIWYG editors make it easy to create HTML content without the technical complexity of XML. The HTML produced from the current-generation of WYSIWYG editors is also clean, well-structured, and can be tagged with attributes to clearly differentiate content such as abstracts, sidebars, person names, dates, and citations from plain text. All of this means that HTML is not simply presentational. It can have complex, granular structures that can be easily transformed to other formats.

Advantages of an HTML Workflow

Faster Content Creation: HTML reduces development time in three ways: you no longer need to convert Word to XML; publishing to the web and to ePub is much simpler because the content is already in a format that requires minimal transformation; the content is always available in a markup language that can be transformed to any required format at any time during the workflow. This means that proof sheets, EPUB files, and output for InDesign and Quark can be generated on demand and problems spotted and corrected more quickly.

Availability of Open Web Technologies: An HTML workflow allows you to leverage a wide range of existing technologies, including web browsers, new HTML5 markup, CSS, Javascript, and WYSIWYG editors. There are also legions of talented web developers who are familiar with web technologies and can build your workflow systems.

Flexible Content Models: HTML is a markup language and, in the form of xHTML, has a formal XML schema. However, unlike XML vocabularies such as DocBook and the NLM Journal and Archiving Tag Suite, the HTML content model is extremely loose and flexible. Content authors can largely disregard strict rules and concentrate on creating content that fits the needs of the publication and not the rules of the schema. This makes authoring within an HTML editor as easy as working in Word. It also gives publishers the ability to develop flexible conventions for managing and structuring their content.

HTML Problems and Solutions

The flexibility of HTML is a double-edged sword. If you allow authors and editors to create any content they want inside a WYSIWYG, you will not be able to reliably publish the resulting content to all channels and have it display correctly.

In order to ensure consistency, you need to control how content is structured. Important tools include creating document templates and developing wizards and widgets that will step authors through the creation of complex structures. In the video accompanying this blog, I demonstrate just such a widget to help users identify author names inside HTML content. Similar tools need to be developed for all your content that has special semantic meaning.

For many publishers, locking down content creation is not always possible because they do not have their own authoring systems and instead rely on third parties to create their HTML. In these cases, the publisher must do two things:

  • First, publishers need to create a written style guide that documents how HTML must be structured. Are abstracts represented by <aside> elements with a class attribute of “abstract”? Are the parts of a person’s name stored inside data-first and data-last attributes on <span> elements? All of these conventions need to be documented.
  • Second, publishers must enforce their stylistic conventions using automatic testing frameworks. One example of such a technology is Schematron, which allows publishers to write rules specifying how well-formed HTML must be structured. Schematron rules can be applied automatically to identify content that needs to be corrected before it is allowed to enter the workflow.

Is HTML Right For Your Workflow?

Consider the following questions when you are weighing the relative merits of HTML and XML workflows:

  • What are your primary output formats? If you are chiefly publishing to the web and eReaders, then HTML is a very good fit. However, HTML may not be the best choice if you need to deliver most of your content in a highly-structured XML format. Your goal is to simplify your workflow, not add additional conversion steps.
  • What is your current infrastructure? It would make little sense to abandon your current technology if you have already invested heavily in training your staff to use XML and purchased licenses for XML editors. However, HTML deserves a close look if you are just starting to implement a digital workflow or if your existing infrastructure is obsolete.
  • How much control do you have over your content creation? HTML works best when you can control how content is created. If your workflow is very decentralized, then the strictness of XML may make it easier to enforce data consistency.
  • How complex and granular is your content? Highly-structured, data-centric content (organizational directories, biomedical data, etc.) with domain-specific schemas are best handled in XML. HTML is a good fit if your content is largely document-centric and consists primarily of paragraphs, lists, and tables.
  • How frequently does the structure of your publications change? If you find that you are constantly revising your XML schemas to reflect changes in your content, then the flexibility of HTML may be an excellent choice. An XML workflow might be easier to maintain if your content model is extremely consistent and rarely changes.

HTML is not a silver bullet. It will not immediately solve all of your publishing problems and it is not the best choice in all situations. Used judiciously, however, it can be a powerful tool that lowers your costs, speeds the creation of new content, and allows you to keep pace with the rapid changes happening in today’s multi-channel publishing environment.

Demian Hess About Demian Hess

Demian Hess is Avalon Consulting, LLC's Director of Digital Asset Management and Publishing Systems. Demian has worked in online publishing since 2000, specializing in XML transformations and content management solutions. He has worked at Elsevier, SAGE Publications, Inc., and PubMed Central. After studying American Civilization and Computer Science at Brown University, he went on to complete a Master's in English at Oregon State University, as well as a Master's in Information Systems at Drexel University.

Leave a Comment