Admittedly, taxonomy is probably the farthest thing from your mind if you’re designing an intelligent content application. You’re probably focused on technology selection and content strategy. In fact, many search engine providers – notably Google – would argue strenuously that you don’t need a taxonomy to find content. I would argue even more strenuously that your intelligent content project – whether it’s a live content project with user-generated content or an information publishing portal – just won’t success unless you give a long, hard think about taxonomy and content classification.

Over the last several years I’ve been in the midst of helping companies build intelligent content applications using taxonomy, and almost all of these had already spend a pile of cash on search and content management technology, only to see it fall short of their vision. Many had even implemented component content in DITA or another XML vocabulary. Taxonomy helps to bridge the gap in several important ways.

First of all, taxonomy helps companies organize and manage their source content. A well designed taxonomy can become the basis for a metadata and content type strategy for the CMS, and the source of the controlled vocabularies that content authors and publishers use to classify their content. Content classification is important for defining the “aboutness” of content as well as administering it. We all need well-defined, clear, unambiguous terms for administrative metadata, including our organization structures, customers, products, information types, and information security classifications, to name a few important categories. Leaving CMS users to enter this metadata by freely typing exposes us to human errors and inconsistency. At a minimum, we want to maintain an authoritative term list and expose it in drop-down lists for users to select values when they upload content to the CMS. Users are usually able to enter this metadata with low error rates when selecting from controlled term lists, especially if their job is content publishing. The list just makes it easy and consistent.

For “aboutness” metadata, we need to help users be comprehensive and consistent, so we often use the taxonomy to inform a classification engine that analyzes the document and provides suggested metadata to the user. Subject classification schemes are conceptually similar to the subject headings in library card catalog systems – they start with broad domains and categories that are broken into increasingly narrower topic spaces. The big difference is that each organization will need to develop and maintain subject classifications that are relevant to their business content. For example, I’m helping a high tech manufacturer classify technical documents; their taxonomy covers technologies, manufacturing process steps, customer needs / applications, as well as symptoms, fault codes, and root causes for troubleshooting and repairing their products. They had a rich set of terms for all of these topics, some in a corporate taxonomy and others in specific systems for service and quality management.  Putting them into a taxonomy helped us use that information for auto-classification of content. Metadata is proposed to content publishers when they upload it to the CMS, and they can add or remove terms form the proposed metadata using the same taxonomy the classifier used. The result is that documents are now more completely tagged with “aboutness” metadata in the CMS.

Intelligent content doesn’t end with the CMS, however. Search engines classify content using their algorithms to match end-user search queries (what you type in the search box) to content – whether it’s unstructured document content or structured data in an enterprise system, or both. A search engine doesn’t understand your business – it only looks at all of your content as a “bag of words” that it statistically determines to be “about” something looking at unusual combinations of terms,  or high-frequency terms. A taxonomy can tell the search indexing engine that certain terms are more meaningful to your business, and that there are relationships between terms that matter when it comes to relevance.  Even if the search engine is placing higher value on metadata values, those usually contain “preferred” terms – the official business labels. Users, on the other hand, are not so disciplined when they type in the search box – they may use “non-preferred” terms. A favorite example is searching a NASA site for “moon buggy” when NASA calls the item the “lunar excursion vehicle.” The taxonomy can relate those terms so the search engine returns relevant documents – even if they never contain the term “moon buggy” or that referred to the acronym for it.  

Finally, taxonomy can be used to driving the search user experience in major ways. It can become the basis for the facets in search refinement, allowing users to narrow their search along the dimensions of the taxonomy (show me only information about these document types, or these products, etc.). It can define the terms we show in tag clouds and other interface objects. The taxonomy can also help the search engine identify related searches – for instance, all of the astronauts who ever drove a LEV, or the Apollo missions that included a LEV.

I’ve actually seen a recent example of an intelligent content application that is entirely defined in a taxonomy – the UK National Health Service has build a Flash application that lets you navigate a hyperbolic tree of symptoms and diseases, all of which is directly managed in a taxonomy and flowed directly into the portal. Their taxonomy aids search and results relevancy as well by taking terms like “AIDS” and assuring that search engine stemming doesn’t return documents that contain the terms “aid” or “aiding” – imagine all the results for “first aid”, “band aid”, “hearing aid”, or “health aid”.  

We also tend not to think of intelligent content systems as being driven by rather “dumb” legacy content, instead they are all about XML and structured content. In fact, most of the intelligent content portals I’ve worked on in the last several years were being populated by legacy PDF content – which was made intelligent only through the use of auto-classification with a well-crafted taxonomy and exposed through faceted search. For lengthy documents, document preview technologies can help hone-in on relevant pages – also being guided by the taxonomy to map search queries to preferred terms.

My conclusion in working with search and enterprise content management technology is that taxonomy development and management is a key success factor in creating effective intelligent content systems. Taxonomy can inform content types and metadata schema, make for consistent tagging, harmonize disparate structured data, and drive dynamic search and navigation user experiences, even with not-so-intelligent content.

 I’ll be presenting more on this topic with extensive examples at Intelligent Content 2011 in Palm Springs, February 16-18. Hope to see you there!