In a previous post I asked the question, “What if a web service could automatically provide the CMS metadata when you go to check-in a new topic?”  In this post I’ll discuss why you would want to do that, some of the candidate technologies, and what is necessary to make it real.

Check-in, metadata, and taxonomies. Anyone who’s worked with a content or document management system knows this scenario:  You’re going to check-in newly authored content, and a dialog box comes up asking you to enter some keywords to describe the content.  This is metadata – data about your data. It’s important because if you fill it in properly, other people (and you, too) can find your content. If you leave it blank, then other users will need to rely on a full-text search of some machine indexing to find your content.

Many organizations have a formal system for classifying content called a taxonomy. Think of it like the naming of the sections in the yellow pages directory – it provides consistent category names. This avoids the problem I call “the Yellow Pages Problem,” where some people call those guys who represent you in court “lawyers” while other people call them “attorneys at law” (or worse things). When an organization uses a taxonomy, everyone uses consistent category names – that is, if they actually use it.

Compliance blues. Taxonomies can be configured into the CMS, so that category names are able to be selected on the check-in dialog box. While that saves the author guesswork of remembering category names and avoids mistyping, it still requires an author to take action – and to get it right. This is a point of failure of many ECM initiatives: authors either fail to classify content at check-in time, or they accept the default settings, or the author applies the wrong category (or even too few categories when the content really crosses genres).

The problem is worse when the author isn’t a fulltime writer, but instead a business contributor who’s creating content as they serve their role in some business process. In these cases the author lacks the time, talent, or motivation to tag the content with the appropriate metadata. They may not see it as part of their job.

Cure for the blues? So can this process be automated? Absolutely. Technologies have existed for some years now to analyze unstructured content. Algorithms involve some combination of statistical, linguistic, and structural analysis.

  • Statistical methods look at the document as a “bag of words” – words or phrases that occur more frequently, or that are “improbable” statistically are more important. Amazon uses SIPs – Statistically Improbable Phrases – to pull keywords out of books. This is purely statistical – the system doesn’t know what the words mean, just that they are “odd” so probably meaningful.
  • Linguistic methods actually analyze the natural language in the document. If you know what the subject, verb, and object are in a sentence, then you know what it is about. Linguistic methods have gotten better with improvements in algorithms and increases in computing power.
  • Structural methods leverage underlying markup in documents, like XML structural tags or even styling or text flow (e.g. recognizing terms in headers).

These methods not only provide automated metadata tagging (document categorization), they can also determine what type of document is being analyzed (document classification). They can also be used to identify Named Entities – named people, places, things, and events. It’s one thing to say this document is a Legal Brief (document type or class). It’s another to say that Legal Brief is about Patent Infringement (a category). It’s another thing still to say that it’s a case between Palm and Xerox (named companies) about handwriting recognition (a named technology). Named entities can be extracted and listed in metadata. They can also be tagged in-line in an XML document (this is often called “auto-tagging” – a post for another day).

Named entities are not addressed by taxomonies, rather by lists or directories of named entities.  A number of these named entity directories are available as web services. Several are kept evergreen by using Wikipedia to drive the ever growing list of named entities.

Making it real. So given this technology, how do you implement such a system?  My preferred method is to customize the authoring environment so that the “Save” dialog box in the editor of choice presents the ECM system’s check-in dialog.  This way the author does not take extra steps to check content in.

Also at check-in time, in the background, the customized editor performs a temporary save to the local file system, and automatically sends a copy of the document to a categorizer web service. This is a content categorizer application running on a server.  That categorizer service would apply the organization’s standard taxonomy to the document, using some classification algorithm to define one or more categories for the document. The results can be applied in either of two ways:

  • Classify the document automatically with no user intervention. This can be done completely in the background with no user interface, even as part of an automated check-in workflow.
  • Classify the document automatically and have the user verify the results. This requires exposing the proposed metadata tags in the check-in dialog.

Categorizers often provide some scoring of the certainty of a given tag; this score can be used to make the call about whether the automatic tag is applied, or whether it needs (or allows) end user verification or editing. Business requirements determine what the best approach or best combination is.

What are the barriers? The reason this technique isn’t used more often is the integration required between the authoring tools, the ECM solution, and the categorization technology. In today’s market these technologies are typically provided by independent software vendors, who have few incentives for bundling tightly integrated solutions (and wish to remain “vendor neutral” with their own technology). As the ECM marketplace continues to consolidate vertically we may see some content lifecycle vendors with more complete solutions (watch IBM and EMC). Services firms specializing in unstructured content and ECM can be one source for prepackaged solutions that combine these ECM, authoring tools, and content classification into a seamless user experience – which is the key to success in deploying an automated solution.

At the end of the day, consideration of the needs and behavior of content authors and contributors (who are very often change-averse) is the most important step in adoption of a content lifecycle solution. Making content classification and categorization a “no brainer” through automation and a seamless user experience improves the likelihood of success.