I recently attended the Linked Data Planet conference where a number of pioneers in the field of Semantic Web shared their perspectives on the state of the art – and business – of helping the world tag their web pages for meaning.  For those of you in the dark about semantic mark-up, it lets authors annotate their web pages with metadata (HTML attributes that don’t get displayed in the document) that describe what those pages are about. 

So for example, when I say “New York” in an HTML document it’s ambiguous – do I mean the city, the state, the Yankees, the Mets, the Giants, the Jets, the song, the steak, the state of mind – you get the idea.  Words are ambiguous – except in the context of the language in which they occur.  So if I am writing about a sporting event you know from the context of the article that I mean the team, but the typical search engine does not.  To a search engine, New York is just a string that occurs in the document with some frequency. 

There are two ways to make sense out of words in a document.  One is semantic analysis (I’ll leave that topic to another day).  The other is semantic tagging – adding metadata to a document.
With metadata, I can define things precisely.  I can state that this document is about the sports team, not the steak.  I can do this by tagging the named entities in the document – the people, places, things, events, and facts – in an unambiguous way.  I can also set those entities into relationships with each other.  For example, a piece of text may refer to two companies involved in a merger.  So I can tag the document being about Company A (thing number one) and Company B (thing number two) involved in a merger (an event, but also a relationship between the two named entities). 

So semantic tagging adds meaning to documents that goes beyond the text, and it does it in an unambiguous way, which is handy.  But it has traditionally faced two large hurdles: (1) it’s been relatively expensive to add semantic markup (either with investments in labor or technology) and (2) there has been little mass market for consuming this markup.  Both of those hurdles are rapidly falling away. 

Let’s address the second point first.  Yahoo has introduced Search Monkey – a new technology that rates web pages not on the keywords and number of links to the page (the “wisdom of crowds”) but on the semantic markup that is embedded in the page (the wisdom of the author).  This creates a substantial motive for adding the markup: Search Engine Optimization.  Semantic markup makes your content more likely to be found and more relevant to the searcher.

Great, so how do you add semantic markup?  For legacy content, you need to use some combination of people and automation to add markup to what you already wrote.  Using people to tag content requires specialized skills that are not in good supply.  Natural language processing technologies for auto-tagging content have been around since the late 90s in lab settings; auto-tagging products are emerging in new and interesting forms in the marketplace today. Thomson-Reuter’s Calais open source project is a great example.  For a demo click here and try pasting some non-proprietary text that describes what your company does (for example, I tried the “About Our Company” page we used in proposals at JustSystems and it accurately tagged all of the named companies, legal entities, products, technologies, countries, cities, and correctly identified JustSystems’s acquisition of XMetaL from Blast Radius as a business event).

Adding semantic markup to new web content as it is created – making it available as data – is the way to go.  But what about other types of unstructured content, like documents, that might be published to the web and other channels?  We’ve been doing this with XML and SGML documents all along, using semantic tags to unambiguously flag specific pieces of text for future discovery.  This has ranged from tagging part numbers in a service manual (which could automate adding hyperlinks or improve search relevance), to tagging financial reports with XBRL to find specific facts within the MD&A or footnotes of an annual report (which could prevent another Enron).  But the important concept here is this: when content is tagged, it can be treated as data

More recent XML standards like DITA help authors focus on creating granular content – primarily for content reuse.  But our customers are finding that DITA and other topic-oriented XML approaches are helping them break out of the document model – where loads of facts are locked-up within documents.  Think of a lengthy Policies and Procedures manual.  The historical reason it’s all bound in one book is for the convenience of publishing.  Today – with electronic publishing on the web, intranets, and portals – you really only want to publish a single policy or procedure as it is added or revised.  The book itself is obsolete when you can publish a procedure at a time. 

In a DITA world, because of its granular nature, a single document (like a Policy manual that was one very large document in your document management system) may instead be managed as a collection of hundreds of DITA topics in your CMS or XML object store.  The document would no longer exist, it becomes a collection of topics, more like records in a database.  To effectively manage large collections of DITA topics, you need to specify metadata for each topic – just so that you can find any given topic again.  So a typical DITA project would define the CMS metadata scheme and the taxonomy for classifying the DITA topics.  For those of us in the XML document world, this is old hat.

So all this makes me ask:

  • What if we start combining semantic web technologies and semantic document technologies?
  • What if we combine technologies that auto-tag named entities with granular authoring approaches like DITA?
  • What if you could automatically tag named entities within the DITA topic you are creating, tagging as you type? 
  • What if a web service could automatically provide the CMS metadata when you go to check-in a new topic?
  • What if the publishing tools that transform your DITA to HTML could automatically add the semantic markup to your HTML pages that are published from your DITA content?
  • How would that change how you publish business documents like policies and procedures to your employees?
  • How would it change how you create marketing content for your web site?
  • How would it change the way you create and manage your product technical content?

Could the secret to the semantic web be right under our nose?