<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>TheContentGuy &#187; markup</title>
	<atom:link href="http://thecontentguy.net/blog/tag/markup/feed/" rel="self" type="application/rss+xml" />
	<link>http://thecontentguy.net</link>
	<description>all things unstructured</description>
	<lastBuildDate>Sat, 05 Mar 2011 06:00:00 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.5</generator>
		<item>
		<title>The Role of Taxonomy in Intelligent Content</title>
		<link>http://thecontentguy.net/blog/2011/02/04/the-role-of-taxonomy-in-intelligent-content/</link>
		<comments>http://thecontentguy.net/blog/2011/02/04/the-role-of-taxonomy-in-intelligent-content/#comments</comments>
		<pubDate>Fri, 04 Feb 2011 21:56:13 +0000</pubDate>
		<dc:creator>paulwlodarczyk</dc:creator>
				<category><![CDATA[ECM]]></category>
		<category><![CDATA[Front Page]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[semantic technology]]></category>
		<category><![CDATA[intelligent content]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[semantic search]]></category>
		<category><![CDATA[unstructured content]]></category>

		<guid isPermaLink="false">http://thecontentguy.net/?p=1023</guid>
		<description><![CDATA[<a href="http://www.rockley.com/IC2011/"> <img src="http://thecontentguy.net/wp-content/uploads/ic2011skyscraper.jpg" align=RIGHT alt="Intelligent Content 2011 - Palm Springs, CA 2/16-18" height="129" width="103"  /></a>Admittedly, taxonomy is probably the farthest thing from your mind if you’re designing an intelligent content application. My conclusion in working with search and enterprise content management technology is that taxonomy development and management is a key success factor in creating effective intelligent content systems. Taxonomy can inform content types and metadata schema, make for consistent tagging, harmonize disparate structured data, and drive dynamic search and navigation user experiences, even with not-so-intelligent legacy content.]]></description>
			<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-1027" title="tag-image" src="http://thecontentguy.net/wp-content/uploads/tag-image.jpg" alt="" width="339" height="502" />Admittedly, taxonomy is probably the farthest thing from your mind if you’re designing an intelligent content application. You’re probably focused on technology selection and content strategy. In fact, many search engine providers – notably Google – would argue strenuously that you don’t need a taxonomy to find content. I would argue even more strenuously that your intelligent content project – whether it’s a live content project with user-generated content or an information publishing portal – just won’t success unless you give a long, hard think about taxonomy and content classification.</p>
<p>Over the last several years I’ve been in the midst of helping companies build intelligent content applications using taxonomy, and almost all of these had already spend a pile of cash on search and content management technology, only to see it fall short of their vision. Many had even implemented component content in DITA or another XML vocabulary. Taxonomy helps to bridge the gap in several important ways.</p>
<p>First of all, taxonomy helps companies organize and manage their source content. A well designed taxonomy can become the basis for a metadata and content type strategy for the CMS, and the source of the controlled vocabularies that content authors and publishers use to classify their content. Content classification is important for defining the “aboutness” of content as well as administering it. We all need well-defined, clear, unambiguous terms for administrative metadata, including our organization structures, customers, products, information types, and information security classifications, to name a few important categories. Leaving CMS users to enter this metadata by freely typing exposes us to human errors and inconsistency. At a minimum, we want to maintain an authoritative term list and expose it in drop-down lists for users to select values when they upload content to the CMS. Users are usually able to enter this metadata with low error rates when selecting from controlled term lists, especially if their job is content publishing. The list just makes it easy and consistent.</p>
<p>For “aboutness” metadata, we need to help users be comprehensive and consistent, so we often use the taxonomy to inform a classification engine that analyzes the document and provides suggested metadata to the user. Subject classification schemes are conceptually similar to the subject headings in library card catalog systems – they start with broad domains and categories that are broken into increasingly narrower topic spaces. The big difference is that each organization will need to develop and maintain subject classifications that are relevant to their business content. For example, I’m helping a high tech manufacturer classify technical documents; their taxonomy covers technologies, manufacturing process steps, customer needs / applications, as well as symptoms, fault codes, and root causes for troubleshooting and repairing their products. They had a rich set of terms for all of these topics, some in a corporate taxonomy and others in specific systems for service and quality management.  Putting them into a taxonomy helped us use that information for auto-classification of content. Metadata is proposed to content publishers when they upload it to the CMS, and they can add or remove terms form the proposed metadata using the same taxonomy the classifier used. The result is that documents are now more completely tagged with “aboutness” metadata in the CMS.</p>
<p>Intelligent content doesn’t end with the CMS, however. Search engines classify content using their algorithms to match end-user search queries (what you type in the search box) to content – whether it’s unstructured document content or structured data in an enterprise system, or both. A search engine doesn’t understand your business – it only looks at all of your content as a “bag of words” that it statistically determines to be “about” something looking at unusual combinations of terms,  or high-frequency terms. A taxonomy can tell the search indexing engine that certain terms are more meaningful to your business, and that there are relationships between terms that matter when it comes to relevance.  Even if the search engine is placing higher value on metadata values, those usually contain “preferred” terms – the official business labels. Users, on the other hand, are not so disciplined when they type in the search box – they may use “non-preferred” terms. A favorite example is searching a NASA site for “moon buggy” when NASA calls the item the “lunar excursion vehicle.” The taxonomy can relate those terms so the search engine returns relevant documents – even if they never contain the term “moon buggy” or that referred to the acronym for it.  </p>
<p>Finally, taxonomy can be used to driving the search user experience in major ways. It can become the basis for the facets in search refinement, allowing users to narrow their search along the dimensions of the taxonomy (show me only information about these document types, or these products, etc.). It can define the terms we show in tag clouds and other interface objects. The taxonomy can also help the search engine identify related searches – for instance, all of the astronauts who ever drove a LEV, or the Apollo missions that included a LEV.</p>
<p>I’ve actually seen a recent <a href="http://www.nhs.uk/Search/Pages/HealthExplorer.aspx?q=Diabetes&amp;qID=845#/tab~845~term~845~history~0">example</a> of an intelligent content application that is entirely defined in a taxonomy – the UK National Health Service has build a Flash application that lets you navigate a hyperbolic tree of symptoms and diseases, all of which is directly managed in a taxonomy and flowed directly into the portal. Their taxonomy aids search and results relevancy as well by taking terms like “AIDS” and assuring that search engine stemming doesn’t return documents that contain the terms “aid” or “aiding” – imagine all the results for “first aid”, “band aid”, “hearing aid”, or “health aid”.  </p>
<p>We also tend not to think of intelligent content systems as being driven by rather “dumb” legacy content, instead they are all about XML and structured content. In fact, most of the intelligent content portals I’ve worked on in the last several years were being populated by legacy PDF content – which was made intelligent only through the use of auto-classification with a well-crafted taxonomy and exposed through faceted search. For lengthy documents, document preview technologies can help hone-in on relevant pages – also being guided by the taxonomy to map search queries to preferred terms.</p>
<p>My conclusion in working with search and enterprise content management technology is that taxonomy development and management is a key success factor in creating effective intelligent content systems. Taxonomy can inform content types and metadata schema, make for consistent tagging, harmonize disparate structured data, and drive dynamic search and navigation user experiences, even with not-so-intelligent content.</p>
<p> I&#8217;ll be presenting more on this topic with extensive examples at <a title="Intelligent Content 2011" href="http://www.rockley.com/IC2011/">Intelligent Content 2011</a> in Palm Springs, February 16-18. Hope to see you there!</p>
]]></content:encoded>
			<wfw:commentRss>http://thecontentguy.net/blog/2011/02/04/the-role-of-taxonomy-in-intelligent-content/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>How to Turn Tagging into Cash: Take the Metadata Best Practices Survey</title>
		<link>http://thecontentguy.net/blog/2009/05/26/how-to-turn-tagging-into-cash-take-the-metadata-best-practices-survey/</link>
		<comments>http://thecontentguy.net/blog/2009/05/26/how-to-turn-tagging-into-cash-take-the-metadata-best-practices-survey/#comments</comments>
		<pubDate>Tue, 26 May 2009 15:07:21 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[DITA]]></category>
		<category><![CDATA[ECM]]></category>
		<category><![CDATA[XML]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[semantic technology]]></category>
		<category><![CDATA[Benchmarking]]></category>
		<category><![CDATA[content management]]></category>
		<category><![CDATA[Earley & Associates]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[semantic search]]></category>
		<category><![CDATA[survey]]></category>

		<guid isPermaLink="false">http://thecontentguy.net/blog/?p=310</guid>
		<description><![CDATA[We tag stuff to add meaning, and so that we and others – especially information systems – can find it.  But is your approach to tagging business content effective?  Find out - take the Metadata Best Practices Benchmarking Survey from Earley &#038; Associates and Taxonomy Strategies.]]></description>
			<content:encoded><![CDATA[<p>If you couldn’t tell by now, one of my particular interests is tagging, a.k.a. content classification, a.k.a. metadata.  We tag stuff to add meaning, and so that we and others – especially information systems – can find it.  But is your approach to tagging business content effective?  Find out &#8211; take the <strong><a title="Metadata Best Practices Benchmarking Survey" href="http://www.surveymonkey.com/s.aspx?sm=TEtPrAKwkiKIXhkey6revA_3d_3d" target="_blank">Metadata Best Practices Benchmarking Survey</a></strong> from Earley &amp; Associates and Taxonomy Strategies.</p>
<p><strong><span style="color: black; font-size: 14pt;  mso-bidi-font-size: 11.0pt; "><a title="Metadata Best Practices Benchmarking Survey" href="http://www.surveymonkey.com/s.aspx?sm=TEtPrAKwkiKIXhkey6revA_3d_3d" target="_blank"><span style="color: blue;"><span style="font-family: Calibri;">Take the Survey</span></span></a></span></strong></p>
<p><span id="more-310"></span>Depending upon context, “tagging” can mean one of three different things: tagging a document, tagging within a document, or tagging a content object.</p>
<p style="PADDING-LEFT: 30px"><strong>Tagging documents.</strong>  These days most of us think of tagging as the keywords we put on our documents – like our photos and websites – so that others can find them when they search.  User tags are fine for finding photos in flickr, but for tagging to be effective in business we need to make it systematic, so that we avoid ambiguity and improve search recall and relevance.  So we’re increasingly “mature” in our approaches to tagging: We use taxonomy to organize our terms into classes and to manage the relationships between terms.  We develop thesauri and foreign language equivalents.  We integrate taxonomies and thesauri into search indexes for ECM and site search and SEO.</p>
<p style="PADDING-LEFT: 30px"><strong>Tagging within a document.</strong>  I got interested in tagging in the early days of XML (back when we spelled it &#8220;S-G-M-L&#8221;), when we were tagging within documents.  By tagging unstructured content inside documents we could do really sophisticated things – not just multi-channel output.  For example, knowing that a paragraph in a document was a step in a service procedure or that a string of gibberish was a part number let us bring life to that content when we transformed it from markup into an interactive electronic technical manual.  <strong>Tagging let us turn books into diagnostic software.</strong></p>
<p style="PADDING-LEFT: 30px"><strong>Tagging reusable content objects.</strong> As content reuse matured with standards like DITA, organizations had more reusable components, with more people creating them in more departments.  Tagging reusable content objects became essential to actually reusing them – if you couldn’t find it, you’d never reuse it.  If you had a single service manual with 100 procedures, now you have at least 100 reusable content objects, so the search scope increased by two orders of magnitude.  At IBM, colleagues report having over a million DITA topics in more than six repositories, with over a dozen departments sharing content across thousands of publications.  <strong>Searching for content objects is like trying to find a needle in a haystack, except you’re trying to find the right needle, and you have more and smaller needles to search amongst, in more and increasingly bigger haystacks.</strong></p>
<p><strong>Measuring Metadata Maturity.</strong>  Each type of tagging can have measurable benefits on your business.  Five years ago, <a title="Earley &amp; Associates" href="www.earley.com" target="_blank">Earley &amp; Associates</a> and <a title="Taxonomy Strategies" href="www.taxonomystrategies.com" target="_blank">Taxonomy Strategies</a> developed a survey to understand metadata maturity for various types of businesses.  Earley is conducting an updated survey to see how organizations have moved up the learning curve.  Since we have a baseline of responses from five years ago, we’ll be able to describe how metadata and taxonomy practices have matured over time.  Also, the original survey was focused on the impact of metadata best practices on knowledge management and e-commerce search.  We now recognize that metadata is also used by technical communicators – especially those that use XML and other technologies to create, manage, and multichannel publish reusable content.  We want to hear from you all for the first time.</p>
<p>The survey is pretty detailed, so you might want to grab your favorite caffeinated beverage before you dig in.  As compensation for your time (about 15 minutes) Earley &amp; Associates is offering these nifty incentives:</p>
<ul type="disc">
<li style="line-height: 14.25pt; margin: 0in 0in 10pt; color: black; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-list: l0 level1 lfo1; tab-stops: list .5in"><strong>A free pass to any future Earley &amp; Associates Community of Practice conference call</strong> (a $50 value).  These are monthly, and the next one is Wednesday June 2<sup>nd</sup> on <a title="Taxonomy Community of Practice - June 2009" href="http://www.earley.com/_June2009.asp" target="_blank">Taxonomy for Portals</a> featuring Giovanni Piazza, Chief Knowledge Officer of Ernst &amp; Young, and Ralph Poole of Earley &amp; Associates.</li>
<li style="line-height: 14.25pt; margin: 0in 0in 10pt; color: black; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><strong>A $200 discount on registration to the <a title="Henry Stewart Digital Asset Management Conference" href="http://www.damusers.com/" target="_blank">Henry Stewart conference</a></strong> on digital asset management, June 1-2 in NYC.  Seth Earley will be there presenting preliminary results.</li>
<li style="line-height: 14.25pt; margin: 0in 0in 10pt; color: black; mso-margin-top-alt: auto; mso-margin-bottom-alt: auto; mso-list: l0 level1 lfo1; tab-stops: list .5in;"><strong>Free participation</strong> in a webcast reviewing the results of the survey (date TBA).</li>
</ul>
<p class="MsoNormal" style="line-height: 14.25pt; margin: 0in 0in 0pt;"><strong><span style="color: black; font-size: 14pt; mso-fareast-font-family: 'Times New Roman'; mso-bidi-font-family: 'Times New Roman'; mso-bidi-font-size: 11.0pt; mso-ascii-font-family: Calibri; mso-hansi-font-family: Calibri;"><a title="Metadata Best Practices Benchmarking Survey" href="http://www.surveymonkey.com/s.aspx?sm=TEtPrAKwkiKIXhkey6revA_3d_3d" target="_blank"><span style="color: blue;"><span style="font-family: Calibri;">Take the Survey</span></span></a></span></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://thecontentguy.net/blog/2009/05/26/how-to-turn-tagging-into-cash-take-the-metadata-best-practices-survey/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Connecting the dots: How XML authoring enables the Semantic Web</title>
		<link>http://thecontentguy.net/blog/2008/08/15/connecting-the-dots-how-xml-authoring-enables-the-semantic-web/</link>
		<comments>http://thecontentguy.net/blog/2008/08/15/connecting-the-dots-how-xml-authoring-enables-the-semantic-web/#comments</comments>
		<pubDate>Fri, 15 Aug 2008 20:10:17 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[DITA]]></category>
		<category><![CDATA[XML]]></category>
		<category><![CDATA[semantic technology]]></category>
		<category><![CDATA[Calais]]></category>
		<category><![CDATA[Linked Data]]></category>
		<category><![CDATA[markup]]></category>
		<category><![CDATA[metadata]]></category>
		<category><![CDATA[natural language processing]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[Search Monkey]]></category>
		<category><![CDATA[semantic web]]></category>
		<category><![CDATA[web services]]></category>

		<guid isPermaLink="false">http://paulwlodarczyk.wordpress.com/?p=4</guid>
		<description><![CDATA[What if we start combining semantic web technologies and semantic document technologies?]]></description>
			<content:encoded><![CDATA[<p><a title="New, Improved *Semantic* Web!" href="http://flickr.com/photos/14829735@N00/303503677"><img class="alignleft" style="margin: 2px;" src="http://farm1.static.flickr.com/105/303503677_e83d70118f_m.jpg" alt="" width="193" height="240" /></a>I recently attended the <a title="Linked Data Planet" href="http://www.linkeddataplanet.com/" target="_blank">Linked Data Planet </a>conference where a number of pioneers in the field of Semantic Web shared their perspectives on the state of the art – and business – of helping the world tag their web pages for meaning.  For those of you in the dark about semantic mark-up, it lets authors annotate their web pages with metadata (HTML attributes that don’t get displayed in the document) that describe what those pages are about. <br />
<span id="more-7"></span><br />
So for example, when I say “New York” in an HTML document it&#8217;s ambiguous – do I mean the city, the state, the Yankees, the Mets, the Giants, the Jets, the song, the steak, the state of mind – you get the idea.  Words are ambiguous – except in the context of the language in which they occur.  So if I am writing about a sporting event <strong>you</strong> know from the context of the article that I mean the team, but the typical search engine does not.  To a search engine, New York is just a string that occurs in the document with some frequency. </p>
<p>There are two ways to make sense out of words in a document.  One is semantic analysis (I&#8217;ll leave that topic to another day).  The other is semantic tagging &#8211; adding metadata to a document.<br />
With metadata, I can define things precisely.  I can state that this document is about the sports team, not the steak.  I can do this by tagging the named entities in the document – the people, places, things, events, and facts – in an unambiguous way.  I can also set those entities into relationships with each other.  For example, a piece of text may refer to two companies involved in a merger.  So I can tag the document being about <strong>Company A</strong> (thing number one) and <strong>Company B</strong> (thing number two) involved in a <strong>merger</strong> (an event, but also a relationship between the two named entities). </p>
<p>So semantic tagging adds meaning to documents that goes beyond the text, and it does it in an unambiguous way, which is handy.  But it has traditionally faced two large hurdles: (1) it’s been relatively expensive to add semantic markup (either with investments in labor or technology) and (2) there has been little mass market for consuming this markup.  Both of those hurdles are rapidly falling away. </p>
<p>Let’s address the second point first.  Yahoo has introduced <a title="Yahoo! Search Monkey" href="http://developer.yahoo.com/searchmonkey/" target="_blank">Search Monkey</a> – a new technology that rates web pages not on the keywords and number of links to the page (the “wisdom of crowds”) but on the semantic markup that is embedded in the page (the wisdom of the author).  This creates a substantial motive for adding the markup: Search Engine Optimization.  Semantic markup makes your content more likely to be found and more relevant to the searcher.</p>
<p>Great, so how do you add semantic markup?  For legacy content, you need to use some combination of people and automation to add markup to what you already wrote.  Using people to tag content requires specialized skills that are not in good supply.  Natural language processing technologies for auto-tagging content have been around since the late 90s in lab settings; auto-tagging products are emerging in new and interesting forms in the marketplace today. Thomson-Reuter’s <a title="Thomson-Reuters Calais" href="http://www.opencalais.com/">Calais</a> open source project is a great example.  For a demo <a title="Calais Viewer Demo" href="http://sws.clearforest.com/calaisviewer/" target="_blank">click here</a> and try pasting some <a title="Terms of use" href="http://www.opencalais.com/terms" target="_blank">non-proprietary</a> text that describes what your company does (for example, I tried the “About Our Company” page we used in proposals at JustSystems and it accurately tagged all of the named companies, legal entities, products, technologies, countries, cities, and correctly identified JustSystems’s acquisition of XMetaL from Blast Radius as a business event).</p>
<p>Adding semantic markup to new web content as it is created &#8211; making it available as data &#8211; is the way to go.  But what about other types of unstructured content, like documents, that might be published to the web and other channels?  We’ve been doing this with XML and SGML documents all along, using semantic tags to unambiguously flag specific pieces of text for future discovery.  This has ranged from tagging part numbers in a service manual (which could automate adding hyperlinks or improve search relevance), to tagging financial reports with XBRL to find specific facts within the MD&amp;A or footnotes of an annual report (which could prevent another Enron).  But the important concept here is this: when content is tagged, it can be treated as data</p>
<p>More recent XML standards like <a title="DITA.XML.ORG" href="http://dita.xml.org/" target="_blank">DITA</a> help authors focus on creating granular content – primarily for content reuse.  But our customers are finding that DITA and other topic-oriented XML approaches are helping them break out of the document model – where loads of facts are locked-up within documents.  Think of a lengthy Policies and Procedures manual.  The historical reason it’s all bound in one book is for the convenience of publishing.  Today – with electronic publishing on the web, intranets, and portals – you really only want to publish a single policy or procedure as it is added or revised.  The book itself is obsolete when you can publish a procedure at a time. </p>
<p>In a DITA world, because of its granular nature, a single document (like a Policy manual that was one very large document in your document management system) may instead be managed as a collection of hundreds of DITA topics in your CMS or XML object store.  The document would no longer exist, it becomes a collection of topics, more like records in a database.  To effectively manage large collections of DITA topics, you <strong>need</strong> to specify metadata for each topic – just so that you can find any given topic again.  So a typical DITA project would define the CMS metadata scheme and the taxonomy for classifying the DITA topics.  For those of us in the XML document world, this is old hat.</p>
<p>So all this makes me ask:</p>
<ul>
<li>What if we start combining semantic web technologies and semantic document technologies?</li>
<li>What if we combine technologies that auto-tag named entities with granular authoring approaches like DITA?</li>
<li>What if you could automatically tag named entities within the DITA topic you are creating, tagging as you type? </li>
<li>What if a web service could automatically provide the CMS metadata when you go to check-in a new topic?</li>
<li>What if the publishing tools that transform your DITA to HTML could automatically add the semantic markup to your HTML pages that are published from your DITA content?</li>
<li>How would that change how you publish business documents like policies and procedures to your employees?</li>
<li>How would it change how you create marketing content for your web site?</li>
<li>How would it change the way you create and manage your product technical content?</li>
</ul>
<p>Could the secret to the semantic web be right under our nose?</p>
]]></content:encoded>
			<wfw:commentRss>http://thecontentguy.net/blog/2008/08/15/connecting-the-dots-how-xml-authoring-enables-the-semantic-web/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

