Drawing a Blueprint for a Scalable Taxonomy

Drawing on the basic concepts of biological classification most studied in high school, this article describes how to develop a scalable taxonomy that can migrate to any content repository – from share drives to enterprise content management systems.

Eugene Stakhov, CRM, CDIA+

Bookmark and Share

Not very long ago, the word “taxonomy” didn’t really have a place in the field of information technology. But, as the ability to govern information has grown more sophisticated, so has the language used to describe the newfound complexity of the various interrelationships and countless moving parts that comprise the typical enterprise data landscape.

To the records and information management professional, words like “system” and “program” don’t seem quite adequate now to describe the richness or the organic and evolving nature of this discipline. Today, the words “platforms” and “ecosystems” are used.

This is an important concept because it underscores the challenge of effective information governance in this day and age, and it provides a glimpse of the growing monster so many organizations are grappling with.

Reviewing Taxonomy Fundamentals

Many will remember first hearing the word “taxonomy” in high school science class, where they learned that the hierarchical categories “Kingdom,” “Phylum,” “Class,” “Order,” “Family,” “Genus,” and “Species” are conceptual buckets within which plants and animals can be classified, such as the above classification for tiger.

This hierarchical classification teaches that tigers are carnivores; that every carnivore is a vertebrate; and, therefore, that all tigers must be vertebrates – but not all vertebrates must be tigers.

Inheritance and Specialization

This relationship of a parent class (superclass) to its child (subclass) illustrates the important concepts of inheritance and specialization. In the case of the tiger example, this subclass would be carnivora (Order). Carnivora inherit all the characteristics of the animalia (Kingdom), the vertebrata (Phylum), and the mammalia (Class).

Then, they specialize by defining their own characteristics that are unique to all carnivores. This pattern of inheritance and specialization repeats all the way down to the tigris (Species) – the lowest category of the biological taxonomy tree.

The only difference between a biological taxonomy and its content counterpart is that rather than inheriting limbs and backbones, the latter inherits document characteristics, including metadata and security, and, in some cases, retention requirements.

In fact, records management professionals have been practicing taxonomy development for as long as the discipline has been around. There may be nuances in terminology (e.g., “file plan” and “retention schedule”), but the core concept is the same: the higher up the bucket, the broader the classification; the lower the bucket, the more specialized the classification.

The common denominator among all these classification practices is the specialization and inheritance of characteristics.

Explaining Technical Concepts

Objects and Classes

It is helpful to think of the relationship between classes and objects as analogous to cookie cutters and cookies. Classes are templates that are used to build the objects (documents and folders) that are managed by an enterprise content management (ECM) system. Take the following pattern:

  • Documents are patterned by Classes
  • Classes are described by Properties
  • Classes can pass on their property definitions to one or more children, known as Subclasses

This type of design paradigm borrows from a style of computer programming known as object-oriented programming.

Inheritance and Polymorphism

Inheritance and polymorphism are among the core capability requirements of object-oriented design. The latter refers to the ability of a property to have more than one intrinsic meaning. To illustrate this, consider two document classes, one called “Invoice” and the other “Contract.”

The “Invoice” document class may have these properties defined:

  • Invoice Date
  • Invoice Number

The “Contract” document class may have these properties defined:

  • Contract Date
  • Contract Number

Rather than define four separate properties to describe what are essentially only two distinct pieces of information (the date and number), the taxonomy designer may instead choose to paint both document classes with the following properties:

  • Document Date
  • Document Number

In this scenario, the document’s class determines whether “Document Date” refers to an invoice date or a contract date. These two properties are generic enough, they can have more than one intrinsic meaning; they are polymorphic.

Folders

Foldering in an ECM system works conceptually differently than in a file system. In the former, folders are not used so much for storage of documents, but for organization. Take the example illustrated in Figure 1.

Here, a sample insurance claim folder is linked to two constituent documents by virtue of a claim number property. This type of setup allows a user to browse many disparate document classes and/or document types.

There are also workflow implications in use with foldering. Routing folders (as opposed to individual documents) through workflow queues greatly reduces confusion in case-driven workflow scenarios.

Describing Design Style Concepts, Choices

Document Class

Every ECM system will expose some basic properties for the document and folder classes they model. These out-of-box properties are found at the top-most, basic level of the document taxonomy (the Kingdom level in biology). Typical properties that ECM systems may place on this level include:

  • Date Created – The date/time stamp at which the content artifact was created in the ECM system (This is distinguished from the businesscentric “Document Date” property.)
  • Document Identifier – The unique system identifier required for every document. No two documents within the ECM system will have the same identifier, and no identifier will ever be reused.
  • Document Title – The readable title of a document in an ECM system. This property is typically not required.

In this case, the first level is referred to as the document class. (Folder objects would ostensibly derive from a folder class. For now, the focus is on documents). The document class is the root pattern, the forebear, of all documents, and the properties at this level will be inherited down to every other document subclass along the hierarchy. Every document within the spectrum of the ECM system will contain at least the characteristics of the document class.

Enterprise Class

The next level down is the first level of specialization. The properties from the document class have been inherited and will look like any other native properties at this level. But, in addition to the inherited properties, the taxonomy designer has the choice of defining new ones.

Typically, the document characteristics found on this level apply to every document in the organization. Using the insurance company paradigm as an example, some properties that might make sense at this level include:

  • Active – A Boolean indicator used to aid in records management by marking the content as either an active or inactive record
  • Document Type – A choice listing of document types that add granularity to the class by further specializing into document sub-categories (This will be illustrated further down.)
  • Document Date – The polymorphic, business-centric date of a document (e.g., a contract effective date or an invoice date). This is distinguished from the technical date created property.

Core Class

The third level down is where taxonomy design begins to get creative … and fun! How does one determine that next species of document? The choice made here determines the basis of taxonomy design and is the real meat of this discussion.

There are three general design style patterns that are seen in most organizations:

  • Content-Centric
  • Organizational
  • Functional

Content-Centric – In this design style, the classes are modeled around the meaning behind the underlying content. In records management parlance, the file plan counterpart style to this might be the subject-based file plan. This design marginalizes the relevance of an organizational unit or function in the definition of the document, so if inheritance of security policy is a concern, this may not be the best option.

The focus here is shifted from the function or business unit to each content element’s purpose and unique characteristics (e.g., “what does it mean to be a correspondence document?”).

Organizational – In organizational design, the classes are modeled around the organization of the enterprise. In this design style, named lines of business (LOB) classes are used as parent containers of the document classes that they work with and govern. The subsequent layers of the hierarchy then follow the organization down into smaller and smaller groupings. Here, content is seen as a direct function of its parent LOB.

Organizational design is a simple and security-driven model. It is easy to map security between LOB users and the documents they have access to. One drawback of this model is its rigidity. Since the classes are tightly coupled to business units, it may not be a very good fit for organizations that experience a lot of restructuring or mergers and acquisitions.

Functional – Functional design is modeled around the higher-level abstractions of the functions that an organization carries out. This may be different than an organizational design paradigm in that this approach captures many of the functional aspects of the corporation.

These functions may mirror the organizational structure, but in a more abstract perspective, by focusing on the function or processes for which the content is used. In records management, this maps to the retention schedule design of the same name.

Reviewing a Detailed Taxonomy

Figure 2 represents the standard way of notating subclass derivation and property arrangement in objectoriented design patterns. At first glance, there is a lot going on, but there is an important nuance useful to understand here: document classes contrasted with document types.

Each rectangle represents a distinct document class modeled for the sample insurance company. Subclasses inherit all the properties and security policies of their respective parents. Properties are listed right under the class name. Required properties are notated in bold.

The items in blue represent that special document type property that was defined earlier on the document level. It is a property just like document title, only it holds values that correspond to types of documents within that same class that share the same characteristics. This concept is an important one.

When structured this way, document type becomes a polymorphic property that can be used to further specialize classes. Taxonomy designers often have the mistaken belief that since document classes are used as classification buckets, a document type property is not necessary. However, this is rarely the case.

A taxonomy can have both, and it should. When the specialization process gets to the point where a document class has nothing to subclass on, but must still have unique document definitions, document type will serve as the differentiator.

Therefore, Figure 2 illustrates two ways in which specialization is achieved down the hierarchical path: using document type property in cases where the base characteristics are uniform for all like document types or subclassing in cases where the intrinsic meaning of a document class must be retained, but additional or different characteristics are needed.

Compare this to the biology example: if a category of animal is so different from the class it was derived from, it’s a subclass (a tiger is derived from a carnivore); but if not, it’s a document type (a Bengal tiger or a Siberian tiger).

This is the blueprint for a scalable taxonomy. As this tree gets wider and longer, identifying where something belongs becomes easier.

Linking to Records Management

Taking a page out of the utilities playbook, Figure 3 illustrates a reallife example of enterprise taxonomy at work. Here, folders and documents are mapped to retention schedules.

The Public Utilities Regulatory Policies Act (PURPA) folder is linked to its retention by virtue of the folder class. All PURPA studies expire 10 years after they are created. On the other side, the project folder is a bit more complicated, because major projects are retained for 15 years, but minor ones for only 5 years. In this case, we must use the project type property for retention schedule mapping guidance.

The point is that records declaration can be achieved automatically based on a pre-determined mapping between the content and records taxonomies. By assigning ordinary metadata to their content, users may not even realize they’re actually declaring and classifying records.

Designing for the Enterprise

Defining an abstract taxonomy and actually implementing it in a specific system are two different things, oftentimes very different. Each ECM system has slightly different nuances in how things are named, how they’re structured, and how they’re connected.

The core concepts presented here in a general sense can be taken and implemented in any content repository. A taxonomy shouldn’t be married to its host system; the core concepts of an abstract taxonomy should migrate to an ECM system of choice.

One standardization tactic is to refer to the Content Management Interoperability Services (CMIS) specification for guidance. CMIS allows different ECM systems to communicate and exchange information with each other or with some other program. It acts as a layer of abstraction so all ECM platforms speak the same language.

For example, where the one platform calls its main data store an “Object Store,” CMIS standardizes it by calling it a “Repository.” CMIS models the document, folder, and repository objects for the taxonomy designer to derive from.

Getting Started with Workshops

To begin a real-world implementation, the best strategy is to begin with the front-line warriors, the end-users. Start with one LOB and involve it in the taxonomybuilding process right away through interactive workshops and questionnaires.

This usually starts out as very qualitative work. The idea is to get the narrative of what drives the organization’s data. Discuss the process, not the document. Every process has an input and an output. Whether these are documents, work list items, or records (or all three), they can all have a place in an intelligently designed taxonomy.

The following six pieces of information are crucial to understanding the full picture in terms of breadth and scope of content elements in play at any organization:

  1. Document characteristics (volume, format, input)
  2. Organizational structure
  3. Process
  4. Security
  5. Retention
  6. Reporting

If building a taxonomy from scratch, there will invariably be confusion among concepts like document classes, document types, inheritance, and polymorphism. It is a good idea to abstract these concepts as much as possible. As these are identified, cross-reference them in a matrix that can then be used to build the taxonomy hierarchy.

The matrix should be a listing of document types and their properties, so a quick glance can yield the lay of the land, and document types can be collapsed into broader classes. Other useful tools include data dictionaries for property semantics and nomenclature, a system requirements specification, and a system design specification.

Avoiding Common Pitfalls

Don’t Design Around a Platform

It is important to design the platform around the taxonomy, not vice-versa. As stated earlier, the hierarchy from Figure 2 should be deployable anywhere.

Keep in mind the importance of polymorphic document types. Strive to keep the taxonomy lean and flexible, and look for warning signs of rigidity in data architecture.

Reject Unneeded Properties

Occasionally, taxonomy project stakeholders and others may demand the inclusion of properties that are out of place or not in concert with taxonomy design. It is important to stand one’s ground when this happens. Not all metadata is equal, and the following set of questions may help weed out what may be a “nice to have” property from what is truly necessary:

  1. Is the property used to search on and/or to display in hit list results?
  2. Does it have any use for business process/workflow?
  3. Does it have any use for records management-related functions?
  4. Is it used for reporting?

These four dimensions really ought to have it covered. The problem with “nice-to-have” requests is they can cause the taxonomy to grow to become bulky and unwieldy.

Growing the Taxonomy

Once a basic blueprint for the taxonomy is established, it should get conceptually easier to grow it as the need evolves. If it is not, that is often a sign that something is wrong at a higher level.

Download the complete PDF version here.

Eugene Stakhov, CRM, CDIA+, can be contacted at gstakhov@lighthousecs.com.

From May - June 2012