Tech Trends

Visual Analytics: A New Way to Manage Data Deluge in E-Discovery

E-discovery (also known as e-disclosure or electronic data disclosure) is the process of locating, securing, and searching through electronically stored information (ESI) with the intention of using it in legal proceedings. Lawyers involved in corporate litigation and regulatory investigations face a growing challenge: how to undertake this task in the face of a deluge of data in a growing number of formats and locations arising from the shift from paper to electronic records creation and storage.

Victoria L. Lemieux, Ph.D.

Bookmark and Share

Jason Baron, an expert on e-discovery  and director of litigation at the U.S. National Archives and Records Administration, has observed that it is not unusual for lawyers to confront the task of having to sift through millions of files contained on electronic media of all types, from databases and online networked systems, websites, and disaster recovery backup tapes, all for the purpose of performing their searches for relevant evidence.

Challenges of Searching, Retrieving ESI

In the throes of a data deluge, lawyers increasingly struggle to locate ESI in the face of requests from litigants and regulators to provide “any and all relevant documents” – a standard legal phrase used in e-discovery requests. Many well-publicized cases, such as Zubulake v. UBS Warburg, highlight how shortcomings in organizations’ records and information management practices have contributed to the problem.

Another reason lawyers struggle to meet the e-discovery challenge is that their search and retrieval techniques no longer seem to be up to the task. The most common methods currently used in e-discovery – keyword searching and line-by line review of the content of documents – are increasingly ineffective for the massive volumes of data that must be sifted through for each case.

There have been a number of studies highlighting the limitations of existing search and retrieval techniques. In 2007, The Sedona Conference®, a legal non-profit think tank, issued its Best Practices Commentary on the Use of Search and Information Retrieval Methods in E-Discovery in which the limitations of keyword searching were described:

Keyword searches work best when the legal inquiry is focussed on finding particular documents and when the use of language is relatively predictable. For example, keyword searches work well to find all documents that mention a specific individual or date, regardless of context. However, the experience of many litigators is that simple keyword searching alone is inadequate in at least some discovery contexts. This is because simple keyword searches end up being both over- and under-inclusive in light of the inherent malleability and ambiguity of spoken and written English (as well as all other languages).

Recognizing that such limitations exist, many legal practitioners are incorporating new techniques into their e-discovery practices. These include:

  • Probabilistic search models (e.g., Bayesian classifiers) – searches that use a formula based on values assigned to particular words based on their interrelationships, proximity, and frequency to establish a relevancy ranking that is applied to each document searched
  • Fuzzy search models – searches beyond specific words, recognizing that words can have multiple forms. By identifying the “core” for a word, the fuzzy search can retrieve documents containing all forms of the target word.
  • Clustering search models – searches that group documents by similarity of content, for example, the presence of a series of same or similar words that are found in multiple documents
  • Concept and categorization tools – search systems that rely on a thesaurus to capture documents that use alternative ways to express the same thought

What Visual Analytics Is, How It Can Help

As early as 2006, Forrester Research predicted that visual analytics (VA) was going to be “the next big thing” in e-discovery. VA is a relatively new technology that combines analytical reasoning facilitated by interactive visual interfaces.

It involves much more than just using visualizations to understand data better. It is an integrated approach combining visualization, human cognition, and data analysis that “enables detection of the expected and discovery of the unexpected within massive, dynamically changing information spaces,” according to Kris Cook, the Visual Analytics Community consortium director for the National Visualization and Analytics Center.

VA tools apply visualizations (e.g., bar charts, node link diagrams, clusters, or time lines) to represent the results of computer analysis. (See Figure 1 for an example.) Users are then able to interact with these visual representations to conduct further analysis in an iterative manner. The results of such additional analytical steps are also presented as visual representations that can be further analyzed if desired.

VA’s original domain of application was in science, but it has now moved into other areas, such as business intelligence, fraud detection, and epidemiology. The kinds of questions and issues that have attracted other domains of analysis to this technology are similar to the types of questions and issues faced in e-discovery.

A growing cadre of researchers argue that legal practitioners would benefit from support in drawing together emergent document classes (groups of related irrelevant documents and groups of related relevant documents) the reviewer becomes aware of during the review task. UK-based researchers Simon Attfield and Ann Blandford note that many traditional review systems fail to assist the reviewer in this and so adversely affect cognitive momentum and the efficiency and effectiveness of the  e-discovery task. They conclude thatinteractive information visualizationsprovide new opportunities to move beyond such limitations.

VA tools, in providing a visual representation of concepts and their interrelationships in a domain, have been shown to be extremely helpful to people who need to learn about a domain. Some researchers go beyond even this claim to suggest that interactive visualization tools can prompt reviewers to think more creatively.

The Centre for the Investigation of Financial Electronic Records at the School of Library, Archival and Information Studies (SLAIS), University of British Columbia (UBC), has been experimenting with VA tools specifically designed to overcome this cognitive barrier by applying the computational power of computers to analyze large datasets in order to “see the unseen.”

Put simply, with the kinds of tools being used at UBC-SLAIS, reviewers do not need to know in advance what they are searching for. These VA tools use vector space clustering algorithms to analyze a dataset and present the entire “universe” of data found within the chosen dataset in the form of a visual representation (e.g., nodes, galaxies, and spires).

The capability of these tools surmounts a problem that often occurs with keyword searching where documents can fall through the cracks of a search. These types of VA tools see all documents and cluster them together, so no document is left out accidentally. A further advantage is the reviewer is then able to engage interactively with the visualization to discover more about the cluster of documents it represents in order to test hypotheses and develop new ones.

Although research is ongoing, a number of VA tools are commercially available, falling roughly into three groups:

  1. Those that have been specifically developed and designed for e-discovery
  2. Those that are traditional search tools that have been enhanced by the addition of some visualization capability
  3. Those that are “pure play” visual analytics tools that have not been designed specifically for e-discovery, but which could be used for this purpose

Visual Analytics Limitations

As exciting a new technology as VA is, it has limitations, which include:

  • The technical skills needed to learn how to use VA tools and operate them effectively are still significantly more than an individual could expect of a casual user.
  • There is no predefined pathway or protocol for the process of analytical reasoning to be followed to interact with a dataset. There can be significant differences in the process of analysis among individuals and groups. In e-discovery, it is entirely up to the reviewer to follow hunches and lines of reasoning. It can also be quite difficult to know when there is nothing more to be gained by further interacting with a visual representation.
  • The task of preparing data for analysis using VA software is not an insignificant one at this point either. Datasets come in many formats, and documents have diverse structures and lengths; nevertheless, they must be rendered in a form that can be read by a VA tool. It can still take a significant amount of effort.
  • An increasing amount of the data that might be of interest in an investigation or litigation is non-textual (e.g., voicemail and video clips). Speech recognition software exists to translate speech into text to enable analysis, but tools to render images remain at the development stage.
  • While it is important to understand how different algorithms function, it may not be done easily when a commercially available VA tool is used. Such algorithms tend to be trade secrets, and vendors may be reluctant to be too transparent about their structure and how they function.
  • VA tools generally do not deal with context very well. Depending on the algorithm, terms referring to a project code named “Friday” may cluster with all the other mundane documents referring to “Friday” in  a given dataset. Without explicit tagging, documents generated in the course of the same business function or project, but which discuss a variety of unrelated matters, are unlikely to cluster together.
  • Visual representations can invite different interpretations of the same dataset. Just as with an optical illusion, it is possible to see different things and extract diverging meanings from an image.

A Promising Competitive Advantage

In a world where organizations are generating terabytes and more of data in a multitude of complex digital formats and across a growing number of locations, surviving the digital deluge requires innovative new techniques and technologies. The old ways of doing business are no longer viable.

VA is among the new approaches that show great promise for overcoming some of the limitations of traditional search and retrieval practices in e-discovery. As promising an approach as VA is, it is in its relative infancy and, consequently, has many limitations. As with any emerging technology, more research on VA is needed.

Luckily, that research is moving at a fast pace, and new developments are quite likely to overcome current limitations. To suggest that VA can help overcome the digital data deluge in e-discovery is not to suggest that it is a panacea, however. Organizations will still need to pay attention to the fundamentals of records and information management to be fully prepared for e-discovery challenges.

Nevertheless, combined with the fundamentals, VA has the potential to offer a competitive advantage in the search for ESI.

Download the PDF version here.

Victoria Lemieux, Ph.D., can be contacted at vlemieux@interchange.ubc.ca.

From March - April 2011