Enterprise Search vs. E-Discovery Search: Same or Different?
As the amount of electronically stored information (ESI) has grown with the enterprise, the keyword search has become increasingly important as a way to uncover relevant information, whether it is performing an enterprise search for knowledge management purposes, or whether it is performing an e-discovery search to find information relevant to litigation or a government inquiry.
To date, most enterprises have used the same search technologies for both tasks. However, a recent trend among large and small enterprises suggests that a significant divergence is occurring between enterprise searches and e-discovery searches.
Both start by entering a search term in a search box, but that’s where the similarities end. The business requirements are different and, as a result, each needs different capabilities. Suggesting that one search technology can be applied for both purposes is like saying a person who spent years training for the 100-meter dash stands a chance at winning an Olympic marathon.
Business Objective Is a Key Consideration
The business objectives are different for enterprise searching vs. e-discovery searching. For an enterprise search, the primary objective is to find the five to 10 most relevant documents that contain the information needed to perform the job, whether it is for sales, marketing, finance, or product development.
For example, let’s say an aeronautical engineer working on a next-generation plane, Model Y, has been tasked with selecting “pitot probes,” one of the critical sensors used to measure airspeed of an aircraft. Pilots need to know the accurate speed of the aircraft, as incorrect information can lead to two bad consequences for the aircraft: under-speed, which can lead to a stall, and over-speed, which can lead to the aircraft breaking up.
A logical starting point would be to understand which pitot probes were used in previous-generation aircraft, Model X, and the selection criteria for the pitot probes. So, the engineer is likely to search for “pitot probes modelX,” scan the first few documents from thousands of search results, retrieve the information, and complete the task.
E-discovery searches are conducted in response to litigation, regulatory inquiries, or investigations, and the objective is to search, retrieve, and analyze ALL relevant documents, not just the top five or 10 most relevant ones.
Using the same example, let’s assume that inaccurate airspeed information lead to a flight accident of Model X aircraft and the National Transportation Safety Board (NTSB) has initiated an investigation regarding the safety of pitot probes. The NTSB will likely request all documents that match searches containing multiple variations of the words such as “pitot,” “probes,” “safety,” “tests,” “flight incidents,” “malfunctions,” “failure,” and “airspeed discrepancies,” among others, in order to find all the thousands or millions of relevant documents.
Number of Search Queries Matter
For enterprise searches, users typically enter one search query at a time and see the most relevant results as quickly as possible. When users do not get the documents they are looking for, they will modify and try a few different iterations until they find the information of interest.
Continuing with the previous example, if “pitot probes modelX” does not retrieve the right results, the engineer will likely try other queries, such as “pitot probes selection criteria modelX” or “pitot probes evaluation modelX.” These queries will run as a single search, and the user will have no visibility into which results are associated with which query.
In the e-discovery search example of information requested by the NTSB, the search request will include Boolean, wildcard, phrase, and proximity searches and could easily result in more than 100 complex queries, such as:
- “pit* prob*”
- “pit* AND prob* AND safe*” ~10 (within 10 words of each other)
- “pit* OR prob*” AND “safety”
- “pit* prob* modelX” ~10 AND “pit* prob* modelX safety” ~35
- “modelX” AND “safe*”
- “pit* prob*” AND “safe*” ~ 20 (within 20 words of each other)
- “pitot probes flight incidents”
- “pit* prob*” AND “test*”
- “pit* prob*” AND “flight incid*”
- “pit* prob*” AND “malfunc*”
- “pit* prob*” AND “fail*”
- “pit* prob*” AND (safe* OR incid* OR malfunc* or fail*)”
Therefore, e-discovery searches require the ability to run multiple queries at once and get a count of unique results across all queries, as well as a count of unique results for each query. The breakdown will enable the user to decide which queries are most likely to return the most relevant results.
The Cost of Relevancy
For the purposes of general enterprise searches, the user only reads and reviews the most relevant documents. There is not a business cost or loss of time if the search is over- or under-inclusive, as long as the first page or first couple of pages contains the information they need to conduct the task.
For e-discovery searches, attorneys have to analyze all the documents, and review costs can be very expensive. According to “Automated Document Review Proves Its Reliability,” published in the November 2005 Digital Discovery & e-Evidence, “Data collections often run into many gigabytes or terabytes of data. Considering that one terabyte is generally estimated to contain 75 million pages, a one-terabyte case could amount to 18,750,000 documents, assuming an average of four pages per document. Further assuming that a lawyer or paralegal can review 50 documents per hour (a very fast review rate), it would take 375,000 hours to complete the review. In other words, it would take more than 185 reviewers working 2,000 hours each per year to complete the review within a year. Assuming each reviewer is paid $50 per hour (a bargain), the cost could be more than $18,750,000.”
As a result, it is critical to minimize the number of non-relevant documents included in a search result. Therefore, in addition to speed and simplicity, e-discovery search users also care about the specific words or query expansions in the search result and want the ability to include or exclude expanded terms.
In the previous example, the reason the user is searching for “pit*” is to account for any misspellings, such as “pito” or “pitit,” for the word pitot. At the same time, “pit*” will retrieve a lot of false positives, such as “pitbull,” “pitch,” “pitfire,” “pithy,” and “pittsburgh.” A transparent query expansion would present a user with all these terms that are in the data set and allow them to exclude false positives, reducing review costs.
Defensibility or Lack Thereof
Finally, for enterprise searches, speed and simplicity are the most important requirements. As a result, enterprise search technologies have been designed to work as a black box that enables the user to enter a single search query and returns the results that match that query. Think Google. The search engine interprets the search query the user entered, but what it actually searches for is hidden from the user.
E-discovery search users also need speed and simplicity, but in addition, they need defensibility and the ability to “show their work” as mandated by court rulings. Recent opinions, such as U.S. Magistrate Judge Paul Grimm’s in Victor Stanley v Creative Pipe, Inc 2008 WL 2221841 (D. Md. May 29, 2008), highlight some of the limitations of search and outline approaches attorneys can employ to defensibly use keyword search given these limitations.
Grimm goes on to specifically cite The Sedona Conference® Best Practices Commentary on the Use of Search & Retrieval Methods in E-Discovery (The Sedona Conference® Best Practices) as a source of best practices. “In this regard, compliance with The Sedona Conference® Best Practices for use of search and information retrieval will go a long way toward convincing the court that the method chosen was reasonable and reliable.”
The Sedona Conference® Best Practices document doesn’t prescribe a specific “process” parties should follow, but it does suggest the following key components could be used in an effective search methodology.
Testing. Searches need to be tested for efficacy, i.e., whether the search is producing over- or under-inclusive results.
Sampling. The primary way to test the efficacy of a search is through sampling. In Victor Stanley, Grimm states that “The only prudent way to test the reliability of the keyword search is to perform some appropriate sampling of the documents determined to be privileged and those determined not to be in order to arrive at a comfort level that the categories are neither over-inclusive nor under-inclusive.”
Iterative feedback. Finally, the process of testing and refining one’s search based on the results of testing needs to be iterative so every refinement can be validated.
So, e-discovery searching needs to support a workflow that enables the user to easily follow and document The Sedona Conference® best practices of testing, sampling and iterative refinement. For example, the user should have the ability to rapidly sample the results from all the individual queries contained within a search to ensure that the terms are relevant.
With e-discovery search, the results for each query that has been run as part of a larger search need to be automatically documented. The details around query expansions that have been included or excluded as a result of transparent query expansion should also be automatically documented. Continuing with the previous example, a best practice for a defensible e-discovery search would be to document that the user included “pito” and “pitet” and excluded “pitbull” and “pitfire” and “pittsburgh" when documenting the search “pit*.”
Choose the Right Tool for the Job
Conducting e-discovery for litigation or an investigation using enterprise search technology is a risky gamble that can result in negative outcomes in court, penalties, and excessive litigation costs. There’s a reason why Olympic athletes specialize in their area of strength – and it’s for that same reason that e-discovery search technologies have entered the marketplace to address the crucial requirements that are specific to the e-discovery business process.
Kamal Shah can be contacted at firstname.lastname@example.org.