Discussion Wiki‎ > ‎

eDiscovery Responsiveness

The National Institute for Standards and Technology (NIST) holds an annual Text Retrieval Conference (TREC).  One element of that Conference has been a Legal Track that focuses on evaluating a corpus of 1.3 million email messages captured by the Federal Energy Review Commission (“FERC”) from Enron during its investigation early in the last decade.

The corpus was about 100GB in size consisting of 685,592 documents in total.  There were a large number of duplicate email messages that had been captured more than once.  For TREC, a list of 455,449 distinct messages were identified as canonical; all other messages duplicated one of the canonical messages. These messages contained 230,143 attachment files.  Text and native versions of these documents were made available to participants for analysis.

The paper “Overview of the TREC 2011 Legal Track” by M. R. Grossman (Wachtell, Lipton, Rosen & Katz), G. V. Cormack (University of Waterloo), B. Hedin, (H5), and D. W. Oard, (University of Maryland, College Park) is at the TREC website.  The authors show the results of an expert human evaluation of a baseline set of documents and compare that to the efforts of ten organizations using computer-based algorithms.

The TREC 2011 Legal Track evaluated the efficacy of various review techniques and tools chosen and implemented by the participating teams. Some participants may have conducted an all-out effort to achieve the best possible results, while others may have conducted experiments to illuminate selected aspects of document review technology. It is inappropriate – and forbidden by the TREC participation agreement – to claim that the results presented showed one participant’s system or approach was better than another’s. It was also inappropriate to compare the results of TREC 2011 with the results of past TREC Legal Track exercises, because the test conditions as well as the particular techniques and tools employed by the participating teams were not directly comparable.

The 2011 TREC Legal Track was the sixth since the Track’s inception in 2006, and the third that has used a collection based on Enron email. The results summarized in the paper show that the technology-assisted review efforts of several participants achieved recall scores that were about as high as might reasonably be measured using the currently available evaluation methodologies. Those efforts required human review of only a fraction of the entire collection, with the consequence that they were far more cost-effective than manual review. There was more opportunity to improve the efficiency and effectiveness of technology-assisted review efforts, and, in particular, the accuracy of intra-review recall estimation tools.