What's involved in eDiscovery?

Having spent last week at Lisa12 I ended up having to describe what my company does to people who ask what I do. Few people outside of the Legal industry know what all is involved. The part of the industry where Sysadmins live are only familiar with one stage of it, if they're familiar with any of it at all: collections.

If you drop "ediscovery flowchart" into your search-engine of choice you'll get a wide selection of graphics. To save load, I'll give it in outline form:

  1. Records Management
  2. Identification
  3. Preservation
  4. Collection
  5. Processing
    • Early Case Assessment
    • Review
  6. Production
  7. Presentation

Now to go into a bit more detail.

The first three rungs of that chart are actually the responsibility of the entity being discovered, not the law-firms involved. Sysadmins are involved with those, even if they don't realize that's what they're doing.

Records Management
This means dealing with your data as it should be dealt with. Know how regulations mandate document handling, especially retention requirements. As sysadmins we help build such systems, while grumbling darkly about over generous email quotas making backups a pain.

After a legal matter is confirmed to be a Thing (it could be a union grievance, or a class action lawsuit, or even a divorce proceeding) it's up to an entity to identify what data is likely to contain information relating to the matter (called 'responsive'). If you don't have good records management, this step is a pain.

Sysadmins get involved at this stage when the person doing the figuring out starts asking questions about how easy it is to get at data. There may be dark mutterings here, but the simple fact of the matter is that such data needs to be produced.

After data is identified, it needs to be preserved. The most common term for this is "legal hold". Document destruction policies are suspended for the data in question, failure to do so (even for "oops! I accidentally ran the quarterly purge a day early and didn't see that email in time, sorry" events) is to open up an entity for major fines. The courts, and the Opposing Party, generally don't have a sense of humor here.

Some systems are better about this than others. Exchange has built in mechanisms for handling Legal Hold where all mail entering an account under Hold are preserved, even if the user attempts to delete them. Others, such as gmail, don't have such built in features so it's up to the user to remember to not delete stuff. In extreme cases this can stop backup-media recycling while the Hold exists (see also: dark mutterings).

When actual Discovery starts the data that has been preserved is collected in ways that preserve the chain of evidence. For smaller matters it could be the sysadmin who actually greps an entire mail database for certain keywords. In others forensic examiners are sent in to take images of hard-drives that have been Preserved. The Legal department of the entity in question may be granted read rights to document repositories on file-servers so they can pull out potentially responsive documents.

This stage is pretty broad, and represents where the vast majority of spending on eDiscovery happens (a 2009 study found it represented over 90% of the spending). This is where my company sits, and as a market segment this is what most in-industry people think of when the word eDiscovery is run into.

Processing: Early Case Assessment
This is the stage where the Collected data is chewed on and spat out in a format that Lawyers (or their aides) can get through in reasonably sane times. There are a variety of products out there for this, and we sell one. This is an extremely complex automation problem because of the sheer diversity of file formats; they all have to be full-text indexed (which can involve OCR in the case of scanned PDFs), metadata preserved, and in some cases converted to a paginated image format like TIFF. Document deduplication can happen at this stage.

All that automation costs serious money. Without it, Lawyers (or almost always, their aides) would have to handle each file and hand record things. For a case of any size this doesn't scale.

Processing: Review
After the Collected data has been appropriately masticated and indexed it's time for the legal team to actually go through it all and find responsive documents. This can be done one by one for every document (linear review), or leverage heuristics similar to what spam-checkers use to learn what a responsive document looks like and tag things appropriately (predictive coding). If document deduplication wasn't handled during ECA, it'll happen here. This is a rather specialized workflow with a few dominant products.

In Days of Yore, this was done with legions of law clerks in basements reviewing printed off documents by hand. They also deduplicated them by hand. I remember news stories of Chrysler getting sued and the number of documents produced reported as semi-truck loads. This method is really slow and really manpower intensive which is why it's only done for really small cases.

Once all of the Responsive documents are identified it needs to be Produced for opposing counsel, who will then do their own review. This functionality is usually found within the Review/ECA products these days. The actual documents produced could be image formats (TIFF, PDF, or in a few cases PNG) or more commonly these days the native files themselves.

This stage is fraught with peril. If data wasn't collected correctly, perhaps the From: date of all the email is the day of collection not the 5 years of email it's supposed to be, opposing counsel can point this out and the data will have to be recollected and run through the processing & review stages again only on a much tighter time-table. There is a reason collections are done by trained professionals and the processing platforms are so expensive: doing it wrong loses cases and is extremely expensive in penalties or adverse judgments.

This re-do is something sysadmins can get involved in. If such recollections need to happen, it'll be done in a SCREAMING HURRY. Whatever the original collection method was will be thrown out (those bash-scripts didn't preserve mtime, crap) and something else put in its place. If this involves granting hurriedly retained forensic consultants full access to a system RIGHT NOW, then so be it.

The data is finally presented for the case. It could be a court, arbitration, in front of the Labor Relations Board, or even Congress.

All this work is not cheap. There have been many cases where a large entity defends against a suit from a much smaller entity by snowing them under with large Discovery requests so that the cost of producing that data exceeds their entire litigation budget. Or Produces so much irrelevant data that Reviewing it all does the same. Some of the out-of-court settlements you've read about have come about this way.