This is Part Two of a continuing series on ESI basics. In this series, we cover some of the terms used most often on the tech-side of e-discovery. In Part One, my colleague, Phil Moon, provided an overview of PSTs. You can find that article here. Whether this is an introduction to you or a refresher, and whether you are an attorney, member of an in-house team or data analyst, this information may come in handy in your practice.
What Is Processing and Why Do You Need It?
Let’s say you have a new case and you request proposals from various e-discovery vendors (or your discovery counsel firm) for preparing the documents for review and production. Maybe you get a proposal back with a very large number next to it called "ESI processing" or “Metadata Extraction.” What do those terms mean and what are you actually getting for your money?
In general terms, ESI processing and metadata extraction involve collecting electronic data and making it useable. In more specific terms, ESI processing takes all of your native documents and runs them through software that parses out the data it reads from the header and extracted text into searchable fields. These searchable fields are often referred to as metadata (data about the data). Metadata allows you to search, put documents in chronological order and otherwise analyze your documents in a way that is efficient and consistent across software applications. Note that there are methods for forensic metadata and extraction but this blog will be confined to more common non-forensic analysis.
ESI Processing – Items to Consider
1. Processing is only as good as the tool that is being used. Some advanced processing tools can handle a very large number of file types and can extract the maximum amount of information available. They can also do it quickly. Other more basic processing tools can handle the most common file types and may be sufficient for your needs. Some popular processing tools are Law, Nuix, IPRO, Relativity Processing and other proprietary tools that are created by database hosting vendors.
2. The software provider will have a list of file types that can be processed and what the tool considers an "exception." Exception files are usually proprietary file types used with specific software or are files that don't extract (for example, some system files and database files). Ask for the processible file list from your vendor to make sure that their processing platform can process your files, particularly if you use 3D modeling software, accounting software or legacy files that no longer have software support. Also, keep in mind that metadata extraction processes may differ depending on the type of device from which the data was collected.
3. You should be able to request a list of the specific pieces of data that are extracted. There can be hundreds of items depending on the file type and you won't necessarily need them all for the purposes of production but it is good to know what can be provided.
4. You’ll want to ask about the handling of duplicates, attachments and compressed files. Will documents be de-duped on a global or custodial basis at the time of processing? Does the processing platform automatically unzip container files? Does it do that with other compressed file applications, like 7 Zip or Pkzip? Can it open Encase files?
5. Also, you may need to ask how the processing platform manages password-protected or corrupt files. Will it log them for you so that you can review a report and consult with the client regarding a solution? Most platforms should be able to provide you with such a report.
File Compression and Data Expansion
In most cases, your data will expand after processing, which means the data size collected does not end up being the data size hosted. Data expansion occurs most often with email because processing pulls out any attachments as separate files and extracts the metadata from them. It can also pull out embedded files and unwrap various containers, including the zip files mentioned above. While the expanded size can be estimated, it largely depends on the original systems used and how they were set up. An Outlook .pst file will expand to 1.75 - 2.00 times the original .pst size. However, Lotus Notes .nsf files can expand to much more than that, sometimes 3-4 times the original size because of the way the files are compressed. So, when budgeting for a case, it’s important to understand what type of data you have in order to reasonably estimate the potential processing and hosting charges.
File Type Filtering
During the early stages of processing (the “pre-processing” stage), you may have the option of narrowing down your document collection by file type before full processing begins. Simple filtering can include de-NISTING, which uses a list provided by the US government called the NIST list, and filters out files by file extension. The chief purpose of NIST filtering is to remove file types that are unlikely to be useable or responsive, such as system files. Some processing providers also use their own common list of non-document file types that can be removed prior to processing and review. Examples include .db, temp and .bak files.
You should request a summary report after the documents are processed, which will show you the various file types that were in the processed data set. This report is very handy if you have to go back to the client to ask about specific file types that couldn’t be processed. It will also reveal the number of emails, Word documents, spreadsheets, Adobe PDFs, etc., so that you can get an idea of how long and detailed the review may be.
After processing is complete, you will be able to run keywords to cull the data set needing review. Keywords can be set to run across metadata as well (which we have found to be extremely helpful). You would provide a list of relevant search terms and then receive a report displaying the number of documents identified, by term. You can then tweak the terms to get to the document set you intend to review.
It’s helpful to discuss processing options with your service provider up front so that you are aware of any impact to timelines. In addition, you’ll want to have realistic expectations when you discuss discovery deadlines and ESI specifications with opposing counsel. Asking the right questions before and during processing will help you avoid document-related issues down the road, particularly during the review and production phases.
DISCLAIMER: The information contained in this blog is not intended as legal advice or as an opinion on specific facts. For more information about these issues, please contact the author(s) of this blog or your existing LitSmart contact. The invitation to contact the author is not to be construed as a solicitation for legal work. Any new attorney/client relationship will be confirmed in writing.