Structured vs. Unstructured Data
It takes a minute to come to terms with the definitions of "structured" and "unstructured" data. It seems logical to associate “formatting,” such as the formatting that comes with word documents (indentation, headers and footers, paragraphs, etc.) with “structure,” but the terms are used very differently when it comes to describing data. Structured data traditionally exists in relational databases such as SQL and more simply, Excel. Unstructured data refers to data that does not easily fit into databases and includes Word documents, emails and PDFs. It seems counter-intuitive, but data that is formatted in a way that makes sense for everyday communication is actually unstructured data.
Structured vs. Unstructured Data in E-Discovery
While structured data is easy for a computer or data analyst to interpret, it presents challenges in e-discovery. Unstructured data, on the other hand, is something that we have long since come to terms with in e-discovery. The development of software platforms that allow us to view, search, code and analyze unstructured data is pretty amazing. Predictive coding is already giving way to advanced AI, in order to keep up with expanding volume of data in discovery. But these tools, and even simple tools such as text searching, struggle to provide a means to handle structured data.
While the task may seem daunting, plenty of effort has been aimed at handling structured data. Even the most basic form of structured data - Excel - presents challenges. Imaging doesn’t make sense because of exorbitant page counts, so we generally produce Excel files natively, which presents its own challenges, such as redacting privileged and personal information. However, tools have been, and continue to be, developed to meet these challenges.
The New Structured Data Source
In the last few years, we have seen the emergence of a new structured data source - messaging platforms. Instant communication platforms such as Skype, Slack and Bloomberg have exploded in the corporate world and, as a result, have created a new challenge in e-discovery. These challenges run the throughout the EDRM from identification, preservation and collection all the way through production.
Challenge # 1: Identifying the Data
One of the most basic challenges comes in the form of simply identifying a custodian. For the sake of this blog I will use Slack as an example, but the lessons can generally be extended to other messaging platforms as well. A custodian is generally considered to be the person who has control over an electronic document, with an email box being the most common example (for example, I am the custodian of my email). With a messaging platform, however, who the custodian is may not be so straight forward. There are public channels, private channels, private messages and so on. An individual participating in a public channel does not have any control over local storage of the data, so are they a custodian? I think most of us would agree that by participating in a relevant communication, their data is discoverable, but how do we identify and collect that user’s data?
Challenge # 2: Preserving the Data
Preservation becomes an enormous factor in what is inevitably available to be collected. The default setting in Slack is to preserve all data forever. However, an administrator can alter these settings to limit retention to a certain time frame or allow individuals to set their own retention policies for their private channels. With such a framework, on the one hand, we run the risk of having an uncontrollable amount of data; on the other hand, we run the risk of an individual user deleting complete channels of communication (creating just the opposite problem). It is critical that we are aware of these options and the implications.
Challenge # 3: Collecting the Data
When it comes to data collection, exporting data from Slack requires administrative control. While it is convenient to have somewhat centralized control of any data for collection, it presents problems when the collection could be limited in scope to only certain individual custodians. Participation by a user in public channel communication will require the export of that entire channel, which may lead to the collection of an enormous amount of data that is not relevant, which in turn creates challenges for culling and review.
Challenge # 4: Making Sense of the Data
What does Slack data look like? Here is an example of a simple Slack .json file:
"text": "<@user1> has left the channel"
In this example, all we find out is that a particular user ‘has left the channel.’ Imagine how much data can be generated if it takes that much text just to note that someone has joined or left a channel. Here is an example of a brief conversation:
"date": "September 13th, 2017",
"time": "3:20 PM",
"date": "September 15th, 2017",
"time": "4:12 PM",
"message": "Can you send me an invite to test flight?"
"date": "September 15th, 2017",
"time": "4:22 PM",
"message": "yep. what's your user ID?"
All of that just to convey the conversation:
Can you send me an invite to test flight?
yep. what's your user ID?
This particular snippet came from a channel that generated over 1200 pages of text and an average of only 10 lines of actual communication per page. When ultimately re-formatted, the entire communication was reduced to 200 pages. Still daunting, but a reduction in review that would be substantial.
Challenge #5: Reviewing and Producing the Data
The last line of this message, "yep. what's your user ID?", creates another problem for review and production, namely how do we handle privileged or private information? Producing the files natively would be a reasonable solution if we didn’t have privacy concerns. The native .json files open readily enough in text editors, but the page count is unwieldy if you have to image in order redact sensitive data.
Aside from the formatting issue related to text, another concern is how to handle non-text communication within messaging platforms, such as emojis. There isn’t any denying that such communication is part of our vocabulary and now we need to identify ways to deal with it. A is a perfectly understood response to question posed in Slack or Skype, but how will it be handled when it is formatted like this after collection?
Where Do We Go From Here?
Even this cursory look at some of the structured data coming out of messaging platforms can be a little unnerving. But don’t worry, there are already a number of solutions out there. E-discovery tools have been developed to handle data coming from various messaging platforms. While no one has developed a comprehensive solution for all messaging platforms as of yet, nearly every platforms has been addressed by someone.
There are a number of approaches to representing structured data from messaging platforms in reviewable formats. One tool on the market parses the structured data and re-formats as .html that can be viewed in a browser or uploaded into review platforms along with metadata load files. One of the examples from above is now represented in this format:
This format is certainly easier to review and it provides reasonable imaging options for production and the redaction of sensitive information. While this example only involves text, images and emojis have also been handled by tools in the market. Some tools that process Slack data are able to represent emojis much as you would see them in Slack. Others choose to keep emojis in Slack text format, making the argument that the text format would allow for text searching of a specific emoji. Here is an example:
Whether making emojis searchable might be useful is probably case specific, but it is an interesting approach and highlights the differences in handling structured data sources.
As with most anything in e-discovery, preparation and forethought are crucial for success. Now that we understand the potential complexities of structured data, we can imagine how important it would be to discuss structured data sources with our clients early to get ahead of any potential issues down the line. So, as take-aways:
- Discuss with clients data retention and preservation policies and practices with respect to structured data sources, including messaging platforms.
- Incorporate structured data sources into discussions with opposing counsel and consider specifically addressing them in your ESI agreements (including whether they are subject to preservation and production at all).
- Identify internal or external resources for the collection and processing of structured data and get an idea of what the reviewable format will be beforehand.
These lessons are not different from how we approach other sources of ESI. It does take a little extra effort because structured data coming from messaging platforms is evolving quickly and it takes time for the tools designed for the legal community to catch up. So it is time to start thinking about this data, because sooner or later it is coming to a case near you.
DISCLAIMER: The information contained in this blog is not intended as legal advice or as an opinion on specific facts. For more information about these issues, please contact the author(s) of this blog or your existing LitSmart contact. The invitation to contact the author is not to be construed as a solicitation for legal work. Any new attorney/client relationship will be confirmed in writing.