A practitioner’s guide to PDF, Email and ESI production challenges

By Adam Bowers, JD


eDiscovery technicians know the value of having the right data to work with. Electronically Stored Information (ESI) provides a wealth of digital data that remains behind the scenes, but can change the course of a case instantaneously. While data management begins long before litigation is filed, we are going to look at data management best practices at the moment litigation is a foreseeable likelihood. Once data is in a legal hold, all documents are to remain in digital stasis where the custodians are instructed not to change or delete data. We will cover some of the most important aspects of preserving and collecting documents with a particular emphasis on PDFs. We will also cover the best ways to avoid mistakes and other hazards which may compromise your litigation.

This paper is not to be considered legal advice. If you need legal advice, consult with a licensed attorney. doeLEGAL is not a forensic collection company, we partner with vetted professionals to ensure the success of our clients’ projects. The intent is to take an honest look at the issues surrounding how to ensure that litigation data is captured and produced in a technically sound and legally defensible manner.

“The PDF Pitfalls”

PDFs and TIFFS are usually the preferred formats for producing documents during discovery. Image files are used to prevent the document from being altered and to keep files from becoming corrupt during transfer. Native files hold all of the metadata which can be vital to a particular case. In a normal scenario, image files will be produced with a load file. The load file will hold all metadata and extracted text associated with the native file and link that to the produced image file. Producing native files also poses some additional risks because they can be opened, corrupted, or inadvertently altered. One example would be email. If email is produced natively and the reviewer opens it in Outlook, the reviewer could forward or reply to that email which could be disastrous. We’ll discuss email in more depth later on. Producing images instead of natives in that scenario protects the document and, in turn, the client.

Most productions we receive come as PDFs or TIFFs that were created utilizing proper quality control (QC) procedures and include load files. When processing data during litigation, it is always advised that you use the native files to retain all their original metadata. Once the analysis and review are complete, that is when the files can be imaged and sent over to the opposing counsel as a PDF production with a corresponding load file. On occasion, groups are producing multiple documents in a single PDF without any corresponding load file showing the original metadata. This practice makes it impossible to code, separate, or process the documents. This also creates costly problems for the review team that receives that production. Without metadata or extracted text, it is impossible to run most searches or maintain familial relationships such as email attachments. The receiving party then must weigh whether to spend money to have the documents separated, OCR’d, and manually attempt to identify some of the missing metadata or try to get the producing party to redo the production properly. We usually see this during production, but we’ve also seen certain companies that self-collect provide data in this format. Not only is this not the proper way to provide data, but it may also open the door to challenges from opposing counsel and the risk of court sanctions.

When problems such as identifying data loss happen, the repair falls to litigation support professionals and they truly are the unsung heroes behind every litigation. These individuals often describe themselves as being constantly “in the weeds,” “running fire drills,” and “underwater” because they are.  They have seen it all and can reliably forecast the common issues with poorly collected and produced data, but this advice after the fact comes at a high cost. This is why, when data is collected as PDFs of native files and sent for processing, it is typically sent back to the collecting party to re-collect as native files. In rare cases, this is not possible, it requires a document specialist to determine how best to process the data and ultimately make it reviewable.

As an example, one of the cases we worked on involved receiving a single PDF file that contained all the documents produced. The first thing the litigation support technician noticed was that the attorney requested Bates Numbers to be applied to each document. Applying Bates Numbering to documents delivered as one PDF is wrought with obstacles because one must manually separate each document within the single PDF. This can lead to mistakes in determining where one document ends and the next begins. A PDF is only an image of a document, and if separate images of each distinct document are not supplied, it makes the review of that PDF extremely cumbersome. As a result, our litigation team asked for the native files to be collected.  While the PDF document could be separated by a document specialist, the cost would have been prohibitive and any native file metadata would have been lost. Some metadata could be recovered by manually going through the document and finding things like Doc Date, To:, From:, and Subject, but this is a costly, time-consuming process and would still not give you all of the metadata.

Top 4 PDF PitfallsIf native files are preserved and collected properly, they retain a valuable source of information: native file metadata. As we have stated, metadata is key to ensuring the document review is most effective because it provides vital information such as the type of device the data was created on and date-specific information which can be very useful information to have at trial. If you alter or completely remove the metadata, you are also crippling the ability of the case team to operate with any efficiency. Judge Scheindlin ruled in Nat’l Day Laborer Org. v. U.S. Immigration and Customs Enforcement Agency that even if a FOIA (Freedom of Information Act) request does not specify the inclusion of metadata, certain basic metadata information is an integral part of the public record and must be produced. This was the first time a federal judge took this position regarding a FOIA request and the presence of metadata. She wrote, “Whether or not metadata has been specifically requested — which it should be — production of a collection of static images without any means of permitting the use of electronic search tools is an inappropriate downgrading of the ESI”.

The absence of metadata creates another issue for review teams: the inability to use de-duplication technology. Most eDiscovery tools contain the ability to remove duplicate files, in a process known as deduplication. Deduplication normally takes place during the processing stage. However, since PDFs, by themselves, do not contain any native file metadata, deduplication technology will not function properly. Identifying duplicate files is accomplished by matching the hash values (the serial number of a document) of two files. Once a document is turned into a PDF, the hash value of the PDF will be different from that of the original file, making duplicates difficult to identify.

Here again, having only an image of a document is crippling. To make matters even worse, PDFs do not contain separate text files, so during processing, OCR (Optical Character Recognition) technology must be implemented to even recognize the file as text once again. OCR is used to create a searchable text file by “reading” an image of a document.

OCR has come a long way, but is not perfect and can return results that are different from what actually appears in the document. If you add the presence of a foreign language, misspellings, slang, or image quality into that equation, results can become even more skewed.

When applying OCR processing to a PDF, it is possible that the exact same document could be part of the review database several different times. The same document could be read by separate reviewers and coded differently.  Disparities in coding could also lead to the possible production of privileged documents. Here is a point where production QC is so important. Do not overlook this vital step in your review. While FRCP 502(d) and clawback agreements protect the producing party from waiving the privilege of inadvertently produced documents, the other side may discover some valuable information within that document and this could be harmful to your litigation—“You cannot unring a bell”.

Who’s qualified to produce PDFs?

Under the newly amended FRE 902, copied data will not be “self-authenticating” unless a qualified person has inspected the data, recorded the process used, and certified that an exact copy of the data was created. The comment section of 902 states that to meet the inspection requirement, the qualified person can compare the hash values of the original ESI and that of the copied version. One of the strongest ways to ensure your collection follows the accepted best practices is to have a forensics expert handle the task.  We often see these forensic experts ward off the pitfalls early on because they have seen the issues first-hand and can communicate the risks and rewards of any task before the collection even begins.

What qualifies someone as an expert? The Advisory Committee on Rules of Evidence has suggested that a qualified person is one who would be admitted to testify based on their expertise and knowledge of how the data was collected. This statement suggests that the same criteria that one might use to screen a potential expert witness (FRE 702) should also be utilized when deciding who should be allowed to collect litigation data.  It is crucial to ensure that whatever files are produced pass opposing counsel’s scrutiny or you may risk having the authenticity of the collection called into question. Aside from file authentication, data could be inadvertently destroyed or lost during the collection process if the person doing the collecting is not qualified. This could cause a drastic increase in litigation expenses, possible sanctions for spoliation, and may prolong the litigation itself.

ESI, email, and exact copies

Another area that is all too familiar to litigation support professionals is the practice of collecting data by forwarding emails. While it may seem benign, the act of forwarding emails can alter metadata or file folder locations, which may be of importance. This is an area where the custodian may not understand they are doing anything wrong, which makes it even more important to educate all employees regarding “best practices” and the hazards of collecting data in this way. Also, when reviewing emails, it is ill-advised to open them in an email application, as you may run the risk of forwarding that email or otherwise making yourself a part of an email chain. All review applications offer options to open a native file in a “near-native format” (e.g., opening an Outlook email file in Word). This allows for defensible review without the dangers of misusing the full-functionality of a native application.

Every attorney should approach each collection with the goal of making it “forensically sound.” A forensically sound collection lightens the burden on the litigation support and review teams by permitting eDiscovery software to function efficiently. Drinker Biddle’s Thomas Lidbury and Michael Boland summed this up in their paper entitled Technology: Forensically sound collection of ESI, “What makes a collection forensically sound, whatever its scope, is not that the entire storage media has been copied bit by bit, but that the files that have been collected can be shown to be exact copies of what was on the source, including associated metadata.” We see this concept as using a collection process that will stand up to judicial scrutiny. Conversely, merely allowing non-forensically sound procedures to occur could lead to higher costs, inadvertent disclosures, delayed review times, and late productions. It is always preferable to capture ESI in its “native” format by creating a forensically sound copy because this will virtually guarantee the metadata is also captured. One sure-fire way to ensure the defensibility is to involve attorneys up front where they can properly ensure that ESI is not destroyed or altered and proper procedures are followed and documented.


PDFs can be just as defensible as native files if created and handled properly. By following the best practices accepted for the collection and production of documents, the risk of any questions regarding the authenticity of the metadata associated with the files by opposing counsel will be greatly reduced. A skilled attorney will document everything and be ready when any questions come. Ensuring that an expert is leading your litigation data collection and production is your best defense and will reduce the likelihood of opposing counsel challenging that collection.


About the author: Adam Bowers is an eDiscovery expert, LLM candidate, former business owner, and legal technology practitioner who helps law firms and attorneys navigate the complex world of discovery.

To learn more, visit our ediscovery-litigation page speak to a knowledgeable advisor – call 1.302.798.7500, or email info@doelegal.com.

About doeLEGAL

doeLEGAL is built on a promise to provide “Smart data, intelligently delivered.” Our software and services help corporate legal departments and law firms efficiently manage operations with up-to-date, insightful data that help teams make confident decisions. We facilitate anytime, anywhere control over cases and costs with advanced management tools and elevated support to generate insights and drive successful outcomes.