Deduplication has become a mainstay of electronic data discovery processing where documents, such as word-processing files and e-mail messages, are assigned an algorithmically calculated alphanumeric value (typically an MD5 hash) and compared to all other electronic files in a data set. Documents with the same MD5 hash values are considered duplicates. As simple as this process seems, there are two different bases for deduplication: by custodian and by case. Both have their advantages and pitfalls.
Deduplicating documents by custodian results only in the removal of duplicates within one person's data set. A custodian is the owner of the electronic data harvested from one person's hard drive, company network or e-mail account. If the data is collected only once, typically only a small number of duplicates exist. But if the custodian's data is harvested on a rolling basis over time, the percentage of deduplicated items will increment with successive collections. For example, a file containing one week of e-mail messages will contain a relatively small amount of new data compared to the previous week's messages. Examples of duplicate documents per custodian may be, for example, copies of e-mail messages created automatically by an "AutoArchive" rule established by the custodian.
Deduplication by custodian is the basis preferred by vendors for several reasons. One obvious reason: deduplicating data sets by custodian results in fewer duplicates than deduplication by case and thus more documents can be generated for review -- vendors that offer to print data sets on demand can possibly earn the most income by deduplicating by custodian. For a more subtle reason, custodian deduplication provides the fewest headaches and worries to the EDD processing vendor and makes it easier to communicate to the law firm how data sets were deduped using the hash comparison explained above. But it is not as easy to conduct and explain deduplication by case, or global deduplication.
To Continue Reading: Click Here
By: Alex K. Schiller