Anti copy detection mechanisms for digital documents

[ Copy Detection Mechanisms for Digital Documents *

Sergey Brin, James Davis, Hector Garcia-Molina

Department of Computer Science

Stanford, CA 94305-2140

In a digital library system, documents are available in digital form and therefore are more easily copied and their copyrights are more easily violated. This is a very serious problem, as it discourages owners of valuable information from sharing it with authorized users. There are two main philosophies for addressing this problem: prevention and detection. The former actually makes unauthorized use of documents difficult or impossible while the latter makes it easier to discover such activity.

In this paper we propose a system for registering documents and then detecting copies, either complete copies or partial copies. We describe algorithms for such detection, and metrics required for evaluating detection mechanisms (covering accuracy, efficiency, and security). We also describe a working prototype, called COPS, describe implementation issues, and present experimental results that suggest the proper settings for copy detection parameters.

Digital libraries are a concrete possibility today because of many technological advances in areas such as storage and processor technology, networks, database systems, scanning systems, and user interfaces. In many aspects, building a digital library today is just a matter of ``doing it.'' However, there is a real danger that such a digital library will either have relatively few documents of interest, or will be a patchwork of isolated systems that provide very restricted access.

The reason for this danger is that the electronic medium makes it much easier to illegally copy and distribute information. If an information provider gives a document to a customer, the customer can easily distribute it on a large mailing list or can post it on a bulletin board. The danger of illegal copies is not new, of course; however, it is much more time consuming to reproduce and distribute paper, CDs or videotape copies than on-line documents.

Clearly, one would like to have an infrastructure that gives users access to a wide variety of digital libraries and information sources, but that at the same time gives information providers good economic incentives for offering their information. In many ways, we believe this is the central issue for future digital information and library systems.

In this paper we present one component of the information infrastructure that addresses this issue. The key idea is quite simple: provide a copy detection service where original documents can be registered, and copies can be detected. The service will detect not just exact copies, but also documents that overlap is significant ways. The service can be used (see Section 2) in a variety of ways by information providers and communications agents to detect violations of intellectual property laws. Although the copy detection idea is simple, there are several challenging issues we address here involving performance, storage capacity, and accuracy that need to be resolved. Furthermore, copy detection is relevant to the ``database community'' since its central component is a large database of registered documents.

We stress that copy detection is not the complete solution by any means; it is simply a helpful tool. There are a number of other important ``tools'' that will also assist in safeguarding intellectual property. For example, good encryption and authorization mechanisms are needed in some cases. It is also important to have mechanisms for charging for access to information. The articles in [5, 7, 9] discuss a variety of other topics related to intellectual property. These other tools and topics will not be covered in this paper.

2 Safeguarding Intellectual Property

How can we ensure that a document is only seen and used by a person who is authorized (e.g., has paid) to see it? One possibility lies in preventing violations from occurring. Some schemes along these lines have been suggested, such as secure printers with cryptographically secure communications paths [12] and active documents [6] where users may interact with documents only through a special program.

The problems with all such techniques is that they are cumbersome (requiring special software or hardware), restrictive (limiting access means to the document), and not completely secure (from someone with an OCR program for example). The alternative is to use detection techniques. That is, we assume most users are honest, allow them access to the documents, and focus on detecting those that violate the rules. Many software vendors have found this approach to be superior (protection mechanisms get in the way of honest users, and sales may actually decrease).

One possible direction is ``watermark'' schemes where a publisher incorporates a unique subtle signature (identifying the user) in a document when it is given to the user so that when an unauthorized copy is found, the source will be known [13, 3, 4, 2]. One problem is that these schemes can easily be defeated by users who destroy the watermarks. For example, slightly varied pixels in an image would be lost if it is passed through a lossy JPEG encoder.

A second approach, and one that we advocate in this paper (for text documents), is that of a copy detection server [1, 11]. The basic idea is as follows: When an author creates a new work, he or she registers it at the server. The server could also be the repository for a copyright recordation and registration system, as suggested in [8]. As documents are registered, they are broken into small units, for now say sentences. Each sentence is hashed and a pointer to it is stored in a large hash table.

Documents can be compared to existing documents in the repository, to check for plagiarism or other types of significant overlap. When a document is to be checked, it is also broken into sentences. For each sentence, we probe the hash table to see if that particular sentence has been seen before. If the document and a previously registered document share more than some threshold number of sentences, then a violation is flagged. The threshold can be set depending on the desired checks, smaller if we are looking for copied paragraphs, larger if we only want to check if documents share large portions. A human would then have to examine both documents to see if it was truly a violation.

This basic scheme has much in common with sif , a tool for finding similar files in a file system, created by Udi Manber [10]. However, there are a number of differences in finding similar files versus finding similar sections of text which COPS addresses. First, since we are dealing with text, we operate on a syntactic level and hash syntactic units as opposed to fixed length strings. We also consider the security of copy detection (Section 3.3 ) and attempt to maximize its flexibility by dealing with violations of varying granularities (Section 4 ). One of the most important differences is that it is much more difficult to test a system like COPS since there are no databases of actual copyright violations (Section 5 ).

The copy detection server can be used in a variety of ways. For example, a publisher is legally liable for publishing materials the author does not have copyright on; thus, it may wish to check if a soon-to-be-published document is actually an original document. Similarly, bulletin-board software may automatically check new postings in this fashion. An electronic mail gateway may also check the messages that go through (checking for ``transportation of stolen goods''). Program committee members may check if a submission overlaps too much with an author's previous paper. Lawyers may want to check subpoenaed documents to prove illegal behavior. (Copy detection can also be used for computer programs [11], but we only focus on text in this paper.) There are also applications that do not involve detection of undesirable behavior. For example, a user that is retrieving documents from an information retrieval system or who is reading electronic mail, may want to flag duplicate items (with a given overlap threshold). Here the ``registered'' documents are those that have been seen already; the ``copies'' represent messages that are retransmitted or forwarded many times, different editions or versions of the same work, and so on. Of course, potential duplicates should not be deleted automatically; it is up to the user to decide if he wants to view possible duplicates.

In summary, we think that detecting copies of text documents is a fundamental problem for distributed information or database systems. And there are many issues that need to be addressed. For instance, should the decomposition units be paragraphs or something else instead of sentences? Should we take into account order of the units (paragraphs or sentences), e.g., by hashing sequences of units? Is it feasible to only hash a fraction of the sentences of registered documents? This would make the hash table smaller, hopefully still making it very likely that we will catch major violations. If the hash table is relatively small, it can be cloned. Our mail gateway above could then perform its checks locally, instead of having to contact a remote copy detection server for each message. There are also implementation issues that need to be addressed. For example, how are sentences extracted from say Latex or Word documents? Can one extract them from Postscript documents, or from bit maps via OCR?

These and other questions will be addressed in the rest of this paper. We start in Sections 3 and 4 by defining the basic terms, evaluation metrics, and options for copy detection. Then in Section 5 we describe our working prototype, COPS, and report on some initial experiments. A sampling technique that can reduce the storage space of registered documents or can speed up checking time is presented and analyzed in Section 6 .

3 General Concepts

In this section we define some of the basic concepts for copy detection and for evaluating mechanisms that implement it. (As far as we know, text copy detection has not been formally studied, so we start from basics.) The starting point is the concept of a document , a body of text from which some structural information (such as word and sentence boundaries) can be extracted. In an initial phase, formatting information and non-textual components are removed from documents (see Section 5 ). The resulting canonical form document consists of a string of ascii characters with whitespace separating words, punctuation separating sentences and possibly a standard method of marking the beginning of paragraphs.

Most of the violation tests we are interested in are not well defined and require a decision by a human being. For example, plagiarism is particularly difficult to test for. For instance, the sentence ``The proof is as follows'' may occur in many scientific papers and would not be considered plagiarism if it occurred in two documents, while this sentence most certainly would. If we consider a test Subset that detects if a document is essentially a subset of another one, we again need to consider if the smaller document makes any significant contributions. This again, requires human evaluation.

The goal of a copy detection system is to implement well defined algorithmic tests, termed operating tests (with the same notation as violation tests), that approximate the desired violation tests. For instance, consider the operating test t 1 ( d, r ) that holds if 90% of the sentences in d are contained in r . This test may be considered an approximation to the Subset test described above. If the system flags t 1 violations, then a human can check if they are indeed Subset violations.

3.1 Ordinary Operational Tests

In the rest of this paper we will focus on a specific class of operational tests, ordinary operational tests (OOTs), that can be implemented efficiently. We believe they can accurately approximate many violation tests of interest, such as Subset, Overlap, and Plagiarism.

Before we describe OOTs we need to define some primitives for specifying the level of detail at which we look at the documents. As mentioned in Section 3 , documents contain some structural information. In particular, documents can be divided into well defined parts, consistent with the underlying structure such sections, paragraphs, sentences, words, or characters. We call each of these types of divisions a unit type and particular instances of these unit types are called units .

We define a chunk as a sequence of consecutive units in a document of a given unit type. A document may be divided into chunks in a number of ways since chunks can overlap, may be of different sizes, and need not completely cover the document. For example, let us assume we have a document ABCDEFG where the letters represent sentences or some other units. Then it can be organized into chunks as follows : A,B,C,D,E,F,G; or AB,CD,EF,G; or AB,BC,CD,DE,EF,FG; or ABC,CD,EFG; or A,D,G. A method of selecting chunks from a document divided into units is a chunking strategy . It is important to note that unlike units, chunks have no structural significance to the document and so chunking strategies cannot use structural information about the document.

An OOT, o , uses hashing to detect matching chunks and is implemented by the set of procedures in Figure 1 . The code is intended to convey key concepts, not an efficient or complete implementation. (Section 5 describes our actual prototype system.) First there is the preprocessing operation, PREPROCESS(R, H) , that takes as input a set of registered documents R and creates the hash table, H . Second, there are procedures for on-the-fly adding documents to H (registering new documents) and for removing them from H (un-registering documents). Third is the function EVALUATE(d, H) that computes o ( d, R ).

To insert documents in the hash table, procedure INSERT uses a function INS-CHUNKS(r) to break up a document r into its chunks. The function returns a set of tuples. Each tuple represents one chunk in r , where t is the text in the chunk, and l is the location of the chunk, measured in some unit. An entry is stored in the hash table for every chunk in the document.

Procedure EVALUATE(d, H) tests a given document d for violations. The procedure uses EVAL-CHUNKS function to break up d . The reason why we use a different chunking function at evaluation time will become apparent in Section 6 . For now, we can assume that both INS-CHUNKS and EVAL-CHUNKS are identical and we use CHUNKS to refer to them.

After chunking, procedure EVALUATE then looks up the chunks in the hash table H , producing a set of tuples MATCH . Each in MATCH represents a match: a chunk of size s at location ld in document d matches (has same hash key) as a chunk at location lr in registered document r . The MATCH set is then given to function DECIDE(MATCH, SIZE) (where SIZE is the number of chunks in d ) that returns the set of matching registered documents. If the set is non-empty, then there was a violation, i.e., o ( d, R ) holds.

for each r in R INSERT(r,H)

C = INS-CHUNKS(r) /* OOT dependent */