|
Categories
Latest Postings
Links
Archives
|
Viacom v Google: eDiscovery v Privacy
On July 1, 2008, in the Viacom v. Google litigation, Judge Stanton of the Southern District of New York has ordered YouTube to turn over to Viacom “all data from the Logging database concerning each time a YouTube video has been viewed on the YouTube website or through embedding on a third-party website.” “[F]or each instance a video is watched, the unique ‘login ID’ of the user who watched it, the time when the user started to watch the video, the internet protocol address other devices connected to the internet use to identify the user’s computer (’IP address‘), and the identifier for the video” is to be produced. Ostensibly, Viacom wants these data “to compare the attractiveness of allegedly infringing videos with that of non-infringing videos. A markedly higher proportion of infringing-video watching may bear on plaintiffs’ vicarious liability claim, and defendants’ substantial non-infringing use defense.” Google objects to producing the log records because of privacy concerns. They argue that “Plaintiffs would likely be able to determine the viewing and video uploading habits of YouTube’s users based on the user’s login ID and the user’s IP address.” But the judge called this concern speculative. This case raises substantial ediscovery and privacy issues without even considering those issues that are legal per se (e.g., whether this production violates the federal Video Privacy Protection Act (VPPA), as the Electronic Frontier Foundation claims). First, Viacom is demanding the log information, they say, to judge the proportion of access to the infringing videos. But you do not need each instance of viewing in order to count the proportion of times an infringing video was viewed. You certainly do not need any indication of the users’ identities to measure this proportion. The court was not convinced that release of this log information to the plaintiff would necessarily compromise the privacy of YouTube users, calling this concern “speculative.” The court noted that Google had itself written that IP addresses, the numerical network address of a users’ computer, were not ordinarily considered personal information, “IP addresses recorded by every website on the planet without additional information should not be considered personal data, because these websites usually cannot identify the human beings behind these number strings.” But this is only partly correct, as the same blog notes. A substantial number of IP addresses are relatively stable over time, thus indicating the same computer repeatedly. When combined with other information, some of which may be in the log itself, and some obtained elsewhere, a great deal can be learned about the identity of a substantial number of the people accessing those data, whether the IP addresses are stable or not. For example, every time you access the internet from a Starbucks you get a new IP address. The next person to access the internet from the same location, may get the same IP address. In this context, it is impossible to prove, on the basis of IP address alone, which user accessed which files. This would argue for the anonymity of IP addresses. On the other hand, when you access the internet from your home computer, you are more likely, but not guaranteed, to have the same IP address for an extended period of time. But even if the IP address changes, you do get some information about the location of the computer. Each internet service provider (ISP) owns one or more ranges of IP addresses, which are assigned to the machines of its clients. An IP address consists of four groups of numbers, each between 0 and 255. All of the addresses in a range (formally, a “Class-C address range”) share the same first three groups. Each range can accommodate up to 256 clients at a time. Every time a user connects to the ISP from home he or she is likely to get an IP address that is identical at least over the first three groups. Large companies often have their own ranges of IP addresses, so it would be a simple matter to identify, for example, how often YouTube videos were viewed from work by employees of “Consolidated Enterprises,” for example. As a result, some IP addresses can be used to identify a fairly small group of people. Knowing only the first three groups of digits in an IP address can reduce the number of possible households to about 256 in a large proportion of the instances. The logs contain other information in addition to the IP address from which the video was accessed. It contains, for example, the name of the video, the login-name of the person accessing it, and other information. A couple of years ago, AOL released a collection of searches, ostensibly to the research community, in which they had taken care to remove any of what they thought was personally identifiable information. Although the users were identified only by a serial number, it was an easy matter to identify many of them through information that they entered into their search and through other publicly available information. The same thing is likely to be the case with YouTube videos. The titles of videos may contain the names of family members or the users themselves—a quick way to narrow down the list of potential users from the 256 given by the IP address. Alternatively, users may use the same login name on multiple services, so a quick cross reference may easily identify more information about that specific user. The login name is useful for finding out who posted a video, but you do not have to log in just to view them. Finally, a large number of users refer to YouTube videos on other publicly accessible web sites, and so may be identified through those sites. Some refer to the interests in the subjects of a set of videos, so just knowing the types of videos a person finds interesting and cross referencing it with other publicly available information may be sufficient to identify specific individuals. In a widely networked world, it is often fairly trivial to identify individuals based on a very small amount of information. The conclusion is that, even if IP addresses were anonymous, a substantial fraction of the users could still be identified based on information in the log and other readily available information. Not everyone could be identified with perfect reliability, but it is not at all speculative that at least a substantial number of users could be. It took me only a few minutes of work and readily available tools to identify at least one viewer of a YouTube video. Others could also be easily discoverable. Even if the proportion of identifiable users is small, in the context of all of the YouTube users, the number that is identifiable is likely to be very large.
Leave a ReplyYou must be logged in to post a comment. |