Videos have become a predominant part of users’ daily lives on the Web, especially with the emergence of video sharing services, such as YouTube. Part of the huge success of multimedia content in the Web is due to the change on the user perspective from content consumer to content creator. However, by allowing users to publicize their independently generated content, video sharing networks become susceptible to different types of pollution. As example, users can pollute the system spreading video messages containing undesirable content. Users can also associate metadata with videos in attempt to fool video search engines (i.e., popular tags, but unrelated to the content). Moreover, users can upload identical videos, generating duplicates of the same content on the system. Such pollution compromises not only the user satisfaction, but it also consumes system resources and can impact negatively aspects related to infrastructure. In this work we provide a general overview of pollution in video sharing systems. We define the different kinds of existent pollution, their negative impact to users and system and possible strategies to minimize the problem.
“This much is known: For every rational line or forthright statement there are leagues of senseless cacophony, verbal nonsense, and incoherency.”
— The Library of Babel, Jorge Luis Borges (1941).
In the story of The Library of Babel, the Argentinean writer Jorge Luis Borges (1899–1986) creates a rich metaphor for the Web. His fictional Library of Babel was a place containing “…all the possible combinations of the 22 orthographic symbols (a number which though unimaginably vast, is not infinite).” In his story, Borges precisely anticipated some of the problems caused by the exponential growth of information, well before the Internet and Web became widespread  . For example, it is common to find on the Web multiple copies of the same video, which reduce the efficiency of video information retrieval tasks, degrading the information quality provided to users. The problem of near–duplicates appears well characterized in the unimaginably vast Library of Babel: “…books that differ by no more than a single letter, or a comma.” The inhomogeneous nature of user–generated content that lead to spam and other forms of pollution is also described in Borges’ story: “There are also letters on the spine of each book; these letters do not indicate or prefigure what the pages will say.” Spam, duplicates, and dubious content are different forms of information pollution that consumes human attention, which is one of the most valuable resources in the information age.
The Web is experiencing a new wave of applications associated with the proliferation of social networks and the growth of digital media. As a consequence, the enormous amount of available data introduces new and challenging information quality problems. Low quality information abounds on the Web. The reasons are various. One of them is the explosion of all types of video content on the Web that has made online video a predominant part of users’ daily lives on the Internet. Videos are being used for many different purposes: entertainment, news, communication, reference tools, etc. A good way to understand the amount of video content uploaded to YouTube is a comparison with the amount of content produced by the major U.S. TV networks. The amount of content uploaded to YouTube in 60 days is equivalent to the content that would have been broadcast for 24 hours, 7 days a week, for 60 years by NBC, CBS and ABC altogether . Also, video search on YouTube accounts for a large fraction of all Google search queries in the U.S., generating 2.6 billion searches in February 2009, according to a report released by ComScore . As a result, video searching on YouTube and across other video sharing sites is rapidly becoming an entry point into the Web. With inexpensive cameras, the growing adoption of broadband , and a proliferation of video sharing sites that host an apparent unlimited number of videos, it is trivial to create, upload and share videos today.
Most part of the huge popularity of multimedia content on the Web is due to the change on the user perspective from consumer to creator. By allowing users to publicize their independently generated content, video sharing systems become susceptible to different forms of pollution. For example, opportunistic users can pollute the system spreading video messages containing undesirable content (i.e., spam). Users can also associate metadata with videos in attempt to fool video search engines (e.g., popular tags, but unrelated to the content) to achieve high ranking positions in search results. Furthermore, users can upload, intentionally or not, identical or almost identical videos, generating duplicates or near–duplicates of the same content on the system, impacting performance and other system–related aspects. All in all, the quick proliferation of videos contributes to amplify the problem of information overload on the Web. Additionally, pollution can cause negative impacts on system aspects such as caching, content distribution networks, and video search engines. In spite of these problems, the available literature is scarce in providing a deep understanding of the problem. In this paper we present an overview of the types of pollution in video sharing systems, propose a taxonomy for video pollution forms and analyze algorithmic ways to detect video pollution on the Web. Recognizing and identifying video pollution is a difficult problem. Technology has limitations to do the work of human reviewers, since video is always contextual and it is almost impossible to regulate videos before you see them.
The goal of this article is to provide an analysis of the pollution problem in video sharing systems. To do that, we organized the paper as follows. The next section presents a taxonomy of different kinds of pollution that can exist in online video social network systems and discusses the implications of pollution to users and system. Section 3 shows evidence of pollution in video social network systems and in section 4 we discuss techniques to detect and control video pollution.
Video sharing systems
In addition to their basic functions discussed below, video hosting and sharing systems (VSS) offer features that are typical of online social networks, such as search, communication, information management, recommendations and advertisements. While online video may not replace TV anytime in the future, it is now certainly mainstream. According to the measurement company Comscore, about 150 million Internet users in the U.S. watch about 14.5 billion videos a month. Another report indicates that the total number of videos viewed online in U.K. in April 2009 was 4.7 billion videos.
Video sharing systems (VSS) can be viewed as online social networks in which the primary user generated objects are videos. In addition to retrieving and contributing video content, VSS users can build relationships. These relationships can be driven by preferences for each other’s content (i.e., a user can subscribe to another user to receive updates of her videos) or can represent friendship relations. Typically, a VSS exhibits four key characteristics: (i) users contribute with multimedia content (often user generated), which is typically annotated with a title, description, and tags; (ii) users view content contributed by other users; (iii) users evaluate content by rating, text commenting, audio/video responding or some combination thereof; and, (iv) users maintain lists of favorite content, favorite contributors, and join thematic groups. Many of these features are also available in more traditional social networking and blog sites. The main difference here is that the objects that promote social interactions are based on video content rather than textual and even image content. It is important to point out that many social networking services (e.g., Facebook, Myspace, Orkut, etc.) also provide video hosting and sharing facilities.
Table 1: Features of popular video sharing systems. YouTube Dailymotion Metacafe Yahoo! Videos MySpace Videos Top lists X X X X X Related videos X X X Textual comments, ratings, favorites and friendship X X X X X Subscribe X X X X Video response X Playlists X X X Honors and statistics about external links X Groups or communities X X X X Categories, title, tags, and description as metadata X X X X X Edit metadata of others’ videos X Video recommendation, earnings and user reviews X X Filtering of duplicates X X Video edit platform X Flags and advertisements X X X X X
Table 1 summarizes the main features of five popular video sharing systems: YouTube, Dailymotion, Metacafe, Yahoo! Videos, and MySpace Videos. Streaming billions of videos in a month, YouTube is the most popular social media network today . YouTube includes most of the social media characteristics listed before. In particular, users maintain lists of favorite videos and favorite contributors (friends), can subscribe to other user accounts to receive updates when new content is posted, and can pool their videos within thematic groups. Additionally, YouTube provides some exclusive features such as honors to users whose videos appear in top lists, and video responses, which allows users to video respond to another user’s video contribution. A user can post a video response by directly uploading it from the user’s webcam, choosing a video from her pre–existing YouTube contributions, or uploading a video from the user’s disk drive . YouTube allows users to add advertising to popular videos and share the revenues with their owners .
Another popular VSS service is called Dailymotion. It shares a number of characteristics with YouTube. Yahoo! Videos is a hosting and sharing service offered by Yahoo! that works in the same way as YouTube. MySpace Videos is an area of MySpace dedicated to videos. The services discussed here provide basic features, such as top lists, textual comments, ratings, favorites, friendship and association of metadata by the owners of the videos.
Another video–based system is Metacafe. A remarkable characteristic of this system is the use of mechanisms to attract good videos to the system. To do so, Metacafe offers rewards to users whose videos obtain large numbers of views and of ratings. Additionally, it filters duplicates (i.e., videos with redundant content) when the videos are uploaded into the system. In contrast, YouTube only filters duplicates from the search results. Another interesting characteristic of Metacafe is a platform which allows users to edit video metadata as an attempt to improve video search results.
According to Alexa , YouTube is the third most popular global site in terms of traffic (as of March 2009). Dailymotion’s global rank is 48th, whereas Metacafe is in position 116th. As Yahoo! Videos and MySpace Videos are specialized areas of a larger Web site, we do not have statistics of their global rank, but the entire Yahoo! site ranks first as the most popular site, whereas MySpace is ranked 7th.
Video pollution in video sharing systems (VSS) can be viewed as the introduction into the environment of (i) redundant, (ii) incorrect, noisy, imprecise, or manipulated, or (iii) undesired or unsolicited videos or meta–information (i.e., the contaminants), causing harm or discomfort to the members of the social environment of a video sharing service. Pollution in VSS can be caused by a number of reasons, most of which are influenced by the uncontrolled (or malicious) activities or intention of the polluters. Next, we propose a taxonomy for pollution in VSS based on the above definition, which takes into consideration the type of “harmful” information delivered to the user by the system. The impact on the user caused by this type of content may include: (i) loss of interest in using the system, as some of the retrieved information is mostly redundant or does not match well to an expressed information need or interest; (ii) distrust in the system, as some of the retrieved information is clearly not related to the current interest or navigation pattern, being perhaps offensive or being included only to take advantage of the user; and, (iii) impatience, as some resources are very difficult to find or due to the low performance of the system, among many others.
Our taxonomy proposes the following types of pollution:
Redundant information (near–duplicates): Since users can freely upload content in VSS systems, it is natural to expect that part of these videos is identical or very similar. We refer to these similar videos, called near-duplicates , as a type of pollution, since they pollute, for example, search results. Duplicates can bring several types of problems to the system. The most intuitive one consists in obtaining a large number of duplicates as the result of a search. In terms of storage, disk space could be saved if the system kept only a single copy of each duplicated video. Additionally, duplicates can share the audience of a certain content, which can indirectly impact the performance of caches and content distribution networks (CDNs)  . A related problem has to do with copyright violation. Users may inadvertently or even maliciously introduce in the system unauthorized copies of videos that are protected by the copyright owner’s exclusive rights.
Incorrect, noisy, imprecise, or manipulated meta–information (metadata pollution): One of the new trends of the Web 2.0 is to allow users to freely create and associate metadata to the content . As a consequence, noise in metadata creates a challenge to content retrieval in VSS systems. Videos with incorrect or imprecise metadata appear in search results as mismatches or in lists of (un)related videos as a polluted video. In systems such as Metacafe, users can edit not only the metadata of their own videos, but also the metadata of third–party videos. Whereas this mechanism can improve the extent to which the metadata describes a video, it can also be target of malicious users, who associate unrelated information with any video of the system.
The main problem caused by this type of pollution is related with content retrieval, as most information retrieval mechanisms rely mostly on metadata. If a video content is not well described by its meta–information, it can not be found in search results or appear in lists of related videos. Additionally, video recommendation mechanisms based on meta–information can suggest undesirable videos in systems. Lastly, users may watch some of these videos (or at least the beginning of them) by mistake, consuming extra resources of the system.
A particular case of this type of pollution is promotion, i.e., when opportunistic users try to modify system statistics of a certain video (i.e., number of views, ratings, number of comments, etc.) to boost the video rank, making it appear in the top lists maintained by the system . For example, there are some controversies about video rankings for the all time most viewed video in YouTube. The popular Evolution of dance video has left the first position for a video clip of singer Avril Lavigne. The quick raise of Avril Lavigne’s video is attributed to a fan club which developed a Web site that streams the video every 15 seconds. Thus, any fan with a browser (or several browsers) opened in this Web site could have contributed with several views of the video in a single day  . Promoted videos with modified statistics can also interfere on system design aspects, since videos that quickly reach high rankings are strong candidates to be kept in caches and CDNs  .
Undesired or unsolicited information (video spam): In video sharing systems, videos can be used as objects of interactions between users. Similarly to e–mail spam, link spam, and spam in blogs and forums, conversations established in video social networks may contain video spam. Video messages can be considered spam if the video communication is unsolicited or is completely unrelated to the subject of the conversation or discussion. The video response feature provided by YouTube is an example of communication using video objects. Although appealing as a mechanism to enrich the online interaction, these features open opportunities for users to introduce pollution into the system by sending undesirable video messages. These users are motivated to pollute top lists for several reasons, such as to spread advertising to generate sales, to disseminate pornography (often as an advertisement to a Web site), or just to compromise the system’s reputation.
Table 2: How different types of pollution may affect the system. Affected part of the system Type of pollution Search/indexing System resources Design issues Toplists Redundant information X X X Incorrect, noisy, imprecise, or manipulated information X X X X Undesired or unsolicited X X
Table 2 summarizes different types of pollution and how they affect different aspects of the systems. As we can see, in addition to consume extra resources every time an undesirable video is displayed, pollution also offers a series of negative aspects to the system. Pollution creates an overwhelming amount of information of questionable quality, making the process of finding trusted information more difficult in the Web. The notion of social trust has been associated with a relationship of reliance, but it also carries the risk of being betrayed . With an increasing amount of information of questionable origin and reliability, finding trusted information created by trusted people is increasable challenging. Above all, pollution may jeopardize the trust of users on the system as well as their patience and satisfaction with the system.
This section presents evidence of the different types of pollution shown in Table 2, with particular emphasis on the currently most popular VSS system, YouTube. The signs of pollution come from real data we collected, as well as from results reported in related previous work.
Redundant information can occur in VSS systems in the form of unauthorized content such as copyright protected videos and of video duplicates. Regarding the former, YouTube has recently released a new technology called Video ID, which allows copyright owners to compare their videos with material in YouTube. Thus, copyright owners can flag infringing material for removal. Afterwards, a research project from MIT, called YouTomb , started to track videos taken down from YouTube for alleged copyright violation. YouTomb continually monitors some YouTube videos in order to account for copyright–related takedowns. They retrieve information such as who issued the complaint, how long the video was up before takedown, and how many views the video received before takedown. In total, they tracked thousands of videos which were removed from YouTube by copyright–protected content.
The first evidence of duplicate content in an VSS system was uncovered by Cha, et al. . The authors created a collection of about 1,000 videos from YouTube, which had been classified as duplicates by the system. They show that the popularity of content can be spread across multiple duplicates, which is undesirable for caching systems. Additionally, as discussed previously, duplicates are also undesirable for video search engines. In fact, Wu, et al. proposed a mechanism to filter near–duplicate results from the video search . They created a collection of near–duplicates based on 24 search queries to YouTube, Google Video and Yahoo! Video. Using a hierarchical clustering algorithm, they were able to filter out a large amount of near–duplicates from the video search results.
… 9,571 are videos with at least one duplicate according to the YouTube algorithm, and the number of collected duplicates sum up to 100,373. In terms of storage, all collected videos would consume 366.5 TB of disk space, whereas storing a single copy of each video (e.g., the oldest one) would require only 27.2 TB of space.
YouTube has recently deployed a duplicate detection mechanism: when the search results are displayed, the system filters the duplicates out, displaying only one video per group of duplicates. However, it also displays links to all filtered duplicates. We have exploited this feature in order to get a rough assessment of the number of duplicate videos detected in YouTube searches, by performing the following experiment . We created a crawler which searches for random words extracted from an English dictionary obtained from a tool called Ispell . The crawler collects meta–information of the videos displayed as results as well as of all of their reported duplicates. As a result from searching 319 different words, our crawler found 97,091 videos in total. Out of them, 9,571 are videos with at least one duplicate according to the YouTube algorithm, and the number of collected duplicates sum up to 100,373. In terms of storage, all collected videos would consume 366.5 TB of disk space, whereas storing a single copy of each video (e.g., the oldest one) would require only 27.2 TB of space. These numbers provide an idea of the existing pollution, in terms of redundant information, in YouTube, and of its impact on the system storage requirements.
Incorrect, noisy, imprecise or manipulated metadata
The association of metadata, especially tags, with content is not a particularity of VSS systems. As an example, tags are used in photo and link sharing systems, such as Flickr  and Digg , respectively. In systems in which users can freely associate tags with content, some noise in this association is unavoidable. Part of this noise can be caused by the high degree of subjectivity in user’s perception of a given content.
In order to evaluate metadata association in YouTube, we considered our collected data set of video duplicates (see the previous section). Our goal is to verify the similarities and differences in the metadata associated with videos with the same content. For simplicity, we focus on one specific type of metadata, that is, tags. We note that, by comparing the tags assigned to different duplicates of the same video, we are, indeed, assessing the variability in user perception with regards to the same content.
To measure the similarity between the tags associated with two videos, we use a simple metric, called Jaccard index , defined as follows. Let A and B be the sets of tags of two videos. The Jaccard index J is defined as the number of tags in common in the two sets divided by the number of unique tags in the union of the two sets:
A Jaccard index equal to 0 means that the two videos have no tags in common, whereas J close to 1 indicates that the two videos share most of their tags. In order to avoid derived and inflected words to be considered as different tags (e.g., House and houses), we used a stemming method to extract each tag’s radical .
Figure 1 shows the distribution of the Jaccard index obtained for each pair of duplicate video of our database. The majority (roughly 57 percent) of the pairs of duplicates do not have any tag in common. In fact, 87 percent of the pairs of duplicates have a Jaccard index under 0.3, and only one percent have all tags in common (J=1). In other words, by comparing videos with the same content, we noted that they have relatively few tags in common, suggesting that there is a significant degree of noise and, perhaps, imprecision in tags used in YouTube.
Figure 1: Similarity between tags of duplicate videos (Jaccard index distribution.
VSS systems offer different sources of information. In addition to the content itself, there is a number of non–content information available directly or indirectly provided by users. As an example, VSS sites usually show explicit quality ratings, video access statistics, links between videos and users, and additional types of metadata such as title and description . As a consequence, these types of information can be exploited to automatically identify high quality content .
However, noise and manipulated metadata can also be intentionally introduced by malicious and opportunistic users. In fact, we have found evidence of promotion in YouTube in a recent analysis of the properties of the social network created by video response interactions . The YouTube video response feature allows users to post a video as a response to a discussion topic. Although appealing as a mechanism to enrich the online interaction, these features open opportunities for users to introduce pollution into the system, aiming at not only promotion but also video spamming. In , we approached the problem of detecting these two classes of opportunistic users using a classification strategy. We discuss evidence of promotion in YouTube in the following section, in which we also present our methodology to investigate the presence not only of this type of pollution but also of video spamming in the system.
Undesired or unsolicited information
In order to raise evidence of promotion as well as of video spamming in YouTube, we collected a large data set containing users and their interactions via video responses. We then “manually” classified each collected user as video spammer, video promoter, or others.
We define as video spammer a user who pollutes the video response interactions by posting at least one video response that is considered unrelated to the video topic (i.e., a video spam) . Users may post unrelated videos as response to popular video topics aiming at increasing the likelihood of the video response being viewed by a larger number of users. Examples of video spam are: (i) an advertisement of a product or Web site completely unrelated to the subject of the responded video; and, (ii) pornographic content posted as response to a cartoon video.
In contrast, a video promoter is defined as a user who pollutes the video response top lists by posting video responses aiming at promoting the video topic. These users try to gain visibility to a specific video by posting a large number of (potentially unrelated) responses to boost the rank of the video topic, making it appear in the top lists maintained by YouTube.
To collect our user data set, we built a crawler who works as follows. Starting from a set of users as seeds, it follows links of video responses and responded videos in a breadth–first fashion, collecting information about the videos and their owners. We used as seeds the owners of videos in the top–100 list of the most responded videos of all time. Our sampling procedure produced an entire weakly connected component of a direct graph where vertices represents users, and an edge from u to v represents that user u has posted at least one video response to v’s videos. The crawler ran for one week (11–18 January 2008), gathering a total of 264,460 users, 381,616 responded videos and 701,950 video responses. A detailed description of our sampling procedure can be found in .
In order to build a collection of video spammers, video promoters, and others, we selected various users to undergo a manual classification. The strategies used to select these users are described in . Three volunteers analyzed all video responses of each selected user in order to independently classify her into one of the three categories. In case of tie (i.e., each volunteer chooses a different class), a fourth independent volunteer was heard. Each user was classified based on majority voting. Volunteers were instructed to not favor spammers or promoters. For instance, if one was not confident that a video response was unrelated to the responded video, she should consider it to be others. Moreover, video responses containing people chatting or expressing their opinions were classified as others, as we choose not to evaluate the expressed opinions. Note that, as we do not favor spammers and promoters we potentially reduce the number of users in these classes, but we guarantee a better accuracy of our test collection. In total our collection contains 855 users, out of which 641 were classified as others, 157 as video spammers, and 31 as video promoters.
In essence, other users, video spammers and video promoters have different goals in the system. Therefore, we expect that they also differ with respect to how they behave to achieve their purposes, i.e., who they interact with, which videos they post, which videos they respond to, etc. In fact, we have analyzed a large set of attributes that characterize user behavior in the system, including: (i) attributes of the videos and video responses uploaded by the users and of the videos the users responded to, such as video duration, number of views and ratings, (ii) individual attributes of the user such as number of friends, number of videos uploaded and number of video responses posted and received, and (iii) attributes that capture the social relationships established among users via video responses interactions . We found significant differences in several attributes, particularly those related to the videos users responds to (i.e., video topics), which clearly distinguish video spammers, video promoters and other users into three different classes. We illustrate these differences by showing, in Figures 2 and 3, the cumulative distribution functions (CDF), for each user class, of two of the ten most discriminative attributes analyzed: the number of views (i.e., popularity) and the total ratings received by the video topics which were target of the video responses posted by the users. We note that these two attributes reflect, in fact, the user feedback in regards to the video content and “quality”, in general.
Figure 2: Differences between video spammers, video promoters and other users — number of views of target videos.
Figure 3: Differences between video spammers, video promoters and other users — total ratings of target videos.
The two figures show that spammers tend to post their video responses to videos with larger number of views and higher ratings, as these users tend to target more popular videos in order to attract more visibility to their content. In sharp contrast, the curves for video promoters are much more skewed towards smaller values, that is, promoters tend to target videos that are still not very popular, thus having fewer views and poorer ratings, aiming at raising their visibility in the system. Other users, being driven mostly by social relationships and interests, exhibit an intermediary behavior, targeting videos with a wide range of popularity and ratings.
This section discusses possible strategies that can be used to control video pollution in online video sharing systems. We explain the general idea of these strategies, which may be applied to all of the three types of video pollution characterized in this paper: video duplicates, metadata pollution and video spam.
One approach widely used to control several types of pollution in Web environments relies on machine learning algorithms (e.g., supervised classification). Basically, special purpose algorithms are developed to automatically detect and fight polluters and polluted contents. In order to be effective, mechanisms and algorithms must be efficient (due to the enormous amount of data to be analyzed) and have a high accuracy, which sometimes can be very hard to achieve. Because polluters are always changing their strategies, algorithms must be constantly updated to adapt to changes and maintain their effectiveness.
Several types of algorithms based on machine learning or not, can be used to detect and remove different types of pollution in video environments. For example, Benevenuto, et al.  addressed the issue of detecting video spammers and promoters. Towards that end, they proposed a classification approach, based on Support Vector Machines , that combined a multitude of evidence on videos (responses and responded), user behavior and from social relationships established between users via video interactions, to detect those kinds of polluters. In , the authors propose a mechanism to filter near–duplicates from video search results. YouTube has an algorithm to detect duplicated content. According to YouTube , “The Video identification tool is the latest way YouTube offers copyright holders to easily identify and manage their content on YouTube. The tool creates ID files which are then run against user uploads and, if a match occurs, the copyright holders policy preferences are then applied to.”
Feedback from users
Another approach used by online video sharing systems to control pollution is based on users’ feedback. By allowing users to flag a video when they encounter pollution, the system is able to remove some sort of undesired video objects. After receiving several flags from different users to the same object, which can be content or a user, providers need to take an action to combat and control pollution. This mechanism of receiving users’ feedback to control video pollution is used by basically all of the VSS environments listed in Section 2. Although a flagging system is very simple to be implemented, its efficacy to control pollution depends on users’ actions, i.e., if a polluted video is not flagged, it will not be removed from the system. In addition, system providers can deploy incentive mechanisms for users to flag polluted contents, and use them combined with other techniques to control pollution in video sharing systems.
The rise of VSS indicates a shift in the organization of video systems. Early video systems were structured primarily by topics according to the video content . Current popular video systems such as YouTube organize videos around online communities structured as personal (or “egocentric”) networks, with the individual at the center of their own community . This new perspective opens the doors to adversarial fights, where some users may want to create polluted content whereas others aim at contributing with system. One interesting approach to tackle pollution consists of empowering members of online communities with mechanisms to clean or report members that do not contribute positively to the community.
This approach builds on the idea that each user should take care of the system and the content that other users upload, editing the content and associated metadata of every video uploaded to the system, like users do to articles in Wikipedia . Metacafe is the only service that allows this collaborative approach to control pollution, which they called Wikicafe , “a feature that allows the community to edit the details of videos you watch, such as its title, tags, descriptions, and much more”.
This type of approach is relatively simple to implement and may be efficient to control pollution, since, fortunately, polluters and malicious individuals are a minority. System providers must educate and convince users that if they want a good system they should take care of it. On the other hand, if polluters represent a large part of the users of the system, this method may not be efficient .
Make life harder for polluters
A strategy to combat pollution is to make life more difficult for spammers and malicious users by increasing the cost of sending spam or creating pollution. Any strategy capable of frustrating them can help. As an example, a well–intentioned user might create just one account in the system. So, if from the same IP address several accounts are created in the system in a short period of time, it is a suspect activity. If the system perceives this fact and starts to require consecutive captchas for new accounts coming from that IP address, for instance, the polluter probably will shortly be tired of completing the captchas and will stop creating fake accounts. If a polluter uses a script to do it automatically, she will have to work harder to make her code work. This kind of strategy has been used by many providers in practice and can succeed to control of pollution .
The impact of pollution on the Web is becoming evident to society. In 2009, YouTube was bombarded with a huge amount of pornography videos, disguised as Jonas Brothers fan videos . Newspaper articles have emphatically shown parental distress with the combination of the abundance of any type of video on the Web, pocket–size video players and young child’s curiosity.
In this work we provide a general overview of pollution in a new context, the video sharing systems. First, we categorize different kinds of existing pollution and their negative impact to users and systems. Then, we show evidence of the existence of each type of pollution and discuss and possible strategies to minimize the problem.
Finally, the strategies discussed in Section 4 can be used to efficiently control all three kind of video pollution here defined. System and content providers may choose the technique most appropriated to each case, or make use of multiple strategies to be effective in the control of pollution. A compromise they must be aware of is among efficiency, implementation, and maintenance costs.
About the authors
Fabrício Benevenuto is a post doctoral researcher at the Universidade Federal de Minas Gerais (UFMG), in Brazil. He received a Ph.D. in Computer Science from UFMG, in 2010. During his Ph.D., he held research intern positions at HP Labs in Palo Alto, Calif., and at Max Planck Institute for Software Systems (MPI–SWS), in Germany. His current research is focused on exploiting and understanding characteristics of online social network systems.
E–mail: fabricio [at] dcc [dot] ufmg [dot] br
Tiago Rodrigues is a computer science graduate student at the Universidade Federal de Minas Gerais (UFMG), Brazil. His research interests focus on interactions in social networks and system behavior, including quality factors such as performance, availability and malicious behavior.
E–mail: tiagorm [at] dcc [dot] ufmg [dot] br
Virgílio Almeida is a Professor of Computer Science at the Universidade Federal de Minas Gerais (UFMG), Brazil. His research interests are centered around performance evaluation and modeling of large–scale distributed systems, such as the Web and social networks. He held visiting positions at Boston University, Polytechnic University of Catalunya, in Barcelona, Polytechnic Institute of NYU and visiting appointments at Xerox PARC and Hewlett–Packard Research Laboratory. Virgilio is a recipient of a Fulbright Research Scholar Award and a full member of the Brazilian Academy of Sciences. Virgílio was an International Fellow of the Santa Fe Institute for 2008–2009. Professor Almeida is co–author (with Daniel Menascé and Larry Dowdy) of the book, Performance By Design: Computer Capacity Planning by Example, published by Prentice–Hall. He is a member of the editorial boards of Internet Computing, First Monday, and Journal of Internet Services and Applications.
E–mail: virgilio [at] dcc [dot] ufmg [dot] br
Jussara M. Almeida is an Associate Professor of Computer Science at the Universidade Federal de Minas Gerais (UFMG), Brazil. She received a Ph.D. in Computer Science from the University of Wisconsin–Madison, in 2003. Her research interests include performance modeling of distributed systems and workload and user behavior characterization.
E–mail: jussara [at] dcc [dot] ufmg [dot] br
Marcos Gonçalves is an Associate Professor at the Computer Science Department of the Universidade Federal de Minas Gerais (UFMG), Brazil. He holds a PhD in Computer Science (CS) from Virginia Tech (2004), a MS in CS from State University of Campinas, Brazil (Unicamp, 1997), and a BS, also in Computer Science, from the Federal University of Ceará, Brazil (UFC,1995). Professor Gonçalves has served as referee on different journals (TOIS, TKDE, IP&M, Information Retrieval, Information Systems, etc.) and at several conferences (SIGIR, CIKM, JCDL, etc.). His research interests include information retrieval, digital libraries, text classification and text mining in general, having published a number of papers in these areas. Marcos is an affiliated member of the Brazilian Academy of Sciences.
E–mail: mgoncalv [at] dcc [dot] ufmg [dot] br
Keith Ross the Leonard J. Shustek Chair in Computer Science (2003), and a professor and department head (as of 2008) in the Department of Computer Science and Engineering of Polytechnic Institute of New York University, is an international expert in peer–to–peer networking and video streaming. In addition to his research in P2P file sharing and video streaming, Ross is recognized for his work in Internet measurement, Web caching, multiservice loss networks, content distribution networks, network security, voice over IP, optimization, queuing theory, and Markov decision processes. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), recipient of the Infocom 2009 Best Paper Award, and recipient of Best Paper in Multimedia Communications 2006–2007 awarded by IEEE Communications Society. Keith Ross is co–author (with James F. Kurose) of the most popular textbook on computer networks, Computer Networking: A Top–Down Approach Featuring the Internet, published by Addison–Wesley.
E–mail: ross [at] poly [dot] edu
This work is partially supported by the projects INCT-Web (MCT/CNPq grant 57.3871/2008–6) and by the authors’ individual grants and scholarships from CNPq, FAPEMIG, and CAPES.
2. Avril is an Anagram for “viral”; see http://www.theglobeandmail.com/servlet/story/RTGAM.20080623.WBmingram20080623143124/WBStory/WBmingram.
3. “Cheating fans give Avril Lavigne a YouTube lift,” at http://blog.wired.com/underwire/2008/06/avril-lavigne-f.html.
7. Image in Haystack, at http://www.nytimes.com/2009/06/07/magazine/07wwln-medium-t.html.
9. Online video, Pew Internet & American Life Project, at http://www.pewinternet.org/Reports/2007/Online-Video.aspx.
10. “Uploading the avant–garde,” New York Times, at www.nytimes.com/2009/09/06/magazine/06FOB-medium-t.htm.
11. “U.S. Streaming Video Market Overview,” at http://www.comscore.com/press/release.asp?press=1264.
15. YouTube copyright policy: Video Identification tool, at http://www.google.com/support/youtube/bin/answer.py?hl=en&answer=83766 .
16. “YouTube videos pull in real money,” New York Times, at http://www.nytimes.com/2008/12/11/business/media/11youtube.html?_r=1&emc=eta1.
17. E. Agichtein, C. Castillo, D. Donato, A. Gionis, and G. Mishne, 2008. “Finding high-quality content in social media,” ACM Conference on Web Search and Web Data Mining (WSDM); version at http://wwww.mathcs.emory.edu/~eugene/papers/wsdm2008quality.pdf.
18. L. von Ahn, M. Blum, N.J. Hopper, and J. Langford, 2003. “CAPTCHA: Using hard AI problems for security,” Conference on the Theory and Applications of Cryptographic Techniques (EUROCRYPT); version at http://www.captcha.net/captcha_crypt.pdf.
19. V. Almeida, 2009. “In search of models and visions for the Web age,” ACM Interactions, volume 16, number 5 (September–October), pp. 44–47.
20. R. Baeza–Yates and B. Ribeiro–Neto, 1999. Modern information retrieval. New York: ACM Press.
21. F. Benevenuto, C. Costa, M. Vasconcelos, V. Almeida, J. Almeida, and M. Mowbray, 2006. “Impact of peer incentives on the disseminations of polluted content,” Proceedings of the ACM Symposium on Applied Computing (SAC) (Dijon, France), pp. 1,875–1,879.
22. F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, and M. Gonçalves, 2009. “Detecting spammers and content promoters in online video social networks,” Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston), pp. 620–627.
23. F. Benevenuto, T. Rodrigues, V. Almeida, J. Almeida, and K. Ross, 2009. ”Video interactions in online video social networks,“ ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP), volume 5, number 4 (October), article number 30.
24. M. Cha, H. Kwak, P. Rodriguez, Y.–Y. Ahn, and S. Moon, 2007. “I tube, you tube, everybody tubes: Analyzing the world’s largest user generated content video system,” Proceedings of the Seventh ACM SIGCOMM Conference on Internet Measurement (San Diego), pp. 1–14.
25. M. Cherubini, R. de Oliveira and N. Oliver, 2009. “Understanding near–duplicate videos: A user–centric approach,” Proceedings of the Seventeen ACM International Conference on Multimedia (Beijing), pp. 35–44.
27. T. Joachims, 1998. “Text categorization with support vector machines: Learning with many relevant features,” Proceedings of the European Conference on Machine Learning (ECML); version at http://www.joachims.org/publications/joachims_98a.pdf.
28. S. Boll, 2007. “MultiTube — Where Web 2.0 and multimedia could meet,” IEEE MultiMedia volume 14, number 1, pp. 9–13.
29. X. Wu, A.G. Hauptmann, and C.–W. Ngo, 2007. “Practical elimination of near–duplicates from Web video search,” Proceedings of the 15th International Conference on Multimedia (Augsburg, Germany), pp. 218–227.
30. M. Zink, K. Suh, Y. Gu and J. Kurose, 2008. “Watch global, cache local: YouTube network traces at a campus network — Measurements and implications,” IEEE Multimedia Computing and Networking (MMCN); version at http://www.cs.umass.edu/~yugu/papers/zink08mmcn.pdf.
31. J. Golbeck, 2008. “Weaving a web of trust,” Science, volume 321, number 5896 (19 September), pp. 1,640–1,641.
32. d. boyd and N. Ellison, 2007. “Social Network Sites: Definition, History, and Scholarship,” Journal of Computer–Mediated Communication, volume 13, number 1, at http://jcmc.indiana.edu/vol13/issue1/boyd.ellison.html.
33. R.Q. Quiroga, 2010. “In retrospect: Funes the Memorious,” Nature, volume 463, number 611 (4 February), p. 611.
34. Examples are the most viewed, most video responded, most commented, and top rated videos.
35. We noticed that, after our experiment was performed, YouTube has stopped displaying hidden duplicates.
Paper received 11 November 2009; revised 10 February 2010; accepted 24 March 2010.
“Video pollution on the Web” by Fabrício Benevenuto, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, Marcos Gonçalves, and Keith Ross is licensed under a Creative Commons Attribution–Noncommercial–Share Alike 3.0 Unported License.
Video pollution on the Web
Fabrício Benevenuto, Tiago Rodrigues, Virgílio Almeida, Jussara Almeida, Marcos Gonçalves, and Keith Ross.
First Monday, Volume 15, Number 4 - 5 April 2010
A Great Cities Initiative of the University of Illinois at Chicago University Library.
© First Monday, 1995-2013.