Reuse, temporal dynamics, interest sharing, and collaboration in social tagging systems
First Monday

Reuse, temporal dynamics, interest sharing, and collaboration in social tagging systemsby Elizeu Santos-Neto, David Condon, Nazareno Andrade, Adriana Iamnitchi, and Matei Ripeanu



Abstract
User–generated content shapes the dynamics of the World Wide Web. In particular, collaborative tagging represents a simple, yet powerful, feature that enables users to share and collaboratively annotate content such as photos and URLs. This collaborative behavior and the pool of user–generated metadata create opportunities to improve existing systems and to design new mechanisms. However, to realize this potential, it is necessary to first understand the usage characteristics of current systems.

This work addresses this issue by characterizing three tagging systems (CiteULike, Connotea and del.icio.us) while focusing on three aspects: i) the patterns of information (tags and items) production; ii) the temporal dynamics of users’ tag vocabularies; and, iii) the social aspects of tagging systems. The analysis of the patterns of information production shows that users publish new content more often than they annotate already existing content in the system. The opposite, however, occurs for tags; the level of tag reuse is much higher. The study of the temporal dynamics of user vocabularies shows that the growth rate of tag vocabularies across the user population over time decreases at early ages, stabilizes, and returns to increase for older users. Moreover, a closer look into the change of vocabulary contents over time shows that despite the fact that tag vocabularies are slowly growing in size with user age, the relative frequency in which each tag is used converges relatively quickly in a users lifetime. Finally, the characterization of social aspects of tagging unveils the relationship between the implicit user ties, as inferred from the similarity between users’ activity, and their explicit social ties, as represented by co–membership in discussion groups or semantic similarity between tag vocabularies.

Contents

1. Introduction
2. Background and related work
3. Data collection and notation
4. Tag reuse and item re–tagging
5. Temporal dynamics of users’ tag vocabularies
6. Interest sharing
7. Shared interest and indicators of collaboration
8. Conclusions

 


 

1. Introduction

Tagging systems (Mathes, 2004; Hammond, et al., 2005; Marlow, et al., 2006; Macgregor and McCulloch, 2006; Farooq, et al., 2007) are a ubiquitous manifestation of online peer production of information (Benkler, 2006), a production mode commonplace in today’s World Wide Web (Ramakrishnan and Tomkins, 2007). The annotation feature, often referred to as simply tagging, has been originally designed to support personal content management. However, as this feature exposes user preferences and their temporal dynamics, similarities between users, and the aggregated characteristics of the user population, annotations have been recognized for their potential to support a wider range of mechanisms such as social search (Yahia, et al., 2008), recommendation (Sigurbjörnsson and van Zwol, 2008), and search optimization (Yanbe, et al., 2007; Heymann, et al., 2008; Huang, et al., 2008).

Moreover, tagging is increasingly important in online social systems and, more recently, motivates new initiatives such as OpenAnnotation [1] that aims to enable users to annotate content on the Web without depending on specific systems. Therefore, understanding social tagging through characterization and modeling of usage patterns is important, as understanding the current systems can better inform the design of future annotation platforms such as Hypothes.is [2]. Finally, characterizing social tagging systems can both unveil new opportunities and improve existing mechanisms.

This work addresses this need for characterization by investigating unexplored aspects of social tagging behavior as well as complementing previous characterization studies (presented in Section 2). In particular, it focuses on three major aspects of the tagging activity that have attracted relatively little attention in the past: i) the dynamics of tag and items produced via collaborative annotation; ii) the temporal dynamics of users’ tag vocabularies; and, iii) the characteristics of the social ties between users in these systems. Compared to past characterization studies is that, this study makes one step further by offering observations across multiple social tagging systems, which allows for a richer analysis of tagging behavior.

To study the production of tags and items, Section 4 concentrates on two metrics: i) item re–tagging, a measure of the degree to which items are repeatedly tagged; and, ii) tag reuse, a measure of the degree to which users reuse a tag to perform new annotations.

The analysis of the evolution of the users tag vocabularies (i.e., the set of tags a user assigns to her items) in Section 5 focuses on the evolution of the user vocabularies over time.

The investigation of social ties between pairs of users focuses first on unveiling the characteristics of the implicit ties between users based on the similarity between their tagging activities (Section 6). Additionally, this work explores the relationship between the strength of such implicit ties and those of more explicit social ties such as co–membership in discussion groups and semantic similarity of tag vocabularies (Section 7). Studying the relationship among the implicit and explicit ties is relevant as we test whether the implicit ties based on usage similarity provide information about the potential creation of explicit social ties and ultimately for collaboration.

This study uses activity traces from three distinct tagging systems — CiteULike, Connotea, and del.icio.us (detailed in Section 3). We believe that this selection of systems samples the diversity of the tagging ecosystem, as they are emblematic tagging systems for different type of content, with CiteULike and Connotea concentrating in bookmarking of academic citations, and del.icio.us focusing on general URLs. The in–depth analysis of these three systems reveals regularities as well as relevant variations in tagging behavior.

The main findings of this work are:

  • The characteristics of peer production of information are qualitatively similar across systems but differ quantitatively, as suggested by the observed rates of item re–tagging and tag reuse. In all three systems investigated, users produce new items at higher rate than they produce new tags. However, the observed rates in CiteULike and Connotea are different from del.icio.us. As the three systems provide essentially similar annotation features, these findings suggest that the target audience and the type of annotated content play an important role in the users tagging behavior (Section 4).

  • User tag vocabularies are constantly growing, but at different rates depending on the age of the user. However, despite the constant increase in size, the relative usage frequency of tags in a vocabulary converges to a stable ranking at early stages of a user’s lifetime in the system. This observation has implications for applications that rely on tag–vocabulary similarity (e.g., recommender systems): these applications can use only a subsample of the entire user activity to estimate vocabulary similarity between users. Moreover, applications can aim to strike a balance between the accuracy of similarity estimates, the data volume used for estimation, and the freshness of the data (Section 5).

  • The observed levels of activity similarity between pairs of users are the result of shared interested as opposed to generated by chance. The distributions of activity similarity strength deviate significantly from those produced by a Random Null Model (RNM) (Reichardt and Bornholdt, 2008). This suggests that the implicit ties between users, as defined by their activity similarity levels, capture latent information about user relationships that may offer support for optimizing system mechanisms (Section 6).

  • The implicit social ties are related to explicit indicators of collaboration. We show that user pairs that share interests over items (i.e., annotate the same items) have higher similarity regarding the groups they participate together and higher semantic similarity of their tag vocabularies (even after eliminating the portions of tagging activity that is related to the items they tag in common) (Section 7).

These characteristics have practical implications for the design of mechanisms that rely on implicit user interactions such as collaborative search (Evans and Chi, 2008; Santos–Neto, et al., 2007; Yahia, et al., 2008), spam detection (Koutrika, et al., 2008; Neubauer, et al., 2009), and recommendation (Jäschke, et al., 2007; Sigurbjörnsson and van Zwol, 2008; Song, et al., 2008) as outlined in Section 8.

 

++++++++++

2. Background and related work

This section contextualizes this work along four main topics: i) general characterization studies of peer production of information in tagging systems; ii) characterization of the evolution of tag vocabularies; iii) graph–based approaches to study activity similarity among users; and, iv) design of tag–based support mechanisms.

2.1. General characterization studies

Previous characterization studies focusing on tagging systems vary along three main aspects: i) the system analyzed from social bookmarking systems such as del.icio.us, CiteULike, and Bibsonomy to content sharing systems like Flickr and YouTube; and, ii) the focus of the characterization system–, tag–, item– or user–centric analysis; and, iii) the method of investigation — qualitative or quantitative research methods.

Nevertheless these works share the same intent: they address the high level set of questions that relate to characterizing the usage patterns observed and gaining insight into the underlying processes that generate them. These works propose models that can be used to explain the observed characteristics of tagging activity such as the incentives behind tagging, the relative frequency of tags over time for a given item, the interval between tag assignments performed by users and the distributions of activity volume.

Hammond, et al. (2005) is, perhaps, the first work to perform an initial study and to discuss the characteristics of social tagging, its potential, and the incentives behind tagging itself. This research comments on the features provided by different social tagging systems and discusses preliminary reasons that incentivize users to annotate and share content online. Following on the question of incentives, Ames and Naaman (2007) examine tagging in online social media Web sites by interviewing 13 users on the fundamental question of why do people tag? Based on user answers, the authors suggest that tagging serves to support content organization or to communicate aspects about the content. These actions can be either socially or personally driven. More recent studies have followed the analysis of incentives at a larger scale (Strohmaier, et al., 2012). Our study supports and, more importantly, extends these result by performing a large–scale user behavior analysis (covering more than 700,000 users) in three tagging systems. Although, we do not focus on the question of incentives particularly, the quantitative analysis we present highlight and provide stronger evidence of existing incentives hypothesized by earlier research.

One of the first works on the quantitative characterization of tagging systems is an item–centric characterization of del.icio.us that proposes the Eggenberger–Polya’s (1923) urn model as an explanation to the observed relative frequencies of tags applied to an item (Golder and Huberman, 2006). Cattuto, et al. (2007) show in a tag–centric characterization that the observed tag co–occurrence patterns in del.icio.us is well modeled by the Yule–Simon’s stochastic process (Simon, 1955). Similarly, Capocci, et al. (2009) show that the tag interarrival time distribution follows a power law. Using a different approach, Chi and Mytkowicz (2008) study the impact of user population growth in the efficiency of tags to retrieve items in del.icio.us. More recent works, focus on a characterization of social tagging systems that analyzes the impact of using tagging on external applications such as information retrieval and expert–generated content (Gu, et al., 2011; Li, et al., 2011; Lu, et al., 2010; Seki, et al., 2010).

Another stream of characterization studies focuses on user–centric analysis. Nov, et al. (2008) present a user–centric qualitative study on the motivations behind content tagging in Flickr, where they suggest that users tag content due to a mixture of individual like personal content organization, and social motivation such as to help others in finding photos from a particular place. In a previous study, we characterize the user–centric properties of tagging activity from two social bookmarking systems designed for academic citation management: CiteULike and Bibsonomy. The observations suggest that user activity across the system follows the Hoerl model (Santos–Neto, et al., 2007). Finally, other works study in what scenarios (e.g., what type of search tasks) users resort to tags to find information as opposed to traditional keyword search (Sinclair and Cardew–Hall, 2008).

Our work complements and extends these previous studies as it investigates a combination of user–, item– and tag–centric characteristics. Moreover, it explores different aspects of tagging activity, such as the levels of item re–tagging and tag reuse over time and the relationship between implicit and explicit user ties in tagging systems. By applying a quantitative approach on a broad population of users and multiple tagging systems, this study also offers new insights on user behavior that complement previous qualitative research by Ames and Naaman (2007).

2.2. Evolution of users’ tag vocabularies

Tags represent to a certain extent the user perception or intended use of an item. It is natural, therefore, to assume that the set of tags (i.e., tag vocabulary) of a given user provides information about her topics of interest, which is useful to design other mechanisms that support efficient content usage such as recommender systems. Naturally, if tag vocabularies are stable over time, that is, if inclusion of new tags and shifts in the tag usage frequency observed in a vocabulary are rare, a mechanism can delay updates on the vocabulary snapshot used to base its predictions. Indeed, this study shows that this is the case (Section 5).

Previous studies on the characterization of the evolution of tag vocabulary can be divided in two categories: first, studies that aim to quantify and model the growth of tag vocabularies at both the system– and user–level (Cattuto, et al., 2007, 2009); and, second, studies that estimate shifts in the tag vocabularies over time such as evolution of the tag popularity distribution of item–level tag vocabularies (Halpin, et al., 2007), and the variation of tag usage frequency across predefined tag classes (Golder and Huberman, 2006) (i.e., factual tags, subjective tags and personal tags) (Sen, et al., 2006).

In summary, these previous studies show that: i) the system–level and user–level tag vocabulary growth is sublinear; ii) item–level tag popularity distribution converges to a power law; and, iii) the usage frequency of tag categories shifts over time.

This study extends previous works by evaluating different facets of the vocabulary evolution. First, this work goes beyond the estimation of vocabulary growth, focusing on the evolution of tag usage frequency. Second, it concentrates on individual, user–level tag vocabularies, as opposed to the item–level vocabularies as in the previous studies. Finally, it uses a different methodology to estimate the difference between tag vocabularies from different points in time. Finally, we note that we use a different approach that does not make assumptions about the categories of tags that appear in the user tag vocabularies, an approach used by previous works.

2.3. Interest sharing analysis

An alternative way to characterize tagging systems is a graph–centric approach. Two users are connected by a weighted edge with strength proportional to the similarity between the tagging activities of these two users. In this study, this similarity is referred to as an implicit social tie between users. Note that other types of connections between users are possible. In particular, we refer to explicit social ties as explicit indicators of user collaboration, such as co–membership in discussion groups.

This approach has been used by Iamnitchi, et al. (2011, 2004) to characterize scientific collaborations, the Web, and peer–to–peer networks. The same model has been used by Li, et al. (2011) to target the problem of finding users with similar interests in online social networking sites. The authors use a del.icio.us data set and define links between users based on the similarity of their tags. Their conclusions support the intuition that tags accurately represent the content by showing that tags assigned to a URL match to a great extent the keywords that summarize that URL. Additionally, they design and evaluate a system that clusters users based on similar interests and identifies topics of interests in a tagging community.

Another focus of graph–centric characterizations is to determine structural features in the graph formed by connecting users, items and tags based on similarity. Hotho, et al. (2006) models a collaborative tagging system as a tripartite network (the network connects users, items and tags in a hypergraph) and design a ranking algorithm to enable search in social tagging systems. Using the same tripartite network model, Cattuto, et al. (2007) examine Bibsonomy and show the existence of small–world patterns in such networks representing social tagging systems. Krause, et al. (2008) also explore the topology of a tagging system, but the one formed by item similarity, to compare the folksonomy inferred from search logs and tagging systems. Their results suggest that search keywords can be considered as tags to URLs. More recently, Kashoob and Caverlee (2012) characterizes and model the temporal evolution of sub–communities in social tagging systems by looking into the similarity between users vocabularies.

Our study differs from these previous investigations in three aspects: first, the characterization of tagging activity similarity between users focuses on the system–wide concentration and intensity of pairwise similarities, as opposed to the topological characteristics. Second, our methodology provide a principled way to test whether the user similarity observed in social tagging systems is the product of interest sharing among users or chance. Finally, we investigate possible correlations between the observed levels of activity similarity between users (i.e., the implicit social ties) and the external indicators of explicit collaboration (i.e., the explicit social ties) as co–membership to discussion groups and semantic similarity of tag vocabularies (Sections 6 and 7). We note that our methodology is inspired by a previous work by Reichardt and Bornholdt (2008) that studies the patterns of similarity of product preferences among buyers and sellers on eBay.

2.4. System design

System characterization work is primarily motivated by it potential impact on system design. Thus, several studies propose to exploit characteristics of tagging systems to improve mechanisms such as recommendation (Jäschke, et al., 2007; Sigurbjörnsson and van Zwol, 2008; Song, et al., 2008), spam detection (Koutrika, et al., 2008; Krause, et al., 2008; Neubauer, et al., 2009; Noll, et al., 2009), top–k querying techniques (Schenkel, et al., 2008; Yahia, et al., 2008), and search and ranking (Hotho, et al., 2006; Yanbe, et al., 2007; Heymann, et al., 2008).

The present work adds to these studies by providing evidence that tagging activity can be useful to support such mechanisms. For example, the characteristics of vocabulary evolution, as presented in Section 5, can be used in the design of tagging systems in distributed platforms to adjust the frequency in which the user profiles are updated across nodes/users.

 

++++++++++

3. Data collection and notation

We choose to analyze three tagging systems: CiteULike, Connotea and del.icio.us. The first two are designed to help users organize references to scientific publications, while the third is a social bookmarking tool for any type of URL.

The main reason to focus on these systems is their popularity. Additionally, studying systems that target different audiences enables a broader comparison between tagging systems that target a niche of Web users such as the scientific community (i.e., CiteULike and Connotea) and a system where any Web user is a potential client (i.e., del.icio.us). Furthermore, the characterizations of multiple classes of systems are complementary. Our intuition is that a study of more specialized tagging systems — in this case, for managing academic publications — may reveal social structures that are harder to identify in generic systems such as del.icio.us.

CiteULike, Connotea, and del.icio.us target different types of content and users, though all three systems can be described in terms of the same abstract entities. In these systems each user maintains a library: a collection of bookmarked items that, for the systems we study, are either citation records linked to online articles or URLs to generic Web pages. A user may assign tags to items in her library. Additionally, a user may also tag items in other user’s public library. Tags may serve to group items, as a form of categorization, or to help find items in the future (Golder and Huberman, 2006; Nov, et al., 2008). The tagging activity can be private (i.e., only the user who generated the tags and items can access these annotations) or public. The analysis presented in the next sections concentrates on the public portion of the activity. A user can see what (public) tags other users assigned to an item when she is tagging it, thus the user is able to reinforce the choice of tags as appropriate by repeating the tags previously assigned to that item.

 

Table 1: Summary of data sets used in this study.
 CiteULikeConnoteadel.icio.us
Activity period11/2004–01/200912/2004–01/200901/2003–12/2006
Number of users40,32734,742659,470
Number of items1,325,565509,31118,778,597
Number of tags (distinct)274,982209,7592,370,234
Number of tag assignments4,835,4881,671,194140,126,555

 

In the case of CiteULike and Connotea, an item can be added to a user’s library (an action often referred to as item posting) in three ways: i) browse popular scientific literature portals (e.g., ACM Portal, IEEE Explorer, arXiv.org) and use their features that automate item posting; ii) search for items already present in other users’ libraries and add them to her own library; and, iii) post a new item manually. In del.icio.us, users can use automatic bookmarking features or manually bookmark URLs.

Table 1 presents a summary of the data sets used in this investigation. The CiteULike and Connotea data sets consist of all tag assignments since the creation of each system in late 2004 until January 2009. The CiteULike data set is available directly from its Web site. For Connotea, we built a crawler that leverages Connotea’s API to collect tagging activity since December 2004 (no earlier activity was retrieved). Finally, the del.icio.us data set is available at the Web site of a previous study by Görlitz, et al. (2008) [3].

Note that we do not have access to browsing or click traces. The traces analyzed in this work contain records that indicate when items are annotated with a given tag and who was the user, but the traces do not inform whether a tag is subsequently used by a user to navigate through the system, for example. The data sets are ‘cleaned’ to reduce sources of noise, such as the default tag ‘no–tag’ in CiteULike, tags composed only of symbols and other tags like the automatically generated ‘bibtex–import’, which are clear outliers in the popularity distribution.

Notation. The rest of this paper uses the following notation to formally refer to the entities that comprise tagging systems. A tagging system is composed of a set of users, items and tags, respectively denoted by U, I, T. The tagging activity in the system is a set of tuples (u; i; w; t), where uU is a user who tagged item iI with tag wT at time t. The activity of a user uU can be characterized by Au, Iu and Tu, which are respectively the set of tag assignments performed by u, the set of items annotated, and the vocabulary or set of tags used by u. The user’s activity from the beginning of the trace up to a particular point in time is denoted by Au(t0; t), Iu(t0; t) and Tu(t0; t), where t0 and t are timestamps, t0 represents the begin of the trace, and t0t.

 

++++++++++

4. Tag reuse and item re–tagging

Let a new item (or tag) be an item (or tag) that has never been used in an annotation in the tagging system. If users introduce new items and tags frequently, efficiently harnessing information based on collective action is difficult, if not impossible. This is so because in this case information about future user actions towards the annotation of an item or use of a tag is then hard to predict: prediction relies on the historical use of items and tags; new items or tags have no history in the system. Understanding the degree to which items are repeatedly tagged and tags reused can therefore help estimating the potential efficiency of techniques that rely on similarity of past user activity (e.g., recommender systems). To this end, this section addresses the following questions:

Q1.1. What is the rate of repeated item annotation and tag reuse? (Section 4.1)

Q1.2. Is the flow of new incoming users a major factor in the observed low rates of repeated item annotation? (Section 4.2)

Q1.3. Are the reuse patterns we observe the result of different usage characteristics of a group of high–volume power users, or are they pervasive through the entire user population? (Section 4.3)

The rest of this section first formalizes the metrics item re–tagging and tag reuse used to address these questions. Second, it characterizes the levels of item re–tagging and tag reuse as well as the level of activity generated by returning users. Finally, it discusses the implications of the usage characteristics discovered.

 

Table 2: A summary of daily item re–tagging and tag reuse.
 Re–tagged itemsReuse tags
 MedianStd. Dev.MedianStd. Dev.
CiteULike0.150.070.840.12
Connotea0.070.060.770.21
del.icio.us0.450.170.860.07

 

4.1. Levels of item re–tagging and tag reuse

An item is re–tagged (repeatedly tagged) if one or more users tag it again (with the same or different tags) after it was tagged for the first time. Similarly, a tag is reused if it appears in the trace more than once (for the same or different items) with different timestamps. We aim to determine which portion of the activity falls in these categories.

Definition 1. The level of item re–tagging during a time interval [tf-1; tf) is the ratio between the number of items tagged during that interval that have also been tagged in the past [t0; tf) to the total number of items tagged during the interval [tf-1; tf), as expressed by Equation 1. (Tag reuse is similarly defined).

 

Equation 1

 

We use this definition to determine the aggregate level of item re–tagging and tag reuse in CiteULike, Connotea and del.icio.us. Table 2 presents the median daily item re–tagging and tag reuse over the entire traces (i.e., the time interval [tf-1; tf) encompasses a day). The results show that CiteULike and Connotea have relatively low levels of item re–tagging while del.icio.us has a higher level of item re–tagging, yet all three systems present similarly high levels of tag reuse. We hypothesize that the observed difference in item re–tagging between del.icio.us and their counterparts in CiteULike and Connotea is due to the type of content users bookmark in each system (with URLs of any type in the former, and academic literature in the latter).

To test whether these aggregate levels are a result of stable behavior over time, Figure 1 presents the moving average (with a window size of 30 days) of daily item re–tagging and tag reuse. Overall, these results show that all three systems go through a bootstrapping period, after which they stabilize.

 

Daily item re-taggingDaily tag reuse
 
Figure 1: Daily item re–tagging (left) and tag reuse (right). The curves are smoothed by a moving average with window size n=30.

 

On the one hand, from the perspective of personal content management, the observed levels of item re–tagging and tag reuse, together with the much larger number of items than tags in these systems, suggest that users exploit tags as an instrument to categorize items according to, for example, topics of interest or intent of usage (‘toread’, ‘towatch’). On the other hand, from the social (or collaborative) perspective, the relatively high level of tag reuse taken together with the low level of item reuse suggests that users may have common interest over some topics, but not necessarily over specific items. These quantitative results suggest that tags are used as discussed by Ames and Naaman (2007).

A question that arises from the above observations is whether the levels of item re–tagging and tag reuse are generated by the same user or by different users. We observe that virtually none of the item re–tagging events are produced by the user who originally introduced the item to the system: generally, users do not add new tags to describe the items they collected and annotated once.

As illustrated by Figure 2 (left), about 50 percent of tag reuse is self–reuse (i.e., the reuse of a tag by a user who already used it first). This level of tag self–reuse indicates that users will often tag multiple items with the same tag, a behavior consistent with the use of tagging for item categorization and personal content management, as discussed above. Additionally, the fact that half of the tag reuse is not self–reuse reinforces the notion that users do share tags, which indicates potentially similar interests. In Section 6, we further investigate this social aspect of tag reuse by defining and evaluating interest sharing among users, as implied by the similarity between users’ activity (i.e., tags and items).

 

Self-tag reuseDaily activity generated by returning users
 
Figure 2: Self–tag reuse (left) and daily activity generated by returning users (right). The curves are smoothed by a moving average with window size n=30.

 

4.2. New incoming users

To understand whether the observed low level of item re–tagging is due to a high rate of new users joining the community, we estimate the levels of activity generated by returning users (as opposed to new users that join the community). Figure 2 (right) shows that, after a short bootstrap period, the level of tagging activity generated by returning users remains stable at about 80 percent over the rest of the trace for both CiteULike and Connotea. In del.icio.us, the percentage of activity represented by returning users is even higher, with above 95 percent of daily activity performed by returning users.

Thus, the low levels of item re–tagging are the outcome of expanding interests of returning users, instead of a constant stream of new users joining the community and introducing new items.

 

Table 3: The statistical test results reject the hypothesis that the item re–tagging and tag reuse observations with and without the power users are equal.
 Re–tagged items
 D–Statisticp–value <
CiteULike0.035162.2 x 10-16
Connotea0.18892.2 x 10-16
del.icio.us0.04750.076
 Reuse Tags
 D–Statisticp–value <
CiteULike0.28582.2 x 10-16
Connotea0.21322.2 x 10-16
del.icio.us0.1373:23 x 10-16

 

4.3. The Influence of power users

Finally, we investigate the influence of highly active users in the observed item re–tagging and tag reuse levels. To this end, we compare the observed item re–tagging and tag reuse with and without the activity produced by such power users. In this experiment, we define power users as the top one percent most active users according to the number of annotations produced, and calculate item re–tagging and tag reuse as before.

The experiments test the hypothesis that the levels of item re–tagging and tag reuse are the same with and without the activity produced by these power users. To this end, we apply the Kolmogorov–Smirnov test (KS–test) on the two samples of activity (i.e., with and without the power users) with the null hypothesis that the item re–tagging and tag reuse observed in the two samples come from the same distribution (i.e., H0 = the item re–tagging and tag reuse levels are equally distributed with and without the power users). Using the KS–test is appropriate as it does not require that the samples are drawn from a normal distribution.

At a confidence level of 99 percent (α=0:01; p= 1-α), we can reject the null hypothesis for all the systems, except the item re–tagging levels for del.icio.us (see the p–values in Table 3). This means that removing the activity produced by the power users leads to statistically different levels of item re–tagging and tag reuse as indicated by the D–statistic in Table 3 (i.e., the maximum difference between the two distributions).

4.4. Summary and implications

The observed user behavior impacts the efficiency of systems that rely on the inferred similarity among items, such as recommender systems. On the one hand, the relatively low level of item re–tagging suggests a highly sparse data set (i.e., attempting to connect users based on similar items will connect only few user pairs). A sparse data set poses challenges when designing recommender systems as they typically rely on the similarity of users based on their past activity to make recommendations.

On the other hand, the higher level of tag reuse confirms that analyzing tags has the potential to circumvent, or at least alleviate, the sparsity problem described above. The tags and users that relate to each item could not only serve to link items and build an item–to–item structure, but could also potentially provide semantic information about items. This information may help, for instance, to design better bibliography and citation management tools for the research community.

The results on analyzing the impact of power users in the observed levels of item re–tagging and tag reuse support two ideas: first, the notion that some users are instrumental on reducing the sparsity on tagging data sets (i.e., without power users, tags and items would be reused less, therefore potentially lesser items would be connected through tags and users). In fact, recommender systems benefit directly from the activity produced by such power users, as they can connect more items via repeated tag usage. Second, the role of power users differs from system to system, potentially due to effects of population size and diversity of interests. In the largest and most diverse system, we consider, reuse is a result of the activity of less active users rather than only power users.

Finally, despite the sparse data set problem, the fact that users tend to permanently add fresh content, as indicated by the low level of item re–tagging, implies that the approach proposed by Yanbe, et al. (2007) would be useful in a search portal for academic content. They suggest that content updated often in tagging systems can be used to improve the freshness and relevance of search results produced by a search engine. Portals for academic publications, such as Google Scholar, could exploit this fact to improve the freshness and relevance of their search results by using a combination of the PageRank ranking algorithm (Brin and Page, 1998) and annotations from systems like CiteULike, Connotea and del.icio.us.

 

++++++++++

5. Temporal dynamics of users’ tag vocabularies

The item re–tagging and tag reuse analysis presented in the previous section shows that users constantly produce new information, by adding both new items to their libraries and tags to their vocabularies, though at different rates.

Although user tag vocabularies are constantly growing, it is unclear whether the growth rate is uniform over time. More importantly, vocabulary growth may or may not imply changes in the relative tag usage frequency for a given user, changes that can indicate shifts in user interests over time.

To better understand these aspects the objective of this section is to answer the following question:

Q2. How do users’ vocabularies change over time?

To this end, this section quantifies the evolution of user tag vocabularies by considering both vocabulary growth and the evolution of tag usage frequency. We note that this investigation is different from, but complements, previous work (Kashoob and Caverlee, 2012; Cattuto, et al., 2007; Halpin, et al., 2007; Sen, et al., 2006): first, it performs a user–centric vocabulary analysis as opposed to a system–centric characterization; second, it studies both growth and change in vocabulary content in contrast to only one of the dimensions; and, finally, our characterization concentrates on the entire user population, as opposed to sub–communities of interests (as indicated by tags) or the evolution of such communities. Yet, the methodology we introduce in this study can be applied in other contexts.

5.1. Methodology

We introduce time in the definition of a user vocabulary by defining the tag vocabulary of a user Tu(s; f) as the set of tags used within the interval [s; f]. A particular case is Tu(1; n) when 1 and n indicate the timestamps of the first and the last observed tagging assignment by user u, respectively. Thus Tu(1; n)=Tu and represents the user’s entire vocabulary.

Vocabulary growth. To analyze the vocabulary growth, we track the distribution of growth rates across the user population for the duration of the traces. The goal is to understand whether the growth rate changes according to the user age. Therefore, we measure the following ratio:

 

Equation 2

 

where k ∈ [1; n] for all users in the system (i.e., 1 and n represent the timestamp of the first and last tag assignments of a particular users, respectively).

Vocabulary change. To measure the rate of change in the content of the vocabularies, we consider vocabularies as sets of tags ordered in decreasing order of usage frequency (i.e., number of times the tag was used to annotate any item) and apply a distance metric. We use the final tag vocabulary, Tu(1; n), as the reference point to study tag vocabulary evolution as, according to tag reuse results presented in Section 4, user tag vocabularies are constantly growing. Therefore, it is unlikely that splitting the activity trace into disjoint windows could help identifying meaningful evolution patterns. Instead, we trace the evolution of a user’s tag vocabulary by comparing the distance of incremental snapshots to her final vocabulary. This way, it is possible to understand the rate of convergence of user vocabularies over time. The experiment consists of calculating the distance from the tag vocabularies Tu(1; k) (k ∈ [2; n]), to the reference (final) tag vocabulary Tu(1; n).

A traditional metric to calculate the distance between two lists of ordered elements is the Kendall’s τ distance (Kendall, 1938), which considers the number of pairwise swaps of adjacent elements necessary to make the lists similarly ordered. However, Kendall’s τ distance assumes that both lists are composed of the same elements. Since we are interested in the evolution of tag vocabularies over time, this assumption is not valid in our case: tag vocabularies are likely to contain different tags at different times due to the constant inclusion of new tags.

Therefore, we apply the generalized Kendall’s τ distance, as defined by Fagin, et al. (2003), which relaxes the restriction mentioned above and accounts for elements that are present in one permutation, but are missing in the other. Similar to the original Kendall’s τ distance, the generalized version of the metric counts the number of pairwise swaps of items necessary to make the lists similarly ordered. Additionally, the generalized version counts the absence of items via a parameter p. This parameter can be set between 0 and 1, which allows various levels of certainty about the order of absent items. For example, in the case that two items are missing from one list, but present on the other, setting p=0 indicates that there are not enough information to decide whether the two items are in the same other or not. Conversely, setting p=1 indicates that there is full information available to consider the absence as an increase in the distance between the lists. In the experiments that follow we use p=1.

5.2. Results and implications

Our analysis filters out users that had negligible activity considering only users with at least 10 annotations. This sample is responsible for approximately 93 percent, 61 percent, and 90 percent of the total system activity in terms of tag assignments in CiteULike, Connotea, and del.icio.us, respectively.

Vocabulary growth rate. Figure 3 illustrates vocabulary growth rate across the user population in the three systems studied. The x–axis indicates categories of users according to their age (i.e., number of days since their first recorded tag assignment), while the y–axis indicates the growth rate relative to each user vocabulary. For each of the systems studied we present two plots: labeled ‘median’ and ‘90th percentile’. A point in the median plot indicates that 50 percent of the user vocabularies with a given age (as specified by x) have a growth rate lower than or equal to the value in the y–axis. Similarly, a point in the 90th percentile plot indicate that 90 percent of the user vocabularies with a given age as indicated by the x–axis have a growth rate lower than or equal to the corresponding value in the y–axis.

The results show that, for the duration of the traces analyzed, the median growth rate (Figure 3 — left) is relatively larger for older users. On the other hand, if we take the 90th percentile growth rate (Figure 3 — right), except for the very young users, we observe that the rate is relatively the same for all age groups with a slightly smaller rate for users in the middle of the age spectrum. An important observation is that except for the growth rate of young vocabularies, the 90th percentile reaches a maximum rate of 0.1. This means that for 90 percent of users, their vocabularies growth rate upper bound is 10 percent.

Vocabulary change. Figure 4 changes the focus from growth rate to the rate of change in users’ vocabularies. The figure presents the rate of change in the contents of user vocabularies by taking into account the frequency of tags and calculating the distance between vocabulary snapshots. The results show that the distance from the vocabulary at earlier ages to its final state (i.e., Kendall–tau distance t(Tu(1; k); Tu(1; n)), where k ∈ [2; n]) decreases rapidly in the first 100 days for 50 percent of users.

 

vocabulary growth pattern in CiteULikevocabulary growth pattern in Connoteavocabulary growth pattern in del.icio.us
 
Figure 3: The vocabulary growth pattern in the systems studied: CiteULike (left), Connotea (center), and del.icio.us (right).

 

 

Rate of change in the tag usage frequency in CiteULikeRate of change in the tag usage frequency in ConnoteaRate of change in the tag usage frequency in del.icio.us
 
Figure 4: Rate of change in the tag usage frequency in the user vocabularies: CiteULike (left), Connotea (center), and del.icio.us (right).

 

 

++++++++++

6. Interest sharing

The analysis of item re–tagging and tag reuse in Section 4 suggests that the observed level of re–tagging is the result of different users annotating the same item they are interested in. We dub this similarity in item–related activity item–based interest sharing. Similarly, we dub the similarity in tag–related activity tag–based interest sharing. This section defines and characterizes pairwise interest sharing between users as implied by their annotation activity in CiteULike, Connotea and del.icio.us.

Analyzing interest sharing is relevant for information retrieval mechanisms such as search engines tailored for tagging systems (Yahia, et al., 2008; Zhou, et al., 2008), which can exploit pairwise user similarity to estimate the relevance of query results. This section focuses in particular on characterizing interest sharing distributions across the user–pairs in the system and addresses the following question:

Q3. How is interest sharing distributed across the pairs of users in the system?

However, this section goes one step further and studies the system–wide characteristics of interest sharing and the implicit social structure that can be inferred from it. Moreover, the next section investigates the relationship between interest sharing (as inferred from activity similarity) and explicit indicators of collaboration such as co–membership in discussion groups and semantic similarity between tag vocabularies (Section 7).

6.1. Quantifying activity similarity

We use the Asymmetric Jaccard Similarity Index (Jaccard, 1912) to quantity similarity between the item (or tag–) sets of two users. We note that previous work (including ours) has used the Jaccard Index to quantify interest sharing: Stoyanovich, et al. (2008) used this index to model shared user interest in del.icio.us and to evaluate its efficiency in predicting future user behavior. Chi, et al. (2007) applied the symmetric index to determine the diversity of users and its impact in a social search setting. Our analysis considerably extends this past work.

The formal definition of item–based interest–sharing metric is:

Definition 2. The level of item–based interest sharing between two users, k and j, as perceived by k, is the ratio between the size of the intersection of the two item sets and the size of the item set of that user. (Tag–based interest sharing is defined similarly and denoted by wT.

 

Equation 3

 

Equation 3 captures how much the interests of a user uk match those of another user uj, from the perspective of uk. We opt for the asymmetric similarity index rather than the symmetric version (which uses the size of the union of the two sets as the denominator in Equation 3) to account for the observation that the distribution of item set sizes in our data is heavily skewed. As a result, the situation where a user has a small item set contained in another user’s much larger item set happens often. In such cases, the symmetric index would define that there is little similarity between interests, while the asymmetric index accurately reflects that, from the standpoint of the user with smaller item set, there is a large overlap of interests. From the perspective of the user with a large item set, however, only a small part of his interests intersect with those of the other user.

 

Distributions for item- and tag-based interest sharing in CiteULikeDistributions for item- and tag-based interest sharing in ConnoteaDistributions for item- and tag-based interest sharing in del.icio.us
 
Figure 5: Distributions for item– and tag–based interest sharing (for pairs of users with non–zero sharing) in CiteULike, Connotea and del.icio.us.

 

6.2. How is interest sharing distributed across the system?

This section presents the distribution of pairwise interest sharing in CiteULike, Connotea and del.icio.us. We first find that approximately 99.9 percent of user pairs in CiteULike and del.icio.us share no interest over items (i.e., wI (k; j)=0). In Connotea, the percentage is virtually the same: 99.8 percent. For the tag–based interest sharing, the percentage of user pairs with no tag–based shared interest (i.e., wT (k; j)=0) is slightly lower: 83.8 percent, 95.8 percent and 99.7 percent for CiteULike, Connotea and del.icio.us, respectively. Such sparsity in the pairwise user similarity supports the conjecture that users are drawn to tagging systems primarily by their personal content management needs, as opposed to the desire of collaborating with others.

The rest of this section focuses on the remaining user pairs, that is, those user pairs that have shared interest either over items or tags. To characterize these user pairs, we plot the cumulative probability distribution (CDF) of item– and tag–based interest sharing for these sets of user pairs in all three systems (Figure 5).

Figure 5 shows that, in all three systems, the typical intensity of tag–based interest sharing is higher than its item–based counterpart. This is not surprising: after all, all three systems include two to three times more items than tags. However, there is qualitative difference across systems with respect the concentration of item–based and tag–based interest sharing levels, with del.icio.us showing a much wider gap between the distributions.

The difference between the levels of item– and tag–based interest sharing suggests the existence of latent organization among users as reflected by their fields of interest. We hypothesize that this observation is due to a large number of user pairs that have similar tag vocabularies regarding high–level topics (e.g., computer networks), but have diverging interests in specific sub–topics (e.g., Internet routing versus firewall traversal techniques), which could explain the relatively lower item–based interest sharing compared to the observed tag–based interest sharing.

Finally, to provide a better perspective in the tag–based interest sharing levels, we compare the observed values to that of controlled studies on the vocabulary of users describing computer commands (Furnas, et al., 1987). The tag–based interest sharing level, as observed in Figure 5, is approximately 0.2 (or less) for 80 percent of the user pairs that have some interest sharing, while Furnas, et al. (1987) show that in an experiment where participants are instructed to provide a word to name a command based on its description such that it is an intuitive name and more likely to be understood by other people, the ratio of agreement between two participants is in the interval [0.1, 0.2] (i.e., number of times two participants use the same word divided by the total number of participant pairs).

These observations suggest that observed tag–based interest sharing is due to conscious choice of terms from vocabularies that are shared among users, rather than by chance. We look more closely into this aspect in the next section by constructing a baseline to compare the observed interest sharing levels.

6.3. Comparing to a baseline

The goal of this section is to better understand the interest sharing levels we observe. In particular, we focus on the following high–level question:

Q4. Do the interest sharing distributions we observe differ significantly from those produced by random tagging behavior?

For this investigation, we compare the observed interest sharing distribution to that obtained in a system with users that have an identical volume of activity and the same user–level popularity distributions for items or tags, but where users do not act according to their personal interests. Instead, in the random null model (RNM) (Reichardt and Bornholdt, 2008), the chance that a user is interested in an item or tag is simply that item or tag’s popularity in the user’s vocabulary.

The reason to perform this experiment is the following: our aim is to validate how well the interest sharing metric distills useful user behavior information. If the interest–sharing levels we observe in the three real systems at hand are more concentrated than those generated by the RNM, then interest sharing metric captures relevant information about similarity of user preferences, rather than simply coincidence in the tagging activity.

To reiterate, the random null model (RNM) is produced by emulating a tagging system activity that preserves the main macro–characteristics of the real systems we explore (such as the number of items, tags, and users, as well as item and tag popularity, and user activity distributions), but where users make random tag assignments. As such, random assignments are used here as the opposite of interest–driven assignments.

To test our hypothesis, we compare the two sets of data (real and RNM–generated) in terms of the numbers of user pairs with non–zero interest sharing and the interest–sharing intensity distribution. Because of its probabilistic nature, we use the RNM to generate five synthetic traces corresponding to each of the real systems we analyze. For the rest of this section, the RNM results represent averages over the five RNM traces for each system. We confirmed that the five synthetic traces represent a large enough sample to guarantee a narrow 95 percent confidence interval for the average interest sharing observed from the RNM simulations.

Our data analysis shows that the observed interest sharing deviates significantly from that generated by random behavior in two important respects.

First, interest sharing (and, consequently, the similarity between users) is more concentrated in the real systems than in the corresponding simulated RNM. More specifically, the number of user pairs that share some item–based interest (i.e., wI (k; j)>0) is approximately three times smaller in the real systems than in the RNM–generated ones. Tag–based interest sharing follows a similar trend.

Second, interest sharing distribution deviates significantly from that produced by a RNM. We compare the cumulative distribution function (CDF) for the interest sharing intensity for the user–pairs that have some shared interest (i.e., wI (k; j)>0). Figure 6 presents the Q–Q plots that directly compare the quantiles of the distributions of interest–sharing levels derived from the actual trace and those derived from the simulated RNM. A deviation from the diagonal indicates a difference between these distributions: The higher the points are above the diagonal, the larger the difference between the observed interest–sharing levels and those generated by the RNM.

 

Q-Q plots that compare the interest sharing distributions for CiteULikeQ-Q plots that compare the interest sharing distributions for Connotea
 
Figure 6: Q–Q plots that compare the interest sharing distributions for the observed vs. simulated (i.e., the RNM model) for CiteULike (left) and Connotea (right).

 

We note that the only interest–sharing distribution that is close to the one produced by the RNM is for Connotea’s tag–based interest sharing (Figure 6). However, there is still a significant deviation from randomness: the real activity trace leads to three times fewer user–pairs that share interest than the corresponding RNM.

6.4. Summary and implications

This section provides a metric to estimate pairwise interest sharing between users, offers a characterization of interest–sharing levels in CiteULike and Connotea; and investigates whether the observed interest sharing in these systems deviates from that produced by chance, given the amount of activity users had. Such reference is given by a random null model (RNM) that preserves the macro characteristics of the systems we investigate, but uses random tag assignments.

The comparison highlights two main characteristics of the interest sharing: first, interest sharing is significantly more concentrated in the real traces than in the RNM–generated activity: in quantitative terms, three times fewer user pairs share interests in the real traces. Second, most of the time, for the user pairs that have non–zero interest sharing the observed interest–sharing intensity is significantly higher in each real system than in its RNM equivalent.

We conjecture that a possible explanation for these observations is as follows. Let us consider that the set of tags that can be assigned to an item is largely limited by the set of topics that item is related to. In this case, intuitively, the probability of choosing a tag is conditional to the set of topics the item is related to. At one extreme, the maximum diversity of topics occurs when there is a one–to–one mapping between topics and tags, that is, when each tag introduces a different topic. The RMN simulates the other extreme, a single topic that encompasses all tags in the system.

However, in real systems, the interests for each individual user are limited to a finite set of topics, which is likely to determine their tag vocabulary. This leads to a concentration of interest sharing, as implied by the tag similarity, on few user pairs, yet at higher intensity than that produced by the RNM.

Finally, and most importantly, the divergence between the observed and the RNM–generated interest sharing distributions shows that activity similarity, our metric to quantify interest sharing intensity, embeds information about user self–organization according to their preferences. This information, in turn, could be exploited by mechanisms that rely on implicit relationships between users. The next section seeks evidence about the existence of such information by analyzing the relationship implicit user ties, as inferred from the similarity between users’ activity, and their explicit social ties, as represented by co membership in discussion groups or semantic similarity between tag vocabularies.

 

++++++++++

7. Shared interest and indicators of collaboration

The previous section characterizes interest sharing across all user pairs in each system and suggests that it encodes information about user behavior, as its distribution deviates significantly from that produced by a random null model.

This section complements this characterization and evaluates whether the implicit user relationships that can be derived from high levels of interest sharing correlate with explicit online social behavior. More specifically, this section addresses the following question:

Q5. Are there correlations between interest sharing and explicit indicators of collaborative social behavior?

Before starting, it is important to mention that the number of externally observable elements of user behavior to which we have access is limited by the design of the tagging systems themselves (e.g., the tagging systems collect limited information on user attributes) and by our limited access to data (e.g., we do not have access to browsing traces or search logs).

One CiteULike feature, however, is useful for this analysis: CiteULike allows users to explicitly declare membership to groups and to share items among a selected subset of co–members — an explicit indicator of user collaboration in the system. Thus, this feature enables an investigation about the relationship between interest sharing and group co–membership (which we assume to indicate collaboration). We note that a similar experiment could be performed using the explicit friendship links in del.icio.us, for example. However, this data is not available to our study.

Along the same lines, we have used a second external signal: semantic similarity between tag vocabularies. More specifically, we test the hypothesis that item–based interest sharing correlates to semantic similarity between user vocabularies. The underlying assumption here is that users who (have the potential to) collaborate employ semantically similar vocabularies.

This section presents the methodology and the results of the first experiment above. Our related technical report (Santos–Neto, et al., 2013) presents the second experiment in detail. In brief, our conclusions are:

  • User pairs with positive item–based interest sharing have a much higher similarity in terms of group co–membership and semantic tag vocabulary, than users who have no interest sharing.

  • On the other side, we find no correlation between the intensity of the interest sharing and the collaboration levels as implied by group co–membership or vocabulary similarity.

7.1. Group membership

In CiteULike, approximately 11 percent of users declare membership to one or more groups. While the percentage may seem small, they are the most active users: these users generate 65 percent of tag assignments, and introduce 51 percent of items and 50 percent of tags. For this section we limit our analysis to the user pairs for which both users are members of at least one group. Also, the analysis focuses on groups that have two or more users (about 50 percent of all groups) as groups with only one user are obviously not representative of potential collaboration.

We explore the possible relationship between item–based interest sharing and co–membership in one or more groups. We determine the group–based similarity wH(u; v) between two users u and v using the asymmetric Jaccard Index, similar to the item–based definition in Equation 3, but considering the sets of groups users participate in. Based on this similarity definition, we study whether the intensity of item–based interest sharing between two users with non–zero interest sharing (i.e., wI (u; v)>0) correlates with group membership similarity.

We find no correlation between wI(u; v) — the item–based similarity — and wH(u; v) — the group–based similarity. More precisely, Pearson’s correlation coefficient is approximately 0.12, and Kendall’s τ is about 0.05. This is surprising as one would expect that being part of the same discussion groups is a good predictor to the intensity in which users share interest over items. Therefore, we look into these correlations in more detail: we look at group similarity for two distinct groups of user pairs: those with no item–based interest sharing (wI(u; v)=0) and those with some interest sharing (wI(u; v)>0). We observe that, although the group information is relatively sparse, pairs of users with positive interest sharing are more likely to be members of the same group than the user pairs where wI(u; v)=0. In particular, four percent of the user pairs with wI(u; v)>0 have wH(u; v)>0.2, while twenty times fewer user pairs with wI(u; v)=0 have wH(u; v)>0.2.

These observations suggest that activity similarity is a necessary, but not sufficient condition for higher–level collaboration, such as participation in the same discussion groups. Although users share interest over items, and may implicitly benefit from each other tagging activity (e.g., using one another’s tags to navigate the system), this may not directly lead to users actively engaging in explicit collaborative behavior (possibly due to the lack of information that such collaboration is indeed possible). Conversely, the lack of interest sharing strongly implies a lack of collaboration.

7.2. Summary and implications

This section takes a step towards understanding the relationship between the implicit user ties, as inferred from pairwise interest sharing, and their explicit social ties. First, we look at correlations between the item–based interest sharing and the group–based similarity. The observations indicate that although the intensity of item–based activity interest sharing does not correlate with explicit collaborative behavior, as implied by group co–membership, user pairs with some interest sharing are more than one order of magnitude more likely to participate in similar groups. In our technical report (Santos–Neto, et al., 2013) we also, we evaluate the relationship between item–based interest similarity and the semantic similarity of tag vocabularies. We discover that, although the two do not yield a Pearson’s correlation, item–based interest similarity does embed information about the expected semantic similarity between user vocabularies.

These results have implications on the design of mechanisms that aim to predict collaborative behavior, as these mechanisms could exploit item–based similarity to set expectations about group–based and vocabulary–based similarity. Moreover, assuming that the tagging activity characteristics of spammers differs from legitimate users, one could use deviations from observed relationship between item–based similarity and the two indicators of collaborative behavior presented here to detect malicious user behavior.

 

++++++++++

8. Conclusions

Tagging systems have been widely adopted by today’s World Wide Web. These systems provide users with the ability to annotate and share content. The peer–produced annotations (or tags) and shared items create a valuable pool of metadata. To efficiently harness this information, it is first necessary to understand the usage characteristics of tagging systems.

To this end, this work studies two major aspects of usage characteristics in tagging systems: i) the dynamics of peer production of information; and, ii) the relationship between implicit and explicit social ties between users.

To address the first aspect, this work analyzes the user behavior characteristics at the individual and aggregate level in three tagging systems that focus on distinct applications: CiteULike and Connotea — personal management of academic citation records; and, del.icio.us — a popular social bookmarking system.

In particular, the characterization of peer production of information focuses on three user activity indicators: i) item re–tagging, a measure for the degree to which users re–tag the items already existing in the system; ii) tag reuse, a measure for the degree to which users reuse a tag perform new annotations; and, iii) the temporal dynamics of user tag vocabularies, a user–centric analysis of the tag vocabulary evolution over time.

To address the second aspect, we define interest sharing, a metric the activity similarity between a pair of users. Through experiments that compare with a random null model, we show that interest sharing metric captures relevant information about similarity of user preferences, rather than simply coincidence in the tagging activity. Additionally, we present an analysis of the relationship between the implicit ties, as represented by activity similarity between users with respect to their tagging activity, and more explicit ties, such as co–membership in discussion groups and semantic similarity of tag vocabularies.

Here we summarize the main findings of this study:

  1. The qualitative characteristics of peer–production of information are similar across different systems, but they differ quantitatively, as indicated by the relative levels of item re–tagging and tag reuse.
  2. Interest sharing (the metric that quantifies the similarity between pairs of user’s tagging activity) is significantly concentrated on a small fraction of user pairs. This is a characteristic of intelligent choices made by users in tagging systems, and not an implicit result of tagging activity volumes and tag/item popularity distributions as indicated by a comparison of the observed interest sharing distribution to that of a system with the same macro characteristics yet where random tag assignments are used.
  3. As expected, user tag vocabularies are constantly growing, yet at different rates depending on the age of the user. However, despite the constant vocabulary growth, the relative usage frequency of tags in a vocabulary tends to converge to a stable ranking at early stages of users’ life in the system.
  4. The implicit and explicit social ties are related, as suggested by the observed higher intensity in group co–membership and tag vocabulary semantic similarity for the user pairs that share interest over items.

The implications of these results have ramifications along multiple fronts of system design including: i) recommender systems — as the concentration of interest sharing on a small fraction of user pairs indicate a highly sparse dataset, it demands more sophisticated techniques to achieve good precision and recall; ii) malicious user detection — as spam detection mechanisms, specially tailored for tagging systems, could use deviations from the characteristics of interest sharing of a non–malicious user population to detect malicious users; iii) design of distributed infrastructure for tagging systems — the characteristics of the evolution of user tag vocabularies together with the ‘sparse’ interest sharing support the intuition that it is possible to design and implement a distributed infrastructure to support tagging features, as it would imply in low communication cost among the parts. End of article

 

About the authors

Elizeu Santos–Neto is a Ph.D. candidate in the Electrical & Computer Engineering Department at the University of British Columbia.
Web: http://blogs.ubc.ca/elizeu/about/
E–mail: elizeus [at] ece [dot] ubc [dot] ca

David Condon, University of South Florida.
E–mail: dcondon [at] mail [dot] usf [dot] edu

Nazareno Andrade is a professor in the Systems and Computing Department of Universidade Federal de Campina Grande, Brazil. His research focuses on peer production systems, social computing, data analytics and computing and music. Nazareno received a Ph.D. in Electrical Engineering from the Universidade Federal de Campina Grande.
E–mail: nazareno [at] computacao [dot] ufcg [dot] edu

Adriana Iamnitchi, University of South Florida.
E–mail: anda [at] cse [dot] usf [dot] edu

Matei Ripeanu received his Ph.D. in computer science from the University of Chicago in 2005. After a brief visit at the Argonne National Laboratory, Matei joined the Electrical and Computer Engineering Department of the University of British Columbia. Matei is broadly interested in distributed systems with a focus on self–organization and decentralized control in large–scale grid and peer–to–peer systems.
E–mail: matei [at] ece [dot] ubc [dot] ca

 

Acknowledgements

The authors would like thank the Research Computing Center at the University of South Florida for allowing us to use their infrastructure in part of our experiments. Elizeu Santos–Neto was partially supported by the B.C. Innovation Council Fellowship and AUCC LACREG Exchange Grant. Finally, thanks to Lauro Beltrão Costa and Abdullah Gharaibeh for insightful discussions.

An extended version of this work is available as a technical report (Santos–Neto, et al., 2013).

 

Notes

1. http://openannotation.org.

2. http://hypothes.is.

3. http://www.tagora-project.eu/.

4. Nathanson, 2001, p. 267.

 

References

M. Ames and M. Naaman, 2007. “Why we tag: Motivations for annotation in mobile and online media,” CHI ’07: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 971–980.
doi: http://dx.doi.org/10.1145/1240624.1240772, accessed 25 June 2014.

Y. Benkler, 2006. The wealth of networks: How social production transforms markets and freedom. New Haven, Conn.: Yale University Press.

S. Brin and L. Page, 1998. “The anatomy of a large–scale hypertextual Web search engine,” Computer Networks and ISDN Systems, volume 30, numbers 1–7, pp. 107–117.
doi: http://dx.doi.org/10.1016/S0169-7552(98)00110-X, accessed 25 June 2014.

A. Capocci, A. Baldassarri, V.D.P. Servedio, and V. Loreto, 2009. “Statistical properties of inter–arrival times distribution in social tagging systems,” HT ’09: Proceedings of the 20th ACM Conference on Hypertext and Hypermedia, pp. 239–244.
doi: http://dx.doi.org/10.1145/1557914.1557955, accessed 25 June 2014.

C. Cattuto, A. Baldassarri, V.D.P. Servedio, and V. Loreto, 2007. “Vocabulary growth in collaborative tagging systems,” arXiv.org (25 April), at http://arxiv.org/abs/0704.3316, accessed 25 June 2014.

C. Cattuto, A. Barrat, A. Baldassarri, G. Schehr, and V. Loreto, 2009. “Collective dynamics of social annotation,” Proceedings of the National Academy of Sciences, volume 106, number 26, pp. 10,511–10,515.
doi: http://dx.doi.org/10.1073/pnas.0901136106, accessed 25 June 2014.

E. Chi, P. Pirolli, and S.K. Lam, 2007. “Aspects of augmented social cognition: Social information foraging and social search,” In: D. Schuler (editor). Online communities and social computing. Lecture Notes in Computer Science, volume 4564, pp. 60–69.
doi: http://dx.doi.org/10.1007/978-3-540-73257-0_7, accessed 25 June 2014.

E.H. Chi and T. Mytkowicz, 2008. “Understanding the efficiency of social tagging systems using information theory,” HT ’08: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia, pp. 81–88.
doi: http://dx.doi.org/10.1007/978-3-540-73257-0_7, accessed 25 June 2014.

F. Eggenberger and G. Polya, 1923. “Über die Statistik verketter vorgage,” Zeitschrift für Angewandte Mathematik und Mechanik, volume 1, pp. 279–289.

B.M. Evans and E.H. Chi, 2008. “Towards a model of understanding social search,” CSCW ’08: Proceedings of the 2008 ACM Conference on Computer Supported Cooperative Work, pp. 485–494.
doi: http://dx.doi.org/10.1145/1460563.1460641, accessed 25 June 2014.

R. Fagin, R. Kumar, and D. Sivakumar, 2003. “Comparing top k lists,” SODA ’03: Proceedings of the Fourteenth Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 28–36.

U. Farooq, T.G. Kannampallil, Y. Song, C.H. Ganoe, J.M. Carroll, and L. Giles, 2007. “Evaluating tagging behavior in social bookmarking systems: Metrics and design heuristics,” GROUP ’07: Proceedings of the 2007 International ACM Conference on Supporting Group Work, pp. 351–360.
doi: http://dx.doi.org/10.1145/1316624.1316677, accessed 25 June 2014.

G.W. Furnas, T.K. Landauer, L.M. Gomez, and S.T. Dumais, 1987. “The vocabulary problem in human–system communications,” Communications of the ACM, volume 30, number 11, pp. 964–971.
doi: http://dx.doi.org/10.1145/32206.32212, accessed 25 June 2014.

S.A. Golder and B.A. Huberman, 2006. “Usage patterns of collaborative tagging systems,” Journal of Information Science, volume 32, number 2, pp. 198–208.
doi: http://dx.doi.org/10.1177/0165551506062337, accessed 25 June 2014.

O. Görlitz, S. Sizov, and S. Staab, 2008. “PINTS: Peer–to–peer infrastructure for tagging systems,” IPTPS ’08: Proceedings of the Seventh International Conference on Peer–to–Peer Systems, p. 19.

X. Gu, X. Wang, R. Li, K. Wen, Y. Yang, and W. Xiao, 2011. “Measuring social tag confidence: Is it a good or bad tag?” In: H. Wang, S. Li, S. Oyama, X. Hu, and T. Qian (editors). Web–Age Information Management. Lecture Notes in Computer Science, volume 6897, pp. 94–105.
doi: http://dx.doi.org/10.1007/978-3-642-23535-1_10, accessed 25 June 2014.

H. Halpin, V. Robu, and H. Shepherd, 2007. “The complex dynamics of collaborative tagging,” WWW ’07: Proceedings of the 16th International Conference on World Wide Web, pp. 211–220.
doi: http://dx.doi.org/10.1145/1242572.1242602, accessed 25 June 2014.

T. Hammond, T. Hannay, B. Lund, and J. Scott, 2005. “Social bookmarking tools (i): A general review,” D–Lib Magazine, volume 11, number 4, at http://www.dlib.org/dlib/april05/hammond/04hammond.html, accessed 25 June 2014.

P. Heymann, G. Koutrika, and H. Garcia–Molina, 2008. “Can social bookmarking improve Web search?” WSDM ’08: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 195–206.
doi: http://dx.doi.org/10.1145/1341531.1341558, accessed 25 June 2014.

A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme, 2006. “Information retrieval in folksonomies: Search and ranking,” In: Y. Sure and J. Domingue (editors). The semantic Web: Research and applications, Lecture Notes in Computer Science, volume 4011, pp. 411–426.
doi: http://dx.doi.org/10.1007/11762256_31, accessed 25 June 2014.

S. Huang, X. Wu, and A. Bolivar, 2008. “The effect of title term suggestion on e-commerce sites,” WIDM ’08: Proceedings of the 10th ACM Workshop on Web Information and Data Management, pp. 31–38.
doi: http://dx.doi.org/10.1145/1458502.1458508, accessed 25 June 2014.

A. Iamnitchi, M. Ripeanu, and I. Foster, 2004. “Small–world file–sharing communities,” INFOCOM 2004: Twenty–third Annual Joint Conference of the IEEE Computer and Communications Societies, volume 2, pp. 952–963.
doi: http://dx.doi.org/10.1109/INFCOM.2004.1356982, accessed 25 June 2014.

A. Iamnitchi, M. Ripeanu, E. Santos–Neto, and I. Foster, 2011. “The small world of file sharing,” IEEE Transactions on Parallel and Distributed Systems, volume 22, number 7, pp. 1,120–1,134.
doi: http://dx.doi.org/10.1109/TPDS.2010.170, accessed 25 June 2014.

P. Jaccard, 1912. “The distribution of the flora in the alpine zone,” New Phytologist, volume 11, number 2, pp. 37–50.
doi: http://dx.doi.org/10.1111/j.1469-8137.1912.tb05611.x, accessed 25 June 2014.

R. Jäschke, L. Marinho, A. Hotho, L. Schmidt–Thieme, and G. Stumme, 2007. “Tag recommendations in folksonomies,” In: J. Kok, J. Koronacki, J., R. Lopez de Mantaras, S. Matwin, D. Mladenič, and A. Skowron (editors). Knowledge Discovery in Databases: PKDD 2007. Lecture Notes in Computer Science, volume 4702, pp. 506–514.
doi: http://dx.doi.org/10.1007/978-3-540-74976-9_52, accessed 25 June 2014.

S. Kashoob and J. Caverlee, 2012. “Temporal dynamics of communities in social bookmarking systems,” Social Network Analysis and Mining, volume 2, number 2, pp. 387–404.
doi: http://dx.doi.org/10.1007/s13278-012-0054-z, accessed 25 June 2014.

M.G. Kendall, 1938. “A new measure of rank correlation,” Biometrika, volume 30, numbers 1–2, pp. 81–93.
doi: http://dx.doi.org/10.1093/biomet/30.1-2.81, accessed 25 June 2014.

G. Koutrika, F.A. Effendi, Z. Gyöngyi, P. Heymann, and H. Garcia–Molina, 2008. “Combating spam in tagging systems: An evaluation,” ACM Transactions on the Web, volume 2, number 4, article number 22.
doi: http://dx.doi.org/10.1145/1409220.1409225, accessed 25 June 2014.

B. Krause, C. Schmitz, A. Hotho, and G. Stumme, 2008. “The anti–social tagger: Detecting spam in social bookmarking systems,” AIRWeb ’08: Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web, pp. 61–68.
doi: http://dx.doi.org/10.1145/1451983.1451998, accessed 25 June 2014.

P. Li, B. Wang, W. Jin, J–Y. Nie, Z. Shi, and B. He, 2011. “Exploring categorization property of social annotations for information retrieval,” CIKM ’11: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 557–562.
doi: http://dx.doi.org/10.1145/2063576.2063659, accessed 25 June 2014.

C. Lu, J.–R. Park, and X. Hu, 2010. “User tags versus expert–assigned subject terms: A comparison of LibraryThing tags and Library of Congress Subject Headings,” Journal of Information Science, volume 6, number 6, pp. 763–779.
doi: http://dx.doi.org/10.1177/0165551510386173, accessed 25 June 2014.

G. Macgregor and E. McCulloch, 2006. “Collaborative tagging as a knowledge organisation and resource discovery tool,” Library Review, volume 55, number 5, pp. 291–300.
doi: http://dx.doi.org/10.1108/00242530610667558, accessed 25 June 2014.

C. Marlow, M. Naaman, d. boyd, and M. Davis, 2006. “HT06, tagging paper, taxonomy, Flickr, academic article, to read,” HYPERTEXT ’06: Proceedings of the Seventeenth Conference on Hypertext and Hypermedia, pp. 31–40.
doi: http://dx.doi.org/10.1145/1149941.1149949, accessed 25 June 2014.

A. Mathes, 2004. “Folksonomies — Cooperative classification and communication through shared metadata,” at http://www.adammathes.com/academic/computer-mediated-communication/folksonomies.html, accessed 25 June 2014.

N. Neubauer, R. Wetzker, and K. Obermayer, 2009. “Tag spam creates large non–giant connected components,” AIRWeb ’09: Proceedings of the Fifth International Workshop on Adversarial Information Retrieval on the Web, pp. 49–52.
doi: http://dx.doi.org/10.1145/1531914.1531925, accessed 25 June 2014.

M.G. Noll, C–m. Au Yeung, N. Gibbins, C. Meinel, and N. Shadbolt, 2009. “Telling experts from spammers: Expertise ranking in folksonomies,” SIGIR ’09: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 612–619.
doi: http://dx.doi.org/10.1145/1571941.1572046, accessed 25 June 2014.

O. Nov, M. Naaman, and C. Ye, 2008. ‘What drives content tagging: The case of photos on Flickr,” CHI ’08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1,097–1,100.
doi: http://dx.doi.org/10.1145/1357054.1357225, accessed 25 June 2014.

R. Ramakrishnan and A. Tomkins, 2007. “Toward a PeopleWeb,” Computer, volume 40, number 8, pp. 63–72.
doi: http://dx.doi.org/10.1109/MC.2007.294, accessed 25 June 2014.

J. Reichardt and S. Bornholdt, 2008. “Market segmentation: The network approach,” In: D. Helbing (editor). Managing complexity: Insights, concepts, applications. Berlin: Springer–Verlag, pp. 19–36.
doi: http://dx.doi.org/10.1007/978-3-540-75261-5_2, accessed 25 June 2014.

E. Santos–Neto, D. Condon, N. Andrade, A. Iamnitchi, and M. Ripeanu, 2013. “Reuse, temporal dynamics, interest sharing, and collaboration in social tagging systems,” arXiv.org (25 January), at http://arxiv.org/abs/1301.6191, accessed 25 June 2014.

E. Santos–Neto, M. Ripeanu, and A. Iamnitchi, 2007. “Tracking user attention in collaborative tagging communities,” arXiv.org (24 June), at http://arxiv.org/abs/0705.1013, accessed 25 June 2014.

R. Schenkel, T. Crecelius, M. Kacimi, S. Michel, T. Neumann, J.X. Parreira, and G. Weikum, 2008. “Efficient top–k querying over social–tagging networks,” SIGIR ’08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 523–530.
doi: http://dx.doi.org/10.1145/1390334.1390424, accessed 25 June 2014.

K. Seki, H. Qin, and K. Uehara, 2010. “Impact and prospect of social bookmarks for bibliographic information retrieval,” JCDL ’10: Proceedings of the Tenth Annual Joint Conference on Digital Libraries, pp. 357–360.
doi: http://dx.doi.org/10.1145/1816123.1816179, accessed 25 June 2014.

S. Sen, S.K. Lam, A.M. Rashid, D. Cosley, D. Frankowski, J. Osterhouse, F.M. Harper, and J. Riedl, 2006. “Tagging, communities, vocabulary, evolution,” CSCW ’06: Proceedings of the 2006 20th Anniversary Conference on Computer Supported Cooperative Work, pp. 181–190.
doi: http://dx.doi.org/10.1145/1180875.1180904, accessed 25 June 2014.

B. Sigurbjörnsson and R. van Zwol, 2008. “Flickr tag recommendation based on collective knowledge,” WWW ’08: Proceedings of the 17th International Conference on World Wide Web, pp. 327–336.
doi: http://dx.doi.org/10.1145/1367497.1367542, accessed 25 June 2014.

H.A. Simon, 1955. “On a class of skew distribution functions,” Biometrika, volume 42, numbers 3–4, pp. 425–440.
doi: http://dx.doi.org/10.1093/biomet/42.3-4.425, accessed 25 June 2014.

J. Sinclair and M. Cardew–Hall, 2008. “The folksonomy tag cloud: When is it useful?” Journal of Information Science, volume 34, number 1, pp. 15–29.
doi: http://dx.doi.org/10.1177/0165551506078083, accessed 25 June 2014.

Y. Song, Z. Zhuang, H. Li, Q. Zhao, J. Li, W.–C. Lee, and C.L. Giles, 2008. “Real–time automatic tag recommendation,” SIGIR ’08: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 515–522.
doi: http://dx.doi.org/10.1145/1390334.1390423, accessed 25 June 2014.

J. Stoyanovich, S.A. Yahia, C. Marlow, and C. Yu, 2008. “Leveraging tagging to model user interests in del.icio.us,” AAAI–SIP 2008: Proceedings of the AAAI Spring Symposium on Social Information Processing, at https://www.aaai.org/Papers/Symposia/Spring/2008/SS-08-06/SS08-06-020.pdf, accessed 25 June 2014.

M. Strohmaier, C. Körner, and R. Kern, 2012. “Understanding why users tag: A survey of tagging motivation literature and results from an empirical study,” Web Semantics, volume 17, pp. 1–11.
doi: http://dx.doi.org/10.1016/j.websem.2012.09.003, accessed 25 June 2014.

S.A. Yahia, M. Benedikt, L.V.S. Lakshmanan, and J. Stoyanovich, 2008. “Efficient network aware search in collaborative tagging sites,” Proceedings of the VLDB Endowment, volume 1, number 1, pp. 710–721.
doi: http://dx.doi.org/10.14778/1453856.1453934, accessed 25 June 2014.

Y. Yanbe, A. Jatowt, S. Nakamura, and K. Tanaka, 2007. “Can social bookmarking enhance search in the Web?” JCDL ’07: Proceedings of the Seventh ACM/IEEE–CS Joint Conference on Digital Libraries, pp. 107–116.
doi: http://dx.doi.org/10.1145/1255175.1255198, accessed 25 June 2014.

D. Zhou, J. Bian, S. Zheng, H. Zha, and C.L. Giles, 2008. “Exploring social annotations for information retrieval,” WWW ’08: Proceedings of the 17th International Conference on World Wide Web, pp. 715–724.
doi: http://dx.doi.org/10.1145/1367497.1367594, accessed 25 June 2014.

 


Editorial history

Received 26 December 2013; revised 29 May 2014; accepted 20 June 2014.


Creative Commons License
This paper is in the public domain.

Reuse, temporal dynamics, interest sharing, and collaboration in social tagging systems
by Elizeu Santos–Neto, David Condon, Nazareno Andrade, Adriana Iamnitchi, and Matei Ripeanu.
First Monday, Volume 19, Number 7 - 7 July 2014
http://www.firstmonday.org/ojs/index.php/fm/article/view/4994/4101
doi: http://dx.doi.org/10.5210/fm.v19i7.4994





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.