Text in social networking Web sites: A word frequency analysis of Live Spaces
by Mike Thelwall
Social networking sites are owned by a wide section of society and seem to dominate Web usage. Despite much research into this phenomenon, little systematic data is available. This article partially fills this gap with a pilot text analysis of one social networking site, Live Spaces. The text in 3,071 Englishlanguage Live Spaces sites was monitored daily for six months and word frequency statistics calculated and compared with those from the British National Corpus. The results confirmed the existence of common domainspecific words and a marked personal focus. Unexpectedly, however, there was no evidence of an unusual degree of experimentation with new words or word spellings — perhaps this behaviour is limited to other social networking environments. Also surprising was the existence of a marked male gender bias in the most commonly used words. This was probably caused by a significant number of newsrelated discussions involving predominantly male politicians and other male public figures.
Social networking spaces are Web sites that allow members to create their own personal Web profile and to discover and connect to other members through that profile. One example is MySpace.com, which apparently surpassed Google as the most visited Web site by U.S. users at the end of 2006 (Prescott, 2007). MySpace is one of the more informal social networking sites, with users who are primarily young, and with profiles typically including a ‘cool’ personal photograph, a pop song, a set of photographs of friends and funny messages from some of those friends (and pop groups). Other popular social networking sites include Facebook (emphasising students), LinkedIn (for business networking) and Digg (for news discovery).
This article focuses on general social networking sites like MySpace and Live Spaces. These are linguistically interesting because they encourage new forms of communication. In fact, there are multiple communication modes, particularly between people who are connected as friends within a site. These modes include the following:
- Friend messages. These are effectively email messages sent to registered friends and typically use an interface that encourages short messages.
- Wall postings, friend comments or testimonials. These are public messages placed on a friends Web profile page. This form of posting was apparently invented by early Friendster users (boyd, 2007a).
- Comments are publicly visible messages attached to a picture or video on a friends site, although this terminology is also used for wall postings.
In addition to the above, there are onetomany communication forms such as blog postings, personal statements and biographical details on the profile page and descriptions of any videos or pictures posted. Thus, social networking sites are a complex multimedia, multimodal communication environment, designed to facilitate interactions amongst friends and which presumably generate a range of evolving communication registers that combine aspects of text messaging, email and spoken styles.
It is important to research social networking language, not just as a widespread aspect of culture, but also to be able to teach it to English learners and to support and understand its use amongst children and young adults. Linguistic insights may also help to address current concerns about the amount of time that children spend online interacting with friends. This article focuses on one aspect of social network communication: onetomany messages in blogs or photo sets. This restricted choice was made for technical reasons, described below. The British National Corpus was selected as a baseline for comparison and Microsoft Live Spaces was chosen as the social networking site, again for technical reasons described below. The research question is explanatory: Which are the main ways that Windows Live Spaces blogging and photograph comments differ from standard and Web English, from a word frequency perspective?
This research applies techniques from corpus linguistics to social networking, perhaps for the first time. Hence this literature review covers both content: social networking research; and methods: corpus linguistics.
Research into social networks so far has tended to be qualitative, using ethnographic or similar approaches to gain insights into the culture of social networking, particularly with regard to teenage users. This research has been particularly useful for the way that it has addressed concerns raised by those in the media and elsewhere that have made false interpretations through superficial readings of the technology. The key concept of friend for example, does not match well with its widely used offline meaning. Users of social networking sites typically become friends by one sending a friend request and the other accepting it. Friendship confers to a set of rights, such as permission to view a full profile, to send a message, to post comments and to view photographs and blog entries. A users friends are normally listed on their home page, or on additional pages if there are too many.
The various social networking sites all have their own unique features or orientations. For example, MySpace emphasises music, Facebook originated in U.S. colleges, blackplanet.com targets a black U.S. audience and gaiaonline.com focuses on manga and roleplay. As such, it would be reasonable to expect significantly different user demographics. For example, it seems that Facebook users tend to be richer and better educated than MySpace members (boyd, 2007b). This and the tendency for social networking behaviour to evolve over time and for users to flock to different sites means that it is difficult to give general and definitive statements about social networking. Nevertheless, some research has yielded useful insights.
Users view friendship in different ways, ranging from equivalent to offline friendship to a totally meaningless relationship (boyd, 2006; Fono and RaynesGoldie, 2006). This can cause conflict when two friends have different interpretations and act within their own understandings. In such a situation one may be seen as betraying the friendship bond whilst the other is taking things too seriously. Nevertheless, it seems to be the case that most social networkers view friending as weaker than offline friendship, for example accepting friendship requests to avoid giving offence (boyd, 2006). Moreover, within friendships there can be hierarchies: for example, a persons closest friends could be listed on their home page with the remainder on other pages (boyd, 2006).
One of the few quantitative social network studies is a largescale analysis of Facebook traffic (Golder, et al., 2007). This showed that most Facebook friends were members of the same college and also that most messages were exchanged between people sharing a college. It seems that Facebook use was built into the daily routines of students and Facebook messaging was often used as part of normal offline friendships. This might be seen as surprising since one of the advantages of social networking sites is their support for long distance and more casual friendships, for example, between former classmates attending different colleges.
The field of corpus linguistics (McEnery and Wilson, 2001) is concerned with the construction and analysis of large bodies of language (corpora). Perhaps the most famous is the British National Corpus (BNC), which consists of a wide variety of genres of English language text (Burnard, 1995). Although predominantly containing written text, from novels to newspapers, it also contains some transcribed spoken language. Language corpora are useful sources of usage evidence, particularly for language teaching (Jaworski, 1990). For example, a list of the most frequent words a representative corpus could form the starting point for a core vocabulary for second language teaching. Similarly, corpora can be queried for examples of text containing any word (concordances) so that students can identify appropriate use contexts. A drawback of corpora, however, is that they can take a long time to construct and can continue in use after some of their language is obsolete. A partial solution to this problem is to use the Web itself as a corpus, at least for concordances (Kilgarriff and Grefenstette, 2003; Meyer, et al., 2003).
A second type of corpus linguistics analysis, one that relates closely to the current paper, is comparative. If at least two corpora are available, or a corpus naturally divides into separate sections, then it may be useful to compare language use between them. For example, previous studies have compared nineteenth century letter writing styles by gender (Geisler, 2003), spoken against written English in universities (Biber, 2003) and Indian against British English (Hosali, 1991). The purpose of such investigations can be linguistic, in the sense of finding out how language works, or nonlinguistic, for instance investigating gender divisions in society. Although there are many different ways by which the language can be compared across two corpora, one simple method is to compare word frequencies.
Word frequency analysis start by recording the number of times that each word occurs in the corpus. The distribution of these word frequencies is known to normally follow a Zipfian distribution (Zipf, 1949) or power law (Lotka, 1926; Thelwall, 2005a, 2005b) with a few words being very common and many words being rare. The classic shape of this distribution (see Figure 1) is consistent with a simple model of language use which posits that copying forms an important part of language. This implies that there is a positive feedback loop in language with people tending to use words that they have heard or read most often in the past. Comparing frequency distributions can be useful in two ways. First, differences in the rank order of the most common words illuminates stylistic differences. For example I is more usual in spoken than written English, although it is common in both. Second, any deviations from a perfect power law distribution indicate some kind of external stress on the language. For example a corpus of legal documents may be stressed by the necessity to repeatedly use official legal jargon. Both of these methods are used in this article.
The following four hypotheses drive this pilot study:
- Low frequency words will be unusually common in Live Spaces due to spelling mistakes and invented words.
- There will be a set of common domainspecific words related to Live Spaces features.
- In comparison to BNC English, text in Live Spaces will be more focused in the present and less in the past.
- In comparison to BNC English, text in Live Spaces will be more focused on the authors and their personal relationships.
- In comparison to BNC English, text in Live Spaces will be genderneutral.
In order to obtain reliable word frequency statistics from social networking sites, a large number must be monitored for an extended period of time. Some sites have a feature that allows this monitoring to be conducted efficiently in terms of minimising the total download size needed. This feature is the Rich Site Summary/Really Simple Syndication (RSS) feed (Hammersley, 2005), which contains the most recently updated content on a site. In order to identify all of the content posted to a site it is sufficient to periodically check its RSS feed without having to repeatedly download and check the entire site.
Unfortunately there is no complete list of social networking sites from which a genuinely random sample could be selected for monitoring, instead search engine searches were used to create as large a list as possible of feeds from which to sample. For this, Windows Live searches were used because this search engine supports the feed: command to identify feeds. We submitted 10,000 feed searches using random midfrequency words from blogs. The combined results included over 100,000 unique URLs, with a majority coming from Windows Live Spaces. Windows Live Search seemed to have particularly good coverage of Live Spaces, perhaps even comprehensive coverage, and hence it was chosen as the sole social network site for this research. From the set of Live Spaces feeds we selected 26,953 at random (the strange number is due to rejecting inactive feeds). These were then monitored daily for six months using the Mozdeh RSS monitoring software. The feed data was parsed to remove all of the XML and HTML tags as well as all URLs. The English language feeds were identified through first separating out nonASCII feeds and then using the bigram method of language identification on the remainder (Cavnar and Trenkle, 1994). This was chosen in preference to the common words method, which is less effective on small quantities of text (Grefenstette, 1995). The feeds were then manually checked, which removed about 200 incorrectly classified feeds and left 3,071 Live Spaces. A word frequency list was built from the tagfree text of these 3,071 Live Spaces.
For comparison purposes word frequency lists from the British National Corpus (BNC) and from U.K. university Web sites in 2003 were used from a previous article (Thelwall, 2005b). Neither of these are ideal as comparative corpora since Live Spaces members are presumably mainly from the U.S. but they form a useful baseline for broad comparisons.
Table 1 reports the most common words in the English Live Spaces feeds in addition to the same results for two comparison corpora. It is interesting that although the top word is the same for all three, there is already a difference in the second most common word, with of being relatively rare in Live Spaces. There does not seem to be a plausible explanation for this result.
Table 1: The most common words in the English Live spaces feeds,
the BNC and U.K. university Web sites in 2003.
Rank BNC UK Uni 2003 Live Spaces (eng) 1 the the the 2 of of and 3 and and to 4 to to i 5 a a a 6 in in of 7 is is in 8 that for it 9 was on that 10 it be is 11 for that for 12 on this you 13 with with my 14 he by photo 15 be are was 16 i as blogentry 17 by it on 18 as you with 19 at or me 20 you from be 21 are at have 22 his an this 23 had will but 24 not not he 25 this have as 26 have which we 27 from university at 28 but i so 29 which if not 30 she can photoalbum 31 they was blog 32 or all are 33 an research all 34 her information more 35 were we entry 36 there was from 37 we one they 38 their may one 39 been your or 40 has but by
Low frequency words
The first research question hypothesised that low frequency words would be unusually common in Live Spaces text. Figure 1 is a frequency distribution of word frequencies in Live Spaces. In terms of low frequency words, the graph has a very straight line towards the left without a notably high number of words with frequency 1. Hence, although words that occur only once in the corpus are more numerous than words that occur more often, they are not more frequent than would be expected by a language that follows a natural powerlaw growth model. In conclusion, there is no evidence of the deliberate construction of new words in Live Spaces postings. A manual inspection of words with frequency 1 supported this conclusion: these words seemed rarely to be invented words.
Figure 1: A frequency distribution of word frequencies in English Live Spaces.
From Table 1 there are many domainspecific words in Live Spaces. Some of these are self-evidently domainspecific (blogentry, blog, photoalbum) and two more can be deduced to be domainspecific in that they relate to the contents of Live Spaces and are much more frequent than in the BNC or university corpora (photo, entry).
The occurrence of temporal participles gives a mixed pattern. The past participles was (BNC rank 9; University rank 30; Live Spaces rank 15) and had (BNC 23; university 136; Live Spaces 53) are rarer in Live Spaces than the BNC. Current participles is (BNC 7; university 7; Live Spaces 10), are (BNC 21; university 15; Live Spaces 32) and have (BNC 26; university 25; Live Spaces 21) are more mixed: two of the three are rarer in Live Spaces. The future participle will (BNC 41; university 23; Live Spaces 41) shows no difference. In conclusion, there is some evidence of less of an orientation in the past in Live Spaces in comparison to the BNC. This may be partly due to the incorporation of a body of narrative text in the BNC.
Live Spaces shows a pronounced orientation towards the writer in terms of the frequency of personal pronouns (high position for I: me and my missing from the BNC top 50). It also shows a wider tendency towards personal engagement in the higher position of the possessive pronouns you and your (although your is higher in the university set).
Although gender norms in offline and online communication are often more in terms of content and style than oldfashioned inappropriate use of gendered words (Herring, 2003; Livia, 2003), the gendered words in the corpus show a clear pattern. The masculine words he (BNC 14; university 74; Live Spaces 24) and his (BNC 22; university 100; Live Spaces 46) are relatively rare in Live Spaces. The feminine word her (BNC 34; university 336; Live Spaces 63) and she (BNC 30; university 381; Live Spaces 62) are also rarer in Live Spaces. More importantly, however, the feminine words are much rarer than the equivalent masculine words in Live Spaces, and so there is clear evidence of a gender bias towards male pronouns. An investigation into a random sample of occurrences of he suggested that the main reason was the number of discussion of news or political events involving male politicians (e.g., George Bush) and other public figures. Presumably there is a predominance of men in the news. In addition, there were a number of other, apparently less common causes of gender bias:
- Religious discussions, using he for a God;
- Some storytelling, using male main characters, including animals;
- Hypothetical discussions of professional figures which are assumed to be male (e.g., what a lawyer would do in a divorce case); and,
- Discussions of events involving male police officers (e.g., receiving a speeding fine).
In summary, the gender imbalance seems to stem mainly from society but also to some extent from genderbiased linguistic conventions (gods and professionals tend to be discussed as male).
The results of the analysis supported two of the research hypotheses fully: the importance of domainspecific terms and the author/relationship focus of Live Spaces text. The results partially supported a hypothesis of a current temporal orientation in Live Spaces text: common past participles were rarer and future participles were equally as common but some present participles were more common and some less common.
Two hypotheses must be rejected. First, rare words, although common, were not more common than would be expected in a corpus following natural language laws. It seems that in Live Spaces text (mainly blogs and photojournals) there is little attempt to play with word creation and word spellings. It seems likely that a different result would have been obtained for other social network Web site text, such as MySpace comment sections. Hence it should not be assumed that all social networking text is equally innovative in its spelling and usage.
The second rejected hypothesis is the absence of a gender bias. Live Spaces text displayed a clear masculine orientation, although both masculine and feminine common words were less common than in the BNC. This bias seems to be predominantly due to discussions of news events, which are dominated by predominantly malelead professions, such as politics.
About the author
Mike Thelwall is Professor of Information Science in the School of Computing & Information Technology at the University of Wolverhampton.
Email: m [dot] thelwall [at] wlv [dot] ac [dot] uk
D. Biber, 2003. Variation among University spoken and written registers: A new multi-dimensional analysis, In: P. Leistyna and C.F. Meyer (editors). Corpus analysis: Language structure and language use. Amsterdam: Rodopi, pp. 4770.
d. boyd, 2007a. Why youth heart social network sites: The role of networked publics in teenage social life, In D. Buckingham (editor). Youth, identity, and digital media. Cambridge, Mass.: MIT Press, pp. 119142.
d. boyd, 2007b. Viewing American class divisions through Facebook and MySpace, Apophenia Blog Essay (24 June), at http://www.danah.org/papers/essays/ClassDivisions.html, accessed 12 July 2007.
d. boyd, 2006. Friends, Friendsters, and MySpace Top 8: Writing community into being on social network sites, First Monday, volume 11, number 12 (December), at http://journals.uic.edu/fm/article/view/1418/13, accessed 23 June 2007.
L. Burnard, 1995. Users reference guide to the British National Corpus. Oxford: Oxford University Computing Services.
W.B. Cavnar and J.M. Trenkle, 1994. nGram-Based Text Categorization, Proceedings of SDAIR94, Third Annual Symposium on Document Analysis and Information Retrieval, pp. 161175.
D. Fono and K. RaynesGoldie, 2006. Hyperfriendship and beyond: Friendship and social norms on Livejournal, M. Consalvo and C. Haythornthwaite (editors). Internet research annual: Selected papers from the Association of Internet Researchers Conference. Volume 4. New York: Peter Lang, and at http://k4t3.org/publications/hyperfriendship.pdf, accessed 10 February 2008.
C. Geisler, 2003. Genderbased variation in nineteenth century English letter writing, In: P. Leistyna and C.F. Meyer (editors). Corpus analysis: Language structure and language use. Amsterdam: Rodopi, pp. 87106.
S.A. Golder, D. Wilkinson, and B.A. Huberman, 2007. Rhythms of social interaction: Messaging within a massive online network, Third International Conference on Communities and Technologies CT2007, East Lansing, Mich., and at http://www.hpl.hp.com/research/idl/papers/facebook/facebook.pdf, accessed 10 February 2008.
G. Grefenstette, 1995. Comparing two language identification schemes, Proceedings of Analisi Statistica dei Dati Testuali (JADT 95), volume I, pp. 263268.
B. Hammersley, 2005. Developing feeds with RSS and Atom. Sebastopol, Calif.: OReilly.
S.C. Herring, 2003. Gender and power in online communication, In: J. Holmes and M. Meyerhoff (editors). The handbook of language and gender. Oxford: Blackwell, pp. 202228.
P. Hosali, 1991. Some syntactic and lexicosemantic features of an Indian variant of English, Central Institute of English and Foreign Languages Bulletin, volume 3, numbers 12, pp. 6583.
A. Jaworski, 1990. The acquisition and perception of formulaic language and foreign language teaching, Multilingua, volume 9, number 4, pp. 397411. http://dx.doi.org/10.1515/mult.19220.127.116.117
A., Kilgarriff and G. Grefenstette, 2003. Introduction to the special issue on the Web as corpus, Computational Linguistics, volume 29, number 3, pp. 333347, and at http://www.kilgarriff.co.uk/Publications/2003-KilgGrefenstette-WACIntro.pdf, accessed 10 February 2008.
A. Livia, 2003. One man in two is a woman: Linguistic approaches to gender in literary texts, In: J. Holmes and M. Meyerhoff (editors). The handbook of language and gender. Oxford: Blackwell, pp. pp. 142158.
A.J. Lotka, 1926. The frequency distribution of scientific productivity, Journal of the Washington Academy of Sciences, volume 16, number 12, pp. 317323.
T. McEnery and A. Wilson, 2001. Corpus linguistics: An introduction. Second edition. Edinburgh: Edinburgh University Press.
C. Meyer, R. Grabowski, H.Y. Han, K. Mantzouranis, and S. Moses, 2003. The World Wide Web as linguistic corpus, In: Pepi Leistyna and Charles F. Meyer (editors). Corpus analysis: Language structure and language use. Language and Computers, number 46. Amsterdam: Rodopi, pp. 241254.
L. Prescott, 2007. Hitwise US consumer generated media report, at http://www.hitwise.com/, accessed 19 March 2007.
M. Thelwall, 2005a. Creating and using Web corpora, International Journal of Corpus Linguistics, volume 10, number 4, pp. 517541. http://dx.doi.org/10.1075/ijcl.10.4.07the
M. Thelwall, 2005b. Text characteristics of English language university Web sites, Journal of the American Society for Information Science and Technology, volume 56, number 6, pp. 609619. http://dx.doi.org/10.1002/asi.20126
G.K. Zipf, 1949. Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, Mass.: AddisonWesley.
Paper received 16 July 2007; accepted 15 January 2008.
Copyright © 2008, First Monday.
Copyright © 2008, Mike Thelwall.
Text in social networking Web sites: A word frequency analysis of Live Spaces
by Mike Thelwall
First Monday, Volume 13, Number 2 - 4 February 2008
A Great Cities Initiative of the University of Illinois at Chicago University Library.
© First Monday, 1995-2014.