Mining the Blogosphere: Age, gender and the varieties of self-expression

The growth of the blogosphere offers an unprecedented opportunity to study language and how people use it on a large scale. We present an analysis of over 140 million words of English text drawn from the blogosphere, exploring if and how age and gender affect writing style and topic. Our primary result is that a number of stylistic and content–based indicators are significantly affected by both age and gender, and that the main difference between older and younger bloggers, and between male and female bloggers, lies in the extent to which their discourse is outer– or inner–directed. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age.

Contents

Introduction
Automated blog analysis
Corpus design
Factor analysis
Age–linked variation
Gender–linked variation
Correlating age and gender
Conclusions

Introduction

A great deal of research has been carried out over the last few decades on how different groups of people use language differently (see, e.g., Labov, 1972; Biber and Finegan, 1994; Schneider, 2002). This research has often been constrained, however, by the time and expense needed to collect and annotate data. Studies therefore often have had to make do with comparatively small sample sizes, which makes it tricky to determine how general any results actually are.

The growth of the blogosphere, however, provides an interesting way out of this conundrum. Anyone can write a blog, and blogs are written about anything the blogger wishes and in whatever style they wish, typically with no editorial control. Moreover, blogs are electronically available for downloading, so that data collection is greatly eased. Since there are many millions of such blogs, the blogosphere offers an unprecedented opportunity to study, in a natural context and over a vast scale, how different groups of people write.

We report here our analysis of a large corpus of blog postings to see if and how writing topic and style vary with age and gender of the blogger. There has been much research interest in possible differences between male and female language use (Coates, 1986; Labov, 1990; Holmes, 1997; Bergvall, 1999), some of which has raised great interest in the popular literature (e.g., Tannen, 2001). It has also recently been shown that writing topic and style are useful indicators of age–linked psychological developments in personality, interests, and feelings (Pennebaker, et al., 2003; Pennebaker and Stone, 2003). As we have noted, however, previous studies have generally been limited by the difficulty of data gathering, and so have relied on relatively small amounts of text (cf. Bailey and Dyer, 1992; Biber, 1993; Schneider, 2002), often gathered in artificial laboratory settings.

Our corpus comprises over 140 million words of naturally occurring text from randomly selected blogs by men and women from their teens into their forties. By applying factor analysis and machine learning techniques, we demonstrate here clear and consistent patterns of age– and gender–linked variation in writing topic and style. We find that older bloggers tend to write about externally–focused topics, while younger bloggers tend to write about more personally–focused topics; changes in writing style with age are closely related. Perhaps surprisingly, similar patterns also characterize gender–linked differences in language style. In fact, the linguistic factors that increase in use with age are just those used more by males of any age, and conversely, those that decrease in use with age are those used more by females of any age. Our results thus confirm and generalize earlier results on age–linked (Pennebaker, et al., 2003; Pennebaker and Stone, 2003; Burger and Henderson, 2006) and gender–linked (Mulac and Lundell, 1994; Biber, 1994; Argamon, et al., 2003; Newman, et al., in press) variation in language use. We suggest that our results are best explained by positing a single factor distinguishing internal from external psychological focus that underlies both age– and gender–linked variation in language use. Preliminary results along these lines were previously presented by the authors in (Schler, et al., 2006).

Automated blog analysis

To properly situate our current study, we note the growing literature on automated blog analysis, as exemplified by the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs [1] and the annual Workshop on the Weblogging Ecosystem [2]. Automated techniques related to our own have been applied to extracting and tracking feelings and opinions in the blogosphere (Ku, et al., 2006; Mishne and Rilke, 2006; Mihalcea and Liu, 2006), social network and related analyses (Gruhl, et al., 2004; Hsu, et al., 2006; Lin, et al., 2006), analyzing weblog comments (Mishne and Glance, 2006), finding “hot” stories and trends in the blogosphere (Glance, et al., 2004; Wu and Tseng, 2006), and identifying “spam blogs” (or “splogs”), artificially created to boost search engine ratings or attract commercial traffic (Kolari, et al., 2006; Rubin and Liddy, 2006).

Previous work on gender and age effects on the blogosphere has generally been of comparatively small scale. Herring, et al. (2004) have considered several blog genres, particularly the distinction between “personal journal” type blogs and “filter” type blogs (which collect and filter information and links). They have noted that most filter blogs are written by male bloggers and by older bloggers. Similarly, Nowson, et al. (2006) found a strong effect of author sex on blog language, finding that female–authored blogs were more contextualized (as measured by Heylighen and Dewaele’s (2002) F measure) than male–authored blogs. In this vein, Huffaker and Calvert (2005) found that teen bloggers are particularly likely to use blogging as a forum for exploring personal issues such as sexual identity. With a few exceptions (e.g., Herring, et al., 2004; Burger and Henderson, 2006), there has been little work on age in the blogosphere.

Some work on computer–mediated communication (CMC) other than blogs (i.e., discussion groups and e–mail) has applied discourse and content analysis to relevant issues of gender–linked language. Thus, for example, it has been found that male–dominated discussion groups had more statements of fact and fewer self–disclosures (Savicki, et al., 1996), that women had higher rates of using emoticons in their messages (Witmer, 1996), and that e–mail messages about vacations written by females mentioned more about social aspects and shopping while males focused more impersonally on the location, the journey, and local people (Colley and Todd, 2002).

Our research extends this previous work to the automated analysis of a much larger corpus of texts than those previously analyzed for such sociolinguistic variation (compare, e.g., Labov, 1990; Bailey and Dyer, 1992; Mulac and Lundell, 1994; Herring and Paolillo, 2006). This has the effect of minimizing possible sample biases, which is critical when dealing with over a thousand textual variables, as we do here. Moreover, as far as we are aware, our current study is the first to examine the relationship between how language use varies by age with how it varies by gender.

Corpus design

We gathered a collection of blogs from the Web site blogger.com in August 2004. We collected all blogs on the site which (a) contained at least 500 total words including at least 200 occurrences of common English words, and (b) had author–provided indication of both gender and age. We then randomly selected 10 percent of the documents as a holdout set (for purposes described below). This left an initial collection of 46,947 blogs, summarized in Table 1 (our unit of analysis throughout this paper is each blogger’s collected writing from inception until harvest; we do not distinguish between different posts by a given blogger). Note that over 60 percent of bloggers age 17 and below are females, while over 60 percent of bloggers older than 17 are males.

Table 1: Distribution of blogs in our initial collection by age and gender.

Gender age Female Male Total

13–17 6949 4120 11069

18–22 7393 7690 15083

23–27 4043 6062 10105

28–32 1686 3057 4743

33–37 860 1827 2687

38–42 374 819 1193

43–47 263 584 847

48 and older 314 906 1220

Total 21682 25065 46747

For purposes of analysis, formatting and non–English text was automatically removed from each blog. To enable reliable age categorization (since a blog can span several years of writing), all blogs for “boundary ages” (ages 18–22 and 28–32) were removed. Each blogger was categorized by age at time of harvest: “10s” (ages 13–17), “20s” (ages 23–27) and “30+” (ages 33–47), and also by gender: “male” and “female.” The number of blogs of each gender within each age category were equalized by randomly deleting surplus blogs from the larger gender category. The final corpus thus contained 19,320 blogs (8,240 in 10s, 8,086 in 20s, and 2,994 in 30+), comprising a total of 681,288 posts and over 140 million words. There were, on average, approximately 35 posts and 7300 words in each blog in the corpus.

Factor analysis

We begin by considering the 1000 most frequent words in the corpus. These comprise 323 different function words and 677 different content words, accounting for 59.4 percent and 21.7 percent, respectively, of all word occurrences. We performed an automated factor analysis on the rate of use of each of the 677 content words, to find groups of related words that tend to occur in similar documents. This process, referred to as a meaning extraction method (Chung and Pennebaker, 2007), yielded twenty coherent factors that depict clear and distinct themes, mostly topic–related. Word lists for the twenty factors, along with suggestive headings (for reference), are given in Table 2. In addition, we divided the function words into several categories according to their parts–of–speech (pronouns, auxiliary verbs, etc.).

Table 2: Words in each factor.

Factor Words

Conversation know, people, think, person, tell, feel, friends, talk, new, talking, mean, ask, understand, feelings, care, thinking, friend, relationship, realize, question, answer, saying

AtHome woke, home, sleep, today, eat, tired, wake, watch, watched, dinner, ate, bed, day, house, tv, early, boring, yesterday, watching, sit

Family years, family, mother, children, father, kids, parents, old, year, child, son, married, sister, dad, brother, moved, age, young, months, three, wife, living, college, four, high, five, died, six, baby, boy, spend, christmas

Time friday, saturday, weekend, week, sunday, night, monday, tuesday, thursday, Wednesday, morning, tomorrow, tonight, evening, days, afternoon, weeks, hours, july, busy, meeting, hour, month, june

Work work, working, job, trying, right, met, figure, meet, start, better, starting, try, worked, idea

PastActions said, asked, told, looked, walked, called, talked, wanted, kept, took, sat, gave, knew, felt, turned, stopped, saw, ran, tried, picked, left, ended

Games game, games, team, win, play, played, playing, won, season, beat, final, two, hit, first, video, second, run, star, third, shot, table, round, ten, chance, club, big, straight

Internet site, email, page, please, website, web, post, link, check, blog, mail, information, free, send, comments, comment, using, internet, online, name, service, list, computer, add, thanks, update, message

Location street, place, town, road, city, walking, trip, headed, front, car, beer, apartment, bus, area, park, building, walk, small, places, ride, driving, looking, local, sitting, drive, bar, bad, standing, floor, weather, beach, view

Fun fun, im, cool, mom, summer, awesome, lol, stuff, pretty, ill, mad, funny, weird

Food/Clothes food, eating, weight, lunch, water, hair, life, white, wearing, color, ice, red, fat, body, black, clothes, hot, drink, wear, blue, minutes, shirt, green, coffee, total, store, shopping

Poetic eyes, heart, soul, pain, light, deep, smile, dreams, dark, hold, hands, head, hand, alone, sun, dream, mind, cold, fall, air, voice, touch, blood, feet, words, hear, rain, mouth

Books/Movies book, read, reading, books, story, writing, written, movie, stories, movies, film, write, character, fact, thoughts, title, short, take, wrote

Religion god, jesus, lord, church, earth, world, word, lives, power, human, believe, given, truth, thank, death, evil, own, peace, speak, bring, truly

Romance forget, forever, remember, gone, true, face, spent, times, love, cry, hurt, wish, loved

Swearing shit, fuck, fucking, ass, bitch, damn, hell, sucks, stupid, hate, drunk, crap, kill, guy, gay, kid, sex, crazy

Politics bush, president, Iraq, kerry, war, american, political, states, america, country, government, john, national, news, state, support, issues, article, michael, bill, report, public, issue, history, party, york, law, major, act, fight, poor

Music music, songs, song, band, cd, rock, listening, listen, show, favorite, radio, sound, heard, shows, sounds, amazing, dance

School school, teacher, class, study, test, finish, english, students, period, paper, pass

Business system, based, process, business, control, example, personal, experience, general

Age–linked variation

Table 3 shows the frequencies of each factor’s average usage in each age and gender class, as well as the same data for function words according to their parts of speech.

Table 3: Mean frequencies of factor and part–of–speech usage by age and gender.

Factor 10s 20s 30s+ Male Female Overall

Conversation 1.74 1.55 1.33 1.47 1.72 1.59

AtHome 1.11 .80 .75 .86 .98 .92

Family .65 .75 .94 .69 .79 .74

Time .65 .74 .68 .65 .73 .69

PastActions .74 .62 .63 .62 .73 .68

Work .61 .75 .70 .67 .69 .68

Games .67 .66 .66 .76 .57 .67

Internet .61 .63 .68 .74 .52 .63

Location .52 .65 .63 .60 .58 .59

Fun .88 .36 .28 .50 .64 .57

Food/Clothes .53 .55 .55 .49 .60 .54

Poetic .52 .53 .52 .48 .57 .53

Books/Movies .51 .54 .54 .54 .51 .53

Religion .44 .50 .55 .50 .46 .48

Romance .54 .44 .38 .39 .55 .47

Swearing .54 .35 .25 .41 .42 .41

Politics .27 .41 .56 .47 .28 .37

Music .36 .29 .26 .34 .29 .32

School .35 .19 .17 .26 .25 .26

Business .07 .13 .16 .13 .08 .11

Articles 5.10 6.46 6.97 6.46 5.45 5.96

PersonalPronouns 11.72 10.44 9.88 9.84 11.97 10.96

AuxiliaryVerbs 9.04 8.90 8.83 8.76 9.14 8.95

Conjunctions 2.89 2.59 2.48 2.63 2.76 2.70

Prepositions 11.83 13.04 13.30 12.76 12.36 12.56

First of all, these results indicate clear differences in both preferred topic and preferred style between bloggers of different ages [3]. Usage of words associated with Family, Religion, Politics, Business, and Internet increases with age, while usage of words associated with Conversation, AtHome, Fun, Romance, Music, School, and Swearing decreases significantly with age. (All effects mentioned are statistically significant with p < 0.001.) None of the other factors varies directly with age in a statistically significant fashion. In addition to these topic–related differences in blogs with blogger age, we also see clear differences in style, as measured by frequencies of grammatical parts–of–speech. Usage of PersonalPronouns, Conjunctions, and AuxiliaryVerbs decreases significantly with age, while usage of Articles and Prepositions increases significantly with age.

In fact, such variations in word frequency can be exploited to effectively predict the age of a blog’s writer. To show this, we computed, for each blog, a vector containing the frequencies in the blog of the above–mentioned 377 function words as well as the 1000 most informative words [4] for age. Two different machine–learning algorithms, Bayesian multinomial logistic regression (BMR: Madigan, et al., 2005) and multi–class balanced real–valued Winnow (WIN: Littlestone, 1988; Dagan, et al., 1997), were applied to these frequency vectors to construct classification models for author age. Ten–fold cross–validation [5] was used to estimate generalization accuracy. The results show automatic classification of an unseen document into the correct age interval (10s, 20s, or 30+) with an accuracy of 77.4 percent (using BMR) and 75.0 percent (using WIN). Examination of the confusion matrix shows that 10s are distinguishable from 30+ with over 96 percent accuracy, whereas distinguishing 20s from either of the other two classes is more difficult. Using only function words gives accuracies of 69.4 percent (BMR) and 67.7 percent (WIN), while using just the high information–gain words gives accuracies of 76.2 percent (BMR) and 75.9 percent (WIN). Thus, as we might have expected, topic preference is most related to blogger age, although there is definitely a marked effect on writing style as well.

Gender–linked variation

Regarding blogger gender, we see (Table 3) that Articles and Prepositions are used significantly more by male bloggers, while PersonalPronouns, Conjunctions, and AuxiliaryVerbs are used significantly more by female bloggers. These are the same features that we previously found to indicate male and female writing styles in published fiction and non–fiction works (Argamon, et al., 2003). In content–based features, we see the factors Religion, Politics, Business, and Internet used more frequently by male bloggers, while the factors Conversation, AtHome, Fun, Romance, and Swearing are more often used by female bloggers. (All effects mentioned are statistically significant with p < 0.001.) Prediction of author gender (as above) from function words and the 1000 words with highest information–gain for gender gave accuracies of 79.3 percent (BMR) and 80.5 percent (WIN). These results are consistent with classification studies on author gender in other types of texts (Argamon, et al., 2003; de Vel, et al., 2002; Hota, et al., 2006).

It should be noted that style and content effects are highly correlated: use of multiple regressions indicates that controlling for style effects essentially eliminates content effects and vice versa. Thus, it may be that choice of content determines particular style preferences, or both content and style may be influenced by a single underlying variable such as genre preference (Herring, et al., 2004). It is highly probable, though, that a more general sociolinguistic variable underlies this phenomenon, for as we have noted, the results of the current study on gender–linked style are virtually identical to those found in studies of vastly differing genres, including published fiction and non–fiction (Argamon, et al., 2003).

Correlating age and gender

It has not escaped our attention that with few exceptions, the factors and parts–of–speech that are used significantly more by younger (older) bloggers are also used significantly more by female (male) bloggers. Thus, Articles, Prepositions, Religion, Politics, Business, and Internet are used more by male bloggers as well as older bloggers, while PersonalPronouns, Conjunctions, AuxiliaryVerbs, Conversation, AtHome, Fun, Romance, and Swearing are used more by female bloggers as well as younger bloggers. There are only three exceptions to this pattern: Family, used more by older bloggers and by females; Music, used more by younger bloggers and by males; and, School, for which there is no significant difference between male and female usage.

The force of this observation is highlighted when examining those individual words that evince both strong age–linked and gender–linked effects. We consider the 316 words that are among both the 1000 words with highest information gain for age and the 1000 words with highest information gain for gender (as computed on the holdout set). The scatterplot in Figure 1 plots log(w(male)/w(female)) against log(w(30+)/w(10s)), where w(A) is the average frequency of word w in documents of class A. Note that every word but one (“husband”) lies in the first (male and 30+) or third (female and 10s) quadrants. That is, with just the one exception, every word we considered that is used more by females is used more by younger bloggers and vice versa. The Pearson correlation between the male/female and 30+/10s log–ratios is 0.71.

Figure 1: Scatterplot showing log(w(male)/w(female)) on the x–axis plotted against log(w(10s)/w(30+)) on the y–axis.
Points shown represent the words with highest information gain for both age and gender as described in the text.

Conclusions

The significance of these results is twofold. First is the fact that, in contradistinction to many previous similar studies, we have analyzed many millions of words of naturally occurring text. This fact lends credence to the conclusion that significant variation in our data reflects real variation in the world (or at least, the world of those likely to write English–language blogs), and is not a mere artifact of our experimental procedure.

Perhaps more significantly, however, our findings serve to link together earlier observations regarding age–linked and gender–linked writing variation that have not previously been connected. Previous studies investigating gender and language have shown gender–linked differences along dimensions of involvedness (Biber, 1995; Argamon, et al., 2003) or contextualization (Heylighen and Dewaele, 2002). Other studies have found age–linked differences in the immediacy and informality of writing (Pennebaker, et al., 2003). The current study suggests that these two sets of results are closely related. Indeed, they likely both reflect a single underlying distinction between inner– and outer–oriented communication that may explain both gender–linked and age–linked variation in language use.

About the authors

Shlomo Argamon is Associate Professor in the Department of Computer Science at the Illinois Institute of Technology in Chicago.
E–mail: argamon [at] iit [dot] edu

Professor Moshe Koppel can be found in the Department of Computer Science at Bar–Ilan University (Ramat Gan 52900, Israel).
E–mail: moishk [at] gmail [dot] com

James W. Pennebaker is Professor and Chair of the Department of Psychology at the University of Texas in Austin.
E–mail: pennebaker [at] mail [dot] utexas [dot] edu

Dr. Jonathan Schler, Department of Computer Science, Bar–Ilan University (Ramat Gan 52900, Israel).
E–mail: schler[at] gmail [dot] com

Notes

1. http://www.aaai.org/Symposia/Spring/sss06.php.

2. http://www.blogpulse.com/www2006-workshop/.

3. We must, of course, keep in mind that since this study is synchronic, we cannot separate generational effects from age effects. Moreover, since older bloggers are somewhat less common, they might represent an atypical demographic as early adopters of technology.

4. The “informativeness” of words for a particular text class (age or gender) was measured by the “information gain” measure (Quinlan, 1986), an information–theoretic formula estimating how much information about the class of a text is conveyed by knowing the frequency of a particular word in the text.

5. Ten–fold cross–validation is a standard technique for estimating the generalization accuracy of a machine learning method (see Mitchell, 1997). The data is randomly divided into ten equally sized segments, and the system repeatedly trains on nine of them and tests on the remaining one; the average of these accuracies is the reported result. Thus we avoid testing on examples that were used in training.

References

S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, 2003. “Gender, Genre, and Writing Style in Formal Written Texts,” Text, volume 23, number 3, pp. 321–346; also at http://www.cs.biu.ac.il/~koppel/papers/male-female-text-final.pdf, accessed 21 August 2007.

G. Bailey and M. Dyer, 1992. “An approach to sampling in dialectology,” American Speech, volume 67, number 1, pp. 3–20. http://dx.doi.org/10.2307/455756

V.L. Bergvall, 1999. “Toward a comprehensive theory of language and gender,” Language in Society, volume 28, pp. 273–293. http://dx.doi.org/10.1017/S0047404599002080

D. Biber, 1995. Dimensions of register variation: A cross–linguistic comparison. Cambridge: Cambridge University Press.

D. Biber, 1994. “An analytical framework for register studies,” In: D. Biber and E. Finegan (editors). Sociolinguistic perspectives on register. New York: Oxford University Press, pp. 31–56.

D. Biber, 1993. “Using register–diversified corpora for general language studies,” Computational Linguistics, volume 19, number 2, pp. 219–241.

D. Biber and E. Finegan (editors), 1994. Sociolinguistic perspectives on register. New York: Oxford University Press.

J.D. Burger and J.C. Henderson, 2006. “An exploration of observable features related to blogger age,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 15–20.

C.K. Chung and J.W. Pennebaker, 2007, in press. “Revealing people’s thinking in natural language: Using an automated meaning extraction method in open–ended self–descriptions,” Journal of Research in Personality.

J. Coates, 1986. Women, men, and language: A sociolinguistic account of sex differences in language. London: Longman.

A. Colley and Z. Todd, 2002. “Gender–linked differences in the style and content of e–mails to friends,” Journal of Language and Social Psychology, volume 21, number 4, pp. 380–392. http://dx.doi.org/10.1177/026192702237955

I. Dagan, Y. Karov, and D. Roth, 1997. “Mistake–driven learning in text categorization,” Proceedings of the Second Conference on Empirical Methods in Natural Language Processing (EMNLP–97), pp. 55–63.

O. de Vel, M. Corney, A. Anderson, G. Mohay, 2002. “Language and gender author cohort analysis of e-mail for computer forensics,” Proceedings of the Second Digital Forensic Research Workshop, at http://dfrws.org/2002/papers/Papers/Olivier_DeVel.pdf, accessed 21 August 2007.

N. Glance, M. Hurst, and T. Tomokiyo, 2004. “BlogPulse: Automated trend discovery for weblogs,” Proceedings of the First Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, at http://www.blogpulse.com/papers/www2004glance.pdf, accessed 21 August 2007.

D. Gruhl, R. Guha, D. Liben–Nowell, and A. Tomkins, 2004. “Information diffusion through blogspace,” Proceedings of the 13th international Conference on World Wide Web (New York, N.Y.), pp. 491–501, and at http://theory.lcs.mit.edu/~dln/papers/blogs/idib.pdf, accessed 21 August 2007.

S. Herring and J. Paolillo, 2006. “Gender and genre variation in weblogs,” Journal of Sociolinguistics, volume 10, number 4, pp. 439–459. http://dx.doi.org/10.1111/j.1467-9841.2006.00287.x

S.C. Herring, L.A. Scheidt, S. Bonus, and E. Wright, 2004. “Bridging the gap: A genre analysis of weblogs,” Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS ’04). Los Alamitos, Calif.: IEEE Press; also at http://www.ics.uci.edu/~jpd/classes/ics234cw04/herring.pdf, accessed 21 August 2007.

F. Heylighen and J.–M. Dewaele, 2002. “Variation in the contextuality of language: an empirical measure,” Foundations of Science, volume 7, pp. 293–340. http://dx.doi.org/10.1023/A:1019661126744

J. Holmes, 1997. “Women, language, and identity,” Journal of Sociolinguistics, volume 1, pp. 195–224. http://dx.doi.org/10.1111/1467-9481.00012

S. Hota, S. Argamon, M. Koppel, and I. Zigdon, 2006. “Performing gender: Automatic stylistic analysis of Shakespeare ’s characters,” Proceedings of the Digital Humanities Conference (Association for Computers in Humanities and the Association for Literary and Linguistic Computing), at http://lingcog.iit.edu/doc/hota_allc2006.pdf, accessed 21 August 2007.

W.H. Hsu, T. Weninger, T. Pydmarri, and M.S.R. Paradesi, 2006. “Collaborative and structural recommendation of friends using Weblog–based social network analysis,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 55–60.

D.A. Huffaker and S.L. Calvert, 2005. “Gender, identity, and language use in teenage blogs,” Journal of Computer–Mediated Communication, volume 10, number 2, at http://jcmc.indiana.edu/vol10/issue2/huffaker.html, accessed 21 August 2007.

P. Kolari, A. Java, and T. Finin, 2006. “Characterizing the Splogosphere,” Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/splogosphere.pdf, accessed 21 August 2007.

L–W. Ku, Y–T. Liang, and H–H. Chen, 2006. “Opinion extraction, summarization and tracking in news and blog corpora,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 100–107, and at http://nlg18.csie.ntu.edu.tw:8080/opinion/SS0603KuLW.pdf, accessed 21 August 2007.

W. Labov, 1990. “The intersection of sex and social class in the course of linguistic change,” Language Variation and Change, volume 2, pp. 205–254. http://dx.doi.org/10.1017/S0954394500000338

W. Labov, 1972. Sociolinguistic patterns. Philadelphia: University of Pennsylvania Press.

Y–R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. Tseng, 2006. “Discovery of blog communities based on mutual awareness,” Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/wwe2006-discovery-lin-final.pdf, accessed 21 August 2007.

N. Littlestone, 1988. “Learning quickly when irrelevant attributes abound: A new linear–threshold algorithm,” Machine Learning, volume 2, issue 4, pp. 285–318. http://dx.doi.org/10.1007/BF00116827

D. Madigan, A. Genkin, D.D. Lewis, and D. Fradkin, 2005. “Bayesian multinomial logistic regression for author identification,” 25th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering (AIP Conference Proceedings), volume 803, pp. 509–516, and at http://www.stat.rutgers.edu/~madigan/mms/authorID-me05-fixed.pdf, accessed 21 August 2007.

R. Mihalcea and H. Liu, 2006. “A corpus–based approach to finding happiness,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 139–144, and at http://www.cs.unt.edu/~rada/papers/mihalcea.aaai06ss.pdf, accessed 21 August 2007.

G. Mishne and M. de Rijke, 2006. “Capturing global mood levels using blog posts,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 145–152, and at http://staff.science.uva.nl/~gilad/pubs/aaai06-blogmoods.pdf, accessed 21 August 2007.

G. Mishne and N. Glance, 2006. “Leave a reply: An analysis of Weblog comments,” Proceedings of the 3rd Annual Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, at http://www.blogpulse.com/www2006-workshop/papers/wwe2006-blogcomments.pdf, accessed 21 August 2007.

T.M. Mitchell, 1997. Machine learning. New York: Mc-Graw–Hill.

A. Mulac and T.L. Lundell, 1994. “Effects of gender–linked language differences in adults’ written discourse: Multivariate tests of language effects,” Language and Communication, volume 14, number 3, pp. 299–309. http://dx.doi.org/10.1016/0271-5309(94)90007-8

M.L. Newman, C.J. Groom, L.D. Handelman, and J.W. Pennebaker, in press. “Gender differences in language use: An analysis of 14,000 text samples,” Discourse Processes.

S. Nowson, J. Oberlander, and A.J. Gill, 2005. “Weblogs, genres, and individual differences,” Proceedings of the 27th Annual Conference of the Cognitive Science Society (Stresa, Italy), pp. 1666–1671, and at http://www.ics.mq.edu.au/~snowson/papers/nowson-cogsci.pdf, accessed 21 August 2007.

J.W. Pennebaker and L.D. Stone, 2003. “Words of wisdom: Language use over the lifespan,” Journal of Personality and Social Psychology, volume 85, pp. 291–301. http://dx.doi.org/10.1037/0022-3514.85.2.291

J.W. Pennebaker, M.R. Mehl, and K. Niederhoffer, 2003. “Psychological aspects of natural language use: Our words, ourselves,” Annual Review of Psychology, volume 54, pp. 547–577. http://dx.doi.org/10.1146/annurev.psych.54.101601.145041

J.R. Quinlan, 1986. “Induction of decision trees,” Machine Learning, volume 1, number 1, pp. 81–106. http://dx.doi.org/10.1007/BF00116251

V.L. Rubin and E.D. Liddy, 2006. “Assessing credibility of Weblogs,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 187–190.

J. Schler, M. Koppel, S. Argamon, and J. Pennebaker, 2006. “Effects of age and gender on blogging,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 199–205, and at http://lingcog.iit.edu/doc/springsymp-blogs-final.pdf, accessed 21 August 2007.

E.W. Schneider, 2002. “Investigating variation and change in written documents,” Chapter 3 of J.K. Chambers, P. Trudgill, and N. Schilling–Estes (editors). Handbook of language variation and change. Malden, Mass.: Blackwell Publishing.

D. Tannen, 2001. You just don’t understand: Women and men in conversation. New York: Quill.

Y. Wu and B.L. Tseng, 2006. “Important Weblog identification and hot story summarization,” In: Computational Approaches to Analyzing Weblogs: Papers from the 2006 AAAI Spring Symposium. Menlo Park, Calif.: AAAI Press, pp. 221–227.

Editorial history

Paper received 28 July 2007; accepted 19 August 2007.

Copyright ©2007, First Monday.

Copyright ©2007, Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler.

Mining the Blogosphere: Age, gender and the varieties of self–expression by Shlomo Argamon, Moshe Koppel, James W. Pennebaker, and Jonathan Schler
First Monday, volume 12, number 9 (September 2007),
URL: http://firstmonday.org/issues/issue12_9/argamon/index.html

Table 1: Distribution of blogs in our initial collection by age and gender.
Gender age	Female	Male	Total
13–17	6949	4120	11069
18–22	7393	7690	15083
23–27	4043	6062	10105
28–32	1686	3057	4743
33–37	860	1827	2687
38–42	374	819	1193
43–47	263	584	847
48 and older	314	906	1220
Total	21682	25065	46747

Table 2: Words in each factor.
Factor	Words
Conversation	know, people, think, person, tell, feel, friends, talk, new, talking, mean, ask, understand, feelings, care, thinking, friend, relationship, realize, question, answer, saying
AtHome	woke, home, sleep, today, eat, tired, wake, watch, watched, dinner, ate, bed, day, house, tv, early, boring, yesterday, watching, sit
Family	years, family, mother, children, father, kids, parents, old, year, child, son, married, sister, dad, brother, moved, age, young, months, three, wife, living, college, four, high, five, died, six, baby, boy, spend, christmas
Time	friday, saturday, weekend, week, sunday, night, monday, tuesday, thursday, Wednesday, morning, tomorrow, tonight, evening, days, afternoon, weeks, hours, july, busy, meeting, hour, month, june
Work	work, working, job, trying, right, met, figure, meet, start, better, starting, try, worked, idea
PastActions	said, asked, told, looked, walked, called, talked, wanted, kept, took, sat, gave, knew, felt, turned, stopped, saw, ran, tried, picked, left, ended
Games	game, games, team, win, play, played, playing, won, season, beat, final, two, hit, first, video, second, run, star, third, shot, table, round, ten, chance, club, big, straight
Internet	site, email, page, please, website, web, post, link, check, blog, mail, information, free, send, comments, comment, using, internet, online, name, service, list, computer, add, thanks, update, message
Location	street, place, town, road, city, walking, trip, headed, front, car, beer, apartment, bus, area, park, building, walk, small, places, ride, driving, looking, local, sitting, drive, bar, bad, standing, floor, weather, beach, view
Fun	fun, im, cool, mom, summer, awesome, lol, stuff, pretty, ill, mad, funny, weird
Food/Clothes	food, eating, weight, lunch, water, hair, life, white, wearing, color, ice, red, fat, body, black, clothes, hot, drink, wear, blue, minutes, shirt, green, coffee, total, store, shopping
Poetic	eyes, heart, soul, pain, light, deep, smile, dreams, dark, hold, hands, head, hand, alone, sun, dream, mind, cold, fall, air, voice, touch, blood, feet, words, hear, rain, mouth
Books/Movies	book, read, reading, books, story, writing, written, movie, stories, movies, film, write, character, fact, thoughts, title, short, take, wrote
Religion	god, jesus, lord, church, earth, world, word, lives, power, human, believe, given, truth, thank, death, evil, own, peace, speak, bring, truly
Romance	forget, forever, remember, gone, true, face, spent, times, love, cry, hurt, wish, loved
Swearing	shit, fuck, fucking, ass, bitch, damn, hell, sucks, stupid, hate, drunk, crap, kill, guy, gay, kid, sex, crazy
Politics	bush, president, Iraq, kerry, war, american, political, states, america, country, government, john, national, news, state, support, issues, article, michael, bill, report, public, issue, history, party, york, law, major, act, fight, poor
Music	music, songs, song, band, cd, rock, listening, listen, show, favorite, radio, sound, heard, shows, sounds, amazing, dance
School	school, teacher, class, study, test, finish, english, students, period, paper, pass
Business	system, based, process, business, control, example, personal, experience, general

Table 3: Mean frequencies of factor and part–of–speech usage by age and gender.
Factor	10s	20s	30s+	Male	Female	Overall
Conversation	1.74	1.55	1.33	1.47	1.72	1.59
AtHome	1.11	.80	.75	.86	.98	.92
Family	.65	.75	.94	.69	.79	.74
Time	.65	.74	.68	.65	.73	.69
PastActions	.74	.62	.63	.62	.73	.68
Work	.61	.75	.70	.67	.69	.68
Games	.67	.66	.66	.76	.57	.67
Internet	.61	.63	.68	.74	.52	.63
Location	.52	.65	.63	.60	.58	.59
Fun	.88	.36	.28	.50	.64	.57
Food/Clothes	.53	.55	.55	.49	.60	.54
Poetic	.52	.53	.52	.48	.57	.53
Books/Movies	.51	.54	.54	.54	.51	.53
Religion	.44	.50	.55	.50	.46	.48
Romance	.54	.44	.38	.39	.55	.47
Swearing	.54	.35	.25	.41	.42	.41
Politics	.27	.41	.56	.47	.28	.37
Music	.36	.29	.26	.34	.29	.32
School	.35	.19	.17	.26	.25	.26
Business	.07	.13	.16	.13	.08	.11
Articles	5.10	6.46	6.97	6.46	5.45	5.96
PersonalPronouns	11.72	10.44	9.88	9.84	11.97	10.96
AuxiliaryVerbs	9.04	8.90	8.83	8.76	9.14	8.95
Conjunctions	2.89	2.59	2.48	2.63	2.76	2.70
Prepositions	11.83	13.04	13.30	12.76	12.36	12.56