Big data for the humanities using Google Ngrams: Discovering hidden patterns of conceptual trends
First Monday

Big data for the humanities using Google Ngrams: Discovering hidden patterns of conceptual trends by Shai Ophir



Abstract
“Big data” methodologies bring new potential for humanities research. Google’s Ngram Viewer provides an extraordinary tool for tracking long-term usage of terms. Although this is a very high level trend analysis, it may shed light on hidden relations that can be discovered only at the macro resolution level. This short paper will attempt to analyze the historical hidden patterns of the term “Truth” during the last 500 years. Its relation with the term “Love” will be revealed. The results are also compared with the manual analysis of “Truth Systems,” by Pitirim Sorokin in 1937.

Contents

1. Big data, digital humanities and Google’s Ngram Viewer
2. Discovering hidden patterns of “Truth”
3. Manual vs. digital analysis: The work of Pitirim Sorokin and Google Ngram
4. Summary and roadmap

 


 

1. Big data, digital humanities and Google’s Ngram Viewer

“Big data” is widely used today in digital culture as a promising method for deriving new understanding from massive aggregations of information. The ability to collect huge amount of data from text, images, and media (voice, video), aggregate the data and analyze it using computerized algorithms create endless opportunities in many areas. Medicines are developed based on data measured and collected from patients. Business analytics is an integral part of banks, governments and the market in general. Schmidt and Lipson (2009) explain how they derive Newton’s laws from a data analytic system, using algorithms to detect the patterns of a pendulum. Google is able to translate from Chinese to English based on aggregation of previous translations, looking for similarities.

Mayer-Schönberger and Cukier (2013) delineate many uses of big data for analysis, but also raise concerns regarding its implications. They express concern about machines replacing over human activities and decision-making. There are, for example, ethical questions about an automatic Google car that need to be addressed. They also raise privacy issues over the collections and interpretation of transactions, voice calls, images and video taken by human agents.

Another concern about big data may come from the fact that it is primarily utilized by businesses. As Laney (2001) noted: “Through 2003/04, practices for resolving e-commerce accelerated data volume, velocity, and variety issues will become more formalized/diverse. Increasingly, these techniques involve tradeoffs and architectural solutions that involve/impact application portfolios and business strategy decisions.” Big data may be of primary value to organisations that can profit from data extraction.

boyd and Crawford (2012) also raise critical questions about big data: “Will large-scale search data help us create better tools, services, and public goods? Or will it usher in a new wave of privacy incursions and invasive marketing? Will data analytics help us understand online communities and political movements? Or will it be used to track protesters and suppress speech?” boyd and Crawford note that “we are our tools” and: “The era of Big Data has only just begun, but it is already important that we start questioning the assumptions, values, and biases of this new wave of research.”

The marketing of big data has created an image of a process that potentially can replace traditional analysis methods, providing predictions for almost anything. Lazer, et al. (2014) point out the problems with this faith in big data in their analysis of Google Flu Trends (GFT). The capabilities of big data are much more limited than we tend to assume. Lazer, et al. note that “big data hubris” is one contributing factor with a second factor of: “algorithm dynamics, the changes made by engineers to improve the commercial service and by consumers in using that service. Several changes in Google’s search algorithm and user behavior likely affected GFT’s tracking.” Lazer, et al. also question measurements: “Is the instrumentation actually capturing the theoretical construct of interest? Is measurement stable and comparable across cases and over time? Are measurement errors systematic?”

Digital humanities provides opportunities for the scholarly uses of big data, using computerized resources of text, images, voice and video. The field of Digital Humanities is served by organizations such as the Alliance of Digital Humanities Organizations (ADHO, http://adho.org) and centerNet (http://dhcenternet.org/), an international network of about 100 digital humanities centers in 19 countries, working together to benefit digital humanities and related fields. Many universities have established departments and research centers for digital humanities, such as at the University College London (http://www.ucl.ac.uk/dh/) and Princeton University (https://digitalhumanities.princeton.edu). However digital humanities is still in its infancy. As Borgman (2009) noted: “The digital humanities community has produced some beautiful work and made many advances in technology, design, and standards. Now is the moment to consolidate that knowledge and to articulate the community’s requirements and goals.”

The idea of quantification of the humanities has historical roots. Latour (2010) described the ideas of Gabriel Tarde: “When Tarde claimed that statistics would one day be as easy to read as newspapers, he could not have anticipated that the newspapers themselves would be so transformed by digitalization that they would merge into the new domain of data visualization. This is a clear case of a social scientist being one century ahead of his time because he had anticipated a quality of connection and traceability necessary for good statistics which was totally unavailable in 1900.” Latour believes that big data can resolve the gap between the micro and the macro in sociology, the unexplained relations between macro social phenomena and the individuals taking part in that phenomena. Latour (2010) described this vision: “The point is that the whole has lost its privileged status: we can produce out of the same data points, as many aggregates as we see fit, while reverting back at any time, to the individual components. This is precisely the sort of movement that was anticipated by Tarde’s social theory although he had no tool to explicate his vision, other than his prose.”

Marres (2012) investigates the implications of digital technologies on social research, and calls for a “redistribution of research”, a concept that “highlights that scientific research tends to involve contributions from a broad range of actors: researchers, research subjects, funders, providers of research materials, infrastructure builders, interested amateurs, and so on.” Marres explores the redistribution of method in an empirical way, by examining online platforms for social research. One of the platforms is a tool of online textual analysis, called the Co-Word Machine, which is based on co-word analysis. Marres mentions that co-word has started to emerge in the 1980s, but “in the 1980s co-word analysts were frustrated in this project by the limits of the databases”. The “redistribution of technologies”, the development of massive datasets and databases, enable now a better co-word analysis. Google Ngram Viewer, the tool used in this paper, is an example. Another example that combines online textual analysis and social research is described by Madsen (2015). Madsen’s focus is on digital social analytics, describing a project developing a “crisis monitor”. The monitor provides warnings of a food-related crisis in Indonesia based on tweet analysis, using tag clouds to display words reflect crisis situations related to food. The software algorithm is designed to identify expressions in tweets and categorize them in clusters. This project shows not only the potential of dynamic text analysis for social purposes, but also draws a line between human and machine capabilities. The results are visualized, but individuals evaluate the results to identify a crisis.

One of the tools that serves digital humanities is Google’s Ngram Viewer. This tool has been created on top of Google Books, the largest digitized collection of books. Google Books was launched in 2002, inspired by digitization projects in various libraries and institutions in the world. According to Wikipedia (https://en.wikipedia.org/wiki/Google_Books), as of October 2015, the number of scanned book titles was over 24 million. The vast majority of the books are in English, and this paper will deal only with the English language.

The Google Ngram Viewer is an online viewer, initially based on Google Books (https://books.google.com/ngrams). It is based on a database of Ngrams collected from books published originally between 1500 and 2000. In the fields of computational linguistics, an Ngram is a contiguous sequence of n items from a given sequence of text. The items can be phonemes, syllables, letters or words. Google Ngram database supports Ngram sequences up to five elements. The project was inspired by a prototype (called “Bookworm”) created by Michel, et al. (2009).

Plotted graphs that are displayed by the Ngram Viewer are normalized, according to the number of books published at that time. Without a normalization, it would be impossible to compare the frequency of a specific Ngram over time, as the number of books published in 1500 is not equal to number of books published in 2000. The viewer therefore displays a percentage of the number of occurrences, where the percentage is calculated out of the total number of books published in a given year. Clicking on a point in plotted graph shows the percentage of occurrences for that year.

The “wild card” (*) search is an additional feature in the Google Ngram Viewer. In the Ngram Viewer information page it is described as one of several advanced features, which enables users to “dig a little deeper into phrase usage” (https://books.google.com/ngrams/info#advanced). As Google notes: “When you put a * in place of a word, the Ngram Viewer will display the top ten substitutions.”

However I believe that the benefit of the wild card search is much wider than “digging little deeper”, because it opens the Ngram database for discovering hidden patterns. With the wild card search, a searcher can ask for information that is not pre-defined by other search keywords. That can lead into an exploration of hidden patterns, as we will see in this paper. The wild card can be applied not only to the next adjacent word, but to other patters.

 

++++++++++

2. Discovering hidden patterns of “Truth”

In this paper we take “Truth” as the basic term for exploration. We use Ngram “wild card” search for “Truth *”, in order to find the most frequently used words that are adjacent to “Truth”. However, the adjacent word is usually not meaningful, hence we need to act in two steps: first we find the top adjacent word, and then we go one word further and look for the second adjacent word. We need to understand that this kind of research is heuristic and is based on “the majority of cases”, ignoring the exceptional.

We see that “Truth and” is by far the most frequent combination among the top 10 words adjacent to Truth:

 

Searching Truth * for finding the most frequent word to be used with Truth
 
Figure 1: Searching “Truth *” for finding the most frequent word to be used with “Truth”.

 

The next step would be to search for “Truth and *”, in order to find the most frequent meaningful words to be used with “Truth”. Figure 2 shows the results:

 

Searching Truth and * for finding the most frequent meaningful words to be used with Truth
 
Figure 2: Searching “Truth and *” for finding the most frequent meaningful words to be used with “Truth”.

 

The most frequent words used with “Truth” are logic, justice, reality, love, beauty, falsehood, error and goodness. It looks already that love has the strongest association with Truth, at least for portions of 1800–2000. However this conclusion is not firm at this point. We need to compare “Truth” with each of the top adjacent words in order to find correlations between them.

Figure 3 shows correlations between all of the mentioned terms:

 

Correlation of Truth and the top nine adjacent words
 
Figure 3: Correlation of “Truth” and the top nine adjacent words.

 

It looks now that “Truth” and love have a similar pattern, but this similarity again is not significant. At this point, we expand the time range of the search window, to investigate correlations over 500 years supported by the Ngram Viewer. Figure 4 shows the results, which has more granularity for finding correlated patterns.

 

Correlation of Truth and the top nine adjacent words, 1500-2000
 
Figure 4: Correlation of “Truth” and the top nine adjacent words, 1500–2000.

 

Since Ngram Viewer normalizes all results, it is difficult to compare the fluctuations of all terms, since some of the terms are not closely correlated with “Truth”. To identify close correlations between terms does not mean only to find close patterns, which show similar number of occurrences, but also similarity in fluctuations, illustrating changes over time. Hence, the next step is to remove the less correlated terms, with the top three leading terms appearing in the Viewer — justice, love and true (Figure 5):

 

Correlation of usage of truth, justice and love
 
Figure 5: Correlation of “Truth”, justice and love.

 

It looks now that love is the closest term correlated with “Truth”, although justice is also correlated with both “Truth” and love. Around the mid-seventeenth century, the terms “Truth”, love and justice start to appear in text increasingly until the early eighteenth century, and then started to decline to the end of the eighteenth century. Other terms play important roles here demonstrating that correlations cannot be related to just an increase in the number of publications or to other external factors.

The next step would be to attempt to explain these fluctuations based on historical context. In general, trying to understand each fluctuation is a very challenging since it depends on many unknown factors. For this paper I will try to zoom into one of the fluctuations, while remaining within the scope of the Ngram tool.

Concentrating on the incline of both “Truth” and love in the period 1680–1700, can we find other terms associated with this sharp incline? We can use again the wild card search, for each one of the terms, for the specific period. The search term would be: “Truth *, Love *”. Figure 6 shows the results. Again, we need to take the top two adjacent words, and go one step further. The top combinations were “Truth of” and “Love of” (the two highest findings on the list):

 

Top combinations of truth and love between 1680 and 1700
 
Figure 6: Top combinations of “Truth” and Love between 1680–1700: “Truth of” and “Love of”.

 

Going further, searching for: “Truth of *, Love of *”, we see “Love of God” as the top result for love, and “Truth of God” as the top meaningful result for “Truth” (sixth on the list, Figure 7 below). That could bring us to the hypothesis that there was a significant increase in religious discussions and text during 1680 to 1700, which is related both to “Truth” and love. If we keep following the second top combination found by this search, “Truth of the”, which is meaningless, but then search for “Truth of the *”, we see “Truth of the Christian” as the top increasing combination (Figure 8). As mentioned, this hypothesis would require further historical research.

 

Top inclining combinations Love of God, Truth of God (first meaningful for Truth)
 
Figure 7:: Top inclining combinations “Love of God”, “Truth of God” (first meaningful for “Truth”).

 

 

Top inclining combination Truth of the Christian
 
Figure 8: Top inclining combination “Truth of the Christian”.

 

 

++++++++++

3. Manual vs. digital analysis: The work of Pitirim Sorokin and Google Ngram

Pitirim Sorokin (1889–1968) was a Russian-American sociologist. He emigrated from Russia to the United States in 1923. In 1930 he was requested by the president of Harvard University to accept a position there, where he founded the Department of Sociology. Robert Merton was one of his famous students. Sorokin was a macro-sociologist, best known for his four-volume work entitled Social and cultural dynamics, published between 1937 and 1941.

Sorokin divided his research into different topics, such as wars, economic, ethics and philosophy. In each, he used analytical methods, looking for long cycles. He collected quantified data, analyzed along a time line. For philosophy, Sorokin did monumental work, preparing a list of well-known philosophers, representing each period during the last 2,500 years, starting from ancient Greece. He classified each philosopher, noting different types of philosophies over 2,500 years, such as mysticism, realism and empiricism (six types are listed in the next figure). This classification was accomplished with a basic familiarity with their texts, giving each a rank for each type of philosophy. Then, by summarizing the ranks for each time slot, Sorokin could realize what the tendencies over time, at a resolution of two decade timeslots.

Figure 9 illustrates Sorokin’s “Fluctuations of systems of truth.” I selected this illustration since it relates to this paper, concentrating on patterns of “Truth” as well. Sorokin’s method of visualization represents six different types of philosophies in one figure. Source data can be found in Figure 10.

 

Sorokin's mapping of systems of truth
 
Figure 9: Sorokin’s mapping of systems of “Truth”.

 

 

Sorokin's systems of Truth table
 
Figure 10: Sorokin’s systems of “Truth” table.

 

Sorokin’s manual analysis required a large analysis of text to discover patterns in philosophy as well as corrections to external conditions, such as war and economic conditions, and other factors. Sorokin’s achievements remain unique.

Can we use Sorokin’s work to shed light on the diagram fluctuations generated by Ngram Viewer, related to the correlation between “Truth” and love? Or at least validate some of the findings made by Ngram regarding these correlations? Can we use Ngram analysis to validate some aspects of Sorokin’s findings?

Sorokin explores the “ethics of love”, among many other combinations of ideas. He notices the embedded link between the love — the love of God and of people — and absolute principles. He noted:

“Ethics of love, in a sense, is a variety of the ethics of absolute principles. Among the above all values it puts the value of infinite, unlimited scarified love of God, and of all the concrete individual persons. Love in this sense includes all other values. The genuine ethics of Jesus, St. Francis and other Christians give examples of such ethics of love.” [1]

Sorokin measured the integration of various systems of truths, by sketching fluctuations of combinations of systems, such as “Rationalism + Mysticism + Fideism”. In addition, he described fluctuation of “Ethics of Principles and Love”, and of other streams he identified in his research. Sorokin’s goal was to find correlates between different conceptual systems.

Figure 11 [2] illustrates the fluctuations of the “Ideational” systems, systems based on ideas, in contrary to “Sensate”, based on empirical sensations. Therefore “Empiricism” is not included in this figure, although included in the previous six types of “Truth” systems. However this figure includes both “Truth” and love systems, the combination relevant to the analysis made with Ngram Viewer in this paper. For Sorokin, “Ethics of Principles and Love” and “Systems of Truth”, such as Rationalism, share a common infrastructure of idea-centric systems. This attitude already supports our Ngram analysis on the correlation of Truth and Love.

 

Sorokin's correlation between systems of Truth, ethics of love and other systems
 
Figure 11: Sorokin’s correlation between systems of “Truth”, ethics of love and other systems.

 

What kind of correlation exists between these ideational systems, according to Sorokin? Sorokin wrote:

“I’ve mentioned many times that in the main movements all the Ideational variables have proceeded in a tangible parallelism with one another ... I pointed also that this association is imperfect ... The tidal trends of these variables are similar. Their secondary movements rise and decline with a substantial degree of independence. Expressed in musical terms, their source are not unisonic (as they are in the medieval centuries) but polyphonic. Likewise, their tempo and rhythm are not the same all the time. In many respects their total character remains one of a complex fugue.” [3]

 

++++++++++

4. Summary and roadmap

This paper illustrates how Google’s Ngram Viewer can be used as a tool for discovering hidden patterns of conceptual trends, using its wild card search feature. As an example, the term “Truth” was analyzed. Hidden patterns for “Truth” indicate that love was correlated with “Truth” as well as with justice, sharing a similar pattern in high level resolution. It appeared that around mid-seventeenth century, the terms “Truth”, love and justice increased until the early eighteenth century, and then declined to the end of the eighteenth century.

The difference between manual and digital analysis was examined with the efforts of manual analysis a barrier, but there is an advantage to human intelligence over digital analysis. I used Sorokin’s findings about correlations between “Systems of Truth” and “Ethics of Love” to support earlier correlations between “Truth” and love discovered via Google’s Ngram Viewer, as well as supporting Sorokin’s findings.

The main challenge for future research will be to take patterns discovered by digital analysis, such as Google’s Ngram Viewer, and discern correlations to historical events, in order to explain patterns by historical forces, causes and relations. End of article

 

About the author

Shai Ophir is an independent researcher. He earned a B.Sc. in mathematics cum laude from Tel-Aviv University and M.Sc. in computer sciences from Clayton University (St. Louis).
E-mail: shai [dot] ophir [at] starhome [dot] com

 

Notes

1. Sorokin, 1937, p. 485.

2. Sorokin, 1937, p. 629.

3. Sorokin, 1937, p. 628.

 

References

C. Borgman, 2009. “The digital future is now: A call to action for the humanities,” Digital Humanities Quarterly, volume 3, number 4, at http://digitalhumanities.org/dhq/vol/3/4/000077/000077.html, accessed 24 June 2016.

d. boyd and K. Crawford, 2012. “Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon,” Information, Communication & Society, volume 15, number 5, pp. 662–679.
doi: http://dx.doi.org/10.1080/1369118X.2012.678878, accessed 24 June 2016.

D. Laney, 2001. “3D data management: Controlling data volume, velocity and variety,” META Group (6 February), at http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf, accessed 24 June 2016.

B. Latour, 2010. “Tarde’s idea of quantification,” In: M. Candea (editor). The social after Gabriel Tarde: Debates and assessments. London: Routledge, pp. 147–164.

D. Lazer, R. Kennedy, G. King and A. Vespignani, 2014. “The parable of Google Flu: Traps in big data analysis,” Science, volume 343, number 6176 (14 March), pp. 1,203–1,205.
doi: http://dx.doi.org/10.1126/science.1248506, accessed 24 June 2016.

A. Madsen, 2015. “Between technical features and analytic capabilities: Charting a relational affordance space for digital social analytics,” Big Data & Society, volume 2, number 1, at http://bds.sagepub.com/content/2/1/2053951714568727, accessed 24 June 2016.
doi: http://dx.doi.org/10.1177/2053951714568727, accessed 24 June 2016.

N. Marres, 2012. “The redistribution of methods: On intervention in digital social research, broadly conceived,” Sociological Review, volume 60, supplement S1, pp. 139–165.
doi: http://dx.doi.org/10.1111/j.1467-954X.2012.02121.x, accessed 24 June 2016.

V. Mayer-Schönberger and K. Cukier, 2013. Big data: A revolution that will transform how we live, think and work. London: John Murray.

J.-B. Michel, Y.K. Shen, A.P. Aiden, A. Veres, M.K. Gray, Google Books Team, J.P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, S. Pinker, M.A. Nowak and E.L. Aiden, 2011 “Quantitative analysis of culture using millions of digitized books,” Science volume 331, number 6014 (14 January), pp. 176–182.
doi: http://dx.doi.org/10.1126/science.1199644, accessed 24 June 2016.

M. Schmidt and H Lipson, 2009. “Distilling free-form natural laws from experimental data,” Science, volume 324, number 5923 (3 April), pp. 81–85.
doi: http://dx.doi.org/10.1126/science.1165893, accessed 24 June 2016.

P. Sorokin, 1937. Social and cultural dynamics. Volume 2: Fluctuation of systems of truth, ethics, and law. New York: American Book Co.

 


Editorial history

Received 19 October 2014; revised 5 July 2015; revised 30 July 2015; accepted 1 October 2015.


Creative Commons License
This paper is licensed under a Creative Commons Attribution 4.0 International License.

Big data for the humanities using Google Ngrams: Discovering hidden patterns of conceptual trends
by Shai Ophir.
First Monday, Volume 21, Number 7 - 4 July 2016
http://www.firstmonday.org/ojs/index.php/fm/article/view/5567/5535
doi: http://dx.doi.org/10.5210/fm.v21i7.5567





A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.