Working beyond the confines of academic discipline to resolve a real-world problem: A community of scientists discussing long-tail data in the cloud
First Monday

Working beyond the confines of academic discipline to resolve a real-world problem: A community of scientists discussing long-tail data in the cloud by Catherine F. Brooks, P. Bryan Heidorn, Gretchen R. Stahlman, and Steven S. Chong

This project interrogates a workshop leader and whole-meeting talk among a group of scientists gathered at a workshop to discuss cyberinfrastructure and the sharing of both ‘light’ and ‘dark’ data in the sciences. This project analyzes discourses working through the workshop talk to interrogate the social relations, interdisciplinary identities, concerns, and commonalities in the sciences and in relation to emerging opportunities for computing and data sharing in the cloud. The findings point to the efficacy of arranging scientists around data collection processes for collaborative work as opposed to groupings around data type, discipline, work sectors, or collection location. This research provides an opportunity to consider the democratization of data, academic boundaries in the sciences, as well as interdisciplinary and collaborative problem-solving processes that happen in groups across academic and applied contexts.


Data, contemporary problems, and interdisciplinary communities
Critical theory: Considerations of institutions, access, and democratized data
Concluding remarks: Organizing work in an interdisciplinary and data-driven milieu




Data sharing on the Internet has shifted in recent years with an increasing reliance on shared and commercialized resources for scientific work. New questions continually arise in the context of contemporary data sets and related practice, and bottlenecks to the flow of data across scientists and fields of study have emerged. This study examines a group of scientists as they consider Internet-based data sharing tools, taking a particular focus on how participants in the group sort themselves academically and disciplinarily while working on a common problem [1].

This research stems from a broader study of computation-sharing and data-sharing solutions, but focuses specifically on discourse as a way to see how a workshop leader guided group-making processes and how participating scientists organized themselves disciplinarily while troubleshooting issues of data sharing in an increasingly cloud-based research milieu. Specifically, this project interrogates workshop leader and whole-meeting talk among a group of scientists at a workshop designed to spur collaborative thinking about gaps and opportunities in computing infrastructures that support data use and sharing in the sciences [2]. To consider group-organization processes, interdisciplinary identities, concerns, and commonalities in relation to emerging opportunities for data sharing in the cloud, this paper begins with a brief discussion of Internet-based data sharing — data in the cloud — as an emerging problem area as well as a discussion of literature on interdisciplinary communities in the academy. Next, a brief discussion of critical theory is offered as a lens through which we can interrogate the contemporary data milieu as well as a group of scientists as they aim to work out of institutionalized and commercial constraints. Then, a discussion of methods deployed for this study provides a description of discourse analysis as a particularly useful tool for interrogating the talk of a workshop leader and relations among a group of scientists focused on a shared problem. The findings of this research are subsequently presented before concluding with a discussion about data democratization and the long tail of data, as well as ideas about how to situate these findings for practitioners (i.e., how we might best facilitate interdisciplinary discussion in the sciences) and future research (i.e., how these findings square with what we know about communities).



Data, contemporary problems, and interdisciplinary communities

Data collection, use, management, and storage are significant activities in the scientific enterprise, and are evolving given the onset of a big data world. Many scholars or instructors do not have access to the computing infrastructure needed to work with large data sets. This is particularly true at many biological research stations. While some stations are part of or loosely affiliated with major universities, others are stand-alone entities. The “clients” of the field stations are relatively transient, spending research or teaching seasons at the stations and then returning to their home institutions. This creates a situation where the data generated at the stations tends to scatter following the path of the researchers. The resulting set of all data associated with stations tends to have different formats (Estrin, et al., 2003) and to become dispersed, thus difficult to discover and integrate. This is particularly true for little data. Little data as opposed to big data, tends to be heterogeneous and not conform to established data format standards (Borgman, et al., 2007). Because of the number of people working with little data, the total volume of little data can rival that of big data (Heidorn, 2008). So, sharing data on the cloud or Web-based platform may mitigate frustrations with limited support at one’s home institution, and also increases opportunities for accessing data collected and shared by others.

Beyond concerns about data themselves, there are other motivations for scientists to come together to work on issues relevant to computing in their day-to-day work. For example, federal agencies, foundations, and publishers are requiring that scientists publically share their data and computational processes, but researchers and institutional managers are struggling to identify efficient methods for sharing. In addition, many scientists are currently using computer resources to conduct their work and almost all data is collected in digital format. The scientists themselves frequently have little formal computer training to work with and manage their data.

Previous research on the long tail of data (Heidorn, 2008; Palmer, et al., 2007) suggests that a significant portion of data collected in previous eras was actually lost or went unused — for example, data collected may have been used for specific projects then left to die on floppy disks or personal computers. Data sharing on cloud-based platforms provides a way to share costs for needed infrastructure and offers hope that previously lost ‘dark’ data will be brought to light. Discussions of day-to-day computing needs and data sharing possibilities are an important initial step in enabling problem-solving for contemporary scientists — facilitating this type of discussion was the focus of the workshop interrogated in this study.

Data sharing and related infrastructure dilemmas are of interest across a wide variety of scientists and are of import for those engaging in little science or data-intensive science. Scientists who study climate, for example, need to work across meteorology, biology, climatology, atmospheric science, and may also connect with physical or cultural geographers — those who study people in their environments. These scientists, especially those working on large projects, tend to work with technologist and engineers as part of their working teams (most smaller projects rely on graduate students working in laboratories). Scientists often handle all data management themselves and the graduate students take on a substantial role in data management for labs. Lack of standardization, temporal context of short-tem data use, data sharing “friction” and a lack of staff with time and training all contribute to difficulties with data sharing (Mayernik, et al., 2011). So, though data sharing ideas and concerns are of upmost interest for many scientists and the like, it is easy to imagine the ways in which shifts in scholarly practice, data collection, and the management of information — especially in an age of big data — are of paramount importance across sectors and to those coming from the entire research spectrum.

This project is thus nested in an ongoing conversation about interdisciplinarity and communities that bridge, conflate, or work across institutional boundaries in higher education and in research organizations. Contemporary problem-solving requires contestation of traditional social structures and institutionalized boundaries (e.g., those that bound disciplines in the academy or in practical work settings) that constrain effective collaborations (Gunderson, 2014). However, a dearth of research exists on those needed interdisciplinary collaborations (Derry, et al., 2013), so more research is needed in this particular area. Derry and Schunn (2013) explain that “interdisciplinarity — the integration of concepts, philosophies, and methodologies from different fields of knowledge — is pervasive. ... Such collaboration is especially needed when complex, real-world problems cannot be understood or solved with the tools and perspectives of only one discipline” [3]. Scholars like Klein (1996) and others (e.g., Becher and Trowler, 2001) have attempted to engage or interrogate interdisciplinary curricula, networks, and communities, but of particular interest for this project is how people actually navigate themselves and their identities as they talk about a shared problem.

These scientists participating in this study are provocative in that they function as a group of scholars who are reflecting on their own identities and disciplines while working together — they disrupt institutional borders that typify the academy in order to address a real-world and shared problem. Participants in this study exhibit characteristics of interdisciplinarity, having “boundary-crossing skills ... [and] the ability to change perspectives, to synthesize knowledge of different disciplines, and to cope with complexity ... [The] integration or synthesis of knowledge is seen as the defining characteristic of interdisciplinarity” [4]. These scientists are of interest, then, in their voicing of differing subjectivities as they are embedded in their individualized sectors, cultures, and disciplines.

Beyond their interdisciplinary nature, the workshop leader and participants can be conceived as a community of like-minded scholars and practitioners. Interest in communities (e.g., how they form, how they function) has emerged across disciplines — especially those in the social sciences — and broadly conceived, a ‘community’ is a collection of people that feel that they share some degree of ‘kinship’ in the group (Bell and Newby, 1976). Members of a community tend to feel a sense of belongingness in the group, and believe that their “needs will be met through their commitment to be together” [5]. Relative to this study then, these scientists as they work together may, in fact, function to a certain degree as a knowledge community (Wang, et al., 2013), community of inquiry (Vaughan and Garrison, 2006), or a community of practice (Wenger, 1998). Regardless of how this community is conceived, a critical lens is germane to the uncovering of institutional constraints and other challenges faced by these scientists and those they represent across the scientific domain.



Critical theory: Considerations of institutions, access, and democratized data

Though critical scholarship is not often deployed in studies of science, this study works to cast contemporary gaze to an increasingly democratized data landscape, an environment in which sharing, working across, and actually breaking out of institutionalized academic boundaries are happening for productive ends. Scholars like Paolo Freire (1990) illuminate the oppressive institutional constraints at work in educational contexts like universities that house many leading scientists in the world — Freire offers a reminder that institutionalized structures, disciplinary boundaries in this case, can work to impede growth and liberty.

Like Freire, Michel Foucault argues that institutions are social mechanisms for maintaining subtle authority over everyday practices and social relations (Foucault, 1988). In such a scenario, day to day scientists doing academic work are the very ‘docile bodies’ (Foucault, 1977) working in distributed disciplinary camps that may in some cases impede sociotechnical advances needed in contemporary society. Indeed, most academics were trained in particular and disciplinary-specific ways, most were ‘marked’ (Foucault, 1988) for particular outcomes down the road. While this study focuses on how a workshop leader and a group of scientists conceive of their problem and themselves, institutionalized constraints that impede innovative and transformative science are considered simultaneously.




This study takes an interpretive approach to exploring scientists as they talk about resolving a real-world contemporary problem in science. This work engages a qualitative methodology (Miles and Huberman, 1994) seeking themes from within the data themselves, and focuses on a particular ‘case’ (Creswell, 2007) of an interdisciplinary group. Broadly conceived, this research aims to offer an up-close look at how a workshop leader and participants talk about a problem and how they organize themselves for collaborative work.

Context and participants

This paper examines the communications and goals of a set of scientists and scientist/administrators and technologists in a creative brainstorming and analysis session. The objective of the two-day workshop was not to analyze the normal work practice of the scientists but rather to ask the participants to envision science advances that would be achieved with improved cyberinfrastructure. The central premise of the workshop was that new science could be enabled across biological research stations through the development of shared cloud-based cyberinfrastructure. The work was funded under the National Science Foundation’s Software Infrastructure for Sustained Innovation program [6]. The workshop participants were scientists who had published papers that indicated that the work had been conducted at a member station of the Organization of Biological Field Stations (OBFS) or were otherwise associated with OBFS activities. Ecological diversity in participants was assured by initially selecting projects across National Ecological Observatory Network, Inc. (NEON, Inc.) regions. NEON has partitioned the United States into 20 eco-climatic domains, each of which represents different regions of vegetation, landforms, climate, and ecosystem performance. The tropics were represented by the Organization for Tropical Studies representative. When the initial pool of invitees had been exhausted, remaining slots were filled by issuing randomized invitations to attendees of a 2013 OBFS meeting. Also invited were representatives of major data and cyberinfrastructure providers including NEON, Department of Interior’s BISON and iPlant Collaborative that was rebranded as CyVerse. The PIs and students of the S2I2 grant served as facilitators and recorders for the workshop. The role of the facilitators was help the scientists understand the capacities and limitations of cloud technology and to focus on new science that could be enabled by that capacity. Overall there were 29 participants. While the careers of the participants had sometimes overlapped, there was no one specific research goal that caused the participants to form a community other than the use of at least one biological field station in OBFS. They represented biological sciences ranging across biological scales from the molecular, through the organismal to the ecosystem scale.

This work interrogates how these participants arrange themselves, and express or ‘give off’ a sense of identity, positionality, and relationship with one another and relative to shared constraints in the academy (e.g., disciplines) and beyond. These scientists were asked by their workshop leader to continually break out into smaller groups (five to eight people) so their ‘groupings’ within the group were forced — of particular interest, though, is how these scientists talk about their grouping process alongside their talk about challenges in their work. Usually groups were designed to be heterogeneous, to maximize the diversities of views within the groups. At other times groups were self-formed around shared interest in sub-problems that had previously been identified by the participants themselves (e.g., The role of field stations in developing research questions and monitoring the impacts of sea-level rise on the east coast of the U.S.). [7]


This project, analytically, relies on the analysis of discourse as a tool given that a focus on language and talk in a group can provide a way to see participants’ identities in relation to others in a group. Embedded in discourses is much more than just the content of a recorded conversation — discourse refers to interactional patterns and also broad themes embedded in conversation (Gee, 2005). Through a close analysis of conversation, discussion, and dialogue conceived as ‘talk’ (Cazden, 2001), we can illuminate social relations and roles (Gee, 2005; Sperling, 1995) as well as notions of selves in relation to others (Johnston, 1996). Indeed, discourse can provide a way to see social relations and connections, and as Wood and Kroger assert, “talk creates the social world in a continuous ongoing way” [8]. By exploring the discourses flowing through the scientists’ talk and by looking at the content of their conversations in particular, we can thus get a glimps of the social world they navigate in their work. This project thus interrogates group talk conceived as discourse in order to uncover social roles inhabited and relations among and across these scientists as they address web-based data concerns.

This project followed protocols that are well established in qualitative research traditions (Lindlof, 1995). Data collection and organization was IRB approved and involved the audio recording of whole-workshop discussions between a leader and meeting attendees. These recordings were then downloaded and transcribed. Interpretive analytical coding work was taken on by just one member of the research team and involved an initial first reading of the transcript to render a broad sense of the data. Then, the transcripts were read and re-read for qualitative codes in the data that were, after several iterations, combined and connected in the form of broad themes working across the data set (Miles and Huberman, 1994).

To address interests in seeing how workshop participants situate their commonalities, constraints, and grouped selves, this project first explored how these scientists situated their problems with data work in relation to emerging opportunities for data sharing in the cloud. To find out how these scholars situated their problems with contemporary data, the first research question is proposed:

RQ1: How do issues with sharing data in the cloud get discursively situated by an interdisciplinary group of scientists?

When addressing these problems and charged with taking a broad look at contemporary problems in science, participants had to organize themselves into working groups for the workshop. To examine the identities deployed or the self-identifying mechanisms utilized to organize workshop participants, the second research question is posed:

RQ2: How does an interdisciplinary group of scientists sort themselves when attempting to address a shared problem?

To interrogate these questions, moments of talk from Shawn the workshop planner as well as whole-group discussions are critically analyzed for underlying discourses running through these scientists’ interactions. Findings related to this interpretive analysis are presented in the next section.



Framing shared problems

As described previously, these scientists came together to resolve a shared problem faced when considering tools that can support sharing data in the cloud. As previous scientists (Heidorn, 2008; Palmer, et al., 2007) have suggested, much of the data collected in previous decades exists as ‘dark data’ or that sitting on disks, personal computers, or in personal files. Federal agencies, private foundations, and publishers now frequently require broad dissemination of research data in as open a manner as possible. Certainly, without computing infrastructures in place, much of the data drawn from smaller projects exist as dark data and not widely available. If those dark data from smaller projects were collected, scientists would have a broad set of data available to them. Shawn, the workshop planner began his talk for the scientist by explaining how we might be more productive as scholars if we coalesce toward a ‘critical mass’ of scholars.

If you have a lot of it that’s closer together, or a lot of dark data closer together, you could pull it together and get to a critical mass to make it easier to share and useful for science. We could make an analogy to star formation. ... the argument is [that] all of our data, if we can organize it and provide the right tools, it’ll come to light. It’ll get bright and become a star. Everybody can use it, and it’ll no longer be dark. I like analogies ... .

If scholars share and work together though the Internet or other Web-based platforms, more data will be made available, viewable, and usable — data will become increasingly ‘democratized’ and access across communities of scholars and practitioners will be enhanced. The move toward increased data-sharing and enhanced consideration brings increasingly larger sets of data globally. As Shawn continued,

... there is a huge growth in the amount of data that we can acquire. Sequenced data, in particular, is outstripping the Moore’s Law, so, in fact, the amount of data we have is growing faster than our computing capacity to process that data. Multi-motive, data sensing, things like that, that, across the board — from this type of data acquisition to observational data and so on — where we’re seeing a lot greater influx of data as driving a lot of the science.

To give context for the current interrogation, then, Shawn’s talk shows that this particular group of scientists came together to talk about their data in relation to ‘the cloud’ or related Web-based tools.

Outsourcing or sharing support as common good. These scientists shared in their concerns for software support at their home institutions. Innovation is indeed constrained without expensive and innovative software packages as well as the technological support to utilize those packages in their work and in their teaching. Needed, then, are Web-based sites, tools, and packages to support contemporary science, as Shawn explained.

I think you’ve probably all heard of this notion of the cloud, that things like Gmail and Dropbox and so on are things that run in the cloud, and that’s this notion of outsourcing. ... The term [the cloud] that is widely used to describe the type thing we’re interested in is what’s known as software as a service. ... Which of those [services] can we push out and have somebody else run and operate for us?

With scientists constrained by the realities of the current economic milieu, science at home or in institutions of higher education has become increasingly difficult. Shawn continued to frame the contemporary problem in science this way.

With the idea that they would be better operated, better maintained, kept more up to date, we can aggregate the cost or amortize the cost of acquisition and maintenance and operations across a larger group of people. [This is preferable, we can do this] rather than [rely on] that poor graduate student sitting in your lab who really wants to work on their dissertation, but is stuck updating the latest version of the software.

Problems associated with sharing data in the cloud framed this particular workshop, and Shawn positively situated ‘outsourcing’ and related sharing possibilities for those in the sciences. That is, these scholars were thus coming together to consider issues of software, hardware, Internet, and Web-based or Internet-related data platforms, tools, and practices in an effort to streamline data work, and Shawn, from the start, shed positive light on emerging Web-based possibilities. These scientists’ attendance at the workshop for resolving a shared problem can be attributed, at least in part, to a recognition of the very localized and stressful problems associated with lacking funding, poor infrastructure at many campuses of higher education, and concerns about aging software packages and capabilities at one’s own place of scientific work.

Academic culture as disruption to the common good. As intimated previously, scholars have pointed to the many institutional barriers to working across disciplinary lines in academic work. Certainly, the very work academics do is embedded in a broader set of trends, requirements, constraints, and expectations (e.g., the charge for tenure, requirements for publication, granting institution expectation). Shawn seemed to recognize the individualized culture and related constraints in contemporary academe, he addressed this point in his workshop.

It’s not that — everybody’s a good citizen and doing good science, but your rewards are for publications, not for data sharing. It’s a lot of work to document the data, if you want to be the first one out, and then the control and ownership of the data is a big, messy problem.

Shawn pointed to the tenure process in academic institutions and raises concern about efficient publishing practices that are important for scholarly ‘rewards’ in research-related work — though scholars share in both needs for Web support and stressors from the institutions to which they are beholden, collaborative data-sharing on the cloud for the good of science is in some ways oppositional to individualized concerns.

While research institutions reward individualized accomplishments, they are also imbued with protective and legal coverage that inhibits data sharing and complicates notions of ownership. As Shawn suggests, “It’s not clear who has the rights to release a lot of different kinds of data, so a lot of challenges, a lot of time.” Institutional notions of data ownership constrain moves toward collaborative data sharing for the good of the broader scientific enterprise. Sometimes legal issues tied to ‘ownership’ are tied to notions of capital, situating data, its production and related research findings as commodities in a capitalistic culture, a culture ripe for the commercializing of data-related practice.

Capital, commercialized threats, and the economy of innovation. In some ways, sharing data on the cloud implies trust of commercial interests. In such an environment, data become the commodity providing a rich place for commercialized competition. These scientists considered commercial barriers in this way.

Discussion Moderator: ... programs, like [those with] Amazon ... that are put in place for science data ... they’ll take data, literally for free, and host it for free within certain constraints that they’re still often a little fuzzy about. You manage it with their kind of, it’s a popular app. He just said it. I know that it will come up, as far as your third-party. You’re competing against it.

Workshop Participant: Right.

Discussion Moderator: I think this is actually a wingding for everybody. I’m not sure this is a bait and switch or any sort of type walk-in solution. I’d also probably argue that we eventually need to get to the point where these things are on your personal cloud, and saying that we trust a load sequencer to do things we need to sort of trust that environment for our computing.

Workshop Participant: Right.

These scientists pointed to the National Science Foundation (NSF) and conveyed a kind of shared understanding about the NSF as a leader of sorts. No one defined the NSF, for example, and instead the NSF was discursively situated as central to the work they do in science:

Discussion Moderator: Then you have the NSF. The NSF is definitely interested in pushing this though. They’re slow to arrive, but, for example, if you’re using it, there is a computing in the cloud program, where you have CloudOne. You can write proposals, get ’em online, and then you can mold how you’re gonna use ... time on Microsoft’s de jure cloud.

For this group, even the NSF was moving on board with commercialized data support. These scientists shared in a sense of concern about commercialized interests that underpin much of their work.

Discussion Moderator: ... some of you may remember that Google would send you a suitcase with a hard drive that said, “Just send your data. We’ll take care of the rest.” It took them less than six months to close that program down.


Discussion Moderator: No company can sustain this longer. It will, in six months, change their mind. If you don’t have an institutional buffer or your data management plan to know, your plan now changes. What do you do? You’ll be screwed.

The news on commercialized interests was not entirely negative for these scientists, but commercialized factors can indeed lead to scientists being ‘screwed’ in their work.

Overall, these scientists recognized that there is a broad threat to their common work in light of commercial interests. As this segment of the discussion shows again, the NSF is a central power in their day-to-day engagements — the NSF will not ‘protect’ these scientists from commercial concerns.

Discussion Moderator: The fact that, if you leave it to the commercial sector, this sort of stuff can happen. If you have, perhaps the government set it up, where it’s something more standardized, you can perhaps have more confidence that it might be there, in some way, shape, or form, 20 years down the line; maybe yes, maybe no.

Workshop Participant: I think one thing to realize is the whole reason for existence of this particular program, and who’s paying for this meeting, is the fact that the government can’t. NSF is not willing to pay to sustain software development ...

Concerns about the marketplace of data work moved through these scientists’ discussion with one another — commercial factors tied to their work were threatening to the sustainability of data initiatives. Participants voiced awareness of the importance of capital as an influential factor in their current and future endeavors.

Indeed, the very economics of science were at the forefront in these science workshops. Talk of ‘drowning’ pointed to the economic threats in the face of needed commercial support.

People are drowning differently, and there are many different stakeholders involved. You, as scientists, are drowning, but so are the data archivists, so are the NSF science policy-makers that are saying you’ve gotta leverage this stuff. You’ve gotta get more money out of it. You’ve gotta put it in repositories. Nobody knows how to sustain the repositories. The funding, even for things like approaching databank, people are really worried about how we’re gonna keep those going. There’s a lot of different players concerned with these issues.

Broadly analyzed, Shawn’s talk in the workshop as well as some whole-group talk segments offer a particular view of how these contemporary concerns in science are framed by a group of academics working on the ground. Though these scientists were drawn from across sectors (e.g., campus, organizations), ‘outsourcing’ in the context of research, data work, and sharing in related software management was situated as a possible and, in fact, an optimal solution. These scientists also referenced a variety of contextual factors that frame contemporary science — individualized academic culture, institutionalized legal concerns, and a broad capitalist culture were all considered disruptions or barriers to the common scientific enterprise.

To resolve these tensions in contemporary science, collaborative teams are increasingly interdisciplinary as research groups aim to ‘divide’ the work or come together — as this group of scientists did — to find creative solutions to real-world concerns. In spite of and in the face of the institutional pressures or commercial interests that create a climate of concern and mistrust, Shawn suggested that there are “definitely a whole number of movements of collaboration across science.” To collaborate, these scientists began by sorting themselves, exploring the ‘granular’ divisions among them as a group of seemingly like-minded scientists who shared a real-world contemporary problem.

Organizing scientists

Organizing by disciplinary label then data type. Though these scientists were asked to consider their data type and challenges tied to how distinct types of data are housed, Shawn recognized the propensity to group people and their academic practice by broad disciplinary label (e.g., biology). Shawn asked, “What are the actual needs in terms of data and processing associated with that science? ... When you say, “What’s the science?” If we get 19 cards that say “biology,” on one side ... ”. He trailed off in his apparent discomfort with the ambiguity left by broad disciplinary identity, and in the end, he asked the group to get beyond their disciplinary home to consider data type.

In addressing data-sharing issues in the cloud, it is actually the ‘types’ of data as opposed to disciplinary structures that play into how scientists approach their cloud-based data concerns. Shawn suggested a kind of ‘gold standard’ in sets of data, then pointed to other types of data that compete with those data traditions.

The gold standard is used, like the zoo, and [one can just take their data and] put it in an archive ... particular archives ... are ready to take most of your data. In some areas, [as in] a lot of the sensor work ... there’s no place to put those data. They didn’t get shared, but even if [scientists] want to share them, there was no home for them at all.

Though data type matters when conceiving how to house particular kinds of data points, Shawn was clear in pointing out how types of data work together, as in the example of climate change research: “... there is an argument that people who are dealing with sort of organism-level species out there, and then you have people who dealing with ecological and community level data. The two are not divorced.” Shawn continued to talk about how “there are people and scattered information all over, some of the big questions aren’t even being looked at because you don’t have the integration [needed].” Beyond just the “integration needed across researchers,” he continued, “Within even certain sciences, like meteorology, they have so many different funding agencies that don’t talk to each other, that you’re not even linking your different resources within a specific area or field of study that is then being asked in these large questions.”

Organizing by data type, would, at least resolves some of the issues tied to disciplinary or field-related divisions. Plot-data about measurements on a particular plot of land over time, on the cloud, for example, seemed reasonable for Shawn, who argued, “if we put it on the cloud, and a bunch of field stations could plot-data, [that would] suddenly make life easier ...”. A scientist participating in the workshop also explained it this way:

Workshop Participant: Right, so we need to know a level of granularity so that we can break out into five groups again to work on similar things. If we just say “biology,” you’re not gonna find many similarities within — you’ll be too abstract for effective use anyway. I don’t want to impose any particular areas of science. Maybe we can get some examples from the group, so if people have ideas of: What’s your pain point or opportunity points? Sub-problems? Yes?

Discussion Moderator: Organizing long-term environmental data. I guess that’s —

Workshop Participant: Long-term environmental data. Yeah, we probably need to deep drill-down on what the environment data, yeah, so —

Discussion Moderator: Yeah.

Workshop Participant: Yeah, it could be biotic ... can be biotic.

Discussion Moderator: It could be anything ...

Workshop Participant: Right.

Discussion Moderator: Plant diversity, pictures associated with that.

Workshop Participant: ... Hopefully, within the three or four scientists in your group, you can work this out.

The moderator spoke of additional means for finding granularity in data type by discussing “spatial and temporal and technical images, geolocation, different genres and forms, genomics ...” and he pointed to a variety of data types around which we might socially organize, “genomic and protein data, and also the observational field data turned to what species are present and not present, that kind of thing.” Shawn suggested that sometimes it is simply a “representation problem, many times it comes right down to a granularity, like a liters versus feet versus miles kind of thing.”

Broadly conceived, these scientists had to work to clump themselves into coherent piles of practice, with data type an ongoing theme in their discussion. However, as illuminated in the next section, these scientists needed to make themselves distinct by data type but then come back together to form working groups based on a re-imagined sets of generalities. Beyond clean groups of data types, that is, other ideas about how to create ‘groupings’ emerged.

Beyond data type: Considering discipline, sectors, and data ‘location’. When charged with finding ways to form working groups to address data-sharing issues, and once these scientists had moved through their discussions of data type, participants reflected on disciplinary boundaries and how they relate to data type and research teams.

Workshop Participant: I think the point about how you define data is critical because we’re in an era now where the buzzwords that tend to get funding is “multidisciplinary,” “interdisciplinary.” Oftentimes, we’re in projects which we’re collecting all kinds of data; numerical, photographic, qualitative, interviews, models, systematics. They all go into a package. You really can’t understand the study without all of those different parts, and they’re all in different kinds of data. How does that work for something like this?

Discussion Moderator: Yeah, so heterogenic in the pail. One of the slides I went through is our recognition that what’s different, I guess, in interdisciplinary studies, but trying to bring data together [that] is heterogeneous.

Discussion Moderator: Right.

Workshop Participant: It’s either heterogeneous because the science you’re doing is different, or I it’s just because everyone has their own practice. You’re all measuring diameter plus type of a plot for trees, but you organize it differently. Some of the problems are really hard. Some of the problems are not as hard, when you stay together and collaborate.

Workshop Participant: No, but a common location. For field stations that cross-manage data that’s gathered at field stations, you’re looking at tens of different types of data in a similar location.

Workshop Participant: Right.

Workshop Participant: We’re looking at geographic datasets and being able to search on locations. Like, “Oh, here are the three PIs who have done work. Here’s 15 things they’re looking at, all in the same square of it.”

Discussion Moderator: Right.

As this segment of discussion suggests, research teams may be ‘interdisciplinary’ in nature, but work from a particular location but involving a variety of data types is common and needs to be addressed when people try to work collaboratively. That is, unpacking some of the difficulties with sorting people — scientists in this case — involves an important examination of research process, the point of this particular project.

In their thinking about divisions amid their eclectic group, these scientists referenced the kind of work people do across sectors. “You’re here — some of you are scientists just doing work at field stations. Others are station managers. Others are IT people that coordinate with those stations and do work ... There’s quite a mix.” Threaded through the talk in these workshops were references to work site or station, markers that aided them in organizing themselves as collective groups of distinct practitioners and scholars nested in particular sectors as distinct environments.

Organizing by collection processes — the generality that connects. As these scientists aimed to resolve issues with sharing their data, and once they had discussed organizing by data type, location, or sector, it was the processes in the day-to-day work that ultimately helped them to connect and talk directly about their needs. Shawn purposely organized the group this way in order that these scientists might be allowed to talk more concretely about processes and tasks they face in their work.

For the next breakout session, what we’ll do is put like people together that are trying to solve the same kind of problems, and then outline everything from your data acquisition to publishing your paper, plus all the stuff that happens in the middle. Where are the sweet spots? Where are there gems of software we could make more broadly available? Those sorts of things.

To discuss data as a collective group, data collection processes emerged as an effective way for these scientists to connect with one another. These scientists — once moving beyond disciplinary label, data types, and talk of data location to a certain extent — continued to work extensively on finding common ground.

Discussion Moderator: You’re looking for generalized pain points?

Workshop Participant: Yeah, generalized pain points.

Workshop Participant: Okay.

Workshop Participant: Yeah, or opportunities to do positive science.


Indeed for these scientists, talking about data-collection and sharing processes created a rich site for talking about commonalities, for connecting with one another as a community of scientists, and the common good of doing ‘positive science’.

Overall, these segments of talk among scientists show their collective concerns about data, the future of science and the common good. In these turns of talk we can see Shawn as a workshop moderator trying to organize scientists in a variety of ways, and we can see a group of scientists wrangling with the pros and cons of carving up their group in particular ways to talk about shared problems. We can also see, the temporal movement from more typical divisions (e.g., discipline, data type) to more grounded and concrete discussion about processes involved in the most mundane moments of science. In the end, Shawn wanted ideas about how to solve a problem. He questioned his workshop participants and aimed for concrete answers to an important question in science.

What are the commonalities? What are the important things that you do that are barriers to progress, that are really essential to doing work, that could be codified in these types of reusable services that we can develop in what’s called hosting, and have somebody else operate and maintain them for you, and you just use them, the same way today you use Gmail or today use Dropbox or something like that?

While providing answers to these questions is beyond the scope of this particular research project, this examination of scientists talking offers a sense of how scientists may be faring amid increasingly bigger data projects and given a growing propensity of reliance on interdisciplinary teams to solve problems in economically strained environments. In keeping with a critical approach to this work, we can indeed see these scientists wrangling with institutional labels (e.g., biologist) that reinforce division and that, to some degree, impede their ability to connect or to create a sense of community with others doing different work in the same place, with the same mission, or on the same topic.

Sociality in a community of scientists. Shawn led this group in a discussion of a shared problem and began with a collaborative aim — he guided the group in their working together. An examination of discourse shows a kind of sociality to their work, even though they were aiming to resolve a technical problem. Shawn pointed to the importance of the social — the focus on this project — by saying very clearly, “We’re very interested in the balance between what technology can do and what social engineering can do.”

In this group of scientists’ talk, there is indeed a sociality to how researchers go about working together and organizing themselves. To add to the turns of talk offered above, this discussion shows these scientists brainstorming about how to classify themselves and the work that they do.

Workshop Participant: Right, so macro-systems versus — and sometimes climate is macro-systems, but often not. Macro-systems could be — yeah.

Discussion Moderator: I changed climate change to global change because there are also agendas in the way we’re using the land that are not —

Discussion Moderator: Landscaping, yeah.

Discussion Moderator: — necessarily — yeah, not necessarily climate data.

Workshop Participant: We can call it global change.

Discussion Moderator: Global change, yes.

Workshop Participant: Okay. Including climate?

Prevailing, was a sociality to the talk, a lightness in the tone of the talk.

Workshop Participant: Well, we could break into different kinds of groups. There could be some groups that want to focus on I’m the manager of my field station. I need better tools for managing my field station. There are other people that might wanna focus around climate change, or — and flux towers or something.

Workshop Participant: We could do it by hair color. [Laughter] Or we just count off.

The moment was not lost on this particular participant who found the very processes by which we organize ourselves as scholars, the focus of this project, as marked moments in our day-to-day work and as sites for deploying humor and connecting socially with one another.

At minimum, they made friends. At the end of each transcribed workshop meeting there was often a call for gathering informally, with “there’s beer” transcribed in a section of closing and inaudible cross talk. As Shawn facilitated one of the break-out sessions in the group and assigned an exercise, he asserted based on the assignment, “The output of that [exercise] will mostly be, other than friendship, cards that describe your science in just a few words.”

Ultimately, and most provocatively, this group aimed to find commonalities in order to talk about a very real concern about modern-day data and contemporary threats to science innovation. Though facilitated by Shawn, a scientist in a group of scientists, we see members of this group trying to organize themselves in a variety of ways that to some degree get beyond traditional boundaries that typify much scientific work. Most powerfully, we see these scientists working to cross academic boundaries in spite of institutionalized pressures and economic realities that constrain them. They emphasized a common good mission in their talk of ‘transformative science’ and the tone of their talk was collaborative and empathetic — they were invested in the success of the entirety of the group. They voiced their prevailing goal of engaging in ‘transformative science’ and innovation.

Transforming science together

From shared threats (e.g., commercialization of data support) and notions of common good, to the social organizing to address those shared problems, illuminated in the discourse is a commitment to transformative science and a strong push for innovation. Shawn explained that data in science does not have to be a separate enterprise, that scholars divided are somehow less productive than those working together:

This is the just the principles of fusion, so if we need to remove barriers to sharing, make computing easy and friendly, so part of the idea is, instead of making data management a separate painful operation from science, why don’t we give you tools, software tools, that just make it part of your workflow, so that it’s not as painful to produce the data in shareable format?

Our field is ripe and ready for tools. Give us money. Let us build some. We would like to be able to help you, if we can help articulate where the biggest payoffs may be. Say, “Here’s some really interesting work problems for these areas of biology,” that we could argue, if we could build the right set of tools, you could really move to the next stage of your science.

Certainly in this segment of his workshop talk, there is a return to capital, science as laden with fiscal constraint. Indeed, as part of the push for better science, there is a disconnect between “Better science for NSF and the community, and better science for the individuals that are generating the data and sharing it.” There is also a disconnect in terms of fiscal and institutional rewards for particular kinds of science research, “... a lot of what you’re talking about are sort of community norms and community expectations, which is fundamentally different from the incentive structure within these sciences.”

In the end, finding a resolution to the shared problem of data sharing, management, and related infrastructures are a primary impediment to contemporary science across scientists. It is the mundane that influences innovation, “... this could be the difference between doing something innovative and not doing something innovative, in that you’ve basically burned half of your effort on stuff that has nothing to do with transformational science or discovery.” These scientists’ conversations illuminate perspectives on data in the cloud (e.g., outsourcing or sharing in infrastructure support preferable, commercial, and economic concerns) as well as preferences for ‘group-ness’ across boundaries — much of this was clear in the content of their talk. Notions of a kind of ‘common good’ and a push for community and connectedness in the hopes of science ‘transformation’ were also discursively exuded in the tenor of their talk. These scientists and their scientific discourse say a lot about shared data concerns in an increasingly interdisciplinary environment.



Concluding remarks: Organizing work in an interdisciplinary and data-driven milieu

Contemporary science has witnessed recent shifts that are powerful, cultural, and sit well beyond the confines of academia. The very nature of our knowledge-related capabilities has changed with increasingly large amounts of data being managed and used with smaller tools. Computing infrastructure is needed, however, and most practicing scientists cannot manage those needs alone. So, enhanced reliance on Web-based data work leads to and, in fact, requires an increased sense of sharing, community, and access across those aiming to resolve contemporary problems. Indeed, dark data — those historically lost on individual floppy disks and other personal tools — can now be saved, collected, combined, and accessed to form a broadened set of opportunities for knowledge gaining, creation, and sharing. The world of data and science has become increasingly democratized as those previously lacking access to appropriate tools can now rely on Web-based platforms to support their own work and to gain traction toward a kind of common good among scientists and in society.

These findings can be utilized to inform practitioners on issues of team planning, group collaboration, and interdisciplinary discussion. Workshops like that interrogated in this study can provide space needed for creating problem solving — as opposed to brainstorming about how to organize the group, workshop leaders can move directly into discussions of research process. Moving beyond disciplinary groupings can mean talking most immediately about important processes (e.g., fixed plot data, collecting glacier samples).

These findings point to areas ripe for future research (i.e., how these findings square with what we know about communities, communities of knowledge or practice, and also in relation to movements toward the democratization of data). Certainly this work situates well in an ongoing conversation about communities of practice — those “communities that form out of a need for social connection, building on a sense of shared interests in order to meet professional needs” [9]. Additionally, literature on knowledge communities (e.g., Wang, et al., 2013; Yigitcanlar and Dur, 2013) yields an emerging set of related research studies, though works on ‘domain-spanning’ academic work (Leahey and Moody, 2014) are of particular interest. This project can thus contribute to ongoing academic conversations about problem-based group and community work. Future research should similarly interrogate nuances in academic talk or scientific community discourse in an effort to parse out what happens when large collaborative teams work together. End of article


About the authors

Catherine F. Brooks is an Assistant Professor in the School of Information and Department of Communication, is also the Director of Undergraduate Studies, and Director of the Center for Digital Society and Data Studies in the School of Information at the University of Arizona.
E-mail: cfbrooks [at] email [dot] arizona [dot] edu

P. Bryan Heidorn is an Associate Professor and is the Director of the School of Information at the University of Arizona.
E-mail: heidorn [at] email [dot] arizona [dot] edu

Gretchen R. Stahlman is a doctoral student in the School of Information at the University of Arizona.
E-mail: gstahlman [at] email [dot] arizona [dot] edu

Steven S. Chong iis a doctoral student in the School of Information at the University of Arizona.
E-mail: stevenchong [at] email [dot] arizona [dot] edu



1. Though the focus of this work is on interrogating discourses about data-sharing across disciplines in the sciences, more can be analyzed about facilitation processes at work in analyzed meetings. Some of the processes at work for these scientists seem to follow, to some extent, Michener’s suggested procedures, who learned his methods as a facilitator for town hall meanings. In the University Leadership Academy (developed at Harvard Business School) they also cover consensus building are provide techniques to create heterogeneous groups, developing buy-in by supporting participant active engagement and the like. A quick Web search provided sites like Mindtool that gives a brief outline at

2. This paper focuses on the behavior of the participants in the workshop in their quest to identify common goals. A separate paper under development will focus on the cyberinfrastructure-enabled research that was identified through the discussion and consensus process.

3. Derry and Schunn, 2013, p. xii.

4. Spelt, et al., 2009, p. 366.

5. McMillan and Chavis, 1986, p. 9.

6. This workshop was sponsored in part by the U.S. National Science Foundation through the following collaborative SI2-S2I2 grants: 1216726, 1216754, 1216872, 1216879, 1216884;

7. Follow-up conversation with the workshop facilitator suggested that, in order to avoid groupthink, brainstorming, assigning roles (to spur the adoption of others’ views), and Delphi techniques (e.g., Rixon, et al., 2007) were used.

8. Wood and Kroger, 2000, p. 4.

9. Brooks, 2014, p. 722.



T. Becher and P.R. Trowler, 2001. Academic tribes and territories: Intellectual enquiry and the cultures of discipline. Second edition. Philadelphia, Pa.: Open University Press.

C. Bell and H. Newby, 1976. “Communion, communalism, class and community action: The sources of the new urban politics,” In: D.T. Herbert and R.J. Johnston (editors). Social areas in cities. New York: Wiley.

C.L. Borgman, J.C. Wallis, and N. Enyedy, 2007. “Little science confronts the data deluge: Habitat ecology, embedded sensor networks, and digital libraries,” International Journal on Digital Libraries, volume 7, numbers 1–2, pp. 17–30.
doi:, accessed 17 January 2016.

C. Brooks, 2014. “Faculty, community, information sharing, and professional support in the age of Facebook,” M. Searson and M. Ochoa (editors). Proceedings of Society for Information Technology & Teacher Education International Conference 2014. Chesapeake, Va.: Association for the Advancement of Computing in Education (AACE), pp. 722–726.

C.B. Cazden, 2001. Classroom discourse: The language of teaching and learning. Second edition. Portsmouth, N.H.: Heinemann.

J.W. Creswell, 2007. Qualitative inquiry & research design: Choosing among five approaches. Second edition. Thousand Oaks, Calif.: Sage.

S.J. Derry and C.D. Schunn, 2013. “Interdisciplinarity: A beautiful but dangerous beast,” In: S.J. Derry, C.D. Schunn, and M.A. Gernsbacher (editors). Interdisciplinary collaboration: An emerging cognitive science. New York: Psychology Press.

S.J. Derry and C.D. Schunn, and M.A. Gernsbacher, 2013. “Preface,” In: S.J. Derry, C.D. Schunn, and M.A. Gernsbacher (editors). Interdisciplinary collaboration: An emerging cognitive science. New York: Psychology Press.

D. Estrin, W. Michener, and G. Bonito, 2003. “Environmental cyberinfrastructure needs for distributed sensor networks: A report from a National Science Foundation sponsored workshop, 12–14 August 2003, Scripps Institute of Oceanography,” at, accessed 17 January 2016.

M. Foucault, 1988. Madness and civilization: A history of insanity in the age of reason. Translated by R. Howard. New York: Random House.

M. Foucault, 1977. Discipline and punish: The birth of the prison. Translated by A. Sheridan. New York: Pantheon.

P. Freire, 1990. Pedagogy of the oppressed. Translated by M.B. Ramos. New York: Continuum.

J.P. Gee, 2005. An introduction to discourse analysis: Theory and method. Second edition. New York: Routledge.

R. Gunderson, 2014. “Social barriers to biophilia: Merging structural and ideational explanations for environmental degradation,” Social Science Journal, volume 51, number 4, pp. 681–685.
doi:, accessed 17 January 2016.

P.B. Heidorn, 2008. Shedding light on the dark data in the long tail of science, Library Trends, volume 57, number 2, pp. 280–299.
doi:, accessed 17 January 2016.

B. Johnston, 1996. The linguistic individual: Self-expression in language and linguistics. New York: Oxford University Press.

J.T. Klein, 1996. “Interdisciplinary needs: The current context,” Library Trends, volume 45, number 2, pp. 134–154.

E. Leahey and J. Moody, 2014. “Sociological innovation through subfield integration,” Social Currents, volume 1, number 3, pp. 228–256.
doi:, accessed 17 January 2016.

T.R. Lindlof, 1995. Qualitative communication research methods. Thousand Oaks, Calif.: Sage.

M.S. Mayernik, A.L. Batcheller, and C.L. Borgman, 2011. “How institutional factors influence the creation of scientific metadata,” iConference ’11: Proceedings of the 2011 iConference, pp. 417–425.
doi:, accessed 17 January 2016.

D.W. McMillan and D.M. Chavis, 1986. “Sense of community: A definition and theory,” Journal of Community Psychology, volume 14, number 1, pp. 6–23.

M.B. Miles and A.M. Huberman, 1994. Qualitative data analysis: An expanded sourcebook. Second edition. Thousand Oaks, Calif.: Sage.

C.L. Palmer, M.H. Cragin, P.B. Heidorn, and L.C. Smith, 2007. “Data curation for the long tail of science: The case of environmental sciences,” paper presented at the Third International Digital Curation Conference, Washington, D.C.; version at, accessed 17 January 2016.

E.J.H. Spelt, H.J.A. Biemans, H. Tobi, P.A. Luning, and M. Mulder, 2009. “Teaching and learning in interdisciplinary higher education: A systematic review,” Educational Psychology Review, volume 21, pp. 365–378.
doi:, accessed 17 January 2016.

M. Sperling, 1995. “Uncovering the role of role in writing and learning to write: One day in an inner-city classroom,” Written Communication, volume 12, number 1, pp. 93–133.
doi:, accessed 17 January 2016.

N. Vaughan and D.R. Garrison, 2006. “How blended learning can support a faculty development community of inquiry,” Journal of Asynchronous Learning Networks, volume 10, number 4, pp. 139–152.

G.A. Wang, J. Jiao, A.S. Abrahams, W. Fan, and Z. Zhang, 2013. “ExpertRank: A topic-aware expert finding algorithm for online knowledge communities,” Decision Support Systems, volume 54, number 3, pp. 1,442–1,451.
doi:, accessed 17 January 2016.

E. Wenger, 1998. Communities of practice: Learning, meaning, and identity. New York: Cambridge University Press.

L.A. Wood and R.O. Kroger, 2000. Doing discourse analysis: Methods for studying action in talk and text. Thousand Oaks, Calif.: Sage.

T. Yigitcanlar and F. Dur, 2013. “Making space and place for knowledge communities: lessons for Australian practice,” Australasian Journal of Regional Studies, volume 19, number 1, pp. 36–63.


Editorial history

Received 7 May 2015; accepted 19 January 2016.

Copyright © 2016, First Monday.
Copyright © 2016, Catherine F. Brooks, P. Bryan Heidorn, Gretchen R. Stahlman, and Steven S. Chong.

Working beyond the confines of academic discipline to resolve a real-world problem: A community of scientists discussing long-tail data in the cloud
by Catherine F. Brooks, P. Bryan Heidorn, Gretchen R. Stahlman, and Steven S. Chong.
First Monday, Volume 21, Number 2 - 1 February 2016

A Great Cities Initiative of the University of Illinois at Chicago University Library.

© First Monday, 1995-2017. ISSN 1396-0466.