I describe in this paper the creation and operation of the Open Science Grid (OSG ), a distributed shared cyberinfrastructure driven by the milestones of a diverse group of research communities. The effort is fundamentally collaborative, with domain scientists, computer scientists and technology specialists and providers from more than 70 U.S. universities, national laboratories and organizations providing resources, tools and expertise. The evolving OSG facility provides computing and storage resources for particle and nuclear physics, gravitational wave experiments, digital astronomy, molecular genomics, nanoscience and applied mathematics. The OSG consortium also partners with campus and regional grids, large projects such as TeraGrid , Earth System Grid , Enabling Grids for EsciencE (EGEE ) in Europe and related efforts in South America and Asia to facilitate interoperability across national and international boundaries.
OSG’s experience broadly illustrates the breadth and scale of effort that a diverse, evolving collaboration must undertake in building and sustaining largescale cyberinfrastructure serving multiple communities. Scalability — in resource size, number of member organizations and application diversity remains a central concern. As a result, many interesting  challenges continue to emerge and their resolution requires engaged partners and creative adjustments.
Introduction to Open Science Grid
Practical lessons learned from operating a large cyberinfrastructure
Scalability, scalability, scalability
Challenges faced by Open Science Grid
Introduction to Open Science Grid
Brief overview of OSG
Open Science Grid is a consortium  of more than 75 institutions and organizations that operates a largescale shared distributed cyberinfrastructure (including computing, storage, networks, software, security and support) supporting U.S. science and engineering. Its fundamental goal is to support the scientific discovery process consortium members and partners who utilize the distributed facility. A significant additional benefit is the close partnership that has developed between scientific user organizations, resource administrators and technology providers in carrying out OSGs program of work.
OSGs basic goals  can be stated relatively concisely:
Operate a secure distributed petascale  cyberinfrastructure across the U.S. that provides both guaranteed and opportunistic access to shared computing and storage resources.
Engage and benefit research efforts of all scales by progressively supporting their applications.
Educate and train students, researchers, system administrators and educators.
Interface and federate with campus, regional, national and international grids (particularly the large-scale partners EGEE and TeraGrid).
Evolve the capabilities and reach of the OSG facility by adapting and deploying externally developed software tools and technologies.
The individual facilities making up the OSG cyberinfrastructure are shown in Figure 1. There are more than 80 separate resources , each containing 1004,000 processors apiece and representing a total of 25,000 CPUs and approximately 4 petabytes (4,000 terabytes) of disk. Not shown are the links provided by research networks (Internet2 , National Lambda Rail , ESnet , LHCNet , AMPATH , and several state research networks ) that provide highspeed connections (up to 10 Gbps) to the sites. OSG participants have made significant contributions to the technical development and deployment of these research networks .
Figure 1: Map  showing the location of Open Science Grid sites throughout the world.
The sites include both university and laboratory facilities and are linked by several wide area research networks.
Some comments should be noted. First, the OSG consortium is not engaged in new software and information technologies development efforts; rather, it tests and integrates new technologies and services (many provided by consortium members and partners working in other projects) as needed by the consortium.
Second, OSG does not own resources. Resources are owned and operated by consortium members and represent far more value than the direct funding OSG receives (this model is different than the one used by grid projects such as TeraGrid, which owns and operates significant cyberinfrastructure resources). As a consequence, the computing and storage resources listed above are not all available at any one time for OSG use because of substantial internal utilization (e.g., researchers in a university or organization, experiments at a national laboratory).
Third, U.S. researchers participating in the data intensive experiments at LIGO and the LHC depend heavily on OSG resources and support. OSG cyberinfrastructure resources are a significant component of a global grid cyberinfrastructure known as the Worldwide LHC Computing Grid (WLHC), whose structure is modeled in Figure 2.
Figure 2: Sketch of the multitier Worldwide LHC Computing Grid, where U.S. resources (outlined in red) are incorporated into OSG.
Worldwide, the WLHC contains approximately 11 Tier1 centers, 100 Tier2 sites and several hundred Tier3 institutions.
Finally, OSG exists in a world where other grid cyberinfrastructures are either operating or will soon be operating at scales ranging from campus to international. OSG must have mechanisms for interoperating with these facilities.
Figure 3 shows a graph of the number of simultaneous computing jobs running on OSG for the past six months, colorcoded by Virtual Organization  (VO). The graph clearly demonstrates that OSG computing resources are being steadily used (3,0005,000 simultaneous jobs) by researchers from many distinct organizations.
Figure 3: Plot showing number of simultaneous jobs running on OSG computing facilities for a recent sixmonth period.
Each color denotes jobs run by a distinct Virtual Organization.
OSGs historical context
OSGs origins date to 1999, before any significant grid deployments had been attempted . In that year, physicists from four large physics and astronomy projects and computer scientists with distributed computing experience (Globus  and Condor ) began developing plans for a gridbased computing infrastructure capable of meeting the dataintensive computing needs of major physics and astronomical experiments that would be operational within a few years. These projects, each costing many hundreds of million dollars to construct and operate, included the ATLAS  and CMS  high energy physics experiments at the LHC  at CERN , the LIGO gravitational wave facility  and the Sloan Digital Sky Survey (SDSS ). A series of discussions with officials at the Department of Energy and National Science Foundation clarified that this work would be groundbreaking and could be broadened to serve other disciplines as their needs for data intensive computing progressed. Within a short time , three grid projects were funded for a total of approximately US$35M: Particle Physics Data Grid  (DOE, 1999), GriPhyN  (NSF, 2000) and the International Virtual Data Grid Laboratory  (NSF, 2001).
By 2002, PPDG, GriPhyN and iVDGL, taking advantage of their extensive overlap of personnel and institutions, began pooling resources and advancing a common agenda in support of deploying a nationalscale grid cyberinfrastructure, aided by the resources of the four collaborations. The Trillium consortium, as this overall effort came to be called, developed several grid testbeds, which provided important feedback to the Globus and Condor development teams as well as invaluable experience in running distributed cyberinfrastructures. Important permanent institutional structures were established, notably the Virtual Data Toolkit (VDT ) to test, install and configure common grid software tools (middleware) and the Grid Operations Center (GOC ) to provide operational support. These testbeds provided substantial computing resources for U.S. physicists in ATLAS and CMS to contribute meaningfully in several worldwide simulation exercises of their respective collaborations.
To improve coordination, Trillium developed a more formal organization, including a Steering Group to represent the major stakeholders from the experiments and software projects. The consortium set as a goal in 2003 the creation of a prototype national grid cyberinfrastructure supporting several disciplines and capable of running 1,000 simultaneous jobs during Supercomputing 2003 (SC03 ) in November. This cyberinfrastructure, called Grid3 , was operating successfully starting in October and to everyones surprise became stable enough after SC03 to operate continuously without heroic effort. For almost two years, Grid3 operations were improved and expanded to serve more disciplines, while additional domain scientists and technology specialists and providers from U.S. universities and national laboratories were recruited. On 20 July 2005, the Open Science Grid was inaugurated and concrete plans were made to secure dedicated funding, necessary because the Trillium projects were expiring in mid2006. This effort was successful, and in September 2006, NSF and DOE announced joint funding of the Open Science Grid project for US$30M over five years.
This historical context of OSG is diagrammed in Figure 4, which shows in addition a parallel effort by European partners to develop the comparable EGEE  grid cyberinfrastructure. OSG and EGEE are members of the Worldwide LHC Computing Grid (WLCG ), which includes grid resources from all over the world that support LHC computing needs. The projects have joint efforts in security and operations and a substantial effort is underway to ensure that sites running OSG and EGEE software (as well as that of NorduGrid , a Scandinavian grid project) can interoperate sufficiently well to see each others resources and submit jobs to one another.
Figure 4: Historical context of Open Science Grid, showing the predecessor projects, partners and its participation in the WLCG global infrastructure that have played significant roles in its development.
Practical lessons learned from operating a large cyberinfrastructure
Although OSGs developmental path partly reflects its particular circumstances, some organizational structures and actions taken by the consortium had unquestionable positive impacts when seen in the light of history. These are outlined in the following subsections.
The value of stakeholder research communities: Science push
The experiences of Open Science Grid, other national and international grid projects, and even university campus efforts, demonstrate the value of having science communities actively driving cyberinfrastructure development and deployment. As stakeholders, research communities become intimately involved in cyberinfrastructure decision making and are willing to expend enormous effort acquiring resources from funding agencies and institutions. Such strong involvement is regarded positively by agency program officers and university administrators who must prioritize and find funds for many worthy projects. The sustainability of the cyberinfrastructure also depends on deep community involvement.
The critical importance of leadership
Successful endeavors invariably have an individual or organization whose strong leadership provides the sustaining vision and energy that draws along the rest of the project. U.S. physicists involved with the ATLAS and CMS experiments at LHC  fulfilled this role for OSG. They had foreseen in the mid1990s that the LHC experiments starting in 2007 would face unprecedented computing challenges from datasets that would reach 100 petabytes by 2012 and several times that by the end of the decade. These enormous data volumes would require correspondingly massive computational resources and ultrahigh speed networks to make them accessible to globally distributed collaborations of several thousand physicists.
Such considerations led to organized efforts by highenergy physicists in the U.S. and Europe to develop and deploy large grid cyberinfrastructures, as described earlier. The critical dependence of their science on this infrastructure has provided the necessary incentives to acquire resources through multiple proposals, work through the difficulties presented by immature grid technologies, push advanced technology development in areas such as optical networking, and channel their wellknown organizational culture to multidisciplinary endeavors such as OSG and EGEE. Even today, the LHC projects define the scale of physical resources and computing services that these cyberinfrastructures must provide.
The benefits of testbeds and experimentation
One of the most valuable actions taken by OSG was the deployment of grid testbeds and the experimentation that followed. Building and operating such prototype cyberinfrastructures provided experience that could be acquired in no other way. Some specific lessons are summarized here:
Learning facility thinking: The mechanics of operating a prototype cyberinfrastructure immediately confronts the people running it with the essential fact that a cyberinfrastructure is a facility, requiring clear organization, defined personnel roles and constant attention to a wide assortment of details. This statement is even more relevant when the cyberinfrastructure is a productionlevel  facility.
Fostering collaboration: Operating a distributed cyberinfrastructure testbed forces personnel at different sites or in different roles to communicate frequently with one another, building social bonds and good will that improve overall collaboration. This social glue is especially beneficial to collaborations containing subgroups from different disciplines or subdisciplines, e.g. astronomyphysics, ocean sciencesatmospheric sciences, biologycomputer science, etc.
Developing common software or standards: Running a cyberinfrastructure testbed even for a short time exposes inefficiencies and troubleshooting difficulties that arise from heterogeneous software and errorprone manual processes. OSGs predecessor projects adopted the VDT early, both for providing common tools and for installing and configuring them in simple and identical ways. Other communities might benefit further by adopting data standards.
Designing for scalability: As testbeds grow (merging perhaps into the final cyberinfrastructure), it is important to identify as quickly as possible organizational structures or operations models that scale poorly. Fixing such problems early avoids painful and perhaps prohibitive restructuring costs later. For example, as the Grid3 prototype cyberinfrastructure grew throughout 2003 in number of sites and resources, the support model had to be transformed so that responsibilities shouldered by default by site managers were assigned appropriately to experts and managers throughout the project. Scalability issues are discussed in more detail in the next section.
Scalability, scalability, scalability
A central issue in building and operating shared cyberinfrastructure is scalability. Ideally, a scalable cyberinfrastructure should accommodate, with no loss of efficiency, unlimited increases in resources, users and application diversity. Considerations of how to achieve even approximately such a daunting level of scalability drive all OSG decisions about its organizational structure and operations model. An important reason is financial: the consortium receives approximately US$6M of federal funding per year, providing salaries for approximately 33 FTEs distributed across 16 institutions, as well as travel and institutional overhead. Funding at this level is insufficient to fully cover facility operations, user support and software development, let alone hardware resources, for a nationalscale cyberinfrastructure. OSGs current effort is in fact made possible only by leveraging resources supplied by participating institutions and organizations.
More fundamentally, however, a distributed cyberinfrastructure of many dozens of sites increasing in both size and diversity must be designed from the beginning to be scalable. Some specific mechanisms adopted by OSG are described briefly below.
Organization and management: OSG is organized to interface with organized projects (typically in the form of Virtual Organizations) rather than with individual researchers and institutions, in contrast to supercomputer centers and TeraGrid, which work directly with individuals or groups. Decisionmaking is balanced between the requirements of a democratic process and practical considerations of efficiency and timeliness. Thus ultimate authority is vested in the OSG Council, a large body representing the major stakeholders, but most planning, reporting, decisions and actions are handled by an Executive Director, Executive Team and Executive Board that report to the Council.
Operations and support: The OSG support model is distributed, with a central Grid Operations Center (GOC) working closely with other centers and support centers for the VOs and large facilities. A ticket mechanism routes most problems to the appropriate support center for handling, reserving for the GOC problems that are deemed to affect gridwide services or security. The operations model likewise devolves member authentication and authorization to the VOs, exploiting the services of the nationallysupported DOEGrids Certificate Service  for robust authentication.
Common software: A specially funded OSG team supports the Virtual Data Toolkit, providing simple installation and configuration of core grid software. As illustrated in Figure 5, the VDT and OSG Release Software together offer a relatively common software environment for all sites. Many VOs have software frameworks that shield users, albeit imperfectly, from much grid middleware.
Figure 5: Diagram showing how OSG interfaces to applications and computing infrastructure.
Challenges faced by Open Science Grid
OSG has overcome substantial obstacles and made original contributions on its way to building and operating a national grid cyberinfrastructure. The cyberinfrastructure successfully supports research communities at sizes ranging from small research groups to large bigscience enterprises. This is no small achievement. However, extensive challenges in the near and medium term still must be confronted. While manifesting themselves in technical, financial, managerial and social terms, the challenges should be seen as resulting from larger issues of scalability and sustainability. In the following I synopsize some looming issues.
OSG support model: How well will OSGs distributed support model scale with continued growth in the number of organizations and resources and with community diversity? Some level of growth can be accommodated by increases in personnel at the central GOC, but large increases could bring unanticipated problems.
Enabling new communities: How can OSG continue to support more and more research communities or projects? How can OSG persuade resource owners to take the time to enable and support their resources for other communities? The (surprisingly) wide range of application requirements places real burdens on resource owners.
Enabling small projects: How can OSG continue meeting the needs of small research projects, which have few resources to operate standard VO services required by OSG? Small groups today are supported by the generic OSG VO, but this has not been applied to many groups. Are there better ways (through campus and regional grids, for example) to help users make the transition to grids by hiding grid middleware?
Integrating new technologies: Will OSGs model of integrating new middleware technologies developed by external projects fail in some crucial cases? Technologies are never ready out of the box but require close cooperation and adjustment of schedules between OSG and the external project as different versions are delivered and supported. Potential problems include loss of funding or critical people from the external project, difficulty in integration or inability to adequately support the new middleware.
Maintaining production operations: How can OSG improve the overall throughput of useful work? How can OSG maintain production operations while evolving the capabilities and capacities of the distributed facility through upgrades?
Interoperability: How can OSG sustain interoperability with increasing numbers of grids? The question is particularly relevant for TeraGrid (for U.S. researchers) and EGEE (as part of OSGs participation in WLCG), but it also applies to new campus and regional grids which are emerging with surprising frequency. It also has relevance for the VDT as the number of software components grows (see Figure 6) and other communities incorporate it in their software configurations.
Longterm sustainability: How can OSG transition an infrastructure funded through research funds to a sustained facility for the future?
Figure 6: Plot showing changes to the Virtual Data Toolkit as new software components were added to support major phases of the grid facility.
The collaborative effort that led to the creation of Open Science Grid has matured into an organization operating a federally supported national cyberinfrastructure on which several bigscience experiments and other research efforts depend for their production computing needs. Nevertheless, the consortiums ability to maintain a high level of service will be constantly challenged as the organization and its cyberinfrastructure grow in resources, software services and numbers of members and partners. The OSG consortiums ability to adapt to these scalability challenges will determine its success in the coming years.
About the author
Paul Avery is Professor in the Department of Physics at the University of Florida in Gainesville.
Email: avery [at] phys [dot] ufl [dot] edu
1. Open Science Grid home page, http://www.opensciencegrid.org. A useful glossary of grid terms can be found at http://www.opensciencegrid.org/About/OSG_Glossary.
2. TeraGrid home page, http://www.teragrid.org.
3. Earth System Grid home page, http://www.earthsystemgrid.org.
4. EGEE home page, http://www.eu-egee.org.
5. As in the curse, May you live in interesting times.
6. I use the term Open Science Grid, perhaps confusingly, sometimes to denote the consortium and sometimes to represent its cyberinfrastructure resources. To be precise, the OSG consortium operates the OSG cyberinfrastructure (consisting of processors, storage, networks, software, and institutional support). Which object is meant should be clear from the context.
7. These are summarized from the OSG Year 1 Project Plan, OSG Docdb 514, 3 December 2006, http://osg-docdb.opensciencegrid.org/cgi-bin/ShowDocument?docid=514.
8. Petascale refers to a processing scale of petaops (1015 operations per second, equivalent to about 300,000 3.0 GHz PCs) and a storage capacity measured in petabytes (thousands of terabytes).
9. LIGO experiment home page at http://www.ligo.caltech.edu/.
10. Large Hadron Collider home page at http://lhc-machine-outreach.web.cern.ch/.
11. A geographical location such as a university can hold several OSG resources, each with its distinct gatekeeper software to accept jobs. Multiple resources occur for a variety of reasons, e.g. they may be managed within different administrative domains, they may serve different functions, their resources might be configured differently, they might have different policies for accepting outside jobs, etc.
12. Internet2 home page, http://www.internet2.org.
13. National Lambda Rail home page, http://www.nlr.net.
14. ESnet home page, http://www.es.net.
15. LHCNet home page, http://lhcnet.caltech.edu.
16. AMPATH home page, http://www.ampath.fiu.edu.
17. These state networks include, for example, California (http://www.cenic.org/), Florida (http://www.flrnet.org/), Louisiana (http://www.loni.org/), New York (http://www.nysernet.org/), Texas (http://www.hipcat.net/Projects/tigre), and many others.
18. These activities include leadership of various networking committees as part of Internet2 and National Lambda Rail. OSG consortium members lead advanced network initiatives such as UltraLight (http://www.ultralight.org) and even operate a transatlantic optical research network to Europe LHCNet (http://lhcnet.caltech.edu). They lead an important activity concerned with international connectivity and the Digital Divide and collect useful statistics documenting worldwide connectivity (http://icfa-scic.web.cern.ch/).
19. Map created with the Virtual Organization Resource Selector (VORS) tool at http://vors.grid.iu.edu/cgi-bin/index.cgi.
20. I use the term Virtual Organization to refer to a set of people (VO members) and shared computing/storage resources (sites) and services (e.g., databases) that are clearly identified with an institution, organization or enterprise. VOs are described also in Wikipedia (http://en.wikipedia.org/wiki/Virtual_organization) and the subject is discussed at length in Anatomy of the Grid.
21. Only small grid testbeds had been deployed by this time.
22. Globus home page at http://www.globus.org.
23. Condor home page at http://www.cs.wisc.edu/condor.
24. ATLAS experiment home page at http://atlasexperiment.org.
25. CMS experiment home page at http://cms.cern.ch.
26. CERN home page at http://www.cern.ch.
27. Sloan Digital Sky Survey experiment home page at http://www.sdss.org/.
28. Something that can be said only with considerable hindsight.
29. Particle Physics Data Grid (PPDG) home page at http://www.ppdg.net/.
30. GriPhyN home page at http://www.griphyn.org/.
31. International Virtual Data Grid Laboratory (iVDGL) home page at http://www.ivdgl.org/.
32. Virtual Data Toolkit (VDT) home page at http://vdt.cs.wisc.edu/.
33. Grid Operations Center home page at http://www.grid.iu.edu/.
34. Supercomputing 2003 home page at http://www.sc-conference.org/sc2003/.
35. Grid3 project (now defunct) is described at http://www.ivdgl.org/grid3/.
36. Worldwide LHC Computing Grid (WLCG) home page at http://lcg.web.cern.ch/LCG/.
37. NorduGrid home page at http://www.nordugrid.org/.
38. Similarly, high energy physicists in Europe led the efforts that resulted in the development and deployment of EGEE and NorduGrid.
39. By a productionlevel facility I mean one that sufficiently dependable that an organization can carry out some of its core activities on the facility. Examples include a campus or regional network, a supercomputing center, a large backup facility, etc.
40. DOEGrids Certificate Service home page at http://www.doegrids.org/.
I. Foster, C. Kesselman, and S. Tuecke, 2001. Anatomy of the Grid: Enabling Scalable Virtual Organizations, at http://www.globus.org/alliance/publications/papers/anatomy.pdf. This provides a very useful short introduction to grids.
I. Foster, C. Kesselman, J. M. Nick, S. Tuecke, 2002. Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration, at http://www.globus.org/alliance/publications/papers/ogsa.pdf.
Daniel Atkins, Chair, Report of the National Science Foundation BlueRibbon Advisory Panel on Cyberinfrastructure, at http://www.nsf.gov/od/oci/reports/toc.jsp.
International Science Grid This Week, a weekly newsletter about grids and science, home page at http://www.isgtw.org/.
Open Science Grid News, a monthly newsletter about Open Science Grid, home page at http://www.opensciencegrid.org/osgnews/.
U.S. Department of Energy (DOE), 1999. The Particle Physics Data Grid: Proposal for FY 1999 Next Generation Internet Funding, at http://www.slac.stanford.edu/xorg/ngi/ppdg/particle_physics_data_gridpropos.htm.
U.S. National Science Foundation, 2001. An International Virtual-Data Grid Laboratory for Data Intensive Science, at http://www.phys.ufl.edu/~avery/ivdgl/itr2001/proposal_all.pdf.
U.S. National Science Foundation, 2000. The GriPhyN Project: Towards Petascale Virtual-Data Grids, at http://www.griphyn.org/documents/document_server/show_docs.php%3Fseries=ivdgl&category=tech&id=145.html.
Copyright ©2007, First Monday.
Copyright ©2007, Paul Avery.
Open Science Grid: Building and Sustaining General Cyberinfrastructure Using a Collaborative Approach by Paul Avery
First Monday, volume 12, number 6 (June 2007),