A consideration of the state of computational grids with respect to standards, current uses, and a road map for commercial benefit beyond their common applications
Computational grids have been much in the technical press recently with both non-profit and commercial organizations increasing their investments. Are commercial grids viable? The short answer is yes. The longer answer is, if so, then what form and for what applications?
The prime beneficiaries of computational grid technology have deservedly been academia, government and other non-profit research. Researchers in these areas have been reaping the benefits of their grid investments for two decades. This essay contains some observations and experiences regarding the state of computational grids and possible routes to realize their full benefit.
Computational Grids Defined
For purposes here, a computational grid is defined as a distributed infrastructure that appears to an end user as one large computing resource; the individual components of the grid may or may not be under the direct control of the end user and may be distributed over a network of any size. Most commonly perhaps, this is multiple CPUs and database(s) distributed over a local area network (LAN). But it need not be. A full realization of the potential of grids would encompass heterogeneous resources over metropolitan area networks (MAN) and wide area networks (WAN).
For high performance computing (HPC) applications that require high amounts of cost effective CPU cycles, the application of grids (or clusters, farms, etc) are well proven.
-- A prime example is high energy physics (HEP) where grids of hundreds of interconnected CPUs have been solving problems for years such as the Advanced Computing Project (ACP) at Fermilab outside Chicago .
-- The Sloan Digital Sky Survey employs a grid to acquire astronomical data in order to produce a three-dimensional map of the universe .
-- The National Science Foundation funds projects for oil exploration research, the TERAGRID  for open scientific research, and a variety of other applications.
-- Argonne National Laboratory  is a pioneer in grid computing and works closely with a number of other governmental and academic organizations developing standards, supporting research and grid software development.
The important work done in these established and developing applications has been covered extensively by others and is not the topic here. Rather, what does this work mean for the commercial world? Can the ongoing successes of research and academia be applied to business applications?
Organizations such as the Global Grid Forum and the IEEE are heavily involved in facilitating the creation of standards for grid development and use. However, widely adopted standards are not necessary for successful grid applications. In fact, apart from developing an application program interface (API) that permits access to a grid infrastructure, no standards are required at all. This is the route first taken by early successful grids such as the ACP at Fermilab.
So what is the role for standards, who is creating them, and who will benefit?
One of the prime stated goals of computational grids, both commercial and otherwise, is access to computing resources over which the user may not have direct control. Perhaps these resources are in a different city, laboratory or company. The role of standards here is obvious given the advantages they provide when interacting with resources not under one's control.
Who is creating the standards? More than 700 people from a variety of disciplines met at the Global Grid Forum #7 in Tokyo , which I attended. Approximately 60 percent of participants were from universities, basic research laboratories, government research organizations or other non-profit research entities -- noting that much of university research is funded by and for the benefit of private industry. Approximately 30 percent were from vendors or service providers such as IBM , HP and Platform Computing . Less than 5 percent represented commercial end users. Similarly, the CCGRID symposiums, which the IEEE and ACM jointly sponsor, are largely driven by academia, which forms the bulk of steering committee and symposium presenters .
The implication is that the standards are being driven primarily by non-profit research and vendor/service providers. Therefore, it can be argued that they will also be the primary beneficiaries.
The requirements for commercial grid computing, while sharing much with the non- commercial, do have key differences. Some of these differences will be outlined later.
For purposes within this text, grids may be divided into four general categories.
1. Fundamental: modifications of existing applications or systems in order to take advantage of a grid infrastructure. No standards are required and all resources are under the control of or directly contracted by the application owner. This is the oldest and most established type of grid with the Fermilab ACP being a prime example.
2. Generalized: A non-industry-specific and generalized infrastructure standard to enable access to a larger set of resources that may or may not be under the control of the application owner. Globus is an example of an enabling toolkit for this.
3. Industry or discipline specific: An open and multi-organizational infrastructure. The TERAGRID is an example in which the goal is to build an open distributed infrastructure for scientific research.
4. Expert: A proprietary offering that builds upon the other types to offer a proprietary product or expertise to a client. Clients would connect to a part of a grid in order to contract specific capabilities, typically at a cost. Such a capability could be any type of service that the client does not wish to or cannot develop in- house. Email, specialized analytics and disaster recovery are examples.
These four types are not necessarily mutually exclusive, for example, an expert grid may be a subset of an industry-specific grid.
Commercial Application Roadmap
One approach for advancing the use of grids within commercial domains is outlined here. It is not exhaustive but does emphasize uses that may not be commonplace today:
1. Get involved in the standards: A noted previously, end corporate users have little direct voice in the standards being developed today. Either they must be content with relying on vendors and service providers representing their concerns or they must get involved directly.
2. Address security: established applications of grid technology benefit from open sharing of information. Apart from perimeter security and privacy issues, security concerns for most existing grid applications are secondary.
However, for many commercial applications, security is paramount and diverse. Inadequate security may disqualify an application from the start. A sufficient security architecture will be required for commercial grids to advance significantly beyond the fundamental type.
The Financial Services Industry, for example, must address multinational regulatory matters, client confidentiality, who can see what data and when, and encryption.
3. HPC grids for existing application domains: this is a common strategy currently being taken for existing commercial applications. Applications that would previously run on a single processor would be distributed among many. Hence, the cycles are shortened and in some cases even make some things possible that were not previously. An example could be CPU-intensive analytics delivered to a customer within seconds waiting on the phone or via a Web service. HPC grids may exist in all grid categories.
4. Focus on more than HPC.
Getting the most out of available CPU resources is proven and will continue to be an important application for both commercial and non-commercial grid applications alike. A longer-term strategy would be to focus on grid technology that takes advantage of the unique attributes of a distributed grid:
-- Application resiliency: Ability to fail-over to other geographical locations when necessary. Most often this is accomplished using proprietary and non-portable means. For many of us who had our processes affected during the September 11 disaster when application bases switched from one geographic center to another, it was successful but very labor intensive. All other business support was preempted for a week or more just to get applications back online. Now that many industries span the globe and require computer systems to be up 100 percent of the time, existing strategies of relying on normal downtime to remedy problems are obsolete. Resiliency will likely be achieved through the combination of a grid of CPU and data resources.
-- On demand resources: many applications don't require all the CPU and other resources allocated to them all the time. But we must often procure CPUs for the worst cast scenarios (e.g., a year-end business cycle, etc). Hence, we many not fully utilize the CPU the rest of the time. This is a waste of resources (i.e. money). Is there a way to use additional CPU, data storage space, and telecommunication bandwidth only when we need it and therefore pay less overall?
-- Disaster mitigation: Combines resiliency and on-demand computing. For cost reasons, backup data centers may not be built to support 100 percent of the business in the event of a disaster. So for a prolong disaster, what do we do? Do less business? Perhaps a better strategy is to build the backup sites to get us through the first few days of a disaster then be able to employ additional resources on demand from a vendor. For common applications such as email (e.g., Microsoft Outlook), this would be straightforward enough. For applications proprietary to a company, this would require advanced preparation with a vendor roughly proportional to the complexity of the application.
-- Centers of expertise: The ability to find high-level resources and algorithm specialists on the grid as we need them. Where "on demand resources" focus on generic resources, a center of expertise would provide resources at a higher level. Examples would be business analytics, financial transaction processing, etc.
-- Industry specific grids: computational grids specific to a particular industry (finance, petroleum, automotive, etc).
Peer-to-peer networks specific to a particular industry or research domain have been in production since the inception of APRANET in the 1960s. Today, in the financial services industry, there are many ways to transact electronic business with clients, stock exchanges, central banks and other market participants. I was using the High Energy Physics network (HEPNET) two decades ago to communicate with colleagues at collaborating organizations around the world.
Industry-specific grids would extend this philosophy to form a grid that supports a specific industry. The concerns of one industry often don't overlap with that of others; hence it behooves us not to burden each other with the overheads and complexity that would come with a universal grid infrastructure.
5. Develop application domain expertise: grid vendors will require domain experts that can understand the business problems the clients are trying to solve. A client having to explain the basics of their business to a service provider reduces confidence in the provider's solutions. Such expertise may not be necessary for fundamental or generalized grids but will definitely be necessary for industry specific and expert grids.
6. Develop commercial APIs: Application program interfaces that are directed at commercial applications will need to accommodate demands that are particular to them.
7. Sell and deliver it: with computational grids frequently in the popular technical press recently, grid purveyors are being successful at getting this technology in the public eye. However, with more and more companies needing to see short-term return on their technology investments, providers with real solutions will need products that can be quickly deployed and be productive -- even if the full benefits are only realized over time. I've seen too many projects partially or completely fail after a large investment since they could not be brought to market in a timely fashion. The key is to think in financial quarters and not in years.
The Big Winners
There are some very likely winners in the expansion of grid technology. With the winners, there are some implied potential losers. In a nutshell, the winners will likely be:
-- Intel and Linux: the price performance ratio for Intel processors and essentially free nature of the Linux operating system combined with vast amounts of software for both make this duo the natural choice for grid application developers.
-- Traditional users who will benefit from the mass market: commercial players entering the grid arena that was traditionally dominated by non-profit organizations will drive down the cost and make more choices available.
-- Commercial users who will benefit from the pioneering work of others: as stated before, grid technology has been around for decades and commercial users will benefit from the work to date.
There is great potential for commercial grid applications to take advantage of the pioneering efforts of non-profit organizations. Both commercial and non-commercial applications will continue to see benefit. However, to truly see the full potential of what grids can offer, we must think beyond the traditional application domain space of high performance and relatively low-cost CPU power into a variety of other domains.
Bryan MacKinnon has more than 20 years technology experience in financial services, high energy physics and astrophysics. He currently is a technology director for Merrill Lynch in Tokyo. Before coming to Merrill Lynch, he worked in software development at Fermilab. He may be contacted at [email protected] or [email protected]
1. Gaines, I., et al., "The ACP Multiprocessor System At Fermilab," Fermilab Publication FERMILAB-Conf-87/21, Presented at the Computing in High Energy Physics Conference, February, 1987.
2. Petravick, D., et al., "Data Acquisition Systems for the Sloan Digital Sky Survey", Fermilab Publication FERMILAB-Conf-94/064, Presented at the Society of Photo-Optical Instrumentation Engineers (SPIE) Conference, March 1994.
3. Catlett, C., "TeraGrid: A Primer," http://www.teragrid.org/about/TeraGrid- Primer-Sept-02.pdf
4. Foster, I., "Grid Technologies & Applications: Architecture & Achievements," Inst. of High Energy Physics (IHEP) Computing in High Energy and Nuclear Physics (CHEP'2001) ; Beijing, China ; Sep 3-7, 2001 ANL/MCS/CP-106157
5. The Global Grid Forum: http://www.ggf.org
6. IBM Grid Computing: http://www.ibm.com/grid
7. Platform Computing distributed and grid computing: http://www.platform.com/
8. The IEEE International Symposium on Cluster Computing and the Grid. http://www.ccgrid.org/
9. The Globus Project: http://www.globus.org