|
MONDAY, MAY 12, 2008
1:00 – 5:00 pm
Using Grid Engine and UniCluster Express: A Primer and Sneak Peek
Speakers: Chris Dagdigian, Principal Consultant, BioTeam Inc. and Rich Wellner, Jr., Vice President of Professional Services, Univa UD
This special session, brought to you by Univa UD, will be a sneak peek of an upcoming advanced user tutorial series co-hosted by BioTeam and Univa UD. This session will be an introduction and live demonstration of UniCluster Express and Grid Engine. The purpose of this tutorial is a step-by-step instruction on the installation, configuration and basic use of a compute cluster with Grid Engine and UniCluster Express. This tutorial requires a rudimentary knowledge of Linux and administration of Linux systems, networking, and basic programming principles. Chances are, if you've come across this tutorial you do indeed have that skill set.
Globus Primer: An Introduction to Globus Software
Speaker: Lee Liming, Technology Analyst, Argonne National Laboratory and the University of Chicago
The Globus Toolkit is a collection of software solutions to many of the integration challenges that come up in Grid system and application development. Recommended for first-time GlobusWORLD attendees, this afternoon tutorial provides an introduction to the Globus Toolkit and its most common uses in science and engineering applications. It provides an overview of what a new attendee can expect throughout the week of the conference. The open source Dev.Globus community develops, distributes, and supports the Globus Toolkit and a variety of other software projects. This tutorial provides answers to critical questions for Grid project planners and product developers, including: What is Globus? What can the Globus Toolkit do for me? Where does Globus software fit in a Grid system or application? Where should I get started learning about Globus? What resources are available to help me when I use Globus? How have others succeeded using Globus software?
5:30 – 7:30 pm
Welcome Reception sponsored by Sun Microsystems
Please join us on the top floor of the Marriott City Center for dazzling views, hors d'oeuvres and a cool beverage as we kick off the conference.
TUESDAY, MAY 13, 2008
8:30 – 10:00 am
Welcome and State of The Union
Join conference organizers Fritz Ferstl, Ian Foster and Philip Papadopoulos for the conference opening session.
10:30 – 12:00 pm
GRID ENGINE #1
Sun/Community State of the Union
Speaker: Fritz Ferstl, Director of Grid Engineering, Sun Microsystems
New features in Sun Grid Engine 6.2
Speaker: Lubomír Petrík,Software Developer, Sun Microsystems Czech and Roland Dittel, Sun Microsystems Germany
This presentation will give an overview over the new features
and enhancements in Sun Grid Engine (SGE) 6.2. The topics covered
are Advance Reservations, improved Interactive Job Support, improved
Array Job Dependencies, support for SMF and Service Tags, JMX support
and Performance and Scalabilty improvements.
ROCKS #1
Workshop Goals and Rocks 5.0 Enhancements
Speakers: Greg Bruno and Mason Katz
This session will focus on the new features of Rocks 5.0 (V) including support for Xen-based virtual machines, enhancements/changes in the Rocks command structure, and general support for Version 5.0 of CentOS/RHEL. In addition, participants will have an opportunity to shape follow-on sessions for this year's workshop. A refresher (or introduction) to the Rocks configuration graph will be given. Participants are expected to have familiarity with previous versions of Rocks.
GLOBUSWORLD #1
Globus WS Core and Tools
Speakers: Rachana Ananthakrishnan, Senior Software Developer, Argonne National Laboratory;
Ashish Sharma, PhD, Research Scientist, Ohio State University; Ravi Madduri, Senior Software Developer, Argonne National Laboratory
The Java WS Core component in Globus Toolkit contains an implementation of the Web Services Resource Framework (WSRF) and Web Services Notification (WSN) family of specifications, and provides a container to build and deploy web services based on these specifications. This session will include presentations on:
- GT Java WS Core: Features and Roadmap (Rachana Ananthakrishnan): An overview of supported features, including new features in GT 4.2 such as dynamic deployment support, HTTP/S connection caching and WS Enumeration support, and discuss the latest on-going and planned work in this area.
- Authoring Services using Introduce (Ashish Sharma): An overview of Introduce, a GUI tool to author services using Java WS Core that enables service developers to focus on the business logic and automates the generation of web service pieces.
- Grid Remote Application Virtualization Interface (gRAVI) (Ravi Madduri): A tool that leverages Introduce and facilitates publishing of arbitrary applications as web services.
INTRODUCTION & OVERVIEW #1: CLUSTER STACKS
Intel Cluster Ready
Speaker: Clem Cole, Manager, Architecture and Development of Intel Cluster Ready
Cluster are a cost effective deployment platform for high performance computing. However, until recently each cluster tended to be a tad different. These differences, while often conceptually minor, are a major inhibitor to developing applications that can run with high conferences on many different clusters. Similarly, many different applications from different provider's can not be confident that their applications can run in harmony with other applications. In this talk, I will describe the Intel Cluster Ready program and it helps to make build, deploying and maintaining clusters easy for end users, application developers, administrators, as well as ISVs, component providers and platform integrators.
Introduction to Univa UD UniCluster Express: A Simple, Fully Integrated Cluster Stack
Speaker: Bill Bryce, Director Product Management-HPA, Univa UD
One of the most important decisions in building a HPC cluster is choosing the software that will deploy, monitor, manage and operate the system. In the past each software component was installed and configured separately leading to a high degree of complexity requiring deep HPC knowledge and system administration skills. Today Software Stacks such as UniCluster Express simplify many of the difficult and complex tasks involved in creating and managing a HPC cluster. The result is that users can focus on running their applications spending much less time on the cluster software infrastructure.
Solaris HPC Stack
Speaker: Daniel Templeton, Strategic Liaison Manager, Sun Grid Engine, Sun Microsystems
Sun has recently launched a new HPC community on the OpenSolaris.org
site. One of the purposes of that community is to facilitate
collaboration in building out an HPC stack for the OpenSolaris operating
system. The end goal of that HPC stack is to gather together HPC
expertise from both inside and outside Sun into a complete and
integrated software stack addressing the needs of both developers and
administrators. This presentation will introduce the project and the
community and discuss plans for the stack. Ample time will be left for
feedback from attendees.
12:00 – 1:30 pm
LUNCH (PROVIDED)
1:30 – 2:00 pm
SPONSOR TALK
Using OGF Standards for Grid and HPC
Speaker: Chris Smith, Vice President of Standards, Open Grid Forum
For a number of years, the OGF have been working on specifications that are intended to address common use cases in Grid and High Performance Computing. We now have a sufficient body of specifications to realize some of these uses cases. This talk will provide a snapshot of where we are with respect to OGF specifications such as DRMAA and the HPC Basic Profile, and the status with respect to implementations of these specifications.
1:30 – 3:00 pm
GRID ENGINE #2
Grid Engine at the Texas Advance Computing Center
Speaker: Roland Dittel, Sun Microsystems Germany
With "Ranger" the Texas Advance Computing Center deploys the largest
computing system in the world for open science research. The resource
management for job scheduling in this cluster is provided by Sun Grid
Engine (SGE). This presentation will give an overview about the cluster
setup and the implemented scalability improvements to utilize the 3,936
nodes and 62,976 processing cores.
Fun with Grid Engine XML
Speaker: Chris Dagdigian, Principal Consultant, BioTeam, Inc.
Organizations needing to programatically monitor the state and status
of Grid Engine systems prior to SGE 6.0 were often required to parse
spool files or manipulate the "human readable" text output from
commands such as qconf and qstat. The introduction of XML output
option flags in Grid Engine has opened the floodgates for far more
interesting and powerful tools to be developed. This talk will center
on methods for obtaining, searching on and transforming Grid Engine
XML data into various output formats. XML technologies including XPATH
and XSLT will be lightly covered using the code from http://xml-qstat.org
as examples. This will be a moderately technical talk aimed at
audiences with little prior exposure to XML transformation and
processing
Grid Engine Future Plans
Speaker: Daniel Templeton, Strategic Liaison Manager, Sun Grid Engine, Sun Microsystems
Sun Grid Engine software is one of the top distributed resource
management software packages in the industry. As both a licensed
product and an open source project, Sun Grid Engine has very broad
adoption across a wide range of industry segments, government
facilities, and educational institutions. The next release of the Sun
Grid Engine software will happen late this summer, but what then? In
this talk, we're peek ahead into what we're thinking about for the future.
GLOBUSWORLD #2
What's New in the Data Area? (45 minutes)
Speakers: Raj Kettimuthu, Argonne/UC; Ann Chervenak, USC/ISI; Rob Schuler, USC/ISI
We will present an overview of recent developments and future plans for major Globus components. For GridFTP, we will discuss the use of GridFTP over UDT; GridFTP with SSH; Multicasting in GridFTP; providing resource management for GridFTP transfers; and recent optimizations to support good performance for lots of small files. We will also discuss the planned work on automatic firewall traversal for GridFTP transfers. For replication services, we will discuss the embedded database backend for the Replica Location Service and the pure Java interface for RLS. We will also discuss new work on policy-driven data placement services and their relationship to workflow management systems. Finally, we will discuss recent developments in the OGSA Data Access and Integration System.
GridFTP and Cluster Meltdown: When No Means 'Maybe Later' (45 minutes)
Speaker: John Bresnahan, Argonne/UC
High speed wide area data transfer requires a quite a bit of compute resources, not only in terms of network bandwidth and disk space, but also in endpoint system memory and processing power. Too often system administrators inadvertently allow clients to 'overclock' their cluster's GridFTP servers by failing to protect them from clients that try to transfer too many files too fast and all at once. This ultimately acts as a denial of service causing thrashing and extremely sub optimal results. In this session we will explain how to properly configure Globus data transfer services including GridFTP and RFT. Attendees will learn when and how to make the choice between the two services in order to protect hardware resources and achieve the optimal results which a given set of hardware set can provide.
TUTORIAL #1: (ROCKS)
Introduction to Clusters and Rocks Overview
Speaker: Mason Katz
This session will cover the basics of types and design of clusters (from Beowulf, to Tiled Walls, to High-Performance Supers). The basic philosophy and first level design of Rocks will be presented as will comparisons to some other competitive methods. Getting started on Rocks will include building of real and virtual machines in Rocks 5.0
INTRODUCTION AND OVERVIEW #2: ENABLING SOFTWARE FOR DISTRIBUTED COMPUTING
Towards a Common Communication Infrastructure for Clusters and Grids
Speaker: Darius Buntinas, PhD, Assistant Computer Scientist, Argonne National Laboratory
Communication infrastructure for Clusters and Grids has traditionally been dealt with in a decoupled manner. For many years, cluster communication systems have been focusing on various optimization aspects relying on hardware protocol offload, RDMA, OS bypass and many other high-performance features. For Grids, on the other hand, TCP and UDP continue to be the dominant communication protocols of choice. As high-speed lambda connectivity between different sites is becoming common, it is becoming increasingly important to have a common communication infrastructure that can match the demands of both Cluster and Grid environments. Such a common communication infrastructure needs to provide various features such as low-latency, high-bandwidth and reduced CPU usage that application scientists have come to expect of most cluster interconnects. At the same time, this infrastructure should also be capable of meeting the various demands of wide-area communication such as efficiently utilizing high-bandwidth communication pipes in lambda grids and maintaining backward compatibility with existing infrastructure. In this talk, we will present different advances in communication technologies that have taken place in this area. Specifically, we will focus on two popular network technologies, InfiniBand and 10-Gigabit Ethernet (TCP/UDP offload engines, iWARP, MX) and present their latest advances in this area. We will also give an overview of different solutions available today and point out pros-and-cons of these technologies.
MPICH
Speaker: Darius Buntinas, PhD, Assistant Computer Scientist, Argonne National Laboratory
Open MPI and Sun HPC ClusterTools: A Technical Overview
Speaker: Leonard Wisniewski, PhD, Engineering Manager, Sun Microsystems / Software Developer Tools and Services
Open MPI was established four years ago as a clean slate implementation
of the MPI-1 and MPI-2 specifications. The goals of the Open MPI
project are to 1) create a free, open source, peer-reviewed,
production-quality complete MPI-2 implementation, 2) provide extremely
high, competitive performance, 3) directly involve the HPC community
with external development and feedback, 4) provide a stable platform for
3rd party research and commercial development, 5) help prevent the"forking problem" common to other MPI projects, and 6) support a wide
variety of HPC platforms and environments. Sun joined the Open MPI
community two years ago to add experience and expertise applied
previously to the proprietary Sun HPC ClusterTools product.
This talk will present an overview of the Open MPI architecture and what
hardware and software platforms it supports. Further, we examine the
Open MPI goals and highlight how these goals have been achieved to
date. We also provide details on how Open MPI has been used as the
basis of the "new" Sun HPC ClusterTools and how Sun has enhanced Open
MPI with its contributions to support Sun software such as Sun Grid
Engine and Sun Studio.
Stateless Provisioning with Perceus
Speaker: Greg Kurtzer, CTO, Infiscale.com
Stateless operating system management has many benefits in both enterprise and high-performance cluster computing. Perceus, like its predecessor Warewulf, facilitates provisioning industry-standard operating systems in a stateless manner, turning bare metal systems into production-ready servers almost indistinguishable to the user from fully-installed boxes, but in a fraction of the time and with little to no administrative effort. Already distribution-neutral, and working toward full operating system neutrality, Perceus can be used with most any hardware infrastructure or cluster software stack. All architectural decisions are up to the administrator or integrator of the system itself, and making changes on a thousand systems is as easy as making a change on a single system. Scaling to tens of thousands of nodes without compromising usability, Perceus now manages clusters of all sizes, from small ad-hoc home-brew systems to ten-thousand node behemoths. Leveraging our partnerships with software and hardware vendors, Perceus and its companion projects combine to form the only 100% free and open source solution available today which is certified as Intel Cluster Ready(tm). In this presentation, we will give an overview of Perceus, provide general usage and examples, and field audience questions.
3:00 – 3:15 pm
BREAK
3:15 – 4:00 pm
KEYNOTE PRESENTATION: How Open Source Drives Standards: Making HPC Clusters Simple
and Affordable
Speaker: Gary Tyreman, General Manager of HPC, Univa UD
“I invented nothing new. I simply assembled into a car the discoveries
of other men." -Henry Ford.
Henry Ford once said “the way to make automobiles is to make one
automobile like another automobile, to make them all alike.” A
visionary in time and motion business practices, Ford understood that
the key to mass market acceptance of the automobile was accessibility,
affordability and safety. The adoption of interchangeable parts,
mainstay in the typewriter and clocks industries for decades, was
precisely the catalyst required to drive volume and lower costs for
the nascent automotive industry. The HPC industry, like the clock,
typewriter and automobile market before it, is ready for adopting a
standardized design and leveraging interchangeable parts.
This keynote will discuss the powerful impact of open source software
on the acceleration of the commoditization of HPC Linux Clusters and
how Univa UD, the leader in open source cluster and grid solutions
will “assemble the discoveries of other men” into a simple to use
cluster software stack for the mass market.
4:00 – 4:30 pm
BREAK
4:30 – 6:00 pm
ROCKS #2
Xen VMs, Virtual Clusters and Programmatic Partitioning
Speakers: Mason Katz; Greg Bruno; Philip Papadopoulos, PhD, Program Director, San Diego Supercomputer Center at UC-San Diego; Anoop Rajendra
The internals of Xen support in Rocks will be presented and dissected in detail. A preliminary roadmap for enhanced support for completely virtualized clusters (frontends and slave nodes) will be given. New for Rocks 5.0 is the ability to fully program how a node partitions its local hard drives so that any partitioning policy can be implemented. Methods, techniques and examples of partitioning schemes will be presented.
GLOBUSWORLD #3
Grid Information Management Using MDS
Speakers: Laura Pearlman, USC/ISI, MDS Project Chair; JP Navarro, ANL; Yusuke Tanimura, AIST
Globus Monitoring and Discovery Services (MDS) allow for the monitoring
of the state of the grid and for discovery of available resources. In
this session we discuss the overall design, latest developments, and
future plans for these services and describe some user experiences with
them. We will focus on new developments and use cases involving the MDS
Index and Trigger Services, the WebMDS interface, and the components
used to publish information via MDS. This session will be structured as
a general overview of MDS topics, followed by a case study of MDS use in
TeraGrid and a discussion of S-MDS, a semantic modeling and discovery
system based on MDS.
TUTORIAL #2: (GRID ENGINE)
Using the New Features of Grid 6.2
Speakers: Roland Dittel, Sun Microsystems Germany and Lubomír Petrík, Software Developer, Sun Microsystems Czech
With the upcoming Sun Grid Engine 6.2 release a lot of new features will
be introduced which include new CLI, APIs and usage concepts. This
tutorial will show how to use and administer these new features and what
are their benefits.
INTRODUCTION AND OVERVIEW #3: GRID ENGINE TOPICS
Synopsis Use of Sun Grid Engine in EDA
Speaker: Joe Fu, Technical Manager, Synopsys, Inc. and Bogdan Vasiliu, Sun Microsystems, Inc.
Electronic Design Automation (EDA) applications stress computer systems in any imaginable way. They can be processor, memory, and I/O intensive, basically nothing is spared. Managing thousands of EDA compute jobs daily (nightly builds, regression runs, tests, benchmarks, interactive jobs, etc.) on geographically distributed grids, each consisting of hundreds to thousands of nodes, is a daunting task not for the faint hearted. To make things even more difficult, each of these grids may have its own access policies and restrictions, and special configurations. Specialized tools and special skills are required to handle this type of job. This talk will focus on how Synopsys, the largest EDA independent software vendor, utilizes Sun Grid Engine (SGE) to efficiently manage its internal EDA compute jobs' flow and execution. The presentation will cover the technical aspects of managing and configuring SGE at Synopsys: the setup and configuration of local grids, queues, complexes, access policies, etc., and various challenges and solutions for this type of large scale grid installations.
Grid Heating: Dynamic Thermal Allocation via Grid Engine Tools
Speaker: Paul Brenner, PhD, Scientist, University of Notre Dame Center for Research Computing
From 2006 to 2011 the national energy consumption for powering and cooling IT servers is estimated to grow from a cost of 4.5 to 7.4 billion dollars as reported by a recent EPA study which included current efficiency improvement trends. With growing national concern for energy efficiency and environmental stewardship, current power utilization trends in HPC and data centers cannot continue to scale with computational demands. I introduce a new grid heating framework to promote the efficient growth and sustainment of commercial, academic, and government computation capabilities. Grid Heating removes cooling expenditures while providing dynamic distributed heating benefits to target heat sinks. In this presentation I will introduce the grid heating framework and share experimental results heating a municipal botanical garden using Grid Engine tools to remotely harness HPC resources. Additional grid heating challenges and opportunities are discussed in regards to development, implementation, and deployment.
LSF vs Grid Engine
Speaker Chris Dagdigian, Principal Consultant, BioTeam, Inc.
As an independent consultant with years of Grid Engine and Platform
LSF experience, Chris Dagdigian has often been asked to help clients
with IT purchasing decisions. Often this includes assisting with
evaluation and selection of a distributed resource management ("DRM")
solution. Using past projects as examples, the background methodology
for making "Grid Engine vs. Platform LSF" deployment decisions will be
explained.
6:00 – 8:00 pm
Sponsor Reception
Join us for hors d'oeuvres and drinks by the sponsor tables and take this opportunity to view their displays and thank them for supporting the conference.
WEDNESDAY, MAY 14, 2008
8:30 – 10:00 AM
GRID ENGINE #3
Service Domain Manager – Basics and Concepts
Speakers: Richard Hierlmeier and Ryszard Macidlowski
Service Domain Manager (SDM) is an upcoming product from Sun that will
allow administrators to configure policies to automatically reassign
resources from one service to another based on service level objectives
and the changing load conditions. The Sun Grid Engine 6.2 software will
include an early version of SDM that will allow multiple Sun Grid Engine
clusters to dynamically share resources to maximize utilization across
the entire grid.
This presentation will explore the SDM features that are exposed in Sun
Grid Engine 6.2. Topics covered will include the various SDM components
and the basic SDM concepts, such as services, resource, the spare pool,
etc. The presentation will also look ahead to what features the full
SDM release will provide.
Managing Multiple Grid Engine Clusters with Service Domain Manager
Speakers: Richard Hierlmeier and Ryszard Macidlowski
Service Domain Manager (SDM) is an upcoming product from Sun that will
allow administrators to configure policies to automatically reassign
resources from one service to another based on service level objectives
and the changing load conditions. The Sun Grid Engine 6.2 software will
include an early version of SDM that will allow multiple Sun Grid Engine
clusters to dynamically share resources to maximize utilization across
the entire grid.
In this presentation, the speakers will present a concrete use case for
SDM. The presentation will walk through assigning a resource to Sun
Grid Engine server, automating resource assignment through service level
objectives, automatically discovering resources, and mapping Sun Grid
Engine complexes to SDM resource properties.
Accounting and Reporting Console Multi-Cluster Support
Speaker: Jana Olivova, Sun Microsystems
The Grid applications produce large amounts of accounting data and users are posed with a perplexity of sorting through the data and generating constructive statistical business reports. Data about the load averages, cpu and memory usage, average throughput or the number of jobs completed are often needed for statistical evaluation of the Grid processing. In the time of ever increasing need for a statistical data analysis, database plays a crucial part in fulfilling the data management requirements of Grid applications due to its advanced data mining capabilities. Sun Grid Engine Accounting and Reporting Console (ARCo) addresses these needs and offers the possibility to gather and store the Grid accounting data in a standard relational database (PostgreSQL, Oracle, MySQL) and access them through an online graphical user interface. The online console contains a set of predefined SQL queries supplementing the most frequent statistical inquiries. Users are able to create custom queries, display the tabular data in a graphical representation or pivot table, store the result snapshots and export data in PDF or CSV format. This presentation familiarizes users with ARCo and explains its multi-cluster support functionality.
ROCKS #3:
Customizing Rocks through Rolls. How to Develop Your Own
Speaker: Tim McIntire, President, Clustercorp; Anoop Rajendra; Greg Bruno
Rolls are the primary mechanism for customizing Rocks installations while enabling reproducibility to any number of clusters. Rolls can be commercial or open-source. ClusterCorp has produced several rolls and will describe their techniques and issues. Techniques for how Linux-based rolls are built and tested at UCSD. An introduction to the needed Rocks changes to support Solaris and Rocks-on-Solaris will be presented
Building on Open Source Rocks: 3rd Party Rolls for Rocks (T. McIntire)
One of the great things about the Rocks Cluster Distribution is the ability to extend, tweak, and replace functionality by leveraging the Rocks framework to build “Rolls”. This has allowed 3rd parties to build off the base Rocks solution, while the open source team remains focused on core functionality and new features on the cluster management side. In this presentation, Tim McIntire, President, Clustercorp will discuss a variety of Rolls that available from 3rd parties, including the OFED Roll, PBS/Torque Roll, and compiler Rolls for Intel, AMD, and PGI.
Using What Your Momma Gave You: Leveraging the Rocks Framework (TBD)
Doing things the “right” way is critical in maintaining an efficient Rocks-based cluster. Many system administrators (including myself) have horror stories of early experiences in cluster configuration and maintenance that include a litany of custom scripts and hacks that keep a system and its users up and running with all the necessary components. Rocks provides a built-in mechanism, “Rolls”, for building software stacks directly into the cluster distribution. Leveraging Rolls for the complete configuration of you cluster, will ensure that redeployment of compute nodes, or even a complete rebuild from the head node up, will be a simple, repeatable process. While there is a learning curve to developing Rolls and working within the Rocks framework, the long-term benefits greatly outweigh the short-term overhead.
Moving Beyond the Womb: A Overview of Currently Available 3rd Party Rolls (TBD)
A brief complete overview of available Rolls with 3rd party contributions including absoft, amd, apbs, bio, cisco-ofed, condor, intel, opal, moab, pbs, pgi, pvfs2, qlogic-ib, voltaire-ib. Subsequently, we’ll go into detail on two 3rd party Rolls with an open-source bent: PBS/Torque from the University of Tromso and Cisco-OFED from Clustercorp.
TUTORIAL #3: (GLOBUSWORLD)
Configuring and Deploying GridFTP for Managing Data Movement in Grid/HPC Environments
Speaker: Raj Kettimuthu, Argonne / UC
One of the foundational issues in HPC computing is the ability to move large (multi Gigabyte, and even Terabyte) data files between sites. Simple file transfer mechanisms such as FTP and SCP are not sufficient either from the reliability or the performance perspective. Globus implementation of GridFTP is the most widely used Open Source production quality data mover available today. Key features of Globus GridFTP include:
Performance: Typically GridFTP provides order of magnitude performance improvements compared to standard FTP. GridFTP's capability to use non-TCP protocols such as UDT and parallel streams to minimize bottlenecks inherent in TCP/IP, allows it to achieve good performance.
Cluster-to-cluster data movement: GridFTP can do coordinated data transfer utilizing multiple computer nodes at source and destination. This can increase performance by another order of magnitude.
Reliability: GridFTP provides support for reliable and restartable data transfers.
Multicasting: Globus GridFTP is capable of doing one source to many destination transfers.
Multiple Security options: Globus GridFTP framework supports various security alternatives. It supports Grid Security Infrastructure, SSH based security, anonymous access, username and password based security.
Modular: XIO based Globus GridFTP framework makes it easy to plugin alternate transport protocols. The Data Storage Interface (DSI) allows for easier integration with various storage systems.
Third-Party Control: GridFTP also allows secure 3rd party clients to initiate transfers between remote sites.
Partial File Transfer: In many cases in the scientific community it is expedient to download only portions of a large file, instead of
The entire file. GridFTP supports this capability by specifying the byte position in the file to begin the transfer.
Negotiation of TCP buffer/window sizes: GridFTP employs FTP command and data channel extensions to support both automatic and manual negotiation of TCP to get optimal performance.
In this tutorial, we will quickly walk through the steps required for setting up GridFTP on Linux/Unix machines. Then we will explore the advanced capabilities of GridFTP such as striping, and a set of best practices for obtaining maximal file transfer performance with GridFTP.
INTRODUCTION AND OVERVIEW #4: GLOBUS TOPICS
GridWay: The Open Source Metascheduling Technology for Grid Computing
Speaker: Ruben S. Montero, PhD, Associate Professor, Universidad Complutense de Madrid
GridWay is a widely-used metascheduling technology that performs job execution management and resource brokering, allowing unattended, reliable, and efficient execution of jobs, job arrays, and workflows on heterogeneous and dynamic Globus grids. GridWay performs all the job scheduling and submission steps transparently to the end user and adapts job execution to changing Grid conditions by providing dynamic scheduling, fault recovery mechanisms, migration on-request and opportunistic migration. The GridWay metascheduler is a Globus product, released under Apache license v2.0, welcoming code and support contributions from individuals and corporations around the world. GridWay provides the following benefits to the different stakeholders involved in a Grid environment: (i) for project and infrastructure directors, GridWay is an open-source community project, adhering to Globus philosophy and guidelines for collaborative development; (ii) for system integrators, GridWay is highly modular, allowing adaptation to different grid infrastructures, and supports several OGF standards; (iii) for system managers, GridWay gives a scheduling framework similar to that found on local DRM systems, supporting resource accounting and the definition of scheduling policies; (iv) for application developers, GridWay implements the DRMAA API (C and JAVA bindings) OGF standard, assuring compatibility of applications with LRM systems that implement the standard, such as SGE, Condor or Torque; and (v) for end users, GridWay provides a LRM-like CLI for submitting, monitoring, synchronizing and controlling jobs that could be described using the JSDL OGF standard. The presentation consists of two parts. The first part is a description of the state of the technology: main benefits and major features, alternatives for scheduling infrastructures, relevant use cases, and project status and roadmap. The presentation will focus on its state-of-the-art functionality, such as the new scheduling policies, which comprise job prioritization policies (fixed priority, urgency, share, deadline and waiting-time) and resource prioritization policies (fixed priority, usage, failure and rank). The second part of the presentation demonstrates its main functionality on production infrastructures, showing how GridWay is able to simultaneously access to distinct middlewares (GT pre-WS, GT WS and EGEE services), additionally allowing Grid interoperability and providing support to the transition to new Globus versions.
Using Taverna to Orchestrate Grid Services in a Workflow
Speaker: Ravi Madduri, Senior Software Developer, Argonne National Laboratory, University of Chicago
caGrid is a service-based grid software infrastructure that effectively bringing together distributed data and analytic resources into a virtual collaborative platform for cancer research. In caGrid, many of the tasks involved in the analysis and aggregation of cancer-related data make use of “canned” solutions, or workflows. As a result, there is a need to orchestrate the invocation of caGrid services through the use of a workflow language and tooling. Given the need to orchestrate caGrid services through the use of a workflow, this presentation first summarizes the rational in selecting Taverna as the primary candidate for workflow authoring and invocation. The presentation then introduces the development of Taverna plug-ins in general, and how to extend Taverna for use with caGrid services. The presentation then details a real-world example and the lessons learned from our research and experiment. To provide a full-fledged, grid-enabled workflow solution, future works include: 1) support for Taverna version 2.0 (T2) which is to be released soon; 2) Support for secure grid services; 3) support for semantic-based service discovery in the scavenger. 4) support for stateful grid services.
MyProxy based Short Lived Credential CA Service at NERSC
Speaker: Shreyas Cholia, Computer Systems Engineer, National Energy Research Scientific Computing Center, Lawrence Berkeley National Laboratory
This session will discuss how the National Energy Research Scientific Computing Center (NERSC) uses a MyProxy based Certification Authority (CA) to issue short-lived grid certificates to its users. PKI X509 certificates form the backbone of grid services in the Globus universe. However, the process of acquiring these certificates is often cumbersome, and can be a deterrent to new grid users attempting to automate and manage their workflows. Moreover, security concerns make the proliferation of long-lived certificate files on compute systems across the grid undesirable. Credential repositories like MyProxy allow for centralized management of proxy credentials that can then be reused across the grid, but rely on users to manage their own certificates, and involve delegation of trust to an entity external to the PKI. To deal with these challenges, MyProxy now supports a CA feature to sign and generate short-lived end-entity certificates. This allows the identity provider to directly manage the issuance of short-lived certificates for practical grid use. The MyProxy based NERSC Online CA attempts to remove the burden of grid certificate generation and management from the user. It ties in the authentication of users receiving certificates with the existing identity management infrastructure at NERSC. NERSC maintains a user database called the NERSC Information Management (NIM) System. NIM contains records for all NERSC users. NERSC users have already been vetted by an accounts-and-allocations process, which includes PI verification or face-to-face contact. This process is common in the United States research communities (NSF, DOE processes for ID vetting). Once the user has been vetted, there is enough information to establish a password based authentication process to the NERSC infrastructure. We wish to leverage this infrastructure for issuance of grid certificates. The NIM database is exported through an LDAP directory tree, including an md5 hash of the user’s password. The user connects to the MyProxy CA using a standard myproxy-logon client. The user enters her NIM password when prompted, which is SSL encrypted and sent to the MyProxy CA. The MyProxy CA verifies the user’s password against the LDAP exported NIM record (using PAM). If successful the MyProxy CA issues a short-lived (12 hours - 11 days) user certificate to the client. This certificate has a unique and persistent DN for a given user. The DN is generated using information from the NIM record (using a combination of the user’s full name and UID). This system can also plug into a portal framework using the Commodity Globus kits, thus enabling web-based grid applications. The NERSC Online CA is based on the IGTF SLCS profile, and includes an Aladdin EToken hardware security module to store the CA’s signing key. The NERSC Online CA provides its users with identity credentials that can be used for job and file management, while effectively leveraging the existing identity framework and authentication systems already in place at NERSC. This allows NERSC to put in explicit authentication controls on credential generation, while greatly simplifying the process of certificate acquisition for the user.
10:00 – 10:30 am
Break
10:30 – 12:00 pm
GRID ENGINE #4
Making Grid Engine Highly Available with Open High Availability Cluster and OpenSolaris
Speakers: Ashutosh Tripathi, Senior Software Engineer, Sun Microsystems
The Grid Engine job scheduling software, while highly scalable, is not highly available “out of the box.” Although Grid Engine can handle failures of individual nodes in the compute cluster, the Master Host itself is a single point of failure. Grid Engine does provide a Shadow Master host mechanism to increase availability. However, for the highest availability, the Master Host itself should be run on a high availability cluster. Open High Availability Cluster, the open-source version of Solaris Cluster, Sun's enterprise high availability product suite, is the first open-source HA Cluster based on a major proprietary HA Cluster. Open HA Cluster tightly connects multiple physical nodes to provide a high availability platform for off-the-shelf software applications. This presentation will introduce the concept of high-availability clusters, and will show how Open High Availability Cluster can run the Grid Engine Master Host on OpenSolaris in a highly available fashion, providing quick recovery times, integrated highly available NFS and IP addresses, and configurable service dependencies.
 HPC Visualization on the GrId
Speaker: Linda Fellingham, PhD, Manager, Visualization and Graphics, Sun Microsystems, and W. Dean Stanton, Senior Staff Engineer, Sun Microsystems
While the performance of 3D graphics hardware has been increasing at astounding rates, in excess of Moore's law, the graphics pipeline is only one part of effective solutions for compute-intensive visual applications. Many interactive visual problems require huge memories, large numbers of CPUs, and/or high-speed access to vast amounts of data storage. These problems are well-suited to execution in the server room, where secure, professionally-managed systems and high-speed interconnects are commonly available as shared resources. Running visual applications on the grid and displaying the images over the network on ordinary, low-cost systems allows tackling larger problems and providing access to many more users, even over wide distances. The challenges that must be met are many - allocating and sharing graphics resources among users; transparently transforming applications which were designed to be used by a single user on a single desktop into applications that can be used by a remote user (or users) scaled across multiple nodes with multiple graphics devices; providing interactive capability through grid interfaces better suited for batch environments; facilitating re-configuration of scalable visualization middleware and applications (and their associated complicated configuration files and scripts), for maximum simultaneous utilization of resources. In addition, GPU computing is coming to the fore for many algorithms which can take advantage of the massively parallel, stream-computing model. Many of the same hardware and software solutions that address visualization on the grid can be leveraged to facilitate grid-based GPU computing. This presentation will describe how Sun Grid Engine, Sun Shared Visualization and Sun Scalable Visualization software work together to provide users seamless access to high-performance visualization applications and GPU computing resources on the grid. It will describe deployments of this visual computation model, and discuss the problems that can be effectively addressed by this technology now and in the future.
PluS: An Advance Reservation plug in for Sun Grid Engine
Speaker: Hidemoto Nakada, PhD, Senior Research Scientist, National Institute of Advanced Industrial Science and Technology
Advance Reservation is an important technology to make resource co-allocation is possible in the Grid environment. This presentation introduces Advance Reservation plugin for Sun Grid Engine, called 'PluS'. Although Sun Grid Engine recently gained Advance Reservation capability, there are still several advantages you can get with PluS:
- PluS provides policy setting mechanism on acceptance of reservation request, that allows site administrators to setup site-local policies, such as group-wise priority settings or history based acceptance. For policy description, we employed the ClassAd language from Condor project, which is a well formalized and powerful language.
- PluS supports two-phase commit protocol, that is important for modification of co-allocation.
- PluS provides capability to execute specific jobs precede and succeed the reserved jobs. This capability was turned to be important to change network settings for co-allocated jobs.
The principal PluS commands are following:
plus_reserve - make a resource reservation and returns reservation ID.
plus_cancel - cancel reservation
plus_modify - modify a existing reservation
plus_status - lists existing reservations
PluS have to operation modes:
1) represents reservation as a queue and controls the queues using external interface.
2) completely replaces the scheduling module of the queuing manger,
The former approach is easy to deploy since it does not change the existing scheduling module of Sun Grid Engine at all. The latter approach has advantage over the former in capability, since it potentially allows any capability the administrator want to have. Another role of PluS is to serve as an easy to use Java toolkit to construct scheduling module that replaces sge_sched. PluS provides easy-to-use Java API to retrieve information from Sun Grid Engine and control node allocation and job execution. The Java API talks to the sge_master via C written proxy module, called operatord, that translate SGE's native protocol, GDI, into plain text notation based on XML. The API allows us to implement novel scheduling algorithms on Sun Grid Engine easily. You can write FIFO based round-robin toy-scheduler in 70 lines.PluS is available from http://www.g-lambda.net/plus .
GLOBUSWORLD #4
Globus Execution Services
Speakers: Stuart Martin, Senior Software Developer, Argonne National Laboratory, University of Chicago; Kate Keahey, Mathematics & CS Division, Argonne National Laboratory Computation Institute, University of Chicago; Suresh Marru, Indiana University; Ruben S.
Montero, PhD, Associate Professor, University of Madrid; Ioan Raicu, Univeristy of Chicago
Globus execution management services provide the capability to submit, monitor, and cancel jobs on Grid computing resources. The remote jobs may require coordinated staging of data and credential management into the resource prior to job execution and out of the resource following execution.
- What's New in 4.0 and 4.2 GRAM, What's Planned for the Future (Stuart Martin): An overview of the latest developments and future plans for the Globus GRAM service, including optimizations for high- throughput, auditing support, support for the OGSA BES standard, SAML authorization support, alternative clients (Java CoG, Condor-G, and others), grid-enabled MPI (MPIg), and dynamic service startup / task execution (Condor GlideIn and FALKON). We focus in particular on recent enhancements and new features in GT4.0 and GT4.2 releases.
- Virtual Machine Management Services (Kate Keahey): An overview of the Globus Toolkit Workspace Service that allows an authorized Grid user to provision and manage environments (currently implemented as virtual machines) in the Grid. The talk will provide an introduction to the cloud computing talk later in the week.
- Experiences with the use of GRAM in the LEAD portal: (Suresh Marru): An overview of the Linked Environments for Atmospheric Discovery (LEAD) portal that provides access to meteorological data, forecast models, and analysis and visualization tools to researchers, educators and students. The focus will be on the experiences of LEAD's use of GRAM and other Globus components during the Spring 2008 weather forecast challenge.
- The GridWay metascheduler (Ruben S. Montero): An overview of the GridWay metascheduler and it's integration with Globus components like GRAM and MDS.
- Swift and Falkon (Ioan Raicu): An overview of Swift, a system that bridges scientific workflows with parallel computing. And Falkon, a light-weight task execution service for optimized task throughput and resource efficiency when executing many independent jobs on large compute clusters.
TUTORIAL #4: (ROCKS)
Basic Management and Customization
Speaker: Greg Bruno
While Rocks clusters are turnkey, users always to manage and customize their cluster. Introduction of the Rocks configuration graph and how to add new packages and configuration will be covered. Other common customization scenarios will be described.
INTRODUCTION AND OVERVIEW #5: INNOVATIVE USES OF ROCKS
A Case Study on Building Faster, Easier HPC Clusters with Rocks at Stanford University
Speaker: Steve Jones, Manager, High Performance Computing Center, Stanford University
In just 11 days during 2007, the Stanford University High-Performance Computing Center Center was able to fully implement a 1,696 core cluster solution by leveraging the certification methodology from the Intel Cluster Ready Program. In addition to rapid deployment, the system nearly doubled the performance of the center’s existing compute system. The new Stanford solution leverages Dell, Clustercorp and Panasas technologies, providing the Center unprecedented flexibility to meet their ever-expanding computational and application requirements and enabling Stanford researchers to achieve faster time-to-results. Steve Jones, the founder and manager of the Stanford HPC Center, will discuss his experiences in the design and deployment of this system. Mission: CFD on Demand The goal of the expansion was simple. Acquire sufficient compute power to support the School of Engineering coursework and research efforts and support the university’s industrial affiliates program. Key research programs include the Department of Energy Advanced Simulation and Computation (ASC) program, sponsored by the National Nuclear Security Administration, and the next-generation Predictive Science Academic Alliance Program (PSAAP). The system had to be capable of accommodating over 200 researchers. Two groups, in particular, required large-scale, massively parallel computing resources for their work with the ASC program. The researchers in the mechanical engineering and aeronautics and astronautics departments leverage the HPC Center resources to analyze the details of flow and acoustics created by helicopters in forward flight. Critical applications include two major in-house-developed simulation codes: Stanford University multiblock (SUmb) and CDP, named for the late Charles David Pierce. Commercial applications include ANSYS, Gaussian, MatLab, and VASP. Result: 11-Day Deployment Delivers 2-14X Performance Improvement The entire deployment, including implementing an entirely new power and cooling infrastructure, took a total of 11 days. Dell’s Enterprise Deployment team played an integral role in this feat, coordinating the efforts of all participating vendors. The power, cooling, and system build out were completed in parallel. We used the Rocks+ Linux cluster distribution to configure master and compute nodes, and by day 11 researchers were able to submit jobs that were flawlessly executed producing scientific code and operations with unprecedented fidelity. The new cluster easily handles ten times the workload of the original 48-node configuration. Testing results show performance of 15.8 teraflops performance compared to 1.1 teraflops delivered by the smaller cluster.
Extending Rocks for the Creation and Management of Grid Systems for Biomedical Research
Speaker: Vicky Rowley, UCSD
The Biomedical Informatics Research Network (BIRN; http://www.nbirn.net) project, an NIH funded project, was launched in 2001 with the goal of fostering collaborations. With a focus on data and tool sharing for biomedical science (Grethe et al., 2005), the BIRN infrastructure is designed around a flexible large-scale grid model, combined with the conventional IT infrastructure to support the deployment of web servers, applications servers, database servers and authentication mechanisms. The result is a complete computing environment that facilitates biomedical research. To date, this system supports a production environment with over 25 fielded sites, with separate staging and development environments for black and white box testing.
A separate and distinct grid, including separate production, staging and development environments, has been established using the same software stack and deployment mechanisms, for the National Database for Autism Research (NDAR) project. NDAR is a collaborative bioinformatics system being created by the National Institutes of Health (NIH) to support research in autism spectrum disorder (ASD) and to help accelerate scientific discovery.
The scientific software integrated to date includes:
- web-based frontend software for end users
- application software for a wide variety of purposes, including web applications
- image processing software specific to processing medical imaging data and adapted to run on a large computational clusters,
- database applications using Oracle, MySQL and Postgres database engines
- “Point-of-Presence” servers, which connects an individual site’s data into the rest of the grid
Managing and deploying hundreds of servers over dozens of sites, including instantiation of multiple environments involving several server types would not be possible without the high level of automation provided by the Rocks-based framework used by the BIRN Coordinating Center. Rocks builds upon RedHat Linux’s use of RPMs and kickstart files , allowing customized, flexible, yet highly automated installation of Linux servers. Rocks allows server functionality and parameters unique to each server (e.g. IP address, hostname, timezone, etc.) to be established at install time using extremely minimal inputs. In contrast, conventional kickstart and so-called “golden image” installations leave the server with its software configured as it was for the server from which the kickstart file or image was made. It then has to be re-configured with the correct information. Also, unlike the golden image method, kickstart installations allow use of diverse hardware – a quality that Rocks installations inherit.
The ability to custom install a server based on both its unique parameters and it’s required functionality in a repeatable, highly automated way has been key in providing our ability to quickly and repeatably establish these grids. The Rocks software is designed for fast, repeatable deployment of computational clusters. For the BIRN project, the Rocks software was extended by the BIRN Coordinating Center (BCC) to support additional frontend types, including the web servers, application servers and database servers previously mentioned. In addition, to facilitate updating the grid software and to support additional software distribution paths, the software repository used for server installation was extended to support YUM (Yellow-dog Updater Modified).
Rocks based Virtual Cluster Management System: GriVon
Speaker: Takahiro Hirofuchi, PhD, Research Scientist, National Institute of Advanced Industrial Science and Technology
We introduce a virtual cluster management system, based on Rocks, called 'Grivon'. Virtual clusters are an virtual environment constructed on real, physical clusters. It provide better abstraction than mere virtual machines, with virtualized networks and virtualized storage. The virtual cluster networks are logically isolated from the real networks to provide better security. Grivon leverage Rocks, the cluster provisioning system, to maintain the virtual cluster. Users of the virtual cluster make reservation for 'virtual clusters' specifying time slot for the cluster, resource requirements, required Rolls, required appliances. When the specified time arrives, the system automatically create a virtual cluster using Rocks and provide it to the users. Behind the scene, on the specified time, the system sets-up virtual networks and storage, and create virtual machine configuration files, and then start up a virtual front end as one of the virtual machine. Thanks to the 'lights-out' installation capability of Rocks, the installation are performed without any interaction with human being. The virtual cluster setting information, such as number of each appliances and their MAC addresses, are automatically injected into the database on virtual front end via Rescue Rolls which are created based on users reservation requests and resource allocation. When the virtual front end installation completes, the system starts up other virtual nodes so that they are installed from the virtual front end. The virtual front end distributes packages according to the injected information in the database. Thus, a completely configured Rocks cluster is automatically installed on the virtual world. Grivon uses VMware Server for computer resources virtualization, VLAN for network virtualization and iSCSI with Logical Volume Manager(LVM) for storage virtualization. The system is also capable of multi-site hosted virtual cluster, that allows more flexible management and higher resource utilization. Inter-site communication are performed with VPN to ensure private communication. Another thing have to be noted is that, the Grivon system itself is implemented as a Rocks Roll, allowing easy installation. Grivon will be contributed to the community as a series of Rolls in the near future.
In this presentation, we explain our project overview and VMware Roll
released in May 2008.
12:00 – 1:30 pm
LUNCH (PROVIDED)
1:30 – 3:00 pm
ROCKS #4:
Extending Functionality Through the Rocks Command Line Roll Screen Development
Speakers: Nadya Williams, Grid Specialist, University of Zurich; Mason Katz; Anoop Rajendra
As an extension to the previous session, roll-developers can add new installation screens and have them integrated seamlessly. Nadya Williams will describe here test harness that significantly improves the development of installation screens. The Rocks command line is the way rolls extend the command structure for Rocks. The Rocks Viz Roll will be used as key example of roll-based extension to support tiled-display clusters. The Solaris command set (currently under development) will be illustrative of how Rocks commands can work across different architectures.
Roll Screen Development ( Nadya Williams)
Rocks Clusters Distribution has a powerful method to add software packages with a roll. Many rolls for diverse applications have been developed by Rocks group and by scientists who want to add their applications to the rocks clusters. Packaging the scientific application as a roll makes it easy to install and update the application and makes it convenient to share the application’s installation, configuration and updates with others. Rocks provides a mechanism for building your own roll using the rocks-specific tools thus making rolls’ integration into the cluster seamless and automatic. Some software, especially grid middleware, requires collecting a user input during the cluster install, and this is done via a screen forms mechanism. Creating a roll with the screen enabled requires additional pieces of software to be written. In addition, testing the roll’s screen requires building a cluster frontend to view and test the screen. We present an example of how the testing and debugging of the roll’s screen can be done “online” without building the frontend. The idea here is that a developer can use an iterative process to build the screen, test it and validate it without leaving a roll development directory. This approach helps to speed up the roll development cycle by providing a way to visualize and validate the roll’s screen in situation.
GLOBUSWORLD #5
Globus Security: Features and Roadmap & Building Secure VOs using Globus Toolkit
Speakers: Frank Siebenlist, Argonne National Laboratory; Rachana Ananthakrishnan, Senior Software Developer, Argonne National Laboratory; Kunal Modi, Security Solutions Architect, Ekagra Software Technologies / Center for Bio-Informatics and Information Technology (CBIIT) - NCI; Tom Scavo, Lead Developer of GridShib Project, National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign
Globus' Security framework and services ensure the integrity, privacy and policy enforcement of the communication and resource usage on the Grid. We will report on a number of exciting new features included in the GT4.2 release that allow for fine-grained policy enforcement of services and resources. Its sophisticated attribute-based framework allows the plugin of different attribute-collecting policy information points, and co-located and external policy decision point implementations. Furthermore, a number of Grid projects will discuss the security components and services that they are contributing to the Globus community at large, like the cancer grid project caBIG (caGrid's Grid Authentication and Authorization with Reliably Distributed Services (GAARDS)), Earth System Grid ("easy" PKI based on Myproxy's online-CA and auto-provisioning), TeraGrid/GridShib (SAML/Shib attribute services).
-GT4.2 Security update and futures & ESG's easy PKI (Frank Siebenlist)
-GT4.2 Security update and futures (Rachana Ananthakrishnan)
-GAARDS: caGrid's Grid Authentication and Authorization with Reliably Distributed Services (Kunal Modi)
-Attribute-based Authorization for Science Gateways Using GridShib (Tom Scavo)
A TeraGrid Science Gateway is an intermediary between a browser user
and one or more TeraGrid resource providers. The Gateway typically
provides a domain-specific portal interface that hides the details of
th |