Past Events – 2015

GlobusWorld 2015 was held April 14-15 at Argonne National Laboratory (near Chicago Illinois, USA). The agenda with links to presentations can be found below.

Tuesday, April 14
7:30—17:00 registration desk open
7:30—8:30 breakfast
TCS Foyer
8:30—10:00 Led by: Steve Tuecke | tutorial materials

We will provide an overview of Globus, with demonstrations of key functions such as file transfer, data sharing, and publication and discovery. Participants will use Globus Connect Personal to install a Globus endpoint on their personal computer, and use it to transfer and share files.

10:00—10:30 beverage break
TCS Foyer
10:30—12:00 Led by: Steve Tuecke | tutorial materials

We will walk system and network administrators through the process of using Globus Connect Server to easily install and configure a Globus endpoint on a shared storage system such as a campus computing cluster.

12:00—13:30 lunch
Main Hall
13:30—15:00 Ian Foster, Globus co-founder  | slides  | video

With the recent release of data publication and discovery capabilities, Globus now provides a comprehensive set of services for simplifying data management throughout the research lifecycle. Ian Foster will review notable deployments of Globus and provide an update on product directions.

15:00—15:30 beverage break
TCS Foyer
15:30—17:00

Compute Canada's Globus Deployment
Jason Hlady, HPC Analyst and Coordinator, University of Saskatchewan, Compute Canada  | slides

Compute Canada entered into a partnership with Globus in 2014, and has deployed Globus file transfer and sharing tools at over twenty computational and storage-intensive sites across Canada, comprising a national research data transfer service. This talk will briefly describe how we rolled out Globus services and the framework of our national metrics service which ingests Globus data transfer statistics, as well as our plans for the future.

Compute Canada’s national advanced computing platform integrates high performance computing systems, research expertise and tools, data storage, and resources with academic research facilities across the country. Compute Canada works to ensure that Canadian researchers have the advanced research computing facilities and expert services required to remain globally competitive.


Jetstream: A National Science & Engineering Cloud
Kurt Seiffert, Chief Storage Architect & Enterprise Architect, Indiana University  | slides

Jetstream (funded by the National Science Foundation, Award #ACI-1445604) will be a national science and engineering cloud providing a user-friendly environment to self-provisioned interactive computing and data analysis resources. Jetstream will leverage Globus tools for data movement and authentication. Researchers will be able to conduct research from a user-selectable library of virtual machines or create customized virtual machines. Jetstream will include virtual Linux desktops and applications aimed to enable research and research education at small colleges and universities. Jetstream will enable research on a wide range of disciplines such as biology, atmospheric science, economics, observational astronomy, and social sciences. A key component of Jetstream will be supporting science gateways, both persistent services and on demand resources.


Using Metadata to Enhance Globus Data Transfer Effectiveness
Martin Margo, Nirvana Lead Developer, General Atomics  | slides

Over time, Globus endpoints can become polluted with files, reaching the capacity limits of the underlying storage infrastructure, and hindering the overall effectiveness of the high performance GridFTP data mover tool. Users are typically focused on the task at hand which typically involves moving one or more files from one data mover node to another, not realizing that the files they are leaving behind accumulate over time, filling up precious space on the data mover nodes. General Atomics Nirvana can be used to scan the POSIX metadata of files stored on Globus endpoints to detect duplicates that have been transferred from one endpoint to another and never removed, in addition to critical metrics such as distribution of files per user, per file size, per creation date, etc. With Nirvana’s metadata driven analysis, a significant amount of Globus endpoint disk space can be reclaimed, optimizing file transfer performance. System Administrators have the power to automatically schedule the scan and clean-up tasks on their local endpoints, further reducing the time and effort needed to keep their data mover nodes in optimal shape. The Data Junkyard use case has been used specifically in the General Atomics Nuclear Fusion Research cyberinfrastructure to scan and analyze accumulated file data in its enterprise disk storage system. The report data from Nirvana tool yields current state of the file system and supported a decision making process to invest in a next generation storage infrastructure, suitable for the targeted scientific applications and day-to-day general computing.


Enabling Research Collaboration and Data Management through Globus
Troy Axthelm, Advanced Research Computing Center, University of Wyoming​  | slides

Researchers using the Advanced Research Computing Center’s (ARCC) Globus endpoint have transferred approximately 140 TB of data since April of 2014. We highlight a use case where Globus has enabled researchers to share geographically separated resources, and a planned use of Globus to enable controlled data access of sensitive data.

We describe a project to model the watershed of the entire Colorado River Basin. The computation for this project is done on the Mt. Moran cluster at the University of Wyoming and the Yellowstone Supercomputer, majority of the data is stored at Utah State University. Globus has made it easy for these researchers to transfer data to and from the compute and storage resources. Using Globus, we will enable researchers to make their data available to collaborators in a secure and time restricted manner.


Integrating Globus into the NCAR Research Data Archive User Services
Thomas Cram, Software Engineer, NCAR  | slides

The Research Data Archive (RDA; http://rda.ucar.edu) at the National Center for Atmospheric Research (NCAR) contains a large and diverse collection of meteorological and oceanographic observations, operational and climate reanalysis outputs, and remote sensing datasets to support atmospheric and geoscience research. The RDA contains greater than 600 dataset collections which support the varying needs of a diverse user community. The number of RDA users is increasing annually, and the most popular method used to access the RDA data holdings is through web based protocols, such as wget and cURL based download scripts. In 2014, 11,000 unique users downloaded greater than 1.1 petabytes of data from the RDA, and customized data products were prepared for more than 45,000 user-driven requests.

In order to further support this increase in web download usage, the RDA is implementing the Globus data transfer service (www.globus.org) to provide a GridFTP data transfer option for the user community. The Globus service is broadly scalable, has an easy to install client, is sustainably supported, and provides a robust, efficient, and reliable data transfer option for RDA users. This presentation will highlight the technical functionality, challenges, and usefulness of the Globus data transfer service for providing user access to the RDA holdings.


Swift Parallel Scripting for Science, Engineering and Data Analysis
Mike Wilde, Senior Fellow, Computation Institute  | slides

Swift is a simple "little" parallel programming language created to ease the scripting of large-scale computing and data analysis tasks. It's an implicitly and pervasively parallel functional language designed to express workflows composed from serial or parallel programs. It has been used to script applications in disciplines ranging from protein structure prediction to modeling global agriculture to processing data from x-ray and neutron scattering experiments.

This talk will show how Swift can be used in tandem with Globus to empower your investigations through easier parallel computing. We'll describe new computing models that help conquer the complexity of high performance computer modeling and better integrate it into the scientific knowledge discovery process. We’ll also preview Parallel.Works, a new tech spinoff incubating at the University of Chicago’s new Chicago Innovation Exchange, which integrates the Swift parallel scripting language with the Galaxy web-based workflow user interface and with Globus information and authentication services.


Repository Planning at New York University
Himanshu Mistry, Manager, Data Services, New York University  | slides

New York University's IT and Libraries support research and data cycles that span from collection/curation of data through preservation and discovery (cyclical) phases. This presentation addresses repository planning efforts at NYU from the perspective of current research trends, and potential challenges we face, especially with NYU's Spatial Data Repository.


CILogon and InCommon: Technical Update
Jim Basney, Senior Research Scientist, NCSA  | slides

Jim will provide a technical update on CILogon (cilogon.org) and InCommon (incommon.org), which enable federated authentication to Globus, XSEDE, and other research services. Topics include: 1) growing support for the Research and Scholarship Category in InCommon and the world, 2) Identifier-Only Trust Assurance (IOTA) in the Interoperable Global Trust Federation (igtf.net), 3) obtaining X.509 server certificates from the InCommon IGTF Server CA, and 4) keeping current with security standards (e.g., OpenID Connect, SHA-2, TLS 1.2).


17:00—19:30

Join your peers for cocktails, "heavy" hors d'oeuvres, and casual conversation right after the conference sessions. We also encourage you peruse the posters that someof our participants will have on display (see list below).


Enabling Research Collaboration and Data Management through Globus
Troy Axthelm, Advanced Research Computing Center, University of Wyoming​


Building Open Compute Systems Using Globus Identity
David Champion, Open Science Grid, University of Chicago


A Reproducible Framework Powered By Globus
Kyle Chard, Researcher, Computation Institute, University of Chicago


Integrating Globus into the NCAR Research Data Archive User Services
Thomas Cram, Software Engineer, NCAR


Integrating Globus and MapReduce for Out-of-computer Analysis of Peta-scale CFD Data
Maxwell Hutchinson, Computational Science Graduate Fellow, University of Chicago


Jetstream: A National Science and Engineering Cloud
Kurt Seiffert, Chief Storage Architect & Enterprise Architect, Indiana University


Implementing High Performance Computing with the Apache Big Data Stack: Experience with Harp
Judy Qiu, Assistant Professor of Computer Science, Indiana University


High Resolution Regional Climate Downscaled Data Transfer Using Globus
Jiali Wang, Postdoctoral Appointee, Environmental Science Division, Argonne


Swift Parallel Scripting for Science, Engineering and Data Analysis
Mike Wilde, Senior Fellow, Computation Institute


 
Wednesday, April 15
7:30—17:00 registration desk open
8:00—9:00 breakfast
TCS Foyer
9:00—10:30

Globus Network Manager and Use Cases
Christopher Mitchell, System Architect, & Todd Bowman, Network Architect, Los Alamos National Laboratory  | slides

Desires for higher bandwidth file transfers, while maintaining an established security posture, have pushed for the development of enhanced intelligence in the network and a need to communicate rather than infer application expectations and state. In response, the Globus team is developing a new subsystem called Network Manager to provide an interface into Globus Connect Server that allows trusted network administration tools to automatically inquire about and manage data movement traffic. LANL is leveraging this capability to automatically make decisions about traffic routing within our HPC network and its interface to our boarder network. We will discuss a quick overview of Network Manager as well as an overview of the possible use cases where this capability can be leveraged in an HPC environment.


Of Mice and Elephants: The Science DMZ and You
Josh Sonstroem, System Administrator, University of California - Santa Cruz  | slides

This presentation will review what "elephant flows" are and how Science DMZ networks can help with them. What does it mean to have a campus Science DMZ for researchers and how to use third party transfer tools? What use are DTNs and how do they help? The presentation will review basic principles of high-speed networking, elephant flows, and network design, and then delve into the Science DMZ and the services currently presented on the UC Santa Cruz campus. A technical walk-through of using GridFTP and Globus and comparing this to the more traditional use of scp to transfer data to and from the DTN will be provided.


Improving Scientific Outcomes at the APS with a Science DMZ
Jason Zurawski, Science Engagement Engineer, ESnet  | slides

Data mobility is an important part of scientific innovation. Instruments, such as the beamlines at the Advanced Photon Source, are generating an unprecedented amount of research data with additional growth projected. This data often must be moved to remote storage and processing facilities after collection. To streamline the activity, the network infrastructure and data movement tools must be up to the challenge of transmission; failure to do so results in frustrated users that turn to alternative means of sharing information. This talk will present the efforts of Argonne National Laboratory, the Advanced Photon Source, and ESnet, in designing a networking solution to support science. The Science DMZ architecture removes friction and is transparent to the user, allowing services like Globus to work with peak efficiency.


Knowledge Lab, Data and Machine Enabled Science
Eamon Duede, Executive Director, Knowledge Lab, University of Chicago  | slides

Knowledge Lab seeks to leverage insights into the dynamics of knowledge creation and advances in large­scale computation to reimagine the scientific processes of the future by identifying gaps in the global knowledge landscape, areas of rich potential for breakthroughs, and automating discovery through the generation of novel, potentially high impact hypotheses.

This talk will explore ways in which Knowledge lab is leveraging Big Data migration, storage, and analysis to accelerate scientific progress by conceiving of and implementing revolutionary computational approaches to reading, reasoning, and hypothesis design that transcend the capacity of individual researchers and traditional teams. Given that Knowledge Lab leads a decentralized network of 40+ field leading scientists, mathematicians, engineers, and scholars we rely on the “centralizing” power of cloud and distributed computational resources.

In particular, we will highlight a recent success that we have had moving two corpora from the Open Science Grid (OSG) to Amazon Simple Storage Service (S3) using Globus S3 support. Transfers to AWS S3 are used to facilitate multi­institutional research, create analysis pipelines between S3 and various computational services (EC2, RDS, EMR) in the AWS cloud, as well as for redundancy. The first transfer was the complete, full text and all associated metadata of the entire, extant IEEE corpus. This was a transfer of 6,134,400 files and directories at 2.59 TB. Most files were less than 1MB in size, so this transfer highlights many of the file transfer optimizations available only through Globus. Previous attempts to transfer this corpus from OSG via rsync and parallelized rsyncs failed due to projected transfers lasting on the order of several months. The second corpus is the entire English language Wikipedia with complete revision history (back to Wikipedia’s inception). This was a transfer of 173 files at 12TB.


Building Open Compute Systems Using Globus Identity
David Champion, Open Science Grid, University of Chicago  | slides

The Open Science Grid (OSG) is an open national cyberinfrastructure of research computing centers comprised of 124 sites which makes CPU time available to users via opportunistic sharing of resources. To connect eligible users to available resources OSG has created a federated login and job submission service, OSG Connect. Secure access—authenticated, authorized, and auditable—is achieved by leveraging the Globus Nexus identity platform as the anchor of a distributed trust model for users and projects. Nexus identity and Nexus groups are mapped within the service environment onto UNIX users and groups. Access to CPU and storage resources is governed by memberships within the Globus group management framework through a combination of OAuth, Pluggable Authentication Modules (PAM), and reflection of Nexus identities and roles into a UNIX directory service (NIS). This allows Principal Investigators and other project leaders a delegated authority role so that they can make "on the ground" access decisions.


Optimizing Globus File Transfer with Metadata-defined Virtual Collections
Martin Margo, Nirvana Lead Developer, General Atomics  | slides

One tedious, but critical, part of the scientific research process is organizing and sending raw data for third party verification, publication, and further analysis by collaborators. Typically this process is done manually and iteratively. With the General Atomics Nirvana metadata tagging and management system, scientists could easily automate grouping and distributing data files based on user-friendly logical rules. Beyond simple identifiers like filename, date, and owner, Nirvana user-defined, application-specific metadata can be used to identify research data files needing to be transferred between collaborators with highly domain specific targeted attributes such as project, subject, study, data source, latitude, longitude, or any domain specific identifier. Nirvana can make this grouping, called a Virtual Collection, real-time, persistent, reusable, and sharable. This presentation will show how Nirvana queries domain specific metadata of research data that can be stored across multiple sources in an institution. Nirvana identifies the files needed to be transferred by Globus, and then sends them to a Globus endpoint for fast, reliable, secure file transfer between collaborators. Nirvana ensures that the content of virtual collection is complete and accurate and contains only the data meant to be shared, excluding sensitive data, and is filtered by appropriate user access rights. Scientists now have a new solution, Nirvana as the grouping tool and Globus as the data transfer tool, to eliminate this housekeeping step and focus more on the research task at hand.


The iRODS-GridFTP DSI: a Nucleus for Collaboration
Dan Bedard, iRODS Market Development Manager, RENCI  | slides

The integrated Rule-Oriented Data System (iRODS) is highly customizable data grid infrastructure software in use at hundreds of commercial and academic research institutions worldwide. An iRODS-GridFTP interface was developed several years ago to transfer data between iRODS Zones and Globus installations. This software has been adapted as a Globus Data Storage Interface (DSI) for use in the EUDAT project, with support for iRODS through version 3.3.1. Now, the National Data Service, coordinating with Globus and the iRODS Consortium, is developing a DSI that is compatible with iRODS 4.0 and beyond, which will be maintained and tested by the iRODS Consortium in perpetuity. This lightning talk will describe the upcoming collaborative activities behind the iRODS-GridFTP DSI and the ways the participants are encouraging further collaboration between iRODS, Globus, and other data-centric communities.


Simple Scripting with the Globus API
Dan Powers, Customer Support Engineer, University of Chicago, Globus  | slides

We will illustrate how the Globus API may be used in simple scripts to automate data management tasks, using file replication as an example.


10:30—11:00 beverage break
TCS Foyer
11:00—12:30

Globus Galaxies: Research Data Management and Analysis
Ravi Madduri, Fellow, Computation Institute, Argonne National Laboratory  | slides

This presentation will describe how we extended the Galaxy workflow system with Globus capabilities and adapted it to provide scalable, flexible analysis pipelines for research into genomics, climate change, crop modeling, cosmology, and materials science.


Compute Canada and Globus Data Publication
Jason Hlady, HPC Analyst and Coordinator, University of Saskatchewan, Compute Canada  | slides

Following the previous presentation describing Compute Canada's deployment of Globus for file transfer and sharing, in this talk we will discuss how Compute Canada is working with Globus to deliver data publication and discovery capabilities to member institutions.

Compute Canada’s national advanced computing platform integrates high performance computing systems, research expertise and tools, data storage, and resources with academic research facilities across the country. Compute Canada works to ensure that Canadian researchers have the advanced research computing facilities and expert services required to remain globally competitive.


A Reproducible Framework Powered By Globus
Kyle Chard, Researcher, Computation Institute, University of Chicago  | slides

Reproducibility is the cornerstone of scientific process; establishing its veracity and self-correcting it. Computational reproducibility, however, continues to be hard to establish, especially given the complexity of software environments, data manipulation techniques, and distributed software stacks. In this talk, we will describe an emerging reproducibility framework consisting of user-friendly tools that audit process and data flows to provide sciunits—self-contained units of scientific activity that a user can later use for either publishing or reanalysis. We will present how elements of sciunits are persisted and annotated using the Globus Catalog, and how sciunits are published using Globus' data publication capabilities. We will outline our intended use of Globus transfer and sharing for making sciunits accessible to a broader community. Finally, we will showcase how sciunits are enabling reproducibility in the geoscience domains of seismology, hydrology and space science, and the broader geoscience community through NSF-sponsored EarthCube initiative.


Moving data with Globus at Fermilab HPPC Facilities
Yujun Wu, Storage Specialist, Fermi Lab  | slides

Globus has been used for data transfer services for USLQCD and Cosmology researchers at Fermilab High Performance Parallel Computing Facilities for several years. We have set up three Globus endpoints, two with Lustre filesystem backend and one with our dCache/Enstore tape backend. We will describe our setup and experience on using Globus for helping users with moving data. We will also discuss some specific challenges we experienced in utilizing Globus and the possible future impacts in solving them.


Fostering Reproducible Science with Globus
Maxwell Hutchinson, Computational Science Graduate Fellow, University of Chicago  | slides

In theory, computational science should be the gold standard for reproducibility, owing to precisely defined methodology, general purpose equipment, and discrete data. In reality, the scale of bleeding edge computations, both in time and space, have made intermediate results inaccessible and reproduction cost prohibitive. We address the space component of this problem by building a post-processing and visualization workflow around Globus. The workflow is a complete, unambiguous statement of methodology that provides direct access to raw data. Using it, third parties can validate our results, which strengthens our conclusions, or perturb our analysis, which broadens the impact of our data. We demonstrate the workflow by backing journal-style figures with scripts that will reproduce their contents locally.


Integrating Globus and MapReduce for Out-of-computer Analysis of Peta-scale CFD Data
Maxwell Hutchinson, Computational Science Graduate Fellow, University of Chicago  | slides

We have integrated Globus into a MapReduce post-processor for fluid dynamics data to dramatically reduce minimum local disk capacity requirements. Globus-enabled out-of-computer post-processing makes peta-scale data sets available to researchers without peta-scale file systems. When a mapper requests a file, a Globus transfer is spawned, and when it is done, the file is discarded. Disk requirements scale with the single-file size of the data set, which in turn is related to the per-processor memory available in the originating simulation. Touch-once use cases, such as independent validation, receive a performance boost by interleaving the remote data transfer with local analysis.


Globus Data Publication as a Case Study in Globus Integration
Jim Pruyne, Software Developer, University of Chicago, Globus  | slides

Data publication and discovery are the newest features of the Globus research data management service. They build upon core Globus functionality including authentication, group management, and file transfer. We discuss how these capabilities are used by Globus data publication, with the intent of conveying insights into the functionality of Globus data publication, and as a case study in how other systems may utilize core Globus services.


12:30—13:30
TCS Foyer
lunch
13:30—14:00 Moderated by: John Towns, XSEDE PI and Project Director

We will host an open discussion on what is planned for XSEDE 2.0. This is a rare opportunity to provide feedback into the development of the proposal for the continuation of the XSEDE project. We will outline our current plans and facilitate an open discussion about how XSEDE could be adapted to facilitate bringing more value to stakeholders of Globus technologies.


14:00—15:00 Moderated by: Steve Tuecke, University of Chicago, Globus team

We invite campus computing administrators (and all Globus users) to an open rountable discussion. We would like to gather feedback on Globus usage and solicit input on new feature requests. This will be a unique opportunity to meet and engage with many Globus team members in an informal setting. Bring your toughest questions!

If you're planning to attend, you may want to browse our new feature request forum and select features that are of particular interest to consider in the discussion. We also have a working document for this session.


Moderated by: Vas Vasiliadis, University of Chicago, Globus team


15:00—15:15
TCS Foyer
beverage break
15:15—16:00 Moderated by: Steve Tuecke, University of Chicago, Globus team

This session is targeted at attendees from institutions participating in the Globus data publication pilot program. We will be discussing the experiences of these early users and soliciting feedback to further refine our product roadmap. The session is open to all GlobusWorld attendees who want to hear how their peers are using Globus data publication and explore different use cases for the service.


16:00 conference adjourns
 

Connect


 

Why Attend?

  • See how to easily provide Globus services on existing storage systems
  • Hear how others are using Globus
  • Learn from the experts about the best ways to apply Globus technologies
  • Connect with other researchers, admins and developers

Gold Sponsors

OrangeFS Spectra Logic

Past Events