The Modern Research Data Portal: a design pattern for networked, data-intensive science

View article
1120 days ago
Once replicated, we use the Modern Research Data Portal design pattern to permit line-speed access via Globus-enabled DTNs 4/4 https://t.co/gVUAIa5XdT https://t.co/Vd2KSjbiH6
1319 days ago
"Biologists need modern data infrastructure on campus" -- and reduce friction via MRDP design pattern https://t.co/Vd2KSiTHPy https://t.co/5CI7fjJGtN
RT @ianfoster: Great news! Our article is one of the top 5 most viewed #ComputerNetworksCommunications #DistributedParallelComputing and #S…
1637 days ago
RT @ianfoster: Great news! Our article is one of the top 5 most viewed #ComputerNetworksCommunications #DistributedParallelComputing and #S…
RT @ianfoster: Great news! Our article is one of the top 5 most viewed #ComputerNetworksCommunications #DistributedParallelComputing and #S…
RT @ianfoster: Great news! Our article is one of the top 5 most viewed #ComputerNetworksCommunications #DistributedParallelComputing and #S…
1637 days ago
Great news! Our article is one of the top 5 most viewed #ComputerNetworksCommunications #DistributedParallelComputing and #SecurityPrivacy articles published in @PeerJCompSci (now Web of Science listed) https://t.co/D8FLDXlsv8
1906 days ago
Definitely. But remember that if you do need to download, @globus is the way to do it. See https://t.co/Vd2KSiTHPy for an example and https://t.co/Vd2KSiTHPy for details https://t.co/VvuQJYDTFu
1906 days ago
Definitely. But remember that if you do need to download, @globus is the way to do it. See https://t.co/Vd2KSiTHPy for an example and https://t.co/Vd2KSiTHPy for details https://t.co/VvuQJYDTFu
Interesting article how Globus is in the heart of building decoupled systems for sharing data at scale - The Modern Research Data Portal - https://t.co/lUAQk6OWN2 via @PeerJCompSci
RT @PeerJCompSci: The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/D6xo9ZkmJg https://t…
The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/D6xo9ZkmJg https://t.co/SoiNJG2Agj
2394 days ago
An interesting approach for sharing lots of data with lots of people. Could be handy for a national fire danger rating system The Modern Research Data Portal https://t.co/xSk4HYLmmt via @PeerJCompSci
RT @PeerJCompSci: The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/D6xo9Z2LkG https://t…
The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/D6xo9Z2LkG https://t.co/A0Y2BjLNGU
RT @ianfoster: 100 downloads of my 'The Modern Research Data Portal: a design pattern for networked, data-intensive science' article publis…
RT @ianfoster: 100 downloads of my 'The Modern Research Data Portal: a design pattern for networked, data-intensive science' article publis…
2616 days ago
RT @ianfoster: 100 downloads of my 'The Modern Research Data Portal: a design pattern for networked, data-intensive science' article publis…
2616 days ago
100 downloads of my 'The Modern Research Data Portal: a design pattern for networked, data-intensive science' article published in #OpenAccess journal @PeerJCompSci https://t.co/x7abLjNXxs
sn-news: #rdi #stem #prosumers #comms The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/GFoX4vtTlo
2631 days ago
RT @thePeerJ: Published this week in @PeerJCompSci: The Modern Research Data Portal: a design pattern for networked, data-intensive science…
Published this week in @PeerJCompSci: The Modern Research Data Portal: a design pattern for networked, data-intensive science https://t.co/H03odXavMb https://t.co/h4EGUouAWl
RT @globusonline: Great resource for building research data portals! New article provides design pattern, code and examples. Published toda…
RT @ianfoster: My article has been published today in @PeerJCompSci https://t.co/x7abLk5yW2 #ComputerNetworksandCommunications #DataScience
Found one! https://t.co/zIyVtKeOKH
PeerJ Computer Science

Main article text

 

Introduction

The research data portal

The MRDP design pattern

  1. The portal server (a web server like any other) which handles data search and access, mapping between users and datasets, and other web services tasks;

  2. A high-performance network enclave that connects large-scale data servers directly to high-performance networks (we use the Science DMZ as an example here); and

  3. A reliable, high-performance external data management service with authentication and other primitives based on standard web APIs (we use Globus as an example here).

Science DMZ and DTNs

Globus services

The design pattern in practice

Variants of the basic pattern

A reference MRDP implementation

  • A complete, working portal server, implemented with the Python Flask framework and comprising a web service and web interface, and that uses Globus APIs to outsource data transfer, authentication, and authorization functions.

  • Integration with Globus Auth for authentication and authorization.

  • Integration with Globus Transfer for browsing and downloading datasets.

  • Use of a decoupled Globus endpoint for serving data securely via HTTP or GridFTP.

  • An independent analysis service, accessed via a REST API, to demonstrate how a data portal can outsource specific functionality securely.

Overview of key points

Diving into code

Endpoints

Identities and credentials

The rdp function

 
 
 rdp('ddb59aef-6d04-11e5-ba46-22000b92c6ec', 
      '~/share/godata/', 
      'jane@uni.edu')    
 
 
         Listing 1: Globus code to implement MRDP design pattern 
_________________________________________________________________ 
 
from globus_sdk import TransferClient, TransferData 
from globus_sdk import AuthClient 
import sys, random, uuid 
 
def rdp(host_id,     # Endpoint for shared endpoint 
        source_path, # Directory to copy data from 
        email):      # Email address to share with 
    tc = TransferClient() 
    ac = AuthClient() 
    tc.endpoint_autoactivate(host_id) 
 
    # (1) Create shared endpoint: 
    # (a) Create directory to be shared 
    share_path = '/~/' + str(uuid.uuid4()) + '/' 
    tc.operation_mkdir(host_id, path=share_path) 
    # (b) Create shared endpoint on directory 
    shared_ep_data = { 
      'DATA_TYPE': 'shared_endpoint', 
      'host_endpoint': host_id, 
      'host_path': share_path, 
      'display_name': 'RDP  shared  endpoint', 
      'description': 'RDP  shared  endpoint' 
    } 
    r = tc.create_shared_endpoint(shared_ep_data) 
    share_id = r['id'] 
 
    # (2) Copy data into the shared endpoint 
    tc.endpoint_autoactivate(share_id) 
    tdata = TransferData(tc, host_id, share_id, 
        label='RDP  copy', sync_level='checksum') 
    tdata.add_item(source_path, '/', recursive=True) 
    r = tc.submit_transfer(tdata) 
    tc.task_wait(r['task_id'], timeout=1000, 
                 polling_interval=10) 
 
    # (3) Enable access by user 
    r = ac.get_identities(usernames=email) 
    user_id = r['identities'][0]['id'] 
    rule_data = { 
      'DATA_TYPE': 'access', 
      'principal_type': 'identity', # Grantee is 
      'principal': user_id,         #  a user.  
      'path': '/',                  # Path is / 
      'permissions': 'r',           # Read-only 
      'notify_email': email,        # Email invite 
      'notify_message':             # Invite msg 
           'Requested  data  are  available.' 
    } 
    tc.add_endpoint_acl_rule(share_id, rule_data) 
 
    # (4) Ultimately, delete the shared endpoint 
    tc.delete_endpoint(share_id) 
_________________________________________________________________    

Data transfer

Web and command line interfaces

Completing the MRDP portal server

 
 
@app.route('/login', methods=['GET']) 
def login(): 
     """Send  the  user  to  Globus  Auth.""" 
     return redirect(url_for('authcallback'))    
 
 
Listing 2:  The authcallback function interacts with Globus Auth to obtain 
access tokens for the server. 
________________________________________________________________________________ 
 
@app.route('/authcallback', methods=['GET']) 
def authcallback(): 
  # Handles the interaction with Globus Auth 
  # Set up our Globus Auth/OAuth 2 state 
  redirect_uri = url_for('authcallback',   _external=True) 
 
  client = load_portal_client() 
  client.oauth2_start_flow_authorization_code(redirect_uri,refresh_tokens=True) 
 
  # If no "code" parameter, we are starting 
  # the Globus Auth login flow 
  if 'code' not in request.args: 
    auth_uri = client.oauth2_get_authorize_url() 
    return redirect(auth_uri) 
  else: 
    # If we have a "code" param, we're coming 
    # back from Globus Auth and can exchange 
    # the auth code for access tokens. 
    code = request.args.get('code') 
    tokens = client.oauth2_exchange_code_for_tokens(code) 
 
    id_token = tokens.decode_id_token(client) 
    session.update( 
      tokens=tokens.by_resource_server, 
      is_authenticated=True, 
      name=id_token.get('name', ''), 
      email=id_token.get('email', ''), 
      project=id_token.get('project', ''), 
      primary_username=id_token.get('preferred_username'), 
      primary_identity=id_token.get('sub'), 
    ) 
 
    return redirect(url_for('transfer')) 
    ___________________________________________________________________________    
 
 
Listing 3: The transfer() function from the web server reference implementa- 
tion. 
________________________________________________________________________________ 
 
@app.route('/transfer', methods=['GET', 'POST']) 
@authenticated 
def transfer(): 
  if request.method  == 'GET': 
    return render_template('transfer.jinja2', datasets=datasets) 
 
  if request.method  == 'POST': 
    # Check that file(s) have been selected for transfer 
    if not request.form.get('dataset'): 
      flash('Please  select  at  least  one  dataset.') 
      return redirect(url_for('transfer')) 
 
    params = { 
      'method': 'POST', 
      'action': url_for('submit_transfer', _external=True, _scheme='https'), 
      'filelimit': 0, 
      'folderlimit': 1 
    } 
 
    browse_endpoint = 
       'https://www.globus.org/app/browse-endpoint?{}'.format(urlencode(params)) 
 
    # Save submitted form to session 
    session['form'] = { 
      'datasets': request.form.getlist('dataset') 
    } 
 
    # Send to Globus to select a destination endpoint using 
    # the Browse Endpoint helper page. 
    return redirect(browse_endpoint) 
    ____________________________________________________________________________    

Invoking other services

Examples of the MRDP design pattern

The NCAR Research Data Archive

Sanger imputation service

Petrel, a user-managed data sharing portal

Scalable data publication

Data delivery at Advanced Photon Source

Evaluation of MRDP adoption

Summary

Additional Information and Declarations

Competing Interests

Ian Foster is an Advisor and Academic Editor for PeerJ Computer Science.

Author Contributions

Kyle Chard and Ian Foster conceived and designed the experiments, analyzed the data, contributed reagents/materials/analysis tools, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper.

Eli Dart conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, reviewed drafts of the paper.

David Shifflett and Jason Williams performed the experiments, performed the computation work, reviewed drafts of the paper.

Steven Tuecke conceived and designed the experiments, analyzed the data, wrote the paper, prepared figures and/or tables, performed the computation work, reviewed drafts of the paper.

Data Availability

The following information was supplied regarding data availability:

The companion web site, http://docs.globus.org/mrdp, provides references to GitHub for associated code.

Github: https://github.com/globus/globus-sample-data-portal for the code.

Funding

This work was supported by the United States National Science Foundation (ACI-1148484) and Department of Energy’s Office of Advanced Scientific Computing Research (DE-AC02-06CH11357). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

27 Citations 12,534 Views 1,885 Downloads

MIT

Your institution may have Open Access funds available for qualifying authors. See if you qualify

Publish for free

Comment on Articles or Preprints and we'll waive your author fee
Learn more

Five new journals in Chemistry

Free to publish • Peer-reviewed • From PeerJ
Find out more