Matthew O'Keefe

Live Blogging the 31st Massive Storage and Systems Technology Conference, June 2015, Santa Clara University

Blog Post created by Matthew O'Keefe on Jun 2, 2015

I'm going to live blog the Massive Storage conference today and tomorrow, I hope you find it useful.

The URL for the conference is here:

 

Here are my notes Ian Corner's keynote:

 

 

Ian Corner,  Design and Implementation Lead for CSIRO's Research Cloud

 

 

CSIRO is Australia’s centralized research lab, Ian runs their hybrid cloud project called BOWEN

 

Ian described CSIRO’s science mission, including doing leading science and research work of importance

to the country; Ian’s job is to provide computing infrastructure to support research workflows, practically

focused, works tightly with business

 

Data Intensive Research: Optimized High Performance Workflows

 

Key point: yesterday’s collections were physical, today these collections are digital, measured in

the field; sensors proliferating

 

4 years ago: 89 PB heading towards CSIRO — 4000 research projects going on in parallel

 

Enterprise infrastructure to expensive and too high latency for data intensive research, also:

 

why we captured it had to change, maturing of the process, don’t just gather info to gather it

 

data without context

disconnected from compute

using enterprise infrastructure

environment of reduced funding

 

Had to look at better ways to do things

 

Solutions: create boundaries to put different projects in; clear delineation between data and infrastructure,

line so that people knew things were being managed end-to-end; datasets had permanent and unique names;

 

Established relationships, boundaries, and framework for four key types of players:

 

owners

domain specialists

users

consumers

 

Put subsets of data in categories so dataset owner can communicate what she needs, i.e.,

performance

reproducibility

looking down, categories map to infrastructure

looking up, gives scientists idea of what to expect from the category

 

pre-QA category, won’t waste resources on white noise;

 

Big data needs to be consumable;

 

“Scientific application is a peer-reviewed workflow” — consider google maps as a scientific workflow

with a consumer interface

 

Laid the foundation for delivery of integrated workflows

 

Goal is to have peer reviewed workflows easily mapped into future workflows

 

Big data needs to be reproducible: lay the foundation for provenance — big problem today is that only

20% of research results can be reproduced; need provenance mechanisms to track from creation/ingest to

initial processing, later processing, storage etc.

 

Summary Requirements:

Bottom line: need to be able to run research from 50 years ago and reproduce results

 

Fit-for-purpose infrastructure is needed:

not all data is equal

unstructured data requires boundaries

infrastructure will come and go — treadmill of 3 to 5 year technology changes

data must be preserved

workflow must be optimized

provenance must be established and maintained

 

Deep need to understand the scale of the data and how it affects the infrastructure

To deal with scale:

1. tight coupling between data size and compute available

2. avoid orphaning data, have sufficient scale for compute, share data and results, accelerate workflow

3. need compute data and applications to be within dark fibre distance of each other; avoid

copying data, recompiles, etc and other friction against the workflow

4. Low latency non-blocking infrastructure: should just be a given

5. Infrastructure should be dynamic: don’t ask scientists infrastructure questions they can’t

answer

6. Split technology from workflow, abstracted the brands; below lines is fit-for-purpose infrastructure,

above the line is business results and workflow issues; need to right-size the infrastructure at the line,

use scale-up, scale-out, scale-down as necessary

7. Need to design-in the ability to right-size the project via the framework

8. Need to put data on the right kind of storage, integrate data protection into the framework

9. Data management practices:  Size, Time to reproduce, Frequency of Use should be considered

in terms of where to put data, while accelerating workflows;

 

BOWEN Research Cloud (previously STACC)

Plan from 2016 to 2020: Global data mirrors, Virtual labs and sic clouds, National Data Hub

Australia collaborates globally, put the data in place so that people across the globe can collaborate

easily; make it the Google play of the peer reviewed workspace;

 

Big data is no good unless:

 

if its unable to speak

only repeats one story

or cannot repeat the same story twice

speaks so slowly that the message is lost

cannot perform in an

Outcomes