A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research

  • Published on

  • View

  • Download


  1. 1. A National Big Data Cyberinfrastructure Supporting Computational Biomedical Research Invited Presentation Symposium on Computational Biology and Bioinformatics: Remembering John Wooley National Institutes of Health Bethesda, MD July 29, 2016 Dr. Larry Smarr Director, California Institute for Telecommunications and Information Technology Harry E. Gruber Professor, Dept. of Computer Science and Engineering Jacobs School of Engineering, UCSD http://lsmarr.calit2.net 1
  2. 2. John Wooley Drove Supercomputing for Biological Sciences
  3. 3. John Wooley was a Scientific Founder of Calit2 www.calit2.net 220 UCSD & UCI Faculty Working in Multidisciplinary Teams With Students, Industry, and the Community The State Provides $100 M For New Buildings and Equipment LS Slide 2001 John Wooley was the UCSD Layer Leader for DeGeM
  4. 4. NSFs OptIPuter Project: Using Supernetworks to Meet the Needs of Data-Intensive Researchers OptIPortal Termination Device for the OptIPuter Global Backplane Calit2 (UCSD, UCI), SDSC, and UIC LeadsLarry Smarr PI Univ. Partners: NCSA, USC, SDSU, NW, TA&M, UvA, SARA, KISTI, AIST Industry: IBM, Sun, Telcordia, Chiaro, Calient, Glimmerglass, Lucent 2003-2009 $13,500,000 Biomedical Big Data as Application Driver: Mark Ellisman, co-PI
  5. 5. UCSD StarLight Chicago UIC EVL NU CENIC San Diego GigaPOP CalREN-XD 8 8 The OptIPuter LambdaGrid is Rapidly Expanding NetherLight Amsterdam U Amsterdam NASA Ames NASA Goddard NLRNLR2 SDSU CICESE via CUDI CENIC/Abilene Shared Network 1 GE Lambda 10 GE Lambda PNWGP Seattle CAVEwave/NL R NASA JPL ISI UCI CENIC Los Angeles GigaPOP 2 2 Source: Greg Hidley, Aaron Chin, Calit2 LS Slide 2005
  6. 6. PI Larry Smarr Paul Gilna Ex. Dir. Announced January 17, 2006 $24.5M Over Seven Years John Wooley was a CAMERA co-PI & Chief Science Officer
  7. 7. Calit2 Microbial Metagenomics Cluster- Next Generation Optically Linked Science Data Server 512 Processors ~5 Teraflops ~ 200 Terabytes Storage 1GbE and 10GbE Switched / Routed Core ~200TB Sun X4500 Storage 10GbE Source: Phil Papadopoulos, SDSC, Calit2
  8. 8. The CAMERA Project Established a Global Marine Microbial Metagenomics Cyber-Community Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis http://camera.calit2.net/ 4000 Registered Users From Over 80 Countries
  9. 9. Determining the Protein Structures of the Thermophilic Thermotoga Maritima GenomeLife at 80o C! Extremely Thermostable -- Useful for Many Industrial Processes (e.g. Chemical and Food) 173 Structures (122 from JCSG) 122 T.M. Structures Solved by JCSG (75 Unique In The PDB) Direct Structural Coverage of 25% of the Expressed Soluble Proteins Probably Represents the Highest Structural Coverage of Any Organism Source: John Wooley, JCSG Bioinformatics Core Project Directro, UCSD LS Slide 2005
  10. 10. John Wooley Organized a Series of International Workshops on Metagenomics and Thermotoga at Calit2
  11. 11. Academic Research OptIPlanet Collaboratory: A 10Gbps End-to-End Lightpath Cloud National LambdaRail Campus Optical Switch Data Repositories & Clusters HPC HD/4k Video Repositories End User OptIPortal 10G Lightpaths HD/4k Live Video Local or Remote Instruments LS 2009 Slide
  12. 12. So Why Dont We Have a National Big Data Cyberinfrastructure? Research is being stalled by information overload, Mr. Bement said, because data from digital instruments are piling up far faster than researchers can study. In particular, he said, campus networks need to be improved. High-speed data lines crossing the nation are the equivalent of six-lane superhighways, he said. But networks at colleges and universities are not so capable. Those massive conduits are reduced to two-lane roads at most college and university campuses, he said. Improving cyberinfrastructure, he said, will transform the capabilities of campus-based scientists. -- Arden Bement, the director of the National Science Foundation May 2005
  13. 13. DOE ESnets Science DMZ: A Scalable Network Design Model for Optimizing Science Data Transfers A Science DMZ integrates 4 key concepts into a unified whole: A network architecture designed for high-performance applications, with the science network distinct from the general-purpose network The use of dedicated systems for data transfer Performance measurement and network testing systems that are regularly used to characterize and troubleshoot the network Security policies and enforcement mechanisms that are tailored for high performance science environments http://fasterdata.es.net/science-dmz/ Science DMZ Coined 2010 The DOE ESnet Science DMZ and the NSF Campus Bridging Taskforce Report Formed the Basis for the NSF Campus Cyberinfrastructure Network Infrastructure and Engineering (CC-NIE) Program
  14. 14. Based on Community Input and on ESnets Science DMZ Concept, NSF Has Funded Over 100 Campuses to Build Local Big Data Freeways Red 2012 CC-NIE Awardees Yellow 2013 CC-NIE Awardees Green 2014 CC*IIE Awardees Blue 2015 CC*DNI Awardees Purple Multiple Time Awardees Source: NSF
  15. 15. Creating a Big Data Freeway on Campus: NSF-Funded Prism@UCSD and CHeruB Campus CC-NIE Grants Prism@UCSD, PI Phil Papadopoulos, SDSC, Calit2, (2013-15) CHERuB, PI Mike Norman, SDSC CHERuB
  16. 16. NCMIR Brain Images in Calit2 VROOM: Allows for Interactive Zooming from Cerebellum to Individual Neurons NCMIR Connected Over Prism to Calit2/SDSC at 80 Gbps
  17. 17. Calit2 3D Immersive StarCAVE OptIPortal: Enables Interative Exploration of Protein Data Bank Cluster with 30 Nvidia 5600 cards-60 GB Texture Memory Source: Tom DeFanti, Greg Dawe, Calit2 Connected at 50 Gb/s to Quartzite 30 HD Projectors! 15 Meyer Sound Speakers + Subwoofer Passive Polarization-- Optimized the Polarization Separation and Minimized Attenuation
  18. 18. The Pacific Wave Platform Creates a Regional Science-Driven Big Data Freeway System Source: John Hess, CENIC Funded by NSF $5M Oct 2015-2020 Flash Disk to Flash Disk File Transfer Rate PI: Larry Smarr, UC San Diego Calit2 Co-PIs: Camille Crittenden, UC Berkeley CITRIS, Tom DeFanti, UC San Diego Calit2, Philip Papadopoulos, UC San Diego SDSC, Frank Wuerthwein, UC San Diego Physics and SDSC
  19. 19. Pacific Research Platform Regional Collaboration: Multi-Campus Science Driver Teams Jupyter Hub Biomedical Cancer Genomics Hub/Browser Microbiome and Integrative Omics Integrative Structural Biology Earth Sciences Data Analysis and Simulation for Earthquakes and Natural Disasters Climate Modeling: NCAR/UCAR California/Nevada Regional Climate Data Analysis CO2 Subsurface Modeling Particle Physics Astronomy and Astrophysics Telescope Surveys Galaxy Evolution Gravitational Wave Astronomy Scalable Visualization, Virtual Reality, and Ultra-Resolution Video 20
  20. 20. PRP Transforms Big Data Microbiome and Integrated Omics Science 12 Cores/GPU 128 GB RAM 3.5 TB SSD 48TB Disk 10Gbps NIC Knight Lab 10Gbps Gordon Prism@UCSD Data Oasis 7.5PB, 200GB/s Knight 1024 Cluster In SDSC Co-Lo CHERuB 100Gbps Emperor & Other Vis Tools 64Mpixel Data Analysis Wall 120Gbps 40Gbps 1.3Tbps PNNL UC Davis LBNL Caltech
  21. 21. To Expand IBD Project the Knight/Smarr Labs Were Awarded ~ 1 Million Core-Hours on SDSCs Comet Supercomputer 8x Compute Resources Over Prior Study Smarr Gut Microbiome Time Series From 7 Samples Over 1.5 Years To 50 Samples Over 4 Years IBD Patients: From 5 Crohns Disease and 2 Ulcerative Colitis Patients to ~100 Patients 50 Carefully Phenotyped Patients Drawn from Sandborn BioBank 43 Metagenomes from the RISK Cohort of Newly Diagnosed IBD patients New Software Suite from Knight Lab Re-annotation of Reference Genomes, Functional / Taxonomic Variations Novel Compute-Intensive Assembly Algorithms from Pavel Pevzner
  22. 22. We Used SDSCs Comet to Uniformly Compute Protein-Coding Genes, RNAs, & CRISPR Annotations We Downloaded from NCBI Over 60,000 Bacterial and Archaea Genomes Required 5 Core-Hours Per Genome 300,000 Core-Hours to Complete Ran 24 Cores in Parallel Over 400 Days Wall-Clock Time Requires a Variety of Software Programs Prodigal for Gene Prediction Diamond for Protein Homolog Search Against UniRef db Infernal for ncRNA Prediction RNAMMER for rRNA Prediction Aragorn for tRNA Prediction Will Make These Results a New Community Database Knight Lab, Calit2, SDSC Source: Zhenjiang (Zech) Xu, Knight Lab, UCSD
  23. 23. Cancer Genomics Hub (UCSC) is Housed in SDSC: Large Data Flows to End Users at UCSC, UCB, UCSF, 1G 8G Data Source: David Haussler, Brad Smith, UCSC 15G Jan 2016 30,000 TB Per Year
  24. 24. Creating a Distributed Cluster for Integrated Modeling of Large Macromolecular Machines UCSF-10-100 Gbps Science DMZ QB3@UCSF (~5000 cores), Institute for Human Genetics (~1200 cores), Cancer Center (~800 cores), Molecular Structure Group (~1000 cores). Coupled Via PRP to: LBNL NERSC SDSC Bring Huge Datasets from Supercomputer Centers Back to UCSF Clusters for Analysis Requires CPU-months per computation Lead: Andrej Sali, UCSF
  25. 25. Driving Improvement s in Scientific Data Transfer Driving Improvement s in Scientific Data Transfer NCMIR X-ray Microscope (XRM) Zeiss Versa 510 MicroCT reconstructions of Chiton radula. Chiton radula have evolved to incorporate an iron oxide mineral, magnetite, making them extremely hard and magnetic. Images courtesy of Steven Herrera, Ph.D., Kisailus Biomemetics and Nanostructured Materials Laboratory, UC Riverside UCSD/NCMIR Fiona/Data Transfer Node (DTN) PRP Facilitated Collaborative Data Transfer 10-100Gbps XRM Data Sets are 100+ GBs UCR researchers are modeling the teeth (radula) of marine snail, Cryptochiton Stelleri, to engineer new biomimetic abrasion resistant composites UC Riverside Fiona/Data Transfer Node (DTN) 3D Reconstructions from NCMIR X-ray Microscopic Computed Tomography Facilitates Development of Bioinspired Tough Materials
  26. 26. Next Step: Global Research Platform Building on CENIC/Pacific Wave and GLIF Current International GRP Partners
  27. 27. Mirror Cell Image Library Infrastructure and Data Management Workflows at Singapores NSCC Cell Image Library Designed For Big Data Leverages High Bandwidth Connected High Performance Storage and Computing Resources Source: Mark Ellisman & Steve Peltier, NCMIR, UCSD


View more >