Open biomedical knowledge using crowdsourcing and citizen science
1. Open biomedical knowledge using crowdsourcing and citizen science Andrew Su, Ph.D. @andrewsu email@example.com http://sulab.org November 5, 2015 UCSD Slides: slideshare.net/andrewsu 2. 2 Candidate genes FLNB CTNNB1 EPHA3 SMAD3 XPO1 RPS27 FLCN ATR FLT3 BRD2 ERG RAF1 EGFR ERBB4 RARA JAK3 LRP1 WT1 PML SMARCA4 Candidate variants chr1:g.156084782C>G chr6:g.31911991G>T chr19:g.3767338C>T chr19:g.3783925C>T chr7:g.552021G>A chr3:g.123005609G>T 3. 3 Biology is an INFORMATION science Pietro Bellini https://flic.kr/p/k5jmja 4. Prioritization of human genetic variants 4 1000s of genetic variants < 10 candidate genes Filters - Variant type - Allele frequencies - Previous clinical observation - Predicted functional effects - Gene function - 5. Data integration as a cottage industry 5 dbNSFP 6. Data integration as hardened community software 6 dbNSFP MyVariant.info 7. MyGene.info for integrating gene annotations 7 Gene MyGene.info 8. MyGene.info for integrating gene annotations 8 http://mygene.info/metadata Current version history Current stats 9. MyGene.info for integrating gene annotations 9 399070 210381 120173 22249 7292 3563 1767 1031 616 406 2724 10 20 30 40 50 60 70 80 90 100 More 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 request time (ms) Frequency Gene annotation service (/v2/gene) 10. MyGene.info for integrating gene annotations 10 2 ~ 3M requests per month 11. MyGene.info for integrating gene annotations 11 12. MyGene.info for integrating gene annotations 12 2015 2018 13. Bioinformatician-friendly JSON output, REST API 13 http://MyGene.info/v2/gene/7157 http://MyVariant.info/v1/variant/ chr7:g.55241707G>T 14. Variant and gene prioritization 14 15. Variant and gene prioritization 15 2441 2308 1917 18 9 5 16. Variant and gene prioritization 16 2441 2308 1917 18 9 5 https://github.com/SuLab/myvariant.info/ blob/master/docs/ipynb/myvariant_R_miller.ipynb 17. Open biomedical knowledge 17 MyVariant.info MyGene.info Integration of molecular biology databases via high performance APIs 18. Open biomedical knowledge 18 MyVariant.info MyGene.info Integration of molecular biology databases via high performance APIs Biomedical Linked Open Data 19. The Gene Wiki project 19 Protein structure Symbols and identifiers Tissue expression pattern Gene Ontology annotations Links to structured databases Gene summary Protein interactions Linked references Huss, PLoS Biol, 2008 20. The Gene Wiki project 20 21. The Gene Wiki project 21 22. Wikidata 22 Provide a database of the worlds knowledge that anyone can edit - Denny Vrandei 23. Centralizing key data storage 23 Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf 24. Centralizing key data storage 24 25. Centralizing key data storage 25 26. Loading biological data into Wikidata 26 Entrez Gene Ensembl UniProt UCSC PDB RefSeq 27. Wikidata for biology 27 is a regulates Interacts with Protein Glycoprotein Neural development VLDL receptor Amyloid precursor protein Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 Reelin http://www.wikidata.org/wiki/Q414043 28. Wikidata for biology 28 Property:P31 Property:P128 Property:P129 Q8054 Q187126 Q1345738 Q1979313 Q423510 Q414043 http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en 29. 29 ~150k genes and proteins ~2k FDA-approved drugs ~7k human diseases 30. Centralizing key data storage 30 287 language editions of Wikipedia Bioinformatics community Toxicology community Epidemiology community 31. Open biomedical knowledge 31 MyVariant.info MyGene.info Integration of molecular biology databases via high performance APIs Biomedical Linked Open Data 32. Open biomedical knowledge 32 Free text to structured data MyVariant.info MyGene.info Integration of molecular biology databases via high performance APIs Biomedical Linked Open Data 33. The biomedical literature is massive 33 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1983 1988 1993 1998 2003 2008 2013 Number of new PubMed-indexed articles 34. but it is very hard to query and compute 34 35. but it is very hard to query and compute 35 Imatinib Crizotinib Erlotinib Gefitinib Sorafenib Lapatinib Dasatinib Acute myeloid leukemia Acute lymphoblastic leukemia Chronic myelogenous leukemia Chronic lymphocytic leukemia Hodgkin lymphoma Non-Hodgkin lymphoma Myeloma AND 36. The Network of BioThings 36 1. Identify biomedical concepts in text We report a case of familial systemic mastocytosis with the rare KIT K509I germ line mutation. In vitro treatment with imatinib, dasatinib and PKC412 reduced cell viability of primary mast cells harboring KIT K509I mutation. Both patients with familial systemic mastocytosis had remarkable hematological and skin improvement after three months of imatinib treatment. Leuk Res. 2014 Oct;38(10):1245-51. doi: 10.1016/j.leukres. GENES DISEASES DRUGS VARIANTS 37. The Network of BioThings 37 imatinib dasatinib PKC412 Familial systemic mastocytosis KIT K509I 1. Identify biomedical concepts in text 2. Identify relationships between concepts Mutation of Mutation causes causes treats inhibits 38. 38 Goal: Assemble a network of biomedical knowledge that is comprehensive, current, computable and traceable. 39. Question: Can Citizen Scientists collectively perform concept recognition in biomedical texts? 39 40. Simple annotation interface 40 Click to see instructions Highlight disease mentions 15 workers annotate each abstract 41. 41 Experts versus crowd for concept identification 593 PubMed abstracts 6,900 mentions of disease concepts F = 0.87F = 0.78 $$$ 42. 42 Experts versus crowd for concept identification 593 PubMed abstracts 6,900 mentions of disease concepts F = 0.87F = 0.87 $$$ 9 days 145 workers Total: $630.96 43. Does Mechanical Turk scale? 43 1,000,000 articles per year 10 annotators / article 4 tasks / doc $0.066 / task $ 2,640,000 / year 44. 44 http://mark2cure.org 45. 45 Paid crowdsourcing F = 0.84 28 days 212 workers Total cost: $0 $$$ F = 0.87 9 days 145 workers Total: $630.96 Help science, please Citizen Science 46. Does Citizen Science scale? 46 1,000,000 articles * 10 AE / article 15,828 volunteers needed 10,275 AE * 365 days 212 annotators* 28 days AE = Annotation events = Number of annotation events per year Number of annotation events per year per volunteer 47. Does Citizen Science scale? 47 15,828 volunteers needed 175,000 volunteers 300,000 volunteers 37,000 volunteers 1,000,000 volunteers 48. Annotating the relationships 48 This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies. therapeutic target subject predicate object GENE DISEASE 49. 49 Goal: Assemble a network of biomedical knowledge that is comprehensive, current, computable and traceable. 50. 50 Nina Hale https://flic.kr/p/zoVih 51. Rare disease case study #1 51 Photo: Retta Beery 52. 52 Bainbridge et al., STM, 2011 53. 53 Photo: Retta Beery 54. Rare disease case study #2 54 55. 55 56. 56 but no obvious treatments 57. 57 Bainbridge et al., STM, 2011 SPR 58. What differentiates SPR and NGLY1? 58 SPR 59. 59 Sarah Olmstead https://flic.kr/p/364dZW NGLY1 60. 60 NGLY1 (11 PubMed articles) Congenital disorders of glycosylation (822) PNGase (686) ERAD (1330) glycosylation (48,862) alacrima (164) Genetic interactors (3016) symptoms (109,928) 24 million articles in PubMed 61. Mapping the biomedical network around NGLY1 61 NGLY1 62. 62 63. 63 A preliminary view of the NGLY1- focused biological network 64. Why do I Mark2Cure? 64 I am retired, have a doctorate in medical humanities, and have two children with Gaucher disease. I am just looking for some way to put my education to use. Sounds like a perfect situation for me. My 4 year old daughter Phoebe is living with and battling rare disease. I have Ehlers Danlos Syndrome. I hope to help people learn about this painful and debilitating disorder, so that others like me can receive more effective medical care. Take part in something that helps humanity. I Mark2Cure in memory of my son Mike who had type 1 diabetes. Studied biology in college and I really miss it! In memory of my daughter who had Cystic Fibrosis Give back 65. Open biomedical knowledge 65 Free text to structured data MyVariant.info MyGene.info Integration of molecular biology databases via high performance APIs Biomedical Linked Open Data 66. 66 Contact http://sulab.org firstname.lastname@example.org @andrewsu Gene Wiki / Wikidata Ben Good Sebastian Burgstaller Tim Putman Julia Turner Ginger Tsueng Andra Waagmeester Elvira Mitraka, UMB Lynn Schriml, UMB Justin Leong, UBC Paul Pavlidis, UBC Join the team! http://bit.ly/JoinSuLab Slides: slideshare.net/andrewsu Funding and Support BioGPS: GM83924 Gene Wiki: GM089820 MyGene / MyVariant: HG008473 BD2K COE: GM114833 Icon credits (Noun Project, Wikimedia Commons): Zach VanDeHey, hunotika, Viktorvoigt, Alberto Rojas, Lloyd Humphreys Other Group members Jake Bruggemann Ramya Gamini Karthik Gangavarapu Louis Gioia Toby Li Greg Stupp MyGene / MyVariant Chunlei Wu Cyrus Afrasiabi Kevin Xin Adam Mark Mark2Cure Max Nanis Ginger Tsueng Jennifer Fouquier Ben Good Chunlei Wu All Mark2Curators!