Apollo Collaborative genome annotation editing

  • Published on
    09-Jan-2017

  • View
    49

  • Download
    6

Transcript

<ul><li><p>ApolloCollaborative genome annotation editing </p><p>A workshop for the Stowers Institute Research Community</p><p>Monica Munoz-Torres, PhD | @monimunoztoBerkeley Bioinformatics Open-Source Projects (BBOP)Environmental Genomics &amp; Systems Biology DivisionLawrence Berkeley National Laboratory</p><p>Kansas City, MO | 12 December, 2016</p><p>http://GenomeArchitect.org</p></li><li><p>Outline</p><p> Today we will discusseffective ways to extract valuable information about a genome through curation efforts.</p></li><li><p>After this talk you will... Better understand curation in the context of genome annotation: </p><p>assembled genome automated annotation manual annotation</p><p> Become familiar with Apollos environment and functionality.</p><p> Learn to identify homologs of known genes of interest in your newly sequenced genome.</p><p> Learn how to corroborate and modify automatically annotated gene models using all available evidence in Apollo.</p></li><li><p>Experimental design, sampling.</p><p>Comparative analyses</p><p>Merged Gene Set</p><p>Manual Annotation</p><p>Automated Annotation</p><p>SequencingAssembly</p><p>Synthesis &amp; dissemination.</p><p>Genome sequencing projects</p></li><li><p>Unlocking genomes</p><p>Marbach et al. 2011. Nature Methods | Shutterstock.com | Alexander Wild</p></li><li><p>First,arefresher</p></li><li><p>A few things to remember during curation of gene models</p><p>7BIO-REFRESHER</p><p> KEEPAGLOSSARY HANDYfromcontig tosplicesite</p><p> WHATISAGENE?definingyourgoal</p><p> TRANSCRIPTIONmRNAindetail</p><p> TRANSLATIONreadingframes,etc.</p><p> GENOMECURATIONstepsinvolved</p></li><li><p>The gene: a moving target</p><p>The gene is a union of genomic </p><p>sequences encoding a coherent set of </p><p>potentially overlapping </p><p>functional products.</p><p>Gerstein et al., 2007. Genome Res</p></li><li><p>9</p><p>"Gene structure" by Daycd- Wikimedia Commons</p><p>BIO-REFRESHER</p><p>mRNA</p><p> Although of brief existence, understanding mRNAs is crucial,as they will become the center of your work.</p></li><li><p>10BIO-REFRESHER</p><p>Reading frames</p><p>v In eukaryotes, only one reading frame per section of DNA is biologically relevant at a time: it has the potential to be transcribed into RNA and translated into protein. This is called the OPEN READING FRAME (ORF) ORF = Start signal + coding sequence (divisible by 3) + Stop signal</p></li><li><p>11BIO-REFRESHER</p><p>Splice sites</p><p>v The spliceosome catalyzes the removal of introns and the ligation of flanking exons.</p><p>v Splicing signals (from the point of view of an intron): One splice signal (site) on the 5 end: usually GT (less common: GC) And a 3 end splice site: usually AG Canonical splice sites look like this: ]5-GT/AG-3[</p></li><li><p>12BIO-REFRESHER</p><p>Exons and Introns</p><p>v Introns can interrupt the reading frame of a gene by inserting a sequence between two consecutive codons</p><p>v Between the first and second nucleotide of a codon</p><p>v Or between the second and third nucleotide of a codon</p><p>"Exon and Intron classes. Licensed under Fair use via Wikipedia </p></li><li><p>13BIO-REFRESHER</p><p>Obstaclesto transcription and translation</p><p>v The presence of premature Stop codons in the message is possible. A process called non-sense mediated decay checks for them and corrects them to avoid: incomplete splicing, DNA mutations, transcription errors, and leaky scanning of ribosome causing changes in the reading frame (frame shifts).</p><p>v Insertions and deletions (indels) can cause frame shifts when indel is not divisible by three. As a result, the peptide can be abnormally long, or abnormally short depending when the first in-frame Stop signal is located.</p></li><li><p>Prediction&amp;Annotation</p></li><li><p>15GENE PREDICTION &amp; ANNOTATION</p><p>PREDICTION &amp; ANNOTATION</p><p>v Identificationandannotationofgenomefeatures:</p><p> primarilyfocusesonprotein-codinggenes. alsoidentifiesRNAs(tRNA,rRNA,longandsmallnon-coding</p><p>RNAs(ncRNA)),regulatorymotifs,repetitiveelements,etc.</p><p> happensin2phases:1. Computationphase2. Annotationphase</p></li><li><p>16GENE PREDICTION &amp; ANNOTATION</p><p>COMPUTATION PHASE</p><p>a. Experimentaldataarealignedtothegenome:expressedsequencetags,RNA-sequencingreads,proteins(alsofromotherspecies).</p><p>b. Genepredictionsaregenerated:- ab initio:basedonnucleotidesequenceandcompositione.g.Augustus,GENSCAN,geneid,fgenesh,etc.</p><p>- evidence-driven:identifyingalsodomainsandmotifse.g.SGP2,JAMg,fgenesh++,etc.</p><p>Result:thesinglemostlikelycodingsequence,noUTRs,noisoforms.Yandell &amp; Ence. Nature Rev 2012 doi:10.1038/nrg3174</p></li><li><p>17GENE PREDICTION &amp; ANNOTATION</p><p>ANNOTATION PHASE</p><p>Experimentaldata(evidence)and predictionsaresynthetizedintogeneannotations.</p><p>Result: genemodelsthatgenerallyincludeUTRs,isoforms,evidencetrails.</p><p>Yandell &amp; Ence. Nature Rev 2012 doi:10.1038/nrg3174</p><p>5UTR 3UTR</p></li><li><p>18</p><p>Insomecasesalgorithmsandmetricsusedtogenerateconsensussetsmayactuallyreducetheaccuracyofthegenesrepresentation.</p><p>CONSENSUS GENE SETS</p><p>Genemodelsmaybeorganizedintosetsusing:v combinersforautomaticintegrationofpredictedsets</p><p>e.g:GLEAN,EvidenceModeler</p><p>orv toolspackagedintopipelines</p><p>e.g:MAKER,PASA,Gnomon,Ensembl,etc.</p><p>GENE PREDICTION &amp; ANNOTATION</p></li><li><p>19BIO-REFRESHER</p><p>Good genes are required!</p><p>1. Generate gene modelsv A few rounds of gene prediction.</p><p>2. Annotate gene modelsv Function, expression patterns, </p><p>metabolic network memberships.</p><p>3. Manually review themv Structure &amp; Function.</p></li><li><p>Best representation of biology &amp; removal of elements reflecting errors in automated analyses.</p><p>Functional assignments through comparative analysis using literature, databases, and experimental data.</p><p>Apollo</p><p>Gene Ontology</p><p>Curation improves quality</p></li><li><p>21BIO-REFRESHER</p><p>Curation is inherently collaborative</p><p> It is impossible for a single individual to curate an entire genome with precise biological fidelity.</p><p> Curators need second opinions and insights from colleagues with domain and gene family expertise.</p></li><li><p>CollaborativecurationwithApollo</p></li><li><p>Apollo: genome annotation editing Collaborative, instantaneous, web-based, built on top of JBrowse.</p><p> Supports real time collaboration &amp; generates analysis-ready data</p><p>USER-CREATED ANNOTATIONS</p><p>EVIDENCE TRACKS</p><p>ANNOTATOR PANEL</p><p>GenomeArchitect.org</p></li><li><p>BECOMING ACQUAINTED WITH APOLLO</p><p>General process of curation</p><p>1. Select or find a region of interest (e.g. scaffold).</p><p>2. Select appropriate evidence tracks to review the genome element to annotate (e.g. gene model).</p><p>3. Determine whether a feature in an existing evidence track will provide a reasonable gene model to start working.</p><p>4. If necessary, adjust the gene model.</p><p>5. Check your edited gene model for integrity and accuracy by comparing it with available homologs.</p><p>6. Comment and finish.</p></li><li><p>ColorbyCDSframe,togglestrands,setcolorschemeandhighlights.</p><p>Uploadevidencefiles(GFF3,BAM,BigWig),addcombinationandsequencesearchtracks.</p><p>QuerythegenomeusingBLAT.</p><p>Navigationandzoom.Searchforagenemodelorascaffold.</p><p>User-createdannotations. Annotator</p><p>panel.</p><p>EvidenceTracks.</p><p>Stageandcell-typespecifictranscriptiondata.</p><p>GenomeArchitect.org</p><p>Apollo Genome Annotation Editor</p><p>Protein coding, pseudogenes, ncRNAs, regulatory elements, variants, etc.</p><p>Admin.</p></li><li><p>Apollo Architecture</p><p>GenomeArchitect.org</p><p>Web-based client + annotation-editing engine + server-side data service</p></li><li><p>Generates ready-made computable data</p><p>GenomeArchitect.org</p><p>Ref Sequence</p></li><li><p>Supports real time collaboration</p><p>GenomeArchitect.org</p></li><li><p>LetsPlay!</p></li><li><p>Access Apollo</p></li><li><p>Annotations Organism Users Groups AdminTracks Reference Sequence</p><p>Removable side dock</p></li><li><p>1</p><p>Annotations</p><p>gene</p><p>mRNA</p><p>Annotation details &amp; exon boundaries</p><p>1</p><p>2</p><p>2</p></li><li><p>CuratingwithApollo</p></li><li><p>34 | BECOMING ACQUAINTED WITH APOLLO</p><p>USER NAVIGATION</p><p>Annotatorpanel.</p><p> ChooseappropriateevidencefromlistofTracksonannotatorpanel.</p><p> Select&amp;dragelementsfromevidencetrackintotheUser-createdAnnotationsarea.</p><p> Hoveringoverannotationinprogressbringsupaninformationpop-up.</p><p> Creatinganewannotation</p></li><li><p>Adding a gene model</p></li><li><p>Adding a gene model</p></li><li><p>Adding a gene model</p></li><li><p>Editing functionality</p></li><li><p>Editing functionalityExample: Adding an exon supported by experimental data</p><p> RNAseq reads show evidence in support of a transcribed product that was not predicted. Add exon by dragging up one of the RNAseq reads.</p></li><li><p>Editing functionalityExample: Adjusting exon boundaries supported by experimental data</p></li><li><p>41 |</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p><p> Zoomtobaselevel revealstheDNATrack.</p></li><li><p>42 |</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p><p> ColorexonsbyCDSfromtheViewmenu.</p></li><li><p>43 |</p><p>Zoomin/outwithkeyboard:shift+arrowkeysup/down</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p><p> TogglereferenceDNAsequenceand translationframesinforwardstrand.Togglemodelsineitherdirection.</p></li><li><p>annotatingsimplecases</p></li><li><p>Simplecase:- thepredictedgenemodeliscorrectornearlycorrect,and- thismodelissupportedbyevidencethatcompletely ormostlyagreeswiththeprediction.- evidencethatextendsbeyondthepredictedmodelisassumedtobenon-codingsequence.</p><p>Thefollowingaresimplemodifications.</p><p>ANNOTATING SIMPLE CASES</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p> A confirmation box will warn you if the receiving transcript is not on thesame strand as the feature where the new exon originated.</p><p> Check Start and Stop signals after each edit.</p><p>ADDING EXONS</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>Iftranscriptalignmentdataareavailable&amp;extendbeyondyouroriginalannotation,youmayextendoraddUTRs.</p><p>1. RightclickattheexonedgeandZoomtobaselevel.</p><p>2. PlacethecursorovertheedgeoftheexonuntilitbecomesablackarrowthenclickanddragtheedgeoftheexontothenewcoordinatepositionthatincludestheUTR.</p><p>ADDING UTRs</p><p>ToaddanewsplicedUTRtoanexistingannotationalsofollowtheprocedureforaddinganexon.</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>To modify an exon boundary and matchdata in the evidence tracks: selectboth the [offending] exon and thefeature with the expected boundary,then right click on the annotation toselect Set 3 end or Set 5 end asappropriate.</p><p>Insomecasesallthedatamaydisagreewiththeannotation,inothercasessomedatasupporttheannotationandsomeofthe</p><p>datasupportoneormorealternativetranscripts.Trytoannotateasmanyalternativetranscriptsasarewellsupportedbythedata.</p><p>MATCHING EXON BOUNDARY TO EVIDENCE</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>1. Twoexonsfromdifferenttrackssharingthesamestart/endcoordinatesdisplayaredbartoindicatematchingedges.</p><p>2. Selectingthewholeannotationoroneexonatatime,usethis edge-matching functionandscrollalongthelengthoftheannotation,verifyingexonboundariesagainstavailabledata.Usesquare[]bracketstoscrollfromexontoexon.Usercurly{}bracketstoscrollfromannotationtoannotation.</p><p>3. CheckifcDNA/RNAseqreadslackoneormoreoftheannotatedexonsorincludeadditionalexons.</p><p>CHECKING EXON INTEGRITY</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>Non-canonicalsplicesitesflags. Doubleclick:selectionoffeatureandsub-features</p><p>EvidenceTracksArea</p><p>User-createdAnnotationsTrack</p><p>Edge-matching</p><p>Apolloseditinglogic(brain): selectslongestORFasCDS flagsnon-canonicalsplicesites</p><p>ORFs AND SPLICE SITES</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>Non-canonicalsplices areindicatedbyanorangecirclewithawhiteexclamationpointinside,placedovertheedgeoftheoffendingexon.</p><p>Canonicalsplicesites:</p><p>3-exon]GA/TG[exon-5</p><p>5-exon]GT/AG[exon-3reversestrand,notreverse-complemented:</p><p>forwardstrand</p><p>SPLICE SITES</p><p>Zoom toreviewnon-canonicalsplicesitewarnings.Althoughthesemaynotalwayshavetobecorrected(e.g GCdonor),theyshouldbeflaggedwithacomment.</p><p>Exon/intronsplicesiteerrorwarning</p><p>Curatedmodel</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>Apollocalculatesthelongestpossibleopenreadingframe(ORF)thatincludescanonicalStartandStopsignalswithinthepredictedexons.</p><p>IfStartappearstobeincorrect,modifyitbyselectinganin-frameStartcodonfurtherupordownstream,dependingonevidence(proteins,RNAseq).</p><p>Itmaybepresentoutsidethepredictedgenemodel,withinaregionsupportedbyanotherevidencetrack.</p><p>Inveryrarecases,theactualStart codonmaybenon-canonical(non-ATG).</p><p>Start AND Stop SITES</p><p>BECOMING ACQUAINTED WITH APOLLO SIMPLE CASES</p></li><li><p>annotatingcomplexcases</p></li><li><p>Evidencemaysupportjoiningtwoormoredifferentgenemodels.Warning: proteinalignmentsmayhaveincorrectsplicesitesandlacknon-conservedregions!</p><p>1. InUser-createdAnnotationsarea shift-clicktoselectanintronfromeachgenemodelandrightclicktoselecttheMerge optionfromthemenu.</p><p>2. Dragsupportingevidencetracksoverthecandidatemodelstocorroborateoverlap,orreviewedgematchingandcoverageacrossmodels.</p><p>3. Checktheresultingtranslationbyqueryingaproteindatabase e.g.UniProt,NCBInr.Addcommentstorecordthatthisannotationistheresultofamerge.</p><p>Redlinesaroundexons:edge-matchingallowsannotatorstoconfirmwhethertheevidenceisinagreementwithoutexaminingeachexonatthebaselevel.</p><p>COMPLEX CASESmerge two gene predictions on the same scaffold</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>Oneormoresplitsmayberecommendedwhen:- differentsegmentsofthepredictedproteinaligntotwoormoredifferentgenefamilies- predictedproteindoesntaligntoknownproteinsoveritsentirelength- Transcriptdatamaysupportasplit,butfirstverifywhethertheyarealternativetranscripts.</p><p>COMPLEX CASESsplit a gene prediction</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>DNATrack</p><p>User-createdAnnotationsTrack</p><p>COMPLEX CASESannotate frameshifts and correct single-base errors</p><p>Alwaysremember:whenannotatinggenemodelsusingApollo,youarelookingatafrozenversionofthegenomeassemblyandyouwillnotbeabletomodifytheassemblyitself.</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>COMPLEX CASEScorrecting selenocysteine containing proteins</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>COMPLEX CASEScorrecting selenocysteine containing proteins</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>1. Apolloallowsannotatorstomakesinglebasemodificationsorframeshiftsthatarereflectedinthesequenceandstructureofanytranscriptsoverlappingthemodification.ThesemanipulationsdoNOTchangetheunderlyinggenomicsequence.</p><p>2. Ifyoudeterminethatyouneedtomakeoneofthesechanges,zoomintothenucleotidelevelandrightclickoverasinglenucleotideonthegenomicsequencetoaccessamenuthatprovidesoptionsforcreatinginsertions,deletionsorsubstitutions.</p><p>3. TheCreateGenomicInsertionfeaturewillrequireyoutoenterthenecessarystringofnucleotideresiduesthatwillbeinsertedtotherightofthecursorscurrentlocation.TheCreateGenomicDeletion optionwillrequireyoutoenterthelengthofthedeletion,startingwiththenucleotidewherethecursorispositioned.TheCreateGenomicSubstitutionfeatureasksforthestringofnucleotideresiduesthatwillreplacetheonesontheDNAtrack.</p><p>4. Onceyouhaveenteredthemodifications,Apollowillrecalculatethecorrectedtranscriptandproteinsequences,whichwillappearwhenyouusetheright-clickmenuGetSequenceoption.SincetheunderlyinggenomicsequenceisreflectedinallannotationsthatincludethemodifiedregionyoushouldalertthecuratorsofyourorganismsdatabaseusingtheCommentssectiontoreporttheCDSedits.</p><p>5. Inspecialcasessuchasselenocysteinecontainingproteins(read-throughs),right-clickovertheoffending/prematureStopsignalandchoosetheSetreadthroughstopcodonoptionfromthemenu.</p><p>COMPLEX CASESannotating frameshifts and correcting single-base errors &amp; selenocysteines</p><p>BECOMING ACQUAINTED WITH APOLLO COMPLEX CASES</p></li><li><p>60 |</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p><p> Information Editor</p></li><li><p>TheAnnotationInformationEditorUSER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p></li><li><p>TheAnnotationInformationEditor</p><p> AddPubMedIDs IncludeGO termsasappropriate</p><p>fromanyofthethreeontologies Writecomments statinghowyou</p><p>havevalidatedeachmodel.</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p></li><li><p>63 |</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p><p> Keeping track of each edit</p></li><li><p>Annotations,annotationedits,andHistory: storedinacentralizeddatabase.</p><p>USER NAVIGATION</p><p>BECOMING ACQUAINTED WITH APOLLO</p></li><li><p>Followthecheckl...</p></li></ul>