(PDF) The SNPforID browser: an online tool for query and display of frequency data from the SNPforID project - DOKUMEN.TIPS (2024)

TECHNICAL NOTE

The SNPforID browser: an online tool for queryand display of frequency data from the SNPforID project

Jorge Amigo & Christopher Phillips & Maviky Lareu &

Ángel Carracedo

Received: 7 May 2007 /Accepted: 13 March 2008 / Published online: 20 May 2008# Springer-Verlag 2008

Abstract The SNPforID browser is a web-based tool forthe query and visualization of the SNP allele frequency datagenerated by the SNPforID consortium (http://www.snpforid.org/). From this project, validated panels of singlenucleotide polymorphisms (SNPs) for a variety of forensicapplications have been generated with the browser concen-trating on the single-tube identification SNP set comprising52 markers. A web interface allows the visitor to review theallele frequencies of the studied markers from all theavailable populations used by SNPforID to validate globalSNP variability. The interface has been designed to offerthe useful facility of combining populations into appro-priate geographic groups for visual comparison of popula-tions individually or amongst user-defined groupings andwith equivalent HapMap data.

Keywords SNP. SNPforID . Online databases .

Forensic allele frequency databases . HapMap

Introduction

The SNPforID consortium was set up in 2003 to developsingle nucleotide polymorphisms (SNP) loci for use inhuman identification analysis: principally focused onforensic analysis but encompassing relationship testing(e.g., paternity analysis, confirmation of pedigree, etc.),enhanced prediction of geographic origin and medicalsample identification. The main requirement from novelforensic marker sets, hitherto lacking in short tandem repeatloci (STRs), is the ability to successfully genotype highlydegraded DNA without dropout: the differential loss of locior alleles caused by PCR fragment sizes above ∼125 bp orresulting from large differences in repeat number within alocus. For this reason, SNPforID prioritized SNP sets thatcould be genotyped from amplified fragments generallybelow 100 bp and in multiplexes sufficiently large toprovide equivalent, or better, discrimination power to thewidely used 16-STR kits. A core 52 SNP multiplex hasbeen developed for forensic analysis comprising lociprimarily targeted from the p-arm and q-arm of eachautosome [13]. This has been supplemented with SNP setsthat allow the prediction of the geographic origin of asample [9], enhanced characterization of the Y chromo-some [3] and typing of haplotype-informative codingregion SNPs in the mitochondrial genome [2].

An important aspect of the work of the consortium hasbeen the promotion of an open source ethos for reportingthe technical aspects of the SNP typing assays developedand the scientific findings together with the provision oftools to analyze SNP genotype data. The SNPforID browserfalls into the third category—an online tool that permits anyresearcher with genotyping data for the 52 SNPs in theforensic marker set to obtain allele frequency estimatesfrom populations relevant to their own analyzes. The data is

Int J Legal Med (2008) 122:435–440DOI 10.1007/s00414-008-0233-7

Accessibility: web access to this tool is granted at http://spsmart.cesga.es/snpforid.php

Electronic supplementary material The online version of this article(doi:10.1007/s00414-008-0233-7) contains supplementary material,which is available to authorized users.

J. Amigo (*) : C. Phillips :Á. CarracedoSpanish National Genotyping Center (CeGen) and GenomicMedicine Group, CIBERER,University of Santiago de Compostela,Santiago de Compostela, Spaine-mail: [emailprotected]

C. Phillipse-mail: [emailprotected]

M. Lareu :Á. CarracedoInsitute of Legal Medicine, Genomic Medicine Group,University of Santiago de Compostela,Santiago de Compostela, Spain

dx.doi.org/10.1007/s00414-008-0233-7

presented in such a way that it is easy to collect and, whenrequired, to combine allele frequency estimates fromseveral populations into groups that better representcontinental groups or geographic regions. HapMap allelefrequency estimates from the four phase I study populationscan also be listed to assist comparisons with the SNPforIDpopulations and as a benchmark for assessing the reliabilityof the estimates for each locus.

Two examples serve to illustrate the potential use of afrequency browser tool and highlight the flexibility of acombinational approach to reviewing SNP allele frequencydata. In the first hypothetical example, a forensic laboratoryin a nonurban region of northern Canada might wish tointerpret a SNP profile by obtaining the frequency forthe genotypes in both European and Inuit populations.Although Inuit data is available from the SNPforIDbrowser, Canadian European population data is lackingbut could be adequately substituted with the combinedEuropean data readily obtained from the frequency page,allowing the investigator to report to court two appropriatecumulative frequency estimates for comparison. In thesecond, real case, a challenging paternity analysis of closelyrelated individuals in Galicia (NW Spain) required SNPtyping as a supplement to STR analysis. The investigatorused the browser to compare and contrast Galician popula-tion estimates with various combinations of Europeanpopulations and to obtain relevant frequency data permit-ting the assessment of the degree of local variation com-pared to European-wide patterns of variability for the 52SNPs. In interpreting paternity analysis data involvingrelated individuals, it is particularly important to gaugethe degree of variability in the family investigated, the localpopulation, and continent-wide to properly assess the sig-nificance of the genotypes and paternity indices obtained.

Data curation

Although it is easy to provide an open access websiteaccessing the full set of population validation genotypesavailable, more power is provided by constructing a webtool that can read directly from a database of combinabledata. Designing and programming a suitable search webtool with emphasis on visualization of allele frequenciesbecame our main priority. From the start, it was decidedthat access to individual genotypes or sample profileswould not normally be required by forensic users and thatdata from multiple centers that can be combined orcompared provides more flexibility. As such, each genotypeis not particularly important as a single entity, but isconsidered as a whole when the allele frequency estimatesare calculated from the query of joined databases. This doesnot preclude the possibility of SNPforID centers geno-

typing standardized control samples such as those from theCoriell cell repositories (http://ccr.coriell.org/ccr/) or thepositive control DNA supplied with standard forensic STRtyping kits, then listing such profiles for each of the SNPsets developed by the consortium. Furthermore, the com-plete dataset of 52 SNP genotype profiles from all thepopulations listed (outside of HapMap) that underlie theallele frequency estimates are available as a flat text filedownload for each selected population, allowing user-defined analyses such as tests for independence orintrapopulation and interpopulation Fst.

The SNPforID project represents the sum of efforts fromsix laboratories spread across Europe, so all the genotypingdata generated required curation before being combined. Asimple format database was created to form the basis forjoining all available data and to allow for future develop-ments that can also work from the same data. Data wasindexed by sample and contained information of contribut-ing laboratory, gender, population of origin (ascertainedfrom the donors’ declaration of their immediate ancestry),and 52 SNP genotypes. The curation process that checksdata quality encompasses scrutiny of GeneMapper ID orGenotyper output from SNaPshot genotyping submissionsmade outside the SNPforID laboratories plus assessment ofHardy–Weinberg equilibrium using chi-squared analysistogether with Fst measurements comparing new populationswith those of the same group.

A minor logistical problem in the initiation of thedatabase was the collation and standardization of the datainto a single repository. The binary nature of autosomalSNP data makes this process much easier than withmultiple allele and haploid polymorphic loci utilizedelsewhere in forensic science population databases like theY Chromosome Haplotype Reference Database (YHRD,[10]) and Mitochondrial DNA Control Region Database(EMPOP, [7]). Therefore, all bases were inverted whennecessary (e.g., CT base calls converted to AG) to matchthose listed in the Santa Cruz genome browser summary ofdbSNP reference SNP data. Heterozygote genotypes werealphabetized and the locus listing order was, by previousconvention, p-arm SNaPshot electrophoretic mobility(Auto1 SNPs) then q-arm (Auto2). Because all contributinglaboratories used SNaPshot for the validation of popula-tions, base standardization anticipates future submissions tothe database from alternative genotyping platforms. Tocheck genotyping quality, chi-squared analysis was made ofthe observed and expected genotype ratios in all popula-tions having sufficient numbers of samples, although thishad been previously performed on similar data to studyinterlaboratory concordance [13]. In addition, SNPforIDallele frequency estimates for African, European, and EastAsian population groups were compared to those from theequivalent population panels of HapMap (termed Yoruba

436 Int J Legal Med (2008) 122:435–440

from Ibadan, Nigeria [YRI]; CEPH Utah residents withEuropean ancestry [CEU], and ASN, respectively, withASN representing a panel of Chinese from Beijing [CHB]and Japanese from Tokyo [JPT] populations combined [1]).

Implementation

The web tool has been written in PHP and HTML, and itacts as an interface to the underlying database, an exampleof which is shown in Fig. 1. It was designed to go beyondtext queries, and so certain graphical aids were developedto address this need. The first query point is a browsableworld map allowing the visitor to locate each studiedpopulation and obtain frequency data with a single click.We used our own customized version of the DIY Map [5], aclickable zooming map written in Flash and configurablethrough an XML file providing the ability to not only spotthe population locations and their population groups, butalso to implement simple queries activated directly throughclicks.

The graphical system of the data summary returned fromthe query provides visitors with a flexible and intuitiveapproach to the scrutiny of allele frequencies from singlepopulations and in comparison to combinations of popula-tions, enhanced to allow comparison of results using twodifferent queries in parallel. This search system establisheditself as the main core of the application because all thepossible queries that visitors were predicted to run had to beincluded together with the ability to preempt incorporationof future submissions of new populations or SNP sets to thedatabase. As a result, the database is dynamically updatedat the point in time each query is made, so the search pagecontains all current available data once it has been checked,curated, and incorporated. The same real-time updating

process applies to the HapMap frequency data that isincluded in the data summary when available (48 out ofthe 52 SNPs have now been characterized by HapMap). Insummary, the SNP data obtained from a query will alwaysprovide the most current frequency estimates for eachSNPforID and equivalent HapMap population: updated inreal-time at the moment the query is made.

In keeping with the clean, easily interpreted pie chartsummaries of SNP variability used successfully in theHapMap genome browser [14], we have mirrored the sameapproach in the pie charts used to visualize frequencies foreach SNPforID population or their combination, althoughactual allele frequencies are also listed as numeric valuesalongside the pie charts in the search return page. Chartsdisplay blue segments denoting the reference allele and redsegments denoting the alternative allele with frequenciescharted from 0.01 to 0.99. It is important to note twoelements of the HapMap pie chart approach: (1) thereference allele segment is positioned counterintuitivelyon the left side of the zero point, i.e., from −3.6° (0.01frequency) to −356° (0.99) and (2) triallelic SNPs that arenow also in the browser as part of the ancestry-informativeSNP sets from SNPforID [8, 9] and were not included inthe 1.1 million phase I SNPs characterized by HapMap.Therefore, the convention we propose to adopt for triallelicSNPs is to add a green segment for the third allele, denotingthe least frequent allele observed in Africans and so likelyto be the most recently derived substitution at the SNP.

Results

Depending on the options chosen for a search, the piecharts plotted in the query return page represent allelefrequency estimates calculated from single populations or

Fig. 1 Example snapshot fromthe joined SNPforID database.Entry columns denote, from leftto right, originating center;center sample ID; SNPforIDsample identifier; gender,population of origin; populationgroup; and genotypes(A01–A54 in the same order ofSNPs as search page top tobottom, allowing a direct trans-position from a curated Excelfile to the database)

Int J Legal Med (2008) 122:435–440 437

their combinations as a single column for the search optionplus multiple columns for up to four user-defined com-parisons. Five population groups are summarized in a set ofpie charts using the grouping of populations outlined in thesearch page listings that Fig. 2 shows. This grouping isbased on a previous study of global variability that found aclose match between geographic distribution of populationsand genetic clustering using STRUCTURE to arrangepopulations into groups based on patterns of variability[11]. Using the same clustering algorithm for the 52 SNPsand 9 of the validation populations in the browser gave abroadly similar grouping within the confines of a muchsmaller range of loci and study populations (Fig. 3, K=4 in[13]). The separate listing of the South Asian populationsample to those from Europe is a potentially contentiousarrangement because populations from this region tended tocluster with other Eurasian populations from Europe, NorthAfrica, and the Middle East in the Rosenberg study;however, the browser allows this population sample to beincluded with the six European populations or analyzedseparately so the added flexibility provided is worthretaining, particularly as additional South Asian populationsare likely to be sampled and submitted to the browser. Thislast point also illustrates the potential of a combine-and-compare approach in studying differences between popula-tions because the pie charts provide an intuitive system forvisualizing the contrasting allele frequency distributionsfound in some of the SNPs in the 52-SNP set. Such SNPscomprise about 10% of the full set and were chosendeliberately to provide indicators of geographic origin inthe same way STR data can be used for this purpose [6].Therefore, it seems likely that the use in the near future of

dedicated sets of ancestry-informative SNP sets includingthose of SNPforID [9] will also benefit from the system ofallele frequency visualization adopted for this browser.

To statistically assess the goodness of fit of allelefrequency estimates from SNPforID and HapMap, an r2

analysis was performed on appropriate population group-ings matched to the HapMap study panels described above.ESM Fig. 1 presents an analysis of allele frequency esti-mate correlation between SNPforID and HapMap geno-typing for 48 of 52 SNPs analyzed in common. Goodnessof fit between the paired datasets was assessed using r2

analysis of appropriate SNPforID study population group-ings matched to the HapMap study panels: (a) European(EUR vs CEU), (b) East Asian (ASN vs combined CHB/JPT), and (c) African (AFR vs YRI). The listed r2 valuesindicate good correlation of SNPforID and HapMapfrequency estimates for all loci and each pair of populationgroups.

As an illustration of the standard display features of thebrowser, a dataset of samples from Spain and Mozambiqueis illustrated in ESM Fig. 2 because both populationsrepresent a data subset that can be readily compared to theircontinental-based population groups of Europeans andAfricans, respectively. ESM Fig. 2 illustrates a completequery result for NW Spain and Mozambique with summarypopulation-group pie charts showing allele frequency datafor each SNP and the equivalent HapMap estimates whenpresent. The SNPforID population-group pie charts aredesigned to match the order of HapMap charts: EUR/CEU(SNPforID European/HapMap CEPH European from Utahof northern and western European ancestry), ASN/CHB+JPTcombined (SNPforID East Asian/combined Chinese from

Fig. 2 a Search options avail-able in the search page. Offsetupper row tick-boxes allowcombination of the listed popu-lations of each region to create afull panel or population group. bComparison options available inthe search page. In each case,combinations can be tailored bythe user to more closely matchgeographic distribution; in theexample, ticking Argentina andColombia in the search popula-tions query and Greenland alonein the compare populationsquery permits comparison ofNorth and South Americanpopulation groups

438 Int J Legal Med (2008) 122:435–440

Beijing and Japanese from Tokyo), AFR/YRI (SNPforIDAfrican/Yoruba of Ibadan, Nigeria), plus SAS=SNPforIDSouth Asian and AME=SNPforID American.

It is important to note that although all database profilesare complete, the sample number ranges from 7 (Japan) to156 (Denmark) and clearly certain small population samplesrequire interpretation with caution or exclusion altogether.The population data is structured in columns and the SNPdata is structured in rows for all collated pie chart setsand corresponding full-frequency figures. These allelefrequencies are shown numerically in columns under theircorresponding genotyped base to four decimal places, andthe pie charts are drawn to 1% allele frequency precision. Acolumn of hyperlinks to dbSNP provides a convenientsystem for obtaining additional data for the individual SNPlocus if required. The complete dataset of 52 SNP genotypeprofiles from all the populations listed (outside of HapMap)that underlie the allele frequency estimates are available as aflat text file download for each selected population, allowinguser-defined analyses such as tests for independence orintrapopulation and interpopulation Fst.

Finally, at the time of writing, the website registered anaverage of 150 visits per month. The browser has beenavailable to the public since December 2005 and hasbenefited in particular from links placed in the STRbaseforensic marker information portal run by the NationalInstitute of Standards and Technology (NIST, [12]) and theSNPforID homepage (http://www.snpforid.org/).

Discussion

The SNPforID browser represents a simple but highlyeffective visualization method to query and display thegenotype data of the SNPforID project. The format of thepie chart graphics also helps the researcher to quicklyreview the data, and the comparison with HapMap data asan external resource adds an appropriate system forconfirming the precision of the allele frequency estimatesgiven with both datasets being updated in real-timeimmediately before the display of the query results. Thisbrowser has been designed to be a web tool that can berapidly accessed by the forensic practitioner requiringinstant allele frequency data retrieval for a specificpopulation plus a comparison at the same time withsamples of global variability and is directly available athttp://spsmart.cesga.es/snpforid.php.

Databases can fall into the trap of becoming static andout-of-date entities if they are not updated regularly. Wehave avoided this problem by recalculating allele frequencyresults at the moment a query has been submitted and byretrieving the current HapMap data at the same time. Aswell as ensuring all data displayed is the most current

available, the dynamic system of data management we haveadopted makes it easier to incorporate new data and towelcome submissions via e-mail from the worldwideforensic community (see contact information on the titlepage). This may represent a more efficient way to dis-seminate allele frequency data from an extending range ofglobal populations than the conventional system of journalpublication of allele frequency data. However, such anapproach brings with it the problems of quality manage-ment more easily addressed in the curation of onlinehaplotype loci databases mentioned previously (YHRDand EMPOP) where phylogenetic methods can be appliedto check for typing errors. For this reason, we have decidedto require scrutiny of raw genotyping data generated bycontributing laboratories outside the SNPforID consortium.We now include the ancestry-informative SNPs developedby SNPforID [9] that supplement the identification SNP set.Ancestry-informative SNPs in particular benefit from thebroadest range of shared population data because they showhigher overall variability between populations. One favor-able feature of autosomal SNP data in general is thatrelatively small population samples provide reliable allelefrequency estimates. Therefore, submitting data to a shareddatabase for SNPs of forensic interest should not representa prohibitive amount of effort from those interested invalidating these loci for forensic applications in their ownlaboratories.

Finally, we intend to allow for the possibility of linkingallele frequency data to individual genotype profiles fromwidely used standard control samples such as the CEPH–HGDP panel of population samples [4] or the Coriell cellrepositories control sample set. This would offer thesimplest system for providing control profiles to helpresearchers that are establishing genotyping assays for theSNPforID loci in their laboratories for the first time.

Acknowledgements The authors wish to thank Albert Vernon Smithand Lalitha Krishnan of the HapMap Project for their guidance inhelping us link the browser to the HapMap SNP dataset, and AntonioSalas for his help with the genotyping quality assessment. We alsowould like to thank the Centro de Supercomputación de Galicia(CESGA) for their web hosting service and technical support. Fundingfrom Xunta de Galicia: PGIDTIT06PXIB228195PR and Ministerio deEducación y Ciencia: proyecto BIO2006-06178 given to ML partiallysupported this work.

References

1. Altshuler D, Brooks LD, Chakravarti A, Collins FS, Daly MJ,Donnelly P, The International HapMap Consortium (2005) Ahaplotype map of the human genome. Nature 437:1299–1320

2. Brandstatter A, Salas A, Niederstatter H, Gassner C, Carracedo A,Parson W (2006) Dissection of mitochondrial superhaplogroup Husing coding region SNPs. Electrophoresis 27:2541–2550

Int J Legal Med (2008) 122:435–440 439

3. Brion M, Sanchez JJ, Balogh K et al (2005) Introduction of ansingle nucleotide polymorphism-based “major Y-chromosomehaplogroup typing kit” suitable for predicting the geographicalorigin of male lineages. Electrophoresis 26:4411–4420

4. Cann HM, de Toma C, Cazes L et al (2002) A human genomediversity cell line panel. Science 296:261–262

5. Emerson J (2006) DIY Map: a clickable and zoomable mapwritten in Flash. Available at http://www.backspace.com/mapapp/

6. Lowe AL, Urquhart A, Foreman LA, Evett IW (2001) Inferringethnic origin bymeans of an STR profile. Forensic Sci Int 119:17–22

7. Parson W, Brandstatter A, Alonso A et al (2004) The EDNAPmitochondrial DNA population database (EMPOP) collaborativeexercises: organisation results and perspectives. Forensic Sci Int139:215–226

8. Phillips C, Lareu V, Salas A, Carracedo A (2004) Non binarysingle-nucleotide polymorphism markers. In: Doutremepuich C,Morling N (eds) Progress in forensic genetics, 10. Elsevier,Amsterdam, pp 30–32

9. Phillips C, Salas A, Sanchez JJ et al (2007) Inferring ancestralorigin using a single multiplex assay of ancestry-informativemarker SNPs. Forensic Sci Int Genetics 1:233–235

10. Roewer L, Krawczak M, Willuweit S et al (2001) Online referencedatabase of European Y–chromosomal short tandem repeat (STR)haplotypes. Forensic Sci Int 118:106–113

11. Rosenberg NA, Pritchard JK, Weber JL, Cann HM, Kidd KK,Zhivotovsky LA, Feldman MW (2002) Genetic structure ofhuman populations. Science 298:2381–2385

12. Ruitberg CM, Reeder DJ, Butler JM (2001) STRBase: a shorttandem repeat DNA database for the human identity testingcommunity. Nucleic Acids Res 29:320–322

13. Sanchez JJ, Phillips C, Borsting C et al (2006) A multiplex assaywith 52 single nucleotide polymorphisms for human identifica-tion. Electrophoresis 27:1713–1724

14. Thorisson GA, Smith AV, Krishnan L, Stein LD (2005) Theinternational HapMap project web site. Genome Res 15:1592–1593

440 Int J Legal Med (2008) 122:435–440

(PDF) The SNPforID browser: an online tool for query and display of frequency data from the SNPforID project - DOKUMEN.TIPS (2024)
Top Articles
Latest Posts
Article information

Author: Msgr. Refugio Daniel

Last Updated:

Views: 6248

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Msgr. Refugio Daniel

Birthday: 1999-09-15

Address: 8416 Beatty Center, Derekfort, VA 72092-0500

Phone: +6838967160603

Job: Mining Executive

Hobby: Woodworking, Knitting, Fishing, Coffee roasting, Kayaking, Horseback riding, Kite flying

Introduction: My name is Msgr. Refugio Daniel, I am a fine, precious, encouraging, calm, glamorous, vivacious, friendly person who loves writing and wants to share my knowledge and understanding with you.