Forensic Practitioner's Guide to the Interpretation of Complex DNA Profiles
By Peter Gill, Øyvind Bleka, Oskar Hansson and
()
About this ebook
Over the past twenty years, there’s been a gradual shift in the way forensic scientists approach the evaluation of DNA profiling evidence that is taken to court. Many laboratories are now adopting ‘probabilistic genotyping’ to interpret complex DNA mixtures. However, current practice is very diverse, where a whole range of technologies are used to interpret DNA profiles and the software approaches advocated are commonly used throughout the world.
Forensic Practitioner’s Guide to the Interpretation of Complex DNA Profiles places the main concepts of DNA profiling into context and fills a niche that is unoccupied in current literature. The book begins with an introduction to basic forensic genetics, covering a brief historical description of the development and harmonization of STR markers and national DNA databases. The laws of statistics are described, along with the likelihood ratio based on Hardy-Weinberg equilibrium and alternative models considering sub-structuring and relatedness. The historical development of low template mixture analysis, theory and practice, is also described, so the reader has a full understanding of rationale and progression. Evaluation of evidence and statement writing is described in detail, along with common pitfalls and their avoidance.
The authors have been at the forefront of the revolution, having made substantial contributions to theory and practice over the past two decades. All methods described are open-source and freely available, supported by sets of test-data and links to web-sites with further information. This book is written primarily for the biologist with little or no statistical training. However, sufficient information will also be provided for the experienced statistician. Consequently, the book appeals to a diverse audience
- Covers short tandem repeat (STR) analysis, including database searching and massive parallel sequencing (both STRs and SNPs)
- Encourages dissemination and understanding of probabilistic genotyping by including practical examples of varying complexity
- Written by authors intimately involved with software development, training at international workshops and reporting cases worldwide using the methods described in this book
Peter Gill
Dr. Peter Gill joined the Forensic Science Service (FSS) in 1982. He began his research into DNA in 1985, collaborating with Sir Alec Jeffreys of Leicester University. In the same year they published the first demonstration of the forensic application of DNA profiling. In 1987, Dr. Gill was given an award under the civil service inventor’s scheme for discovery of the preferential sperm DNA extraction technique and the development of associated forensic tests. He was employed as Senior Principal Research Scientist at the Forensic Science Service (FSS). Currently, he hold concurrent positions at Oslo University Hospital and the University of Oslo where he is Professor of Forensic Genetics. Romanovs In 1993-4, Dr. Gill was responsible for leading the team which confirmed the identity of the remains of the Romanov family, murdered in 1918, and also the subsequent investigation which disproved the claim of Anna Anderson to be the Duchess Anastasia (using tissue preserved in a paraffin wax block for several decades). This was an early example of an historical mystery that was solved by the analysis of very degraded and aged material, and was one of the first demonstrations of low-template DNA analysis. Low-template DNA In relation to the above, Dr. Gill was responsible for developing a routine casework-based ‘super-sensitive’ method of DNA profiling that was capable of analysing DNA profiles from a handful of cells. This method was originally known as low-copy-number (LCN) DNA profiling. Now it is known as Low template DNA profiling. New statistical methods and thinking were also developed to facilitate the new methods. National DNA database Dr. Gill was responsible for leading the team that developed the first multiplex DNA systems to be used in a National DNA database anywhere in the world, and for the design of interpretation methods that are in current use (c.1995). Court reporting: Dr. Gill has been involved with giving evidence in several high profile (controversial) cases – including the Doheny / Adams appeals, and the Omagh bombing trial in the UK. Membership of scientific societies Currently, Dr. Gill is a member of the European Network of Forensic Science Institutes and ex-chair of the ‘methods, analysis and interpretation sub-section’ He is chair of the International society for forensic genetics DNA commission on mixtures and has written a number of ISFG recommendations on low-template, mixture interpretation and evaluation of evidence that are highly cited. D. Gill is a member of the European DNA Profiling Group (EDNAP). He has published more than 200 papers in the international scientific literature which have been cited more than 20,000 times – many of these are collaborative papers under the auspices of ISFG, EDNAP and ENFSI. He is the recipient of the 2013 Scientific Prize of the International Society for Forensic Genetics. Affiliations and Expertise Forensic Genetics Research Group, Oslo University Hospital; Institute of Clinical Medicine, University of Oslo, Norwa
Related to Forensic Practitioner's Guide to the Interpretation of Complex DNA Profiles
Related ebooks
Fundamentals of Forensic DNA Typing Rating: 5 out of 5 stars5/5Computational Non-coding RNA Biology Rating: 0 out of 5 stars0 ratingsClinical Genome Sequencing: Psychological Considerations Rating: 0 out of 5 stars0 ratingsProgress in Genomic Medicine: From Research to Clinical Application Rating: 0 out of 5 stars0 ratingsMolecular Genetic Medicine: Volume 3 Rating: 0 out of 5 stars0 ratingsThe Human Genome Rating: 4 out of 5 stars4/5Essentials of Noncoding RNA in Neuroscience: Ontogenetics, Plasticity of the Vertebrate Brain Rating: 0 out of 5 stars0 ratingsCell Biology A Comprehensive Treatise V3: Gene Expression: The Production of RNA's Rating: 0 out of 5 stars0 ratingsThe Regulatory Genome: Gene Regulatory Networks In Development And Evolution Rating: 5 out of 5 stars5/5Genetics, revised edition: A Guide for Students and Practitioners of Nursing and Health Care Rating: 0 out of 5 stars0 ratingsThe Social Life of Forensic Evidence Rating: 5 out of 5 stars5/5Practice and Theory of Enzyme Immunoassays Rating: 0 out of 5 stars0 ratingsPrinciples of Developmental Genetics Rating: 5 out of 5 stars5/5CRISPR Genome Surgery in Stem Cells and Disease Tissues Rating: 0 out of 5 stars0 ratingsCancer Treatment and the Ovary: Clinical and Laboratory Analysis of Ovarian Toxicity Rating: 0 out of 5 stars0 ratingsFundamentals of Toxicology: Essential Concepts and Applications Rating: 0 out of 5 stars0 ratingsGuide for Investigator Initiated Trials Rating: 0 out of 5 stars0 ratingsStructural Biology in Immunology: Structure/Function of Novel Molecules of Immunologic Importance Rating: 0 out of 5 stars0 ratingsForensic Science Notes Rating: 0 out of 5 stars0 ratingsManaging Scientific Information and Research Data Rating: 0 out of 5 stars0 ratingsGenomics of Rare Diseases: Understanding Disease Genetics Using Genomic Approaches Rating: 0 out of 5 stars0 ratingsLC/MS: A Practical User's Guide Rating: 0 out of 5 stars0 ratingsConcepts and Techniques in Genomics and Proteomics Rating: 0 out of 5 stars0 ratingsForensic Anthropology: Theoretical Framework and Scientific Basis Rating: 0 out of 5 stars0 ratingsGene Therapy for Viral Infections Rating: 5 out of 5 stars5/5Case Studies in Cell Biology Rating: 0 out of 5 stars0 ratingsGenome Stability: From Virus to Human Application Rating: 0 out of 5 stars0 ratingsIntroduction to Environmental Forensics Rating: 3 out of 5 stars3/5Protocols used in Molecular Biology Rating: 5 out of 5 stars5/5An Introduction To Heredity And Genetics - A Study Of The Modern Biological Laws And Theories Relating To Animal And Plant Breeding Rating: 0 out of 5 stars0 ratings
Law For You
Secrets of Criminal Defense Rating: 5 out of 5 stars5/5Win In Court Every Time Rating: 5 out of 5 stars5/5How to Think Like a Lawyer--and Why: A Common-Sense Guide to Everyday Dilemmas Rating: 3 out of 5 stars3/5Legal Words You Should Know: Over 1,000 Essential Terms to Understand Contracts, Wills, and the Legal System Rating: 4 out of 5 stars4/5Dictionary of Legal Terms: Definitions and Explanations for Non-Lawyers Rating: 5 out of 5 stars5/5Win Your Case: How to Present, Persuade, and Prevail--Every Place, Every Time Rating: 5 out of 5 stars5/5The Everything Guide To Being A Paralegal: Winning Secrets to a Successful Career! Rating: 5 out of 5 stars5/5The ZERO Percent: Secrets of the United States, the Power of Trust, Nationality, Banking and ZERO TAXES! Rating: 5 out of 5 stars5/5Legal Writing in Plain English: A Text with Exercises Rating: 3 out of 5 stars3/58 Living Trust Forms: Legal Self-Help Guide Rating: 5 out of 5 stars5/5The Pro Se Litigant's Civil Litigation Handbook: How to Represent Yourself in a Civil Lawsuit Rating: 5 out of 5 stars5/5Law For Dummies Rating: 4 out of 5 stars4/5Criminal Law Rating: 0 out of 5 stars0 ratingsVerbal Judo, Second Edition: The Gentle Art of Persuasion Rating: 4 out of 5 stars4/5The Common Law Rating: 4 out of 5 stars4/5The Everything Executor and Trustee Book: A Step-by-Step Guide to Estate and Trust Administration Rating: 3 out of 5 stars3/5Patents, Copyrights and Trademarks For Dummies Rating: 4 out of 5 stars4/5Estate & Trust Administration For Dummies Rating: 0 out of 5 stars0 ratingsTrans: When Ideology Meets Reality Rating: 3 out of 5 stars3/5Wills and Trusts Kit For Dummies Rating: 5 out of 5 stars5/5The Law Rating: 4 out of 5 stars4/5Critical Race Theory: The Cutting Edge Rating: 4 out of 5 stars4/5Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America Rating: 4 out of 5 stars4/5The Paralegal's Handbook: A Complete Reference for All Your Daily Tasks Rating: 4 out of 5 stars4/5With Liberty and Justice for Some: How the Law Is Used to Destroy Equality and Protect the Powerful Rating: 4 out of 5 stars4/5No Place to Hide: Edward Snowden, the NSA, and the U.S. Surveillance State Rating: 4 out of 5 stars4/5
Reviews for Forensic Practitioner's Guide to the Interpretation of Complex DNA Profiles
0 ratings0 reviews
Book preview
Forensic Practitioner's Guide to the Interpretation of Complex DNA Profiles - Peter Gill
R.
Chapter 1
Forensic genetics: the basics
Abstract
The interpretation of evidence is rooted in population genetics theory. The fundamental principle that underpins this is the Hardy–Weinberg equilibrium. The inheritance of genes follows laws of probability that enable probability estimates to describe the rarity
of a DNA profile. Such calculations rely upon the Hardy–Weinberg equilibrium assumption of independence
, i.e., that populations are very large, randomly interspersed and randomly mating. However, such assumptions are rarely justified—populations are structured into sub-populations
; the frequencies of their alleles differ. To compensate, the FST statistic is used as a measure of population differentiation. There are initiatives to collate data to provide community population databases that can be accessed, for example, STRidER acts as a platform that is used as a repository that are subject to rigorous quality control before release.
The likelihood ratio is fundamental to all aspects of interpretation of forensic evidence. Likelihood ratios are very flexible. In particular, they can be used to evaluate DNA mixtures, which are the subject of much of the material to be found in subsequent chapters. The early genetics theory applied to mixtures is described, limited to two-contributors. The calculations are very complex. The need for computer algorithms to take over the burden of calculation are described, and there is a demonstration of the principle of extension of multiple contributors. A list of software available to carry out analysis of mixtures is provided. Finally, extension of the theory to relatedness (kinship) tests is introduced.
Keywords
Hardy–Weinberg equilibrium; FST; STRidER; likelihood ratio
The interpretation of forensic genetic evidence is based upon probability. Probability is expressed as a number that is somewhere between zero and one, representing two extremes: a probability of zero means that something is impossible, whereas a probability of one means that something is certain. In practice, a probability is never exactly zero or one—it is usually somewhere between the two extremes.
In forensic genetics, a probability is usually equated to the frequency
of observation of particular type
. Before the DNA era, probabilities were applied to blood groups. One of the first used for forensic typing was the ABO grouping that was credited to the Austrian scientist Karl Landsteiner, who identified the O, A, and B blood types in 1900.
The inheritance of the ABO blood groups follows the laws of Mendelian genetics. Chromosomes are inherited in pairs. There are two genes, one is inherited from the mother and the other from the father. The genes are inherited via gametes, i.e., sperm of the father and the eggs (ova) of the mother. To ensure that the offspring only has a single pair of genes per cell, the gametes only contain a single copy.
Some basic definitions follow:
Gene: The gene is a stretch of DNA positioned on a chromosome. The gene may have a function, such as producing proteins that determine eye colour. However, genes that are currently analyzed in forensic science are sometimes described as junk DNA
since they have no known function, although this idea has been challenged.
Locus: Describes the position of a gene on a chromosome, commonly expressed by a universal identifier, such as D22S11.
Allele: Genes of forensic interest are variable; this means that there are different versions of genes, where the DNA code differs.
Therefore for the blood group ABO gene, which is positioned chromosome 9 at the band/locus 9q34.2, which means the long (q) arm of chromosome 9 at position 34.2. Over recent years, the human genetics community has compiled two human genome assemblies called GRCh37 and GRCh38. Both are references in human genome databases, such as the NCBI Genome Browser http://www.ncbi.nlm.nih.gov. Because it is the most up-to-date, GRCh38 is recommended by the DNA Commission of the ISFG [1]. The molecular location of the ABO gene is between 133,250,401 to 133,275,201 base pairs on chromosome 9: https://ghr.nlm.nih.gov/gene/ABO#location. It comprises three common different alleles, namely A, B, and O. In diploid cells, there are six possible combinations called genotypes that are observed in the population. These are AA, AB, AO, BO, OO, BB, and they are described as mutually exclusive to the individual, meaning that a person can only have one genotype, but mutually inclusive with respect to the population, meaning that any given individual selected from the population must have one of these genotypes.¹
Alleles A and B are both dominant to allele O. If a person has an AO or BO genotype, the O
is masked, which results in the person being typed
as A or B, respectively. The masked O allele is called recessive. Dominant alleles mask the genotype of the person. People are classed as being A, B, or O phenotypes, where a phenotype can be expressed by more than one genotype.
With conventional DNA profiling, we do not need to worry about phenotypes, because we deal with well defined genetic sequences.
The variability of the gene forms the basis of its usefulness to discriminate between individuals. Usually, the aim is to associate crime-stain evidence to some specific individual, typically a suspect who may have been apprehended. He/she may be described as the questioned individual
, or the person of interest
(POI). Note that the POI is not always the suspect, sometimes he/she may be a victim, for example, where a body fluid stain is found on the clothing of the suspect.
1.1 Short tandem repeat (STR) analysis
Short tandem repeats (STRs) are blocks of tandemly repeating DNA sequences found throughout the human genome. Forensic laboratories usually use four base pair repeats, because shorter sequences were much more prone to artefacts known as stutters. For a comprehensive review of STRs currently used in casework, the reader is referred to [2] and the NIST STRBase website https://strbase.nist.gov/index.htm, which lists sequences of common and rare alleles. Refer to Parson et al. [1], supplemental materials, for full details of sequences using up-to-date recommendations of the ISFG DNA commission.
There are three kinds of repeat sequences defined by Urquhart et al. [3]. Simple
repeats contain units of identical length and sequence; compound
repeats comprise two or more adjacent simple repeats; complex
repeats may contain several repeat blocks of variable unit length, along with variable intervening sequences.
Simple repeat example The STR HUMTH01 locus is an example of a simple AATG sequence ranging between three to 14 repeats, and it is written shorthand as [AATG]a, where a= the number of repeats. A common microvariant allele is observed that consists of a single base deletion of the seventh repeat in the 10 allele, which results in a partial repeat of just three bases. It is signified as [AATG]6 ATG [AATG]3, and the nomenclature follows as HUMTH01 9.3.
Compound repeat example HUMVWA is an example of a compound repeat locus² [TCTA]a [TCTG]b [TCTA]0−1, where the final sequence is either not observed or observed once.
Complex repeat example D21S11 is a highly polymorphic locus with a compound structure of several intervening sequences [TCTA]a, [TCTG]b, [TCTA]c, TA [TCTA]d, TCA [TCTA]e, TCCATA [TCTA]f.
Repeat unit nomenclature is standardized for capillary gel electrophoresis (CE) applications. This enables universal comparisons between laboratories and national DNA databases to be achieved. The nomenclature used is based upon the number of repeat sequences [4].
Existing designation systems that are universally applied to national DNA databases are based upon the repeating structure of typical
reference alleles that were discovered and characterized in the early to mid 1990s. All new allelic variant designations must fit within the scheme, regardless of sequence, and are strictly conditioned upon the number of bases that are counted in the fragment length. Comparisons are made against an allelic ladder. This means that the length of the STR repeat, and its correspondence to the reference sequenced repeat does not necessarily hold. Let us suppose that there is a deletion of a single base in the flanking region of an amplicon in an allele 27 variant; though the repeating structure may be identical to that listed, the allelic designation must change to 26.3. Consequently, this allele designation no longer reflects the repeat structure of the reference sequence.³
With the introduction of massively parallel sequencing (MPS), the issue of nomenclature has achieved new prominence [1]. Whereas the sequence information is generally hidden from view with conventional CE applications, with MPS, this information is available. This results in the observation of polymorphisms, where there are sequence differences between alleles, though the STR fragment sizes are identical. With CE, all of these polymorphisms would be classed together, whereas with MPS they can be separated, with the resulting increase in discriminating power. There is continued discussion on nomenclature in relation to MPS in Chapter 13.9.1. The main aim is to devise a new nomenclature to maximise the benefits of MPS, whilst at the same time retaining back-compatibility with existing CE repeat unit nomenclature.
1.1.1 Historical development of multiplexed systems
Short tandem repeat (STR) analysis was introduced into forensic casework about 25 years ago. The ability to combine several markers to form multiplexes and to subsequently visualize the results by automated fluorescent sequencing made national DNA databases feasible. The first example was launched in 1995 by the UK Forensic Science Service (FSS). In total there have been three iterations of multiplexes.
Early multiplexes consisted of relatively few loci based on simple STRs. The four locus quadruplex
was the first multiplex to be used in casework, and was developed by the Forensic Science Service (FSS) [5]. Because it consisted of just four STRs, there was a high probability of a random match—approximately 1 in 10,000. In 1995 the FSS re-engineered the multiplex, producing a 6 locus STR system combined with the amelogenin sex test [6]. This acquired the name second generation multiplex
(SGM). The addition of complex STRs, D21S11 and HUMFIBRA/FGA [7], which have greater variability than simple STRs, decreased the probability of a random match to about 1 in 50 million. In the UK, the introduction of SGM in 1995 facilitated the implementation of the UK national DNA database (NDNAD) [8]. As databases become much larger, the number of pairwise comparisons increases dramatically, so it became necessary to ensure that the match probability of the system was sufficient to minimize the chance of two unrelated individuals matching by chance (otherwise known as an adventitious match). Consequently, as the UK NDNAD grew in its first four years of operation, a new system known as the AmpFlwas introduced in 1999. This system comprised 10 STR loci with amelogenin, replacing the previous SGM system. To ensure continuity of the DNA database, and to enable the new system to match samples that had been collated in previous years, all six loci of the older SGM system were retained in the new AmpFlSTR SGM Plus system.
1.2 Development and harmonization of European National DNA databases
Harmonization of STR loci was achieved by collaboration at the international level. Notably, the European DNA profiling group (EDNAP) carried out a series of successful studies to identify and to recommend STR loci for the forensic community to use. This work began with an evaluation of the simple STRs HUMTH01 and HUMVWFA31 [10]. Subsequently, the group evaluated D21S11 and HUMFIBRA/FGA [11]. Recommendations on the use of STRs were published by the ISFG [4].
Most, if not all, European countries have legislated to implement national DNA databases that are based upon STRs [12]. In Europe, there has been a drive to standardize loci across countries to meet the challenge of increasing cross-border crime. In particular, a European Community (EC) funded initiative led by the ENFSI group was responsible for co-ordinating collaborative exercises to validate commercially available multiplexes for general use [13]. National DNA databases were introduced in 1997 in Holland and Austria; 1998 in Germany, France, Slovenia, and Cypus; 1999 in Finland, Norway, and Belgium; 2000 in Sweden, Denmark, Switzerland, Spain, Italy, and Czech Republic; 2002 in Greece and Lithuania; 2003 in Hungary; 2004 in Estonia and Slovakia [14].
A parallel process has occurred in Canada [15,16] and in the US [17]), where standardization was based on thirteen STR loci, known as the Combined DNA Index System (CODIS) core loci.
An FBI-sponsored CODIS core loci working group recommended an expanded set of loci from the thirteen in use in 2011 [18,19]. There followed an extensive validation study, which resulted in the recommendation that seven new loci were to be adopted [20], resulting in 20 CODIS core loci to be implemented by 2017. The additional seven loci included the five new European ESS markers, plus D2S1338 and D19S433 (see next section). This resulted in comparability of the CODIS core and expanded ESS to have 15 DNA loci in common [21].
1.2.1 Development of the European set of standard (ESS) markers
Based on the initial EDNAP exercises and recommendations by ENSFI and the Interpol working party [22], four loci were originally defined as the European standard set (ESS) of loci—HUMTH01, HUMVWFA31, D21S11, and HUMFIBRA/FGA. The identity of these loci was dictated by their universal incorporation into different commercial multiplexes that were utilized by member states. By the same rationale, three additional loci were added to this set—D3S1358, D8S1179, and D18S51. These loci are the same as the standard set of loci identified by Interpol for the global exchange of DNA data.
A subsequent expansion of ESS loci was motivated by the Prüm treaty of 2005 . The new loci were officially adopted by the European Commission [28] and Interpol in 2010; this led to development of a series of new multiplexes by the major companies (Promega, Life Technologies, and Qiagen). See Fig. 1.1. Practically speaking there are sixteen loci, since D16S539, D19S433, D2S1338, and SE33 are all included European multiplexes in addition to the ESS markers. A complete list of multiplex kits and their loci can be accessed from the NIST website https://strbase.nist.gov/multiplx.htm.
Figure 1.1 Commonly used multiplex kits showing ESS and CODIS loci relative to molecular weights (bp) using different dye markers.
New biochemistry has simultaneously increased the sensitivity of tests, to the extent that the once controversial low-level or low-template (LT-) DNA analysis is considered to be routine (Chapter 4). However, this is not without challenge. LT-DNA profiles tend to be complex mixtures, with problems of missing alleles
, known as drop-out. This book will explain how statistical methods, based on likelihood ratio (LR) estimation, have been critical to improve the interpretation of evidence.
1.3 Hardy–Weinberg equilibrium
There is a fundamental principle that underpins all population genetics: the Hardy–Weinberg equilibrium, named after two scientists who simultaneously discovered the formula in the early 1900's [29].
To illustrate, consider a simple example that comprises two alleles: a and b, respectively. These two alleles are found in three alternative diploid combinations (genotypes) aa, ab, and bb respectively: two alleles the same are called homozygotes aa, bb whereas two different alleles, ab are heterozygotes.
individuals. It is relatively straight forward to express the genotype observations in terms of frequencies as this is simply
It is a law of probability that the sum of all possible outcomes is one. It also follows that the larger the sample size, the better the frequency estimate will be. International Society for Forensic Genetics (ISFG) DNA commission recommendations [30] suggest that a sample size of at least 500 is desirable, although for small discrete populations that are difficult to access, a smaller sample size will suffice.
Next, the number of alleles in the population is calculated. This is achieved by counting the number of alleles in the observed data. In the example, there are two alleles (and three genotypes). For a homozygote aa, there are two a alleles, whereas for a heterozygote ab, there is one a allele.
The total number of a and b alleles is twice the number of individuals (n).
heterozygotes, or 45a alleles and 45b alleles. To find the proportion, we divide by 2n
The same calculation is carried out for allele b
and
Once the allele frequencies are estimated, this information can be used to calculate the expectation that an individual chosen at random will be a particular genotype. The expected genotype proportions are calculated by applying the Hardy–Weinberg equilibrium formula, which describes the relative probabilistic expectations of the genotype proportion in genotypes aa, ab, bb.
The Hardy–Weinberg formula is important, because it is the basis of the product rule. This relies upon a law of probability, which states that the probability of two independent events occurring together can be estimated by multiplying their individual probabilities. The probability of genotype aa , and the probability of genotype bb .
Heterozygote genotype ab can occur in two different ways: There are two chromosome strands, which can be labeled c1 and c2. Therefore there are two alternative arrangements whereby an individual can be ab.
. As before, the sum of the probabilities always equals one.
The Hardy–Weinberg formula holds true if
1. There is no migration in or out of the population.
2. There is no natural selection that favours the survival of individuals with certain alleles.
3. The population is assumed to be randomly mating, without inbreeding and is very large.
4. There is no mutation of alleles.
5. Generations are non-overlapping.
In practice, it is difficult to fulfill the Hardy–Weinberg assumptions, because populations are not discrete or static; there is often much migration, immigration, and interbreeding between different populations. Furthermore, because populations are finite, there is always some inbreeding. The effect of inbreeding is to increase the level of homozygosity (the Wahlund effect [31]), and this means that the multiplication rule used to estimate genotype frequencies is not strictly valid. The implications and solution is examined in greater detail in Section 1.12.
, the Hardy Weinberg expectations are
The number of expected aa, ab, bb genotypes is recovered by multiplying their expected frequencies by n (Table 1.1). Note that the observed and expected genotype frequencies are close, but not exactly the same. It is usual to see small deviations of this kind.
Table 1.1
Chi-square statistic to test for deviation from Hardy–Weinberg equilibrium.
1.3.1 Measuring deviation from Hardy–Weinberg equilibrium
Chi-square test
The chi-square statistic tests the null hypothesis, which basically states that there is no difference between the observed and expected results. It is calculated as follows (Table 1.1):
(1.1)
where o= observed data and e= expected data
(1.2)
The result is compared to a chi-square distribution table, e.g., .
If a result is significant at a chosen level, then the null hypothesis is rejected. Traditionally, the significance level (α, where α is the desired overall significance level and m , making it less likely that any individual test will fail the null hypothesis.
Fisher's exact test
⁴ under HW equilibrium:
(1.3)
Raw programming of these formulae lead to huge numbers that lead to errors. An R-package, HardyWeinberg
is available: https://cran.r-project.org/web/packages/HardyWeinberg/HardyWeinberg.pdf. This package can be easily utilized to carry out the necessary calculations using the HWExact
function. The chi-square test is also accommodated under the HWChisq
function.
Exact tests are important in the quality assurance of frequency databases, and this is described in more detail in Section 1.15.
1.3.2 Extension to multiple alleles in STRs
Whereas single nucleotide polymorphisms described in Section 1.3 are described by two alleles, all of the autosomal STR systems currently in use have many more alleles per locus. A compilation of allele frequency databases can be directly accessed from the STRs for identity ENFSI reference database
(STRidER): https://strider.online/ [32]. For example, HUMTH01 has eight alleles listed; FGA has 24 alleles listed.
The extension to multiple alleles and loci is straight forward. A list of all possible genotypes can be obtained by listing the alleles sequentially in the first row and column of Table 1.2. The genotypes, comprised of two alleles, are designated by their intersections in the table. This process is also known as pairwise comparison, can be calculated as
(1.4)
is the binomial coefficient giving the number of outcomes to select y elements out of x elements (unordered with replacement). There are eight alleles in HUMTH01, hence there are a total of 36 genotypes, which is the number of elements in the lower triangular matrix, plus the number of elements of the diagonal of Table 1.2.
Table 1.2
Depiction of all possible 36 genotypes for HUMTH01 using pairwise comparisons of eight alleles.
Pairwise comparison is an important part of computer programming, which will be discussed in detail in Section 1.11. All possible genotype combinations are easily listed, hence the probabilities of the genotypes are also easily generated by multiplying their allele frequencies together.
1.4 Quality assurance of data
It is desirable to carry out statistical tests to demonstrate if HW expectations are satisfied (Section 1.3) as a step to demonstrate if the data are independent, so that they can be properly utilized in strength of evidence calculations using the product rule.
On behalf of the European Network of Forensic Science Institutes (ENFSI) group, Welch et al. , otherwise known as theta (θ), (discussed in detail in Section 1.12).
This collection of European population data was made accessible by the European Network of Forensic Science Institutes (ENFSI) on STRbASE, now superseded by STRidER (STRs for identity ENFSI Reference database) [32] https://strider.online/. It has an integrated approach to assuring the quality of submitted data before acceptance.
In support of STRidER, the International Society for Forensic Genetics (ISFG) has published guidelines [30]. The recommendations are summarised by the following:
1. The minimum requirements are 15 autosomal STR loci, typed from 500 samples.
2. The geographical origin of the database is stated.
3. Methods of analysis stated, STR typing kit used.
4. Information on data analysis and handling.
5. Datasets must pass STRidER QC tests before they can be published in FSI: Genetics.
When databases are submitted, the data are examined for duplicates, close relatives, and transcription errors. Once data have been verified, statistical tests are carried out to show if the data conform to Hardy–Weinberg expectations. STRidER actively accepts new databases (Fig. 1.2) and is working towards collections of data generated by new generation sequencing (NGS).
Figure 1.2 Figure taken from [32]. The STRidER work flow, showing the integration of the QC platform and the STR database, resulting in high quality data in FSI: genetics publications. Reproduced from [32] with permission from Elsevier.
1.5 Recap: the laws of statistics
Before progressing further to explain mixture interpretation, it is necessary to recap two fundamental laws of statistics:
The product rule: The probability that two independent alleles both occur together is defined by the multiplication rule, so the probability (Pr) of observing a genotype, ab, where a is on chromosome strand c1 and b is on chromosome strand c.
We do not know the chromosomal arrangement of alleles, but one or the other must be true, they cannot both be true at the same time. The probability of either genotype ab or genotype ba (events A or B) is subject to the addition rule.
Addition rule: When two events, A and B, are mutually exclusive, the probability that A or B will occur is the sum of the probability of each event.
We continue with the example. Genotype ab is observed and the chromosomal arrangement is unknown, hence the probability that the order is ab or ba .
Independent/dependent: If the occurrence of event A changes the probability of event B, then events A and B are dependent. On the other hand, if the occurrence of event A does not change the probability of event B, then events A and B are