GoTriple - Data from: Effects of sampling close relatives on some elementary population genetics analyses

Abstract

Many molecular ecology analyses assume the genotyped individuals are sampled at random from a population and thus are representative of the population. Realistically, however, a sample may contain excessive close relatives (ECR) because, for example, localized juveniles are drawn from fecund species. Our knowledge is limited about how ECR affect the routinely conducted elementary genetics analyses, and how ECR are best dealt with to yield unbiased and accurate parameter estimates. This study quantifies the effects of ECR on some popular population genetics analyses of marker data, including the estimation of allele frequencies, F-statistics, expected heterozygosity (He), effective and observed numbers of alleles, and the tests of Hardy-Weinberg equilibrium (HWE) and linkage equilibrium (LE). It also investigates several strategies for handling ECR to mitigate their impact and to yield accurate parameter estimates. My analytical work, assisted by simulations, shows that ECR have large and global effects on all of the above marker analyses. The naïve approach of simply ignoring ECR could yield low-precision and often biased parameter estimates, and could cause too many false rejections of HWE and LE. The bold approach, which simply identifies and removes ECR, and the cautious approach, which estimates target parameters (e.g. He) by accounting for ECR and using naïve allele frequency estimates, eliminate the bias and the false HWE and LE rejections, but could reduce estimation precision substantially. The likelihood approach, which accounts for ECR in estimating allele frequencies and thus target parameters relying on allele frequencies, usually yields unbiased and the most accurate parameter estimates. Which of the four approaches is the most effective and efficient may depend on the particular marker analysis to be conducted. The results are discussed in the context of using marker data for understanding population properties and marker properties. Allele frequency simulation codeFortran source code for simulating genotype data, for estimating allele frequencies by different methods from the data, and for assessing the accuracy of different methodsAlleleFre.f90Allele frequency simulation executableThe compiled executable of file AlleleFre.f90AlleleFre.exeFst simulation codeFortran source code for simulating genotype data, estimating Fst by using estimated allele frequencies from different methods, and for assessing the accuracy of different methodsFst.f90fst simulation executableCompiled from file Fst.f90fst.exeHe simulation codeFortran code for simulating genotype data, estimating expected heterozygosity from the data by different methods, and assessing the accuracy of different methodsHe.f90He simulation executableCompiled from He.f90He.exeHWE Test simulation codeFortran code for simulating genotype data and testing Hardy-Weinberg equilibrium from the dataHWE_Test.f90HWE test simulation executableCompiled from HWE_test.f90hwe_test.exeLD Test simulation codeFortran code for simulating genotype data, testing linkage disequilibrium.LD_Test.f90LD Test simulation executableCompiled from LD_test.f90LD_test.exe