| Title: | Genomic Breeding Tools: Genetic Variance Prediction and Cross-Validation |
|---|---|
| Description: | The main attribute of 'PopVar' is the prediction of genetic variance in bi-parental populations, from which the package derives its name. 'PopVar' contains a set of functions that use phenotypic and genotypic data from a set of candidate parents to 1) predict the mean, genetic variance, and superior progeny value of all, or a defined set of pairwise bi-parental crosses, and 2) perform cross-validation to estimate genome-wide prediction accuracy of multiple statistical models. More details are available in Mohammadi, Tiede, and Smith (2015, <doi:10.2135/cropsci2015.01.0030>). A dataset 'think_barley.rda' is included for reference and examples. |
| Authors: | Tyler Tiede [aut], Jeffrey Neyhart [aut, ctb, cre] (ORCID: <https://orcid.org/0000-0002-1991-5310>), Mohsen Mohammadi [ctb] (ORCID: <https://orcid.org/0000-0002-4536-1200>), Kevin Smith [ctb] (ORCID: <https://orcid.org/0000-0001-8253-3326>) |
| Maintainer: | Jeffrey Neyhart <[email protected]> |
| License: | GPL-3 |
| Version: | 1.3.2 |
| Built: | 2026-03-03 09:11:09 UTC |
| Source: | https://github.com/umn-barleyoatsilphium/popvar |
Calculate minor allele frequency
calc_maf(x, check.matrix = TRUE)calc_maf(x, check.matrix = TRUE)
x |
An n x p marker matrix of n individuals and p markers coded as -1, 0, and 1 for homozygous alternate, heterozygous, and homozygous reference. |
check.matrix |
Logical. Should the marker matrix 'x' be checked? Use check.matrix = FALSE if 'x' contains imputed decimal genotypes. |
Estimate genomewide prediction accuracy using cross-validation
cross_validate( data, models, method = c("frac", "kfold", "loo"), p.test, k, reps, group )cross_validate( data, models, method = c("frac", "kfold", "loo"), p.test, k, reps, group )
data |
A |
models |
The list of models to run. Currently supports ... |
method |
The cross-validation method. Either fractional ( |
p.test |
For fractional cross-validation, the proportion of individuals to designate as the test set per replicate. |
k |
For k-fold cross-validation, the number of folds. |
reps |
The number of complete replicates of fractional or k-fold cross-validation. |
group |
The name of a column in the phenotype data for grouped cross-validation (e.g. environment, location, etc.). For example to run leave-one-environment-out cross-validation, set |
Filter SNP marker genotypes
filter_geno( geno, max.ind.miss = 0.5, max.mar.miss = 0.5, min.mac = 5, max.r2 = 1 )filter_geno( geno, max.ind.miss = 0.5, max.mar.miss = 0.5, min.mac = 5, max.r2 = 1 )
geno |
A data.frame of genotype calls, where the first three columns are "marker," "chrom,", and "position." Subsequent columns are genotype calls for individuals. |
max.ind.miss |
The maximum rate of missingness to retain an individual |
max.mar.miss |
The maximum rate of missingness to retain an marker |
min.mac |
The minimum minor allele count to retain a marker. |
max.r2 |
The maximum squared correlation between any pair of markers. The |
Variable of class geno.
# Use data from the PopVar package data("cranberry_geno", package = "PopVar") # Remove cM, rename geno <- cranberry_geno[-3] names(geno)[1:3] <- c("marker", "chrom", "position") # Filter geno_in <- filter_geno(geno = geno, max.ind.miss = 0.2, max.mar.miss = 0.2, min.mac = 5, max.r2 = 1)# Use data from the PopVar package data("cranberry_geno", package = "PopVar") # Remove cM, rename geno <- cranberry_geno[-3] names(geno)[1:3] <- c("marker", "chrom", "position") # Filter geno_in <- filter_geno(geno = geno, max.ind.miss = 0.2, max.mar.miss = 0.2, min.mac = 5, max.r2 = 1)
Internal functions generally not to be called by the user.
par_position(crossing.table, par.entries) par_name(crossing.mat, par.entries) tails(GEBVs, tail.p) maf_filt(G) XValidate_nonInd( y.CV = NULL, G.CV = NULL, models.CV = NULL, frac.train.CV = NULL, nCV.iter.CV = NULL, burnIn.CV = NULL, nIter.CV = NULL ) XValidate_Ind( y.CV = NULL, G.CV = NULL, models.CV = NULL, nFold.CV = NULL, nFold.CV.reps = NULL, burnIn.CV = NULL, nIter.CV = NULL ) calc_marker_effects( M, y.df, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), nIter, burnIn )par_position(crossing.table, par.entries) par_name(crossing.mat, par.entries) tails(GEBVs, tail.p) maf_filt(G) XValidate_nonInd( y.CV = NULL, G.CV = NULL, models.CV = NULL, frac.train.CV = NULL, nCV.iter.CV = NULL, burnIn.CV = NULL, nIter.CV = NULL ) XValidate_Ind( y.CV = NULL, G.CV = NULL, models.CV = NULL, nFold.CV = NULL, nFold.CV.reps = NULL, burnIn.CV = NULL, nIter.CV = NULL ) calc_marker_effects( M, y.df, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), nIter, burnIn )
crossing.table |
The crossing table. |
par.entries |
The parent entries. |
crossing.mat |
The crossing matrix. |
GEBVs |
The genomic estimated breeding values. |
tail.p |
The proportion from the population to select. |
G |
The marker genotypes |
y.CV |
The phenotypes for cross-validation. |
G.CV |
The marker genotypes for cross-validation. |
models.CV |
The models for cross-validation. |
frac.train.CV |
The fraction of data to use as training data in cross-validation. |
nCV.iter.CV |
The number of iterations of cross-validation. |
burnIn.CV |
The burn-in number for cross-validation. |
nIter.CV |
The number of iterations for Bayesian models in cross-validation. |
nFold.CV |
The number of folds in k-fold cross-validation. |
nFold.CV.reps |
The number of replications of k-fold cross-validation. |
M |
The marker matrix. |
y.df |
The phenotype data. |
models |
The models to use. |
nIter |
The number of iterations. |
burnIn |
The burn-in rate. |
Uses linear interpolation and local recombination rate information to predict the unknown genetic position of markers based on their known physical positions.
interpolate_genetic_position(geno, cMMb)interpolate_genetic_position(geno, cMMb)
geno |
A data frame of marker allele dosages. See \italicDetails for more information. |
cMMb |
A data frame of the recombination rate information. See \italicDetails. |
The first columns of the geno object are marker, chrom, and bp. Subsequent columns contain the allele dosage for individuals/clones, coded 0,1,2. While fractional values are allowed for genomewide prediction, these values are rounded when predicting the genetic variance within crosses.
The cMMb object must contains six columns: chrom, left_bp (the physical position of the left flank of an interval), right_bp (the physical position of the right flank of an interval), left_cM (the genetic position of the left flank of an interval), right_cM (the genetic position of the right flank of an interval), and cM_Mb (the recombination rate (in cM / Mb) within that interval).
Predicts the genotypic mean, genetic variance, and usefulness criterion (superior progeny mean) in a set of multi-parent populations using marker effects and a genetic map. If more than two traits are specified, the function will also return predictions of the genetic correlation in the population and the correlated response to selection.
mppop.predict( G.in, y.in, map.in, crossing.table, parents, n.parents = 4, tail.p = 0.1, self.gen = 10, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), n.core = 1, ... ) mppop_predict2( M, y.in, marker.effects, map.in, crossing.table, parents, n.parents = 4, tail.p = 0.1, self.gen = 10, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), n.core = 1, ... )mppop.predict( G.in, y.in, map.in, crossing.table, parents, n.parents = 4, tail.p = 0.1, self.gen = 10, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), n.core = 1, ... ) mppop_predict2( M, y.in, marker.effects, map.in, crossing.table, parents, n.parents = 4, tail.p = 0.1, self.gen = 10, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), n.core = 1, ... )
G.in |
See |
y.in |
See |
map.in |
See |
crossing.table |
A |
parents |
See |
n.parents |
Integer number of parents per cross. May be 2 or 4. If |
tail.p |
See |
self.gen |
The number of selfing generations in the potential cross. Can be an integer or |
DH |
Indicator if doubled-haploids are to be induced after the number of selfing generations indicated by
|
models |
See |
n.core |
Number of cores for parallelization. Parallelization is supported only on a Linux or Mac OS operating system; if working on a Windows system, the function is executed on a single core. |
... |
Additional arguments to pass depending on the choice of |
M |
A Matrix of marker genotypes of dimensions |
marker.effects |
A data frame of marker effects. The first column should include the marker name and
subsequent columns should include the marker effects. Supercedes |
Predictions are based on the deterministic equations specified by Allier et al. (2019).
In the case of four-way crosses (i.e. 4 parents), the function assumes that the first two parents are mated,
producing a offspring; then, the next two parents are mated, producing another offspring.
The two offspring are then mated and inbreeding or doubled haploid induction (if specified) proceeds
from there. For example, say cross i uses parents P1, P2, P3, and P4. P1 and P2 are first mated,
producing O1; then, P3 and P4 are mated, producing O2; then, O1 and O2 are mated, producing a segregating family.
The mppop.predict function takes similarly formatted arguments as the pop.predict function
in the PopVar package. For the sake of simplicity, we also include the mppop_predict2 function, which
takes arguments in a format more consistent with other genomewide prediction packages/functions.
If you select a model other than "rrBLUP", you must specify the following additional arguments:
nIter: See pop.predict.
burnIn: See pop.predict.
A data.frame containing predictions of , , and for
each trait for each potential multi-parent cross. When multiple traits are provided, the correlated
responses and correlation between all pairs of traits is also returned.
Allier, A., L. Moreau, A. Charcosset, S. Teyssèdre, and C. Lehermeier, 2019 Usefulness Criterion and Post-selection Parental Contributions in Multi-parental Crosses: Application to Polygenic Trait Introgression. G3 (Bethesda) 9: 1469–1479. https://doi.org/https://doi.org/10.1534/g3.119.400129
# Load data data("think_barley") # Vector with 8 parents parents <- sample(y.in_ex$Entry, 8) # Create a crossing table with four parents per cross cross_tab <- as.data.frame(t(combn(x = parents, m = 4))) names(cross_tab) <- c("parent1", "parent2", "parent3", "parent4") out <- mppop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross_tab, models = "rrBLUP")# Load data data("think_barley") # Vector with 8 parents parents <- sample(y.in_ex$Entry, 8) # Create a crossing table with four parents per cross cross_tab <- as.data.frame(t(combn(x = parents, m = 4))) names(cross_tab) <- c("parent1", "parent2", "parent3", "parent4") out <- mppop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross_tab, models = "rrBLUP")
pop.predict uses phenotypic and genotypic data from a set of individuals known as a training population (TP) and a set of candidate parents, which may or may not be included in the TP, to predict the mean (), genetic variance (V_G), and superior progeny values (_sp) of the half-diallel, or a defined set of pairwise bi-parental crosses between parents. When multiple traits are provided pop.predict will also predict the correlated responses and correlation between all pairwise traits. See Mohammadi, Tiede, and Smith (2015) for further details.
NOTE - \code{pop.predict} writes and reads files to disk so it is highly recommended to set your working directory
pop.predict( G.in = NULL, y.in = NULL, map.in = NULL, crossing.table = NULL, parents = NULL, tail.p = 0.1, nInd = 200, map.plot = FALSE, min.maf = 0.01, mkr.cutoff = 0.5, entry.cutoff = 0.5, remove.dups = TRUE, impute = "EM", nSim = 25, frac.train = 0.6, nCV.iter = 100, nFold = NULL, nFold.reps = 1, nIter = 12000, burnIn = 3000, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), return.raw = FALSE, saveAt = tempdir() )pop.predict( G.in = NULL, y.in = NULL, map.in = NULL, crossing.table = NULL, parents = NULL, tail.p = 0.1, nInd = 200, map.plot = FALSE, min.maf = 0.01, mkr.cutoff = 0.5, entry.cutoff = 0.5, remove.dups = TRUE, impute = "EM", nSim = 25, frac.train = 0.6, nCV.iter = 100, nFold = NULL, nFold.reps = 1, nIter = 12000, burnIn = 3000, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), return.raw = FALSE, saveAt = tempdir() )
G.in |
TIP - Set header= |
y.in |
|
map.in |
|
crossing.table |
Optional |
parents |
Optional |
tail.p |
Optional |
nInd |
Optional |
map.plot |
Optional |
min.maf |
Optional |
mkr.cutoff |
Optional |
entry.cutoff |
Optional |
remove.dups |
Optional |
impute |
Options include |
nSim |
Optional |
frac.train |
Optional |
nCV.iter |
Optional |
nFold |
Optional |
nFold.reps |
Optional |
nIter, burnIn
|
Optional |
models |
Optional |
return.raw |
Optional |
saveAt |
When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files
the are produced by the |
pop.predict can be used to predict the mean (), genetic variance (V_G), superior progeny values (), as well as the predicted correlated response and correlations between all pairwise traits. The methodology and procedure to do so has been described in Bernardo (2014) and Mohammadi, Tiede, and K.P. Smith (2015). Users familiar with genome-wide prediction, association mapping, and/or linkage mapping will be familiar with the
required inputs of pop.predict. G.in includes all of the entries (taxa) in the TP as well as additional entries to be considered as parent candidates. Entries included in G.in that do have a phenotype for any or all traits in y.in are considered TP entries for those respective traits. G.in is filtered according to min.maf, mkr.cutoff, entry.cutoff, and remove.dups;
remaining missing marker data is imputed using the EM algorithm (Poland et al., 2012) when possible, and the marker mean otherwise, both implemented in A.mat. For each trait, the TP (i.e. entries with phenotype) is used to:
Perform CV to select a regression model. NOTE - Using the model with the highest CV accuracy is expected to result in the most accurate marker effect estimates (Bernardo, 2014). This expectation, however, is yet to be empirically validated and the user is encouraged to investigate the various models in order to make an educated decision about which one to ultimately use.
Estimate marker effects using the model resulting in the highest CV accuracy
Models include ridge regression BLUP implemented in mixed.solve (Endelman, 2011) and BayesA, BayesB, BayesC, Bayesian lasso (BL), and Bayesian ridge regression (BRR) implemented in BGLR (de los Compos and Rodriguez, 2014).
Information from the map.in is then used to simulate chromosomal recombination expected in a recombinant inbred line (i.e. F-infinity) (Broman et al., 2003) population (size=nInd). A function then converts the recombined chromosomal segments of the generic RIL population to the chromosomal segments of the population's respective parents and GEBVs of the simulated progeny are calculated.
The simulation and conversion process is repeated s times, where s = nSim, to calculate dispersion statistics for and V_G; the remainder of the values in the predictions output are means of the s simulations. During each iteration the correlation (r) and correlated response of each pairwise combination of traits is also calculated and their mean across n simulations is returned.
The correlated response of trait.B when predicting trait.A is the mean of trait.B for the () of trait.A, and vice-versa; a correlated response for the bottom tail.p and upper is returned for each trait.
A dataset \code{\link{think_barley.rda}} is provided as an example of the proper formatting of input files and also for users to become familiar with \code{pop.predict}.
A list containing:
predictions A list of dataframes containing predictions of (), (V_G), and (_sp). When multiple traits are provided the correlated responses and correlation between all pairwise traits is also included. More specifically, for a given trait pair the correlated response of the secondary trait with both the high and low superior progeny of the primary trait is returned since the favorable values cannot be known by PopVar.
preds.per.sim If return.raw is TRUE then a dataframe containing the results of each simulation is returned. This is useful for calculating dispersion statistics for traits not provided in the standard predictions dataframe.
CVs A dataframe of CV results for each trait/model combination specified.
models.chosen A matrix listing the statistical model chosen for each trait.
markers.removed A vector of markers removed during filtering for MAF and missing data.
entries.removed A vector of entries removed during filtering for missing data and duplicate entries.
Bernardo, R. 2014. Genomewide Selection of Parental Inbreds: Classes of Loci and Virtual Biparental Populations. Crop Sci. 55:2586-2595.
Broman, K. W., H. Wu, S. Sen and G.A. Churchill. 2003. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889-890.
Endelman, J. B. 2011. Ridge regression and other kernels for genomic selection with R package rrBLUP. Plant Genome 4:250-255. doi: 10.3835/plantgenome2011.08.0024
Gustavo de los Campos and Paulino Perez Rodriguez, (2014). BGLR: Bayesian Generalized Linear Regression. R package version 1.0.3. http://CRAN.R-project.org/package=BGLR
Mohammadi M., T. Tiede, and K.P. Smith. 2015. PopVar: A genome-wide procedure for predicting genetic variance and correlated response in bi-parental breeding populations. Crop Sci. \emph{Accepted}.
Munoz-Amatriain, M., M. J. Moscou, P. R. Bhat, J. T. Svensson, J. Bartos, P. Suchankova, H. Simkova, T. R. Endo, R. D. Fenton, S. Lonardi, A. M. Castillo, S. Chao, L. Cistue, A. Cuesta-Marcos, K. L. Forrest, M. J. Hayden, P. M. Hayes, R. D. Horsley, K. Makoto, D. Moody, K. Sato, M. P. Valles, B. B. H. Wulff, G. J. Muehlbauer, J. Dolezel, and T. J. Close. 2011 An improved consensus linkage map of barley based on flow-sorted chromosomes and single nucleotide polymorphism markers. Plant Gen. 4:238-249.
Poland, J., J. Endelman, J. Dawson, J. Rutkoski, S. Wu, Y. Manes, S. Dreisigacker, J. Crossa, H. Sanches-Villeda, M. Sorrells, and J.-L. Jannink. 2012. Genomic Selection in Wheat Breeding using Genotyping-by-Sequencing. Plant Genome 5:103-113.
## Not run: # Load data data("think_barley") ## The following examples only use the model 'rrBLUP' for the sake of testing. Functions ## BGLR package write temporary files to the disk. ## ## Ex. 1 - Predict a defined set of crosses ## This example uses CV method 1 (see Details of x.val() function) ex1.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, nSim=5, nCV.iter=2, models = "rrBLUP") ex1.out$predictions ## Predicted parameters ex1.out$CVs ## CV results ## Ex. 2 - Use only rrBLUP and Bayesian lasso (BL) models ex3.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, models = c("rrBLUP"), nSim=5, nCV.iter=10) ## Ex. 3 - Same as Ex. 3, but return all raw SNP and prediction data for each simulated population ex4.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, models = c("rrBLUP"), nSim=5, nCV.iter=2, return.raw = TRUE) ## End(Not run)## Not run: # Load data data("think_barley") ## The following examples only use the model 'rrBLUP' for the sake of testing. Functions ## BGLR package write temporary files to the disk. ## ## Ex. 1 - Predict a defined set of crosses ## This example uses CV method 1 (see Details of x.val() function) ex1.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, nSim=5, nCV.iter=2, models = "rrBLUP") ex1.out$predictions ## Predicted parameters ex1.out$CVs ## CV results ## Ex. 2 - Use only rrBLUP and Bayesian lasso (BL) models ex3.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, models = c("rrBLUP"), nSim=5, nCV.iter=10) ## Ex. 3 - Same as Ex. 3, but return all raw SNP and prediction data for each simulated population ex4.out <- pop.predict(G.in = G.in_ex, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex, models = c("rrBLUP"), nSim=5, nCV.iter=2, return.raw = TRUE) ## End(Not run)
Generates predictions of the genetic variance and genetic correlation in bi-parental populations using a set of deterministic equations instead of simulations.
pop.predict2( G.in, y.in, map.in, crossing.table, parents, tail.p = 0.1, self.gen = Inf, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), ... ) pop_predict2( M, y.in, marker.effects, map.in, crossing.table, parents, tail.p = 0.1, self.gen = Inf, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), ... )pop.predict2( G.in, y.in, map.in, crossing.table, parents, tail.p = 0.1, self.gen = Inf, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), ... ) pop_predict2( M, y.in, marker.effects, map.in, crossing.table, parents, tail.p = 0.1, self.gen = Inf, DH = FALSE, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), ... )
G.in |
See |
y.in |
See |
map.in |
See |
crossing.table |
See |
parents |
See |
tail.p |
See |
self.gen |
The number of selfing generations in the potential cross. Can be an integer or |
DH |
Indicator if doubled-haploids are to be induced after the number of selfing generations indicated by
|
models |
See |
... |
Additional arguments to pass depending on the choice of |
M |
A Matrix of marker genotypes of dimensions |
marker.effects |
A data frame of marker effects. The first column should include the marker name and
subsequent columns should include the marker effects. Supercedes |
Predictions are based on the deterministic equations specified by Zhong and Jannink (2007), Allier et al. (2019), and Neyhart et al. (2019).
If you select a model other than "rrBLUP", you must specify the following additional arguments:
nIter: See pop.predict.
burnIn: See pop.predict.
A data.frame containing predictions of , , and for
each trait for each potential bi-parental cross. When multiple traits are provided, the correlated
responses and correlation between all pairs of traits is also returned.
pop_predict2():
Zhong, S., and J.-L. Jannink, 2007 Using quantitative trait loci results to discriminate among crosses on the basis of their progeny mean and variance. Genetics 177: 567–576. https://doi.org/10.1534/ genetics.107.075358
Allier, A., L. Moreau, A. Charcosset, S. Teyssèdre, and C. Lehermeier, 2019 Usefulness Criterion and Post-selection Parental Contributions in Multi-parental Crosses: Application to Polygenic Trait Introgression. G3 9: 1469–1479. doi: 10.1534/g3.119.400129
Neyhart, J.L., A.J. Lorenz, and K.P. Smith, 2019 Multi-trait Improvement by Predicting Genetic Correlations in Breeding Crosses. G3 9: 3153-3165. doi: 10.1534/g3.119.400406
# Load data data("think_barley") # Use example data to make predictions out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex) # Provide a vector of parents to predict all possible crosses (some parents # have missing phenotypic data) out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, parents = y.in_ex$Entry[1:5]) # Make predictions for 5 crosses with various levels of inbreeding out_list <- lapply(X = 1:10, FUN = function(self.gen) { out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex[1:5,], self.gen = self.gen) out$self.gen <- self.gen out }) # Plot predictions of grain yield genetic variance over levels of inbreeding dat <- do.call("rbind", lapply(out_list, subset, trait == "Yield")) plot(pred_varG ~ self.gen, data = dat, type = "b", subset = parent1 == parent1[1] & parent2 == parent2[1]) # Load data data("think_barley") # Use example data to make predictions out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex) # Provide a vector of parents to predict all possible crosses (some parents # have missing phenotypic data) out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, parents = y.in_ex$Entry[1:10])# Load data data("think_barley") # Use example data to make predictions out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex) # Provide a vector of parents to predict all possible crosses (some parents # have missing phenotypic data) out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, parents = y.in_ex$Entry[1:5]) # Make predictions for 5 crosses with various levels of inbreeding out_list <- lapply(X = 1:10, FUN = function(self.gen) { out <- pop.predict2(G.in = G.in_ex_imputed, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex[1:5,], self.gen = self.gen) out$self.gen <- self.gen out }) # Plot predictions of grain yield genetic variance over levels of inbreeding dat <- do.call("rbind", lapply(out_list, subset, trait == "Yield")) plot(pred_varG ~ self.gen, data = dat, type = "b", subset = parent1 == parent1[1] & parent2 == parent2[1]) # Load data data("think_barley") # Use example data to make predictions out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, crossing.table = cross.tab_ex) # Provide a vector of parents to predict all possible crosses (some parents # have missing phenotypic data) out <- pop_predict2(M = G.in_ex_mat, y.in = y.in_ex, map.in = map.in_ex, parents = y.in_ex$Entry[1:10])
S4 class of genotypic and phenotypic data
ploidyThe ploidy level.
mapA data frame of the genetic map information.
geno.matA matrix of marker genotypes.
haplo.matA matrix of haplotypes.
phasedWhether the marker data was phased.
missingWhether there is any missing marker data.
assume.inbredWhether to assume unphased marker data comes from completely homozygous individuals.
genoA PopVar.geno object.
phenoA data frame of phenotypic data
fixedA data frame of fixed effects
traitsA list of trait names
S4 class of genotypic and phenotypic data with estimated marker effects
marker.effectsA matrix of marker effects
Predict marker effects
predict_marker_effects( data, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), params = list(nIter = 500, burnIn = 50) )predict_marker_effects( data, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), params = list(nIter = 500, burnIn = 50) )
data |
A |
models |
The list of models to run. See Details for supported models. |
params |
Parameters to pass to |
The function currently supports the models:
Ridge regression best linear unbiased prediction.
BayesA.
BayesB.
BayesC-pi.
Bayesian Lasso.
Bayesian ridge regression.
A variable of class PopVar.me
# Read in unphased cranberry marker data and phenotypes geno <- cranberry_geno pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = FALSE, assume.inbred = FALSE) # Compute marker effects using the rrBLUP and RKHS models data_me <- predict_marker_effects(data = popvar_data, models = c("rrBLUP", "BRR"))# Read in unphased cranberry marker data and phenotypes geno <- cranberry_geno pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = FALSE, assume.inbred = FALSE) # Compute marker effects using the rrBLUP and RKHS models data_me <- predict_marker_effects(data = popvar_data, models = c("rrBLUP", "BRR"))
Predict breeding values or genotypic values
predict_merit(data)predict_merit(data)
data |
A |
models |
The list of models to run. Currently supports ... |
method |
The cross-validation method. Either fractional ( |
p.test |
For fractional cross-validation, the proportion of individuals to designate as the test set per replicate. |
k |
For k-fold cross-validation, the number of folds. |
reps |
The number of complete replicates of fractional or k-fold cross-validation. |
group |
The name of a column in the phenotype data for grouped cross-validation (e.g. environment, location, etc.). For example to run leave-one-environment-out cross-validation, set |
Prune SNPs based on linkage disequilibrium
prune_LD(x, cor.mat, r2.max = 0.8, check.matrix = TRUE)prune_LD(x, cor.mat, r2.max = 0.8, check.matrix = TRUE)
x |
An n x p marker matrix of n individuals and p markers coded as -1, 0, and 1 |
cor.mat |
A similarity matrix between SNPs. No calculation of LD will be made if this matrix is passed. for homozygous alternate, heterozygous, and homozygous reference. |
check.matrix |
Logical. Should the marker matrix 'x' be checked? Use check.matrix = FALSE if 'x' contains imputed decimal genotypes. |
r2max |
The maximum acceptable value of r2 (i.e. LD) or similarity between any two markers. |
A marker matrix without markers in high LD
Reads and combines marker genotype data and phenotype data into a single object for downstream analysis and processing.
read_PopVar( geno, pheno, n.traits, ploidy = 2L, phased = FALSE, sep = ".", assume.inbred = FALSE, max.ind.miss = 0.5, max.mar.miss = 0.5, min.mac = 5, max.r2 = 1, dominance = FALSE, delim = "," )read_PopVar( geno, pheno, n.traits, ploidy = 2L, phased = FALSE, sep = ".", assume.inbred = FALSE, max.ind.miss = 0.5, max.mar.miss = 0.5, min.mac = 5, max.r2 = 1, dominance = FALSE, delim = "," )
geno |
A data frame or path to a file containing the marker allele dosages. See Details for information about formatting. |
pheno |
A data frame or path to a file containing the phenotypic data. See Details for information about formatting. |
n.traits |
The number of traits in the |
ploidy |
The ploidy level e.g. 2, 4, 6, ..., etc. |
phased |
Logical. If TRUE, the input marker genotype data is assumed phased, with alleles coded as positive integers.
The name of each individual/clone column is the id and haplotype, concatenated by |
sep |
The haplotype concatenator; see |
assume.inbred |
Logical. If TRUE, individuals/clones are assumed to be completely homozygous. Setting |
max.ind.miss |
The maximum rate of missingness to retain an individual. |
max.mar.miss |
The maximum rate of missingness to retain an marker. |
min.mac |
The minimum minor allele count to retain a marker. |
max.r2 |
The maximum squared correlation between any pair of markers. The |
dominance |
Logical. Should a dominance relationship matrix be constructed? |
delim |
The delimiter character for the |
The first columns of the file or geno data frame or file should be marker, chrom, cM (and optionally bp, if provided). Subsequent columns
contain the allele dosage for individuals/clones/haplotypes, coded 0,1,2, ..., ploidy. While fractional
values are allowed for genomewide prediction, these values are rounded when predicting the genetic variance within crosses.
If unphased marker data is passed, the function will assume that all individuals are homozygous at all sites. Marker genotype doses will be rounded to meet that constraint.
Missing allele dosage information is not permitted for downstream analysis. If there are missing allele dosages, the user will be prompted to run genotype imputation.
The first column of the pheno data frame or file should be genotype ID, columns 2 through n.traits + 1
are trait columns, and subsequent columns are fixed effects.
Individuals/clones in the pheno file or data frame that are not genotyped are removed from the analysis.
Only diploid allele dosages are allowed at this time.
When dominance=FALSE, non-additive effects are not captured.
If dominance=TRUE, a (digenic) dominance covariance matrix is used instead.
A variable of class PopVar.data
# Read in phased cranberry marker data and phenotypes geno <- cranberry_geno_phased pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = TRUE, sep = "_") # Read in unphased barley marker data and phenotypes; this will assume that all # individuals are completely homozygous geno <- barley_geno pheno <- barley_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 4, phased = FALSE, assume.inbred = TRUE) # Read in unphased cranberry marker data and phenotypes geno <- cranberry_geno pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = FALSE, assume.inbred = FALSE)# Read in phased cranberry marker data and phenotypes geno <- cranberry_geno_phased pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = TRUE, sep = "_") # Read in unphased barley marker data and phenotypes; this will assume that all # individuals are completely homozygous geno <- barley_geno pheno <- barley_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 4, phased = FALSE, assume.inbred = TRUE) # Read in unphased cranberry marker data and phenotypes geno <- cranberry_geno pheno <- cranberry_pheno dat <- read_PopVar(geno = geno, pheno = pheno, n.traits = 6, phased = FALSE, assume.inbred = FALSE)
A sample dataset, previously described in Sallam et al. (2014) is provided as an example of the proper formatting of input files and also for users to become familiar with PopVar; the think_barley dataset is useful in demonstrating both pop.predict and x.val. Note that a number of entries are missing data for one or both traits,
which is representative of a real breeding scenario where phenotypic data may not be available for all parent candidates.
The names of the example files are:
A set of 245 barley lines genotyped with 742 SNP markers
A n x p matrix of n = 245 barley lines genotyped with p = 742 SNP markers
A n x p matrix of n = 245 barley lines and p = 742 imputed SNP marker genotypes
Phenotypes of four traits for a portion of the 245 barley lines, Fusarium head blight (FHB), deoxynivalenol (DON) in ppm, grain yield in bushels/acre, and plant height in cm.
Genetic map (i.e. chromosome assignment and genetic distance (cM) between markers) of the 742 SNP markers based on Munoz-Amatriain et al., 2011
A table of user-defined crosses
Sallam, A.H., J.B. Endelman, J-L. Jannink, and K.P. Smith. 2015. Assessing Genomic Selection Prediction Accuracy in a Dynamic Barley Breeding Population. Plant Gen. 8(1)
x.val performs cross-validation (CV) to estimate the accuracy of genome-wide prediction (otherwise known as genomic selection) for a specific training population (TP), i.e. a set of individuals for which phenotypic and genotypic data is available. Cross-validation can be conducted via one of two methods within x.val, see Details for more information.
NOTE - \code{x.val}, specifically \code{\link[BGLR]{BGLR}} writes and reads files to disk so it is highly recommended to set your working directory
x.val( G.in = NULL, y.in = NULL, min.maf = 0.01, mkr.cutoff = 0.5, entry.cutoff = 0.5, remove.dups = TRUE, impute = "EM", frac.train = 0.6, nCV.iter = 100, nFold = NULL, nFold.reps = 1, return.estimates = FALSE, CV.burnIn = 750, CV.nIter = 1500, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), saveAt = tempdir() )x.val( G.in = NULL, y.in = NULL, min.maf = 0.01, mkr.cutoff = 0.5, entry.cutoff = 0.5, remove.dups = TRUE, impute = "EM", frac.train = 0.6, nCV.iter = 100, nFold = NULL, nFold.reps = 1, return.estimates = FALSE, CV.burnIn = 750, CV.nIter = 1500, models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"), saveAt = tempdir() )
G.in |
TIP - Set header= |
y.in |
|
min.maf |
Optional |
mkr.cutoff |
Optional |
entry.cutoff |
Optional |
remove.dups |
Optional |
impute |
Options include |
frac.train |
Optional |
nCV.iter |
Optional |
nFold |
Optional |
nFold.reps |
Optional |
return.estimates |
Optional |
CV.burnIn |
Optional |
CV.nIter |
Optional |
models |
Optional |
saveAt |
When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files
the are produced by the |
Two CV methods are available within PopVar:
CV method 1: During each iteration a training (i.e. model training) set will be randomly sampled from the TP of size , where N is the size of the TP, and the remainder of the TP is assigned to the validation set. The accuracies of individual models are expressed as average Pearson's correlation coefficient (r) between the genome estimated breeding value (GEBV) and observed phenotypic values in the validation set across all nCV.iter iterations. Due to its amendibility to various TP sizes, CV method 1 is the default CV method in pop.predict.
CV method 2: nFold independent validation sets are sampled from the TP and predicted by the remainder. For example, if the TP will be split into 10 equal sets, each containing -th of the TP, which will be predicted by the remaining -ths of the TP. The accuracies of individual models are expressed as the average (r) between the GEBV and observed phenotypic values in the validation set across all nFold folds. The process can be repeated nFold.reps times with nFold new independent sets being sampled each replication, in which case the reported prediction accuracies are averages across all folds and replications.
A list containing:
CVs A dataframe of CV results for each trait/model combination specified
If return.estimates is TRUE the additional items will be returned:
models.used A list of the models chosen to estimate marker effects for each trait
mkr.effects A vector of marker effect estimates for each trait generated by the respective prediction model used
betas A list of beta values for each trait generated by the respective prediction model used
## The following examples only use the model 'rrBLUP' for the sake of testing. Functions ## BGLR package write temporary files to the disk. ## CV using method 1 with 25 iterations CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25, models = "rrBLUP") CV.mthd1$CVs ## CV using method 2 with 5 folds and 3 replications x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3, models = "rrBLUP")## The following examples only use the model 'rrBLUP' for the sake of testing. Functions ## BGLR package write temporary files to the disk. ## CV using method 1 with 25 iterations CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25, models = "rrBLUP") CV.mthd1$CVs ## CV using method 2 with 5 folds and 3 replications x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3, models = "rrBLUP")