vcfR2genlight() - variation in ploidy determination #128

fdchevalier · 2019-03-16T22:42:48Z

Hi Brian,

When using vcfR2genlight() with a very small dataset, the function may set different ploidy for each sample. This behavior comes from the automatic determination of the ploidy when creating an object of genlight class (see ploidy section from the genlight-class documentation). Below a reproducible example and a solution.

The reproducible example

library(adegenet)
data(vcfR_test)

# Conversion
gl <- vcfR2genlight(vcfR_test)

# Ploidy of each sample (all should be 2)
ploidy(gl)
#NA00001 NA00002 NA00003 
#      1       1       2 

# The genotype data with only biallelic sites
vcfR_test[ !grepl(",", vcfR_test@fix[,5]), ]@gt
#     FORMAT        NA00001          NA00002          NA00003       
#[1,] "GT:GQ:DP:HQ" "0|0:48:1:51,51" "1|0:48:8:51,51" "1/1:43:5:.,."
#[2,] "GT:GQ:DP:HQ" "0|0:49:3:58,50" "0|1:3:5:65,3"   "0/0:41:3"    
#[3,] "GT:GQ:DP:HQ" "0|0:54:7:56,60" "0|0:48:4:51,51" "0/0:61:2"

When vcfR2genlight() translates the genotypes in binary code, the maximum allele code will be 0 (0|0), 1 (1|0 or 0|1), and 2 (1/1) for the first, second and third sample, respectively. So genlight will determine that the two first samples are haploid and the second is diploid.

The solution

The ploidy can be indicated when creating a genlight object. So the solution is to add this argument in the vcfR_conversion.R file:

# Line 145 should be:
x <- new('genlight', t(x), n.cores=n.cores, ploidy=2)

But I think it would be better to have the ploidy as an argument of the vcfR2genlight(). This would allow:

someone who has "diploidized" the data to specify the ploidy of the organism (this can sometimes be the case when you work on malaria).
someone who has samples with different ploidy (like in yeast).

If the ploidy is an argument, a warning message about mixed ploidy in the resulting genlight object could be a plus.

Comment

A more elegant way could be to determine the ploidy from the data. I wonder if it would be worth writing a general function to estimate the ploidy from the VCF because this has been raised several times (#106, #117, #121).

Fred

The text was updated successfully, but these errors were encountered:

knausb · 2019-03-18T21:18:15Z

Hi @fdchevalier, thanks for the detailed response! The solution I'm pursuing for #117 is to us the first non-NA variant to determine a sample's ploidy. So I think we could include 'ploidy' as a parameter for vcfR2genlight(). A default of ploidy = NULL could trigger this auto detection while a single integer could indicate the ploidy of all samples and a vector the same length as the sample size could be used to indicate individual ploidy. Does that sound like a solution? Thanks for the feedback!

fdchevalier · 2019-03-18T21:23:57Z

I agree, this makes perfect sense.

knausb · 2019-03-18T22:08:03Z

Ok, we'll put it on the to-do list. With the caveat that I have collaborators asking me for manuscript updates so I haven't found a lot of time to code lately. But we'll get it done. Thanks!

lihe-s · 2020-04-28T05:17:43Z

Hi @knausb I used vcfR2genlight to convert my mixed-ploidy( diploid-tetraploid) vcf data (obtained by GATK). Unfortuantely, all the tetraploid sites shows "NA" after conversion (BSJ_1-diploid, BSJ_2-diploid, BSJ_3-tetraploid):

require(vcfR)
require(StAMPP)
vcf <- read.vcfR("di-tetra.vcf.gz", verbose = FALSE)
x <- vcfR2genlight(vcf)
x2 <- as.matrix(x)
t(as.matrix(x))[c(1:50), 1:3]
BSJ_1 BSJ_2 BSJ_3
chr01_1339 0 0 NA
chr01_1405 2 2 NA
chr01_1450 0 0 NA
chr01_3565 2 2 NA
chr01_3610 2 2 NA
chr01_3658 0 0 NA
chr01_3691 0 0 NA
chr01_3709 0 0 NA
chr01_3748 0 0 NA
chr01_3760 0 0 NA
chr01_3913 NA NA NA
chr01_4054 NA NA NA
chr01_4153 NA NA NA
chr01_4165 NA NA NA
chr01_4189 NA NA NA
chr01_4195 NA NA NA
chr01_4198 NA NA NA
chr01_4294 0 0 NA
chr01_4423 0 0 NA
chr01_4588 0 0 NA
chr01_4603 0 0 NA
chr01_4654 0 0 NA
chr01_5413 0 0 NA
chr01_5464 0 0 NA
chr01_5476 0 0 NA
chr01_5491 0 0 NA
chr01_5557 0 0 NA
chr01_5701 2 2 NA
chr01_5782 0 0 NA
chr01_5902 0 0 NA
...

Do you have any solution?

Thanks
Li

knausb · 2020-04-28T17:17:35Z

Hi @lihe-s , I do not work on mixed ploidy data sets. So I don't have much experience with this. If you could create a reproducible example we might be able to come up with a solution. Thanks!

lihe-s mentioned this issue Apr 29, 2020

vcfR2genlight failed to convert polyploid of mixed ploidy vcf (obtained by GATK) #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcfR2genlight() - variation in ploidy determination #128

vcfR2genlight() - variation in ploidy determination #128

fdchevalier commented Mar 16, 2019

knausb commented Mar 18, 2019

fdchevalier commented Mar 18, 2019

knausb commented Mar 18, 2019

lihe-s commented Apr 28, 2020

knausb commented Apr 28, 2020

vcfR2genlight() - variation in ploidy determination #128

vcfR2genlight() - variation in ploidy determination #128

Comments

fdchevalier commented Mar 16, 2019

The reproducible example

The solution

Comment

knausb commented Mar 18, 2019

fdchevalier commented Mar 18, 2019

knausb commented Mar 18, 2019

lihe-s commented Apr 28, 2020

knausb commented Apr 28, 2020