VCF files should be block-gzipped and not gzipped #124

jasper1918 · 2019-01-15T19:38:20Z

Excellent package!
I have prototyped a few solutions to use this to apply ML classification of called variants and do filtering.

This is more of a feature request. Just wanted to add that VCFs should block gzipped(bgzf) and not standard gzip. The former is necessary to index the VCF using bcftools and is widely considered a standard.
https://samtools.github.io/bcftools/bgzf-aes-encryption.html

Thanks for all your efforts in making vcfR available.

knausb · 2019-01-15T23:42:20Z

Hi Jasper,

From a philosophical perspective I think you may have a good point. But I'm going to vote against it for now. Part of the reason is that I'd have to learn to use a new library and may require a lot of rewriting of code to make that happen. So it sounds like a high effort but low benefit option.

A more important issue is that this would create a dependency on an external library: htslib. vcfR currently uses zlib to work with 'standard gzip'. zlib is very popular so it will be on most machines including those at CRAN. Even though I agree that bcftools and htslib are well accepted in the bioinformatics community once you step outside that community I doubt anyone will know what you're talking about. Because of this I think your suggestion will create an external dependency that will not build on the CRAN computers and that means they will stop hosting vcfR. I see a lot of value in having packages on CRAN, so I see that as a unfavorable choice.

Can you provide an example of why this is important? Personally, I do not need this feature. Although, my goal is usually to work in R. If your goal is to create VCF files for processing with other software there may be an issue. But I've found that 'standard gzip' is sufficient for my needs. Do you see something I don't? Thanks!

jasper1918 · 2019-01-18T20:06:46Z

Hey Brian-
Thanks for the nice overview.

My use case is meant to be an Rscript run from a command line as part of a workflow. I do need VCFs in bgzip format to be compatible with the universe of other tools in the bioinformatics community. I do agree that not every use case requires a bgzip/indexed VCF but I don't know of any tool that takes a gzip VCF. Some of the most common next steps in a pipeline post VCF filtering would be annotation or aggregation. These steps do require a bgzip and index file.

The workaround is simple enough. I can unzip and bgzip in a single stream. Also understand if you have other development priorities. In case you would like to implement, I looked on CRAN for latest packages that have bgzf.c and found https://github.com/cran/seqminer/tree/master/src. My preference would be to have vcfR write bgzip and index because of full compatibility and just as useful as gzip since it can be decoded with the standard library.

Thanks again for making vcfR available and for considering this feature

knausb · 2019-01-20T23:29:25Z

Hi Jasper, thanks for sticking with me on this. And thanks for pointing out seqminer, I was not aware of it. I like to 'think out loud' here, so here goes. A potential solution is to use bgzip which is a part of htslib.

> library(vcfR)

   *****       ***   vcfR   ***       *****
   This is vcfR 1.8.0 
     browseVignettes('vcfR') # Documentation
     citation('vcfR') # Citation
   *****       *****      *****       *****

> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> library(vcfR)
> data("vcfR_test")
> vcfR_test
***** Object of Class vcfR *****
3 samples
1 CHROMs
5 variants
Object size: 0 Mb
0 percent missing data
*****        *****         *****
> write.vcf(vcfR_test, file = "myData.vcf.gz")
> system("gunzip myData.vcf.gz")
> system("~/bin/htslib-1.9/bgzip -i myData.vcf")
> list.files()
[1] "myData.vcf.gz"     "myData.vcf.gz.gzi" "vcfR.html"         "vcfR.Rmd"          "vcfR.Rproj"       
>

This accomplishes our goal. But there are several reasons this appears undesirable.

It creates intermediate files
vcfR::write.vcf() currently only writes to *vcf.gz (zlib gz) so we have to uncompress the file and then recompress with bgzip (which appears to only work with uncompressed VCF) which may present a bottle neck

So I think we're at "I like the idea, but I do not know how to implement it." Perhaps the path would be to reach out to the authors of seqminer for some guidance. Although, my quick look at the package makes me think they are mostly about reading the data in and not out. We could also reach out to the samtools team to see if we can get guidance from them.

knausb · 2019-01-22T23:32:20Z

I think I've decided that seqminer is not a solution here. It appears to read in portions of a VCF file. See ?seqminer::readVCFToMatrixByRange. In vcfR we're used to reading it all in. Or subsetting by rows. I just found Rhtslib which may be a viable path.

knausb mentioned this issue Jan 22, 2019

Support Tabix #125

Open

knausb mentioned this issue Aug 15, 2019

Support for bcf files #143

Closed

knausb mentioned this issue Mar 6, 2020

VCF files written by vcfR::write.vcf not parsed correctly by bcftools #153

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF files should be block-gzipped and not gzipped #124

VCF files should be block-gzipped and not gzipped #124

jasper1918 commented Jan 15, 2019

knausb commented Jan 15, 2019

jasper1918 commented Jan 18, 2019

knausb commented Jan 20, 2019

knausb commented Jan 22, 2019

VCF files should be block-gzipped and not gzipped #124

VCF files should be block-gzipped and not gzipped #124

Comments

jasper1918 commented Jan 15, 2019

knausb commented Jan 15, 2019

jasper1918 commented Jan 18, 2019

knausb commented Jan 20, 2019

knausb commented Jan 22, 2019