-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF files should be block-gzipped and not gzipped #124
Comments
Hi Jasper, From a philosophical perspective I think you may have a good point. But I'm going to vote against it for now. Part of the reason is that I'd have to learn to use a new library and may require a lot of rewriting of code to make that happen. So it sounds like a high effort but low benefit option. A more important issue is that this would create a dependency on an external library: htslib. Can you provide an example of why this is important? Personally, I do not need this feature. Although, my goal is usually to work in R. If your goal is to create VCF files for processing with other software there may be an issue. But I've found that 'standard gzip' is sufficient for my needs. Do you see something I don't? Thanks! |
Hey Brian- My use case is meant to be an Rscript run from a command line as part of a workflow. I do need VCFs in bgzip format to be compatible with the universe of other tools in the bioinformatics community. I do agree that not every use case requires a bgzip/indexed VCF but I don't know of any tool that takes a gzip VCF. Some of the most common next steps in a pipeline post VCF filtering would be annotation or aggregation. These steps do require a bgzip and index file. The workaround is simple enough. I can unzip and bgzip in a single stream. Also understand if you have other development priorities. In case you would like to implement, I looked on CRAN for latest packages that have bgzf.c and found https://github.com/cran/seqminer/tree/master/src. My preference would be to have vcfR write bgzip and index because of full compatibility and just as useful as gzip since it can be decoded with the standard library. Thanks again for making vcfR available and for considering this feature |
Hi Jasper, thanks for sticking with me on this. And thanks for pointing out seqminer, I was not aware of it. I like to 'think out loud' here, so here goes. A potential solution is to use bgzip which is a part of htslib.
This accomplishes our goal. But there are several reasons this appears undesirable.
So I think we're at "I like the idea, but I do not know how to implement it." Perhaps the path would be to reach out to the authors of seqminer for some guidance. Although, my quick look at the package makes me think they are mostly about reading the data in and not out. We could also reach out to the samtools team to see if we can get guidance from them. |
I think I've decided that |
Excellent package!
I have prototyped a few solutions to use this to apply ML classification of called variants and do filtering.
This is more of a feature request. Just wanted to add that VCFs should block gzipped(bgzf) and not standard gzip. The former is necessary to index the VCF using bcftools and is widely considered a standard.
https://samtools.github.io/bcftools/bgzf-aes-encryption.html
Thanks for all your efforts in making vcfR available.
The text was updated successfully, but these errors were encountered: