Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PureCN unintentionally filters out reads with supplementary alignments #371

Open
tinyheero opened this issue Aug 6, 2024 · 3 comments
Open

Comments

@tinyheero
Copy link

tinyheero commented Aug 6, 2024

In the .calculateBamCoverageByInterval() function:

.calculateBamCoverageByInterval <- function(intervalGr, ...) {
  param <- ScanBamParam(what = c("pos", "qwidth", "flag"),
              which = intervalGr,
              flag = scanBamFlag(isUnmappedQuery = FALSE,
                           isNotPassingQualityControls = FALSE,
                           isSecondaryAlignment = FALSE,
                           isDuplicate = NA
                   ),
              ...
           )

  xAll <- scanBam(bam.file, index = index.file, param = param)
  xDupFiltered <- .filterDuplicates(xAll)

There is no explicit use of the isSupplementaryAlignment argument in the scanBamFlag() call leading one to believe that reads with supplementary alignments are not filtered out. However the call to the .filterDuplicates() function:

.filterDuplicates <- function(x) {
    lapply(x, function(y) {
        idx <- y$flag < 1024
        lapply(y, function(z) z[idx])
    })
}

Will retain reads that have a SAM flag less than 1024.

However, reads with supplementary alignments have a SAM flag of 2048.

$> samtools flag
About: Convert between textual and numeric flag representation
Usage: samtools flags FLAGS...

Each FLAGS argument is either an INT (in decimal/hexadecimal/octal) representing
a combination of the following numeric flag values, or a comma-separated string
NAME,...,NAME representing a combination of the following flag names:

   0x1     1  PAIRED         paired-end / multiple-segment sequencing technology
   0x2     2  PROPER_PAIR    each segment properly aligned according to aligner
   0x4     4  UNMAP          segment unmapped
   0x8     8  MUNMAP         next segment in the template unmapped
  0x10    16  REVERSE        SEQ is reverse complemented
  0x20    32  MREVERSE       SEQ of next segment in template is rev.complemented
  0x40    64  READ1          the first segment in the template
  0x80   128  READ2          the last segment in the template
 0x100   256  SECONDARY      secondary alignment
 0x200   512  QCFAIL         not passing quality controls or other filters
 0x400  1024  DUP            PCR or optical duplicate
 0x800  2048  SUPPLEMENTARY  supplementary alignment

This means reads with supplementary alignments will always be filtered out regardless of whether they are a PCR duplicate or not when the filterDuplicates function is called.

This doesn't seem intentional?

@lima1
Copy link
Owner

lima1 commented Aug 7, 2024

Thanks so much @tinyheero. This code is indeed ancient and I'm not remembering why I implemented it that way. I will look into the impact. Gut feeling should not matter, right? Or do you have a concern here?

@tinyheero
Copy link
Author

You're right in that the effects are minor as supplementary alignments make a small proportion of the total reads.

The reason why I spotted it was because I noticed several regions with much lower counts that expected and it was simply because all the reads with supplementary alignments were not being counted. Regions with many supplementary alignments tend to be associated with SV events so it makes sense to actually retain them in my opinion.

@lima1
Copy link
Owner

lima1 commented Aug 8, 2024

I'll go through my collections of normal and cancer test samples. If it's generally safe to add them even when normal samples do not, it probably makes indeed sense to retain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants