Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

Merged
merged 2 commits into from
Nov 29, 2023

Conversation

grosenberger-bruker
Copy link

Hi @lazear,

For some downstream applications (spectral library generation or rescoring), algorithms require access to the scored spectra again. Most frequently, this is implemented by raw data access followed by repeat annotation of fragment ions using the PSMs. However, there is of course considerable overhead in this regard.

We thus think that it would be great if Sage natively had an option to directly export matched fragment ions based on the PSMs. This PR introduces an additional parameter that will export a parquet file containing all matched fragment ions for each PSM. Downstream applications like MS2Rescore or EasyPQP can then use the Sage PSM parquet and this new matched fragments parquet for rescoring or spectral library generation.

This is a draft PR, where we hope to receive feedback of any kind (style, implementation, algorithm, variable naming, etc.) to eventually make this feature as native as possible to Sage.

Thanks for your feedback!

Copy link
Owner

@lazear lazear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the idea, and the implementation seems straightforward and sound.

Major comments:

  • Annotation should be "zero-cost" for users who do not enable it. Currently, it appears that annotation is happening regardless of any CLI flags. The proposed data layout (putting Fragments inline with Feature) might complicate things. This should be an Option at least, since annotation should not run by default. I will mess around and see if there is a better place to put this data.
  • There are unneeded memory allocations & clones in several places. Fragments annotation might need to be restructured to eliminate these. In other parts of the codebase I wouldn't be so nitpicky, but performance-sensitive parts are going to be looked at very closely 😉
  • I really like the design of the Fragments struct
  • How well does this scale to very large searches?

The approach I have taken to this in the past is just to write a separate Rust binary that imports sage-core, sage-cloudpath, reads a result file and does a second pass over the data - still overhead, but extremely fast nonetheless. I am happy to integrate this functionality directly into Sage, but it's also worth considering if a separate binary leveraging Sage's internals is an acceptable solution

crates/sage-cli/src/output.rs Outdated Show resolved Hide resolved
crates/sage-cli/src/input.rs Outdated Show resolved Hide resolved
crates/sage-cli/src/main.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage-cloudpath/src/parquet.rs Outdated Show resolved Hide resolved
crates/sage-cloudpath/src/parquet.rs Outdated Show resolved Hide resolved
crates/sage-cli/src/output.rs Outdated Show resolved Hide resolved
@vijay-gnanasambandan-bruker
Copy link
Contributor

Thank you for providing valuable feedback. I have made the necessary updates to the code based on your suggestions. Please review it again and let me know if there are any further adjustments that need to be addressed. Thank you.

@lazear
Copy link
Owner

lazear commented Nov 27, 2023

Thanks for making changes - I will start reviewing and testing this week!

Copy link
Owner

@lazear lazear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested and everything looks code! Minimal runtime and memory impacts even on very large datasets with annotation turned on (1800 files). I have just a couple nitpicks and one minor change (making psm_id required/non-Option), but I think I can handle doing them myself

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
crates/sage-cli/src/main.rs Outdated Show resolved Hide resolved
crates/sage/src/scoring.rs Outdated Show resolved Hide resolved
@lazear lazear marked this pull request as ready for review November 29, 2023 22:36
generation (lazear#101)

- using counter instead of UUID, parameter renamed, use memory swap instead of clone vector and use Optional in Feature.

- update integration test.

- correcting the parameter name.
- fix: int32 -> int64 for psm_id in parquet
@lazear lazear merged commit e16f835 into lazear:master Nov 29, 2023
1 check passed
@vijay-gnanasambandan-bruker
Copy link
Contributor

@lazear Thank you for approving the modifications.

@lazear
Copy link
Owner

lazear commented Nov 30, 2023

Thank you for the excellent contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants