[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

grosenberger-bruker · 2023-11-08T14:02:39Z

For some downstream applications (spectral library generation or rescoring), algorithms require access to the scored spectra again. Most frequently, this is implemented by raw data access followed by repeat annotation of fragment ions using the PSMs. However, there is of course considerable overhead in this regard.

We thus think that it would be great if Sage natively had an option to directly export matched fragment ions based on the PSMs. This PR introduces an additional parameter that will export a parquet file containing all matched fragment ions for each PSM. Downstream applications like MS2Rescore or EasyPQP can then use the Sage PSM parquet and this new matched fragments parquet for rescoring or spectral library generation.

This is a draft PR, where we hope to receive feedback of any kind (style, implementation, algorithm, variable naming, etc.) to eventually make this feature as native as possible to Sage.

Thanks for your feedback!

lazear

I really like the idea, and the implementation seems straightforward and sound.

Major comments:

Annotation should be "zero-cost" for users who do not enable it. Currently, it appears that annotation is happening regardless of any CLI flags. The proposed data layout (putting Fragments inline with Feature) might complicate things. This should be an Option at least, since annotation should not run by default. I will mess around and see if there is a better place to put this data.
There are unneeded memory allocations & clones in several places. Fragments annotation might need to be restructured to eliminate these. In other parts of the codebase I wouldn't be so nitpicky, but performance-sensitive parts are going to be looked at very closely 😉
I really like the design of the Fragments struct
How well does this scale to very large searches?

The approach I have taken to this in the past is just to write a separate Rust binary that imports sage-core, sage-cloudpath, reads a result file and does a second pass over the data - still overhead, but extremely fast nonetheless. I am happy to integrate this functionality directly into Sage, but it's also worth considering if a separate binary leveraging Sage's internals is an acceptable solution

crates/sage-cli/src/output.rs

crates/sage-cli/src/input.rs

crates/sage-cli/src/main.rs

crates/sage/src/scoring.rs

crates/sage-cloudpath/src/parquet.rs

crates/sage-cli/src/output.rs

vijay-gnanasambandan-bruker · 2023-11-17T21:24:42Z

Thank you for providing valuable feedback. I have made the necessary updates to the code based on your suggestions. Please review it again and let me know if there are any further adjustments that need to be addressed. Thank you.

lazear · 2023-11-27T19:38:07Z

Thanks for making changes - I will start reviewing and testing this week!

lazear

I've tested and everything looks code! Minimal runtime and memory impacts even on very large datasets with annotation turned on (1800 files). I have just a couple nitpicks and one minor change (making psm_id required/non-Option), but I think I can handle doing them myself

crates/sage/src/scoring.rs

crates/sage-cli/src/main.rs

crates/sage/src/scoring.rs

generation (lazear#101) - using counter instead of UUID, parameter renamed, use memory swap instead of clone vector and use Optional in Feature. - update integration test. - correcting the parameter name.

- fix: int32 -> int64 for psm_id in parquet

vijay-gnanasambandan-bruker · 2023-11-30T16:34:01Z

@lazear Thank you for approving the modifications.

lazear · 2023-11-30T17:03:29Z

Thank you for the excellent contribution!

lazear requested changes Nov 8, 2023

View reviewed changes

lazear approved these changes Nov 29, 2023

View reviewed changes

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

crates/sage-cli/src/main.rs Outdated Show resolved Hide resolved

crates/sage/src/scoring.rs Outdated Show resolved Hide resolved

lazear marked this pull request as ready for review November 29, 2023 22:36

vijay-gnanasambandan-bruker and others added 2 commits November 29, 2023 15:29

feat: export matched fragments for rescoring, spectral library

0e4498c

generation (lazear#101) - using counter instead of UUID, parameter renamed, use memory swap instead of clone vector and use Optional in Feature. - update integration test. - correcting the parameter name.

fix: make psm_id required field, add to all outputs, change ordinal

e16f835

- fix: int32 -> int64 for psm_id in parquet

lazear force-pushed the matching_fragments branch from 2cec0aa to e16f835 Compare November 29, 2023 23:30

lazear merged commit e16f835 into lazear:master Nov 29, 2023
1 check passed

karlssoc mentioned this pull request Feb 28, 2024

Library generation fra Sage results (https://github.com/lazear/sage) grosenberger/easypqp#110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

grosenberger-bruker commented Nov 8, 2023

lazear left a comment •

edited

Loading

vijay-gnanasambandan-bruker commented Nov 17, 2023

lazear commented Nov 27, 2023

lazear left a comment

vijay-gnanasambandan-bruker commented Nov 30, 2023

lazear commented Nov 30, 2023

[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101

Conversation

grosenberger-bruker commented Nov 8, 2023

lazear left a comment • edited Loading

Choose a reason for hiding this comment

vijay-gnanasambandan-bruker commented Nov 17, 2023

lazear commented Nov 27, 2023

lazear left a comment

Choose a reason for hiding this comment

vijay-gnanasambandan-bruker commented Nov 30, 2023

lazear commented Nov 30, 2023

lazear left a comment •

edited

Loading