-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101
[FEATURE] Export matched fragment ions for rescoring & spectral library generation #101
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the idea, and the implementation seems straightforward and sound.
Major comments:
- Annotation should be "zero-cost" for users who do not enable it. Currently, it appears that annotation is happening regardless of any CLI flags. The proposed data layout (putting
Fragments
inline withFeature
) might complicate things. This should be anOption
at least, since annotation should not run by default. I will mess around and see if there is a better place to put this data. - There are unneeded memory allocations & clones in several places.
Fragments
annotation might need to be restructured to eliminate these. In other parts of the codebase I wouldn't be so nitpicky, but performance-sensitive parts are going to be looked at very closely 😉 - I really like the design of the
Fragments
struct - How well does this scale to very large searches?
The approach I have taken to this in the past is just to write a separate Rust binary that imports sage-core
, sage-cloudpath
, reads a result file and does a second pass over the data - still overhead, but extremely fast nonetheless. I am happy to integrate this functionality directly into Sage, but it's also worth considering if a separate binary leveraging Sage's internals is an acceptable solution
Thank you for providing valuable feedback. I have made the necessary updates to the code based on your suggestions. Please review it again and let me know if there are any further adjustments that need to be addressed. Thank you. |
Thanks for making changes - I will start reviewing and testing this week! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested and everything looks code! Minimal runtime and memory impacts even on very large datasets with annotation turned on (1800 files). I have just a couple nitpicks and one minor change (making psm_id
required/non-Option), but I think I can handle doing them myself
generation (lazear#101) - using counter instead of UUID, parameter renamed, use memory swap instead of clone vector and use Optional in Feature. - update integration test. - correcting the parameter name.
- fix: int32 -> int64 for psm_id in parquet
2cec0aa
to
e16f835
Compare
@lazear Thank you for approving the modifications. |
Thank you for the excellent contribution! |
Hi @lazear,
For some downstream applications (spectral library generation or rescoring), algorithms require access to the scored spectra again. Most frequently, this is implemented by raw data access followed by repeat annotation of fragment ions using the PSMs. However, there is of course considerable overhead in this regard.
We thus think that it would be great if Sage natively had an option to directly export matched fragment ions based on the PSMs. This PR introduces an additional parameter that will export a parquet file containing all matched fragment ions for each PSM. Downstream applications like MS2Rescore or EasyPQP can then use the Sage PSM parquet and this new matched fragments parquet for rescoring or spectral library generation.
This is a draft PR, where we hope to receive feedback of any kind (style, implementation, algorithm, variable naming, etc.) to eventually make this feature as native as possible to Sage.
Thanks for your feedback!