All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Initial support for searching diaPASEF data
override_precursor_charge
setting that forces multiple charge states to be searched
precursor_ppm
field reports the non-absoluted average mass error, rather than the absoluted average mass error.- Don't deisotope reporter ion regions if MS2-based TMT/iTRAQ is used
- Removed
fragment_min_mz
andfragment_max_mz
parameters. These were decreasing the accuracy of preliminary scoring estimation when attempting to annotate multiply-charged, high-m/z ions.
- Added columns missing from parquet output:
semi_enzymatic
andmissed_cleavages
- Fixed ion mobility parsing from some mzMLs
- MGF paths were being lowercased prior to parsing
- Support for MGF files
- Support for writing ion mobility measurements to output files:
ion_mobility
,predicted_mobility
,delta_mobility
added to primary tsv and parquet reports. Ion mobility is predicted in a similar manner to RT, using a linear model trained on the data from the search.
- Support for semi-enzymatic digests (
database.enzyme.semi_enzymatic
parameter) - Ability to directly export matched fragment ions (e.g. for spectral library or rescoring) with the
--annotate-matches
CLI option. This is compatible with the--parquet
CLI option as well. Annotations will be written tomatched_fragments.sage.tsv
ormatched_fragments.sage.parquet
- Sage sends basic telemetry data (version of Sage, run time, OS, # of CPU cores, # of peptides in database, whether LFQ is used) to a remote server. No information about your actual data is sent - e.g. identifications, quantities, organism, or modifications are NOT tracked or reported. This data will be used to help focus efforts on improving Sage and figuring which features are most used. Please take a look at
crates/sage-cli/src/telemetry.rs
to see exactly what is sent! You can disable sending telemetry data by using the--disable-telemetry-i-dont-want-to-improve-sage
CLI flag.
- Modified visibility on some crate internals to support the sagepy project
- Added
psm_id
field to various output files to match the new--annotate-matches
option.
- Removed the
ms1_intensity
field from CSV output, since it is essentially useless
- Unstable feature: Preliminary support for reading Bruker .d folders (ddaPASEF; no MS1/LFQ support yet)
- Retention times are converted to minutes
- Fixed bug where charge state 1 would never be searched
- Hotfix for bug in parquet LFQ writer
quant.lfq_settings.combine_charge_state
boolean option. By default this is set totrue
, and LFQ is performed on the peptide-level, where all charge states are treated as the same precursor. Setting this tofalse
performs LFQ on the peptide-charge-level, where each charge state will be treated separately.
- Percolator output format now contains the integer-valued charge state encoded in the
z=other
column, if the charge state is outside the range 2-6 (e.g. a value of 7 will appear in thez=other
column, rather than it being one-hot encoded) - LFQ uses the the charge state range from the
precursor_charge
configuration option for tracing MS1 peaks
- Added additional output showing search progress if
SAGE_LOG=trace
environment variable is set - Added additional warnings about precursor tolerances
- Added configuration option
precursor_charge
to make it explicit what charge states are being searched in the case where the mzML does not contain charge state information, or wherewide_window
is turned on.
- Added a warning message if variable modifications are specified as single values (e.g.
15.9949
) instead of lists of values (e.g.[15.9949]
). By v0.15 this will become a hard error and will not parse, to simply some of the internal logic.
- Support for parquet file format output. Search results and reporter ion quantification will be written to one file (
results.sage.parquet
) and label-free quant will be written to another (lfq.parquet
). Parquet files tend to be significantly smaller than TSV files, faster to parse, and are compatible with a variety of distributed SQL engines.
- Implement heapselect algorithm for faster sorting of candidate matches (#80). This is a backwards-incompatible change with respect to output - small changes in PSM ranks will be present between v0.13.4 and v0.14.0
- Bug in mzML parser, where some older specification-compliant mzMLs would not parse. If your mzMLs previously parsed, then there will be no change in behavior. Added a test case
- Bug in
database.enzyme.restrict
parameter, wherenull
values were being overriden with "P" (causing Trypsin/P to behave like Trypsin)
- Subtle change to TMT integration tolerance, and selection of which ion to quantify (most intense). As a result, TMT integration should be more in agreement (if not 100% so) with ProteomeDiscover/FragPipe/etc
- Remove
delta_mass
(precursor ppm) LDA feature - instead, build a delta mass (or ppm) profile using KDE/posterior error calculation code, and use the P(decoy) as a feature for LDA.
- Internal performance and stability improvements for RT prediction & LDA
- Better error reporting thanks to @Elendol
- Added support for multiple variable mods for the same amino acid
- Added support for N/C-terminal modifications specific to an individual amino acid
New syntax:
"variable_mods": {
"M": [15.9949],
"^Q": -17.026549,
"^E": -18.010565,
"[": 42.010565
}
Either a single floating point number (-18.0) or a list of floating point numbers ([-18.0, -15.2]) can be supplied as modifications. Support for single values may eventually be phased out to simplify the parser.
- Changed "_fdr" columns to "_q" (e.g. "spectrum_q") in "results.sage.tsv" file
- Changed internal data representation of
Peptide
struct to allow for sharing of sequences (usingArc
) among modified peptides - Fragment index creation should now be faster
- Add
wide_window
option to configuration file. This option turns offprecursor_tol
, instead using the isolation window written in the mzML file.
- Changed internal calculation of precursor tolerances when searching with
isotope_errors
. The new version should be more accurate. This change also enables a significant boost to search speed for open searches.
- Add rank & charge features to LDA
- One-hot encode charge state information for percolator
.pin
files - Change PSMId -> SpecId for Mokapot compatibility with
.pin
files
- Support for additional fragment ion types, via the "database.ion_kinds" configuration option. Valid values are "a", "b", "c", "x", "y", "z"
- Sort protein names alphanumerically for each peptide entry. This should enhance stability across runs, and fixes a bug with picked-protein group FDR
- Fix another bug where picked-FDR approaches assume internal decoy generation
- Modify order of operations during deisotoping. Deisotoped peaks can contribute intensity to only 1 parent peak now, rather than potentially multiple parent peaks
- Support for percolator output files (
--write-pin
CLI flag) - Support for modifying file batch size (
--batch-size N
CLI flag) - Add
delta_best
feature, which reports the delta hyperscore from the best match to current ranked PSM - Add Sage version to
results.json
files
- Breaking changes to
quant
section of the configuration file format - Rename
delta_hyperscore
todelta_next
- Altered internal scoring algorithm. Rather than consider all MS2 peaks within a m/z tolerance window to be matches to a theoretical spectrum, consider only the closest peak. This should increase the accuracy of # of matched peaks, and subsequent scores
- Overhaul of chimeric scoring,
report_psms
can now be used to search for multiple chimeric spectra - Completely overhauled the LFQ algorithm: added match-between runs, peak scoring using normalized spectral angle relative to theoretical isotopic envelope, target decoy scoring of MS1 integration
- Fixed bug in picked-peptide FDR that could lead to liberal FDR
- Fixed bug in picked-protein FDR that could lead to conservative FDR
- Fixed bug where using variable protein terminal (e.g. protein N-terminal acetylation) modifications could cause some determinism. This also improves the accuracy of peptide => protein assignment. Unfortunately this fix has performance implications, causing creation of the fragment index to take up to ~2x as long.
- Remove
no-parallel
CLI flag, andparallel
configuration file entry
- Retention times are now globally aligned across files
- RT prediction is then performed on all files at once (on aligned RTs), rather than one file at a time - previously, there were many instances where some files in a search could not have RTs predicted, decreasing the effectiveness of delta_rt as a feature for LDA.
- Peptide sequences within a protein are now deduplicated - previously, repeated peptides would be called multiple times for the same protein (e.g. num_proteins > 1 even if the peptide was unique)
- Fix issues with RT prediction (and occasionally LDA) that arise from 0's being present on the diagonals of the covariance matrix (small amount of regularization added)
- Allow users to set minimum number of matched b+y ions for reporting PSMs (
min_matched_peaks
)
- Internal code for calculating factorials
- Added option for TMT signal/noise quantification, if noise values are present in mzML
- FASTA file path, JSON configuration file can now be specified as "s3://" paths, allowing Sage to run completely disk-free
- Support for non-specific digests, N-terminal enzymatic digestion
quant.tmt_level
configuration option to enable MS2 (or MSn) isobaric quantification
- Support for protein N-terminal ('['), C-terminal (']') as well as peptide C-terminal ('$') modifications
- Support for k-combinations of variable modifications. This can be specified with the
database.max_variable_mods
parameter
- Fix bug with in silico digest: Logic around overwriting decoys with target sequences was incorrect peptides shared between targets/decoys were being annotated as decoy peptides but assigned to non-decoy proteins. We now make sure that they are assigned to non-decoy proteins and also annotated as target sequences.
- Add support for user-specified enzymes to JSON file.
database.enzyme.sites
anddatabase.enzyme.restrict
are limited to valid amino acids - Sage can now search MS2 spectra without annotated precursor charge states. Default behavior is to search with z=2, z=3, z=4, and then merge the PSMs for scoring
- Configuration file schema changed.
peptide_min_len
,peptide_max_len
,missed_cleavages
are now specified underdatabase.enzyme
in the JSON file - Internal behavior of Sage was changed to enable deterministic searching
- Docker file changed from Alpine to Debian
- Changelog
rank
column added to output filedatabase.generate_decoys
parameter, which turns off internal decoy generation. This enables the use of FASTA databases for SearchGUI/PeptideShaker
- Base ProForma v2 notation is used for peptide modifications, i.e. "[+304.2071]-PEPTIDEM[+15.9949]AAC[+57.0214]H"
scannr
column now contains the full nativeID/spectrum title from the mzML file, i.e. "controllerType=0 controllerNumber=1 scan=30069"discriminant_score
column renamed tosage_discriminant_score
for PeptideShaker recognitiondatabase.decoy_prefix
JSON option changed todatabase.decoy_tag
. This allows decoy tagging to occur anywhere within the accession: "sp|P01234_REVERSED|HUMAN"- Output file renamed:
results.pin
toresults.sage.tsv
- Output file renamed:
quant.csv
toquant.tsv
- Rename
pin_paths
tooutput_paths
in results.json file
- Support for selenocysteine and pyrrolysine amino acids
- Ability to directly read/write files from AWS S3
- Processing files in parallel processes them in batches of
num_cpus / 2
to avoid memory issues - Fixed bug where
protein_fdr
was erroneously assigned topeptide_fdr
output field - Additional parallelization for assignment of PEP, FDR, writing output files
- Label free quantification can be enabled by turning on
quant.lfq
JSON parameter - Commmand line arguments can be used to override configuration file
- Workflow contributions from @wfondrie.
- Don't parse empty MS2 spectra
- Retention time prediction
- Ability to filter low-number b/y-ions for faster preliminary scoring (
database.min_ion_index
option) - Ability to toggle retention time prediction (
predict_rt
)