You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
mmseqs download would be expected to download an up-to-date version of the target 'nr' and 'nt' databases.
Current Behavior
The download FASTA targets for the 'nr' and 'nt' databases are no longer being updated by NCBI. Explanation: focusing on 'NR' as an example, the URL in databases.sh points to https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. The README in that FTP location states:
In April 2024, the BLAST FASTA files in this directory will no longer be
available. You can easily generate FASTA files yourself from the formatted
BLAST databases by using the BLAST utility blastdbcmd that comes with the
standalone BLAST programs. See NCBI Insights for more details https://ncbiinsights.ncbi.nlm.nih.gov/2024/01/25/blast-fasta-unavailable-on-ftp/
Indeed, the nr.gz file was last updated on 2024-02-07.
Looking in the parent directoy, the various NR database files have been updated as recently as 2024-10-02. Therefore, the download targets for mmseqs2 are out of date by about 8 months, and this problem will get worse over time.
NCBI's solution for users is to download the blast-format files and then generate their own FASTA files:
Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility;
Obviously this isn't ideal for many users, and it's been getting at least some hate.
Solution
Unless NCBI backflips on their decision, the only solution would be to change the mmseqs databases workflow for these databases to have an intermediate (slow) step of extracting a FASTA file before the mmseqs createdb step is run. However, this would obviously require extra dependencies, i.e. the blastdbcmd. Otherwise, I suppose you could host periodic builds of the databases on a server or something.
Just thought I should bring this to your attention in case you are unaware 😄
The text was updated successfully, but these errors were encountered:
I would recommend to just use UniProt instead of NR. it’s much better maintained, especially with the versioning. Any annotations against the NR are essentially unreproducible due to the lack of versioning by the NCBI.
I don’t plan on integrating the blast databases for the foreseeable future.
Expected Behavior
mmseqs download
would be expected to download an up-to-date version of the target 'nr' and 'nt' databases.Current Behavior
The download FASTA targets for the 'nr' and 'nt' databases are no longer being updated by NCBI. Explanation: focusing on 'NR' as an example, the URL in databases.sh points to https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz. The README in that FTP location states:
Indeed, the nr.gz file was last updated on 2024-02-07.
Looking in the parent directoy, the various NR database files have been updated as recently as 2024-10-02. Therefore, the download targets for mmseqs2 are out of date by about 8 months, and this problem will get worse over time.
NCBI's solution for users is to download the blast-format files and then generate their own FASTA files:
Obviously this isn't ideal for many users, and it's been getting at least some hate.
Solution
Unless NCBI backflips on their decision, the only solution would be to change the
mmseqs databases
workflow for these databases to have an intermediate (slow) step of extracting a FASTA file before themmseqs createdb
step is run. However, this would obviously require extra dependencies, i.e. theblastdbcmd
. Otherwise, I suppose you could host periodic builds of the databases on a server or something.Just thought I should bring this to your attention in case you are unaware 😄
The text was updated successfully, but these errors were encountered: