Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uniprotdb.index file is showing as generic file #887

Open
vineethvintu opened this issue Sep 17, 2024 · 1 comment
Open

uniprotdb.index file is showing as generic file #887

vineethvintu opened this issue Sep 17, 2024 · 1 comment

Comments

@vineethvintu
Copy link

Expected Behavior

I have provided the below command
mmseqs expandaln ./base/qdb ./uniprot/uniprotdb.index ./base/res ./uniprot/uniprotdb.index ./base/res_exp --db-load-mode 2 --expansion-mode 0 -e inf --expand-filter-clusters 0 --max-seq-id 0.95 --threads 124

I have created the Uniprotdb using mmseqs createdb command so the uniportdb.index file was created with it.

Current Behavior

But I am seeing after giving expandaln command facing an issue saying the uniprotdb.index is generic type
Input database "./uniprot/uniprotdb.index" has the wrong type (Generic)
Allowed input:

  • Index
  • Nucleotide
  • Profile
  • Aminoacid

Steps to Reproduce (for bugs)

Please make sure to execute the reproduction steps with newly recreated and empty tmp folders.
MMSEQS="$1"
QUERY="$2"
BASE="$4"
DB1="$5"
DB2="$6"
DB3="$7"
USE_ENV="$8"
USE_TEMPLATES="$9"
FILTER="${10}"
TAXONOMY="${11}"
M8OUT="${12}"
EXPAND_EVAL=inf
ALIGN_EVAL=10
DIFF=3000
QSC=-20.0
MAX_ACCEPT=1000000
if [ "${FILTER}" = "1" ]; then
0.1 was not used in benchmarks due to POSIX shell bug in line above
EXPAND_EVAL=0.1
ALIGN_EVAL=10
QSC=0.8
MAX_ACCEPT=100000
fi
export MMSEQS_CALL_DEPTH=1
SEARCH_PARAM="--num-iterations 3 --db-load-mode 2 -a --k-score 'seq:96,prof:80' -e 0.1 --max-seqs 10000"
FILTER_PARAM="--filter-min-enable 1000 --diff ${DIFF} --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95"
EXPAND_PARAM="--expansion-mode 0 -e ${EXPAND_EVAL} --expand-filter-clusters ${FILTER} --max-seq-id 0.95"
mkdir -p "${BASE}"
mkdir -p "${BASE}"
"${MMSEQS}" createdb "${QUERY}" "${BASE}/qdb"
"${MMSEQS}" search "${BASE}/qdb" "${DB1}" "${BASE}/res" "${BASE}/tmp1" $SEARCH_PARAM
"${MMSEQS}" mvdb "${BASE}/tmp1/latest/profile_1" "${BASE}/prof_res"
"${MMSEQS}" lndb "${BASE}/qdb_h" "${BASE}/prof_res_h"
mmseqs expandaln ./base/qdb ./uniprot/uniprotdb.index ./base/res ./uniprot/uniprotdb.index ./base/res_exp --db-load-mode 2 --expansion-mode 0 -e inf --expand-filter-clusters 0 --max-seq-id 0.95 --threads 124

I got stucked at the above command

next I am gonna do
"${MMSEQS}" align "${BASE}/prof_res" "${DB1}.idx" "${BASE}/res_exp" "${BASE}/res_exp_realign" --db-load-mode 2 -e ${ALIGN_EVAL} --max-accept ${MAX_ACCEPT} --alt-ali 10 -a
"${MMSEQS}" filterresult "${BASE}/qdb" "${DB1}.idx" "${BASE}/res_exp_realign" "${BASE}/res_exp_realign_filter" --db-load-mode 2 --qid 0 --qsc $QSC --diff 0 --max-seq-id 1.0 --filter-min-enable 100

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.
$ time mmseqs expandaln ./base/qdb ./uniprot/uniprotdb.index ./base/res ./uniprot/uniprotdb.index ./base/res_exp --db-load-mode 2 --expansion-mode 0 -e inf --expand-filter-clusters 0 --max-seq-id 0.95 --threads 124
expandaln ./base/qdb ./uniprot/uniprotdb.index ./base/res ./uniprot/uniprotdb.index ./base/res_exp --db-load-mode 2 --expansion-mode 0 -e inf --expand-filter-clusters 0 --max-seq-id 0.95 --threads 124

MMseqs Version: GITDIR-NOTFOUND
Expansion mode 0
Substitution matrix aa:blosum62.out,nucl:nucleotide.out
Gap open cost aa:11,nucl:5
Gap extension cost aa:1,nucl:2
Max sequence length 65535
Score bias 0
Compositional bias 1
Compositional bias 1
E-value threshold inf
Seq. id. threshold 0
Coverage threshold 0
Coverage mode 0
Pseudo count mode 0
Pseudo count a substitution:1.100,context:1.400
Pseudo count b substitution:4.100,context:5.800
Expand filter clusters 0
Use filter only at N seqs 0
Maximum seq. id. threshold 0.95
Minimum seq. id. 0.0
Minimum score per column -20
Minimum coverage 0
Select N most diverse seqs 1000
Preload mode 2
Compressed 0
Threads 124
Verbosity 3

Input database "./uniprot/uniprotdb.index" has the wrong type (Generic)
Allowed input:

  • Index
  • Nucleotide
  • Profile
  • Aminoacid

Context

trying to get the mmseqs out in the MSA format so we can input that to Alphafold to predict the structure of protein

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters):
    MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,
    parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: GITDIR-NOTFOUND
© Martin Steinegger ([email protected])

usage: mmseqs []

Easy workflows for plain text input/output
easy-search Sensitive homology search
easy-cluster Slower, sensitive clustering
easy-linclust Fast linear time cluster, less sensitive clustering
easy-taxonomy Taxonomic classification
easy-rbh Find reciprocal best hit

Main workflows for database input/output
search Sensitive homology search
map Map nearly identical sequences
rbh Reciprocal best hit search
linclust Fast, less sensitive clustering
cluster Slower, sensitive clustering
clusterupdate Update previous clustering with new sequences
taxonomy Taxonomic classification

Input database creation
databases List and download databases
createdb Convert FASTA/Q file(s) to a sequence DB
createindex Store precomputed index on disk to reduce search overhead
convertmsa Convert Stockholm/PFAM MSA file to a MSA DB
msa2profile Convert a MSA DB to a profile DB

Format conversion for downstream processing
convertalis Convert alignment DB to BLAST-tab, SAM or custom format
createtsv Convert result DB to tab-separated flat file
convert2fasta Convert sequence DB to FASTA format
taxonomyreport Create a taxonomy report in Kraken or Krona format

An extended list of all modules can be obtained by calling 'mmseqs -h'.

Bash completion for modules and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".
Include the location of the MMseqs2 binary in your "$PATH" environment variable.

  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.):
    $ which mmseqs
    ~/MMseqs2-71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1/build/bin/mmseqs
  • For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation:
  • Server specifications (especially CPU support for AVX2/SSE and amount of system memory):
  • Operating system and version:
    MACOS 15
@vineethvintu
Copy link
Author

vineethvintu commented Sep 17, 2024

Actually Now I have used
For the next step, an index file of the targetDB is computed for a fast read-in. It is recommended
to compute the index if the targetDB is reused for several searches. If only few searches against this
database will be done, this step should be skipped.
mmseqs createindex targetDB tmp
This call will create a targetDB.idx file. It is just possible to have one index per database.
Then generate a directory for temporary files. MMseqs2 can produce a high IO on the file system.
It is recommended to create this temporary folder on a local drive.
Then after I got
tmp uniprotdb.dbtype uniprotdb_h.dbtype uniprotdb.idx.0 uniprotdb.idx.2 uniprotdb.idx.4 uniprotdb.idx.index uniprotdb.lookup
uniprotdb uniprotdb_h uniprotdb_h.index uniprotdb.idx.1 uniprotdb.idx.3 uniprotdb.idx.dbtype uniprotdb.index uniprotdb.source

so now I am confused which idx file needs to be considered ?

mmseqs expandaln ./base/qdb ./uniprot/uniprotdb.index ./base/res ./uniprot/uniprotdb.index ./base/res_exp --db-load-mode 2 --expansion-mode 0 -e inf --expand-filter-clusters 0 --max-seq-id 0.95 --threads 124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant