Cannot reproduce results on text classification benchmark. #1490

tsWen0309 · 2024-11-23T13:07:48Z

I am trying to reproduce the performance of the model "jina_v3" https://huggingface.co/jinaai/jina-embeddings-v3 on text classificiaton benchmark.
And I am using the code below:

import mteb
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel

model_name = "jinaai/jina-embeddings-v3"
model = SentenceTransformer('model_name',trust_remote_code=True)
tasks = mteb.get_tasks(tasks=['AmazonCounterfactualClassification',
'AmazonReviewsClassification',
"Banking77Classification",
'EmotionClassification',
'ImdbClassification',
'MTOPIntentClassification',
'ToxicConversationsClassification',
'TweetSentimentExtractionClassification'])
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, eval_splits=["test"],output_folder=f"results/{model_name}")

The results seem to differ significantly from the results in https://huggingface.co/spaces/mteb/leaderboard .
Any suggestion?

Samoed · 2024-11-23T13:22:23Z

You should load model like this:

import mteb

model = mteb.load_model("jinaai/jina-embeddings-v3")
...

tsWen0309 · 2024-11-23T13:28:36Z

You should load model like this:

import mteb

model = mteb.load_model("jinaai/jina-embeddings-v3")
...

mteb has no attribute "load_model" ? I am using mteb==1.20.0

Samoed · 2024-11-23T13:30:38Z

Sorry, this should be

import mteb

model = mteb.get_model("jinaai/jina-embeddings-v3")
...

tsWen0309 · 2024-11-23T13:54:28Z

File "D:\code\mteb-main\mteb\models\overview.py", line 126, in get_model
model = meta.load_model(**kwargs)
File "D:\code\mteb-main\mteb\model_meta.py", line 120, in load_model
model: Encoder = loader(**kwargs) # type: ignore
File "D:\code\mteb-main\mteb\model_meta.py", line 37, in sentence_transformers_loader
return SentenceTransformerWrapper(model=model_name, revision=revision, **kwargs)
File "D:\code\mteb-main\mteb\models\sentence_transformer_wrapper.py", line 48, in init
model_prompts = self.validate_task_to_prompt_name(self.model.prompts)
File "D:\code\mteb-main\mteb\models\wrapper.py", line 81, in validate_task_to_prompt_name
task = mteb.get_task(task_name=task_name)
File "D:\code\mteb-main\mteb\overview.py", line 318, in get_task
raise KeyError(suggestion)

any solution?

Samoed · 2024-11-23T14:10:17Z

Can you provide code? I tried to run tasks with following code and everything was working

import mteb
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel

model = mteb.get_model("jinaai/jina-embeddings-v3")
tasks = mteb.get_tasks(
    tasks=['AmazonCounterfactualClassification',
    'AmazonReviewsClassification',
    "Banking77Classification",
    'EmotionClassification',
    'ImdbClassification',
    'MTOPIntentClassification',
    'ToxicConversationsClassification',
    'TweetSentimentExtractionClassification'
    ]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(
    model, 
    eval_splits=["test"],
    output_folder="results"
)

tsWen0309 · 2024-11-23T14:14:18Z

I am using the exact code of yours except I replace model=mteb.get_model("jinaai/jina-embeddings-v3") to model=mteb.get_model("jina_v3"), which is the local path of the download jina-embeddings-v3 model on https://huggingface.co/jinaai/jina-embeddings-v3. Could this be the problem?

Samoed · 2024-11-23T14:17:42Z

Yes, I think this is a problem

tsWen0309 · 2024-11-23T14:53:12Z

I successfully run the code, but the results still don't match. For example, on banking77 dataset, I got Acc 76.77 vs 84.08(reported)

tsWen0309 · 2024-11-24T09:24:00Z

Yes, I think this is a problem

I run the code on several text classification datasets. None of the results match the performance reported in the leaderboard. Neither significantly high nor low. Do you meet the same problem?

Samoed · 2024-11-24T11:47:22Z

@bwanglzu Do you have any ideas?

Results:

Task	Leaderboard score	Eval result
AmazonCounterfactualClassification	0.925440219900916	0.9559220389805099
TweetSentimentExtractionClassification	0.713978494623656	0.7420769666100736
Banking77Classification	0.8408116883116883	0.7678896103896105
ToxicConversationsClassification	0.912890625	0.912548828125

tsWen0309 · 2024-11-24T13:59:03Z

To updatae. I randomly selected some models to reproduce the performance. NV-embed-v2 failed. learning2_model succeed.

KennethEnevoldsen · 2024-11-24T18:59:25Z

Hmm this seems odd.

I am using the exact code of yours except I replace model=mteb.get_model("jinaai/jina-embeddings-v3") to model=mteb.get_model("jina_v3"), which is the local path of the download jina-embeddings-v3 model on https://huggingface.co/jinaai/jina-embeddings-v3. Could this be the problem?

Just want to state that this is indeed an issue as it will call sentence-transformers to load the model instead of our implementation, which also included prompt-handling (see implementation below):

mteb/mteb/models/jina_models.py

Line 199 in 3ff38ec

model_prompts={

A few points to ensure. Check that everything works:

What version of MTEB is used?
Does the revision IDs match?

tsWen0309 · 2024-11-25T06:15:51Z

Hmm this seems odd.

I am using the exact code of yours except I replace model=mteb.get_model("jinaai/jina-embeddings-v3") to model=mteb.get_model("jina_v3"), which is the local path of the download jina-embeddings-v3 model on https://huggingface.co/jinaai/jina-embeddings-v3. Could this be the problem?

Just want to state that this is indeed an issue as it will call sentence-transformers to load the model instead of our implementation, which also included prompt-handling (see implementation below):

mteb/mteb/models/jina_models.py

Line 199 in 3ff38ec

model_prompts={

A few points to ensure. Check that everything works:

What version of MTEB is used?

Does the revision IDs match?

I am using MTEB==1.20.0, and the revision id of "jinaai/jina-embeddings-v3" model is 215a6e121fa0183376388ac6b1ae230326bfeaed

bwanglzu · 2024-11-25T06:27:20Z

I'll take a look this morning

Samoed mentioned this issue Nov 24, 2024

fix: align readme with current mteb #1493

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce results on text classification benchmark. #1490

Cannot reproduce results on text classification benchmark. #1490

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

tsWen0309 commented Nov 24, 2024

Samoed commented Nov 24, 2024 •

edited

Loading

tsWen0309 commented Nov 24, 2024

KennethEnevoldsen commented Nov 24, 2024

tsWen0309 commented Nov 25, 2024

bwanglzu commented Nov 25, 2024

Cannot reproduce results on text classification benchmark. #1490

Cannot reproduce results on text classification benchmark. #1490

Comments

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

Samoed commented Nov 23, 2024

tsWen0309 commented Nov 23, 2024

tsWen0309 commented Nov 24, 2024

Samoed commented Nov 24, 2024 • edited Loading

tsWen0309 commented Nov 24, 2024

KennethEnevoldsen commented Nov 24, 2024

tsWen0309 commented Nov 25, 2024

bwanglzu commented Nov 25, 2024

Samoed commented Nov 24, 2024 •

edited

Loading