-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate Retrieval/Reranking/Instruction Variants #1359
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the PR. I think the time concern is a reasonable trade-off for simplicity and maintainability, but @Muennighoff might know more about this given he made the original choice to split the two.
Have we checked that the new rerankers produce similar scores?
Currently, retrieval tasks can't load multilingual datasets. I think this PR is a good opportunity to standardize this and re-upload them, as retrieval datasets have different loaders. For base we can take MIRACLE loader or CodeSearchNetCCRetrieval. For example, in CMTEB, |
Thanks for the comments @KennethEnevoldsen and @Samoed! I'll go ahead and finish validating these / fixing MIRACL and turn this into a non-draft PR.
@Samoed, when you say standardize, what do you mean? I think each task usually defines their way of loading right? Unless you mean have a default loader for this in the main codebase, so each file doesn't have to define their own custom loader? |
Yes, currently only retrieval tasks don’t support multilingual loading by default and require custom data loaders. |
An update on this. Everything is done, except for MindSmall, but perhaps I need to make another large change to fit it. The reranking code has:
where these subqueries only applies to the MindSmall reranking task. The two scores (with or without subqueries) are actually pretty close if you ignore this step (only 0.1% off), but since it is in MTEB-original I wanted to make sure it hit parity. The problem is that MindSmall has thousands of duplicate entries. So because my change makes it the "mini"-retrieval setting where each query sets up a mini retrieval task, it re-computes all the duplicates. With all the duplicates (1000x), it's simply not feasible to run the task. I considered a solution where (like we do currently) we compute a unique set of queries and documents, but then share the retrieval evaluation code. However, this will compute the dot product on a lot more query-doc pairs than we need to. I'm not sure I see a solution, so I will probably re-add the reranking evaluator and try to have it share as much code as possible with the retrieval one. EDIT: or potentially changing the DRESModel search to handle this case... If anyone has other suggestions please let me know! |
I think the best solution is to duplicate the evaluation code over to mindsmall reranking (and then leave it as is). We can then create a v2 of the mindsmall both without duplicates and using the standardized code. We can then update MTEB(eng, beta) to use v2. This keeps parity, simplifies the code (moving complexity to a single case), and removes duplicates from a dataset. |
Okay, I think this is finally ready for review. Sorry it ended up so big, it was hard to break this into pieces. A list of things changed:
I ran Eventually, I'd like to create new datasets on HF for the 15 ish reranking tasks, but I don't think I'll have time for that in the next couple weeks. It takes a few seconds to convert them on the fly for all tasks except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome changes!
@orionw wow! Great PR. I'll take a more detailed look shortly. Right now I only have a small thought on the removal of the Evaluators and AbsTask: removing these classes would be considered a breaking change, which should trigger a major version bump. We might want to do that only when we group a few other bigger impact changes together. So my suggestion here is, instead of fully deleting them here and now, we could only add a warning to each of those classes that they are now deprecated, and will be removed in v2.0.0. That way if anyone had started using these will have a heads up, and we avoid 2 almost back-to-back major version bumps. Those classes will be fully removed at the v2.0.0 commit. What do you think? also @KennethEnevoldsen @Samoed |
We could include these in v2.0.0 along with MIEB (maybe we should start the v2.0.0 branch to collect some changes together? |
Yeah, good call! |
Moved from main to v.2.0.0 |
Thanks @isaac-chung and good point!
I agree with this. Unless v2.0.0 is soon, I think this will make it much easier for people to add instruction-based tasks and we should merge it in sooner than later. @KennethEnevoldsen do you know the timeline on that? Happy to add back in some of these evaluators even if they are unused, just so we keep compatibility. |
@orionw @KennethEnevoldsen maybe we can merge this into the 2.0.0 preview branch when ready? That way if users want to try it out before the actual release they still can, and they won't be blocked by our timeline. |
Co-authored-by: Roman Solomatin <[email protected]>
@isaac-chung we could do that too I was mainly commenting from a "new-task-adder" perspective rather than a user perspective. New PR's like #1425 are adding new instruction-retrieval tasks and the current implementation of the InstructionRetrieval is quite confusing to add new tasks (which is totally my fault, as I added that setup earlier this year and didn't keep it broad enough). If v2.0.0 is coming out within a month or so then perhaps it's fine to make those PRs also merge in to v2.0.0, but I was trying to avoid making it difficult for others to add because of my poor decisions :) |
I believe this is a good fit for v2. Feel free to merge in into v2 as it is now (resolving the merge conflict). Please add a release note in the description over at #1433 |
…com/embeddings-benchmark/mteb into consolidate_reranking_and_retrieval
This PR would consolidate much of the logic between reranking/retrieval/instruction variants.
In principal, the only difference between these retrieval and reranking is that reranking is a bunch of "mini" retrieval tasks where the corpus is set by a different list.
The only difference between reranking and instruction reranking is the presence of an instruction, so this PR also merges those together also (we will have to designate instruction tasks via some list, but we need to do that anyways for #1066 and others)
Benefits are:
Potential concerns/todos:
If people are okay with this approach, I'll go through and verify that all the Reranking tasks still work and reproduce score-wise.
Checklist
make test
.make lint
.