-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] support qqq(w4a8) for lmdeploy #2274
base: main
Are you sure you want to change the base?
Conversation
Hi @HandH1998 Nice work! May you merge the latest main branch and fix the conflicts? |
We might wait for the merge of #2090. |
Brilliant! |
When implementing W8A8 in the future, some components may be reused. |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HandH1998 May you fix the CI issue? Thanks.
ea251f3
to
ecee3aa
Compare
@HandH1998 The Windows build is still failing. |
@HandH1998 |
@HandH1998 Marlin W4A16 is mainly optimized for A100, but compared to TurboMind AWQ, its performance is still worse. Marlin's performance on H100 is average, especially compared to #2090, the gap is very large. After #2090 merges next week, this PR will be reviewed. There are probably 2 strategies: one is to review based on the current implementation first (of course, assuming you still need to merge the latest main and resolve some conflicts), and then reimplement it later according to the optimized implementation in TurboMind. Another strategy is to reimplement it directly (which can be based on some existing components), we'll discuss this at that time. @lzhangzz cc @irexyc @lvhan028 |
And the difference should not be significant on A100. I have roughly verified it using SGLang's Marlin AWQ and LMDeploy TurboMind's AWQ on Llama 3.1 8B Instruct, and their performance is basically close (though I don't remember if this was based on whether LMDeploy had already fixed that chunked prefill bug). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#2090 has been merged. Please merge the latest main and resolve the conflicts. Thanks.
That's the old AWQ kernels. The new kernels achieve 26+ RPS with A100 and Llama 3.1 8B. |
@HandH1998 May you resolve the conflicts in these days? After that, @lzhangzz will help rewrite with the TurboMind’s style. We should move forward together. |
I am working on it. Since the main branch changed a lot, I still need time to resolve the conflicts and fix new bugs. Probably I can finish it in two days. |
Motivation
We have implemented W4A8 quantization for the lmdeploy turbomind backend using our quantization algorithm QQQ to enhance inference throughput. We hope that lmdeploy users will find this beneficial. Additionally, we have submitted a PR to vLLM, which has been incorporated into vLLM v0.5.4.
Modification
We have completed the following tasks to enable the w4a8 pipeline:
Use cases
First you need to export the quantized model weights using our repo. Then, you can enable QQQ in the same manner as you would enable AWQ. Here, we provide two examples for inference and service.
Inference
Service
Benchmark
Accuracy
We employ OpenCompass to evaluate the quantized model. Here we provide the evaluation results for
llama2-13b-base
.You can add the following script to
configs
to reproduce our results.Throughput
We use the script profile_restful_api.py and ShareGPT dataset to benchmark throughput. Here we provide the results for
llama2-13b-base
on one A100-80G.Settings: