Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[benchmark] optimize benchmark: counting tokenlizer tokens and error requests #1607

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

NiuBlibing
Copy link
Contributor

@NiuBlibing NiuBlibing commented May 17, 2024

Motivation

  1. support https shceme for benchmark
  2. counting the real output tokens
  3. counting local tokenlizer throughput
  4. counting error requests
  5. support setting role in prompt
  6. support benchmark openai API

BC-breaking (Optional)

None

Use cases (Optional)

For better benchmark.

timestamps.append(time.perf_counter())

first_token_latency = np.round(timestamps[1] - timestamps[0], 3)
token_latency = np.round(timestamps[-1] - timestamps[0], 3)
# assert output.pop('finish_reason') == 'length', \
# f'Error. session_id({session_id}) request {output_seqlen} ' \
# f'tokens, but `finish_reason` is not `length`'
total_tokens = input_seqlen + output_seqlen
tokenlizer_start = time.perf_counter()
real_output_seqlen = len(self.tokenizer(full_output).input_ids)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding the text affects the inference performance.
I don't suggest doing that.
If the output seqlen is needed, the server can return it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In stream mode, this statistic is not returned.

In my tests(Qwen-72B-Chat), the tokenlizer takes only 0.027% of the whole benchmark elapsed time(tokenlizer speed: 77402.264 token/s for one concurrency), and the tokenlizer time has been removed in the final stats code.

        stats = np.concatenate(stats).reshape(-1, 6)

        tokenlizer_time = np.sum(stats[:, 5], axis=0) / concurrency
        elapsed_time -= tokenlizer_time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encoding the text affects the inference performance. I don't suggest doing that. If the output seqlen is needed, the server can return it.

lmdeploy doesn't support stream_options to get stats from server in stream mode yet.

@NiuBlibing NiuBlibing changed the title [benchmark] optimize counting of output tokens [benchmark] optimize benchmark: counting tokenlizer tokens and error requests May 17, 2024
@lvhan028
Copy link
Collaborator

what kind of errors?

@NiuBlibing
Copy link
Contributor Author

what kind of errors?

Such as oom, account limits, etc.

@NiuBlibing NiuBlibing closed this May 20, 2024
@NiuBlibing NiuBlibing reopened this May 20, 2024
@NiuBlibing NiuBlibing mentioned this pull request May 21, 2024
2 tasks
@NiuBlibing NiuBlibing closed this May 21, 2024
@NiuBlibing NiuBlibing reopened this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants