Implementation of flash attention for native webgpu ep #22932

sushraja-msft · 2024-11-24T02:47:30Z

Description

This change implements flash attention in webgpu, to improve prefill speed.
Perf numbers from Intel Alderlake device

Baseline MHA

Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       2.26746e+07
        avg (tokens/s): 22.0952              <<<
        p50 (us):       2.34637e+07
        stddev (us):    3.92912e+06
        n:              5 * 501 token(s)
Token generation:
        avg (us):       96519.8
        avg (tokens/s): 10.3606              <<<
        p50 (us):       98061.5
        stddev (us):    9220.87
        n:              635 * 1 token(s)

With FA

Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.69236e+07
        avg (tokens/s): 29.6036             <<<
        p50 (us):       1.63162e+07
        stddev (us):    960417
        n:              5 * 501 token(s)
Token generation:
        avg (us):       91436.7
        avg (tokens/s): 10.9365             <<<
        p50 (us):       90397.1
        stddev (us):    5349.19
        n:              635 * 1 token(s)

Motivation and Context

On integrated GPUs memory bandwidth is premium, Flash attention makes softmax computation (and therefore output attention vector computation) a running operation instead of maintaining full QKt attention scores in memory. As a result, we see significant improvements in prefill speed - 30% speed up measured here.

This implementation also uses new webgpu feature subgroups to further accelerate attention computation.

Tested on Intel Alderlake (Subgroup Size 16) with Phi 3.5 mini.
Tested on Nvidia 2060 (Subgroup Size 32) with Phi 3.5 mini.
Tested with Lama 3.2 1B parameters, FlashAttention does not activate because past/present keys are always null. Needs investigation into the model to understand why this is the case.

Remaining work

Algorithm specialization for generation phase, here memory tiles for K/V can be removed because each K/V values are used just once creating more Shared memory space for larger tile size.
Algorithm specialization for no past KV case (prefill case). The CopyKVCache operation can likely be eliminated in this case, as there is no past KV values to copy over, new KV values can be copied to present KV as part of flash attention. PIX profiling shows CopyKVCache is almost as expensive as FlashAttention implementation. StaticKV cache will also eliminate this and result in more performance wins.

How to enable

Currently flash attention is off by default. To enable use
"provider_options": [
{
"webgpu": { "enableFlashAttention" : "1" }
}
]
in genai_config.json.

…not creating a seperate shader for new seq length == 1 case.

guschmue · 2024-11-25T18:05:53Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-11-25T18:06:01Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

guschmue · 2024-11-25T18:06:06Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2024-11-25T18:06:10Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-11-25T18:06:14Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-11-25T18:06:28Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-11-25T18:06:29Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-11-25T18:06:41Z

Azure Pipelines successfully started running 9 pipeline(s).

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h

fs-eire · 2024-11-26T04:08:45Z

onnxruntime/core/providers/webgpu/webgpu_context.cc

+    wgpu::DawnExperimentalSubgroupLimits subgroup_limits;
+    device_supported_limits.nextInChain = &subgroup_limits;
    ORT_ENFORCE(Device().GetLimits(&device_supported_limits));
    device_limits_ = device_supported_limits.limits;
+    min_subgroup_size_ = subgroup_limits.minSubgroupSize;


What is the expected behavior when subgroup is not supported?

I didn't test this code but it looks like it will abort. Would be better to disable features that use subgroup if it's not available.

guschmue · 2024-11-26T16:05:45Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-11-26T16:05:52Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

guschmue · 2024-11-26T16:06:00Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

azure-pipelines · 2024-11-26T16:06:02Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-11-26T16:06:07Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-11-26T16:06:20Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-11-26T16:06:23Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-11-26T16:06:38Z

Azure Pipelines successfully started running 9 pipeline(s).

guschmue · 2024-11-26T16:07:46Z

Very cool, I can give it a test drive on some other gpu's and macos.

sushraja-msft added 21 commits November 11, 2024 10:30

FA Base - Does Not Work

52656bf

The new Copy KV Cache works.

ed8bf5d

Add flash attention

75aa49d

Integrate FA

58157c5

Try fix the divide by zero issue

80296aa

FA works onn intel (TILE_SIZE == SUBGROUP_SIZE) for seq length of 1.

c281f84

Update subgroup_size and tile_size to be actual intel values

3d25852

Commit temporarily

b19070a

Works so far.

228b840

This works.

122e5f9

Matches past and should work.

8b5fcc7

Multi Q also works

ed310de

Switch back to safer algorithm that is optimized for prefill.

ee1051f

Refactor into separate file

d1d8175

Get subgroup size from device limits

6febf6c

Switch to the dot product optimized for 1 token length, since we are …

ec59ba5

…not creating a seperate shader for new seq length == 1 case.

Add an alias to control math precision of flash attention.

0dd5b70

Improve comments.

8bcb47a

Improve comments

832c323

Fix comment about intel subgroup size.

a814770

Fix AttentionBias loading

ab11009

github-advanced-security bot found potential problems Nov 25, 2024

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Fixed Show fixed Hide fixed

onnxruntime/contrib_ops/webgpu/bert/flash_attention.h Fixed Show fixed Hide fixed

sushraja-msft added 3 commits November 25, 2024 16:23

Add comment explaining Flash Attention

563e662

Fix lint errors

ce2031e

Add option to turn on flash attention via config

bf1b146

fs-eire reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation of flash attention for native webgpu ep #22932

Implementation of flash attention for native webgpu ep #22932

sushraja-msft commented Nov 24, 2024 •

edited

Loading

guschmue commented Nov 25, 2024

guschmue commented Nov 25, 2024

guschmue commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

guschmue commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

fs-eire Nov 26, 2024 •

edited

Loading

guschmue commented Nov 26, 2024

guschmue commented Nov 26, 2024

guschmue commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

guschmue commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

guschmue commented Nov 26, 2024

Implementation of flash attention for native webgpu ep #22932

Are you sure you want to change the base?

Implementation of flash attention for native webgpu ep #22932

Conversation

sushraja-msft commented Nov 24, 2024 • edited Loading

Description

Motivation and Context

Remaining work

How to enable

guschmue commented Nov 25, 2024

guschmue commented Nov 25, 2024

guschmue commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

guschmue commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

azure-pipelines bot commented Nov 25, 2024

fs-eire Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

guschmue commented Nov 26, 2024

guschmue commented Nov 26, 2024

guschmue commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

guschmue commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

azure-pipelines bot commented Nov 26, 2024

guschmue commented Nov 26, 2024

sushraja-msft commented Nov 24, 2024 •

edited

Loading

fs-eire Nov 26, 2024 •

edited

Loading