how to release gpu memory when use onnxruntime with fastapi #22899

SZ-ing · 2024-11-20T03:17:23Z

This is probably a repetitive problem, but I still haven't found how to solve it.

I used FastAPI to build API interfaces and used onnxruntime to load models. However, due to the limited number of models and gpu memory, I hope to release this part of the occupied gpu memory after each interface call is completed. But it seems that I can never fully release it, and there will still be some models occupying gpu memory.

I want to know if there is any way to solve this problem for Python.

skottmckay · 2024-11-26T02:38:04Z

The infererence session will have the weights for the model in memory at a minimum. Unless you free the session that will always remain the case.

You can potentially reduce memory usage by sharing allocators between sessions. The python API has create_and_register_allocator which calls into CreateAndRegisterAllocator mentioned in 'Share allocator(s) between sessions' here. This reduces the amount of unused memory from having multiple arenas, each with their own over-allocation.

See #6411 (comment)

amarin16 added the api issues related to all other APIs: C, C++, Python, etc. label Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to release gpu memory when use onnxruntime with fastapi #22899

how to release gpu memory when use onnxruntime with fastapi #22899

SZ-ing commented Nov 20, 2024

skottmckay commented Nov 26, 2024

how to release gpu memory when use onnxruntime with fastapi #22899

how to release gpu memory when use onnxruntime with fastapi #22899

Comments

SZ-ing commented Nov 20, 2024

skottmckay commented Nov 26, 2024