PPL LLM Serving

Overview

ppl.llm.serving is a part of PPL.LLM system.

We recommend users who are new to this project to read the Overview of system.

ppl.llm.serving is a serving based on ppl.nn for various Large Language Models(LLMs). This repository contains a server based on gRPC and inference support for LLaMA.

Prerequisites

Linux running on x86_64 or arm64 CPUs
GCC >= 9.4.0
CMake >= 3.18
Git >= 2.7.0
CUDA Toolkit >= 11.4. 11.6 recommended. (for CUDA)
Rust & cargo >= 1.8.0. (for Huggingface Tokenizer)

PPL Server Quick Start

Here is a brief tutorial, refer to LLaMA Guide for more details.

Installing Prerequisites(on Debian or Ubuntu for example)
```
apt-get install build-essential cmake git
```

Cloning Source Code

git clone https://github.com/openppl-public/ppl.llm.serving.git

Building from Source
```
./build.sh  -DPPLNN_USE_LLM_CUDA=ON  -DPPLNN_CUDA_ENABLE_NCCL=ON -DPPLNN_ENABLE_CUDA_JIT=OFF -DPPLNN_CUDA_ARCHITECTURES="'80;86;87'" -DPPLCOMMON_CUDA_ARCHITECTURES="'80;86;87'" -DPPL_LLM_ENABLE_GRPC_SERVING=ON
```
NCCL is required if multiple GPU devices are used.

We support Sync Decode feature (mainly for offline_inference), which means model forward and decode in the same thread. To enable this feature, compile with marco -DPPL_LLM_SERVING_SYNC_DECODE=ON.
Exporting Models

Refer to ppl.pmx for details.

Running Server

./ppl_llm_server \
    --model-dir /data/model \
    --model-param-path /data/model/params.json \
    --tokenizer-path /data/tokenizer.model \
    --tensor-parallel-size 1 \
    --top-p 0.0 \
    --top-k 1 \
    --max-tokens-scale 0.94 \
    --max-input-tokens-per-request 4096 \
    --max-output-tokens-per-request 4096 \
    --max-total-tokens-per-request 8192 \
    --max-running-batch 1024 \
    --max-tokens-per-step 8192 \
    --host 127.0.0.1 \
    --port 23333

You are expected to give the correct values before running the server.

model-dir: path of models exported by ppl.pmx.
model-param-path: params of models. $model_dir/params.json.
tokenizer-path: tokenizer files for sentencepiece.

Running client: send request through gRPC to query the model
```
./ppl-build/client_sample 127.0.0.1:23333
```
See tools/client_sample.cc for more details.
Benchmarking
```
./ppl-build/client_qps_measure --target=127.0.0.1:23333 --tokenizer=/path/to/tokenizer/path --dataset=tools/samples_1024.json --request_rate=inf
```
See tools/client_qps_measure.cc for more details. --request_rate is the number of request per second, and value inf means send all client request with no interval.

Running inference offline:

./offline_inference \
    --model-dir /data/model \
    --model-param-path /data/model/params.json \
    --tokenizer-path /data/tokenizer.model \
    --tensor-parallel-size 1 \
    --top-p 0.0 \
    --top-k 1 \
    --max-tokens-scale 0.94 \
    --max-input-tokens-per-request 4096 \
    --max-output-tokens-per-request 4096 \
    --max-total-tokens-per-request 8192 \
    --max-running-batch 1024 \
    --max-tokens-per-step 8192 \
    --host 127.0.0.1 \
    --port 23333

See tools/offline_inference.cc for more details.

License

This project is distributed under the Apache License, Version 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
cmake		cmake
docs		docs
samples/integration-cuda		samples/integration-cuda
src		src
test		test
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PPL LLM Serving

Overview

Prerequisites

PPL Server Quick Start

License

About

Releases

Packages

Contributors 6

Languages

License

OpenPPL/ppl.llm.serving

Folders and files

Latest commit

History

Repository files navigation

PPL LLM Serving

Overview

Prerequisites

PPL Server Quick Start

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages