Examples

You can use ScaleLLM for offline batch completions or online inference. Below are some examples to help you get started. More examples can be found in the examples folder.

Chat Completion

Start the REST API server with the following command:

$ python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B-Instruct

You can query the chat completions using curl or the OpenAI client:

$ curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

Completions

Start the REST API server with the following command:

$ python3 -m scalellm.serve.api_server --model=meta-llama/Meta-Llama-3.1-8B

You can query the completions using curl or the OpenAI client:

$ curl http://localhost:8080/v1/completions \
 -H "Content-Type: application/json" \
 -d '{
   "model": "meta-llama/Meta-Llama-3-8B",
   "prompt": "hello",
   "max_tokens": 32,
   "temperature": 0.7,
   "stream": true
 }'