ScaleLLM¶

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs). It is meticulously designed to meet the demands of production environments and extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more.

Note

ScaleLLM is currently in alpha. We are actively working on improving the system and adding new features. If you have any feedback or suggestions, please feel free to reach out to us.

ScaleLLM is available as a Python Wheel package on PyPI. You can install it using pip:

# Install ScaleLLM with CUDA 12.6 and Pytorch 2.7.1
$ pip install -U scalellm

Table of contents¶

User Guide

Quick Start
- Installation
- Inference
Examples
- Chat Completion
- Completions
Supported Models

Tutorials

Tutorials
Architecture

Developer Guide

Contributing

Reference