LLM Serve

Overview

This project provides Python scripts to interact with APIs from Ollama and vLLM. The script ollama_app.py allows you to call models such as llama3:8b, gemma2:9b, and phi-3 from the Ollama API. The vLLM models can be hosted and accessed through Docker.

For Ollama and vLLM supports, see ollama_app.py and vllm_app.py respectively.

Ollama API Integration

The ollama_app.py script allows you to generate text completions using different models provided by the Ollama API.

Supported Models

codestral:22b is recommended for running as a coding assistant...

codestral:22b
llama3.1:8b
gemma2:9b

Prerequisites

Python 3.6+
requests library

You can install the required library using pip:

pip install requests

Usage

Edit the Script: Ensure that the model_names list contains the correct models you want to use.
Run the Script:

import requests
import json

def query_ollama(prompt, model_name, temperature=0):
    url = "http://140.119.164.212:11434/api/generate"
    headers = {"Content-Type": "application/json", "Accept": "application/json"}
    payload = {
        "model": model_name,
        "prompt": prompt,
        "temperature": temperature,
        "guided_json": "TEST_SCHEMA"  # Ensure TEST_SCHEMA is a string variable
    }

    response = requests.post(url, headers=headers, data=json.dumps(payload))
    response.raise_for_status()

    if response.headers.get('Content-Type') == 'application/x-ndjson':
        ndjson_content = response.text.strip().split('\n')
        json_responses = [json.loads(line) for line in ndjson_content]
        complete_response = ''.join([item['response'] for item in json_responses if 'response' in item])
        return complete_response

    elif response.headers.get('Content-Type') == 'application/json':
        return response.json()

    else:
        raise Exception(f"Unexpected content type: {response.headers.get('Content-Type')}")

model_names = ["codestral:22b", "llama3.1:8b", "gemma2:9b"]
prompt = "Translate English to French: 'Hello, how are you?'"
response = query_ollama(prompt, model_names[2])
print(response)

Execute the Script:

python ollama_app.py

vLLM API Integration

Prerequisites

Docker
NVIDIA Docker Runtime (for GPU support)
Hugging Face API token

Setting Up vLLM

Pull the vLLM Docker Image:

docker pull vllm/vllm-openai:latest

Run the Docker Container:

Ensure you have your Hugging Face API token. Replace YOUR_HUGGING_FACE_HUB_TOKEN with your actual token.

docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -v /path/to/your/model_dir:/root/model_config \
    --env "HUGGING_FACE_HUB_TOKEN=YOUR_HUGGING_FACE_HUB_TOKEN" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /root/model_config

Accessing the vLLM API

You can interact with the vLLM API using HTTP requests. Below is an example Python script for querying the vLLM API:

import requests

def query_vllm(prompt, model_name, max_tokens=50):
    url = "http://localhost:8000/v1/completions"
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer YOUR_HUGGING_FACE_HUB_TOKEN"
    }
    payload = {
        "model": model_name,
        "prompt": prompt,
        "max_tokens": max_tokens
    }

    response = requests.post(url, headers=headers, json=payload)
    response.raise_for_status()
    return response.json()

prompt = "Once upon a time"
model_name = "llama3.1:8b"
response = query_vllm(prompt, model_name)
print(response)

Conclusion

This project provides a straightforward way to interact with text generation models from Ollama and vLLM. Ensure you have the required dependencies installed and configured correctly to use the provided scripts effectively.

For any issues or contributions, feel free to open an issue or a pull request on the project's repository.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
ollama_app.py		ollama_app.py
requirements.txt		requirements.txt
test.ipynb		test.ipynb
vllm_app.py		vllm_app.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Serve

Overview

Ollama API Integration

Supported Models

Prerequisites

Usage

vLLM API Integration

Prerequisites

Setting Up vLLM

Accessing the vLLM API

Conclusion

About

Releases

Packages

Languages

theQuert/llm-serve

Folders and files

Latest commit

History

Repository files navigation

LLM Serve

Overview

Ollama API Integration

Supported Models

Prerequisites

Usage

vLLM API Integration

Prerequisites

Setting Up vLLM

Accessing the vLLM API

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages