Integration: vLLM Invocation Layer
Use a vLLM server or locally hosted instance in your Prompt Node
Simply use vLLM in your haystack pipeline, to utilize fast, self-hosted LLMs.
Installation
Install the wrapper via pip: pip install vllm-haystack
Usage
This integration provides two invocation layers:
vLLMInvocationLayer
: To use models hosted on a vLLM servervLLMLocalInvocationLayer
: To use locally hosted vLLM models
Use a Model Hosted on a vLLM Server
To utilize the wrapper the vLLMInvocationLayer
has to be used.
Here is a simple example of how a PromptNode
can be created with the wrapper.
from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMInvocationLayer
model = PromptModel(model_name_or_path="", invocation_layer_class=vLLMInvocationLayer, max_length=256, api_key="EMPTY", model_kwargs={
"api_base" : API, # Replace this with your API-URL
"maximum_context_length": 2048,
})
prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)
The model will be inferred based on the model served on the vLLM server. For more configuration examples, take a look at the unit-tests.
Hosting a vLLM Server
To create an OpenAI-Compatible Server via vLLM you can follow the steps in the Quickstart section of their documentation.
Use a Model Hosted Locally
⚠️To run vLLM
locally you need to have vllm
installed and a supported GPU.
If you don’t want to use an API-Server this wrapper also provides a vLLMLocalInvocationLayer
which executes the vLLM on the same node Haystack is running on.
Here is a simple example of how a PromptNode
can be created with the vLLMLocalInvocationLayer
.
from haystack.nodes import PromptNode, PromptModel
from vllm_haystack import vLLMLocalInvocationLayer
model = PromptModel(model_name_or_path=MODEL, invocation_layer_class=vLLMLocalInvocationLayer, max_length=256, model_kwargs={
"maximum_context_length": 2048,
})
prompt_node = PromptNode(model_name_or_path=model, top_k=1, max_length=256)