ChatLlamaCpp
This notebook provides a quick overview for getting started with chat model intergrated with llama cpp python
An example below demonstrating how to implement with the open-source Llama3 Instruct 8B
Overviewβ
Integration detailsβ
Class | Package | Local | Serializable | JS support |
---|---|---|---|---|
ChatLlamaCpp | langchain-community | β | β | β |
Model featuresβ
Tool calling | Structured output | JSON mode | Image input | Audio input | Video input | Token-level streaming | Native async | Token usage | Logprobs |
---|---|---|---|---|---|---|---|---|---|
β | β | β | β | β | β | β | β | β | β |
Setupβ
Installationβ
The LangChain OpenAI integration lives in the langchain-community
and llama-cpp-python
packages:
%pip install -qU langchain-community llama-cpp-python
Instantiationβ
Now we can instantiate our model object and generate chat completions:
import multiprocessing
from langchain_community.chat_models import ChatLlamaCpp
llm = ChatLlamaCpp(
temperature=0.5,
model_path="./SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf",
n_ctx=10000,
n_gpu_layers=8,
n_batch=300, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
max_tokens=512,
n_threads=multiprocessing.cpu_count() - 1,
repeat_penalty=1.5,
top_p=0.5,
verbose=True,
)
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 7
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Δ Δ ", "Δ Δ Δ Δ ", "Δ Δ Δ Δ ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q8_0: 226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q8_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 7.95 GiB (8.50 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Γ'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors: CPU buffer size = 8137.64 MiB
llm_load_tensors: CUDA0 buffer size = 1768.25 MiB
.........................................................................................
llama_new_context_with_model: n_ctx = 10016
llama_new_context_with_model: n_batch = 300
llama_new_context_with_model: n_ubatch = 300
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 939.00 MiB
llama_kv_cache_init: CUDA0 KV buffer size = 313.00 MiB
llama_new_context_with_model: KV self size = 1252.00 MiB, K (f16): 626.00 MiB, V (f16): 626.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 683.78 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 16.15 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 268
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}{% endif %}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>
Invocationβ
messages = [
(
"system",
"You are a helpful assistant that translates English to French. Translate the user sentence.",
),
("human", "I love programming."),
]
ai_msg = llm.invoke(messages)
ai_msg
llama_print_timings: load time = 1077.71 ms
llama_print_timings: sample time = 21.82 ms / 39 runs ( 0.56 ms per token, 1787.35 tokens per second)
llama_print_timings: prompt eval time = 1077.65 ms / 37 tokens ( 29.13 ms per token, 34.33 tokens per second)
llama_print_timings: eval time = 8403.75 ms / 38 runs ( 221.15 ms per token, 4.52 tokens per second)
llama_print_timings: total time = 9689.66 ms / 75 tokens
AIMessage(content='Je adore le programmation.\n\n(Note: "programmation" is used in both formal and informal contexts, but it\'s generally accepted as equivalent of saying you like computer science or coding.)', response_metadata={'finish_reason': 'stop'}, id='run-e9e03b94-f29f-4c1d-8483-e23a46acb556-0')
print(ai_msg.content)
Je adore le programmation.
(Note: "programmation" is used in both formal and informal contexts, but it's generally accepted as equivalent of saying you like computer science or coding.)
Chainingβ
We can chain our model with a prompt template like so:
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages(
[
(
"system",
"You are a helpful assistant that translates {input_language} to {output_language}.",
),
("human", "{input}"),
]
)
chain = prompt | llm
chain.invoke(
{
"input_language": "English",
"output_language": "German",
"input": "I love programming.",
}
)
Llama.generate: prefix-match hit
llama_print_timings: load time = 1077.71 ms
llama_print_timings: sample time = 29.23 ms / 52 runs ( 0.56 ms per token, 1778.75 tokens per second)
llama_print_timings: prompt eval time = 869.38 ms / 17 tokens ( 51.14 ms per token, 19.55 tokens per second)
llama_print_timings: eval time = 6694.18 ms / 51 runs ( 131.26 ms per token, 7.62 tokens per second)
llama_print_timings: total time = 7830.86 ms / 68 tokens
AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) Do you have any favorite languages or projects? Ich bin hier, um dir zu helfen und ΓΌber deine Lieblingsprogrammierthemen sprechen kΓΆnnen wir gerne weiter machen... !)', response_metadata={'finish_reason': 'stop'}, id='run-922c4cad-368f-41ba-9db9-eacb41d37cb2-0')
Tool callingβ
Firstly, it works mostly the same as OpenAI Function Calling
OpenAI has a tool calling (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.
With ChatLlamaCpp.bind_tools
, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:
{
"name": "...",
"description": "...",
"parameters": {...} # JSONSchema
}
and passed in every model invocation.
However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.
{"type": "function", "function": {"name": <<tool_name>>}}.
from langchain.tools import tool
from langchain_core.pydantic_v1 import BaseModel, Field
class WeatherInput(BaseModel):
location: str = Field(description="The city and state, e.g. San Francisco, CA")
unit: str = Field(enum=["celsius", "fahrenheit"])
@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
"""Get the current weather in a given location"""
return f"Now the weather in {location} is 22 {unit}"
llm_with_tools = llm.bind_tools(
tools=[get_weather],
tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
)
ai_msg = llm_with_tools.invoke(
"what is the weather like in HCMC in celsius",
)
ai_msg
Llama.generate: prefix-match hit
llama_print_timings: load time = 1077.71 ms
llama_print_timings: sample time = 853.67 ms / 20 runs ( 42.68 ms per token, 23.43 tokens per second)
llama_print_timings: prompt eval time = 1060.96 ms / 21 tokens ( 50.52 ms per token, 19.79 tokens per second)
llama_print_timings: eval time = 2754.74 ms / 19 runs ( 144.99 ms per token, 6.90 tokens per second)
llama_print_timings: total time = 4817.07 ms / 40 tokens
AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}, 'tool_calls': [{'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}}]}, response_metadata={'token_usage': {'prompt_tokens': 23, 'completion_tokens': 19, 'total_tokens': 42}, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-9d35869c-36fe-4f4a-835e-089a3f3aba3c-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}])
ai_msg.tool_calls
[{'name': 'get_current_weather',
'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}]
Structured output
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool
class AnswerWithJustification(BaseModel):
"""An answer to the user question along with justification for the answer."""
answer: str
justification: str
dict_schema = convert_to_openai_tool(AnswerWithJustification)
structured_llm = llm.with_structured_output(dict_schema)
result = structured_llm.invoke(
"What weighs more a pound of bricks or a pound of feathers ?"
)
Llama.generate: prefix-match hit
llama_print_timings: load time = 1077.71 ms
llama_print_timings: sample time = 1964.76 ms / 44 runs ( 44.65 ms per token, 22.39 tokens per second)
llama_print_timings: prompt eval time = 914.34 ms / 18 tokens ( 50.80 ms per token, 19.69 tokens per second)
llama_print_timings: eval time = 7903.81 ms / 43 runs ( 183.81 ms per token, 5.44 tokens per second)
llama_print_timings: total time = 11065.60 ms / 61 tokens
print(result)
{'answer': "a pound is always the same weight, regardless of what it's made up off. So both options are equal in terms of their mass.", 'justification': ''}
Streaming
for chunk in llm.stream("what is 25x5"):
print(chunk.content, end="\n", flush=True)
Llama.generate: prefix-match hit
``````output
The
answer
to
the
multiplication
problem
"
What
's
25
x
5
?"
would
be
:
125
``````output
llama_print_timings: load time = 1077.71 ms
llama_print_timings: sample time = 10.60 ms / 20 runs ( 0.53 ms per token, 1886.26 tokens per second)
llama_print_timings: prompt eval time = 3661.75 ms / 12 tokens ( 305.15 ms per token, 3.28 tokens per second)
llama_print_timings: eval time = 2468.01 ms / 19 runs ( 129.90 ms per token, 7.70 tokens per second)
llama_print_timings: total time = 3133.11 ms / 31 tokens
``````output
API referenceβ
For detailed documentation of all ChatLlamaCpp features and configurations head to the API reference: https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.llamacpp.ChatLlamaCpp.html