ChatLlamaCpp

This notebook provides a quick overview for getting started with chat model intergrated with llama cpp python

An example below demonstrating how to implement with the open-source Llama3 Instruct 8B

Overview

Integration details

Class	Package	Local	Serializable	JS support
ChatLlamaCpp	langchain-community	✅	❌	❌

Model features

Tool calling	Structured output	JSON mode	Image input	Audio input	Video input	Token-level streaming	Native async	Token usage	Logprobs
✅	✅	❌	❌	❌	❌	✅	❌	❌	✅

Setup

Installation

The LangChain OpenAI integration lives in the langchain-community and llama-cpp-python packages:

%pip install -qU langchain-community llama-cpp-python

Instantiation

Now we can instantiate our model object and generate chat completions:

import multiprocessing

from langchain_community.chat_models import ChatLlamaCpp

llm = ChatLlamaCpp(
    temperature=0.5,
    model_path="./SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf",
    n_ctx=10000,
    n_gpu_layers=8,
    n_batch=300,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    max_tokens=512,
    n_threads=multiprocessing.cpu_count() - 1,
    repeat_penalty=1.5,
    top_p=0.5,
    verbose=True,
)

API Reference:ChatLlamaCpp

llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /home/tni5hc/Documents/langchain_llamacpp/SanctumAI-meta-llama-3-8b-instruct.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv   2:                          llama.block_count u32              = 32
llama_model_loader: - kv   3:                       llama.context_length u32              = 8192
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 7
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 7.95 GiB (8.50 BPW) 
llm_load_print_meta: general.name     = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA RTX A2000 12GB, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  8137.64 MiB
llm_load_tensors:      CUDA0 buffer size =  1768.25 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 10016
llama_new_context_with_model: n_batch    = 300
llama_new_context_with_model: n_ubatch   = 300
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =   939.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   313.00 MiB
llama_new_context_with_model: KV self size  = 1252.00 MiB, K (f16):  626.00 MiB, V (f16):  626.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   683.78 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    16.15 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 268
AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'tokenizer.chat_template': "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", 'tokenizer.ggml.eos_token_id': '128009', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'gpt2', 'general.architecture': 'llama', 'llama.rope.freq_base': '500000.000000', 'tokenizer.ggml.pre': 'llama-bpe', 'llama.context_length': '8192', 'general.name': 'Meta-Llama-3-8B-Instruct', 'llama.embedding_length': '4096', 'llama.feed_forward_length': '14336', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'tokenizer.ggml.bos_token_id': '128000', 'llama.attention.head_count': '32', 'llama.block_count': '32', 'llama.attention.head_count_kv': '8', 'general.file_type': '7', 'llama.vocab_size': '128256', 'llama.rope.dimension_count': '128'}
Using gguf chat template: {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}{% endif %}
Using chat eos_token: <|eot_id|>
Using chat bos_token: <|begin_of_text|>

Invocation

messages = [
    (
        "system",
        "You are a helpful assistant that translates English to French. Translate the user sentence.",
    ),
    ("human", "I love programming."),
]

ai_msg = llm.invoke(messages)
ai_msg

llama_print_timings:        load time =    1077.71 ms
llama_print_timings:      sample time =      21.82 ms /    39 runs   (    0.56 ms per token,  1787.35 tokens per second)
llama_print_timings: prompt eval time =    1077.65 ms /    37 tokens (   29.13 ms per token,    34.33 tokens per second)
llama_print_timings:        eval time =    8403.75 ms /    38 runs   (  221.15 ms per token,     4.52 tokens per second)
llama_print_timings:       total time =    9689.66 ms /    75 tokens

AIMessage(content='Je adore le programmation.\n\n(Note: "programmation" is used in both formal and informal contexts, but it\'s generally accepted as equivalent of saying you like computer science or coding.)', response_metadata={'finish_reason': 'stop'}, id='run-e9e03b94-f29f-4c1d-8483-e23a46acb556-0')

print(ai_msg.content)

Je adore le programmation.

(Note: "programmation" is used in both formal and informal contexts, but it's generally accepted as equivalent of saying you like computer science or coding.)

Chaining

We can chain our model with a prompt template like so:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "You are a helpful assistant that translates {input_language} to {output_language}.",
        ),
        ("human", "{input}"),
    ]
)

chain = prompt | llm
chain.invoke(
    {
        "input_language": "English",
        "output_language": "German",
        "input": "I love programming.",
    }
)

API Reference:ChatPromptTemplate

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1077.71 ms
llama_print_timings:      sample time =      29.23 ms /    52 runs   (    0.56 ms per token,  1778.75 tokens per second)
llama_print_timings: prompt eval time =     869.38 ms /    17 tokens (   51.14 ms per token,    19.55 tokens per second)
llama_print_timings:        eval time =    6694.18 ms /    51 runs   (  131.26 ms per token,     7.62 tokens per second)
llama_print_timings:       total time =    7830.86 ms /    68 tokens

AIMessage(content='Ich liebe auch Programmieren! (Translation: I also like coding!) Do you have any favorite languages or projects? Ich bin hier, um dir zu helfen und über deine Lieblingsprogrammierthemen sprechen können wir gerne weiter machen... !)', response_metadata={'finish_reason': 'stop'}, id='run-922c4cad-368f-41ba-9db9-eacb41d37cb2-0')

Tool calling

Firstly, it works mostly the same as OpenAI Function Calling

OpenAI has a tool calling (we use "tool calling" and "function calling" interchangeably here) API that lets you describe tools and their arguments, and have the model return a JSON object with a tool to invoke and the inputs to that tool. tool-calling is extremely useful for building tool-using chains and agents, and for getting structured outputs from models more generally.

With ChatLlamaCpp.bind_tools, we can easily pass in Pydantic classes, dict schemas, LangChain tools, or even functions as tools to the model. Under the hood these are converted to an OpenAI tool schemas, which looks like:

{
    "name": "...",
    "description": "...",
    "parameters": {...}  # JSONSchema
}

and passed in every model invocation.

However, it cannot automatically trigger a function/tool, we need to force it by specifying the 'tool choice' parameter. This parameter is typically formatted as described below.

{"type": "function", "function": {"name": <<tool_name>>}}.

from langchain.tools import tool
from langchain_core.pydantic_v1 import BaseModel, Field


class WeatherInput(BaseModel):
    location: str = Field(description="The city and state, e.g. San Francisco, CA")
    unit: str = Field(enum=["celsius", "fahrenheit"])


@tool("get_current_weather", args_schema=WeatherInput)
def get_weather(location: str, unit: str):
    """Get the current weather in a given location"""
    return f"Now the weather in {location} is 22 {unit}"


llm_with_tools = llm.bind_tools(
    tools=[get_weather],
    tool_choice={"type": "function", "function": {"name": "get_current_weather"}},
)

API Reference:tool

ai_msg = llm_with_tools.invoke(
    "what is the weather like in HCMC in celsius",
)
ai_msg

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1077.71 ms
llama_print_timings:      sample time =     853.67 ms /    20 runs   (   42.68 ms per token,    23.43 tokens per second)
llama_print_timings: prompt eval time =    1060.96 ms /    21 tokens (   50.52 ms per token,    19.79 tokens per second)
llama_print_timings:        eval time =    2754.74 ms /    19 runs   (  144.99 ms per token,     6.90 tokens per second)
llama_print_timings:       total time =    4817.07 ms /    40 tokens

AIMessage(content='', additional_kwargs={'function_call': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}, 'tool_calls': [{'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554', 'type': 'function', 'function': {'name': 'get_current_weather', 'arguments': '{ "location": "Ho Chi Minh City", "unit" : "celsius"}'}}]}, response_metadata={'token_usage': {'prompt_tokens': 23, 'completion_tokens': 19, 'total_tokens': 42}, 'finish_reason': 'tool_calls', 'logprobs': None}, id='run-9d35869c-36fe-4f4a-835e-089a3f3aba3c-0', tool_calls=[{'name': 'get_current_weather', 'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'}, 'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}])

ai_msg.tool_calls

[{'name': 'get_current_weather',
  'args': {'location': 'Ho Chi Minh City', 'unit': 'celsius'},
  'id': 'call__0_get_current_weather_cmpl-3e329fde-4fa6-41b9-837c-131fa9494554'}]

Structured output

from langchain_core.pydantic_v1 import BaseModel
from langchain_core.utils.function_calling import convert_to_openai_tool


class AnswerWithJustification(BaseModel):
    """An answer to the user question along with justification for the answer."""

    answer: str
    justification: str


dict_schema = convert_to_openai_tool(AnswerWithJustification)

structured_llm = llm.with_structured_output(dict_schema)

result = structured_llm.invoke(
    "What weighs more a pound of bricks or a pound of feathers ?"
)

API Reference:convert_to_openai_tool

Llama.generate: prefix-match hit

llama_print_timings:        load time =    1077.71 ms
llama_print_timings:      sample time =    1964.76 ms /    44 runs   (   44.65 ms per token,    22.39 tokens per second)
llama_print_timings: prompt eval time =     914.34 ms /    18 tokens (   50.80 ms per token,    19.69 tokens per second)
llama_print_timings:        eval time =    7903.81 ms /    43 runs   (  183.81 ms per token,     5.44 tokens per second)
llama_print_timings:       total time =   11065.60 ms /    61 tokens

print(result)

{'answer': "a pound is always the same weight, regardless of what it's made up off. So both options are equal in terms of their mass.", 'justification': ''}

Streaming

for chunk in llm.stream("what is 25x5"):
    print(chunk.content, end="\n", flush=True)

Llama.generate: prefix-match hit
``````output

The
 answer
 to
 the
 multiplication
 problem
 "
What
's
 
25
 x
 
5
?"
 would
 be
:


125
``````output

llama_print_timings:        load time =    1077.71 ms
llama_print_timings:      sample time =      10.60 ms /    20 runs   (    0.53 ms per token,  1886.26 tokens per second)
llama_print_timings: prompt eval time =    3661.75 ms /    12 tokens (  305.15 ms per token,     3.28 tokens per second)
llama_print_timings:        eval time =    2468.01 ms /    19 runs   (  129.90 ms per token,     7.70 tokens per second)
llama_print_timings:       total time =    3133.11 ms /    31 tokens
``````output

API reference

For detailed documentation of all ChatLlamaCpp features and configurations head to the API reference: https://api.python.langchain.com/en/latest/chat_models/langchain_community.chat_models.llamacpp.ChatLlamaCpp.html

ChatLlamaCpp

Overview

Integration details

Model features

Setup

Installation

Instantiation

Invocation

Chaining

Tool calling

Structured output

Streaming

API reference

Was this page helpful?

You can leave detailed feedback on GitHub.

ChatLlamaCpp

Overview​

Integration details​

Model features​

Setup​

Installation​

Instantiation​

Invocation​

Chaining​

Tool calling​

Structured output

Streaming

API reference​

Was this page helpful?

You can leave detailed feedback on GitHub.

Overview

Integration details

Model features

Setup

Installation

Instantiation

Invocation

Chaining

Tool calling

API reference