# Use LangChain with the Qualcomm AI Inference Suite SDK

You can use the Qualcomm Inference Suite SDK with [LangChain](https://docs.langchain.com) by installing the LangChain extras when you install the SDK. If you are familiar with LangChain, you know that it offers a standard interface to use with language models from different vendors, as seen
on [their list of language models](https://docs.langchain.com/oss/python/integrations/providers/overview). The Qualcomm AI Inference Suite SDK can be used in exactly in the same way.

## Prerequisites

- Complete the [get started prerequisites](https://docs.qualcomm.com/doc/80-88545-1/topic/3_0_langchain.html#prerequistes).
- If not already installed, install the [LangChain extras](https://docs.qualcomm.com/doc/80-88545-1/topic/3_0_langchain.html#prerequistes).
- If using the [Qualcomm AI On-Prem Appliance Solution](https://docs.qualcomm.com/nav/home?product=626615100779122971), make sure that the appliance box can communicate with the relevant external APIs to achieve tool calling.

## Chat with LLM using LangChain messages

The code in this example initializes `ImagineChat` with `Llama-3-8B` as the model, then sends a single request in `model.invoke()` and a streaming request in `model.stream`. Notice the use of standard LangChain message types like `HumanMessage`, `SystemMessage`, and `AIMessage`.

from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
    from imagine.langchain import ImagineChat
    
    model = ImagineChat(model="Llama-3-8B")
    response = model.invoke(
            [
                SystemMessage(content="Translate the following from English into Italian"),
                HumanMessage(content="hello"),
            ]
        )
    
    print(response.content)
    
    for chunk in model.stream(
        [
            HumanMessage(content="hello!"),
            AIMessage(content="Hi there human!"),
            HumanMessage(
                content="Write a program to sort a list of numbers in python!"
            ),
        ], max_tokens=512):
    
        print(chunk.content, end="", flush=True)
    
    Copy to clipboard

## Start a chat with the  LLM

The following example shows how to use the Qualcomm AI Inference Suite SDK with the LangChain optional dependency group to start a chat with a large language model (LLM). This code in the example instantiates the `ImagineChat` class, specifies `Llama-3.1-8B` as the LLM to use, and sends the `HumanMessage` as user input.

from langchain_core.messages import HumanMessage, SystemMessage
    
    from imagine.langchain import ImagineChat

    model = ImagineChat(model="Llama-3.1-8B", max_tokens=200)
    messages = [
        SystemMessage(content="You're a helpful assistant"),
        HumanMessage(content="What is the purpose of model regularization?"),
    ]
    
    response = model.invoke(messages)
    
    print(response.content)
    Copy to clipboard

The chat returns a response similar to the following:

The purpose of model regularization is to prevent overfitting in machine learning
    models. Overfitting occurs when a model becomes too complex and starts to fit the noise
    in the training data, leading to poor generalization on unseen data. Regularization
    techniques introduce additional constraints or penalties to the model's objective
    function, discouraging it from becoming overly complex and promoting simpler and more
    generalizable models. Regularization helps to strike a balance between fitting the
    training data well and avoiding overfitting, leading to better performance on new,
    unseen data.
    Copy to clipboard

## Stream a chat response

This example invokes the `stream()`method to output each part of the LLM’s response in real-time
so that you can start providing feedback to the user as soon as possible. The chat output is similar
to the previous example, but the text displays as each part of the response is received rather
than waiting for the entire response. This approach is useful to reduce perceived latency.

from langchain_core.messages import HumanMessage, SystemMessage
    
    from imagine.langchain import ImagineChat

    model = ImagineChat(model="Llama-3-8B")
    messages = [
        SystemMessage(content="You're a helpful assistant"),
        HumanMessage(content="What is the purpose of model regularization?"),
    ]
    
    for chunk in model.stream(messages):
        print(chunk.content, end="", flush=True)
    
    print("\n")
    Copy to clipboard

## Start a chat using the asynchronous library

This example uses the `asyncio`library for asynchronous programming instead of the synchronous client. Notice the use of `async`, `ainvoke()`, and `asyncio` in the code to ensure tht asynchronous functions are used.

See [Imagine clients](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html) for more information about the two clients.

import asyncio
    
    from langchain_core.messages import HumanMessage, SystemMessage
    
    from imagine.langchain import ImagineChat

    async def main():
        model = ImagineChat(model="Llama-3-8B", max_tokens=512)
        messages = [
            SystemMessage(content="You're a helpful assistant"),
            HumanMessage(content="What is the purpose of model regularization?"),
        ]
    
        response = await model.ainvoke(messages)
    
        print(response.content)

    if __name__ == "__main__":
        asyncio.run(main())
    Copy to clipboard

## Stream a chat response using the asynchronous library

This example is the same as the previous asynchronous chat example but uses the `astream` method to print each part of the LLM’s response in real-time so that you
can start providing feedback to the user as soon as possible. Notice the use of `async`, `astream()`, and `asyncio` in the code to ensure tht asynchronous functions are used.

import asyncio
    
    from langchain_core.messages import HumanMessage, SystemMessage
    
    from imagine.langchain import ImagineChat

    async def main():
        model = ImagineChat(model="Llama-3-8B", max_tokens=512)
        messages = [
            SystemMessage(content="You're a helpful assistant"),
            HumanMessage(content="What is the purpose of model regularization?"),
        ]
    
        async for chunk in model.astream(messages, max_tokens=100):
            print(chunk.content, end="", flush=True)

    if __name__ == "__main__":
        asyncio.run(main())
    Copy to clipboard

## Define a prompt template

Define a prompt template for message consistency, reliability, and efficiency when chatting with an LLM. This sample shows three different examples for creating prompt templates.

from langchain_core.prompts import ChatPromptTemplate
    
    from imagine.langchain import ImagineChat

    model = ImagineChat(model="Llama-3-8B")

    # Example 1: Create a ChatPromptTemplate using a template string
    
    print("-----Prompt from Template-----")
    template = "Tell me a joke about {topic}."
    prompt_template = ChatPromptTemplate.from_template(template)
    
    prompt = prompt_template.invoke({"topic": "cats"})
    result = model.invoke(prompt)
    print(result.content)

    # Example 2: Prompt with Multiple Placeholders
    
    print("\n----- Prompt with Multiple Placeholders -----\n")
    template_multiple = """You are a helpful assistant.
    Human: Tell me a {adjective} short story about a {animal}.
    Assistant:"""
    prompt_multiple = ChatPromptTemplate.from_template(template_multiple)
    prompt = prompt_multiple.invoke({"adjective": "funny", "animal": "panda"})
    
    result = model.invoke(prompt)
    print(result.content)

    # Example 3: Prompt with System and Human Messages (Using Tuples)
    
    print("\n----- Prompt with System and Human Messages (Tuple) -----\n")
    messages = [
        ("system", "You are a comedian who tells jokes about {topic}."),
        ("human", "Tell me {joke_count} jokes."),
    ]
    prompt_template = ChatPromptTemplate.from_messages(messages)
    prompt = prompt_template.invoke({"topic": "lawyers", "joke_count": 3})
    result = model.invoke(prompt)
    print(result.content)
    Copy to clipboard

## Chain calls to generate text with chat

Use LangChain Expression Language [LCEL](https://python.langchain.com/v0.1/docs/expression_language/) to chain a sequence of calls to an LLM, a tool, or a data preprocessing step.

The following example involves the following steps:

1. Defines a prompt template.
2. Defines additional processing steps using a component of LCEL called RunnableLamba.
3. Creates the combined chain of all of the components using LCEL.
4. Runs the chain.
5. Prints the output.

from langchain_core.output_parsers import StrOutputParser
    from langchain_core.prompts import ChatPromptTemplate
    from langchain_core.runnables import RunnableLambda
    
    from imagine.langchain import ImagineChat

    model = ImagineChat(model="Llama-3-8B")
    
    # Define prompt templates
    prompt_template = ChatPromptTemplate.from_messages(
        [
            ("system", "You are a comedian who tells jokes about {topic}."),
            ("human", "Tell me {joke_count} jokes."),
        ]
    )
    
    # Define additional processing steps using RunnableLambda
    uppercase_output = RunnableLambda(lambda x: x.upper())
    count_words = RunnableLambda(lambda x: f"Word count: {len(x.split())}\n{x}")
    
    # Create the combined chain using LangChain Expression Language (LCEL)
    chain = prompt_template | model | StrOutputParser() | uppercase_output | count_words
    
    # Run the chain
    result = chain.invoke({"topic": "lawyers", "joke_count": 3})
    
    # Output
    print(result)
    Copy to clipboard

## Get single-turn and multi-turn interactions from an LLM

The code in this example uses `ImagineLLM` to interact with the LLM in a single-turn and multi-turn interaction. The code sends a single query to the LLM using `llm.invoke`, then displays the entire response when it is complete. Because `llm.invoke` is a synchronous call, it must complete entirely before moving to the next line of code. When complete, the code calls `llm.stream` to display output in real-time output as the LLM generates it.

from imagine.langchain import ImagineLLM

    llm = ImagineLLM(max_tokens=1024)
    
    res_query = llm.invoke(
        "What are some theories about the relationship between unemployment and inflation?",
        max_tokens=100,
    )
    print(res_query)
    
    for chunk in llm.stream(
        "What are some theories about the relationship between unemployment and inflation?"
    ):
        print(chunk, end="", flush=True)
    print("\n")
    Copy to clipboard

## Generate embeddings

The code in this example uses the `ImagineEmbeddings` class to generate embeddings.
Embeddings are numerical representations of a piece of text. This is useful because it
means we can think about text in the vector space, and do things like semantic search
where we look for pieces of text that are most similar in the vector space.

The base `embeddings` class in LangChain provides two methods:

- `embed_documents` for embedding documents; takes multiple text strings as input and converts each string into an embedding.
- `embed_query` for embedding queries; takes a single text string as input and converts it into an embedding.

from imagine.langchain import ImagineEmbeddings

    embedding = ImagineEmbeddings()
    
    # Embed list of texts
    res_documents = embedding.embed_documents(
        [
            "Hi there!",
            "Oh, hello!",
            "What's your name?",
            "My friends call me World",
            "Hello World!",
        ]
    )
    # print(res_documents)
    print(len(res_documents))
    print([len(d) for d in res_documents])

    # Embed a single piece of text for the purpose of comparing to other embedded pieces of texts
    res_query = embedding.embed_query("What was the name mentioned in the conversation?")
    # print(res_query)
    print(len(res_query))
    Copy to clipboard

This example is the same as the previous example except that it uses the `asyncio`library for asynchronous programming. Notice the use of `async`, `aembed()`, and `asyncio` in the code to ensure tht asynchronous functions are used. The previous example is synchronous.

import asyncio
    
    from imagine.langchain import ImagineEmbeddings

    async def main():
        embedding = ImagineEmbeddings()
    
        # Embed list of texts
        res_documents = await embedding.aembed_documents(
            [
                "Hi there!",
                "Oh, hello!",
                "What's your name?",
                "My friends call me World",
                "Hello World!",
            ]
        )
        # print(res_documents)
        print(len(res_documents))
        print([len(d) for d in res_documents])
    
        # Embed a single piece of text for the purpose of comparing to other embedded pieces of texts
        res_query = await embedding.aembed_query(
            "What was the name mentioned in the conversation?"
        )
        # print(res_query)
        print(len(res_query))

    if __name__ == "__main__":
        asyncio.run(main())
    Copy to clipboard

## Next steps

- Review [LangChain classes](https://docs.qualcomm.com/doc/80-88545-1/topic/langchain.html).
- [Connect to external functions using LangChain tools](https://docs.qualcomm.com/doc/80-88545-1/topic/3_1_langchain_tools.html).
- [Create custom tools](https://docs.qualcomm.com/doc/80-88545-1/topic/3_2_langchain_custom_tools.html) with LangChain.

Last Published: Apr 17, 2026

[Previous Topic
Connect models to external systems with tool calling](https://docs.qualcomm.com/bundle/publicresource/80-88545-1/topics/2_0_tool_calling.md) [Next Topic
Use LiteLLM to abstract API calls](https://docs.qualcomm.com/bundle/publicresource/80-88545-1/topics/litellm.md)