# Basic usage

The Qualcomm AI Inference Suite SDK exposes two clients, each with a different programming paradigm:
synchronous and asynchronous.

[`ImagineClient`](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html#imagine.ImagineClient) is the synchronous Imagine client. If you don’t need
asynchronous programming on your Python code, or simply you are not familiar with
asynchronous programming, this is the client you want to use.

Otherwise, if you are leveraging [`asyncio`](https://docs.python.org/3/library/asyncio.html#module-asyncio "(in Python v3.14)") on your codebase,
[`ImagineAsyncClient`](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html#imagine.ImagineAsyncClient) might be a better choice.

The examples of this page are mostly focused on the synchronous client, as the async
client offers a very similar interface. Check
the [API documentation](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html) for more details about their
differences.

## Available models

When calling any of the inference methods, you can pass a model name as a string to
specify which model to use (for example, see [`imagine.ImagineClient.chat`](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html#imagine.ImagineClient.chat)). If you don’t pass a model name explicitly when invoking the method, the default model is used.

To get a list of available models, run the following:

from pprint import pprint
    
    from imagine import ImagineClient, ModelType

    client = ImagineClient()
    
    all_models = client.get_available_models_by_type()
    pprint(all_models)
    
    llm_models = client.get_available_models(model_type=ModelType.LLM)
    pprint(llm_models)
    Copy to clipboard

Depending on the model type, the current default models are as follows:

| Model type | Default model |
| --- | --- |
| LLM | Llama-3.1-8B |
| Text to Image | sdxl-turbo |
| Translate | Helsinki-NLP/opus-mt-en-es |
| Transcribe | whisper-tiny |
| Embedding | BAAI/bge-large-en-v1.5 |
| Reranker | BAAI/bge-reranker-base |

AI Appliance

If using the [Qualcomm AI On-Prem Appliance Solution](https://docs.qualcomm.com/nav/home?product=626615100779122971), ensure that you start the appropriate models before calling them.

## Start a chat with the LLM

The following basic example shows how to use the Qualcomm AI Inference Suite SDK to start a chat with a large language model (LLM). The code in the example instantiates the [`ImagineClient`](https://docs.qualcomm.com/doc/80-88545-1/topic/1_0_basic_usage.html#) and starts a chat by sending  a `ChatMessage` with a user question. The `model` parameter specifies which LLM to use and, finally, the `chat_response` prints the model’s reply.

from imagine import ChatMessage, ImagineClient

    client = ImagineClient()
    
    chat_response = client.chat(
        messages=[ChatMessage(role="user", content="What is the best Spanish cheese?")],
        model="Llama-3.1-8B",
    )
    
    print(chat_response.first_content)
    Copy to clipboard

The output is similar to the following:

Spain is renowned for its rich variety of cheeses, each with its unique flavor profile
    and texture. The "best" Spanish cheese is subjective and often depends on personal
    taste preferences. However, here are some of the most popular and highly-regarded
    Spanish cheeses:
    
    1. Manchego: A firm, crumbly cheese made from sheep's milk, Manchego is a classic
       Spanish cheese with a nutty, slightly sweet flavor.
    2. Mahon: A semi-soft cheese from the island of Minorca, Mahon has a mild,
       creamy flavor and a smooth texture.
    3. Idiazabal: A smoked cheese from the Basque region, Idiazabal has a strong, savory
       flavor and a firm texture.
    4. Garrotxa: A soft, creamy cheese from Catalonia, Garrotxa has a mild, buttery flavor
       and a delicate aroma.
    ...
    Copy to clipboard

## Start a chat with the LLM using the asynchronous client

This example is the same as the previous chat example but uses the asynchronous client `ImagineAsyncClient` instead of the synchronous client. The methods and the input arguments are the same in both examples, so you can transition from synchronous code to asynchronous code. See [Imagine clients](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html) for more information on the two clients.

import asyncio
    
    from imagine import ChatMessage, ImagineAsyncClient

    async def main():
        client = ImagineAsyncClient()
    
        chat_response = await client.chat(
            messages=[ChatMessage(role="user", content="What is the best Spanish cheese?")],
        )
        print(chat_response.first_content)

    if __name__ == "__main__":
        asyncio.run(main())
    Copy to clipboard

## Stream a chat response

This example invokes the `chat_stream` method to print each part of the LLM’s response in real-time so that you
can start providing feedback to the user as soon as possible. The chat output is similar to the previous example,
but the text displays as each part of the response is received rather than waiting for the entire response.
This approach is useful to reduce perceived latency.

from imagine import ChatMessage, ImagineClient

    client = ImagineClient()
    
    for chunk in client.chat_stream(
        messages=[
            ChatMessage(role="system", content="You are an expert programmer."),
            ChatMessage(
                role="user", content="Write a quick sort implementation in python."
            ),
        ],
        max_tokens=1024,
    ):
        if chunk.first_content is not None:
            print(chunk.first_content, end="", flush=True)
    
    print("\n")
    Copy to clipboard

## Stream a chat response using the asynchronous client

This example is the same as the previous streaming chat example but uses the asynchronous client `ImagineAsyncClient` instead of the synchronous client. The methods and the input arguments are the same in both examples, so you can transition from synchronous code to asynchronous code. See [Imagine clients](https://docs.qualcomm.com/doc/80-88545-1/topic/imagine_clients.html) for more information on the two clients.

import asyncio
    
    from imagine import ChatMessage, ImagineAsyncClient

    async def main():
        client = ImagineAsyncClient()
    
        async for chunk in client.chat_stream(
            messages=[ChatMessage(role="user", content="What is the best French cheese?")],
        ):
            if chunk.first_content is not None:
                print(chunk.first_content, end="", flush=True)

    if __name__ == "__main__":
        asyncio.run(main())
    Copy to clipboard

## Generate code

This example invokes the `completion` method to generate code in response to a prompt.  The following
shows how to generate Python code:

from imagine import ImagineClient

    client = ImagineClient()
    
    completion_response = client.completion(
        prompt="Write a Python function to get the fibonacci series"
    )
    
    print(completion_response.first_text)
    Copy to clipboard

The AI outputs a response similar to the following:

Here is a Python function that generates the Fibonacci series up to a given number:
    
    ```Python
    def fibonacci(n):
        fib_series = [0, 1]
        while fib_series[-1] + fib_series[-2] <= n:
            fib_series.append(fib_series[-1] + fib_series[-2])
        return fib_series
    
    n = int(input("Enter a number: "))
    print(fibonacci(n))
    ```
    Copy to clipboard

## Translate text

This example invokes the `translate` method to translate text between languages. Use the `model` parameter to specify the
model for the desired input and output language.

from imagine import ImagineClient

    client = ImagineClient()
    
    english_to_spanish = "Helsinki-NLP/opus-mt-en-es"
    
    translate_response = client.translate(
        prompt="San Diego is one of the most beautiful cities in America!",
        model=english_to_spanish,
    )
    
    print(translate_response.first_text)
    Copy to clipboard

The output is similar to the following:

San Diego es una de las ciudades más hermosas de América!
    Copy to clipboard

## Generate an image

This code shows how to use the Qualcomm AI Inference SDK to generate two images (`n=2`) based on a text prompt and then to save the images as PNG files.

import base64
    
    from imagine import ImagineClient

    client = ImagineClient()
    
    images_response = client.images_generate(
        prompt="A cat sleeping on planet Mars",
        n=2,
        negative_prompt="disfigured, ugly, bad, immature, cartoon, anime, 3d, painting, b&w",
        response_format="b64_json",
    )
    
    # Save image to file
    
    for i in range(len(images_response.data)):
        with open(f"MyImage_{i}.png", "wb") as f:
            f.write(base64.decodebytes(images_response.data[0].b64_json.encode()))
    Copy to clipboard

## Transcribe audio

This code shows how to use the Qualcomm AI Inference SDK to transcribe an audio file MP3 to text.

from imagine import ImagineClient

    client = ImagineClient()
    
    response = client.transcribe("my_audio.mp3")
    
    print(response.text)
    Copy to clipboard

## Generate embeddings

Use Qualcomm AI Inference Suite SDK to create embeddings from text input. Embeddings are
numerical representations of text that capture semantic meaning. Embeddings are useful for
some of the following natural language processing (NLP) tasks:

- Similarity search: Find texts that are semantically similar.
- Clustering: Group similar texts together.
- Classification: Improve the performance of text classification models.
- Recommendation systems: Enhance content recommendations based on text similarity.

This code shows how to use the Qualcomm AI Inference SDK to generate numerical representations (embeddings) for the text strings provided in the `client.embeddings`. The code prints “2” for the `len` and stores the actual embedding vectors in `embedding_response`.

from imagine import ImagineClient

    client = ImagineClient()
    
    embedding_response = client.embeddings(["What a beautiful day", "this is amazing"])
    
    print(len(embedding_response.data))
    Copy to clipboard

## Implement reranking in RAG workflows

This code shows how to use the Qualcomm AI Inference SDK reranker functionality as part of your retrieval-augmented generation (RAG) workflow.

from pprint import pprint
    
    from imagine import ImagineClient, ModelType

    client = ImagineClient()
    
    reranker_models = client.get_available_models_by_type(model_type=ModelType.RERANKER)
    print(reranker_models)
    
    reranker_response = client.reranker(
        query="what is a panda?",
        documents=[
            "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear",
            "Paris is in France",
            "Kung fu panda is a movie",
            "Pandas are animals that live in cold climate",
        ],
        return_documents=True,
        top_n=3,
    )
    
    pprint(reranker_response.data)
    Copy to clipboard

## Next steps

- Review the [synchronous and asynchronous client classes](https://docs.qualcomm.com/doc/80-88545-1/topic/index_api.html) in the SDK.
- [Connect models to external systems with tool calling](https://docs.qualcomm.com/doc/80-88545-1/topic/2_0_tool_calling.html).

Last Published: Apr 17, 2026

[Previous Topic
Examples](https://docs.qualcomm.com/bundle/publicresource/80-88545-1/topics/index_tutorials.md) [Next Topic
Connect models to external systems with tool calling](https://docs.qualcomm.com/bundle/publicresource/80-88545-1/topics/2_0_tool_calling.md)