Agentic AI with Gemma 3

Although Gemma 3 is available as instruction tuned model, it does not offer dedicated support for tools. When you have a look at the documentation, it's just regular prompting without any special tokens for functions. So in order to use the model as agent, you need to take care about prompting in the right way yourself.

Step 1: Loading the model

As the sample code should run on a Mac, the first step is to load mlx-lm and to initialize a Gemma 3 model. In case you are using CUDA, replace this with a call to Hugging face transformers. In order to load and test your code, use a small model version with 1bn parameters. Before deploying this on a real world use case, you should of course update this to larger version.

from mlx_lm import load, generate
from mlx_lm.generate import stream_generate
import json

model, tokenizer = load("./models/gemma3-1b-it/transformers")

Step 2: Creating the prompt

The second step is to craft the model prompt based on the user query. First, you initialize the model in general. After completing that, individual functions are specified as JSON Schema and the original user prompt is added at the end.

In case the model wants to call a tool, it will reply with its name and the parameters as JSON.

prompt = tokenizer.apply_chat_template([{
    "role": "system", "content": [{
        "text": """
        You are Gemma, a helpful chat bot. 

        You have access to functions. 
        If you decide to invoke any of the function(s),
        you MUST put it in the format of
        ```json{"name": function name, 
        "parameters": dictionary of argument name and its value}```. 

        The tool output is available as <unused0>JSON<unused0>.

        You MUST NOT include any other text in the response 
        if you call a function.
        [
          {
            "name": "get_stock_price",
            "description": "Finds the current stock price by a symbol",
            "parameters": {
              "type": "object",
              "properties": {
                "SYMBOL": {
                  "type": "string"
                }
              },
              "required": [
                "SYMBOL"
              ]
            }
          }
        ]"""
    }]},
    {
        "role": "user", 
         "content": """Can you find the current stock price with the symbol APPL ?"""
    }
], add_generation_prompt=True)

Step 3: Executing the function

When the initial response has been generated, we will check if the model requeried a tool or did provide an answer directly by parsing the response. In case a function call has been made, we add the tool response directly after the request for the call and escape with a special token to simply parsing for the model.

```json {"name": "get_stock_price", "parameters": {"SYMBOL": "APPL"}}```
<unused0>{"price": 180}<unused0>

This response is then added to the original prompt, together with the question to summarize the results in total, as you can see in this excerpt.

if responseParser.has_tool: 
    next_prompt = prompt + tokenizer.encode(responseParser.text + \
        "<end_of_turn><start_of_turn>user Summarize the results." + \
        "<end_of_turn><start_of_turn>model", 
        add_special_tokens = False)
else:
    next_prompt = None

There are tutorials that suggest feeding the function output back as additional user turn, but at least for Gemma 3 the approach to put the results directly after the function call in my tests provided much better results.

Step 4: Generating the final answer

Once the results of tool have been generated, the updated prompt is fed into the model again for final summarization of the answer.

if next_prompt: 
    final_response = ""
    for response in stream_generate(model, tokenizer, next_prompt):
        final_response += response.text

        if response.token == 106: 
            break

The model then generates the final answer:

The current stock price for APPL is $180.<end_of_turn>

This deviates from the OpenAI specification as well as the Model Context Protocol, but each model performs somewhat different and this will provide you with the best results. In a real world use case, you would of course split the parsing and the function execution into to different steps.

You can find the complete notebook in the repository with example notebooks: https://github.com/matt-do-it/GenEmbeddings.