RAG (Retrieval-Augmented Generation), Function / Tools Calling, CAG (Cache-Augmented Generation) for LLM models - Roman Kryvolapov

Hello!
I will try to describe here my experience with the implementation of RAG and functional calls for the LLM model running locally on the phone.
For cloud models the general idea will not be different, so after reading you will be able to understand how it works and what it is used for.
What is interesting here is that the success of the work strongly depends on the correctness of the writing of the prompt of what we want from the model, that is, just good code does not work here, we also need a humanitarian part – to explain all the models and hope that she understood everything correctly.

Content

➤ What is RAG (Retrieval-Augmented Generation)
➤ What is a vector database
➤ What is Function Calling
➤ What is CAG (Cache-Augmented Generation)
➤ Example 1 – Low-Level Function Calling Implementation in Kotlin
➤ Example 2 – Python framework Dspy
➤ Example 3 – Python framework LangChain
➤ Example 4 – Java framework LangChain4j
➤ Example 5 – Google's Java libraries for mobile devices
➤ Example 6 – CAG (Cache-Augmented Generation) in LLama.cpp

What is RAG (Retrieval-Augmented Generation)

The LLM model is equipped with terabytes of information during training, but it may not know information about you, your project, your documents, and your cat.

You can describe all this in the request to the model, in the prompt, but if there is a lot of information, the size of the model context may end before you get to the question, or you will get tired of typing the text.

To make the model aware of information it was not trained on, RAG (Retrieval Augmented Generation) is used.

RAG can have many different implementations, but the general idea is:
Additional information is being added to your message., usually in text form, for example:

Internet query result
Data from the database
Text from the attached document, and so on

In this example I will use a vector base, which is great for working with LLM models.

▲To contents▲

What is a vector database

Before querying the model, you need to feed your data to the database so that the model can then find the most suitable data for your query.

When adding text to the database:

The text is broken into pieces
Using a special Gecko embedding model, a 768-dimensional vector is generated for each piece, the meaning of which corresponds as closely as possible to the content of the piece of text.
The vector along with the text is written to the database

When searching text in the database:

The search query is converted into a vector using the Gecko model
The database is searched for a vector that is as close as possible to the search query vector.
For this vector, the text is extracted
The text is added to your request for the LLM model

▲To contents▲

What is Function Calling

The difference with RAG is that in this case the model itself decides whether to access the vector database or another source to obtain information.
To do this:

the prompt specifies that the model has access to function calls, describes their format, describes the conditions under which the model needs to make this call
if the model decides to make such a call, it returns service information with a request for the call
the parser understands that a call needs to be made, reads the data from the vector database, adds it to the message queue and sends the request again.

▲To contents▲

What is CAG (Cache-Augmented Generation)

A technique for speeding up and reducing the cost of LLM work, based on reusing the results of past queries.

LLM stores intermediate representations of text (hidden states, attention caches) during operation.

For repeated or similar queries, the entire sequence does not need to be run through the model. Instead, the saved cache is used and the model only refines the "new" part.

This reduces the amount of computation and lowers latency.

Application options:

Token-level caching:
KV attention caches are saved for already processed tokens (already used in llama.cpp, Transformers and others).

Prompt-level caching:
Results for frequently occurring prompts are saved and reused.

Semantic caching:
storing responses by semantic similarity (e.g. via vector storage). If the new query is similar to the old one, the stored result is returned or used as context.

Difference with regular RAG:
RAG (Retrieval-Augmented Generation) pulls external data from the knowledge base.
CAG saves computation by reusing past model runs.

▲To contents▲

Example 1 – Low-Level Function Calling Implementation in Kotlin

Here I will give an example of how Function Calling works from an implementation point of view.

First, we make a template that will explain in as much detail as possible to the LLM model that it has access to functional calls and how to work with them.

It is very important to write a keyword by which we can later determine that the model wants to make a functional call, in this example it is tool_code, it must be unique and must not appear in the text.

Next we describe the structures we want to get, in this example it is Json, but the structure can be any, I use Json because it is easy to parse into a class, but in terms of using tokens it is not optimal.

If the structure is complex and there is a possibility that the model will make a mistake, you can return a parsing error, add it to the message queue along with a description that the model made a mistake and must correct the request, and send it to the model for a new attempt.

In this example I used 2 function calls

get_weather

{
  "name": "get_weather",
  "arguments": {
    "location": "New York"
 }
}

This is a very simple call with only one parameter, which is hard to get wrong.

search_docs

{
  "name": "search_docs",
  "arguments": {
    "query": "quantum computing basics",
    "top_k": 5,
    "min_similarity_score": 0.6
  }
}

This is a search query to the database in which suitable documents will be searched.

This call can be placed in a loop with starting parameters and added to the prompt description – if the model does not get suitable results, it can reduce min_similarity_score or change the wording query and make the call again.

It is very important to describe detailed instructions for the model for each step in the prompt, the model has no motivation and intuition to understand what it should do next, and if further steps are not described, it will return an empty message.

It is also important to understand that a prompt is not a program code, and its correct execution has a probability that depends on how well we have described all the actions and structures.

At the end of the prompt, it is important to write that the model will receive the result in Json format, the structure of which is not described in the prompt, and the model itself must understand how to formulate a response for the user in accordance with his request.

The script in the prompt has almost endless customization possibilities, for example, if we need to get data from a database, we can describe the structure of the database and ask the model to generate an SQL query in accordance with the user's request, which we then execute and get the data, in this case it is important not to forget to limit the connection rights to read only.

Some models are trained to perform function calls, a specific format and keywords, and may also have a separate role for the result of the call, such as functionI used a model that was not trained for this, and it did a great job, I think any modern more or less "smart" model can do it.

Full text of the prompt:

const val DEFAULT_FUNCTION_CALLING_PROMPT = """
Call tools only when the user explicitly asks to check, verify, look up, find, or search for something. In all other cases, answer directly without calling any tool.

If you decide to invoke a function, output only a JSON object inside a fenced block marked as tool_code without any text before or after it.

Available tools:

search_docs
Description: Searches knowledge and returns the most relevant results as plain text.
Arguments:
query (string) — User string query to search in the vector store.
top_k (integer, default 5) — Number of results to return.
min_similarity_score (float, default 0.6) — Minimum similarity score from 0.0 (no filtering) to 1.0 (exact match).

get_weather
Description: Gets current weather for a location.
Arguments:
location (string) — City or place name.

JSON format inside the fenced block:
tool_code
{
"name": "<tool_name>",
"arguments": {
"<arg_name>": <arg_value>
}
}

Examples:
tool_code
{ "name": "search_docs", "arguments": { "query": "quantum computing basics", "top_k": 5, "min_similarity_score": 0.6 } }

tool_code
{ "name": "get_weather", "arguments": { "location": "New York" } }

Important rules:
Use tools only when explicitly requested by the user to check, verify, look up, find, or search. Otherwise, provide a direct answer.
When calling a tool, the output must be valid JSON with exactly two keys: name and arguments.
Do not include any explanations or extra content outside the tool_code block.

After receiving tool results (the result will be in JSON format):
1. Parse and process the result.
2. Convert it into a simple, clear, and human-readable natural-language response.
3. Always reply to the user with this processed answer as the next message.
4. If no relevant results are found, briefly explain that nothing relevant was found and suggest next steps.
"""

In this example I used local models Gemma And DeepSeek, launched in LM Studio, which has OpenAI Compatible API, that is, my application connects to it via Localhost and makes requests, cloud models will do even better, as they have many more parameters and a special API for functional calls, but in the example I specifically wanted to use the lowest-level implementation.

fun main() = runBlocking {

    val messages = mutableListOf(

       	// Add a prompt to the message queue with the system role
        ChatMessage(
            role = "system",
            content = DEFAULT_FUNCTION_CALLING_PROMPT
        ),

        // We add a user request to the queue, this request makes it clear,
        // that the user asks to check the weather, the model knows that it has access
      	// the weather check function described in the prompt and calls it,
      	//  by sending a response message starting with tool_code
        ChatMessage(
            role = "user",
            content = "Check weather in London",
        )
    )
    var iterator = 0
    while (iterator < 3) {
        try {
            chatAnswerWithToolsStream(
                ChatCompletionRequest(
                    model = MODEL_ID,
                    messages = messages,
                    temperature = 0.2
                )
            ).collect { piece ->
                when (piece) {
                    is ChatResult.Debug -> {
                        println("DEBUG MESSAGE:\n${piece.message}")
                    }

                    is ChatResult.Message -> {
                        println("FINAL MESSAGE:\n${piece.message}")
                        iterator = 3
                        return@collect
                    }

                    is ChatResult.SearchDocs -> {
                        println("SEARCH DOCS data:\n$piece")
                        iterator = 3

                        // In this example I will only use the weather check,
                        // but added one more function to show that there can be many of them
                        return@collect
                    }

                    is ChatResult.GetWeather -> {
                        println("GET WEATHER result:\n$piece")

                        // If the model decides to call a function, we add its call
                        // to the message queue with the assistant role
                        // It is important to add all messages to the queue so that
                        // the model meant the whole history of the dialogue
                        messages.add(
                            ChatMessage(
                                role = "assistant",
                                content = piece.message ?: ""
                            )
                        )

                        // request API to get weather data
                        val result = getCurrentWeatherByCity(
                            city = piece.location!!,
                        )
                        val jsonResult: String = kotlinxJsonConfig.encodeToString(result)
                        println("GET WEATHER request result:\n$jsonResult")

                        // we add to the message queue with the user role, since this model
                        // does not support special roles for function calls
                        messages.add(
                            ChatMessage(
                                role = "user",
                                content = jsonResult
                            )
                        )
                        iterator++
                    }
                }
            }
        } catch (e: Throwable) {
            println(e.message ?: "error")
            break
        }
    }

}

// code for working with OpenAI API, the function call is important here
// parseToolJson, which will return null if it fails to convert the response into a function call
fun chatAnswerWithToolsStream(req: ChatCompletionRequest): Flow<ChatResult> = flow {
    val response = client.post("$BASE_URL/api/v0/chat/completions") {
        contentType(ContentType.Application.Json)
        setBody(req.copy(stream = true))
    }
    val channel: ByteReadChannel = response.bodyAsChannel()

    val fullBuilder = StringBuilder()

    while (!channel.isClosedForRead) {
        val line: String = channel.readUTF8Line() ?: break
        val payload: String = extractDataLine(line) ?: continue

        val chunk: StreamChatChunk = runCatching {
            kotlinxJsonConfig.decodeFromString(StreamChatChunk.serializer(), payload)
        }.getOrNull() ?: continue

        val piece: String = extractChatDeltaContent(chunk)
        if (piece.isNotEmpty()) {
            fullBuilder.append(piece)
            emit(ChatResult.Debug(message = fullBuilder.toString()))
        }

        if (isFinished(chunk)) break
    }

    val fullMessage: String = fullBuilder.toString().trim()
    val toolResult: ChatResult? = parseToolJson(fullMessage)

    if (toolResult != null) {
        emit(toolResult)
    } else {
        emit(ChatResult.Message(message = fullMessage))
    }
}

fun parseToolJson(input: String): ChatResult? {
    try {
        val json = when {

            // if the response starts with tool_code, we understand that this is not a response to the user
            // and the function call will be followed by json with its type and parameters
            input.startsWith("tool_code") -> input.replace("tool_code", "")
                .trim()

            input.startsWith("```tool_code") -> input.replace("```tool_code", "")
                .removeSuffix("```")
                .trim()

            else -> return null
        }
        val rawTool = runCatching {
            gson.fromJson(json, RawTool::class.java)
        }.getOrNull() ?: return null

        val jsonObject = rawTool.arguments?.asJsonObject ?: return null

		// we determine which function the model wants to call
        // and transform json into a data model.
        // This code can be improved by returning an error to the model if the transformation
        // it won't work, with the description that the model should edit her answer and correct it
        return when (rawTool.name?.lowercase()) {
            "search_docs" -> {
                gson.fromJson(jsonObject, ChatResult.SearchDocs::class.java).copy(
                    message = json
                )
            }
            "get_weather" -> gson.fromJson(jsonObject, ChatResult.GetWeather::class.java).copy(
                message = json
            )
            else -> null
        }
    } catch (e: Exception) {
        println(e.message ?: "error")
        return null
    }
}

Additional code if you want to repeat:

ext {
    ktor_version = "3.2.3"
    kotlinx_serialization_json_version = "1.8.1"
    logback_version = "1.4.14"
}

dependencies {
    implementation "io.ktor:ktor-client-core:$ktor_version"
    implementation "io.ktor:ktor-client-cio:$ktor_version"
    implementation "io.ktor:ktor-client-content-negotiation:$ktor_version"
    implementation "io.ktor:ktor-serialization-kotlinx-json:$ktor_version"
    implementation "io.ktor:ktor-client-logging:$ktor_version"
    implementation "org.jetbrains.kotlinx:kotlinx-serialization-json:$kotlinx_serialization_json_version"
    implementation "ch.qos.logback:logback-classic:$logback_version"
    implementation "com.google.code.gson:gson:2.13.1"
}

import io.ktor.client.call.*
import io.ktor.client.request.*

suspend fun getCurrentWeatherByCity(
    apiKey: String = WEATHER_API_KEY,
    city: String,
    units: String = "metric",
    lang: String = "en"
): WeatherResponse {
    return client.get(OWM_BASE) {
        url {
            parameters.append("q", city)
            parameters.append("appid", apiKey)
            parameters.append("units", units)
            parameters.append("lang", lang)
        }
    }.body()
}

import com.google.gson.JsonElement
import kotlinx.serialization.SerialName
import kotlinx.serialization.Serializable
import kotlinx.serialization.json.JsonObject


@Serializable
data class ChatMessage(
    @SerialName("role")
    val role: String,
    @SerialName("content")
    val content: String
)

@Serializable
data class ChatCompletionRequest(
    @SerialName("model")
    val model: String,
    @SerialName("messages")
    val messages: List<ChatMessage>,
    @SerialName("temperature")
    val temperature: Double? = null,
    @SerialName("max_tokens")
    val maxTokens: Int? = null,
    @SerialName("stream")
    val stream: Boolean? = null
)

@Serializable
data class StreamChatChunk(
    @SerialName("choices")
    val choices: List<StreamChatChoice>? = null
)


@Serializable
data class ChatChoiceMessage(
    @SerialName("content")
    val content: String? = null
)

@Serializable
data class ChatChoice(
    @SerialName("message")
    val message: ChatChoiceMessage? = null
)

@Serializable
data class ChatResponse(
    @SerialName("choices")
    val choices: List<ChatChoice> = emptyList()
)


@Serializable
data class StreamChatChoice(
    @SerialName("delta")
    val delta: JsonObject? = null,
    @SerialName("finish_reason")
    val finishReason: String? = null
)

data class RawTool(
    val name: String?,
    val arguments: JsonElement?
)

sealed interface ChatResult {

    data class Debug(
        val message: String?,
    ) : ChatResult

    data class Message(
        val message: String?,
    ) : ChatResult

    data class SearchDocs(
        val message: String? = null,
        val query: String?,
        val topK: Int?,
        val minSimilarityScore: Float?
    ) : ChatResult

    data class GetWeather(
        val message: String? = null,
        val location: String?,
    ) : ChatResult

}

import kotlinx.serialization.json.JsonElement
import kotlinx.serialization.json.contentOrNull
import kotlinx.serialization.json.jsonPrimitive

fun extractDataLine(line: String): String? {
    val t: String = line.trim()
    if (t.isEmpty()) return null
    if (t == "data: [DONE]" || t == "[DONE]") return null
    if (t.startsWith("event:")) return null
    return if (t.startsWith("data:")) t.removePrefix("data:").trim() else t
}

fun extractChatDeltaContent(chunk: StreamChatChunk): String {
    val choice: StreamChatChoice = chunk.choices?.firstOrNull() ?: return ""
    val delta = choice.delta ?: return ""
    val el: JsonElement? = delta["content"]
    return el?.jsonPrimitive?.contentOrNull ?: ""
}

fun extractTextDelta(chunk: StreamTextChunk): String {
    val choice: StreamTextChoice = chunk.choices?.firstOrNull() ?: return ""
    val direct: String? = choice.text
    if (direct != null) return direct
    val delta = choice.delta
    val el: JsonElement? = delta?.get("content") ?: delta?.get("text")
    return el?.jsonPrimitive?.contentOrNull ?: ""
}

fun isFinished(chunk: StreamChatChunk): Boolean {
    val reason: String? = chunk.choices?.firstOrNull()?.finishReason
    return !reason.isNullOrEmpty()
}

fun isFinished(chunk: StreamTextChunk): Boolean {
    val reason: String? = chunk.choices?.firstOrNull()?.finishReason
    return !reason.isNullOrEmpty()
}

import com.google.gson.FieldNamingPolicy
import com.google.gson.Gson
import com.google.gson.GsonBuilder
import io.ktor.client.*
import io.ktor.client.engine.cio.*
import io.ktor.client.plugins.*
import io.ktor.client.plugins.contentnegotiation.*
import io.ktor.serialization.kotlinx.json.*
import kotlinx.serialization.json.Json

const val BASE_URL: String = "http://localhost:1234"

const val OWM_BASE = "https://api.openweathermap.org/data/2.5"

const val WEATHER_API_KEY = "..."

const val MODEL_ID = "google/gemma-3n-e4b"

val gson: Gson = GsonBuilder()
    .setFieldNamingPolicy(FieldNamingPolicy.LOWER_CASE_WITH_UNDERSCORES)
    .create()

val kotlinxJsonConfig: Json = Json {
    ignoreUnknownKeys = true
    prettyPrint = false
}

val client: HttpClient = HttpClient(CIO) {
    install(ContentNegotiation) {
        json(kotlinxJsonConfig)
    }
    install(HttpTimeout) {
        requestTimeoutMillis = 60_000
        connectTimeoutMillis = 10_000
        socketTimeoutMillis = 60_000
    }
}

▲To contents▲

Example 2 – Python framework Dspy

DSPy is an open-source Python framework for "programming, not prompting" LLM: you describe the behavior of modules in the code, and optimizers automatically select hints and, if desired, further train weights for the selected metric. Suitable for classifiers, RAGs, and agents.
dspy.ai

https://dspy.ai

https://github.com/stanfordnlp/dspy

Signatures:
declarative specifications of module inputs/outputs.

Modules:
ready-made LLM calling strategies (Predict, ChainOfThought, ReAct, etc.) from which pipelines are assembled.

Prediction:
Basic building block.
Makes a request to LLM using a given signature (input -> output).
Used for simple tasks such as QA, summarization, classification.
Signature:
Defines the structure of input and output data.
Can be used as "q -> a" or as a Python class with annotations.
Makes the code readable and strict.
Chain:
Allows you to link multiple Predicts in a pipeline.
It is convenient when there are intermediate steps (extraction of facts → summary → answer).
React:
Agent mode: the model reasons and calls tools.
Allows you to connect APIs, databases, external functions.
A typical option for chatbots with access to tools.
ChainOfThought (CoT):
Forces the model to explain the steps of reasoning.
Convenient for complex problems, mathematics, logic.
Retrieve/RAG:
Integration with vector databases.
Allows you to add document search to your workflow.

Optimizers (formerly teleprompters):
algorithms that "compile" a program into efficient hints/weights by metric; work even with 5-10 examples.

An example similar in function to the previous one:

requirements.txt

dspy==3.0.1
litellm==1.75.8
openai==1.99.9
optuna>=4.5.0
gepa[dspy]==0.0.4
regex>=2025.7.34
diskcache>=5.6.3
json-repair>=0.49.0
magicattr>=0.1.6
backoff>=2.2.1
asyncer==0.0.8
cachetools>=6.1.0
aiohttp>=3.12.15

import dspy
import requests
from typing import Dict

dspy.enable_logging()

OPENWEATHER_API_KEY = "..."

# Set provider for LLM
lm = dspy.LM(
    "openai/google/gemma-3n-e4b",
    api_base="http://localhost:1234/v1",
    api_key="lm-studio",
    temperature=0
)
dspy.configure(lm=lm)

# Check if the provider was found
print("Provider:\n", type(lm.provider))

# Send the model test data using the Predict module
probe = dspy.Predict("question -> answer")
try:
    result = probe(question="Who are you", max_tokens=1)
    print("Model supported by provider, test answer:\n", result.answer)
except Exception as e:
    print("Model not supported by provider:\n", e)

def get_weather(city: str, units: str = "metric") -> str:
    url = "https://api.openweathermap.org/data/2.5/weather"
    params = {
        "q": city,
        "appid": OPENWEATHER_API_KEY,
        "units": units,
        "lang": "en"
    }
    response = requests.get(url, params=params)
    if response.status_code == 200:
        data: Dict = response.json()
        temp = data["main"]["temp"]
        description = data["weather"][0]["description"]
        return f"Weather in {city}: {temp}°C, {description}"
    else:
        return f"Error fetching weather: {response.text}"


# Adding a question -> answer communication format and
# get_weather tool and ReAct module
# The framework will generate the prompt itself accordingly
# with what module and with what parameters we use
agent = dspy.ReAct(
    "question -> answer",
    tools=[get_weather]
)

result = agent(question="Check weather in London")

print("Answer:\n", result.answer)

# You can view the history of communication with LLM,
# what tools were called and what prompt was generated
history = dspy.inspect_history(n=10)

Console output:

Provider:
 <class 'dspy.clients.openai.OpenAIProvider'>
Model supported by provider, test answer:
 I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind.
Answer:
 Weather in London: 15.89°C, overcast clouds

Log:

the model answers a simple question without tools (Who are you). Here she immediately gives [[ ## answer ## ]].

the system sets the task: "You have the get_weather/finish tools". The model thinks and issues the first step: next_tool_name = get_weather.

the model receives a trajectory with the API result (Weather in London: ...) and must decide what to do next. She writes next_tool_name = finish.

the system substitutes the trajectory with both steps (get_weather + finish). The model must produce the final [[ ## answer ## ]].

Test message:

System message:

Your input fields are:
1. `question` (str):
Your output fields are:
1. `answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## answer ## ]]
{answer}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
 Given the fields `question`, produce the fields `answer`.

User message:

[[ ## question ## ]]
Who are you
Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Response:
[[ ## answer ## ]]
I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind.
[[ ## completed ## ]]

Main message:
here the models are explained the Tools that it can use – get_weather And finish.
The model is expected to next_thought, next_tool_name, next_tool_args
The model responds to the challenge [[ ## next_tool_name ## ]] get_weather with parameters [[ ## next_tool_args ## ]] {"city": "London"}

System message:

Your input fields are:
1. `question` (str):
2. `trajectory` (str):
Your output fields are:
1. `next_thought` (str):
2. `next_tool_name` (Literal['get_weather', 'finish']):
3. `next_tool_args` (dict[str, Any]):
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## next_thought ## ]]
{next_thought}
[[ ## next_tool_name ## ]]
{next_tool_name}        # note: the value you produce must exactly match (no extra characters) one of: get_weather; finish
[[ ## next_tool_args ## ]]
{next_tool_args}        # note: the value you produce must adhere to the JSON schema: {"type": "object", "additionalProperties": true}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
        Given the fields `question`, produce the fields `answer`.

        You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far.
        Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`.

        To do this, you will interleave next_thought, next_tool_name, and next_tool_args in each turn, and also when finishing the task.
        After each tool call, you receive a resulting observation, which gets appended to your trajectory.

        When writing next_thought, you may reason about the current situation and plan for future steps.
        When selecting the next_tool_name and its next_tool_args, the tool must be one of:

        (1) get_weather. It takes arguments {'city': {'type': 'string'}, 'units': {'type': 'string', 'default': 'metric'}}.
        (2) finish, whose description is <desc>Marks the task as complete. That is, signals that all information for producing the outputs, i.e. `answer`, are now available to be extracted.</desc>. It takes arguments {}.
        When providing `next_tool_args`, the value inside the field must be in JSON format

User message:

[[ ## question ## ]]
Check weather in London
[[ ## trajectory ## ]]
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['get_weather', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.

Response:

[[ ## next_thought ## ]]
I need to check the weather in London. I should use the get_weather tool for this.
[[ ## next_tool_name ## ]]
get_weather
[[ ## next_tool_args ## ]]
{"city": "London"}
[[ ## completed ## ]]

Next, the program makes a call to the weather API, receives the result and adds it to the next message.
The model decides what to call next [[ ## next_tool_name ## ]] finish
At this stage, the model is still an agent and has decided which tool to call next.

User message:

[[ ## question ## ]]
Check weather in London
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
I need to check the weather in London. I should use the get_weather tool for this.
[[ ## tool_name_0 ## ]]
get_weather
[[ ## tool_args_0 ## ]]
{"city": "London"}

[[ ## observation_0 ## ]]
Weather in London: 23.75°C, overcast clouds
Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['get_weather', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`.

Response:

[[ ## next_thought ## ]]
The weather for London has been retrieved. I can now finish the task.
[[ ## next_tool_name ## ]]
finish
[[ ## next_tool_args ## ]]
{}
[[ ## completed ## ]]

Then the last call, the model is already an assistant, it gives reasoning + the final answer.
Prompt has also changed.

System message:

Your input fields are:
1. `question` (str):
2. `trajectory` (str):
Your output fields are:
1. `reasoning` (str):
2. `answer` (str):
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## question ## ]]
{question}
[[ ## trajectory ## ]]
{trajectory}
[[ ## reasoning ## ]]
{reasoning}
[[ ## answer ## ]]
{answer}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
Given the fields `question`, produce the fields `answer`.

User message:

[[ ## question ## ]]
Check weather in London
[[ ## trajectory ## ]]
[[ ## thought_0 ## ]]
I need to check the weather in London. I should use the get_weather tool for this.
[[ ## tool_name_0 ## ]]
get_weather
[[ ## tool_args_0 ## ]]
{"city": "London"}
[[ ## observation_0 ## ]]
Weather in London: 23.75°C, overcast clouds
[[ ## thought_1 ## ]]
The weather for London has been retrieved. I can now finish the task.
[[ ## tool_name_1 ## ]]
finish
[[ ## tool_args_1 ## ]]
{}
[[ ## observation_1 ## ]]
Completed.
Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Response:

[[ ## reasoning ## ]]
The question asks to check the weather in London. I need to use a tool that can provide weather information. The `get_weather` tool is suitable for this task, and I should provide the city as "London". After retrieving the weather information, I will finish the task.
[[ ## answer ## ]]
Weather in London: 23.75°C, overcast clouds
[[ ## completed ## ]]

Why are there two calls with finish:
The reason is in the architecture of the agent loop in DSPy.

Call with next_tool_name:
This step is modeled as an "agent action". The agent always operates in the format:
I think → select the tool → call the tool → receive observation.
Even if tool = finish, formally it is still considered an action.

Challenge with reasoning + answer:
After the agent has "completed the task", the system makes a separate request, where the context includes the entire trajectory.

This two-step scheme is a way to divide the roles:
one format of communication, when the model is an agent managing tools;
Another format is when the model is an assistant responding to the user.

Further in LangChain there will be an example with 1st final message.

▲To contents▲

➤ Example 3 – Python framework LangChain

LangChain is a framework for working with LLM (Large Language Models) that helps build complex applications on top of models, not just “question → answer”.

Its purpose is to provide a comfortable layer over the model so that it can:
work with tools (e.g. API, databases, calculator);
save context and memory (chats, dialogue history);
use chains – a sequence of steps where the model performs one task and passes the result to the next;
launch agents that decide for themselves which tools to use and in what order;
integrate with popular LLM providers (OpenAI, Anthropic, LM Studio, HuggingFace, etc.);
work with retrieval and RAG (extract knowledge from documents or vector databases).

https://www.langchain.com

https://github.com/langchain-ai/langchain

https://en.wikipedia.org/wiki/LangChain

An example similar in function to the previous one:

requirements.txt

langchain-core==0.3.74
langchain-openai==0.3.30
langchain==0.3.27
openai==1.100.1
requests==2.32.5

import requests
from typing import Dict
from langchain_core.tools import tool
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain.agents import AgentExecutor, create_react_agent
from langchain_core.callbacks.base import BaseCallbackHandler

OPENWEATHER_API_KEY = "..."

# Callback for logging
class CustomLogger(BaseCallbackHandler):
    def on_tool_start(self, serialized, input_str, **kwargs):
        print(f"\nTOOL START:\n{serialized.get('name')}\nINPUT:\n{input_str}\n")

    def on_tool_end(self, output, **kwargs):
        print(f"\nTOOL END:\n{output}\n")

    def on_llm_start(self, serialized, prompts, **kwargs):
        print(f"\nLLM START:\n{serialized.get('name')}\nPROMPTS\n{prompts}\n")

    def on_llm_end(self, response, **kwargs):
        print(f"\nLLM END:\n{response}\n")

# Define a tool that the agent can call
@tool(description="Get current weather for a city.")
def get_weather(city: str) -> str:
    # Form a request to the OpenWeather API
    url = "https://api.openweathermap.org/data/2.5/weather"
    params = {"q": city, "appid": OPENWEATHER_API_KEY, "units": "metric", "lang": "en"}
    r = requests.get(url, params=params, timeout=15)
    if not r.ok:
        # If there is an error, return the error text
        return f"error: {r.status_code} {r.text}"
    # Parse the JSON response
    data: Dict = r.json()
    # Return a short string with the weather
    return f"Weather in {data.get('name', city)}: {data['main']['temp']}°C, {data['weather'][0]['description']}"

# Connect LLM via LM Studio (local OpenAI API-compatible server)
llm = ChatOpenAI(
    model="google/gemma-3n-e4b",
    openai_api_base="http://localhost:1234/v1",
    openai_api_key="lm-studio",
    temperature=0,
)

# Download the ready-made ReAct prompt template from LangChain Hub
prompt = hub.pull("hwchase17/react")

# Create an agent in ReAct style (model thinks + calls tools)
agent = create_react_agent(
  llm,
  tools=[get_weather],
  prompt=prompt
)

# Wrap the agent in Executor to launch and manage its work
executor = AgentExecutor(
    agent=agent,
    tools=[get_weather],
    verbose=True
)

# Launch an agent with a question about the weather in London
result = executor.invoke(
    {"input": "Check weather in London"},
    config={"callbacks": [CustomLogger()]}
)

Console output:

> Entering new AgentExecutor chain...
I need to find the weather in London. I will use the get_weather tool to do this.
Action: get_weather
Action Input: LondonWeather in London: 16.16°C, overcast cloudsI have retrieved the weather for London. Now I can provide the answer.
Final Answer: Weather in London: 16.16°C, overcast clouds

> Finished chain.
Weather in London: 16.16°C, overcast clouds

Full log:

ON_LLM_START:
ChatOpenAI

PROMPTS:
['Human: Answer the following questions as best you can. You have access to the following tools:
get_weather(city: str) -> str - Get current weather for a city.
Use the following format:
	Question: the input question you must answer
	Thought: you should always think about what to do
	Action: the action to take, should be one of [get_weather]
	Action Input: the input to the action
	Observation: the result of the action
	... (this Thought/Action/Action Input/Observation can repeat N times)
	Thought: I now know the final answer
	Final Answer: the final answer to the original input question
Begin!
Question: Check weather in London
Thought:']


ON_LLM_END:
generations=[[
ChatGenerationChunk(text='I need to find the weather in London. I will use the get_weather tool to do this.
	Action: get_weather
	Action Input: London',
	generation_info={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'},
	message=AIMessageChunk(
		content='I need to find the weather in London. I will use the get_weather tool to do this.
		Action: get_weather
		Action Input: London',
		additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, id='run--ce1137e0-edde-4be7-95df-8b26c7ddddce'))]]
	llm_output=None
	run=None type='LLMResult'

I need to find the weather in London. I will use the get_weather tool to do this.
Action: get_weather
Action Input: London

ON_TOOL_START:
get_weather

INPUT:
London

ON_TOOL_END:
Weather in London: 23.01°C, overcast clouds

ON_LLM_START:
ChatOpenAI

PROMPTS:
['Human: Answer the following questions as best you can. You have access to the following tools:
get_weather(city: str) -> str - Get current weather for a city.
Use the following format:
	Question: the input question you must answer
	Thought: you should always think about what to do
	Action: the action to take, should be one of [get_weather]
	Action Input: the input to the action
	Observation: the result of the action
	... (this Thought/Action/Action Input/Observation can repeat N times)
	Thought: I now know the final answer
	Final Answer: the final answer to the original input question
Begin!

Question: Check weather in London
Thought: I need to find the weather in London. I will use the get_weather tool to do this.
Action: get_weather
Action Input: London
Observation: Weather in London: 23.01°C, overcast clouds
Thought: ']

ON_LLM_END:
generations=[[
ChatGenerationChunk(text='I have retrieved the weather for London. Now I can provide the answer.
Final Answer: Weather in London: 23.01°C, overcast clouds', generation_info={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'},
message=AIMessageChunk(
	content='I have retrieved the weather for London. Now I can provide the answer.
	Final Answer: Weather in London: 23.01°C, overcast clouds',
	additional_kwargs={},
	response_metadata={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, id='run--342aee15-0171-4549-94c4-a02a620aad7f'))]]
llm_output=None
run=None type='LLMResult'

I have retrieved the weather for London. Now I can provide the answer.
Final Answer: Weather in London: 23.01°C, overcast clouds

> Finished chain.

Process finished with exit code 0

▲To contents▲

Example 4 – Java framework LangChain4j

As LLM makes its way from startups to the enterprise, it's only a matter of time before Java solutions emerge.

Example 5 – Google's Java libraries for mobile devices

In this example I will use

Engine: Google AI Edge MediaPipe

LLM model: gemma-3n-E4B-it-int4

RAG / Functional Calling: On-Device RAG SDK & On-Device Function Calling SDK

Embedding model: Gecko-110m-en

How it should work – you can read here

RAG:

https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation

https://en.wikipedia.org/wiki/Retrieval-augmented_generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

Function Calling:

https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling

https://huggingface.co/docs/hugs/guides/function-calling

https://platform.openai.com/docs/guides/function-calling

https://medium.com/@danushidk507/function-calling-in-llm-e537b286a4fd

In this example there will be 3 classes

MediaPipeEngineCommon: common components for working with the vector base will be stored here

MediaPipeEngineWithRag: here you can start generation by adding data from the vector database to the request

MediaPipeEngineWithTools: here you can start the generation and the model itself decides whether to make a request to the vector base

If it decides that it needs to, it makes a function call, which is processed not manually, as in the previous example, but by the library com.google.ai.edge.localagents:localagents-fc

Project dependencies, in the new Gradle format:

localagentsRag = "0.2.0"
localagentsFc = "0.1.0"
tasksGenai = "0.10.25"
tasksText = "0.10.26.1"
tasksVision = "0.10.26.1"
tensorflowLite = "2.17.0"
kotlinxCoroutinesGuava = "1.10.2"

tasks-genai = { module = "com.google.mediapipe:tasks-genai", version.ref = "tasksGenai" }
tasks-text = { module = "com.google.mediapipe:tasks-text", version.ref = "tasksText" }
tasks-vision = { module = "com.google.mediapipe:tasks-vision", version.ref = "tasksVision" }
tensorflow-lite = { module = "org.tensorflow:tensorflow-lite", version.ref = "tensorflowLite" }
localagents-rag = { module = "com.google.ai.edge.localagents:localagents-rag", version.ref = "localagentsRag" }
localagents-fc = { module = "com.google.ai.edge.localagents:localagents-fc", version.ref = "localagentsFc" }
kotlinx-coroutines-guava = { module = "org.jetbrains.kotlinx:kotlinx-coroutines-guava", version.ref = "kotlinxCoroutinesGuava" }

MediaPipeEngineCommon: Class for working with a vector database, needed in this example for both RAG and Function Calling

import com.google.ai.edge.localagents.rag.chunking.TextChunker
import com.google.ai.edge.localagents.rag.memory.DefaultSemanticTextMemory
import com.google.ai.edge.localagents.rag.memory.SqliteVectorStore
import com.google.ai.edge.localagents.rag.models.Embedder
import com.google.ai.edge.localagents.rag.prompt.PromptBuilder

interface MediaPipeEngineCommon {

    var chunker: TextChunker
    var embedder: Embedder<String>
    var vectorStore: SqliteVectorStore
    var promptBuilder: PromptBuilder
    var semanticMemory: DefaultSemanticTextMemory

    fun init(
        geckoModelPath: String, // Gecko_256_quant.tflite
        tokenizerModelPath: String, // sentencepiece.model
        useGpuForEmbeddings: Boolean = true,
    )

    fun saveTextToVectorStore(
        text: String,
        chunkOverlap: Int = 20,
        chunkTokenSize: Int = 128,
        chunkMaxSymbolsSize: Int = 1000,
        chunkBySentences: Boolean = false,
    ): String?

    fun readEmbeddingVectors(): List<VectorStoreEntity>

    suspend fun readEmbeddingVectors(
        query: String,
        topK: Int,
        minSimilarityScore: Float,
    ): List<VectorStoreEntity>

    fun makeSQLRequest(query: String): Boolean

}

import android.app.Application
import android.database.Cursor
import android.database.sqlite.SQLiteDatabase
import com.google.ai.edge.localagents.rag.chunking.TextChunker
import com.google.ai.edge.localagents.rag.memory.DefaultSemanticTextMemory
import com.google.ai.edge.localagents.rag.memory.SqliteVectorStore
import com.google.ai.edge.localagents.rag.memory.VectorStoreRecord
import com.google.ai.edge.localagents.rag.models.EmbedData
import com.google.ai.edge.localagents.rag.models.Embedder
import com.google.ai.edge.localagents.rag.models.EmbeddingRequest
import com.google.ai.edge.localagents.rag.models.GeckoEmbeddingModel
import com.google.ai.edge.localagents.rag.prompt.PromptBuilder
import com.google.common.collect.ImmutableList
import com.romankryvolapov.offlineailauncher.common.extensions.toDurationString
import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logDebug
import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logError
import kotlinx.coroutines.guava.await
import java.io.File
import java.nio.ByteBuffer
import java.nio.ByteOrder
import java.util.Optional

class MediaPipeEngineCommonImpl(
    private val application: Application
) : MediaPipeEngineCommon {

    companion object {
        private const val TAG = "CommonComponentsTag"
        private const val GECKO_EMBEDDING_MODEL_DIMENSION = 768
        private const val PROMPT_TEMPLATE: String =
            "You are an assistant for question-answering tasks. Here are the things I want to remember: {0} Use the things I want to remember, answer the following question the user has: {1}"

    }

    override lateinit var chunker: TextChunker
    override lateinit var embedder: Embedder<String>
    override lateinit var vectorStore: SqliteVectorStore
    override lateinit var promptBuilder: PromptBuilder
    override lateinit var semanticMemory: DefaultSemanticTextMemory

    override fun init(
        geckoModelPath: String,
        tokenizerModelPath: String,
        useGpuForEmbeddings: Boolean,
    ) {
        logDebug("init", TAG)
        chunker = TextChunker()

        // in embedder add the path to the Gecko-110m-en model
        // I use the Gecko_256_quant.tflite version here 256 is the maximum text input size
        // This version is optimal in terms of text chunk size and performance
        // important - further in the code we pass on which pieces to split the text into, this depends on the parameters of the model
        embedder = GeckoEmbeddingModel(
            geckoModelPath,
            Optional.of(tokenizerModelPath),
            useGpuForEmbeddings,
        )

        // here I will use a ready-made SQLite database in the application
        // another table rag_vector_store will simply be added to it
        // with text columns for text and embeddings for vector
        val database = File(application.getDatabasePath("database").absolutePath)
        if (!database.exists()) {
            logError("startEngine database not exists", TAG)
        }

        // It is also possible to create a custom database implementation that inherits
        // VectorStore<String> interface, but the getNearestRecords method must
        // to be implemented correctly and work quickly, he is looking for the nearest vesters
        vectorStore = SqliteVectorStore(
            GECKO_EMBEDDING_MODEL_DIMENSION,
            database.absolutePath
        )
        semanticMemory = DefaultSemanticTextMemory(
            vectorStore,
            embedder
        )
        promptBuilder = PromptBuilder(
            PROMPT_TEMPLATE
        )
        logDebug("init ready", TAG)
    }

    override fun saveTextToVectorStore(
        text: String,

      	// how far to get into the text before the fragment, here 20
        chunkOverlap: Int,

      	// Please note that the size seems to be in tokens,
      	// but it is used for chunker and may not match
      	// token size for embedder, here chunkTokenSize 128
        chunkTokenSize: Int,

      	// when splitting using chunkBySentences
        // the size of the offers can be large
      	// if it exceeds the capabilities of the embedder model, it will throw an error
      	// for this purpose, cropping to the maximum size is used
      	// here it is 1000 characters
        chunkMaxSymbolsSize: Int,

      	// use the sentence-by-sentence method
        chunkBySentences: Boolean,
    ): String? {
        logDebug("saveTextToVectorStore text length: ${text.length}", TAG)
        // timer to see how fast it works
        val start = System.currentTimeMillis()
        val chunks: List<String> = if (chunkBySentences)
            chunker.chunkBySentences(
                text,
                chunkTokenSize,
            ).filter {
                it.isNotBlank()
            }.map { chunk ->
                if (chunk.length > chunkMaxSymbolsSize) {
                    logError("saveTextToVectorStore crop chunk", TAG)
                    chunk.substring(0, chunkMaxSymbolsSize)
                } else {
                    chunk
                }
            }
        else
            chunker.chunk(
                text,
                chunkTokenSize,
                chunkOverlap
            ).filter {
                it.isNotBlank()
            }.map { chunk ->
                if (chunk.length > chunkMaxSymbolsSize) {
                    logError("saveTextToVectorStore crop chunk", TAG)
                    chunk.substring(0, chunkMaxSymbolsSize)
                } else {
                    chunk
                }
            }
        val end = System.currentTimeMillis()
        val delta = end - start
        logDebug("saveTextToVectorStore chunks delta: ${delta.toDurationString()} size: ${chunks.size}", TAG)
        chunks.forEach {
            logDebug("length: ${it.length}", TAG)
        }
        if (chunks.isEmpty()) {
            logError("saveTextToVectorStore chunks.isEmpty()", TAG)
            return "Chunks is empty"
        }
        return try {
          	// vector generation occurs inside semanticMemory
            val result: Boolean? = semanticMemory.recordBatchedMemoryItems(
                ImmutableList.copyOf(chunks)
            )?.get()
            val end = System.currentTimeMillis()
            val delta = end - start
            logDebug("saveTextToVectorStore ready delta: ${delta.toDurationString()} result: $result", TAG)
            null
        } catch (t: Throwable) {
            logError("saveTextToVectorStore failed: ${t.message}", t, TAG)
            t.message
        }
    }

    // search by query, will find all pieces of text similar to the query
    override suspend fun readEmbeddingVectors(
        query: String,

        // number of database query results
        topK: Int,

        // how similar the query vector should be to the database record
        // 0.0 = search all entries, sort by most similar
        // 1.0 = perfect match only
        // I use values 0.6 - 0.8
        minSimilarityScore: Float,
    ): List<VectorStoreEntity> {
        logDebug("readEmbeddingVectors query: $query", TAG)
        val queryEmbedData: EmbedData<String> = EmbedData.create(
            query,
            EmbedData.TaskType.RETRIEVAL_QUERY
        )
        val embeddingRequest: EmbeddingRequest<String> = EmbeddingRequest
            .create(
                listOf(queryEmbedData)
            )
        val vector: ImmutableList<Float> = try {
            embedder.getEmbeddings(embeddingRequest).await()
        } catch (t: Throwable) {
            logError("readEmbeddingVectors: embedding failed: ${t.message}", t, TAG)
            return emptyList()
        }
        logDebug("searchDocsInternal vector size: ${vector.size}", TAG)
        if (vector.isEmpty()) {
            logError("readEmbeddingVectors vector.isEmpty()", TAG)
            return emptyList()
        }
        val hits: ImmutableList<VectorStoreRecord<String>> = try {
            vectorStore.getNearestRecords(
                vector,
                topK,
                minSimilarityScore
            )
        } catch (t: Throwable) {
            logError("readEmbeddingVectors: vector search failed: ${t.message}", t, TAG)
            return emptyList()
        }
        if (hits.isEmpty()) {
            logError("readEmbeddingVectors hits.isEmpty()", TAG)
            return emptyList()
        }
        val result = hits.map {
            VectorStoreEntity(
                id = null,
                text = it.data,
                embedding = it.embeddings
            )
        }
        logDebug("readEmbeddingVectors\nsize: ${result.size}\nresult: $result", TAG)
        return result
    }

    // simply displays all records in the database
    override fun readEmbeddingVectors(): List<VectorStoreEntity> {
        logDebug("readEmbeddingPreview", TAG)
        var cursor: Cursor? = null
        var database: SQLiteDatabase? = null
        return try {
            val databaseFile = File(application.getDatabasePath("database").absolutePath)
            database = SQLiteDatabase.openDatabase(
                databaseFile.absolutePath,
                null,
                SQLiteDatabase.OPEN_READONLY
            )
            cursor = database.rawQuery("SELECT ROWID, text, embeddings FROM rag_vector_store", null)
            val result = mutableListOf<VectorStoreEntity>()
            while (cursor.moveToNext()) {
                val rowId = cursor.getLong(0)
                val text = cursor.getString(1)
                val blob = cursor.getBlob(2)
                val buffer = ByteBuffer.wrap(blob).order(ByteOrder.LITTLE_ENDIAN)
                val floats = mutableListOf<Float>()
                while (buffer.hasRemaining()) {
                    floats.add(buffer.float)
                }
                result.add(
                    VectorStoreEntity(
                        id = rowId,
                        text = text,
                        embedding = floats
                    )
                )
            }
            logDebug("readEmbeddingPreview\nsize: ${result.size}\nresult: $result", TAG)
            result
        } catch (t: Throwable) {
            logError("readEmbeddingPreview failed: ${t.message}", t, TAG)
            emptyList()
        } finally {
            cursor?.close()
            database?.close()
        }
    }

    // you can write your own request and it will be executed,
    // for example "DELETE FROM rag_vector_store"
    override fun makeSQLRequest(query: String): Boolean {
        logDebug("makeSQLRequest query: $query", TAG)
        var cursor: Cursor? = null
        var database: SQLiteDatabase? = null
        return try {
            val databaseFile = File(application.getDatabasePath("database").absolutePath)
            database = SQLiteDatabase.openDatabase(
                databaseFile.absolutePath,
                null,
                SQLiteDatabase.OPEN_READWRITE
            )
            cursor = database.rawQuery(query, null)
            val result = cursor.moveToFirst()
            logDebug("makeSQLRequest result: $result", TAG)
            result
        } catch (t: Throwable) {
            logError("makeSQLRequest failed: ${t.message}", t, TAG)
            false
        } finally {
            cursor?.close()
            database?.close()
        }
    }

}

MediaPipeEngineWithRag: Only simple RAG mechanism is supported here

import kotlinx.coroutines.flow.Flow
import java.io.File

interface MediaPipeEngineWithRag {

    fun startEngine(
        modelFile: File,
        isSupportImages: Boolean = false,
        engineParams: MediaPipeEngineParams,
    )

    fun resetSession()

    fun generateResponse(
        prompt: String,
        topK: Int = 5,
        minSimilarityScore: Float = 0.6F,
    ): Flow<ResultEmittedData<String>>

}

import android.app.Application
import com.google.ai.edge.localagents.rag.chains.ChainConfig
import com.google.ai.edge.localagents.rag.chains.RetrievalAndInferenceChain
import com.google.ai.edge.localagents.rag.models.AsyncProgressListener
import com.google.ai.edge.localagents.rag.models.LanguageModelResponse
import com.google.ai.edge.localagents.rag.models.MediaPipeLlmBackend
import com.google.ai.edge.localagents.rag.retrieval.RetrievalConfig
import com.google.ai.edge.localagents.rag.retrieval.RetrievalConfig.TaskType
import com.google.ai.edge.localagents.rag.retrieval.RetrievalRequest
import com.google.common.util.concurrent.FutureCallback
import com.google.common.util.concurrent.Futures
import com.google.common.util.concurrent.ListenableFuture
import com.google.common.util.concurrent.MoreExecutors
import com.google.mediapipe.tasks.genai.llminference.GraphOptions
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInference.LlmInferenceOptions
import com.google.mediapipe.tasks.genai.llminference.LlmInferenceSession.LlmInferenceSessionOptions
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.callbackFlow
import java.io.File
import java.util.concurrent.Executor

class MediaPipeEngineWithRagImpl(
    private val application: Application,
    private val common: MediaPipeEngineCommon,
) : MediaPipeEngineWithRag {

    companion object {
        private const val TAG = "MediaPipeEngineWithRagTag"
    }

    private var chainConfig: ChainConfig<String>? = null
    private var retrievalAndInferenceChain: RetrievalAndInferenceChain? = null
    private var engineMediaPipe: LlmInference? = null
    private var sessionOptions: LlmInferenceSessionOptions? = null
    private var mediaPipeLanguageModel: MediaPipeLlmBackend? = null
    private var interfaceOptions: LlmInferenceOptions? = null
    private val executor: Executor = MoreExecutors.directExecutor()

    private var future: ListenableFuture<LanguageModelResponse>? = null

    override fun startEngine(
        modelFile: File,
        isSupportImages: Boolean,
        engineParams: MediaPipeEngineParams,
    ) {
        logDebug("startEngine", TAG)
        interfaceOptions = createInterfaceOptions(
            modelFile = modelFile,
            engineParams = engineParams,
            isSupportImages = isSupportImages,
        )
        engineMediaPipe = LlmInference.createFromOptions(
            application,
            interfaceOptions
        )
        if (engineMediaPipe == null) {
            logError("startEngine llmInference == null", TAG)
            return
        }
        sessionOptions = createSessionOptions(
            engineParams = engineParams,
            isSupportImages = isSupportImages,
        )
        mediaPipeLanguageModel = MediaPipeLlmBackend(
            application.applicationContext,
            interfaceOptions,
            sessionOptions,
            executor
        )
        chainConfig = ChainConfig.create(
            mediaPipeLanguageModel,
            common.promptBuilder,
          	// add a database in which to check queries
            common.semanticMemory
        )

        // we make a chain with a check in the database
        retrievalAndInferenceChain = RetrievalAndInferenceChain(
            chainConfig
        )
        Futures.addCallback(
            mediaPipeLanguageModel!!.initialize(),
            object : FutureCallback<Boolean> {
                override fun onSuccess(result: Boolean) {
                    logDebug("mediaPipeLanguageModel initialize onSuccess", TAG)
                }

                override fun onFailure(t: Throwable) {
                    logError(
                        "mediaPipeLanguageModel initialize onFailure: ${t.message}",
                        t,
                        TAG,
                    )
                }
            },
            executor
        )
        logDebug("startEngine ready", TAG)
    }

    override fun resetSession() {
        logDebug("resetSession", TAG)
        try {
            retrievalAndInferenceChain = RetrievalAndInferenceChain(
                chainConfig
            )
            logDebug("Session reset completed", TAG)
        } catch (e: Exception) {
            logError("Failed to reset session: ${e.message}", e, TAG)
        }
        logDebug("resetSession ready", TAG)
    }

    override fun generateResponse(
        prompt: String,

     	// Number of database query results
        topK: Int,

        // how similar the query vector should be to the database record
        // 0.0 = search all entries, sort by most similar
        // 1.0 = perfect match only
        // I use values 0.6 - 0.8
        minSimilarityScore: Float,
    ): Flow<ResultEmittedData<String>> = callbackFlow {
        logDebug("generateResponse prompt: $prompt", TAG)
        try {
            if (retrievalAndInferenceChain == null) {
                logError("generateResponse retrievalAndInferenceChain == null", TAG)
                trySend(
                    ResultEmittedData.error(
                        model = null,
                        error = null,
                        title = "MediaPipe engine error",
                        responseCode = null,
                        message = "retrievalAndInferenceChain == null",
                        errorType = ErrorType.ERROR_IN_LOGIC,
                    )
                )
                return@callbackFlow
            }
            val retrievalConfig = RetrievalConfig.create(
                topK,
                minSimilarityScore,
                TaskType.QUESTION_ANSWERING
            )

            // the request already includes a chain with verification
            val retrievalRequest = RetrievalRequest.create(
                prompt,
                retrievalConfig
            )
            logDebug("generateResponse retrievalRequest", TAG)
            val messageBuilder = StringBuilder()
            val listener = AsyncProgressListener<LanguageModelResponse> { partial, done ->
                val delta = partial.text.orEmpty()
                logDebug("generateResponse delta: $delta", TAG)
                if (!done && delta.isNotBlank()) {
                    messageBuilder.append(delta)
                    trySend(
                        ResultEmittedData.loading(
                            model = messageBuilder.toString(),
                        )
                    )
                }
            }
            future = retrievalAndInferenceChain!!.invoke(
                retrievalRequest,
                listener
            )
            future?.addListener({
                val fullText = future?.get()?.text
                if (fullText.isNullOrEmpty()) {
                    logError("generateResponse fullText isNullOrEmpty", TAG)
                    trySend(
                        ResultEmittedData.error(
                            model = null,
                            error = null,
                            title = "MediaPipe engine error",
                            responseCode = null,
                            message = "Empty response",
                            errorType = ErrorType.EXCEPTION
                        )
                    )
                    close()
                    return@addListener
                }
                logDebug("generateResponse fullText: $fullText", TAG)
                trySend(
                    ResultEmittedData.success(
                        model = fullText,
                        message = null,
                        responseCode = null
                    )
                )
                close()
            }, executor)
            logDebug("generateResponse ready", TAG)
        } catch (t: Throwable) {
            logError("generateResponse failed: ${t.message}", t, TAG)
            trySend(
                ResultEmittedData.error(
                    model = null,
                    error = t,
                    title = "MediaPipe engine error",
                    responseCode = null,
                    message = t.message,
                    errorType = ErrorType.EXCEPTION,
                )
            )
        }
    }

    private fun createInterfaceOptions(
        modelFile: File,
        engineParams: MediaPipeEngineParams,
        isSupportImages: Boolean,
    ): LlmInferenceOptions {
        val backend = when (engineParams.backend) {
            MediaPipeBackendParams.CPU -> LlmInference.Backend.CPU
            MediaPipeBackendParams.GPU -> LlmInference.Backend.GPU
        }
        return LlmInferenceOptions.builder().apply {
            setModelPath(modelFile.absolutePath)
            setMaxTokens(engineParams.contextSize)
            setPreferredBackend(backend)
            val maxNumImages = if (isSupportImages) 1 else 0
            setMaxNumImages(maxNumImages)
            if (engineParams.useMaxTopK) setMaxTopK(engineParams.maxTopK)
        }.build()
    }

    private fun createSessionOptions(
        engineParams: MediaPipeEngineParams,
        isSupportImages: Boolean,
    ): LlmInferenceSessionOptions {
        return LlmInferenceSessionOptions.builder().apply {
            if (engineParams.useTopK) setTopK(engineParams.topK)
            if (engineParams.useTopP) setTopP(engineParams.topP)
            if (engineParams.useTemperature) setTemperature(engineParams.temperature)
            if (engineParams.useRandomSeed) setRandomSeed(engineParams.randomSeed)
            setGraphOptions(
                GraphOptions.builder()
                    .setEnableVisionModality(isSupportImages)
                    .build()
            )
        }.build()
    }

    private fun isInGeneration(): Boolean {
        return future != null && future?.isDone != true && future?.isCancelled != true
    }

}

MediaPipeEngineWithTools: Function calls are supported here.

import kotlinx.coroutines.flow.Flow
import java.io.File

interface MediaPipeEngineWithTools {

    fun startEngine(
        modelFile: File,
        isSupportImages: Boolean = false,
        engineParams: MediaPipeEngineParams,
    )

    fun generateResponse(
        userQuery: String,
        maxSteps: Int = 3,
    ): Flow<ResultEmittedData<String>>

}

package com.romankryvolapov.offlineailauncher.mediapipe

import android.app.Application
import com.google.ai.edge.localagents.core.proto.Content
import com.google.ai.edge.localagents.core.proto.FunctionCall
import com.google.ai.edge.localagents.core.proto.FunctionDeclaration
import com.google.ai.edge.localagents.core.proto.FunctionResponse
import com.google.ai.edge.localagents.core.proto.GenerateContentResponse
import com.google.ai.edge.localagents.core.proto.Part
import com.google.ai.edge.localagents.core.proto.Schema
import com.google.ai.edge.localagents.core.proto.Tool
import com.google.ai.edge.localagents.fc.GemmaFormatter
import com.google.ai.edge.localagents.fc.GenerativeModel
import com.google.ai.edge.localagents.fc.LlmInferenceBackend
import com.google.ai.edge.localagents.rag.memory.VectorStoreRecord
import com.google.ai.edge.localagents.rag.models.EmbedData
import com.google.ai.edge.localagents.rag.models.EmbeddingRequest
import com.google.common.collect.ImmutableList
import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInference.LlmInferenceOptions
import com.google.protobuf.Struct
import com.google.protobuf.Value
import com.romankryvolapov.offlineailauncher.common.models.common.ErrorType
import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logDebug
import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logError
import com.romankryvolapov.offlineailauncher.common.models.common.ResultEmittedData
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow
import kotlinx.coroutines.guava.await
import java.io.File

class MediaPipeEngineWithFunctionCallingImpl(
    private val application: Application,
    private val common: MediaPipeEngineCommon,
) : MediaPipeEngineWithFunctionCalling {

    companion object {
        private const val TAG = "MediaPipeEngineWithFunctionCallingsTag"

        private const val DEFAULT_MIN_SIMILARITY_SCORE = 0.8

        private const val TOOLS_CODE = "tool_code"
        private const val RESULTS = "results"

        private const val TOOLS_ACTION_SEARCH_DOCS = "search_docs"
        private const val TOOLS_ACTION_SEARCH_DOCS_DESCRIPTION = "Searches knowledge and returns the most relevant results as plain text."


        private const val TOOLS_PARAM_QUERY = "query"
        private const val TOOLS_PARAM_QUERY_DESCRIPTION = "User query to search in the vector store."

        private const val TOOLS_PARAM_TOP_K = "top_k"
        private const val TOOLS_PARAM_TOP_K_DESCRIPTION = "Number of results to return (default 5)."


        private const val TOOLS_PARAM_MIN_SIMILARITY_SCORE = "min_similarity_score"

        private const val MIN_SIMILARITY_SCORE_DESCRIPTION = """
        Minimum similarity score threshold (float) for filtering search results, from 0.0 (no filtering) to 1.0 (exact match).
        Start with $DEFAULT_MIN_SIMILARITY_SCORE, and if no results are found, lower the value and retry the search.
        """

      	// A lot depends on this template
      	// incorrectly selected parameters will make calling the tool impossible
        // or after calling the tool the generation will stop
      	// for other LLM models the template may differ
        // ```tool_code works well for Gemma 3n, it seems it was trained on this keyword
      	// if nothing is found through the tools, the prompt indicates that the similarity can be reduced
        // The tools can be completely different, for example, a SQL query, an Internet query, it is important to describe them correctly
        private val PROMPT_TEMPLATE_WITH_TOOLS = """
        You are an on-device assistant.
        You have access to special tools (also called: "function call", "invoke tool", "use API", "search", "lookup", "query tool")

        If you decide to invoke any of the function, it should be wrapped with ```$TOOLS_CODE```

        You have access to the following tools.
        * `$TOOLS_ACTION_SEARCH_DOCS`: Searches knowledge and returns the most relevant results as plain text.

        WHEN TO USE A TOOL
        - If you do not have enough information to answer with high confidence.
        - If the user explicitly or implicitly asks to check/verify/find out/look up ("check via tools", "verify", "lookup", etc.).

        Tool args:
        $TOOLS_PARAM_QUERY: User string query to search in the vector store.
        $TOOLS_PARAM_TOP_K: Integer number of results to return (default 5).
        $TOOLS_PARAM_MIN_SIMILARITY_SCORE:  Minimum similarity score threshold (float) for filtering search results, from 0.0 (no filtering) to 1.0 (exact match).
        Start with $DEFAULT_MIN_SIMILARITY_SCORE, and if no results are found, lower the value and retry the search.

        Rules for tool call:
        ```$TOOLS_CODE
        $TOOLS_ACTION_SEARCH_DOCS($TOOLS_PARAM_QUERY="<string>", $TOOLS_PARAM_TOP_K=<integer>, $TOOLS_PARAM_MIN_SIMILARITY_SCORE=<float>)
        ```

        Tool response:
        $RESULTS: Plain text results.

        IMPORTANT: After receiving tool results, ALWAYS write a natural-language answer for the user in the very next message.
        If tool results are empty, briefly explain that nothing relevant was found and propose next steps.
       """.trimIndent()

    }

    private var generativeModel: GenerativeModel? = null

    override fun startEngine(
        modelFile: File,
        isSupportImages: Boolean,
        engineParams: MediaPipeEngineParams,
    ) {
        logDebug("startEngine", TAG)
        val interfaceOptions = createInterfaceOptions(
            modelFile = modelFile,
            engineParams = engineParams,
            isSupportImages = isSupportImages,
        )
        val engineMediaPipe = LlmInference.createFromOptions(
            application,
            interfaceOptions
        )
        if (engineMediaPipe == null) {
            logError("startEngine llmInference == null", TAG)
            return
        }
        val searchDocs = FunctionDeclaration.newBuilder()
            .setName(TOOLS_ACTION_SEARCH_DOCS)
            .setDescription(TOOLS_ACTION_SEARCH_DOCS_DESCRIPTION)
            .setParameters(
                Schema.newBuilder()
                    .setType(com.google.ai.edge.localagents.core.proto.Type.OBJECT)
                    .putProperties(
                        TOOLS_PARAM_QUERY,
                        Schema.newBuilder()
                            .setType(com.google.ai.edge.localagents.core.proto.Type.STRING)
                            .setDescription(TOOLS_PARAM_QUERY_DESCRIPTION)
                            .build()
                    )
                    .putProperties(
                        TOOLS_PARAM_TOP_K,
                        Schema.newBuilder()
                            .setType(com.google.ai.edge.localagents.core.proto.Type.INTEGER)
                            .setDescription(TOOLS_PARAM_TOP_K_DESCRIPTION)
                            .build()
                    )
                    .putProperties(
                        TOOLS_PARAM_MIN_SIMILARITY_SCORE,
                        Schema.newBuilder()
                            .setType(com.google.ai.edge.localagents.core.proto.Type.NUMBER)
                            .setDescription(MIN_SIMILARITY_SCORE_DESCRIPTION)
                            .build()
                    )
                    .build()
            )
            .build()
        val systemInstruction = Content.newBuilder()
            .setRole(Gemma3nRoles.SYSTEM.type)
            .addParts(
                Part.newBuilder().setText(
                    PROMPT_TEMPLATE_WITH_TOOLS
                )
            )
            .build()
        val tool = Tool.newBuilder()
            .addFunctionDeclarations(searchDocs)
            .build()
        val inferenceBackend = LlmInferenceBackend(
            engineMediaPipe,
            GemmaFormatter()
        )
        generativeModel = GenerativeModel(
            inferenceBackend,
            systemInstruction,
            listOf(tool),
        )
        logDebug("startEngine ready", TAG)
    }

    override fun generateResponse(
        userQuery: String,
        maxSteps: Int,
    ): Flow<ResultEmittedData<String>> = flow {
        logDebug("generateResponseWithTools userQuery: $userQuery", TAG)
        try {
            val generativeModel = generativeModel ?: run {
                logError("generateResponseWithTools generativeModel is null", TAG)
                emit(
                    ResultEmittedData.error(
                        model = null,
                        error = null,
                        title = "MediaPipe engine error",
                        responseCode = null,
                        message = "Model is not initialized;",
                        errorType = ErrorType.ERROR_IN_LOGIC,
                    )
                )
                return@flow
            }
            val contentPart = Part.newBuilder()
                .setText(userQuery)
                .build()
            val userContent = Content.newBuilder()
                .setRole(Gemma3nRoles.USER.type)
                .addParts(contentPart)
                .build()
            val conversation = mutableListOf(userContent)
            var step = 0

            // just in case there is a cycle here
          	// the model is sent a request if it thinks the tool needs to be called,
          	// it writes service information, the tool is called and the request with the result is repeated
          	// if you want the model to try to find the best result for a query,
          	// when changing the query text or minimum similarity, write about it in the prompt so that
            // the model meant that there could be many tool calls
            while (step < maxSteps) {
                logDebug("generateResponseWithTools step: $step conversation: ${conversation.size}", TAG)
                step++
                val response: GenerateContentResponse = generativeModel.generateContent(
                    conversation
                )
                val responseContent: Content = response.candidatesList.firstOrNull()?.content ?: run {
                    logError("generateResponseWithTools content is null", TAG)
                    emit(
                        ResultEmittedData.error(
                            model = null,
                            error = null,
                            title = "MediaPipe engine error",
                            responseCode = null,
                            message = "Candidates list is null",
                            errorType = ErrorType.ERROR_IN_LOGIC,
                        )
                    )
                    return@flow
                }
                val functionCall: FunctionCall? = responseContent.partsList.firstOrNull {
                    it.hasFunctionCall()
                }?.functionCall

              	// if the model has decided that there is no need to call the tools, we simply send a response to the user
                if (functionCall == null) {
                    val text = extractText(response)
                    if (text.isBlank()) {
                        logError(
                            "generateResponseWithTools text is blank, response: $response",
                            TAG
                        )
                        emit(
                            ResultEmittedData.error(
                                model = null,
                                error = null,
                                title = "MediaPipe engine error",
                                responseCode = null,
                                message = "Empty text",
                                errorType = ErrorType.ERROR_IN_LOGIC,
                            )
                        )
                        return@flow
                    }
                    logDebug("generateResponseWithTools functionCall is null text: $text", TAG)
                    emit(
                        ResultEmittedData.success(
                            model = text,
                            message = null,
                            responseCode = null
                        )
                    )
                    return@flow
                }
                if (functionCall.name != TOOLS_ACTION_SEARCH_DOCS) {
                    logError("generateResponseWithTools wrong name: ${functionCall.name}", TAG)
                    val text = extractText(response)
                    if (text.isBlank()) {
                        logError(
                            "generateResponseWithTools text is blank, response: $response",
                            TAG
                        )
                        emit(
                            ResultEmittedData.error(
                                model = null,
                                error = null,
                                title = "MediaPipe engine error",
                                responseCode = null,
                                message = "Wrong function call",
                                errorType = ErrorType.ERROR_IN_LOGIC,
                            )
                        )
                        return@flow
                    }
                    emit(
                        ResultEmittedData.success(
                            model = text,
                            message = null,
                            responseCode = null
                        )
                    )
                    return@flow
                }
                val args = functionCall.args.fieldsMap

                // the model returns the text of the database query in the tool call parameters,
                // number of results and similarity
                // if nothing is found, the prompt says that the similarity can be reduced
                val query = args[TOOLS_PARAM_QUERY]?.stringValue
                val topK = args[TOOLS_PARAM_TOP_K]?.numberValue?.toInt() ?: 5
                val minSimilarityScore = args[TOOLS_PARAM_MIN_SIMILARITY_SCORE]?.numberValue?.toFloat() ?: 0.0F
                if (query.isNullOrEmpty()) {
                    logError("generateResponseWithTools query is null or empty", TAG)
                    val text = extractText(response)
                    if (text.isBlank()) {
                        logError(
                            "generateResponseWithTools text is blank, response: $response",
                            TAG
                        )
                        emit(
                            ResultEmittedData.error(
                                model = null,
                                error = null,
                                title = "MediaPipe engine error",
                                responseCode = null,
                                message = "Wrong function call",
                                errorType = ErrorType.ERROR_IN_LOGIC,
                            )
                        )
                        return@flow
                    }
                    logDebug("generateResponseWithTools query is null or empty text: $text", TAG)
                    emit(
                        ResultEmittedData.success(
                            model = text,
                            message = null,
                            responseCode = null
                        )
                    )
                    return@flow
                }
                val results: String = searchDocsInternal(
                    query,
                    topK,
                    minSimilarityScore
                )
                val respStruct = Struct.newBuilder()
                    .putFields(
                        TOOLS_PARAM_QUERY,
                        Value.newBuilder().setStringValue(query).build()
                    )
                    .putFields(
                        TOOLS_PARAM_TOP_K,
                        Value.newBuilder().setNumberValue(topK.toDouble()).build()
                    )
                    .putFields(
                        TOOLS_PARAM_MIN_SIMILARITY_SCORE,
                        Value.newBuilder().setNumberValue(minSimilarityScore.toDouble()).build()
                    )
                    .putFields(
                        RESULTS,
                        Value.newBuilder().setStringValue(results).build()
                    )
                    .build()
                val functionResponse = FunctionResponse.newBuilder()
                    .setName(TOOLS_ACTION_SEARCH_DOCS)
                    .setResponse(respStruct)
                    .build()
                val functionResponsePart = Part.newBuilder()
                    .setFunctionResponse(functionResponse)
                    .build()
                val toolContent = Content.newBuilder()
                    .setRole(Gemma3nRoles.MODEL.type)
                    .addParts(functionResponsePart)
                    .build()

                // we add a model response with a tool call and the tool call itself
                // into the message chain and start the next iteration of the loop
                // the model will thus see all its requests and all the results of calling the tool
                conversation.add(responseContent)
                conversation.add(toolContent)
                logDebug("conversation: $conversation", TAG)

                if (step == maxSteps) {
                    val finalResponse = generativeModel.generateContent(conversation)
                    val text = extractText(finalResponse)
                    if (text.isBlank()) {
                        logError("generateResponseWithTools finalResponse text is blank", TAG)
                        emit(
                            ResultEmittedData.error(
                                title = "MediaPipe engine error",
                                message = "Empty final response",
                                error = null,
                                model = null,
                                responseCode = null,
                                errorType = ErrorType.ERROR_IN_LOGIC,
                            )
                        )
                        return@flow
                    }
                    emit(
                        ResultEmittedData.success(
                            model = text,
                            message = null,
                            responseCode = null
                        )
                    )
                    return@flow
                }
            }
        } catch (t: Throwable) {
            logError("generateResponseWithTools failed: ${t.message}", t, TAG)
            emit(
                ResultEmittedData.error(
                    model = null,
                    error = t,
                    title = "MediaPipe engine error",
                    responseCode = null,
                    message = t.message,
                    errorType = ErrorType.EXCEPTION,
                )
            )
        }
    }

    // search in a vector database, while all parameters are set by the model itself
    private suspend fun searchDocsInternal(
        query: String,
        topK: Int,
        minSimilarityScore: Float,
    ): String {
        logDebug("searchDocsInternal query: $query topK: $topK minSimilarityScore: $minSimilarityScore", TAG)
        val queryEmbedData: EmbedData<String> = EmbedData.create(
            query,
            EmbedData.TaskType.RETRIEVAL_QUERY
        )
        val embeddingRequest: EmbeddingRequest<String> =
            EmbeddingRequest.create(listOf(queryEmbedData))
        val vector: ImmutableList<Float> = try {
            common.embedder.getEmbeddings(embeddingRequest).await()
        } catch (t: Throwable) {
            logError(
                "searchDocsInternal: embedding failed: ${t.message}", t,
                TAG
            )
            return "No results."
        }
        if (vector.isEmpty()) {
            logError("searchDocsInternal vector.isEmpty()", TAG)
            return "No results."
        }
        val hits: ImmutableList<VectorStoreRecord<String>> = try {
            common.vectorStore.getNearestRecords(
                vector,
                topK,
                minSimilarityScore
            )
        } catch (t: Throwable) {
            logError("searchDocsInternal: failed: ${t.message}", t, TAG)
            return "No results."
        }
        if (hits.isEmpty()) {
            logError("searchDocsInternal hits.isEmpty()", TAG)
            return "No results."
        }
        val result = buildString {
            for (h in hits) {
                appendLine(h.data.trim())
            }
        }.trim()
        logDebug("searchDocsInternal ready size: ${result.length}", TAG)
        return result
    }

    private fun extractText(response: GenerateContentResponse): String {
        response.candidatesList.forEach { candidate ->
            candidate.content.partsList.forEach { part ->
                if (part.text.isNotEmpty()) return part.text
            }
        }
        return ""
    }

    private fun createInterfaceOptions(
        modelFile: File,
        engineParams: MediaPipeEngineParams,
        isSupportImages: Boolean,
    ): LlmInferenceOptions {
        val backend = when (engineParams.backend) {
            MediaPipeBackendParams.CPU -> LlmInference.Backend.CPU
            MediaPipeBackendParams.GPU -> LlmInference.Backend.GPU
        }
        return LlmInferenceOptions.builder().apply {
            setModelPath(modelFile.absolutePath)
            setMaxTokens(engineParams.contextSize)
            setPreferredBackend(backend)
            val maxNumImages = if (isSupportImages) 1 else 0
            setMaxNumImages(maxNumImages)
            if (engineParams.useMaxTopK) setMaxTopK(engineParams.maxTopK)
        }.build()
    }

}

I hope it was interesting.

For those who want to repeat, the auxiliary classes used are:

enum class ErrorType {
    EXCEPTION,
    SERVER_ERROR,
    ERROR_IN_LOGIC,
    SERVER_DATA_ERROR,
    NO_INTERNET_CONNECTION,
    AUTHORIZATION
}

data class ResultEmittedData<out T>(
    val model: T?,
    val error: Any?,
    val status: Status,
    val title: String?,
    val message: String?,
    val responseCode: Int?,
    val errorType: ErrorType?,
) {

    enum class Status {
        SUCCESS,
        ERROR,
        LOADING,
    }

    companion object {
        fun <T> success(
            model: T,
            message: String?,
            responseCode: Int?,
        ): ResultEmittedData<T> =
            ResultEmittedData(
                error = null,
                title = null,
                model = model,
                errorType = null,
                message = message,
                status = Status.SUCCESS,
                responseCode = responseCode,
            )

        fun <T> loading(
            model: T? = null,
            message: String? = null,
        ): ResultEmittedData<T> =
            ResultEmittedData(
                model = model,
                error = null,
                title = null,
                errorType = null,
                message = message,
                responseCode = null,
                status = Status.LOADING,
            )

        fun <T> error(
            model: T?,
            error: Any?,
            title: String?,
            message: String?,
            responseCode: Int?,
            errorType: ErrorType?,
        ): ResultEmittedData<T> =
            ResultEmittedData(
                model = model,
                error = error,
                title = title,
                message = message,
                errorType = errorType,
                status = Status.ERROR,
                responseCode = responseCode,
            )
    }
}

inline fun <T : Any> ResultEmittedData<T>.onLoading(
    action: (
        model: T?,
        message: String?,
    ) -> Unit
): ResultEmittedData<T> {
    if (status == ResultEmittedData.Status.LOADING) action(
        model,
        message
    )
    return this
}

inline fun <T : Any> ResultEmittedData<T>.onSuccess(
    action: (
        model: T,
        message: String?,
        responseCode: Int?,
    ) -> Unit
): ResultEmittedData<T> {
    if (status == ResultEmittedData.Status.SUCCESS && model != null) action(
        model,
        message,
        responseCode,
    )
    return this
}

inline fun <T : Any> ResultEmittedData<T>.onFailure(
    action: (
        model: Any?,
        title: String?,
        message: String?,
        responseCode: Int?,
        errorType: ErrorType?,
    ) -> Unit
): ResultEmittedData<T> {
    if (status == ResultEmittedData.Status.ERROR) action(
        model,
        title,
        message,
        responseCode,
        errorType
    )
    return this
}

data class MediaPipeEngineParams(
    val name: String,
    val topK: Int,
    val topP: Float,
    val temperature: Float,
    val randomSeed: Int,
    val contextSize: Int,
    val maxTopK: Int,
    val useTopK: Boolean,
    val useTopP: Boolean,
    val useTemperature: Boolean,
    val useRandomSeed: Boolean,
    val useMaxTopK: Boolean,
    val backend: MediaPipeBackendParams,
)

enum class MediaPipeBackendParams {
    CPU,
    GPU
}

fun Long.toDurationString(): String {
    var msRemaining = this

    val years = msRemaining / (365L * 24 * 60 * 60 * 1000)
    msRemaining %= (365L * 24 * 60 * 60 * 1000)

    val months = msRemaining / (30L * 24 * 60 * 60 * 1000)
    msRemaining %= (30L * 24 * 60 * 60 * 1000)

    val days = msRemaining / (24L * 60 * 60 * 1000)
    msRemaining %= (24L * 60 * 60 * 1000)

    val hours = msRemaining / (60L * 60 * 1000)
    msRemaining %= (60L * 60 * 1000)

    val minutes = msRemaining / (60L * 1000)
    msRemaining %= (60L * 1000)

    val seconds = msRemaining / 1000
    val milliseconds = msRemaining % 1000

    return buildString {
        if (years > 0) append("$years years, ")
        if (months > 0) append("$months months, ")
        if (days > 0) append("$days days, ")
        if (hours > 0) append("$hours hours, ")
        if (minutes > 0) append("$minutes minutes, ")
        if (seconds > 0) append("$seconds seconds, ")
        append("$milliseconds milliseconds")
    }
}

import android.annotation.SuppressLint
import android.os.Bundle
import android.os.Environment
import android.util.Log
import com.google.firebase.analytics.FirebaseAnalytics
import kotlinx.coroutines.CoroutineScope
import kotlinx.coroutines.Dispatchers
import kotlinx.coroutines.Job
import kotlinx.coroutines.launch
import org.koin.core.component.KoinComponent
import org.koin.core.component.inject
import java.io.File
import java.io.FileOutputStream
import java.text.SimpleDateFormat
import java.util.Date

object LogUtil : KoinComponent {

    private val timeDirectoryName: String

    private const val QUEUE_CAPACITY = 10000
    private const val CURRENT_TAG = "LogUtilExecutionStatusTag"
    private const val LOG_APP_FOLDER_NAME = "app"
    private const val TIME_FORMAT_FOR_LOG = "HH:mm:ss dd-MM-yyyy"
    private const val TIME_FORMAT_FOR_DIRECTORY = "HH-mm-ss_dd-MM-yyyy"
    private const val TAG = "TAG: "
    private const val TIME = "TIME: "
    private const val ERROR_STACKTRACE = "ERROR STACKTRACE: "
    private const val ERROR_MESSAGE = "ERROR: "
    private const val DEBUG_MESSAGE = "MESSAGE: "
    private const val NEW_LINE = "\n"

    private val queue = ArrayDeque<LogData>(QUEUE_CAPACITY)

    private var saveLogsToTxtFileJob: Job? = null

    private val analytics: FirebaseAnalytics by inject()

    @Volatile
    private var isSaveLogsToTxtFile = false

    init {
        Log.d(CURRENT_TAG, "init")
        timeDirectoryName = getCurrentTimeForDirectory()
    }

    fun logDebug(message: String, tag: String) {
        CoroutineScope(Dispatchers.IO).launch {
            if (BuildConfig.DEBUG) {
                Log.d(tag, message)
                enqueue(
                    LogData.DebugMessage(
                        tag = tag,
                        time = System.currentTimeMillis(),
                        message = message,
                    )
                )
                saveLogsToTxtFile()
            }
        }
    }

    fun logError(message: String, tag: String) {
        CoroutineScope(Dispatchers.IO).launch {
            if (BuildConfig.DEBUG) {
                Log.e(tag, message)
                enqueue(
                    LogData.ErrorMessage(
                        tag = tag,
                        time = System.currentTimeMillis(),
                        message = message,
                    )
                )
                saveLogsToTxtFile()
            }
        }
    }

    fun logError(exception: Throwable, tag: String) {
        CoroutineScope(Dispatchers.IO).launch {
            if (BuildConfig.DEBUG) {
                Log.e(tag, exception.message, exception)
                enqueue(
                    LogData.ExceptionMessage(
                        tag = tag,
                        time = System.currentTimeMillis(),
                        exception = exception,
                    )
                )
                saveLogsToTxtFile()
            }
        }
    }

    fun logError(message: String, exception: Throwable, tag: String) {
        CoroutineScope(Dispatchers.IO).launch {
            if (BuildConfig.DEBUG) {
                Log.e(tag, "$message, exception: ${exception.message}", exception)
                enqueue(
                    LogData.ErrorMessageWithException(
                        tag = tag,
                        time = System.currentTimeMillis(),
                        message = message,
                        exception = exception,
                    )
                )
                saveLogsToTxtFile()
            }
        }
    }

    fun logError(message: String, error: String?, tag: String) {
        CoroutineScope(Dispatchers.IO).launch {
            if (BuildConfig.DEBUG) {
                Log.e(tag, "$message, error: $error")
                enqueue(
                    LogData.ErrorMessage(
                        tag = tag,
                        time = System.currentTimeMillis(),
                        message = message,
                    )
                )
                saveLogsToTxtFile()
            }
        }
    }

    @SuppressLint("SimpleDateFormat")
    private fun getTime(time: Long): String {
        return try {
            val date = Date(time)
            val timeString = SimpleDateFormat(TIME_FORMAT_FOR_LOG).format(date)
            timeString.ifEmpty {
                Log.e(CURRENT_TAG, "getTime time.ifEmpty")
                time.toString()
            }
        } catch (e: Exception) {
            Log.e(CURRENT_TAG, "getCurrentTime exception: ${e.message}", e)
            time.toString()
        }
    }

    @SuppressLint("SimpleDateFormat")
    private fun getCurrentTimeForDirectory(): String {
        val time = System.currentTimeMillis()
        return try {
            val date = Date(time)
            val timeString = SimpleDateFormat(TIME_FORMAT_FOR_DIRECTORY).format(date)
            Log.d(CURRENT_TAG, "getCurrentTimeForDirectory time: $time")
            timeString.ifEmpty {
                Log.e(CURRENT_TAG, "getCurrentTimeForDirectory time.ifEmpty")
                time.toString()
            }
        } catch (e: Exception) {
            Log.e(CURRENT_TAG, "getCurrentTimeForDirectory exception: ${e.message}", e)
            time.toString()
        }
    }

    private fun enqueue(message: LogData) {
        try {
            while (queue.size >= QUEUE_CAPACITY) {
                Log.d(CURRENT_TAG, "enqueue removeFirst")
                queue.removeFirst()
            }
            queue.addLast(message)
        } catch (e: Exception) {
            Log.e(CURRENT_TAG, "enqueue exception: ${e.message}", e)
        }
    }

}

▲To contents▲

Example 6 – CAG (Cache-Augmented Generation) in LLama.cpp

Here I will write a concept of how this could work.
This particular code is slow and not suitable for full use.

Example of getting model context as ByteArray.
The resulting context can then be saved to a database or file.
Context may include prompts, questions and answers, added documents, and so on.

#include <vector>
#include <cstdint>
#include <stdexcept>
#include <cstddef>
#include "llama.h"

std::vector<uint8_t> get_full_state_raw(llama_context* ctx) {

    // Check that the context pointer is not nullptr
    if (ctx == nullptr) {
        throw std::invalid_argument("llama_context pointer is null");
    }

    // Get the size of the model state (in bytes)
    const size_t state_size = llama_state_get_size(ctx);
    if (state_size == 0) {
        throw std::runtime_error("llama_state_get_size returned 0");
    }

    // We allocate a buffer of the required size to store the state
    std::vector<uint8_t> out(state_size);

    // We save the binary state data into our buffer
    const size_t written = llama_state_get_data(ctx, out.data(), out.size());

    // We check that the size matches the expected one.
    if (written != state_size) {
        throw std::runtime_error("state size changed during serialization");
    }

    // Return the state as a byte array
    return out;
}

Example of restoring context from ByteArray.
It is important to remember that the context can only be restored for the same model for which it was saved, since it includes intermediate states, and it will also not be possible to combine multiple contexts – there is no full-fledged modular architecture for models yet.

#include <vector>
#include <cstdint>
#include <stdexcept>
#include <cstddef>
#include "llama.h"

int set_full_state_raw(llama_context* ctx, const std::vector<uint8_t>& data) {
    // Check that the context and data have been transferred
    if (ctx == nullptr) {
        throw std::invalid_argument("llama_context pointer is null");
    }
    if (data.empty()) {
        throw std::invalid_argument("data is empty");
    }

    // Recovering the model state from a byte array
    const size_t written = llama_state_set_data(ctx, data.data(), data.size());

    // Return the number of bytes actually downloaded
    return static_cast<int>(written);
}

▲To contents▲