Hello!
I will try to describe here my experience with the implementation of RAG and functional calls for the LLM model running locally on the phone.
For cloud models the general idea will not be different, so after reading you will be able to understand how it works and what it is used for.
What is interesting here is that the success of the work strongly depends on the correctness of the writing of the prompt of what we want from the model, that is, just good code does not work here, we also need a humanitarian part – to explain all the models and hope that she understood everything correctly.
Content
➤ What is RAG (Retrieval-Augmented Generation)
➤ What is a vector database
➤ What is Function Calling
➤ What is CAG (Cache-Augmented Generation)
➤ Example 1 – Low-Level Function Calling Implementation in Kotlin
➤ Example 2 – Python framework Dspy
➤ Example 3 – Python framework LangChain
➤ Example 4 – Java framework LangChain4j
➤ Example 5 – Google's Java libraries for mobile devices
➤ Example 6 – CAG (Cache-Augmented Generation) in LLama.cpp
What is RAG (Retrieval-Augmented Generation)
The LLM model is equipped with terabytes of information during training, but it may not know information about you, your project, your documents, and your cat.
You can describe all this in the request to the model, in the prompt, but if there is a lot of information, the size of the model context may end before you get to the question, or you will get tired of typing the text.
To make the model aware of information it was not trained on, RAG (Retrieval Augmented Generation) is used.
RAG can have many different implementations, but the general idea is:
Additional information is being added to your message., usually in text form, for example:
- Internet query result
- Data from the database
- Text from the attached document, and so on
In this example I will use a vector base, which is great for working with LLM models.
What is a vector database
Before querying the model, you need to feed your data to the database so that the model can then find the most suitable data for your query.
When adding text to the database:
- The text is broken into pieces
- Using a special Gecko embedding model, a 768-dimensional vector is generated for each piece, the meaning of which corresponds as closely as possible to the content of the piece of text.
- The vector along with the text is written to the database
When searching text in the database:
- The search query is converted into a vector using the Gecko model
- The database is searched for a vector that is as close as possible to the search query vector.
- For this vector, the text is extracted
- The text is added to your request for the LLM model
What is Function Calling
The difference with RAG is that in this case the model itself decides whether to access the vector database or another source to obtain information.
To do this:
- the prompt specifies that the model has access to function calls, describes their format, describes the conditions under which the model needs to make this call
- if the model decides to make such a call, it returns service information with a request for the call
- the parser understands that a call needs to be made, reads the data from the vector database, adds it to the message queue and sends the request again.
What is CAG (Cache-Augmented Generation)
A technique for speeding up and reducing the cost of LLM work, based on reusing the results of past queries.
LLM stores intermediate representations of text (hidden states, attention caches) during operation.
For repeated or similar queries, the entire sequence does not need to be run through the model. Instead, the saved cache is used and the model only refines the "new" part.
This reduces the amount of computation and lowers latency.
Application options:
Token-level caching:
KV attention caches are saved for already processed tokens (already used in llama.cpp, Transformers and others).
Prompt-level caching:
Results for frequently occurring prompts are saved and reused.
Semantic caching:
storing responses by semantic similarity (e.g. via vector storage). If the new query is similar to the old one, the stored result is returned or used as context.
Difference with regular RAG:
RAG (Retrieval-Augmented Generation) pulls external data from the knowledge base.
CAG saves computation by reusing past model runs.
Example 1 – Low-Level Function Calling Implementation in Kotlin
Here I will give an example of how Function Calling works from an implementation point of view.
First, we make a template that will explain in as much detail as possible to the LLM model that it has access to functional calls and how to work with them.
It is very important to write a keyword by which we can later determine that the model wants to make a functional call, in this example it is tool_code, it must be unique and must not appear in the text.
Next we describe the structures we want to get, in this example it is Json, but the structure can be any, I use Json because it is easy to parse into a class, but in terms of using tokens it is not optimal.
If the structure is complex and there is a possibility that the model will make a mistake, you can return a parsing error, add it to the message queue along with a description that the model made a mistake and must correct the request, and send it to the model for a new attempt.
In this example I used 2 function calls
get_weather
{ "name": "get_weather", "arguments": { "location": "New York" } }
This is a very simple call with only one parameter, which is hard to get wrong.
search_docs
{ "name": "search_docs", "arguments": { "query": "quantum computing basics", "top_k": 5, "min_similarity_score": 0.6 } }
This is a search query to the database in which suitable documents will be searched.
This call can be placed in a loop with starting parameters and added to the prompt description – if the model does not get suitable results, it can reduce min_similarity_score or change the wording query and make the call again.
It is very important to describe detailed instructions for the model for each step in the prompt, the model has no motivation and intuition to understand what it should do next, and if further steps are not described, it will return an empty message.
It is also important to understand that a prompt is not a program code, and its correct execution has a probability that depends on how well we have described all the actions and structures.
At the end of the prompt, it is important to write that the model will receive the result in Json format, the structure of which is not described in the prompt, and the model itself must understand how to formulate a response for the user in accordance with his request.
The script in the prompt has almost endless customization possibilities, for example, if we need to get data from a database, we can describe the structure of the database and ask the model to generate an SQL query in accordance with the user's request, which we then execute and get the data, in this case it is important not to forget to limit the connection rights to read only.
Some models are trained to perform function calls, a specific format and keywords, and may also have a separate role for the result of the call, such as functionI used a model that was not trained for this, and it did a great job, I think any modern more or less "smart" model can do it.
Full text of the prompt:
const val DEFAULT_FUNCTION_CALLING_PROMPT = """ Call tools only when the user explicitly asks to check, verify, look up, find, or search for something. In all other cases, answer directly without calling any tool. If you decide to invoke a function, output only a JSON object inside a fenced block marked as tool_code without any text before or after it. Available tools: search_docs Description: Searches knowledge and returns the most relevant results as plain text. Arguments: query (string) — User string query to search in the vector store. top_k (integer, default 5) — Number of results to return. min_similarity_score (float, default 0.6) — Minimum similarity score from 0.0 (no filtering) to 1.0 (exact match). get_weather Description: Gets current weather for a location. Arguments: location (string) — City or place name. JSON format inside the fenced block: tool_code { "name": "<tool_name>", "arguments": { "<arg_name>": <arg_value> } } Examples: tool_code { "name": "search_docs", "arguments": { "query": "quantum computing basics", "top_k": 5, "min_similarity_score": 0.6 } } tool_code { "name": "get_weather", "arguments": { "location": "New York" } } Important rules: Use tools only when explicitly requested by the user to check, verify, look up, find, or search. Otherwise, provide a direct answer. When calling a tool, the output must be valid JSON with exactly two keys: name and arguments. Do not include any explanations or extra content outside the tool_code block. After receiving tool results (the result will be in JSON format): 1. Parse and process the result. 2. Convert it into a simple, clear, and human-readable natural-language response. 3. Always reply to the user with this processed answer as the next message. 4. If no relevant results are found, briefly explain that nothing relevant was found and suggest next steps. """
In this example I used local models Gemma And DeepSeek, launched in LM Studio, which has OpenAI Compatible API, that is, my application connects to it via Localhost and makes requests, cloud models will do even better, as they have many more parameters and a special API for functional calls, but in the example I specifically wanted to use the lowest-level implementation.
fun main() = runBlocking { val messages = mutableListOf( // Add a prompt to the message queue with the system role ChatMessage( role = "system", content = DEFAULT_FUNCTION_CALLING_PROMPT ), // We add a user request to the queue, this request makes it clear, // that the user asks to check the weather, the model knows that it has access // the weather check function described in the prompt and calls it, // by sending a response message starting with tool_code ChatMessage( role = "user", content = "Check weather in London", ) ) var iterator = 0 while (iterator < 3) { try { chatAnswerWithToolsStream( ChatCompletionRequest( model = MODEL_ID, messages = messages, temperature = 0.2 ) ).collect { piece -> when (piece) { is ChatResult.Debug -> { println("DEBUG MESSAGE:\n${piece.message}") } is ChatResult.Message -> { println("FINAL MESSAGE:\n${piece.message}") iterator = 3 return@collect } is ChatResult.SearchDocs -> { println("SEARCH DOCS data:\n$piece") iterator = 3 // In this example I will only use the weather check, // but added one more function to show that there can be many of them return@collect } is ChatResult.GetWeather -> { println("GET WEATHER result:\n$piece") // If the model decides to call a function, we add its call // to the message queue with the assistant role // It is important to add all messages to the queue so that // the model meant the whole history of the dialogue messages.add( ChatMessage( role = "assistant", content = piece.message ?: "" ) ) // request API to get weather data val result = getCurrentWeatherByCity( city = piece.location!!, ) val jsonResult: String = kotlinxJsonConfig.encodeToString(result) println("GET WEATHER request result:\n$jsonResult") // we add to the message queue with the user role, since this model // does not support special roles for function calls messages.add( ChatMessage( role = "user", content = jsonResult ) ) iterator++ } } } } catch (e: Throwable) { println(e.message ?: "error") break } } } // code for working with OpenAI API, the function call is important here // parseToolJson, which will return null if it fails to convert the response into a function call fun chatAnswerWithToolsStream(req: ChatCompletionRequest): Flow<ChatResult> = flow { val response = client.post("$BASE_URL/api/v0/chat/completions") { contentType(ContentType.Application.Json) setBody(req.copy(stream = true)) } val channel: ByteReadChannel = response.bodyAsChannel() val fullBuilder = StringBuilder() while (!channel.isClosedForRead) { val line: String = channel.readUTF8Line() ?: break val payload: String = extractDataLine(line) ?: continue val chunk: StreamChatChunk = runCatching { kotlinxJsonConfig.decodeFromString(StreamChatChunk.serializer(), payload) }.getOrNull() ?: continue val piece: String = extractChatDeltaContent(chunk) if (piece.isNotEmpty()) { fullBuilder.append(piece) emit(ChatResult.Debug(message = fullBuilder.toString())) } if (isFinished(chunk)) break } val fullMessage: String = fullBuilder.toString().trim() val toolResult: ChatResult? = parseToolJson(fullMessage) if (toolResult != null) { emit(toolResult) } else { emit(ChatResult.Message(message = fullMessage)) } } fun parseToolJson(input: String): ChatResult? { try { val json = when { // if the response starts with tool_code, we understand that this is not a response to the user // and the function call will be followed by json with its type and parameters input.startsWith("tool_code") -> input.replace("tool_code", "") .trim() input.startsWith("```tool_code") -> input.replace("```tool_code", "") .removeSuffix("```") .trim() else -> return null } val rawTool = runCatching { gson.fromJson(json, RawTool::class.java) }.getOrNull() ?: return null val jsonObject = rawTool.arguments?.asJsonObject ?: return null // we determine which function the model wants to call // and transform json into a data model. // This code can be improved by returning an error to the model if the transformation // it won't work, with the description that the model should edit her answer and correct it return when (rawTool.name?.lowercase()) { "search_docs" -> { gson.fromJson(jsonObject, ChatResult.SearchDocs::class.java).copy( message = json ) } "get_weather" -> gson.fromJson(jsonObject, ChatResult.GetWeather::class.java).copy( message = json ) else -> null } } catch (e: Exception) { println(e.message ?: "error") return null } }
Additional code if you want to repeat:
ext { ktor_version = "3.2.3" kotlinx_serialization_json_version = "1.8.1" logback_version = "1.4.14" } dependencies { implementation "io.ktor:ktor-client-core:$ktor_version" implementation "io.ktor:ktor-client-cio:$ktor_version" implementation "io.ktor:ktor-client-content-negotiation:$ktor_version" implementation "io.ktor:ktor-serialization-kotlinx-json:$ktor_version" implementation "io.ktor:ktor-client-logging:$ktor_version" implementation "org.jetbrains.kotlinx:kotlinx-serialization-json:$kotlinx_serialization_json_version" implementation "ch.qos.logback:logback-classic:$logback_version" implementation "com.google.code.gson:gson:2.13.1" }
import io.ktor.client.call.* import io.ktor.client.request.* suspend fun getCurrentWeatherByCity( apiKey: String = WEATHER_API_KEY, city: String, units: String = "metric", lang: String = "en" ): WeatherResponse { return client.get(OWM_BASE) { url { parameters.append("q", city) parameters.append("appid", apiKey) parameters.append("units", units) parameters.append("lang", lang) } }.body() }
import com.google.gson.JsonElement import kotlinx.serialization.SerialName import kotlinx.serialization.Serializable import kotlinx.serialization.json.JsonObject @Serializable data class ChatMessage( @SerialName("role") val role: String, @SerialName("content") val content: String ) @Serializable data class ChatCompletionRequest( @SerialName("model") val model: String, @SerialName("messages") val messages: List<ChatMessage>, @SerialName("temperature") val temperature: Double? = null, @SerialName("max_tokens") val maxTokens: Int? = null, @SerialName("stream") val stream: Boolean? = null ) @Serializable data class StreamChatChunk( @SerialName("choices") val choices: List<StreamChatChoice>? = null ) @Serializable data class ChatChoiceMessage( @SerialName("content") val content: String? = null ) @Serializable data class ChatChoice( @SerialName("message") val message: ChatChoiceMessage? = null ) @Serializable data class ChatResponse( @SerialName("choices") val choices: List<ChatChoice> = emptyList() ) @Serializable data class StreamChatChoice( @SerialName("delta") val delta: JsonObject? = null, @SerialName("finish_reason") val finishReason: String? = null ) data class RawTool( val name: String?, val arguments: JsonElement? ) sealed interface ChatResult { data class Debug( val message: String?, ) : ChatResult data class Message( val message: String?, ) : ChatResult data class SearchDocs( val message: String? = null, val query: String?, val topK: Int?, val minSimilarityScore: Float? ) : ChatResult data class GetWeather( val message: String? = null, val location: String?, ) : ChatResult }
import kotlinx.serialization.json.JsonElement import kotlinx.serialization.json.contentOrNull import kotlinx.serialization.json.jsonPrimitive fun extractDataLine(line: String): String? { val t: String = line.trim() if (t.isEmpty()) return null if (t == "data: [DONE]" || t == "[DONE]") return null if (t.startsWith("event:")) return null return if (t.startsWith("data:")) t.removePrefix("data:").trim() else t } fun extractChatDeltaContent(chunk: StreamChatChunk): String { val choice: StreamChatChoice = chunk.choices?.firstOrNull() ?: return "" val delta = choice.delta ?: return "" val el: JsonElement? = delta["content"] return el?.jsonPrimitive?.contentOrNull ?: "" } fun extractTextDelta(chunk: StreamTextChunk): String { val choice: StreamTextChoice = chunk.choices?.firstOrNull() ?: return "" val direct: String? = choice.text if (direct != null) return direct val delta = choice.delta val el: JsonElement? = delta?.get("content") ?: delta?.get("text") return el?.jsonPrimitive?.contentOrNull ?: "" } fun isFinished(chunk: StreamChatChunk): Boolean { val reason: String? = chunk.choices?.firstOrNull()?.finishReason return !reason.isNullOrEmpty() } fun isFinished(chunk: StreamTextChunk): Boolean { val reason: String? = chunk.choices?.firstOrNull()?.finishReason return !reason.isNullOrEmpty() }
import com.google.gson.FieldNamingPolicy import com.google.gson.Gson import com.google.gson.GsonBuilder import io.ktor.client.* import io.ktor.client.engine.cio.* import io.ktor.client.plugins.* import io.ktor.client.plugins.contentnegotiation.* import io.ktor.serialization.kotlinx.json.* import kotlinx.serialization.json.Json const val BASE_URL: String = "http://localhost:1234" const val OWM_BASE = "https://api.openweathermap.org/data/2.5" const val WEATHER_API_KEY = "..." const val MODEL_ID = "google/gemma-3n-e4b" val gson: Gson = GsonBuilder() .setFieldNamingPolicy(FieldNamingPolicy.LOWER_CASE_WITH_UNDERSCORES) .create() val kotlinxJsonConfig: Json = Json { ignoreUnknownKeys = true prettyPrint = false } val client: HttpClient = HttpClient(CIO) { install(ContentNegotiation) { json(kotlinxJsonConfig) } install(HttpTimeout) { requestTimeoutMillis = 60_000 connectTimeoutMillis = 10_000 socketTimeoutMillis = 60_000 } }
Example 2 – Python framework Dspy
DSPy is an open-source Python framework for "programming, not prompting" LLM: you describe the behavior of modules in the code, and optimizers automatically select hints and, if desired, further train weights for the selected metric. Suitable for classifiers, RAGs, and agents.
dspy.ai
https://github.com/stanfordnlp/dspy
Signatures:
declarative specifications of module inputs/outputs.
Modules:
ready-made LLM calling strategies (Predict, ChainOfThought, ReAct, etc.) from which pipelines are assembled.
- Prediction:
Basic building block.
Makes a request to LLM using a given signature (input -> output).
Used for simple tasks such as QA, summarization, classification. - Signature:
Defines the structure of input and output data.
Can be used as "q -> a" or as a Python class with annotations.
Makes the code readable and strict. - Chain:
Allows you to link multiple Predicts in a pipeline.
It is convenient when there are intermediate steps (extraction of facts → summary → answer). - React:
Agent mode: the model reasons and calls tools.
Allows you to connect APIs, databases, external functions.
A typical option for chatbots with access to tools. - ChainOfThought (CoT):
Forces the model to explain the steps of reasoning.
Convenient for complex problems, mathematics, logic. - Retrieve/RAG:
Integration with vector databases.
Allows you to add document search to your workflow.
Optimizers (formerly teleprompters):
algorithms that "compile" a program into efficient hints/weights by metric; work even with 5-10 examples.
An example similar in function to the previous one:
requirements.txt
dspy==3.0.1 litellm==1.75.8 openai==1.99.9 optuna>=4.5.0 gepa[dspy]==0.0.4 regex>=2025.7.34 diskcache>=5.6.3 json-repair>=0.49.0 magicattr>=0.1.6 backoff>=2.2.1 asyncer==0.0.8 cachetools>=6.1.0 aiohttp>=3.12.15
import dspy import requests from typing import Dict dspy.enable_logging() OPENWEATHER_API_KEY = "..." # Set provider for LLM lm = dspy.LM( "openai/google/gemma-3n-e4b", api_base="http://localhost:1234/v1", api_key="lm-studio", temperature=0 ) dspy.configure(lm=lm) # Check if the provider was found print("Provider:\n", type(lm.provider)) # Send the model test data using the Predict module probe = dspy.Predict("question -> answer") try: result = probe(question="Who are you", max_tokens=1) print("Model supported by provider, test answer:\n", result.answer) except Exception as e: print("Model not supported by provider:\n", e) def get_weather(city: str, units: str = "metric") -> str: url = "https://api.openweathermap.org/data/2.5/weather" params = { "q": city, "appid": OPENWEATHER_API_KEY, "units": units, "lang": "en" } response = requests.get(url, params=params) if response.status_code == 200: data: Dict = response.json() temp = data["main"]["temp"] description = data["weather"][0]["description"] return f"Weather in {city}: {temp}°C, {description}" else: return f"Error fetching weather: {response.text}" # Adding a question -> answer communication format and # get_weather tool and ReAct module # The framework will generate the prompt itself accordingly # with what module and with what parameters we use agent = dspy.ReAct( "question -> answer", tools=[get_weather] ) result = agent(question="Check weather in London") print("Answer:\n", result.answer) # You can view the history of communication with LLM, # what tools were called and what prompt was generated history = dspy.inspect_history(n=10)
Console output:
Provider: <class 'dspy.clients.openai.OpenAIProvider'> Model supported by provider, test answer: I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind. Answer: Weather in London: 15.89°C, overcast clouds
Log:
- the model answers a simple question without tools (
Who are you
). Here she immediately gives[[ ## answer ## ]]
.
- the system sets the task: "You have the get_weather/finish tools". The model thinks and issues the first step:
next_tool_name = get_weather
.
- the model receives a trajectory with the API result (
Weather in London: ...
) and must decide what to do next. She writesnext_tool_name = finish
.
- the system substitutes the trajectory with both steps (
get_weather
+finish
). The model must produce the final[[ ## answer ## ]]
.
Test message:
System message: Your input fields are: 1. `question` (str): Your output fields are: 1. `answer` (str): All interactions will be structured in the following way, with the appropriate values filled in. [[ ## question ## ]] {question} [[ ## answer ## ]] {answer} [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `question`, produce the fields `answer`. User message: [[ ## question ## ]] Who are you Respond with the corresponding output fields, starting with the field `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## answer ## ]] I am Gemma, an open-weights AI assistant. I am a large language model trained by Google DeepMind. [[ ## completed ## ]]
Main message:
here the models are explained the Tools that it can use – get_weather And finish.
The model is expected to next_thought
, next_tool_name
, next_tool_args
The model responds to the challenge [[ ## next_tool_name ## ]] get_weather with parameters [[ ## next_tool_args ## ]] {"city": "London"}
System message: Your input fields are: 1. `question` (str): 2. `trajectory` (str): Your output fields are: 1. `next_thought` (str): 2. `next_tool_name` (Literal['get_weather', 'finish']): 3. `next_tool_args` (dict[str, Any]): All interactions will be structured in the following way, with the appropriate values filled in. [[ ## question ## ]] {question} [[ ## trajectory ## ]] {trajectory} [[ ## next_thought ## ]] {next_thought} [[ ## next_tool_name ## ]] {next_tool_name} # note: the value you produce must exactly match (no extra characters) one of: get_weather; finish [[ ## next_tool_args ## ]] {next_tool_args} # note: the value you produce must adhere to the JSON schema: {"type": "object", "additionalProperties": true} [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `question`, produce the fields `answer`. You are an Agent. In each episode, you will be given the fields `question` as input. And you can see your past trajectory so far. Your goal is to use one or more of the supplied tools to collect any necessary information for producing `answer`. To do this, you will interleave next_thought, next_tool_name, and next_tool_args in each turn, and also when finishing the task. After each tool call, you receive a resulting observation, which gets appended to your trajectory. When writing next_thought, you may reason about the current situation and plan for future steps. When selecting the next_tool_name and its next_tool_args, the tool must be one of: (1) get_weather. It takes arguments {'city': {'type': 'string'}, 'units': {'type': 'string', 'default': 'metric'}}. (2) finish, whose description is <desc>Marks the task as complete. That is, signals that all information for producing the outputs, i.e. `answer`, are now available to be extracted.</desc>. It takes arguments {}. When providing `next_tool_args`, the value inside the field must be in JSON format User message: [[ ## question ## ]] Check weather in London [[ ## trajectory ## ]] Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['get_weather', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## next_thought ## ]] I need to check the weather in London. I should use the get_weather tool for this. [[ ## next_tool_name ## ]] get_weather [[ ## next_tool_args ## ]] {"city": "London"} [[ ## completed ## ]]
Next, the program makes a call to the weather API, receives the result and adds it to the next message.
The model decides what to call next [[ ## next_tool_name ## ]] finish
At this stage, the model is still an agent and has decided which tool to call next.
User message: [[ ## question ## ]] Check weather in London [[ ## trajectory ## ]] [[ ## thought_0 ## ]] I need to check the weather in London. I should use the get_weather tool for this. [[ ## tool_name_0 ## ]] get_weather [[ ## tool_args_0 ## ]] {"city": "London"} [[ ## observation_0 ## ]] Weather in London: 23.75°C, overcast clouds Respond with the corresponding output fields, starting with the field `[[ ## next_thought ## ]]`, then `[[ ## next_tool_name ## ]]` (must be formatted as a valid Python Literal['get_weather', 'finish']), then `[[ ## next_tool_args ## ]]` (must be formatted as a valid Python dict[str, Any]), and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## next_thought ## ]] The weather for London has been retrieved. I can now finish the task. [[ ## next_tool_name ## ]] finish [[ ## next_tool_args ## ]] {} [[ ## completed ## ]]
Then the last call, the model is already an assistant, it gives reasoning + the final answer.
Prompt has also changed.
System message: Your input fields are: 1. `question` (str): 2. `trajectory` (str): Your output fields are: 1. `reasoning` (str): 2. `answer` (str): All interactions will be structured in the following way, with the appropriate values filled in. [[ ## question ## ]] {question} [[ ## trajectory ## ]] {trajectory} [[ ## reasoning ## ]] {reasoning} [[ ## answer ## ]] {answer} [[ ## completed ## ]] In adhering to this structure, your objective is: Given the fields `question`, produce the fields `answer`. User message: [[ ## question ## ]] Check weather in London [[ ## trajectory ## ]] [[ ## thought_0 ## ]] I need to check the weather in London. I should use the get_weather tool for this. [[ ## tool_name_0 ## ]] get_weather [[ ## tool_args_0 ## ]] {"city": "London"} [[ ## observation_0 ## ]] Weather in London: 23.75°C, overcast clouds [[ ## thought_1 ## ]] The weather for London has been retrieved. I can now finish the task. [[ ## tool_name_1 ## ]] finish [[ ## tool_args_1 ## ]] {} [[ ## observation_1 ## ]] Completed. Respond with the corresponding output fields, starting with the field `[[ ## reasoning ## ]]`, then `[[ ## answer ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`. Response: [[ ## reasoning ## ]] The question asks to check the weather in London. I need to use a tool that can provide weather information. The `get_weather` tool is suitable for this task, and I should provide the city as "London". After retrieving the weather information, I will finish the task. [[ ## answer ## ]] Weather in London: 23.75°C, overcast clouds [[ ## completed ## ]]
Why are there two calls with finish:
The reason is in the architecture of the agent loop in DSPy.
Call with next_tool_name:
This step is modeled as an "agent action". The agent always operates in the format:
I think → select the tool → call the tool → receive observation.
Even if tool = finish, formally it is still considered an action.
Challenge with reasoning + answer:
After the agent has "completed the task", the system makes a separate request, where the context includes the entire trajectory.
This two-step scheme is a way to divide the roles:
one format of communication, when the model is an agent managing tools;
Another format is when the model is an assistant responding to the user.
Further in LangChain there will be an example with 1st final message.
➤ Example 3 – Python framework LangChain
LangChain is a framework for working with LLM (Large Language Models) that helps build complex applications on top of models, not just “question → answer”.
Its purpose is to provide a comfortable layer over the model so that it can:
work with tools (e.g. API, databases, calculator);
save context and memory (chats, dialogue history);
use chains – a sequence of steps where the model performs one task and passes the result to the next;
launch agents that decide for themselves which tools to use and in what order;
integrate with popular LLM providers (OpenAI, Anthropic, LM Studio, HuggingFace, etc.);
work with retrieval and RAG (extract knowledge from documents or vector databases).
https://github.com/langchain-ai/langchain
https://en.wikipedia.org/wiki/LangChain
An example similar in function to the previous one:
requirements.txt
langchain-core==0.3.74 langchain-openai==0.3.30 langchain==0.3.27 openai==1.100.1 requests==2.32.5
import requests from typing import Dict from langchain_core.tools import tool from langchain_openai import ChatOpenAI from langchain import hub from langchain.agents import AgentExecutor, create_react_agent from langchain_core.callbacks.base import BaseCallbackHandler OPENWEATHER_API_KEY = "..." # Callback for logging class CustomLogger(BaseCallbackHandler): def on_tool_start(self, serialized, input_str, **kwargs): print(f"\nTOOL START:\n{serialized.get('name')}\nINPUT:\n{input_str}\n") def on_tool_end(self, output, **kwargs): print(f"\nTOOL END:\n{output}\n") def on_llm_start(self, serialized, prompts, **kwargs): print(f"\nLLM START:\n{serialized.get('name')}\nPROMPTS\n{prompts}\n") def on_llm_end(self, response, **kwargs): print(f"\nLLM END:\n{response}\n") # Define a tool that the agent can call @tool(description="Get current weather for a city.") def get_weather(city: str) -> str: # Form a request to the OpenWeather API url = "https://api.openweathermap.org/data/2.5/weather" params = {"q": city, "appid": OPENWEATHER_API_KEY, "units": "metric", "lang": "en"} r = requests.get(url, params=params, timeout=15) if not r.ok: # If there is an error, return the error text return f"error: {r.status_code} {r.text}" # Parse the JSON response data: Dict = r.json() # Return a short string with the weather return f"Weather in {data.get('name', city)}: {data['main']['temp']}°C, {data['weather'][0]['description']}" # Connect LLM via LM Studio (local OpenAI API-compatible server) llm = ChatOpenAI( model="google/gemma-3n-e4b", openai_api_base="http://localhost:1234/v1", openai_api_key="lm-studio", temperature=0, ) # Download the ready-made ReAct prompt template from LangChain Hub prompt = hub.pull("hwchase17/react") # Create an agent in ReAct style (model thinks + calls tools) agent = create_react_agent( llm, tools=[get_weather], prompt=prompt ) # Wrap the agent in Executor to launch and manage its work executor = AgentExecutor( agent=agent, tools=[get_weather], verbose=True ) # Launch an agent with a question about the weather in London result = executor.invoke( {"input": "Check weather in London"}, config={"callbacks": [CustomLogger()]} )
Console output:
> Entering new AgentExecutor chain... I need to find the weather in London. I will use the get_weather tool to do this. Action: get_weather Action Input: LondonWeather in London: 16.16°C, overcast cloudsI have retrieved the weather for London. Now I can provide the answer. Final Answer: Weather in London: 16.16°C, overcast clouds > Finished chain. Weather in London: 16.16°C, overcast clouds
Full log:
ON_LLM_START: ChatOpenAI PROMPTS: ['Human: Answer the following questions as best you can. You have access to the following tools: get_weather(city: str) -> str - Get current weather for a city. Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [get_weather] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: Check weather in London Thought:'] ON_LLM_END: generations=[[ ChatGenerationChunk(text='I need to find the weather in London. I will use the get_weather tool to do this. Action: get_weather Action Input: London', generation_info={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, message=AIMessageChunk( content='I need to find the weather in London. I will use the get_weather tool to do this. Action: get_weather Action Input: London', additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, id='run--ce1137e0-edde-4be7-95df-8b26c7ddddce'))]] llm_output=None run=None type='LLMResult' I need to find the weather in London. I will use the get_weather tool to do this. Action: get_weather Action Input: London
ON_TOOL_START: get_weather INPUT: London ON_TOOL_END: Weather in London: 23.01°C, overcast clouds
ON_LLM_START: ChatOpenAI PROMPTS: ['Human: Answer the following questions as best you can. You have access to the following tools: get_weather(city: str) -> str - Get current weather for a city. Use the following format: Question: the input question you must answer Thought: you should always think about what to do Action: the action to take, should be one of [get_weather] Action Input: the input to the action Observation: the result of the action ... (this Thought/Action/Action Input/Observation can repeat N times) Thought: I now know the final answer Final Answer: the final answer to the original input question Begin! Question: Check weather in London Thought: I need to find the weather in London. I will use the get_weather tool to do this. Action: get_weather Action Input: London Observation: Weather in London: 23.01°C, overcast clouds Thought: '] ON_LLM_END: generations=[[ ChatGenerationChunk(text='I have retrieved the weather for London. Now I can provide the answer. Final Answer: Weather in London: 23.01°C, overcast clouds', generation_info={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, message=AIMessageChunk( content='I have retrieved the weather for London. Now I can provide the answer. Final Answer: Weather in London: 23.01°C, overcast clouds', additional_kwargs={}, response_metadata={'finish_reason': 'stop', 'model_name': 'google/gemma-3n-e4b', 'system_fingerprint': 'google/gemma-3n-e4b'}, id='run--342aee15-0171-4549-94c4-a02a620aad7f'))]] llm_output=None run=None type='LLMResult' I have retrieved the weather for London. Now I can provide the answer. Final Answer: Weather in London: 23.01°C, overcast clouds > Finished chain. Process finished with exit code 0
Example 4 – Java framework LangChain4j
As LLM makes its way from startups to the enterprise, it's only a matter of time before Java solutions emerge.
Popular solutions include:
Spring AI
https://spring.io/projects/spring-ai
https://github.com/spring-projects/spring-ai
Microsoft Semantic Kernel
https://learn.microsoft.com/en-us/semantic-kernel/overview
https://github.com/microsoft/semantic-kernel
LangChain for Java
https://docs.langchain4j.dev
https://github.com/langchain4j/langchain4j
Here is an example of LangChain4j, similar in functionality to the previous examples:
ext { ktor_version = '3.2.3' lc_4_j_version = '1.3.0' logback_version = '1.5.13' kotlinx_serialization_json_version = "1.8.1" } dependencies { implementation "io.ktor:ktor-client-core:$ktor_version" implementation "io.ktor:ktor-client-cio:$ktor_version" implementation "io.ktor:ktor-client-content-negotiation:$ktor_version" implementation "io.ktor:ktor-serialization-kotlinx-json:$ktor_version" implementation "io.ktor:ktor-client-logging:$ktor_version" implementation "org.jetbrains.kotlinx:kotlinx-serialization-json:$kotlinx_serialization_json_version" implementation "ch.qos.logback:logback-classic:$logback_version" implementation "dev.langchain4j:langchain4j:$lc_4_j_version" implementation "dev.langchain4j:langchain4j-open-ai:$lc_4_j_version" implementation("dev.langchain4j:langchain4j-http-client-jdk:$lc_4_j_version") }
import dev.langchain4j.http.client.jdk.JdkHttpClientBuilder import dev.langchain4j.model.openai.OpenAiChatModel import dev.langchain4j.service.AiServices import kotlinx.coroutines.runBlocking import java.net.http.HttpClient import java.time.Duration fun main() = runBlocking { val httpClientBuilder = HttpClient.newBuilder() .connectTimeout(Duration.ofSeconds(10)) .version(HttpClient.Version.HTTP_1_1) val jdkClientBuilder = JdkHttpClientBuilder() .httpClientBuilder(httpClientBuilder) .connectTimeout(Duration.ofSeconds(10)) .readTimeout(Duration.ofSeconds(90)) val model = OpenAiChatModel.builder() .httpClientBuilder(jdkClientBuilder) .baseUrl(BASE_URL) .apiKey("lm-studio") .modelName(MODEL_ID) .temperature(0.2) .logRequests(true) .logResponses(true) .timeout(Duration.ofSeconds(90)) .listeners(listOf(ChatLogger())) .build() val assistant = AiServices.builder(Assistant::class.java) .chatModel(model) .tools(WeatherTools) .build() val result = assistant.chat("Check weather in London") println(result) }
import dev.langchain4j.service.UserMessage import dev.langchain4j.service.V interface Assistant { @UserMessage("{{input}}") fun chat(@V("input") message: String): String }
import dev.langchain4j.agent.tool.Tool import io.ktor.client.call.* import io.ktor.client.request.* import io.ktor.http.* import kotlinx.coroutines.runBlocking object WeatherTools { @Tool( name = "current_weather_city", value = ["Get current weather for a city"] ) @JvmStatic fun currentWeatherByCity(city: String): String = runBlocking { client.use { httpClient -> val url = URLBuilder(OWM_BASE).apply { appendPathSegments("weather") parameters.append("q", city) parameters.append("appid", WEATHER_API_KEY) parameters.append("units", "metric") parameters.append("lang", "en") }.buildString() val response: WeatherResponse = httpClient.get(url).body() val name = response.name ?: city val temp = response.main?.temp val desc = response.weather.firstOrNull()?.description ?: "n/a" if (temp == null) { "Can't read temperature for $name" } else { "Weather in $name: $temp °C, $desc" } } } }
import io.ktor.client.* import io.ktor.client.engine.cio.* import io.ktor.client.plugins.* import io.ktor.client.plugins.contentnegotiation.* import io.ktor.serialization.kotlinx.json.* import kotlinx.serialization.json.Json const val BASE_URL: String = "http://localhost:1234/v1" const val OWM_BASE = "https://api.openweathermap.org/data/2.5" const val WEATHER_API_KEY = "..." const val MODEL_ID = "google/gemma-3n-e4b" val kotlinxJsonConfig: Json = Json { ignoreUnknownKeys = true prettyPrint = false } val client: HttpClient = HttpClient(CIO) { install(ContentNegotiation) { json(kotlinxJsonConfig) } install(HttpTimeout) { requestTimeoutMillis = 60_000 connectTimeoutMillis = 10_000 socketTimeoutMillis = 60_000 } }
import kotlinx.serialization.SerialName import kotlinx.serialization.Serializable @Serializable data class WeatherResponse( @SerialName("coord") val coord: Coord? = null, @SerialName("weather") val weather: List<WeatherItem> = emptyList(), @SerialName("base") val base: String? = null, @SerialName("main") val main: MainBlock? = null, @SerialName("visibility") val visibility: Int? = null, @SerialName("wind") val wind: Wind? = null, @SerialName("clouds") val clouds: Clouds? = null, @SerialName("dt") val dt: Long? = null, @SerialName("sys") val sys: Sys? = null, @SerialName("timezone") val timezone: Int? = null, @SerialName("id") val id: Long? = null, @SerialName("name") val name: String? = null, @SerialName("cod") val cod: Int? = null ) @Serializable data class Coord( @SerialName("lon") val lon: Double? = null, @SerialName("lat") val lat: Double? = null ) @Serializable data class WeatherItem( @SerialName("id") val id: Int? = null, @SerialName("main") val main: String? = null, @SerialName("description") val description: String? = null, @SerialName("icon") val icon: String? = null ) @Serializable data class MainBlock( @SerialName("temp") val temp: Double? = null, @SerialName("feels_like") val feelsLike: Double? = null, @SerialName("temp_min") val tempMin: Double? = null, @SerialName("temp_max") val tempMax: Double? = null, @SerialName("pressure") val pressure: Int? = null, @SerialName("humidity") val humidity: Int? = null ) @Serializable data class Wind( @SerialName("speed") val speed: Double? = null, @SerialName("deg") val deg: Int? = null, @SerialName("gust") val gust: Double? = null ) @Serializable data class Clouds( @SerialName("all") val all: Int? = null ) @Serializable data class Sys( @SerialName("type") val type: Int? = null, @SerialName("id") val id: Int? = null, @SerialName("country") val country: String? = null, @SerialName("sunrise") val sunrise: Long? = null, @SerialName("sunset") val sunset: Long? = null )
import dev.langchain4j.model.chat.listener.ChatModelErrorContext import dev.langchain4j.model.chat.listener.ChatModelListener import dev.langchain4j.model.chat.listener.ChatModelRequestContext import dev.langchain4j.model.chat.listener.ChatModelResponseContext class ChatLogger : ChatModelListener { override fun onRequest(request: ChatModelRequestContext) { println("ON_LLM_START:") println(request) } override fun onResponse(response: ChatModelResponseContext) { println("ON_LLM_END:") println(response) } override fun onError(context: ChatModelErrorContext) { println("ON_LLM_ERROR:") println(context.error().message) } }
Console output:
C:\Users\Roman\.jdks\corretto-21.0.6\bin\java.exe "-javaagent:C:\Program Files\JetBrains\IntelliJIdea2025.1\lib\idea_rt.jar=52675" -Dfile.encoding=UTF-8 -Dsun.stdout.encoding=UTF-8 -Dsun.stderr.encoding=UTF-8 -classpath C:\Users\Roman\IdeaProjects\LangChain4j_example\build\classes\kotlin\main;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlin\kotlin-stdlib\2.2.0\fdfc65fbc42fda253a26f61dac3c0aca335fae96\kotlin-stdlib-2.2.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\ch.qos.logback\logback-classic\1.5.13\e6f82a9ce14a912c7dab831164ec6f10cc3eeb1b\logback-classic-1.5.13.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\dev.langchain4j\langchain4j\1.3.0\207a49cc2a25b14e61c219057023c97ea6344836\langchain4j-1.3.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\dev.langchain4j\langchain4j-open-ai\1.3.0\abd013b6600fefb6db2f2d08cad4970a601bf71a\langchain4j-open-ai-1.3.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\dev.langchain4j\langchain4j-http-client-jdk\1.3.0\ade5c4f73164b4b16f191312b80af2d847a26fd6\langchain4j-http-client-jdk-1.3.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-client-cio-jvm\3.2.3\7fb001474733cba10b0e234be7e25c984bac1360\ktor-client-cio-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-client-content-negotiation-jvm\3.2.3\9ca43332547aaf2e7af0ef9e75b0e55a7fd5d7c9\ktor-client-content-negotiation-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-client-logging-jvm\3.2.3\7731df6cd3b0da0be75d0d91e66c5c5eab0efb08\ktor-client-logging-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-client-core-jvm\3.2.3\58361d3e3cb2f4e77c4eedc294b2889fdf82b2ac\ktor-client-core-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-serialization-kotlinx-json-jvm\3.2.3\211555a0b1cad21ccaf31ef4b482374c146e6318\ktor-serialization-kotlinx-json-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-serialization-json-jvm\1.8.1\4de3bace4b175753df5484d2acd74c14bfeb5be9\kotlinx-serialization-json-jvm-1.8.1.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains\annotations\23.0.0\8cc20c07506ec18e0834947b84a864bfc094484e\annotations-23.0.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\ch.qos.logback\logback-core\1.5.13\ccd418c6f1a5a93e57e7b5bb9a3f708dbd563cd8\logback-core-1.5.13.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.slf4j\slf4j-api\2.0.17\d9e58ac9c7779ba3bf8142aff6c830617a7fe60f\slf4j-api-2.0.17.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\dev.langchain4j\langchain4j-core\1.3.0\18fba60f0f2b76b18e2ba362b8ef8076d675e9af\langchain4j-core-1.3.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.apache.opennlp\opennlp-tools\2.5.4\11a7671f2169280c0e24b979cb326833b96ca246\opennlp-tools-2.5.4.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-core\2.19.2\50f3b4bd59b9ff51a0ed493e7b5abaf5c39709bf\jackson-core-2.19.2.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-databind\2.19.2\46509399d28f57ca32c6bb4b0d4e10e8f062051e\jackson-databind-2.19.2.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\com.fasterxml.jackson.core\jackson-annotations\2.19.2\c5381f11988ae3d424b197a26087d86067b6d7d\jackson-annotations-2.19.2.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\dev.langchain4j\langchain4j-http-client\1.3.0\c0d03a935c32c5252561ce0004af633477bf0e8a\langchain4j-http-client-1.3.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\com.knuddels\jtokkit\1.1.0\b7370f801db3eb8c7c6a2c2c06231909ac6de0b0\jtokkit-1.1.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-coroutines-slf4j\1.10.2\1271f9d4a929150bb87ab8d1dc1e86d4bcf039f3\kotlinx-coroutines-slf4j-1.10.2.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jspecify\jspecify\1.0.0\7425a601c1c7ec76645a78d22b8c6a627edee507\jspecify-1.0.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-http-cio-jvm\3.2.3\9ba86bd196344eb2f33f4de7c01b2a76079563d8\ktor-http-cio-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-websockets-jvm\3.2.3\180024bd52774540635a9fdaca021f9392400e60\ktor-websockets-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-network-tls-jvm\3.2.3\d1f24a1e38fd0f940ee8b55c8e11be151491a3de\ktor-network-tls-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-coroutines-core-jvm\1.10.2\4a9f78ef49483748e2c129f3d124b8fa249dafbf\kotlinx-coroutines-core-jvm-1.10.2.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-serialization-jvm\3.2.3\f3ee77994a1ba0ddc0d65bbb9bd8db2900f40009\ktor-serialization-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-websocket-serialization-jvm\3.2.3\e62e9fdebc262c2b4c556f208628a03740f14046\ktor-websocket-serialization-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-http-jvm\3.2.3\c83599639fc1fc9312e63ed86423ee535d80bc5\ktor-http-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-events-jvm\3.2.3\38a0c405c218e64b0c967ad300ca587cde8154bc\ktor-events-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-sse-jvm\3.2.3\67421e4181c68c6b6d82cb6f8af9c8e44e2838c8\ktor-sse-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-serialization-json-io-jvm\1.8.1\de4e31bfc7ddf8585418f8d4fca649feca574cdf\kotlinx-serialization-json-io-jvm-1.8.1.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-serialization-kotlinx-jvm\3.2.3\316c3d4122d2dcfecd2ae2146060d557a2a4b4\ktor-serialization-kotlinx-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-serialization-core-jvm\1.8.1\510cb839cce9a3e708052d480a6fbf4a7274dfcd\kotlinx-serialization-core-jvm-1.8.1.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-network-jvm\3.2.3\f3eb3f831652f4b5407703f58e4885cd1ae93e92\ktor-network-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-io-jvm\3.2.3\81a70f5d028cb2878fba2107df2101edc7c0240f\ktor-io-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\io.ktor\ktor-utils-jvm\3.2.3\5cc6a940a5ca98f7f0c36e0b699c6fbd2f594598\ktor-utils-jvm-3.2.3.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-io-core-jvm\0.7.0\1d9595ed390f6304ce90a049e6b2f1fab3b408c4\kotlinx-io-core-jvm-0.7.0.jar;C:\Users\Roman\.gradle\caches\modules-2\files-2.1\org.jetbrains.kotlinx\kotlinx-io-bytestring-jvm\0.7.0\ffe5bd231da40d21870250703326113277dbf9c3\kotlinx-io-bytestring-jvm-0.7.0.jar MainKt ON_LLM_START: dev.langchain4j.model.chat.listener.ChatModelRequestContext@33d512c1 00:48:03.351 [main] INFO dev.langchain4j.http.client.log.LoggingHttpClient -- HTTP request: - method: POST - url: http://localhost:1234/v1/chat/completions - headers: [Authorization: Beare...io], [User-Agent: langchain4j-openai], [Content-Type: application/json] - body: { "model" : "google/gemma-3n-e4b", "messages" : [ { "role" : "user", "content" : "Check weather in London" } ], "temperature" : 0.2, "stream" : false, "tools" : [ { "type" : "function", "function" : { "name" : "current_weather_city", "description" : "Get current weather for a city", "parameters" : { "type" : "object", "properties" : { "arg0" : { "type" : "string" } }, "required" : [ "arg0" ] } } } ] } 00:48:04.961 [main] INFO dev.langchain4j.http.client.log.LoggingHttpClient -- HTTP response: - status code: 200 - headers: [connection: keep-alive], [content-length: 741], [content-type: application/json; charset=utf-8], [date: Tue, 19 Aug 2025 21:48:04 GMT], [etag: W/"2e5-JWXcPUg7aNE2xJTlIOO4yntOymw"], [keep-alive: timeout=5], [x-powered-by: Express] - body: { "id": "chatcmpl-ox8omdn9nxl0o4pbihqqi", "object": "chat.completion", "created": 1755640083, "model": "google/gemma-3n-e4b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "", "tool_calls": [ { "type": "function", "id": "458045748", "function": { "name": "current_weather_city", "arguments": "{\"arg0\":\"London\"}" } } ] }, "logprobs": null, "finish_reason": "tool_calls" } ], "usage": { "prompt_tokens": 399, "completion_tokens": 34, "total_tokens": 433 }, "stats": {}, "system_fingerprint": "google/gemma-3n-e4b" } ON_LLM_END: dev.langchain4j.model.chat.listener.ChatModelResponseContext@1c481ff2 ON_LLM_START: dev.langchain4j.model.chat.listener.ChatModelRequestContext@70d2e40b 00:48:05.501 [main] INFO dev.langchain4j.http.client.log.LoggingHttpClient -- HTTP request: - method: POST - url: http://localhost:1234/v1/chat/completions - headers: [Authorization: Beare...io], [User-Agent: langchain4j-openai], [Content-Type: application/json] - body: { "model" : "google/gemma-3n-e4b", "messages" : [ { "role" : "user", "content" : "Check weather in London" }, { "role" : "assistant", "tool_calls" : [ { "id" : "458045748", "type" : "function", "function" : { "name" : "current_weather_city", "arguments" : "{\"arg0\":\"London\"}" } } ] }, { "role" : "tool", "tool_call_id" : "458045748", "content" : "Weather in London: 17.76 °C, few clouds" } ], "temperature" : 0.2, "stream" : false, "tools" : [ { "type" : "function", "function" : { "name" : "current_weather_city", "description" : "Get current weather for a city", "parameters" : { "type" : "object", "properties" : { "arg0" : { "type" : "string" } }, "required" : [ "arg0" ] } } } ] } 00:48:06.257 [main] INFO dev.langchain4j.http.client.log.LoggingHttpClient -- HTTP response: - status code: 200 - headers: [connection: keep-alive], [content-length: 554], [content-type: application/json; charset=utf-8], [date: Tue, 19 Aug 2025 21:48:06 GMT], [etag: W/"22a-YKbJMzdQdr5Qv/9Z3J+Mf2kPeAw"], [keep-alive: timeout=5], [x-powered-by: Express] - body: { "id": "chatcmpl-6aslomnvzvh3tu6jo8tfl6", "object": "chat.completion", "created": 1755640085, "model": "google/gemma-3n-e4b", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "The weather in London is 17.76 °C with few clouds.", "tool_calls": [] }, "logprobs": null, "finish_reason": "stop" } ], "usage": { "prompt_tokens": 469, "completion_tokens": 18, "total_tokens": 487 }, "stats": {}, "system_fingerprint": "google/gemma-3n-e4b" } ON_LLM_END: dev.langchain4j.model.chat.listener.ChatModelResponseContext@2449cff7 The weather in London is 17.76 °C with few clouds. Process finished with exit code 0
Example 5 – Google's Java libraries for mobile devices
In this example I will use
Engine: Google AI Edge MediaPipe
LLM model: gemma-3n-E4B-it-int4
RAG / Functional Calling: On-Device RAG SDK & On-Device Function Calling SDK
Embedding model: Gecko-110m-en
How it should work – you can read here
RAG:
https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation
https://en.wikipedia.org/wiki/Retrieval-augmented_generation
https://www.ibm.com/think/topics/retrieval-augmented-generation
Function Calling:
https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/function-calling
https://huggingface.co/docs/hugs/guides/function-calling
https://platform.openai.com/docs/guides/function-calling
https://medium.com/@danushidk507/function-calling-in-llm-e537b286a4fd
In this example there will be 3 classes
MediaPipeEngineCommon: common components for working with the vector base will be stored here
MediaPipeEngineWithRag: here you can start generation by adding data from the vector database to the request
MediaPipeEngineWithTools: here you can start the generation and the model itself decides whether to make a request to the vector base
If it decides that it needs to, it makes a function call, which is processed not manually, as in the previous example, but by the library com.google.ai.edge.localagents:localagents-fc
Project dependencies, in the new Gradle format:
localagentsRag = "0.2.0" localagentsFc = "0.1.0" tasksGenai = "0.10.25" tasksText = "0.10.26.1" tasksVision = "0.10.26.1" tensorflowLite = "2.17.0" kotlinxCoroutinesGuava = "1.10.2" tasks-genai = { module = "com.google.mediapipe:tasks-genai", version.ref = "tasksGenai" } tasks-text = { module = "com.google.mediapipe:tasks-text", version.ref = "tasksText" } tasks-vision = { module = "com.google.mediapipe:tasks-vision", version.ref = "tasksVision" } tensorflow-lite = { module = "org.tensorflow:tensorflow-lite", version.ref = "tensorflowLite" } localagents-rag = { module = "com.google.ai.edge.localagents:localagents-rag", version.ref = "localagentsRag" } localagents-fc = { module = "com.google.ai.edge.localagents:localagents-fc", version.ref = "localagentsFc" } kotlinx-coroutines-guava = { module = "org.jetbrains.kotlinx:kotlinx-coroutines-guava", version.ref = "kotlinxCoroutinesGuava" }
MediaPipeEngineCommon: Class for working with a vector database, needed in this example for both RAG and Function Calling
import com.google.ai.edge.localagents.rag.chunking.TextChunker import com.google.ai.edge.localagents.rag.memory.DefaultSemanticTextMemory import com.google.ai.edge.localagents.rag.memory.SqliteVectorStore import com.google.ai.edge.localagents.rag.models.Embedder import com.google.ai.edge.localagents.rag.prompt.PromptBuilder interface MediaPipeEngineCommon { var chunker: TextChunker var embedder: Embedder<String> var vectorStore: SqliteVectorStore var promptBuilder: PromptBuilder var semanticMemory: DefaultSemanticTextMemory fun init( geckoModelPath: String, // Gecko_256_quant.tflite tokenizerModelPath: String, // sentencepiece.model useGpuForEmbeddings: Boolean = true, ) fun saveTextToVectorStore( text: String, chunkOverlap: Int = 20, chunkTokenSize: Int = 128, chunkMaxSymbolsSize: Int = 1000, chunkBySentences: Boolean = false, ): String? fun readEmbeddingVectors(): List<VectorStoreEntity> suspend fun readEmbeddingVectors( query: String, topK: Int, minSimilarityScore: Float, ): List<VectorStoreEntity> fun makeSQLRequest(query: String): Boolean }
import android.app.Application import android.database.Cursor import android.database.sqlite.SQLiteDatabase import com.google.ai.edge.localagents.rag.chunking.TextChunker import com.google.ai.edge.localagents.rag.memory.DefaultSemanticTextMemory import com.google.ai.edge.localagents.rag.memory.SqliteVectorStore import com.google.ai.edge.localagents.rag.memory.VectorStoreRecord import com.google.ai.edge.localagents.rag.models.EmbedData import com.google.ai.edge.localagents.rag.models.Embedder import com.google.ai.edge.localagents.rag.models.EmbeddingRequest import com.google.ai.edge.localagents.rag.models.GeckoEmbeddingModel import com.google.ai.edge.localagents.rag.prompt.PromptBuilder import com.google.common.collect.ImmutableList import com.romankryvolapov.offlineailauncher.common.extensions.toDurationString import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logDebug import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logError import kotlinx.coroutines.guava.await import java.io.File import java.nio.ByteBuffer import java.nio.ByteOrder import java.util.Optional class MediaPipeEngineCommonImpl( private val application: Application ) : MediaPipeEngineCommon { companion object { private const val TAG = "CommonComponentsTag" private const val GECKO_EMBEDDING_MODEL_DIMENSION = 768 private const val PROMPT_TEMPLATE: String = "You are an assistant for question-answering tasks. Here are the things I want to remember: {0} Use the things I want to remember, answer the following question the user has: {1}" } override lateinit var chunker: TextChunker override lateinit var embedder: Embedder<String> override lateinit var vectorStore: SqliteVectorStore override lateinit var promptBuilder: PromptBuilder override lateinit var semanticMemory: DefaultSemanticTextMemory override fun init( geckoModelPath: String, tokenizerModelPath: String, useGpuForEmbeddings: Boolean, ) { logDebug("init", TAG) chunker = TextChunker() // in embedder add the path to the Gecko-110m-en model // I use the Gecko_256_quant.tflite version here 256 is the maximum text input size // This version is optimal in terms of text chunk size and performance // important - further in the code we pass on which pieces to split the text into, this depends on the parameters of the model embedder = GeckoEmbeddingModel( geckoModelPath, Optional.of(tokenizerModelPath), useGpuForEmbeddings, ) // here I will use a ready-made SQLite database in the application // another table rag_vector_store will simply be added to it // with text columns for text and embeddings for vector val database = File(application.getDatabasePath("database").absolutePath) if (!database.exists()) { logError("startEngine database not exists", TAG) } // It is also possible to create a custom database implementation that inherits // VectorStore<String> interface, but the getNearestRecords method must // to be implemented correctly and work quickly, he is looking for the nearest vesters vectorStore = SqliteVectorStore( GECKO_EMBEDDING_MODEL_DIMENSION, database.absolutePath ) semanticMemory = DefaultSemanticTextMemory( vectorStore, embedder ) promptBuilder = PromptBuilder( PROMPT_TEMPLATE ) logDebug("init ready", TAG) } override fun saveTextToVectorStore( text: String, // how far to get into the text before the fragment, here 20 chunkOverlap: Int, // Please note that the size seems to be in tokens, // but it is used for chunker and may not match // token size for embedder, here chunkTokenSize 128 chunkTokenSize: Int, // when splitting using chunkBySentences // the size of the offers can be large // if it exceeds the capabilities of the embedder model, it will throw an error // for this purpose, cropping to the maximum size is used // here it is 1000 characters chunkMaxSymbolsSize: Int, // use the sentence-by-sentence method chunkBySentences: Boolean, ): String? { logDebug("saveTextToVectorStore text length: ${text.length}", TAG) // timer to see how fast it works val start = System.currentTimeMillis() val chunks: List<String> = if (chunkBySentences) chunker.chunkBySentences( text, chunkTokenSize, ).filter { it.isNotBlank() }.map { chunk -> if (chunk.length > chunkMaxSymbolsSize) { logError("saveTextToVectorStore crop chunk", TAG) chunk.substring(0, chunkMaxSymbolsSize) } else { chunk } } else chunker.chunk( text, chunkTokenSize, chunkOverlap ).filter { it.isNotBlank() }.map { chunk -> if (chunk.length > chunkMaxSymbolsSize) { logError("saveTextToVectorStore crop chunk", TAG) chunk.substring(0, chunkMaxSymbolsSize) } else { chunk } } val end = System.currentTimeMillis() val delta = end - start logDebug("saveTextToVectorStore chunks delta: ${delta.toDurationString()} size: ${chunks.size}", TAG) chunks.forEach { logDebug("length: ${it.length}", TAG) } if (chunks.isEmpty()) { logError("saveTextToVectorStore chunks.isEmpty()", TAG) return "Chunks is empty" } return try { // vector generation occurs inside semanticMemory val result: Boolean? = semanticMemory.recordBatchedMemoryItems( ImmutableList.copyOf(chunks) )?.get() val end = System.currentTimeMillis() val delta = end - start logDebug("saveTextToVectorStore ready delta: ${delta.toDurationString()} result: $result", TAG) null } catch (t: Throwable) { logError("saveTextToVectorStore failed: ${t.message}", t, TAG) t.message } } // search by query, will find all pieces of text similar to the query override suspend fun readEmbeddingVectors( query: String, // number of database query results topK: Int, // how similar the query vector should be to the database record // 0.0 = search all entries, sort by most similar // 1.0 = perfect match only // I use values 0.6 - 0.8 minSimilarityScore: Float, ): List<VectorStoreEntity> { logDebug("readEmbeddingVectors query: $query", TAG) val queryEmbedData: EmbedData<String> = EmbedData.create( query, EmbedData.TaskType.RETRIEVAL_QUERY ) val embeddingRequest: EmbeddingRequest<String> = EmbeddingRequest .create( listOf(queryEmbedData) ) val vector: ImmutableList<Float> = try { embedder.getEmbeddings(embeddingRequest).await() } catch (t: Throwable) { logError("readEmbeddingVectors: embedding failed: ${t.message}", t, TAG) return emptyList() } logDebug("searchDocsInternal vector size: ${vector.size}", TAG) if (vector.isEmpty()) { logError("readEmbeddingVectors vector.isEmpty()", TAG) return emptyList() } val hits: ImmutableList<VectorStoreRecord<String>> = try { vectorStore.getNearestRecords( vector, topK, minSimilarityScore ) } catch (t: Throwable) { logError("readEmbeddingVectors: vector search failed: ${t.message}", t, TAG) return emptyList() } if (hits.isEmpty()) { logError("readEmbeddingVectors hits.isEmpty()", TAG) return emptyList() } val result = hits.map { VectorStoreEntity( id = null, text = it.data, embedding = it.embeddings ) } logDebug("readEmbeddingVectors\nsize: ${result.size}\nresult: $result", TAG) return result } // simply displays all records in the database override fun readEmbeddingVectors(): List<VectorStoreEntity> { logDebug("readEmbeddingPreview", TAG) var cursor: Cursor? = null var database: SQLiteDatabase? = null return try { val databaseFile = File(application.getDatabasePath("database").absolutePath) database = SQLiteDatabase.openDatabase( databaseFile.absolutePath, null, SQLiteDatabase.OPEN_READONLY ) cursor = database.rawQuery("SELECT ROWID, text, embeddings FROM rag_vector_store", null) val result = mutableListOf<VectorStoreEntity>() while (cursor.moveToNext()) { val rowId = cursor.getLong(0) val text = cursor.getString(1) val blob = cursor.getBlob(2) val buffer = ByteBuffer.wrap(blob).order(ByteOrder.LITTLE_ENDIAN) val floats = mutableListOf<Float>() while (buffer.hasRemaining()) { floats.add(buffer.float) } result.add( VectorStoreEntity( id = rowId, text = text, embedding = floats ) ) } logDebug("readEmbeddingPreview\nsize: ${result.size}\nresult: $result", TAG) result } catch (t: Throwable) { logError("readEmbeddingPreview failed: ${t.message}", t, TAG) emptyList() } finally { cursor?.close() database?.close() } } // you can write your own request and it will be executed, // for example "DELETE FROM rag_vector_store" override fun makeSQLRequest(query: String): Boolean { logDebug("makeSQLRequest query: $query", TAG) var cursor: Cursor? = null var database: SQLiteDatabase? = null return try { val databaseFile = File(application.getDatabasePath("database").absolutePath) database = SQLiteDatabase.openDatabase( databaseFile.absolutePath, null, SQLiteDatabase.OPEN_READWRITE ) cursor = database.rawQuery(query, null) val result = cursor.moveToFirst() logDebug("makeSQLRequest result: $result", TAG) result } catch (t: Throwable) { logError("makeSQLRequest failed: ${t.message}", t, TAG) false } finally { cursor?.close() database?.close() } } }
MediaPipeEngineWithRag: Only simple RAG mechanism is supported here
import kotlinx.coroutines.flow.Flow import java.io.File interface MediaPipeEngineWithRag { fun startEngine( modelFile: File, isSupportImages: Boolean = false, engineParams: MediaPipeEngineParams, ) fun resetSession() fun generateResponse( prompt: String, topK: Int = 5, minSimilarityScore: Float = 0.6F, ): Flow<ResultEmittedData<String>> }
import android.app.Application import com.google.ai.edge.localagents.rag.chains.ChainConfig import com.google.ai.edge.localagents.rag.chains.RetrievalAndInferenceChain import com.google.ai.edge.localagents.rag.models.AsyncProgressListener import com.google.ai.edge.localagents.rag.models.LanguageModelResponse import com.google.ai.edge.localagents.rag.models.MediaPipeLlmBackend import com.google.ai.edge.localagents.rag.retrieval.RetrievalConfig import com.google.ai.edge.localagents.rag.retrieval.RetrievalConfig.TaskType import com.google.ai.edge.localagents.rag.retrieval.RetrievalRequest import com.google.common.util.concurrent.FutureCallback import com.google.common.util.concurrent.Futures import com.google.common.util.concurrent.ListenableFuture import com.google.common.util.concurrent.MoreExecutors import com.google.mediapipe.tasks.genai.llminference.GraphOptions import com.google.mediapipe.tasks.genai.llminference.LlmInference import com.google.mediapipe.tasks.genai.llminference.LlmInference.LlmInferenceOptions import com.google.mediapipe.tasks.genai.llminference.LlmInferenceSession.LlmInferenceSessionOptions import kotlinx.coroutines.flow.Flow import kotlinx.coroutines.flow.callbackFlow import java.io.File import java.util.concurrent.Executor class MediaPipeEngineWithRagImpl( private val application: Application, private val common: MediaPipeEngineCommon, ) : MediaPipeEngineWithRag { companion object { private const val TAG = "MediaPipeEngineWithRagTag" } private var chainConfig: ChainConfig<String>? = null private var retrievalAndInferenceChain: RetrievalAndInferenceChain? = null private var engineMediaPipe: LlmInference? = null private var sessionOptions: LlmInferenceSessionOptions? = null private var mediaPipeLanguageModel: MediaPipeLlmBackend? = null private var interfaceOptions: LlmInferenceOptions? = null private val executor: Executor = MoreExecutors.directExecutor() private var future: ListenableFuture<LanguageModelResponse>? = null override fun startEngine( modelFile: File, isSupportImages: Boolean, engineParams: MediaPipeEngineParams, ) { logDebug("startEngine", TAG) interfaceOptions = createInterfaceOptions( modelFile = modelFile, engineParams = engineParams, isSupportImages = isSupportImages, ) engineMediaPipe = LlmInference.createFromOptions( application, interfaceOptions ) if (engineMediaPipe == null) { logError("startEngine llmInference == null", TAG) return } sessionOptions = createSessionOptions( engineParams = engineParams, isSupportImages = isSupportImages, ) mediaPipeLanguageModel = MediaPipeLlmBackend( application.applicationContext, interfaceOptions, sessionOptions, executor ) chainConfig = ChainConfig.create( mediaPipeLanguageModel, common.promptBuilder, // add a database in which to check queries common.semanticMemory ) // we make a chain with a check in the database retrievalAndInferenceChain = RetrievalAndInferenceChain( chainConfig ) Futures.addCallback( mediaPipeLanguageModel!!.initialize(), object : FutureCallback<Boolean> { override fun onSuccess(result: Boolean) { logDebug("mediaPipeLanguageModel initialize onSuccess", TAG) } override fun onFailure(t: Throwable) { logError( "mediaPipeLanguageModel initialize onFailure: ${t.message}", t, TAG, ) } }, executor ) logDebug("startEngine ready", TAG) } override fun resetSession() { logDebug("resetSession", TAG) try { retrievalAndInferenceChain = RetrievalAndInferenceChain( chainConfig ) logDebug("Session reset completed", TAG) } catch (e: Exception) { logError("Failed to reset session: ${e.message}", e, TAG) } logDebug("resetSession ready", TAG) } override fun generateResponse( prompt: String, // Number of database query results topK: Int, // how similar the query vector should be to the database record // 0.0 = search all entries, sort by most similar // 1.0 = perfect match only // I use values 0.6 - 0.8 minSimilarityScore: Float, ): Flow<ResultEmittedData<String>> = callbackFlow { logDebug("generateResponse prompt: $prompt", TAG) try { if (retrievalAndInferenceChain == null) { logError("generateResponse retrievalAndInferenceChain == null", TAG) trySend( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "retrievalAndInferenceChain == null", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@callbackFlow } val retrievalConfig = RetrievalConfig.create( topK, minSimilarityScore, TaskType.QUESTION_ANSWERING ) // the request already includes a chain with verification val retrievalRequest = RetrievalRequest.create( prompt, retrievalConfig ) logDebug("generateResponse retrievalRequest", TAG) val messageBuilder = StringBuilder() val listener = AsyncProgressListener<LanguageModelResponse> { partial, done -> val delta = partial.text.orEmpty() logDebug("generateResponse delta: $delta", TAG) if (!done && delta.isNotBlank()) { messageBuilder.append(delta) trySend( ResultEmittedData.loading( model = messageBuilder.toString(), ) ) } } future = retrievalAndInferenceChain!!.invoke( retrievalRequest, listener ) future?.addListener({ val fullText = future?.get()?.text if (fullText.isNullOrEmpty()) { logError("generateResponse fullText isNullOrEmpty", TAG) trySend( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Empty response", errorType = ErrorType.EXCEPTION ) ) close() return@addListener } logDebug("generateResponse fullText: $fullText", TAG) trySend( ResultEmittedData.success( model = fullText, message = null, responseCode = null ) ) close() }, executor) logDebug("generateResponse ready", TAG) } catch (t: Throwable) { logError("generateResponse failed: ${t.message}", t, TAG) trySend( ResultEmittedData.error( model = null, error = t, title = "MediaPipe engine error", responseCode = null, message = t.message, errorType = ErrorType.EXCEPTION, ) ) } } private fun createInterfaceOptions( modelFile: File, engineParams: MediaPipeEngineParams, isSupportImages: Boolean, ): LlmInferenceOptions { val backend = when (engineParams.backend) { MediaPipeBackendParams.CPU -> LlmInference.Backend.CPU MediaPipeBackendParams.GPU -> LlmInference.Backend.GPU } return LlmInferenceOptions.builder().apply { setModelPath(modelFile.absolutePath) setMaxTokens(engineParams.contextSize) setPreferredBackend(backend) val maxNumImages = if (isSupportImages) 1 else 0 setMaxNumImages(maxNumImages) if (engineParams.useMaxTopK) setMaxTopK(engineParams.maxTopK) }.build() } private fun createSessionOptions( engineParams: MediaPipeEngineParams, isSupportImages: Boolean, ): LlmInferenceSessionOptions { return LlmInferenceSessionOptions.builder().apply { if (engineParams.useTopK) setTopK(engineParams.topK) if (engineParams.useTopP) setTopP(engineParams.topP) if (engineParams.useTemperature) setTemperature(engineParams.temperature) if (engineParams.useRandomSeed) setRandomSeed(engineParams.randomSeed) setGraphOptions( GraphOptions.builder() .setEnableVisionModality(isSupportImages) .build() ) }.build() } private fun isInGeneration(): Boolean { return future != null && future?.isDone != true && future?.isCancelled != true } }
MediaPipeEngineWithTools: Function calls are supported here.
import kotlinx.coroutines.flow.Flow import java.io.File interface MediaPipeEngineWithTools { fun startEngine( modelFile: File, isSupportImages: Boolean = false, engineParams: MediaPipeEngineParams, ) fun generateResponse( userQuery: String, maxSteps: Int = 3, ): Flow<ResultEmittedData<String>> }
package com.romankryvolapov.offlineailauncher.mediapipe import android.app.Application import com.google.ai.edge.localagents.core.proto.Content import com.google.ai.edge.localagents.core.proto.FunctionCall import com.google.ai.edge.localagents.core.proto.FunctionDeclaration import com.google.ai.edge.localagents.core.proto.FunctionResponse import com.google.ai.edge.localagents.core.proto.GenerateContentResponse import com.google.ai.edge.localagents.core.proto.Part import com.google.ai.edge.localagents.core.proto.Schema import com.google.ai.edge.localagents.core.proto.Tool import com.google.ai.edge.localagents.fc.GemmaFormatter import com.google.ai.edge.localagents.fc.GenerativeModel import com.google.ai.edge.localagents.fc.LlmInferenceBackend import com.google.ai.edge.localagents.rag.memory.VectorStoreRecord import com.google.ai.edge.localagents.rag.models.EmbedData import com.google.ai.edge.localagents.rag.models.EmbeddingRequest import com.google.common.collect.ImmutableList import com.google.mediapipe.tasks.genai.llminference.LlmInference import com.google.mediapipe.tasks.genai.llminference.LlmInference.LlmInferenceOptions import com.google.protobuf.Struct import com.google.protobuf.Value import com.romankryvolapov.offlineailauncher.common.models.common.ErrorType import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logDebug import com.romankryvolapov.offlineailauncher.common.models.common.LogUtil.logError import com.romankryvolapov.offlineailauncher.common.models.common.ResultEmittedData import kotlinx.coroutines.flow.Flow import kotlinx.coroutines.flow.flow import kotlinx.coroutines.guava.await import java.io.File class MediaPipeEngineWithFunctionCallingImpl( private val application: Application, private val common: MediaPipeEngineCommon, ) : MediaPipeEngineWithFunctionCalling { companion object { private const val TAG = "MediaPipeEngineWithFunctionCallingsTag" private const val DEFAULT_MIN_SIMILARITY_SCORE = 0.8 private const val TOOLS_CODE = "tool_code" private const val RESULTS = "results" private const val TOOLS_ACTION_SEARCH_DOCS = "search_docs" private const val TOOLS_ACTION_SEARCH_DOCS_DESCRIPTION = "Searches knowledge and returns the most relevant results as plain text." private const val TOOLS_PARAM_QUERY = "query" private const val TOOLS_PARAM_QUERY_DESCRIPTION = "User query to search in the vector store." private const val TOOLS_PARAM_TOP_K = "top_k" private const val TOOLS_PARAM_TOP_K_DESCRIPTION = "Number of results to return (default 5)." private const val TOOLS_PARAM_MIN_SIMILARITY_SCORE = "min_similarity_score" private const val MIN_SIMILARITY_SCORE_DESCRIPTION = """ Minimum similarity score threshold (float) for filtering search results, from 0.0 (no filtering) to 1.0 (exact match). Start with $DEFAULT_MIN_SIMILARITY_SCORE, and if no results are found, lower the value and retry the search. """ // A lot depends on this template // incorrectly selected parameters will make calling the tool impossible // or after calling the tool the generation will stop // for other LLM models the template may differ // ```tool_code works well for Gemma 3n, it seems it was trained on this keyword // if nothing is found through the tools, the prompt indicates that the similarity can be reduced // The tools can be completely different, for example, a SQL query, an Internet query, it is important to describe them correctly private val PROMPT_TEMPLATE_WITH_TOOLS = """ You are an on-device assistant. You have access to special tools (also called: "function call", "invoke tool", "use API", "search", "lookup", "query tool") If you decide to invoke any of the function, it should be wrapped with ```$TOOLS_CODE``` You have access to the following tools. * `$TOOLS_ACTION_SEARCH_DOCS`: Searches knowledge and returns the most relevant results as plain text. WHEN TO USE A TOOL - If you do not have enough information to answer with high confidence. - If the user explicitly or implicitly asks to check/verify/find out/look up ("check via tools", "verify", "lookup", etc.). Tool args: $TOOLS_PARAM_QUERY: User string query to search in the vector store. $TOOLS_PARAM_TOP_K: Integer number of results to return (default 5). $TOOLS_PARAM_MIN_SIMILARITY_SCORE: Minimum similarity score threshold (float) for filtering search results, from 0.0 (no filtering) to 1.0 (exact match). Start with $DEFAULT_MIN_SIMILARITY_SCORE, and if no results are found, lower the value and retry the search. Rules for tool call: ```$TOOLS_CODE $TOOLS_ACTION_SEARCH_DOCS($TOOLS_PARAM_QUERY="<string>", $TOOLS_PARAM_TOP_K=<integer>, $TOOLS_PARAM_MIN_SIMILARITY_SCORE=<float>) ``` Tool response: $RESULTS: Plain text results. IMPORTANT: After receiving tool results, ALWAYS write a natural-language answer for the user in the very next message. If tool results are empty, briefly explain that nothing relevant was found and propose next steps. """.trimIndent() } private var generativeModel: GenerativeModel? = null override fun startEngine( modelFile: File, isSupportImages: Boolean, engineParams: MediaPipeEngineParams, ) { logDebug("startEngine", TAG) val interfaceOptions = createInterfaceOptions( modelFile = modelFile, engineParams = engineParams, isSupportImages = isSupportImages, ) val engineMediaPipe = LlmInference.createFromOptions( application, interfaceOptions ) if (engineMediaPipe == null) { logError("startEngine llmInference == null", TAG) return } val searchDocs = FunctionDeclaration.newBuilder() .setName(TOOLS_ACTION_SEARCH_DOCS) .setDescription(TOOLS_ACTION_SEARCH_DOCS_DESCRIPTION) .setParameters( Schema.newBuilder() .setType(com.google.ai.edge.localagents.core.proto.Type.OBJECT) .putProperties( TOOLS_PARAM_QUERY, Schema.newBuilder() .setType(com.google.ai.edge.localagents.core.proto.Type.STRING) .setDescription(TOOLS_PARAM_QUERY_DESCRIPTION) .build() ) .putProperties( TOOLS_PARAM_TOP_K, Schema.newBuilder() .setType(com.google.ai.edge.localagents.core.proto.Type.INTEGER) .setDescription(TOOLS_PARAM_TOP_K_DESCRIPTION) .build() ) .putProperties( TOOLS_PARAM_MIN_SIMILARITY_SCORE, Schema.newBuilder() .setType(com.google.ai.edge.localagents.core.proto.Type.NUMBER) .setDescription(MIN_SIMILARITY_SCORE_DESCRIPTION) .build() ) .build() ) .build() val systemInstruction = Content.newBuilder() .setRole(Gemma3nRoles.SYSTEM.type) .addParts( Part.newBuilder().setText( PROMPT_TEMPLATE_WITH_TOOLS ) ) .build() val tool = Tool.newBuilder() .addFunctionDeclarations(searchDocs) .build() val inferenceBackend = LlmInferenceBackend( engineMediaPipe, GemmaFormatter() ) generativeModel = GenerativeModel( inferenceBackend, systemInstruction, listOf(tool), ) logDebug("startEngine ready", TAG) } override fun generateResponse( userQuery: String, maxSteps: Int, ): Flow<ResultEmittedData<String>> = flow { logDebug("generateResponseWithTools userQuery: $userQuery", TAG) try { val generativeModel = generativeModel ?: run { logError("generateResponseWithTools generativeModel is null", TAG) emit( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Model is not initialized;", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } val contentPart = Part.newBuilder() .setText(userQuery) .build() val userContent = Content.newBuilder() .setRole(Gemma3nRoles.USER.type) .addParts(contentPart) .build() val conversation = mutableListOf(userContent) var step = 0 // just in case there is a cycle here // the model is sent a request if it thinks the tool needs to be called, // it writes service information, the tool is called and the request with the result is repeated // if you want the model to try to find the best result for a query, // when changing the query text or minimum similarity, write about it in the prompt so that // the model meant that there could be many tool calls while (step < maxSteps) { logDebug("generateResponseWithTools step: $step conversation: ${conversation.size}", TAG) step++ val response: GenerateContentResponse = generativeModel.generateContent( conversation ) val responseContent: Content = response.candidatesList.firstOrNull()?.content ?: run { logError("generateResponseWithTools content is null", TAG) emit( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Candidates list is null", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } val functionCall: FunctionCall? = responseContent.partsList.firstOrNull { it.hasFunctionCall() }?.functionCall // if the model has decided that there is no need to call the tools, we simply send a response to the user if (functionCall == null) { val text = extractText(response) if (text.isBlank()) { logError( "generateResponseWithTools text is blank, response: $response", TAG ) emit( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Empty text", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } logDebug("generateResponseWithTools functionCall is null text: $text", TAG) emit( ResultEmittedData.success( model = text, message = null, responseCode = null ) ) return@flow } if (functionCall.name != TOOLS_ACTION_SEARCH_DOCS) { logError("generateResponseWithTools wrong name: ${functionCall.name}", TAG) val text = extractText(response) if (text.isBlank()) { logError( "generateResponseWithTools text is blank, response: $response", TAG ) emit( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Wrong function call", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } emit( ResultEmittedData.success( model = text, message = null, responseCode = null ) ) return@flow } val args = functionCall.args.fieldsMap // the model returns the text of the database query in the tool call parameters, // number of results and similarity // if nothing is found, the prompt says that the similarity can be reduced val query = args[TOOLS_PARAM_QUERY]?.stringValue val topK = args[TOOLS_PARAM_TOP_K]?.numberValue?.toInt() ?: 5 val minSimilarityScore = args[TOOLS_PARAM_MIN_SIMILARITY_SCORE]?.numberValue?.toFloat() ?: 0.0F if (query.isNullOrEmpty()) { logError("generateResponseWithTools query is null or empty", TAG) val text = extractText(response) if (text.isBlank()) { logError( "generateResponseWithTools text is blank, response: $response", TAG ) emit( ResultEmittedData.error( model = null, error = null, title = "MediaPipe engine error", responseCode = null, message = "Wrong function call", errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } logDebug("generateResponseWithTools query is null or empty text: $text", TAG) emit( ResultEmittedData.success( model = text, message = null, responseCode = null ) ) return@flow } val results: String = searchDocsInternal( query, topK, minSimilarityScore ) val respStruct = Struct.newBuilder() .putFields( TOOLS_PARAM_QUERY, Value.newBuilder().setStringValue(query).build() ) .putFields( TOOLS_PARAM_TOP_K, Value.newBuilder().setNumberValue(topK.toDouble()).build() ) .putFields( TOOLS_PARAM_MIN_SIMILARITY_SCORE, Value.newBuilder().setNumberValue(minSimilarityScore.toDouble()).build() ) .putFields( RESULTS, Value.newBuilder().setStringValue(results).build() ) .build() val functionResponse = FunctionResponse.newBuilder() .setName(TOOLS_ACTION_SEARCH_DOCS) .setResponse(respStruct) .build() val functionResponsePart = Part.newBuilder() .setFunctionResponse(functionResponse) .build() val toolContent = Content.newBuilder() .setRole(Gemma3nRoles.MODEL.type) .addParts(functionResponsePart) .build() // we add a model response with a tool call and the tool call itself // into the message chain and start the next iteration of the loop // the model will thus see all its requests and all the results of calling the tool conversation.add(responseContent) conversation.add(toolContent) logDebug("conversation: $conversation", TAG) if (step == maxSteps) { val finalResponse = generativeModel.generateContent(conversation) val text = extractText(finalResponse) if (text.isBlank()) { logError("generateResponseWithTools finalResponse text is blank", TAG) emit( ResultEmittedData.error( title = "MediaPipe engine error", message = "Empty final response", error = null, model = null, responseCode = null, errorType = ErrorType.ERROR_IN_LOGIC, ) ) return@flow } emit( ResultEmittedData.success( model = text, message = null, responseCode = null ) ) return@flow } } } catch (t: Throwable) { logError("generateResponseWithTools failed: ${t.message}", t, TAG) emit( ResultEmittedData.error( model = null, error = t, title = "MediaPipe engine error", responseCode = null, message = t.message, errorType = ErrorType.EXCEPTION, ) ) } } // search in a vector database, while all parameters are set by the model itself private suspend fun searchDocsInternal( query: String, topK: Int, minSimilarityScore: Float, ): String { logDebug("searchDocsInternal query: $query topK: $topK minSimilarityScore: $minSimilarityScore", TAG) val queryEmbedData: EmbedData<String> = EmbedData.create( query, EmbedData.TaskType.RETRIEVAL_QUERY ) val embeddingRequest: EmbeddingRequest<String> = EmbeddingRequest.create(listOf(queryEmbedData)) val vector: ImmutableList<Float> = try { common.embedder.getEmbeddings(embeddingRequest).await() } catch (t: Throwable) { logError( "searchDocsInternal: embedding failed: ${t.message}", t, TAG ) return "No results." } if (vector.isEmpty()) { logError("searchDocsInternal vector.isEmpty()", TAG) return "No results." } val hits: ImmutableList<VectorStoreRecord<String>> = try { common.vectorStore.getNearestRecords( vector, topK, minSimilarityScore ) } catch (t: Throwable) { logError("searchDocsInternal: failed: ${t.message}", t, TAG) return "No results." } if (hits.isEmpty()) { logError("searchDocsInternal hits.isEmpty()", TAG) return "No results." } val result = buildString { for (h in hits) { appendLine(h.data.trim()) } }.trim() logDebug("searchDocsInternal ready size: ${result.length}", TAG) return result } private fun extractText(response: GenerateContentResponse): String { response.candidatesList.forEach { candidate -> candidate.content.partsList.forEach { part -> if (part.text.isNotEmpty()) return part.text } } return "" } private fun createInterfaceOptions( modelFile: File, engineParams: MediaPipeEngineParams, isSupportImages: Boolean, ): LlmInferenceOptions { val backend = when (engineParams.backend) { MediaPipeBackendParams.CPU -> LlmInference.Backend.CPU MediaPipeBackendParams.GPU -> LlmInference.Backend.GPU } return LlmInferenceOptions.builder().apply { setModelPath(modelFile.absolutePath) setMaxTokens(engineParams.contextSize) setPreferredBackend(backend) val maxNumImages = if (isSupportImages) 1 else 0 setMaxNumImages(maxNumImages) if (engineParams.useMaxTopK) setMaxTopK(engineParams.maxTopK) }.build() } }
I hope it was interesting.
For those who want to repeat, the auxiliary classes used are:
enum class ErrorType { EXCEPTION, SERVER_ERROR, ERROR_IN_LOGIC, SERVER_DATA_ERROR, NO_INTERNET_CONNECTION, AUTHORIZATION } data class ResultEmittedData<out T>( val model: T?, val error: Any?, val status: Status, val title: String?, val message: String?, val responseCode: Int?, val errorType: ErrorType?, ) { enum class Status { SUCCESS, ERROR, LOADING, } companion object { fun <T> success( model: T, message: String?, responseCode: Int?, ): ResultEmittedData<T> = ResultEmittedData( error = null, title = null, model = model, errorType = null, message = message, status = Status.SUCCESS, responseCode = responseCode, ) fun <T> loading( model: T? = null, message: String? = null, ): ResultEmittedData<T> = ResultEmittedData( model = model, error = null, title = null, errorType = null, message = message, responseCode = null, status = Status.LOADING, ) fun <T> error( model: T?, error: Any?, title: String?, message: String?, responseCode: Int?, errorType: ErrorType?, ): ResultEmittedData<T> = ResultEmittedData( model = model, error = error, title = title, message = message, errorType = errorType, status = Status.ERROR, responseCode = responseCode, ) } } inline fun <T : Any> ResultEmittedData<T>.onLoading( action: ( model: T?, message: String?, ) -> Unit ): ResultEmittedData<T> { if (status == ResultEmittedData.Status.LOADING) action( model, message ) return this } inline fun <T : Any> ResultEmittedData<T>.onSuccess( action: ( model: T, message: String?, responseCode: Int?, ) -> Unit ): ResultEmittedData<T> { if (status == ResultEmittedData.Status.SUCCESS && model != null) action( model, message, responseCode, ) return this } inline fun <T : Any> ResultEmittedData<T>.onFailure( action: ( model: Any?, title: String?, message: String?, responseCode: Int?, errorType: ErrorType?, ) -> Unit ): ResultEmittedData<T> { if (status == ResultEmittedData.Status.ERROR) action( model, title, message, responseCode, errorType ) return this }
data class MediaPipeEngineParams( val name: String, val topK: Int, val topP: Float, val temperature: Float, val randomSeed: Int, val contextSize: Int, val maxTopK: Int, val useTopK: Boolean, val useTopP: Boolean, val useTemperature: Boolean, val useRandomSeed: Boolean, val useMaxTopK: Boolean, val backend: MediaPipeBackendParams, ) enum class MediaPipeBackendParams { CPU, GPU } fun Long.toDurationString(): String { var msRemaining = this val years = msRemaining / (365L * 24 * 60 * 60 * 1000) msRemaining %= (365L * 24 * 60 * 60 * 1000) val months = msRemaining / (30L * 24 * 60 * 60 * 1000) msRemaining %= (30L * 24 * 60 * 60 * 1000) val days = msRemaining / (24L * 60 * 60 * 1000) msRemaining %= (24L * 60 * 60 * 1000) val hours = msRemaining / (60L * 60 * 1000) msRemaining %= (60L * 60 * 1000) val minutes = msRemaining / (60L * 1000) msRemaining %= (60L * 1000) val seconds = msRemaining / 1000 val milliseconds = msRemaining % 1000 return buildString { if (years > 0) append("$years years, ") if (months > 0) append("$months months, ") if (days > 0) append("$days days, ") if (hours > 0) append("$hours hours, ") if (minutes > 0) append("$minutes minutes, ") if (seconds > 0) append("$seconds seconds, ") append("$milliseconds milliseconds") } }
import android.annotation.SuppressLint import android.os.Bundle import android.os.Environment import android.util.Log import com.google.firebase.analytics.FirebaseAnalytics import kotlinx.coroutines.CoroutineScope import kotlinx.coroutines.Dispatchers import kotlinx.coroutines.Job import kotlinx.coroutines.launch import org.koin.core.component.KoinComponent import org.koin.core.component.inject import java.io.File import java.io.FileOutputStream import java.text.SimpleDateFormat import java.util.Date object LogUtil : KoinComponent { private val timeDirectoryName: String private const val QUEUE_CAPACITY = 10000 private const val CURRENT_TAG = "LogUtilExecutionStatusTag" private const val LOG_APP_FOLDER_NAME = "app" private const val TIME_FORMAT_FOR_LOG = "HH:mm:ss dd-MM-yyyy" private const val TIME_FORMAT_FOR_DIRECTORY = "HH-mm-ss_dd-MM-yyyy" private const val TAG = "TAG: " private const val TIME = "TIME: " private const val ERROR_STACKTRACE = "ERROR STACKTRACE: " private const val ERROR_MESSAGE = "ERROR: " private const val DEBUG_MESSAGE = "MESSAGE: " private const val NEW_LINE = "\n" private val queue = ArrayDeque<LogData>(QUEUE_CAPACITY) private var saveLogsToTxtFileJob: Job? = null private val analytics: FirebaseAnalytics by inject() @Volatile private var isSaveLogsToTxtFile = false init { Log.d(CURRENT_TAG, "init") timeDirectoryName = getCurrentTimeForDirectory() } fun logDebug(message: String, tag: String) { CoroutineScope(Dispatchers.IO).launch { if (BuildConfig.DEBUG) { Log.d(tag, message) enqueue( LogData.DebugMessage( tag = tag, time = System.currentTimeMillis(), message = message, ) ) saveLogsToTxtFile() } } } fun logError(message: String, tag: String) { CoroutineScope(Dispatchers.IO).launch { if (BuildConfig.DEBUG) { Log.e(tag, message) enqueue( LogData.ErrorMessage( tag = tag, time = System.currentTimeMillis(), message = message, ) ) saveLogsToTxtFile() } } } fun logError(exception: Throwable, tag: String) { CoroutineScope(Dispatchers.IO).launch { if (BuildConfig.DEBUG) { Log.e(tag, exception.message, exception) enqueue( LogData.ExceptionMessage( tag = tag, time = System.currentTimeMillis(), exception = exception, ) ) saveLogsToTxtFile() } } } fun logError(message: String, exception: Throwable, tag: String) { CoroutineScope(Dispatchers.IO).launch { if (BuildConfig.DEBUG) { Log.e(tag, "$message, exception: ${exception.message}", exception) enqueue( LogData.ErrorMessageWithException( tag = tag, time = System.currentTimeMillis(), message = message, exception = exception, ) ) saveLogsToTxtFile() } } } fun logError(message: String, error: String?, tag: String) { CoroutineScope(Dispatchers.IO).launch { if (BuildConfig.DEBUG) { Log.e(tag, "$message, error: $error") enqueue( LogData.ErrorMessage( tag = tag, time = System.currentTimeMillis(), message = message, ) ) saveLogsToTxtFile() } } } @SuppressLint("SimpleDateFormat") private fun getTime(time: Long): String { return try { val date = Date(time) val timeString = SimpleDateFormat(TIME_FORMAT_FOR_LOG).format(date) timeString.ifEmpty { Log.e(CURRENT_TAG, "getTime time.ifEmpty") time.toString() } } catch (e: Exception) { Log.e(CURRENT_TAG, "getCurrentTime exception: ${e.message}", e) time.toString() } } @SuppressLint("SimpleDateFormat") private fun getCurrentTimeForDirectory(): String { val time = System.currentTimeMillis() return try { val date = Date(time) val timeString = SimpleDateFormat(TIME_FORMAT_FOR_DIRECTORY).format(date) Log.d(CURRENT_TAG, "getCurrentTimeForDirectory time: $time") timeString.ifEmpty { Log.e(CURRENT_TAG, "getCurrentTimeForDirectory time.ifEmpty") time.toString() } } catch (e: Exception) { Log.e(CURRENT_TAG, "getCurrentTimeForDirectory exception: ${e.message}", e) time.toString() } } private fun enqueue(message: LogData) { try { while (queue.size >= QUEUE_CAPACITY) { Log.d(CURRENT_TAG, "enqueue removeFirst") queue.removeFirst() } queue.addLast(message) } catch (e: Exception) { Log.e(CURRENT_TAG, "enqueue exception: ${e.message}", e) } } }
Example 6 – CAG (Cache-Augmented Generation) in LLama.cpp
Here I will write a concept of how this could work.
This particular code is slow and not suitable for full use.
Example of getting model context as ByteArray.
The resulting context can then be saved to a database or file.
Context may include prompts, questions and answers, added documents, and so on.
#include <vector> #include <cstdint> #include <stdexcept> #include <cstddef> #include "llama.h" std::vector<uint8_t> get_full_state_raw(llama_context* ctx) { // Check that the context pointer is not nullptr if (ctx == nullptr) { throw std::invalid_argument("llama_context pointer is null"); } // Get the size of the model state (in bytes) const size_t state_size = llama_state_get_size(ctx); if (state_size == 0) { throw std::runtime_error("llama_state_get_size returned 0"); } // We allocate a buffer of the required size to store the state std::vector<uint8_t> out(state_size); // We save the binary state data into our buffer const size_t written = llama_state_get_data(ctx, out.data(), out.size()); // We check that the size matches the expected one. if (written != state_size) { throw std::runtime_error("state size changed during serialization"); } // Return the state as a byte array return out; }
Example of restoring context from ByteArray.
It is important to remember that the context can only be restored for the same model for which it was saved, since it includes intermediate states, and it will also not be possible to combine multiple contexts – there is no full-fledged modular architecture for models yet.
#include <vector> #include <cstdint> #include <stdexcept> #include <cstddef> #include "llama.h" int set_full_state_raw(llama_context* ctx, const std::vector<uint8_t>& data) { // Check that the context and data have been transferred if (ctx == nullptr) { throw std::invalid_argument("llama_context pointer is null"); } if (data.empty()) { throw std::invalid_argument("data is empty"); } // Recovering the model state from a byte array const size_t written = llama_state_set_data(ctx, data.data(), data.size()); // Return the number of bytes actually downloaded return static_cast<int>(written); }