Skip to content

Offline AI Launcher – launch AI LLM neural networks on smartphone

In my spare time I wrote an application with which you can download and run AI LLM neural networks, such as Google Gemma, DeepSeek, Llama, Qwen and many others on your smartphone.

There are quite a lot of AI applications now, but the key difference is that the models do not work in the cloud on servers, but directly on smartphone from the model file using processor and video card smartphone.

To run it you don't need the Internet or a paid subscription, it's enough to have a smartphone with 6+GB of RAM and a more or less modern processor.

The application works with a repository of models HuggingFace

https://huggingface.co

It is the largest repository of millions of OpenSource models, of which thousands are Large Language Models, or LLMs.

The application can open the most common format of models .gguf using the LLama.cpp engine, as well as the Google format for the MediaPipe engine – .task And .tflite (versions for MediaPipe are available in the Google repository).

For some models, such as the Gemma 3n (OpenSource version of Gemini) via the MediaPipe engine, it is supported image processing.

This app is made primarily for AI enthusiasts, also for those who cannot use cloud models like ChatGPT or Gemini due to possible data leakage, also for people who do not have constant access to the Internet.

At the moment the app is free, but if I can't find a way to pay for my work on writing it, some of the features in it will become paid.

The application has tabs: Chat, History, Models, Settings.

  • Chat: This tab contains a chat with the currently running model.
  • New chat: Clears the history of dialogue with the model.
  • Image: You can add an image to the chat. Currently, only the MediaPipe engine and the Gemma 3n and Gemma 3 4B 12B 27B models are supported.
  • Clear text: Clear the current text message.
  • Insert text: Paste text from the clipboard into the current message input position.
  • Voice input: When pressed, voice input is activated. After voice input is completed, if the option "Automatically send text to model after voice recognition is completed" is selected in the application settings, the text is sent to the model. If not selected, you must send the text by pressing the "Send" button.
  • Send: Send a message to the model. The text on the button may change depending on the state of the model.
  • Microphone icon on message: Speak the message. The pronunciation speed can be adjusted in the application settings using the "Speed of speech" parameter. If the "Automatically speak responses after generation" function is activated in the application settings, the message will be automatically spoken after the model generation is complete.
  • Click on the message: A menu will open where you can copy the message to the clipboard, share it with other applications, or read it out loud.
  • Top panel: shows which model is currently being spoken to. It also displays the amount of free RAM if the "Display free memory on screen" option is enabled.
  • Story: shows a list of all dialogues with all models.
  • Clicking on the dialog: opens a dialogue with the model.
  • Delete: deletes the dialogue with the model. All dialogues can be deleted in the application settings by clicking "Clear chat history".
  • Continue the dialogue: You can continue any conversation. When you continue a conversation, it will be saved as a new one. Clicking on it will open a list of all available models, with any of which you can continue the previously started conversation, even if it was a different model. Please note that the context of models is limited. If the conversation contains a large amount of text, the context of the model may break, in which case part of the conversation may be lost. You can adjust the size of the context of the model in the application settings. Please note that the context size is not the number of characters, but the number of tokens, which usually correspond to part of a word.
  • Models: Here you will find all downloaded models, as well as models running in Ollama and LM Studio on your computer. A model is associated with a file located in the application folder. You can open this folder in some file managers, such as ES Explorer, and add or remove models yourself.
  • Run the model: Launches the model, unloading all other models. If the model is running on a computer in a local network, it is not launched, but a dialog box with it is opened, since the launch of the model is controlled by the program in which it is running.
  • Show information: A page with service information about the model opens.
  • Click on the model: The model will open in a supported file manager. Please note that not all file managers support opening files and not all file managers have permission to open another application's folder.
  • Delete: Deletes a model. Deleting models running in Ollama on a computer in the local network is also supported.
  • Top panel: Displays the application folder where the models are located. By clicking on the folder, you can open it in the file manager, as well as copy or share it by path. Also displays the amount of free RAM if the option "Display free memory on the screen" is enabled.
  • Download models: Go to the download page for models from the HuggingFace repository
  • Author: Filter the list of models by author. You can enter text and select one of the presets.
  • Conveyor: Filtering the list of models by pipeline. You can enter text and select one of the presets.
  • Filter: Filtering the list of models by filter. You can enter text and select one of the presets.
  • Search: Filter the list of models by model name. You can enter text and select one of the presets.
  • Sort option: By what parameter to sort the list.
  • ▼: Open the list of presets for the corresponding filtering parameter.
  • +: Save the parameter entered in the text field to the presets.
  • -: Remove the parameter entered in the text field from the presets.
  • Update: Refresh the list.
  • Close filter: Hide the filter window.
  • Open repository by model: Open the model page in the HuggingFace repository.
  • Download: Open a list of all available models in the repository. Models with Q4 quantization, suitable for the Arm architecture, will be marked in green. If your smartphone has a fast processor and the model is small, you can try to download and run a model with any quantization. The higher the quantization number, the higher the accuracy of the model. After clicking, the current loading status of the model will be displayed at the top of the screen. If you click on it, the loading will stop.
  • Settings
  • Application Settings
  • Language selection: Changes the language of the application. When selecting "System", the language will change depending on the device language. Currently supported are English, German, French, Spanish, Polish, Russian, Ukrainian.
  • View mode: Application theme, dark, light or system.
  • Voice settings: Pronunciation settings. Here you can choose automatic pronunciation of the message after generation, automatic sending of the message after recognition and pronunciation speed.
  • Debug settings: Here you can choose whether to display the amount of available RAM, and also go to the LLama.cpp engine debug information page.
  • LLama.cpp engine debug information page: Here, after turning on the switch, you can view the service information that the engine outputs to the console while loading the model and generating a response. Different debug levels are displayed in different colors.
  • HuggingFace Repository Token: Some models from the HuggingFace repository, such as models from Google, require a token to download. You can enter the token manually. If you do not do this, the app supports authorization through the HuggingFace website. If the model requires authorization, the app will attempt to authorize through the website and add the received token, which will then be displayed in the token field.
  • Copy/paste application settings to clipboard: You can copy all application settings in JSON format to the clipboard, paste application settings from the clipboard or send them via SMS. Please note that different versions of the application may have incompatible settings.
  • Rate the app: Rate the app on Google Play. Please note that this is an experimental app and may not work reliably in some cases and with some models.
  • Clear chat history: Clear all saved messages.
  • Restore application settings to default: Removes all application settings. Sometimes when updating the application version, if the previous version had critical errors related to the application settings, when updating to a more stable version, the previous settings will also be removed.
  • Memory settings
  • Memory settings: Running LLM models on a smartphone requires a large amount of free RAM. On this screen, you can see the current state of available RAM, and configure notifications or unloading the model when it is insufficient. The C++ engine code has an integrated function that accepts parameters from this screen and stops generation when there is insufficient memory. However, in some cases, memory may suddenly run out, in which case the application will close. The size of the model context directly affects the amount of RAM consumed. The context size can be configured on the engine settings page. The memory configuration page also displays the processor architecture and supported instruction set.
  • Hint settings: On this page you can create, edit, save to presets and delete from presets hints. The hint is used for all engines and is added to the first message of the dialog box. The default hint cannot be deleted, also when changing the language the default hint is reset to the selected language. If you want to use your own hint, do not save it as the default hint.
  • Template settings: The LLama.cpp engine uses different templates to interact with different models. Usually, presets are the same for one model family. On this page, you can update or add presets for different models. If a model is running in the application, the corresponding template will be selected when you open this page.
  • Template name: Templates are linked to the loaded model via the "Model Name Key World". You can specify names for multiple models via , as long as they use the same preset.
  • Template for the first message: The template for the first message must include a prompt. When sending a message to the model, the prompt_text parameter is replaced with the prompt, and the message_text parameter is replaced with the user's message.
    Stop Generation Keyword: A keyword that, when received in the response from the model, will cause the application to stop generation. This is necessary for some models that have made mistakes during training or modification, which may result in the model using incorrect keywords that the engine does not recognize, and the generation will continue indefinitely. You can use multiple keywords separated by the "," symbol.
  • Key replacement world: Some models may return service keywords in the response. Here you can list which keywords will be removed from the response, in the format "keyword" -> "what can replace it with". For example, you can replace the service tag about what the model thinks with your general opinion.
  • LLama.cpp engine settings: LLama.cpp engine settings. You can save settings to presets. More information about the settings and their impact can be found in the articles about LLM models. Here are some of them:
  • Context size: The maximum number of tokens that the model can "remember" at the same time. The size of the RAM affects the size of the context, the recommended values are: 2-4 GB RAM – 128-256 tokens, 6-8 GB RAM – 256-1024 tokens, 12-16 GB RAM – 1024-32000 tokens.
  • Response size: The maximum number of tokens the model can generate.
    Temperature: Float value for temperature (used if temperature sampling is enabled). Controls randomness. 1.0 = neutral, 0.7 = conservative, 1.3 = creative.
  • Top-K: Integer value for Top-K. Used if use_top_k is true. Specifies the number of top tokens to keep.
  • Top-P: Float value for Top-P. Used if use_top_p is true. Represents the cumulative probability threshold (e.g. 0.9).
  • Streams: The number of threads used for initialization.
  • Stream package: The number of threads used for generation.
  • Samplers: Selecting a sampler to process raw values.
  • Read more: You can read more about this in my article “Neural Networks in Simple Terms”.
  • MediaPipe engine settings: MediaPipe engine settings. These settings partially repeat the settings from LLama.cpp.
  • LocalHost Engine Settings
  • Ollama LocalHost settings: you can run Ollama on your computer to generate responses to the GPU and connect by IP address. To run, use the commands: $env:OLLAMA_HOST="0.0.0.0" – sets the environment variable to run as an external IP address, without this, Ollama will only listen to localhost inside the computer setx OLLAMA_HOST 0.0.0.0 /M – write to permanent environment variables, this must be done as an administrator and restart the terminal ollama pull gemma3n:e4b – loads the gemma3n model ollama serve – runs ollama ollama list – shows loaded models, after loading the model and running Ollama, the model will be visible in the application IP address – the address of your computer on the network, usually 192.168.., you can see it in the WiFi settings on your computer or in the router settings
  • OpenAI LocalHost settings: You can run the model on your computer in LM Studio or other OpenAI API-compatible programs to generate responses for the GPU and connect via IP address. The IP address is the address of your computer on the network, usually 192.168.., you can find it in the WiFi settings on your computer or in the router settings. To listen for LM Studio on your computer's port, select LM Studio – Developer – Settings – Serve on LAN and start it with the Status switch set to Running.

so far only experimental use, since the models on HuggingFace consist of many files, and their use is not convenient, also the engine is poorly optimized for Android and the models work slower than llama.cpp and MediaPipe, in the future it is planned to repackage some models into a specialized format in one file and upload them to HuggingFace

so far only experimental use, since models on HuggingFace consist of many files and there is no way to pack them into one file, using the engine is not convenient and it works slowly under Android

I plan to add

  • Modular Clean Architecture by Robert C. Martin

https://en.wikipedia.org/wiki/Robert_C._Martin

Tech stack

  • Languages: Kotlin, Java, C++
  • Build System: Gradle Groovy DSL, CMake