Android Development

Fast introduction to Massive Language Fashions for Android builders

1 October 2024

Posted by Thomas Ezan, Sr Developer Relation Engineer

Android has supported conventional machine studying fashions for years. Frameworks and SDKs like LiteRT (previously generally known as TensorFlow Lite), ML Equipment and MediaPipe enabled builders to simply implement duties like picture classification and object detection.

In recent times, generative AI (gen AI) and huge language fashions (LLMs), have opened up new prospects for language understanding and textual content technology. We now have lowered the boundaries for integrating gen AI options into your apps and this weblog publish will offer you the mandatory high-level information to get began.

Earlier than we dive into the specificities of generative AI fashions, let’s take a excessive stage look: how is machine studying (ML) totally different from conventional programming.

Machine studying as a brand new programming paradigm

A key distinction between conventional programming and ML lies in how options are carried out.

In conventional programming, builders write express algorithms that take enter and produce a desired output.

A flow chart showing the process of machine learning model training. Input data is fed into the training process, resulting in a trained ML model

Machine studying takes a distinct method: builders present a big set of beforehand collected enter information and the corresponding output, and the ML mannequin is educated to discover ways to map the enter to the output.

A flow chart illustrating the machine learning model training. This step is labeled above the process '1. Train the model with a large set of input and output data'. Below, arrows labeled 'Input' and 'Output' point to a green box labeled 'ML Model Training'. Another arrow points away from the box and is labeled 'ML Model'.

Then, the mannequin is deployed on the Cloud or on-device to course of enter information. This step is named inference.

A flow chart illustrating the inference training for training an ML model. This step is labeled above the process '2. Deploy the model to run inferences on input data'. Below, an arrow labeled 'Input' points to a green box labeled 'Run ML Inference'. Another arrow points away from the box and is labeled 'Output'.

This paradigm allows builders to deal with issues that have been beforehand tough or unattainable to unravel with rule-based programming.

Conventional machine studying vs. generative AI on Android

Conventional ML on Android contains duties resembling picture classification that may be carried out utilizing mobilenet and LiteRT, or pose estimation that may be simply added to your Android app with the ML Equipment SDK. These fashions are sometimes educated on particular datasets and carry out extraordinarily effectively on well-defined, slender duties.

Generative AI introduces the aptitude to grasp inputs resembling textual content, photos, audio and video and generate human-like responses. This permits purposes like chatbots, language translation, textual content summarization, picture captioning, picture or code technology, artistic writing help, and far more.

Most cutting-edge generative AI fashions just like the Gemini fashions are constructed on the transformer structure. To generate photos, diffusion fashions are sometimes used.

Understanding massive language fashions

At its core, an LLM is a neural community mannequin educated on huge quantities of textual content information. It learns patterns, grammar, and semantic relationships between phrases and phrases, enabling it to foretell and generate textual content that mimics human language.

As talked about earlier, most up-to-date LLMs use the transformer structure. It breaks down enter into tokens, assigns numerical representations referred to as “embeddings” (see Key ideas under) to those tokens, after which processes these embeddings via a number of layers of the neural community to grasp the context and that means.

LLMs usually undergo two essential phases of coaching:

1. Pre-training part: The mannequin is uncovered to huge quantities of textual content from totally different sources to be taught basic language patterns and information.

2. Effective-tuning part: The mannequin is educated on particular duties and datasets to refine its efficiency for specific purposes.

Lessons of fashions and their capabilities.

Gen AI fashions are available in numerous sizes, from smaller fashions like Gemini Nano or Gemma 2 2B, to huge fashions like Gemini 1.5 Professional that run on Google Cloud. The scale of a mannequin typically correlates with the capabilities and compute energy required to run it.

Fashions are consistently evolving, with new analysis pushing the boundaries of their capabilities. These fashions are being evaluated on duties like query answering, code technology, and inventive writing, demonstrating spectacular outcomes.

As well as some fashions are multimodal which signifies that they’re designed to course of and perceive info from a number of modalities, resembling photos, audio, and video, alongside textual content. This enables them to deal with a wider vary of duties, together with picture captioning, visible query answering, audio transcription. A number of Google Generative AI fashions resembling Gemini 1.5 Flash, Gemini 1.5 Professional, Gemini Nano with Multimodality and PaliGemma are multimodal.

Key ideas

Context Window

Context window refers back to the quantity of tokens (transformed from textual content, picture, audio or video) the mannequin considers when producing a response. For chat use instances, it contains each the present enter and a historical past of previous interactions. For reference, 100 tokens is the same as about 60-80 English phrases.For reference, Gemini 1.5 Professional presently helps 2M enter tokens. It is sufficient to match the seven Harry Potter books… and extra!

Embeddings

Embeddings are multidimensional numerical representations of tokens that precisely encode their semantic that means and relationships inside a given vector house. Phrases with comparable meanings are nearer collectively, whereas phrases with reverse meanings are farther aside.

The embedding course of is a key element of an LLM. You may strive it independently utilizing MediaPipe Textual content Embedder for Android. It may be used to establish relations between phrases and sentences and implement a simplified semantic search immediately on-device.

A 3-D graph plots 'Man' and 'King' in blue and 'Woman' and 'Queen' in green, with arrows pointing from 'Man' to 'Woman' and from 'King' to 'Queen'.

A (very) simplified illustration of the embeddings for the phrases “king”, “queen”, “man” and “girl”

Prime-Okay, Prime-P and Temperature

Parameters like Prime-Okay, Prime-P and Temperature allow you to manage the creativity of the mannequin and the randomness of its output.

Prime-Okay filters tokens for output. For instance a Prime-Okay of three retains the three most possible tokens. Growing the Prime-Okay worth will enhance the randomness of the mannequin response (find out about Prime-Okay parameter).

Then, defining the Prime-P worth provides one other step of filtering. Tokens with the best possibilities are chosen till their sum equals the Prime-P worth. Decrease Prime-P values end in much less random responses, and better values end in extra random responses (find out about Prime-P parameter).

Lastly, the Temperature defines the randomness to pick out the tokens left. Decrease temperatures are good for prompts that require a extra deterministic and fewer open-ended or artistic response, whereas greater temperatures can result in extra numerous or artistic outcomes (find out about Temperature).

Effective-tuning

Iterating over a number of variations of a immediate to attain an optimum response from the mannequin on your use-case isn’t at all times sufficient. The following step is to fine-tune the mannequin by re-training it with information particular to your use-case. You’ll then receive a mannequin custom-made to your software.

Extra particularly, Low rank adaptation (LoRA) is a fine-tuning method that makes LLM coaching a lot quicker and extra memory-efficient whereas sustaining the standard of the mannequin outputs.
The method to fine-tune open fashions by way of LoRA is effectively documented. See, for instance, how one can fine-tune Gemini fashions via Google AI Studio with out superior ML experience. You too can fine-tune Gemma fashions utilizing the KerasNLP library.

The way forward for generative AI on Android

With ongoing analysis and optimization of LLMs for cellular units, we are able to count on much more modern gen AI enabled options coming to Android quickly. Within the meantime try different AI on Android Highlight Week weblog posts, and go to the Android AI documentation to be taught extra about energy your apps with gen AI capabilities!