With the recent industry advances around Artificial Intelligence, I was curious: Is there progress towards having this kind of technology usable offline, on low-end computing devices, free of charge, with an element of openness (i.e. ability to manipulate the software or underlying data)?
Perhaps the most well known product within the current whirlwind of AI attention is ChatGPT, which provides a textual interface for conversing with a “chatbot” virtual agent. You can try this out for free, accessing it online using your web browser. However, it’s a centralised service that requires an internet connection. You can imagine it being a beast of great complexity, run by some massively powerful computing cluster. Beyond the web chat version, you have to pay for any other way of using it, and the software and data is private.
Are there other variants of this technology that do not face such restrictions? Anything that could run on a smartphone in my pocket?
My knowledge into AI topics does not run deep – and while I answered a few questions in the few hours I spent here, I’m probably still under-informed or wrong about certain aspects. I’m also not yet personally convinced that recent developments in AI justify all the hype around it. But anyway, here is what I have learned:
Open language models
AI solutions are heavily dependent upon a “language model” of sufficient quality and depth. Production of such models is tricky, requiring complex tooling, plentiful computing resources, and significant human tuning to ensure quality.
The availability of high quality open models has been limited, but that changed to some degree when Meta recently published LLaMA, an enormous and powerful language model, for free. Actually though, it’s not available for download, you have to fill in a request form and they then email you the download link (except I didn’t get any email). Unsurprisingly it has since been “leaked” online, but that doesn’t facilitate any kind of general availability if we’re looking to have some kind of open chatbot solution within reach of a general audience.
But, LLaMA does seem to have somehow spurred a proliferation of several other “open” models, some of which have public download links. For example GPT4All and OpenLLaMA look like they may be providing openly downloadable, decent quality models.
Open source AI solutions
Now we need some software to work with the model. Meta (Facebook) is perhaps the leader in developing AI technology “in the open”, through development of PyTorch, an advanced toolkit for AI-related things.
PyTorch, being written in Python, makes it a poor fit for running on my Android phone. But fortunately llama.cpp has been developed as a compact and portable C++ reimplementation of much of this. llama.cpp also requires the model to be converted to a new “GGML” format, which results in the LLaMA model shrinking from 13GB to 4GB. That is an important innovation in taking these large models towards low end devices.
Putting these together brings perhaps the most fascinating moment of this experiment. llama.cpp is a very small codebase, easy to set up, and once combined with a single model file, you have a human-like chat experience running directly on your PC. Of course, there are no guarantees about the accuracy and fairness of such chatbot, but it is seriously spooky to witness such a small codebase provide such a dynamic experience. It becomes evident that this impressive result builds upon many years of research.
Sherpa
I then shifted attention towards running this on my 2019 Samsung Galaxy S10 smartphone. I wanted to use this opportunity to create something with Flutter, Google’s app development platform that lets you build cross-platform interactive applications, but I swiftly discovered that Bip-Rep already combined Flutter with llama.cpp to produce a mobile app: “Sherpa”.
While a promising demo, it didn’t quite work right on my PC, and was crashing on my smartphone. I set about fixing that, before encountering encountering a more serious issue. My smartphone has 8GB RAM, but the operating system uses about 4.5GB, and the remaining RAM was not enough for the LLaMA model. The app could not function.
I then discovered the recent emergence of smaller “3B” models, such as orca_mini_3b, for which the freely downloadable GGML file comes in at under 2GB. To have this run in Sherpa, I had to carefully update the llama.cpp version it was using, and I took the opportunity to rework some internals for improved performance and lower memory usage. Here is the imperfect but promising result running on my smartphone:
While there are undoubtedly some downsides of using such a small model, you can see that the technology is clearly within reach of relatively low end devices, and you can expect the situation to only improve in future, with likely further innovations in the model format as well as the ever increasing memory and processing power in smartphones and similar devices. Additionally, this is open source software, using a freely available model, and works completely offline.
If you would like to try it, I have uploaded a prebuilt Android APK file, and you can combine this with the readily available orca-mini-3b.ggmlv3.q4_0.bin GGML model.