Saturday, September 14, 2024

AI AlterEgo

 The killer application for AI is to enable expert profiles in enterprise and productivity applications. These are not bots that help you get through mundane tasks, these are profiles that application consults to provide assistance in using the application. This is akin to expert levels in gaming. 

When using - say - an IDE. Today, the profile that the application stores on the user is mainly to collect credentials and secure access to outside storage and other artifacts. With GenAI, these profiles can be based on other users, experts or just AI itself. If you admire writing style of someone, then assuming the person is willing to sell/share their profile, one can use that profile (import it) and the application can now use the profile's style to generate your content. 

IDE is the easiest to understand this concept, but you can imagine how intelligent profiles can be used in every sphere where applications perform majority of the mundane tasks. For example, the world's best trader can export his/her profile in stock trading and you can use if in your brokerage application to receive recommendations for trade which otherwise you would not have entered into as you would not have seen the opportunity. 

All of this is essentially creating a alter ego of yourself. Now everybody can become a rock star!

Saturday, August 03, 2024

Where is the productivity in AI? Try this!

For some reason, mainstream is now asking for proof of productivity from AI. There are some skeptics. Let me show you how it increases my productivity as a developer. 

As Easy as 123

Using continue code assistant, I was able to build with very little help an application that uses streamlit for UX, MySQL for DBMS and LangChain for chaining model and logic. The ease with which I can now "talk" to my tables in DBMS makes mysql workbench kind of obsolete. For run-of-the-mill DBMS reporting, we don't need to use any expensive human talent to get it done. This is a boost in productivity for anyone who has to back up an argument with data. This is probably why Snowflake acquired streamlit. The productivity gain is astronomical as I don't have to keep searching for the "right" syntax. I barely check on API reference as the code tells me which method and object I need. 

Agents R US

While I used chains and linked them together to get the end result, I could have created distinct semi-autonomous agents which would get the information as it updates and report the state in real-time. This type of work takes weeks today in a organization and it can now be done in hours. 

PaaS This!

You need a source of data, a connector to read/write to the data and a execution environment which allows for use of models from many sources (using their API key) and frameworks to keep all these components and their state in sync. This is not done in a IaaS setting, this needs a PaaS. No wonder, you can't get away from HuggingFace. No wonder they just announce Github models as competition to HF. 

Models are 4Ever: 

APIs come and go, but models are forever. I have used three different models in a single application and am paying no more 2.5 cents for 10K tokens. I am beginning to wonder if they actually make money providing me this service. Let's look at their investment, a typical model (like GPT), requires 144GPUs to load GPT model, it uses 750W per GPU used. Most cannot afford this, so they go for a model that fits into a single system, but single system needs to be configured for GPU passthrough to the VMs without any bloatware from K8s. We are looking at a hard requirement that a single node offer performance of 200TFLOPs as a minimum. 

Larger models with 400B+ parameters are now called giants. We need Giants because they keep the context around for longer and capture deeper relationships between parameter. But these giants shouldn't share context across tenants. I believe currently they do. 


Saturday, April 20, 2024

Llama 3 - More ways to run it, but still nothing new

 Llama 3 is out and getting to it can be a challenge. The approval email's URL expires in 24 hours. It can take 8hrs to download. But after the download from Meta, it can be use locally in text-generation-webui. This time it has hosted versions on hugging chat and meta itself. It says it's training stopped in 2021 so it continues to think the PM of UK is Boris. But it believes it is more conversational. 





When asked how many params it is trained on, it initially said 1.5B. Then I asked again and it changed its mind. 



Using ollama to run llama-3, I get better answers



On text-generation-webui, the model does not load except when you pick transformers as the loader. And the chat is not fully functional. 




After converting to GGUF, 



LM Studio is the best one of these for now. 




Thursday, April 04, 2024

LLM - Not everything can be learned - so let's realign it to our preferences

When I first started researching LLMs it seemed like the technology could simply learn and get to a point where it is self-learning artificial lifeform (AGI). Now 8 months since my last post, it looks like the initial trust of teaching an LLM everything is not giving the returns that researchers thought. Words originally used such as "emergent behavior" are now being replaced with "hallucinations", "catatrophic degradation". 

The jack of all LLM is not what we really wanted, what we want is precise control over the completions (answers). To get there, we are now seeing new aveneues of research collectively called fine-tuning. Fine-tuning is not a performance run-time effort, rather, it is changing the model's weights to reflect preferences. A new alphabet soup of acronyms called DPO, IPO, KTO are all optimizations that introduce new labeled datasets and under supervision get a generic pre-trained model to answer the "money questions". 

If you have been exposed to ML/AI for long, you already know we have seen this before and then it was called "reinforcement learning". Today they add a HF (human feedback to it) and it is now called RLHF. Once again, we are back to using likelihoods (read probabilities) and rewards (biases) to get an AI to spit out answers which can add economic value. 



Saturday, July 22, 2023

Costs in Training LLMs

 I went through the Llama-2 white paper that was released with the model by meta. I was hoping to learn some special technique they may be using to train their models. Apparently, there isn't any. The learning process is straightforward. What is different is the huge costs associated with fine tuning after the model is trained. This fine tuning requires human interaction and feedback. To incorporate the feedback, the model has to be altered that requires more computation. Training, fine tuning the model costs more than $20M (~$4 per hour and 5M hours). This immediately limits the number of players who will actively develop LLMs. The cost of adding safety to these models (e.g. block prompts for ransom letters etc.) is almost as high as cost of training the model. 

Another interesting tidbit from the paper was the assertion that RoCE based 200 Gbps interconnected cluster was adequate and more economical than Inifiband based cluster. RoCE uses commodity ethernet. If one can train 70B parameter model with trillions of tokens using commodity ethernet with RDMA, what is the compelling need to move to expensive NVLink linked superchips based systems? May be they are overfitting? (pun intended)

There is a significant cost to building these models that are shared with public (unknowingly) i.e. they are climate related. The carbon emission of these clusters is shown in the paper at 539 tonnes of CO2e. It took 3.3M hours of GPU (A100-80G). All of this to chat with a bot?

I found more benchmarks and metrics related to safety, climate and other social concerns in the paper than what one finds a technical paper. 

It was easy to play with the model using oobaooga's text gen UI. I used the 13B parameter model from the family of Llama-2s released. It is a bit dated. You can see for yourself. 




Sunday, July 09, 2023

Learning a Model

 Neural Networks have a bad reputation of being very confident when they are wrong. This is the result of a bad probability estimates being calculated (i.e. learned). They also suffer from adversarial attacks. Training is the activity that takes the most time in arriving at a functional LLM. Besides collecting, curating and integrating data sets, we have to also navigate around pot holes by employing optimization techniques on the objective function. Objective function or goal seeking function is a function that takes data and model parameters as arguments and outputs a number. The goal is to find values for these parameters which either maximize or minimize this number. Maximum likelihood (MLE) is one of the most often used function that fits this task of finding the set of parameters that best fit the observed data.

LLMs have three model architectures (a) encoder only (BERT) (b) decoder only (GPT) (c) encoder-decoder (T5). Looking at (b), which is a probability distribution over a word given the prompt, which is arrived by taking a smoothed exponential (softmax) of scores calculated using scaled dot products between each new prediction word with prompt, we use MLE to find the best distribution that fits the observed data.

Stochastic gradient descent (SGD) and ADAM (adaptive moment estimation) are two common methods used to optimize the objective function. The latter is memory intensive. There are many knobs like size of a floating point, calculating moments (more value per parameter), changing learning rates among others that can be used to learn a model. Sometimes the knob settings result in generic learning and other times overfitting. More often than not we just don’t converge i.e. no learning. ADAM is popular optimizer (I use it on the open source transformers from hugging), it keeps 3X more values per parameter than vanilla SGD. AdaFactor is an optimization on ADAM to reduce memory consumption but has been known to not always work.

A rule of thumb in ML is gather more data as a dumb model with lots of data beats a smart model with limited data. But training on large amount of data is costly. It used computational resources and as the whole process is iterative, we need fast processors to crunch through the data so the model can be adapted and iterated upon. More data does not guarantee convergence i.e learning. The whole exercise in learning a model looks and feels like art bordering on black magic than anything analytic or scientific. If modeling felt like a recipe than this is like cooking the recipe. The end result has a lot of variance.  

Saturday, June 24, 2023

The Model behind LLM and Transformers

 We want to get to a point where we have a probability distribution over a sequence of tokens. But what are these tokens? They are the words in a sentence that are quantized i.e. they are numbers. These numbers are not 64bit floats, they are more like 16 or 8 bit floats. In an array, these numbers map to a word and some of the context of the word. A classic example of a word (embedding) context is King - Man = Queen. This is a legit operation in this space. So we take a sentence and convert them into tokens and create vectors for each word which as a whole is called word embeddings. This seems like preliminary stuff, but it is not, this is so important and we get this wrong, the whole generative AI stuff falls on its face.

The engine of a generative AI car is the transformer. The main parts of the transformer is the encoder and decoder. Encoder is where you say “Yo quiero tacobell” and decoder translates to “I love Tacobell”. As an aside, if you tokenize as “I love Taco Bell” that will result a vastly different generation than if you tokenize as “I love Tacobell”. But to make this generative i.e. change its task model, they got rid of the encoder and used stacks of decoders. After all, given a prompt, we need to generate an essay where every word is picked from a probability distribution. A stack of decoders helps this more than a encoder/decoder architecture. This transformer architecture has two main components. First is the positional encoding and second is the multi-head attention.

If we start with positional encoding, we are looking at a matrix where each row is a word in the sentence of the prompt. This matrix is created by using 18th century math which showed us that a summation of sine and cosine function with varying frequencies carries all the information in the input signal. The input signal here is a prompt and to give different emphasis to the word position in the prompt, we use sine and cosines function to arrive at the rows in the matrix. So far we started with word embeddings and now we have a matrix with positional embeddings. But why? Well, the output of each time-step is fed to the decoder. E.g. I love Tacobell around the corner where the italics means they were generated in the previous step, will give emphasis to the words the corner by over the first few words. This is the part about attention. That brings us to the next big component called Multi-Head Attention (MHA).

To understand MHA, we need to understand the concept of query (Q), key (K) and Value (V). Query is the question asked by the transformer to find the correct next word in the sentence. In our case, that would be “around”. So query is asking is “around” the best fit given the input sequence i.e the Keys. The values are the input sequence’s embeddings that we calculated earlier. At the end of the day, we are using matrix multiplication, specifically dot product to guage the relative fit of a word given the input sentence. As we generate more words, we want the generated words to be in the input to find the next best match. A dot product is literally telling us how far is the next word (my query) from my current sentence (keys).

Note that till now we have not really used a neutral network. We would need that because so far all we have done is linear transforms (matrix multiplication). We need something nonlinear to activate the neurons or else what’s the point of calling this NN? That part is the feed forward neural network which gets this output (a matrix) that we calculated. But by doing all this processing and in parallel because we use multiple heads of attention, we are able to accelerate it. This was the original intent of transformers i.e. accelerate the translation using GPUs. If you are student of history, this is kind of like old mechanical engineering techniques to perform switching and routing. They worked but eventually were replaced by electronics. I think AI is lacking that paradigm shift. It hasn’t yet found its transistor.

AI AlterEgo

 The killer application for AI is to enable expert profiles in enterprise and productivity applications. These are not bots that help you ge...