Saturday, July 22, 2023

Costs in Training LLMs

 I went through the Llama-2 white paper that was released with the model by meta. I was hoping to learn some special technique they may be using to train their models. Apparently, there isn't any. The learning process is straightforward. What is different is the huge costs associated with fine tuning after the model is trained. This fine tuning requires human interaction and feedback. To incorporate the feedback, the model has to be altered that requires more computation. Training, fine tuning the model costs more than $20M (~$4 per hour and 5M hours). This immediately limits the number of players who will actively develop LLMs. The cost of adding safety to these models (e.g. block prompts for ransom letters etc.) is almost as high as cost of training the model. 

Another interesting tidbit from the paper was the assertion that RoCE based 200 Gbps interconnected cluster was adequate and more economical than Inifiband based cluster. RoCE uses commodity ethernet. If one can train 70B parameter model with trillions of tokens using commodity ethernet with RDMA, what is the compelling need to move to expensive NVLink linked superchips based systems? May be they are overfitting? (pun intended)

There is a significant cost to building these models that are shared with public (unknowingly) i.e. they are climate related. The carbon emission of these clusters is shown in the paper at 539 tonnes of CO2e. It took 3.3M hours of GPU (A100-80G). All of this to chat with a bot?

I found more benchmarks and metrics related to safety, climate and other social concerns in the paper than what one finds a technical paper. 

It was easy to play with the model using oobaooga's text gen UI. I used the 13B parameter model from the family of Llama-2s released. It is a bit dated. You can see for yourself. 




Sunday, July 09, 2023

Learning a Model

 Neural Networks have a bad reputation of being very confident when they are wrong. This is the result of a bad probability estimates being calculated (i.e. learned). They also suffer from adversarial attacks. Training is the activity that takes the most time in arriving at a functional LLM. Besides collecting, curating and integrating data sets, we have to also navigate around pot holes by employing optimization techniques on the objective function. Objective function or goal seeking function is a function that takes data and model parameters as arguments and outputs a number. The goal is to find values for these parameters which either maximize or minimize this number. Maximum likelihood (MLE) is one of the most often used function that fits this task of finding the set of parameters that best fit the observed data.

LLMs have three model architectures (a) encoder only (BERT) (b) decoder only (GPT) (c) encoder-decoder (T5). Looking at (b), which is a probability distribution over a word given the prompt, which is arrived by taking a smoothed exponential (softmax) of scores calculated using scaled dot products between each new prediction word with prompt, we use MLE to find the best distribution that fits the observed data.

Stochastic gradient descent (SGD) and ADAM (adaptive moment estimation) are two common methods used to optimize the objective function. The latter is memory intensive. There are many knobs like size of a floating point, calculating moments (more value per parameter), changing learning rates among others that can be used to learn a model. Sometimes the knob settings result in generic learning and other times overfitting. More often than not we just don’t converge i.e. no learning. ADAM is popular optimizer (I use it on the open source transformers from hugging), it keeps 3X more values per parameter than vanilla SGD. AdaFactor is an optimization on ADAM to reduce memory consumption but has been known to not always work.

A rule of thumb in ML is gather more data as a dumb model with lots of data beats a smart model with limited data. But training on large amount of data is costly. It used computational resources and as the whole process is iterative, we need fast processors to crunch through the data so the model can be adapted and iterated upon. More data does not guarantee convergence i.e learning. The whole exercise in learning a model looks and feels like art bordering on black magic than anything analytic or scientific. If modeling felt like a recipe than this is like cooking the recipe. The end result has a lot of variance.  

Saturday, June 24, 2023

The Model behind LLM and Transformers

 We want to get to a point where we have a probability distribution over a sequence of tokens. But what are these tokens? They are the words in a sentence that are quantized i.e. they are numbers. These numbers are not 64bit floats, they are more like 16 or 8 bit floats. In an array, these numbers map to a word and some of the context of the word. A classic example of a word (embedding) context is King - Man = Queen. This is a legit operation in this space. So we take a sentence and convert them into tokens and create vectors for each word which as a whole is called word embeddings. This seems like preliminary stuff, but it is not, this is so important and we get this wrong, the whole generative AI stuff falls on its face.

The engine of a generative AI car is the transformer. The main parts of the transformer is the encoder and decoder. Encoder is where you say “Yo quiero tacobell” and decoder translates to “I love Tacobell”. As an aside, if you tokenize as “I love Taco Bell” that will result a vastly different generation than if you tokenize as “I love Tacobell”. But to make this generative i.e. change its task model, they got rid of the encoder and used stacks of decoders. After all, given a prompt, we need to generate an essay where every word is picked from a probability distribution. A stack of decoders helps this more than a encoder/decoder architecture. This transformer architecture has two main components. First is the positional encoding and second is the multi-head attention.

If we start with positional encoding, we are looking at a matrix where each row is a word in the sentence of the prompt. This matrix is created by using 18th century math which showed us that a summation of sine and cosine function with varying frequencies carries all the information in the input signal. The input signal here is a prompt and to give different emphasis to the word position in the prompt, we use sine and cosines function to arrive at the rows in the matrix. So far we started with word embeddings and now we have a matrix with positional embeddings. But why? Well, the output of each time-step is fed to the decoder. E.g. I love Tacobell around the corner where the italics means they were generated in the previous step, will give emphasis to the words the corner by over the first few words. This is the part about attention. That brings us to the next big component called Multi-Head Attention (MHA).

To understand MHA, we need to understand the concept of query (Q), key (K) and Value (V). Query is the question asked by the transformer to find the correct next word in the sentence. In our case, that would be “around”. So query is asking is “around” the best fit given the input sequence i.e the Keys. The values are the input sequence’s embeddings that we calculated earlier. At the end of the day, we are using matrix multiplication, specifically dot product to guage the relative fit of a word given the input sentence. As we generate more words, we want the generated words to be in the input to find the next best match. A dot product is literally telling us how far is the next word (my query) from my current sentence (keys).

Note that till now we have not really used a neutral network. We would need that because so far all we have done is linear transforms (matrix multiplication). We need something nonlinear to activate the neurons or else what’s the point of calling this NN? That part is the feed forward neural network which gets this output (a matrix) that we calculated. But by doing all this processing and in parallel because we use multiple heads of attention, we are able to accelerate it. This was the original intent of transformers i.e. accelerate the translation using GPUs. If you are student of history, this is kind of like old mechanical engineering techniques to perform switching and routing. They worked but eventually were replaced by electronics. I think AI is lacking that paradigm shift. It hasn’t yet found its transistor.

Saturday, June 10, 2023

Perplexity, Entropy: How to measure LLMs?

 How can we measure efficacy of a language model? Language model researchers use the term “Perplexity” to measure how a language model performs on tasks on standard datasets. In a language model the task means quizzing the model to complete a sentence or hold a Q&A or generate an essay. GPT-3 scored well, in fact very well, on perplexity on standard benchmarks like Penn Tree Bank. Overall, though, the results were mediocre.

Perplexity of a model measures the “surprise” factor in the generation or how many branches does the model deal with when predicting the next word. A perplexity of 20 would mean that given a few words the model has to pick between 20 choices for the next word. If that number was 2, then the model has an easier task but that is most likely because we overfit the model to a specific task. Without this context specific training, the GPT-3 folks claim that the model is a few shot learner, which means it takes a few attempts before the model hones in on the task’s context. There are variants of transformer models like BERT cased - which are trained for specific tasks like “complete last word” or “fill in the blank” which perform much better at those specific tasks than GPT-3.

As we are building LLMs to perform task like translation, generation, completion etc., should we overfit a model to a specific task or leave it as a generic model and provide in-context training via prompting? With the GPT-3, it seems prompting is the chosen route to get the model to hone-in. And what about the model size? Does it make sense to overfit a model with hundreds of billions of parameters to do a single task (like translate) or leave it as a few shot performer i.e. mediocre? These are the tradeoffs that one has to make when building a new LLM. Larger models are mediocre at best.

Sunday, May 28, 2023

Language Models

 LM are probability distribution over sequences of tokens - which are words in a vocabulary and so the sequence is a phrase. Each phrase fragment gets a probability. The probability is higher for fragments which are good phrases. Good meaning grammatically correct and semantically plausible. Good is very dependent on extra information that is not in the training set. For example, a sentence “I saw the golden gate bridge flying into san francisco” needs some contextual information like bridges are stationary and “I” refers to a human in the plane. This means the LM needs to determine semantic plausibility of a phrase fragment. This is what makes LMs deceptively simple and easy to get wrong.

Mathematically, the phrase fragment is called sequence of tokens. And the tokens are pulled from a set “V” for vocabulary of the language. Probability distribution of a four token sequence assigns different probabilities to the ordering of the four tokens in that sequence. Some orderings are semantically implausible and others are syntactically incorrect. So far, we haven’t done any generation, but that is also possible in the LM, where given a sequence of tokens, say four, we can pull five token sequences and judge on its “goodness”. The judgement is based on some parameter which could be level starting from poor to best. To continue on the sentence above, the probabilities of “I saw .. sf on saturday”, “I saw ..sf yesterday” are all good as they are good sentences, but we need a new parameter to pick one over the other. This parameter is the randomness of search. If we are strict, we would pick the highest next probability of each word in the sentence and its next prediction. If we are lenient, we would randomize the next pick. This conditional generation is then determined by the prefix sequence, in our case “I saw .. sf” and the completion will be “yesterday”. This is the big problem with the LM, a lenient generation will create absurd sentences and even total fabrications and strict generation will create templates. Add to that there is a heavy reliance on the prefix sequence or “prompt”. This parameter is called temperature and ranges from lenient to strict and controls the variability of the generation. So, we now have the mathematical vocabulary of generation. We have a prompt, a completion and temperature.

A slight change to the prompt will generate a entirely different sentence. A longer prompt also changes the generation and any foreign words in the prompt, perfectly normal is spoken english, can also influence the generation in unpredictable ways. This length of the number of tokens in the prompt is a big determinate of computational complexity of LMs. A generation that uses all the tokens, such as RNN (Recurrent Neural Network) will take a very long time to train. A transformer which sets attention to a few prior tokens and exploits parallelism of GPUs limits the accuracy of the generation. Parallelism of transformers enable large LMs. Large as in trillion parameters. What was observed in later 2022 was the larger the model got, the fixed length generation coupled with temperature generated sentences which never existed before. This is called emergent behavior - this is beginning of “learning”. Repeated prompts on the same topic teach the LM to hone in on the context. This makes the completions sound more like spoken language and less like an automaton. This observation of emergent behavior without having to change the model (like no new gradient descents) is what is causing most the hype around Generative AI. As the model hones into a context, it feels like a turing machine where the generation feels conversational.

The biggest challenge for Gen AI - as an industry - is not to create larger and larger models, but instead to slice and dice an existing LLM into a size that can be deployed at scale. We can see the sizes of large model in this paper, reproduced below. It is rumored GPT-4 is over 1 trillion parameters as well. Training costs around $28K per 1 billion parameters. It is however a one-time cost. The continual cost is on the sliced/diced version - perhaps running on the phone - which needs to cost no more a penny per 1000 generations.

The models also need to be trained on proprietary and current data for them to generate economic value beyond research.

AI IDEs - Do you need it?

 AI generated code IDEs like Replit, Cursor.sh and plugins into vscode are all the rage now a days. I blogged earlier on using continue.dev ...