LLMOps 101: A Detailed Insight into Large Language Model Operations

In the rapidly evolving world of artificial intelligence (AI) and machine learning (ML), Large Language Model Operations (LLMOps) is a term that is gaining significant traction. It refers to the practices and tools used to develop, deploy, and maintain large language models in production environments.

Understanding LLMOps

Large language models like GPT-3, BERT, and RoBERTa have revolutionized the field of natural language processing (NLP). They have the ability to understand, generate, and interact with human language in a way that was previously unimaginable. However, managing these models is not a trivial task. They require significant computational resources, and their performance needs to be continuously monitored and optimized. This is where LLMOps comes into play. This term became popular with the release of ChatGPT in 2022.

MLOps vs LLMOps

Large Language Model Operations (LLMOps) and Machine Learning Operations (MLOps) are both important aspects of AI development, but they focus on different areas.

MLOps refers to the practices and tools used to develop machine learning models, to deploy them to production, and to monitor their performance in production environment and maintain them so that they do not become obsolete or ineffective overtime. The goal of MLOps is to streamline and automate these processes, ensuring a seamless and efficient integration of machine learning models into real-world applications, while also addressing challenges such as version control, reproducibility, and scalability.

LLMOps is a more specific subset of MLOps. It is MLOps but for large language models (like GPT-3). LLMOps focuses on the unique challenges posed by these models, such as their size, the computational resources they require, the prompt management, and the need for careful monitoring to prevent the generation of inappropriate or biased content.

Components of LLMOps

Data Collection and Preparation
Model Development
Prompt Engineering, RAG and Model Fine-tuning
Model Deployment
Observability
RLHF

1. Data Collection and Preparation

Data collection and preparation are a must if one wants to train a Large Language Model (LLM) from scratch or fine-tune one. The training from scratch requires vast amounts of data and the data quality should be as good as possible for better quality of the model.

Fig 1: Data Collection, Preparation and Model Training

2. Model Development

Given the scale of storage and compute required, it is often not possible for many organizations to train the large language models (LLMs) from scratch. So, most of them start with consuming the LLMs, either through the APIs or directly consuming the models themselves.

So, most people start with selecting a large language model which is best suited for their use case. This raises the question of how one can choose a large language model that is best suited for their use case. We can use leaderboards like ‘Open LLM Leaderboard’ of Hugging Face and ‘HELM (Holistic Evaluation of Language Models)’ from Stanford CRFM etc. to achieve this. Cost, token limits and availability are some of the other factors that definitely influence this decision.

3. Prompt Engineering, RAG and Model Fine-tuning

Prompt Engineering

After the LLM is chosen, one generally starts with passing prompts to the LLM, basically asking it to do something. The LLM may not give the response or the solution that they expect in the first try itself. The prompt, then, has to be modified and passed to the LLM and after some modifications, one will find the best prompt which gives the best response (or) the output. This process of continuous modification of the prompt to extract the best output from the LLM is called the Prompt Engineering.
At times, prompt itself might not be sufficient to extract the satisfactory response. In a such case, one can provide examples to the LLM containing the prompt and the corresponding output with expected format and quality. If one example is given, it is called Single-shot prompting, and Few-shot prompting, if multiple examples are given. When no examples are provided, it is called Zero-shot prompting.
Given the importance of the prompt and the temperature, it is important to keep track of various combinations of the prompt or prompt template and temperature so that one can finally choose the best combination that gives the best output.

RAG (Retrieval Augmented Generation)

There will be times when one cannot extract satisfactory output from the LLM using prompt engineering. To get a better output, information regarding the query posed can be given to the LLM along with the query (i.e., the prompt) and then the LLM can be asked to address the query using the information provided to it. This information is also called as ‘Knowledge’ or ‘Context’.
Given the significance of the Context, it is important that the LLM is given the Context that is relevant to the query and not something else. To achieve this important task, Vector Stores are used. Vector stores can be Vector databases like Pinecone, ChromaDB etc. or Vector Libraries like FAISS (Facebook AI Similarity Search). They basically store some information inside them as embeddings and when the LLM is queried, they give out the information relevant (similar, to be precise) to the query and both the query (i.e., the prompt) and the information (i.e., the Context) will then be passed to the LLM to extract the best output.

Model Fine-tuning

Even with the implementation of RAG, one may still not be satisfied with the quality of the output. Then the only option is to fine-tune the LLM i.e., fine-tune the LLM with a dataset. This is a costly affair depending upon the LLM one chooses to fine-tune and the size of the dataset, due to compute, data collection and preparation costs.
Even if one fine-tunes an LLM, it is still possible that the LLM produces unreliable or unsatisfactory output because LLMs are ultimately ‘black-boxes’, which makes it difficult to understand why they make certain predictions. This can make it hard to control their behavior.
While fine-tuning, it is also important to keep track of various versions of the model, just like it is done with regular ML models.
If one is still not satisfied, a custom model can be trained using the dataset available.

Fig 2: LLM Life Cycle

Many a times, combinations of these 3 approaches are used to extract the satisfactory or the best response from the LLM.

4. Model Deployment

If the LLM has either been developed from the scratch or fine-tuned over a dataset or one is using an open-source LLM, the next step is to deploy it to production for usage. LLMs are generally very large in size and need a lot of compute power to run. During the deployment, it is important to keep in mind the latency and the effects of completion length on the same.

5. LLM Observability

Drift Detection

After the models are moved into production stage, there is no guarantee that they will perform in the same manner they did during the training or validation stages. So, we need to continuously monitor the model’s performance.
Generally, in most of the cases, the fall in model’s performance is due to data. When the model’s performance reduces over time, it is said that that the model is drifting.
Generally, the performance of the model is measured using the metrics like Precision, Recall etc.
Depending upon the number of inferences the model is subjected to, it may not be possible to get the correct outputs, right away, for all the predictions. Generally, in an enterprise set up, one can get the correct outputs with the help of Managed Services teams or a 3^rd party model (i.e., another LLM). While the 3^rd party model is subject to all issues an LLM can have, in general, the Managed Services teams (human evaluation) (MST) can be biased.
One can get the correct outputs with MSTs’ help, but it is difficult to get them for all inferences, if number of inferences is huge. As a result, we can’t depend on these traditional metrics. In such situations, we need some metrics that can indicate the fluctuations in model’s performance. These are called as ‘Proxy metrics’ and one such proxy metric is ‘Drift’.

Embeddings Drift

As one deals with text data here, Embeddings drift is used for monitoring.
Data drift is the variation in the distribution of the data from training stage or a time period in production (when model’s performance was at desired level) to production stage. To get the drift, we take the centroids of baseline and production datasets and calculate the various metrics to compare them in the n-dimensional space. Some of them being Euclidean Distance, Cosine Similarity and MMD etc.
Projecting the prompt-response pairs to 2D or 3D (using dimensionality reduction techniques like UMAP or PCA etc.), clustering them (using HDBSCAN, DBSCAN or K-Means etc.) and asking another LLM like GPT-4 to explain these clusters can give insights into outliers and the topics the prompt-response pairs belong to like Medicine or Engineering.
It is also important to monitor other kinds of drift like Concept Drift etc.

Evals

But performance drop is not the only issue here. There are some more issues which are inherent to LLMs like Hallucination, generating biased and toxic content. Oftentimes, people try to manipulate the LLMs into doing something unethical or bad, which is called as Prompt Injection. Data Privacy is a also concern here. Embeddings Drift or Similarity can’t solve these issues. There is a need for more metrics or other solutions to handle these issues. Apart from just evaluating whether the LLM’s response has the above issues (Response Evaluation), one must also check whether it has properly addressed the task posed by the prompt (Use case Evaluation).

Fig 3: LLM Evals

Depending upon the type of the use case one has to evaluate, there are some statistical metrics like ROUGE for Summarization and BLEU for Translation. For hallucination, many use metrics like BLEU, BERTScore, METEOR etc. For issues like Toxicity and Data Leakage, one can use Regex. But ultimately, all of them are either string-matching based or similarity-based solutions and not all issues can be addressed this way. Same is the case with Usecase Evaluation.
So, this makes human evaluation mandatory but it, definitely, is not scalable apart from being biased. This pushes us towards using another LLM to perform the Use case and Response evaluations.

LLM-based Evaluation

When one says LLM-based evaluation, one can either
- Ask the LLM to generate the correct output (i.e., the ground truth) for the given prompt-response pair and then calculate the above-mentioned metrics.
- Or ask the LLM to evaluate the given prompt-response pair to see if there is hallucination or toxicity etc.
The first question that is asked if one wants to go ahead with LLM-based evaluation is that whether there is any evals framework available. There are many frameworks to help with this like RunML.
One of the popular frameworks is, obviously, OpenAI Evals. This deals with asking the evaluating LLM to generate the ground truth and then using various metrics or regex to find if there are any issues with the generated response. Many other frameworks focus on directly asking the LLM to evaluate the prompt-response pairs for Usecase and Response evaluations as it is more straightforward.
It must be noted that the evaluating LLM is an LLM. It is bound to hallucinate and have all the other issues that LLMs generally face, so it is a best practice to benchmark the LLM and the prompt template together.
It is to be kept in mind that complex chains can be created using LLMs and other libraries like LangChain. Monitoring these can be tricky and, at times, might not be possible. But there are solutions like LangSmith to handle some of these.
Guardrails – it is important to implement guardrails for the evaluating LLM to restrict it from generating unwanted or toxic content. Some open-source providers of the guardrails include Guardrails AI and NeMo-Guardrails.

Explainability

When any model makes a prediction, explainability enables one to explain why it is making such prediction. Being able to understand why the LLM gives the particular response is highly important in making the model better i.e., helps reduce hallucinations, toxicity and many other issues. But it is important to understand that the LLMs are complex black-box systems whose internal workings are unknown. The below paper tries to discuss the same: Explainability for Large Language Models: A Survey

Latency and Token Usage

It is important to keep continuous track of the latency of the deployed model’s APIs and the token limits and usage.

6. RLHF (Reinforcement Learning from Human Feedback)

At times, depending on the capabilities of the LLM one may be using, it may be difficult to get the outputs, from the LLM, in the way we prefer, as they are ultimately black boxes as mentioned in the Section 3. If one wants to train a model for the use case, the resulting model may not have all the capabilities an LLM has. RLHF is very useful in these situations.

RLHF is about teaching the LLMs to understand the human preferences and perform the tasks accordingly. It allows us to guide the LLM’s output towards specific desired characteristics, like helpfulness, truthfulness, and politeness etc. So, the hallucinations, toxicity etc. in the responses of the LLMs can be tackled using RLHF. Here the idea is to use a Reward model to refine (or fine-tune) the LLM so that it generates the responses favored by humans. But it is important to recognize that RLHF comes with its own set of issues.

Conclusion

As the demand for LLMs continues to grow across various industries, the significance of efficient LLMOps becomes increasingly apparent. The successful implementation of LLMOps not only ensures the optimal functioning of language models but also paves the way for future advancements in natural language processing. In a landscape where language models are becoming integral to various applications, a strategic and well-rounded approach to LLMOps is paramount for unlocking the full potential of these powerful linguistic tools.

References