LoRA: Revolutionizing Large Language Model Adaptation without Fine-Tuning

lora nlp

The model once pre-trained parameters/weights of the model can now be represented in a lower dimensions than what they currently are. In conclusion, LoRA has the potential to play a significant role in shaping the future of large language models and natural language processing. When training is set to false (such as during evaluation), the computed low-rank adaptation is either added to or subtracted from the primary weight matrix, depending on the training mode.

Why Representation Finetuning is the Most Efficient Approach Today? – Towards Data Science

Why Representation Finetuning is the Most Efficient Approach Today?.

Posted: Sun, 26 May 2024 07:00:00 GMT [source]

All datasets was merged with help. of create_instruct_set_v2.py. (hash a1151bf903990b88177d30bd1de67c7b94fdecef). The model is able to generate text in Russian, answer questions, solve simple logical puzzles and simple math. calculations, it was trained on a medium corpus of Russian instructions, manuals and other texts. The Snorkel Flow AI data development platform allows users to clear this challenge through scalable data labeling solutions. For example, two of our researchers used programmatic labeling functions in Snorkel Flow to curate 20,000 examples from the OpenAssistant and Dolly 2.0 data sets down to the best 10,000 examples in about a day. We will now override the original query/value projection matrices with our. You can foun additiona information about ai customer service and artificial intelligence and NLP. new LoRA layers.

A model trained for summarization may not adapt well to machine reading comprehension, question answering, or coding tasks. If you are considering finetuning a model for your business, you are probably correct. But data becomes an important part of the process, take a look at our synthetic data generation guide. However, there are many ways fine-tuned models can improve the performance of your business by providing customization.

2 Architecture

Too low a rank risks losing information, while too high a rank wastes computation. A and B are initialized accordingly, and backpropagation adjusts their values during fine-tuning. AWS empowers efficient data engineering with S3, Glue, EMR, Kinesis, and Lambda for seamless batch and real-time processing. If you’re training on more than one GPU, add the –multi_gpu parameter to the accelerate launch command. The training script has many parameters to help you customize your training run.

Finetuning the models helps you customize the model to your specific needs and knowledge. You can use an RAG pipeline to customize the model, but sometimes the knowledge is so vast that embedding and similarity searches are not enough, that’s where customization via finetuning comes in. At the time of its release, it became a state-of-the-art model and achieved 99.3% performance relative to ChatGPT. Even though other models have lesser models like Vicuna 13B, and Guanaco33B because of the use of 4-bit precision used less memory than the 13B model.

Predictive performance of full fine-tuning can be replicated

even by constraining W0’s updates to low-rank decomposition matrices. In this section, we introduce MixLoRA, a method that leverages LoRA for parameter-efficient fine-tuning (PEFT) to construct sparse mixture-of-experts models through fine-tuning on dense models as Figure 1. The first part involves constructing the sparse MoE block using the vanilla transformer block augmented with LoRAs. The second part utilizes a top-k router to assign each token from various tasks to different expert modules. This approach allows us to harness the power of MoE along with the capabilities of LoRA.

LoRA offers a practical solution for enhancing the specialization of language models without the need for extensive data typically required for training from scratch. LoRA researchers ran several experiments to test its fine-tuning performance against other parameter-efficient and full fine-tuning approaches. Similarly, in adapters, additional layers are added to the attention and MLP sub-blocks. The weights of these additional layers can be updated using LoRA to reduce the sequential processing overhead of adapters. As mentioned before, LoRA is an adapter-based approach, but new parameters are added only for the training step, they are not introduced as a part of the model. This keeps the model size completely the same and still offers the flexibility of parameter-efficient finetuning.

Specifically, it only requires loading LoRAs into the GPU memory, thereby avoiding loading inactive FFN (Feed Forward Network) layers into GPU memory, which significantly saves space. Additionally, the FFN layers as experts, equipped with different LoRAs, can process tokens from various tasks, thereby enhancing the multi-task learning capability. Naturally, some recent work has attempted to combine LoRA with MoE, constructing

the efficient sparse MoE structure by using multiple LoRA modules to handle different tokens and tasks, as depicted in Table 1. This aims to reduce the number of parameters for fine-tuning while enhancing the LLM’s cross-task generalization ability as much as possible.

lora nlp

The BitsandBytes library takes care of the 4-bit quantization and the whole low-precision storage and high-precision compute part. Here is a graphical representation of these errors on a larger distribution of weights lora nlp of a neural network. You can see that there are “buckets” or “bins” of data where the data is quantized. This quantization process allows you to use fewer numbers by “rounding off” to the nearest quantile.

Essentially, LoRA focuses on optimizing these matrices’ ranks — reducing the dimensions that need adjustments — making it possible to fine-tune large models without the excessive computational cost typically involved. In the context of LoRA, the concept of rank plays a pivotal role in determining the efficiency and effectiveness of the adaptation process. Remarkably, the paper highlights that the rank of the matrices A and B can be astonishingly low, sometimes as low as one.

From their blog post, all you need is to add the following lines to your code to integrate PEFT into your finetuning workflow. LoRA doesn’t change the underlying model, but it changes how the model emphasizes different connections. Most photo applications offer pre-made filters that users can apply to their images to evoke different moods. In

a transformer model, the LoRA layer is created and injected for the query and

value projection matrices. In keras.layers.MultiHeadAttention, the query/value

projection layers are keras.layers.EinsumDense layers. We will fine-tune both the GPT-2 model and the

LoRA GPT-2 model on a subset of this dataset.

These methods offer a promising avenue for deploying large language models in real-world applications, making NLP more accessible and practical than ever before. Equation 3 represents objective for conditional language generation, based on next token prediction. Maximize the probability of the token predicted at time t conditioned on the input x and we optimize parameters Φ till we maximize the objective over all the data points.

How does the introduction of dropout methods like HiddenKey impact the overall performance of large language models

I work with both organisations and individuals to uncover and achieve their goals. Coaching helped me define my goals, understand the limitations I was putting on myself and set me up on a path of success – a success that was meaningful to me and what I wanted to achieve with my life. In this section, we discuss the technical details of LoRA, build a LoRA GPT-2

model, fine-tune it and generate text. We use a sequence length of 128

instead of 1024 (which is the default sequence length). This will limit our

ability to predict long sequences, but will allow us to run this example quickly

on Colab.

lora nlp

Data scientists can also apply LoRA to large-scale multi-modal or non-language generative models, such as Stable Diffusion. Parameter-efficient fine-tuning of LLMs is a rapidly evolving field that addresses the challenges posed by computational and memory requirements. Techniques like LORA and QLORA demonstrate innovative strategies to optimize fine-tuning efficiency without sacrificing task performance.

Depending on the number and complexity of the target tasks, this could require tens of thousands of examples. Manual approaches to preparing this data often prove unworkable due to time, cost, or privacy concerns. The central idea underlying LoRA, and PEFT more generally, is to approximate the update to a large parameter model using a low-dimension update. This low-dimension update contains most of the information (gradient signal) contained in the full update but requires much less computation time and memory.

Machine Learning Frontiers

Before we generate text, let’s compare

the training time and memory usage of the two models. The training time of GPT-2

on a 16 GB Tesla T4 (Colab) is 7 minutes, and for LoRA, it is 5 minutes, a 30%

decrease. Large Language Models (LLMs) have been shown to be effective at a variety of NLP

tasks. An LLM is first pre-trained on a large corpus of text in a

self-supervised fashion. Pre-training helps LLMs learn general-purpose knowledge,

such as statistical relationships between words.

Since A and B are much smaller than W0​, this significantly reduces the number of parameters that need to be trained, hence reducing computational requirements. In essence, LoRA simplifies the fine-tuning process by focusing on tuning the smaller matrices A and B instead of the larger weight matrix delta W. Once the fine-tuning is complete, the adjusted weights are added back to the original model for inference.

One of the most significant advantages of LoRA is its ability to reduce the computational resources required for adapting large language models. The result is a full-sized language model that has been efficiently adapted to the target task while maintaining the performance of the original pre-trained model. However, the sheer size of these models also comes with its downsides, such as high computational resource requirements, longer training and fine-tuning times, and considerable energy consumption. The primary reason behind the exceptional performance of LLMs is their massive size and architecture. By increasing the number of parameters and layers within the model, LLMs can capture more complex patterns and relationships within language.

QLoRA and LoRA both are finetuning techniques, but QLoRA uses LoRA as an accessory to fix the errors introduced during the quantization errors. Al. paper summarizes performance (in terms of GPU memory consumption) across a variety of PEFT methods. The original version fo the RedPajama open source LLM used the entire 20,000 examples. Our researchers fine-tuned a separate version using their curated 10,000 examples.

MixDoRA, while still relatively balanced, might allow for a more similar distribution of work where expertise can be leveraged for certain tasks, at the risk of less flexibility if the task demands change. It can be proved in Table 4 where MixDoRA cases a higher drop than MixLoRA in multi-task learning experiments. MixLoRA achieves greater gains in fine-tuning results on Mistral 7B compared to LoRA, surpassing all results obtained on LLaMA 13B. This indicates that MixLoRA effectively extends the model’s capacity by building multiple experts on the FFN of the dense model.

In experiments, MixLoRA achieves commendable performance across all evaluation metrics in both single-task and multi-task learning scenarios. In this paper, we introduced MixLoRA, a parameter-efficient fine-tuning method that constructs multiple LoRA experts and a top-k router on top of the frozen shared feed-forward network block of dense models through fine-tuning. This approach facilitates the construction of an efficient sparse mixture-of-experts model, enabling improved performance across multiple downstream tasks. Building on the m-LoRA framework, we achieved high-throughput parallel training, inference, and evaluation of multiple MixLoRA models on a single 24GB consumer-grade GPU. By optimizing the additional overhead in the computation process, we increased the forward computation speed of MixLoRA by approximately 10% compared to running it without optimization.

UNIPELT [63] introduces a unified framework to effectively tune pre-trained language models(PLMs) with fewer trainable parameters, especially when training data is limited. Yeh et al. [64] introduce a unified framework within the LoRA family for stable diffusion processes. In natural language processing (NLP), the ability to fine-tune large language models (LLMs) for specific tasks is paramount. However, traditional fine-tuning methods often come with significant computational costs and memory requirements, making them less feasible in resource-constrained environments.

Can LoRA be applied to any large language model?

🤗 Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It’ll automatically configure your training setup based on your hardware and environment. While a standardized fine-tuning task and sufficient training data can help reduce these challenges, there is no guarantee.

Some of these efforts focus on making the fine-tuning of LLMs more cost-efficient. One of the techniques that helps reduce the costs of fine-tuning enormously is “low-rank adaptation” (LoRA). With LoRA, you can fine-tune LLMs at a fraction of the cost it would normally take. LoRA, with its innovative low-rank adaptation approach, has the potential to revolutionize the way we work with these models, making them more practical and sustainable for a wider range of applications and users. LoRA differs from traditional fine-tuning in that it focuses on adapting a low-rank representation of the pre-trained model instead of the entire model.

Separating the pre-trained and fine-tuned parameters is an important part of LoRA. This approach significantly enhances computational efficiency, resulting in a training speed improvement of over 10%. MixDoRA outperforms MixLoRA in a single-task scenario but lags behind MixLoRA in a multi-task scenario. Experimental results indicate that the extended learning capacity of MixDoRA may require more training and tuning compared to MixLoRA.

In this example, we will explain LoRA in technical terms, show how the technical

explanation translates to code, hack KerasNLP’s

GPT-2 model and fine-tune

it on the next token prediction task using LoRA. We will compare LoRA GPT-2

with a fully fine-tuned GPT-2 in terms of the quality of the generated text,

training time and GPU memory usage. In essence, LoRA has the potential to magnify crucial features for particular downstream tasks that were learned but not highlighted (i.e. the low singular vectors) in the general pre-training model. Since LoRA requires that we keep the pre-trained and fine-tuned weights separately, we incur a memory overhead. Also, the operation of adding the pre-trained and fine-tuned weights at inference time causes a small computation penalty.

This helps keep the model size small while still making sure the model is still highly performant and accurate. This goes without saying but PEFT can save you a ton of money on computing costs. As mentioned, because of heavy memory optimizations, you don’t need to rent instances with high amounts of VRAM, as you can fit bigger batches in smaller VRAM. This saves money on the compute and allows you to train on much bigger datasets leveraging the advantage of fitting in larger batch sizes. There are many reasons to use PEFT techniques, they have become the go-to way to finetune LLMs and other models. But here are some reasons why even enterprises and large businesses like to use these approaches.

  • This helps in more efficient training as there is no need for the conversion step to update the models during the backpropagation process.
  • Meaning, that LoRA tunes two low-dimensional matrices based on the rank hyperparameter and multiplies them to calculate the fine-tuned weight matrix.
  • In this section, we discuss the technical details of LoRA, build a LoRA GPT-2

    model, fine-tune it and generate text.

In double-blind study, human testers preferred our researchers’ version of the model in every category. Snorkel researchers and engineers have since used similar approaches to improve generative applications for some of the largest companies in the world. Popular deep learning libraries offer PEFT implementations, such as PyTorch Lightning’s Lit-GPT and HuggingFace’s PEFT, that accelerate the process of getting to a fine-tuned model. Doing so requires only a quick import statement followed by wrapping a base HuggingFace Transformers model.

As we all know these models require huge amount of resources like compute and memory to operate, which is not feasible for everyone or for companies to fine-tune these models in their entirety for every new/downstream tasks. A standard Mixture of Experts (MoE) architecture typically consists of N𝑁Nitalic_N parallel feed-forward networks (FFNs) as experts, which necessitate a large number of parameters. MixLoRA constructs experts based on the LoRA technique [23], leveraging a shared single feed-forward network (FFN) from the dense model to enhance training and inference efficiency. LoRA offers a parameter- and compute-efficient approach compared to other methods. One such alternative is adding Adapter Layers within each Transformer block, where only the parameters of these layers are adjusted during fine-tuning.

This method allows LongLoRA to scale to a much longer context because of the distributed workload. These smaller matrices can then be used to train the model using normal backpropagation but updating the parameters in the smaller matrices rather than updating directly in the model. These smaller matrices can then be multiplied together to get back the original matrix.

The changes made to ( W ) during fine-tuning are collectively represented by ( Δ W ), such that the updated weights can be expressed as ( W + Δ W ). Large Language Models (LLMs) like GPT-4, Llama-2, and Claude have shown tremendous capabilities for solving language tasks. However, they are mostly unreliable for solving domain-specific downstream tasks, such as portfolio optimization or drug discovery.

Print the model’s summary and see if the number of non-trainable parameters and

total parameters are correct. Even though the total number of parameters increase (since we are adding LoRA

layers), the memory footprint reduces, because the number of trainable

parameters reduces. We’ll now batch the dataset and retain only the document field because we are

fine-tuning the model on the next word prediction task. ICLR 2022 provides an efficient and model-agnostic way for fine-tuning large pre-trained model, termed Low-Rank Adaptation (LoRA). The main idea and hypothesis of LoRA are the observation of intrinsic dimensionality.

This reduction is achieved through an understanding of how layers in neural networks function — each layer involves a matrix multiplication to the layer’s input, addition of a bias vector, and a nonlinear operation. Chat GPT The matrices, whose entries are the adjustable weights, vary in size with the model’s complexity. For instance, GPT-3’s matrices are much larger than GPT-2’s due to their vastly different parameter counts.

Low-Rank Decomposition of Weight Update $\Delta W$

As shown before, quantile quantization creates buckets or bins for a large range of numbers. This process leads to placing multiple different numbers into the same buckets, for example, two numbers 2 and 3 can be converted into 3 during the quantization process. This leads to an error of 1 being introduced during the dequantization of weights. This block-wise method leads to multiple quantization constants being created, which then once again can be quantized to save additional space. This is okay because the number of quantization constants is low and hence not a lot of computing or storage is required.

Here, we keep the LoRA matrices in a higher precision format, like brain float 16 or float 16 and during the backpropagation and forward pass the weights of the network are also de-quantized. So the actual training is still happening in higher precision formats, but the storage is still in lower precision. This causes quantization errors to emerge, but the model training itself is able to compensate for these inefficiencies in the quantization process. LoRA works by breaking down the weight update matrix into smaller matrices and using them to train the model. Take a look at the diagram below, the ΔWAxB is the weight update matrix, the matrix of learned changes from backpropagation, this is the same size as the number of parameters we need to update to finetune our model. This matrix, or any matrix, can be represented as a set of smaller matrices, presented here as A and B with r as their rank.

Appropriate data selection forms the foundation for all machine learning customization efforts—whether that’s a simple logistic regression model or a LoRA-customizated generative AI (GenAI) model. Let’s put these concepts into practice with a code example of fine-tuning a large language model using QLORA. Since, with LoRA, there is a huge reduction in the number of trainable

parameters, the optimizer memory and the memory required to store the gradients

for LoRA is much less than GPT-2. The

callback function uses TensorFlow’s tf.config.experimental.get_memory_info

API.

For each parameter updates we need to store ∆Φ (change in weights matrix) whose dimensions are as big as the pre-trained model. So storing and deploying many independent instances of fine-tuned models can be challenging. In comparison, LoRA’s set of weights are optimized low-rank decomposition matrices. Meaning, that LoRA tunes two low-dimensional matrices based on the rank hyperparameter and multiplies them to calculate the fine-tuned weight matrix. PEFT Finetuning is Parameter Efficient Fine Tuning, a set of fine-tuning techniques that allows you to fine-tune and train models much more efficiently than normal training.

lora nlp

Every LLM is a transformer model that is composed of several layer blocks, each of which contains learnable parameters. To confirm the effectiveness of MixLoRA in specializing the experts with different tasks, we present the distribution of experts across diverse tasks for MixLoRA and MixDoRA in Figure 3, respectively. Intuitively, we can observe that the expert loads in all tasks are balanced, which means these experts are allocated to different tasks relatively evenly.

QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques. One of the biggest advantages of LoRA over other adapter methods is that it

does not incur any additional inference latency.

LoRA (Low-Rank Adaptation of Large Language Models) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training.

lora nlp

An LLM can then be fine-tuned

on a downstream task of interest (such as sentiment analysis). The trained over-parameterized models reside on a low intrinsic dimension, and the change in weights during model adaptation also has a low intrinsic rank. During training, modifications are made to the LoRA parameters, which are now much fewer than the original weights.

Massive models like GPT-4 cost millions of dollars to train, hence we use smaller models in production settings. For only a slight reduction in downstream task performance, LoRA and QLoRA can perform comparably to the original LLaMA-7/13B-Alpaca models but with parameter counts at only about 2% of the originals. One takeaway is that the reduction in GPU https://chat.openai.com/ memory usage provided by the use of LoRA and PEFT is huge, ranging anywhere from % depending on the type and size of the model. Furthermore, if you are willing to sacrifice some accuracy, quantized versions of LoRA can further cut memory usage by half or more. However, some hybrid and additive fine-tuning methods such as MAM can increase memory usage.

By synthesizing a full-size LoRA weight from individual client contributions and employing Singular Value Decomposition (SVD) for weight redistribution, FlexLoRA fully leverages heterogeneous client resources. FlexLoRA’s practicality is further underscored by our theoretical analysis and its seamless integration with existing LoRA-based FL methods, offering a path toward cross-device, privacy-preserving federated tuning for LLMs. This can be made possible by treating the trainable parameters of these techniques as the trainable parameters of LoRA. Unlike adapters, LoRA adds no inference latency since it performs simple matrix operations. Moreover, it only adapts the attention layer weights for the downstream tasks and freezes the rest of the transformer weights, making it more parameter-efficient.

Studies have shown that LoRA training pipelines can use incredibly small rank-decomposition subspaces (relative to the size of the original space). Doing so allows data scientists to fine-tune an LLM such as GPT-3 by updating as few as 0.01% of the original parameters. Various PEFT methods have been developed to cater to different requirements and trade-offs. Some notable PEFT techniques include T-Few, which attains higher accuracy with lower computational cost, and AdaMix. This general method tunes a mixture of adaptation modules for better performance across different tasks.

Initially, one matrix is initialized using a normal distribution and the other is initialized to 0. Then, based on the fine-tuning objective, the backpropagation process finds the right values for the two matrices. They are then multiplied to obtain the fine-tuned weight matrix that is equal to the size of the original pre-trained weight matrix. As you can see in the above table, there is no loss of performance whatsoever in the T5 model family upon training with QLoRA and even with Double Quantization, we don’t see any major differences. In the paper, the authors mention that they needed more LoRA adapters for QLoRA finetuning, compared to normal LoRA finetuning.

This approach offers computational efficiency and simplicity compared to other adaptation methods. In a world where time and resources are precious, LoRa offers a streamlined approach to tackle computational costs, training times, and memory requirements. Discover how LoRa opens doors to rapid experimentation, deployment in resource-constrained environments, and optimization for memory-limited devices. This blog section serves as your guide to unlocking the full potential of NLP adaptation, empowering you to stay ahead in the curve in language processing.

It adds a middleware security and protection layer on top of your LLM to check the integrity of its responses and make corrections in real-time. Effectively mitigating hallucinations, profanity, and off-topic responses in real time. LLMs have made significant strides in patient care, medical education, and research. They can converse with patients, analyze doctors’ notes, summarize literature, and provide treatment plans. LoRA-enhanced LLMs are more equipped to handle healthcare data, such as medical literature, research findings, clinical notes, prescriptions, lab results, etc. Researchers can quickly fine-tune specialized models to power clinical decision support systems, accelerate drug development, and build better patient engagement platforms.

LoRa fine-tuning is an innovative approach that offers a more streamlined and cost-effective way to adapt large language models to specific tasks. Training times plummet, computational costs decrease, and memory usage becomes more manageable. As NLP continues its rapid development, LoRa stands out as a key technique for unlocking the full potential of large language models. By enabling faster and more targeted adaptation, LoRa paves the way for a wider range of real- world applications, pushing the boundaries of what’s possible in the field. LoRA is an innovative technique designed to efficiently fine-tune pre-trained language models by injecting trainable low-rank matrices into each layer of the Transformer architecture.

While methods like LoRA have effectively tackled GPU memory constraints during fine-tuning, their applicability is often restricted to limited performance, especially on multi-task. On the other hand, Mix-of-Expert (MoE) models, such as Mixtral 8x7B, demonstrate remarkable performance across multiple NLP tasks while maintaining a reduced parameter count. However, the resource requirements of these MoEs still challenging, particularly for consumer-grade GPUs only have limited VRAM. To address these challenge, we propose MixLoRA, an innovative approach aimed at constructing a resource-efficient sparse MoE model based on LoRA. MixLoRA inserts multiple LoRA-based experts within the feed-forward network block of a frozen pre-trained dense model through fine-tuning, employing a commonly used top-k router.

Neural networks consist of numerous dense layers that perform computations using weight matrices. These matrices are typically of full rank, meaning they have the maximum number of linearly independent rows or columns possible for their dimensions. However, research suggests that when adapting these pre-trained models to specific tasks, the necessary adjustments or updates to the weights don’t require altering every single element of these matrices. Instead, these updates can be effectively captured with a much lower degree of freedom, or “intrinsic rank.” To understand this lets get some SVD intuition first.

Large Language Models (LLMs) are powerful machine learning models specifically designed for natural language processing tasks. Large language models (LLMs) are taking the world by storm, bringing forth unparalleled advancements in natural language processing (NLP) tasks. The train method sets the layer in training mode and handles the training logic specific to LoRA. If mode is True (training mode) and merge_weights is enabled, it ensures that the weights are not merged by subtracting the LoRA-computed update from the weight matrix. If mode is False (evaluation mode) and merge_weights is enabled, it merges the weights by adding the LoRA-computed update to the weight matrix.

Many applications in NLP rely on adapting one large language model (LLM) to multiple downstream applications via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many parameters as in the original model. It is seriously inefficient, as LLMs in recent days, such as GPTs, can have billions of parameters.

In summary, this class provides a Linear layer with the capability to incorporate LoRA, allowing for fine-tuning of pre-trained models while preserving the original weights and structure. It enables adaptive adjustments to the weights during training while ensuring stability and controlled modifications. These fine-tuned weights are added to the pre-trained weights during inference, maintaining only one set of weights in the GPU’s memory, not doubling it. The adjustments made by these additional weights are merged as needed, ensuring that the GPU memory requirements do not increase during inference. One solution is to use LoRA in conjunction with other fine-tuning techniques like adapters or prefix tuning. However, configuring the parameters for these techniques adds another challenge to the already complex fine-tuning pipeline.