Large language models in legaltech: Demystifying fine-tuning

David Smythe
November 21, 2024

In the first months of the LLM-ageddon brought upon us by OpenAI’s general release of ChatGPT, an unusually technical term suddenly entered the legaltech vernacular: fine-tuning.

The topic is again in the spotlight. OpenAI recently released GPT-4o-mini with heavily discounted fine-tuning training and inference costs. At first, it sounds completely self-evident: everyone should want a model that is fine-tuned on their data, right? And a language model fine-tuned on legal texts must be a huge improvement over ChatGPT when it comes to legal work, right?

Based on the rhetorical nature of these questions, a subpar LLM might conclude the answer is “No.” The actual answer is well, not necessarily.

Fine-tuning certainly offers numerous advantages – in some cases enabling workflows that otherwise would be impossible. However, fine-tuning is often completely unnecessary or impractical, and more viable alternatives exist that are less costly in terms of time and data. Some AI practitioners in the legal profession may even find that after months of data curation and experimentation, fine-tuning worsens their performance against their target benchmarks.

Let’s dive into what fine-tuning actually means and when it’s likely – and unlikely – to help.

Part 1: What fine-tuning actually means

Let’s imagine that, rather than read a lengthy tutorial on transformers architectures, loss functions, and gradient descent, you realize that really you would much prefer to take a spontaneous holiday in Rome. I wouldn’t blame you. With only your passport in hand, you hail a ride to the airport and hop on the first redeye to Fiumicino. Buon viaggio!

Unfortunately, with neither a wallet nor a phone, you find yourself stranded at a deserted Fiumicino at 4:45 AM. You could walk 6 hours to barter for a cappuccino on the Piazza Navona as you originally had envisaged, or you could settle for a sunrise on the beach down the street.

La dolce vita

In many respects, AI engineers find themselves deliberating this same decision for each new problem they tackle: prompt engineer if the destination is within walking distance, otherwise consider fine-tuning.

Commercial LLMs and other foundation models are no less revolutionary than airplanes. Under the hood, the mechanics of language models share a lot in common with the travel analogy as well.

LLMs are built around a specific ideal “target” output and minimizing some quantifiable measure of “wrongness” against that target (this is called a “loss function”). As the model works its way through a high-quality training dataset, the loss eventually begins to decrease, but given the tremendous complexity of language, the ride is almost always a bumpy one. 

Similar to travel, it’s usually more practical to split the output process into stages with different targets rather than walk towards the final destination in a straight line:

  • Pre-training gets the model into the general vicinity of an understanding of language by targeting a “fill in the blank” or “predict the next word” type task from massive amounts of raw text. Like flying, pre-training gets the model very far, very quickly - but without further alignment, you’re stuck in the air.
  • Alignment or instruction tuning is the magic that lands the plane and is what made GPT-3 so groundbreaking. Alignment relies on a large number of manually curated examples that focus the “next word” target on completing the fulfillment of instructions or conversational interactions. It can also use a “reward model” to help guide the model towards better outputs based on aggregated user feedback (i.e. thumbs up/down). Technically, alignment is already an example of fine-tuning.
  • Fine-tuning is your ground transportation to the Colosseum: it swaps in a customized target, usually within a specific domain (i.e. legal or medical) or for a narrower task (i.e. generating a formatted memo). Either way, the target is usually more specific than in the previous step, and can also be a completely different type of output instead of text generation (i.e. a spam labeling model with only two options: spam, not-spam).

Fine-tuning was already popularized in the machine learning (ML) space well before “instruction tuning,” but by starting with an instruction-aligned model, you probably won’t need nearly as many examples for your fine-tuning target and will likely see better results. The alignment has already given the model a better “understanding” of how to interpret instructions, so a bit of textual guidance can do the work of thousands of training examples.

Part 2: How not to fine-tune your LLM

The allure of fine-tuning can lead to common mistakes that degrade results rather than improve them. Here are some traps to avoid:

Dirty data

The saying that “a model is only as good as its training data” has become a cliche for a good reason: the most common pitfall in fine-tuning is using low-quality, biased, or irrelevant data. 

Low quality could mean something as obvious as errors in the dataset: mislabeled case outcomes, bad contract language, etc. This is often the case when using commercial LLM outputs to fine-tune smaller LLMs.

It could also mean that the target responses were written based on data or assumptions not available to the model. This is a recipe for generating hallucinations. For instance, if you show a model an example of drafting a clause that uses a defined term, with no reference to that defined term in your instructions, you are essentially teaching the model to make up defined terms.

Hop in!

Bias is also very challenging to detect. In ML, bias refers to imbalanced representations in the dataset compared to the actual data you expect to process in real life. This could be as simple as inadvertently providing training inputs exclusively in the present tense - in this case, you would be making the model more likely to respond erratically if asked “What will the Seller be liable for in X situation?”

Misalignment

Misalignment happens when the task you're tuning the model for isn’t clearly defined or is misrepresented in the data. If the goal is to flag unfavorable contract terms, inconsistencies in how “unfavorable” is interpreted in the dataset might cause the model to generalize poorly.

Misaligned data can lead to unexpected and subpar outcomes because the model is optimizing for the wrong task.

The danger of drift

When fine-tuning on domain-specific data, there’s always the risk of “drift.” Drift happens when a model becomes too specialized to the fine-tuning data and starts to perform worse on broader tasks than it was originally good at.

For example, if a model fine-tuned on bankruptcy law relies on a dataset reflecting only one task (i.e. answering questions), it may lose its ability to perform general reasoning on other tasks (i.e. comparing language).l. This is particularly dangerous for generalist LLMs, which are expected to handle diverse types of queries.

Part 3: Why “legal reasoning” fine-tuning won’t give you an AI lawyer any time soon

There’s been a lot of excitement about using fine-tuned LLMs for legal reasoning, but the truth is that more nuanced and general commercial LLMs can still offer better results.

“I’m afraid that wasn’t on the bar exam, Dave.”

Better performance on general benchmarks ≠ better in practice

A model fine-tuned on legal texts may score higher on benchmarks or tests designed to assess knowledge of the law. However, the complexities of legal work often transcend mere knowledge or the types of reasoning reflected in benchmark datasets. 

Lawyers must balance logical reasoning, negotiation skills, and client-specific needs—all factors not captured well in fine-tuning on legal corpora. Simply improving benchmark performance doesn’t guarantee practical success in a real-world legal context. The only way to be sure is to evaluate the model on your specific task.

Fine-tuning for specific legal tasks

That said, LLMs that have already been fine-tuned on legal reasoning tasks may be a better starting point for further fine-tuning on a specific task in the legal domain. Depending on the data used to train the legal LLM and its own starting point (i.e. whether it was trained from scratch or fine-tuned on another foundation model), it could also get your LLM task closer to the final destination, reducing the amount of time you need to invest in producing your fine-tuning data.

After all, if your final destination were Milan instead of Rome, wouldn’t it make more sense to fly directly to Malpensa?

Part 4: The Promise of outcome-oriented fine-tuning in legal tasks

Fine-tuning has its qualifications, but when applied correctly with well-defined goals and high-quality data, it can unlock substantial improvements in specific legal tasks.

Task-specific, high-quality data

To see the benefits of fine-tuning, the task and the data need to be clearly defined. For example, a fine-tuned model tasked with writing appellate briefs would benefit greatly from training on a diverse set of briefs including a wide range of writing styles and arguments, rather than a narrow, homogeneous dataset. Ensuring that the training data is both statistically diverse and aligned with the task’s end goal unlocks successful results.

Reducing latency on chain-of-thought tasks

Fine-tuning may also help reduce latency on complex, multi-step reasoning tasks. Instead of waiting for the LLM to walk through logical steps to conclude, fine-tuning can embed some of this work directly into the parameters of the model itself. This enables a quicker, more responsive result with equal or greater accuracy.

Moving inside the black box

Finally, controlling what your model has been trained on allows for more transparency in its decision-making processes, provided that you have thorough knowledge of your training data contents. Although the biases of commercial foundation models are usually not revealed by the provider, fine-tuning puts more of the model bias under your control and observation, making it easier to predict where the model will perform better or worse.

For example, a foundation model’s pre-training data may have included an abundance of forums and blog posts advocating for tenants’ rights, without proportionate representation for landlords; this could lead a model to assume a tenant-friendly position even when prompted to negotiate on behalf of the landlord. Additional fine-tuning on a more balanced sample can help the model to correct for this bias.

This type of bias mitigation is especially important in legal contexts, where understanding how the model arrives at its conclusions is crucial for accuracy and accountability.

Navigating legal AI: Why fine-tuning matters for lawyers and legal tech

Today’s legal space is flooded with tools and functionalities reshaping the way that lawyers interact with their work and think about legal processes. With new terms introduced into the vernacular on a regular basis, deciphering immaterial jargon from impactful technical knowledge is crucial. 

Fine-tuning is a powerful tool to connect users, especially in the legal context, with reliable results. While fine-tuning may be ground transportation to the Colosseum, you still need to get on the right bus to reach your intended destination. You also want to ensure you fine-tune a foundation model that gets you as close as possible: a cab from Heathrow to Rome can get pretty expensive!

Investigating how legal tech platforms use, (or don’t use) fine-tuning, is essential to understanding how a model learns and will perform in legal tasks, and identifying potential shortcomings. 

To learn more about fine-tuning and how the DraftWise team leverages fine-tuning to deliver optimized results for legal professionals and increase efficiency in time consuming, non-billable hours, reach out to us today.

Supercharge your drafting and negotiating.

Experience the power of AI in legal document drafting and analysis.