Lawyer or language model? Testing AI’s competence

By Miriam Stiel, Valeska Bloch, Lisa Kozaris, Tommy Chen & Paul Mersiades on May 27, 2024

Allens Australian law benchmark for generative AI

7 min read

The last 24 months have seen generative artificial intelligence (AI) tools advance in leaps and bounds, powered by remarkable developments in large language models (LLMs). Their new capabilities are already having significant impact on the way firms operate, including the legal function. However, exactly how effective generative AI is when it comes to the law remains largely the realm of speculation and anecdote. Certainly, those of us who have tried each new model have noticed improvements (and sometimes steps backwards), but how good are they, really, at being a lawyer?

Conceptually, the ability of AI tools to quickly identify patterns in large volumes of data and generate the optimal sentences or phrases would seem like skills that are highly desirable in lawyers. However, the limitations of generative AI when faced with a prompt that requests legal advice are also well publicised—AI faces many challenges when it comes to replicating a human lawyer’s judgement, which plays a crucial role in legal practice. Language also operates differently in a legal context compared to many other language contexts.

A repeatable benchmark is needed to systematically test, compare and track over time, developments in generative AI’s ability to answer legal questions. In consultation with Linklaters LLP, Allens has developed the Allens AI Australian Law Benchmark (Allens AI Benchmark) to test the ability of LLMs to answer legal questions under Australian law. We tested general-purpose implementations of market-leading (at the time of the test: February 2024) LLMs, approximating how a lay user might try to answer legal questions using AI, instead of a human lawyer.

Key findings

The models we tested should not be used for Australian law legal advice without expert human supervision. There are real risks to using them if you don’t already know the answer.
The strongest overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini-1 tested relatively similarly.
In 2024, even the best-performing LLMs we tested were not consistently reliable when asked to answer legal questions. While these LLMs could have a practical role in assisting legal practitioners to summarise relatively well-understood areas of law, inconsistencies in performance means these outputs still need careful review by someone able to verify they are accurate and correct.
For tasks that involve critical reasoning, none of the tools we tested (being publicly available chatbots implementing GPT-4, Gemini 1, Claude 2, Perplexity and LLaMa 2) can be relied on to produce correct legal advice without expert human supervision. The LLMs we tested frequently produced answers that got the law wrong and/or missed the point of the question, while expressing their answers with falsely inflated confidence. There are, therefore, real risks to using these tools to generate legal advice if you don’t already know the answer.
Poor citation remains a major problem for many of the models. For example, some tools demonstrated:
- an inability to choose authoritative legal sources (cases, legislation or authoritative texts) over unauthoritative ones (such as a law firm publication);
- a tendency to manufacture (hallucinate) case names;
- a tendency to name a correct source but attribute a fictional extract or choose an incorrect pinpoint citation; or
- a tendency only to cite a whole piece of legislation without specifying a section reference.
‘Infection’ by legal analysis from larger jurisdictions with different laws is a significant problem for smaller jurisdictions like Australia. In particular, despite being asked to answer from an Australian law perspective, many of the responses cited authorities from UK and EU law, or incorporated UK and EU law analysis that is not correct for Australian law.
Legal teams within any business considering the use of generative AI technologies should ensure they have safeguards in place that govern how the output be can used. In the legal context, AI outputs need careful review by someone able to verify they are accurate and correct, and do not contain irrelevant or fictitious citations.
Even if (and when) LLMs achieve or surpass parity with the benchmark, the role and importance of the human lawyer will comfortably endure. The ability to answer questions of law in a succinct and correct manner is but a fraction of what is required in the daily travail of an Australian lawyer, whose role today is more akin to a strategic adviser.

Who in your organisation needs to know about this?

Legal leaders and teams, IT personnel, innovation and procurement teams.

The testing: methodology and review

The Allens Benchmark is an extension of the LinksAI English Law Benchmark. The Allens Benchmark comprises 30 questions relevant to 10 different practice areas. The questions would ordinarily require advice from a competent, mid-level lawyer specialised in that practice area. The intention was to test whether the AI models can reasonably replicate certain tasks carried out by a human lawyer.

While our question set has some questions in common with the LinksAI English Law Benchmark, others are designed to test issues unique to the Australian law context.

We tested the question set against five different models, being GPT-4, Gemini 1, Claude 2, Perplexity and LLaMa 2. We used general-purpose implementations of these LLMs, which are not specially trained or fine-tuned to provide legal advice. Our methodology therefore approximates how a lay user might attempt to carry out tasks using AI instead of a human lawyer.

We put each of the 30 questions to each AI three times, starting the session anew each time. LLMs use probabilistic algorithms to assemble their written output. Repeating each question controls for boundary conditions (as shown in instances where the same model’s answers differed significantly each time a question was asked).

The answers were marked by senior lawyers from each practice area. Each answer was given a mark out of 10 comprising:

5 marks for substance (is the answer correct?)
3 for citations (is the answer supported by relevant statute, case law, regulations?)
2 for clarity.

The strongest overall performer was GPT-4, followed by Perplexity. LLaMa 2, Claude 2 and Gemini-1 tested relatively similarly.

LLMs performing at the level of GPT-4 could have a practical role in assisting legal practitioners to summarise relatively well-understood areas of law. GPT-4 appears capable of, for example, preparing a sensible first draft of the law in some cases. However, inconsistencies in the performance of even the best-performing model means the draft still needs careful review by someone able to verify it is accurate and correct, and that it does not contain irrelevant or fictitious citations. Many of the models frequently cited unauthoritative sources, hallucinated case names and hallucinated quotes from real sources.

For tasks that involve critical reasoning, even the best-performing LLMs performed poorly. This finding is consistent with the October 2023 LinksAI report. In addition, we found that (as predicted in the October 2023 LinksAI report), the LLMs suffered a greater disadvantage in the context of a smaller jurisdiction like Australia’s. The responses frequently adopted analysis from larger jurisdictions (especially the EU and the UK) without recognising the difference in the law.

What’s next?

In the short time since we ran our tests, we have already seen major new versions of several of these LLMs released (including LLaMa 3, released on 4 March 2024, Gemini 1.5, released for public preview on 9 April 2024, Claude 3, released on 18 April 2024 and GPT-4o released on 13 May 2024). We intend to rerun this benchmarking exercise in the future over new tools, and as new LLMs and other AI tools are released onto the market. We anticipate that future editions of this report will make for some interesting reading if the improvement in LLMs continues at its current rate.
Lawyers may be more likely to use one of the tools currently emerging on the market that are being specifically developed to provide legal advice. The last few months have also seen more AI-assisted legal tools come on to the market that are designed to perform tasks other than provide answers to legal questions, expanding the range of possibilities of how AI can assist legal functions with their work.
Working with the aid of such tools, as opposed to being replaced by them, is the likely outcome for lawyers in this rapidly evolving dynamic, where the promise of AI-backed efficiencies should allow lawyers to focus on more strategic issues instead of simple questions of fact.
For detailed analysis, read our full report.

Allens Australian law benchmark for generative AI 7 min read

Key findings

Who in your organisation needs to know about this?

The testing: methodology and review

What’s next?

Allens Australian law benchmark for generative AI

7 min read