Debbie Ginsberg, Guest Blogger

Benchmarking should be simple, right? Come up with a set of criteria, run some tests, and compare the answers. But how do you benchmark a moving target like generative AI?

Over the past months, I’ve tested a sample legal question in various commercial LLMs (like ChatGPT and Google Gemini) and RAGs (like Lexis Protégé and Westlaw CoCounsel) to compare how each handled the issues raised. Almost every time I created a sample set of model answers to write about, the technology would change drastically within a few days. My set became outdated before I could start my analysis. While this became a good reason to procrastinate, I still wanted to show something for my work.

As we tell our 1Ls, sometimes you need to work with what you have and just write.

The model question

In May, I asked several LLMS and RAGs this question (see the list below for which ones I tested):

Under current U.S. copyright law (caselaw, statutes, regulations, agency information), to what extent are fonts and typefaces protectable as intellectual property? Please focus on the distinction between protection for font software versus typeface designs. What are the key limitations on such protection as established by statute and case law? Specifically, if a font has been created by proprietary software, or if a font has been hand-designed to include artistic elements (e.g, “A” incorporates a detailed drawing of an apple into its design), is the font entitled to copyright protection?

I chose this question because the answer isn’t facially obvious – it straddles the line between “typeface isn’t copyrightable” and “art and software are copyrightable”.  To answer the question effectively, the models would need to address that nuance in some form.

The model benchmarks

The next issue was how to compare the models. In my first runs, the answers varied wildly. It was hard to really compare them. Lately, the answers have been more similar. I was able to develop a set of criteria for comparison. So for the May set, I benchmarked (or at least checked):

  • Did the AI answer the question that I asked?
  • Was the answer thorough (did it more or less match my model answer)?
  • Did the AI cite the most important cases and sources noted in my model answer?
  • Were any additional citations the AI included at least facially relevant?
  • Did the model refrain from providing irrelevant or false information?

I did not benchmark:

  • Speed (we already know the reasoning models can be slow)
  • If the citations were wrong in a non-obvious way 

The model answer and sources

According ot my model answer, the best answers to the question should include at least the following:

  • Font software: Font software that creates fonts is protected by copyright.  The main exception is software that essentially executes a font or font file, meaning the software is utilitarian rather than creative.
  • Typefaces/Fonts: Neither of these is protected by copyright law.  Fonts and typefaces may have artistic elements that are protected by copyright law, but only the artistic elements are protected, not the typefaces or fonts themselves.
  • The answer should include at least some discussion as to whether a heavily artistic font qualifies for protection.

Bonus if the answer addressed:

  • Separability: If the art can be separated from the typeface/font, it’s copyrightable.
  • Alternatives: Can the font/typeface be protected by other IP protections such as licensing, patents, or trademarks?
  • International implications: Would we expect to see the same results in other jurisdictions?

In answering this question, I expected the LLMs and RAGs to cite:

Benchmarking with the AI models

For this post, I ran my model in the following LLMs/RAGs:

  • Lexis Protégé (work account)
  • Westlaw CoCounsel (work account)
  • ChatGPT o3 deep research (work account)
  • Gemini 2.5 deep research (personal paid account)
  • Perplexity research (personal paid account)
  • DeepSeek R1 (personal free account)
  • Claude 3.7 (personal paid account)

I’ve set up accounts in several commercial GenAI products. Some are free, some are Pro, and Harvard pays for my ChatGPT Enterprise account. As an academic librarian, I have access to CoCounsel and Protétgé.

The individual responses are included in the appendix.

I didn’t have access to Vincent or Paxton at the time. I also didn’t have ChatGPT o3 Pro, either. Later in June, Nick Halperin ran my model in Vincent and Paxton, and I ran the model in o3 Pro. Those examples, as well as GPT5, will be included in the appendix but they are not discussed here.

Bechmarking the results

In parsing the results, most answers were fairly similar with some exceptions:

Source

Font software copyrightable Typefaces/
fonts not copyrightable
Exceptions to font‑software copyright Art in typefaces/fonts copyrightable
Lexis Protégé Yes Yes Yes No
Westlaw CoCounsel Yes Yes No Yes
ChatGPT o3 deep research Yes Yes Yes Yes
Gemini 2.5 deep research Yes Yes Yes Yes
Perplexity research Yes Yes Yes Yes
DeepSeek R1 Yes Yes Yes Yes
Claude 3.7 Yes Yes Yes Yes
  • Font software is copyrightable: in all answers 
  • Typefaces/fonts are not copyrightable: in all answers
  • Exceptions to font software copyright: in all answers except Westlaw
  • Art in typefaces/fonts is copyrightable: in all answers except Lexis

Several answers included additional helpful information:

Source

Sepera-bility C Office Policies Altern-atives Licen-sing Int’l Recent State law
Lexis Protégé Yes No No No No No No
Westlaw Co-Counsel No No No No No No Yes
ChatGPT o3 deep research Yes Yes Yes Yes Yes Yes No
Gemini 2.5 deep research Yes Yes Yes Yes No No No
Per-
plexity research
Yes No Yes No No No No
Deep-
Seek R1
Yes No Yes No No No No
Claude 3.7 No No Yes Yes Yes No No
  • Discussions about separability: Gemini, ChatGPT, Deep Seek (to some extent), Perplexity, Lexis
  • Specific discussions about Copyright Office policies: Gemini, ChatGPT
  • Discussions about alternatives to copyright (e.g., patent, trademark): Gemini, Claude, ChatGPT, Deep Seek, Perplexity
  • Specific discussions about licensing: Gemini, Claude, ChatGPT
  • International considerations: Claude, ChatGPT
  • Recent developments: ChatGPT
  • State law: Westlaw

The models were somewhat consistent about what they cited:

LLM/RAG

Copyright statute Copyright regs Adobe Laatz Shake Shack The Copyright Compendium
Lexis Protégé Yes Yes Yes Yes No No
Westlaw Co-
Counsel
Yes Yes Yes Yes Yes No
ChatGPT o3 deep research Yes Yes Yes No No Yes
Gemini 2.5 deep research Yes Yes Yes Yes No Yes
Perplexity research No Yes No No No Yes
DeepSeek R1 Yes Yes Yes No No No
Claude 3.7 No Yes Yes No No No
  • The Copyright statute: Lexis, Westlaw, Deep Seek, Chat GPT, Gemini
  • Copyright regs: cited by all
  • Adobe: Lexis, Westlaw, Claude, Deep Seek, Chat GPT, Gemini
  • Laatz: Lexis, Westlaw, Gemini
  • Shake Shack: Westlaw
  • The Copyright Compendium: Perplexity, Chat GPT, Gemini; Lexis cited to Nimmer for the same discussion

The models also included additional resources not on my list:

LLM/RAG

Blogs etc. Restat. Eltra Law review Articles about loans LibGuides
Lexis Protégé Yes Yes Yes No No No
Westlaw Co-
Counsel
Yes No No Yes Yes No
ChatGPT o3 deep research Yes No Yes No No No
Gemini 2.5 deep research Yes No Yes No No Yes
Perplexity research No No No No No No
DeepSeek R1 No No Yes No No No
Claude 3.7 Yes No Yes No No No
  • Blogs, websites, news articles: The commercial LLMs.  Gemini found the most, but it’s Google.
  • Restatement: Lexis
  • Eltra Corp. v. Ringer, 1976 U.S. Dist. LEXIS 12611: Lexis, Claude, Deep Seek, Chat GPT, Gemini (t’s not a bad case, but not my favorite for this problem)
  • An actual law review article: Westlaw
  • Higher interest rate consumer loans may snag lenders: Westlaw (not sure why)
  • LibGuides: Gemini
  • Included a handy table: ChatGPT, Gemini

The answers varied in depth of discussion and number of sources:

  • Lexis: 1 page of text, 1 page of sources (I didn’t count the sources in the tabs)
  • Westlaw: 2.5 pages of formatted text, 17 pages of sources
  • ChatGPT: 8 pages of well-formatted text, 1 page of sources
  • Gemini: 6.5 pages of well-formatted text, 1 page of sources
  • Perplexity: A little more than 4 pages of text, about 1 page of sources
  • Deep Seek: a little more than 2 pages of weirdly formatted text, no separate sources
  • Claude: 2.5 pages of well-formatted text, no separate sources

Hallucinations

  • I didn’t find any sources that were completely made up
  • I didn’t find any obvious errors in the written text, though some sources made more sense than others
  • I did not thoroughly examine every source in every list (that would require more time than I’ve already devoted to this blog post). 

Some random concluding thoughts about benchmarking

When I was running these searches, I was sometimes frustrated with the Westlaw and Lexis AI research tools. Not only do they fail to describe exactly what they are searching, they also don’t necessarily capture critical primary sources in their answers (we can get a general idea of the sources used, but not as granular as I’d like). For example, the Copyright Compendium includes one of the more relevant discussions about artistic elements in fonts and typefaces, but that discussion isn’t captured in the RAGs.  To be sure, Lexis did find a similar discussion in Nimmer; Westlaw didn’t find anything comparable, although it did cite secondary sources.

In general, the responses provided by all of the generative AI platforms were correct, but some were more complete than others.  For the most part, the commercial reasoning models (particularly ChatGPT and Gemini) provided more detailed and structured answers than the others.  They also provided responses using formatting designed to make the answers easy to read (Westlaw did as well).

None of the models appeared to consider that recency would be a significant factor in this problem.  Several cited a case from the 70s that didn’t concern fonts.  Several failed to cite Laatz, a recent case that’s on point.  Lexis and Westlaw, of course, cited to authoritative secondary sources (and even a law review article in Westlaw’s case).  The LLMs were less concerned with citing to authority.  In all cases, I would have preferred a more curated set of resources than the platforms provided. 

Finally, none of the platforms included visual elements in what is inherently a visual question. It would have been nice to see some examples of “this is probably copyrightable and this is not” (not that I directly asked for them). 

The post Benchmarking a Moving Target, or let’s run a hypo through 7 AIs and see what happens first appeared on AI Law Librarians.