Role of Web Search in Legal Research

By David Whelan on April 8, 2024

Reading Time: 9 minutes

In the before times, we were wary of guiding people to legal research with web search engines. Over time, Google and other generic, commercial web search engines have become commonly used tools in legal research. All signs indicate that we need to rethink, perhaps not their place, but their usage in legal research and legal information delivery.

This is not a de-Google post (this is). I do believe that we need to use a greater variety of tools, not all Microsoft nor all Google nor all Adobe nor … you get the picture. There are benefits to tool diversity but this is especially true when the providers want to harvest your ~~organs~~ private data.

But there is also, currently, an inevitability to using Google for web search. Alphabet’s advertising model drives a substantial amount of web traffic, and despite Bing’s entreaties, it will remain the default web search for most people accessing legal information.

A screenshot of a Bing web search results page. The search box has the word "google" in it. Below that is a box that says "promoted by Microsoft" and it's another search box. Below that is a result that is for Google.com — The ultimate “didn’t you mean”: If you search for Google on Bing (people frequently type URLs into web search, don’t judge!), Bing throws up an EXTRA search box above the results to Google.com

My relationship to Google these days is on the front end and on the back end. I use Google search periodically, I use some of Google’s apps, and I test my website responsiveness in order to create a better user experience, which Google used to count when it was generating search engine result pages (SERPs).

Speed And Other Unknowns Matter

The latter has been an important function for media and product companies that rely on web traffic for revenue. It has overflowed into my thinking about law library websites and so, at each organization I’ve worked at, ensuring our website is as easy to find on Google as possible has always been an operational priority. I do the same work on my personal site, although not because I have any more revenue expectations than my workplace site does.

The names have changed over time but the easiest way to do this was with PageSpeed, or the newer Google-powered initiative, Lighthouse. You can run a test on PageSpeed’s website with your URL. If you use Google Chrome, you can open up your developer tools (usually by hitting F12 if you’ve enabled them) and choosing, from the tab menu in the now-visible right-hand panel, Lighthouse.

A screenshot of this website’s home page. To the left, Google Chrome’s developer tools panel is visible and the Lighthouse report panel has been selected.

If you try both tools, you’ll see that there’s overlap. Both of them will do a check on performance, on website accessibility, “SEO”, and “best practices”. Lighthouse will also return a publisher ad response, if you use those. You will also notice that they may have different results.

Here are PageSpeed results (in a Firefox browser) from the day this post was written:

A screenshot of a PageSpeed result for this website on a desktop. There are green circles across the top with the numbers for Performance (96), Accessibility (90), Best Practices (100), and SEO (100). Below those are details on the site performance, repeating the Performance green circle, and an image of the home page. — A screenshot of a PageSpeed result for this website on a desktop.

Here are the Lighthouse results (in a Google Chrome browser):

A screenshot of aLighthouse result for this website on a desktop. Most of the left side contains an image of this website, the webstie being tested. There are green circles in the right side bar with the numbers for Performance (96), Accessibility (90), Best Practices (100), and SEO (100). There is a greyed out circle with the initials PWA in it. Below those are details on the site performance, repeating the Performance green circle, and an image of the home page.

As you probably know, Google regularly updates its search software. Each iteration, listed here, causes a flurry of websites wondering how to adapt. Your law library might already be monitoring these, if you are watching your website data, to see if you are experiencing any changes in visitor rates. Can you know if they’re causally related? No, but it’s another data point to consider if you see an anomaly.

The theory is that, if you take some care to curate your website’s content and infrastructure, you will be more likely to be found within the billions of web pages that a web search engine will return. These infrastructure changes can be really deep in the weeds.

For example, a lot of sites try to make their content look better by using fancy fonts. This requires a decision: use a hosted font from a site like Google or Bunny Fonts? The benefit is that those fonts will be delivered by faster servers and are probably cached on the internet already. The alternative is to put the font file on your web server and load it locally. It may be slower but it doesn’t rely on the remote server being online. Font loading is a common speed problem, often require efforts to load the web page first, then load the fonts. You will have probably seen this, when a page looks like it’s reloading for no reason and the content is the same but not quite.

I fiddled with this issue for a long time and eventually just ditched web-based fonts, on either my server or on someone else’s. The reality is that you have fonts on your machine. If you look at the style sheets that control how my content looks, you’ll see it’s just your bog standard computer fonts:

A screenshot of the developer tools on Firefox, showing a font family style that lists common computer fonts, including Arial and Helvetica. The left side of the image is a cut away of the website being inspected. The right side is the developer panel, showing other CSS styles above and below a cutout of the highlighted styles. — A screenshot of the developer tools on Firefox, showing a font family style that lists common computer fonts, including Arial and Helvetica.

From a speed perspective, that’s ideal as your PC doesn’t need to download anything to display the page. And the page doesn’t shift and impact accessibility. All positives.

If your eyes haven’t glazed over this far, then here’s why I explained all of that. Google wants its search results to be “people first” and helpful. They list specific content considerations—including expertise, production value, well-formed content, originality—but also they focus on page experience. That’s where page speeds come into play. People-first means getting web pages to load quickly so the searcher isn’t waiting.

The point is this: lots of people are trying to make their sites appear in Google search because we realize that Google prefers some content over other content. And your students and faculty and self-represented litigants and lawyers will be presented content based on how well we website owners, including law libraries, focus on these criteria.

And what location they are searching from. And what searches they’ve run in the past and what YouTube videos they’ve liked and what Reddit contributions they’ve made, but I digress.

You can see who is doing a better job of this by a simple search I run all the time, on the California legislation for county law libraries. Commercial publishers are first and second, the legislature comes in third but also obfuscates its content. Here are the results on Bing:

A screenshot of a Bing search engine results page. The results for the search on a statute are, in order, from Findlaw, Public.law, the California Legislature, and Justia

and on Google:

A screenshot of a Google search engine results page. The results for the search on a statute are, in order, from Thomson Reuters Findlaw, Thomson Reuters Casetext, the California Legislature, and Justia

The upside for the Legislature content is that it makes the first page. The rule of thumb, as I’m reminded by SEO “consultants” in daily emails to improve this website, is to be on the first page of search results. The downside is that it’s the least informative. I have no idea if it is the right code section, which all the commercial publishers get right. I expect they have a low SEO score, even though they are literally the official source for this information. There’s a reason that Thomson Reuters Findlaw, a law firm marketing vehicle, gets the Featured Snippet.

Now let’s really make things complicated.

Web Search Requires More Work

It is going to take people longer to find relevant information the more complicated their web search is and the greater their need for expertise if those “people-first” concepts are subordinated to commercial goals. And it seems like that’s what we’re seeing with Google. Or perhaps to be more balanced, people-first, like “Don’t be evil”, is no longer the guiding principle it was before.

The Fall and Rise of Search Operators

One post floating about on Mastodon, so anecdata, is that people are recommending to add search segments (!) to de-crapify their search results. In essence, the argument is that artificial intelligence has generated a lot of people-second content and we know about when this content was created. Google is indexing AI generated content, both from crawled websites for its main search engine and it is adding AI-generated books to Google Books.

To manage this issue, people are adding the before: 2023 time-limiting search operator on Google. Here’s that same search above, and the results are a bit different, with a county law library appearing, one publisher drops out, and the same legislative site but with a completely different name label and metadata??

A screenshot of a Google search using the before:2023 hack, which returns results for Thomson Reuters Casetext, Thomson Reuters Findlaw, the California Legislature (I think) and the Siskiyou County Public Law Library

So that’s an interesting approach, except when you have information situations that require current information. “But your Honor, the law BEFORE:2023 said this!” This goes hand in hand with the jurisdictional challenge, which is that web search tends to default to localized results based on your IP address, and so searches need to incorporate clear jurisdictional terms or search operators like the site: operator.

But I do wonder if we’re going to see an increased interest in using the search operators that we have probably all used and advocated for. This retro approach to search may clash with natural language search that commercial legal publishers, even pre-AI, advocated. I learned while I was writing this that Bing says that “[o]nly the first 10 terms are used to get search results“! Clearly worded natural language searches are probably going to get truncated. Short web searches were preferable anyway—I always advocated starting with the fewest possible, but most specific, words—but now that preference will be meaningful.

We may also see people switch to DuckDuckGo or other search engines like Mojeek. A lot of other search engines rely on either the Bing index (DuckDuckGo, Ecosia) or the Google index (StartPage, ProtoPage’s default), though, so the results set may be the same. Especially with Google’s ingesting AI content and Reddit posts, it will be interesting to see if the degrading of Google search result quality spreads to those other services as well. I think this is probably not a good solution, as it obfuscates the problem and may make the problem unnecessarily more complicated than it is.

Deeper Search Pages

It probably means a lot more emphasis on or, at minimum, awareness of the impact of not using search operators. It is likely that the results that our researchers are going to be looking for are no longer going to be on the first page as people-second content proliferates.

That was my take away from this post by a game system review site called RetroDodo. I encourage you to read the author, and site owner’s, perspective on how recent changes to Google’s search product have impacted their business. But this summed up a lot of the problems that I think will be universal:

Firstly, Google wants to completely eradicate users leaving their search results and will now show you their own “From Sources” answers to search terms. These answers are taken without consent from the publisher’s content and are stolen from creators’ work so that Google can give you “their” answer instead, lowering the chances of readers exploring websites… the websites that are paying to create the exact content that Google now shows for themself. It’s straight-up intellectual property theft.

Secondly, if you manage to scroll past Google’s stolen answer, you will undoubtedly be bombarded with sponsored ads, and lots of them. These results are purposely designed to look like normal results and can bend Google’s guidelines when it comes to content quality. This is straight-up pay-to-play, again lowering the need for creators’ articles.

Google Is Killing Retro Dodo & Other Independent Sites, Brandon Saltalamacchia, April 3, 2024

Both search engines are using user interface changes to highlight “cards” or proposed answers, like that Findlaw “featured snippet” above, to the search inquiry posted. I have ad blocking on both mobile browsing (Firefox only) and desktop (everything) so I don’t see a lot of the ads. But the old “first page of Google results” guidance may no longer be relevant if the goal is to find people-first content that is written by experts and not AI and is relevant to the search for reasons other than that its advertising or sponsored content.

Google is putting a lot of effort into mobile search too, which may be invisible to our researchers since we tend to provide desktop research experiences in law libraries. It is also apparently considering putting its AI search behind a paywall, like commercial legal publishers.

This, frankly, may be preferable for legal information searchers. Artificial intelligence is a response tool. The less that people searching for legal information see AI as an answer tool, the better. AI will always give responses and I am happy to have that happening behind paywalls, although I fear that the web search index is still going to be heavily compromised by AI response content promoted in the search results.

This is probably one of those issues that we will have to wait and see to know exactly how to respond. But I am already starting to consider how many additional pages of SERPs I should be paging through. If you look at the chart in that RetroDodo article and look at the drop they’re experiencing, and if we assume that people are sticking to the “front page only” approach to reading (and the SEO folks are focusing on that too), then it raises the question whether the front page is a reliable indicator of “best” search results.

It also raises questions about when you’ve done enough research to know that you’ve gotten the results you need although, admittedly, with web search, it’s essentially a secondary resource index. I think the greater issue will be wasted time, as we can’t rely on web search (like we can’t rely on AI) to provide pre-validated content or properly prioritized content (like putting the legislature first on a statutory search) based on expertise.