LTH Insights for Law Firms/SOURCES MATTER, AND OTHER RESEARCH CONUNDRUMS IN THE GENAI ERA

Sources Matter, and Other Research Conundrums in the GenAI Era

Published on 2024-04-23 bySarah Glassmeyer
logo

Now that the initial rush of excitement over Generative AI and LLMs has passed, those digging into the efficacy and sustainability of these tools are looking at topics such as the provenance of the training data.  The copyright status of the material seems to be the primary issue of concern, with lawsuits pending against Open AI[1] and Nvidia[2] among others. While the lawsuits on LLM training are working their way through the courts, we wondered about something a little more clearcut and perhaps even more basic.

External data has been used in legal technology products for years, long before the rise of LLMs. The legal research and docket analysis verticals are populated with repackaged and repurposed content that comes from a host of government, IGO, and NGO sources.  Depending on the type of content and the creator/publisher, the acessiblity and usability of the content varies wildly.

You may be surprised to learn that accessibilty and usability concerns even include primary law. In the United States, at least, all law is assumed to be in the public domain. However, impediments to access include copyright, both incorrect assertions by governments or embedding public domain material in copyrighted content; the format of publication, including if it’s born digital or a print product; and potential corporate control such as via exclusive publishing licenses.

Some of these concerns stem from subtle challenges arising from the way laws have traditionally been reported. For example, while the text of a court decision is public, the synopsis and headnote is likely copyrighted, making it difficult for a secondary provider to safely use technology to scrape the law from a court reporter (see graphic below, where this is explored further).

These impediments to access have stopped some legal tech creators from obtaining copies of a corpus to use to build their products. However, not all of them.  Primary law content has been obtained by various means with varying levels of compliance with copyright law and Terms of Services. Some vendors have been willing to sell their data to potential competiors. Some companies paid to have books rekeyed by overseas contractors. Some have used OCR on digitized print sources. Some used data scraping techniques on both locked and open content.

 

Why does this matter, and why should you care?

Of course, all law is “public” – or should be (whether or not it is, remains a topic for another day). Primary sources such caselaw, statutes, and regulations are what constitute the law. In this, the large research conglomerates are no different from smaller or newer players (except that they have been gathering this data for longer than most other providers).

There are, however, significant differences in the processing and surfacing of public data, which give rise to notable variations in the quality and reliability of a research solution.

  1. For one, not all digitization efforts are equal, especially early efforts at OCR. There may be quality issues, and while these may reveal themselves to customers via initial internal product testing and use… they may not, especially if it’s not an error that can be viewed externally by the end user, as opposed to a training issue. Providers with longevity and stability have developed trust among professionals who know that their digitization efforts are sound and their content collections are comprehensive, which is why they’re able to charge a premium for their licensed content.
  2. The quality issue is heightened in the age of AI. Research tools leveraging generative AI on top of public sources of data have a lot of decisions to make when it comes to digitization. Results will vary depending on the data structure and data sources they use, choices made around vectorization, parsing, and RAG, whether or not they’ve done the work of creating a chain of thought for answers and responses, what model they use, and so on. The impact of poor work here is potentially severe, with wrong answers to research questions likely. A recent start-up, for example, uses a foundation model to make generative AI summaries of caselaw available to users, proclaiming easier search for cases and better access to information for both consumers and legal professionals. The summaries have not, however, been verified by lawyers (or anyone) against the text of the cases and so the onus is on users to determine whether the summary is accurate or not, leaving room for all kinds of errors.
  3. Credibility of a vendor is critical. If a company says they have all the law from a particular jurisdiction but they only have it for certain courts, or if they don’t have sound processes in place to ensure maintenance and upkeep as the law evolves and new decisions are handed down, your lawyers could be missing key caselaw when advising clients. When large law firms insist on research providers of sufficient stability that leverage or license content from known sources, they’re doing so because they need to know their lawyers can rely upon the information they’re getting from their providers. Well-established providers have processes in place to ensure that law that is passed yesterday has made it onto the platform by today.
  4. The design, UI, and architecture of a platform has a large part to play in the effectiveness of a research solution, The ability to see a chain of decisions as the law evolves on a particular issue, or to see at a glance whether a decision has been overturned, or whether or not the jurisdiction and court from which it stems make it merely influential or legally binding – these are all complexities that require sophistication and legal understanding on the part of a provider. The search tools used across bodies of law (keyword, semantic search, neural or conceptual search, or some combination) and the way search results are surfaced to users can also make a significant difference to the usefulness of a research product.
  5. Also, while large and well funded companies such as OpenAI are able to survive a potentially long and drawn out litigation fight over their content, not all are. The sustainability of a company may very well rest on the origin of the content they provide. We have already seen this in the legal world with Ross Intelligence.

 

The Law should be free

We are advocates for public access to legal information. It has been problematic for consumers that there has been an historical reliance on monolithic research providers because they have had the deep pockets to scrape court websites, digitize the law, build sophisticated delivery systems, and then make those available only at high cost. The gatekeeping around dockets has been similarly problematic.  On the bright side, however, generative AI and cutting edge technology have made it easier for younger companies to enter the research game and create compelling products. And, thanks to a recent project, they can now  access some primary law content in the United States.

The Harvard Library Innovation Lab, in partnership with Ravel Law, embarked on a project to digitize all of the caselaw reporters held by the Harvard Law School Library. https://case.law/  This project was more than just scanning books. After scanning and OCRing the content, they needed to remove the copyrighted content such as case synopsises, key numbers, and headnoes. See images below.

     

An open book with textDescription automatically generated

A close-up of a bookDescription automatically generated

 

With the the physical labor in scanning, solving this intermixed content problem, and creating QA workflows, it took them about five years in order to complete the project.  But complete it they did and after a period of embargo, over 350 years of U.S. caselaw – or about 6.7 million cases from all juridictions and courts – is now available for anyone to use for any type of use, including commerical purposes.  The Free Law Project https://free.law/ and its research engine Court Listener have already incoporated the content into their services.  This marks a significant step forward for access to justice. Leveraging this content in developing new research solutions will help vendors know that issues such as copyright have been dealt with appropriately in their source data.

 

Ensuring your research solutions are trustworthy and stable

One of our goals at Legaltech Hub is to make the legal tech acquisition process smoother for all parties involved – the customer as well as the legal tech vendor. Just over a week ago we issued a poll to determine whether provenance of sources of law was a key factor in selecting or utilizing a solution. Through this poll we also wanted to determine if customers were even aware of the issues outlined above, or if there were changes to sales conversations (or product!) that vendors needed to consider making.

Results of the poll are below.

 

A screenshot of a surveyDescription automatically generated

Perhaps not surprisingly, the disclosure of sources is most important to users. Transparency not just around sources used but also the methods for processing and digitizing those is a requirement vendors need to take seriously. Only 10% responded that they were restricted to using licensed content from trusted third-party providers, and 14% interestingly responded that, if using sources of public law, they preferred vendors to be using the Caselaw Access Project data.

When reviewing a legal research tool, it has long been standard operating procedure to ask “what is the coverage of your service?” meaning “What juridictions are covered? What is the date range? What particular courts are covered?”  Now, in light of the LLM training data lawsuits, the availability of unencumbered data from Harvard, and the explosion of legal research and legal research adjancent legal technology tools, especially those that incorporate a Generative AI or LLM type functionality, these questions are even more critical. What follows is a checklist of some of the questions you should be asking when evaluating a new research tool:

  1. What is the coverage of your service (jurisdictions, date range, courts)?
  2. What are the sources of law used for your service?
  3. Do you provide access to unreported cases?
  4. What are the processes in place for maintaining currency of the data? With what frequency is the content updated? (if a decision is handed down in the morning, how long before it’s reflected in the platform?)
  5. How is access to the source content obtained?
  6. How has copyright been dealt with in accessing and processing the data?
  7. Is generative AI used in your platform?
  8. If so,
    1. Which model(s) does the platform leverage?
    2. What features are powered by generative AI?
    3. If any of the core research components in the platform are developed by generative AI (for example, development of summaries, answers to research questions), please provide information about how these are verified or developed to ensure reliability.
    4. Are any user queries or other user provided content or interactions used to train the product?

This not a comprehensive list, but a reminder to buyers that, like much other legal technology, many research products now require increased scrutiny. A reminder also to vendors that, in this new era, transparency around sources and digitization methods matters more than ever. 

 

- by Sarah Glassmeyer and Nikki Shaver

About the AuthorView Profile
Sarah Glassmeyer

Sarah Glassmeyer

Senior Solutions Analyst Legaltech Hub

Sarah has been solving problems across the legal industry for over 10 years, and believes that:


1. The best solutions come from community based effort
2. Standards are the operating system of effective collaborations
3. Education is the first step in bringing people together into the community of problem solvers

View more
Search Legaltech Jobs
Legaltech Jobs provides targeted job listings for alternative careers in law, including roles in legal technology, legal data, legal operations, legal design, and legal innovation. Click and browse to find your next opportunity!
Search Now
We use cookies to monitor the performance of our website, improve user experience, and assist in our marketing efforts. By continuing to browse our site, you agree to our use of cookies.