AI's intellectual property battleground: word embeddings

Jay Krall
Sep 21, 2023
2 min read

Updated: Oct 3, 2023

The Penn Treebank, an early attempt at understanding the statistics of language

Copyright infringement claims against AI builders continue to pile up, most recently with John Grisham, George R.R. Martin and Jonathan Franzen joining Sarah Silverman among prominent authors who have sued OpenAI. These cases raise high-stakes questions about what constitutes language and intellectual property. At the center is a lightly understood area of computational linguistics: word embeddings.

A word embedding is a set of numbers that describes a word's relationship to other words in an AI model, based on a set of training documents. Each unique word that appears in the document corpus is described as occupying a position in relation to other words. The statistical "vectors" or relationships form the basis of how the model create sentences, through identifying words with similar meanings, and words which are commonly used together.

AI models train on massive document sets containing trillions of words to understand common patterns of speech. An auto-generated, textual response from a generative AI tool is perhaps best thought of as an attempt to replicate the statistical patterns of relationships between words that elicit desired responses from people. When people commonly describe generative AI output as long on wit and short on factual substance, word vectors explain why. Generative AI foundation models excel at mimicking the stylistic traits of popular culture, enabling prompts like "Write this ad copy in the tone of a Seinfeld episode." This stylistic accuracy provides much of the wow factor of recent advancements.

Can the ideas in an author's text and the peculiarities of their tone, the qualities which make their work original, be separated from the statistical relationships between the words they use? In OpenAI's recent motion to dismiss Silverman's claim, its lawyers argued that copyright doesn't extend to word vectors. "While an author may register a copyright in her book, the 'statistical information' pertaining to 'word frequencies, syntactic patterns, and thematic markers' in that book are beyond the scope of copyright protection," OpenAI's attorneys wrote.

The idea that word vectors derived from a set of training documents are a creation of the AI builder, and not an inherent property of the copyrighted work, will likely be challenged by the authors. Silverman's memoir is popular because of her unique comedic style, which interweaves taboo topics like sex, religious doubt, psychological trauma and physical disability. Quick transitions between sensitive subject areas are a hallmark of her voice as an author. The question facing courts is, can a statistical representation of that voice be used to imitate it for a commercial purpose without infringing copyright?

OpenAI has also argued that works created by plaintiffs in the current raft of lawsuits it faces represent "less than a millionth" of the overall content in their training set. But this has not proved material in copyright infringement cases surrounding business analytics software products over the past decade. In those cases, publishers have prevailed in infringement claims despite providing a small minority of the content for those businesses.

This much is certain: the question of whether copyrighted works can be freely used to build AI that competes with those works for human attention, will shape the future of media fundamental ways.