In the summer of 2002, I co-wrote an article in the Wall Street Journal about one of the Internet's first conspiracy theories: that Jose Padilla, an American citizen arrested that year for joining al Qaeda in Afghanistan, had been a conspirator in the 1995 Oklahoma City bombing.
Conservative radio host Glenn Beck had spread the theory on his nationally syndicated radio show. Internet sleuths across the country, who were not yet called bloggers, posted tidbits and innuendo on their web pages, noting that a woman who shared Padilla's surname had once been married to convicted Oklahoma City accomplice Terry Nichols. In an early example of national media chasing Internet rumors, I called her. She confirmed the matching surnames were a coincidence. Then, I started making calls to these early conspiracy bloggers, who at the time generally used their real names and could be found in telephone directories. "I'm a journalist," one of the bloggers told me. "The only difference is I published all my notes as I take them."
This concept struck me as novel and innovative, though risky. An open-book approach to journalism would inevitably lead to misleading facts being published before they could be verified. One veteran newspaper editor with whom I discussed the idea said, "Publishing your notes in real time? That's the most irresponsible thing I've ever heard."
Over the next 20 years, the pressure to monetize online attention would push news organizations to adopt more of the blogging ethos than they could have imagined. Forced to cater to increasingly polarized audiences, news outlets are now expected to wear their values on their sleeves in a manner that late-20th century journalists, schooled in a mindset of objectivity, would dread. But in a way, the current media environment is also a little more honest than the old one. Newspapers always catered to particular political leanings and economic classes, only more subtly in the past. As media blended with digital commerce and both became consumed with search engine optimization, what was once thought of as "news bias" came to be seen simply as audience-targeting best practice.
Now, we find ourselves in a media environment where divisions are clearly defined by the outlets we subscribe to. That's often cast as a source of societal division, but at least it's happening in the plain-sight view of the public Internet, where thousands of researchers can study it. By contrast, the generative AI movement, on its current trajectory, promises to shroud the biases inherent to its products in training-data secrecy.
Companies building next-generation virtual assistants are in a "race for intimacy", a widely used phrase sometimes attributed to Zillow CEO Rich Barton. This creates two problems for users of generative AI:
1) Large language models (LLMs) underlying generative AI are incentivized to cater to your tastes and beliefs, in order to convince you to spend more time with them.
2) In the current global regulatory environment, LLMs are not compelled to share any information about the websites they use for training data, a content selection process which deeply inform the perspectives behind the responses provided by an LLM tool.
The first problem is best addressed through consumer education and open product testing. If all LLMs are biased in some way, the best you can do is choose your tools based on high-quality information about their accuracy and biases. A leading example is Cornell University's TruthfulQA tool, which benchmarks truthfulness in open source LLMs.
The second problem seems more suited to resolution through industry partnerships, in advance of forthcoming regulation. In June, European Parliament approved a draft of the European Union AI Act, setting off a round of negotiations on the final text. Though the law isn't expected to be implemented before 2025, its current draft would require generative AI providers to publicly disclose details of copyright works used in training. In the US, similar levels of training-data transparency are sought in lawsuits brought by book authors against OpenAI. These will be landmark cases of 21st-century copyright law. In the meantime, let's briefly examine the current state of news data usage disclosure amongst popular LLMs.
OpenAI, Anthropic Claude, Meta Llama 2 and Alibaba Qwen do not disclose in their technical documentation any lists of Web domains from which training data was gleaned. Amongst top LLM foundation models, the TIIUAE Falcon project gets the best marks for training-data openness, for merely disclosing that it uses the commoncrawl.org repository of public Internet data, which includes tens of thousands of news sites. CommonCrawl's terms of use require commercial users of the repository to procure their own content licenses.
The bar for disclosure set by the Falcon project is low, and technologists are hoping it will stay where they believe it has stood since Authors Guild v Google (2015), which established that Google Book Search did not violate US copyright law. But the Book Search product gave copyrighted works greater exposure to global audiences, where generative AI competes with human-generated content for our attention. Courts may conclude generative AI is fundamentally different.
Falcon is developed by the Technology Innovation Institute, a government-sponsored lab in the United Arab Emirates. The documentation of RefinedWeb allows us a glimpse into how much data is required to train a general-purpose LLM, sharing that 5 trillion "text tokens" from CommonCrawl were used. In the technical jargon of LLMs, a token equates to roughly 75% of a word, suggesting that 3.75-trillion word sample of global online content, including news sites and other types of sites, was used. If we assume an average document length of 300 words often targeted as an optimal size by content marketers, this points to a training set of about 12.5 billion documents, representing millions of individual content owners.
The key question around data governance for the LLM space heading into 2024 is: how quickly will LLM builders publicly court partnerships with news organizations and social platforms? OpenAI announced a partnership with the Associated Press in July, while the New York Times updated its terms of service to prohibit generative AI training last month.
Training data governance may be the single most overlooked factor in a critical choice facing all enterprise companies: which LLM to integrate with vital business processes? Complicating matters further, some technology buyers are beginning to conflate "open source" with "fair use".
LLM software itself is often provided under an "open source" license, though these typically fall short of open source standards established by standard license types such as Apache and GNU. The terms of both Llama and Qwen prohibit projects attracting large numbers of users, effectively trapping technology product builders into an "open source until you're successful" arrangement. Against this already murky backdrop, copyright liability could hamper future development efforts for LLM builders who decline to work with content owners.
For enterprise implementation builders, this is a hidden risk. Particularly in data-intensive sectors like finance, healthcare and entertainment, the best foundation model for your project may be the one with the strongest rights management, not the one producing the cleverest responses. Outside of copyright issues, data privacy considerations and website terms of use restrictions may also cause expensive hurdles for foundation model owners. They can get ahead of this by partnering with content owners proactively and voluntarily.
An often-cited risk of generative AI is that the technology will spread misinformation. For content owners, another risk may simply be that LLMs refuse to publish their "notes", figuratively, the training data from which they derive their responses. In the spirit of those early bloggers who both perturbed and inspired me, I'll continue to post my notes here as this frontier develops.
Comments