https://www.businessinsider.com/suchir-balaji-marques-brownlee-openai-ai-training-data-death-llm-2024-12

Suchir Balaji helped OpenAI collect data from the internet for AI model training, the NYT reported.
He was found dead in an apartment in San Francisco in late November, according to police.
About a month before, Balaji published an essay criticizing how AI models use data.

The recent death of former OpenAI researcherSuchir Balaji has brought an under-discussed AI debate back into the limelight.

AI models are trained on information from the internet. These tools answer user questions directly, so fewer people visit the websites that created and verified the original data. This drains resources from content creators, which could lead to a less accurate and rich internet.

Elon Musk calls this "Death by LLM." Stack Overflow, a coding Q&A website, has already been damaged by this phenomenon. And Balaji was concerned about this.

Balaji was found dead in late November. The San Francisco Police Department said it found "no evidence of foul play" during the initial investigation. Thecity's chief medical examiner determined the death to be suicide.

Balaji's concerns

About a month before Balaji died, he published an essay on his personal website that addressed how AI models are created and how this may be bad for the internet.

He citedresearchthat studied the impact of AI models using online data for free to answer questions directly while sucking traffic away from the original sources.

The study analyzed Stack Overflow and found that traffic to this site declined by about 12% after the release of ChatGPT. Instead of going to Stack Overflow to ask coding questions and do research, some developers were just asking ChatGPT for the answers.

Other findings from the research Balaji cited:

There was a decline in the number of questions posted on Stack Overflow after the release of ChatGPT.
The average account age of the question-askers rose after ChatGPT came out, suggesting fewer people signed up to Stack Overflow or that more users left the online community.

This suggests that AI models could undermine some of the incentives that created the information-rich internet as we know it today.

If people can get their answers directly from AI models, there's no need to go to the original sources of the information. If people don't visit websites as much, advertising and subscription revenue may fall, and there would be less money to fund the creation and verification of high-quality online data.

MKBHD wants to opt out

It's even more galling to imagine that AI models might be doing this based partly on your own work.

Tech reviewer Marques Brownlee experienced this recently when he reviewed OpenAI's Sora video model and found that it created a clip with a plant that looked a lot like a plant from his own videos posted on YouTube.

"Are my videos in that source material? Is this exact plant part of the source material? Is it just a coincidence?" said Brownlee, who's known as MKBHD.

Naturally, he also wanted to know if he could opt out and prevent his videos from being used to train AI models."We don't know if it's too late to opt out," Brownlee said.

'Not a sustainable model'

In an interview with The New York Times published in October, Balaji said AI chatbots like ChatGPT are stripping away the commercial value of people's work and services.

The publication reported that while working at OpenAI, Balaji was part of a team that collected data from the internet for AI model training. He joined the startup with high hopes for how AI could help society, but became disillusioned, NYT wrote.

"This is not a sustainable model for the internet ecosystem," he told the publication.

In a statement to the Times about Balaji's comments, OpenAI said the way it builds AI models is protected by fair use copyright principles and supported by legal precedents. "We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness," it added.

In his essay, Balaji disagreed.

One of the four tests for copyright infringement is whether a new work impacts thepotential market for, or value of, the original copyrighted work. If it does this type of damage, then it's not"fairuse" and is not allowed.

Balaji concluded that ChatGPT and other AI models don't quality for fair use copyright protection.

"None of the four factors seem to weigh in favor of ChatGPT being a fair use of its training data," he wrote. "That being said, none of the arguments here are fundamentally specific to ChatGPT either, and similar arguments could be made for many generative AI products in a wide variety of domains."

Talking about data

Tech companies producing these powerful AI models don'tlike to talk about the value of training data. They've even stopped disclosing where they get the data from, which was a common practice until a few years ago.

"They always highlight their clever algorithms, not the underlying data," Nick Vincent, an AI researcher, told BI last year.

Balaji's death may finally give this debate the attention it deserves.

"We are devastated to learn of this incredibly sad news today and our hearts go out to Suchir's loved ones during this difficult time," an OpenAI spokesperson told BI recently.

If you or someone you know is experiencing depression or has had thoughts of harming themself or taking their own life, get help. In the US, call or text 988 to reach the Suicide & Crisis Lifeline, which provides 24/7, free, confidential support for people in distress, as well as best practices for professionals and resources to aid in prevention and crisis situations. Help is also available through the Crisis Text Line — just text "HOME" to 741741. The International Association for Suicide Prevention offers resources for those outside the US.

View article category

The tragedy of former OpenAI researcher Suchir Balaji puts 'Death by LLM' back in the spotlight

Balaji's concerns

MKBHD wants to opt out

'Not a sustainable model'

Talking about data