Should We Preserve the Pre-AI Internet Before It’s Contaminated?

SEI 259648162

Wikipedia already shows signs of huge AI input

Serene Lee/Sopa Images/Lightrocket via Getty Images

The emergence of AI chatbots introduces a significant turning point, suggesting that online content is increasingly unreliable in terms of human creation. How do people reflect on this transformation? Some are urgently striving to preserve “pure” data from the pre-AI period, while others advocate for documenting AI’s own contributions, enabling future historians to analyze the evolution of chatbots.

Rajiv Pant, an entrepreneur and former chief technology officer, notes in the New York Times and Wall Street Journal that he views AI as a potential risk to information integrity, particularly concerning news articles that constitute historical records. “Since the launch of ChatGPT, we’ve been grappling with this issue of ‘digital archaeology’, which is becoming increasingly pressing,” Pant remarks. “Currently, there’s no dependable way to differentiate between human-created content and that generated by large AI systems. This is a concern that extends beyond academia; it affects journalism, legal clarity, and scientific discovery.”

For John Graham-Cumming of cybersecurity company CloudFlare, data generated post-ChatGPT is akin to low-background steel, prized for its application in sensitive scientific and medical devices, devoid of residual radioactive contamination from the Atomic Age that disrupts measurements.

Graham-Cumming has established a website, Lowbackgroundsteel.ai, which has already demonstrated that Wikipedia reflects the impacts of AI contributions, aiming to archive data sources lacking AI contamination, such as the complete Wikipedia archive from August 2022.

“There were times we handled everything manually, but eventually, this process became significantly augmented by chat systems,” he explains. “You can view this as a type of pollution, or positively, as a way for humanity to advance with assistance.”

Mark Graham, who operates the Wayback Machine on the Internet Archive—an initiative that has been documenting the public Internet since 1996—expresses skepticism regarding the effectiveness of new data archiving initiatives, especially since the Internet Archive captures up to 160 terabytes of new information daily.

Graham aspires to develop a repository of AI outputs for researchers and historians in the future. He plans to pose 1,000 local questions each day and record the chatbot’s responses, even leveraging AI for this extensive task. This method helps document the evolving outputs of AI for future human inquiry.

“You ask a specific question, receive an answer, and the next day, you can re-ask the same question to receive a potentially different response,” Graham comments.

Graham-Cumming emphasizes he is not against AI; instead, he believes preserving human-generated content can actually enhance AI models. This is crucial since subpar AI outputs may harm the training of new models, leading to “model collapse.” Preventing this occurrence is a worthy endeavor, he asserts.

“At some point, one of these AIs is bound to contemplate concepts that humans haven’t considered. It will prove a mathematical theorem or innovate something entirely new.”

Topic:

Source: www.newscientist.com