Brain Drain - the limits of human data
What happens when the amount of data needed to continue exponentially scaling the intelligence of AI models outpaces the capacity of us puny carbon based life forms to produce more text, video, and audio content. I wrote about it recently for the Stack Overflow blog:
One of the most striking things about today's generative AI models is the absolutely enormous amount of data that they train on. Meta wrote, for example, that its Llama 3 model was trained on 15 trillion tokens, which is equal to roughly 44 TERABYTES of disk space. In the case of large language models this usually means terabytes of text from the internet, although the newest generations of multimodal models also train on video, audio, and images.
The internet, like the oceans of planet Earth, has always been viewed as an inexhaustible resource. Not only is it enormous to begin with, but billions of users are adding fresh text, audio, images, and video every day. Recently, however, researchers have begun to examine the impact this data consumption is having.
“In a single year (2023-2024) there has been a rapid crescendo of data restrictions from web sources,” write the authors of a paper from the Data Provenance Initiative, a volunteer collective of AI researchers from around the world, including experts from schools like MIT and Harvard, and advisors from companies like Salesforce and Cohere. For some of the largest and most popular collections of open data typically used to train large AI models, as much as 45% has now been restricted. “If respected or enforced, these restrictions are rapidly biasing the diversity, freshness, and scaling laws for general-purpose AI systems.”
You can check out the full piece here.
There are now many examples, of course, of large AI labs and canny startups willing to pay human beings to sit at home all day, chatting with AI, generating fresh input by activating their cerebellums and typing the output into digital receptacles. One hopes that the fission of our chaotic emotional intelligence and Large Language Models can catalyze into something new for both sides.