Basecamp researchers gather genetic data in Malta
Greg Funnell
A British biotech firm, Basecamp Research, has spent recent years gathering extensive genetic data from microorganisms inhabiting extreme environments worldwide, uncovering 10 billion new species among over a million scientifically recognized entities. This vast database of planetary biodiversity aims to assist in training “biology chats” to address inquiries regarding life on Earth, although its effectiveness remains uncertain.
Jorg Overmann from the Leibniz Institute DSMZ, which houses one of the world’s most extensive collections of microbial cultures, asserts that while an increase in known genetic sequences is beneficial, it likely won’t lead to significant discoveries in drug development or chemistry without deeper insights into the organisms from which they originated. “In the end, I’m skeptical that a better understanding of unique features will be achieved merely through brute force in the sequencing domain,” he remarks.
Recent years have seen a surge in machine learning models aimed at identifying patterns and predicting relationships within vast biological datasets. The most well-known of these is Alphafold, which can predict the 3D structure of proteins using only genetic data, and was awarded the 2024 Nobel Prize in Chemistry at Google DeepMind.
This “genometric biology” approach has grown significantly, but according to Francis Din at the University of California, Berkeley, progress has been limited. One reason for this is the underrepresentation of biodiversity data. “Current biological models are primarily trained with datasets that favor well-studied species (e.g., E. coli, mice, humans), leading to poor prediction capabilities for traits associated with sequences from other branches of the Tree of Life,” she explains.
Basecamp researchers aim to bridge this biodiversity gap. Their expanding database now includes samples from over 120 locations across 26 countries, as detailed in a report by the company. Jonathan Finn, the company’s Chief Science Officer, notes that their sampling efforts target extreme environments that have yet to be thoroughly examined, spanning from the icy depths of the Arctic Ocean to the warm jungle hot springs. “Most of the samples we’re prioritizing are prokaryotic: bacteria, microorganisms, and their viruses,” Finn states. “We are also aware that some fungi are present.”
Genetic analyses of these samples have illuminated gene variations that are broadly shared across the Tree of Life. Based on this research, the company estimates that their data encompasses over a million species of genetic information not found in public genomic databases utilized for training AI models. This includes around 9.8 billion newly identified genes, increasing the overall known gene count tenfold, each potentially encoding useful proteins, according to the researchers.
“By providing these models with richer data, we enhance our understanding of biological mechanisms,” Finn explains. “We aim to create a ChatGPT for Biology.”
It’s estimated that Earth hosts trillions of microorganism species, many of which remain poorly characterized. Thus, it’s not unexpected that the company has identified such a wealth of novel life forms. “As we explore more, discovering diverse gene variants becomes almost inevitable,” notes Leopold Parts at the Wellcome Sanger Institute in the UK.
Nevertheless, Basecamp promotes the notion that all newly discovered materials might hold value. It’s not alone in this sentiment. “This is among the most thrilling advances I’ve encountered in quite some time,” remarks Nathan Frey, a machine learning researcher at Genentech, a US biotech firm. He emphasizes that most AI biology projects focus on algorithm improvement or generating additional lab data rather than venturing out to collect samples directly from nature.
However, skepticism arises regarding whether this database will yield the meaningful advancements the company aspires to achieve. For starters, it remains uncertain how much this newfound diversity in proteins reflects valuable new functions like enzymes and proteins that can degrade plastic useful for gene editing. “They must demonstrate that this novelty has practical utility,” cautions Parts.
Moreover, if the new genes significantly differ from known genes, Overmann expresses doubts about how easily existing tools can predict functionality or how such data can be utilized for training new models. “I can’t discern the functions of most of my genes,” he states. The company may have created a valuable new repository of biological data, but in traditional lab settings, even the most advanced AI may still face challenges in interpretation.
topic:
Source: www.newscientist.com