Cancer disrupts multiple layers of the biological blueprint, including the order of DNA sequences and the chemical markers on DNA known as DNA methylation. In cancer patients, tumor samples obtained from areas like the colon or skin contain a blend of healthy cells, which exhibit normal levels of methylation, alongside cancer cells that show abnormal methylation patterns. This mixture complicates doctors’ efforts to differentiate between the two and identify which methylation signals are genuinely sourced from the tumor.
Moreover, harvesting tumors directly often necessitates painful surgical procedures. Some scientists propose using blood samples as an alternative for initial diagnosis. However, blood samples generally face the same challenge, frequently containing only minute traces of cancer DNA.
Traditionally, scientists have averaged the methylation levels of numerous DNA fragments from patient samples to estimate the proportions of cancerous and normal DNA present. Unfortunately, this conventional approach overlooks valuable insights regarding rare and subtle disruptions to DNA. Researchers in Germany and Belgium contend that this missing information is vital for the early detection and diagnosis of cancer. Consequently, they have introduced a new analytical tool named Methylvert to tackle this issue. This tool examines individual DNA sequences to analyze DNA methylation, ensuring these subtle details are preserved.
The team developed MmethylBERT, utilizing the same technology that powers modern language models, such as ChatGPT, with a transformer architecture. They re-engineered this technology to interpret the language of DNA and its methylation signals rather than human language. Each DNA sequence served as a concise “sentence” for the model to analyze and discern the differences between tumor and normal DNA.
The researchers trained MmethylBERT in two phases. Initially, they exposed it to a template dataset derived from the human reference genome. This dataset was used to help the model recognize patterns in DNA sequences, independent of methylation or disease information. This step is akin to teaching students to read using only the letters that form words, without additional context. The model became adept at distinguishing various three-letter DNA combinations, recognizing that certain bases, particularly C and G in ATCG, manifest in specific patterns. The pre-training step proved crucial; omitting it would prevent the model from accurately classifying cancer cells versus normal cells.
In the second phase, they fine-tuned the pre-trained model using DNA sequences from actual cancerous and healthy samples, teaching the model to identify known tumor-specific methylation patterns. This strategy parallels instructing students on grammar, which adds context and meaning to words. The model learned that certain DNA regions exhibit high methylation levels in tumors and low or negligible methylation in normal cells, or vice versa. They devised a system that generates a probability score, indicating how likely each DNA fragment originates from tumor or normal tissue.
The team evaluated MmethylBERT against existing methods by employing simulated DNA sequence data of varying complexity. Their findings demonstrated that their method accurately detects cancer DNA, even while analyzing DNA fragments at genomic locations with minimal sequence reads—where traditional methods often falter. They successfully identified very small quantities of tumor DNA in the blood of colorectal and pancreatic cancer patients, further validating its applicability in non-invasive cancer detection.
Scientists noted that training models on human genome data is time-consuming, so they assessed whether a model trained on the mouse genome could analyze human cancer samples. Remarkably, the mouse-trained model performed nearly as well as the human-trained model when applied to human cancer data, resulting in only minor differences in the probability distribution. The researchers attributed this efficacy to the consistent organization of DNA across mammals, enabling models to transfer knowledge from one organism to another.
The researchers concluded that MethylBERT can identify cancer DNA in sequence data obtained from any sequencing platform, irrespective of the complexity of the methylation signal or the size of the tumor DNA in the sample. They also cautioned that the current version requires substantial computational resources for training and operation and have already commenced development on a more efficient iteration.
Post views: 674
Source: sciworthy.com












