Computational cancer researchers leverage machine learning technology to tackle a significant challenge: the vast amounts of data available for training machine learning models. Despite this abundance, training is hindered by inconsistent data formats, structures, and properties. Consequently, when scientists apply various cancer types and data cleaning procedures, the resulting models can yield vastly different outcomes.
Researchers have identified the disparity between available and usable datasets as a considerable obstacle for scientists lacking specialized bioinformatics training. Furthermore, varied processing strategies make it difficult to equitably compare new machine learning techniques and identify the most effective method for specific cancer research tasks—such as classifying patient samples into benign or malignant categories.
To address this issue, a collaboration between researchers in Japan and the United States has resulted in the development of a comprehensive database tailored for machine learning applications. This database, named MLOmics, encompasses genetic and molecular information from over 8,000 cancer patients. Similar to a well-organized library, MLOmics offers cancer data that can be directly utilized by computer models, eliminating the need for extensive preprocessing.
In constructing MLOmics, the team gathered patient samples from 32 cancer types sourced from publicly available databases like the Cancer Genome Atlas. Data collection included four distinct types of molecular information, consisting of two forms of DNA products: Transcriptomics data, data on repetitive DNA regions termed Copy Number Variations, and information about chemical DNA tags known as Methylation. The team meticulously labeled experimental sources affecting data quality, eliminated contamination from non-human samples, and removed unlabeled values specific to transcriptomics data.
For the copy number variation data, researchers focused on cancer-specific repeats, identifying and labeling recurrent aberrant repeats along with corresponding genes in those regions. They also adjusted the methylation data to eliminate biases from various experimental platforms. Each processed molecular data type was then assigned a standardized identifier to mitigate discrepancies in naming conventions.
Subsequently, a coding pipeline was established to assess data quality and consolidate each patient’s molecular data types into a unified dataset—an approach known as multi-omics, as it integrates various molecular measurements. The researchers matched each patient’s sample to its relevant cancer type, resulting in an organized dataset suitable for analysis.
The research team developed 20 task-aware datasets across three categories of machine learning problems, providing crucial metrics for model evaluation in each. Their objective was to showcase how other scientists can effectively utilize MLOmics for a range of common tasks.
The first category focuses on classification, including six datasets that assist scientists in training models to categorize samples as malignant or benign. The second category, clustering, incorporates nine datasets that reveal natural groupings among samples based on molecular patterns when predefined labels are absent. The final category, data completion, features five datasets aimed at addressing incomplete molecular data resulting from experimental or technical challenges, showcasing how models estimate or fill in missing values—a common occurrence in real-world scenarios.
The MLomics database is organized into three sections, each offering detailed usage guidelines. The first section includes task-aware cancer multi-omics datasets in comma-separated values (CSV) format. This format is ideal for large genomic datasets, as programming languages like Python and R have built-in functions for effective reading, writing, and analysis. The second section offers code files to facilitate model development and application of evaluation metrics, while the final section contains links to supplementary resources to enhance biological analyses and ensure the database is accessible to all researchers, regardless of their educational background.
In conclusion, the researchers assert that MLOmics represents a vital resource for the cancer research community, enabling researchers to concentrate on developing superior algorithms instead of data preparation. They highlight the accessibility of MLOmics for non-specialists and its support for interdisciplinary and broader biological research. The team is committed to continuously updating MLOmics with new resources and tasks to align with advancements in the field.
Post views: 59
Source: sciworthy.com
