Computational cancer researchers utilizing machine learning technology face a critical challenge. Large datasets are available for training machine learning models, but the process is demanding due to inconsistencies in data formats, names, structures, and other attributes. Consequently, when scientists analyze different cancer types or apply varying data cleaning methods, the performance of the resulting models can diverge significantly.
This discrepancy has created a gap between available datasets and their practical usability, posing a significant barrier for researchers lacking specialized bioinformatics training. Variations in data processing methodologies further complicate the comparison of different machine learning approaches, making it challenging to identify the optimal method for tasks such as classifying patient samples as benign or malignant.
In response, collaborative researchers from Japan and the United States have developed a robust database tailored for machine learning applications, comprising genetic and molecular data from over 8,000 cancer patients. They named this groundbreaking database MLOmics. Similar to a well-organized library, MLOmics provides cancer data ready for immediate use by computer models, eliminating the need for extensive data preprocessing.
To create MLomics, researchers retrieved patient samples from 32 cancer types from publicly accessible databases, including the Cancer Genome Atlas. They collected four distinct types of molecular data per patient, comprising two DNA product types. The dataset includes transcriptomics data, data on DNA regions termed copy number variation, and details regarding chemical DNA markers known as methylation. For transcriptomics data, the team labeled experimental factors influencing data quality, eliminated contamination from non-human samples, and addressed unlabeled values.
For copy number variation data, researchers focused on cancer-specific repeated sequences, identifying and labeling recurrent aberrant repeats along with their corresponding genes. They adjusted methylation data to eliminate biases caused by various experimental platforms. In addition, a uniform identifier was assigned to all molecular data to standardize naming conventions.
Subsequently, the team developed a coding pipeline to assess data quality and integrate each patient’s molecular data types into a single, cohesive dataset using the multi-omics approach, which amalgamates diverse molecular measurements. They matched each patient sample with its associated cancer type, thereby creating an organized dataset prime for analysis.
The researchers designed 20 task-aware datasets across three categories of machine learning problems, establishing appropriate metrics for model evaluation in each category. They aimed to showcase how MLOmics can be employed for a variety of common research tasks.
The first category is classification, comprising six datasets that facilitate training models to categorize samples into known classes, such as malignant or benign tumors. The second category, clustering, includes nine datasets that allow scientists to explore how samples group naturally based on molecular characteristics when predefined labels are absent. The final category, data completion, consists of five datasets aimed at addressing incomplete molecular data caused by technical or experimental errors, detailing how models can estimate or fill in missing values, a common challenge in real-world scenarios.
The researchers also organized the MLOmics database into three distinct sections, each with comprehensive usage guidelines. The first section primarily offers task-aware cancer multi-omics datasets formatted as comma-separated values (CSV files). CSV files were selected for their efficiency with large genomic datasets, as they are easily processed by programming languages like Python and R. The second section provides code files designed to assist scientists in model development and evaluation. Finally, the last section includes links to additional resources that complement the primary datasets, ensuring accessibility for all interested researchers, regardless of their background.
In conclusion, the researchers affirmed that MLOmics represents a significant asset for the cancer research community, allowing scientists to concentrate on enhancing algorithms instead of expending time on data preparation. They highlighted MLOmics’ suitability for non-specialists, encouraging interdisciplinary research and broader biological studies. The team is committed to continuously updating MLOmics with new resources and tasks in alignment with advancements in the field.
Post views: 676
Source: sciworthy.com










