Cells utilize their internal DNA to produce essential products, such as proteins, through a process termed gene expression. However, scientists and health organizations have identified that gene expression datasets often suffer from inadequate patient samples and excess genes per sample, creating significant challenges in the global fight against cancer. This discrepancy hinders the ability to identify and prioritize critical changes in gene expression that differentiate cancer cells from healthy ones, a phenomenon referred to as the curse of dimensionality.
While machine learning techniques can analyze existing patterns within these expansive datasets to classify samples as cancerous or non-cancerous, this presents additional hurdles. Clinicians are often skeptical of machine learning conclusions due to a lack of understanding regarding model decision-making processes, leading to what is known as the black box problem. Consequently, researchers are striving to develop methodologies that clarify how these models derive their predictions.
A collaborative research team across multiple institutions in Africa concentrated on explicating breast cancer model predictions. They accessed publicly available gene expression data from a global database known as The Cancer Genome Atlas, which compiles data on approximately 20,000 genes from 1,208 breast cancer samples. Their primary objective was to isolate a select few genes from those 20,000 that could reliably predict cancer presence in tissue samples.
Initially, the researchers refined their dataset to 3,602 genes that exhibited differential expression between breast cancer and healthy cells. They then implemented an algorithm to experiment with various gene combinations, aiming to identify the smallest set of genes that consistently yielded promising results. This process is analogous to conducting thousands of mini-races with different runners to determine which runner consistently finishes first, despite all ultimately reaching the finish line.
Subsequently, they utilized diverse machine learning techniques to train and optimize several models based on the expression data of the genes chosen by the algorithm. Remarkably, all models demonstrated high accuracy, predicting cancer status with at least 98% reliability. The next questions arose: “Which genes contribute to model efficacy?” and “How do these genes influence predictions?”
The team employed four distinct statistical interpretation methods known as feature importance techniques to pinpoint the genes most critical to model performance. The first method illustrated how each model’s predictions shifted based on gene expression levels. The second showcased the interplay between multiple genes informing model decisions. The third quantified the overall impact of each gene on the model’s judgement, facilitating a ranked analysis, while the final method evaluated how accurately a single gene could predict breast cancer independently.
Through their analysis, the researchers identified seven genes consistently represented across all trained models and feature importance evaluations. They verified that these genes are associated with biological functions influencing cancer progression, such as tissue repair, regulation of cellular substance transport, and immune response management.
While different models generally agreed on key genes, variations in their exact rankings and influence scores were noted. The researchers explained that biological data is often complex, leading models to interpret various aspects of the same data, suggesting that integrating insights from multiple machine learning models yields superior outcomes compared to depending on a singular model.
The team acknowledged several challenges. The gene selection algorithm required nearly six hours on a high-performance laptop, which may not be practical for larger datasets. They also recognized the potential omission of crucial genes during the selection process. Additionally, despite the extensive dataset, it may not encapsulate the full diversity of breast cancer globally, potentially limiting the model’s applicability across different populations. The researchers concluded that merging machine learning approaches with clear and interpretable methods marks the future of cancer prediction, fostering clinical trust in machine learning-driven insights.
Post views: 58
Source: sciworthy.com












