Creating a Comprehensive Cancer Data Library: A Step-by-Step Guide by Sciworthy

Computational cancer researchers leverage machine learning technology to tackle a significant challenge: the vast amounts of data available for training machine learning models. Despite this abundance, training is hindered by inconsistent data formats, structures, and properties. Consequently, when scientists apply various cancer types and data cleaning procedures, the resulting models can yield vastly different outcomes.

Researchers have identified the disparity between available and usable datasets as a considerable obstacle for scientists lacking specialized bioinformatics training. Furthermore, varied processing strategies make it difficult to equitably compare new machine learning techniques and identify the most effective method for specific cancer research tasks—such as classifying patient samples into benign or malignant categories.

To address this issue, a collaboration between researchers in Japan and the United States has resulted in the development of a comprehensive database tailored for machine learning applications. This database, named MLOmics, encompasses genetic and molecular information from over 8,000 cancer patients. Similar to a well-organized library, MLOmics offers cancer data that can be directly utilized by computer models, eliminating the need for extensive preprocessing.

In constructing MLOmics, the team gathered patient samples from 32 cancer types sourced from publicly available databases like the Cancer Genome Atlas. Data collection included four distinct types of molecular information, consisting of two forms of DNA products: Transcriptomics data, data on repetitive DNA regions termed Copy Number Variations, and information about chemical DNA tags known as Methylation. The team meticulously labeled experimental sources affecting data quality, eliminated contamination from non-human samples, and removed unlabeled values specific to transcriptomics data.

For the copy number variation data, researchers focused on cancer-specific repeats, identifying and labeling recurrent aberrant repeats along with corresponding genes in those regions. They also adjusted the methylation data to eliminate biases from various experimental platforms. Each processed molecular data type was then assigned a standardized identifier to mitigate discrepancies in naming conventions.

Subsequently, a coding pipeline was established to assess data quality and consolidate each patient’s molecular data types into a unified dataset—an approach known as multi-omics, as it integrates various molecular measurements. The researchers matched each patient’s sample to its relevant cancer type, resulting in an organized dataset suitable for analysis.

The research team developed 20 task-aware datasets across three categories of machine learning problems, providing crucial metrics for model evaluation in each. Their objective was to showcase how other scientists can effectively utilize MLOmics for a range of common tasks.

The first category focuses on classification, including six datasets that assist scientists in training models to categorize samples as malignant or benign. The second category, clustering, incorporates nine datasets that reveal natural groupings among samples based on molecular patterns when predefined labels are absent. The final category, data completion, features five datasets aimed at addressing incomplete molecular data resulting from experimental or technical challenges, showcasing how models estimate or fill in missing values—a common occurrence in real-world scenarios.

The MLomics database is organized into three sections, each offering detailed usage guidelines. The first section includes task-aware cancer multi-omics datasets in comma-separated values (CSV) format. This format is ideal for large genomic datasets, as programming languages like Python and R have built-in functions for effective reading, writing, and analysis. The second section offers code files to facilitate model development and application of evaluation metrics, while the final section contains links to supplementary resources to enhance biological analyses and ensure the database is accessible to all researchers, regardless of their educational background.

In conclusion, the researchers assert that MLOmics represents a vital resource for the cancer research community, enabling researchers to concentrate on developing superior algorithms instead of data preparation. They highlight the accessibility of MLOmics for non-specialists and its support for interdisciplinary and broader biological research. The team is committed to continuously updating MLOmics with new resources and tasks to align with advancements in the field.


Post views: 59

Source: sciworthy.com

Entomologists Launch Comprehensive Digital Library Showcasing Global Ant Diversity

Utilizing advanced X-ray technology, robotics, and artificial intelligence, entomologists have successfully developed interactive digital imagery for 792 ant species across 212 genera.



A detailed Antscan specimen rendering: Eciton Hamatum. Image credit: Katzke et al., doi: 10.1038/s41592-026-03005-0.

To create this extensive digital library, researchers at the Okinawa University of Science and Technology, led by Julian Katzke, gathered ethanol-preserved ant specimens from museums, partner institutions, and global experts.

The team organized the specimens by species and category and transported them to the lab. The Karlsruhe Institute of Technology (KIT) in Germany provided cutting-edge X-ray micro-CT scanning, similar to medical CT scans but with significantly higher magnification.

A synchrotron particle accelerator generated a powerful X-ray beam, enabling rapid scanning of a vast array of samples, while a robotic sample changer seamlessly rotated images every 30 seconds.

This sophisticated process facilitated the production of 2D image stacks, essential for constructing 3D models.

Despite the utility of raw image files, initial depictions of the ant specimens were often distorted, falling short of achieving the realistic models scientists envisioned.

3D imaging allows for the visualization of internal structures, including muscles, nervous systems, and digestive systems, at a micrometer level of resolution.

These models can easily be animated or integrated into virtual reality environments for purposes spanning research, education, and entertainment.

“If we had conducted this project using a standard lab-based CT scanner, it would have taken six years of continuous operation,” Dr. Katzke explained.

“With the KIT setup, we scanned 2,000 specimens in just one week.”

Professor Evan Economo, a researcher at the Okinawa Institute of Science and Technology and the University of Maryland, remarked, “Without these computational tools, completing this project manually would have been nearly impossible.”

Dubbed the Antscan, this initiative could pave the way for future digitization efforts across various species beyond ants.

“The significance of this research extends far beyond ants,” Professor Economo stated. “Once specimens are digitized, we can create libraries that enhance the utilization of biological materials across science labs, classrooms, and even Hollywood studios.”

The team’s study was published in the prestigious journal Nature Methods.

_____

J. Katzke et al. High-throughput phenomics of global ant biodiversity. Nat Methods published online March 5, 2026. doi: 10.1038/s41592-026-03005-0

Source: www.sci.news

Authors in London protest Meta’s theft of book and use of ‘Shadow Library’ to train AI

A demonstration will be held today outside Meta’s London office by authors and other publishing industry experts protesting the organization’s use of copyrighted books for training artificial intelligence.

Notable figures like novelists Kate Moss and Tracy Chevalier, poet Daljit Nagra, and former chairman of the Royal Literature Society, are expected to be present outside Meta’s Kings Cross office.

Protesters will gather at Granary Square at 1:30 pm, with hand-written letters to Meta by the Authors Association (SOA) planned for 1:45 pm, also to be sent to Meta’s US headquarters.

Earlier this year, Meta CEO Mark Zuckerberg allegedly approved the use of Libgen, known as the “Shadow Library,” which contains over 7.5 million books. The Atlantic recently released a searchable database of the titles in Libgen, suggesting that authors’ works may have been used to train Meta’s AI models.

SOA Chair Vanessa Fox O’Loughlin condemned Meta’s actions as “illegal, shocking, and devastating for writers.”

Vanessa added, “Books take years to write, and Meta stealing them for AI replication threatens authors’ livelihoods.”

In response, a Meta spokesperson claimed they respect intellectual property rights and believe their actions comply with the law.

Skip past newsletter promotions

Several prominent authors, including Moss, Richard Osman, Isiguro Kawako, and Val McDermid, signed a letter to Culture Secretary Lisa Nandi asking for Meta executives to appear before Congress. The petition garnered over 7,000 signatures.

Today’s protest is led by novelist AJ West, who expressed dismay at seeing their work in the Libgen database without consent.

A court filing in January revealed a group of authors suing Meta for copyright infringement, noting the impact on authors’ rights by using unauthorized databases like Libgen.

SOA’s chief executive Anna Gunley emphasized the detrimental effect of companies exploiting authors’ copyrighted works.

Protesters are encouraged to create placards and use hashtags like #MetaBookThieves, #DothewRiteThing, #MakeItfair.

Source: www.theguardian.com

British Library starts process of reinstating digital services following cyber attack

After enduring a severe cyber attack, the British Library is now in the process of restoring its main catalog online. This is a significant milestone as the catalog contains 36 million records of printed and rare books, maps, magazines, and sheet music.

Despite this progress, access is currently limited to a “read-only” format, and it may take until the end of the year for the National Library’s services to be fully restored.

Sir Rory Keating, the library’s chief executive, confirmed that the full restoration of all services will be a gradual process. This has been particularly challenging for researchers who rely on the library’s collections for their work and livelihood.

The devastating cyber attack, which occurred on October 31st and was claimed by the ransomware group Rhysida, caused the main catalog to be inaccessible online and led to the theft of some employee data.

Upon restoring the online catalog, users will have the ability to search for materials. However, the process for checking inventory and ordering materials for use in the library reading room will differ from before. Users will also need to visit the library in person to view offline versions of the specialized catalog.

The library has also acknowledged the financial impact of the attack, stating that significant spending will be required to rebuild its digital services and complete the technological recovery. Additionally, concerns have been raised about the impact of the attack on payments to authors through the UK’s public lending rights system.

Despite the challenges ahead, the library is committed to restoring its services to their full capacity and continues to work with cybersecurity experts to address the aftermath of the attack.

Source: www.theguardian.com