Using machine learning to find metastatic cancer's origin

Steaming cup in front of genetic cancer images

Most metastatic cancers have a known primary site of origin, but up to 5% of the time, the primary site cannot be identified. These “cancers of unknown primary” (CUP) are difficult to treat and have very poor prognoses.

To address the problem, a JAX team developed CUP-AI-Dx, a machine learning tool that uses RNA sequence data for analysis. In a paper recognized as one of the best of 2020 by EBioMedicine, the researchers show that CUP-AI-Dx has high accuracy when applied to real-world data sets and provides an important clinical tool to help guide therapies for CUP patients.

What is CUP and what does it have to do with metastatic cancers?

Based on their molecular attributes, most metastatic cancers can be traced back to their site of origin, e.g., breast, colorectal, skin, etc. It’s an important piece of information that helps guide therapeutic strategy. Unfortunately, up to five percent of the time, the site of origin cannot be determined, making these “cancers of unknown primary” (CUP) even more difficult to treat. Sadly, patients with CUP have a one-year survival rate of only 25 percent, making improved diagnostic methods essential for better patient prognoses.  

To help address the problem, a team co-led by Jackson Laboratory Assistant Professor Sheng Li, Senior Director of Computational Sciences Krishna Karuturi, and Joshy George, Ph.D.My research focuses on applying machine learning, statistical techniques, and computational methods to address problems ranging from experimental design to data analysis in systems biology.Associate Director of Computational Sciences Joshy George , developed a machine learning framework to help predict the primary site and molecular subtype of cancer samples. The tool, called CUP-AI-Dx, uses RNA sequencing data for analysis, incorporating the expression of 817 genes as input. CUP-AI-Dx incorporates a 1D Inception convolutional neural network model to infer metastatic cancer’s primary tissue of origin. It simultaneously allows for robust identification of a tumor’s molecular subtype, further enhancing clinical insight.

Using genetic tools to target cancer

As presented in “CUP-AI-Dx: A tool for inferring cancer tissue of origin and molecular subtype using RNA gene-expression data and artificial intelligence,” a paper published in EBioMedicine, the research team used the transcriptional profiles of more than 18,000 primary tumors representing 32 cancer types from The Cancer Genome Atlas (TCGA) to train the model. Once optimized, CUP-AI-Dx was tested on nearly 400 metastatic samples, correctly identifying the tissue of origin 96.7 percent of the time in a test dataset. When applied to clinical-grade RNA-seq dataset generated from two different institutes in the U.S. and Australia, the model predicted the primary site as the top option with an accuracy of 87 percent and 72.5%, respectively.

CUP-AI-Dx provides an important clinical tool to help guide therapies for patients who might otherwise be limited to generalized treatments. In fact, EBioMedicine named the paper as one of its top non-COVID-19-related papers from 2020, and the only one from the field of oncology. The model and results are available for non-commercial use at https://github.com/TheJacksonLaboratory/CUP-AI-Dx.