Nonlinear multi-omics biomarker discovery with random forests
The MRF-IMD workflow
The framework takes matched multi-omics data and produces a ranked biomarker list. One omics block is treated as the predictor matrix X; the other as a multivariate response Y. Splits are chosen to maximise joint heterogeneity of Y, so features sitting near tree roots are those that partition the response most cleanly across the full multivariate surface.
Figure 1. Four-step MRF-IMD pipeline. The forest is directed in that X and Y are swapped to give a symmetric importance pair; the two IMD rankings are then combined.
Three ways to turn IMD into a biomarker list
The same IMD scores admit several thresholding policies. We offer three strategies with different operating characteristics — each chooses a different point on the parsimony / sensitivity curve:
Strategy A
Retain variables above a threshold τ·σ. Produces the most parsimonious, stable signatures — e.g. a ~73-gene panel in BRCA.
Strategy B
Fit a two-component mixture model to separate signal from noise in the IMD distribution. A balanced trade-off.
Strategy C
Standardise IMD by a t-score. Best for picking up subtle signals and variables that act through interactions.
Benchmark against linear multi-omics methods
We simulated matched-omics data under both linear (latent-factor) and nonlinear (interaction + threshold) generative models and measured feature-recovery via area under the precision-recall curve (PR-AUC). The gap opens in the nonlinear regime:
Figure 2. Feature-recovery PR-AUC for MRF-IMD, SPLS, PMDCCA, and RGCCA. In linear settings MRF-IMD matches SPLS/CCA (~0.90); under nonlinear simulations linear methods collapse toward random while MRF-IMD retains 0.71+.
- Linear regime. MRF-IMD reaches ~0.90 PR-AUC, comparable to SPLS (0.94) and PMDCCA (0.90). It is competitive even when assumptions favour linear methods.
- Nonlinear regime. MRF-IMD holds ~0.71–0.81 PR-AUC, while SPLS and PMDCCA fall to ~0.04–0.15. This is the central advantage of the tree-based framework.
- Ensemble baselines. Univariate ensemble learners (GBM, XGBoost) adapted to the multi-omics task also underperform, reinforcing the value of the multivariate split criterion.
Coherent molecular groups across 22 TCGA cancer types
Applied to a pan-cancer TCGA cohort spanning 22 disease types, MRF-IMD-selected features recovered eight coherent molecular groups under IntNMF clustering. The groupings respect lineage where expected and cut across tissue of origin where the biology is shared.
| Group | Label | Composition |
|---|---|---|
| G1 | Basal-like & UCEC | High genomic instability; TP53-driven. |
| G2 | Hepatobiliary | LIHC, CHOL — hepatic-lineage tumours. |
| G3 | Hypermutated | BLCA, SKCM — high mutational burden. |
| G4 | Non-basal BRCA | Luminal and HER2+ subtypes. |
| G5 | GI adenocarcinoma | COAD, STAD, ESCA — CIN phenotype. |
| G6 | Endocrine | ACC, PCPG — hormone-secreting neoplasms. |
| G7 | Squamous (SCC) | HNSC, LUSC — RTK-RAS activation. |
| G8 | Renal epithelial | KIRC, KIRP — VHL/HIF alterations. |
Table 1. Pan-cancer groups recovered from MRF-IMD–selected features. In raw-data controls these groupings collapse into tissue-of-origin blocks, suggesting that the framework surfaces shared molecular programs that transcend lineage alone.
Dementia-conversion prediction in ADNI
We integrated blood DNA-methylation and gene-expression for 538 ADNI participants (CN vs. MCI) and used MRF-IMD–selected features to stratify risk of cognitive decline. Top-ranked genes included ARL11, S1PR1, and DAPK2 — all with established links to neuro-inflammation.
Figure 3. Kaplan-Meier curves (simulated for visualisation) showing separation of high- vs. low-risk groups by MRF-IMD score.
How to cite
Zhang, W., et al. (2025). An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery. GigaScience. https://doi.org/10.1093/gigascience/giaf148.
Code and a step-by-step vignette are available at github.com/TransBioInfoLab/multiRF-vs and novawz.github.io/multiRF.