MRF-IMD — Variable Selection

§01 · Summary

Nonlinear multi-omics biomarker discovery with random forests

Abstract

Current high-throughput platforms generate rich multi-omics datasets, but discovering shared biomarkers across layers is difficult when biological relationships are nonlinear. Classical methods (SPLS, CCA, RGCCA) assume linear co-variation and miss nonlinear hubs. We introduce MRF-IMD, an unsupervised framework that fits a multivariate random forest across omics blocks and ranks features by a new Inverse Minimal Depth (IMD) importance measure. Three complementary selection strategies — filter, mixture, and transform — convert IMD into biomarker sets suited to parsimonious, balanced, or subtle-signal regimes. MRF-IMD is robust in nonlinear simulation, uncovers eight coherent pan-cancer groups on TCGA, and improves dementia-conversion prediction in ADNI.

§02 · Framework

The MRF-IMD workflow

The framework takes matched multi-omics data and produces a ranked biomarker list. One omics block is treated as the predictor matrix X; the other as a multivariate response Y. Splits are chosen to maximise joint heterogeneity of Y, so features sitting near tree roots are those that partition the response most cleanly across the full multivariate surface.

i.

Multi-omics input

Matched samples with two omics layers (e.g. expression + methylation). One serves as predictor, the other as multivariate response.

ii.

Multivariate forest

Fit a random forest in which each split maximises heterogeneity of the multivariate response Y.

iii.

Compute IMD

Inverse Minimal Depth: features near the root (small depth) receive high IMD scores.

iv.

Select biomarkers

Apply filter, mixture, or transform strategy to pick a robust shared biomarker set.

Figure 1. Four-step MRF-IMD pipeline. The forest is directed in that X and Y are swapped to give a symmetric importance pair; the two IMD rankings are then combined.

§03 · Selection strategies

Three ways to turn IMD into a biomarker list

The same IMD scores admit several thresholding policies. We offer three strategies with different operating characteristics — each chooses a different point on the parsimony / sensitivity curve:

Strategy A

Filter

Retain variables above a threshold τ·σ. Produces the most parsimonious, stable signatures — e.g. a ~73-gene panel in BRCA.

Strategy B

Mixture

Fit a two-component mixture model to separate signal from noise in the IMD distribution. A balanced trade-off.

Strategy C

Transform

Standardise IMD by a t-score. Best for picking up subtle signals and variables that act through interactions.

§04 · Simulation

Benchmark against linear multi-omics methods

We simulated matched-omics data under both linear (latent-factor) and nonlinear (interaction + threshold) generative models and measured feature-recovery via area under the precision-recall curve (PR-AUC). The gap opens in the nonlinear regime:

Figure 2

PR-AUC across simulation regimes

Figure 2. Feature-recovery PR-AUC for MRF-IMD, SPLS, PMDCCA, and RGCCA. In linear settings MRF-IMD matches SPLS/CCA (~0.90); under nonlinear simulations linear methods collapse toward random while MRF-IMD retains 0.71+.

Linear regime. MRF-IMD reaches ~0.90 PR-AUC, comparable to SPLS (0.94) and PMDCCA (0.90). It is competitive even when assumptions favour linear methods.
Nonlinear regime. MRF-IMD holds ~0.71–0.81 PR-AUC, while SPLS and PMDCCA fall to ~0.04–0.15. This is the central advantage of the tree-based framework.
Ensemble baselines. Univariate ensemble learners (GBM, XGBoost) adapted to the multi-omics task also underperform, reinforcing the value of the multivariate split criterion.

§05 · Pan-cancer

Coherent molecular groups across 22 TCGA cancer types

Applied to a pan-cancer TCGA cohort spanning 22 disease types, MRF-IMD-selected features recovered eight coherent molecular groups under IntNMF clustering. The groupings respect lineage where expected and cut across tissue of origin where the biology is shared.

Group	Label	Composition
G1	Basal-like & UCEC	High genomic instability; TP53-driven.
G2	Hepatobiliary	LIHC, CHOL — hepatic-lineage tumours.
G3	Hypermutated	BLCA, SKCM — high mutational burden.
G4	Non-basal BRCA	Luminal and HER2+ subtypes.
G5	GI adenocarcinoma	COAD, STAD, ESCA — CIN phenotype.
G6	Endocrine	ACC, PCPG — hormone-secreting neoplasms.
G7	Squamous (SCC)	HNSC, LUSC — RTK-RAS activation.
G8	Renal epithelial	KIRC, KIRP — VHL/HIF alterations.

Table 1. Pan-cancer groups recovered from MRF-IMD–selected features. In raw-data controls these groupings collapse into tissue-of-origin blocks, suggesting that the framework surfaces shared molecular programs that transcend lineage alone.

§06 · ADNI

Dementia-conversion prediction in ADNI

We integrated blood DNA-methylation and gene-expression for 538 ADNI participants (CN vs. MCI) and used MRF-IMD–selected features to stratify risk of cognitive decline. Top-ranked genes included ARL11, S1PR1, and DAPK2 — all with established links to neuro-inflammation.

ARL11

0.24

S1PR1

0.18

DAPK2

0.15

Figure 3

Dementia-conversion survival by MRF-IMD risk score

Figure 3. Kaplan-Meier curves (simulated for visualisation) showing separation of high- vs. low-risk groups by MRF-IMD score.

MRF-IMD
P = 0.033
Significant separation of high- vs. low-risk groups for incident dementia.

SPLS baseline

P = 0.60

No significant stratification — linear integration fails to capture the relevant signal.

§07 · Citation

How to cite

Zhang et al., 2025

Zhang, W., et al. (2025). An Integrative Multi-Omics Random Forest Framework for Robust Biomarker Discovery. GigaScience. https://doi.org/10.1093/gigascience/giaf148.

Code and a step-by-step vignette are available at github.com/TransBioInfoLab/multiRF-vs and novawz.github.io/multiRF.