multiRF: fast multivariate random forests for multi-omics integration

An R package for integrating matched multi-omics datasets through multivariate random forests.

Author: Wei Zhang
License: MIT
Requires: R ≥ 4.3
Source: github.com/novawz/multiRF

1. Introduction

High-throughput molecular profiling now routinely produces matched measurements across genomics, epigenomics, transcriptomics, and proteomics for the same set of samples. Jointly analyzing these data layers can reveal biological structure that no single omics platform captures alone, but doing so requires methods that handle heterogeneous feature spaces, mixed signal-to-noise ratios, and block-wise correlation without strong distributional assumptions.

multiRF addresses this through multivariate random forests. Each tree splits on predictors from one omics block while optimizing a composite criterion over responses from another, yielding forest-weight and proximity matrices that encode cross-platform sample similarity. Rather than relying on a predefined distance metric, the forest learns a sample-by-sample similarity structure directly from the data: two samples are similar when trees consistently route them to the same terminal node. The package then decomposes this learned similarity into shared (cross-omics) and specific (within-omics) components and clusters samples in each space separately, producing both a global integrative clustering and per-platform subtype assignments. An enhanced proximity mode further refines these similarities by incorporating sibling-leaf relationships (pairs of leaves that share the same parent split), weighted by the correlation of their sample profiles. Variable importance is quantified via inverse minimal depth (IMD), which measures how early each feature enters the tree ensemble and how much it contributes to the multivariate split criterion.

The computational backend is a purpose-built C++ engine (via Rcpp) that constructs multivariate regression and unsupervised forests with OpenMP thread-level parallelism. Forest weights, proximity matrices, and optionally enhanced proximity with sibling-leaf corrections are all computed during tree construction in a single pass, avoiding the overhead of post-hoc R-level tree traversal. On typical multi-omics datasets (500-1000 samples, 3 platforms, 300 trees), the native engine runs 20-30% faster than an equivalent randomForestSRC pipeline while producing statistically equivalent output.

2. Multivariate splitting

To model each cross-modal connection, multiRF fits a multivariate random forest with one omics block as the predictor matrix and another as the multivariate response. Unlike a univariate regression forest, each tree is grown by a composite splitting rule that considers multiple response dimensions jointly. The splitting rule implemented here is a variation of the multivariate random forest framework of Li and Xiao (2011) and the normalized composite criterion in randomForestSRC (Tang and Ishwaran, 2017; rfsrc mvsplit vignette), re-implemented in C++ for speed.

At each candidate split in node \(t\), a random subset \(\mathcal{J} \subset \{1,\dots,p_a\}\) of \(q_{\mathrm{try}}\) response columns is drawn (analogous to the predictor-side mtry but applied to responses), and the selected responses are standardised to unit within-node variance. The split score is

$$ G_q(s,t) = \sum_{j \in \mathcal{J}} G_j(s,t), \qquad G_j(s,t) = \frac{1}{\hat\sigma_j^2}\biggl[ \frac{1}{n_{t_L}} \Bigl(\sum_{i \in t_L} Y_{ij}^{*}\Bigr)^{\!2} + \frac{1}{n_{t_R}} \Bigl(\sum_{i \in t_R} Y_{ij}^{*}\Bigr)^{\!2}\biggr] $$

where \(Y_{ij}^{*}\) is the centred value of response variable \(j\) for sample \(i\), \(\hat\sigma_j^2\) is the within-node variance of column \(j\), and \(t_L, t_R\) denote the left and right daughter nodes with sizes \(n_{t_L}, n_{t_R}\). Normalizing each \(G_j\) by its own variance ensures all response columns contribute on a comparable scale regardless of their marginal dispersion. The partition that maximizes \(G_q\) is selected.

The default \(q_{\mathrm{try}} = \lceil p_a / 3 \rceil\) balances information content per split against response-level diversity across nodes: each split focuses on a random subspace of \(p_a\), so different splits within the same tree,and different trees within the forest,learn complementary aspects of the joint response structure. When \(p_a \gg n\), restricting the split evaluation to \(q_{\mathrm{try}} \ll p_a\) random response columns regularises the criterion, reducing its effective variance while preserving the multivariate character of the tree.

At every node, mtry predictor columns and ytry response columns are sampled independently. When nsplit > 0 (default 10), candidate cutpoints are drawn at random from the set of distinct predictor values; setting nsplit = 0 evaluates all distinct cutpoints (deterministic splitting). Nodesize is enforced at the parent level: a node is eligible for splitting only when it contains at least \(2 \times \texttt{nodesize}\) samples, and individual cutpoints require only that both children are non-empty. The entire split search, tree construction, and matrix accumulation are implemented in C++ with OpenMP parallelism, avoiding the overhead of R-level tree traversal that would otherwise dominate runtime for forests with hundreds of trees.

3. Installation

> remotes::install_github("novawz/multiRF")

The native C++ engine compiles from source during installation and requires a C++17 compiler (Rtools on Windows, Xcode CLI on macOS, g++ or clang++ on Linux). OpenMP support is optional but recommended for parallel tree construction. randomForestSRC is not required; the package uses its own forest implementation by default.

4. Toy data

A subsampled TCGA BRCA dataset is bundled for demonstration. It contains three matched omics blocks (gene expression, methylation, miRNA) and clinical annotations including PAM50 subtypes.

library(multiRF)
data("tcga_brca_data")

names(tcga_brca)
dim(tcga_brca$gene)
dim(tcga_brca$methy)
dim(tcga_brca$mirna)
#> [1] "gene"  "methy" "mirna"
#> [1] 674 200
#> [1] 674 200
#> [1] 674 100

5. Clustering workflow

fit <- mrf3(
  tcga_brca,
  k = 4,
  ntree = 100,
  filter_mode = "none",
  run_imd = TRUE,
  seed = 529
)

summary(fit)
table(get_clusters(fit))
get_top_vars(fit, n = 10)

Comparing backends

fit_proximity <- mrf3(
  tcga_brca, k = 4,
  main_clustering = "proximity",
  ntree = 100, seed = 529
)

fit_enhanced <- mrf3(
  tcga_brca, k = 4,
  main_clustering = "enhanced_proximity",
  ntree = 100, seed = 529
)

Compare against PAM50

common_ids <- intersect(
  names(get_clusters(fit)),
  tcga_brca_clinical$sampleID
)
pam50 <- tcga_brca_clinical$BRCA_Subtype_PAM50[
  match(common_ids, tcga_brca_clinical$sampleID)
]

cluster_metric_matrix(
  list(
    similarity         = get_clusters(fit)[common_ids],
    proximity          = get_clusters(fit_proximity)[common_ids],
    enhanced_proximity = get_clusters(fit_enhanced)[common_ids],
    PAM50              = pam50
  ),
  metric = "ari"
)

6. Variable selection

fit_vs <- mrf3_fit(
  dat.list = tcga_brca,
  clustering_args = list(shared_k = 4, specific_k = 4),
  ntree = 100,
  run_imd = TRUE,
  run_variable_selection = TRUE,
  variable_selection_args = list(
    method = "mixture",
    level = 0.05,
    normalized = TRUE
  ),
  seed = 529
)

get_selected_vars(fit_vs)
get_vs_summary(fit_vs)

Compare methods

vs_test <- mrf3_vs(
  mod = fit_vs, dat.list = tcga_brca,
  method = "test", re_fit = FALSE, level = 0.05
)
vs_mixture <- mrf3_vs(
  mod = fit_vs, dat.list = tcga_brca,
  method = "mixture", re_fit = FALSE, level = 0.05
)

lapply(
  list(test = vs_test, mixture = vs_mixture),
  get_selected_vars
)

7. Robust clustering

fit_robust <- mrf3_fit(
  dat.list = tcga_brca,
  clustering_args = list(shared_k = 4, specific_k = 4),
  ntree = 100,
  run_imd = TRUE,
  run_variable_selection = TRUE,
  run_cluster_imd = FALSE,
  run_robust_clustering = TRUE,
  variable_selection_args = list(method = "mixture"),
  seed = 529
)

table(get_clusters(fit_robust, which = "robust"))
summary(fit_robust)

Pairwise IMD

pw <- pairwise_imd(fit_robust)
plot_network(pw, label = TRUE, vertex.size = 3, vertex.label.cex = 0.4)
Feature pairwise IMD network
Feature network from pairwise IMD, colored by omics block.

8. Plots

t-SNE

plot_tsne(
  mod = fit,
  group = get_clusters(fit),
  source = "specific_shared",
  main = "t-SNE: shared/specific branch"
)
t-SNE embedding
t-SNE embedding colored by cluster assignments.

Sample network

plot_network(
  dat = fit,
  group = get_clusters(fit),
  source = "specific_shared",
  cutoff = 0.01
)
Sample similarity network
Sample similarity network colored by cluster.

Feature importance

plot_weights(fit_vs, weight_source = "imd", top = 15)
IMD weights
Top 15 features by IMD weight for each omics block.

Survival

clinical_ids <- tcga_brca_clinical$sampleID
clusters <- get_clusters(fit)
if (is.null(names(clusters))) {
  names(clusters) <- rownames(tcga_brca[[1]])
}
km_ids <- intersect(names(clusters), clinical_ids)
clinical_sub <- tcga_brca_clinical[match(km_ids, clinical_ids), , drop = FALSE]
keep <- !is.na(clinical_sub$OS.time) & !is.na(clinical_sub$OS)

plot_km(
  test_var = factor(clusters[km_ids][keep]),
  time_var = "OS.time",
  event_var = "OS",
  pheno_mat = clinical_sub[keep, , drop = FALSE]
)
Kaplan-Meier curves
Kaplan-Meier survival curves stratified by cluster.

9. API reference

Core workflow

FunctionPurpose
mrf3()Compact end-to-end workflow (fit + cluster).
mrf3_fit()Staged workflow with full parameter surface.
mrf3_vs()Variable selection on existing fit.
mrf3_stability()Resampling-based stability assessment.
pairwise_imd()Variable-level pairwise co-occurrence network.
filter_omics()Pre-fitting feature filtering.

Accessors

FunctionReturns
get_clusters()Cluster assignments.
get_weights()Per-block feature importance weights.
get_top_vars()Top-n variables by IMD weight.
get_selected_vars()Selected variable names from VS.
get_vs_summary()Variable-selection summary table.
get_connection()Block pairing structure.

Plotting

FunctionInput
plot_tsne()Similarity or reconstruction
plot_umap()Similarity or reconstruction
plot_network()Similarity graph or pairwise IMD
plot_weights()IMD weights
plot_km()Clusters + clinical
plot_circos()Adjacency structure

Metrics

FunctionReturns
cluster_metrics()ARI, Jaccard, NMI, purity.
cluster_metric_matrix()Pairwise metric matrix.

10. References

  1. Zhang, W. et al. (2025). An integrative multi-omics random forest framework for robust biomarker discovery. GigaScience, 14, giaf148. doi:10.1093/gigascience/giaf148
  2. Li, M. and Xiao, Y. (2011). Multivariate random forests. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(1), 80-87.
  3. Ishwaran, H. et al. (2008). Random survival forests. Annals of Applied Statistics, 2(3), 841-860.
  4. Tang, F. and Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining, 10(6), 363-377.
  5. Ishwaran, H. and Kogalur, U.B. randomForestSRC: multivariate splitting rule. https://www.randomforestsrc.org/articles/mvsplit.html