A challenging problem in multivariate statistics is to study relationships between several sets of variables measured on the same set of individuals. This paradigm is referred to by several names, including “learning from multimodal data”, “data integration”, “multiview data”, “multisource data”, “data fusion”, or “multiblock data analysis”. Despite the availability of various statistical methods and dedicated software for multiblock data analysis, handling multiple highly multivariate datasets remains a critical issue for effective analysis and knowledge discovery.
Regularized generalized canonical correlation analysis is a unified statistical framework that gathers six decades of multiblock component methods. While RGCCA encompasses a large number of methods, it is based on a single optimization problem that bears immediate practical implications, facilitating statistical analyses and implementation strategies. In this paper, we present the package called , which implements the RGCCA framework.
For the statistical computing environment , various packages have come to the fore for carrying out multiblock data analysis. The bioinformatician community mostly uses . It wraps the main functions of the package for performing multiblock analyses.
The package covers a wide range of multivariate methods. mainly relies on three approaches for performing multiblock analysis: multiple co-inertia analysis , multiple factor analysis and Statis .
is also one widely used package for multiblock methods mainly due to the implementation of MFA and, to a somewhat lesser extent, generalized Procrustes analysis .
The package covers a wide range of multiblock methods for data fusion but intensively relies on a wrapper strategy for performing multiblock analyses. Consequently, this results in strong dependencies between the package and other packages including (i) (for MFA and GPA), (for hierarchical PCA ; , MCOA, inter-battery factor analysis and Carroll’s GCCA ), , , and .
To the sake of completeness, we also mention the package that can be used for predicting a univariate response (member of the exponential family of distributions) from several blocks of variables .
Each package uses its own way of specifying multiblock models and storing the results.
For users, seems to be the most advanced module for multiview data. offers a suite of algorithms for learning latent space embeddings and joint representations of views, including a limited version of the RGCCA framework, a kernel version of GCCA , and deep CCA . Several other methods for dimensionality reduction and joint subspace learning are also included, such as multiview multidimensional scaling .
From this perspective, it appears that the package already plays a central role within the community and beyond for conducting multiblock analyses. Also, special attention has been paid to ensuring that ’s implementation of various multiblock component methods, including MCOA and MFA, aligns with results obtained from other packages such as or .
Within the package, all implemented methods from the literature are made available through the single function and thus share the same function interface and a clear class structure. In addition to the main statistical functionalities, the package provides several utility functions for data preprocessing and offers various advanced visualization tools to help users interpret and extract meaningful information from integrated data. It also includes metrics for assessing the robustness and significance of the analysis. The package also includes several built-in datasets and examples to help users get started quickly. Package is available from the Comprehensive Archive Network (CRAN), at https://CRAN.R-project.org/package=RGCCA and can be installed from the console with the following command:
This paper presents an overview of the RGCCA framework’s theoretical foundations, summarizes the optimization problem under which all the algorithms were designed, and provides code examples to illustrate the package’s usefulness and versatility. Our package offers a valuable contribution to the field of multiblock data analysis and will enable researchers to conduct more effective analyses and gain new insights into complex datasets.
The paper’s remaining sections are organized as follows: Section 2 presents the RGCCA framework, how it generalizes existing methods from the literature, and the master algorithm underlying the RGCCA framework. Section 3 presents the structure of the package and provides code examples to illustrate the package’s capabilities. Finally, we conclude the paper in Section 4.
We first introduce the optimization problem defining the RGCCA framework, previously published in . We then show how it includes many existing methods.
Let x be a random column vector of p variables such that x is the concatenation of J subvectors x1, …, xJ, with xj = (xj1, …, xjpj)⊤ for j ∈ {1, …, J}. We assume that x has zero mean and a finite second-order moment. Its covariance matrix Σ is then composed of J2 submatrices: Σjk = 𝔼[xjxk⊤] for (j, k) ∈ {1, …, J}2. Let aj = (aj1, …, ajpj)⊤ be a non-random pj-dimensional column vector. A composite variable yj is defined as the linear combination of the elements of xj: yj = aj⊤xj. Therefore the covariance between two composite variables is aj⊤Σjkak.
The RGCCA framework aims to extract the information shared by the J random composite variables, taking into account an undirected graph of connections between them. It consists in maximizing the sum of the covariances between “connected” composites yj and yk subject to specific constraints on the weights aj for j ∈ {1, …, J}. The following optimization problem thus defines the RGCCA framework:
whereA sample-based optimization problem related to can be defined by considering n observations of the vector x. It yields a column-partition data matrix X = [X1, …, Xj, …, XJ] ∈ ℝn × p. Each n × pj data matrix Xj is called a block and represents a set of pj variables observed on the same set of n individuals. The variables’ number and nature may differ from one block to another, but the individuals must be the same across blocks. We assume that all variables are centered.
The package proposes a first implementation of the RGCCA framework, focusing on the following sample-based optimization problem:
where $\widehat{\mathbf{\Sigma}}_{jk}= n^{-1} \mathbf X_j^\top \mathbf X_k$ is an estimate of the interblock covariance matrix Σjk = 𝔼[xjxk⊤] and $\widehat{\boldsymbol \Sigma}_{jj}$ is an estimate of the intra-block covariance matrix Σjj = 𝔼[xjxj⊤]. As we will see in the next section, an important family of methods is recovered by choosing $\Omega_j = \{\mathbf a_j \in \mathbb{R}^{p_j}; \mathbf a_j^\top \widehat{\boldsymbol \Sigma}_{jj} \mathbf a_j = 1\}$. In cases involving multi-collinearity within blocks or in high dimensional settings, one way of obtaining an estimate for the true covariance matrix Σjj is to consider the class of linear convex combinations of the sample covariance matrix Sjj = n−1Xj⊤Xj and the identity matrix Ipj. Therefore, $\widehat{\mathbf{\Sigma}}_{jj} = (1-\tau_j)\mathbf{S}_{jj} + \tau_j\mathbf{I}_{p_j}$ with τj ∈ [0, 1] (shrinkage estimator of Σjj).
From this viewpoint, an equivalent formulation of optimization problem is given hereafter and enables a better description of the objective of RGCCA.
The objective of RGCCA is thus to find block components yj = Xjaj for j ∈ {1, …, J}, summarizing the relevant information between and within the blocks. The τjs are called shrinkage parameters ranging from 0 to 1 and interpolate smoothly between maximizing the covariance or the correlation. Setting τj to 0 will force the block components to unit variance, in which case the covariance criterion boils down to the correlation. Setting τj to 1 will normalize the block weight vectors, which applies the covariance criterion. A value between 0 and 1 will lead to a compromise between the two first options. We can discuss the choice of shrinkage parameters by providing interpretations of the properties of the resulting block components:
It is worth pointing out that for each block j, an appropriate shrinkage parameter τj can be obtained using various analytical formulae . As $\widehat{\mathbf{\Sigma}}_{jj}$ must be positive definite, τj = 0 can only be selected for a full rank data matrix Xj. In the package, for each block, the determination of the shrinkage parameter can be made fully automatic by using the analytical formula proposed by or guided by the context of an application by cross-validation or permutation.
From the optimization problem (), the term “generalized” in the acronym of RGCCA embraces at least four notions. The first one relates to the generalization of two-block methods - including canonical correlation analysis , inter-battery factor analysis , and redundancy analysis - to three or more sets of variables. The second one relates to the ability to take into account some hypotheses on between-block connections: the user decides which blocks are connected and which ones are not. The third one relies on the choices of the shrinkage parameters, allowing the capture of both correlation or covariance-based criteria. The fourth one relates to the function g that enables considering different functions of the covariance. A triplet of parameters embodies this generalization: (g, τj, C) and by the fact that an arbitrary number of blocks can be handled. This triplet of parameters offers flexibility and allows RGCCA to encompass a large number of multiblock component methods that have been published for sixty years. Tables - give the correspondences between the triplet (g, τj, C) and the multiblock component methods. For a complete overview, see .
Two families of methods have come to the fore in the field of multiblock data analysis. These methods rely on correlation-based or covariance-based criteria. Canonical correlation analysis is the seminal paper for the first family, and Tucker’s inter-battery factor analysis is for the second one. These two methods have been extended to more than two blocks in many ways:
Main contributions for generalized canonical correlation analysis (GCCA) are found in .
Main contributions for extending Tucker’s method to more than two blocks come from .
proposed the “mixed” correlation and covariance criterion. combined correlation and variance for the two-block situation (redundancy analysis). This method is extended to the multiblock situation in .
In the two block case, the optimization problem () reduces to:
This problem has been introduced under the name of regularized canonical correlation analysis . For various extreme cases τ1 = 0 or 1 and τ2 = 0 or 1, optimization problem () covers a situation which goes from canonical correlation analysis to Tucker’s inter-battery factor analysis , while passing through redundancy analysis . This framework corresponds exactly to the one proposed by and and is reported in Table .
Methods | g(x) | τj | C |
---|---|---|---|
Canonical correlation analysis | x | τ1 = τ2 = 0 | $\mathbf{C}_1 = \begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$ |
Inter-battery factor analysis or PLS regression | x | τ1 = τ2 = 1 | C1 |
Redundancy analysis | x | τ1 = 1 ; τ2 = 0 | C1 |
Regularized redundancy analysis | x | 0 ≤ τ1 ≤ 1 ; τ2 = 0 | C1 |
Regularized canonical correlation analysis | x | 0 ≤ τ1 ≤ 1 ; 0 ≤ τ2 ≤ 1 | C1 |
In the multiblock data analysis literature, all blocks Xj, j = 1, …, J are assumed to be connected, and many criteria were proposed to find block components satisfying some covariance or correlation-based optimality. Most of them are special cases of the optimization problem (). These multiblock component methods are listed in Table . PLS path modeling is also mentioned in this table. The great flexibility of PLS path modeling lies in the possibility of taking into account certain hypotheses on connections between blocks: the researcher decides which blocks are connected and which are not.
Methods | g(x) | τj | C |
---|---|---|---|
SUMCOR | x | τj = 0, j = 1, …, J | $\mathbf{C}_2 = \begin{pmatrix} 1 & 1 & \cdots & 1 \\ 1 & 1 & \ddots & \vdots \\ \vdots & \ddots& \ddots & 1\\ 1 & \cdots & 1 & 1 \end{pmatrix}$ |
SSQCOR | x2 | τj = 0, j = 1, …, J | C2 |
SABSCOR | |x| | τj = 0, j = 1, …, J | C2 |
SUMCOV-1 | x | τj = 1, j = 1, …, J | C2 |
SSQCOV-1 | x2 | τj = 1, j = 1, …, J | C2 |
SABSCOV-1 | |x| | τj = 1, j = 1, …, J | C2 |
SUMCOV-2 | x | τj = 1, j = 1, …, J | $\mathbf{C}_3 = \begin{pmatrix} 0 & 1 & \cdots & 1 \\ 1 & 0 & \ddots & \vdots\\ \vdots & \ddots& \ddots& 1\\ 1 & \cdots & 1 & 0 \end{pmatrix}$ |
SSQCOV-2 | x2 | τj = 1, j = 1, …, J | C3 |
PLS path modeling - mode B | |x| | τj = 0, j = 1, …, J | cjk = 1 for two connected block and cjk = 0 otherwise |
Many multiblock component methods aim to simultaneously find block components and a global component. For that purpose, we consider J blocks, X1, …, XJ connected to a (J + 1)th block defined as the concatenation of the blocks, XJ + 1 = [X1, X2, …, XJ]. Several criteria were introduced in the literature, and many are listed below.
Methods | g(x) | τj | C |
---|---|---|---|
Generalized CCA | x2 | τj = 0, j = 1, …, J + 1 | $\mathbf{C}_4 = \begin{pmatrix} 0 & \cdots & 0 & 1 \\ \vdots & \ddots & \vdots & \vdots\\ 0 & \cdots & 0 & 1\\ 1 & \cdots & 1 & 0 \end{pmatrix}$ |
Generalized CCA | x2 | τj = 0, j = 1, …, J1 ; τj = 1, j = J1 + 1, …, J | C4 |
Hierarchical PCA | x4 | τj = 1, j = 1, …, J ; τJ + 1 = 0 | C4 |
Multiple co-inertia analysis | x2 | τj = 1, j = 1, …, J ; τJ + 1 = 0 | C4 |
Multiple factor analysis | x2 | τj = 1, j = 1, …, J + 1 | C4 |
It is quite remarkable that the single optimization problem () offers a framework for all the multiblock component methods referenced in Tables -. It has immediate practical consequences for a unified statistical analysis and implementation strategy. In the next section, we present the straightforward gradient-based Algorithm for solving the RGCCA optimization problem. This Algorithm is monotonically convergent and hits a stationary point at convergence. Two numerically equivalent approaches for solving the RGCCA optimization problem are available. A primal formulation described in requires handling matrices of dimensions pj × pj. A dual formulation described in requires handling matrices of dimension n × n. Therefore, the primal formulation of the RGCCA algorithm will be preferred when n > pj and the dual form will be used when n ≤ pj. The function of the package implements these two formulations and automatically selects the best one.
RGCCA is a component-based approach that aims to study the relationships between several sets of variables. The quality and interpretability of the RGCCA block components yj = Xjaj, j = 1, …, J are likely to be affected by the usefulness and relevance of the variables in each block. Therefore, it is important to identify within each block which subsets of significant variables are active in the relationships between blocks. For instance, biomedical data are known to be measurements of intrinsically parsimonious processes. Sparse generalized canonical correlation analysis (SGCCA) extends RGCCA to address this issue of variable selection and enhances the RGCCA framework. The SGCCA optimization problem is obtained by considering another set Ωj as follows:
sj is a user-defined positive constant that determines the amount of sparsity for aj, j = 1, …, J. The smaller the sj, the larger the degree of sparsity for aj. The sparsity parameter sj is usually set by cross-validation or permutation procedures. Alternatively, values of sj can be chosen to result in desired amounts of sparsity.
SGCCA is also implemented in the package and offers a sparse counterpart for all the covariance-based methods cited above.
The general optimization problem behind the RGCCA framework can be formulated as follows:
where $f( \mathbf a_1, \ldots, \mathbf a_J):\mathbb{R}^{p_1}\times \ldots \times \mathbb{R}^{p_J} \xrightarrow{}\mathbb{R}$ is a continuously differentiable multi-convex function and each aj belongs to a compact set Ωj ⊂ ℝpj. For such a function defined over a set of parameter vectors (a1, …, aJ), we make no difference between the notations f(a1, …, aJ) and f(a), where a is the column vector a = (a1⊤, …, aJ⊤)⊤ of size $p = \sum_{j=1}^{J}p_j$. Moreover, the vertical concatenation of a column vector is denoted a = (a1; …; aJ) for the sake of simplification of notation.
A simple, monotonically, and globally convergent algorithm is presented for solving the optimization problem (). The maximization of the function f defined over different parameter vectors (a1, …, aJ) is approached by updating each of parameter vector in turn, keeping the others fixed. This update rule was recommended in and is called Block Relaxation or cyclic Block Coordinate Ascent (BCA).
Let ∇jf(a) be the partial gradient of f(a) with respect to aj. We assume ∇jf(a) ≠ 0 in this manuscript. This assumption is not too binding as ∇jf(a) = 0 characterizes the global minimum of f(a1, …, aJ) with respect to aj when the other vectors a1, …, aj − 1, aj + 1, …, aJ are fixed.
We want to find an update $\hat{ \mathbf a}_j\in \Omega_j$ such that $f( \mathbf a)\leq f( \mathbf a_1, ..., \mathbf a_{j-1}, \hat{ \mathbf a}_j, \mathbf a_{j+1}, ..., \mathbf a_J)$. As f is a continuously differentiable multi-convex function and considering that a convex function lies above its linear approximation at aj for any $\tilde{ \mathbf a}_j\in\Omega_j$, the following inequality holds:
On the right-hand side of the inequality (), only the term $\nabla_jf( \mathbf a)^\top\tilde{ \mathbf a}_j$ is relevant to $\tilde{ \mathbf a}_j$ and the solution that maximizes the minorizing function over $\tilde{ \mathbf a}_j\in\Omega_j$ is obtained by considering the following optimization problem:
The entire algorithm is subsumed in Algorithm .
We need to introduce some extra notations to present the convergence properties of Algorithm : Ω = Ω1 × … × ΩJ, a = (a1; …; aJ) ∈ Ω, cj : Ω ↦ Ω is an operator defined as cj(a) = (a1; …; aj − 1; rj(a); aj + 1; …; aJ) with rj(a) introduced in Equation~ and c : Ω ↦ Ω is defined as c = cJ ∘ cJ − 1 ∘ ... ∘ c1, where ∘ stands for the composition operator. Using the operator c, the for loop inside Algorithm can be replaced by the following recurrence relation: as + 1 = c(as). The convergence properties of Algorithm are summarized in the following proposition:
Proposition gathers all the convergence properties of Algorithm . The three first points of Proposition concern the behavior of the sequence values {f(as)} of the objective function, whereas the three last points are about the behavior of the sequence {as}. The full proof of these properties is given in .
The optimization problem defining the RGCCA framework is a particular case of . Indeed, we assume the Ωjs’ to be compact. In addition, when the diagonal of C is null, the convexity and the continuous differentiability of the function g imply that the objective function of itself is multi-convex continuously differentiable. When at least one element of the diagonal of C is different from 0, additional conditions have to be imposed on g to keep the objective function multi-convex. For example, when g is twice differentiable, a sufficient condition is that ∀x ∈ ℝ+, g′(x) ≥ 0. This condition guarantees that the second derivative of g(aj⊤Σjjaj) is positive-definite:
All functions g considered in the RGCCA framework satisfy this condition. Consequently, the optimization problem () falls under the umbrella of the general optimization framework presented in the previous section.
The optimization problem defining sample-based RGCCA is a particular case of . Indeed, $\Omega_j = \{ \mathbf a_j \in \mathbb{R}^{p_j}; \mathbf a_j^\top \widehat{\mathbf{\Sigma}}_{jj} \mathbf a_j = 1\}$ defines a compact set. Therefore, Algorithm can be used to solve the optimization problem (). This is done by updating each parameter vector, in turn, keeping the others fixed. Hence, we want to find an update $\hat{\mathbf{a}}_j\in \Omega_j=\left\lbrace \mathbf a_j \in \mathbb{R}^{p_j}; \mathbf{a}_j^\top \widehat{\mathbf{\Sigma}}_{jj} \mathbf{a}_j = 1 \right\rbrace$ such that $f(\mathbf{a})\leq f(\mathbf{a}_1, \ldots, \mathbf{a}_{j-1}, \hat{\mathbf{a}}_j, \mathbf{a}_{j+1}, \ldots, \mathbf{a}_J)$. The RGCCA update is obtained by considering the following optimization problem:
where the partial gradient ∇jf(a) of f(a) with respect to aj is a pj-dimensional column vector given by:
The optimization problem () falls into the RGCCA framework with Ωj = {aj ∈ ℝpj; ‖aj‖2 ≤ 1; ‖aj‖1 ≤ sl}. Ωj is defined as the intersection between the ℓ2-ball of radius 1 and the ℓ1-ball of radius sl ∈ ℝ+⋆ which are two compact sets. Hence, Ωj is a compact set. Therefore, we can consider the following update for SGCCA:
According to , the solution of () satisfies:
where function 𝒮(., λ) is the soft-thresholding operator. When applied on a vector x ∈ ℝp, this operator is defined as:
We made the assumption that the ℓ2-ball of radius 1 is not included in the ℓ1-ball of radius sj and the other way round. Otherwise, systematically, only one of the two constraints is active. This assumption is true when the corresponding spheres intersect. This assumption can be translated into conditions on sj.
The norm equivalence between ‖.‖1 and ‖.‖2 can be formulated as the following inequality:
This can be converted into a condition on sj: $1 \leq s_j \leq \sqrt{p_j}$. When such a condition is fulfilled, the ℓ2-sphere of radius 1 and the ℓ1-sphere of radius sj necessarily intersect. Within the package, for consistency with the value of τj ∈ [0, 1], the level of sparsity for aj is controlled with $s_j/p_j \in [1/\sqrt{p_j}, 1]$.
Several strategies, such as binary search or the projection on convex sets algorithm (POCS), also known as the alternating projection method , can be used to determine the optimal λj verifying the ℓ1-norm constraint. Here, a much faster approach described in is implemented within the package.
The SGCCA algorithm is similar to the RGCCA algorithm and keeps the same convergence properties. Empirically, we note that the S/RGCCA algorithm is found to be not sensitive to the starting point and usually reaches convergence () within a few iterations.
In many applications, several components per block need to be identified. The traditional approach incorporates the single-unit RGCCA algorithm in a deflation scheme and sequentially computes the desired number of components. More precisely, the RGCCA optimization problem returns a set of J optimal block-weight vectors, denoted here aj(1), j = 1, …, J. Let yj(1) = Xjaj(1), j = 1, …, J be the corresponding block components. Two strategies to determine higher-level weight vectors are presented: the first yields orthogonal block components, and the second yields orthogonal weight vectors. Deflation is the most straightforward way to add orthogonality constraints. This deflation procedure is sequential and consists in replacing within the RGCCA optimization problem the data matrix Xj by Xj(1) its projection onto either: (i) the orthogonal subspace of yj(1) if orthogonal components are desired: Xj(1) = Xj − yj(1)(yj(1)⊤yj(1))−1yj(1)⊤Xj, or (ii) the orthogonal subspace of aj(1) for orthogonal weight vectors Xj(1) = Xj − Xjaj(1)(aj(1)⊤aj(1))−1aj(1)⊤.
The second level RGCCA optimization problem boils down to:
The optimization problem () is solved using Algorithm and returns a set of optimal block-weight vectors aj(2) and block components yj(2) = Xj(1)aj(2), for j = 1…, J.
For orthogonal block weight vectors, yj(2) = Xj(1)aj(2) = Xjaj(2) naturally expresses as a linear combination of the original variables. For orthogonal block component, as yj(1) = Xjaj(1), the range space of Xj(1) is included in the range space of Xj. Therefore any block component yj(2) belonging to the range space of Xj(1) can also be expressed in terms of the original block Xj: that is, it exists aj(2)⋆ such that yj(2) = Xj(1)aj(2) = Xjaj(2)⋆. It implies that the block components can always be expressed in terms of the original data matrix, whatever the deflation mode.
This deflation procedure can be iterated in a very flexible way. For instance, it is not necessary to keep all the blocks in the procedure at all stages: the number of components per block can vary from one block to another. This might be interesting in a supervised setting where we want to predict a univariate block from other blocks. In that case, the deflation procedure applies to all blocks except the one to predict.
To conclude this section, when the superblock option is used, various deflation strategies (what to deflate and how) have been proposed in the literature. We propose, as the default option, to deflate only the superblock with respect to its global components:
XJ + 1(1) = (I − yJ + 1(1)(yJ + 1(1)⊤yJ + 1(1))−1yJ + 1(1)⊤)XJ + 1 = [X1(1), …, XJ(1)].
The individual blocks Xj(1)s are then retrieved from the deflated superblock. This strategy enables recovering multiple factor analysis ( and ). Note that, in this case, block components do not belong to their block space and are correlated. On the contrary, we follow the deflation strategy described in () for multiple co-inertia analysis, which is one of the most popular and established methods of the multiblock literature.
In this section, using the idea of average variance explained (AVE), the following indicators of model quality are defined:
However, the previous quantities do not take into account the correlations between blocks. Therefore, another indicator of model quality is the inner AVE, defined as follows:
All these quantities vary between 0 and 1 and reflect important properties of the model.
Equation~ is defined for a specific block component. The cumulative AVE is obtained by summing these individual AVEs over the different components. However, this sum applies only to orthogonal block components. For correlated components, we follow the QR-orthogonalization procedure described in to consider only the increase of AVE due to adding the new components.
Guidelines describing R/SGCCA in practice are provided in . The usefulness and versatility of the package are illustrated in the next section.
This section describes how the package is designed and how it can be used on two real datasets.
The main user-accessible functions are listed in Table .
Function | Description |
---|---|
Main entry point of the package, this function allows fitting a R/SGCCA model on a multiblock dataset. | |
Find the best set of parameters for a R/SGCCA model using cross-validation. | |
Train a caret model on the block components of a fitted R/SGCCA model and predict values for unseen individuals. | |
Use a fitted R/SGCCA model to compute the block components of unseen individuals. | |
Find the best set of parameters for a R/SGCCA model using a permutation strategy. | |
Evaluate the significance of the block-weight vectors produced by a R/SGCCA model using bootstrap. | |
Select the most stable variables of a R/SGCCA model using their VIPs. | |
Summary, print and plot methods for outputs of functions , , , , and . |
These functions are detailed hereafter.
The cornerstone of the RGCCA package is the function. This function enables the construction of a flexible pipeline encompassing all model-building steps. This master function automatically checks the proper formatting of the blocks and performs preprocessing steps based on the and arguments. Several user-specified strategies for handling missing data are available through the argument. The final step of the pipeline explicitly focuses on the choice of the multiblock component methods. The list of pre-specified multiblock component methods that can be used within the package is reported below:
## [1] "rgcca" "sgcca" "pca" "spca" "pls" "spls"
## [7] "cca" "ifa" "ra" "gcca" "maxvar" "maxvar-b"
## [13] "maxvar-a" "mfa" "mcia" "mcoa" "cpca-1" "cpca-2"
## [19] "cpca-4" "hpca" "maxbet-b" "maxbet" "maxdiff-b" "maxdiff"
## [25] "sabscor" "ssqcor" "ssqcov-1" "ssqcov-2" "ssqcov" "sumcor"
## [31] "sumcov-1" "sumcov-2" "sumcov" "sabscov-1" "sabscov-2"
These method can be retrieved by setting the argument of the function. Alternatively, several arguments (, , , , , ) allow users to manually recover the various methods listed in Section 2.3.
The optimal tuning parameters can be determined by cross-validating different indicators of quality, namely:
This cross-validation protocol is made available through the function. This function can be used in the presence of a response block specified by the argument. The primary objective of this function is to automatically find the set of tuning parameters (shrinkage parameters, amount of sparsity, number of components) that yields the highest cross-validated scores. The prediction model aims to predict the response block from all available block components. harnesses the power of the package. As a direct consequence, an astonishingly large number of prediction models becomes accessible (for an exhaustive list, refer to ). calls to obtain the block components associated with the test set and for making predictions.
When blocks play a symmetric role (i.e., without a specific response block), a permutation-based strategy, very similar to the one proposed in has also been integrated within the package through the function.
For each set of candidate parameters, the following steps are performed:
The best set of tuning parameters is then the set that yields the highest zstat. This procedure is available through the function.
Once a fitted model has been obtained, the function can be used to assess the reliability of parameter estimates (block-weight/loading vectors). Bootstrap samples of the same size as the original data are repeatedly sampled with replacement from the original data. RGCCA is then applied to each bootstrap sample to obtain the RGCCA estimates. We calculate the standard deviation of the estimates across the bootstrap samples, from which we derive bootstrap confidence intervals, t-ratio (defined as the ratio of the parameter estimate to its bootstrap estimate of the standard deviation), and p-value (the p-value is computed by assuming that the ratio of the parameter estimate to its standard deviation follows the standardized normal distribution), to indicate the reliability of parameter estimates.
Since multiple p-values are constructed simultaneously, correction for multiple testing applies and, adjusted p-values are returned.
We also implemented a procedure to stabilize the selection of variables in SGCCA. defines the variable importance in projection (VIP) score for the PLS method. This score can be used for variable selection: the higher the score, the more important the variable. We use this idea to propose a procedure for evaluating the stability of the variable selection procedure of SGCCA. This procedure relies on the following score: SGCCA is applied several times using a bootstrap resampling procedure. For each fitted model, the VIPs are computed for each variable. The higher VIPs averaged over the different models are kept. This procedure is available through the function.
The outputs of most of the aforementioned functions can be described and visualized using the generic , and functions.
The following two sections provide examples illustrating the practical applications of these functions.
In this section, we reproduce some of the results presented in from the Russett data. The Russett dataset is available within the package. The Russett dataset is studied in . Russett collected this data to study relationships between agricultural inequality, industrial development, and political instability.
## [1] "gini" "farm" "rent" "gnpr" "labo" "inst"
## [7] "ecks" "death" "demostab" "demoinst" "dictator"
The first step of the analysis is to define the blocks. Three blocks of variables have been defined for 47 countries. The variables that compose each block have been defined according to the nature of the variables.
The different blocks of variables X1, X2, X3 are arranged in the list format.
A <- list(
Agriculture = Russett[, c("gini", "farm", "rent")],
Industrial = Russett[, c("gnpr", "labo")],
Politic = Russett[, c("inst", "ecks", "death", "demostab", "dictator")])
lab <- factor(
apply(Russett[, 9:11], 1, which.max),
labels = c("demost", "demoinst", "dict")
)
Preprocessing. In general, and especially for the covariance-based criterion, the data blocks might be preprocessed to ensure comparability between variables and blocks. In order to ensure comparability between variables, standardization is applied (zero mean and unit variance). Such preprocessing is reached by setting the argument to (default value) in the function. A possible strategy to make blocks comparable is to standardize the variables and divide each block by the square root of its number of variables . This two-step procedure leads to tr(Xj⊤Xj) = n for each block (i.e., the sum of the eigenvalues of the covariance matrix of Xj is equal to 1 whatever the block). Such preprocessing is reached by setting the argument to or (default value) in the function. If , each block is divided by the square root of the highest eigenvalue of its empirical covariance matrix. If standardization is applied (), the block scaling is applied on the result of the standardization.
Definition of the design matrix C. From Russett’s hypotheses, it is difficult for a country to escape dictatorship when agricultural inequality is above average and industrial development is below average. These hypotheses on the relationships between blocks are encoded through the design matrix C; usually cjk = 1 for two connected blocks and 0 otherwise. Therefore, we have decided to connect X1 to X3 (c13 = 1), X2 to X3 (c23 = 1) and to not connect X1 to X2 (c12 = 0). The resulting design matrix C is:
## [,1] [,2] [,3]
## [1,] 0 0 1
## [2,] 0 0 1
## [3,] 1 1 0
Choice of the scheme function g. Typical choices of scheme functions are g(x) = x, x2, or |x|. According to , a fair model is a model where all blocks contribute equally to the solution in opposition to a model dominated by only a few of the J sets. If fairness is a major objective, the user must choose m = 1. m > 1 is preferable if the user wants to discriminate between blocks. In practice, m is equal to 1, 2 or 4. The higher the value of m, the more the method acts as block selector .
RGCCA using the pre-defined design matrix C, the factorial scheme (g(x) = x2), τ = 1 for all blocks (full covariance criterion) and a number of (orthogonal) components equal to 2 for all blocks is obtained by specifying appropriately the arguments , , , , in . (default value = ) indicates that the progress will be reported while computing and that a plot illustrating the convergence of the algorithm will be displayed.
fit <- rgcca(blocks = A, connection = C,
tau = 1, ncomp = 2,
scheme = "factorial",
scale = TRUE,
scale_block = FALSE,
comp_orth = TRUE,
verbose = FALSE)
The summary() and plot() functions. The function allows summarizing the RGCCA analysis.
## Call: method='rgcca', superblock=FALSE, scale=TRUE, scale_block=FALSE, init='svd',
## bias=TRUE, tol=1e-08, NA_method='na.ignore', ncomp=c(2,2,2), response=NULL,
## comp_orth=TRUE
## There are J = 3 blocks.
## The design matrix is:
## Agriculture Industrial Politic
## Agriculture 0 0 1
## Industrial 0 0 1
## Politic 1 1 0
##
## The factorial scheme is used.
## Sum_{j,k} c_jk g(cov(X_j a_j, X_k a_k) = 7.9469
##
## The regularization parameter used for Agriculture is: 1
## The regularization parameter used for Industrial is: 1
## The regularization parameter used for Politic is: 1
The block-weight vectors solution of the optimization problem () are available as an output of the function in and correspond exactly to the weight vectors reported in Figure 5 of . It is possible to display specific block-weight vector(s) () block-loadings vector(s) () using the generic function and specifying the arguments and accordingly. The aj(k)⋆, mentioned in Section Higher level RGCCA algorithm, are available in .
Block-weight vectors of a fitted RGCCA model.
As a component-based method, the package provides block components as an output of the function in . Various graphical representations of the block components are available using the function on a fitted RGCCA object by setting the argument. They include factor plot (), correlation circle (), or biplot (). This graphical display allows visualization of the sources of variability within blocks, the relationships between variables within and between blocks, and the correlation between blocks. The factor plot of the countries obtained by crossing the block components associated with the agricultural inequality and industrial development blocks, marked by their political regime in 1960, is shown below.
Graphical display of the countries by drawing the block component of the first block against the block component of the second block, colored according to their political regime.
Countries aggregate together when they share similarities. It may be noted that the lower right quadrant concentrates on dictatorships. It is difficult for a country to escape dictatorship when its industrial development is below average and its agricultural inequality is above average. It is worth pointing out that some unstable democracies located in this quadrant (or close to it) became dictatorships for a period of time after 1960: Greece (1967-1974), Brazil (1964-1985), Chili (1973-1990), and Argentina (1966-1973).
The AVEs of the different blocks are reported in the axes of Figure . All AVEs (defined in -) are available as output of the function in . These indicators of model quality can also be visualized using the generic function.
Average variance explained of the different blocks.
The strength of the relations between each block component and each variable can be visualized using correlation circles or biplot representations.
Correlation circle associated with the first two components of the first block.
By default, all the variables are displayed on the correlation circle. However, it is possible to choose the block(s) to display () in the correlation_circle.
plot(fit, type = "biplot", block = 1,
comp = 1:2, repel = TRUE,
resp = lab, cex = 2,
show_arrow = TRUE)
Biplot associated with the first two components of the first block.
As we will see in the next section when the superblock option is considered ( or set to a method that induces the use of superblock), global components can be derived. The space spanned by the global components can be viewed as a consensus space that integrates all the modalities and facilitates the visualization of the results and their interpretation.
Bootstrap confidence intervals. We illustrate the use of the function. It is possible to set the number of bootstrap samples using the argument:
The bootstrap results are detailed using the function,
## Call: method='rgcca', superblock=FALSE, scale=TRUE, scale_block=FALSE, init='svd',
## bias=TRUE, tol=1e-08, NA_method='na.ignore', ncomp=c(2,2,2), response=NULL,
## comp_orth=TRUE
## There are J = 3 blocks.
## The design matrix is:
## Agriculture Industrial Politic
## Agriculture 0 0 1
## Industrial 0 0 1
## Politic 1 1 0
##
## The factorial scheme is used.
##
## Extracted statistics from 500 bootstrap samples.
## Block-weight vectors for component 1:
## estimate mean sd lower_bound upper_bound bootstrap_ratio pval
## gini 0.6602 0.6360 0.0696 0.4744 0.725 9.490 0.0000
## farm 0.7445 0.7304 0.0555 0.6443 0.828 13.423 0.0000
## rent 0.0994 0.0762 0.2203 -0.3943 0.454 0.451 0.4663
## gnpr 0.6891 0.6894 0.0298 0.6252 0.739 23.140 0.0000
## labo -0.7247 -0.7232 0.0278 -0.7804 -0.673 -26.087 0.0000
## inst 0.1692 0.1672 0.1174 -0.0801 0.357 1.442 0.0917
## ecks 0.4418 0.4340 0.0592 0.3224 0.545 7.458 0.0000
## death 0.4784 0.4699 0.0483 0.3769 0.560 9.914 0.0000
## demostab -0.5574 -0.5520 0.0509 -0.6400 -0.432 -10.942 0.0000
## dictator 0.4864 0.4831 0.0524 0.3846 0.586 9.285 0.0000
## adjust.pval
## gini 0.000
## farm 0.000
## rent 0.523
## gnpr 0.000
## labo 0.000
## inst 0.131
## ecks 0.000
## death 0.000
## demostab 0.000
## dictator 0.000
and displayed using the function.
plot(boot_out, type = "weight",
block = 1:3, comp = 1,
display_order = FALSE, cex = 2,
show_stars = TRUE)
Bootstrap confidence intervals for the block-weight vectors.
Each weight is shown along with its associated bootstrap confidence interval. If , stars reflecting the p-value of assigning a strictly positive or negative weight to this variable are displayed.
Choice of the shrinkage parameters. In the previous section, the shrinkage parameters were manually set. However, the package introduces three fully automated strategies for selecting the optimal shrinkage parameters.
The Schafer and Strimmer analytical formula. For each block j, an “optimal” shrinkage parameter τj can be obtained using the Schafer and Strimmer analytical formula by setting the argument of the function to .
The optimal shrinkage parameters are given by:
## [1] 0.08853216 0.02703256 0.08422566
This automatic estimation of the shrinkage parameters allows one to come closer to the correlation criterion, even in the case of high multicollinearity or when the number of individuals is smaller than the number of variables.
As before, all the fitted RGCCA objects can be visualized/bootstrapped using the , and functions.
Permutation strategy. We illustrate the use of the function here. The number of permutations can be set using the argument:
set.seed(0)
perm_out <- rgcca_permutation(blocks = A, connection = C,
par_type = "tau",
par_length = 10,
n_cores = 1,
n_perms = 10)
By default, the function generates 10 sets of tuning parameters uniformly between some minimal values (0 for RGCCA and 1/sqrt(ncol) for SGCCA) and 1. Results of the permutation procedure are summarized using the generic function,
## Call: method='rgcca', superblock=FALSE, scale=TRUE, scale_block=TRUE, init='svd',
## bias=TRUE, tol=1e-08, NA_method='na.ignore', ncomp=c(1,1,1), response=NULL,
## comp_orth=TRUE
## There are J = 3 blocks.
## The design matrix is:
## Agriculture Industrial Politic
## Agriculture 0 0 1
## Industrial 0 0 1
## Politic 1 1 0
##
## The factorial scheme is used.
##
## Tuning parameters (tau) used:
## Agriculture Industrial Politic
## 1 1.000 1.000 1.000
## 2 0.889 0.889 0.889
## 3 0.778 0.778 0.778
## 4 0.667 0.667 0.667
## 5 0.556 0.556 0.556
## 6 0.444 0.444 0.444
## 7 0.333 0.333 0.333
## 8 0.222 0.222 0.222
## 9 0.111 0.111 0.111
## 10 0.000 0.000 0.000
##
## Tuning parameters Criterion Permuted criterion sd zstat p-value
## 1 1.00/1.00/1.00 0.708 0.0909 0.0479 12.88 0
## 2 0.89/0.89/0.89 0.758 0.0993 0.0512 12.86 0
## 3 0.78/0.78/0.78 0.814 0.1093 0.0549 12.83 0
## 4 0.67/0.67/0.67 0.878 0.1214 0.0592 12.80 0
## 5 0.56/0.56/0.56 0.953 0.1365 0.0640 12.76 0
## 6 0.44/0.44/0.44 1.040 0.1561 0.0696 12.70 0
## 7 0.33/0.33/0.33 1.144 0.1829 0.0764 12.58 0
## 8 0.22/0.22/0.22 1.273 0.2233 0.0854 12.29 0
## 9 0.11/0.11/0.11 1.449 0.2974 0.1007 11.44 0
## 10 0.00/0.00/0.00 1.934 0.6049 0.1419 9.37 0
## The best combination is: 1.00/1.00/1.00 for a z score of 12.9 and a p-value of 0
and displayed using the generic function.
Values of the objective function of RGCCA against the sets of tuning parameters, triangles correspond to evaluations on non-permuted datasets.
The fitted permutation object, , can be directly provided as the output of and visualized/bootstrapped as usual.
Of course, it is possible to define explicitly the combination of regularization parameters to be tested. In that case, a matrix of dimension K × J is required. Each row of this matrix corresponds to one set of tuning parameters. Alternatively, a numeric vector of length J indicating the maximum range values to be tested can be given. The set of parameters is then uniformly generated between the minimum values (0 for RGCCA and 1/sqrt(ncol) for SGCCA) and the maximum values specified by the user with .
Cross-validation strategy. The optimal tuning parameters can also be obtained by cross-validation. In the forthcoming section, we will illustrate this strategy in the second case study.
RGCCA with a superblock. To conclude this section on the analysis of the Russett dataset, we consider multiple co-inertia analysis with 2 components per block.
See for a list of pre-specified multiblock component methods.
Interestingly, the function reports the arguments implicitly specified to perform MCOA.
## Call: method='mcoa', superblock=TRUE, scale=TRUE, scale_block='inertia', init='svd',
## bias=TRUE, tol=1e-08, NA_method='na.ignore', ncomp=c(2,2,2,2), response=NULL,
## comp_orth=FALSE
## There are J = 4 blocks.
## The design matrix is:
## Agriculture Industrial Politic superblock
## Agriculture 0 0 0 1
## Industrial 0 0 0 1
## Politic 0 0 0 1
## superblock 1 1 1 0
##
## The factorial scheme is used.
## Sum_{j,k} c_jk g(cov(X_j a_j, X_k a_k) = 3.578
##
## The regularization parameter used for Agriculture is: 1
## The regularization parameter used for Industrial is: 1
## The regularization parameter used for Politic is: 1
## The regularization parameter used for superblock is: 0
It is possible to display specific output as previously using the generic function by specifying the argument accordingly. MCOA enables individuals to be represented in the space spanned by the first global components. The biplot representation associated with this consensus space is given below.
Biplot of the countries obtained by crossing the two first components of the superblock. Individuals are colored according to their political regime and variables according to their block membership.
As previously, this model can be easily bootstrapped using the function, and the bootstrap confidence intervals are still available using the and functions.
Biological problem. Brain tumors are children’s most common solid tumors and have the highest mortality rate of all pediatric cancers. Despite advances in multimodality therapy, children with pediatric high-grade gliomas (pHGG) invariably have an overall survival of around 20% at 5 years. Depending on their location (e.g., brainstem, central nuclei, or supratentorial), pHGG present different characteristics in terms of radiological appearance, histology, and prognosis. Our hypothesis is that pHGG have different genetic origins and oncogenic pathways depending on their location. Thus, the biological processes involved in the development of the tumor may be different from one location to another, as has been frequently suggested.
Description of the data. Pretreatment frozen tumor samples were obtained from 53 children with newly diagnosed pHGG from Necker Enfants Malades (Paris, France) . The 53 tumors are divided into 3 locations: supratentorial (HEMI), central nuclei (MIDL), and brain stem (DIPG). The final dataset is organized into 3 blocks of variables defined for the 53 tumors: the first block X1 provides the expression of 15702 genes (GE). The second block X2 contains the imbalances of 1229 segments (CGH) of chromosomes. X3 is a block of dummy variables describing the categorical variable location. One dummy variable has been left out because of redundancy with the others.
The next lines of code can be run to download the dataset from http://biodev.cea.fr/sgcca/:
if (!("gliomaData" %in% rownames(installed.packages()))) {
destfile <- tempfile()
download.file("http://biodev.cea.fr/sgcca/gliomaData_0.4.tar.gz", destfile)
install.packages(destfile, repos = NULL, type = "source")
}
data("ge_cgh_locIGR", package = "gliomaData")
blocks <- ge_cgh_locIGR$multiblocks
Loc <- factor(ge_cgh_locIGR$y)
levels(Loc) <- colnames(ge_cgh_locIGR$multiblocks$y)
blocks[[3]] <- Loc
vapply(blocks, NCOL, FUN.VALUE = 1L)
We impose X1 and X2 to be connected to X3. This design is commonly used in many applications and is oriented toward predicting the location. The argument of the function encodes this design.
When the response variable is qualitative, two steps are implicitly performed: (i) disjunctive coding and (ii) the associated shrinkage parameter is set to 0 regardless of the value specified by the user.
The primal/dual RGCCA algorithm. From the dimension of each block (n > p or n ≤ p), selects automatically the dual formulation for X1 and X2 and the primal one for X3. The formulation used for each block is returned using the following command:
The dual formulation makes the RGCCA algorithm highly efficient, even in a high-dimensional setting.
RGCCA enables visual inspection of the spatial relationships between classes. This facilitates assessment of the quality of the classification and makes it possible to determine which components capture the discriminant information readily.
For easier interpretation of the results, especially in high-dimensional settings, adding penalties promoting sparsity within the RGCCA optimization problem is often appropriate. For that purpose, an ℓ1 penalization on the weight vectors a1, …, aJ is applied. the argument of varies between 1/sqrt(ncol) and 1 (larger values of correspond to less penalization) and controls the amount of sparsity of the weight vectors a1, …, aJ. If is a vector, ℓ1-penalties are the same for all the weights corresponding to the same block but different components:
with pj the number of variables of Xj.
If is a matrix, row h of defines the constraints applied to the weights corresponding to components h:
SGCCA for the Glioma dataset. The algorithm associated with the optimization problem () is available through the function with the argument or by specifying directly the argument as below.
fit.sgcca <- rgcca(blocks = blocks, response = 3, ncomp = 2,
sparsity = c(0.0710, 0.2000, 1),
verbose = FALSE)
The function allows summarizing the SGCCA analysis,
and the function returns the same graphical displays as RGCCA. We skip these representations for the sake of brevity.
Determining the sparsity parameters by cross-validation. Of course, it is still possible to determine the optimal sparsity parameters by permutation. This is made possible by setting the argument to (instead of ) within the function. However, in this section, we will adopt a different approach. The optimal tuning parameters can be determined by cross-validating using different indicators of quality for classification and regression.
This cross-validation protocol is made available through the function and is used here to predict the tumors’ location. In this situation, the goal is to maximize the cross-validated balanced accuracy () in a model where we try to predict the response block from all the block components with a user-defined classifier (). The cross-validation protocol is set by specifying the arguments like the number of folds () and the number of cross-validations to be run (). Also, we decide to upper bound the sparsity parameters for X1 and X2 to 0.2 to achieve an attractive amount of sparsity.
set.seed(0)
in_train <- caret::createDataPartition(
blocks[[3]], p = .75, list = FALSE
)
training <- lapply(blocks, function(x) as.matrix(x)[in_train, , drop = FALSE])
testing <- lapply(blocks, function(x) as.matrix(x)[-in_train, , drop = FALSE])
cv_out <- rgcca_cv(blocks = training, response = 3,
par_type = "sparsity",
par_value = c(.2, .2, 0),
par_length = 10,
prediction_model = "lda",
validation = "kfold",
k = 7, n_run = 3, metric = "Balanced_Accuracy",
n_cores = 1)
relies on the package. As a direct consequence, an astonishingly large number of models are made available (see ). Results of the cross-validation procedure are reported using the generic function,
and displayed using the generic function.
As before, the optimal sparsity parameters can be used to fit a new model, and the resulting optimal model can be visualized/bootstrapped.
Note that the sparsity parameter associated with X3 switches automatically to τ3 = 0. This choice is justified by the fact that we were not looking for a block component y3 that explained its own block well (since X3 is a group coding matrix) but one that is correlated with its neighboring components.
At last, can be used for predicting new blocks,
and a summary of the performances can be reported.
The function can be used if, for a specific reason, only the block components are wanted for the test set.
Stability selection. We finally illustrate the use of the function:
set.seed(0)
fit_stab <- rgcca_stability(fit,
keep = vapply(
fit$a, function(x) mean(x != 0),
FUN.VALUE = 1.0
),
n_boot = 100, verbose = TRUE, n_cores = 1)
Once the most stable variables have been found, a new model using these variables is automatically fitted. This last model can be visualized using the usual and functions. We can finally apply the bootstrap procedure on the most stable variables.
The bootstrap results can be visualized using the generic function. We use the parameter to display the top 50 variables of GE.
The RGCCA framework gathers sixty years of multiblock component methods and offers a unified implementation strategy for these methods. The package is available on the Comprehensive Archive Network (CRAN) and GitHub https://github.com/rgcca-factory/RGCCA. This release of the package includes:
The package will be a valuable resource for researchers and practitioners who are interested in multiblock data analysis to gain new insights and improve decision-making.
The RGCCA framework is constantly evolving and extending. Indeed, we proposed RGCCA for multigroup data , RGCCA for multiway data , and RGCCA for (sparse and irregular) functional data . In addition, maximizing successive criteria may be sub-optimal from an optimization point of view, where a single global criterion is preferred. A global version of RGCCA , which allows simultaneously extracting several components per block (no deflation procedure required), has been proposed. Also, it is possible to use RGCCA in structural equation modeling with latent and emergent variables for obtaining consistent and asymptotically normal estimators of the parameters . At last, several alternatives for handling missing values are discussed in . Work in progress includes the integration of all these novel approaches in the next release of the package. The modular design of the function will greatly simplify the integration of these extensions into the package.
This project has received funding from UDOPIA - ANR-20-THIA-0013, the European Union’s Horizon 2020 research and innovation program under grant agreement No 874583 and the AP-HP Foundation, within the framework of the AIRACLES Chair.