The objective is to explain a block-matrix Y thanks to a block-matrix X. Each block describes n observations through q numerical variables for the block-matrix Y and p numerical variables for the block-matrix X. The links are assumed to be linear such as the objective is to estimated a linear matrix transformation B such as
Y = XB + E, where E is an additive random noise.
In the general case, the number of observations n can be lower than the number of descriptors p and most of regression methods cannot handle the estimation of the matrix B, often denoted n< < p. The ddsPLS methodology deals with this framework.
The objective of the proposed method is to estimate the matrix B and to simultaneously select the relevant variables in Y and in X.
The methodology gives quality descriptors of the optimal built model, in terms of explained variance and prediction error, and provides a ranking of variable importance.
Only two kinds of parameters need to be tuned in the ddsPLS methodology. This tuning is automatically data driven. The proposed criterion uses the power of bootstrap, a classical statistical tool that generates different samples from an initial one, particularly interesting when the sample size n is small.
The tuning parameters are:
More precisely, the ddsPLS methodology detailed in this vignette is based on R soft-thresholded estimations of the covariance matrices between a q-dimensional response matrix Y and a p-dimensional covariable matrix X. Each soft-threshold parameter is the λr parameter introduced above. It relies on the following latent variable model.
The ddsPLS methodology uses the principle of construction and prediction errors. The first one, denoted as R2, evaluates the precision of the model on the training data-set. The second one, denoted as Q2, evaluates the precision of the model on a test data-set, independent from the train data-set. The R2 is well known to be sensible to over-fitting and an accepted rule of thumb, among PLS users, is to select model for which the difference R2 − Q2 is minimum.
The ddsPLS methodology is based on bootstrap versions of the R2 and the Q2 defined below.
More precisely, for the bootstrap sample of index b, among a total of B bootstrap samples, the Rb2 and the Qb2 are defined as $$ \begin{array}{cc} \begin{array}{rccc} &R^{2}_b & = & 1-\dfrac{ \left|\left|\mathbf{y}_{\text{IN}(b)} -\hat{\mathbf{y}}_{\text{IN}(b)}\right|\right|^2 }{ \left|\left|\mathbf{y}_{\text{IN}(b)} -\bar{\mathbf{y}}_{\text{IN}(b)}\right|\right|^2 }, \end{array} & \begin{array}{rccc} &Q^{2}_b & = & 1-\dfrac{ \left|\left|\mathbf{y}_{\text{OOB}(b)} -\hat{\mathbf{y}}_{\text{OOB}(b)}\right|\right|^2 }{ \left|\left|\mathbf{y}_{\text{OOB}(b)} -\bar{\mathbf{y}}_{\text{IN}(b)}\right|\right|^2 }, \end{array} \end{array} $$ where the IN(b) (IN for “In Bag”) is the list on indices of the observations selected in the bootstrap sample b and OOB(b) (OOB for “Out-Of-Bag”) is the list on indices not selected in the bootstrap sample. Then, the subscripts IN(b) correspond to values of the object taken for in-bag indices (respectively for subscript notation OOB(b) and out-of-bag indices). Also, the “$\hat{\mathbf{y}}$” notation corresponds to the estimation of “y” by the current model (based on the in-bag sample) and “$\bar{\mathbf{y}}$” stands for the mean estimator of “y”. The ddsPLS methodology aggregates the B descriptors Rb2 and Qb2 as follows: $$ \bar{R}^{2}_{B}=\frac{1}{B}\sum_{b=1}^{B}R^2_b ~~~\mbox{and}~~~ \bar{Q}^{2}_{B}=\frac{1}{B}\sum_{b=1}^{B}Q^2_b. $$
Even if the notations of the metrics R̄B2 and Q̄B2 show a square, they can be negative. Indeed, for large samples, they actually compare the quality of the built model to the mean prediction model:
As the objective of a linear prediction model is to build models better than the mean prediction model, a rule of thumb is to select models for which $$ \begin{array}{ccc} \mbox{(A)} &:& \bar{Q}^{2}_{B}>0. \end{array} $$
A second rule of thumb is necessary to determine if a new component is relevant to improve the overall prediction power of the model. If the rth-component is tested, then the condition writes
$$ \begin{array}{ccc} \mbox{(Ar)} &:& \bar{Q}^{2}_{B,r}>0, \end{array} $$
where Q̄B, r2 is the “component-version” of Q̄B2” defined in Appendix A.1.
Remark 1 The metrics R̄B2 and Q̄B2 are in fact estimators of a same statistic γ, associated with a prediction model 𝒫: $$ \begin{array}{ccc} \gamma(\mathcal{P}) & = & 1-\dfrac{ \sum_{j=1}^q \mbox{var} ({y}_{j} -y_{j}^{(\mathcal{P})}) }{ \sum_{j=1}^q \mbox{var} ({y}_{j})}, \end{array} $$ where yj(𝒫)) is the prediction of the jth component of y by the model 𝒫. The closer this statistic is to 1, the more accurate the model 𝒫 is.
As said before, a ddsPLS model of r components is based on r parameters (λ1, …, λr), denoted 𝒫λ1, …, λr. The metrics R̄B2, Q̄B2 or Q̄B, r2 are functions of the values taken by the estimated ddsPLS models $\widehat{\mathcal{P}}$, which depend on the r estimated λ̂1, …, λ̂r parameter. The ddsPLS methodology is based on a set of to be tested values for estimating each λs, ∀s ∈ [ [1, r] ], which is denoted as Λ in the following. Their values are pick in [0, 1].
We denote by $$ \bar{R}^{2}_{B}(\widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r}})~~~\text{and}~~~\bar{Q}^{2}_{B}(\widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r}}) $$ the values of the two metrics for the estimated ddsPLS model $\widehat{\mathcal{P}}$ of r components, based on the r estimated regularization coefficients λ̂1, …, λ̂r. For each component, the ddsPLS methodology seeks the model which minimizes the difference between those two metrics, more precisely:
$$ \begin{array}{cccc} \widehat{\lambda}_r &=& \mbox{arg min}_{\lambda\in \Lambda} & \bar{R}^{2}_{B}( \widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r-1},\lambda} ) - \bar{Q}^{2}_{B}( \widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r-1},\lambda} ),\\ && \mbox{s.t} & \left\{\begin{array}{l} \bar{Q}^{2}_{B}( \widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r-1},\lambda} )>\bar{Q}^{2}_{B}( \widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r-1}} ),\\ \bar{Q}^{2}_{B,r}( \widehat{\mathcal{P}}_{\widehat{\lambda}_1,\dots,\widehat{\lambda}_{r-1},\lambda} )>0. \end{array}\right. \end{array} $$
The rth-component is not built if λ̂r = ∅ and so the selected model is a (r − 1)-component (if r − 1 > 0 or is the mean estimator model) with estimated values for the regularization coefficients (λ̂1, …, λ̂r − 1).
The package depends on a low number of low level packages. They are of two types:
Parallelization packages: foreach
,
parallel
and doParallel
(which depends on both
of the previous packages).
C++-developer
packages: Rcpp
(for basic C++ development)
and RcppEigen
(for inner mathematical operations).
#> Loading required package: foreach
In the following, we study a synthetic structure defined as, in general
$$ \left\{ \begin{array}{l} \mathbf{x} = \mathbf{A}'\boldsymbol{\phi} + \boldsymbol{\epsilon},\ \mathbf{A}\in\mathbb{R}^{R\times p}, \\ \mathbf{y} = \mathbf{D}'\boldsymbol{\phi}+ \boldsymbol{\xi},\ \mathbf{D}\in\mathbb{R}^{R\times q}, \end{array} \right. $$
where R is the total number of eigenvectors of A′A = var(x) with non-null projections on A′D = cov(x, y). In the PLS context this is the theoretical number of components.
For the sack of the proposed simulations, we use R = 2 (associated to the dimension of ϕ), p = 1000 (associated to the dimension of A′ϕ and ϵ) and q = 3 (associated to the dimension of D′ϕ and ξ).
The total number of such data-sets is rarely higher than p = 1000 and the constraint n< < p holds most of times. This is an high-dimensionnal data-set.
Each observation of those random vectors are generated following a multivariate normal distribution such as ψ = (ϕ′, ϵ1, …, 100′/σ, ϵ101, …, 1000′, ξ1, 2′, ξ3)′ ∼ 𝒩(02 + 1000 + 3, 𝕀2 + 1000 + 3),
where σ is the standard deviation of the additive noise. A response variable y of q = 3 components is generated as a linear combination of the latent variable ϕ to which is added a Gaussian noise ξ. The equivalent process generates a variable x of p = 1000 components, from the matrix A and Gaussian additive noise ${\sigma}=\sqrt{1-0.95^2}\approx0.312$. The columns of A and D are normalized such as
∀(i, j) ∈ [ [1, p] ] × [ [1, q] ], var(xi) = var(yj) = 1.
Remark 2 Taking into account Remark 1 and the current statistical model, we can define a theoretical (understanding if n → +∞) value for γ(𝒫) which is $$ \begin{array}{ccc} \gamma^\star & = & 1-\dfrac{ \sum_{j=1}^q \mbox{var} (\epsilon_j) }{ \sum_{j=1}^q \mbox{var} ({y}_{j})}= 2(1-\sigma^2)/3\approx0.602. \end{array} $$ Comparing this theorethical value to the values of Q̄B2 helps interpretability, for theorethical work only. If Q̄B2< < γ⋆ then the corresponding model is not enough efficient. If Q̄B2> > γ⋆ then the corresponding model overfits the data.
More precisely we propose to study the following matrices A and D $$ \begin{array}{c l c c} & \mathbf{A} =\sqrt{1-\sigma^2} \left(\begin{array}{ccc} \boldsymbol{1}_{50}' & \sqrt{\alpha}\boldsymbol{1}_{25}' & \boldsymbol{0}_{25}' & \boldsymbol{0}_{900}'\\ \boldsymbol{0}_{50}' & \sqrt{1-\alpha}\boldsymbol{1}_{25}' & \boldsymbol{0}_{25}' & \boldsymbol{0}_{900}'\\ \boldsymbol{0}_{50}' & \boldsymbol{0}_{25}' &\boldsymbol{1}_{25}' & \boldsymbol{0}_{900}'\\ \end{array} \right)_{(3,1000)} &\text{ and }& \mathbf{D}=\sqrt{1-\sigma^2} \left(\begin{array}{ccc} 1 & \sqrt{\beta_0} & 0 \\ 0 & \sqrt{1-\beta_0} & 0 \\ 0 & 0 & 0 \\ \end{array} \right)_{(3,3)}, \end{array} $$ where α ∈ [0, 1] can be easily interpreted. Indeed, α controls the correlation between the components x1…50 and x51…75. Also β0 = 0.1. It indirectly controls the association between x and y2. The effects of α on those two associations are detailed in the following table.
Value of α | α ≈ 0 | α ≈ 1 |
---|---|---|
cor(x1…50, x51…75) | Strong | Low |
cor(x, y2) | Strong | Low |
If α > 1/2, the second variable response component, y2, is hard to predict and the variables x51…75 are hard to select. The objective is to build models predicting y1 and y2 but not y3. Also components 1 to 75 of x should be selected.
The function getData
, printed in Appendix
A.2 allows to simulate a couple
(X,Y) according to this structure.
As a popint of comparison, we can build the ddsPLS model for which Λ = {0} and look at its prediction performances, through the Q̄B2 statistics.
mo0 <- ddsPLS( datas$X, datas$Y,lambdas = 0,
n_B=n_B,NCORES=NCORES,verbose = FALSE)
sum0 <- summary(mo0,return = TRUE)
print(sum0$R2Q2[,c(1,4)])
mo0 <- ddsPLS( datas$X, datas$Y,lambdas = 0,
n_B=n_B,NCORES=1,verbose = FALSE)
sum0 <- summary(mo0,return = TRUE)
#> ______________
#> | ddsPLS |
#> =====================----------------=====================
#> The optimal ddsPLS model is built on 2 component(s)
#>
#> The bootstrap quality statistics:
#> ---------------------------------
#> lambda R2 R2_r Q2 Q2_r
#> Comp. 1 0 0.39 0.39 0.36 0.36
#> Comp. 2 0 0.63 0.24 0.56 0.31
#>
#>
#> The explained variance (in %):
#> -----------------------
#>
#> In total: 62.83
#> - - -
#>
#> Per component or cumulated:
#> - - - - - - - - -
#> Comp. 1 Comp. 2
#> Per component 38.9 23.93
#> Cumulative 38.9 62.83
#>
#> Per response variable:
#> - - - - - - - -
#> Y1 Y2 Y3
#> Comp. 1 78.74 34.30 3.65
#> Comp. 2 13.90 57.78 0.11
#>
#> Per response variable per component:
#> - - - - - - - - - - - -
#> Y1 Y2 Y3
#> Comp. 1 78.74 34.30 3.65
#> Comp. 2 13.90 57.78 0.11
#>
#> ...and cumulated to:
#> - - - - - - -
#> Y1 Y2 Y3
#> Comp. 1 78.74 34.30 3.65
#> Comp. 2 92.64 92.09 3.75
#>
#> For the X block:
#> - - - - - - - - -
#> Comp. 1 Comp. 2
#> Per component 5.32 2.45
#> Cumulative 5.32 7.76
#> ===================== =====================
#> ================
It is possible to compare the prediction qualities of the two models using
print(sum0$R2Q2[,c(1,4)])
#> lambda Q2
#> Comp. 1 0 0.3575314
#> Comp. 2 0 0.5582657
print(sum_up$R2Q2[,c(1,4)])
#> lambda Q2
#> Comp. 1 0.5172414 0.3314618
#> Comp. 2 0.5172414 0.5982472
In that context, the sparse ddsPLS approach allows to get better prediction rate than the “non selection” ddsPLS model.
plot
It is also possible to plot different things thanks to the plot S3-method. In the representation with λ in abscissa:
The different values given to the argument type
would
give representation that helps the analyst concluding on the final
quality of the model. The different values are
type="predict"
to draw the predicted values of
y against the observed. This can be useful to locate
potential outliers (observations away from the distribution…) that would
drive the model (… but close to the bisector).type="criterion"
to draw the values of the metrics
R̄B2 − Q̄B2.
This is the optimized metrics.type="Q2"
to draw the Q̄B2
metrics which represents the overall prediction quality of the build
model, one component after another.type="prop"
to draw the proportion of bootstrap models
with a positive Qb, r2,
this ∀b ∈ [ [1, B] ].
Since this is tricky to interpret negative values for Qb, r2
(apart from describing models which perform worse than the mean
prediction model) negative values for Q̄B, r2
is necessarily hardly interpretable. However, the proportion of models
with positive Qb, r2
can be interpreted as a probability to finally build a model 𝒫 with a positive γ(𝒫). This can be interesting to
look at this metrics to interprettype="weightX"
or type="weightY"
to draw
the values of the weights for each component for the X
block of for the Y block. If there is an a
priori, such as a functional one, on the variables of
X (resp. Y), this a priori
(which is not currently taken into account in the
ddsPLS model) must certainly have an impact on the
values of the weights parameters. This can be characterized by a
structure of the weights on each component. In the opposite case, if the
model is not enough sparse or too sparse, for example, the analyst is
invited to modify the parameterization of the model, by limiting the
grid of accessible λ for
example.Simply specifying type="predict"
.
Since the the variable y3 has not been selected, its predicted values are constantly equal to the mean estimation. The two other columns of Y are described with more than 90% accuracy and no observation seems to guide the model at the expense of other observations.
To plot the criterion, the plot argument type must
be set to criterion
. It is possible to move the legend with
the legend.position argument
.
The legend title gives the total explained variance by the model built on the two components while the legend itself gives the explained variance by each of the considered component.
the previous figure does not provide information on the prediction
quality of the model. This information can be found using the same
S3-method fixing the parameter type to
type="Q2"
.
According to that figure it is clear that the chosen optimal model (minimizing R̄B2 − Q̄B2) represents a model for which the Q̄B2 on each of its component is not far from being maximum.
It is also possible to directly plot the values of the weights for each component. For the Y block and for the X block, such as
where it is clear that the first component explains y1 while the second one explains y2. Looking at the weights on the block X:
the variables x1…50 (resp. x51…75) are selected on the first (resp. second) component, which is associated only with y1 (resp. y2).
For a given bootstrap sample indexed by b, we define $$ \begin{array}{rccc} &Q^{2}_{b,r} & = & 1-\dfrac{ \left|\left|\mathbf{y}_{\text{OOB}(b)} -\hat{\mathbf{y}}_{\text{OOB}(b)}^{(r)}\right|\right|^2 }{ \left|\left|\mathbf{y}_{\text{OOB}(b)} -\bar{\mathbf{y}}_{\text{IN}(b)}^{(r-1)}\right|\right|^2 }, \end{array} $$
where, $\hat{\mathbf{y}}_{\text{OOB}(b)}^{(r)}$ is the prediction of yOOB(b) for the model built on r components. The interpretation of this metrics is that if Qb, r2 > 0 then the r-components based model predicts better than the r − 1-components based model. Naturally, the aggregated version of this metrics is
$$ \bar{Q}^{2}_{B,r}=\frac{1}{B}\sum_{b=1}^{B}Q^2_{b,r} $$
We propose the following function
getData
#> function (n = 100, alpha = 0.4, beta_0 = 0.2, sigma = 0.3, p1 = 50,
#> p2 = 25, p3 = 25, p = 1000)
#> {
#> R1 = R2 = R3 <- 1
#> d <- R1 + R2 + R3
#> A0 <- matrix(c(c(rep(1/sqrt(R1), p1), rep(sqrt(alpha), p2),
#> rep(0, p3), rep(0, p - p1 - p2 - p3)), c(rep(0, p1),
#> rep(sqrt(1 - alpha), p2), rep(0, p3), rep(0, p - p1 -
#> p2 - p3)), c(rep(0, p1), rep(0, p2), rep(1, p3),
#> rep(0, p - p1 - p2 - p3))), nrow = d, byrow = TRUE)
#> A <- eps * A0
#> D0 <- matrix(c(1, 0, 0, sqrt(beta_0), sqrt(1 - beta_0), 0,
#> 0, 0, 0), nrow = d, byrow = FALSE)
#> D <- eps * D0
#> q <- ncol(D)
#> L_total <- q + p
#> psi <- MASS::mvrnorm(n, mu = rep(0, d + L_total), Sigma = diag(d +
#> L_total))
#> phi <- psi[, 1:d, drop = F]
#> errorX <- matrix(rep(sqrt(1 - apply(A^2, 2, sum)), n), n,
#> byrow = TRUE)
#> errorY <- matrix(rep(sqrt(1 - apply(D^2, 2, sum)), n), n,
#> byrow = TRUE)
#> X <- phi %*% A + errorX * psi[, d + 1:p, drop = F]
#> Y <- phi %*% D + errorY * psi[, d + p + 1:q, drop = F]
#> list(X = X, Y = Y)
#> }
#> <bytecode: 0x5640407b3a40>
where the output is a list of two matrices X and Y.
For a model built on R components
$$ \begin{array}{rccc} &\text{ExpVar}_{1:R} & = & \left(1-\frac{1}{n}\sum_{i=1}^n\dfrac{ \left|\left|\mathbf{y}_i -\hat{\mathbf{y}}^{(1:R)}_i\right|\right|^2 }{ \left|\left|\mathbf{y}_i -\boldsymbol{\mu}_{\mathbf{y}}\right|\right|^2 } \right)*100, \end{array} $$
where
$$ \hat{\mathbf{y}}^{(1:R)} = \left( \mathbf{X}-\boldsymbol{\mu}_{\mathbf{x}} \right) \mathbf{U}_{(1:R)}\left(\mathbf{P}_{(1:R)}'\mathbf{U}_{(1:R)}\right)^{-1}\mathbf{C}_{(1:R)} + \boldsymbol{\mu}_{\mathbf{y}}, $$ and U(1 : R) is the concatenation of the R weights, P(1 : R) is the concatenation of the R scores pr = X(r)′tr/tr′tr and C(1 : R) is the concatenation of the R scores cr = ΠrY(r)′tr/tr′tr. The deflated matrices are defined such as X(r + 1) = X(r) − trpr′, Y(r + 1) = Y(r) − trcr′, and Πr is the diagonal matrix with 1 element if the associated response variable is selected and 0 elsewhere. Also, μy and μx are the estimated mean matrices.
$$ \begin{array}{rccc} &\text{ExpVar}_{R} & = & \left(1-\frac{1}{n}\sum_{i=1}^n\dfrac{ \left|\left|\mathbf{y}_i -\hat{\mathbf{y}}^{(R)}_i\right|\right|^2 }{ \left|\left|\mathbf{y}_i -\boldsymbol{\mu}_{\mathbf{y}}\right|\right|^2 } \right)*100, \end{array} $$
$$ \begin{array}{rccc} &\text{ExpVar}_{1:R}^{(j)} & = & \left( 1-\frac{1}{n}\sum_{i=1}^n\dfrac{ \left|\left|y_{i,j} -\hat{{y}}^{(1:R)}_{i,j}\right|\right|^2 }{ \left|\left|{y}_{i,j} -{\mu}_{{y}_j}\right|\right|^2 } \right)*100, \end{array} $$
$$ \begin{array}{rccc} &\text{ExpVar}_{R}^{(j)} & = & \left( 1-\frac{1}{n}\sum_{i=1}^n\dfrac{ \left|\left|y_{i,j} -\hat{{y}}^{(R)}_{i,j}\right|\right|^2 }{ \left|\left|{y}_{i,j} -{\mu}_{{y}_j}\right|\right|^2 } \right)*100, \end{array} $$