Title: | Multidimensional Scaling for Big Data |
---|---|
Description: | MDS is a statistic tool for reduction of dimensionality, using as input a distance matrix of dimensions n × n. When n is large, classical algorithms suffer from computational problems and MDS configuration can not be obtained. With this package, we address these problems by means of six algorithms, being two of them original proposals: - Landmark MDS proposed by De Silva V. and JB. Tenenbaum (2004). - Interpolation MDS proposed by Delicado P. and C. Pachón-García (2021) <arXiv:2007.11919> (original proposal). - Reduced MDS proposed by Paradis E (2018). - Pivot MDS proposed by Brandes U. and C. Pich (2007) - Divide-and-conquer MDS proposed by Delicado P. and C. Pachón-García (2021) <arXiv:2007.11919> (original proposal). - Fast MDS, proposed by Yang, T., J. Liu, L. McMillan and W. Wang (2006). |
Authors: | Cristian Pachón García [aut, cre]
|
Maintainer: | Cristian Pachón García <[email protected]> |
License: | MIT + file LICENSE |
Version: | 3.0.0 |
Built: | 2025-02-02 03:09:53 UTC |
Source: | https://github.com/pachoning/bigmds |
Roughly speaking, a large data set, x
, of size
is divided into parts, then classical MDS is performed over every part and,
finally, the partial configurations are combined so that all the points lie
on the same coordinate system with the aim to obtain a global MDS configuration.
divide_conquer_mds(x, l, c_points, r, n_cores)
divide_conquer_mds(x, l, c_points, r, n_cores)
x |
A matrix with |
l |
The size for which classical MDS can be computed efficiently
(using |
c_points |
Number of points used to align the MDS solutions obtained by the
division of |
r |
Number of principal coordinates to be extracted. |
n_cores |
Number of cores wanted to use to run the algorithm. |
The divide-and-conquer MDS starts dividing the points into
partitions: the first partition contains
l
points and the others
contain l-c_points
points. Therefore, l)/(l-c_points)
.
The partitions are created at random.
Once the partitions are created, c_points
different random
points are taken from the first partition and concatenated to the other
partitions After that, classical MDS is applied to each partition,
with target low dimensional configuration r
.
Since all the partitions share c_points
points with the first one, Procrustes can be applied in order to align all
the configurations. Finally, all the configurations are
concatenated in order to obtain a global MDS configuration.
Returns a list containing the following elements:
A matrix that consists of points (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
a dimensionality reduction is performed, r
The first r
largest eigenvalues:
, where
,
being
the
eigenvalue from partition
and
the size of the partition
.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- divide_conquer_mds(x = x, l = 200, c_points = 5 * 2, r = 2, n_cores = 1) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- divide_conquer_mds(x = x, l = 200, c_points = 5 * 2, r = 2, n_cores = 1) head(mds$points) mds$eigen
Fast MDS uses recursive programming in combination with a divide
and conquer strategy in order to obtain an MDS configuration for a given
large data set x
.
fast_mds(x, l, s_points, r, n_cores)
fast_mds(x, l, s_points, r, n_cores)
x |
A matrix with |
l |
The size for which classical MDS can be computed efficiently
(using |
s_points |
Number of points used to align the MDS solutions obtained by
the division of |
r |
Number of principal coordinates to be extracted. |
n_cores |
Number of cores wanted to use to run the algorithm. |
Fast MDS randomly divides the whole sample data set, x
, of size
into
l/s_points
data subsets, where l
being
the limit size for which classical MDS is applicable. Each one of the
data subsets
has size
. If
then classical MDS is applied
to each data subset. Otherwise, fast MDS is recursively applied.
In either case, a final MDS configuration is obtained for each data subset.
In order to align all the partial solutions, a small subset of size s_points
is randomly selected from each data subset. They are joined
to form an alignment set, over which classical MDS is performed giving rise to an
alignment configuration. Every data subset shares s_points
points with the alignment
set. Therefore every MDS configuration can be aligned with the alignment configuration
using a Procrustes transformation.
Returns a list containing the following elements:
A matrix that consists of individuals (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
we are performing a dimensionality reduction, r
The first r
largest eigenvalues:
, where
,
being
the
eigenvalue from partition
and
the size of the partition
.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Yang, T., J. Liu, L. McMillan and W.Wang (2006). A fast approximation to multidimensional scaling. In Proceedings of the ECCV Workshop on Computation Intensive Methods for Computer Vision (CIMCV).
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- fast_mds(x = x, l = 200, s_points = 5 * 2, r = 2, n_cores = 1) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- fast_mds(x = x, l = 200, s_points = 5 * 2, r = 2, n_cores = 1) head(mds$points) mds$eigen
Given that the size of the data set is too large, this algorithm
consists of taking a random sample from it of size
l
, being
the limit size for which
classical MDS is applicable, to perform classical MDS to it, and to extend the
obtained results to the rest of the data set by using Gower's
interpolation formula, which allows to add a new set of points
to an existing MDS configuration.
interpolation_mds(x, l, r, n_cores)
interpolation_mds(x, l, r, n_cores)
x |
A matrix with |
l |
The size for which classical MDS can be computed efficiently
(using |
r |
Number of principal coordinates to be extracted. |
n_cores |
Number of cores wanted to use to run the algorithm. |
Gower's interpolation formula is the central piece of this algorithm since it allows to add a new set of points to an existing MDS configuration so that the new one has the same coordinate system.
Given the matrix x
with points (rows) and
and
variables (columns), a first data subsets (based on a random sample)
of size
l
is taken and it is used to compute a MDS configuration.
The remaining part of x
is divided into l
)/l
data subsets (randomly). For every data subset, it is obtained a MDS
configuration by means of Gower's interpolation formula and the first
MDS configuration obtained previously. Every MDS configuration is appended
to the existing one so that, at the end of the process, a global MDS
configuration for x
is obtained.
This method is similar to landmark_mds()
and reduced_mds()
.
Returns a list containing the following elements:
A matrix that consists of individuals (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
we are performing a dimensionality reduction, r
The first r
largest eigenvalues:
, where each
is obtained
from applying classical MDS to the first data subset.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
Gower JC. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- interpolation_mds(x = x, l = 200, r = 2, n_cores = 1) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- interpolation_mds(x = x, l = 200, r = 2, n_cores = 1) head(mds$points) mds$eigen
Landmark MDS (LMDS) algorithm applies first classical MDS to a subset of the data (landmark points) and then the remaining individuals are projected onto the landmark low dimensional configuration using a distance-based triangulation procedure.
landmark_mds(x, num_landmarks, r)
landmark_mds(x, num_landmarks, r)
x |
A matrix with |
num_landmarks |
Number of landmark points to obtain an initial MDS configuration. It is
equivalent to |
r |
Number of principal coordinates to be extracted. |
LMDS applies first classical MDS to a subset of the data (landmark points). Then, it uses a distance-based triangulation procedure to project the non-landmark individuals. This distance-based triangulation procedure coincides with Gower's interpolation formula.
This method is similar to interpolation_mds()
and reduced_mds()
.
Returns a list containing the following elements:
A matrix that consists of points (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
a dimensionality reduction is performed, r
The first r
largest eigenvalues:
, where each
is obtained
from applying classical MDS to the first data subset.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
De Silva V. and JB. Tenenbaum (2004). Sparse multidimensional scaling using landmark points. Technical Report, Stanford University.
Gower JC. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- landmark_mds(x = x, num_landmarks = 200, r = 2) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- landmark_mds(x = x, num_landmarks = 200, r = 2) head(mds$points) mds$eigen
Pivot MDS, introduced in the literature of graph layout algorithms, is similar to
Landmark MDS (landmark_mds()
) but it uses the distance information between landmark and non-landmark
points to improve the initial low dimensional configuration,
as more relations than just those between landmark points are taken into account.
pivot_mds(x, num_pivots, r)
pivot_mds(x, num_pivots, r)
x |
A matrix with |
num_pivots |
Number of pivot points to obtain an initial MDS configuration. It is
equivalent to |
r |
Number of principal coordinates to be extracted. |
Returns a list containing the following elements:
A matrix that consists of individuals (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
we are performing a dimensionality reduction, r
The first r
largest eigenvalues:
, where each
is obtained
from applying classical MDS to the first data subset.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Brandes U. and C. Pich (2007). Eigensolver Methods for Progressive Multidimensional Scaling of Large Data. Graph Drawing.
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
Gower JC. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- pivot_mds(x = x, num_pivots = 200, r = 2) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- pivot_mds(x = x, num_pivots = 200, r = 2) head(mds$points) mds$eigen
A data subset is selected and classical MDS is performed on it to obtain the corresponding low dimensional configuration.Then the reaming points are projected onto this initial configuration.
reduced_mds(x, l, r, n_cores)
reduced_mds(x, l, r, n_cores)
x |
A matrix with |
l |
The size for which classical MDS can be computed efficiently
(using |
r |
Number of principal coordinates to be extracted. |
n_cores |
Number of cores wanted to use to run the algorithm. |
Gower's interpolation formula is the central piece of this algorithm since it allows to add a new set of points to an existing MDS configuration so that the new one has the same coordinate system.
Given the matrix x
with points (rows) and
and
variables (columns), a first data subsets (based on a random sample)
of size
l
is taken and it is used to compute a MDS configuration.
The remaining part of x
is divided into l
)/l
data subsets (randomly). For every data point, it is obtained a MDS
configuration by means of Gower's interpolation formula and the first
MDS configuration obtained previously. Every MDS configuration is appended
to the existing one so that, at the end of the process, a global MDS
configuration for x
is obtained.
#'This method is similar to landmark_mds()
and interpolation_mds()
.
Returns a list containing the following elements:
A matrix that consists of individuals (rows)
and
r
variables (columns) corresponding to the principal coordinates. Since
we are performing a dimensionality reduction, r
The first r
largest eigenvalues:
, where each
is obtained
from applying classical MDS to the first data subset.
Delicado P. and C. Pachón-García (2021). Multidimensional Scaling for Big Data. https://arxiv.org/abs/2007.11919.
Paradis E. (2018). Multidimensional Scaling With Very Large Datasets. Journal of Computational and Graphical Statistics.
Borg, I. and P. Groenen (2005). Modern Multidimensional Scaling: Theory and Applications. Springer.
Gower JC. (1968). Adding a point to vector diagrams in multivariate analysis. Biometrika.
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- reduced_mds(x = x, l = 200, r = 2, n_cores = 1) head(mds$points) mds$eigen
set.seed(42) x <- matrix(data = rnorm(4 * 10000), nrow = 10000) %*% diag(c(9, 4, 1, 1)) mds <- reduced_mds(x = x, l = 200, r = 2, n_cores = 1) head(mds$points) mds$eigen