Title: | A Fast Clustering Algorithm for High Dimensional Data Based on the Gram Matrix Decomposition |
---|---|
Description: | Clustering algorithm for high dimensional data. Assuming that P feature measurements on N objects are arranged in an N×P matrix X, this package provides clustering based on the left Gram matrix XX^T. To simulate test data, type "help('simulate_HD_data')" and to learn how to use the clustering algorithm, type "help('RJclust')". To cite this package, type 'citation("RJcluster")'. |
Authors: | Shahina Rahman [aut], Valen E. Johnson [aut], Suhasini Subba Rao [aut], Rachael Shudde [aut, cre, trl] |
Maintainer: | Rachael Shudde <[email protected]> |
License: | GPL (>= 2) |
Version: | 3.2.4 |
Built: | 2024-10-31 18:38:30 UTC |
Source: | https://github.com/cran/RJcluster |
Clustering algorithm for high dimensional data. Assuming that P feature measurements on N objects are arranged in an N×P matrix X, this package provides clustering based on the left Gram matrix XX^T. To simulate test data, type "help('simulate_HD_data')" and to learn how to use the clustering algorithm, type "help('RJclust')". To cite this package, type 'citation("RJcluster")'.
Package: | RJcluster |
Type: | Package |
Version: | 3.2.4 |
Date: | 07-15-2021 |
License: | GPL>=2 |
Shahina Rahman [aut], Valen E. Johnson [aut], Suhasini Subba Rao [aut], Rachael Shudde [aut, cre, trl]
Maintainer: Rachael Shudde <[email protected]>
Calculates normalized mutual information and adjusted mutual information. The value for both will be a value bewteen 0 and 1 that measures how close the classification between the two clusters is. A value closer to 1 means the labels are more similar across v1 and v2, and a value closer to 0 means the labels are not as similar.
Mutual_Information(v1, v2)
Mutual_Information(v1, v2)
v1 |
vector containing first classification labels |
v2 |
vector containing second classification labels |
See these links for a more formal definition of AMI and NMI.
Returns mutual information:
nmi |
NMI value |
ami |
AMI value |
cluster1 <- sample(1:5, size = 10, replace = TRUE) cluster2 <- sample(1:2, size = 10, replace = TRUE) Mutual_Information(cluster1, cluster2)
cluster1 <- sample(1:5, size = 10, replace = TRUE) cluster2 <- sample(1:2, size = 10, replace = TRUE) Mutual_Information(cluster1, cluster2)
This is a high dimensional clustering algorithm for data in matrix form. There are are two different types of penalty methods that can be used,
depending on the size of the data and the desired accuracy. The first is the default method: the hokey stick penalty. There is also the BIC penalty.
For large , the scale method can be used, which uses the approximation method of RJclust. For the scaleRJ method,
a parmater n_bins (usually
) is required that splits the data into different buckets.
For all methods, a C_max variable is needed that is an upper limit on the possible
number of clusters.
RJclust( data, penalty = "hockey_stick", scaleRJ = FALSE, C_max = 10, criterion = "VVI", n_bins = NULL, seed = 1, verbose = FALSE )
RJclust( data, penalty = "hockey_stick", scaleRJ = FALSE, C_max = 10, criterion = "VVI", n_bins = NULL, seed = 1, verbose = FALSE )
data |
Data input, must be in matrix form. Currently no support for missing values |
penalty |
A string of possible vectors. Options include: "bic" an "hockey_stock" (default = "hockey_stick") |
scaleRJ |
Should the scaled version of RJ be used, suggested for data where n > 1000 (default = FALSE) |
C_max |
Maximum number of clusters to look for (default is 10) |
criterion |
Model of covariance structure (default = "VVI") |
n_bins |
Number of cuts if penalty = "scale" for the scaled RJ algorithm (default = sqrt(p)) |
seed |
Seed (defalt = 1) |
verbose |
Should progress be printed? (default = FALSE) |
All implementations use backend C++ to increase runtime.
model_names controls the type of covariance structure. See Mclust Documenttion for more information. Note criterion "kmeans" is the same as "EEI". It is not suggested to use "kmeans" if it is suspected the classes are imbalanced
Returns RJ algorithm result for "aic", "bic" ("mclust" and "scale" will return an mclust object:
K |
number of clusters found |
class |
Class labels |
penalty |
Penalty values at each iteraiton |
mean |
Mean matrix |
prob |
Probability values |
z |
Z values from mclust (NULL penalty = "full_covariance") |
X = simulate_HD_data() X = X$X clust = RJclust(X, penalty = "hockey_stick", C_max = 10)
X = simulate_HD_data() X = X$X clust = RJclust(X, penalty = "hockey_stick", C_max = 10)
This is simulaiton data to check performance of RJcluster. Data can be simulated for any n, P, and size of clusters. The data has two types of
data: noisy data and signal data. The percent of the data that is noisy is controlled by the sparsity paramater. The noisy data has two parts:
half of it is and half is
. The signal data is divided in two as well, half of it is
and half
.
simulate_HD_data( size_vector = c(20, 20, 20, 20), p = 220, mu = matrix(c(1.5, 2.5, 0, 1.5, 0, -1.5, -2.5, -1.5), ncol = 2, byrow = TRUE), signal_variance = 1, noise_variance = 1, sparsity = 0.09, seed = 1234 )
simulate_HD_data( size_vector = c(20, 20, 20, 20), p = 220, mu = matrix(c(1.5, 2.5, 0, 1.5, 0, -1.5, -2.5, -1.5), ncol = 2, byrow = TRUE), signal_variance = 1, noise_variance = 1, sparsity = 0.09, seed = 1234 )
size_vector |
A list of the size of the different clusters. (default = a balanced case of 4 clusters of size 20, c(20, 20, 20, 20)) |
p |
The number of columns in the simulated matrix (default = 220) |
mu |
The matrix of means, of dimension length(size_vector)x2. The first column of means is for the first half informative features, the second columns of mean is for the second half of the informative features (default is described in RJcluster paper) |
signal_variance |
Variance of the signal part of the generated data. A value of 1 indicates a high SNR, a value of 2 indicates a low SNR (default = 1) |
noise_variance |
Variance of the noisy part of the generated data (Default = 1) |
sparsity |
What percent of the data should be informative? A value between 0 and 1, a higher value means more data is informative (default = 0.09) |
seed |
Random seed. Change if generating multiple simulation datasets (default = 1234) |
The data in the paper is generated with number of clusters = 4, a balanced case of c(20, 20, 20, 20) and an unbalanced case of c(20, 20, 200, 200),
with p = 220 in both cases. The default is a balanced, high signal case with as the matrix in the RJcluster paper.
Returns simulation data for X and Y values
X |
Matrix of dimension sum(size_vector)xp |
Y |
Vector of class labels of length , with unique values of 1:length(size_vector) |
data = simulate_HD_data() X = data$X Y = data$X print(head(X))
data = simulate_HD_data() X = data$X Y = data$X print(head(X))