Threaded Mode | Linear Mode

rawi · 12-03-2019, 03:23 PM

Cluster Analysis is a multivariate statistical method to look for homogeneous groups within elements. For example it is used in market segmentation.

The program uses K-Means-method to generate clusters. For the experts: It uses euclidian distances and centroid method. Input is a data matrix in matrix M3, whereas the rows are the elements and the columns are the variables. If you have the data matrix on your computer e.g. in Excel you can copy it to the spreadsheet application in the HP Prime and from there to the matrix using the connectivity software. I could not directly copy to a matrix but perhaps there is a way I could not find.

Secondly you have to give the information which of the variables you want to have active, i.e. which of the variables shall be used in the clustering procedure. You can use only 2 or all variables you have. Sometimes it makes sense not to use all variables. But the final analysis is shown for all variables.

K-Means-method needs the number of clusters (= homogeneous groups) you want to have. Then it generates as many cluster centers by randomly selecting elements. In the next step it allocates the other elements to the centers. Each element is attributed to the cluster where the distance to the center is lowest. Then the centers are computed newly as means of clusters. Now because the centers changed it may be that for some of the elements another center is nearer. So they attribution to the centers is done again. This is repeated as often as there is no further improvement.

Then another random choice of elements is used to form a new center and to optimize it again. This is necessary because the method mostly only finds local optima. To find a good solution you will repeat the procedure e.g. 100 times. The program takes the best solution found as the final solution.

The HP prime is very fast for a machine of its size and has a lot of memory. So you can do this as well for reasonable sized problems. I did it for 418 elements, 25 variables in total, 25 active variables and 6 clusters, doing the procedure 120 times. The computation time was about 90 minutes.

Using the software:
Input:
M3: Data matrix (rows: elements, columns: variables
Number of cases must exceed number of variables.
L3: List of active variables. Length of list is number of variables. For each variable you put in either 0 (variable is not active and not used for the cluster algorithm) of 1 (variable is used for the cluster algorithm).

If you run the software you will be asked for the number of clusters. Put in a number between 2 (minimum) and 20 (maximum).
Next you will be asked how often you want to repeat the procedure. Put in a number between 10 and 1000.
If your numbers are below the minima or above the maxima your number will be set to minimum or maximum.

The program first standardizes the variables by subtracting from each variable the mean and dividing by the standard deviation. The standardized variables are saved in matrix M4. This is done to make the variables comparable.

During the computation you get for every repeat of the procedure the starting elements for the cluster centers, the total variance within the cluster as the result of this iteration and the minimum reached so far.

Finally you get some text on the terminal that shows you where to find the results:
Cluster sizes in list L4: Number of elements per cluster. Ideally the numbers should be about same size.
Pseudo-F-Statistics in L5:For every variable the variance within the cluster is set in relation to the variance between the clusters. The higher the number the more differentiating the clusters are for this variable.
Cluster attributions in L6: For every element you get a number denoting the cluster to which this element is attributed.
Cluster centers in M5: Means of the standardized variables.
Variances within the clusters im M6: Variances of all variables within all clusters.
Variances between the clusters per variable in list L7.

Enjoy!

Raimund Wildner

EXPORT KMEANSE()
BEGIN
// INPUT: DATA MATRIX M3
// LIST ACTIVE OF VARIABLES: L3
LOCAL VZ,VZ1,FZ,CZ;
// NUMBER OF VAR., ACT. VAR., CASES, CLUSTER
LOCAL LDIM; // DIMENSIONS DATA MATRIX
LOCAL M41; // STAND. ACTIVE VARIABLES
LOCAL LSTART; // STARTING VALUES FOR CLUSTER
LOCAL I1,I2,I3,I4,J1; // LOOP INDICES
LOCAL AK1,AK2,AL1,AL2,ALMIN,LB; //FOR COMP.
LOCAL LMIN,IZ,FLAG;
// OPTIMUM ALLOC, NUMBER OF ITER.,FLAG
LOCAL AKMIN,PF1,PF2;
LOCAL LM,LV; // LISTS OF MEAN, VARIANCES
DIM(M3)▶LDIM;
LDIM(1)▶FZ;
LDIM(2)▶VZ;
IP(ΣLIST(L3)+.5)▶VZ1;
INPUT(CZ,"NUMBER OF CLUSTERS","CZ=",2);
INPUT(IZ,"NUMBER OF ITERATONS","IZ=",10);
MAX(CZ,2)▶CZ; // MIN NO OF CLUSTERS=2
MIN(CZ,MIN(FZ-1,20))▶CZ; // MAX NO OF CLUSTERS=20
MAX(IZ,10)▶IZ; // MIN NO OF ITERATIONS=10
MIN(IZ,1000)▶IZ; // MAX NO OF ITERATIONs=1000
IP(CZ+.5)▶CZ;
REDIM(M4,{FZ,VZ}); // STANDARDIZED DATA
REDIM(M5,{CZ,VZ1}); // CLUSTER CENTRES
REDIM(M6,{CZ,VZ1}); // VARIANCES
MAKEMAT(0,FZ,VZ1)▶M41; // ACTIVE VARIABLES
MAKELIST(0,I1,1,FZ,1)▶LMIN;
//MAKE STANDARDIZED DATA MATRIX
mean(M3)▶LM; //VARIABLE MEANS
variance(M3)▶LV; //VARIABLE VARIANCES
0▶J1;
FOR I1 FROM 1 TO VZ DO
0▶FLAG;
IF IP(L3(I1)+.5)=1 THEN
J1+1▶J1;
1▶FLAG;
END;
FOR I2 FROM 1 TO FZ DO
(M3(I2,I1)-LM(I1))/√LV(I1)▶M4(I2,I1);
IF FLAG=1 THEN
M4(I2,I1)▶M41(I2,J1);
END;
END;
END;
1▶ALMIN;
FOR J1 FROM 1 TO IZ DO
//MAKE RANDOM STARTING SOLUTION
MAKELIST(0,I1,1,CZ,1)▶LSTART;
0▶L6; // CLUSTER ATTRIBUTION
RANDINT(CZ,1,FZ)▶LSTART;
REPEAT
0▶AK1;
FOR I1 FROM 1 TO CZ-1 DO
FOR I2 FROM I1+1 TO CZ DO
IF LSTART(I1)=LSTART(I2) THEN
AK1+1▶AK1;
RANDINT(FZ)▶LSTART(I2);
END;
END;
END;
UNTIL AK1<0.5;
SORT(LSTART)▶LSTART;
FOR I1 FROM 1 TO CZ DO
FOR I2 FROM 1 TO VZ1 DO
M41(LSTART(I1),I2)▶M5(I1,I2);
END;
END;
PRINT(LSTART);
// START OF OPTIMIZATION ROUTINE ------
// ATTRIBUTION OF CLUSTERS TO CASES ---
1▶AL1;
0▶I4;
REPEAT
I4+1▶I4;
FOR I1 FROM 1 TO FZ DO
VZ1*CZ▶AK1;
FOR I2 FROM 1 TO CZ DO
0▶AK2;
FOR I3 FROM 1 TO VZ1 DO
AK2+(M41(I1,I3)-M5(I2,I3))^2▶AK2;
END;
AK2/VZ1▶AK2;
IF AK2<AK1 THEN
I2▶L6(I1);
AK2▶AK1;
END;
END;
END;
// COMPUTATION OF NEW CENTERS
// VARIANCES, NO OF CASES PER CLUSTER
FOR I1 FROM 1 TO CZ DO
FOR I2 FROM 1 TO VZ1 DO
0▶M5(I1,I2);
0▶M6(I1,I2);
END;
END;
MAKELIST(0,I1,1,CZ)▶L4;
FOR I1 FROM 1 TO FZ DO
L4(L6(I1))+1▶L4(L6(I1));
FOR I2 FROM 1 TO VZ1 DO
M5(L6(I1),I2)+M41(I1,I2)▶M5(L6(I1),I2);
M6(L6(I1),I2)+M41(I1,I2)^2▶M6(L6(I1),I2);
END;
END;
0▶AL2;
FOR I1 FROM 1 TO CZ DO
FOR I2 FROM 1 TO VZ1 DO
M5(I1,I2)/L4(I1)▶M5(I1,I2);
M6(I1,I2)/L4(I1)-M5(I1,I2)^2▶M6(I1,I2);
AL2+M6(I1,I2)▶AL2;
END;
END;
AL2/VZ1/CZ▶AL2;
AL1-AL2▶AK1;
AL2▶AL1;
UNTIL AK1<0.0001;
PRINT(J1 +" ACTUAL: " +AL2 +" MIN: " +ALMIN);
IF AL2<ALMIN THEN
L6▶LMIN;
AL2▶ALMIN;
END;
END;
// FINISH ------------------------
LMIN▶L6;
REDIM(M5,{CZ,VZ});
REDIM(M6,{CZ,VZ});
// COMP. OF CLUSTER CENTERS AT OPTIMUM -
FOR I1 FROM 1 TO CZ DO
FOR I2 FROM 1 TO VZ DO
0▶M5(I1,I2);
0▶M6(I1,I2);
END;
END;
// NUMBER OF ELEMENTS PER CLUSTER (L4),
// CLUSTER CENTERS (M5), VARIANCES (M6)
MAKELIST(0,I1,1,CZ)▶L4;
FOR I1 FROM 1 TO FZ DO
L4(L6(I1))+1▶L4(L6(I1));
FOR I2 FROM 1 TO VZ DO
M5(L6(I1),I2)+M4(I1,I2)▶M5(L6(I1),I2);
M6(L6(I1),I2)+M4(I1,I2)^2▶M6(L6(I1),I2);
END;
END;
0▶AL2;
FOR I1 FROM 1 TO CZ DO
FOR I2 FROM 1 TO VZ DO
M5(I1,I2)/L4(I1)▶M5(I1,I2);
M6(I1,I2)/L4(I1)-M5(I1,I2)^2▶M6(I1,I2);
AL2+M6(I1,I2)▶AL2;
END;
END;
AL2/VZ/CZ▶AL2;
//PSEUDO-F-STATISTICS, VARIANCES
MAKELIST(0,I1,1,VZ)▶L5;
MAKELIST(0,I1,1,VZ)▶L7;
0▶PF1;
0▶PF2;
FOR I1 FROM 1 TO VZ DO
FOR I2 FROM 1 TO CZ DO
L5(I1)+L4(I2)+M6(I2,I1)▶L5(I1);
L7(I1)+L4(I2)*M5(I2,I1)^2▶L7(I1);
END;
PF1+L5(I1)▶PF1;
PF2+L7(I1)▶PF2;
IF L5(I1)<.00001 THEN
9999▶L5(I1)
ELSE
(L7(I1)/(CZ-1))/(L5(I1)/(FZ-CZ))▶L5(I1);
END;
L7(I1)/(FZ-1)▶L7(I1);
END;
PF1/VZ▶PF1;
PF2/(VZ*CZ)*FZ/(FZ-1)▶PF2;
PRINT("--------------------------");
PRINT("RESULT K-MEANS CLUSTERING:");
PRINT("TOTAL VAR WITHIN: " +PF2);
PRINT("TOTAL VAR BETWEEN: " +PF1);
PRINT("CLUSTER SIZES: L4,");
PRINT("PSEUDO-F-STATISTICS: L5,");
PRINT("CLUSTER ATTRIBUTIONS: L6,");
PRINT("CLUSTER CENTERS: M5,");
PRINT("VARIANCES WITHIN: M6,");
PRINT("VARIANCES BETWEEN: L7");
PRINT("READY.");
END;