Cluster Analysis
|
12-03-2019, 03:23 PM
Post: #1
|
|||
|
|||
Cluster Analysis
Cluster Analysis is a multivariate statistical method to look for homogeneous groups within elements. For example it is used in market segmentation.
The program uses K-Means-method to generate clusters. For the experts: It uses euclidian distances and centroid method. Input is a data matrix in matrix M3, whereas the rows are the elements and the columns are the variables. If you have the data matrix on your computer e.g. in Excel you can copy it to the spreadsheet application in the HP Prime and from there to the matrix using the connectivity software. I could not directly copy to a matrix but perhaps there is a way I could not find. Secondly you have to give the information which of the variables you want to have active, i.e. which of the variables shall be used in the clustering procedure. You can use only 2 or all variables you have. Sometimes it makes sense not to use all variables. But the final analysis is shown for all variables. K-Means-method needs the number of clusters (= homogeneous groups) you want to have. Then it generates as many cluster centers by randomly selecting elements. In the next step it allocates the other elements to the centers. Each element is attributed to the cluster where the distance to the center is lowest. Then the centers are computed newly as means of clusters. Now because the centers changed it may be that for some of the elements another center is nearer. So they attribution to the centers is done again. This is repeated as often as there is no further improvement. Then another random choice of elements is used to form a new center and to optimize it again. This is necessary because the method mostly only finds local optima. To find a good solution you will repeat the procedure e.g. 100 times. The program takes the best solution found as the final solution. The HP prime is very fast for a machine of its size and has a lot of memory. So you can do this as well for reasonable sized problems. I did it for 418 elements, 25 variables in total, 25 active variables and 6 clusters, doing the procedure 120 times. The computation time was about 90 minutes. Using the software: Input: M3: Data matrix (rows: elements, columns: variables Number of cases must exceed number of variables. L3: List of active variables. Length of list is number of variables. For each variable you put in either 0 (variable is not active and not used for the cluster algorithm) of 1 (variable is used for the cluster algorithm). If you run the software you will be asked for the number of clusters. Put in a number between 2 (minimum) and 20 (maximum). Next you will be asked how often you want to repeat the procedure. Put in a number between 10 and 1000. If your numbers are below the minima or above the maxima your number will be set to minimum or maximum. The program first standardizes the variables by subtracting from each variable the mean and dividing by the standard deviation. The standardized variables are saved in matrix M4. This is done to make the variables comparable. During the computation you get for every repeat of the procedure the starting elements for the cluster centers, the total variance within the cluster as the result of this iteration and the minimum reached so far. Finally you get some text on the terminal that shows you where to find the results: Cluster sizes in list L4: Number of elements per cluster. Ideally the numbers should be about same size. Pseudo-F-Statistics in L5:For every variable the variance within the cluster is set in relation to the variance between the clusters. The higher the number the more differentiating the clusters are for this variable. Cluster attributions in L6: For every element you get a number denoting the cluster to which this element is attributed. Cluster centers in M5: Means of the standardized variables. Variances within the clusters im M6: Variances of all variables within all clusters. Variances between the clusters per variable in list L7. Enjoy! Raimund Wildner EXPORT KMEANSE() BEGIN // INPUT: DATA MATRIX M3 // LIST ACTIVE OF VARIABLES: L3 LOCAL VZ,VZ1,FZ,CZ; // NUMBER OF VAR., ACT. VAR., CASES, CLUSTER LOCAL LDIM; // DIMENSIONS DATA MATRIX LOCAL M41; // STAND. ACTIVE VARIABLES LOCAL LSTART; // STARTING VALUES FOR CLUSTER LOCAL I1,I2,I3,I4,J1; // LOOP INDICES LOCAL AK1,AK2,AL1,AL2,ALMIN,LB; //FOR COMP. LOCAL LMIN,IZ,FLAG; // OPTIMUM ALLOC, NUMBER OF ITER.,FLAG LOCAL AKMIN,PF1,PF2; LOCAL LM,LV; // LISTS OF MEAN, VARIANCES DIM(M3)▶LDIM; LDIM(1)▶FZ; LDIM(2)▶VZ; IP(ΣLIST(L3)+.5)▶VZ1; INPUT(CZ,"NUMBER OF CLUSTERS","CZ=",2); INPUT(IZ,"NUMBER OF ITERATONS","IZ=",10); MAX(CZ,2)▶CZ; // MIN NO OF CLUSTERS=2 MIN(CZ,MIN(FZ-1,20))▶CZ; // MAX NO OF CLUSTERS=20 MAX(IZ,10)▶IZ; // MIN NO OF ITERATIONS=10 MIN(IZ,1000)▶IZ; // MAX NO OF ITERATIONs=1000 IP(CZ+.5)▶CZ; REDIM(M4,{FZ,VZ}); // STANDARDIZED DATA REDIM(M5,{CZ,VZ1}); // CLUSTER CENTRES REDIM(M6,{CZ,VZ1}); // VARIANCES MAKEMAT(0,FZ,VZ1)▶M41; // ACTIVE VARIABLES MAKELIST(0,I1,1,FZ,1)▶LMIN; //MAKE STANDARDIZED DATA MATRIX mean(M3)▶LM; //VARIABLE MEANS variance(M3)▶LV; //VARIABLE VARIANCES 0▶J1; FOR I1 FROM 1 TO VZ DO 0▶FLAG; IF IP(L3(I1)+.5)=1 THEN J1+1▶J1; 1▶FLAG; END; FOR I2 FROM 1 TO FZ DO (M3(I2,I1)-LM(I1))/√LV(I1)▶M4(I2,I1); IF FLAG=1 THEN M4(I2,I1)▶M41(I2,J1); END; END; END; 1▶ALMIN; FOR J1 FROM 1 TO IZ DO //MAKE RANDOM STARTING SOLUTION MAKELIST(0,I1,1,CZ,1)▶LSTART; 0▶L6; // CLUSTER ATTRIBUTION RANDINT(CZ,1,FZ)▶LSTART; REPEAT 0▶AK1; FOR I1 FROM 1 TO CZ-1 DO FOR I2 FROM I1+1 TO CZ DO IF LSTART(I1)=LSTART(I2) THEN AK1+1▶AK1; RANDINT(FZ)▶LSTART(I2); END; END; END; UNTIL AK1<0.5; SORT(LSTART)▶LSTART; FOR I1 FROM 1 TO CZ DO FOR I2 FROM 1 TO VZ1 DO M41(LSTART(I1),I2)▶M5(I1,I2); END; END; PRINT(LSTART); // START OF OPTIMIZATION ROUTINE ------ // ATTRIBUTION OF CLUSTERS TO CASES --- 1▶AL1; 0▶I4; REPEAT I4+1▶I4; FOR I1 FROM 1 TO FZ DO VZ1*CZ▶AK1; FOR I2 FROM 1 TO CZ DO 0▶AK2; FOR I3 FROM 1 TO VZ1 DO AK2+(M41(I1,I3)-M5(I2,I3))^2▶AK2; END; AK2/VZ1▶AK2; IF AK2<AK1 THEN I2▶L6(I1); AK2▶AK1; END; END; END; // COMPUTATION OF NEW CENTERS // VARIANCES, NO OF CASES PER CLUSTER FOR I1 FROM 1 TO CZ DO FOR I2 FROM 1 TO VZ1 DO 0▶M5(I1,I2); 0▶M6(I1,I2); END; END; MAKELIST(0,I1,1,CZ)▶L4; FOR I1 FROM 1 TO FZ DO L4(L6(I1))+1▶L4(L6(I1)); FOR I2 FROM 1 TO VZ1 DO M5(L6(I1),I2)+M41(I1,I2)▶M5(L6(I1),I2); M6(L6(I1),I2)+M41(I1,I2)^2▶M6(L6(I1),I2); END; END; 0▶AL2; FOR I1 FROM 1 TO CZ DO FOR I2 FROM 1 TO VZ1 DO M5(I1,I2)/L4(I1)▶M5(I1,I2); M6(I1,I2)/L4(I1)-M5(I1,I2)^2▶M6(I1,I2); AL2+M6(I1,I2)▶AL2; END; END; AL2/VZ1/CZ▶AL2; AL1-AL2▶AK1; AL2▶AL1; UNTIL AK1<0.0001; PRINT(J1 +" ACTUAL: " +AL2 +" MIN: " +ALMIN); IF AL2<ALMIN THEN L6▶LMIN; AL2▶ALMIN; END; END; // FINISH ------------------------ LMIN▶L6; REDIM(M5,{CZ,VZ}); REDIM(M6,{CZ,VZ}); // COMP. OF CLUSTER CENTERS AT OPTIMUM - FOR I1 FROM 1 TO CZ DO FOR I2 FROM 1 TO VZ DO 0▶M5(I1,I2); 0▶M6(I1,I2); END; END; // NUMBER OF ELEMENTS PER CLUSTER (L4), // CLUSTER CENTERS (M5), VARIANCES (M6) MAKELIST(0,I1,1,CZ)▶L4; FOR I1 FROM 1 TO FZ DO L4(L6(I1))+1▶L4(L6(I1)); FOR I2 FROM 1 TO VZ DO M5(L6(I1),I2)+M4(I1,I2)▶M5(L6(I1),I2); M6(L6(I1),I2)+M4(I1,I2)^2▶M6(L6(I1),I2); END; END; 0▶AL2; FOR I1 FROM 1 TO CZ DO FOR I2 FROM 1 TO VZ DO M5(I1,I2)/L4(I1)▶M5(I1,I2); M6(I1,I2)/L4(I1)-M5(I1,I2)^2▶M6(I1,I2); AL2+M6(I1,I2)▶AL2; END; END; AL2/VZ/CZ▶AL2; //PSEUDO-F-STATISTICS, VARIANCES MAKELIST(0,I1,1,VZ)▶L5; MAKELIST(0,I1,1,VZ)▶L7; 0▶PF1; 0▶PF2; FOR I1 FROM 1 TO VZ DO FOR I2 FROM 1 TO CZ DO L5(I1)+L4(I2)+M6(I2,I1)▶L5(I1); L7(I1)+L4(I2)*M5(I2,I1)^2▶L7(I1); END; PF1+L5(I1)▶PF1; PF2+L7(I1)▶PF2; IF L5(I1)<.00001 THEN 9999▶L5(I1) ELSE (L7(I1)/(CZ-1))/(L5(I1)/(FZ-CZ))▶L5(I1); END; L7(I1)/(FZ-1)▶L7(I1); END; PF1/VZ▶PF1; PF2/(VZ*CZ)*FZ/(FZ-1)▶PF2; PRINT("--------------------------"); PRINT("RESULT K-MEANS CLUSTERING:"); PRINT("TOTAL VAR WITHIN: " +PF2); PRINT("TOTAL VAR BETWEEN: " +PF1); PRINT("CLUSTER SIZES: L4,"); PRINT("PSEUDO-F-STATISTICS: L5,"); PRINT("CLUSTER ATTRIBUTIONS: L6,"); PRINT("CLUSTER CENTERS: M5,"); PRINT("VARIANCES WITHIN: M6,"); PRINT("VARIANCES BETWEEN: L7"); PRINT("READY."); END; |
|||
« Next Oldest | Next Newest »
|
User(s) browsing this thread: 1 Guest(s)