How to Find Optimal Clusters in R, K-means clustering is one of the most widely used clustering techniques in machine learning.
With the K-means clustering technique, each observation in a dataset is assigned to one of K clusters.
The ultimate goal is to have K clusters in which the observations are relatively similar to one another within each cluster and considerably dissimilar from one another within different clusters.
Best Books on Data Science with Python – Data Science Tutorials
The first stage in k-means clustering is to decide on a value for K or the number of clusters we want to group the observations into.
The elbow method is one of the most popular approaches to choosing a value for K.
It entails plotting the total inside the sum of squares on the y-axis and the number of clusters on the x-axis to locate the plot’s “elbow” or bend.
The best number of clusters to utilize in the k-means clustering algorithm is indicated by the location on the x-axis where the “elbow” occurs.
Filter Using Multiple Conditions in R – Data Science Tutorials
The elbow method in R is demonstrated in the example that follows.
How to Find Optimal Clusters in R
We’ll use the USArrests dataset from R for this example, which includes the proportion of the population residing in urban areas in each state, or UrbanPop, as well as the number of murder, assault, and rape arrests made per 100,000 citizens in each state of the United States in 1973.
The dataset may be loaded using the code below, which also demonstrates how to delete rows with blank values and scale each variable in the dataset to have a mean and standard deviation of 0 and 1, respectively.
How to handle Imbalanced Data? – Data Science Tutorials
Now let’s load the data
df <- USArrests
Then we can remove rows with missing values
df <- na.omit(df)
As you know before clustering we need to scale the data frame. Scale each variable to have a mean of 0 and sd of 1.
5 Free Books to Learn Statistics For Data Science – Data Science Tutorials
df <- scale(df)
Let’s see the first six rows of the dataset
head(df)
Murder  Assault  UrbanPop        Rape Alabama   1.24256408 0.7828393 -0.5209066 -0.003416473 Alaska    0.50786248 1.1068225 -1.2117642 2.484202941 Arizona   0.07163341 1.4788032 0.9989801 1.042878388 Arkansas  0.23234938 0.2308680 -1.0735927 -0.184916602 California 0.27826823 1.2628144 1.7589234 2.067820292 Colorado  0.02571456 0.3988593 0.8608085 1.864967207
We’ll use the fviz_nbclust() function from the factoextra package to make a plot of the number of clusters vs. the total inside the sum of squares in order to determine the ideal number of clusters to use in the k-means algorithm.
How to do Conditional Mutate in R? – Data Science Tutorials
library(cluster) library(factoextra)
Plot the number of clusters relative to the total within the sum of squares
fviz_nbclust(df, kmeans, method = "wss")

At k = 4 clusters, it appears like there are an “elbow” or bends in the plot. The sum of the total of the squares starts to level out at this point.
This indicates that using four clusters is the ideal amount to employ when using the k-means method.
Although employing more clusters might result in a lower sum of squares, we would probably be overfitting the training data, which would cause the k-means algorithm to perform worse on the testing data.
Add new calculated variables to a data frame and drop all existing variables (datasciencetut.com)
We can now run k-means clustering on the dataset using the kmeans() function from the cluster package and the recommended value of k of 4.
We can make this example reproducible
set.seed(1234)
Now perform k-means clustering with k = 4 clusters
km <- kmeans(df, centers = 4, nstart = 25)
Let’s view the output
km
K-means clustering with 4 clusters of sizes 13, 13, 8, 16 Cluster means:      Murder   Assault  UrbanPop       Rape 1 0.6950701 1.0394414 0.7226370 1.27693964 2 -0.9615407 -1.1066010 -0.9301069 -0.96676331 3 1.4118898 0.8743346 -0.8145211 0.01927104 4 -0.4894375 -0.3826001 0.5758298 -0.26165379 Clustering vector:       Alabama        Alaska       Arizona      Arkansas    California             3             1             1             3             1      Colorado   Connecticut      Delaware       Florida       Georgia             1             4             4             1             3        Hawaii         Idaho      Illinois       Indiana          Iowa             4             2             1             4             2        Kansas      Kentucky     Louisiana         Maine      Maryland             4             2             3             2             1  Massachusetts      Michigan     Minnesota   Mississippi      Missouri             4             1             2             3             1       Montana      Nebraska        Nevada New Hampshire    New Jersey             2             2             1             2             4    New Mexico      New York North Carolina  North Dakota          Ohio             1             1             3             2             4      Oklahoma        Oregon  Pennsylvania  Rhode Island South Carolina             4             4             4             4             3  South Dakota     Tennessee         Texas          Utah       Vermont             2             3             1             4             2      Virginia    Washington West Virginia     Wisconsin       Wyoming             4             4             2             2             4 Within cluster sum of squares by cluster: [1] 19.922437 11.952463 8.316061 16.212213  (between_SS / total_SS = 71.2 %) Available components: [1] "cluster"     "centers"     "totss"       "withinss"    "tot.withinss" [6] "betweenss"   "size"        "iter"        "ifault"   Â
Additionally, we can add each state’s cluster assignments to the initial dataset.
How to compare variances in R – Data Science Tutorials
Now add cluster assignment to the original data
finaldata <- cbind(USArrests, cluster = km$cluster) head(finaldata)
              Murder Assault UrbanPop Rape cluster Alabama     13.2    236      58 21.2      3 Alaska      10.0    263      48 44.5      1 Arizona      8.1    294      80 31.0      1 Arkansas     8.8    190      50 19.5      3 California   9.0    276      91 40.6      1 Colorado     7.9    204      78 38.7      1
There are four clusters in which each observation from the first data frame has been sorted.