Skip to content

Data Science Tutorials

For Data Science Learners

  • How to Add Superscripts and Subscripts to Plots in R?, The basic syntax for adding superscripts or subscripts to charts in R is as follows:
    How to Add Superscripts and Subscripts to Plots in R? R
  • Mastering the table() Function in R R
  • Filter Using Multiple Conditions in R
    Filter Using Multiple Conditions in R R
  • Normal distribution in R
    Normal Distribution in R R
  • Error in sum(List) : invalid 'type' (list) of argument
    Error in sum(List) : invalid ‘type’ (list) of argument R
  • Top Reasons To Learn R
    Top Reasons To Learn R in 2023 Machine Learning
  • Calculating Z-Scores in R: A Step-by-Step Guide R
  • how to draw heatmap in r
    How to draw heatmap in r: Quick and Easy way R
How to Find Optimal Clusters in R, K-means clustering is one of the most widely used clustering techniques in machine learning.

How to Find Optimal Clusters in R?

Posted on September 10September 10 By Admin No Comments on How to Find Optimal Clusters in R?

How to Find Optimal Clusters in R, K-means clustering is one of the most widely used clustering techniques in machine learning.

With the K-means clustering technique, each observation in a dataset is assigned to one of K clusters.

The ultimate goal is to have K clusters in which the observations are relatively similar to one another within each cluster and considerably dissimilar from one another within different clusters.

Best Books on Data Science with Python – Data Science Tutorials

The first stage in k-means clustering is to decide on a value for K or the number of clusters we want to group the observations into.

The elbow method is one of the most popular approaches to choosing a value for K.

It entails plotting the total inside the sum of squares on the y-axis and the number of clusters on the x-axis to locate the plot’s “elbow” or bend.

The best number of clusters to utilize in the k-means clustering algorithm is indicated by the location on the x-axis where the “elbow” occurs.

Filter Using Multiple Conditions in R – Data Science Tutorials

The elbow method in R is demonstrated in the example that follows.

How to Find Optimal Clusters in R

We’ll use the USArrests dataset from R for this example, which includes the proportion of the population residing in urban areas in each state, or UrbanPop, as well as the number of murder, assault, and rape arrests made per 100,000 citizens in each state of the United States in 1973.

The dataset may be loaded using the code below, which also demonstrates how to delete rows with blank values and scale each variable in the dataset to have a mean and standard deviation of 0 and 1, respectively.

How to handle Imbalanced Data? – Data Science Tutorials

Now let’s load the data

df <- USArrests

Then we can remove rows with missing values

df <- na.omit(df)

As you know before clustering we need to scale the data frame. Scale each variable to have a mean of 0 and sd of 1.

5 Free Books to Learn Statistics For Data Science – Data Science Tutorials

df <- scale(df)

Let’s see the first six rows of the dataset

head(df)
             Murder   Assault   UrbanPop         Rape
Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
Arizona    0.07163341 1.4788032  0.9989801  1.042878388
Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
California 0.27826823 1.2628144  1.7589234  2.067820292
Colorado   0.02571456 0.3988593  0.8608085  1.864967207

We’ll use the fviz_nbclust() function from the factoextra package to make a plot of the number of clusters vs. the total inside the sum of squares in order to determine the ideal number of clusters to use in the k-means algorithm.

How to do Conditional Mutate in R? – Data Science Tutorials

library(cluster)
library(factoextra)

Plot the number of clusters relative to the total within the sum of squares

fviz_nbclust(df, kmeans, method = "wss")

At k = 4 clusters, it appears like there are an “elbow” or bends in the plot. The sum of the total of the squares starts to level out at this point.

This indicates that using four clusters is the ideal amount to employ when using the k-means method.

Although employing more clusters might result in a lower sum of squares, we would probably be overfitting the training data, which would cause the k-means algorithm to perform worse on the testing data.

Add new calculated variables to a data frame and drop all existing variables (datasciencetut.com)

We can now run k-means clustering on the dataset using the kmeans() function from the cluster package and the recommended value of k of 4.

We can make this example reproducible

set.seed(1234)

Now perform k-means clustering with k = 4 clusters

km <- kmeans(df, centers = 4, nstart = 25)

Let’s view the output

km
K-means clustering with 4 clusters of sizes 13, 13, 8, 16
Cluster means:
      Murder    Assault   UrbanPop        Rape
1  0.6950701  1.0394414  0.7226370  1.27693964
2 -0.9615407 -1.1066010 -0.9301069 -0.96676331
3  1.4118898  0.8743346 -0.8145211  0.01927104
4 -0.4894375 -0.3826001  0.5758298 -0.26165379
Clustering vector:
       Alabama         Alaska        Arizona       Arkansas     California
             3              1              1              3              1
      Colorado    Connecticut       Delaware        Florida        Georgia
             1              4              4              1              3
        Hawaii          Idaho       Illinois        Indiana           Iowa
             4              2              1              4              2
        Kansas       Kentucky      Louisiana          Maine       Maryland
             4              2              3              2              1
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri
             4              1              2              3              1
       Montana       Nebraska         Nevada  New Hampshire     New Jersey
             2              2              1              2              4
    New Mexico       New York North Carolina   North Dakota           Ohio
             1              1              3              2              4
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina
             4              4              4              4              3
  South Dakota      Tennessee          Texas           Utah        Vermont
             2              3              1              4              2
      Virginia     Washington  West Virginia      Wisconsin        Wyoming
             4              4              2              2              4
Within cluster sum of squares by cluster:
[1] 19.922437 11.952463  8.316061 16.212213
 (between_SS / total_SS =  71.2 %)
Available components:
[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"    

Additionally, we can add each state’s cluster assignments to the initial dataset.

How to compare variances in R – Data Science Tutorials

Now add cluster assignment to the original data

finaldata <- cbind(USArrests, cluster = km$cluster)
head(finaldata)
               Murder Assault UrbanPop Rape cluster
Alabama      13.2     236       58 21.2       3
Alaska       10.0     263       48 44.5       1
Arizona       8.1     294       80 31.0       1
Arkansas      8.8     190       50 19.5       3
California    9.0     276       91 40.6       1
Colorado      7.9     204       78 38.7       1

There are four clusters in which each observation from the first data frame has been sorted.

Check your inbox or spam folder to confirm your subscription.

R

Post navigation

Previous Post: How to Avoid Overfitting?
Next Post: R Percentage by Group Calculation

Related Posts

  • How to Create a Covariance Matrix in R
    How to Create a Covariance Matrix in R? R
  • Convert a continuous variable to a categorical in R R
  • one-sample-proportion-test-in-r
    One sample proportion test in R-Complete Guide R
  • Number to Percentage in R
    Number to Percentage in R R
  • Comparing group means in R
    One way ANOVA Example in R-Quick Guide R
  • best books about data analytics
    Best Books to learn Tensorflow Course

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Best Prompt Engineering Books
  • Understanding Machine Learning and Data Science
  • Best Git Books
  • Top 5 Books to Learn Data Engineering
  • Mastering R Programming for Data Science: Tips and Tricks
  • About Us
  • Contact
  • Disclaimer
  • Privacy Policy

https://www.r-bloggers.com

  • YouTube
  • Twitter
  • Facebook
  • Course
  • Excel
  • Machine Learning
  • Opensesame
  • R
  • Statistics

Check your inbox or spam folder to confirm your subscription.

  • Logistic Function in R R
  • Crosstab calculation in R
    Crosstab calculation in R R
  • Predictive Modeling and Data Science
    Predictive Modeling and Data Science Machine Learning
  • How to Scale Only Numeric Columns in R
    How to Scale Only Numeric Columns in R R
  • Checking Missing Values in R
    Checking Missing Values in R R
  • Arrange Data by Month in R
    Arrange Data by Month in R with example R
  • Top Data Science Examples You Should Know 2023
    Top Data Science Applications You Should Know 2023 Machine Learning
  • Group By Minimum in R
    Group By Minimum in R R

Privacy Policy

Copyright © 2025 Data Science Tutorials.

Powered by PressBook News WordPress theme