Descriptive Statistics in R » Data Science Tutorials

Descriptive Statistics in R: A Step-by-Step Guide

Descriptive statistics are a crucial part of data analysis, as they provide a snapshot of the central tendency and variability of a dataset.

In R, there are two primary functions that can be used to calculate descriptive statistics: summary() and sapply().

In this article, we will explore how to use these functions to gain a deeper understanding of our data.

Replace first match in R » Data Science Tutorials

Method 1: Using the summary() Function

The summary() function is a simple and efficient way to calculate various descriptive statistics for each variable in a data frame. To use this function, simply call it on your data frame, like so:

summary(my_data)

The summary() function will return a variety of values for each variable, including the minimum, first quartile, median, mean, third quartile, and maximum.

For example, let’s say we have the following data frame:

df <- data.frame(x=c(1, 4, 4, 5, 6, 7, 10, 12),
                 y=c(2, 2, 3, 3, 4, 5, 11, 11),
                 z=c(8, 9, 9, 9, 10, 13, 15, 17))

We can use the summary() function to calculate descriptive statistics for each variable:

summary(df)

This will output:

       x                y                z        
 Min.   :1.000   Min.   :2.000   Min.   :8.00  
 1st Qu.:4.000   1st Qu.:2.750   1st Qu.:9.00  
 Median :5.500   Median :3.500   Median :9.50  
 Mean   :6.125   Mean   :5.125   Mean   :11.25  
 3rd Qu.:7.750   3rd Qu.:6.500   3rd Qu.:13.50  
 Max.   :12.000   Max.   :11.000   Max.   :17.00

Method 2: Using the sapply() Function

The sapply() function is a more versatile option for calculating descriptive statistics. It allows us to specify a custom function to apply to each variable in the data frame.

For example, we can use the sapply() function to calculate the standard deviation of each variable:

sapply(df, sd, na.rm=TRUE)

This will output:

       x        y        z 
3.522884 3.758324 3.327376

We can also use the sapply() function to calculate more complex descriptive statistics by defining a custom function within it.

For example, let’s say we want to calculate the range of each variable:

sapply(df, function(df) max(df)-min(df), na.rm=TRUE)

This will output:

x      y      z 
11     9     9

Conclusion

In this article, we have explored two methods for calculating descriptive statistics in R: the summary() function and the sapply() function.

The summary() function provides a quick and easy way to calculate common descriptive statistics for each variable in a data frame.

The sapply() function offers more flexibility and allows us to define custom functions to calculate more complex descriptive statistics.

By using these functions effectively, we can gain a deeper understanding of our data and make more informed decisions about our analysis and visualization strategies.