Descriptive Statistics in R: A Step-by-Step Guide
Descriptive statistics are a crucial part of data analysis, as they provide a snapshot of the central tendency and variability of a dataset.
In R, there are two primary functions that can be used to calculate descriptive statistics: summary()
and sapply()
.
In this article, we will explore how to use these functions to gain a deeper understanding of our data.
Replace first match in R » Data Science Tutorials
Method 1: Using the summary()
Function
The summary()
function is a simple and efficient way to calculate various descriptive statistics for each variable in a data frame. To use this function, simply call it on your data frame, like so:
summary(my_data)
The summary()
function will return a variety of values for each variable, including the minimum, first quartile, median, mean, third quartile, and maximum.
For example, let’s say we have the following data frame:
df <- data.frame(x=c(1, 4, 4, 5, 6, 7, 10, 12), y=c(2, 2, 3, 3, 4, 5, 11, 11), z=c(8, 9, 9, 9, 10, 13, 15, 17))
We can use the summary()
function to calculate descriptive statistics for each variable:
summary(df)
This will output:
x y z Min. :1.000 Min. :2.000 Min. :8.00 1st Qu.:4.000 1st Qu.:2.750 1st Qu.:9.00 Median :5.500 Median :3.500 Median :9.50 Mean :6.125 Mean :5.125 Mean :11.25 3rd Qu.:7.750 3rd Qu.:6.500 3rd Qu.:13.50 Max. :12.000 Max. :11.000 Max. :17.00
Method 2: Using the sapply()
Function
The sapply()
function is a more versatile option for calculating descriptive statistics. It allows us to specify a custom function to apply to each variable in the data frame.
For example, we can use the sapply()
function to calculate the standard deviation of each variable:
sapply(df, sd, na.rm=TRUE)
This will output:
x y z 3.522884 3.758324 3.327376
We can also use the sapply()
function to calculate more complex descriptive statistics by defining a custom function within it.
For example, let’s say we want to calculate the range of each variable:
sapply(df, function(df) max(df)-min(df), na.rm=TRUE)
This will output:
x y z
11 9 9
Conclusion
In this article, we have explored two methods for calculating descriptive statistics in R: the summary()
function and the sapply()
function.
The summary()
function provides a quick and easy way to calculate common descriptive statistics for each variable in a data frame.
The sapply()
function offers more flexibility and allows us to define custom functions to calculate more complex descriptive statistics.
By using these functions effectively, we can gain a deeper understanding of our data and make more informed decisions about our analysis and visualization strategies.
- Major Components of Time Series Analysis
- Sample Size Calculation and Power Clinical Trials
- Biases in Statistics Common Pitfalls
- Area Under Curve in R (AUC)
- Filtering Data in R 10 Tips -tidyverse package
- How to Perform Tukey HSD Test in R
- Statistical Hypothesis Testing-A Step by Step Guide
- How to Create Frequency Tables in R
- PCA for Categorical Variables in R
- sweep function in R