Box Cox transformation in R, The Box-Cox transformation is a power transformation that eliminates nonlinearity between variables, differing variances, and variable asymmetry.
The ability to turn a variable into a new variable with a normal distribution is therefore very helpful.
Box cox family
The following expression gives the Box-Cox functions transformations for various values of lambda:
being y the changed variable and lambda (λ) the transformation parameter However, the following table describes the most typical transformations:
λ | Transformation |
-2 | 1/x^2 |
-1 | 1/x |
-0.5 | 1/sqrt(x) |
0 | log(x) |
0.5 | sqrt(x) |
1 | x |
2 | x^2 |
In practise, it is advised to choose the value from the table rather than the precise value if the estimated transformation parameter is close to one of the values of the previous table because the value from the table is simpler to understand.
How to make a rounded corner bar plot in R? – Data Science Tutorials
The boxcox function in R
The boxcox function from the MASS package in R can be used to estimate the transformation parameter using maximum likelihood estimation.
We will also receive the parameter’s 95% confidence interval from this function. The following are the arguments for the function:
boxcox(object, lambda = seq(-2, 2, 1/10), plotit = TRUE, interp, eps = 1/50, xlab = expression(lambda), ylab = "log-Likelihood", …)
Example
Take into account the sample vector x below, which deviates from the normal distribution:
x <- c(0.2, 0.528, 0.11, 0.260, 0.091, 1.314, 1.52, 0.244, 1.981, 0.273, 0.461, 0.366, 1.407, 0.79, 2.266) hist(x)
You must compute a linear model with the lm function and pass it to the boxcox function as shown below in order to determine the appropriate “lambda”:
How to create Radar Plot in R-ggradar – Data Science Tutorials
library(MASS) boxcox(lm(x ~ 1))
Keep in mind that the others reflect the 95% confidence interval of the estimation, and the dashed vertical line in the middle represents the estimated parameter lambda hat.
The best choice is to apply the logarithmic transformation of the data because the preceding plot indicates that the 0 is inside the confidence interval of the optimal “lambda” and because the estimation of the parameter in this example is quite near to 0. (see the table of the first section).
How to Label Outliers in Boxplots in ggplot2? (datasciencetut.com)
# Transformed data new <- log(x) # Histogram hist(new)
The data now appears to be more closely following a normal distribution, but you can also run a statistical test like the Shapiro-Wilk test to make sure:
shapiro.test(new) Shapiro-Wilk normality test data: new W = 0.94531, p-value = 0.4538
We lack evidence to reject the null hypothesis of normalcy because the p-value is higher than the typical levels of significance (1%, 5%, and 10%).
How to draw heatmap in r: Quick and Easy way – Data Science Tutorials
Extracting the exact lambda
You can determine the actual lambda using the following code if the confidence interval of the estimated parameter doesn’t fit with any of the table’s values:
library(MASS) b <- boxcox(lm(x ~ 1)) # Exact lambda lambda <- b$x[which.max(b$y)] lambda 0.02020202
How to create Anatogram plot in R – Data Science Tutorials
Using the expression from the first part, you can now transform the variable:
new_x_exact <- (x ^ lambda - 1) / lambda