Checking Missing Values in R, we’ll undertake data wrangling, which is the pre-processing and preparation of data.
In fact, practicing data science will consume more than 70% of your time. We’ll only look at a few of the most important commands to make things as simple as possible.
However, you will devote a significant amount of time to twisting your data in various directions. And for that, some valuable R packages have been built, which we’ll look at today.
So, let’s take a look at the slides for pre-processing data with R. And, of course, you’re going to set yourself up first.
Checking Missing Values in R
So the first thing we’ll do is, we’ll look at a command that checks for missing values. And missing values in R are called NA’s. And if you look at the function is.na() in R, here it is.
And it tells you that it’s in the base package, NA is not available, and they’re missing values. So you can check for them and, you’ll also be able to impute or replace them with another value.
So, this is quite important when we get started because some of the functions don’t accept missing data and will have strange behavior.
So for instance, if we start up with an example dataset, a small vector that I build up with the command c(), and I look at this example and then I want to compute say, the mean of
the example. So, it tells me that the result of this computation is NA.
vec<-c(1,2,3,4,NA) vec
[1] 1 2 3 4 NA
mean(vec)
[1] NA
So, it does give me a result but it gives it the value not available and this happens for two reasons here in fact.
One is that we have some strings which have been mixed in with our actual numbers and it doesn’t know how to compute the mean of strings, and then we actually have some missing values.
Get the first value in each group in R? – Data Science Tutorial
But the output of this is NA.
So if in your function at some point you do a manipulation that gives you an NA, this will perk you all the way down the different results as you go along. So if I have any missing values, it’s going to tell you.
Where are the missing values?
The first is a character that isn’t missing, the second isn’t missing, and the third and fourth aren’t missing, but the fifth is recognized as being missing.
That’s because NA has its own character, which isn’t actually between quotes and represents the NA value.
Furthermore, you must be cautious when importing data because the common value for NA in other software is 9999, which will not be recognized as missing; you must re-code it to make it a missing value.
So, here’s another little example, where we actually have encoded mostly numbers and then one NA, so there are no characters in this. And, if we do well,
is.na(vec)
[1] FALSE FALSE FALSE FALSE TRUE
Everything is false except the true in the fifth position, according to the output.
So if we do mean(vec), it will return missing since it will return a missing if there is one missing value in any vector.
However, many R functions have this capability, so to remove the missing value, use na.rm, which removes the missing value and sets it to true, then computes the mean for you while disregarding any missing values.
mean(vec,na.rm=TRUE)
2.5
And this is also possible for the median function or many other functions that allow you to do this,
but you have to be careful that if you have some missing values, you’re going to take them out.
How to replace NA, we will discuss in an upcoming post.