Skip to content

Data Science Tutorials

  • Home
  • R
  • Statistics
  • Course
  • Machine Learning
  • Guest Blog
  • Contact
  • About Us
  • Toggle search form
  • How to compare the performance of different algorithms in R
    How to compare the performance of different algorithms in R? R
  • How to Replace Inf Values with NA in R
    How to Replace Inf Values with NA in R R
  • How to put margins on tables or arrays in R?
    How to put margins on tables or arrays in R? R
  • Two Sample Proportions test in R
    Two Sample Proportions test in R-Complete Guide R
  • How to Create a Frequency Table by Group in R
    How to Create a Frequency Table by Group in R? R
  • How to Join Data Frames for different column names in R
    How to Join Data Frames for different column names in R R
  • Data Science Applications in Banking
    Data Science Applications in Banking Machine Learning
  • Top Reasons To Learn R
    Top Reasons To Learn R in 2023 Machine Learning
Detecting and Dealing with Outliers

Detecting and Dealing with Outliers: First Step

Posted on May 1May 12 By Jim No Comments on Detecting and Dealing with Outliers: First Step
Tweet
Share
Share
Pin

Detecting and Dealing with Outliers, We’re going to look a little bit more at these mammals’ sleep data.

Let’s take a summary of them.

library(mice)
summary(mammalsleep)
 species         bw                brw         
 African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
 African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
 Arctic Fox               : 1   Median :   3.342   Median :  17.25  
 Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
 Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
 Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
 (Other)                  :56                   
                    
      sws               ps              ts             mls         
 Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
 1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
 Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
 Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
 3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
 Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
 NA's   :14       NA's   :12      NA's   :4       NA's   :4       
 
       gt               pi             sei             odi       
 Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
 Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
 3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :4   

So, if you use the summary command, it will automatically take a data frame and try to figure out all of the information about it, as well as how to summarise each column independently.

How to perform One-Sample Wilcoxon Signed Rank Test in R?

Here we have the various species, which is the first variable (it’s not the row names, it’s the first variable), and it indicates that you have one of each. There are 56 that aren’t included in this list.

But that’s not the case with the others. They’re all labeled with the correct species.

The first one we notice is this body mass, which is also known as body weight, and we can see that the range, from minimal to maximum, is really wide.

Detecting and Dealing with Outliers

So, this is an occasion to point out that you should be on the lookout for values that are significantly different from others, which we refer to as outliers.

So, for example, if we go in and ask, “What is the maximum of the mammalsleep$bw?”

which.max(mammalsleep$bw)

and it tells us 1 and so, that’s because out of all the weights, We can look at the real culprit here.

This tells us that the African elephant is actually the largest, and these are the values we have for them. Let’s look at the bare minimum.

Best GGPlot Themes You Should Know – (datasciencetut.com)

mammalsleep[which.min(mammalsleep$bw),]
species
32 Lesser short-tailed shrew
      bw  brw sws  ps  ts mls
32 0.005 0.14 7.7 1.4 9.1 2.6
     gt pi sei odi
32 21.5  5   2   4

Both of these, this being the lesser short-tailed shrew, are reasonable.

That is, they are not errors, therefore you may obtain the maximum and the value is, for example,

Because it’s 9,999, but it doesn’t make sense, and you have something that does, so if we discovered a possum that was 9,999, that was bigger than the elephant.

That was an encoding of a missing variable to us, but it’s not a valid value.

So, in order to accomplish imputation, I’d have to substitute the numerical value with a genuine missing value.

As a result, documenting and looking for outliers is useful as a data double-check.

There are numerous issues that arise due to typos and values that are merged together, as well as missing columns, which cause your data to provide absolutely absurd results, and it is your obligation to correct them.

This is something that the computer can assist you with, but you must become involved in all elements of the process.

So, taking a look at the maximum and minimum is a good approach to go about it.

So here’s what we do: you have to document everything. I’ve discovered this value, and this is the largest, and this is the smallest, so make notes for yourself and the people for whom you’re writing the report.

So that everyone is aware that these are the extreme points and that what they know should be consistent.

Tweet
Share
Share
Pin
R

Post navigation

Previous Post: Dealing With Missing values in R
Next Post: Methods for Integrating R and Hadoop complete Guide

Related Posts

  • Count Observations by Group in R
    Count Observations by Group in R R
  • Two-Way ANOVA Example in R
    Two-Way ANOVA Example in R-Quick Guide R
  • OLS Regression in R
    OLS Regression in R R
  • Error: Can't rename columns that don't exist
    Can’t rename columns that don’t exist R
  • How to Create an Interaction Plot in R
    How to Create an Interaction Plot in R? R
  • How to perform TBATS Model in R
    How to perform TBATS Model in R R

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • About Us
  • Contact
  • Disclaimer
  • Guest Blog
  • Privacy Policy
  • YouTube
  • Twitter
  • Facebook
  • Top 7 Skills Required to Become a Data Scientist
  • Learn Hadoop for Data Science
  • How Do Online Criminals Acquire Sensitive Data
  • Top Reasons To Learn R in 2023
  • Linear Interpolation in R-approx

Check your inbox or spam folder to confirm your subscription.

 https://www.r-bloggers.com
  • Top Reasons To Learn R
    Top Reasons To Learn R in 2023 Machine Learning
  • How to handle Imbalanced Data
    How to handle Imbalanced Data? R
  • How To Become a Business Intelligence Analyst
    How To Become a Business Intelligence Analyst Course
  • Add new calculated variables to a data frame and drop all existing variables
    Add new calculated variables to a data frame and drop all existing variables R
  • How to put margins on tables or arrays in R?
    How to put margins on tables or arrays in R? R
  • How to Count Distinct Values in R
    How to Count Distinct Values in R R
  • How to Use Gather Function in R
    How to Use Gather Function in R?-tidyr Part2 R
  • How to Calculate Lag by Group in R
    How to Calculate Lag by Group in R? R

Copyright © 2023 Data Science Tutorials.

Powered by PressBook News WordPress theme