Skip to content

Data Science Tutorials

  • Home
  • R
  • Statistics
  • Course
  • Machine Learning
  • Guest Blog
  • Contact
  • About Us
  • Toggle search form
  • How to Get a Job as a Data Engineer
    How to Get a Job as a Data Engineer? R
  • Add Significance Level and Stars to Plot in R
    Add Significance Level and Stars to Plot in R R
  • Defensive Programming Strategies in R
    Defensive Programming Strategies in R Machine Learning
  • How to Add Superscripts and Subscripts to Plots in R?, The basic syntax for adding superscripts or subscripts to charts in R is as follows:
    How to Add Superscripts and Subscripts to Plots in R? R
  • How to use image function in R
    How to use the image function in R R
  • Algorithm Classifications in Machine Learning
    Algorithm Classifications in Machine Learning Machine Learning
  • How To Become a Business Intelligence Analyst
    How To Become a Business Intelligence Analyst Course
  • Calculate the P-Value from Chi-Square Statistic in R
    Calculate the P-Value from Chi-Square Statistic in R R
Detecting and Dealing with Outliers

Detecting and Dealing with Outliers: First Step

Posted on May 1May 12 By Jim No Comments on Detecting and Dealing with Outliers: First Step
Tweet
Share
Share
Pin

Detecting and Dealing with Outliers, We’re going to look a little bit more at these mammals’ sleep data.

Let’s take a summary of them.

library(mice)
summary(mammalsleep)
 species         bw                brw         
 African elephant         : 1   Min.   :   0.005   Min.   :   0.14  
 African giant pouched rat: 1   1st Qu.:   0.600   1st Qu.:   4.25  
 Arctic Fox               : 1   Median :   3.342   Median :  17.25  
 Arctic ground squirrel   : 1   Mean   : 198.790   Mean   : 283.13  
 Asian elephant           : 1   3rd Qu.:  48.202   3rd Qu.: 166.00  
 Baboon                   : 1   Max.   :6654.000   Max.   :5712.00  
 (Other)                  :56                   
                    
      sws               ps              ts             mls         
 Min.   : 2.100   Min.   :0.000   Min.   : 2.60   Min.   :  2.000  
 1st Qu.: 6.250   1st Qu.:0.900   1st Qu.: 8.05   1st Qu.:  6.625  
 Median : 8.350   Median :1.800   Median :10.45   Median : 15.100  
 Mean   : 8.673   Mean   :1.972   Mean   :10.53   Mean   : 19.878  
 3rd Qu.:11.000   3rd Qu.:2.550   3rd Qu.:13.20   3rd Qu.: 27.750  
 Max.   :17.900   Max.   :6.600   Max.   :19.90   Max.   :100.000  
 NA's   :14       NA's   :12      NA's   :4       NA's   :4       
 
       gt               pi             sei             odi       
 Min.   : 12.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
 1st Qu.: 35.75   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
 Median : 79.00   Median :3.000   Median :2.000   Median :2.000  
 Mean   :142.35   Mean   :2.871   Mean   :2.419   Mean   :2.613  
 3rd Qu.:207.50   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :645.00   Max.   :5.000   Max.   :5.000   Max.   :5.000  
 NA's   :4   

So, if you use the summary command, it will automatically take a data frame and try to figure out all of the information about it, as well as how to summarise each column independently.

How to perform One-Sample Wilcoxon Signed Rank Test in R?

Here we have the various species, which is the first variable (it’s not the row names, it’s the first variable), and it indicates that you have one of each. There are 56 that aren’t included in this list.

But that’s not the case with the others. They’re all labeled with the correct species.

The first one we notice is this body mass, which is also known as body weight, and we can see that the range, from minimal to maximum, is really wide.

Detecting and Dealing with Outliers

So, this is an occasion to point out that you should be on the lookout for values that are significantly different from others, which we refer to as outliers.

So, for example, if we go in and ask, “What is the maximum of the mammalsleep$bw?”

which.max(mammalsleep$bw)

and it tells us 1 and so, that’s because out of all the weights, We can look at the real culprit here.

This tells us that the African elephant is actually the largest, and these are the values we have for them. Let’s look at the bare minimum.

Best GGPlot Themes You Should Know – (datasciencetut.com)

mammalsleep[which.min(mammalsleep$bw),]
species
32 Lesser short-tailed shrew
      bw  brw sws  ps  ts mls
32 0.005 0.14 7.7 1.4 9.1 2.6
     gt pi sei odi
32 21.5  5   2   4

Both of these, this being the lesser short-tailed shrew, are reasonable.

That is, they are not errors, therefore you may obtain the maximum and the value is, for example,

Because it’s 9,999, but it doesn’t make sense, and you have something that does, so if we discovered a possum that was 9,999, that was bigger than the elephant.

That was an encoding of a missing variable to us, but it’s not a valid value.

So, in order to accomplish imputation, I’d have to substitute the numerical value with a genuine missing value.

As a result, documenting and looking for outliers is useful as a data double-check.

There are numerous issues that arise due to typos and values that are merged together, as well as missing columns, which cause your data to provide absolutely absurd results, and it is your obligation to correct them.

This is something that the computer can assist you with, but you must become involved in all elements of the process.

So, taking a look at the maximum and minimum is a good approach to go about it.

So here’s what we do: you have to document everything. I’ve discovered this value, and this is the largest, and this is the smallest, so make notes for yourself and the people for whom you’re writing the report.

So that everyone is aware that these are the extreme points and that what they know should be consistent.

Tweet
Share
Share
Pin
R

Post navigation

Previous Post: Dealing With Missing values in R
Next Post: Methods for Integrating R and Hadoop complete Guide

Related Posts

  • How to Add a title to ggplot2 Plots in R
    How to Add a caption to ggplot2 Plots in R? R
  • Quantiles by Group calculation in R
    Quantiles by Group calculation in R with examples R
  • Add Significance Level and Stars to Plot in R
    Add Significance Level and Stars to Plot in R R
  • test for normal distribution in r
    Test for Normal Distribution in R-Quick Guide R
  • How to Find Unmatched Records in R
    How to Find Unmatched Records in R R
  • best books about data analytics
    Best Books to learn Tensorflow Course

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • About Us
  • Contact
  • Disclaimer
  • Guest Blog
  • Privacy Policy
  • YouTube
  • Twitter
  • Facebook
  • Tips for Data Scientist Interview Openings
  • What is Epoch in Machine Learning?
  • Dynamic data visualizations in R
  • How Do Machine Learning Chatbots Work
  • Convex optimization role in machine learning

Check your inbox or spam folder to confirm your subscription.

  • Sampling from the population in R
  • Two of the Best Online Data Science Courses for 2023
  • Process of Machine Learning Optimisation?
  • ggplot2 scale in R (grammar for graphics)
  • ggplot aesthetics in R (Grammer of graphics)
  • How to handle Imbalanced Data
    How to handle Imbalanced Data? R
  • pheatmap function in R
    The pheatmap function in R R
  • How to Filter Rows In R
    How to Filter Rows In R? R
  • Two-Way ANOVA Example in R
    How to perform a one-sample t-test in R? R
  • ggdogs on ggplot2
    ggdogs on ggplot2 R
  • Check whether any values of a logical vector are TRUE
    Check whether any values of a logical vector are TRUE R
  • How do augmented analytics work
    How do augmented analytics work? R
  • Extract patterns in R
    Extract patterns in R? R

Copyright © 2023 Data Science Tutorials.

Powered by PressBook News WordPress theme