Cleaning data

Introduction#

Cleaning data in R is paramount to make any analysis. whatever data you have, be it from measurements taken in the field or scraped from the web it is most probable that you will have to reshape it, transform it or filter it to make it suitable for your analysis.

In this documentation, we will cover the following topics:

Removing observations with missing data
Factorizing data
Removing incomplete Rows

Removing missing data from a vector

First lets create a vector called Vector1:

set.seed(123)
Vector1 <- rnorm(20)

And add missing data to it:

set.seed(123)
Vector1[sample(1:length(Vector1), 5)] <- NA

Now we can use the is.na function to subset the Vector

Vector1 <- Vector1[!is.na(Vector1)]

Now the resulting vector will have removed the NAs of the original Vector1

Removing incomplete rows

There might be times where you have a data frame and you want to remove all the rows that might contain an NA value, for that the function complete.cases is the best option.

We will use the first 6 rows of the airquality dataset to make an example since it already has NAs

x <- head(airquality)

This has two rows with NAs in the Solar.R column, to remove them we do the following

x_no_NA <- x[complete.cases(x),]

The resulting dataframe x_no_NA will only have complete rows without NAs