5 Cleaning Data


Notes taken during/inspired by the Datacamp course ‘Cleaning Data in R’ by Nick Carchedi.

5.1 Tidying data

In Hadley’s paper on tidy data, he talked about how columns in a data frame should be variables or attributes and rows should be observations - this somestimes does not happen if there are things like dummy variables as columns, that could be collpased in to a single column. The entire data table (data frame) should be about one particular set of data i.e. we have countries without an embedded table about cats.

Hadley introduced the tidyr package to try and help clean some data. There are two fundamental verbs of data tidying:

  • gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer

  • spread() takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider

gather(data, key, value …)

data: is a data frame

key: the name of the new key column

value: the name of the new value column

: names of columns to gather or not (if not, state -col e.g. -time to not include the time column in the gathered table)

spread(data, key, value)

data: is a data frame

key: the name containing the key column

value: the name containing the value column

# Apply gather() to bmi and save the result as bmi_long
bmi_long <- gather(bmi, year, bmi_value, -Country)

# Apply spread() to bmi_long
bmi_wide <- spread(bmi_long, year, bmi_val)

Another useful feature is separate(). This takes a single variable and separates it into two separate columns or variable, for instance converting a year-month (2015-10) into a separate column for year and month.

separate(data, col, into, sep = “”)

data: a data frame

col: bare name of column to separate

into: charecter vector of new column names

Optional sep = “”: in the separate command you can designate on what item (/, @ etc) to break the data by. This is optional and can depend on the column type (numeric vs char)

# separate year-mo into two columns
separate(treatments, year_mo, c("year", "month"))

We can also use the unite function to combine two columns together

unite(data, col, …)

data: a data frame

col: name of the new column

: columns to unite

The default seperator within the new column is an underscore, however we can specify something different

Optional sep = “-”: would add the seperator as a hyphen

head(bmi_cc) Country_ISO year bmi_val 1 Afghanistan/AF Y1980 21.48678 2 Albania/AL Y1980 25.22533 3 Algeria/DZ Y1980 22.25703 4 Andorra/AD Y1980 25.66652 5 Angola/AO Y1980 20.94876 6 Antigua and Barbuda/AG Y1980 23.31424

So to separate Country_ISO into two columns

# Apply separate() to bmi_cc
bmi_cc_clean <- separate(bmi_cc, col = Country_ISO, into = c("Country", "ISO"), sep = "/")

# Apply unite() to bmi_cc_clean aand reverse
bmi_cc <- unite(bmi_cc_clean, Country_ISO, Country, ISO, sep = "-")

5.2 Preparing data for analysis

Often we need to convert, or in the case of raw data, create the appropriate data type for each variable prior to analysis. Some common data types include

  • character: “treatment”, “123”, “A”
  • numeric: 23.44, 120, NaN, Inf
  • integer: 4L, 1123L
  • factor: factor(“Hello”), factor(8)
  • logical: TRUE, FALSE, NA

We can use the class() function to detmine the variable type, or we can also include a value to determine the appropriate type e.g. class(77L) will return [1] “integer”.

We can also use the coercion functions to change the types, such as as.numeric, as.factor() and as.character().

For dates and times, we can use the lubridate package.

# Load the lubridate package
library(lubridate)

# Parse as date
dmy("17 Sep 2015")

# Parse as date and time (with no seconds!)
mdy_hm("July 15, 2012 12:56")

# Coerce dob to a date (with no time)
students2$dob <- ymd(students2$dob)

# Coerce nurse_visit to a date and time
students2$nurse_visit <- ymd_hms(students2$nurse_visit)

5.3 String manipulation

Another useful package is stringr, which like lubridate and other Hadley packages has a consistent interface, providing a range of functions for dealing with strings. Some functions include

  • str_trim() - Trim leading and trailing white space
  • str_pad() - Pad with additional characters
  • str_detect() - Detect a pattern
  • str_replace() - Find and replace a pattern
# Load the stringr package
library(stringr)

# Trim all leading and trailing whitespace
str_trim(c("   Filip ", "Nick  ", " Jonathan"))

# Pad these strings with leading zeros
str_pad(c("23485W", "8823453Q", "994Z"), width = 9, side = "left", pad = 0)

# Detect all dates of birth (dob) in 1997
str_detect(students2$dob, "1997")

# In the sex column, replace "F" with "Female"...
students2$sex <-  str_replace(students2$sex, "F", "Female")

# ...And "M" with "Male"
students2$sex <- str_replace(students2$sex, "M", "Male")

R {base} also has some handy features for strings, including toupper() and tolower().

5.4 Missing, Specials and Outliers

Generally missing values in R are represented by NA. However, if the data has been imported from other systems, the values can be different, such as a . (dot) if imported from SPSS.

We can use the is.na(df) to return a TRUE/FALSE array of where there are NA values in a data frame. Or, for large datasets, we can use the any(is.na(df)) to return a true or false if there is an NA anywhere in the data frame.

Alternatively we can use the sum(is.na(df)) to count how many NAs are in the dataframe. Use complete.cases() to see which rows have no missing values.

Special values include inf for infinite value, NaN for Not a number.

Outliers are best detected by measures such as the IQR or other nuemrical measures (see the EDA section), by using a boxplot or a histogram/density plot. There are a number of likely reasons for an outlier:

Valid measurements Variability in measurement Experimental error Data entry error

May be discarded or retained depending on cause. In some instances we may want to cap, or put a limit on, the maximum number the outlier can. Looking at the actual values and considering possible values can help, for instance negative age values or a perons age above 200 are not plausible values. However, they may be data entry errors or in the case of negative numbers, represent a deliberately coded missing value.

5.5 Examples

The weather dataset suffers from one of the five most common symptoms of messy data: column names are values. In particular, the column names X1-X31 represent days of the month, which should really be values of a new variable called day.

head(weather)

X year month measure X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14
1 2014 12 Max.TemperatureF 64 42 51 43 42 45 38 29 49 48 39 39 42 45
2 2014 12 Mean.TemperatureF 52 38 44 37 34 42 30 24 39 43 36 35 37 39
3 2014 12 Min.TemperatureF 39 33 37 30 26 38 21 18 29 38 32 31 32 33
4 2014 12 Max.Dew.PointF 46 40 49 24 37 45 36 28 49 45 37 28 28 29
# Load the tidyr package
library(tidyr)

# Gather the columns
weather2 <- gather(weather, day, value, X1:X31, na.rm = TRUE)

becomes

X year month measure day value
1 2014 12 Max.TemperatureF X1 64
2 2014 12 Mean.TemperatureF X1 52
3 2014 12 Min.TemperatureF X1 39
4 2014 12 Max.Dew.PointF X1 46
5 2014 12 MeanDew.PointF X1 40
6 2014 12 Min.DewpointF X1 26

Our data suffer from a second common symptom of messy data: values are variable names. Specifically, values in the measure column should be variables (i.e. column names) in our dataset. WE also have an additional column (X) which is not needed as it is just the row number.

# First remove column of row names
weather2 <- weather2[, -1]

# Spread the data
weather3 <- spread(weather2, measure, value)
Table 4.1:
year month day CloudCover Events Max.Dew.PointF Max.Gust.SpeedMPH
2014 12 X1 6 Rain 46 29
2014 12 X2 7 Rain-Snow 40 29
2014 12 X3 8 Rain 49 38
2014 12 X4 3 24 33
2014 12 X5 5 Rain 37 26
2014 12 X6 8 Rain 45 25

Now that the weather dataset adheres to tidy data principles, the next step is to prepare it for analysis. We’ll start by combining the year, month, and day columns and recoding the resulting character column as a date. We can use a combination of base R, stringr, and lubridate to accomplish this task.

# Remove X's from day column
weather3$day <- str_replace(weather3$day, "X", "")

# Unite the year, month, and day columns
weather4 <- unite(weather3, date, year, month, day, sep = "-")

# Convert date column to proper date format using lubridates's ymd()
weather4$date <- ymd(weather4$date)

# Rearrange columns using dplyr's select()
weather5 <- select(weather4, date, Events, CloudCover:WindDirDegrees)

It’s important for analysis that variables are coded appropriately. This is not yet the case with our weather data. Recall that functions such as as.numeric() and as.character() can be used to coerce variables into different types.

It’s important to keep in mind that coercions are not always successful, particularly if there’s some data in a column that you don’t expect. For example, the following will cause problems:

as.numeric(c(4, 6.44, “some string”, 222))

So you can use the str_replace function to change character values to something else.

If we have missing data, we can use indices and is.na function to identify then only see those rows with NA values on a variable of interest.

# Count missing values
sum(is.na(weather6))

# Find missing values
summary(weather6)

# Find indices of NAs in Max.Gust.SpeedMPH
ind <- which(is.na(weather6$Max.Gust.SpeedMPH))

# Look at the full rows for records missing Max.Gust.SpeedMPH
weather6[ind, ]

Besides missing values, we want to know if there are values in the data that are too extreme or bizarre to be plausible. A great way to start the search for these values is with summary().

Once implausible values are identified, they must be dealt with in an intelligent and informed way. Sometimes the best way forward is obvious and other times it may require some research and/or discussions with the original collectors of the data.

# Find row with Max.Humidity of 1000
ind <- which(weather6$Max.Humidity == 1000)

# Look at the data for that day
weather6[ind, ]

# Change 1000 to 100
weather6$Max.Humidity[ind] <- 100

Before officially calling our weather data clean, we want to put a couple of finishing touches on the data. These are a bit more subjective and may not be necessary for analysis, but they will make the data easier for others to interpret, which is generally a good thing.

There are a number of stylistic conventions in the R language. Depending on who you ask, these conventions may vary. Because the period (.) has special meaning in certain situations, we generally recommend using underscores (_) to separate words in variable names. We also prefer all lowercase letters so that no one has to remember which letters are uppercase or lowercase.