Treating your data: The old school vs tidyverse modern tools

By Gabriel Vasconcelos

When I first started using R there was no such thing as the tidyverse. Although some of the tidyverse packages were available independently, I learned to treat my data mostly using brute force combining pieces of information I had from several sources. It is very interesting to compare this old school programming with the tidyverse writing using the magrittr package. Even if you want to stay old school, tidyverse is here to stay and it is the first tool taught in many data science courses based on R.

My objective is to show a very simple example comparing the two ways of writing. There are several ways to do what I am going to propose here, but I think this example is enough to capture the main differences between old school codes and magrittr plus tidyverse. Magrittr is not new, but It seems to me that it is more popular now because of tidyverse.

To the example

I am going to generate a very simple data where we have two variables indexed by letters. My objective is to sum the two variables only in the values corresponding to vowels.

set.seed(123)
M = 1000
db1 = data.frame(id = sample(letters, 1000, replace = TRUE), v1 = rnorm(1000), v2 = rnorm(1000))
vowels=c("a", "e", "i", "o", "u")
head(db1)
##   id          v1         v2
## 1  h -0.60189285 -0.8209867
## 2  u -0.99369859 -0.3072572
## 3  k  1.02678506 -0.9020980
## 4  w  0.75106130  0.6270687
## 5  y -1.50916654  1.1203550
## 6  b -0.09514745  2.1272136

The first strategy (old school) to solve this problem is to use aggregate and then some manipulation. First I aggregate the variables to have the sum of each letter, then I select the vowels and use colsums to have the final result.

ag1 = aggregate( . ~ id, data = db1, FUN = sum)
ag1 = ag1[ag1$id %in% vowels, ]
ag1 = colSums(ag1[, -1])
ag1
##        v1        v2
## 26.656837  6.644839

The second strategy (tidyverse) uses functions from the dplyr package and the foward-pipe operator (%>%) from the magrittr. The foward-pipe allows us to do many operations in a single shot to get the final result. We do not need to create these auxiliary objects like I did in the previous example. The first two lines do precisely the same as the aggregate. The group_by defines the variable used to create the groups and the summarize tells R the grouping function. In the third line I select only the lines corresponding to vowels and the last summarize sums each variable. As you can see, the results are the same. This approach generated an object type called tibble, which is a special type of data frame from the tidyverse with some different features like not using factors for strings.

library(tidyverse)

ag2 = group_by(db1, id) %>%
  summarise(v1 = sum(v1), v2 = sum(v2)) %>%
  filter(id %in% vowels) %>%
  summarize(v1 = sum(v1), v2 = sum(v2))

ag2
## # A tibble: 1 x 2
##         v1       v2
##      <dbl>    <dbl>
## 1 26.65684 6.644839

The same thing using merge

Suppose that we want to do the same thing as the previous example but now we are dealing with two data frames: the one from the previous example and a second data frame of characteristics that will tell us which letters are vowels.

aux = rep("consonant",length(letters))
aux[which(letters %in% vowels)] = "vowel"
db2 = data.frame(id = letters, type = aux)
head(db2)
##   id      type
## 1  a     vowel
## 2  b consonant
## 3  c consonant
## 4  d consonant
## 5  e     vowel
## 6  f consonant

The first approach uses merge to combine the two data frames and then sum the observations that have id==vowel.

merge1 = merge(db1, db2, by = "id")
head(merge1)
##   id          v1         v2  type
## 1  a -0.73657823  1.1903106 vowel
## 2  a  0.07987382 -1.1058145 vowel
## 3  a -1.20086933  0.4859824 vowel
## 4  a  0.32040231 -0.6196151 vowel
## 5  a -0.69493683 -1.0387278 vowel
## 6  a  0.15735335  1.6165776 vowel
merge1 = colSums(merge1[merge1[,4] == "vowel", 2:3])
merge1
##        v1        v2
## 26.656837  6.644839

The second approach uses the function inner_join from the dplyr package, then it filters the vowels observations and uses summarize to sum the vowels observations.

merge2 = inner_join(db1, db2, by = "id") %>%
  filter(type == "vowel") %>%
  summarise(v1 = sum(v1), v2 = sum(v2))
merge2
##         v1       v2
## 1 26.65684 6.644839

As you can see, the two ways of writing are very different. Naturally, there is some cost to change from the old school to the tidyverse codes. However, the second makes your code easier to read, it is part of the tidyverse philosophy to write codes that can be read by humans. For example, something like this:

x = 1:10
sum(log(sqrt(x)))
## [1] 7.552206

becomes something like this if you use the foward-pipe:

x %>% sqrt() %>% log() %>% sum()
## [1] 7.552206

For more information check out the tidyverse website and the R For Data Science book, which is available for free on-line here.

This entry was posted in R and tagged , , , , , . Bookmark the permalink.

15 Responses to Treating your data: The old school vs tidyverse modern tools

  1. Pingback: Treating your data: The old school vs tidyverse modern tools – Cloud Data Architect

  2. Pingback: Treating your data: The old school vs tidyverse modern tools - biva

  3. Greg Arnold says:

    Why not just:
    colSums(db1[db1$id %in% vowels,sapply(db1,is.numeric)])

    Like

    • insightr says:

      Hi Greg. Thanks for your comment. As I said, there are many ways to do what I proposed.

      Like

    • Timothy Davis says:

      Performance would be another interesting comparison between old school and the tidyverse. One thing that has drawn some of us to the tidyverse is the fact that so much of the code has optimized C++ behind it.

      Like

      • Brian Stamper says:

        Looks like in this case base is faster:

        library(microbenchmark)
        microbenchmark(
          test1_base = {
            ag1 = aggregate( . ~ id, data = db1, FUN = sum)
            ag1 = ag1[ag1$id %in% vowels, ]
            ag1 = colSums(ag1[, -1])
          },
          test1_dplyr = {
            ag1 = group_by(db1, id) %>%
              summarise(v1 = sum(v1), v2 = sum(v2)) %>%
              filter(id %in% vowels) %>%
              summarize(v1 = sum(v1), v2 = sum(v2))
          }
        )
        

        Unit: milliseconds
        expr min lq mean median uq max neval cld
        test1_base 1.609314 1.660545 1.745496 1.703812 1.753540 3.229445 100 a
        test1_dplyr 5.017839 5.094009 5.339200 5.220356 5.327323 6.895173 100 b

        Without the aggregate step it is much faster, but base still wins:

        microbenchmark(
          test2_base = {
            ag1 = db1
            ag1 = ag1[ag1$id %in% vowels, ]
            ag1 = colSums(ag1[, -1])
          },
          test2_dplyr = {
            ag1 = db1 %>%
              filter(id %in% vowels) %>%
              summarize(v1 = sum(v1), v2 = sum(v2))
          }
        )
        

        Unit: microseconds
        expr min lq mean median uq max neval cld
        test2_base 160.150 178.0285 198.8722 194.2535 208.0755 338.029 100 a
        test2_dplyr 3886.872 3942.4590 4180.3349 4052.7315 4182.5340 6691.454 100 b

        I know this is not always the case, often dplyr is faster, and data.table is usually faster than dplyr. But, dplyr wins the readability (and teachability) contest, hands down.

        Like

  4. thiagosilva says:

    There is definitely different ways of doing things, but your examples have unnecessary steps. Since you just want the sum of all vowels, and not per vowel, there is no reason to aggregate or group_by:

    # Alternative to example 1
    ag1 = db1[db1$id %in% vowels, ] # no need to aggregate, can go straight to filter
    ag1 = colSums(ag1[, -1])
    ag1
    #v1        v2 
    #26.656837  6.644839 
    
    # Alternative to example 2: no need to group by or summarize twice
    ag2 <- db1 %>% filter(id %in% vowels) %>%
      summarize(v1 = sum(v1), v2 = sum(v2))
    ag2
    #v1       v2
    #1 26.65684 6.644839
    

    This actually speaks in favor of tidyverse. I have started using R way before tidyverse, and have changed to using dplyr only very recently. Still, I didn’t even blink when I read your “old school” code, but when I saw the double use of summarize () on the second example, the redundancy was obvious!

    Like

    • insightr says:

      You are right, there are more efficient ways to solve the problem. However, my point was to show the difference between the two writing styles in an example people could catch in a few text lines. If the problem was solved in a single line using each style the main idea of the post would be lost.

      You made a very good point! In fact, it is much easier to see if the code is as efficient as it can be if it was written using dplyr =).

      Like

    • José De Mello says:

      I think for R beginners, a more concise code may translate in more confusion. I have come across situations like these myself. When I am coding for myself, it is easy to wrap multiple functions and remove redundancies. Things are different when you write something for someone else as you have to often explain what you are doing. It is easy to fall into redundancy but it is not a bad thing per se because the code becomes more literal.

      Like

  5. Pingback: Treating your data: The old school vs tidyverse modern tools – Mubashir Qasim

  6. Pingback: Distilled News | Data Analytics & R

  7. Srikanth K S says:

    Here is a comparison of different methods on a larger dataframe:

    library("dplyr")
    library("data.table")
    library("microbenchmark")
    
    set.seed(123)
    M   = 1e7
    db1 = data.frame(id = sample(letters, M, replace = TRUE)
                     , v1 = rnorm(M)
                     , v2 = rnorm(M)
                     )
    vowels = c("a", "e", "i", "o", "u")
    head(db1)
    setDT(db1)
    
    microbenchmark(
        DT    = db1[id %in% vowels, .(v1 = sum(v1), v2 = sum(v2))]
      , dplyr = filter(db1, id %in% vowels) %>% summarize(v1 = sum(v1), v2 = sum(v2))
      , baseR = colSums(db1[db1$id %in% vowels, c("v1", "v2")])
      )
    

    yields this (on my computer):

    Unit: milliseconds
      expr      min       lq     mean   median       uq      max neval cld
        DT 115.6803 123.2886 132.2126 127.4978 131.3758 266.7618   100  a 
     dplyr 326.6213 337.6235 355.3938 343.7479 351.3250 491.6581   100   b
     baseR 308.4871 329.5107 365.9381 335.7553 354.9196 497.6719   100   b
    

    IMHO, syntax of data.table is not as intuitive as dplyr’s but it is a matter of time, as performance is usually far superior!

    Like

  8. Ista Zahn says:

    There really is no reason you cannot write “tidyverse-style” code without tidyverse. For example:

    db1[db1$id %in% vowels, ] ->..
    aggregate(.. ~ id, data = .., FUN = sum) ->..
    colSums(..[, setdiff(names(..), "id")]) -> ag1
    ag1
    
    merge(db1, db2, , by = "id") ->..
    subset(.., type == "vowel") ->..
    colSums(..[, setdiff(names(..), c("type","id"))])
    

    Don’t get me wrong, there are some really cool things about the tidyverse. However, I don’t really understand the “it makes the code cleaner/easier to understand” argument.

    Like

    • José De Mello says:

      I think for people who don’t have a vast experience in coding, tidyverse is far superior. For more experienced programmers, most of the Base R options are just as good or even better. I don’t have coding background myself and experienced large gains in productivity once I mastered the tidyverse. I think with inexperienced coders, it takes longer to hit a productivity threshold with “abstract” coding syntax.

      Like

  9. Pingback: Linkdump #53 | WZB Data Science Blog

Leave a comment