Module 1 | Lesson 2

{Tidyverse} is your friend

When you begin your journey in R, you'll quickly encounter two different philosophies for writing code: base R and the tidyverse.

Both are powerful ways to work with data, but they offer different approaches and syntax.

Free

6 minutes read

Base R vs. Tidyverse

Base R is the original R system that comes right out of the box when you install R. It includes everything you need to perform data analysis and is incredibly powerful.

However, it can sometimes be complex and unintuitive for data manipulation and analysis, especially for those who are new to R.

Let's consider this line of code:

round(mean(subset(na.omit(data), species == "Adelie")$bill_length_mm),2)

It is hard to read!

The tidyverse, on the other hand, is a collection of R packages designed with the goal of making data science faster, easier, and more accessible. It introduces a consistent and simplified syntax that can often make your code more readable and easier to write.

library(tidyverse)

data %>%
  filter(!is.na(bill_length_mm), species == "Adelie") %>%
  summarise(mean_bill_length = mean(bill_length_mm)) %>%
  pull(mean_bill_length) %>%
  round(2)

What's in the Tidyverse?

The Tidyverse is an opinionated collection of R packages designed for data science. This suite of packages works in harmony because they share common data representations and syntax.

This collection of packages is developed mainly by the RStudio team, and it has become one of the most popular ways to use R for data science.

Here is an overview of the main packages:

→

Click a package image to know more about it!

Why you will ❤️ the tidyverse

🤝 Friendly Syntax

Think of the Tidyverse as speaking a more straightforward version of the R language. It uses a consistent set of rules across its tools, so once you learn how to do something in one tool, it's easier to do something similar in another.

🧐 Easy to Read

Tidyverse code is like a well-organized book - it's written to be easy to read. This means when you look back at your code, or if someone else needs to check it, it's much clearer what each part is supposed to do.

🔗 Linking Steps Together

Imagine a production line where each step is clearly connected to the next - that's what the %>% symbol does in Tidyverse. It lets you link different tasks together in a way that's easy to follow, like a recipe.

🛠️ Handling Data with Ease

Tidyverse has special tools, like dplyr for changing and fixing data, and tidyr for reshaping it. They're like having a Swiss Army knife for data - lots of functions in one place, all designed to make common data tasks simpler.

◻️ Modern Data Tables

Tidyverse introduces tibbles, which are like the next-generation version of data tables in R. They're smarter and avoid some of the common frustrations you might run into with regular data tables.

→ Streamlined Work

The Tidyverse is like a well-coordinated team where each member knows what the others are doing. This makes your journey from starting a data project to finishing it smoother and less complicated.

🔃 Getting Data In and Out

Whether it's from a simple text file, a big spreadsheet, or a statistics program, Tidyverse has tools that make it faster and less of a headache to bring data into R and to export it out again.

📈 Making Graphs

ggplot2 allows creating graphs in a way that's a bit like building with blocks - step by step, with a consistent approach.

🔄 Smarter Loops

The purrr package lets you do repetitive tasks without writing loops, which can be tricky for beginners. It's like having a robot that can repeat tasks quickly and without mistakes.

␂ Dealing with Text

The stringr package gives you a set of easy-to-use tools for when you need to work with text, making tasks like finding and replacing words less of a chore.

☀️ Working with Lists and Vectors

Tidyverse functions are often designed to work with whole columns or lists of data at once, so you don't have to tell R how to handle each individual item.

👩🏽‍🦰 Helpful Community

The Tidyverse has a big group of users who are always creating new guides, answering questions, and helping each other out. It's like being part of a club where everyone is there to support you.

💪 Keeps Getting Better

The Tidyverse is like an app that's regularly updated with new features. It's always getting improvements and additions, which means it stays up-to-date with what data scientists need.

🚀 Expandable

There are lots of extra 'plugins' or packages that work with the Tidyverse, so you can add on specialized tools as you need them, just like adding apps to your phone.

Quizz time

Let's check your tidyverse knowledge (and expand it a bit 😀)

What is the Tidyverse?

An R package for statistical analysis.

A collection of R packages designed for data science that share common philosophies.

A new programming language based on R.

A graphical user interface for R.

Name three core principles that the Tidyverse packages adhere to.

Consistency, readability, and usability.

Complexity, dependency, and variety.

Speed, automation, and scalability.

Randomness, flexibility, and modularity.

Which package in the Tidyverse is primarily used for data manipulation?

ggplot2

readr

dplyr

forcats

What is the primary function of the ggplot2 package?

Data cleaning and transformation.

Time-series analysis.

Creating descriptive and exploratory data visualizations.

Database management.

Explain the purpose of the pipe operator in the Tidyverse.

It is used to assign values to variables.

It serves as a division operator in complex arithmetic operations.

It is used to exponentiate numbers.

It is a pipe operator that helps to chain together a sequence of functions in a logical order.

How does the readr package enhance the data import experience in R?

By offering faster and more intuitive functions for reading tabular data.

By providing tools for 3D plotting.

By enabling the execution of Python code within R.

By automating the data cleaning process.

What is the main advantage of using tibble over the traditional data frame in R?

Tibbles allow for the execution of SQL queries directly.

Tibbles print more data to the console than traditional data frames.

Tibbles are more modern, providing a cleaner and more succinct display of data in the console.

Tibbles have built-in plotting capabilities.

Describe a common use case for the purrr package in the Tidyverse.

To create web applications within R.

To apply functions to each element of a list or vector, often replacing the need for loops.

To perform regression analysis.

To connect to RDBMS systems.

How does the tidyr package assist in transforming data into a tidy format?

By providing functions to merge and sort data frames.

By offering functions to convert wide data into long format and vice versa, making it easier to work with.

By encrypting data to maintain confidentiality.

By generating synthetic data for testing.

What is the significance of string manipulation in R, and which Tidyverse package is designed to handle such tasks?

String manipulation is not significant in R; it is primarily for statistical analysis.

String manipulation is used for database management. RSQLite is designed for this.

String manipulation is crucial for cleaning and preparing text data. Stringr is useful for this job!

String manipulation is mainly for timestamp data, and the lubridate package is used for this purpose.

Conclusion

This page is not intended to provide a comprehensive tutorial on the tidyverse. If you're looking to master it, I highly recommend reading R for Data Science by Hadley Wickham, a book often considered as the bible of tidyverse best practices.

Learning it in-depth would require a significant time investment. However, I hope you now grasp the distinction between tidyverse and base R. I also hope you're convinced of its benefits!

Now, let's enhance our penguin project by taking a step further and translating the original code into tidyverse conventions:

Homework

Open the script called analysis.R we created in the previous lesson

Install the dplyr and ggplot2 libraries with the install.packages() function

Use dplyr and ggplot2 functions to perform the data wrangling and dataviz tasks. Use google or chatGPT to do so, this is how we do in real life!

Before I let you go, it is important to note that there is some criticism too! See here and here.

← Previous

Introduction

Automatic formatting with {styler}