Styling R Code

01. What is code styling?

‘Styling’ covers…

  • Code/project structure (e.g. 01-modelling.R -> 02-plots.R -> 03-export.R )

  • Code formatting

    • Naming things (variables, columns, functions etc)
    • Indentation/line breaks/spacing
    • Stuff specific to R, e.g. <- vs =
  • Wider principles like…

    • What does a good comment say?
    • What should a good function do?
  • Styling is about making your work easy to understand without changing its function

Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread

-Introduction to the Tidyverse Style Guide

Some specific benefits

  • Styling makes writing code easier (less decisions to make)

  • Styling makes reading code easier

  • Styling makes it easier to avoid bugs

Without styling

With styling

My styling journey

  • 2019: Started coding in R

    • Struggled to write clear code
    • Often got frustrated by needing to rewrite stuff
  • 2020-2021: Had some kind of set of conventions specific to myself

    • Would occasionally change how I did something
    • This made me dislike all the code I’d written previously
    • This made me feel sad
  • 2022: Started religiously following the Tidyverse style guide

    • My code needed less rewrites
    • I spent almost no time thinking about styling (just design)
    • My code was clearer
    • This made me feel happy
  • 2023: Life is great

02. Some specific styling tips

Naming things

  • Keep names as short as you can while still being descriptive. Prioritise being descriptive!
  • Only use abbreviations in special cases, e.g. acronyms
  • Don’t use the name to signal the type of the object
Bad
table_totalcost <- costs %>% 
  group_by(Category) %>% 
  summarise(Cost = sum(Cost))

model_for_use_later_on <- lm(Cost ~ Time, data = costs)

read_data_func <- function(path) {
  readr::read_csv(path, id = "filepath", na = "N/A")
}
Good
cost_totals <- costs %>% 
  group_by(Category) %>% 
  summarise(Cost = sum(Cost))

cost_model <- lm(Cost ~ Time, data = costs)

read_data <- function(path) {
  readr::read_csv(path, id = "filepath", na = "N/A")
}
  • Names for things like dataframes, vectors, values etc should be noun-like, e.g. costs, costs_summary, costs_uplift_factor etc

  • Names for functions should be verb-like, e.g. filter(), standardise_names() , extract_coefficients() etc

Name case

# snake_case
iris_summary <- summary(iris)

# Title_Snake_Case
Iris_Summary <- summary(iris)

# camelCase
irisSummary <- summary(iris)

# PascalCase
IrisSummary <- summary(iris)

# SCREAMING_SNAKE_CASE
IRIS_SUMMARY <- summary(iris)
  • Consistency should be prioritised above all else, but…
  • lower_snake_case should be preferred in most cases
  • Title_Snake_Case works well for column names
  • You might see camelCase in other packages, but you shouldn’t use it unless you’re doing serious object-oriented programming

Syntactic names

R has rules for names:

  • They must only include letters and numbers _ and ., and must start with a letter or .

  • Other names must be surrounded by backticks:

# Good ('syntactic')
iris_proportions <- mutate(iris, across(1:4, ~ . / sum(.)))

# Bad: starts with a number
`01_iris_proportions` <- mutate(iris, across(1:4, ~ . / sum(.)))

# Bad: contains a non-alphanumeric character
`iris_%s` <- mutate(iris, across(1:4, ~ . / sum(.)))

# Bad: contains a space
`iris proportions` <- mutate(iris, across(1:4, ~ . / sum(.)))
  • When your data has non-syntactic column names, clean these up ASAP!

  • While names like my.data are allowed, avoid this naming style. Use my_data instead.

Indentation

  • Question: Why indent?
  • Answer: Indentation shows code structure at a glance
  • Whenever you increase indentation, do it by exactly 2 spaces
  • This usually means that all function arguments should have their own line
Bad
 iris %>% 
 dplyr::mutate(Sepal.Area = Sepal.Width * Sepal.Length,
   Petal.Area = Petal.Width * Petal.Length) %>% 
  ggplot2::ggplot(aes(x = Sepal.Area, 
y = Petal.Area, colour = Species)) +
  ggplot2::geom_point()


Better
iris %>% 
  dplyr::mutate(Sepal.Area = Sepal.Width * Sepal.Length,
                Petal.Area = Petal.Width * Petal.Length) %>% 
  ggplot2::ggplot(aes(x = Sepal.Area, 
                      y = Petal.Area, 
                      colour = Species)) +
  ggplot2::geom_point()


Best
iris %>% 
  dplyr::mutate(
    Sepal.Area = Sepal.Width * Sepal.Length,
    Petal.Area = Petal.Width * Petal.Length
  ) %>% 
  ggplot2::ggplot(aes(
    x = Sepal.Area, 
    y = Petal.Area, 
    colour = Species
  )) +
  ggplot2::geom_point()

Comments: what should they say?

  • Question: How much should you comment?
  • Answer: As much as needed, but no more

If a comment is needed, it should explain the why, not the what/how (if what your code does isn’t clear, you should probably rewrite it).

Bad
plot_data <- mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble() %>% 
  
  # Reorder car levels by values of mpg
  mutate(
    car = fct_reorder(car, mpg)
  )


Good
plot_data <- mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble() %>% 
  
  # Order cars by efficiency (mpg) for plotting later
  mutate(
    car = fct_reorder(car, mpg)
  )

Comments: maximising clarity

plot_data <- mtcars %>% 
  
  # 1. Create a column for the car name
  rownames_to_column("car") %>% 
  
  # 2. Apply tibble format for nicer printing
  as_tibble() %>% 
  
  # 3. Order cars by efficiency (mpg) for plotting later
  mutate(car = fct_reorder(car, mpg))

Number your comments if it makes sense


# 1. Create a column for the car name
# 2. Apply tibble format for nicer printing
# 3. Order cars by efficiency (mpg) for plotting later
plot_data <- mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble() %>% 
  mutate(car = fct_reorder(car, mpg))

Prefer infrequent, detailed comments over frequent ones which are overly terse


# ~~ Prepare data for plotting ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# 1. Create a column for the car name
# 2. Apply tibble format for nicer printing
# 3. Order cars by efficiency (mpg) for plotting later
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
plot_data <- mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble() %>% 
  mutate(car = fct_reorder(car, mpg))

Fencing comments suggests a new ‘section’. This can help guide the reader to the most important information.

Comments: miscellaneous tips

  • Give each comment its own line unless there’s a really good reason not to

  • Don’t needlessly abbreviate things - use full sentences

  • Use the imperative mood for short comments:

    • Good (imperative mood):

      # Remove rows where Cost is NA

    • Bad (indicative mood?):

      # Removing rows where Cost is NA

  • If your code is more than 50% comments consider switching to Quarto/R Markdown

Packages

  • Prefer packages which are widely used by other colleagues. Make sure you trust the packages you’re using!

  • Learn a bit about a package before using it. If it’s not well documented or maintained, find another approach.

    • Especially applies to packages used in answers on Stackoverflow

    • Tip: packages which have websites linked from GitHub are usually good!

  • Read a function’s documentation using help(pkg::fun) or ?pkg::fun. If a function is superseded or deprecated, use the recommended new approach.

library(dplyr, warn.conflicts = FALSE)

iris %>% 
  head(5) %>% 
  select_("Species", "Sepal.Width")
#>  Warning: `select_()` was deprecated in dplyr 0.7.0.
#>  ℹ Please use `select()` instead.
#>    Species Sepal.Width
#>  1  setosa         3.5
#>  2  setosa         3.0
#>  3  setosa         3.2
#>  4  setosa         3.1
#>  5  setosa         3.6
library(dplyr, warn.conflicts = FALSE)

iris %>% 
  head(5) %>% 
  select(all_of(c("Species", "Sepal.Width")))
#>    Species Sepal.Width
#>  1  setosa         3.5
#>  2  setosa         3.0
#>  3  setosa         3.2
#>  4  setosa         3.1
#>  5  setosa         3.6

Miscellaneous tips

  • Always use <- for assignment, not = or ->

  • You should (almost) never use <<- - there’s (almost) always a better approach

  • Space stuff out! E.g. 1 / (a + b + c) is better than 1/(a+b+c)

  • Only use return() for early returns; don’t put it at the end of every function

  • Write TRUE and FALSE, not T and F

  • Don’t comment out old code - delete it

  • Rewrite your code! Code you write once and never change isn’t likely to be very clear.

03. Code Design

What is design?

  • Design is about making your code consistent, composable and reusable

  • Styling can be boiled down to a set of rules - design is more of an art

  • In R, good code design is mostly about writing good functions

Example: dplyr::select() is a masterclass in design:

all_cols   <- colnames(iris)
sepal_cols <- all_cols[startsWith(all_cols, "Sepal")]
iris_small <- iris[c("Species", sepal_cols)]

head(iris_small, 5)
#>    Species Sepal.Length Sepal.Width
#>  1  setosa          5.1         3.5
#>  2  setosa          4.9         3.0
#>  3  setosa          4.7         3.2
#>  4  setosa          4.6         3.1
#>  5  setosa          5.0         3.6
library(dplyr)

iris %>% 
  select(Species, starts_with("Sepal")) %>% 
  head(5)
#>    Species Sepal.Length Sepal.Width
#>  1  setosa          5.1         3.5
#>  2  setosa          4.9         3.0
#>  3  setosa          4.7         3.2
#>  4  setosa          4.6         3.1
#>  5  setosa          5.0         3.6

Both code chunks take the iris dataframe and select the Species column plus all columns which begin with "Sepal". Which is clearer?

Functions

  • Repeating code is bad - defining functions is the answer

  • This takes practice but makes code much easier to read and maintain

Bad
# Rescale a, b, c, and d to be between 0 and 1
df %>% 
  mutate(
    a = (a - min(a, na.rm = TRUE)) / 
      (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
    b = (b - min(b, na.rm = TRUE)) / 
      (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
    c = (c - min(c, na.rm = TRUE)) / 
      (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
    d = (d - min(d, na.rm = TRUE)) / 
      (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))
  )
Better
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

df %>% 
  mutate(
    a = rescale01(a),
    b = rescale01(b),
    c = rescale01(c),
    d = rescale01(d)
  )


Best
# across() applies rescale01() to columns a to d
# This finally eliminates all code repetition!
df %>% 
  mutate(across(c(a, b, c, d), rescale01))

Reproducibility

  • Code-based approaches work best when reproducibility is a focus

  • So, strive to clearly delimit each stage of your pipeline, and make prerequisites obvious:

    1. Data import (sources and access requirements should be obvious)

    2. Data cleaning (successful 1. is prerequisite)

    3. Modelling/analysis (successful 1. and 2. are prerequisite)

    4. Outputs (successful 1., 2. and 3. are all prerequisite)

  • Periodically restart your R session (Shift + Ctrl + F10) and rerun your code to make sure all stages still run together successfully

Session reloading from .RData

  • By default, RStudio will save and reload your R session from a .RData file. This discourages a reproducible workflow, so disable this feature! (RStudio -> Tools -> Global Options -> General)

  • Generally avoid saving R objects with save() and saveRDS(). It’s better to put a bit more work in to export to CSV, Excel or SQL.

Projects and filepaths

  • Create a new RStudio project for each new piece of work

  • A project is defined by the presence of a file with the .Rproj extension. This tells RStudio that everything in the folder is part of the project.

    • Never use setwd() as this can result in code which other people can’t run. Instead, use relative filepaths.
  • Use a sub-directory R/ for your production R scripts

  • Write a README.md file for each project explaining its purpose and steps to get started using it

  • For long-term projects, or ones which are on hold, use {renv} to keep track of dependencies

fs::dir_tree("example-proj")
example-proj
├── example-proj.Rproj
├── in-development
│   ├── pupil-counts.sql
│   └── testing.R
├── R
│   ├── 01-import.R
│   ├── 02-tidy.R
│   ├── 03-model.R
│   └── 04-export.R
├── README.md
├── reports
│   └── analysis-report.qmd
└── SQL
    └── read-starts.sql

04. Some useful tools

janitor::clean_names()

  • clean_names() is a very quick and easy way to make syntactic column names. Use it!

  • Getting Title_Snake_Case is possible, but not obvious from the documentation

  • Renaming is lossy, so check the result

messy_df <- data.frame(
  pupilNumber      = 1:2, 
  `cost 2022`      = 1234, 
  `% totalFunding` = c(0.4, 0.6),
  `<50% funding`   = c(TRUE, FALSE), 
  check.names = FALSE
)

messy_df
#>    pupilNumber cost 2022 % totalFunding <50% funding
#>  1           1      1234            0.4         TRUE
#>  2           2      1234            0.6        FALSE

messy_df %>% 
  janitor::clean_names()
#>    pupil_number cost_2022 percent_total_funding x50_percent_funding
#>  1            1      1234                   0.4                TRUE
#>  2            2      1234                   0.6               FALSE

messy_df %>% 
  janitor::clean_names(case = "title", sep_out = "_")
#>    Pupil_Number Cost_2022 Percent_Total_Funding X50_Percent_Funding
#>  1            1      1234                   0.4                TRUE
#>  2            2      1234                   0.6               FALSE

styler::style_file()

{styler} is a powerful tool to use sparingly. Some reasonable use-cases:

  • Re-style a project you inherit

  • Re-style your own old projects after seeing this presentation

messy-code.R
#load packages
library(tidyverse);library(lubridate)

data_raw=read_csv(  "some_file.csv"  )

data_clean<-data_raw %>%
  mutate(Amount=Amount/sum(Amount),
    #Combine date parts into single column
    Date=make_date(Year,   Month,Day))%>%
    filter(
  # other years aren't relevant to analysis
          year(Date)==2020,
          Amount> 0.1
    )

ggplot2(data_clean,aes(Date,Amount))+geom_line()
styler::style_file("messy-code.R")
# load packages
library(tidyverse)
library(lubridate)

data_raw <- read_csv("some_file.csv")

data_clean <- data_raw %>%
  mutate(
    Amount = Amount / sum(Amount),
    # Combine date parts into single column
    Date = make_date(Year, Month, Day)
  ) %>%
  filter(
    # other years aren't relevant to analysis
    year(Date) == 2020,
    Amount > 0.1
  )

ggplot2(data_clean, aes(Date, Amount)) + geom_line()

lintr::lint()

{lintr} is similar to {styler}, but it tells you about issues instead of fixing them.

  • Good for maintaining already well-styled code, not so good for restyling old code

  • Highly customisable, e.g. if you want to relax/not apply some rules

  • RStudio’s UI lets you click through to address individual lints

messy-code.R
#load packages
library(tidyverse);library(lubridate)

data_raw=read_csv(  "some_file.csv"  )

data_clean<-data_raw %>%
  mutate(Amount=Amount/sum(Amount),
    #Combine date parts into single column
    Date=make_date(Year,   Month,Day))%>%
    filter(
  # other years aren't relevant to analysis
          year(Date)==2020,
          Amount> 0.1
    )

ggplot2(data_clean,aes(Date,Amount))+geom_line()
lintr::lint("messy-code.R")

Built-in tools in RStudio

Shortcut Action
Ctrl + I Correct indentation
Alt + Ctrl + Shift + M ‘Rename in scope’, e.g. change myVar to my_var
Ctrl + Shift + F Find (and replace) throughout multiple files
Ctrl + Shift + / Wrap long comments over multiple lines
Alt + - Insert <- with the correct spacing
Ctrl + Shift + M Insert %>% with the correct spacing
Alt + Ctrl + Shift + R Inserts template function documentation
Alt + click/drag Activate multiline cursor

Note: You can use Tools -> Keyboard Shortcuts Help for a full list of shortcuts

05. Resources

Resources