A brief tour of tabbycat
A couple of months ago I was asked to do some diagnostic work on a set of regression models to help explain some of the results. The models were drawn from a dataset with a large number of categorical variables, and to pick apart the relationships in the data I found myself writing variations on the same data transformations over and over again.
I realised that some of these could easily be abstracted into functions, which I decided to organise into a library. A few weeks later this became tabbycat, an R package for tabulating and summarising categorical variables. Last week, tabbycat was published on CRAN.
What follows is a brief summary of tabbycat's main features. Full details of the API can be found in the GitHub readme, but it's quite a small package really. Its main benefits are that it lets you work quickly when exploring data interactively, and saves you a lot of repetitive typing. Everything you need to know to start using the package is shown below.
1. Setting up
To show how the package works, I'm going to use the House of Commons Library's voting summary dataset from the 2019 General Election. This file contains one row per Parliamentary constituency, and shows data about the election results in each seat. If you want to try out the examples, save this file locally so that it's available to R.
You will need to install tabbycat.
Then load the libraries and the data we will use.
library(tidyverse) library(tabbycat) vs <- read_csv("voting-summary.csv")
The simplest function in the package is cat_count, which shows the frequency of discrete values in a categorical variable. Give the function a dataframe and the name of the column that countains the variable to count. In the example below we get the number and percentage of seats won by each political party at the 2019 General Election.
cat_count(vs, "first_party") # A tibble: 11 × 3 first_party number percent <chr> <int> <dbl> 1 Con 365 0.562 2 Lab 202 0.311 3 SNP 48 0.0738 4 LD 11 0.0169 5 DUP 8 0.0123 6 SF 7 0.0108 7 PC 4 0.00615 8 SDLP 2 0.00308 9 Alliance 1 0.00154 10 Green 1 0.00154 11 Spk 1 0.00154
cat_count operates on dataframes, and like all the functions that operate on dataframes in the package, it takes the dataframe as the first argument so you can use them in tidyverse pipelines. An alternative function called cat_vcount produces the same output for vector inputs. So this code will produce the same result as the previous example.
cat_vcount can handle a wider range of inputs than cat_count, but it doesn't fit as easily into tidyverse pipelines.
3. Comparing and constrasting
The cat_compare function shows the distribution of one categorical variable within the groups of another. Call the function with a dataframe, the name of the variable to distribute down the rows, and the name of the variable to distribute along the columns. The percentages are calculated columnwise, like in cat_count.
cat_compare(vs, "first_party", "constituency_type") # A tibble: 11 × 5 first_party n_borough n_county p_borough p_county <chr> <dbl> <dbl> <dbl> <dbl> 1 Alliance 0 1 0 0.00272 2 Con 97 268 0.344 0.728 3 DUP 1 7 0.00355 0.0190 4 Green 1 0 0.00355 0 5 Lab 158 44 0.560 0.120 6 LD 5 6 0.0177 0.0163 7 PC 0 4 0 0.0109 8 SDLP 1 1 0.00355 0.00272 9 SF 2 5 0.00709 0.0136 10 SNP 17 31 0.0603 0.0842 11 Spk 0 1 0 0.00272
In the above example, we break down the seats that each party won by their constituency type. There are two constituency types: borough and county. Without getting into a long digression, borough constituencies are mainly urban, while county constituencies are mainly rural. The Conservatives won 73% of the county seats, but only 34% of the borough seats.
The cat_contrast function is similar to cat_compare, but it splits the variable that is distributed along the columns into two exclusive groups. This allows you to compare the distribution in one group against the rest of the dataset. You call the function in the same way as cat_compare, but with an additional argument, which indicates the group you want to contrast with the rest of the dataset.
cat_contrast(vs, "mp_gender", "region_name", "London") # A tibble: 2 × 5 mp_gender n_london n_other p_london p_other <chr> <dbl> <dbl> <dbl> <dbl> 1 Male 37 393 0.507 0.681 2 Female 36 184 0.493 0.319
In this example, we break down the seats won by men and women into those in London and those elsewhere in the UK. Almost half the constituencies in London returned women MPs, while less than a third of the seats in the rest of the country returned women MPs.
The last function is cat_summarise (or cat_summarize), which calculates summary statistics for a numerical variable for each group in a categorical variable. Before we use this function, let's create an interesting numerical variable to summarise.
Dividing the valid votes by the electortate gives the turnout in each constituency. Below we create a turnout variable in a tidyverse pipeline to show how these functions fit into tidy workflows. At the end of the piepline we use cat_summarise to get the distribution of turnout by constituency type.
vs |> mutate(turnout = valid_votes / electorate) |> cat_summarise("constituency_type", "turnout") # A tibble: 2 × 10 constituency_type n na mean sd min lq med uq max <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Borough 282 0 0.649 0.0569 0.493 0.609 0.654 0.686 0.787 2 County 368 0 0.689 0.0513 0.510 0.657 0.696 0.725 0.803
The median constituency turnout was around four percentage points higher in county constituencues (70%) than in borough constituencies (65%).
5. Further reading
The five functions shown here comprise tabbycat's core API, but I've skipped over some of the details. You may be interested in how the different functions handle NAs in the data. There are also some useful arguments that control the structure of the output they return. If you want to find out more, all the details can be found in the documentation.