Recoding
Split-Apply-Combine operations
Long and Wide data
Joins
Quick-and-dirty tables and plots
We’ll make extensive use of tidyverse
packages for these sections.
Some genre conventions of tidy code
Most functions take a .data =
argument. Which means so they work well with |>
and you’ll rarely see $
notation.
A preference for tidy
data, where each row represents a single case or observation, each column is a variable, and each cell contains a single value.
Some stylistic regularities: like related functions having a prefix_suffix
naming style. Ex: str_split
, str_replace
, and str_extract
I’m interesting in tracking the annual number of civilians killed by country in the first 6 months of 2025
I’ve got a collection of “candidate events” from the UCDP. Here’s an example of the formatting:
My raw data is measured at the event level
But I want to aggregate data by country
country | conflict_name | date_start | source_headline | deaths_civilians |
---|---|---|---|---|
Mexico | Jalisco Cartel New Generation - Civilians | 2025-04-24 | Condena Segob ejecución | 2 |
Ukraine | Russia - Ukraine | 2025-05-22 | In Aleshki, a civilian died during shelling by the Ukrainian Armed Forces | 1 |
Ukraine | Russia - Ukraine | 2025-02-21 | One person died and four were injured in shelling in the Kherson region within a day | 1 |
Mozambique | IS - Civilians | 2025-05-11 | Cabo Ligado Update: 5 — 18 May 2025;IS-linked group abducting children in Mozambique: HRW | 3 |
Burundi | XXX516 - Civilians | 2025-03-12 | Bulletin hebdomadaire Iteka n’Ijambo n°465 | 1 |
Split data into meaningful groups
Apply some statistical calculation for each group (average, range, variance, etc.)
Combine the results into a new summary table
while these sorts of operations could be performed in a loop, or using base functions like aggregate
, we’ll use the dplyr
package here
group_by(.data, ...)
Turns .data
into a grouped table where each group is defined by the variables listed in ...
library(dplyr)
df<-data.frame(group = c("A", "A", "A", "A",
"B", "B", "B", "B",
"C", "C", "C"),
value = c(3, 3, 5, 7, 2, 1, 1,9, 2, 1, 1)
)
df_grouped<-df|>
group_by(group)
# data looks the same after this step:
df_grouped
# A tibble: 11 × 2
# Groups: group [3]
group value
<chr> <dbl>
1 A 3
2 A 3
3 A 5
4 A 7
5 B 2
6 B 1
7 B 1
8 B 9
9 C 2
10 C 1
11 C 1
summarize(.data, ...)
applies one or more functions defined by ...
to each group in .data
and then combines the results in to a single data frame.
An alternative to summarize
is mutate
. Mutate works almost the same way, but the final result is then joined to the original data.
Mutate will be most useful when we need to do something like mean-centering or calculating a running-sum for each group in a data set. (We’ll also sometimes use mutate
without group_by
to just add a new column to a dataframe)
These functions also respect the grouping context, so be careful about how you use them after calling group_by
because they can give very different results for a grouped data frame:
select
: selects certain columns from a tbl (with a more flexible syntax compared to base R)
filter
: removes rows based on a logical condition
slice_head
/slice_tail/slice_sample
: take the top rows, bottom rows, or a random sample of rows.
arrange
: sort by one or more values
count
: count the number of rows in the grouping context.
Here’s an example of using some other dplyr
functions:
line 4. remove cases with 0 deaths
line 5-6. get the worst single incident for each conflict
line 7. select specific some specific columns
line 8. sort by civilian deaths in decreasing order
line 9. ungroup the result
url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged|>
filter(deaths_civilians>0)|>
group_by(conflict_name)|>
filter(deaths_civilians == max(deaths_civilians))|>
select(date_start, conflict_name, deaths_civilians)|>
arrange(-deaths_civilians)|>
ungroup()
# A tibble: 143 × 3
date_start conflict_name deaths_civilians
<dttm> <chr> <dbl>
1 2025-07-12 00:00:00 RSF - Civilians 402
2 2025-07-13 00:00:00 Syria: Government 321
3 2025-07-09 00:00:00 AFC - Civilians 169
4 2025-07-13 00:00:00 Government of Syria - Civilians 89
5 2025-07-28 00:00:00 IS - Civilians 45
6 2025-07-21 00:00:00 Forces of Dan Saadiya - Civilians 35
7 2025-07-31 00:00:00 Russia - Ukraine 31
8 2025-07-10 00:00:00 Myanmar (Burma): Government 30
9 2025-07-13 00:00:00 Sudan: Government 27
10 2025-07-29 00:00:00 Israel: Palestine 26
# ℹ 133 more rows
The UCDP distinguishes between state based, non-state, and one-sided forms of violence. I’m interested in tracking total deaths by country and violence type.
We can think of a single “record” of data as looking something like this:
How should I arrange my data for analysis?
This is another common complication: how do we represent complex data 2-dimensional matrix?
A classic example would be panel data, where multiple cases are tracked across multiple years:
Country-year data on economic growth
Tracking student GPAs across terms
Public opinion surveys for the same set of respondents during an entire election
Putting the records in long format would mean putting each case in a row and using repetition to identify related cases:
Putting records in wide format means using additional columns for each group (either one column per country or per violence type, in the current example)
Long format may be inefficient for storage and transfer because information is repeated.
But it’s far more flexible: adding combatant death tolls is as simple as adding a new column.
Each row is a discrete observation as opposed to a collection of observations.
In general, R functions (especially code related to the tidyverse) will work better with long data.
pivot_wider(.data, names_from, values_from)
Makes data wider. names_from
will identify the new column names. values_from
will identify the values in each cell.
pivot_longer(.data, cols)
Makes data longer. Data in cols
will form new rows.
How should I arrange my data for analysis?
url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged_by_type=ged|>
mutate(
type_of_violence = factor(
type_of_violence,
levels = c(1, 2, 3),
labels = c("State-based", "Non-state", "One-sided")
))|>
group_by(country, type_of_violence)|>
summarise(est_deaths = sum(best))
ged_by_type
# A tibble: 99 × 3
# Groups: country [50]
country type_of_violence est_deaths
<chr> <fct> <dbl>
1 Afghanistan State-based 4
2 Afghanistan Non-state 1
3 Afghanistan One-sided 5
4 Angola State-based 22
5 Bangladesh State-based 2
6 Bangladesh Non-state 5
7 Bangladesh One-sided 3
8 Belize One-sided 2
9 Benin State-based 0
10 Brazil State-based 1
# ℹ 89 more rows
Try pivoting this to wide format and then back again.
Pivoting the data from long to wide with one column for each conflict type.
# A tibble: 50 × 4
# Groups: country [50]
country `State-based` `Non-state` `One-sided`
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 4 1 5
2 Angola 22 NA NA
3 Bangladesh 2 5 3
4 Belize NA NA 2
5 Benin 0 NA NA
6 Brazil 1 35 2
7 Burkina Faso 296 6 58
8 Burundi NA NA 15
9 Cambodia (Kampuchea) 11 NA NA
10 Cameroon 12 1 10
# ℹ 40 more rows
Putting the wide data back in its original format:
# A tibble: 150 × 3
# Groups: country [50]
country violence_type est_deaths
<chr> <chr> <dbl>
1 Afghanistan State-based 4
2 Afghanistan Non-state 1
3 Afghanistan One-sided 5
4 Angola State-based 22
5 Angola Non-state NA
6 Angola One-sided NA
7 Bangladesh State-based 2
8 Bangladesh Non-state 5
9 Bangladesh One-sided 3
10 Belize State-based NA
# ℹ 140 more rows
A general rule: prefer long data while you’re processing stuff. You can use wide formats for your “finished products” (the stuff that someone else might need to look at)
Keep in mind: Wide and long aren’t necessarily discrete categories. There are long-er and wide-er versions of most real-world data, so it’s really more of a relative concept.
I want to explore the relationship between “rugged terrain” and civil wars since 1945, but doing so requires me to combine data from two different data sets.
My first source is a list of all countries that had a civil war since 1989.
# downloading battle deaths data
if(!file.exists('ucdp_datasets/BattleDeaths_v25_1.csv')){
download.file('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
dest = 'battle_deaths.zip'
)
unzip('battle_deaths.zip',
exdir = 'ucdp_datasets')
}
# getting data on civil wars
civil_wars<-read_csv('ucdp_datasets/BattleDeaths_v25_1.csv')|>
filter(type_of_conflict=='3')|>
group_by(gwno_a)|>
summarise(deaths = sum(bd_best))|>
filter(deaths>1000)|>
ungroup()|>
mutate(gwno_a = as.numeric(gwno_a))|>
mutate(country = countrycode::countrycode(gwno_a, origin='gwn',
dest='country.name'))|>
relocate(country)|>
arrange(country)
civil_wars
My second data source measures the logged % of mountainous terrain in a country:
This will require some kind of join, where two tables are combined based on a matching value in each called the key. But there’s more than one way to do this, especially when the tables don’t have perfectly matching keys.
left_join
will take all data from the left hand side and all matching data on the left. A right_join
will do the reverse. (“left” vs. “right” is entirely a matter of which table you reference first in the join command)Full joins will preserve all keys in both data sets. (so Albania is kept, but with an NA in the deaths column)
left_join(left_table, right_table, by = join_by(left_key == right_key))
put the terrain data first and use a left_join
join on the Gleditsch Ward country ID. (country names are not very reliable!)
We’ll want to fill NA
values in deaths
with “0” because these states had no civil war deaths.
inner_join
will drop any rows that don’t have a match in either data set
full_join
will avoid dropping any rows from either dataset.
anti_join
will find rows in the right hand side without a match on the left. (you can use this to find discrepancies between groups)
fuzzy_join
(part of a separate package called fuzzyjoin
) will allow imperfect matches like slight misspellings.