Data Wrangling

Data wrangling:

  • Recoding

  • Split-Apply-Combine operations

  • Long and Wide data

  • Joins

  • Quick-and-dirty tables and plots

Tidyverse conventions

We’ll make extensive use of tidyverse packages for these sections.

Some genre conventions of tidy code

  1. Most functions take a .data = argument. Which means so they work well with |> and you’ll rarely see $ notation.

  2. A preference for tidy data, where each row represents a single case or observation, each column is a variable, and each cell contains a single value.

  3. Some stylistic regularities: like related functions having a prefix_suffix naming style. Ex: str_split, str_replace, and str_extract

Problem

I’m interesting in tracking the annual number of civilians killed by country in the first 6 months of 2025

I’ve got a collection of “candidate events” from the UCDP. Here’s an example of the formatting:

  • My raw data is measured at the event level

  • But I want to aggregate data by country

country conflict_name date_start source_headline deaths_civilians
Mexico Jalisco Cartel New Generation - Civilians 2025-04-24 Condena Segob ejecución 2
Ukraine Russia - Ukraine 2025-05-22 In Aleshki, a civilian died during shelling by the Ukrainian Armed Forces 1
Ukraine Russia - Ukraine 2025-02-21 One person died and four were injured in shelling in the Kherson region within a day 1
Mozambique IS - Civilians 2025-05-11 Cabo Ligado Update: 5 — 18 May 2025;IS-linked group abducting children in Mozambique: HRW 3
Burundi XXX516 - Civilians 2025-03-12 Bulletin hebdomadaire Iteka n’Ijambo n°465 1

Split-Apply-Combine operations

  • Many analyses will require some variation on this basic pattern:
  • Split data into meaningful groups

  • Apply some statistical calculation for each group (average, range, variance, etc.)

  • Combine the results into a new summary table

  • while these sorts of operations could be performed in a loop, or using base functions like aggregate, we’ll use the dplyr package here

Split-Apply-Combine

SplitApplyCombine cluster_split 1. Split cluster_apply 2. Apply cluster_combine 3. Combine Original Data Group Value A 3 A 3 A 5 A 7 B 2 B 1 B 1 B 9 C 2 C 1 C 1 ... Group A Group A Original Data->Group A Group B Group B Original Data->Group B Group C Group C Original Data->Group C Final Result Group Value A 4.5 B 3.25 C 1.33 Result A Result A Group A->Result A  mean(Value) Result B Result B Group B->Result B  mean(Value) Result C Result C Group C->Result C  mean(Value) Result A->Final Result Result B->Final Result Result C->Final Result

Example of split-apply-combine process for a group-wise average

dplyr: group_by

group_by(.data, ...)

Turns .data into a grouped table where each group is defined by the variables listed in ...

  • This doesn’t do much on its own, but subsequent dplyr functions will be performed on the groups you’ve created. You can think of this as the “split” step.
library(dplyr)
df<-data.frame(group = c("A", "A", "A", "A", 
                         "B", "B",  "B", "B",
                         "C", "C", "C"),
               value = c(3, 3, 5, 7, 2, 1, 1,9, 2, 1, 1)
               )

df_grouped<-df|>
  group_by(group)


# data looks the same after this step:
df_grouped
# A tibble: 11 × 2
# Groups:   group [3]
   group value
   <chr> <dbl>
 1 A         3
 2 A         3
 3 A         5
 4 A         7
 5 B         2
 6 B         1
 7 B         1
 8 B         9
 9 C         2
10 C         1
11 C         1

dplyr summarize

summarize(.data, ...)

applies one or more functions defined by ... to each group in .data and then combines the results in to a single data frame.

  • You can think of this as performing both the “apply” and “combine” steps.
df_grouped<-df|>
  group_by(group)|>
  summarize(mean_value = mean(value),
            sd_value = sd(value)
            
            )

df_grouped

dplyr group_by + summarize

SplitApplyCombine cluster_split group_by(group) cluster_apply summarize(mean_value = mean(value)) cluster_combine Original Data group value A 3 A 3 A 5 A 7 B 2 B 1 B 1 B 9 C 2 C 1 C 1 ... Group A Group A Original Data->Group A Group B Group B Original Data->Group B Group C Group C Original Data->Group C Final Result group mean_value A 4.5 B 3.25 C 1.33 Result A Result A Group A->Result A   Result B Result B Group B->Result B Result C Result C Group C->Result C   Result A->Final Result Result B->Final Result Result C->Final Result

Example of group_by and summarize

dplyr group_by + mutate

An alternative to summarize is mutate. Mutate works almost the same way, but the final result is then joined to the original data.

SplitApplyCombine cluster_split groupby(group) cluster_apply mutate(mean_value = mean(value)) cluster_combine Original Data group value A 3 A 3 A 5 A 7 B 2 B 1 B 1 B 9 C 2 C 1 C 1 C 5 ... Mutated Data group value mean_value A 3 4.5 A 3 4.5 A 5 4.5 A 7 4.5 B 2 3.25 B 1 3.25 B 1 3.25 B 9 3.25 C 2 2.25 C 1 2.25 C 1 2.25 C 5 2.25 ... Original Data->Mutated Data Joining group summaries with original data Group A Group A Original Data->Group A Group B Group B Original Data->Group B Group C Group C Original Data->Group C Final Result group mean_value A 4.5 B 3.25 C 2.25 Final Result->Mutated Data Result A Result A Group A->Result A mean Result B Result B Group B->Result B mean Result C Result C Group C->Result C mean Result A->Final Result Result B->Final Result Result C->Final Result

dplyr mutate vs. summarize

Mutate will be most useful when we need to do something like mean-centering or calculating a running-sum for each group in a data set. (We’ll also sometimes use mutate without group_by to just add a new column to a dataframe)

df_summarized<-df|>
  group_by(group)|>
  summarize(mean_value = mean(value),
            sd_value = sd(value)
            
            )

df_summarized
df_mutated<-df|>
  group_by(group)|>
  mutate(mean_value = mean(value),
            sd_value = sd(value)
            
            )|>
  # good rule to always run `ungroup` after mutate:

  ungroup()

df_mutated

Back to the problem

  • Back to the original problem: how do we turn event-level data into data with a total number of civilian deaths for each country?
  • Split by country

  • Apply a sum the civilian death toll column in each group

  • combine the results into a single data frame

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'

ged<-read_csv(url)
Show the code
ged|>
  group_by(country)|>
  summarise(civilian_deaths = sum(deaths_civilians))

Some other common dplyr operations

These functions also respect the grouping context, so be careful about how you use them after calling group_by because they can give very different results for a grouped data frame:

  • select: selects certain columns from a tbl (with a more flexible syntax compared to base R)

  • filter: removes rows based on a logical condition

  • slice_head/slice_tail/slice_sample: take the top rows, bottom rows, or a random sample of rows.

  • arrange: sort by one or more values

  • count: count the number of rows in the grouping context.

Some other common dplyr operations

Here’s an example of using some other dplyr functions:

  • line 4. remove cases with 0 deaths

  • line 5-6. get the worst single incident for each conflict

  • line 7. select specific some specific columns

  • line 8. sort by civilian deaths in decreasing order

  • line 9. ungroup the result

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged|>
  filter(deaths_civilians>0)|>
  group_by(conflict_name)|>
  filter(deaths_civilians == max(deaths_civilians))|>
  select(date_start, conflict_name, deaths_civilians)|>
  arrange(-deaths_civilians)|>
  ungroup()
# A tibble: 143 × 3
   date_start          conflict_name                       deaths_civilians
   <dttm>              <chr>                                          <dbl>
 1 2025-07-12 00:00:00 RSF - Civilians                                  402
 2 2025-07-13 00:00:00 Syria: Government                                321
 3 2025-07-09 00:00:00 AFC - Civilians                                  169
 4 2025-07-13 00:00:00 Government of Syria - Civilians                   89
 5 2025-07-28 00:00:00 IS - Civilians                                    45
 6 2025-07-21 00:00:00 Forces of Dan Saadiya   - Civilians               35
 7 2025-07-31 00:00:00 Russia - Ukraine                                  31
 8 2025-07-10 00:00:00 Myanmar (Burma): Government                       30
 9 2025-07-13 00:00:00 Sudan: Government                                 27
10 2025-07-29 00:00:00 Israel: Palestine                                 26
# ℹ 133 more rows

Problem

The UCDP distinguishes between state based, non-state, and one-sided forms of violence. I’m interested in tracking total deaths by country and violence type.

We can think of a single “record” of data as looking something like this:

G Record 1 Single record: one for each country/violence type Country Ex: Burkina Faso, Mexico, Ukraine, ... Conflict Type Ex: non-state, state, one-sided Est. Deaths Ex: 296, 376, 6870

How should I arrange my data for analysis?

Complex data and long vs. wide format

This is another common complication: how do we represent complex data 2-dimensional matrix?

A classic example would be panel data, where multiple cases are tracked across multiple years:

  • Country-year data on economic growth

  • Tracking student GPAs across terms

  • Public opinion surveys for the same set of respondents during an entire election

Long Data

Putting the records in long format would mean putting each case in a row and using repetition to identify related cases:

G cluster_split Record 1 Single record: one for each country/violence type Country Ex: Burkina Faso, Mexico, Ukraine, ... Conflict Type Ex: state-based, non-state, one-sided Est. Deaths Ex: 296, 376, 6870 Long Format Long Format Country Conflict Type Est. Deaths Burkina Faso state-based 296 Burkina Faso non-state 6 Burkina Faso one-sided 58 Mexico state-based 8 Mexico non-state 376 Mexico one-sided 22 ...more rows Record 1->Long Format

Wide Format

Putting records in wide format means using additional columns for each group (either one column per country or per violence type, in the current example)

G cluster_split Record 1 Single record: one for each country/violence type Country Ex: Burkina Faso, Mexico, Ukraine, ... Conflict Type Ex: state-based, non-state, one-sided Est. Deaths Ex: 296, 376, 6870 Long Format Long Format Country Conflict Type Est. Deaths Burkina Faso state-based 296 Burkina Faso non-state 6 Burkina Faso one-sided 58 Mexico state-based 8 Mexico non-state 376 Mexico one-sided 22 ...more rows Record 1->Long Format Wide Format Wide Format Country state-based non-state one-sided Burkina Faso 296 6 58 Mexico 8 376 22 ...more rows Record 1->Wide Format

Long vs. Wide

  • Wide format data is generally easier on the eyeballs, especially when you want people to make comparisons across groups/periods.
  • It’s much easier to calculate totals by row or column.
  • But its less flexible: what if I want to track civilian and non-civilian deaths separately? I need a whole new table.

G Wide Format Wide Format Country state-based non-state one-sided Burkina Faso 296 6 58 Mexico 8 376 22 ...more rows

Long vs. Wide

  • Long format may be inefficient for storage and transfer because information is repeated.

  • But it’s far more flexible: adding combatant death tolls is as simple as adding a new column.

  • Each row is a discrete observation as opposed to a collection of observations.

  • In general, R functions (especially code related to the tidyverse) will work better with long data.

G Long Format Long Format Country Conflict Type Est. Deaths Burkina Faso state-based 296 Burkina Faso non-state 6 Burkina Faso one-sided 58 Mexico state-based 8 Mexico non-state 376 Mexico one-sided 22 ...more rows

Long to wide in dplyr

pivot_wider(.data, names_from, values_from)

Makes data wider. names_from will identify the new column names. values_from will identify the values in each cell.

df<-data.frame(group = c("A", "A", "A",
                         "B", "B",  "B"),
               period = c("First", "Second", "Third", 
                          "First", "Second", "Third"),
               value = c(3, 3, 5, 7, 2, 1)
               )
df
  group period value
1     A  First     3
2     A Second     3
3     A  Third     5
4     B  First     7
5     B Second     2
6     B  Third     1
df|>
  pivot_wider(names_from = group, values_from = value)
# A tibble: 3 × 3
  period     A     B
  <chr>  <dbl> <dbl>
1 First      3     7
2 Second     3     2
3 Third      5     1

Wide-to-long in dplyr

pivot_longer(.data, cols)

Makes data longer. Data in cols will form new rows.

df<-data.frame(period = c("First", "Second", "Third"),
               `A` = c(3, 3, 5),
               `B` = c(7, 2, 1)
               )
df
  period A B
1  First 3 7
2 Second 3 2
3  Third 5 1
df|>
  pivot_longer(cols=c(A,B))
# A tibble: 6 × 3
  period name  value
  <chr>  <chr> <dbl>
1 First  A         3
2 First  B         7
3 Second A         3
4 Second B         2
5 Third  A         5
6 Third  B         1

Back to the problem

How should I arrange my data for analysis?

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged_by_type=ged|>
  mutate(
    type_of_violence = factor(
    type_of_violence,
    levels = c(1, 2, 3),
    labels = c("State-based", "Non-state", "One-sided")
  ))|>
  group_by(country, type_of_violence)|>
  summarise(est_deaths  = sum(best))

ged_by_type
# A tibble: 99 × 3
# Groups:   country [50]
   country     type_of_violence est_deaths
   <chr>       <fct>                 <dbl>
 1 Afghanistan State-based               4
 2 Afghanistan Non-state                 1
 3 Afghanistan One-sided                 5
 4 Angola      State-based              22
 5 Bangladesh  State-based               2
 6 Bangladesh  Non-state                 5
 7 Bangladesh  One-sided                 3
 8 Belize      One-sided                 2
 9 Benin       State-based               0
10 Brazil      State-based               1
# ℹ 89 more rows

Back to the problem

Try pivoting this to wide format and then back again.

Pivoting the data from long to wide with one column for each conflict type.

Show the code
ged_wide<-ged_by_type|>
  pivot_wider(names_from = type_of_violence, 
              values_from = est_deaths)
  
ged_wide
# A tibble: 50 × 4
# Groups:   country [50]
   country              `State-based` `Non-state` `One-sided`
   <chr>                        <dbl>       <dbl>       <dbl>
 1 Afghanistan                      4           1           5
 2 Angola                          22          NA          NA
 3 Bangladesh                       2           5           3
 4 Belize                          NA          NA           2
 5 Benin                            0          NA          NA
 6 Brazil                           1          35           2
 7 Burkina Faso                   296           6          58
 8 Burundi                         NA          NA          15
 9 Cambodia (Kampuchea)            11          NA          NA
10 Cameroon                        12           1          10
# ℹ 40 more rows

Putting the wide data back in its original format:

Show the code
ged_long <- ged_wide|>
  pivot_longer(cols = c(`State-based`, `Non-state`, `One-sided`),
               names_to ='violence_type', #  name the new columns
               values_to ='est_deaths'
               )
  
  
ged_long
# A tibble: 150 × 3
# Groups:   country [50]
   country     violence_type est_deaths
   <chr>       <chr>              <dbl>
 1 Afghanistan State-based            4
 2 Afghanistan Non-state              1
 3 Afghanistan One-sided              5
 4 Angola      State-based           22
 5 Angola      Non-state             NA
 6 Angola      One-sided             NA
 7 Bangladesh  State-based            2
 8 Bangladesh  Non-state              5
 9 Bangladesh  One-sided              3
10 Belize      State-based           NA
# ℹ 140 more rows

Long vs. Wide

  • A general rule: prefer long data while you’re processing stuff. You can use wide formats for your “finished products” (the stuff that someone else might need to look at)

  • Keep in mind: Wide and long aren’t necessarily discrete categories. There are long-er and wide-er versions of most real-world data, so it’s really more of a relative concept.

Problem

I want to explore the relationship between “rugged terrain” and civil wars since 1945, but doing so requires me to combine data from two different data sets.

My first source is a list of all countries that had a civil war since 1989.

Show the code
# downloading battle deaths data
if(!file.exists('ucdp_datasets/BattleDeaths_v25_1.csv')){
  download.file('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
                dest = 'battle_deaths.zip'
                )
  
    unzip('battle_deaths.zip',
        exdir = 'ucdp_datasets')
}


# getting data on civil wars
civil_wars<-read_csv('ucdp_datasets/BattleDeaths_v25_1.csv')|>
  filter(type_of_conflict=='3')|>
  group_by(gwno_a)|>
  summarise(deaths = sum(bd_best))|>
  filter(deaths>1000)|>
  ungroup()|>
  mutate(gwno_a = as.numeric(gwno_a))|>
  mutate(country = countrycode::countrycode(gwno_a, origin='gwn', 
                                            dest='country.name'))|>
  relocate(country)|>
  arrange(country)


civil_wars

My second data source measures the logged % of mountainous terrain in a country:

Show the code
rugged<-peacesciencer::rugged|>
  filter(!is.na(gwcode))|>
  mutate(country = countrycode::countrycode(gwcode, origin='gwn', 
                                            dest='country.name'),
         country = ifelse(gwcode == 816, "Vietnam", country))|>
  relocate(country)|>
  select(-ccode, -rugged)|>
  arrange(country)|> 
  rename(rough = newlmtnest)


rugged

Joins

This will require some kind of join, where two tables are combined based on a matching value in each called the key. But there’s more than one way to do this, especially when the tables don’t have perfectly matching keys.

G civil_wars Civil Wars Country Country ID Est. Deaths Afghanistan 700 54753 Algeria 615 18488 Angola 540 23088 ...more rows Combined Data Combined Data civil_wars->Combined Data rugged Terrain Country Country ID Log % Rough Terrain Afghanistan 700 4.2 Albania 339 2.3 Algeria 615 2.8 Angola 540 2.4 ...more rows rugged->Combined Data

Join types

  • Inner joins preserve only keys that are present in both data sets. (Rows with multiple matches are duplicated)

G civil_wars Civil Wars Country Country ID Est. Deaths Afghanistan 700 54753 Algeria 615 18488 Angola 540 23088 ...more rows joined Combined Country Country ID Log % Rough Terrain Est. Deaths Afghanistan 700 4.2 54753 Albania dropped! Algeria 615 2.8 18488 Angola 540 2.4 23088 ...more rows civil_wars->joined rugged Terrain Country Country ID Log % Rough Terrain Afghanistan 700 4.2 Albania 339 2.3 Algeria 615 2.8 Angola 540 2.4 ...more rows rugged->joined

left/right join

  • A left_join will take all data from the left hand side and all matching data on the left. A right_join will do the reverse. (“left” vs. “right” is entirely a matter of which table you reference first in the join command)

G cluster_split rugged Terrain (RHS) Country Country ID Log % Rough Terrain Afghanistan 700 4.2 Albania 339 2.3 Algeria 615 2.8 Angola 540 2.4 ...more rows joined Terrain Country Country ID Log % Rough Terrain Est. Deaths Afghanistan 700 4.2 54753 Albania dropped! Algeria 615 2.8 18488 Angola 540 2.4 23088 ...more rows rugged->joined civil_wars Civil Wars (LHS) Country Country ID Est. Deaths Afghanistan 700 54753 Algeria 615 18488 Angola 540 23088 ...more rows civil_wars->joined

Full Join

Full joins will preserve all keys in both data sets. (so Albania is kept, but with an NA in the deaths column)

G cluster_split rugged Terrain (LHS) Country Country ID Log % Rough Terrain Afghanistan 700 4.2 Albania 339 2.3 Algeria 615 2.8 Angola 540 2.4 ...more rows joined Combined Data Country Country ID Log % Rough Terrain Est. Deaths Afghanistan 700 4.2 54753 Albania 339 2.3 NA Algeria 615 2.8 18488 Angola 540 2.4 23088 ...more rows rugged->joined civil_wars Civil Wars (RHS) Country Country ID Est. Deaths Afghanistan 700 54753 Algeria 615 18488 Angola 540 23088 ...more rows civil_wars->joined

Joins in dplyr

left_join(left_table, right_table, by = join_by(left_key == right_key))

Back to the original problem

  • put the terrain data first and use a left_join

  • join on the Gleditsch Ward country ID. (country names are not very reliable!)

  • We’ll want to fill NA values in deaths with “0” because these states had no civil war deaths.

joined_data<-left_join(rugged, 
                       civil_wars, 
                       by=join_by(gwcode ==gwno_a))

joined_data

Back to the original problem

  • We can use replace_na to replace NA values with 0s

  • And we’ll just create an indicator that’s TRUE if a country had 0 civil wars

joined_data<-left_join(rugged,
                       civil_wars, 
                       by=join_by(gwcode ==gwno_a))|>
  mutate(deaths = replace_na(deaths, 0),
         any_civil_war = deaths>0
         )

joined_data

Back to the original problem

lm(any_civil_war ~ rough, data = joined_data)|>
  broom::tidy()
ggplot(joined_data, aes(x=any_civil_war, y=rough)) + 
  geom_boxplot() +
  theme_bw() +
  ylab("natural log of % mountainous terrain") +
  xlab("Any civil wars after 1989?")

Other joins

  • inner_join will drop any rows that don’t have a match in either data set

  • full_join will avoid dropping any rows from either dataset.

  • anti_join will find rows in the right hand side without a match on the left. (you can use this to find discrepancies between groups)

  • fuzzy_join (part of a separate package called fuzzyjoin) will allow imperfect matches like slight misspellings.