Data Wrangling

Data wrangling:

Recoding
Split-Apply-Combine operations
Long and Wide data
Joins
Quick-and-dirty tables and plots

Tidyverse conventions

We’ll make extensive use of tidyverse packages for these sections.

Some genre conventions of tidy code

Most functions take a .data = argument. Which means so they work well with |> and you’ll rarely see $ notation.
A preference for tidy data, where each row represents a single case or observation, each column is a variable, and each cell contains a single value.
Some stylistic regularities: like related functions having a prefix_suffix naming style. Ex: str_split, str_replace, and str_extract

Problem

I’m interesting in tracking the annual number of civilians killed by country in the first 6 months of 2025

I’ve got a collection of “candidate events” from the UCDP. Here’s an example of the formatting:

My raw data is measured at the event level
But I want to aggregate data by country

country	conflict_name	date_start	source_headline	deaths_civilians
Mexico	Jalisco Cartel New Generation - Civilians	2025-04-24	Condena Segob ejecución	2
Ukraine	Russia - Ukraine	2025-05-22	In Aleshki, a civilian died during shelling by the Ukrainian Armed Forces	1
Ukraine	Russia - Ukraine	2025-02-21	One person died and four were injured in shelling in the Kherson region within a day	1
Mozambique	IS - Civilians	2025-05-11	Cabo Ligado Update: 5 — 18 May 2025;IS-linked group abducting children in Mozambique: HRW	3
Burundi	XXX516 - Civilians	2025-03-12	Bulletin hebdomadaire Iteka n’Ijambo n°465	1

Split-Apply-Combine operations

Many analyses will require some variation on this basic pattern:

Split data into meaningful groups
Apply some statistical calculation for each group (average, range, variance, etc.)
Combine the results into a new summary table
while these sorts of operations could be performed in a loop, or using base functions like aggregate, we’ll use the dplyr package here

Split-Apply-Combine

Example of split-apply-combine process for a group-wise average

dplyr: group_by

group_by(.data, ...): Turns .data into a grouped table where each group is defined by the variables listed in ...

This doesn’t do much on its own, but subsequent dplyr functions will be performed on the groups you’ve created. You can think of this as the “split” step.

library(dplyr)
df<-data.frame(group = c("A", "A", "A", "A", 
                         "B", "B",  "B", "B",
                         "C", "C", "C"),
               value = c(3, 3, 5, 7, 2, 1, 1,9, 2, 1, 1)
               )

df_grouped<-df|>
  group_by(group)


# data looks the same after this step:
df_grouped

# A tibble: 11 × 2
# Groups:   group [3]
   group value
   <chr> <dbl>
 1 A         3
 2 A         3
 3 A         5
 4 A         7
 5 B         2
 6 B         1
 7 B         1
 8 B         9
 9 C         2
10 C         1
11 C         1

dplyr summarize

summarize(.data, ...): applies one or more functions defined by ... to each group in .data and then combines the results in to a single data frame.

You can think of this as performing both the “apply” and “combine” steps.

df_grouped<-df|>
  group_by(group)|>
  summarize(mean_value = mean(value),
            sd_value = sd(value)
            
            )

df_grouped

dplyr group_by + summarize

Example of group_by and summarize

dplyr group_by + mutate

An alternative to summarize is mutate. Mutate works almost the same way, but the final result is then joined to the original data.

dplyr mutate vs. summarize

Mutate will be most useful when we need to do something like mean-centering or calculating a running-sum for each group in a data set. (We’ll also sometimes use mutate without group_by to just add a new column to a dataframe)

df_summarized<-df|>
  group_by(group)|>
  summarize(mean_value = mean(value),
            sd_value = sd(value)
            
            )

df_summarized

df_mutated<-df|>
  group_by(group)|>
  mutate(mean_value = mean(value),
            sd_value = sd(value)
            
            )|>
  # good rule to always run `ungroup` after mutate:

  ungroup()

df_mutated

Back to the problem

Back to the original problem: how do we turn event-level data into data with a total number of civilian deaths for each country?

Split by country
Apply a sum the civilian death toll column in each group
combine the results into a single data frame

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'

ged<-read_csv(url)

Show the code

ged|>
  group_by(country)|>
  summarise(civilian_deaths = sum(deaths_civilians))

Some other common dplyr operations

These functions also respect the grouping context, so be careful about how you use them after calling group_by because they can give very different results for a grouped data frame:

select: selects certain columns from a tbl (with a more flexible syntax compared to base R)
filter: removes rows based on a logical condition
slice_head/slice_tail/slice_sample: take the top rows, bottom rows, or a random sample of rows.
arrange: sort by one or more values
count: count the number of rows in the grouping context.

Some other common dplyr operations

Here’s an example of using some other dplyr functions:

line 4. remove cases with 0 deaths
line 5-6. get the worst single incident for each conflict
line 7. select specific some specific columns
line 8. sort by civilian deaths in decreasing order
line 9. ungroup the result

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged|>
  filter(deaths_civilians>0)|>
  group_by(conflict_name)|>
  filter(deaths_civilians == max(deaths_civilians))|>
  select(date_start, conflict_name, deaths_civilians)|>
  arrange(-deaths_civilians)|>
  ungroup()

# A tibble: 143 × 3
   date_start          conflict_name                       deaths_civilians
   <dttm>              <chr>                                          <dbl>
 1 2025-07-12 00:00:00 RSF - Civilians                                  402
 2 2025-07-13 00:00:00 Syria: Government                                321
 3 2025-07-09 00:00:00 AFC - Civilians                                  169
 4 2025-07-13 00:00:00 Government of Syria - Civilians                   89
 5 2025-07-28 00:00:00 IS - Civilians                                    45
 6 2025-07-21 00:00:00 Forces of Dan Saadiya   - Civilians               35
 7 2025-07-31 00:00:00 Russia - Ukraine                                  31
 8 2025-07-10 00:00:00 Myanmar (Burma): Government                       30
 9 2025-07-13 00:00:00 Sudan: Government                                 27
10 2025-07-29 00:00:00 Israel: Palestine                                 26
# ℹ 133 more rows

Problem

The UCDP distinguishes between state based, non-state, and one-sided forms of violence. I’m interested in tracking total deaths by country and violence type.

We can think of a single “record” of data as looking something like this:

How should I arrange my data for analysis?

Complex data and long vs. wide format

This is another common complication: how do we represent complex data 2-dimensional matrix?

A classic example would be panel data, where multiple cases are tracked across multiple years:

Country-year data on economic growth
Tracking student GPAs across terms
Public opinion surveys for the same set of respondents during an entire election

Long Data

Putting the records in long format would mean putting each case in a row and using repetition to identify related cases:

Wide Format

Putting records in wide format means using additional columns for each group (either one column per country or per violence type, in the current example)

Long vs. Wide

Wide format data is generally easier on the eyeballs, especially when you want people to make comparisons across groups/periods.
It’s much easier to calculate totals by row or column.
But its less flexible: what if I want to track civilian and non-civilian deaths separately? I need a whole new table.

Long vs. Wide

Long format may be inefficient for storage and transfer because information is repeated.
But it’s far more flexible: adding combatant death tolls is as simple as adding a new column.
Each row is a discrete observation as opposed to a collection of observations.
In general, R functions (especially code related to the tidyverse) will work better with long data.

Long to wide in dplyr

pivot_wider(.data, names_from, values_from): Makes data wider. names_from will identify the new column names. values_from will identify the values in each cell.

df<-data.frame(group = c("A", "A", "A",
                         "B", "B",  "B"),
               period = c("First", "Second", "Third", 
                          "First", "Second", "Third"),
               value = c(3, 3, 5, 7, 2, 1)
               )
df

  group period value
1     A  First     3
2     A Second     3
3     A  Third     5
4     B  First     7
5     B Second     2
6     B  Third     1

df|>
  pivot_wider(names_from = group, values_from = value)

# A tibble: 3 × 3
  period     A     B
  <chr>  <dbl> <dbl>
1 First      3     7
2 Second     3     2
3 Third      5     1

Wide-to-long in dplyr

pivot_longer(.data, cols): Makes data longer. Data in cols will form new rows.

df<-data.frame(period = c("First", "Second", "Third"),
               `A` = c(3, 3, 5),
               `B` = c(7, 2, 1)
               )
df

  period A B
1  First 3 7
2 Second 3 2
3  Third 5 1

df|>
  pivot_longer(cols=c(A,B))

# A tibble: 6 × 3
  period name  value
  <chr>  <chr> <dbl>
1 First  A         3
2 First  B         7
3 Second A         3
4 Second B         2
5 Third  A         5
6 Third  B         1

Back to the problem

How should I arrange my data for analysis?

url<-'https://ucdp.uu.se/downloads/candidateged/GEDEvent_v25_0_7.csv'
ged<-read_csv(url)
ged_by_type=ged|>
  mutate(
    type_of_violence = factor(
    type_of_violence,
    levels = c(1, 2, 3),
    labels = c("State-based", "Non-state", "One-sided")
  ))|>
  group_by(country, type_of_violence)|>
  summarise(est_deaths  = sum(best))

ged_by_type

# A tibble: 99 × 3
# Groups:   country [50]
   country     type_of_violence est_deaths
   <chr>       <fct>                 <dbl>
 1 Afghanistan State-based               4
 2 Afghanistan Non-state                 1
 3 Afghanistan One-sided                 5
 4 Angola      State-based              22
 5 Bangladesh  State-based               2
 6 Bangladesh  Non-state                 5
 7 Bangladesh  One-sided                 3
 8 Belize      One-sided                 2
 9 Benin       State-based               0
10 Brazil      State-based               1
# ℹ 89 more rows

Back to the problem

Try pivoting this to wide format and then back again.

Pivoting the data from long to wide with one column for each conflict type.

Show the code

ged_wide<-ged_by_type|>
  pivot_wider(names_from = type_of_violence, 
              values_from = est_deaths)
  
ged_wide

# A tibble: 50 × 4
# Groups:   country [50]
   country              `State-based` `Non-state` `One-sided`
   <chr>                        <dbl>       <dbl>       <dbl>
 1 Afghanistan                      4           1           5
 2 Angola                          22          NA          NA
 3 Bangladesh                       2           5           3
 4 Belize                          NA          NA           2
 5 Benin                            0          NA          NA
 6 Brazil                           1          35           2
 7 Burkina Faso                   296           6          58
 8 Burundi                         NA          NA          15
 9 Cambodia (Kampuchea)            11          NA          NA
10 Cameroon                        12           1          10
# ℹ 40 more rows

Putting the wide data back in its original format:

Show the code

ged_long <- ged_wide|>
  pivot_longer(cols = c(`State-based`, `Non-state`, `One-sided`),
               names_to ='violence_type', #  name the new columns
               values_to ='est_deaths'
               )
  
  
ged_long

# A tibble: 150 × 3
# Groups:   country [50]
   country     violence_type est_deaths
   <chr>       <chr>              <dbl>
 1 Afghanistan State-based            4
 2 Afghanistan Non-state              1
 3 Afghanistan One-sided              5
 4 Angola      State-based           22
 5 Angola      Non-state             NA
 6 Angola      One-sided             NA
 7 Bangladesh  State-based            2
 8 Bangladesh  Non-state              5
 9 Bangladesh  One-sided              3
10 Belize      State-based           NA
# ℹ 140 more rows

Long vs. Wide

A general rule: prefer long data while you’re processing stuff. You can use wide formats for your “finished products” (the stuff that someone else might need to look at)
Keep in mind: Wide and long aren’t necessarily discrete categories. There are long-er and wide-er versions of most real-world data, so it’s really more of a relative concept.

Problem

I want to explore the relationship between “rugged terrain” and civil wars since 1945, but doing so requires me to combine data from two different data sets.

My first source is a list of all countries that had a civil war since 1989.

Show the code

# downloading battle deaths data
if(!file.exists('ucdp_datasets/BattleDeaths_v25_1.csv')){
  download.file('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
                dest = 'battle_deaths.zip'
                )
  
    unzip('battle_deaths.zip',
        exdir = 'ucdp_datasets')
}


# getting data on civil wars
civil_wars<-read_csv('ucdp_datasets/BattleDeaths_v25_1.csv')|>
  filter(type_of_conflict=='3')|>
  group_by(gwno_a)|>
  summarise(deaths = sum(bd_best))|>
  filter(deaths>1000)|>
  ungroup()|>
  mutate(gwno_a = as.numeric(gwno_a))|>
  mutate(country = countrycode::countrycode(gwno_a, origin='gwn', 
                                            dest='country.name'))|>
  relocate(country)|>
  arrange(country)


civil_wars

My second data source measures the logged % of mountainous terrain in a country:

Show the code

rugged<-peacesciencer::rugged|>
  filter(!is.na(gwcode))|>
  mutate(country = countrycode::countrycode(gwcode, origin='gwn', 
                                            dest='country.name'),
         country = ifelse(gwcode == 816, "Vietnam", country))|>
  relocate(country)|>
  select(-ccode, -rugged)|>
  arrange(country)|> 
  rename(rough = newlmtnest)


rugged

Joins

This will require some kind of join, where two tables are combined based on a matching value in each called the key. But there’s more than one way to do this, especially when the tables don’t have perfectly matching keys.

Join types

Inner joins preserve only keys that are present in both data sets. (Rows with multiple matches are duplicated)

left/right join

A left_join will take all data from the left hand side and all matching data on the left. A right_join will do the reverse. (“left” vs. “right” is entirely a matter of which table you reference first in the join command)

Full Join

Full joins will preserve all keys in both data sets. (so Albania is kept, but with an NA in the deaths column)

Joins in dplyr

left_join(left_table, right_table, by = join_by(left_key == right_key))

Back to the original problem

put the terrain data first and use a left_join
join on the Gleditsch Ward country ID. (country names are not very reliable!)
We’ll want to fill NA values in deaths with “0” because these states had no civil war deaths.

joined_data<-left_join(rugged, 
                       civil_wars, 
                       by=join_by(gwcode ==gwno_a))

joined_data

Back to the original problem

We can use replace_na to replace NA values with 0s
And we’ll just create an indicator that’s TRUE if a country had 0 civil wars

joined_data<-left_join(rugged,
                       civil_wars, 
                       by=join_by(gwcode ==gwno_a))|>
  mutate(deaths = replace_na(deaths, 0),
         any_civil_war = deaths>0
         )

joined_data

Back to the original problem

lm(any_civil_war ~ rough, data = joined_data)|>
  broom::tidy()

ggplot(joined_data, aes(x=any_civil_war, y=rough)) + 
  geom_boxplot() +
  theme_bw() +
  ylab("natural log of % mountainous terrain") +
  xlab("Any civil wars after 1989?")

Other joins

inner_join will drop any rows that don’t have a match in either data set
full_join will avoid dropping any rows from either dataset.
anti_join will find rows in the right hand side without a match on the left. (you can use this to find discrepancies between groups)
fuzzy_join (part of a separate package called fuzzyjoin) will allow imperfect matches like slight misspellings.