Data Wrangling Pt. II

What we’ll cover

  • Fancier joins

  • Working with text and dates

  • Basic plots with ggplot

Data

ccc<-read_csv('https://dataverse.harvard.edu/api/access/datafile/12027896')

Fancy joins

Anti-join

An anti_join: returns all elements of LHS that are NOT IN the RHS. For instance:

library(countrycode)

Cross joins

Also known as a cartesian join. A cross_join: creates a row for each row in the LHS and RHS table:

lhs<-tibble(letter = c("A", "B", "C"))
rhs<-tibble(number = c(1, 2) )
combined<-cross_join(lhs, rhs)

combined
letter number
A 1
A 2
B 1
B 2
C 1
C 2

Nesting and unnesting

Problem: nested data event data

Since events often include multiple behaviors and claims, CCC variables allow “nested” data, where each value is separated by a semi-colon:

ccc|>
  count(issue_tags_summary)|>
  arrange(-n)|>
  slice_head(n=5)
issue_tags_summary n
foreign affairs; war and peace 5530
civil rights; corruption; democracy; free speech; war and peace; patriotism; presidency 2127
democracy; war and peace; presidency 2075
war and peace 1513
economy; healthcare; war and peace; presidency 1275

How do we (non-destructively) get rid of the nesting from this to get something into the “tidy” format we want for analysis?

Splitting and unnesting

Once we split up the data, we’ve got three main options that can retain our tabular structure without losing any data: long, wide, or dummied (also known as “one-hot” encoding)

G cluster_split Record 1 Original Records Event ID Issues 1 (corruption, healthcare, immigration) 2 healthcare 3 (immigration, covid-19) 4 israel-palestine Long Format Long Format Event ID Issues 1 corruption 1 healthcare 1 immigration 2 healthcare 3 immigration 3 covid-19 4 israel-palestine Record 1->Long Format Wide Format Wide Format Event ID Issues 1 Issues 2 Issues 3 1 corruption healthcare immigration 2 healthcare 3 immigration covid-19 4 israel-palestine Record 1->Wide Format Dummy Format Dummy Format Event ID corruption healthcare immigration israel-palestine 1 1 1 1 0 2 0 1 0 0 3 0 0 1 0 4 0 0 0 1 Record 1->Dummy Format

Wide, long, and dummified

Wide format really only works here if the nested list has an ordering (otherwise, the decision to place a value in “issue1” vs. “issue2” is arbitrary)

ccc|>
  select(issue_tags_summary)|>
  mutate(issue_tags_summary = str_split(issue_tags_summary, ";"))|>
  unnest_wider(issue_tags_summary, names_sep='_')|>
  slice_head(n=5)
issue_tags_summary_1 issue_tags_summary_2 issue_tags_summary_3 issue_tags_summary_4 issue_tags_summary_5 issue_tags_summary_6 issue_tags_summary_7 issue_tags_summary_8 issue_tags_summary_9 issue_tags_summary_10 issue_tags_summary_11 issue_tags_summary_12 issue_tags_summary_13 issue_tags_summary_14 issue_tags_summary_15
banking and finance economy foreign affairs healthcare war and peace NA NA NA NA NA NA NA NA NA NA
war and peace policing NA NA NA NA NA NA NA NA NA NA NA NA NA
foreign affairs war and peace NA NA NA NA NA NA NA NA NA NA NA NA NA
foreign affairs war and peace NA NA NA NA NA NA NA NA NA NA NA NA NA
economy foreign affairs war and peace NA NA NA NA NA NA NA NA NA NA NA NA

Long format works well with many of our R functions, but we want to be careful about how we calculate/interpret proportions and counts because multi-issue events would receive more weight than single issue events.

ccc|>
  select(issue_tags_summary)|>
  mutate(
    event_id = row_number(),
    issue_tags_summary = str_split(issue_tags_summary, ";"))|>
  unnest_longer(issue_tags_summary)|>
  mutate(issue_tags_summary = str_squish(issue_tags_summary))|>
  slice_head(n=5)
issue_tags_summary event_id
banking and finance 1
economy 1
foreign affairs 1
healthcare 1
war and peace 1

Making dummies takes a little more legwork, but it preserves the 1 row per event structure of the original data.

ccc|>
  select(issue_tags_summary)|>
  mutate(
    event_id = row_number(),
    issue_tags_summary = str_split(issue_tags_summary, ";"),
    value = 1
    )|>
  unnest_longer(issue_tags_summary)|>
  mutate(issue_tags_summary = str_squish(issue_tags_summary))|>
  pivot_wider(names_from = issue_tags_summary, values_from =value, values_fill = 0
              )|>
  slice_head(n=5)
event_id banking and finance economy foreign affairs healthcare war and peace policing corruption racism criminal justice religion immigration patriotism women’s rights labor and workers rights energy environment guns legislative transportation lgbtqia animal rights indigenous peoples democracy presidency housing reproductive rights civil rights education judiciary NA local development taxes sports sexual violence covid disability rights drugs free speech science
1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Wide, long, and dummified: notes

  • All of this can be memory intensive for long strings or big data sets. For those scenarios, we can use methods and packages specifically designed for doing text analysis.

  • Nesting is often preferable for storage, transfer, and data entry because it’s compact with minimal repetition. So we typically don’t need to worry about this until we’re at the analysis stage.

  • There’s a set of (currently experimental) tidyr functions that combine the split-unnest process into a single command.

Messy strings

Problem: spelling errors

Some variations on “demonstration” in the CCC data for 2025:

demos<-ccc|>
  filter(str_detect(event_type, ";|,") ==FALSE)|>
  filter(str_detect(tolower(event_type), "demo"))

demos|>
  count(event_type)|>
  arrange(-n)|>
  slice_tail(n=20)|>
  kableExtra::kbl()|>
  kableExtra::kable_classic()
event_type n
demonstration 18935
demonsatration 17
Demonstration 11
demonstrations 3
demosntration 3
demonstration and sit-in 2
Demonstartion 1
demomstration 1
demonstarion 1
demonstation 1
demonstraiton 1
demonstratiion 1
demonstratino 1
demontration 1
demostration 1
pdemonstration 1

::: :::::

“Manual” spelling correction

A labor-intensive - but precise - fix would be to use an exhaustive list of mis-spellings and then a logical expression:

spellings<-c("demonstrations", 
             "demonsatration", 
             'demosntration', 
             'demomstration', 
             'Demonstration',
             'demonstration and sit-in',
             'demonstarion',
             'Demonstartion',
             'demonstation',
             'demonstratino',
             'pdemonstration',
             'demonstraiton',
             'demonstratiion',
             'demontration',
             'demostration',
             'demonstration'
             )

demos|>
  mutate(is_demonstration = event_type %in% spellings)|>
  count(is_demonstration)
is_demonstration n
TRUE 18981

Regular Expressions

Instead of typing every possibility, we could use a regular expression to catch values.

  • A regular expression is a way to match text patterns. It’s sort of like a wild-card search, but far more flexible.

  • Most mature programming languages will have support for using regular expressions (they’re really indespensible for this kind of problem!)

Some examples of regex operators:

Pattern Match
a|b ‘a’ or ‘b’
a+ ‘a’ one or more times
a* ‘a’ zero or more times
a? ‘a’ zero or one time
a{1, 4} ‘a’ between 1 to 5 times
^a ‘a’ at the start of a string
a$ ‘a’ at the end of a string
(a) Assigns ‘a’ to a “capture” group
[^a-c] Anything except ‘a’, ‘b’, or ‘c’

Regular Expressions

Here’s a regular expression you might use to match “demonstrations”

\[ \text{\b[d|D]emo\S+ons*\b} \]

  • \b denotes a word boundary (start of a string, a space, punctuation mark, or newline)

  • [D|d] matches either upper case D or lower case D.

  • \S+ will match any non-whitespace character between one and infinite times.

  • ons*\b means that the word must end with the letters on followed by either an s or nothing.

(Note that in R you need to escape backslashes, so \b will be \\b)

I recommend using a graphical interface this one to develop and test complex expressions. Then just paste your results into an R command.

Regular Expressions

stringr functions:

  • str_detect returns TRUE if there’s a match anywhere in the string

  • str_replace will replace the first instance of a matching string with something else.

  • str_extract will extract first instance of the string that matches your regular expression.

  • str_extract_all and str_replace_all will do this for all matches in a string, instead of just the first.

demos|>
  distinct(event_type)|>
  mutate(detect = str_detect(event_type, "\\b[d|D]emo\\S+ons*\\b"),
         replace = str_replace(event_type, "\\b[d|D]emo\\S+ons*\\b", "demonstration"),
         extract = str_extract(event_type, "\\b[d|D]emo\\S+ons*\\b")
         )
event_type detect replace extract
demonstration TRUE demonstration demonstration
demontration TRUE demonstration demontration
demostration TRUE demonstration demostration
demosntration TRUE demonstration demosntration
demonstarion TRUE demonstration demonstarion
pdemonstration FALSE pdemonstration NA
demomstration TRUE demonstration demomstration
demonstation TRUE demonstration demonstation
demonsatration TRUE demonstration demonsatration
Demonstration TRUE demonstration Demonstration
Demonstartion TRUE demonstration Demonstartion
demonstratino FALSE demonstratino NA
demonstrations TRUE demonstration demonstrations
demonstratiion TRUE demonstration demonstratiion
demonstration and sit-in TRUE demonstration and sit-in demonstration
demonstraiton TRUE demonstration demonstraiton

Regular Expressions: Notes

  • If you just need a simple string match, it might be better to use str_detect(coll(x), "string_to_match") to do literal matching.

  • Any kind of wild card matching comes with some risk of errors, so be careful about using this in high-stakes scenarios.

  • Regular expressions can be computationally expensive because they require multiple passes through a string.

Regular Expressions: Notes

This:

… Was ultimately caused by this regular expression:


(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Making a dictionary with fuzzy matching

Finally, we might try to use some fuzzy matching to facilitate creating a dictionary.

  1. Make a list of valid strings and a list of all strings that occur in the data.

  2. Quantify the difference between each string.

  3. Join any rows where the distance is below some threshold. (typically setting this requires some manual inspection)

Making a dictionary with fuzzy matching

String Distance:

  • There are lots of ways to calculate the “distance” between strings.

  • The default for the R package “stringdist” is the restricted Damerau-Levenshtein distance, which counts the number of operations need to transform one string into another.

  • A single operation could be:

    • transposing “s” and “n” in “demosntration”
    • removing the “s” “demonstrations” or adding an “n” to “demostration”
    • substituting “m” for “n” in “demomstration”

Making a dictionary with fuzzy matching

(note: if there’s a really large amount of text, then the expand.grid function below might eat up all your memory. So you would probably want to do this using a loop or just use an existing package for text analysis.)

library(stringdist)

valid_event_types = c("demonstration", 
                      "rally", 
                      "march",
                      "protest", 
                      "civil disobedience",
                      "counter-protest")
all_event_types<-ccc|>
  pluck('event_type')|>
  str_split(";")|>
  unlist()|>
  str_to_lower()|>
  str_squish()|>
  na.omit()|>
  unique()



distances<-expand.grid("valid" = valid_event_types, 
                       "all"  = all_event_types)|>
  mutate(distance = stringdist(valid, all, method='osa'))|>
  arrange(valid, distance)


distances|>
  filter(valid == "demonstration")|>
  ggplot(aes(x=distance, y=reorder(all, distance))) + 
  geom_bar(stat='identity') +
  ylab("Term") +
  xlab("Damerau-Levenshtein distance from \"demonstration\"") +
  theme_bw()

Making a dictionary with fuzzy matching

After some inspection, I decided a distance of around 4 should be the maximum.

dictionary<-distances|>
  group_by(all)|>
  filter(distance < 4)|>
  slice_min(n=1, order_by=distance, with_ties=FALSE)


# viewing the distances for unmatched results:
anti_join(distances, dictionary, by=join_by(all == all))|>
  pivot_wider(names_from = valid, values_from =distance)|>
  kableExtra::kbl(caption='string distances for unmatched strings')|>
  kableExtra::kable_classic()
string distances for unmatched strings
all demonstration rally march protest civil disobedience counter-protest
caravan 10 5 5 7 16 13
other 11 5 5 5 16 12
demonstration and sit-in 11 22 22 20 21 19
vigil 12 5 5 7 15 15
13 5 5 7 18 15

Making a dictionary with fuzzy matching

Here’s how we might apply this dictionary to get the proportion of events featuring each “type” of activity across all events in the data:

ccc$event_id <- 1:nrow(ccc)

ccc_events<-ccc|>
  mutate(
    event_split = str_split(str_to_lower(event_type), ";"))|>
  select(event_id, event_split)|>
  unnest_longer(event_split)|>
  mutate(event_split = str_squish(event_split))|>
  left_join(dictionary, by =join_by(event_split == all))|>
  select(event_id, valid)|>
  mutate(one = 1)|>
  pivot_wider(names_from = valid, values_from = one, values_fill = 0)


colMeans(ccc_events[,2:ncol(ccc_events)])
     demonstration              rally              march    counter-protest 
       0.757491551        0.134209538        0.093203154        0.016259857 
civil disobedience            protest                 NA 
       0.003529854        0.099887345        0.004581299 

(Alternatively, you might just use a left join to add these values back on to the original CCC data)

Times and dates

Times and dates are complex

  • Different date formats: Aug 19, 2025; 2025-19-08; 19 August, 2025…

  • Time zones, leap years, and daylight savings

  • Calendar/clock math, periods, and intervals

Times and dates

Base R uses specialized date and time classes to help with these complexities, but you’ll encounter issues if the dates aren’t formatted correctly:

as.Date("01/01/1970")              # everything is wrong here
[1] "0001-01-19"
as.Date("July 1, 2024")            # unable to recognize text
Error in charToDate(x): character string is not in a standard unambiguous format
as.Date("2025-09-20 13:40:13 EDT") # time is dropped here
[1] "2025-09-20"
as.Date(-1003)                     # this is fine for some reason
[1] "1967-04-04"
as.POSIXct("2025-09-30")
[1] "2025-09-30 EDT"
as.POSIXlt("2025-09-30")
[1] "2025-09-30 EDT"

Times and dates

The lubridate package provides a series of date conversion functions that can handle a lot of common date formats:

library(lubridate)
mdy('August 9, 2012')
[1] "2012-08-09"
dmy('9 Aug, 2012')
[1] "2012-08-09"
ymd('2012 8 9')
[1] "2012-08-09"

However, when this fails you can also use the format option of as.Date() to convert date strings as long as you supply a string that clarifies what each element represents:

as.Date('August 9, 2012', format = "%B %d, %Y")
[1] "2012-08-09"

Times and dates

days<-data.frame(date =seq.Date(from = min(ccc$date),
         to = max(ccc$date),
         by="day"
         ))|>
    mutate(fri = wday(date, label=T) == "Fri",
           thirteenth = mday(date) == 13, 
           f13 = fri& thirteenth
           )

Visualization for exploratory analysis

Data

We’ve got some monthly counts of events organized by whether they were right or left wing.

event_counts<-ccc|>
  mutate(month = floor_date(date, unit='month'),
         valence = factor(valence, levels=c(0, 1, 2), 
                          labels=c("other/neither",
                                   'left-wing', 
                                   'right-wing'))
         )|>
  count(month, valence)|>
  complete(month = seq.Date(from= min(month), 
                            to=max(month), 
                            by='month'), 
           fill=list(n=0))|>
  drop_na()


event_counts
month valence n
2025-01-01 other/neither 1169
2025-01-01 left-wing 931
2025-01-01 right-wing 123
2025-02-01 other/neither 1045
2025-02-01 left-wing 1071
2025-02-01 right-wing 49
2025-03-01 other/neither 1273
2025-03-01 left-wing 2292
2025-03-01 right-wing 57
2025-04-01 other/neither 966
2025-04-01 left-wing 2939
2025-04-01 right-wing 113
2025-05-01 other/neither 1215
2025-05-01 left-wing 2063
2025-05-01 right-wing 26
2025-06-01 other/neither 1114
2025-06-01 left-wing 3381
2025-06-01 right-wing 194
2025-07-01 other/neither 971
2025-07-01 left-wing 2298
2025-07-01 right-wing 22
2025-08-01 other/neither 1259
2025-08-01 left-wing 2016
2025-08-01 right-wing 25

Parts of a plot

Required components:

  1. Data and mapping: We need to choose what elements we’ll be visualizing and how they map onto different parts of the plot (what goes on the x and y axis? If we’re color coding, what variables determine colors? Or shapes?)

  2. Layers: the visible representations of the data. The most important being geometries (bar, point, line etc.)

  3. Scales: the mapping from data units to distances, colors, etc.

  4. The remaining elements can be modified, but they’re often left at default values (ggplot assumes you want carteesian coordinates unless you say otherwise, for instance)

Dimensionality

  • Data is multi-dimensional, but visuals are 2D.

  • A core challenge for good visualization is elegantly handling this dimensional mismatch by:

    • simplifying data
    • color/shape/symbol encoding
    • faceting

ggplot

You’ll always start with a call to ggplot

On it’s own, this just makes empty space.

ggplot()

Once we’ve got an aesthetic mapping, we start to get something more concrete:

base <- ggplot(data = event_counts, aes(x=n))

base

ggplot geometry

We can use the + to add more layers to a ggplot object, and we can add more than one kind of geometry to a single visualization.

base + geom_histogram()

base + geom_density() + geom_rug()

ggplot scales

We can also change the x or y scaling from the default using scale_ functions

base + geom_density()

base + geom_histogram() + scale_x_log10()

ggplot with multiple variables

The previous examples were all univariate plots, but we have a lot more to work with.

library(paletteer) # package for alternate color palettes
 


ggplot(event_counts, aes(x = month, y = n, color = valence)) +
  geom_line(lwd=.7) + 
  geom_point(aes(shape =valence), size=3) +
  theme_minimal() + 
  labs(x = "Date", 
       y='Number of events', 
       title='Monthly event counts through August 2025',
       caption = "Data source: Crowd Counting Consortium"
      )+
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
scale_color_paletteer_d("nationalparkcolors::Acadia") 

Facets and small multiples

Big differences in the scales can make it difficult to identify relative changes within each group. One way to deal with this is by using a facet to draw multiple version of the same plot.

In this example, the x-axis is fixed, but the y-axis is re-scaled for each group - making relative changes within groups more apparent (albeit at the cost of distorting absolute values)

event_counts|>
  mutate(valence = fct_relevel(valence, "left-wing", 
                                         "other/neither", 
                                         "right-wing"))|>
  ggplot(aes(x = month, y = n, color = valence)) +
  geom_line(lwd=.7) + 
  geom_point(aes(shape =valence), size=3) +
  theme_minimal() + 
  labs(x = "Date", 
       y='Number of events', 
       title='Monthly event counts through August 2025',
       caption = "Data source: Crowd Counting Consortium"
      ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%B") +
  scale_color_paletteer_d("nationalparkcolors::Acadia")  +
  facet_wrap(~valence, ncol=1, scales='free_y') +
  # legend no longer needed here since the facets are already labeled
  theme(legend.position = "none")

General tips on data visualization

  • The best way to learn is through replication: find an example and adapt it. here’s a great place to start

  • Care.

  • Consider resolution and audience: slide decks should have big plots with minimal detail, written reports can tolerate more subtlety.

  • Almost all “3d” plots are pointless.