Data Wrangling Pt. II

What we’ll cover

Fancier joins
Working with text and dates
Basic plots with ggplot

Data

ccc<-read_csv('https://dataverse.harvard.edu/api/access/datafile/12027896')

Fancy joins

Anti-join

An anti_join: returns all elements of LHS that are NOT IN the RHS. For instance:

library(countrycode)

Cross joins

Also known as a cartesian join. A cross_join: creates a row for each row in the LHS and RHS table:

lhs<-tibble(letter = c("A", "B", "C"))
rhs<-tibble(number = c(1, 2) )
combined<-cross_join(lhs, rhs)

combined

letter	number
A	1
A	2
B	1
B	2
C	1
C	2

Nesting and unnesting

Problem: nested data event data

Since events often include multiple behaviors and claims, CCC variables allow “nested” data, where each value is separated by a semi-colon:

ccc|>
  count(issue_tags_summary)|>
  arrange(-n)|>
  slice_head(n=5)

issue_tags_summary	n
foreign affairs; war and peace	5530
civil rights; corruption; democracy; free speech; war and peace; patriotism; presidency	2127
democracy; war and peace; presidency	2075
war and peace	1513
economy; healthcare; war and peace; presidency	1275

How do we (non-destructively) get rid of the nesting from this to get something into the “tidy” format we want for analysis?

Splitting and unnesting

Once we split up the data, we’ve got three main options that can retain our tabular structure without losing any data: long, wide, or dummied (also known as “one-hot” encoding)

Wide format really only works here if the nested list has an ordering (otherwise, the decision to place a value in “issue1” vs. “issue2” is arbitrary)

ccc|>
  select(issue_tags_summary)|>
  mutate(issue_tags_summary = str_split(issue_tags_summary, ";"))|>
  unnest_wider(issue_tags_summary, names_sep='_')|>
  slice_head(n=5)

issue_tags_summary_1	issue_tags_summary_2	issue_tags_summary_3	issue_tags_summary_4	issue_tags_summary_5	issue_tags_summary_6	issue_tags_summary_7	issue_tags_summary_8	issue_tags_summary_9	issue_tags_summary_10	issue_tags_summary_11	issue_tags_summary_12	issue_tags_summary_13	issue_tags_summary_14	issue_tags_summary_15
banking and finance	economy	foreign affairs	healthcare	war and peace	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
war and peace	policing	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
foreign affairs	war and peace	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
foreign affairs	war and peace	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
economy	foreign affairs	war and peace	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Long format works well with many of our R functions, but we want to be careful about how we calculate/interpret proportions and counts because multi-issue events would receive more weight than single issue events.

ccc|>
  select(issue_tags_summary)|>
  mutate(
    event_id = row_number(),
    issue_tags_summary = str_split(issue_tags_summary, ";"))|>
  unnest_longer(issue_tags_summary)|>
  mutate(issue_tags_summary = str_squish(issue_tags_summary))|>
  slice_head(n=5)

issue_tags_summary	event_id
banking and finance	1
economy	1
foreign affairs	1
healthcare	1
war and peace	1

Making dummies takes a little more legwork, but it preserves the 1 row per event structure of the original data.

ccc|>
  select(issue_tags_summary)|>
  mutate(
    event_id = row_number(),
    issue_tags_summary = str_split(issue_tags_summary, ";"),
    value = 1
    )|>
  unnest_longer(issue_tags_summary)|>
  mutate(issue_tags_summary = str_squish(issue_tags_summary))|>
  pivot_wider(names_from = issue_tags_summary, values_from =value, values_fill = 0
              )|>
  slice_head(n=5)

event_id	banking and finance	economy	foreign affairs	healthcare	war and peace	policing
1	1	1	1	1	1	0
2	0	0	0	0	1	1
3	0	0	1	0	1	0
4	0	0	1	0	1	0
5	0	1	1	0	1	0

Wide, long, and dummified: notes

All of this can be memory intensive for long strings or big data sets. For those scenarios, we can use methods and packages specifically designed for doing text analysis.
Nesting is often preferable for storage, transfer, and data entry because it’s compact with minimal repetition. So we typically don’t need to worry about this until we’re at the analysis stage.
There’s a set of (currently experimental) tidyr functions that combine the split-unnest process into a single command.

Messy strings

Problem: spelling errors

Some variations on “demonstration” in the CCC data for 2025:

demos<-ccc|>
  filter(str_detect(event_type, ";|,") ==FALSE)|>
  filter(str_detect(tolower(event_type), "demo"))

demos|>
  count(event_type)|>
  arrange(-n)|>
  slice_tail(n=20)|>
  kableExtra::kbl()|>
  kableExtra::kable_classic()

event_type	n
demonstration	18935
demonsatration	17
Demonstration	11
demonstrations	3
demosntration	3
demonstration and sit-in	2
Demonstartion	1
demomstration	1
demonstarion	1
demonstation	1
demonstraiton	1
demonstratiion	1
demonstratino	1
demontration	1
demostration	1
pdemonstration	1

::: :::::

“Manual” spelling correction

A labor-intensive - but precise - fix would be to use an exhaustive list of mis-spellings and then a logical expression:

spellings<-c("demonstrations", 
             "demonsatration", 
             'demosntration', 
             'demomstration', 
             'Demonstration',
             'demonstration and sit-in',
             'demonstarion',
             'Demonstartion',
             'demonstation',
             'demonstratino',
             'pdemonstration',
             'demonstraiton',
             'demonstratiion',
             'demontration',
             'demostration',
             'demonstration'
             )

demos|>
  mutate(is_demonstration = event_type %in% spellings)|>
  count(is_demonstration)

is_demonstration	n
TRUE	18981

Regular Expressions

Instead of typing every possibility, we could use a regular expression to catch values.

A regular expression is a way to match text patterns. It’s sort of like a wild-card search, but far more flexible.
Most mature programming languages will have support for using regular expressions (they’re really indespensible for this kind of problem!)

Some examples of regex operators:

Pattern	Match
a\|b	‘a’ or ‘b’
a+	‘a’ one or more times
a*	‘a’ zero or more times
a?	‘a’ zero or one time
a{1, 4}	‘a’ between 1 to 5 times
^a	‘a’ at the start of a string
a$	‘a’ at the end of a string
(a)	Assigns ‘a’ to a “capture” group
[^a-c]	Anything except ‘a’, ‘b’, or ‘c’

Regular Expressions

Here’s a regular expression you might use to match “demonstrations”

\[ \text{\b[d|D]emo\S+ons*\b} \]

\b denotes a word boundary (start of a string, a space, punctuation mark, or newline)
[D|d] matches either upper case D or lower case D.
\S+ will match any non-whitespace character between one and infinite times.
ons*\b means that the word must end with the letters on followed by either an s or nothing.

(Note that in R you need to escape backslashes, so \b will be \\b)

I recommend using a graphical interface this one to develop and test complex expressions. Then just paste your results into an R command.

Regular Expressions

stringr functions:

str_detect returns TRUE if there’s a match anywhere in the string
str_replace will replace the first instance of a matching string with something else.
str_extract will extract first instance of the string that matches your regular expression.
str_extract_all and str_replace_all will do this for all matches in a string, instead of just the first.

demos|>
  distinct(event_type)|>
  mutate(detect = str_detect(event_type, "\\b[d|D]emo\\S+ons*\\b"),
         replace = str_replace(event_type, "\\b[d|D]emo\\S+ons*\\b", "demonstration"),
         extract = str_extract(event_type, "\\b[d|D]emo\\S+ons*\\b")
         )

event_type	detect	replace	extract
demonstration	TRUE	demonstration	demonstration
demontration	TRUE	demonstration	demontration
demostration	TRUE	demonstration	demostration
demosntration	TRUE	demonstration	demosntration
demonstarion	TRUE	demonstration	demonstarion
pdemonstration	FALSE	pdemonstration	NA
demomstration	TRUE	demonstration	demomstration
demonstation	TRUE	demonstration	demonstation
demonsatration	TRUE	demonstration	demonsatration
Demonstration	TRUE	demonstration	Demonstration
Demonstartion	TRUE	demonstration	Demonstartion
demonstratino	FALSE	demonstratino	NA
demonstrations	TRUE	demonstration	demonstrations
demonstratiion	TRUE	demonstration	demonstratiion
demonstration and sit-in	TRUE	demonstration and sit-in	demonstration
demonstraiton	TRUE	demonstration	demonstraiton

Regular Expressions: Notes

If you just need a simple string match, it might be better to use str_detect(coll(x), "string_to_match") to do literal matching.
Any kind of wild card matching comes with some risk of errors, so be careful about using this in high-stakes scenarios.
Regular expressions can be computationally expensive because they require multiple passes through a string.

Regular Expressions: Notes

This:

… Was ultimately caused by this regular expression:


(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))

Making a dictionary with fuzzy matching

Finally, we might try to use some fuzzy matching to facilitate creating a dictionary.

Make a list of valid strings and a list of all strings that occur in the data.
Quantify the difference between each string.
Join any rows where the distance is below some threshold. (typically setting this requires some manual inspection)

Making a dictionary with fuzzy matching

String Distance:

There are lots of ways to calculate the “distance” between strings.
The default for the R package “stringdist” is the restricted Damerau-Levenshtein distance, which counts the number of operations need to transform one string into another.
A single operation could be:
- transposing “s” and “n” in “demosntration”
- removing the “s” “demonstrations” or adding an “n” to “demostration”
- substituting “m” for “n” in “demomstration”

Making a dictionary with fuzzy matching

(note: if there’s a really large amount of text, then the expand.grid function below might eat up all your memory. So you would probably want to do this using a loop or just use an existing package for text analysis.)

library(stringdist)

valid_event_types = c("demonstration", 
                      "rally", 
                      "march",
                      "protest", 
                      "civil disobedience",
                      "counter-protest")
all_event_types<-ccc|>
  pluck('event_type')|>
  str_split(";")|>
  unlist()|>
  str_to_lower()|>
  str_squish()|>
  na.omit()|>
  unique()



distances<-expand.grid("valid" = valid_event_types, 
                       "all"  = all_event_types)|>
  mutate(distance = stringdist(valid, all, method='osa'))|>
  arrange(valid, distance)


distances|>
  filter(valid == "demonstration")|>
  ggplot(aes(x=distance, y=reorder(all, distance))) + 
  geom_bar(stat='identity') +
  ylab("Term") +
  xlab("Damerau-Levenshtein distance from \"demonstration\"") +
  theme_bw()

Making a dictionary with fuzzy matching

After some inspection, I decided a distance of around 4 should be the maximum.

dictionary<-distances|>
  group_by(all)|>
  filter(distance < 4)|>
  slice_min(n=1, order_by=distance, with_ties=FALSE)


# viewing the distances for unmatched results:
anti_join(distances, dictionary, by=join_by(all == all))|>
  pivot_wider(names_from = valid, values_from =distance)|>
  kableExtra::kbl(caption='string distances for unmatched strings')|>
  kableExtra::kable_classic()

string distances for unmatched strings
all	demonstration	rally	march	protest	civil disobedience	counter-protest
caravan	10	5	5	7	16	13
other	11	5	5	5	16	12
demonstration and sit-in	11	22	22	20	21	19
vigil	12	5	5	7	15	15
	13	5	5	7	18	15

Making a dictionary with fuzzy matching

Here’s how we might apply this dictionary to get the proportion of events featuring each “type” of activity across all events in the data:

ccc$event_id <- 1:nrow(ccc)

ccc_events<-ccc|>
  mutate(
    event_split = str_split(str_to_lower(event_type), ";"))|>
  select(event_id, event_split)|>
  unnest_longer(event_split)|>
  mutate(event_split = str_squish(event_split))|>
  left_join(dictionary, by =join_by(event_split == all))|>
  select(event_id, valid)|>
  mutate(one = 1)|>
  pivot_wider(names_from = valid, values_from = one, values_fill = 0)


colMeans(ccc_events[,2:ncol(ccc_events)])

     demonstration              rally              march    counter-protest 
       0.757491551        0.134209538        0.093203154        0.016259857 
civil disobedience            protest                 NA 
       0.003529854        0.099887345        0.004581299

(Alternatively, you might just use a left join to add these values back on to the original CCC data)

Times and dates

Times and dates are complex

Different date formats: Aug 19, 2025; 2025-19-08; 19 August, 2025…
Time zones, leap years, and daylight savings
Calendar/clock math, periods, and intervals

Times and dates

Base R uses specialized date and time classes to help with these complexities, but you’ll encounter issues if the dates aren’t formatted correctly:

as.Date("01/01/1970")              # everything is wrong here

[1] "0001-01-19"

as.Date("July 1, 2024")            # unable to recognize text

Error in charToDate(x): character string is not in a standard unambiguous format

as.Date("2025-09-20 13:40:13 EDT") # time is dropped here

[1] "2025-09-20"

as.Date(-1003)                     # this is fine for some reason

[1] "1967-04-04"

as.POSIXct("2025-09-30")

[1] "2025-09-30 EDT"

as.POSIXlt("2025-09-30")

[1] "2025-09-30 EDT"

Times and dates

The lubridate package provides a series of date conversion functions that can handle a lot of common date formats:

library(lubridate)

mdy('August 9, 2012')

[1] "2012-08-09"

dmy('9 Aug, 2012')

[1] "2012-08-09"

ymd('2012 8 9')

[1] "2012-08-09"

However, when this fails you can also use the format option of as.Date() to convert date strings as long as you supply a string that clarifies what each element represents:

as.Date('August 9, 2012', format = "%B %d, %Y")

[1] "2012-08-09"

Times and dates

days<-data.frame(date =seq.Date(from = min(ccc$date),
         to = max(ccc$date),
         by="day"
         ))|>
    mutate(fri = wday(date, label=T) == "Fri",
           thirteenth = mday(date) == 13, 
           f13 = fri& thirteenth
           )

Visualization for exploratory analysis

Data

We’ve got some monthly counts of events organized by whether they were right or left wing.

event_counts<-ccc|>
  mutate(month = floor_date(date, unit='month'),
         valence = factor(valence, levels=c(0, 1, 2), 
                          labels=c("other/neither",
                                   'left-wing', 
                                   'right-wing'))
         )|>
  count(month, valence)|>
  complete(month = seq.Date(from= min(month), 
                            to=max(month), 
                            by='month'), 
           fill=list(n=0))|>
  drop_na()


event_counts

month	valence	n
2025-01-01	other/neither	1169
2025-01-01	left-wing	931
2025-01-01	right-wing	123
2025-02-01	other/neither	1045
2025-02-01	left-wing	1071
2025-02-01	right-wing	49
2025-03-01	other/neither	1273
2025-03-01	left-wing	2292
2025-03-01	right-wing	57
2025-04-01	other/neither	966
2025-04-01	left-wing	2939
2025-04-01	right-wing	113
2025-05-01	other/neither	1215
2025-05-01	left-wing	2063
2025-05-01	right-wing	26
2025-06-01	other/neither	1114
2025-06-01	left-wing	3381
2025-06-01	right-wing	194
2025-07-01	other/neither	971
2025-07-01	left-wing	2298
2025-07-01	right-wing	22
2025-08-01	other/neither	1259
2025-08-01	left-wing	2016
2025-08-01	right-wing	25

Parts of a plot

Required components:

Data and mapping: We need to choose what elements we’ll be visualizing and how they map onto different parts of the plot (what goes on the x and y axis? If we’re color coding, what variables determine colors? Or shapes?)
Layers: the visible representations of the data. The most important being geometries (bar, point, line etc.)
Scales: the mapping from data units to distances, colors, etc.
The remaining elements can be modified, but they’re often left at default values (ggplot assumes you want carteesian coordinates unless you say otherwise, for instance)

Dimensionality

Data is multi-dimensional, but visuals are 2D.
A core challenge for good visualization is elegantly handling this dimensional mismatch by:
- simplifying data
- color/shape/symbol encoding
- faceting

ggplot

You’ll always start with a call to ggplot

On it’s own, this just makes empty space.

ggplot()

Once we’ve got an aesthetic mapping, we start to get something more concrete:

base <- ggplot(data = event_counts, aes(x=n))

base

ggplot geometry

We can use the + to add more layers to a ggplot object, and we can add more than one kind of geometry to a single visualization.

base + geom_histogram()

base + geom_density() + geom_rug()

ggplot scales

We can also change the x or y scaling from the default using scale_ functions

base + geom_density()

base + geom_histogram() + scale_x_log10()

ggplot with multiple variables

The previous examples were all univariate plots, but we have a lot more to work with.

library(paletteer) # package for alternate color palettes
 


ggplot(event_counts, aes(x = month, y = n, color = valence)) +
  geom_line(lwd=.7) + 
  geom_point(aes(shape =valence), size=3) +
  theme_minimal() + 
  labs(x = "Date", 
       y='Number of events', 
       title='Monthly event counts through August 2025',
       caption = "Data source: Crowd Counting Consortium"
      )+
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
scale_color_paletteer_d("nationalparkcolors::Acadia")

Facets and small multiples

Big differences in the scales can make it difficult to identify relative changes within each group. One way to deal with this is by using a facet to draw multiple version of the same plot.

In this example, the x-axis is fixed, but the y-axis is re-scaled for each group - making relative changes within groups more apparent (albeit at the cost of distorting absolute values)

event_counts|>
  mutate(valence = fct_relevel(valence, "left-wing", 
                                         "other/neither", 
                                         "right-wing"))|>
  ggplot(aes(x = month, y = n, color = valence)) +
  geom_line(lwd=.7) + 
  geom_point(aes(shape =valence), size=3) +
  theme_minimal() + 
  labs(x = "Date", 
       y='Number of events', 
       title='Monthly event counts through August 2025',
       caption = "Data source: Crowd Counting Consortium"
      ) +
  scale_x_date(date_breaks = "1 month", date_labels = "%B") +
  scale_color_paletteer_d("nationalparkcolors::Acadia")  +
  facet_wrap(~valence, ncol=1, scales='free_y') +
  # legend no longer needed here since the facets are already labeled
  theme(legend.position = "none")

General tips on data visualization

The best way to learn is through replication: find an example and adapt it. here’s a great place to start
Care.
Consider resolution and audience: slide decks should have big plots with minimal detail, written reports can tolerate more subtlety.
Almost all “3d” plots are pointless.

event_id	banking and finance	economy	foreign affairs	healthcare	war and peace	policing
1	1	1	1	1	1	0
2	0	0	0	0	1	1
3	0	0	1	0	1	0
4	0	0	1	0	1	0
5	0	1	1	0	1	0

event_id	banking and finance	economy	foreign affairs	healthcare	war and peace	policing
1	1	1	1	1	1	0
2	0	0	0	0	1	1
3	0	0	1	0	1	0
4	0	0	1	0	1	0
5	0	1	1	0	1	0

event_id	banking and finance	economy	foreign affairs	healthcare	war and peace	policing
1	1	1	1	1	1	0
2	0	0	0	0	1	1
3	0	0	1	0	1	0
4	0	0	1	0	1	0
5	0	1	1	0	1	0