Fancier joins
Working with text and dates
Basic plots with ggplot
Since events often include multiple behaviors and claims, CCC variables allow “nested” data, where each value is separated by a semi-colon:
issue_tags_summary | n |
---|---|
foreign affairs; war and peace | 5530 |
civil rights; corruption; democracy; free speech; war and peace; patriotism; presidency | 2127 |
democracy; war and peace; presidency | 2075 |
war and peace | 1513 |
economy; healthcare; war and peace; presidency | 1275 |
How do we (non-destructively) get rid of the nesting from this to get something into the “tidy” format we want for analysis?
Once we split up the data, we’ve got three main options that can retain our tabular structure without losing any data: long, wide, or dummied (also known as “one-hot” encoding)
Wide format really only works here if the nested list has an ordering (otherwise, the decision to place a value in “issue1” vs. “issue2” is arbitrary)
ccc|>
select(issue_tags_summary)|>
mutate(issue_tags_summary = str_split(issue_tags_summary, ";"))|>
unnest_wider(issue_tags_summary, names_sep='_')|>
slice_head(n=5)
issue_tags_summary_1 | issue_tags_summary_2 | issue_tags_summary_3 | issue_tags_summary_4 | issue_tags_summary_5 | issue_tags_summary_6 | issue_tags_summary_7 | issue_tags_summary_8 | issue_tags_summary_9 | issue_tags_summary_10 | issue_tags_summary_11 | issue_tags_summary_12 | issue_tags_summary_13 | issue_tags_summary_14 | issue_tags_summary_15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
banking and finance | economy | foreign affairs | healthcare | war and peace | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
war and peace | policing | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
foreign affairs | war and peace | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
foreign affairs | war and peace | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
economy | foreign affairs | war and peace | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Long format works well with many of our R functions, but we want to be careful about how we calculate/interpret proportions and counts because multi-issue events would receive more weight than single issue events.
issue_tags_summary | event_id |
---|---|
banking and finance | 1 |
economy | 1 |
foreign affairs | 1 |
healthcare | 1 |
war and peace | 1 |
Making dummies takes a little more legwork, but it preserves the 1 row per event structure of the original data.
ccc|>
select(issue_tags_summary)|>
mutate(
event_id = row_number(),
issue_tags_summary = str_split(issue_tags_summary, ";"),
value = 1
)|>
unnest_longer(issue_tags_summary)|>
mutate(issue_tags_summary = str_squish(issue_tags_summary))|>
pivot_wider(names_from = issue_tags_summary, values_from =value, values_fill = 0
)|>
slice_head(n=5)
event_id | banking and finance | economy | foreign affairs | healthcare | war and peace | policing | corruption | racism | criminal justice | religion | immigration | patriotism | women’s rights | labor and workers rights | energy | environment | guns | legislative | transportation | lgbtqia | animal rights | indigenous peoples | democracy | presidency | housing | reproductive rights | civil rights | education | judiciary | NA | local development | taxes | sports | sexual violence | covid | disability rights | drugs | free speech | science |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
All of this can be memory intensive for long strings or big data sets. For those scenarios, we can use methods and packages specifically designed for doing text analysis.
Nesting is often preferable for storage, transfer, and data entry because it’s compact with minimal repetition. So we typically don’t need to worry about this until we’re at the analysis stage.
There’s a set of (currently experimental) tidyr functions that combine the split-unnest process into a single command.
Some variations on “demonstration” in the CCC data for 2025:
event_type | n |
---|---|
demonstration | 18935 |
demonsatration | 17 |
Demonstration | 11 |
demonstrations | 3 |
demosntration | 3 |
demonstration and sit-in | 2 |
Demonstartion | 1 |
demomstration | 1 |
demonstarion | 1 |
demonstation | 1 |
demonstraiton | 1 |
demonstratiion | 1 |
demonstratino | 1 |
demontration | 1 |
demostration | 1 |
pdemonstration | 1 |
::: :::::
A labor-intensive - but precise - fix would be to use an exhaustive list of mis-spellings and then a logical expression:
spellings<-c("demonstrations",
"demonsatration",
'demosntration',
'demomstration',
'Demonstration',
'demonstration and sit-in',
'demonstarion',
'Demonstartion',
'demonstation',
'demonstratino',
'pdemonstration',
'demonstraiton',
'demonstratiion',
'demontration',
'demostration',
'demonstration'
)
demos|>
mutate(is_demonstration = event_type %in% spellings)|>
count(is_demonstration)
is_demonstration | n |
---|---|
TRUE | 18981 |
Instead of typing every possibility, we could use a regular expression to catch values.
A regular expression is a way to match text patterns. It’s sort of like a wild-card search, but far more flexible.
Most mature programming languages will have support for using regular expressions (they’re really indespensible for this kind of problem!)
Some examples of regex operators:
Pattern | Match |
---|---|
a|b | ‘a’ or ‘b’ |
a+ | ‘a’ one or more times |
a* | ‘a’ zero or more times |
a? | ‘a’ zero or one time |
a{1, 4} | ‘a’ between 1 to 5 times |
^a | ‘a’ at the start of a string |
a$ | ‘a’ at the end of a string |
(a) | Assigns ‘a’ to a “capture” group |
[^a-c] | Anything except ‘a’, ‘b’, or ‘c’ |
Here’s a regular expression you might use to match “demonstrations”
\[ \text{\b[d|D]emo\S+ons*\b} \]
\b
denotes a word boundary (start of a string, a space, punctuation mark, or newline)
[D|d]
matches either upper case D or lower case D.
\S+
will match any non-whitespace character between one and infinite times.
ons*\b
means that the word must end with the letters on
followed by either an s
or nothing.
(Note that in R you need to escape backslashes, so \b
will be \\b
)
I recommend using a graphical interface this one to develop and test complex expressions. Then just paste your results into an R command.
stringr
functions:
str_detect
returns TRUE
if there’s a match anywhere in the string
str_replace
will replace the first instance of a matching string with something else.
str_extract
will extract first instance of the string that matches your regular expression.
str_extract_all
and str_replace_all
will do this for all matches in a string, instead of just the first.
demos|>
distinct(event_type)|>
mutate(detect = str_detect(event_type, "\\b[d|D]emo\\S+ons*\\b"),
replace = str_replace(event_type, "\\b[d|D]emo\\S+ons*\\b", "demonstration"),
extract = str_extract(event_type, "\\b[d|D]emo\\S+ons*\\b")
)
event_type | detect | replace | extract |
---|---|---|---|
demonstration | TRUE | demonstration | demonstration |
demontration | TRUE | demonstration | demontration |
demostration | TRUE | demonstration | demostration |
demosntration | TRUE | demonstration | demosntration |
demonstarion | TRUE | demonstration | demonstarion |
pdemonstration | FALSE | pdemonstration | NA |
demomstration | TRUE | demonstration | demomstration |
demonstation | TRUE | demonstration | demonstation |
demonsatration | TRUE | demonstration | demonsatration |
Demonstration | TRUE | demonstration | Demonstration |
Demonstartion | TRUE | demonstration | Demonstartion |
demonstratino | FALSE | demonstratino | NA |
demonstrations | TRUE | demonstration | demonstrations |
demonstratiion | TRUE | demonstration | demonstratiion |
demonstration and sit-in | TRUE | demonstration and sit-in | demonstration |
demonstraiton | TRUE | demonstration | demonstraiton |
If you just need a simple string match, it might be better to use str_detect(coll(x), "string_to_match")
to do literal matching.
Any kind of wild card matching comes with some risk of errors, so be careful about using this in high-stakes scenarios.
Regular expressions can be computationally expensive because they require multiple passes through a string.
This:
… Was ultimately caused by this regular expression:
(?:(?:\"|'|\]|\}|\\|\d|(?:nan|infinity|true|false|null|undefined|symbol|math)|\`|\-|\+)+[)]*;?((?:\s|-|~|!|{}|\|\||\+)*.*(?:.*=.*)))
Finally, we might try to use some fuzzy matching to facilitate creating a dictionary.
Make a list of valid strings and a list of all strings that occur in the data.
Quantify the difference between each string.
Join any rows where the distance is below some threshold. (typically setting this requires some manual inspection)
String Distance:
There are lots of ways to calculate the “distance” between strings.
The default for the R package “stringdist” is the restricted Damerau-Levenshtein distance, which counts the number of operations need to transform one string into another.
A single operation could be:
(note: if there’s a really large amount of text, then the expand.grid
function below might eat up all your memory. So you would probably want to do this using a loop or just use an existing package for text analysis.)
library(stringdist)
valid_event_types = c("demonstration",
"rally",
"march",
"protest",
"civil disobedience",
"counter-protest")
all_event_types<-ccc|>
pluck('event_type')|>
str_split(";")|>
unlist()|>
str_to_lower()|>
str_squish()|>
na.omit()|>
unique()
distances<-expand.grid("valid" = valid_event_types,
"all" = all_event_types)|>
mutate(distance = stringdist(valid, all, method='osa'))|>
arrange(valid, distance)
distances|>
filter(valid == "demonstration")|>
ggplot(aes(x=distance, y=reorder(all, distance))) +
geom_bar(stat='identity') +
ylab("Term") +
xlab("Damerau-Levenshtein distance from \"demonstration\"") +
theme_bw()
After some inspection, I decided a distance of around 4 should be the maximum.
dictionary<-distances|>
group_by(all)|>
filter(distance < 4)|>
slice_min(n=1, order_by=distance, with_ties=FALSE)
# viewing the distances for unmatched results:
anti_join(distances, dictionary, by=join_by(all == all))|>
pivot_wider(names_from = valid, values_from =distance)|>
kableExtra::kbl(caption='string distances for unmatched strings')|>
kableExtra::kable_classic()
all | demonstration | rally | march | protest | civil disobedience | counter-protest |
---|---|---|---|---|---|---|
caravan | 10 | 5 | 5 | 7 | 16 | 13 |
other | 11 | 5 | 5 | 5 | 16 | 12 |
demonstration and sit-in | 11 | 22 | 22 | 20 | 21 | 19 |
vigil | 12 | 5 | 5 | 7 | 15 | 15 |
13 | 5 | 5 | 7 | 18 | 15 |
Here’s how we might apply this dictionary to get the proportion of events featuring each “type” of activity across all events in the data:
ccc$event_id <- 1:nrow(ccc)
ccc_events<-ccc|>
mutate(
event_split = str_split(str_to_lower(event_type), ";"))|>
select(event_id, event_split)|>
unnest_longer(event_split)|>
mutate(event_split = str_squish(event_split))|>
left_join(dictionary, by =join_by(event_split == all))|>
select(event_id, valid)|>
mutate(one = 1)|>
pivot_wider(names_from = valid, values_from = one, values_fill = 0)
colMeans(ccc_events[,2:ncol(ccc_events)])
demonstration rally march counter-protest
0.757491551 0.134209538 0.093203154 0.016259857
civil disobedience protest NA
0.003529854 0.099887345 0.004581299
(Alternatively, you might just use a left join to add these values back on to the original CCC data)
Times and dates are complex
Different date formats: Aug 19, 2025; 2025-19-08; 19 August, 2025…
Time zones, leap years, and daylight savings
Calendar/clock math, periods, and intervals
Base R uses specialized date and time classes to help with these complexities, but you’ll encounter issues if the dates aren’t formatted correctly:
[1] "0001-01-19"
Error in charToDate(x): character string is not in a standard unambiguous format
[1] "2025-09-20"
[1] "1967-04-04"
[1] "2025-09-30 EDT"
[1] "2025-09-30 EDT"
The lubridate package provides a series of date conversion functions that can handle a lot of common date formats:
[1] "2012-08-09"
[1] "2012-08-09"
[1] "2012-08-09"
However, when this fails you can also use the format
option of as.Date()
to convert date strings as long as you supply a string that clarifies what each element represents:
We’ve got some monthly counts of events organized by whether they were right or left wing.
event_counts<-ccc|>
mutate(month = floor_date(date, unit='month'),
valence = factor(valence, levels=c(0, 1, 2),
labels=c("other/neither",
'left-wing',
'right-wing'))
)|>
count(month, valence)|>
complete(month = seq.Date(from= min(month),
to=max(month),
by='month'),
fill=list(n=0))|>
drop_na()
event_counts
month | valence | n |
---|---|---|
2025-01-01 | other/neither | 1169 |
2025-01-01 | left-wing | 931 |
2025-01-01 | right-wing | 123 |
2025-02-01 | other/neither | 1045 |
2025-02-01 | left-wing | 1071 |
2025-02-01 | right-wing | 49 |
2025-03-01 | other/neither | 1273 |
2025-03-01 | left-wing | 2292 |
2025-03-01 | right-wing | 57 |
2025-04-01 | other/neither | 966 |
2025-04-01 | left-wing | 2939 |
2025-04-01 | right-wing | 113 |
2025-05-01 | other/neither | 1215 |
2025-05-01 | left-wing | 2063 |
2025-05-01 | right-wing | 26 |
2025-06-01 | other/neither | 1114 |
2025-06-01 | left-wing | 3381 |
2025-06-01 | right-wing | 194 |
2025-07-01 | other/neither | 971 |
2025-07-01 | left-wing | 2298 |
2025-07-01 | right-wing | 22 |
2025-08-01 | other/neither | 1259 |
2025-08-01 | left-wing | 2016 |
2025-08-01 | right-wing | 25 |
Required components:
Data and mapping: We need to choose what elements we’ll be visualizing and how they map onto different parts of the plot (what goes on the x and y axis? If we’re color coding, what variables determine colors? Or shapes?)
Layers: the visible representations of the data. The most important being geometries (bar, point, line etc.)
Scales: the mapping from data units to distances, colors, etc.
The remaining elements can be modified, but they’re often left at default values (ggplot assumes you want carteesian coordinates unless you say otherwise, for instance)
Data is multi-dimensional, but visuals are 2D.
A core challenge for good visualization is elegantly handling this dimensional mismatch by:
You’ll always start with a call to ggplot
We can use the +
to add more layers to a ggplot object, and we can add more than one kind of geometry to a single visualization.
We can also change the x or y scaling from the default using scale_
functions
The previous examples were all univariate plots, but we have a lot more to work with.
library(paletteer) # package for alternate color palettes
ggplot(event_counts, aes(x = month, y = n, color = valence)) +
geom_line(lwd=.7) +
geom_point(aes(shape =valence), size=3) +
theme_minimal() +
labs(x = "Date",
y='Number of events',
title='Monthly event counts through August 2025',
caption = "Data source: Crowd Counting Consortium"
)+
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
scale_color_paletteer_d("nationalparkcolors::Acadia")
Big differences in the scales can make it difficult to identify relative changes within each group. One way to deal with this is by using a facet to draw multiple version of the same plot.
In this example, the x-axis is fixed, but the y-axis is re-scaled for each group - making relative changes within groups more apparent (albeit at the cost of distorting absolute values)
event_counts|>
mutate(valence = fct_relevel(valence, "left-wing",
"other/neither",
"right-wing"))|>
ggplot(aes(x = month, y = n, color = valence)) +
geom_line(lwd=.7) +
geom_point(aes(shape =valence), size=3) +
theme_minimal() +
labs(x = "Date",
y='Number of events',
title='Monthly event counts through August 2025',
caption = "Data source: Crowd Counting Consortium"
) +
scale_x_date(date_breaks = "1 month", date_labels = "%B") +
scale_color_paletteer_d("nationalparkcolors::Acadia") +
facet_wrap(~valence, ncol=1, scales='free_y') +
# legend no longer needed here since the facets are already labeled
theme(legend.position = "none")
The best way to learn is through replication: find an example and adapt it. here’s a great place to start
Care.
Consider resolution and audience: slide decks should have big plots with minimal detail, written reports can tolerate more subtlety.
Almost all “3d” plots are pointless.