[1] 4
function
s allow us to create re-usable bits of code that take a set of inputs and return some dataif
statements allow us to make code that only runs when a specified condition evaluates as TRUE
for
, while
and repeat
allow us to apply the same code to each element of a vector, or repeat a chunk of code until some condition is met.Pseudo-code is an informal way of writing generic code that explains the steps without worrying about precise syntax. Its a useful way plan complex processes.
There are some conventions to pseudo-code (especially for mathematical operations), but no hard rules. The goal is just to create an outline of the intended process that’s not specific to any particular language.
What would this code do?
What would this code do?
Remember that R functions are just a chunk of code you can easily repeat. To create a function, you can just run something like: function_name <- function(arguments){code}
Running the code below will create my_function
in your R environment.
What will happen if I run the code below? Is that in line with your expectations?
The variables you specify in a function as arguments will always take first priority over objects of the same name inside the global environment.
Variables that are created inside functions are (usually) temporary. They disappear once the function runs unless you assign the results to a variable.
Use the return
to control what data is returned by your function:
Note that the function executes successfully, but xvar
, xmu
and xsd
no longer exist once the function is finished running:
Use the code below, but replace return(xsd)
with return(xvar)
. What happens?
You can create an optional argument by specifying a default value.
One method for measuring polarization is to calculate the standard deviation of party positions on an issue, weighted by each party’s share of the votes or seats in the last election.
Here are the left-right positions and seat shares for UK parties in 2024:
Here’s how I would calculate the weighted SD:
Create a function that takes a vector of seats and positions and calculates the polarization measure.
One good rule of thumb is to make a function when you find yourself copy-pasting code more than twice in a single sitting.
A good practice is to specify functions at the top of the script so they’re easy to find. Alternatively, you can write them in a separate file and use source(filename)
to load them into the environment.
Scenario:
I’m sharing an analysis of conflict data from the Uppsala Conflict Data Program (UCDP), but users need to download the files to replicate my analysis.
The files come in a zip folder, so I would also like to automatically unzip the resulting data once it’s available on the local device and place it in a new data directory.
Some files are quite large, so I want the code to only download a file once even if they run the code repeatedly
In general terms, I want something that does this:
if
statement allows us to selectively run chunks of code.
The general syntax is:
if(condition){
[... some code that runs if the condition is true]
}
Code inside the {}
runs only if the expression in condition
is TRUE
I can also add an else
statement that will execute only if the if
condition is FALSE
Example:
if(condition){
[... some code that runs if the condition is true]
}else{
[...code that runs if the condition is false]
}
I can also nest an if statement inside another if statement, to create code that requires multiple conditions:
if(condition){
[... some code that runs if the condition is true]
if(condition2){
[... code that runs if condition 1 is true and condition2 is also true]
}
}
Back to our original problem:
!file.exists()
checks for a file called battledeaths.zip
in the current working directory.
!
is a negation, so it flips FALSE
to TRUE
.
download.file
downloads the file from the specified URL, and places it in the folder specified in the dest
argument
The unzip
function will extract the files from the zipped folder and move them to exdir
Write code to download the .csv version of the UCDP one-sided conflict data, call the new zip file onesided.zip
The URL for the file is:
https://ucdp.uu.se/downloads/nsos/ucdp-onesided-251-csv.zip
if
statements are less important for code that you intend to use interactively (you can just skip lines, after all)
but they’re critical for code that runs unattended or for making code that works inside of a function or loop.
The UCDP produces multiple data products, so what if my analysis requires users to grab multiple files?
files<-data.frame(
filename = c('battledeaths.zip',
'nonstate_actors.zip',
'armed_conflicts.zip'),
url = c('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
'https://ucdp.uu.se/downloads/nsos/ucdp-nonstate-251-csv.zip',
'https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-251-csv.zip'
)
)
files
(In general, I want to avoid copy-pasting code, so I’ll want to do this using a loop.)
There are different flavors of loop, but they all allow us to repeat the same chunk of code multiple times.
a repeat
loop runs the code inside of {}
infinitely unless it encounters a break
statement
a while
loop runs until a specified condition is TRUE
a for
loop runs the code once for each element in a specified vector
In a while loop, then break
statement is implicit. These loops stop whenever the while
condition evaluates as TRUE
A for loop will just run for each element of the vector specified in the for
statement:
Here, the variable (called i
here) is created automatically and it automatically increments after each iteration of the loop.
What if I want to do row-wise addition of df
?
An even more flexible and concise way to write the previous loop would be to use the :
operator:
See if you can write a loop to solve the problem we laid out at the start of this section.
files<-data.frame(
filename = c('battledeaths.zip',
'nonstate_actors.zip',
'armed_conflicts.zip'),
url = c('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
'https://ucdp.uu.se/downloads/nsos/ucdp-nonstate-251-csv.zip',
'https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-251-csv.zip'
)
)
Why don’t operations like sum
or mean
require us to write a loop?
sum
, prod
, mean
etc. do ultimately use loops, but the loop isn’t written in R.R loops are comparatively very slow:
All code eventually gets translated to 1s and 0s, but R and Python do that translation each time you run a command. “Compiled” languages do the translation in one fell swoop.
Each iteration of an R loop requires translating the command all over again, so loops written in R are slow.
BUT: functions like sum
call a loop that was already compiled in another language (sum
uses compiled C code)
Use loops:
When commands aren’t already vectorized (like the file download problem we’re discussing here)
When the problem is small and you need fine-grained control over how things run
Apply family functions are a more concise way of writing loops in R. They use a syntax that’s similar to what we saw with aggregate
, where a function specified by the FUN
argument is applied to each element in a vector. So these two functions are more or less equivalent:
Some other “apply”-type functions are:
lapply
works like sapply
but always returns a list
apply
will apply a function to each row or each column of a matrix or array
tapply
apply a function to each cell or column, after grouping by some factor variable (similar to aggregate
)
Click here for the in-class excercise