Loops and Functions

Control Flow, loops and functions

What we’ll cover

  • We’ll be talking about how to use things like loops, conditions, and functions to make more flexible code.
    • functions allow us to create re-usable bits of code that take a set of inputs and return some data
    • if statements allow us to make code that only runs when a specified condition evaluates as TRUE
    • for, while and repeat allow us to apply the same code to each element of a vector, or repeat a chunk of code until some condition is met.

Pseudo-code

  • Pseudo-code is an informal way of writing generic code that explains the steps without worrying about precise syntax. Its a useful way plan complex processes.

  • There are some conventions to pseudo-code (especially for mathematical operations), but no hard rules. The goal is just to create an outline of the intended process that’s not specific to any particular language.

Pseudo-code

What would this code do?

\begin{algorithm} \begin{algorithmic} \STATE v = $\text{a single numeric value}$ \IF {$v \geq 0$} \STATE $v$ \ELSE \STATE $-1\times v$ \ENDIF \end{algorithmic} \end{algorithm}
v = -4

if(v >= 0){
  v
}else{
  v * -1
}
[1] 4

Pseudo-code

What would this code do?

\begin{algorithm} \begin{algorithmic} \STATE $X = \text{[a vector of numbers]}$ \STATE $S = 0$ \FOR{$\text{each value } i \text{ in } X$} \STATE S = S + i \ENDFOR \end{algorithmic} \end{algorithm}
x<-c(1, 2, 3, 4, 5)
s<-0

for(i in x){
  s<-s+1
}

s
[1] 5

Functions

Remember that R functions are just a chunk of code you can easily repeat. To create a function, you can just run something like: function_name <- function(arguments){code}

Running the code below will create my_function in your R environment.

my_function<-function(x){
  print(sprintf("you entered %s", x))
}

Then you can run your function like this:

my_function(x=5)
[1] "you entered 5"

Function environment

What will happen if I run the code below? Is that in line with your expectations?

my_function<-function(x){
  x
  print(sprintf("you entered %s", x))
}


x <- 30

my_function(x=90)

The variables you specify in a function as arguments will always take first priority over objects of the same name inside the global environment.

Function returns

Variables that are created inside functions are (usually) temporary. They disappear once the function runs unless you assign the results to a variable.

Use the return to control what data is returned by your function:

my_sample<- c(1, 2, 3, 4, 5)
x_sd<-function(x){
  xmu <- mean(x)
  xvar <-  sum((x - xmu)^2)/(length(x)-1)
  xsd <- sqrt(xvar)
  return(xsd)
}
result<-x_sd(my_sample)
result
[1] 1.581139

Function returns

Note that the function executes successfully, but xvar, xmu and xsd no longer exist once the function is finished running:

my_sample<- c(1, 2, 3, 4, 5)
x_sd<-function(x){
  xmu <- mean(x)
  xvar <-  sum((x - xmu)^2)/(length(x)-1)
  xsd <- sqrt(xvar)
  return(xsd)
}


result<-x_sd(my_sample)
xvar
Error: object 'xvar' not found

Function returns

Use the code below, but replace return(xsd) with return(xvar). What happens?

my_sample<- c(1, 2, 3, 4, 5)
x_sd<-function(x){
  xmu <- x/length(x)
  xvar <-  sum((x - xmu)^2)/(length(x)-1)
  xsd <- sqrt(xvar)
  return(xsd)
}
result<-x_sd(my_sample)
result
[1] 2.966479

Function default arguments

You can create an optional argument by specifying a default value.

my_sample<- c(1, 2, 3, 4, NA)
x_sd<-function(x, na.rm=TRUE){
  if(na.rm==TRUE){
    x<-x[!is.na(x)]
  }
  xmu <- sum(x)/length(x)
  xvar <-  sum((x - xmu)^2)/(length(x)-1)
  xsd <- sqrt(xvar)
  return(xsd)
}

x_sd(my_sample)
[1] 1.290994
x_sd(my_sample, na.rm=FALSE)
[1] NA

Try it out

One method for measuring polarization is to calculate the standard deviation of party positions on an issue, weighted by each party’s share of the votes or seats in the last election.

Here are the left-right positions and seat shares for UK parties in 2024:

positions<-c(7.67, 3.81, 3.86, 3.32, 3.12, 2.05, 9.24)
seats<-c(121, 411,  72,   9,   4,   4,   5)

Here’s how I would calculate the weighted SD:

w <- seats/sum(seats) 
p_k<- sum(positions * w) / sum(w)
polarization<-sqrt(sum(w  * ((positions - p_k))^2))

Try it out

Create a function that takes a vector of seats and positions and calculates the polarization measure.

expand for answer
polarization<-function(positions, seats){
  # normalized seats (sums to 1)
  w <- seats/sum(seats) 
  # weighted mean:
  p_k<- sum(positions * w) / sum(w)
  # sqrt of weighted squared deviations from weighted mean
  polarization<-sqrt(sum(w  * ((positions - p_k))^2))
  return(polarization)
}

Using functions

  • One good rule of thumb is to make a function when you find yourself copy-pasting code more than twice in a single sitting.

  • A good practice is to specify functions at the top of the script so they’re easy to find. Alternatively, you can write them in a separate file and use source(filename) to load them into the environment.

If statements

Problem: conditionally downloading a file

Scenario:

  • I’m sharing an analysis of conflict data from the Uppsala Conflict Data Program (UCDP), but users need to download the files to replicate my analysis.

  • The files come in a zip folder, so I would also like to automatically unzip the resulting data once it’s available on the local device and place it in a new data directory.

  • Some files are quite large, so I want the code to only download a file once even if they run the code repeatedly

Problem: in pseudo-code

In general terms, I want something that does this:

\begin{algorithm} \begin{algorithmic} \State $url = \text{a link to a .zip file}$ \State $\text{filename} = \text{a filename}$ \State $\text{datafolder} = \text{a folder to store unzipped data}$ \If{$\text{filename doesn't exist}$} \State \Call{download.file}{$\text{url}, \text{filename}$} \EndIf \State \Call{unzip.file}{$\text{filename}, \text{datafolder}$} \end{algorithmic} \end{algorithm}

If

if statement allows us to selectively run chunks of code.

The general syntax is:

if(condition){
   [... some code that runs if the condition is true]
}

Code inside the {} runs only if the expression in condition is TRUE

i <- 10
if(i %% 2 == 0 ){
  print('i is evenly divisible by 2')
}
[1] "i is evenly divisible by 2"

If-else

I can also add an else statement that will execute only if the if condition is FALSE

Example:

if(condition){
   [... some code that runs if the condition is true]
}else{
  [...code that runs if the condition is false]
}
i <- 11
if(i %% 2 == 0 ){
  print('i is evenly divisible by 2')
}else{
  print('i is not evenly divisible by 2')
}
[1] "i is not evenly divisible by 2"

Nested ifs

I can also nest an if statement inside another if statement, to create code that requires multiple conditions:

if(condition){
   [... some code that runs if the condition is true]
   
   if(condition2){
    [... code that runs if condition 1 is true and condition2 is also true]
  }
}
i <- -10

if(i%%2==0){
  
  if(i<0){
    print("i is even and negative")
  }
}
[1] "i is even and negative"

Conditional Downloads

Back to our original problem:

  • !file.exists() checks for a file called battledeaths.zip in the current working directory.

  • ! is a negation, so it flips FALSE to TRUE.

  • download.file downloads the file from the specified URL, and places it in the folder specified in the dest argument

  • The unzip function will extract the files from the zipped folder and move them to exdir

url<-'https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip'


if(!file.exists('battledeaths.zip')){
  download.file(url, dest='battledeaths.zip')
  
}

unzip('battledeaths.zip',
      exdir = 'ucdp_datasets')

Try it out

Write code to download the .csv version of the UCDP one-sided conflict data, call the new zip file onesided.zip The URL for the file is:

https://ucdp.uu.se/downloads/nsos/ucdp-onesided-251-csv.zip

Try it out

expand for answer
url<-'https://ucdp.uu.se/downloads/nsos/ucdp-onesided-251-csv.zip'


if(!file.exists('onesided.zip')){
  download.file(url, dest='onesided.zip')
  
}

unzip('onsided.zip',
      exdir = 'ucdp_datasets')

Using “if” statements

  • if statements are less important for code that you intend to use interactively (you can just skip lines, after all)

  • but they’re critical for code that runs unattended or for making code that works inside of a function or loop.

Loops

Problem: downloading multiple files

The UCDP produces multiple data products, so what if my analysis requires users to grab multiple files?

files<-data.frame(
  filename = c('battledeaths.zip',
           'nonstate_actors.zip',
           'armed_conflicts.zip'),
  url = c('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
          'https://ucdp.uu.se/downloads/nsos/ucdp-nonstate-251-csv.zip',
          'https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-251-csv.zip'
          )
)
files

(In general, I want to avoid copy-pasting code, so I’ll want to do this using a loop.)

The problem in pseudo-code:

\begin{algorithm} \begin{algorithmic} \State $files = \text{a dataframe with filenames and URLs}$ \STATE $\text{datafolder} = \text{a folder to store unzipped data}$ \FOR{$\text{each row } i \text{ in } files$} \STATE $\text{url} = \text{files}[i, \text{url}]$ \STATE $\text{filename} = \text{files}[i, \text{file}]$ \If{$\text{filename doesn't exist}$} \State \Call{download.file}{$\text{url}, \text{filename}$} \EndIf \State \Call{unzip.file}{$\text{filename}, \text{datafolder}$} \ENDFOR \end{algorithmic} \end{algorithm}

Varieties of loop

There are different flavors of loop, but they all allow us to repeat the same chunk of code multiple times.

  • a repeat loop runs the code inside of {} infinitely unless it encounters a break statement

  • a while loop runs until a specified condition is TRUE

  • a for loop runs the code once for each element in a specified vector

A repeat loop

i <- 1
repeat{
  print(i)
  i <- i + 1
  if (i > 15) {
    break # break stops the loop
  }
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15

A while loop

In a while loop, then break statement is implicit. These loops stop whenever the while condition evaluates as TRUE

i <- 1
while(i <= 15){
  print(i)
  i <- i +1
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15

A for loop

A for loop will just run for each element of the vector specified in the for statement:

letters<-c("a", "b", "c", "d")
for(i in letters){ 
  
  print(i) 
  
}
[1] "a"
[1] "b"
[1] "c"
[1] "d"

Here, the variable (called i here) is created automatically and it automatically increments after each iteration of the loop.

A for loop

What if I want to do row-wise addition of df?

df<-data.frame(x = c(1, 2, 3, 4),
               y = c(4, 3, 2, -1)
               )

I could get a vector with a sequence from 1 to the number of rows in df, and then put that sequence in a for loop:

rowseq<-c(1, 2, 3, 4)

for(i in rowseq){ 
  print(df$x[i] + df$y[i])
  
}

A for loop

An even more flexible and concise way to write the previous loop would be to use the : operator:

1:nrow(df)
[1] 1 2 3 4
for(i in 1:nrow(df)){ 
  print(df$x[i] + df$y[i])
  
}
[1] 5
[1] 5
[1] 5
[1] 3

Try it out

See if you can write a loop to solve the problem we laid out at the start of this section.

files<-data.frame(
  filename = c('battledeaths.zip',
           'nonstate_actors.zip',
           'armed_conflicts.zip'),
  url = c('https://ucdp.uu.se/downloads/brd/ucdp-brd-dyadic-251-csv.zip',
          'https://ucdp.uu.se/downloads/nsos/ucdp-nonstate-251-csv.zip',
          'https://ucdp.uu.se/downloads/ucdpprio/ucdp-prio-acd-251-csv.zip'
          )
)
expand for answer
for(i in 1:nrow(files)){
  if(!file.exists(files$filename[i])){
    download.file(files$url[i] , dest=files$filename[i])
  }
  unzip(files$filename[i],
        exdir = 'ucdp_datasets')
}

Using loops

Why don’t operations like sum or mean require us to write a loop?

Technically, summing a vector is a kind of loop operation:

x<-c(1, 4, 5, 8, 2)
y<-0
for(i in x){
  y <- y + i
}

y
[1] 20

But you’ll never do this because you can just run:

sum(x)
[1] 20

Why isn’t everything a loop?

  • Many built-in R functions like sum, prod, mean etc. do ultimately use loops, but the loop isn’t written in R.
# the source code for the sum function isn't very informative, because
# it doesn't actually call R code at all:
sum
function (..., na.rm = FALSE)  .Primitive("sum")

Why isn’t everything a loop?

R loops are comparatively very slow:

Why isn’t everything a loop?

  • All code eventually gets translated to 1s and 0s, but R and Python do that translation each time you run a command. “Compiled” languages do the translation in one fell swoop.

  • Each iteration of an R loop requires translating the command all over again, so loops written in R are slow.

  • BUT: functions like sum call a loop that was already compiled in another language (sum uses compiled C code)

Using loops

Use loops:

  • When commands aren’t already vectorized (like the file download problem we’re discussing here)

  • When the problem is small and you need fine-grained control over how things run

Apply loops

Apply family functions are a more concise way of writing loops in R. They use a syntax that’s similar to what we saw with aggregate, where a function specified by the FUN argument is applied to each element in a vector. So these two functions are more or less equivalent:

x<-1:10
for(i in 1:10){
  print(i)
}
x<-1:10
sapply(x, FUN=function(x) print(x))

Apply loops

Some other “apply”-type functions are:

  • lapply works like sapply but always returns a list

  • apply will apply a function to each row or each column of a matrix or array

  • tapply apply a function to each cell or column, after grouping by some factor variable (similar to aggregate)

In-Class excercise

Click here for the in-class excercise