Programming for EOR
Individual Assignment
Maximilian Osterhaus
2023-2024
The assignment includes 1 exercise worth 30 points. The assignment is designed in such a way that the
majority of questions should be relatively easy to answer with the content covered in the lecture. However,
some questions deliberately are more difficult and require functions that we did not cover in the lecture. The
idea is to train your skills finding solutions using different resources (e.g., Stack Overflow) and solving more
complex real-life coding challenges.
You have to submit the assignment on Brightspace before Sunday 10.12.2023 at 18.00. Do not forget to
hand in both the .pdf file and the .Rmd file in the respective hand-in boxes (other file types cannot be
uploaded). Be careful when submitting your assignment: assignments that are handed in too late, in the
wrong box, or by email will not be checked and will receive grade 0.
Your grade is equal to the number of points that you achieved divided by the total number of possible
points (30). If you fail the assignment (< 17 points), you are eligible for a repair assignment due Sunday
17.01.2024 at 18.00. For the repair assignment, your grade is equal to min(5.5, points/maxpoints × 10).
Make sure that all your code is in accordance with the style guide (available on Brightspace). Points will be
subtracted from your grade for every infraction (up to 2 grade points from your assignment grade).
1
Exercise 1 (30 points)
Research links the consumption of sugar-sweetened beverages to increased body-weight, diabetes, cardiovascular risk factors, and dental caries among other adverse health effects. Some countries, states and
municipalities have taken this as an opportunity to introduce a tax on sugar-sweetened beverages. One of
the first to introduce such a tax was the city of Berkeley in California, USA, which introduced a tax of one
cent USD per liquid ounce (1¢/oz) on such beverages.
In this assignment, you will be working with data from this research article that studies the impact of the
tax on the prices of beverages in Berkeley and surrounding areas. The data includes information on prices
of a large number of products from varying categories sold in stores located in Berkeley and surroundings at
different points in time.
In the following questions, we will first look at price data from stores in Berkeley to see what happened to
the price of sugary and non-sugary beverages in Berkely after the tax was introduced. For that purpose, we
will use the two data sets data_stores.csv and data_prices.csv. The data set data_stores.csv contains a store
identifier (store_id), information on the type of the store (store_type), a product identifier (product_id),
the general product category (category1), information on the subcategory (category2), and information
on the time (year and month).
The data set data_prices.csv contains a store identifier (store_id), a product identifier (product_id),
information on the time (year and month), the price in dollars (price), the product size in liquid ounces
(size), and a variable named taxed that is equal to 1 if the product is addressed by the tax in general, and
0 otherwise (i.e., the variable taxed is equal to 1 for all sugar-sweetened beverages, also for periods before
the introduction of the tax).
a) The stores data covers large chain supermarkets (store_type = 1), independent corner stores
(store_type = 2), drugstores (store_type = 3), and gas stations (store_type = 4). Convert/replace the variable store_type to a factor variable with labels corresponding to the store types
(i) supermarkets, (ii) corner stores, (iii) drugstores, and (iv) gas stations.
Use the new variable to report the unique number of stores per store type, and the unique number of
products offered per store type (i.e., all products across all stores of the same type).
b) The product categories (category1) are reported in different formats: some are reported in lower case
letters, some are reported in upper cases letters, and some are reported with upper and lower case
letters. In addition, some categories use different separators within their names. Unify all product
categories in the variable category1 to be written exclusively in upper case letters and with hyphens
as separators.
For every product category (category1), report the unique number of products.
c) The variable category2 includes a more detailed description of beverages in comparison to the variable
category1, though only for few product categories. Products without additional information are
marked by a hyphen. Replace the hyphens by NAs.
For every product category (category1), report the unique subcategories (category2) without NAs in
your reported results.
d) Load the data set data_prices.csv and merge the data to your store data set that you prepared in
questions a) to c). Merge the two data sets in such a way that you keep only observations appearing
in both data sets. Add a new variable to your merged data set that corresponds to the price per liquid
ounce in dollar cents. Illustrate the distribution of prices in a box plot with a separate box for every
unique product category in category1.
2
e) Convert the variable taxed in your merged data from part d) into a factor variable with label Taxed
for all observations for which taxed is equal to 1 and with label Non-taxed for all observations for
which the variable taxed is equal to zero.
Use the factor variable to replicate the table below presenting summary statistics of the price per
ounce in cents for taxed and no-taxed beverages whereby the first column represents the number of
observations per group and the last column the standard deviations. Hint: You can use either the
package kableExtra or stargazer to create a table in markdown files. You do not need to recreate
the table format exactly. If you did not succeed to answer question d), you can use the data set
data_cleaned.csv to answer the question (and following questions).
Table 1: Summary Statistics of Price per Ounce (in cents) for Taxed and Non-taxed Beverages.
taxed N Min Median Mean Max Std
Non-taxed 1072 0.98 7.64 11.11 49.83 10.46
Taxed 1096 1.21 8.25 13.14 71.50 13.77
f) Use the merged data to replicate the figure below showing the percentage change in average prices
per ounce in cents for taxed and non-taxed drinks between December 2014 and June 2015 ((¯p2015 −
p¯2014)/p2014 ∗ 100) for the different store types in the data.
Supermarkets Corner Stores Drugstores Gas Stations
Store Type
Price Change in %
Non−taxed Taxed
Price Change Between December 2014 and June 2015 for Different Store Types
3
It is possible that the observed price patterns in Berkeley observed in the figure above are not solely due
to the tax introduced on sugar-sweetened beverages, but instead were also influenced by other events that
happened in Berkeley and surrounding areas. To investigate whether this is the case, you will use the data
set data_berkeley.csv to compare the prices in Berkeley before and after the tax to prices from stores located
outside of Berkeley.
The data set contains monthly average prices per liquid ounce in cents for different product categories
(category) collected from stores in and around Berkeley (location) at different points in time (year and
month). The reported prices (prices) are average prices across different stores and products of the same
category, whereby products are separated into products that are addressed by the tax (taxed = Taxed), and
products not addressed by the tax (taxed = Non-taxed). If prices changed in a similar way in surrounding
areas (where no tax on sugar-sweetened beverages was introduced), then the observed price changes in
Berkeley may not be primarily a result of the tax.
More precisely, the impact of the tax on the price of sugar-sweetened beverages in Berkeley, known as the
pass-through of a tax, can be calculated as the difference in the mean prices of sugar-sweetened beverages
(i.e., of beverages generally addressed by the tax) sold after the introduction of the tax in Berkeley and
in surrounding areas, minus the difference in the mean prices of sugar-sweetened beverages sold before the
introduction of the tax in stores in Berkeley, and in stores outside of Berkeley:
Pass-Through = (¯p
taxed
Berkeley,after − p¯
taxed
Non-Berkeley,after) − (¯p
taxed
Berkeley,before − p¯
taxed
Non-Berkeley,before)
However, one important assumption that needs to be satisfied for this calculation to return the true effect of
the tax on prices in Berkeley is that the price-trends in Berkeley and its surrounding areas must be parallel
before the introduction of the tax.
g) In order to study whether this assumption is satisfied, create a new variable of type date (with year,
month, and day) by combining the variables year and month and where the day of the month is always
set to the first day of the month. Use your new variable to calculate average prices at the date-locationtaxed level to replicate the figure below whereby the date is your newly created variable and where the
dashed vertical line illustrates the date where the tax was implemented (i.e., 01.01.2015).
Price per Ounce in Cents
taxed Non−taxed Taxed location Berkeley Non−Berkeley
Price Trends of Taxed and Non−Taxed Beverages in Berkeley and Surrounding Areas
h) According to the figure above, average prices of taxed beverages follow a parallel trend in Berkeley and
its surrounding areas before the introduction of the tax. Using only data from before first of November
2014 and from first of March 2015 onwards, calculate the pass-though of the tax using the formula
from above where average prices are calculated across all product categories.
i) Use a for loop to loop over all product categories that are addressed by the tax to calculate the passthrough of the tax for every product category separately. Use only data from before first of November
2014 and from first of March 2015 onwards for your calculations. Collect your results in a data frame
that includes a variable with the category and a variable with the pass-through rate that you calculated.