讲解 ECON3173 – Cross Section and Panel Data Analysis讲解 Python编程

ECON3173 – Cross Section and Panel Data Analysis

Individual Project: Guidelines and Questions

This document provides guidelines and questions for the Individual Project of ECON3173, which accounts for 40% of the total marks.

Honoring the precepts of academic integrity and applying their principles are fundamental responsibilities of all students and scholars at BNBU. You are advised to read through the BNBU Guidelines for Handling Academic Dishonesty file on iSpace before you start your assignment. Any form. of plagiarism or cheating can result in various disciplinary and corrective activities. Using generative AI tools is not allowed.

Deadline: by Dec.14, 2025.

Submission Method:

a) Please submit your typing assignment report in a single PDF file to Turnitin ‘Submission Link: Report’ via iSpace. The file name of your PDF submissions should have the following format: ECON3173_Project_Student ID_Name in Pinyin (e.g., ECON3173_Project_190000001_Mi Lin).

b) Save your data and .do file(s) in a zip file. Name your zip file as ECON3173_Project_Student ID_Name in Pinyin. Then, upload your file to ‘Submission Link: Stata Data and Program’ via iSpace. You are expected to submit 2 .do files, namely ParA.do and PartB.do, respectively, should be able to replicate each part of your submitted work.

c) Use the ‘ECON3173_Individual Project_Report Template’ file on the iSpace to input your report. Ensure you provide a question number for each part of your work.

Format Requirements

Cover page:	Please enter your name and student ID at the top of the report template cover page, available on iSpace.
Word limit:	The required minimum word count is 1,500 words, with a maximum of 2,000 words in total, excluding tables, graphs, and appendices.
Referencing:	Your report should include appropriate references in APA format to a variety of necessary literature sources and a wide- ranging bibliography of academic aspects of economics.
Font / Size:	Cambria 12 or Times New Roman 12.
Spacing / Sides:	1.0 / Single-sided / Single-line spacing between two paragraphs.
Pagination required:	Yes
Margins:	At 2.50 to both left and right, and ‘justified’.

Project Theme: Access to External Finance and Firm Performance

Introduction:

In this project, you are invited to empirically investigate the determinants of financial access and its subsequent causal impact on firm performance (measured by Sales) using the World Bank Enterprise Survey Data (WBESD).

One of the most cited constraints for firms in developing economies is a lack of access to external finance. You will test whether alleviating financial constraints (e.g., gaining access to credit) causes firms to expand output.

The project is divided into two analytical stages:

l Determinants of Credit: Using cross-sectional techniques to model the probability of a firm having a loan;

l Impact of Credit: Using panel data techniques to test if gaining access to credit causes firms to expand output.

The WBESD database collects information on firm performance, growth, and related factors. The entire database is available to researchers and includes all survey questions at the firm level.

Guidelines to download and prepare data for this individual project:

a) Please visithttps://login.enterprisesurveys.org/to register your user account for the WBESD database (see the snapshot below). Registration is free.

b) There are a total of 168 economies represented in the World Bank Enterprise Surveys Database (WBESD). Among these, 83 economies have a time span of at least three years. For their individual projects, students are required to use data from a panel of random combinations of three different economies out of the 83 economies.

Data allocation protocol: Students must first pick a lottery ticket number. An “Individual Project Lottery Ticket Sign-up Sheet” will be available in iSpace from 9 p.m. on Friday, 28/11/2025. Please sign up for a lottery ticket number by Sunday, 30/11/2025. We will operate on a 'first-come, first-served' basis.

A lucky draw will be conducted in class on Monday, 01/12/2025, to assign specific economies to each lottery ticket number.

c) Once registration is completed, log in and download the data following the steps below:

i. Login with your username and password. You will be directed to the ‘Full Survey Data’ page.

ii. Select ‘Panel data’ under ‘Survey Type’ on the left. Ensure you are on the ‘Data by Economy’ view instead of ‘Combined Data’. See the snapshot below.

iii. Download your economies’ corresponding data and documentation for all the available years.

For example, Afghanistan has two panel data files, one for 2005 and 2009, and the other for 2008, 2010, and 2014. Then download both of them.

iv. Extract the data and survey documentation files into a working folder on your PC.

The data file is now ready to open in Stata.

d) Appendix A at the end of this document offers guidelines for data construction and

cleaning when working with WBESD data. Read it carefully before you begin.

Answer ALL of the Following Questions

Note that this is not an essay-type assignment. Please answer the questions one by one. For each question, the performance of the Stata do files accounts for 20% of the marks. Support your answers with regression tables, graphs, Stata output, and explanations/discussions.

Part A: Data Management and Exploratory Analysis (15%)

Q1 (5%) Data Preparation:

Use the Stata command “append” to combine data from all years and all the selected economies into a single Stata data file with a panel data format and complete the following data preparation tasks:

● Select and rename the variables according to Table 1 below. ‘Old name’ refers to the variable name in the original dataset, while ‘New name’ is the new corresponding name to be defined.

● Generate a new dummy variable creditdum: Equals 1 if the firm has a line of credit or loan from a financial institution (k8 = yes); otherwise 0.

● Generate a new dummy variable Femaledum: Equals 1 if the firm has female participation in ownership (b4 = yes); otherwise 0.

● Generate a new variable ln(sales): The natural logarithm of total annual sales. Table 1: Variable List

Survey Questions	Old name	New name
The year the survey was conducted	year	year
Panel ID (the same ID for each firm across different years)	panelid	panelid
What percentage of this firm is owned by Private foreign individuals, companies, or organizations %	b2b	foreign
During the past fiscal year, what were this establishment’s total annual sales?	d2	sales
Total number of permanent, full-time workers at the end of the last fiscal year	l1	labor
Year of Survey – Year establishment began operations + 1	year 一 b5 + 1	age

Q2 (10%) Conduct exploratory data analysis:

● Provide summary statistics for the variables created in Q1.

● Compare the average ln(Sales) for firms with credit (creditdum = 1) versus those without (creditdum = 0). Is the difference statistically significant?

● Briefly comment on the prevalence of credit access across the different economies in your sample.

Part B: Cross-Sectional Analysis (20%)

Q3 (20%) Determinants of Access to Credit:

Before analyzing the effect of credit, we must understand who gets credit. Restrict your sample to the most recent survey year only (treat this sub-sample as cross- sectional data).

Estimate the probability of having a credit line based on firm characteristics:

pro(creditdumi = 1|x) = F(β0 + β1ln(Labor)i + β2Agei + β3Foreigni + β4Femaledumi) (1)

l Estimate the model using both the Probit and Logit estimators. Report the results side-by-side. Compare the Pseudo-R2. Do the models yield consistent inferences regarding significance?

l Interpret the coefficient of Femaledumi from the Logit model. Then, calculate and report the average marginal effects for all variables in the Probit model.

● Explain why the raw coefficients in non-linear binary response models cannot be interpreted as simple marginal effects (unlike in OLS).

Part C: Panel Regression and Causal Inference (65%)

Q4 (15%) Baseline Fixed Effects Model

Revert to the full Panel Dataset (all years and all three economies). Consider a standard performance model in which sales depend on labor inputs and firm characteristics. Report all the results side by side.

ln(Sales)it = β0 + β1ln(Labor)it + β2Ageit + β3Foreignit + uit (2)

● Estimate equation (2) using OLS, Fixed Effects (FE) estimator controlling for time- invariant individual effects, FE estimator controlling for individual-invariant time effects, and FE estimator controlling for both time and individual effects. Provide examples of individual effects and time effects in the current context. Comment on your regression results.

● Compare the result of the FE estimator controlling for both time and individual effects to a Random Effects (RE) model using the Hausman test. Interpret the test result.

● Comment on the elasticity of sales with respect to labor in your preferred model.

Q5 (15%) The Effect of Credit Access (Naive Approach)

Expand your model from Q4 to include credit_dum as the mainvariable of interest.

ln(sales)it = β0 + β1credit_dumit + yx it + μi + δt + E it (3)

● Explore the WBESD database to include appropriate other control variables based on the literature as you see fit. Give justifications for adding these extra control variables.

● Run the regression and interpret the coefficient β1, and explain the estimated result.

● Discuss to what extent we could use the estimated coefficient on credit_dumit for causal inference?

Q6 (15%) Causal Inference: Further Investigation

To better address causality, implement a Difference-in-Differences (DiD) strategy focusing on firms that changed their credit status.

● Define a Treatment Group (Firms that did not have credit in period t 一 1 but gained it in period t) and a Control Group (Firms that never had credit).

● Estimate the standard Two-Way Fixed Effects (TWFE) DiD equation:

yit = αi + λt + δDiD(Treati × postt) + βxit + E it (4)

● Report the estimator for δDiD.

● Discuss the Parallel Trends Assumption required for this estimator to be valid. Q7 (20%) Robustness

To what extent could we use the estimated coefficient on Treati × postt obtained in Q6 for causal inference? How could we ensure that the Parallel Trends Assumption holds? Is the treatment effect long-lasting? Is the treatment effect homogeneous?

Illustrate a suitable empirical strategy for the above questions. Estimate the model using your chosen approach, and compare the results with those from Q6. Interpret and discuss the findings. Explore the WBESD database to include other variables as you see fit.

Appendix: Guidelines for Data Construction and Cleaning

(Read this carefully before starting your Stata analysis)

The World Bank Enterprise Survey Data (WBESD) is a rich resource, but it requires careful cleaning to be usable for empirical studies. Real-world data is rarely “ready to run”. Follow the steps below to construct your dataset.

Phase 1: Data Merging and Compilation

1. File Selection:

l Do not download single-year cross-section files (e.g., “Vietnam 2015”).

l Download the “Panel” datasets. These files usually have names like Vietnam- 2015-2023-Panel-Data.dta. They contain the crucial “panelid” variable that links firms across time.

2. Combining Economies (The append Strategy):

l You need three economies. Do not try to merge them side-by-side. You want to stack them on top of each other (long format).

l Stata Workflow: Open the first country’s dataset, generate a country ID, save it. Open the second, generate a country ID, append the first, etc.

l Code Hint in Stata:

use "Vietnam_Panel.dta", clear

gen country_name = "Vietnam"

save "combined_data.dta", replace

use "Senegal_Panel.dta", clear

gen country_name = "Senegal"

append using "combined_data.dta"

save "combined_data.dta", replace

3. Variable Standardization:

l Check variable names across countries. While the World Bank tries to standardize (e.g., d2 is always Sales), sometimes older files use d2_2015 or sales_val.

l Use the command lookfor sales or lookfor labor to find the correct variable codes in each dataset before appending.

Phase 2: Cleaning and Consistency

1. Handling Missing Values and Codes:

l WBESD often uses special codes for missing data:

o -9 = Don't Know

o -7 = Refusal

o -8 = Does not apply

l Crucial Step: You must convert these to Stata missing values (.) before calculating means or running regressions. If you treat -9 as a real number, your averages will be wrong.

l Code Hint in Stata:

mvdecode _all, mv(-9 -8 -7)

2. Outliers and Monetary Values:

l Sales (d2) are reported in local currency units (LCU).

l Do not compare raw nominal sales between Vietnam (Dong) and Senegal (CFA Franc) directly.

l Solution: We use log_sales and Country Fixed Effects (or Firm Fixed Effects). The Logarithm roughly normalizes the scale differences.

l Winsorizing: Real data often has data entry errors (e.g., a firm reporting 1000% growth). It is good practice to winsorize the top/bottom 1% of continuous variables, such as sales and employee counts.

l Code Hint in Stata (requires ssc install winsor2):

winsor2 sales, cut(1 99) replace

Phase 3: Handling Panel Time Gaps

This is the most challenging part ofusing WBESD. Unlike annual stock market data, these surveys happen irregularly (e.g., 2013, 2016, 2020).

1. Declaring Panel Data:

l You cannot just use panelid if IDs are repeated across countries (e.g., Firm #1 in Vietnam and Firm #1 in Peru).

l Create a unique ID: egen unique_id = group(country_name panelid)

l Declare data: xtset unique_id year

2. Defining the "Treatment" (Switchers):

l A firm is "Treated" in the DiD sense ifit goes from No Credit (k8=0) in one wave to Yes Credit (k8=1) in the next.

l Identify the year the switch happened. Since there are gaps, we assume the switch happened between the survey waves.

3. Imputing Dynamics for Event Studies (this is only relevant if you choose to conduct event studies):

l Because you don’t have data for every year (e.g., data exists for t = 2015 and t = 2019, but missing 2016, 2017, 2018), you cannot create a standard “Year-1, Year-2” event plot.

l The “Relative Wave” Solution: Instead of "Years since treatment", use “Waves since treatment”.

l Constructing the Variable: If a firm is treated in 2019 (it had no credit in 2015, but has credit in 2019):

o 2015 is Time t = -1 (Pre-treatment)

o 2019 is Time t = 0 (Treatment/Post)

o 2023 is Time t = 1 (Post-treatment persistence)

Use these “Relative Time” indicators to plot your coefficients if needed.

Phase 4: Common Pitfalls to Avoid

The “Inconsistent Panel” Trap:

l Some firms appear in 2015, 2018, and 2023 but are missing in 2020.

l For the First Difference or Lagged models, Stata will drop these firms because it cannot calculate (t) - (t - 1).

l Check: Use xtdescribe to see your pattern. Ideally, keep firms that are present in consecutive waves for the DiD analysis.

l Creating a Time Index: Do not use the calendar year as your time index for xtset. Instead, generate a sequential Wave Index, e.g.,