辅导MSBA 6030编程、Python语言讲解留学生、讲解Python、Python编程辅导

MSBA 6030: Data Assignment 1
1 Instructions
This is an individual assignment but you are permitted to talk to other students in case you get
stuck. Download the associated Compustat zip le + codebook which contains annual nancial
information for US companies over the time period of January 1950 to July 31 2018. The dataset
contains all quantitatively reported nancial statement variables (over 700+) but for this homework
we will be focused on a restricted subset.1
When you submit the homework, include the output+answers to the questions along with the
code used to generate the output (we will run the code to verify the output). Ideally, you should
use R in conjunction with Markdown to solve this homework.2
2 Objectives
There are three main objectives in this assignment
1. Is there indirect evidence suggestive of data manipulation across the universe of publicly
traded US companies?
2. Could we have forecasted the accounting scandal at Satyam based on the Beneish M-Score
model?
3. Despite concerns about data quality, is accounting information incrementally useful for pre-
dicting future cash ows across the universe of publicly traded US companies?
1The entire dataset is provided so that you can decide what you wish to use for the nal project.
2If you strongly prefer to use Python, Jupyter notebooks are acceptable as well.
1
3 Homework
3.1 Pre-processing
1. The rst step is to make sure the data is ready for analysis. For this assignment we only
need to use gvkey (company identi er) datadate (reporting period) fyear ( scal year) revt
rect ppegt epspi ni at oancf sic rdq. Use the codebook to familiarize yourself with what these
variables represent.
2. Restrict sample to rms with scal years 1988 to 2017 inclusive.
3. Drop any observations with missing assets, revenue, net income, EPS, accounts receivables,
or operating cash ows (data errors or odd accounting rules).
4. If PPE is missing set it as zero (often the dataset reports missing when it’s zero).
5. Drop any observations where revenue or accounts receiveables are negative (data errors or
odd accounting rules).
6. The dataset has some duplicate observations at the rm- scal year level where all variables
values are identical except rdq. Drop duplicates at the rm- scal year level based on every
variable except rdq.
7. Your nal dataset should have approximately 248,288 observations for all variables except
rdq where it is 215,786.3
3Depending on how you dropped the duplicates, you may have a slightly di erent count.
2
3.2 Descriptive statistics
1. The rst step to any analysis is to understand the raw descriptive statistics so report. Note
that all variables are reported in millions (USD).
a. What is the average revenue, net income, and total assets for a rm in scal year 2017?
b. Plot the average revenue, net income, and total assets by scal year (1988-2017) - com-
ment on any trends.
c. The data you generated from the preprocessing steps is obviously di erent from the
raw data. Is there any potential bias in the \cleaned" data? What would you do to
learn whether the cleaned data is representative or non-representative of the raw data?
Provide some simple analysis to establish the degree of representativeness.
3
3.3 Indirect evidence of earnings management
2. Are rms relatively more likely to report performance just better or just worse than last year’s
performance?
a. Calculate the change in EPS from year t to t 1 for each rm-year.4 Plot the histogram
of the change in EPS restricting change in EPS 2[ :10;+:10] in 1 cent increments.
b. Calculate the change in ROA from year t to t 1 for each rm-year. ROA (return-on-
assets) is another common performance metric and is de ned as net income scaled by
lagged total assets. Plot the histogram of the change in ROA restricting change in ROA
2[ :10;+:10] in 1 % increments.
c. Compare the two distributions around zero; which variable is relatively more asymmetric
and in which direction of asymmetry?5 Note that both EPS and ROA share the same
numerator (net income). Conjecture possible explanations for the relative symmetry
di erence between the two variables.
4If the lagged variable is missing then set change in EPS as missing. Follow this rule throughout the exercise.
5You do not need to establish a formal statistics test here, just raw summary statistics will be enough. The formal
method here involves local linear regressions.
4
3. Feel free to use a R package such as benford.analysis to generate 2 test statistics and p-
values.6 For the following variables: revenue, operating cash ows, and total assets
a. Without looking at the data, rank-order which variables you think are most likely to be
manipulated in the data and why.
b. Plot the distribution of the rst digit (relative to Benford’s law distribution on the same
graph). Report the 2 test statistic and interpret its signi cance. Are you surprised at
which variables are signi cant or insigni cant?
c. Repeat part b for the second digit (note that this test only applies to numbers that have
at least two digits).
e. For this question, pool the digits from all three variables together for analysis [imagine
they all came from a single variable]. Focusing on the rst digit’s distribution, calculate
the di erence between the actual frequency and Benford’s Law’s expected frequency.
Take the maximum absolute value of this di erence across the 9 digits, and call it MAD
(maximum absolute deviation). Calculate MAD by scal year and plot it across time.
Discuss any patterns e.g., overall trends or periods associated with corporate scandals
and regulation.
f. Thus far we have analyzed violations of Benford from the perspective of all rms across
the entire sample period or by scal year. Can you think of a way to test at the rm-year
level instead? If so, try it on your project rm [hint: use more data].
6https://cran.r-project.org/web/packages/benford.analysis/README.html
5
4. The SEC uses a discretionary accruals model to screen rms for improper earnings manage-
ment
a. Estimate the following OLS regression for each year separately
Accrualsit = 0 + 1Cash Revenue Growthit + 2Gross PPEit + SICi + it (1)
i. where accrualsit is net income - operating cash ows for rm i in scal year t
ii. where cash revenue growth is revenue - rect
iii. SIC an industry classi er and should be coded as a vector of indicator values which
equal one for each SIC code and zero otherwise.7
iv. where all variables above (except SIC) including gross PPE is scaled by prior year’s
total assets
v. based on the estimated parameters, obtain the tted residuals ^ it for each rm-year
observation. Check to make sure that the tted residuals have zero mean or you
must have done something wrong.
vi. rms may be interested in earnings management upwards or downwards; all we are
interested in is any evidence of earnings management. Create a new variable which
is the absolute value of ^ it. This is commonly known as the unsigned discretionary
accruals. We will use this in the next step.
b. The SEC prefers to detect earnings management with multiple signals - one such signal
is that rms that tend to earnings manage also tend to delay their nancial reporting
(to give them time to manage presumably...) The variable rdq measures the date of
the earnings announcement (where nancial data is released). Datadate is the date on
which the rm’s operating period ends so rdq{datadate represents the delay in reporting.
Firms are obligated to report within a certain time period (mostly 120 days). Set delay
as missing any observation where delay is negative or delay is more than 180 days as
those are unusual circumstances or data errors. What is the average delay (in days) in
the sample?
c. Estimate the following pooled OLS regression
Delayit = 0 + 1Unsigned Discretionary Accrualit + Firmi + it (2)
i. where Firmi represents a vector of indicators equaling 1 for each gvkey and zero
otherwise.
ii. what is the source of variation used in the data to identify 1?
iii. provide the regression results (do not report the s) and interpret the results.
7The regression software should automatically omit one SIC identi er which is the baseline group.
6
5. On January 8, 2009, Sanyam Computer Services, a leading Indian outsourcing company that
serves more than a third of the Fortune 500 companies, announced that it had systematically
falsi ed accounts over the past several years. It is not hard to spot that fateful moment on
the price chart above. [For more detail, see attached New York Times article Satyam Chief
Admits Huge Fraud].
a. We want to see whether the Beneish Earnings Manipulation Detection model could have
picked up some early warning signs from the companys nancial statements. Because
it is an Indian rm that trades as an ADR in the NYSE, the rm les an annual 20-F
rather than a 10-K.
b. Using the Beneish Earnings Manipulation Model, compute the company’s M-Score for
both 2008 and 2007. How does the company score each year on this model? Would
these irregularities have been agged in advance? Use the provided spreadsheet to ll
in the numbers. The M-Score components’ formulas have been written for you so the
calculations will be generated automatically once you provide the raw data. All the
information needed to make the calculations are provided in the attached exhibits (You
will be able to nd the depreciation expense on the statement of cash ows). Note that
in 2007 the rm classi ed investment in bank deposits as a non-current asset. In 2008,
the bank deposits were re-classi ed as a current asset. To increase comparability across
time, treat the bank deposits as current assets in both periods in your calculations.
c. Interpreting the individual components of the M-Score, which input factors seem to
suggest warning signs?
d. The Sarbanes Oxley Act of 2002 required board of directors to form. an audit commit-
tee sta ed by an nancial expert. Did Satyam disclose such an expert? Reading the
background characteristics of the various board members, who in your mind would have
been the most quali ed expert?
e. Overall, would you say this \Enron of India" was potentially detectable in advance?
7
6. The objective of an accrual-based accounting system is to provide more relevant information
investors at the expense of worse reliability, in comparison to pure cash ows. How can we
evaluate the relevancy of the information? One of the principle tasks of an investor is to
determine a rm’s future cash ows for the purposes of valuation. Using the post-processed
data you’ve assembled previously, estimate the following predictive regression (note the lag
operator) which asks how well current year’s accounting accruals and cash ows predict next
year’s cash ows
operating cash owsi;t+1 = 0 + 1accrualsi;t + 2operating cash owsi;t + i;t+1 (3)
where all variables are de ned as previously (e.g., scaled by prior year’s total assets).
a. Before you run the regression, provide an economic interpretation to the following three
cases
i. if ^ 1 = 0, if ^ 1 0
ii. Show the actual regression results. Are accounting accruals useful in forecasting
future cash ows? Are cash ows a useful predictor?
b. The sample period is 29 years (1988-2017); divide into period 1(1988-2002) and period
2(2003-2017) and repeat the regression exercise. Comment on any di erences between
the two sample periods. Are you surprised?