辅导 Module 03: Lab 01 – AD 688 Web Analytics辅导 Java程序

Module 03: Lab 01 – AD 688 Web Analytics

Module 03: Lab 01

Important Reminder

All tasks for this lab must be performed on your AWS EC2 instance. Ensure you have accepted the GitHub Classroom assignment and cloned your repository before proceeding.

Step 1: GitHub Classroom Repository

To complete this assignment, you must first accept the GitHub Classroom Lab 05 request. Once accepted, follow the instructions below to clone the repository and start working.

Step 2: Clone the Repository on Your EC2 Instance

Option 1: Using Terminal

1. Open your terminal in VS Code.

2. Clone the repository:

git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git

cd YOUR_REPO_NAME

3. Ensure all work is performed inside this cloned directory.

Option 2: Using VS Code Git Source Control

1. Open Visual Studio Code.

2. Click on the Source Control tab on the left panel.

3. Click on Clone Repository.

4. Paste the repository URL:

https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git .

5. Select a local directory where you want to store the repository.

6. Once cloned, open the folder in VS Code and start working within it.

1 Setting Up Your AWS EC2 Environment

Before starting the assignment, you need to set up your EC2 instance with the required Spark ML and SQL packages.

1.1 Update and Upgrade System Packages

Run the following command to ensure your system is up to date and remove unnecessary packages:

sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y

1.2 Install Java and Scala

Ensure Java and Scala are installed since Spark depends on them:

sudo apt install -y openjdk-11-jdk scala

1.3 Install Apache Spark

wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop

sudo tar -xvf spark-3.5.4-bin-hadoop3.tgz -C /opt/

1.4 Set Environment Variables

echo "export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3" >> ~/.bashrc

echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.bashrc

source ~/.bashrc

2 Objective

In this assignment, you will use Spark SQL to query job postings data from the Lightcast dataset. You will:

1. Load the job postings data into a Spark DataFrame.

2. Register the DataFrame. as a temporary SQL table.

3. Run SQL queries to explore job roles, salaries, locations, and trends.

4. Save the query results as either a Quarto ( .qmd ) or Jupyter Notebook ( .ipynb ).

3 Tasks

3.1 Step 1: Create Your Analysis File and Configure Git Ignore

3.1.1 Preferred Language: Python

For this lab, Python is the preferred language for Spark SQL queries. However, if you are comfortable with R, you may use SparkR. Follow the R Spark SQL resource for guidance.

3.1.2 Using a Virtual Environment (Recommended)

Since managed EC2 instances may require package installations via apt install python3-xyz , it’s best to use a Python virtual environment to avoid conflicts:

python3 -m venv .venv

source .venv/bin/activate

This will create an isolated environment where you can install packages without affecting the system Python.

3.1.3 Install Python Dependencies

Activate your virtual environment before installing:

source .venv/bin/activate

pip install pyspark pandas jupyter notebook matplotlib plotly seaborn

3.1.4 Verify Spark Installation

pyspark --version

If the setup is successful, you should see the Spark version.

3.1.5 Configure Git Igno

1. Add large dataset files (e.g., lightcast_data.csv ) to .gitignore to prevent pushing them to GitHub.

2. Make sure to add the file in gitignore first and then commit and sync.

3. MAKE SURE TO SYNC AFTER YOY COMMIT. THIS COULD CAUSE ERROR

3.2 Task: Choose Your File Format

2. Inside your repository, create either:

A Quarto file ( spark_analysis.qmd ), or

A Jupyter Notebook ( spark_analysis.ipynb )

3. If using Jupyter Notebook, add the following YAML metadata in the first cell:

---

title: "Spark SQL Job Data Analysis"

author: "Your Name"

format: html

embed-resources: true

date: "2020-02-25"

date-format: long

execute:

echo: true

---

4. Only submit one of the two files ( .qmd or .ipynb ).

3.3 Step 2: Load and Prepare the Dataset

1. Load the Dataset into Spark:

gdown https://drive.google.com/uc?

id=1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ

you can also copy the file from the lab 4 folder using cp ../lab04-

yourgithubusername/lightcast_job_postings.csv .

Read the CSV file into a Spark DataFrame.

from pyspark.sql import SparkSession

# Start a Spark session

spark = SparkSession.builder.appName("JobPostingsAnalysis").getOrCreat

# Load the CSV file into a Spark DataFrame.

df = spark.read.option("header", "true").option("inferSchema", "true")

# 3. Register the DataFrame. as a temporary SQL table

df.createOrReplaceTempView("jobs")

2. Verify the Data:

Display the first five rows.

Show the schema (column names & data types).

# Verify the Data

# Display the first five rows

df.show(5)

# Show the schema (column names & data types)

df.printSchema()

3.4 Step 3: Run Spark SQL Queries

Answer the following queries using Spark SQL. Make sure your code is visible and the html displays output correctly.

For each query:

Write the SQL statement in your file.

Display the query results in a structured format.

Briefly explain the insights from the results.

For Example:

Find the number of job postings for each employment type and order them in descending order.

# Run a Spark SQL query to count job postings per employment type

job_counts_by_type = spark.sql("""

SELECT EMPLOYMENT_TYPE_NAME, COUNT(*) AS job_count

FROM jobs

GROUP BY EMPLOYMENT_TYPE_NAME

ORDER BY job_count DESC

""")

# Show the result

job_counts_by_type.show()

We can retrieve the information from Job Postings table by counting the unique job ids for each employment type.There are 3686 Full time jobs and 5635 Part time jobs in the dataset.

Here are the questions:

1. How many job postings we have in the dataset?

2. Find the top 5 most common job titles

3. Find the average salary for each employment type

4. What five states have the most job postings

5. Calculate the salary range (max-min) for each job title in a California

6. What top 5 industries have the highest average salaries, and ,more than 100 job postings?

3.5 Submission Instructions

1. Commit and Push Your Work Using VS Code Source Control.

2. Submit Only Your GitHub Repository Link on Blackboard.

Resources and Installing R

Apache Spark SQL Documentation