Module 03: Lab 01 – AD 688 Web Analytics
Module 03: Lab 01
Important Reminder
All tasks for this lab must be performed on your AWS EC2 instance. Ensure you have accepted the GitHub Classroom assignment and cloned your repository before proceeding.
Step 1: GitHub Classroom Repository
To complete this assignment, you must first accept the GitHub Classroom Lab 05 request. Once accepted, follow the instructions below to clone the repository and start working.
Step 2: Clone the Repository on Your EC2 Instance
Option 1: Using Terminal
1. Open your terminal in VS Code.
2. Clone the repository:
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git
cd YOUR_REPO_NAME
3. Ensure all work is performed inside this cloned directory.
Option 2: Using VS Code Git Source Control
1. Open Visual Studio Code.
2. Click on the Source Control tab on the left panel.
3. Click on Clone Repository.
4. Paste the repository URL:
https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git .
5. Select a local directory where you want to store the repository.
6. Once cloned, open the folder in VS Code and start working within it.
1 Setting Up Your AWS EC2 Environment
Before starting the assignment, you need to set up your EC2 instance with the required Spark ML and SQL packages.
1.1 Update and Upgrade System Packages
Run the following command to ensure your system is up to date and remove unnecessary packages:
sudo apt update && sudo apt upgrade -y && sudo apt autoremove -y
1.2 Install Java and Scala
Ensure Java and Scala are installed since Spark depends on them:
sudo apt install -y openjdk-11-jdk scala
1.3 Install Apache Spark
wget https://dlcdn.apache.org/spark/spark-3.5.4/spark-3.5.4-bin-hadoop
sudo tar -xvf spark-3.5.4-bin-hadoop3.tgz -C /opt/
1.4 Set Environment Variables
echo "export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3" >> ~/.bashrc
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.bashrc
source ~/.bashrc
2 Objective
In this assignment, you will use Spark SQL to query job postings data from the Lightcast dataset. You will:
1. Load the job postings data into a Spark DataFrame.
2. Register the DataFrame. as a temporary SQL table.
3. Run SQL queries to explore job roles, salaries, locations, and trends.
4. Save the query results as either a Quarto ( .qmd ) or Jupyter Notebook ( .ipynb ).
3 Tasks
3.1 Step 1: Create Your Analysis File and Configure Git Ignore
3.1.1 Preferred Language: Python
For this lab, Python is the preferred language for Spark SQL queries. However, if you are comfortable with R, you may use SparkR. Follow the R Spark SQL resource for guidance.
3.1.2 Using a Virtual Environment (Recommended)
Since managed EC2 instances may require package installations via apt install python3-xyz , it’s best to use a Python virtual environment to avoid conflicts:
python3 -m venv .venv
source .venv/bin/activate
This will create an isolated environment where you can install packages without affecting the system Python.
3.1.3 Install Python Dependencies
Activate your virtual environment before installing:
source .venv/bin/activate
pip install pyspark pandas jupyter notebook matplotlib plotly seaborn
3.1.4 Verify Spark Installation
pyspark --version
If the setup is successful, you should see the Spark version.
3.1.5 Configure Git Igno
1. Add large dataset files (e.g., lightcast_data.csv ) to .gitignore to prevent pushing them to GitHub.
2. Make sure to add the file in gitignore first and then commit and sync.
3. MAKE SURE TO SYNC AFTER YOY COMMIT. THIS COULD CAUSE ERROR
3.2 Task: Choose Your File Format
2. Inside your repository, create either:
A Quarto file ( spark_analysis.qmd ), or
A Jupyter Notebook ( spark_analysis.ipynb )
3. If using Jupyter Notebook, add the following YAML metadata in the first cell:
---
title: "Spark SQL Job Data Analysis"
author: "Your Name"
format: html
embed-resources: true
date: "2020-02-25"
date-format: long
execute:
echo: true
---
4. Only submit one of the two files ( .qmd or .ipynb ).
3.3 Step 2: Load and Prepare the Dataset
1. Load the Dataset into Spark:
gdown https://drive.google.com/uc?
id=1V2GCHGt2dkFGqVBeoUFckU4IhUgk4ocQ
you can also copy the file from the lab 4 folder using cp ../lab04-
yourgithubusername/lightcast_job_postings.csv .
Read the CSV file into a Spark DataFrame.
Register it as a temporary SQL table.
from pyspark.sql import SparkSession
# Start a Spark session
spark = SparkSession.builder.appName("JobPostingsAnalysis").getOrCreat
# Load the CSV file into a Spark DataFrame.
df = spark.read.option("header", "true").option("inferSchema", "true")
# 3. Register the DataFrame. as a temporary SQL table
df.createOrReplaceTempView("jobs")
2. Verify the Data:
Display the first five rows.
Show the schema (column names & data types).
# Verify the Data
# Display the first five rows
df.show(5)
# Show the schema (column names & data types)
df.printSchema()
3.4 Step 3: Run Spark SQL Queries
Answer the following queries using Spark SQL. Make sure your code is visible and the html displays output correctly.
For each query:
Write the SQL statement in your file.
Display the query results in a structured format.
Briefly explain the insights from the results.
For Example:
Find the number of job postings for each employment type and order them in descending order.
# Run a Spark SQL query to count job postings per employment type
job_counts_by_type = spark.sql("""
SELECT EMPLOYMENT_TYPE_NAME, COUNT(*) AS job_count
FROM jobs
GROUP BY EMPLOYMENT_TYPE_NAME
ORDER BY job_count DESC
""")
# Show the result
job_counts_by_type.show()
We can retrieve the information from Job Postings table by counting the unique job ids for each employment type.There are 3686 Full time jobs and 5635 Part time jobs in the dataset.
Here are the questions:
1. How many job postings we have in the dataset?
2. Find the top 5 most common job titles
3. Find the average salary for each employment type
4. What five states have the most job postings
5. Calculate the salary range (max-min) for each job title in a California
6. What top 5 industries have the highest average salaries, and ,more than 100 job postings?
3.5 Submission Instructions
1. Commit and Push Your Work Using VS Code Source Control.
2. Submit Only Your GitHub Repository Link on Blackboard.
Resources and Installing R
Apache Spark SQL Documentation
Quarto Documentation
Jupyter Notebook Documentation
R Spark SQL Guide
Installing R and Adding Packages
If you prefer using R instead of Python, you can install R and necessary packages on your EC2 instance.
Install R on Ubuntu
sudo apt update
sudo apt install -y r-base
Verify R Installation
Run the following command to check if R is installed correctly:
R --version
Install R Packages Using Ubuntu Command Line
You can install R packages directly via the command line by using the Rscript. command:
sudo Rscript. -e 'install.packages("tidyverse", repos="http://cran.rstu
For installing multiple packages:
sudo Rscript. -e 'install.packages(c("sparklyr", "dplyr", "ggplot2"), r
Connecting R with Spark
To use Spark with R, install sparklyr and configure the connection:
install.packages("sparklyr")
library(sparklyr)
spark_install()
sc <- spark_connect(master = "local")
For more details, refer to the R Spark SQL Guide.