首页 > > 详细

辅导 ISIT312 Big Data Management Assignment 3 Spring 2025讲解 留学生SQL 程序

ISIT312 Big Data Management

Assignment 3

Spring 2025

Scope

This assignment includes the tasks related to querying a data cube, design and implementation of HBase table, querying and manipulating data in HBase table, data processing with Pig, and data processing with Spark.

This assignment is due on Saturday, 01 November 2025, 7:00pm (sharp).

This assignment is worth 20% of the total evaluation in the subject.

The assignment consists of 4 tasks and specification of each task starts from a new page.

Only electronic submission through Moodle at:

https://moodle.uowplatform.edu.au/login/index.php

will be accepted. A submission procedure is explained at the end of Assignment 1 specification.

A policy regarding late submissions is included in the subject outline.

Only  one  submission  of  Assignment  3  is  allowed  and  only  one  submission  per  student  is accepted.

A submission marked by Moodle as "late" is always treated as a late submission no matter how many seconds it is late.

A submission that contains an incorrect file attached is treated as a correct submission with all consequences coming from the evaluation of the file attached.

All files left on Moodle in a state "Draft(not submitted)" will not be evaluated.

A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc) is not allowed. The compressed files will not be evaluated.

An implementation that does not compile well due to one or more syntactical and/or run time errors scores no marks.

Using any sort of Generative Artificial Intelligence (GenAI) for this assignment is NOT allowed !

It is expected that all tasks included within Assignment 3 will be solved individually without any cooperation with the other students.  If you have any doubts, questions, etc. please consult your lecturer or tutor during lab classes or office hours . Plagiarism will result in a FAIL grade being recorded for the assessment task.

Task 1 (5 marks)

Querying a data cube

Use Hive to create an internal table  TRIP with records of individual trips, with each record containing the driver's license, the truck's registration, the trip length in kilometers, and the date of the trip.

create table TRIP (

registration   char(7),

license        char(7),

kilometers     decimal(2),

tday           decimal(2),

tmonth         decimal(2),

tyear          decimal(4) )

row format delimited fields terminated by ','

stored as textfile;

Remove a header line from a file task1.csv and save the file.

Populate the table by loading data from the file task1.csv.

(1) 0.5 mark

Implement the following query using GROUP BY clause with CUBE operator.

Find the total number of trips per driver (license), per truck (registration), per driver and truck (license, registration), and the total number of trips.

(2) 0.5 mark

Implement the following query using GROUP BY clause with ROLLUP operator.

Find the longest trip (kilometers) per driver (license) and per driver and truck (license, registration) and the longest trip at all.

(3) 0.5 mark

Implement  the  following  query  using   GROUP BY clause  with   GROUPING SETS operator.

Find the shortest trip (kilometers) per driver (license)   and   per   truck (registration) and per driver and year (license, tyear) .

Implement the following SQL queries as SELECT statements using window partitioning technique.

(4)  0.5 mark

For each truck, list its registration number, the length of its longest and shortest trips (in kilometers), the total number of trips, and the average trip length (in kilometers) .

(5) 0.5 mark

For each truck list its registration (registration) and all its trips (license, tday, tmonth, tyear,  kilometers)   sorted in descending    order   of   trip    length (kilometers) and a rank (position number in an ascending order) of each trip. Use an analytic function ROW_NUMBER().

(6) 0.5 mark

For  each  driver,  list  its  license  number  (license),  total  length  of  all his/her  trips (kilometers), and the average length of all trips (kilometers) .

(7) 0.5 mark

For each driver (license) and truck (registration) and for each trip length (kilometers) list the longest trip length (kilometers) aggregated per driver (license) .

(8) 0.5 mark

For each truck (registration) find how the total trip length (kilometers) changed year by year (tyear). Order the results in the ascending order of years (tyear) .

(9) 0.5 mark

For each truck (registration) list an average length of the current and previous trip (kilometers). Order the results in the ascending order of trip length (kilometers).

(10) 0.5 mark

For each truck (registration) list an average length of the current, the previous and the next trip (kilometers) . Order the results in the ascending order of trip length (kilometers) .

When ready, save your SELECT statements in a file solution1.hql. Then, process a script. file solution1.hql and save the results in a report solution1.txt.

Deliverables

A file solution1.txt that contains a report from processing of SELECT statements implementing the queries listed above.

Task 2 (5 marks)

Design and implementation of HBase table (3 marks)

(1)  Consider the following conceptual schema of a sample data cube designed to analyze vehicle repairs by mechanics for vehicle owners.

Design a single HBase table to store the data described by the conceptual schema above.

Create HBase script.  solution2-1.hb with HBase  shell  commands that create HBase table and load sample data into the table. Load into the table information about at least two vehicles, two owners, two mechanics and three repairs.

When ready use HBase shell to process a script file solution2-1.hb and to save a report from processing in a file solution2-1.txt.

Querying HBase table (2 marks)

(2)  Consider  a  conceptual  schema  given  below .  The  schema  represents  a  data  cube where students submit assignments and each submission consists of several files and it is related to one subject.

Download  a  file   task2-2.hb with  HBase   shell  commands.  Process  a   script task2-2.hb. Processing of the script creates HBase table  task2-2 and loads some data into it.

Use  HBase  shell  to  implement  the  queries  and  data  manipulations  listed  below . Implementation of each step is worth 0.4 of a mark.

Save the queries and data manipulations in a file  solution2-2.hb. Note that implementation of the queries and data manipulations listed below may need more than one command of HBase shell.

(1) Find all information included in a column family SUBJECT qualified by code and column family FILES qualified by fnumber1 and fnumber2.

(2) Find all information about a subject that has a code 312, list two versions per cell.

(3) Find all information about a submission of assignment 1performed by a student 007 in a subject 312, list one version per cell.

(4) Replace  a submission date of assignment 1 performed by a student 007 in a subject 312 with a date 02-APR-2019 and   then list a column  family SUBMISSION to verify the results.

(5)  Add a column family DEGREE that contains information about titles of degrees enrolled by the students. Assume that a student can enrol only one degree . Then add information about a title of degree enrolled by a student with a number 007. A degree title is up to you . List all information about a student with a number 007.

When ready, start HBase shell and process a script file  solution2-2.hb with the Hbase shell commands. Save report from processing of the script. in a file solution2- 2.txt.

Deliverables

A file solution2-1.txt with a listing from processing of a script file solution2- 1.hb.

A file solution2-2.txt with a listing from processing of a script file solution2- 2.hb.

Task 3 (5 marks)

Data processing with Pig Latin

Consider the following logical schema of two-dimensional data cube.

Download  a  file  task3.zip published  on  Moodle  together  with  a  specification  of Assignment 3 and unzip it. You should obtain a folder task3 with the following files: driver.csv, truck.csv and trip.csv.

Use a text editor to examine the contents ofthe files.

Upload the files into HDFS.

Open  Terminal window  and  start  pig command  line  interface  to  Pig.  Use  pig command line interface to implement the following actions. Implementation of each step is worth 1 mark.

(1)  Use load command to load the files truck.csv, driver.csv and trip.csv from HDFS into a Pig storage.

Use Pig  Latin  and  Pig  Grunt  command  line  interface  to  implement  and  process  the following queries.

(2)  Find the full names (first-name, last name) of drivers who used the trucks manufactured (make) either by DAF or MAN.

(3)  Find the full names (first-name, last name) of drivers who used the trucks manufactured (make) by DAF and on the other occasion manufactured by MAN.

(4)  Find the full names (first-name, last name) of drivers who never travelled to Albany.

(5)  Find the total number of times each truck (registration) was used on a trip to Albany.  There is no need to list the trucks never used on any trip to Albany.

Once completed,  copy the  entire contents of the Terminal window,  including data loading outputs, processed queries, ALL messages, and ALL results, to the clipboard. Then, paste these contents into a text file named solution3.txt.

Deliverables

A file solution3.txt that contains a listing of data loadings and queries performed above , ALL messages and the results of operations. A file solution3.txt must be created through  Copy/Paste  of the  entire  contents  of  Terminal window  into  a  file solution3.txt.  No  screen  dumps  are  allowed  and  no  screen  dumps  will  be evaluated. Solutions that do not include the query processing messages will not receive any marks.

Task 4 (5 marks)

Data processing with Spark

Consider the following logical schema of two-dimensional data cube.

Download  a  file  task4.zip published  on  Moodle  together  with  a  specification  of Assignment 3 and unzip it. You should obtain a folder task4 with the following files: driver.csv, truck.csv and trip.csv.

Use a text editor to examine the contents ofthe files.

Upload the files into HDFS.

Open Terminal window and start pyspark command line interface to Spark. Use pyspark command line interface to implement the following actions. Implementation of each step is worth 1 mark.

(1)  Create the schemas for the files   truck.csv, driver.csv, and trip.csv.

(2)  Create the data frames with the contents of the files truck.csv,  driver.csv, and trip.csv using the schemas created in the previous step.

Count the total number of rows in each frame and then list the contents of each frame.

(3)  Create and process the  following query directly on the trips DataFrame, without creating a temporary view.

Find the total number of times each driver (license, first name, last name) travelled to Albany. There is no need to list the drivers who never travelled to Albany.

(4)  Create a temporary view over a data frame with information about the trips and drivers.

(5)  Execute the following query on a temporary view containing information about the trips and drivers.

Find the total number of times each driver (license, first name, last name) travelled to Albany. There is no need to list the drivers who never travelled to Albany.

When  ready,  copy   into   a  clipboard  the  contents   of  Terminal window  with  the operations processed above and the results listed in the window and paste the results from a clipboard into a text file solution4.txt.

Deliverables

A file solution4.txt that contains a listing of operations performed above and the results of operations. A file solution4.txt must be created through Copy/Paste of the contents of Terminal window into a file solution4.txt. No screen dumps are allowed and no screen dumps will be evaluated.

Submission of Assignment 3

Note, that you have only one submission. So, make it absolutely sure that you submit the correct files with the correct contents. No other submission is possible !

Submit  the  files solution1.txt, solution2-1.txt, solution2-2.txt, solution3.txt, and solution4.txt through Moodle in the following way:

(1)  Access Moodle at http://moodle.uowplatform.edu.au/

(2)  To login use a Login link located in the right upper corner the Web page or in the middle of the bottom of the Web page

(3)  When    logged     select    a     site ISIT912/312 (S225) Big Data Management

(4)  Scroll down to a section Assessment items (Assignments)

(5)  Click at In this place you can submit the outcomes of your work on the tasks included in Assignment 3 for ISIT312 students link.

(6)  Click at a button Add Submission

(7)  Move a file solution1.txt into an area File submissions. You can also use a link Add…

(8)  Repeat  a  step  (7)  for  the  files solution2-1.txt, solution2-2.txt, solution3.txt, and solution4.txt.

(9)  Click at a button Submit assignment.

(10) Click at the checkbox with a text attached: By checking this box, I confirm that this submission is my own work, I accept responsibility for any copyright infringement that may occur as a result of this submission, and I acknowledge that  this  submission  may be  forwarded  to  a  text- matching service.

(11) Click at a button Continue

(12) Check if Submission status is Submitted for grading.




联系我们
  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-21:00
  • 微信:codinghelp
热点标签

联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!