讲解 DATA420-24S2 Assignment 2 The Million Song Dataset辅导留学生SQL 程序

DATA420-24S2 (C)

Assignment 2

The Million Song Dataset (MSD)

Due on Friday, October 18 by 5:00 PM.

If you have questions about the assignment please use the forumor send me an email.

LEARN

Help forum for Assignment 2

Report upload (pdf)

Supplementary material upload (zip, limited to 10 MB)

Instructions

• You are encouraged to work together to understand and solve each of the tasks, but you must submit your own work.

• Any assignments submitted after the deadline without obtaining an extension will receive a 50% penalty.

• The forum is a great place to ask questions about the assignment material. Any questions will be answered by the lecturer or tutors as soon as possible, students can also answer each other’s questions, and you can all benefit from the answers and resulting discussion as well.

• All data under hdfs:///data/ is read only. Please use your own home directory to store your outputs e.g. hdfs:///user/abc123/outputs/.

• I recommend that you use the pyspark notebook provided on LEARN as this will make it easier for you to develop code and check outputs interactively.

• Please be mindful that you are sharing cluster resources. Keep an eye on Spark using the user interface mathmadslinux2p:8080, and don’t leave a shell running for longer than necessary. You can change the number of executors, cores, and memory you are using, but be prepared to scale down if others are waiting for resources. If you need to ask someone else to scale down their resources, email them using their user code (e.g. abc123@uclive.ac.nz).

Requirements

• Your report should be submitted as a single pdf file via LEARN. Any additional code, images, and supplementary material should be submitted as a single zip file via LEARN. You should not submit any outputs as part of your supplementary material, leave these in your home directory in HDFS.

• The body of your report should be no more than 10 pages long, including figures and tables in the body but excluding a cover page, table of contents, references, appendices, and supplementary material. You need to be accurate and concise and you need to demonstrate depth of understanding.

• You should make sensible choices concerning margins, font size, spacing, and formatting. For example, margins between 0.5” and 1”, a sans-serif font e.g. Arial with font size 11 or 12, line spacing 1 or 1.15, and sensible use of monospaced code blocks, tables, and images.

• You should reference any external resources using a citation format such as APA or MLA, including any online resources which you used to obtain snippets of code or examples. You must reference any use of Grammerly, ChatGPT, or any other generative AI tools to improve the quality of your own original work.

• You must not use any content generated by ChatGPT or other generative AI tools directly in your report. These are useful tools for aggregating existing knowledge but they are too wordy and lack the context required to provide concise, specific answers to questions that require qualitative reasoning.

Structure

• Your report should have the following sections and can also use question numbers as subheadings to group paragraphs, tables, and figures that you use to answer the questions that have been asked. This assignment is more open ended and you should explain and justify any decisions that you make along the way.

- Background

In this section you should give a brief overview of what you are doing in the assignment including any useful links or references to background material and a high level description of any difficulties that you had.

- Data processing

In this section you should give an overview of the relevant datasets and provide any references for discussion in the sections below. You should describe the steps that you took to load the data and anything you discovered along the way, but you should not include any actual outputs.

- Audio similarity

In this section you should describe the continuous features and the cate- gorical target for your classification algorithms in more detail, describe the algorithms, their strengths and weaknesses, and their hyperparameters, de- scribe how you trained the algorithms to predict a binary and a multiclass outcome, discuss the performance of the algorithms, and talk about how you would do hyperparameter tuning. You should explain any decisions that you made about featureselection, splitting, sampling, hyperparameters, and metrics.

- Song recommendations

In this section you should describe the distributions of the user-song play counts, describe any choices that you had to make to use the counts for collaborative filtering, talk about the performance of the collaborative filtering model using some specific examples and the ranking metrics that you have evaluated, and discuss any other considerations for using the collaborative filtering model to generate recommendations in the real world. You should explain the implications of the choices that you had to make and discuss any other systems that you would need to generate recommendations for all users of your recommendation system.

- Conclusions

In this section you should reflect on the decisions that you made, what you have achieved, and what you have learned.

- References

In this section you should list all references that you have used or referenced.

The Million Song Dataset (MSD)

In this assignment we will study a collection of datasets referred to as the Million Song Dataset (MSD), a project initiated by The Echo Nest and LabROSA. The Echo Nest was a research spin-off from the MIT Media Lab established with the goal of understanding the audio and textual content of recorded music, and was acquired by Spotify after 10 years for 50 million Euro.

• The Million Song Dataset (MSD)

The main dataset contains feature analysis and metadata for one million songs, with song ID, track ID, artist ID, and 51 other fields such as the year, title, artist tags, and various audio properties such as loudness, beat, tempo, and time signature. Note that there is an distinction between the terms song and track. Song is a general concept representing the musical work, and track is a specific instance or recording of a song. There may be multiple tracks for one song, differing in how they were recorded or how they sound. The tracks in the dataset have been matched to songs and artists.

The Million Song Dataset also contains other datasets contributed by organisations and the community,

• SecondHandSongs (cover songs)

• musiXmatch dataset (song lyrics)

• Last.fm dataset (song-level tags and similarity)

• Taste Profile subset (user-song plays)

• thisismyjam-to-MSD mapping (user-song plays, imperfectly joined)

• tagtraum genre annotations (genre labels)

• All Music genre datasets (genre labels)

• All Music audio feature datasets (extracted from samples)

We will focus on the All Music and Taste Profile datasets, but you are free to explore the other datasets on your own and as part of the challenges. There are many online resources and some publications exploring these datasets as well.

MSD Allmusic Genre Dataset (MAGD)

Many song annotations have been generated for the MSD by sources such as Last.fm, musiXmatch, and the Million Song Dataset Benchmarks by Schindler et al. These benchmarks contain song level genre and style. annotations. We will focus on the MSD Allmusic Genre Dataset (MAGD) provided by the Music Information Retrieval group at the Vienna University of Technology.

This dataset is included on the million song dataset benchmarks downloadspage and class frequencies are provided on the MSD Allmusic Genre Dataset (MAGD) details page as well. For more information about the genres themselves have a look at the All Music genres page.

Audio features

The Music Information Retrieval group at the Vienna University of Technology also downloaded audio samples for 994,960 songs in the dataset which were available from an online provider, most in the form. of 30 or 60 second snippets. They used these snippets to extract a multitude of features to allow comparison between the songs and prediction of song attributes,

Rhythm Patterns

Statistical Spectrum Descriptors

Rhythm Histograms

Temporal Statistical Spectrum Descriptors

Temporal Rhythm Histograms

Modulation Frequency Variance

Marsyas

Timbral features

jMir

Spectral Centroid

Spectral Rolloff Point

Spectral Flux

Compactness

Spectral Variability

Root Mean Square

Zero Crossings

Fraction of Low Energy Windows

Low-level features derivatives

Method of Moments

Area of Moments

Linear Predictive Coding (LPC)

MFCC features

These features are described in detail on the million song dataset benchmarks downloads page and the audio feature extraction page, and the number of features is listed along with file names and sizes for the separate audio feature sets.

Taste Profile

The Taste Profile dataset contains real user-song play counts from undisclosed organisations. All songs have been matched to identifiers in the main million song dataset and can be joined with this dataset to retrieve additional song attributes. This is an implicit feedback dataset as users interact with songs by playing them but do not explicitly indicate a preference for the song.

The dataset has an issue with the matching between the Taste Profile tracks and the million song dataset tracks. Some tracks were matched to the wrong songs, as the user data needed to be matched to song metadata, not track metadata. Approximately 5,000 tracks are matched to the wrong songs and approximately 13,000 matches are not verified. This is described in their blog post in detail. Thankfully, we don’t need to resolve this matching even though it might be useful later when we want to assess the performance of our ranking algorithms quantitatively.

TASKS

The assignment is separated into a number of sections, each of which explore the data in increasing detail from processing to analysis to visualizations. You should plan to complete one section per week to line up with the lecture material and to complete everything on time.

Each question is broken down into the following,

• What you need to do

Tasks that you need to step through in order before you start your write up.

• What should be included in your write up

Specific items that you should include in your write up or general comments on what you should talk about as you describe your method, present your results, and discuss your conclusions step by step.

• Tips

Any other suggestions to help you if you get stuck.

These supplement the report requirements detailed on the cover page. In general, your write up should be accurate and concise and should demonstrate depth of understanding. The tasks are intentionally detailed to make it easier to work through the assignment step by step.

Data processing

The datasets that we need for the assignment have been copied to the following locations,

Windows T:\courses\2024\DATA420-24S2\data\msd\

Linux /scratch-network/courses/2024/DATA420-24S2/data/msd/

HDFS hdfs:///data/msd/

The main/summary directory contains the song metadata from the main million song dataset but none of the audio analysis, song lyrics, genre tags, or user-song plays (see the page on getting the dataset).

The genre directory contains each of the genre datasets provided by the Music Information Retrieval group at the Vienna University of Technology. There is one tab separated file for each genre subset and you can use them interchangeably.

The audio directory contains audio features sets from the Music Information Retrieval group at the Vienna University of Technology. The audio/attributes directory contains attributes names from the header of the ARFF and the audio/features directory contains the audio features themselves.

The tasteprofile directory contains the user-song play counts from the Taste Profile dataset as well as logs identifying mismatches and matches that were manually accepted. You don’t need to use these to do anything with these mismatches.

Q1 First you will explore the relevant datasets that you need at a high level.

What you need to do

Read through the links above and then use the hdfs command to explore the relevant datasets under hdfs:///data/msd/.

(a) Give an overview of the structure of the datasets, including their sizes, formats, data types, and how each dataset has been stored in HDFS.

(b) Figure out how to load each of the different types of datasets, even if you infer schema and only load a small subset of the data for testing.

What you should include in your write up

(1) A directory tree showing how the datasets are organized in HDFS.

(2) A table or equivalent containing the names, sizes, formats, data types, and number of rows in each of the datasets.

(3) A brief comment on how the number of rows compares to the total number of unique songs.

Q2 Complete the following data preprocessing. What you need to do

(a) Load the audio feature attribute names and types from the audio/attributes directory and think about how they can be used to define schemas for the audio feature datasets themselves. Note that each of the separate attribute and feature datasets share the same prefix and that the attribute types are named consistently. Think about how you can create a StructType automatically by mapping attribute types to pyspark.sql.types objects.

(b) Load one or more of the audio feature attribute datasets from the audio/features directory using the schema that you obtained in step (a).

Think about the advantages and disadvantages of using these column names as you move on to develop models in the sections below.

Develop a systematic way to rename columns in the audio feature datasets after you load them in Spark so that you could refer to each column in each dataset uniquely if you merged them together. Your column names should be short, intuitive, and easy to read.

Do not actually load and merge all of the audio feature datasets together as this would require a significant amount of distributed memory.

What you should include in your write up

(1) A brief summary of the information that is contained in the audio attributes datasets and why these are separate from the audio features datasets.

(2) A clear explanation of how you used these to create a StructType automatically, including any additional processing.

(3) A clear explanation of how you renamed columns in the audio feature datasets. You should explain any decisions that you made and why.

(4) Answers to any other questions that you have been asked.

Audio similarity

In this section you will explore using audio based features based on the audio waveform. of a track to predict its genre. If this is possible, it could enable an online music streaming service to offer a unique service to their customers. For example, it may be possible to compare songs based entirely on the way their tracks sound, and to discover rare songs that are similar to popular songs even though they have no collaborative filtering relationship. This would enable users to have more precise control over variety, and to discover songs they would not have found any other way.

Q1 The audio feature datasets each have different levels of detail. If we were developing a real world model we would eventually consider all of them to select the best features for our particular model but for this assignment you can just use one of them to save time.

What you need to do

(a) Pick one of the audio feature datasets to use in the questions below.

The audio features are continuous values, obtained using methods such as digital signal processing and psycho-acoustic modeling. Generate descriptive statistics for each audio feature or column in the dataset you have picked.

Are any features distributed in the same way? Are any features strongly correlated?

(b) Load the MSD Allmusic Genre Dataset (MAGD).

Visualize the distribution of genres for the tracks that were matched.

What you should include in your write up

(1) A brief summary of the specific audio features dataset you have chosen and why.

(2) A table containing the descriptive statistics for the audio features or a representative sample of the audio features if there are too many to fit on one page for the dataset you have picked.

(3) Any conclusions that you can make from the descriptive statistics and if they will influence how you train or inference your model.

(4) A clear explanation of any choices you need to make based on how features are correlated.

(5) A visualization of the distribution of genres and a brief comment on how this distribution will impact the performance of your model.

(6) Answers to any other questions that you have been asked.

联系我们