辅导 SQL编程、讲解 program程序
Submission Format
If you used databricks, please submit the published notebook link in a word or pdf document. Do not submit HTML, Jupyter notebook, or archive (DBC) formats.
If you used a local instance of spark, please submit a Jupyter notebook.
Setup Instructions
1.Read the article:
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html (links to an external site)
2.Download Data: You an either download the JSON file from the Assignment 3 home page in Learn under 'EXPECTATIONS AND ASSIGNMENT FILES' or you can download the file from Github to your computer:
https://github.com/dmatrix/examples/blob/master/spark/databricks/notebooks/py/data/iot_devices.json (links to an external site) Click on the raw data source -> right click -> click save as. The file will download locally and now you can import it to Databricks. NOTE:
3.Import files: How to import your downloaded files (from step 3) to your Databricks cluster: https://www.projectpro.io/recipes/create-dataframe-from-json-file-read-data-from-dbfs-and-write-into-dbfs (links to an external site).
4.Import the notebook below. There is some data exploration already done in this notebook for your reference. For details on how to import the notebook above see: https://docs.databricks.com/user-guide/notebooks/index.html (links to an external site).
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/411609171004360/6085673883631125/latest.html
5.Run it. NOTE: Don't forget to create a cluster and attach the imported notebook to it (left upper corner: button detached) before trying to run it.
Questions (12 marks)
1.Explain the main differences between RDDs, Dataframes and Datasets (4 marks)
2.Answer the following questions:
2.1 How many sensor pads are reported to be from Poland (2 marks)
2.2 How many different LCDs (distinct colors) are present in the dataset (2 marks)
2.3 Find 5 countries that have the largest number of MAC devices used (2 marks)
2.4 Propose and try an interesting statistical test or machine learning model you could use to gain insight from this dataset. Note, you don't have to use Machine Learning for this question. You can apply any analysis to the data even using SparkSQL, Python visualization libraries to analyze the data. Another example cloud be to apply correlation functions or other Spark functions to analyze the data. (2 marks)
NOTE: You may use MLLib in 2.4: https://spark.apache.org/docs/latest/ml-guide.html. Marks are awarded for the idea and implementation of the test/ML model.
- QQ:99515681
- 邮箱:99515681@qq.com
- 工作时间:8:00-21:00
- 微信:codinghelp
联系我们 - QQ: 99515681 微信:codinghelp
程序辅导网!