Spark Programming Practice on Hadoop Platform
The goal of this document is to practice Spark programming on Hadoop platform with the following problems.
- In the text file (
Youvegottofindwhatyoulove.txt
), show the top 30 most frequent occurring words and their average occurrences in a sentence According to the result, what are the characteristics of these words? - Implement a program to calculate the average amount in credit card trip for different number of passengers which are from one to four passengers in 2017.09 NYC Yellow Taxi trip data. In NYC Taxi data, the “Passenger_count” is a driver-entered value. Explain also how you deal with the data loss issue.
- For each of the above task 1 and 2, compare the execution time on local worker and yarn cluster. Also, give some discussions on your observation.