• Spark编程题


    1.自定义一个集合,里面有多个字符串,按照每个元素的首字母进行分组

    代码:

    1. object StringDemo2 {
    2. def main(args: Array[String]): Unit = {
    3. //创建集合
    4. var list = ListBuffer("java","php","python","js")
    5. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    6. val sc = new SparkContext(sparkConf)
    7. val rdd = sc.makeRDD(list)
    8. val rdd2 = rdd.groupBy(_.charAt(0))//每一个值的首字母
    9. rdd2.collect().foreach(println)
    10. }
    11. }

    运行结果:

    1. (p,CompactBuffer(php, python))
    2. (j,CompactBuffer(java, js))

    2.统计日志中同一时间出现多少次日志

    日志log.txt的数据

    1. 83.149.9.216 - - 17/05/2015:10:05:03 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-search.png
    2. 83.149.9.216 - - 17/05/2015:10:05:43 +0000 GET /presentations/logstash-monitorama-2013/images/kibana-dashboard3.png
    3. 83.149.9.216 - - 17/05/2015:10:05:47 +0000 GET /presentations/logstash-monitorama-2013/plugin/highlight/highlight.js
    4. 83.149.9.216 - - 17/05/2015:10:05:12 +0000 GET /presentations/logstash-monitorama-2013/plugin/zoom-js/zoom.js
    5. 83.149.9.216 - - 17/05/2015:10:05:07 +0000 GET /presentations/logstash-monitorama-2013/plugin/notes/notes.js
    6. 83.149.9.216 - - 17/05/2015:10:05:34 +0000 GET /presentations/logstash-monitorama-2013/images/sad-medic.png
    7. 83.149.9.216 - - 17/05/2015:10:05:57 +0000 GET /presentations/logstash-monitorama-2013/css/fonts/Roboto-Bold.ttf

    代码:

    1. object reduceTest {
    2. def main(args: Array[String]): Unit = {
    3. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    4. val sc = new SparkContext(sparkConf)
    5. val rdd = sc.textFile("input\\log.txt",1)
    6. val rdd2 = rdd.map(line=>{
    7. var data = line.split(" ")
    8. val simpleDateFormat = new SimpleDateFormat("dd/MM/yyyy:HH:mm:ss")
    9. var date = simpleDateFormat.parse(data(3))
    10. var simpleDateFormat2 = new SimpleDateFormat("yyyy:MM:dd:HH")
    11. var time = simpleDateFormat2.format(date)
    12. (time,1)
    13. }).groupBy(_._1)
    14. val rdd3 = rdd2.map({
    15. case (k,v)=>{
    16. (k,v.size)
    17. }
    18. })
    19. rdd3.saveAsTextFile("out\\out17")
    20. }
    21. }

    运行结果:

    (2015:05:17:10,7)

    3.完成统计相同字母组成的单词

    text01.txt的数据

    1. abc acb java
    2. avaj bac
    3. cba abc
    4. jvaa php hpp
    5. pph python thonpy

    代码:

    1. object Demo {
    2. def main(args: Array[String]): Unit = {
    3. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    4. val sc = new SparkContext(sparkConf)
    5. val rdd = sc.textFile("input//text01.txt")
    6. rdd.flatMap(str=>{
    7. str.split(" ")
    8. }).map{
    9. str=>{
    10. val char = str.toCharArray
    11. java.util.Arrays.sort(char)
    12. val str2 =new String(char)
    13. (str2,str)
    14. }
    15. }.groupByKey().collect().foreach(println)
    16. }
    17. }

    运行结果:

    1. (hpp,CompactBuffer(php, hpp, pph))
    2. (hnopty,CompactBuffer(python, thonpy))
    3. (aajv,CompactBuffer(java, avaj, jvaa))
    4. (abc,CompactBuffer(abc, acb, bac, cba, abc))

    4.使用Spark完成单词去重

    text02.txt的数据

    1. java php java
    2. python php java
    3. python mysql
    4. hadoop hadoop
    5. java php python
    6. hadoop php

    代码:

    1. object Demo2 {
    2. def main(args: Array[String]): Unit = {
    3. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    4. val sc = new SparkContext(sparkConf)
    5. val rdd = sc.textFile("input//text02.txt")
    6. rdd.flatMap(str=>str.split(" ")).distinct().collect().foreach(println)
    7. }
    8. }

    运行结果:

    1. php
    2. python
    3. mysql
    4. java
    5. hadoop

    5.使用Spark统计2005年1月和2月的平均气温值

    text03.txt的数据

    1. 2005 01 02 04 -11
    2. 2005 01 02 05 -17
    3. 2005 01 02 06 -17
    4. 2005 01 02 07 -17
    5. 2005 01 02 08 -17
    6. 2005 01 02 09 -17
    7. 2005 01 02 10 -22
    8. 2005 01 02 11 -22
    9. 2005 01 02 12 -28
    10. 2005 01 02 13 -33
    11. 2005 02 02 14 -39
    12. 2005 02 02 15 -28
    13. 2005 02 02 16 0
    14. 2005 02 02 17 11
    15. 2005 02 02 18 17
    16. 2005 02 02 19 17
    17. 2005 02 02 20 22
    18. 2005 02 02 21 28
    19. 2005 02 02 22 28
    20. 2005 02 02 23 22

    代码:

    1. object Demo3 {
    2. def main(args: Array[String]): Unit = {
    3. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    4. val sc = new SparkContext(sparkConf)
    5. val rdd = sc.textFile("input//text03.txt")
    6. var rdd2:RDD[(String,Int)] = rdd.map{
    7. line =>{
    8. var month = line.substring(5,7)
    9. var temp = line.substring(16,19).trim.toInt
    10. (month,temp)
    11. }
    12. }
    13. val rdd3:RDD[(String,Iterable[Int])] = rdd2.groupByKey()
    14. rdd3.mapValues(line=> {
    15. line.toList.sum/line.size
    16. }).collect().foreach(println)
    17. }
    18. }

    运行结果:

    1. (02,7)
    2. (01,-20)

    6.使用Spark统计137 138 139开头的总流量

     text04.txt的数据

    1. 13726230503 81
    2. 13826544101 50
    3. 13926435656 30
    4. 13926251106 40
    5. 13826544101 2106
    6. 13826544101 1432
    7. 13719199419 300

    代码:

    1. object Demo4 {
    2. def main(args: Array[String]): Unit = {
    3. val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
    4. val sc = new SparkContext(sparkConf)
    5. val rdd = sc.textFile("input//text04.txt")
    6. rdd.map { line => {
    7. val str = line.split(" ")
    8. (str(0).substring(0,3),str(1).toInt)
    9. }
    10. }.reduceByKey(_+_).collect().foreach(println)
    11. }
    12. }

    运行结果:

    1. (138,3588)
    2. (139,70)
    3. (137,381)

  • 相关阅读:
    基于springboot的智慧养老平台设计与实现-计算机毕业设计源码和LW文档
    主流商业智能(BI)工具的比较(二):Power BI与Domo
    Context
    node版本管理工具推荐
    【BLE蓝牙学习开发笔记】安利一款简单好用且高性价比的BLE蓝牙抓包器
    面经-LinkedList与ArrayList比较
    LCR 006.两数之和 II - 输入有序数组
    仙境传说ro:如何在地图上刷怪教程
    几个简单的JavaScript面试题
    FPGA:什么是状态机?状态机的结构?状态机怎么用?
  • 原文地址:https://blog.csdn.net/m0_55834564/article/details/125469859