【Hadoop】在spark读取clickhouse中数据

读取clickhouse数据库数据

import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.SparkSession

def getCKJdbcProperties(
                           batchSize: String = "100000",
                           socketTimeout: String = "300000",
                           numPartitions: String = "50",
                           rewriteBatchedStatements: String = "true"): Properties = {
    val properties = new Properties
    properties.put("driver", "ru.yandex.clickhouse.ClickHouseDriver")
    properties.put("user", "default")
    properties.put("password", "数据库密码")
    properties.put("batchsize", batchSize)
    properties.put("socket_timeout", socketTimeout)
    properties.put("numPartitions", numPartitions)
    properties.put("rewriteBatchedStatements", rewriteBatchedStatements)
    properties
  }
// 读取click数据库数据
val today = "2023-06-05"
val ckProperties = getCKJdbcProperties()
val ckUrl = "jdbc:clickhouse://233.233.233.233:8123/ss"
val ckTable = "ss.test"
var ckDF = spark.read.jdbc(ckUrl, ckTable, ckProperties)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

**show** 展示数据,类似于select * from test的功能
1. [ckDF.show](http://ckDF.show) 默认展示前20个记录
2. ckDF.show(3) 指定展示记录数
3. ckDF.show(false) 是否展示前20个
4. ckDF.show(3, 0) 截取记录数
**ckDF.collect** 方法会将 ckDF中的所有数据都获取到，并返回一个Array对象
ckDF.collectAsList 功能和collect类似，只不过将返回结构变成了List对象

**ckDF.describe**("ip_src").show(3) ****获取指定字段的统计信息

scala> ckDF.describe("ip_src").show(3)
+-------+------+                                                                
|summary|ip_src|
+-------+------+
|  count|855035|
|   mean|  null|
| stddev|  null|
+-------+------+
only showing top 3 rows
1
2
3
4
5
6
7
8
9

first, head, take, takeAsList 获取若干行记录
1. first获取第一行记录
2. head获取第一行记录，head(n: Int)获取前n行记录
3. take(n: Int)获取前n行数据
4. takeAsList(n: Int)获取前n行数据，并以List的形式展现
以Row或者Array[Row]的形式返回一行或多行数据。first和head功能相同。take和takeAsList方法会将获得到的数据返回到Driver端，所以，使用这两个方法时需要注意数据量，以免Driver发生OutOfMemoryError

相关阅读:
从零开始Blazor Server(15)--总结
十大开源机器人智能体
Redis高可用之持久化
STM32单片机OLED俄罗斯方块单片机小游戏
【STL面试】说说 vector 和 list 的区别，分别适用于什么场景？
OceanBase持续践行“一体化”产品战略，发布首个一体化数据库长期支持版本
python每日一题【剑指 Offer 46. 把数字翻译成字符串】
【设计模式】观察者模式
记录一次Powerjob踩的坑(Failed to deserialize message)
安卓请求权限

原文地址：https://blog.csdn.net/qq_35240081/article/details/136421437