5分钟一个批次,实现大小表的join,单批次大表数据量3GB,小表数据量5MB。
- /usr/bin/spark-submit \
- --master yarn \
- --deploy-mode client \
- --queue root.xxx \
- --executor-memory 8g \
- --num-executors 20 \
- --executor-cores 4 \
- --driver-memory 10G \
- --conf spark.sql.shuffle.partitions=10 \
- --class com.xxx.Xxx \
- hdfs://xxx/xxx.jar 202208170015
Spark History 页面

Yarn 页面
--deploy-mode client

vim $SPARK_HOME/conf/log4j.properties
- # Set everything to be logged to the console
- # log4j.rootCategory=INFO, console # 默认是INFO
- log4j.rootCategory=DEBUG, console
- log4j.appender.console=org.apache.log4j.ConsoleAppender
- log4j.appender.console.target=System.err
- log4j.appender.console.layout=org.apache.log4j.PatternLayout
- log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
-
- # Settings to quiet third party logs that are too verbose
- log4j.logger.org.eclipse.jetty=WARN
- log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR
- log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
- log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO

spark sql where 条件当中,分区字段的值没有用引号括起来,导致谓词没有成功下推,通过查看执行计划,现象如下:

分区字段的值加上''就好了,通过查看执行计划,现象如下:
InMemroyFileIndex数组中包含要计算的hive分区对应的hdfs路径,避免了全hdfs目录扫描的过程,很大程度降低了任务耗时。
