首页 | 本学科首页   官方微博 | 高级检索  
     检索      

Spark作业性能建模及参数优化
引用本文:崔晓龙,张敏,刘祥,郭茜.Spark作业性能建模及参数优化[J].实验技术与管理,2021(3):146-152.
作者姓名:崔晓龙  张敏  刘祥  郭茜
作者单位:北京科技大学计算机与通信工程学院
基金项目:国家自然科学基金项目(61602031);中央高校基本科研业务费专项资金资助项目(FRF-BD-19-012A);北京科技大学重大教学改革项目(JG2019ZD02)。
摘    要:Apache Spark分布式大数据计算框架应用广泛,但是其配置参数繁多导致使用难度较大,且不合理的配置将严重影响作业执行性能,研究Spark参数对性能的影响并进一步对参数进行自动优化具有重要意义。该文分析了Spark作业中影响系统行为的关键参数,建立了性能模型,并进一步探索了Spark参数自动优化的方法和策略。通过提取作业执行过程中对性能有影响的参数,对主流的19种回归模型进行了对比测试,获得通用性和拟合效果都比较好的6种回归模型,并针对不同类型的Spark任务在特定集群上建立性能模型,最后依据建立的性能模型在参数空间中利用改进的多起点爬山搜索算法寻找最优的参数组合。实验证明经参数优化后Spark作业性能有较大提升。

关 键 词:Apache  Spark  性能建模  机器学习  参数调优  搜索算法

Performance modeling and parameter optimization of Spark job
CUI Xiaolong,ZHANG Min,LIU Xiang,GUO Xi.Performance modeling and parameter optimization of Spark job[J].Experimental Technology and Management,2021(3):146-152.
Authors:CUI Xiaolong  ZHANG Min  LIU Xiang  GUO Xi
Institution:(Beijing Key Laboratory of Materials Science Knowledge Engineering,School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China)
Abstract:Apache Spark, a distributed big data computing framework, is widely used, but its numerous configuration parameters make it difficult to use, and unreasonable configuration will seriously affect the performance of job execution. It has important meaning to study the impact of Spark parameters on performance and further automatic optimization of parameters. This paper analyzes the key parameters that affect the behavior of the system in Spark jobs, establishes a performance model, and explores the methods and strategies of automatic optimization of Spark parameters. By extracting the parameters that affect the performance in the process of job execution, 6 regression models with good generality and fitting effect are obtained through the comparison test of 19 mainstream regression models to build performance models on specific clusters for different types of Spark jobs. Finally, according to the established performance model, an improved multi-point mountain climbing search algorithm is used to find a better combination of parameters in the parameter space. The experiment shows that the performance of Spark job is greatly improved after parameter optimization.
Keywords:Apache Spark  performance modeling  machine learning  parameter optimization  search algorithm
本文献已被 维普 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号