Spark源码分析13-Tuning Spark -

frankfan915

浏览: 350282 次
性别:
来自: 杭州

最近访客更多访客>>

gaojingsong

javacoo

449582981

nick_jian

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Spark源码分析13-Tuning Spark

博客分类：

源码分析
Spark

We can refer to the link http://spark.incubator.apache.org/docs/latest/tuning.html for detail tuning document.

After tuning, spark can process 200M logs every minutes in one single work with 1G memory. The time for finish process logs is about 35-40 seconds every duration.

There is some points I do the tuning.

I change the logs, every use conation 20KB logs, so 5000 uses will have 100M logs. When the logs of every use is less, the concurrent process speed will have improvement

JavaDStream<String> stringStream = jsc.socketTextStream("0.0.0.0", ConfigUtil.getInt(ConfigUtil.KEY_SPARK_REMOTE_FLUME_LISTENER_PORT), StorageLevel.MEMORY_AND_DISK())

Add StorageLevel.MEMORY_AND_DISK() when create stream, the default is StorageLevel.MEMORY_AND_DISK2(), will use double memory.

Replace code String user = currentLine.substring(start,end); with String user = new String(currentLine.substring(start,end).toCharArray());

Because substring will create a new String, it will use the same char array with currentLine, thus when some object have refer to user object and currentLine is useless, currentLine also can’t be recycle.

update Spark configure. Below is the latest configuration, when the data is different, sometimes we need update it.

sparkConf.setMaster(ConfigUtil.getString(ConfigUtil.KEY_SPARK_REMOTE_MASTER)).setAppName(appName)

.setJars(new String[]{ConfigUtil.getString(ConfigUtil.KEY_SPARK_REMOTE_JAR_LOCATION)})

.set("spark.executor.memory", "1024m")

.set("spark.streaming.unpersist", "true")

.set("spark.rdd.compress", "true")

.set("spark.default.parallelism", "12")

.set("spark.storage.memoryFraction", "0.3")

.set("spark.cleaner.ttl", "1200")

//.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

// .set("spark.kryo.registrator", "com.seven.oi.spark.KryoRegistratorEx")

//.setExecutorEnv("SPARK_JAVA_OPTS","-XX:NewRatio=1 -XX:+UseCompressedStrings -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps")

//.setExecutorEnv("SPARK_JAVA_OPTS", "-XX:NewRatio=1 -XX:+UseCompressedStrings")

//use fastutil to store map

.set("spark.shuffle.memoryFraction", "0.3");

set spark.rdd.compress = true, thus can use less memory

when out of memory happen, we can also consider increment spark.default.parallelism.

sometimes we can use org.apache.spark.serializer.KryoSerializer, Kryo is significantly faster and more compact than Java serialization. default is Java serialization. (Need more test, current I can’t find any different in my test).

When there is a lot of young GC, we can consider increment the Young generation. Set -XX:NewRatio

When the JDK is 64 bit, we can use -XX:+UseCompressedStrings to compress String. I found JDK with 64 bit will use more memory then JDK with 32 bit.

We also can consider increment spark.shuffle.memoryFraction when have a lot of shuffle operation

If you want spark submit more tasks every time you can increment SPARK_WORKER_CORES. spark will submit tasks less than SPARK_WORKER_CORES every time

分享到：

Zookeeper 学习 | Spark源码分析12-yarn部署

2014-05-15 17:48
浏览 2507
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Spark源码分析13-Tuning Spark

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Spark源码分析13-Tuning Spark

评论

发表评论

相关推荐

concurrent- LinkedBlockingQueue

flume源码分析-Sink

flume源码分析-SinkProcessor

flume源码分析-ChannelSelector

Spark源码分析12-yarn部署

Spark源码分析11-BlockManager

Spark源码分析10-Schedualer

Spark源码分析9-Excutor

Spark源码分析8-client 如何选择将task提交给那个excutor

Spark源码分析7-Metrics的分析

Spark源码分析6-Worker

Spark源码分析5-Master

Spark源码分析4-RDD computor

Spark源码分析3-The connect between driver,master and excutor

Spark源码分析2-Driver generate jobs and launch task

Spark源码分析1-部署与整体架构

Dubbo源代碼分析-configuration

最近访客更多访客>>