Sunday, 21 May 2017

Interview ready :Why it requires to configure Driver memory in Spark.

5) Why it requires to configure Driver memory in Spark. 


The --driver-memory flag controls the amount of memory to allocate for a driver, which is 1GB by default and should be increased in case you call a collect() or take(N) action on a large RDD inside your application.


Lets look this in brief :


1.If your Spark job is based purely on transformations and terminates on some distributed output action like rdd.saveAsTextFile etc. Then you only need 100 MB for driver (As driver is also responsible of delivering files and collecting metrics)


2.If your spark job requires the driver to participate in the computation, like e.g. some ML algorithm that needs to materialize results and broadcast them on the next iteration, then your job becomes dependent of the amount of data passing through the driver. Operations like .collect,.take and takeSampledeliver data to the driver and hence, the driver needs enough memory to allocate such data.


Example : 


 If you have an rdd of 10GB in the cluster and call val WebServerData= rdd.collect, then you will need 10GB +100MB of memory in the driver to hold that data.


Spark framework reports Spark driver memory usage in the SparkContext log, spark.log. This is located on the workbench where the Spark job is launched, as specified by /etc/spark/log4j.properties