Repository: spark Updated Branches: refs/heads/branch-1.0-jdbc 1d4103a09 -> 9caf3a965
[SPARK-2696] Reduce default value of spark.serializer.objectStreamReset The current default value of spark.serializer.objectStreamReset is 10,000. When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors. This patch sets the default value to a more reasonable default value (100). Author: Hossein <[email protected]> Closes #1595 from falaki/objectStreamReset and squashes the following commits: 650a935 [Hossein] Updated documentation 1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset (cherry picked from commit 66f26a4610aede57322cb7e193a50aecb6c57d22) Signed-off-by: Matei Zaharia <[email protected]> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9caf3a96 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9caf3a96 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9caf3a96 Branch: refs/heads/branch-1.0-jdbc Commit: 9caf3a9659821a3b4fd4394c9f4134adff9caf88 Parents: 1d4103a Author: Hossein <[email protected]> Authored: Sat Jul 26 01:04:56 2014 -0700 Committer: Matei Zaharia <[email protected]> Committed: Sat Jul 26 01:05:05 2014 -0700 ---------------------------------------------------------------------- .../main/scala/org/apache/spark/serializer/JavaSerializer.scala | 2 +- docs/configuration.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/9caf3a96/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala ---------------------------------------------------------------------- diff --git a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala index 0a7e1ec..a7fa057 100644 --- a/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala +++ b/core/src/main/scala/org/apache/spark/serializer/JavaSerializer.scala @@ -108,7 +108,7 @@ private[spark] class JavaSerializerInstance(counterReset: Int) extends Serialize */ @DeveloperApi class JavaSerializer(conf: SparkConf) extends Serializer with Externalizable { - private var counterReset = conf.getInt("spark.serializer.objectStreamReset", 10000) + private var counterReset = conf.getInt("spark.serializer.objectStreamReset", 100) def newInstance(): SerializerInstance = new JavaSerializerInstance(counterReset) http://git-wip-us.apache.org/repos/asf/spark/blob/9caf3a96/docs/configuration.md ---------------------------------------------------------------------- diff --git a/docs/configuration.md b/docs/configuration.md index 71fafa5..44766ef 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -362,13 +362,13 @@ Apart from these, the following properties are also available, and may be useful </tr> <tr> <td><code>spark.serializer.objectStreamReset</code></td> - <td>10000</td> + <td>100</td> <td> When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches objects to prevent writing redundant data, however that stops garbage collection of those objects. By calling 'reset' you flush that info from the serializer, and allow old objects to be collected. To turn off this periodic reset set it to a value <= 0. - By default it will reset the serializer every 10,000 objects. + By default it will reset the serializer every 100 objects. </td> </tr> <tr>
