The goal of this topic is to document best practices when asking Apache Spark related questions.
When asking Apache Spark related questions please include following information
pom.xml) if applicable or external dependency versions (Python, R) when applicable.
local[n], Spark standalone, Yarn, Mesos), mode (
cluster) and other submit options if applicable.
Please try to provide a minimal example input data in a format that can be directly used by the answers without tedious and time consuming parsing for example input file or local collection with all code required to create distributed data structures.
When applicable always include type information:
StrucTypeor output from
If particular problem occurs only at scale use random data generators (Spark provides some useful utilities in
Please use type annotations when possible. While your compiler can easily keep track of the types it is not so easy for mere mortals. For example:
val lines: RDD[String] = rdd.map(someFunction)
def f(x: String): Int = ???
are better than:
val lines = rdd.map(someFunction)
def f(x: String) = ???
When question is related to debugging specific exception always provide relevant traceback. While it is advisable to remove duplicated outputs (from different executors or attempts) don't cut tracebacks to a single line or exception class only.
Depending on the context try to provide details like: