Apache Spark

Learn about using Sentry with Apache Spark.

The Spark Integration adds support for the Python API for Apache Spark, PySpark.

This integration is experimental and in an alpha state. The integration API may experience breaking changes in further minor versions.

The Spark driver integration is supported for Spark 2 and above.

To configure the SDK, initialize it with the integration before you create a SparkContext or SparkSession.

Copied
import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

if __name__ == "__main__":
    sentry_sdk.init(
        dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
        # Set traces_sample_rate to 1.0 to capture 100%
        # of transactions for tracing.
        traces_sample_rate=1.0,
        # Set profiles_sample_rate to 1.0 to profile 100%
        # of sampled transactions.
        # We recommend adjusting this value in production.
        profiles_sample_rate=1.0,
        integrations=[
            SparkIntegration(),
        ],
    )

    spark = SparkSession\
        .builder\
        .appName("ExampleApp")\
        .getOrCreate()
    ...

The spark worker integration is supported for Spark versions 2.4.x and 3.1.x.

Create a file called sentry-daemon.py with the following content:

sentry-daemon.py
Copied
import sentry_sdk
from sentry_sdk.integrations.spark import SparkWorkerIntegration
import pyspark.daemon as original_daemon

if __name__ == '__main__':
    sentry_sdk.init(
        dsn="https://examplePublicKey@o0.ingest.sentry.io/0",
        # Set traces_sample_rate to 1.0 to capture 100%
        # of transactions for tracing.
        traces_sample_rate=1.0,
        # Set profiles_sample_rate to 1.0 to profile 100%
        # of sampled transactions.
        # We recommend adjusting this value in production.
        profiles_sample_rate=1.0,
        integrations=[
            SparkWorkerIntegration(),
        ],
    )

    original_daemon.manager()
    ...

In your spark_submit command, add the following configuration options so the spark clusters can use the Sentry integration.

Command Line OptionsParameterUsage
--py-filessentry_daemon.pySends the sentry_daemon.py file to your Spark clusters
--confspark.python.use.daemon=trueConfigures Spark to use a daemon to execute its Python workers
--confspark.python.daemon.module=sentry_daemonConfigures Spark to use the Sentry custom daemon
Copied
./bin/spark-submit \
    --py-files sentry_daemon.py \
    --conf spark.python.use.daemon=true \
    --conf spark.python.daemon.module=sentry_daemon \
    example-spark-job.py

  • You must have the Sentry Python SDK installed on all your clusters to use the Spark integration. The easiest way to do this is to run an initialization script on all your clusters:
Copied
easy_install pip
pip install --upgrade sentry-sdk
  • In order to access certain tags (app_name, application_id), the worker integration requires the driver integration to also be active.

  • The worker integration only works on UNIX-based systems due to the daemon process using signals for child management.

This integration can be set up for Google Cloud Dataproc. It's recommended that Cloud Dataproc image version 1.4 or 2.0 be used with Spark 2.4 and 3.1, respectively, (as required by the worker integration).

  1. Set up an Initialization action to install the sentry-sdk on your Dataproc cluster.

  2. Add the driver integration to your main python file submitted in in the job submit screen

  3. Add the sentry_daemon.py under Additional python files in the job submit screen. You must first upload the daemon file to a bucket to access it.

  4. Add the configuration properties listed above, spark.python.use.daemon=true and spark.python.daemon.module=sentry_daemon in the job submit screen.

Help improve this content
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").