---
title: "Apache Spark"
description: "Learn about using Sentry with Apache Spark."
url: https://docs.sentry.io/platforms/python/integrations/spark/
---

# Apache Spark | Sentry for Python

The Spark Integration adds support for the Python API for Apache Spark, [PySpark](https://spark.apache.org/).

**This integration is experimental and in an alpha state**. The integration API may experience breaking changes in further minor versions.

### [Driver](https://docs.sentry.io/platforms/python/integrations/spark.md#driver)

The Spark driver integration is supported for Spark 2 and above.

To configure the SDK, initialize it with the integration before you create a `SparkContext` or `SparkSession`.

In addition to capturing errors, you can monitor interactions between multiple services or applications by [enabling tracing](https://docs.sentry.io/concepts/key-terms/tracing.md). You can also collect and analyze performance profiles from real users with [profiling](https://docs.sentry.io/product/explore/profiling.md).

Select which Sentry features you'd like to install in addition to Error Monitoring to get the corresponding installation and configuration instructions below.

Error Monitoring\[ ]Tracing\[ ]Profiling

```python
import sentry_sdk
from sentry_sdk.integrations.spark import SparkIntegration

if __name__ == "__main__":
    sentry_sdk.init(
        dsn="___PUBLIC_DSN___",
        # Add data like request headers and IP for users, if applicable;
        # see https://docs.sentry.io/platforms/python/data-management/data-collected/ for more info
        send_default_pii=True,
        # ___PRODUCT_OPTION_START___ performance
        # Set traces_sample_rate to 1.0 to capture 100%
        # of transactions for tracing.
        traces_sample_rate=1.0,
        # ___PRODUCT_OPTION_END___ performance
        # ___PRODUCT_OPTION_START___ profiling
        # To collect profiles for all profile sessions,
        # set `profile_session_sample_rate` to 1.0.
        profile_session_sample_rate=1.0,
        # Profiles will be automatically collected while
        # there is an active span.
        profile_lifecycle="trace",
        # ___PRODUCT_OPTION_END___ profiling
        integrations=[
            SparkIntegration(),
        ],
    )

    spark = SparkSession\
        .builder\
        .appName("ExampleApp")\
        .getOrCreate()
    ...
```

### [Worker](https://docs.sentry.io/platforms/python/integrations/spark.md#worker)

The spark worker integration is supported for Spark versions 2.4.x and 3.1.x.

Create a file called `sentry-daemon.py` with the following content:

`sentry-daemon.py`

```python
import sentry_sdk
from sentry_sdk.integrations.spark import SparkWorkerIntegration
import pyspark.daemon as original_daemon

if __name__ == '__main__':
    sentry_sdk.init(
        dsn="___PUBLIC_DSN___",
        # Add data like request headers and IP for users, if applicable;
        # see https://docs.sentry.io/platforms/python/data-management/data-collected/ for more info
        send_default_pii=True,
        # ___PRODUCT_OPTION_START___ performance
        # Set traces_sample_rate to 1.0 to capture 100%
        # of transactions for tracing.
        traces_sample_rate=1.0,
        # ___PRODUCT_OPTION_END___ performance
        # ___PRODUCT_OPTION_START___ profiling
        # To collect profiles for all profile sessions,
        # set `profile_session_sample_rate` to 1.0.
        profile_session_sample_rate=1.0,
        # Profiles will be automatically collected while
        # there is an active span.
        profile_lifecycle="trace",
        # ___PRODUCT_OPTION_END___ profiling
        integrations=[
            SparkWorkerIntegration(),
        ],
    )

    original_daemon.manager()
    ...
```

In your `spark_submit` command, add the following configuration options so the spark clusters can use the Sentry integration.

| Command Line Options | Parameter                                 | Usage                                                          |
| -------------------- | ----------------------------------------- | -------------------------------------------------------------- |
| --py-files           | sentry\_daemon.py                         | Sends the `sentry_daemon.py` file to your Spark clusters       |
| --conf               | spark.python.use.daemon=true              | Configures Spark to use a daemon to execute its Python workers |
| --conf               | spark.python.daemon.module=sentry\_daemon | Configures Spark to use the Sentry custom daemon               |

```bash
./bin/spark-submit \
    --py-files sentry_daemon.py \
    --conf spark.python.use.daemon=true \
    --conf spark.python.daemon.module=sentry_daemon \
    example-spark-job.py
```

## [Behavior](https://docs.sentry.io/platforms/python/integrations/spark.md#behavior)

* You must have the Sentry Python SDK installed on all your clusters to use the Spark integration. The easiest way to do this is to run an initialization script on all your clusters:

```bash
easy_install pip
pip install --upgrade sentry-sdk
```

* In order to access certain tags (`app_name`, `application_id`), the worker integration requires the driver integration to also be active.

* The worker integration only works on UNIX-based systems due to the daemon process using signals for child management.

## [Google Cloud Dataproc](https://docs.sentry.io/platforms/python/integrations/spark.md#google-cloud-dataproc)

This integration can be set up for [Google Cloud Dataproc](https://cloud.google.com/dataproc/). It's recommended that Cloud Dataproc image version 1.4 or 2.0 be used with Spark 2.4 and 3.1, respectively, (as required by the worker integration).

1. Set up an [Initialization action](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions) to install the `sentry-sdk` on your Dataproc cluster.

2. Add the driver integration to your main python file submitted in in the job submit screen

3. Add the `sentry_daemon.py` under *Additional python files* in the job submit screen. You must first upload the daemon file to a bucket to access it.

4. Add the configuration properties listed above, `spark.python.use.daemon=true` and `spark.python.daemon.module=sentry_daemon` in the job submit screen.
