Installation

Setup

An Apache Spark distribution is required to be installed before installing Apache Toree. You can download a copy of Apache Spark here. Throughout the rest of this guide we will assume you have downloaded and extracted the Apache Spark distribution to /usr/local/bin/apache-spark/.

Installing Toree via Pip

The quickest way to install Apache Toree is through the toree pip package.

pip install toree

This will install a jupyter application called toree, which can be used to install and configure different Apache Toree kernels.

jupyter toree install --spark_home=/usr/local/bin/apache-spark/

You can confirm the installation by verifying the apache_toree_scala kernel is listed in the following command:

jupyter kernelspec list

Options

Arguments that take values are actually convenience aliases to full Configurables, whose aliases are listed on the help line. For more information on full configurables, see ‘–help-all’.

--user
    Install to the per-user kernel registry
--debug
    set log level to logging.DEBUG (maximize logging output)
--replace
    Replace any existing kernel spec with this name.
--sys-prefix
    Install to Python's sys.prefix. Useful in conda/virtual environments.
--interpreters=<Unicode> (ToreeInstall.interpreters)
    Default: 'Scala'
    A comma separated list of the interpreters to install. The names of the
    interpreters are case sensitive.
--toree_opts=<Unicode> (ToreeInstall.toree_opts)
    Default: ''
    Specify command line arguments for Apache Toree.
--python_exec=<Unicode> (ToreeInstall.python_exec)
    Default: 'python'
    Specify the python executable. Defaults to "python"
--kernel_name=<Unicode> (ToreeInstall.kernel_name)
    Default: 'Apache Toree'
    Install the kernel spec with this name. This is also used as the base of the
    display name in jupyter.
--log-level=<Enum> (Application.log_level)
    Default: 30
    Choices: (0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL')
    Set the log level by value or name.
--config=<Unicode> (JupyterApp.config_file)
    Default: ''
    Full path of a config file.
--spark_home=<Unicode> (ToreeInstall.spark_home)
    Default: '/usr/local/spark'
    Specify where the spark files can be found.
--spark_opts=<Unicode> (ToreeInstall.spark_opts)
    Default: ''
    Specify command line arguments to proxy for spark config.

Configuring Spark

Toree is started using the spark-submit script. All configuration options from Spark are consistent with configuring a Spark Submit job. There are two ways of setting configuration options for Spark.

The first is at install time with the --spark_opts command line option.

jupyter toree install --spark_opts='--master=local[4]'

The second option is configured at run time through the SPARK_OPTS environment variable.

SPARK_OPTS='--master=local[4]' jupyter notebook

Note: There is an order of precedence to the configuration options. SPARK_OPTS will overwrite any values configured in --spark_opts.

Configuring Toree

There are some configuration options that are specific to Toree.

Option                                             Description
------                                             -----------
--default-interpreter                              default interpreter for the kernel
--default-repositories                             comma separated list of additional
                                                     repositories to resolve
--default-repository-credentials                   comma separated list of credential
                                                     files to use
-h, --help                                         display help information
--interpreter-plugin
--ip                                               used to bind sockets
--jar-dir                                          directory where user added jars are
                                                     stored (MUST EXIST)
--magic-url                                        path to a magic jar
--max-interpreter-threads <Integer>                total number of worker threads to use
                                                     to execute code
--spark-context-initialization-timeout <Long>      number of milliseconds allowed for
                                                     creation of the spark context; default
                                                     is 100 milliseconds
--alternate-sigint  <String>                       specifies the signal to use instead of SIGINT
                                                     for interrupting a long-running cell; value
                                                     does not include the SIG prefix; use of
                                                     USR2 is recommended
--nosparkcontext                                   kernel should not create a spark
                                                     context
-v, --version                                      display version information

There are two way of setting these configuration options.

The first is at install time with the --toree_opts command line option.

jupyter toree instal --toree_opts='--nosparkcontext'

The second option is configured at run time through the TOREE_OPTS environment variable.

TOREE_OPTS='--nosparkcontext' jupyter notebook

Note: There is an order of precedence to the configuration options. TOREE_OPTS will overwrite any values configured in --toree_opts.

Installing Multiple Kernels

Apache Toree provides support for multiple languages. To enable this you need to install the configurations for these interpreters as a comma seperated list to the --interpreters flag:

jupyter toree install --interpreters=Scala,PySpark,SparkR,SQL

The available interpreters and their supported languages are:

Language Spark Implementation Value to provide to Apache Toree
Scala Scala with Spark Scala
Python Python with PySpark PySpark
R R with SparkR SparkR
SQL Spark SQL SQL

Interpreter Requirements

  • R version 3.2+
  • Make sure that the packages directory used by R when installing packages is writable, necessary to installed modified SparkR library. This is done automatically before any R code is run.

If the package directory is not writable by the Apache Toree, then you should see an error similar to the following:

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)
Warning in install.packages("sparkr_bundle.tar.gz", repos = NULL, type = "source") :
'lib = "/usr/local/lib/R/site-library"' is not writable
Error in install.packages("sparkr_bundle.tar.gz", repos = NULL, type = "source") :
unable to install packages
Execution halted