Host the Spark UI on Amazon SageMaker Studio

Amazon SageMaker presents a number of methods to run distributed information processing jobs with Apache Spark, a preferred distributed computing framework for giant information processing.

You’ll be able to run Spark functions interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive periods, you possibly can select Apache Spark or Ray to simply course of massive datasets, with out worrying about cluster administration.

Alternately, in case you want extra management over the surroundings, you should utilize a pre-built SageMaker Spark container to run Spark functions as batch jobs on a completely managed distributed cluster with Amazon SageMaker Processing. This selection lets you choose a number of varieties of situations (compute optimized, reminiscence optimized, and extra), the variety of nodes within the cluster, and the cluster configuration, thereby enabling better flexibility for information processing and mannequin coaching.

Lastly, you possibly can run Spark functions by connecting Studio notebooks with Amazon EMR clusters, or by working your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).

All these choices permit you to generate and retailer Spark occasion logs to research them via the web-based person interface generally named the Spark UI, which runs a Spark Historical past Server to watch the progress of Spark functions, observe useful resource utilization, and debug errors.

On this submit, we share a solution for putting in and working Spark Historical past Server on SageMaker Studio and accessing the Spark UI immediately from the SageMaker Studio IDE, for analyzing Spark logs produced by totally different AWS companies (AWS Glue Interactive Periods, SageMaker Processing jobs, and Amazon EMR) and saved in an Amazon Simple Storage Service (Amazon S3) bucket.

Answer overview

The answer integrates Spark Historical past Server into the Jupyter Server app in SageMaker Studio. This enables customers to entry Spark logs immediately from the SageMaker Studio IDE. The built-in Spark Historical past Server helps the next:

Accessing logs generated by SageMaker Processing Spark jobs
Accessing logs generated by AWS Glue Spark functions
Accessing logs generated by self-managed Spark clusters and Amazon EMR

A utility command line interface (CLI) known as sm-spark-cli can be offered for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli permits managing Spark Historical past Server with out leaving SageMaker Studio.

The answer consists of shell scripts that carry out the next actions:

Set up Spark on the Jupyter Server for SageMaker Studio person profiles or for a SageMaker Studio shared area
Set up the sm-spark-cli for a person profile or shared area

Set up the Spark UI manually in a SageMaker Studio area

To host Spark UI on SageMaker Studio, full the next steps:

Select System terminal from the SageMaker Studio launcher.

Run the next instructions within the system terminal:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/obtain/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The instructions will take a number of seconds to finish.

When the set up is full, you can begin the Spark UI by utilizing the offered sm-spark-cli and entry it from an internet browser by working the next code:

sm-spark-cli begin s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>

The S3 location the place the occasion logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are saved could be configured when working Spark functions.

For SageMaker Studio notebooks and AWS Glue Interactive Periods, you possibly can arrange the Spark occasion log location immediately from the pocket book by utilizing the sparkmagic kernel.

The sparkmagic kernel accommodates a set of instruments for interacting with distant Spark clusters via notebooks. It presents magic (%spark, %sql) instructions to run Spark code, carry out SQL queries, and configure Spark settings like executor reminiscence and cores.

For the SageMaker Processing job, you possibly can configure the Spark occasion log location immediately from the SageMaker Python SDK.

Discuss with the AWS documentation for added info:

You’ll be able to select the generated URL to entry the Spark UI.

The next screenshot exhibits an instance of the Spark UI.

You’ll be able to examine the standing of the Spark Historical past Server by utilizing the sm-spark-cli standing command within the Studio System terminal.

You may also cease the Spark Historical past Server when wanted.

Automate the Spark UI set up for customers in a SageMaker Studio area

As an IT admin, you possibly can automate the set up for SageMaker Studio customers by utilizing a lifecycle configuration. This may be achieved for all person profiles underneath a SageMaker Studio area or for particular ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for extra particulars.

You’ll be able to create a lifecycle configuration from the install-history-server.sh script and connect it to an present SageMaker Studio area. The set up is run for all of the person profiles within the area.

From a terminal configured with the AWS Command Line Interface (AWS CLI) and applicable permissions, run the next instructions:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/obtain/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config 
	--studio-lifecycle-config-name install-spark-ui-on-jupyterserver 
	--studio-lifecycle-config-content $LCC_CONTENT 
	--studio-lifecycle-config-app-type JupyterServer 
	--query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
	--region {YOUR_AWS_REGION} 
	--domain-id {YOUR_STUDIO_DOMAIN_ID} 
	--default-user-settings 
	'{
	"JupyterServerAppSettings": {
	"DefaultResourceSpec": {
	"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
	"InstanceType": "system"
	},
	"LifecycleConfigArns": [
	"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
	]
	}}'

After Jupyter Server restarts, the Spark UI and the sm-spark-cli shall be accessible in your SageMaker Studio surroundings.

Clear up

On this part, we present you tips on how to clear up the Spark UI in a SageMaker Studio area, both manually or mechanically.

Manually uninstall the Spark UI

To manually uninstall the Spark UI in SageMaker Studio, full the next steps:

Select System terminal within the SageMaker Studio launcher.

Run the next instructions within the system terminal:

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

Uninstall the Spark UI mechanically for all SageMaker Studio person profiles

To mechanically uninstall the Spark UI in SageMaker Studio for all person profiles, full the next steps:

On the SageMaker console, select Domains within the navigation pane, then select the SageMaker Studio area.

On the area particulars web page, navigate to the Surroundings tab.
Choose the lifecycle configuration for the Spark UI on SageMaker Studio.
Select Detach.

Delete and restart the Jupyter Server apps for the SageMaker Studio person profiles.

Conclusion

On this submit, we shared an answer you should utilize to rapidly set up the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine studying (ML) and information engineering groups can use scalable cloud compute to entry and analyze Spark logs from wherever and pace up their mission supply. IT admins can standardize and expedite the provisioning of the answer within the cloud and keep away from proliferation of customized improvement environments for ML tasks.

All of the code proven as a part of this submit is accessible within the GitHub repository.

Concerning the Authors

Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years software program engineering and an ML background, he works with prospects of any dimension to grasp their enterprise and technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in numerous domains, together with MLOps, pc imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys taking part in soccer.

Bruno Pistone is an AI/ML Specialist Options Architect for AWS based mostly in Milan. He works with prospects of any dimension, serving to them perceive their technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. His subject of expertice contains machine studying finish to finish, machine studying endustrialization, and generative AI. He enjoys spending time along with his mates and exploring new locations, in addition to touring to new locations.