in

Construct a centralized monitoring and reporting answer for Amazon SageMaker utilizing Amazon CloudWatch


Amazon SageMaker is a completely managed machine studying (ML) platform that gives a complete set of providers that serve end-to-end ML workloads. As recommended by AWS as a best practice, clients have used separate accounts to simplify coverage administration for customers and isolate sources by workloads and account. Nonetheless, when extra customers and groups are utilizing the ML platform within the cloud, monitoring the big ML workloads in a scaling multi-account surroundings turns into more difficult. For higher observability, clients are on the lookout for options to observe the cross-account useful resource utilization and monitor actions, reminiscent of job launch and working standing, which is crucial for his or her ML governance and administration necessities.

SageMaker providers, reminiscent of Processing, Coaching, and Internet hosting, acquire metrics and logs from the working cases and push them to customers’ Amazon CloudWatch accounts. To view the main points of those jobs in several accounts, you should log in to every account, discover the corresponding jobs, and look into the standing. There isn’t any single pane of glass that may simply present this cross-account and multi-job info. Moreover, the cloud admin staff wants to supply people entry to completely different SageMaker workload accounts, which provides further administration overhead for the cloud platform staff.

On this publish, we current a cross-account observability dashboard that gives a centralized view for monitoring SageMaker consumer actions and sources throughout a number of accounts. It permits the end-users and cloud administration staff to effectively monitor what ML workloads are working, view the standing of those workloads, and hint again completely different account actions at sure factors of time. With this dashboard, you don’t must navigate from the SageMaker console and click on into every job to search out the main points of the job logs. As an alternative, you may simply view the working jobs and job standing, troubleshoot job points, and arrange alerts when points are recognized in shared accounts, reminiscent of job failure, underutilized sources, and extra. You can even management entry to this centralized monitoring dashboard or share the dashboard with related authorities for auditing and administration necessities.

Overview of answer

This answer is designed to allow centralized monitoring of SageMaker jobs and actions throughout a multi-account surroundings. The answer is designed to don’t have any dependency on AWS Organizations, however could be adopted simply in an Organizations or AWS Control Tower surroundings. This answer can assist the operation staff have a high-level view of all SageMaker workloads unfold throughout a number of workload accounts from a single pane of glass. It additionally has an choice to allow CloudWatch cross-account observability throughout SageMaker workload accounts to supply entry to monitoring telemetries reminiscent of metrics, logs, and traces from the centralized monitoring account. An instance dashboard is proven within the following screenshot.

The next diagram exhibits the structure of this centralized dashboard answer.

SageMaker has native integration with the Amazon EventBridge, which screens standing change occasions in SageMaker. EventBridge lets you automate SageMaker and reply robotically to occasions reminiscent of a coaching job standing change or endpoint standing change. Occasions from SageMaker are delivered to EventBridge in near-real time. For extra details about SageMaker occasions monitored by EventBridge, consult with Automating Amazon SageMaker with Amazon EventBridge. Along with the SageMaker native occasions, AWS CloudTrail publishes occasions once you make API calls, which additionally streams to EventBridge in order that this may be utilized by many downstream automation or monitoring use instances. In our answer, we use EventBridge guidelines within the workload accounts to stream SageMaker service occasions and API occasions to the monitoring account’s occasion bus for centralized monitoring.

Within the centralized monitoring account, the occasions are captured by an EventBridge rule and additional processed into completely different targets:

  • A CloudWatch log group, to make use of for the next:
    • Auditing and archive functions. For extra info, consult with the Amazon CloudWatch Logs User Guide.
    • Analyzing log knowledge with CloudWatch Log Insights queries. CloudWatch Logs Insights lets you interactively search and analyze your log knowledge in CloudWatch Logs. You’ll be able to carry out queries that will help you extra effectively and successfully reply to operational points. If a difficulty happens, you need to use CloudWatch Logs Insights to determine potential causes and validate deployed fixes.
    • Assist for the CloudWatch Metrics Insights question widget for high-level operations within the CloudWatch dashboard, including CloudWatch Insights Question to dashboards, and exporting question outcomes.
  • An AWS Lambda perform to finish the next duties:
    • Carry out customized logic to reinforce SageMaker service occasions. One instance is performing a metric question on the SageMaker job host’s utilization metrics when a job completion occasion is obtained.
    • Convert occasion info into metrics in sure log codecs as ingested as EMF logs. For extra info, consult with Embedding metrics within logs.

The instance on this publish is supported by the native CloudWatch cross-account observability characteristic to realize cross-account metrics, logs, and hint entry. As proven on the backside of the structure diagram, it integrates with this characteristic to allow cross-account metrics and logs. To allow this, essential permissions and sources must be created in each the monitoring accounts and supply workload accounts.

You need to use this answer for both AWS accounts managed by Organizations or standalone accounts. The next sections clarify the steps for every state of affairs. Be aware that inside every state of affairs, steps are carried out in several AWS accounts. On your comfort, the account kind to carry out the step is highlighted in the beginning every step.

Conditions

Earlier than beginning this process, clone our supply code from the GitHub repo in your native surroundings or AWS Cloud9. Moreover, you want the next:

Deploy the answer in an Organizations surroundings

If the monitoring account and all SageMaker workload accounts are all in the identical group, the required infrastructure within the supply workload accounts is created robotically through an AWS CloudFormation StackSet from the group’s administration account. Due to this fact, no handbook infrastructure deployment into the supply workload accounts is required. When a brand new account is created or an current account is moved right into a goal organizational unit (OU), the supply workload infrastructure stack shall be robotically deployed and included within the scope of centralized monitoring.

Arrange monitoring account sources

We have to acquire the next AWS account info to arrange the monitoring account sources, which we use because the inputs for the setup script in a while.

Enter Description Instance
Residence Area The Area the place the workloads run. ap-southeast-2
Monitoring account AWS CLI profile title Yow will discover the profile title from ~/.aws/config. That is non-obligatory. If not supplied, it makes use of the default AWS credentials from the chain. .
SageMaker workload OU path The OU path that has the SageMaker workload accounts. Preserve the / on the finish of the trail. o-1a2b3c4d5e/r-saaa/ou-saaa-1a2b3c4d/

To retrieve the OU path, you may go to the Organizations console, and below AWS accounts, discover the knowledge to assemble the OU path. For the next instance, the corresponding OU path is o-ye3wn3kyh6/r-taql/ou-taql-wu7296by/.

After you retrieve this info, run the next command to deploy the required sources on the monitoring account:

./scripts/organization-deployment/deploy-monitoring-account.sh

You may get the next outputs from the deployment. Preserve a notice of the outputs to make use of within the subsequent step when deploying the administration account stack.

Arrange administration account sources

We have to acquire the next AWS account info to arrange the administration account sources, which we use because the inputs for the setup script in a while.

Enter Description Instance
Residence Area The Area the place the workloads run. This must be the identical because the monitoring stack. ap-southeast-2
Administration account AWS CLI profile title Yow will discover the profile title from ~/.aws/config. That is non-obligatory. If not supplied, it makes use of the default AWS credentials from the chain. .
SageMaker workload OU ID Right here we use simply the OU ID, not the trail. ou-saaa-1a2b3c4d
Monitoring account ID The account ID the place the monitoring stack is deployed to. .
Monitoring account position title The output for MonitoringAccountRoleName from the earlier step. .
Monitoring account occasion bus ARN The output for MonitoringAccountEventbusARN from the earlier step. .
Monitoring account sink identifier The output from MonitoringAccountSinkIdentifier from the earlier step. .

You’ll be able to deploy the administration account sources by working the next command:

./scripts/organization-deployment/deploy-management-account.sh

Deploy the answer in a non-Organizations surroundings

In case your surroundings doesn’t use Organizations, the monitoring account infrastructure stack is deployed in an identical method however with just a few modifications. Nonetheless, the workload infrastructure stack must be deployed manually into every workload account. Due to this fact, this technique is appropriate for an surroundings with a restricted variety of accounts. For a big surroundings, it’s advisable to think about using Organizations.

Arrange monitoring account sources

We have to acquire the next AWS account info to arrange the monitoring account sources, which we use because the inputs for the setup script in a while.

Enter Description Instance
Residence Area The Area the place the workloads run. ap-southeast-2
SageMaker workload account record A listing of accounts that run the SageMaker workload and stream occasions to the monitoring account, separated by commas. 111111111111,222222222222
Monitoring account AWS CLI profile title Yow will discover the profile title from ~/.aws/config. That is non-obligatory. If not supplied, it makes use of the default AWS credentials from the chain. .

We will deploy the monitoring account sources by working the next command after you acquire the required info:

./scripts/individual-deployment/deploy-monitoring-account.sh

We get the next outputs when the deployment is full. Preserve a notice of the outputs to make use of within the subsequent step when deploying the administration account stack.

Arrange workload account monitoring infrastructure

We have to acquire the next AWS account info to arrange the workload account monitoring infrastructure, which we use because the inputs for the setup script in a while.

Enter Description Instance
Residence Area The Area the place the workloads run. This must be the identical because the monitoring stack. ap-southeast-2
Monitoring account ID The account ID the place the monitoring stack is deployed to. .
Monitoring account position title The output for MonitoringAccountRoleName from the earlier step. .
Monitoring account occasion bus ARN The output for MonitoringAccountEventbusARN from the earlier step. .
Monitoring account sink identifier The output from MonitoringAccountSinkIdentifier from the earlier step. .
Workload account AWS CLI profile title Yow will discover the profile title from ~/.aws/config. That is non-obligatory. If not supplied, it makes use of the default AWS credentials from the chain. .

We will deploy the monitoring account sources by working the next command:

./scripts/individual-deployment/deploy-workload-account.sh

Visualize ML duties on the CloudWatch dashboard

To examine if the answer works, we have to run a number of SageMaker processing jobs and SageMaker coaching jobs on the workload accounts that we used within the earlier sections. The CloudWatch dashboard is customizable based mostly by yourself eventualities. Our pattern dashboard consists of widgets for visualizing SageMaker Processing jobs and SageMaker Coaching jobs. All jobs for monitoring workload accounts are displayed on this dashboard. In every kind of job, we present three widgets, that are the overall variety of jobs, the variety of failing jobs, and the main points of every job. In our instance, we’ve two workload accounts. By way of this dashboard, we will simply discover that one workload account has each processing jobs and coaching jobs, and one other workload account solely has coaching jobs. As with the capabilities we use in CloudWatch, we will set the refresh interval, specify the graph kind, and zoom in or out, or we will run actions reminiscent of obtain logs in a CSV file.

Customise your dashboard

The answer supplied within the GitHub repo contains each SageMaker Coaching job and SageMaker Processing job monitoring. If you wish to add extra dashboards to observe different SageMaker jobs, reminiscent of batch remodel jobs, you may observe the directions on this part to customise your dashboard. By modifying the index.py file, you may customise the fields what you need to show on the dashboard. You’ll be able to entry all particulars which are captured by CloudWatch by means of EventBridge. Within the Lambda perform, you may select the required fields that you simply need to show on the dashboard. See the next code:

@metric_scope
def lambda_handler(occasion, context, metrics):
    
    strive:
        event_type = None
        strive:
            event_type = SAGEMAKER_STAGE_CHANGE_EVENT(occasion["detail-type"])
        besides ValueError as e:
            print("Surprising occasion obtained")

        if event_type:
            account = occasion["account"]
            element = occasion["detail"]

            job_detail = {
                "DashboardQuery": "True"
            }
            job_detail["Account"] = account
            job_detail["JobType"] = event_type.title

            
            metrics.set_dimensions({"account": account, "jobType": event_type.title}, use_default=False)
            metrics.set_property("JobType", event_type.worth)
            
            if event_type == SAGEMAKER_STAGE_CHANGE_EVENT.PROCESSING_JOB:
                job_status = element.get("ProcessingJobStatus")

                metrics.set_property("JobName", element.get("ProcessingJobName"))
                metrics.set_property("ProcessingJobArn", element.get("ProcessingJobArn"))

                job_detail["JobName"]  = element.get("ProcessingJobName")
                job_detail["ProcessingJobArn"] = element.get("ProcessingJobArn")
                job_detail["Status"] = job_status
                job_detail["StartTime"] = element.get("ProcessingStartTime")
                job_detail["InstanceType"] = element.get("ProcessingResources").get("ClusterConfig").get("InstanceType")
                job_detail["InstanceCount"] = element.get("ProcessingResources").get("ClusterConfig").get("InstanceCount")
                if element.get("FailureReason"):

To customise the dashboard or widgets, you may modify the supply code within the monitoring-account-infra-stack.ts file. Be aware that the sphere names you employ on this file must be the identical as these (the keys of  job_detail) outlined within the Lambda file:

 // CloudWatch Dashboard
    const sagemakerMonitoringDashboard = new cloudwatch.Dashboard(
      this, 'sagemakerMonitoringDashboard',
      {
        dashboardName: Parameters.DASHBOARD_NAME,
        widgets: []
      }
    )

    // Processing Job
    const processingJobCountWidget = new cloudwatch.GraphWidget({
      title: "Whole Processing Job Depend",
      stacked: false,
      width: 12,
      peak: 6,
      left:[
        new cloudwatch.MathExpression({
          expression: `SEARCH('{${AWS_EMF_NAMESPACE},account,jobType} jobType="PROCESSING_JOB" MetricName="ProcessingJobCount_Total"', 'Sum', 300)`,
          searchRegion: this.region,
          label: "${PROP('Dim.account')}",
        })
      ]
    });
    processingJobCountWidget.place(0,0)
    const processingJobFailedWidget = new cloudwatch.GraphWidget({
      title: "Failed Processing Job Depend",
      stacked: false,
      width: 12,
      peak:6,
      proper:[
        new cloudwatch.MathExpression({
          expression: `SEARCH('{${AWS_EMF_NAMESPACE},account,jobType} jobType="PROCESSING_JOB" MetricName="ProcessingJobCount_Failed"', 'Sum', 300)`,
          searchRegion: this.region,
          label: "${PROP('Dim.account')}",
        })
      ]
    })
    processingJobFailedWidget.place(12,0)
    
    const processingJobInsightsQueryWidget = new cloudwatch.LogQueryWidget(
      {
        title: 'SageMaker Processing Job Historical past',
        logGroupNames: [ingesterLambda.logGroup.logGroupName],
        view: cloudwatch.LogQueryVisualizationType.TABLE,
        queryLines: [
          'sort @timestamp desc',
          'filter DashboardQuery == "True"',
          'filter JobType == "PROCESSING_JOB"',
          'fields Account, JobName, Status, Duration, InstanceCount, InstanceType, Host, fromMillis(StartTime) as StartTime, FailureReason',
          'fields Metrics.CPUUtilization as CPUUtil, Metrics.DiskUtilization as DiskUtil, Metrics.MemoryUtilization as MemoryUtil',
          'fields Metrics.GPUMemoryUtilization as GPUMemoeyUtil, Metrics.GPUUtilization as GPUUtil',
        ],
        width:24,
        peak: 6,
      }
    );
    processingJobInsightsQueryWidget.place(0, 6)
    sagemakerMonitoringDashboard.addWidgets(processingJobCountWidget);
    sagemakerMonitoringDashboard.addWidgets(processingJobFailedWidget);
    sagemakerMonitoringDashboard.addWidgets(processingJobInsightsQueryWidget);

After you modify the dashboard, you should redeploy this answer from scratch. You’ll be able to run the Jupyter pocket book supplied within the GitHub repo to rerun the SageMaker pipeline, which is able to launch the SageMaker Processing jobs once more. When the roles are completed, you may go to the CloudWatch console, and below Dashboards within the navigation pane, select Customized Dashboards. Yow will discover the dashboard named SageMaker-Monitoring-Dashboard.

Clear up

In case you not want this tradition dashboard, you may clear up the sources. To delete all of the sources created, use the code on this part. The cleanup is barely completely different for an Organizations surroundings vs. a non-Organizations surroundings.

For an Organizations surroundings, use the next code:

make destroy-management-stackset # Execute towards the administration account
make destroy-monitoring-account-infra # Execute towards the monitoring account

For a non-Organizations surroundings, use the next code:

make destroy-workload-account-infra # Execute towards every workload account
make destroy-monitoring-account-infra # Execute towards the monitoring account

Alternatively, you may log in to the monitoring account, workload account, and administration account to delete the stacks from the CloudFormation console.

Conclusion

On this publish, we mentioned the implementation of a centralized monitoring and reporting answer for SageMaker utilizing CloudWatch. By following the step-by-step directions outlined on this publish, you may create a multi-account monitoring dashboard that shows key metrics and consolidates logs associated to their numerous SageMaker jobs from completely different accounts in actual time. With this centralized monitoring dashboard, you may have higher visibility into the actions of SageMaker jobs throughout a number of accounts, troubleshoot points extra rapidly, and make knowledgeable choices based mostly on real-time knowledge. General, the implementation of a centralized monitoring and reporting answer utilizing CloudWatch gives an environment friendly manner for organizations to handle their cloud-based ML infrastructure and useful resource utilization.

Please check out the answer and ship us the suggestions, both in the AWS forum for Amazon SageMaker, or by means of your standard AWS contacts.

To study extra concerning the cross-account observability characteristic, please consult with the weblog Amazon CloudWatch Cross-Account Observability


Concerning the Authors

Jie Dong is an AWS Cloud Architect based mostly in Sydney, Australia. Jie is enthusiastic about automation, and likes to develop options to assist buyer enhance productiveness. Occasion-driven system and serverless framework are his experience. In his personal time, Jie likes to work on constructing good house and discover new good house devices.

Melanie Li, PhD, is a Senior AI/ML Specialist TAM at AWS based mostly in Sydney, Australia. She helps enterprise clients construct options utilizing state-of-the-art AI/ML instruments on AWS and supplies steerage on architecting and implementing ML options with finest practices. In her spare time, she likes to discover nature and spend time with household and pals.

Gordon Wang, is a Senior AI/ML Specialist TAM at AWS. He helps strategic clients with AI/ML finest practices cross many industries. He’s enthusiastic about laptop imaginative and prescient, NLP, Generative AI and MLOps. In his spare time, he loves working and mountain climbing.


Analyzing FC Barcelona’s Protection From a Information Science Perspective | by Pol Marin | Aug, 2023

HashGNN: Deep Dive into Neo4j GDS’s New Node Embedding Algorithm | by Philipp Brunenberg | Aug, 2023