Construct protein folding workflows to speed up drug discovery on Amazon SageMaker

Drug growth is a posh and lengthy course of that includes screening hundreds of drug candidates and utilizing computational or experimental strategies to judge leads. According to McKinsey, a single drug can take 10 years and price a median of $2.6 billion to undergo illness goal identification, drug screening, drug-target validation, and eventual industrial launch. Drug discovery is the analysis part of this pipeline that generates candidate medicine with the very best probability of being efficient with the least hurt to sufferers. Machine studying (ML) strategies might help determine appropriate compounds at every stage within the drug discovery course of, leading to extra streamlined drug prioritization and testing, saving billions in drug growth prices (for extra info, check with AI in biopharma research: A time to focus and scale).

Drug targets are usually organic entities known as proteins, the constructing blocks of life. The 3D construction of a protein determines the way it interacts with a drug compound; due to this fact, understanding the protein 3D construction can add important enhancements to the drug growth course of by screening for drug compounds that match the goal protein construction higher. One other space the place protein construction prediction may be helpful is knowing the variety of proteins, in order that we solely choose for medicine that selectively goal particular proteins with out affecting different proteins within the physique (for extra info, check with Improving target assessment in biomedical research: the GOT-IT recommendations). Exact 3D buildings of goal proteins can allow drug design with greater specificity and decrease probability of cross-interactions with different proteins.

Nonetheless, predicting how proteins fold into their 3D construction is a troublesome drawback, and conventional experimental strategies reminiscent of X-ray crystallography and NMR spectroscopy may be time-consuming and costly. Current advances in deep studying strategies for protein analysis have proven promise in utilizing neural networks to foretell protein folding with exceptional accuracy. Folding algorithms like AlphaFold2, ESMFold, OpenFold, and RoseTTAFold can be utilized to shortly construct correct fashions of protein buildings. Sadly, these fashions are computationally costly to run and the outcomes may be cumbersome to check on the scale of hundreds of candidate protein buildings. A scalable answer for utilizing these numerous instruments will permit researchers and industrial R&D groups to shortly incorporate the newest advances in protein construction prediction, handle their experimentation processes, and collaborate with analysis companions.

Amazon SageMaker is a totally managed service to arrange, construct, practice, and deploy high-quality ML fashions shortly by bringing collectively a broad set of capabilities purpose-built for ML. It affords a totally managed setting for ML, abstracting away the infrastructure, knowledge administration, and scalability necessities so you’ll be able to concentrate on constructing, coaching, and testing your fashions.

On this publish, we current a totally managed ML answer with SageMaker that simplifies the operation of protein folding construction prediction workflows. We first talk about the answer on the excessive stage and its consumer expertise. Subsequent, we stroll you thru find out how to simply arrange compute-optimized workflows of AlphaFold2 and OpenFold with SageMaker. Lastly, we show how one can observe and evaluate protein construction predictions as a part of a typical evaluation. The code for this answer is accessible within the following GitHub repository.

Answer overview

On this answer, scientists can interactively launch protein folding experiments, analyze the 3D construction, monitor the job progress, and observe the experiments in Amazon SageMaker Studio.

The next screenshot exhibits a single run of a protein folding workflow with Amazon SageMaker Studio. It contains the visualization of the 3D construction in a pocket book, run standing of the SageMaker jobs within the workflow, and hyperlinks to the enter parameters and output knowledge and logs.

The next diagram illustrates the high-level answer structure.

To know the structure, we first outline the important thing parts of a protein folding experiment as follows:

  • FASTA goal sequence file – The FASTA format is a text-based format for representing both nucleotide sequences or amino acid (protein) sequences, by which nucleotides or amino acids are represented utilizing single-letter codes.
  • Genetic databases – A genetic database is a number of units of genetic knowledge saved along with software program to allow customers to retrieve genetic knowledge. A number of genetic databases are required to run AlphaFold and OpenFold algorithms, reminiscent of BFD, MGnify, PDB70, PDB, PDB seqres, UniRef30 (FKA UniClust30), UniProt, and UniRef90.
  • A number of sequence alignment (MSA) – A sequence alignment is a means of arranging the first sequences of a protein to determine areas of similarity that could be a consequence of practical, structural, or evolutionary relationships between the sequences. The enter options for predictions embrace MSA knowledge.
  • Protein construction prediction – The construction of enter goal sequences is predicted with folding algorithms like AlphaFold2 and OpenFold that use a multitrack transformer structure skilled on recognized protein templates.
  • Visualization and metrics – Visualize the 3D construction with the py3Dmol library as an interactive 3D visualization. You need to use metrics to judge and evaluate construction predictions, most notably root-mean-square deviation (RMSD) and template modeling Score (TM-score)

The workflow comprises the next steps:

  1. Scientists use the web-based SageMaker ML IDE to discover the code base, construct protein sequence evaluation workflows in SageMaker Studio notebooks, and run protein folding pipelines by way of the graphical consumer interface in SageMaker Studio or the SageMaker SDK.
  2. Genetic and construction databases required by AlphaFold and OpenFold are downloaded previous to pipeline setup utilizing Amazon SageMaker Processing, an ephemeral compute function for ML knowledge processing, to an Amazon Simple Storage Service (Amazon S3) bucket. With SageMaker Processing, you’ll be able to run a long-running job with a correct compute with out establishing any compute cluster and storage and with no need to close down the cluster. Information is robotically saved to a specified S3 bucket location.
  3. An Amazon FSx for Lustre file system is ready up, with the info repository being the S3 bucket location the place the databases are saved. FSx for Lustre can scale to tons of of GB/s of throughput and tens of millions of IOPS with low-latency file retrieval. When beginning an estimator job, SageMaker mounts the FSx for Lustre file system to the occasion file system, then begins the script.
  4. Amazon SageMaker Pipelines is used to orchestrate a number of runs of protein folding algorithms. SageMaker Pipelines affords a desired visible interface for interactive job submission, traceability of the progress, and repeatability.
  5. Inside a pipeline, two computationally heavy protein folding algorithms—AlphaFold and OpenFold—are run with SageMaker estimators. This configuration helps mounting of an FSx for Lustre file system for top throughput database search within the algorithms. A single inference run is split into two steps: an MSA development step utilizing an optimum CPU occasion and a construction prediction step utilizing a GPU occasion. These substeps, like SageMaker Processing in Step 2, are ephemeral, on-demand, and totally managed. Job output reminiscent of MSA information, predicted pdb construction information, and different metadata information are saved in a specified S3 location. A pipeline may be designed to run one single protein folding algorithm or run each AlphaFold and OpenFold after a standard MSA development.
  6. Runs of the protein folding prediction are robotically tracked by Amazon SageMaker Experiments for additional evaluation and comparability. The job logs are saved in Amazon CloudWatch for monitoring.


To comply with this publish and run this answer, that you must have accomplished a number of stipulations. Seek advice from the GitHub repository for an in depth rationalization of every step.

Run protein folding on SageMaker

We use the totally managed capabilities of SageMaker to run computationally heavy protein folding jobs with out a lot infrastructure overhead. SageMaker makes use of container photographs to run customized scripts for generic knowledge processing, coaching, and internet hosting. You’ll be able to simply begin an ephemeral job on-demand that runs a program with a container picture with a few strains of the SageMaker SDK with out self-managing any compute infrastructure. Particularly, the SageMaker estimator job supplies flexibility on the subject of alternative of container picture, run script, and occasion configuration, and helps a wide variety of storage options, together with file techniques reminiscent of FSx for Lustre. The next diagram illustrates this structure.

Folding algorithms like AlphaFold and OpenFold use a multitrack transformer structure skilled on recognized protein templates to foretell the construction of unknown peptide sequences. These predictions may be run on GPU cases to supply finest throughput and lowest latency. The enter options nevertheless for these predictions embrace MSA knowledge. MSA algorithms are CPU-dependent and may require a number of hours of processing time.

Working each the MSA and construction prediction steps in the identical computing setting may be cost-inefficient as a result of the costly GPU sources stay idle whereas the MSA step runs. Subsequently, we optimize the workflow into two steps. First, we run a SageMaker estimator job on a CPU occasion particularly to compute MSA alignment given a specific FASTA enter sequence and supply genetic databases. Then we run a SageMaker estimator job on a GPU occasion to foretell the protein construction with a given enter MSA alignment and a folding algorithm like AlphaFold or OpenFold.

Run MSA technology

For MSA computation, we embrace a customized script and script that’s adopted from the present AlphaFold prediction supply Be aware that this script might must be up to date if the supply AlphaFold code is up to date. The customized script is offered to the SageMaker estimator by way of script mode. The important thing parts of the container picture, script mode implementation, and establishing a SageMaker estimator job are additionally a part of the following step of working folding algorithms, and are described additional within the following part.

Run AlphaFold

We get began by working an AlphaFold construction prediction with a single protein sequence utilizing SageMaker. Working an AlphaFold job includes three easy steps, as may be seen in 01-run_stepbystep.ipynb. First, we construct a Docker container picture based mostly on AlphaFold’s Dockerfile in order that we are able to additionally run AlphaFold in SageMaker. Second, we assemble the script that instructs how AlphaFold ought to be run. Third, we assemble and run a SageMaker estimator with the script, the container, occasion sort, knowledge, and configuration for the job.

Container picture

The runtime requirement for a container picture to run AlphaFold (OpenFold as nicely) in SageMaker may be tremendously simplified with AlphaFold’s Dockerfile. We solely want so as to add a handful of easy layers on high to put in a SageMaker-specific Python library so {that a} SageMaker job can talk with the container picture. See the next code:

# In Dockerfile.alphafold
## SageMaker particular
RUN pip3 set up sagemaker-training --upgrade --no-cache-dir
ENV PATH="/decide/ml/code:${PATH}"
# this setting variable is utilized by the SageMaker Estimator to find out our consumer code listing

Enter script

We then present the script that runs from the AlphaFold repository that’s at the moment positioned within the container /app/alphafold/ When this script is run, the placement of the genetic databases and the enter FASTA sequence might be populated by SageMaker as setting variables (SM_CHANNEL_GENETIC and SM_CHANNEL_FASTA, respectively). For extra info, check with Input Data Configuration.

Estimator job

We subsequent create a job utilizing a SageMaker estimator with the next key enter arguments, which instruct SageMaker to run a selected script utilizing a specified container with the occasion sort or depend, your networking possibility of alternative, and different parameters for the job. vpc_subnet_ids and security_group_ids instruct the job to run inside a selected VPC the place the FSx for Lustre file system is in in order that we are able to mount and entry the filesystem within the SageMaker job. The output path refers to a S3 bucket location the place the ultimate product of AlphaFold might be uploaded to on the finish of a profitable job by SageMaker robotically. Right here we additionally set a parameter DB_PRESET, for instance, to be handed in and accessed inside as an environmental variable throughout runtime. See the next code:

from sagemaker.estimator import Estimator
vpc_subnet_ids=['subnet-xxxxxxxxx'] # okay to make use of a default VPC
env={'DB_PRESET': db_preset} # <full_dbs|reduced_dbs>
output_path="s3://%s/%s/job-output/"%(default_bucket, prefix)

estimator_alphafold = Estimator(
source_dir="src", # listing the place and different runtime information find
entry_point="", # our script that runs /app/alphafold/
image_uri=alphafold_image_uri, # container picture to make use of
instance_count=instance_count, #

Lastly, we collect the info and let the job know the place they’re. The fasta knowledge channel is outlined as an S3 knowledge enter that might be downloaded from an S3 location into the compute occasion initially of the job. This enables nice flexibility to handle and specify the enter sequence. Alternatively, the genetic knowledge channel is outlined as a FileSystemInput that might be mounted onto the occasion initially of the job. Using an FSx for Lustre file system as a means to herald shut to three TB of information avoids repeatedly downloading knowledge from an S3 bucket to a compute occasion. We name the .match technique to kick off an AlphaFold job:

from sagemaker.inputs import FileSystemInput
file_system_directory_path=f'/{fsx_mount_id}/{prefix}/alphafold-genetic-db' # ought to be the total prefix from the S3 knowledge repository

file_system_access_mode="ro" # Specify the entry mode (read-only)
file_system_type="FSxLustre" # Specify your file system sort

genetic_db = FileSystemInput(

s3_fasta=sess.upload_data(path="sequence_input/T1030.fasta", # FASTA location regionally
key_prefix='alphafoldv2/sequence_input') # S3 prefix. Bucket is sagemaker default bucket
fasta = sagemaker.inputs.TrainingInput(s3_fasta,
data_channels_alphafold = {'genetic': genetic_db, 'fasta': fasta}

wait=False) # wait=False will get the cell again within the pocket book; set to True to see the logs because the job progresses

That’s it. We simply submitted a job to SageMaker to run AlphaFold. The logs and output together with .pdb prediction information might be written to Amazon S3.

Run OpenFold

Working OpenFold in SageMaker follows the same sample, as proven within the second half of 01-run_stepbystep.ipynb. We first add a easy layer to get the SageMaker-specific library to make the container picture SageMaker appropriate on high of OpenFold’s Dockerfile. Secondly, we assemble a as an entry level for the SageMaker job. In, we run the from OpenFold, which is available in the container image with the identical genetic databases we downloaded for AlphaFold and OpenFold’s mannequin weights (--openfold_checkpoint_path). When it comes to enter knowledge places, apart from the genetic databases channel and the FASTA channel, we introduce a 3rd channel, SM_CHANNEL_PARAM, in order that we are able to flexibly move within the mannequin weights of alternative from the estimator assemble after we outline and submit a job. With the SageMaker estimator, we are able to simply submit jobs with totally different entry_point, image_uri, setting, inputs, and different configurations for OpenFold with the identical signature. For the info channel, we add a brand new channel, param, as an Amazon S3 enter together with the usage of the identical genetic databases from the FSx for Lustre file system and FASTA file from Amazon S3. This, once more, permits us simply specify the mannequin weight to make use of from the job assemble. See the next code:

param = sagemaker.inputs.TrainingInput(s3_param,

data_channels_openfold = {"genetic": genetic_db, 'fasta': fasta, 'param': param}


To entry the ultimate output after the job completes, we run the next instructions:

!aws s3 cp {estimator_openfold.model_data} openfold_output/mannequin.tar.gz
!tar zxfv openfold_output/mannequin.tar.gz -C openfold_output/

Runtime efficiency

The next desk exhibits the fee financial savings of 57% and 51% for AlphaFold and OpenFold, respectively, by splitting the MSA alignment and folding algorithms in two jobs as in comparison with a single compute job. It permits us to right-size the compute for every job: ml.m5.4xlarge for MSA alignment and ml.g5.2xlarge for AlphaFold and OpenFold.

Job Particulars Occasion Kind Enter FASTA Sequence Runtime Value
MSA alignment + OpenFold ml.g5.4xlarge T1030 50 minutes $1.69
MSA alignment + AlphaFold ml.g5.4xlarge T1030 65 minutes $2.19
MSA alignment ml.m5.4xlarge T1030 46 minutes $0.71
OpenFold ml.g5.2xlarge T1030 6 minutes $0.15
AlphaFold ml.g5.2xlarge T1030 21 minutes $0.53

Construct a repeatable workflow utilizing SageMaker Pipelines

With SageMaker Pipelines, we are able to create an ML workflow that takes care of managing knowledge between steps, orchestrating their runs, and logging. SageMaker Pipelines additionally supplies us a UI to visualise our pipeline and simply run our ML workflow.

A pipeline is created by combing a variety of steps. On this pipeline, we mix three training steps, which require an SageMaker estimator. The estimators outlined on this pocket book are similar to these outlined in 01-run_stepbystep.ipynb, with the exception that we use Amazon S3 places to level to our inputs and outputs. The dynamic variables permit SageMaker Pipelines to run steps one after one other and in addition allow the consumer to retry failed steps. The next screenshot exhibits a Directed Acyclic Graph (DAG), which supplies info on the necessities for and relationships between every step of our pipeline.

Dynamic variables

SageMaker Pipelines is able to taking consumer inputs at first of each pipeline run. We outline the next dynamic variables, which we want to change throughout every experiment:

  • FastaInputS3URI – Amazon S3 URI of the FASTA file uploaded by way of SDK, Boto3, or manually.
  • FastFileName – Title of the FASTA file.
  • db_preset – Choice between full_dbs or reduced_dbs.
  • MaxTemplateDate – AlphaFold’s MSA step will seek for the obtainable templates earlier than the date specified by this parameter.
  • ModelPreset – Choose between AlphaFold fashions together with monomer, monomer_casp14, monomer_ptm, and multimer.
  • NumMultimerPredictionsPerModel – Variety of seeds to run per mannequin when utilizing multimer system.
  • InferenceInstanceType – Occasion sort to make use of for inference steps (each AlphaFold and OpenFold). The default worth is ml.g5.2xlarge.
  • MSAInstanceType – Occasion sort to make use of for MSA step. The default worth is ml.m5.4xlarge.

See the next code:

fasta_file = ParameterString(title="FastaFileName")
fasta_input = ParameterString(title="FastaInputS3URI")
pipeline_db_preset = ParameterString(title="db_preset",
enum_values=['full_dbs', 'reduced_dbs'])
max_template_date = ParameterString(title="MaxTemplateDate")
model_preset = ParameterString(title="ModelPreset")
num_multimer_predictions_per_model = ParameterString(title="NumMultimerPredictionsPerModel")
msa_instance_type = ParameterString(title="MSAInstanceType", default_value="ml.m5.4xlarge")
instance_type = ParameterString(title="InferenceInstanceType", default_value="ml.g5.2xlarge")

A SageMaker pipeline is constructed by defining a sequence of steps after which chaining them collectively in a selected order the place the output of a earlier step turns into the enter to the following step. Steps may be run in parallel and outlined to have a dependency on a earlier step. On this pipeline, we outline an MSA step, which is the dependency for an AlphaFold inference step and OpenFold inference step that run in parallel. See the next code:

step_msa = TrainingStep(

step_alphafold = TrainingStep(

step_openfold = TrainingStep(

To place all of the steps collectively, we name the Pipeline class and supply a pipeline title, pipeline enter variables, and the person steps:

pipeline_name = f"ProteinFoldWorkflow"
pipeline = Pipeline(
steps=[step_msa, step_alphafold, step_openfold],

pipeline.upsert(role_arn=position, # run this if it is the primary time establishing the pipeline

Run the pipeline

Within the final cell of the pocket book 02-define_pipeline.ipynb, we present find out how to run a pipeline utilizing the SageMaker SDK. The dynamic variables we described earlier are offered as follows:

!mkdir ./sequence_input/
!curl '' > ./sequence_input/T1030.fasta

pathName = f'./sequence_input/{fasta_file_name}'

'db_preset': 'full_dbs',
'FastaFileName': fasta_file_name,
'MaxTemplateDate': '2020-05-14',
'ModelPreset': 'monomer',
'NumMultimerPredictionsPerModel': '5',
execution = pipeline.begin(execution_display_name="SDK-Executetd",
execution_description='This pipeline was executed by way of SageMaker SDK',

Monitor experiments and evaluate protein buildings

For our experiment, we use an instance protein sequence from the CASP14 competitors, which supplies an unbiased mechanism for the evaluation of strategies of protein construction modeling. The goal T1030 is derived from the PDB 6P00 protein, and has 237 amino acids within the major sequence. We run the SageMaker pipeline to foretell the protein construction of this enter sequence with each OpenFold and AlphaFold algorithms.

When the pipeline is full, we obtain the expected .pdb information from every folding job and visualize the construction within the pocket book utilizing py3Dmol, as within the pocket book 04-compare_alphafold_openfold.ipynb.

The next screenshot exhibits the prediction from the AlphaFold prediction job.

The expected construction is in contrast in opposition to its recognized base reference construction with PDB code 6poo archived in RCSB. We analyze the prediction efficiency in opposition to the bottom PDB code 6poo with three metrics: RMSD, RMSD with superposition, and template modeling rating, as described in Comparing structures.

. Enter Sequence Comparability With RMSD RMSD with Superposition Template Modeling Rating
AlphaFold T1030 6poo 247.26 3.87 0.3515

The folding algorithms are actually in contrast in opposition to one another for a number of FASTA sequences: T1030, T1090, and T1076. New goal sequences might not have the bottom pdb construction in reference databases and due to this fact it’s helpful to check the variability between folding algorithms.

. Enter Sequence Comparability With RMSD RMSD with Superposition Template Modeling Rating
AlphaFold T1030 OpenFold 73.21 24.8 0.0018
AlphaFold T1076 OpenFold 38.71 28.87 0.0047
AlphaFold T1090 OpenFold 30.03 20.45 0.005

The next screenshot exhibits the runs of ProteinFoldWorkflow for the three FASTA enter sequences with SageMaker Pipeline:

We additionally log the metrics with SageMaker Experiments as new runs of the identical experiment created by the pipeline:

from import Run, load_run
with Run(experiment_name=experiment_name, run_name=input_name_1, sagemaker_session=sess) as run:
run.log_metric(title=metric_type + "rmsd_cur", worth=rmsd_cur_one, step=1)
run.log_metric(title=metric_type + "rmds_fit", worth=rmsd_fit_one, step=1)
run.log_metric(title=metric_type + "tm_score", worth=tmscore_one, step=1)

We then analyze and visualize these runs on the Experiments web page in SageMaker Studio.

The next chart depicts the RMSD worth between AlphaFold and OpenFold for the three sequences: T1030, T1076, and T1090.


On this publish, we described how you should use SageMaker Pipelines to arrange and run protein folding workflows with two in style construction prediction algorithms: AlphaFold2 and OpenFold. We demonstrated a value performant answer structure of a number of jobs that separates the compute necessities for MSA technology from construction prediction. We additionally highlighted how one can visualize, consider, and evaluate predicted 3D buildings of proteins in SageMaker Studio.

To get began with protein folding workflows on SageMaker, check with the pattern code within the GitHub repo.

Concerning the authors

Michael Hsieh is a Principal AI/ML Specialist Options Architect. He works with HCLS prospects to advance their ML journey with AWS applied sciences and his experience in medical imaging. As a Seattle transplant, he loves exploring the good mom nature the town has to supply, such because the mountain climbing trails, surroundings kayaking within the SLU, and the sundown at Shilshole Bay.

Shivam Patel is a Options Architect at AWS. He comes from a background in R&D and combines this together with his enterprise information to unravel complicated issues confronted by his prospects. Shivam is most captivated with workloads in machine studying, robotics, IoT, and high-performance computing.

Hasan Poonawala is a Senior AI/ML Specialist Options Architect at AWS, Hasan helps prospects design and deploy machine studying functions in manufacturing on AWS. He has over 12 years of labor expertise as an information scientist, machine studying practitioner, and software program developer. In his spare time, Hasan likes to discover nature and spend time with family and friends.

Jasleen Grewal is a Senior Utilized Scientist at Amazon Internet Companies, the place she works with AWS prospects to unravel actual world issues utilizing machine studying, with particular concentrate on precision drugs and genomics. She has a robust background in bioinformatics, oncology, and medical genomics. She is captivated with utilizing AI/ML and cloud providers to enhance affected person care.

Double Machine Studying Simplified: Half 2 — Focusing on & the CATE | by Jacob Pieniazek | Jul, 2023

Is your mannequin good? A deep dive into Amazon SageMaker Canvas superior metrics