Managing deep studying fashions might be troublesome because of the big variety of parameters and settings which can be wanted for all modules. The coaching module would possibly want parameters like batch_size
or the num_epochs
or parameters for the training charge scheduler. Equally, the info preprocessing module would possibly want train_test_split
or parameters for picture augmentation.
A naive strategy to handle or introduce these parameters into pipeline is to make use of them as CLI arguments whereas operating the scripts. Command line arguments could possibly be troublesome to enter and managing all parameters in a single file might not be attainable. TOML recordsdata present a cleaner strategy to handle configurations and scripts can load mandatory components of the configuration within the type of a Python dict
while not having boilerplate code to learn/parse command-line args.
On this weblog, we’ll discover the usage of TOML in configuration recordsdata and the way we will effectively use them throughout coaching/deployment scripts.
TOML, stands for Tom’s Obvious Minimal Language, is file-format designed particularly for configuration recordsdata. The idea of a TOML file is kind of much like YAML/YML files which have the power to retailer key-value pairs in a tree-like hierarchy. An advantage of TOML over YAML is its readability which turns into essential when there are a number of nested ranges.
Personally, apart from enhanced readability, I discover no sensible cause to favor TOML over YAML. Utilizing YAML is completely nice, right here a Python package for parsing YAML.
There are two benefits of utilizing TOML for storing mannequin/information/deployment configuration for ML fashions:
Managing all configurations in a single file: With TOML recordsdata, we will create a number of teams of settings which can be required for various modules. As an illustration, in determine 1, the settings associated to the mannequin’s coaching process are nested below the [train]
attribute, equally the port
and host
required for deploying the mannequin are saved below deploy
. We’d like not bounce between practice.py
or deploy.py
to vary their parameters, as a substitute we will globalize all settings from a single TOML configuration file.
This could possibly be tremendous useful if we’re coaching the mannequin on a digital machine, the place code-editors or IDEs aren’t accessible for enhancing recordsdata. A single config file is straightforward to edit with
vim
ornano
accessible on most VMs.
To learn the configuration from a TOML recordsdata, two Python packages can be utilized, toml
and munch
. toml
will assist us learn the TOML file and return the contents of the file as a Python dict
. munch
will convert the contents of the dict
to allow attribute-style entry of components. As an illustration, as a substitute of writing, config[ "training" ][ "num_epochs" ]
, we will simply write config.coaching.num_epochs
which reinforces readability.
Take into account the next file construction,
- config.py
- practice.py
- project_config.toml
project_config.toml
incorporates the configuration for our ML venture, like,
[data]
vocab_size = 5589
seq_length = 10
test_split = 0.3
data_path = "dataset/"
data_tensors_path = "data_tensors/"[model]
embedding_dim = 256
num_blocks = 5
num_heads_in_block = 3
[train]
num_epochs = 10
batch_size = 32
learning_rate = 0.001
checkpoint_path = "auto"
In config.py
, we create a perform which returns the munchified-version of this configuration, utilizing toml
and munch
,
$> pip set up toml munch
import toml
import munchdef load_global_config( filepath : str = "project_config.toml" ):
return munch.munchify( toml.load( filepath ) )
def save_global_config( new_config , filepath : str = "project_config.toml" ):
with open( filepath , "w" ) as file:
toml.dump( new_config , file )
Now, now in any of our venture recordsdata, like practice.py
or predict.py
, we will load this configuration,
from config import load_global_configconfig = load_global_config()
batch_size = config.practice.batch_size
lr = config.practice.learning_rate
if config.practice.checkpoint_path == "auto":
# Make a listing with title as present timestamp
move
The output of print( toml.load( filepath ) ) )
is,
{'information': {'data_path': 'dataset/',
'data_tensors_path': 'data_tensors/',
'seq_length': 10,
'test_split': 0.3,
'vocab_size': 5589},
'mannequin': {'embedding_dim': 256, 'num_blocks': 5, 'num_heads_in_block': 3},
'practice': {'batch_size': 32,
'checkpoint_path': 'auto',
'learning_rate': 0.001,
'num_epochs': 10}}
For those who’re utilizing MLOps instruments like W&B Monitoring or MLFlow, sustaining configuration as a dict
could possibly be useful as we will instantly move it as an argument.
Hope you’ll think about using TOML configurations in your subsequent ML venture! Its a clear manner of managing settings which can be each international or native to your coaching / deployment or inference scripts.
As an alternative of writing lengthy CLI arguments, the scripts may instantly load the configuration from the TOML file. If we want to practice two variations of a mannequin with totally different hyperparameters, we simply want to vary the TOML file in config.py
. I’ve began utilizing TOML recordsdata in my current tasks and experimentation has turn into sooner. MLOps instruments also can handle variations of a mannequin together with their configurations, however the simplicity of the above mentioned strategy is exclusive and required minimal change in current tasks.
Hope you’ve loved studying. Have a pleasant day forward!