Jazz Chords Parsing with Transformers | by Francesco Foscarin

On this article, I summarize a part of my analysis paper “Predicting Music Hierarchies with a Graph-Based Neural Decoder” which presents a data-driven system in a position to parse jazz chord sequences.

This analysis is motivated by my frustration with Grammar-based parsing programs (which have been the one possibility obtainable for music knowledge):

The grammar-building section requires a number of area information
The parser will fail in case of some unseen configurations or noisy knowledge
It’s difficult to account for a number of musical dimensions in a single grammar rule
There isn’t a well-supported energetic Python framework to assist with the event

My method (impressed by comparable works in Pure Language Processing), as an alternative, doesn’t depend on any grammar, produces partial outcomes for noisy inputs, trivially handles a number of musical dimensions, and is carried out in PyTorch.

If you’re not acquainted with parsing and grammars, or just must refresh your information, I’ll now take a step again.

What’s “parsing”?

The time period parsing refers to predicting/inferring a tree (the mathematical construction) whose leaves are the weather of the sequences.

Okay then, however why would we want a tree?

Let’s begin with the next sequence of jazz chords (part A of “Take the A Prepare”).

In Jazz music chords are related by a posh system of perceptual relations. For instance, the Dm7 is a preparation for the dominant chord G7. Which means the Dm7 is much less vital than the G7 and it may, for instance, be omitted in a distinct reharmonization. Equally, the D7 is a secondary dominant (a dominant of a dominant) additionally referring to G7.

This type of harmonic relation will be expressed with a tree and will be helpful for music evaluation or whereas performing duties like reharmonization. Nevertheless, since chords in music items can be found largely as a sequence, we would like a system which is ready to routinely construct such a tree construction.

Earlier than persevering with we have to differentiate between two sorts of timber.

Musicologist tends to make use of what is named constituent timber, which you’ll be able to see within the image under. Constituent timber comprise leaves (chords in blue — components of the enter sequence), and inner nodes (chords in orange — reductions of the youngsters’s leaves).

On this work as an alternative, we take into account one other sort of tree, referred to as dependency tree. This type of tree doesn’t have inner nodes, however solely directed arcs connecting the weather of the sequence.

We are able to produce the dependency tree from the constituent tree, with some algorithms that will likely be mentioned later.

Since it is a data-driven method, we want a dataset of chord sequences (the enter knowledge) related to a dataset of timber (the bottom fact) for coaching and testing. We use the Jazz Treebank¹ which is publicly obtainable in this GitHub repository (it may be freely used for non-commercial functions, and I obtained the writer’s permission to make use of it on this article). Particularly, they supply a JSON file with all chords and annotations.

We mannequin every chord in enter to our system, with three options:

The basis, an integer in [0..11], the place C -> 0, C# ->1, and so on…
The fundamental type, an integer in [0..5], which choose amongst main, minor, augmented, half-diminished, diminished, and suspended (sus).
The extension, an integer in [0,1,2] which selects amongst 6, minor 7, or main 7.

To supply the chord options from a chord label (a string), we will use an everyday expression as follows (observe that this code work for this dataset, because the format might differ in different chord datasets).

def parse_chord_label(chord_label):
# Outline a regex sample for chord symbols
sample = r"([A-G][#b]?)(m|+|%|o|sus)?(6|7|^7)?"
# Match the sample with the enter chord
match = re.match(sample, chord_label)
if match:
# Extract the basis, fundamental chord type and extension from the match obj
root = match.group(1)
type = match.group(2) or "M"
ext = match.group(3) or ""
return root, type, ext
else:
# Return None if the enter will not be a legitimate chord image
elevate ValueError("Invalid chord image: {}".format(chord_label))

Lastly, we have to produce the dependency tree. The JHT dataset solely accommodates constituent timber, encoded as a nested dictionary. We import them and remodel them into dependency timber with a recursive perform. The mechanism of our perform will be described as follows.

We begin from a completely fashioned constituent tree and a dependency tree with none dependency arcs, consisting solely of the nodes labelled with sequence components. The algorithm teams all inner tree nodes with their main youngster (which all have the identical label) and makes use of all secondary youngster relations originating from every group to create dependency arcs between the group label and the secondary youngster label.

def parse_jht_to_dep_tree(jht_dict):
"""Parse the python jazz concord tree dict to an inventory of dependencies and an inventory of chord within the leaves.
"""
all_leaves = []def _iterative_parse_jht(dict_elem):
"""Iterative perform to parse the python jazz concord tree dict to an inventory of dependencies."""
youngsters = dict_elem["children"]
if youngsters == []:  # recursion ending situation
out = (
[],
{"index": len(all_leaves), "label": dict_elem["label"]},
)
# add the label of the present node to the worldwide listing of leaves
all_leaves.append(dict_elem["label"])
return out
else:  # recursive name
assert len(youngsters) == 2 
current_label = noast(dict_elem["label"])
out_list = []  # dependency listing
iterative_result_left = _iterative_parse_jht(youngsters[0])
iterative_result_right = _iterative_parse_jht(youngsters[1])
# merge the dependencies lists computed deeper
out_list.prolong(iterative_result_left[0])
out_list.prolong(iterative_result_right[0])
# test if the label correspond to the left or proper youngsters and return the corresponding end result
if iterative_result_right[1]["label"] == current_label: # default if each youngsters are equal is to go left-right arch
# append the dependency for the present node
out_list.append((iterative_result_right[1]["index"], iterative_result_left[1]["index"]))
return out_list, iterative_result_right[1]
elif iterative_result_left[1]["label"] == current_label: 
# print("right-left arc on label", current_label)
# append the dependency for the present node
out_list.append((iterative_result_left[1]["index"], iterative_result_right[1]["index"]))
return out_list, iterative_result_left[1]
else:
elevate ValueError("One thing went unsuitable with label", current_label)
dep_arcs, root = _iterative_parse_jht(jht_dict)
dep_arcs.append((-1,root["index"])) # add connection to the basis, with index -1
# add self loop to the basis
dep_arcs.append((-1,-1)) # add loop connection to the basis, with index -1
return dep_arcs, all_leaves

Our parsing mannequin functioning mechanism is fairly easy: we take into account all attainable arcs and use an arc predictor (a easy binary classifier) to foretell whether or not this arc must be a part of the tree or not.

Nevertheless, it’s fairly exhausting to make this selection solely based mostly on the 2 chords we are attempting to attach. We’d like some context. We construct such context with a transformer encoder.

To renew, our parsing mannequin act in two steps:

the enter sequence is handed by way of a transformer encoder to complement it with contextual data;
a binary classifier evaluates the graph of all attainable dependency arcs to filter out the undesirable arcs.

The Transformer Encoder follows the usual structure. We use a learnable embedding layer to map every categorical enter function to factors in a steady multidimensional area. All embeddings are then summed collectively, so it’s as much as the community to “resolve” the dimension to make use of for every function.

import torch.nn as nnclass TransformerEncoder(nn.Module):
def __init__(
self,
input_dim,
hidden_dim,
encoder_depth,
n_heads = 4,
dropout=0,
embedding_dim = 8,
activation = "gelu",
):
tremendous().__init__()
self.input_dim = input_dim
self.positional_encoder = PositionalEncoding(
d_model=input_dim, dropout=dropout, max_len=200
)
encoder_layer = nn.TransformerEncoderLayer(d_model=input_dim, dim_feedforward=hidden_dim, nhead=n_heads, dropout =dropout, activation=activation)
encoder_norm = nn.LayerNorm(input_dim)
self.transformer_encoder = nn.TransformerEncoder(encoder_layer, num_layers=encoder_depth, norm=encoder_norm)
self.embeddings = nn.ModuleDict({
"root": nn.Embedding(12, embedding_dim),
"type": nn.Embedding(len(CHORD_FORM), embedding_dim),
"ext": nn.Embedding(len(CHORD_EXTENSION), embedding_dim),
"period": nn.Embedding(len(JTB_DURATION), embedding_dim,
"metrical": nn.Embedding(METRICAL_LEVELS, embedding_dim)
})
def ahead(self, sequence):
root = sequence[:,0]
type = sequence[:,1]
ext = sequence[:,2]
period = sequence[:,3]
metrical = sequence[:,4]
# remodel categorical options to embedding
root = self.embeddings["root"](root.lengthy())
type = self.embeddings["form"](type.lengthy())
ext = self.embeddings["ext"](ext.lengthy())
period = self.embeddings["duration"](period.lengthy())
metrical = self.embeddings["metrical"](metrical.lengthy())
# sum all embeddings
z = root + type + ext + period + metrical
# add positional encoding
z = self.positional_encoder(z)
# reshape to (seq_len, batch = 1, input_dim)
z = torch.unsqueeze(z,dim= 1)
# run transformer encoder
z = self.transformer_encoder(src=z, masks=src_mask)
# take away batch dim
z = torch.squeeze(z, dim=1)
return z, ""
class PositionalEncoding(nn.Module):
def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 500):
tremendous().__init__()
self.dropout = nn.Dropout(p=dropout)
place = torch.arange(max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * (-np.log(10000.0) / d_model))
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(place * div_term)
pe[:, 1::2] = torch.cos(place * div_term)
self.register_buffer('pe', pe)
def ahead(self, x: torch.Tensor) -> torch.Tensor:
x = x + self.pe[:x.size(0)]
return self.dropout(x)

The arc predictor is only a linear layer taking as enter the concatenation of the hidden options of the 2 chords. The classification step for all of the arcs is completed in parallel because of the ability of matrix multiplication.

class ArcPredictor(nn.Module):
def __init__(self, hidden_channels, activation=F.gelu, dropout=0.3):
tremendous().__init__()
self.activation = activation
self.root_linear = nn.Linear(1, hidden_channels) # linear to provide root options
self.lin1 = nn.Linear(2*hidden_channels, hidden_channels)
self.lin2 = nn.Linear(hidden_channels, 1)
self.dropout = nn.Dropout(dropout)
self.norm = nn.LayerNorm(hidden_channels)def ahead(self, z, pot_arcs):
# add column for the basis aspect
root_feat = self.root_linear(torch.ones((1,1), system=z.system))
z = torch.vstack((root_feat,z))
# proceed with the computation
z = self.norm(z)
# concat the embeddings of the 2 nodes, form (num_pot_arcs, 2*hidden_channels)
z = torch.cat([z[pot_arcs[:, 0]], z[pot_arcs[:, 1]]], dim=-1)
# cross by way of a linear layer, form (num_pot_arcs, hidden_channels)
z = self.lin1(z)
# cross by way of activation, form (num_pot_arcs, hidden_channels)
z = self.activation(z)
# normalize
z = self.norm(z)
# dropout
z = self.dropout(z)
# cross by way of one other linear layer, form (num_pot_arcs, 1)
z = self.lin2(z)
# return a vector of form (num_pot_arcs,)
return z.view(-1)

We are able to put the transformer encoder and the arc predictor in a single torch module to simplify its utilization.

class ChordParser(nn.Module):
def __init__(self, input_dim, hidden_dim, num_layers, dropout=0.2, embedding_dim = 8, use_embedding = True, n_heads = 4):
tremendous().__init__()
self.activation = nn.purposeful.gelu
# initialize the encoder
self.encoder = NotesEncoder(input_dim, hidden_dim, num_layers, dropout, embedding_dim, n_heads=n_heads)
# initialize the decoder
self.decoder = ArcDecoder(input_dim, dropout=dropout)def ahead(self, note_features, pot_arcs, masks=None):
z = self.encoder(note_features)
return self.decoder(z, pot_arcs)

As a loss perform, we use the sum of two losses:

The binary cross entropy loss: the concept is to see our drawback as a binary classification drawback, the place every arc will be predicted or not.
The cross-entropy loss: the concept is to see our drawback as a multiclass classification drawback, the place for every head in a head → dep arc, we have to predict which one is the right dependent amongst all different chords

loss_bce = torch.nn.BCEWithLogitsLoss()
loss_ce = torch.nn.CrossEntropyLoss(ignore_index=-1)
total_loss = loss_bce + loss_ce

There’s one drawback that we nonetheless have to resolve. The truth that the anticipated arcs ought to type a tree construction will not be enforced at any level throughout our coaching. Subsequently we may have an invalid configuration corresponding to an arc loop. Happily, there may be an algorithm that we will use to make sure that this doesn’t occur: the Eisner algorithm.²

As an alternative of simply assuming that an arc exists if its predicted chance is larger than 0.5, we save all predictions in a sq. matrix (the adjacency matrix) of measurement (variety of chords, variety of chords) and we run the Eisner algorithm on it.

# Tailored from https://github.com/HMJW/biaffine-parser
def eisner(scores, return_probs = False):
"""Parse utilizing Eisner's algorithm.
The matrix follows the next conference:
scores[i][j] = p(i=head, j=dep) = p(i --> j)
"""
rows, collumns = scores.form
assert rows == collumns, 'scores matrix should be sq.'
num_words = rows - 1  # Variety of phrases (excluding root).
# Initialize CKY desk.
full = np.zeros([num_words+1, num_words+1, 2])  # s, t, course (proper=1).
incomplete = np.zeros([num_words+1, num_words+1, 2])  # s, t, course (proper=1).
complete_backtrack = -np.ones([num_words+1, num_words+1, 2], dtype=int)  # s, t, course (proper=1).
incomplete_backtrack = -np.ones([num_words+1, num_words+1, 2], dtype=int)  # s, t, course (proper=1).
incomplete[0, :, 0] -= np.inf
# Loop from smaller gadgets to bigger gadgets.
for ok in vary(1, num_words+1):
for s in vary(num_words-k+1):
t = s + ok
# First, create incomplete gadgets.
# left tree
incomplete_vals0 = full[s, s:t, 1] + full[(s+1):(t+1), t, 0] + scores[t, s]
incomplete[s, t, 0] = np.max(incomplete_vals0)
incomplete_backtrack[s, t, 0] = s + np.argmax(incomplete_vals0)
# proper tree
incomplete_vals1 = full[s, s:t, 1] + full[(s+1):(t+1), t, 0] + scores[s, t]
incomplete[s, t, 1] = np.max(incomplete_vals1)
incomplete_backtrack[s, t, 1] = s + np.argmax(incomplete_vals1)
# Second, create full gadgets.
# left tree
complete_vals0 = full[s, s:t, 0] + incomplete[s:t, t, 0]
full[s, t, 0] = np.max(complete_vals0)
complete_backtrack[s, t, 0] = s + np.argmax(complete_vals0)
# proper tree
complete_vals1 = incomplete[s, (s+1):(t+1), 1] + full[(s+1):(t+1), t, 1]
full[s, t, 1] = np.max(complete_vals1)
complete_backtrack[s, t, 1] = s + 1 + np.argmax(complete_vals1)
worth = full[0][num_words][1]
heads = -np.ones(num_words + 1, dtype=int)
backtrack_eisner(incomplete_backtrack, complete_backtrack, 0, num_words, 1, 1, heads)
value_proj = 0.0
for m in vary(1, num_words+1):
h = heads[m]
value_proj += scores[h, m]
if return_probs:
return heads, value_proj
else:
return headsdef backtrack_eisner(incomplete_backtrack, complete_backtrack, s, t, course, full, heads):
"""
Backtracking step in Eisner's algorithm.
- incomplete_backtrack is a (NW+1)-by-(NW+1) numpy array listed by a begin place,
an finish place, and a course flag (0 means left, 1 means proper). This array accommodates
the arg-maxes of every step within the Eisner algorithm when constructing *incomplete* spans.
- complete_backtrack is a (NW+1)-by-(NW+1) numpy array listed by a begin place,
an finish place, and a course flag (0 means left, 1 means proper). This array accommodates
the arg-maxes of every step within the Eisner algorithm when constructing *full* spans.
- s is the present begin of the span
- t is the present finish of the span
- course is 0 (left attachment) or 1 (proper attachment)
- full is 1 if the present span is full, and 0 in any other case
- heads is a (NW+1)-sized numpy array of integers which is a placeholder for storing the
head of every phrase.
"""
if s == t:
return
if full:
r = complete_backtrack[s][t][direction]
if course == 0:
backtrack_eisner(incomplete_backtrack, complete_backtrack, s, r, 0, 1, heads)
backtrack_eisner(incomplete_backtrack, complete_backtrack, r, t, 0, 0, heads)
return
else:
backtrack_eisner(incomplete_backtrack, complete_backtrack, s, r, 1, 0, heads)
backtrack_eisner(incomplete_backtrack, complete_backtrack, r, t, 1, 1, heads)
return
else:
r = incomplete_backtrack[s][t][direction]
if course == 0:
heads[s] = t
backtrack_eisner(incomplete_backtrack, complete_backtrack, s, r, 1, 1, heads)
backtrack_eisner(incomplete_backtrack, complete_backtrack, r+1, t, 0, 1, heads)
return
else:
heads[t] = s
backtrack_eisner(incomplete_backtrack, complete_backtrack, s, r, 1, 1, heads)
backtrack_eisner(incomplete_backtrack, complete_backtrack, r+1, t, 0, 1, heads)
return

I introduced a system for the dependency parsing of chord sequences which makes use of a transformer to construct contextual chord hidden representations, and a classifier to pick whether or not two chords must be linked by an arc.

The primary benefit with respect to competing programs is that this method doesn’t depend on any specific symbolic grammar, due to this fact it could possibly take into account a number of musical options concurrently, make use of sequential context data, and produce partial outcomes for noisy inputs.

To maintain this text of an affordable measurement, each the reason and the code give attention to essentially the most fascinating a part of the system. You’ll find a extra full rationalization in this scientific article and all of the code on this GitHub repository.

(All photos are by the writer.)