Open Language Fashions
As open supply language fashions turn out to be extra available, getting misplaced in all of the choices is simple.
How can we decide their efficiency and evaluate them? And the way can we confidently say that one mannequin is healthier than one other?
This text offers some solutions by presenting coaching and analysis metrics, and basic and particular benchmarks to have a transparent image of your mannequin’s efficiency.
For those who missed it, check out the primary article within the Open Language Fashions sequence:
Language fashions outline a likelihood distribution over a vocabulary of phrases to pick out the probably subsequent phrase in a sequence. Given a textual content, a language mannequin assigns a likelihood to every phrase within the language, and the probably is chosen.
Perplexity measures how effectively a language mannequin can predict the subsequent phrase in a given sequence. As a coaching metric, it exhibits how effectively the fashions realized its coaching set.
We gained’t go into the mathematical particulars however intuitively, minimizing perplexity means maximizing the anticipated likelihood.
In different phrases, the perfect mannequin is the one that isn’t shocked when it sees the brand new textual content as a result of it’s anticipating it — which means it already predicted effectively what phrases are coming subsequent within the sequence.
Whereas perplexity is useful, it doesn’t contemplate the which means behind the phrases or the context through which they’re used, and it’s influenced by how we tokenize our knowledge — totally different language fashions with various vocabularies and tokenization methods can produce various perplexity scores, making direct comparisons much less significant.
Perplexity is a helpful however restricted metric. We use it primarily to trace progress throughout a mannequin’s coaching or to check…