After publishing my recent article on how to estimate the cost for OpenAI’s API, I obtained an attention-grabbing remark that somebody had seen that the OpenAI API is way more costly in different languages, resembling ones utilizing Chinese language, Japanese, or Korean (CJK) characters, than in English.
I wasn’t conscious of this challenge, however shortly realized that that is an lively analysis area: Initially of this 12 months, a paper referred to as “Language Mannequin Tokenizers Introduce Unfairness Between Languages” by Petrov et al.  confirmed that the “similar textual content translated into completely different languages can have drastically completely different tokenization lengths, with variations as much as 15 occasions in some circumstances.”
As a refresher, tokenization is the method of splitting a textual content into a listing of tokens, that are widespread sequences of characters in a textual content.
The distinction in tokenization lengths is a matter as a result of the OpenAI API is billed in units of 1,000 tokens. Thus, in case you have as much as 15 occasions extra tokens in a comparable textual content, this may lead to 15 occasions the API prices.
Let’s translate the phrase “Hey world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). After we tokenize the brand new phrases with the
cl100k_base tokenizer utilized in OpenAI’s GPT fashions, we get the next outcomes (you’ll find the Code I used for these experiments on the finish of this text):
From the above graph, we will make two attention-grabbing observations:
- The variety of letters for…