in

Why OpenAI’s API Is Extra Costly for Non-English Languages | by Leonie Monigatti | Aug, 2023


Past phrases: How byte pair encoding and Unicode encoding issue into pricing disparities

How can or not it’s that the phrase “Hey world” has two tokens in English and 12 tokens in Hindi?

After publishing my recent article on how to estimate the cost for OpenAI’s API, I obtained an attention-grabbing remark that somebody had seen that the OpenAI API is way more costly in different languages, resembling ones utilizing Chinese language, Japanese, or Korean (CJK) characters, than in English.

Remark by a reader on my recent article on how to estimate the cost for OpenAI’s API with the tiktoken library

I wasn’t conscious of this challenge, however shortly realized that that is an lively analysis area: Initially of this 12 months, a paper referred to as “Language Mannequin Tokenizers Introduce Unfairness Between Languages” by Petrov et al. [2] confirmed that the “similar textual content translated into completely different languages can have drastically completely different tokenization lengths, with variations as much as 15 occasions in some circumstances.”

As a refresher, tokenization is the method of splitting a textual content into a listing of tokens, that are widespread sequences of characters in a textual content.

An instance for Tokenization

The distinction in tokenization lengths is a matter as a result of the OpenAI API is billed in units of 1,000 tokens. Thus, in case you have as much as 15 occasions extra tokens in a comparable textual content, this may lead to 15 occasions the API prices.

Let’s translate the phrase “Hey world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). After we tokenize the brand new phrases with the cl100k_base tokenizer utilized in OpenAI’s GPT fashions, we get the next outcomes (you’ll find the Code I used for these experiments on the finish of this text):

Number of letters and tokens (cl100k_base) for the phrase “Hello world” in English, Japanese, and Hindi
Variety of letters and tokens (cl100k_base) for the phrase “Hey world” in English, Japanese, and Hindi

From the above graph, we will make two attention-grabbing observations:

  1. The variety of letters for…


It’s excessive time for extra AI transparency

3 Silent Pandas Errors You Ought to Be Conscious Of | by Soner Yıldırım | Aug, 2023