Why OpenAI’s API Is Extra Costly for Non-English Languages | by Leonie Monigatti

Past phrases: How byte pair encoding and Unicode encoding issue into pricing disparities

How can or not it’s that the phrase “Hey world” has two tokens in English and 12 tokens in Hindi?

After publishing my recent article on how to estimate the cost for OpenAI’s API, I obtained an attention-grabbing remark that somebody had seen that the OpenAI API is way more costly in different languages, resembling ones utilizing Chinese language, Japanese, or Korean (CJK) characters, than in English.

Remark by a reader on my recent article on how to estimate the cost for OpenAI’s API with the `tiktoken` library

I wasn’t conscious of this challenge, however shortly realized that that is an lively analysis area: Initially of this 12 months, a paper referred to as “Language Mannequin Tokenizers Introduce Unfairness Between Languages” by Petrov et al. [2] confirmed that the “similar textual content translated into completely different languages can have drastically completely different tokenization lengths, with variations as much as 15 occasions in some circumstances.”

As a refresher, tokenization is the method of splitting a textual content into a listing of tokens, that are widespread sequences of characters in a textual content.

The distinction in tokenization lengths is a matter as a result of the OpenAI API is billed in units of 1,000 tokens. Thus, in case you have as much as 15 occasions extra tokens in a comparable textual content, this may lead to 15 occasions the API prices.

Let’s translate the phrase “Hey world” into Japanese (こんにちは世界) and transcribe it into Hindi (हैलो वर्ल्ड). After we tokenize the brand new phrases with the cl100k_base tokenizer utilized in OpenAI’s GPT fashions, we get the next outcomes (you’ll find the Code I used for these experiments on the finish of this text):

Number of letters and tokens (cl100k_base) for the phrase “Hello world” in English, Japanese, and Hindi — Variety of letters and tokens (`cl100k_base`) for the phrase “Hey world” in English, Japanese, and Hindi

From the above graph, we will make two attention-grabbing observations:

The variety of letters for…

Why OpenAI’s API Is Extra Costly for Non-English Languages | by Leonie Monigatti | Aug, 2023

Past phrases: How byte pair encoding and Unicode encoding issue into pricing disparities

New Technology Revolutionizes Insect Research

Open Source AI Has Founders—and the FTC—Buzzing

You Don't Understand AI Until You Watch THIS

Think Deepfakes Aren’t a Risk? Check Out This AI Video of Biden Flinging Slurs at His Enemies

Leak Shows That Google-Funded AI Video Generator Runway Was Trained on Stolen YouTube Content, Pirated Films

Study Finds That AI Is Adding to Employees’ Workload and Burning Them Out

New Technology Revolutionizes Insect Research

Open Source AI Has Founders—and the FTC—Buzzing

Think Deepfakes Aren’t a Risk? Check Out This AI Video of Biden Flinging Slurs at His Enemies

Leak Shows That Google-Funded AI Video Generator Runway Was Trained on Stolen YouTube Content, Pirated Films

Study Finds That AI Is Adding to Employees’ Workload and Burning Them Out

When AI Is Trained With AI-Generated Data, It Starts Spouting Gibberish

Bind AI Copilot (www.getbind.co)

Forensic Analysis Finds Overwhelming Similarities Between OpenAI’s Voice and Scarlett Johansson

WriteText.ai for WooCommerce (writetext.ai)

World’s Largest Radiology AI Marketplace CARPL Raises $6 Million to Accelerate the Adoption of AI in Clinical Workflows

Google for Startups Accelerator: AI First MENA-T

It’s excessive time for extra AI transparency

3 Silent Pandas Errors You Ought to Be Conscious Of | by Soner Yıldırım | Aug, 2023

Past phrases: How byte pair encoding and Unicode encoding issue into pricing disparities

Log In

With social network:

Or with username:

Sign In

Forgot password?

Your password reset link appears to be invalid or expired.

Log in

Privacy Policy

Add to Collection

No Collections