Learn how to create a Python based token visualization tool for OpenAI and Azure OpenAI GPT-based models to visualize token boundaries with the latest encodings from OpenAI.
OpenAI relies on Byte-Pair encoding for tokenization using the tiktoken library. Most token visualization tools (inclduing OpenAI’s) are relying on the old GPT2/3 encodings and won’t always provide accurate results for models like GPT-4, ChatGPT (gpt-3.5-turbo), text-embedding-ada-002 etc. which rely on the new cl100k_base encoding.
*Links:
Demo Site – https://tokenization.azurewebsites.net/
Code for Jupyter notebooks and full app – https://github.com/OpsConfig/OpenAI_Lab
OpenAI Tokenization Tool – https://platform.openai.com/tokenizer
Microsoft Semantic Kernel token explanation docs: https://learn.microsoft.com/semantic-kernel/prompt-engineering/tokens