in

Generate synthetic data with BigQuery DataFrames and LLMs


In the realm of big data analytics, a common challenge has been the separation between data processing and machine learning workflows. Traditionally, data engineers would use tools like Apache Spark for large-scale data processing in data warehouses like BigQuery, while data scientists would leverage libraries like pandas and scikit-learn for machine learning tasks. This disjointed approach led to inefficiencies, data duplication, and delays in deriving insights from data.

At the same time AI’s success hinges on vast amounts of data. Therefore, generation and management of synthetic data — fabricated data that mimics real-world data — has become a critical operation for any business. Synthetic data is generated either algorithmically to model datasets used in production or through ML algorithms training such as generative AI. This synthetic data can emulate operational or production data, facilitating the training of machine learning (ML) models or the evaluation of mathematical models.

BigQuery DataFrames as a solution

BigQuery DataFrames bridges the gap between data processing and machine learning by providing a unified, scalable, and cost-efficient platform for both tasks. This empowers organizations to accelerate their data-driven initiatives, improve collaboration between teams, and unlock the full potential of their data. BigQuery DataFrames is an open-source python package providing pandas-like DataFrame and scikit-learn-like ML library for big data. It utilizes BigQuery and the rest of Google Cloud as the storage and compute platform under the hood. It provides easy compute extensibility by integrating with Google Cloud Functions, and generative AI capabilities, including the state-of-the-art generative AI models, by integrating with Vertex AI. This versatile set of capabilities allow BigQuey DataFrames to be used for developing scalable AI applications.

BigQuery DataFrames allows you to generate artificial data at scale and mitigates a number of issues around moving the data outside of your ecosystem or using third-party solutions. When dealing with sensitive personal data, synthetic data offers a privacy-preserving alternative. It allows you to share and collaborate on datasets without exposing individuals’ private information. In addition, this allows deploying analytical models into production. Synthetic data also provides a safe environment for testing and validation. You can simulate edge cases, outliers, and rare events that might not be present in your real dataset. In addition, before making changes to your data warehouse schema or ETL processes, synthetic data allows you to simulate the impact of those changes, preventing costly errors and downtime.

BigQuery DataFrames and synthetic data generation in action

Synthetic data generation is a need that arises in many applications where:

  • Real data generation is slow and expensive
  • Sharing original data has a high governance bar compared with synthetic data, i.e., there are stringent rules, regulations, and oversight
  • Larger scale data is needed for simulations

Let’s see the integration of BigQuery DataFrames together with LLMs to generate synthetic data right inside BigQuery using BigQuery DataFrames. This process has two main stages and number of substages as below:

Code generation

1. Set the Schema and provide instructions to the LLM

1.1 The user knows the schema of the data they need

1.2 They have a high level sense of the code that could generate such data

1.3 They express the intent to generate code to generate such data on a small scale in a natural language (NL) prompt

1.4 Enrich the prompt with hints to guide the LLM to generate correct code

2. Send the prompt to the LLM and get generated code

Code execution

3. Review and execute the code, loop back to step 1.1 if needed (human-in-the-loop)

4. Deploy the code as a remote_function and execute it at the desired scale

5. Post process to produce the data in the desired shape


Elon Musk Threatens to Seize iPhones and Store Them in a Faraday Cage at His Companies

Elon Musk Threatens to Seize iPhones and Store Them in a Faraday Cage at His Companies

UN Head Implores Dimwit Leaders Not to Give AI Control of Nuclear Weapons

UN Head Implores Dimwit Leaders Not to Give AI Control of Nuclear Weapons