Analyzing Humanitarian Knowledge Unstructured Excel Tables with ChatGPT Code Interpreter | by Matthew Harris

Some Preliminary Exploration with Code Interpreter

Created by DALL-E2 with immediate “baby’s crayon drawing of a cheerful robotic processing information, with graphs within the background”

TL;DR

The brand new experimental function ‘Code Interpreter’ supplies native assist for producing and working Python code as a part of utilizing ChatGPT. It reveals nice potential for performing information engineering and evaluation duties, offering a conversational interface that non-technical customers may doubtlessly use. This text presents some checks of ChatGPT (GPT-4) Code Interpreter on an unstructured Excel desk from my previous blog post, to see if it is ready to mechanically convert this desk to a extra customary kind that might be loaded right into a database. With restricted prompting, it was capable of determine the hierarchical heading construction however was unable to generate code that will parse the desk precisely. On adjusting the immediate to counsel utilizing the openpyxl Python library to extract details about Excel merged cells, it was capable of parse the desk in a single try. Nonetheless, on repeating the duty with the very same immediate, it failed. With no management but over the temperature parameter to make outcomes extra deterministic, Code Interpreter doesn’t seem capable of deal with this explicit activity constantly. It’s early days although and solely a beta function, the sample for automated information processing utilizing Giant Language Fashions is probably going right here to remain and can little question enhance over time.

This week ChatGPT launched a brand new function referred to as Code Interpreter, which permits ChatGPT to generate and name Python code, in addition to add information information to carry out duties corresponding to information evaluation. As I’ve explored in previous blog posts, Giant Language Fashions have the potential for simplifying information engineering and evaluation duties. The LangChain mission has some great patterns, and there’s already a whole lot of industrial exercise on this space, so it’s fascinating to see OpenAI beginning to provide native assist.

There are numerous articles already out there exploring OpenAI Code Interpreter, however I questioned how effectively it would carry out utilizing a number of the tabular information I’ve beforehand explored as discovered on the wonderful Humanitarian Data Exchange (HDX). Having the ability to provide pure language interfaces for platforms corresponding to HDX opens the best way for much less technical customers to discover and perceive this information, which has implications for anticipating and accelerating response instances for humanitarian catastrophe occasions.

Code Interpreter is an ‘Alpha’ function at the moment, that means it’s in an early testing section and never a part of customary ChatGPT. To entry it you will want to:

Be a ChatGPT+ subscriber, costing $20 per 30 days
Go to https://chat.openai.com/
Choose the “…” subsequent to your identify bottom-left, and choose “Settings”
Click on on “Beta Options” and activate “Code Interpreter”
Again within the chat window hover over both GPT-3.5 or GPT-4 and choose “Code Interpreter”

It’s value noting that initially you needed to be on OpenAI’s plugin waitlist, however I’m undecided if that’s nonetheless the case. The options appeared for me regardless that I’ve not had affirmation of being given entry by way of the record. If the above doesn’t work, you may need to be added.

As talked about in a previous blog post, tables in Excel information can are available in all kinds of great varieties with merged cells, clean rows, and different issues which might make automated processing a little bit of a problem. For this text, I made a decision to attempt utilizing GPT-4 with Code Interpreter to research a typical instance of an Excel desk as discovered on the Humanitarian Knowledge Alternate (HDX) …

Instance of an irregular desk in Excel, with clean high rows, labels and merged cells. Completely readable for people, however a problem to parse for information science. This file was sourced from the Humanitarian Data Exchange

Although Code Interpreter has entry to open datasets, it doesn’t but embody HDX and doesn’t have the power to entry the web. I anticipate it will change fairly quickly, however within the meantime, I downloaded the file, began a session on https://chat.openai.com/ as described above, then uploaded this file by clicking the ‘+’ within the chat entry field …

After importing, ChatGPT will get proper into issues by loading the info into Python Pandas and doing a little preliminary evaluation …

Straight away it has noticed that the desk within the sheet is considerably irregular, and instantly tries to deal with this …

That’s truly spot on. In actual fact, it goes on to ask …

I responded …

It’s additionally doable to see how precisely it got here to this conclusion by deciding on the ‘Present Work’ dropdown below every response which supplies the generated Python code and consequence …

Its strategy appears cheap, nevertheless it’s value noting that it doesn’t use native Excel details about merged cells, as an alternative straight loading into Pandas.

ChatGPT will get to work and mechanically begins attempting to course of the info …

However it fails at this instantly. On deciding on ‘Present Work’ we see that it has generated some Python for parsing our desk …

# Drop the rows that aren't wanted
information = information.drop([0, 1])# Reset the index
information = information.reset_index(drop=True)
# Outline the land dimension classes
land_sizes = information.iloc[:, 1].values
# Initialize empty lists to carry the tidy information
sub_counties = []
genders = []
land_sizes_tidy = []
numbers = []
percentages = []
# Iterate over the sub-counties and genders
for i in vary(4, information.form[1], 2):
sub_county = information.columns[i].cut up(' ')[0]
gender = information.columns[i].cut up(' ')[1]
# Iterate over the land dimension classes
for j in vary(len(land_sizes)):
land_size = land_sizes[j]
quantity = information.iloc[j, i]
proportion = information.iloc[j, i + 1]
# Append the values to the lists
sub_counties.append(sub_county)
genders.append(gender)
land_sizes_tidy.append(land_size)
numbers.append(quantity)
percentages.append(proportion)
# Create a tidy dataframe
tidy_data = pd.DataFrame({
'Sub County': sub_counties,
'Gender': genders,
'Land Measurement': land_sizes_tidy,
'Quantity': numbers,
'Proportion': percentages
})
tidy_data

However this results in a bounds error …

ChatGPT doesn’t surrender and mechanically begins debugging …

Which is fairly spectacular in that it appears to have recognized the problem accurately. Nonetheless, it’s maybe lower than spectacular given it had already recognized the precise column hierarchy on the very begin of the dialog and appears to have ‘Misplaced’ this data.

Once more, it will get again into issues and mechanically proceeds …

Which provides a desk with headers like this …

The place we see information included within the column heading, suggesting it hasn’t recognized the place the column finish and information begins. In actual fact, it even spots this and bravely continues …

At this level it will get itself right into a little bit of a confused state, attempting out a lot of issues in cycles not displayed right here.

Ultimately, I feel the token restrict was breached and era stopped, with the desk trying like this …

Spot-checking the above values in ‘Present Work’ output in comparison with the unique desk, we see that for the final ‘Whole’ row the values look right, however there are two ‘Bomet Central Femail N Bomet’ column headings. It spots this …

Because it appeared so shut, I requested ChatGPT to proceed …

I had left it a short while earlier than asking it to proceed which I think resulted within the code atmosphere job being terminated. It appeared comfortable to start out this again up once more, however in doing so had misplaced some variables …

I did what was prompted and reuploaded the file, and it picked issues up once more. Ultimately, that is the desk it produced …

Which is nice …. for simply the Whole rows from the unique desk. ChatGPT has misplaced all different rows the place the info was cut up by acreage, so in truth, the parsing has failed.

I identified that it was in truth lacking a column associated to land dimension, which it had in truth recognized instantly after the primary preliminary add …

At this level ChatGPT began off on one other quest, iterating by means of a number of makes an attempt at parsing the sheet, none of which we finally profitable. A full hyperlink to the chat could be discovered here.

I began a model new chat session and tried again, which resulted in several outcomes, which has implications for the reproducibility of this system. However irrespective of how a lot I attempted, the outcomes had been by no means right.

In my previous blog post, I used to be capable of obtain essentially the most success in parsing tables like the instance supplied by utilizing data extracted from Excel to point which cells had been merged. That is key to understanding the desk heading hierarchy and what we people use when taking a look at Excel tables. If utilizing solely Pandas Excel parsing — which is what ChatGPT did in our check — somewhat than interrogating the Excel construction cell-by-cell this data is misplaced, making it so much tougher to efficiently parse tables with merged heading cells.

So what about if we give ChatGPT a tip to make use of ‘openpyxl’ straight and to consider merged cells?

Similar to final time, it will get issues proper from the beginning …

And once more proceeds with its evaluation …

Nonetheless, this time, having been supplied some steering relating to method …. It flattened the desk virtually completely with simply the one immediate above and no additional intervention.

Outcomes of ChatGPT ‘Code Interpreter’ processing an unstructured Excel desk right into a kind that is likely to be loaded right into a SQL Database. This was created with a immediate that steered utilizing the ‘openpyxl’ Python library to determine merged cells in column headings

The one problem is that final row, however with additional prompting we will take away this simply. To see the total interplay and generated code for this see here.

I requested ChatGPT to generate the total code to see what it lastly got here up with …

from openpyxl import load_workbook
import pandas as pd# Load the workbook
wb = load_workbook(filename='/mnt/information/Variety of Acreage below Irrigation.xlsx')
ws = wb.energetic
information = ws.values
information = record(information)
# Convert the tuples to lists
information = [list(row) for row in data]
# Operate to deal with None values within the first two columns
def handle_none(information):
last_valid = [None, None]
for row in information:
for i in vary(2):
if row[i] shouldn't be None:
last_valid[i] = row[i]
else:
row[i] = last_valid[i]
return information
data_filled = handle_none(information)
# Assemble the headers
headers = [None] * len(data_filled[0])
for row in data_filled[2:6]:
for i, worth in enumerate(row):
if worth shouldn't be None:
if headers[i] is None:
headers[i] = worth
else:
headers[i] += '_' + worth
# Modify the primary two headers
headers[0] = 'Query'
headers[1] = 'Class'
# Modify '%' headers
for i in vary(len(headers)):
if headers[i] == '%':
headers[i] = headers[i-1].rsplit('_', 1)[0] + '_%'
# Modify 'Female_N' and 'Female_%' headers
headers[6] = 'Sub county_Chepalungu_Female_N'
headers[7] = 'Sub county_Chepalungu_Female_%'
headers[10] = 'Bomet Central_Female_N'
headers[11] = 'Bomet Central_Female_%'
# Create a DataFrame
df = pd.DataFrame(data_filled[6:], columns=headers)
# Save the DataFrame as a CSV file
df.to_csv('/mnt/information/Number_of_Acreage_under_Irrigation_SQL.csv', index=False)

Which appears cheap. It’s not generic and has traces particular to the file being processed. I think we would want extra prompting to (possibly) make ChaGPT generate generic code, however for the duty on this research it was capable of parse the unstructured desk properly.

Nice consequence!

Given the actual fact within the first check that ChatGPT gave completely different outcomes with the identical immediate, I made a decision to repeat the very same succsessful immediate to see how issues behaved within the profitable check. Sadly, it got here up with a completely completely different, and incorrect answer utilizing the very same immediate.

NOT a terrific consequence!

Within the API the mannequin could be made extra deterministic and produce repeatable outcomes by lowering the temperature parameter, however as Code Interpreter isn’t out there within the API simply but, I used to be not capable of experiment with this.

After initially failing, we had been capable of immediate ChatGPT to accurately parse an unstructured desk by offering some coding tips on how one may do that in Python, which is a fairly wonderful consequence truly. Nonetheless, outcomes weren’t reproducible with the very same immediate failing on a second try. That is possible as a result of we don’t but have management over the mannequin temperature parameter on this beta function.

One other fascinating limitation was famous, for instance when token limits are breached and completions cease earlier than the duty is full, requiring one other immediate to hold on. Additionally, the method is somewhat sluggish as ChatGPT goes by means of iterations attempting out completely different chunks of code. It’s not but a method which might be utilized to duties requiring fast responses.

Principally, Code Interpreter seems to be actually spectacular and reveals nice promise, however doesn’t seem prepared simply but for the duty tried above.

So for now a minimum of, albeit a really quick time … I’ve one up on ChatGPT. 😊