Information cleansing and preprocessing
The subsequent step after studying the information is to scrub and preprocess it. I’m certain you heard it a number of occasions that this step normally takes probably the most time.
There are a number of points that must be dealt with earlier than creating the world map visualizations. ChatGPT detected them and supplied an answer in lower than a minute. Fairly spectacular!
The code within the above snippet:
# code generated by Code Interpreter
# Verify if the nation names are constant throughout each dataframes
population_countries = set(population_data['Country Name'])
coordinates_countries = set(country_coordinates['name'])# Nations within the inhabitants information however not within the coordinates information
missing_in_coordinates = population_countries - coordinates_countries
# Nations within the coordinates information however not within the inhabitants information
missing_in_population = coordinates_countries - population_countries
There are some international locations lacking within the coordinates dataset and in addition some within the inhabitants dataset.
What ChatGPT did within the code snippet above:
- Create Python units of nation names in each datasets
- Examine them individually to seek out the distinction as a set
ChatGPT solved it in lower than a minute:
The code within the above snippet:
# code generated by Code Interpreter
# Mapping of nation names from the inhabitants dataset to the coordinates dataset
country_name_mapping = {
'Bahamas, The': 'Bahamas',
'Brunei Darussalam': 'Brunei',
'Cabo Verde': 'Cape Verde',
'Congo, Dem. Rep.': 'Congo [DRC]',
'Congo, Rep.': 'Congo [Republic]',
"Cote d'Ivoire": "Côte d'Ivoire",
'Czechia': 'Czech Republic',
'Egypt, Arab Rep.': 'Egypt',
'Gambia, The': 'Gambia',
'Hong Kong SAR, China': 'Hong Kong',
'Iran, Islamic Rep.': 'Iran',
"Korea, Dem. Folks's Rep.": 'North Korea',
'Korea, Rep.': 'South Korea',
'Kyrgyz Republic': 'Kyrgyzstan',
'Lao PDR': 'Laos',
'Macao SAR, China': 'Macau',
'Micronesia, Fed. Sts.': 'Micronesia',
'Myanmar': 'Myanmar [Burma]',
'North Macedonia': 'Macedonia [FYROM]',
'Russian Federation': 'Russia',
'Sao Tome and Principe': 'São Tomé and Príncipe',
'Slovak Republic': 'Slovakia',
'Syrian Arab Republic': 'Syria',
'Turkiye': 'Turkey',
'Venezuela, RB': 'Venezuela',
'Yemen, Rep.': 'Yemen',
'Eswatini': 'Swaziland'
}# Clear the inhabitants dataset
population_data_clean = population_data.substitute({"Nation Identify": country_name_mapping})
# Drop the rows that don't exist within the coordinates dataset
population_data_clean = population_data_clean[population_data_clean['Country Name'].isin(coordinates_countries)]
# Merge the inhabitants information with the nation coordinates information
merged_data = pd.merge(population_data_clean, country_coordinates, left_on='Nation Identify', right_on='title')
# Hold solely the mandatory columns
merged_data = merged_data[['Country Name', '2022', 'latitude', 'longitude']]
# Rename the columns for readability
merged_data.columns = ['Country', 'Population', 'Latitude', 'Longitude']
merged_data.head()
We now have the information in a pleasant and clear format. This step would usually take a number of time and handbook efforts. ChatGPT did it in a few minute.