Does rain predict rain? | In the direction of Information Science

Introducing helpful local weather datasets and validating a worldwide warming prediction

Photograph by Wim van ‘t Einde on Unsplash

Throughout a dreary stretch of June and July in Boston, it appeared to rain each time my household had deliberate to do one thing enjoyable. We began to marvel if we have been caught in a wet sample and requested, “Does the truth that it has rained loads for 3 days straight, make it prone to rain tomorrow?” I spotted that this query is simple to reply utilizing accessible climate knowledge.

This text presents the US climate datasets I used, the Python/pandas program I wrote to research the information, and the outcomes. In brief, sure, a stretch of wet days strongly predicts extra rain. And, surprisingly, the longer the stretch of rain, the extra probably it’s to rain the following day. The outcomes additionally show a prediction from international warming fashions — there’s extra rain now than in earlier years.

There are two key datasets about rainfall from the US Oceanic and Atmospheric Administration (NOAA).

I used Hourly Precipitation Data (HPD). The outline web page is mostly useful, however the hyperlink to the dataset beneath the Access tab is improper and factors to an older model. The brand new dataset is here and covers the interval 1940 to 2022. HPD has tremendous granularity with hourly precipitation quantities from greater than 2000 collection stations throughout the US. The information recordsdata include all years for one station in every CSV file. I used solely the each day totals, however the hourly data is likely to be helpful for future evaluation. (See the Constraints tab for knowledge utilization rights.)

What about when it snows somewhat than rains? Snow accumulation is melted to seek out the equivalent rain quantity. So all knowledge in HPD contains liquid rain, melted snow, and every part in between akin to slush and hail.

There may be one other priceless dataset named Local Climatological Data (LCD) that could possibly be used for related evaluation. LCD accommodates way more data than simply precipitation and in addition contains temperature, dawn/sundown, air stress, visibility, wind velocity, fog, smoke, month-to-month summaries, and extra. LCD is up to date each day, so it accommodates yesterday’s knowledge. To make use of it it’s good to decode the Integrated Surface Dataset (ISD) station numbers.

The rain analysis program is written in Python/pandas. I wrote the code to be readable as is, however it’s value exploring some particular options.

This system can learn the full list of HPD stations or a selected listing of stations from a textual content file. This characteristic is used to re-run this system with numerous parameters whereas ensuring to make use of the very same stations as a earlier run.

from rain_helpers import ALL_STATION_FILES
STATION_LIST_INPUT = "/Customers/chuck/Desktop/Articles/hpd_stations_used_list_1940-1950.txt"
ALL_STATIONS = True # use each station, or a selected listing ?

# Select both all stations we find out about, or a selected listing of stations (often from a earlier run of this program)

station_files = ALL_STATION_FILES
with open(STATION_LIST_INPUT, 'r') as fp:
knowledge = fp.learn()
station_files = knowledge.cut up("n")

One other helpful characteristic is the flexibility to decide on a subset of the station recordsdata. You may run the code in a short time with 1/100 of the stations for debugging, or one thing like 1/3 for an correct approximation of the outcomes. By the Law Of Large Numbers, my checks with 1/3 (about 600 stations) produced outcomes which are practically equivalent to the total dataset.

SKIP_COUNT = 3 # 1 = do not skip any.

for i in vary (0, len(station_files), SKIP_COUNT):
station_url = HPD_LOCAL_DIR + station_files[i]
stationDF = pd.read_csv(station_url, sep=',', header='infer', dtype=str)

One other velocity enchancment is to download the entire stations recordsdata to your native machine, so that you don’t need to fetch them from NOAA every time. The complete set is about 20GB. Should you don’t have this additional area, the code runs simply tremendous whereas studying from the cloud.

HPD_CLOUD_DIR = ""  # Hourly Precipitation Information (HPD)
HPD_LOCAL_DIR = "/Customers/chuck/Desktop/Articles/NOAA/HPD/"

station_url = HPD_LOCAL_DIR + station_files[i] # toggle between native and cloud

The trickiest a part of the code is the look-back on every date to see if it has been wet for a stretch of earlier days. The issue is that the information to be seemed up is inside the exact same DataFrame, a self-join. It’s tempting to make use of a loop over the DataFrame and, for every row, lookup earlier dates as you go. However loops over massive knowledge buildings are unhealthy model in any programming language, particularly pandas. My code solves this drawback by taking a snapshot of the DataFrame, creating fields on every row that include the 9 earlier dates (and one for tomorrow), after which utilizing these fields to hitch with the snapshot.

    # Seize a snapshot for a self-join later. Alter fields names to keep away from confusion after the be a part of.

stationCopyDF = stationDF
stationCopyDF = stationCopyDF[["STATION","DATE","DlySumToday"]] # maintain simply what we want
stationCopyDF = stationCopyDF.rename({"DlySumToday":"DlySumOther", "DATE":"DATEother"}, axis='columns')

# Add in another dates, for which we'll pull in rainfall.

stationDF["DATE_minus9"] = stationDF["DATE"] - pd.offsets.Day(9)
stationDF["DATE_minus8"] = stationDF["DATE"] - pd.offsets.Day(8)
stationDF["DATE_minus1"] = stationDF["DATE"] - pd.offsets.Day(1)
stationDF["DATE_plus1"] = stationDF["DATE"] + pd.offsets.Day(1)

# Be a part of different rainfall onto base file. Alter column names to clarify what we did.

stationDF = stationDF.merge(stationCopyDF, how='internal', left_on=["STATION","DATE_minus9"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum9DaysAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

stationDF = stationDF.merge(stationCopyDF, how='internal', left_on=["STATION","DATE_minus8"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum8DaysAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])


stationDF = stationDF.merge(stationCopyDF, how='internal', left_on=["STATION","DATE_minus1"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySum1DayAgo"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

stationDF = stationDF.merge(stationCopyDF, how='internal', left_on=["STATION","DATE_plus1"], right_on = ["STATION","DATEother"])
stationDF = stationDF.rename({"DlySumOther":"DlySumTomorrow"}, axis='columns')
stationDF = stationDF.drop(columns=["DATEother"])

After getting the earlier days’ rainfalls on every row, the code simply finds the size of every wet interval. Observe that when calculating what number of days it has been raining, at this time counts as a day.

    stationDF["DaysOfRain"] = 0   
stationDF.loc[(stationDF["DlySumToday"] >= RAINY), "DaysOfRain"] = 1
stationDF.loc[(stationDF['DlySumToday'] >= RAINY) & (stationDF['DlySum1DayAgo'] >= RAINY), 'DaysOfRain'] = 2
stationDF.loc[(stationDF['DlySumToday'] >= RAINY) & (stationDF['DlySum1DayAgo'] >= RAINY) & (stationDF['DlySum2DaysAgo'] >= RAINY), 'DaysOfRain'] = 3
... and many others

Utilizing the years 2000 to 2021 inclusive, there are 1808 stations with legitimate knowledge, with 8,967,394 knowledge factors (a date, location, and rain quantity).

  • The common rainfall over all knowledge factors was 0.0983 inches, or about 1/10 of an inch.
  • The fraction of days it was wet (≥ 0.5 inches) was 6.2%.
  • The fraction of dry days (≤ 0.05 inches) was 78.0%.

The reply to the unique query that prompted this challenge is,

Sure, wet days predict rain tomorrow. The longer it has been raining (as much as 8 days), the extra probably it’s to rain once more.

And a associated consequence…

Wet days predict how a lot rain is predicted tomorrow. The longer it has been raining (as much as 7 days), the extra rain is probably going tomorrow.

Two charts present this consequence.

Likelihood of Rain Tomorrow vs Days of Rain (picture by Creator)
Amt of Rain Tomorrow vs Days of Rain (picture by Creator)

I experimented with totally different settings for “wet day”, altering it from 0.5 inches to 0.75 and 1.0. These modifications present the identical basic phenomena of wet days predicting extra rain, however with out excellent correlation by way of eight days. The definition of “wet” as 0.5 inches appears to be the candy spot for predicting tomorrow’s rain.

You may marvel the place it might rain for 10 days straight. Over 22 years throughout all the US, with virtually 9 million knowledge factors, there have been solely 118 runs of such climate. Among the locations have been: Boca Raton, FL; San Juan, PR; Kahuna Falls, HI; Kaumana, HI; Kihalani, HI; Paakea, HI; Pascagoula, MS; Quinault, WA; and Quilcene, WA.

Sequential days of dryness (< 0.05 inches of rain) additionally correlate effectively with dryness the following day, however the prediction just isn’t as robust as a result of the possibilities of a dry day tomorrow are very shut to one another. The prospect of a dry day tomorrow is at all times close to the general imply of 78%.

Sequential days of dryness are considerably higher at predicting the quantity of rain anticipated the following day.

Amt of Rain Tomorrow vs Days of Dry (picture by Creator)

An apparent associated query is whether or not the outcomes described right here have modified because the earth’s temperature rises from local weather change. I ran the identical evaluation of US knowledge from 1940 to 1960, 1960 to 1980, and 1980 to 2000.

The central consequence is identical — wet days predict extra rain. The precise numbers are barely totally different in every time span, however they don’t change the robust correlation. For instance, from 1960 to 1980, there have been 1388 stations with legitimate knowledge and 6,807,917 knowledge factors, with these outcomes:

Fraction of days it's wet after 1 wet day = 17.3%
Fraction of days it's wet after 2 wet days = 19.6%
Fraction of days it's wet after 3 wet days = 27.4%
Fraction of days it's wet after 4 wet days = 37.1%
Fraction of days it's wet after 5 wet days = 43.8%
Fraction of days it's wet after 6 wet days = 51.5%
Fraction of days it's wet after 7 wet days = 52.4%

A extra essential corollary, predicted by local weather change fashions, is that because the earth warms up, it would rain extra. The HPD dataset can confirm this, a minimum of over the previous 80 years.

The straightforward method could be to take the entire present local weather stations (about 2000) and have a look at rainfall knowledge throughout every decade. However there’s a potential bias with doing that. There are extra climate stations now than there have been in 1940, since stations have regularly been added over the previous 80 years. It’s attainable that newer stations have been in-built wet locations. If that’s the case, newer knowledge would present extra rain, however solely as a result of the general set of stations is wetter than the 1940 set.

A extra correct method is to seek out the set of stations that had rain knowledge within the Forties, after which use the identical stations for each decade. My program can do that as a result of it emits the listing of stations really used on every run. So first I discovered knowledge for the years 1940 to 1950, after which re-used the emitted station listing once more for 1950 to 1960, then 1960 to 1970, and many others. That is about 840 stations with between 400K and a couple of.5M knowledge factors within the numerous a long time.

The common rainfall in every decade ought to be very shut — by the Legislation Of Massive Numbers once more. However the chart under reveals considerably elevated rain over the identical assortment stations. This can be a outstanding consequence supporting a key predication of worldwide warming fashions.

Common Rainfall vs Decade (picture by Creator) — How the US Nationwide Climate Service makes forecasts. — Overview of US Oceanic and Atmospheric Administration datasets. — US Environmental Safety Company report on precipitation improve from local weather change.

HashGNN: Deep Dive into Neo4j GDS’s New Node Embedding Algorithm | by Philipp Brunenberg | Aug, 2023

Past The VIF: Collinearity Evaluation for Bias Mitigation and Predictive Accuracy | by Ruth Eneyi Ikwu | Jul, 2023