in

Randomizing Very Massive Datasets. Contemplate the issue of randomizing a… | by Douglas Clean, PhD | Aug, 2023


Contemplate the issue of randomizing a dataset that’s so massive, it doesn’t even match into reminiscence. This text describes how you are able to do it simply and (comparatively) rapidly in Python.

As of late it’s not in any respect unusual to search out datasets which can be measured in Gigabytes, and even Terabytes, in measurement. That a lot knowledge may help tremendously within the coaching course of to create sturdy Machine Studying fashions. However how will you randomize such massive datasets?

Picture by Jess Bailey on Unsplash

Think about that you’ve got a really massive dataset with one merchandise per line in a file. The main points of the information are irrelevant for our objectives right here. The dataset may very well be strains in a CSV (comma-separate values) or TSV (tab-separated values) file, or every line may very well be a JSON object, or an inventory of X, Y, Z values of some extent in a big level cloud. All we want is that the dataset is formatted with one merchandise per line.

For recordsdata containing smaller datasets, one may randomize the file (known as “shuffling”) in reminiscence utilizing a easy Python operate like this:

import random

def shuffle_in_memory(filename_in, filename_out):
# Shuffle a file, line-by-line
with open(filename_in) as fp:
strains = fp.readlines()
# Randomize them in place:
random.shuffle(strains)
# Write the brand new order out:
with open(filename_out, "w") as fp:
fp.writelines(strains)

The shuffle_in_memory() operate takes an enter filename and an output filename, shuffles the strains in reminiscence utilizing the builtin random.shuffle(), and writes the randomized knowledge out. Like its title implies, this operate requires that the entire strains of the file be loaded into reminiscence directly.

To check this operate out, let’s make some take a look at recordsdata. The operate make_file() takes the quantity strains that you prefer to within the take a look at file. The operate will create the file and return the filename.

import os

def make_file(strains):
filename = "test-%s.txt" % strains
print("Making take a look at file '%s'..." % filename)

with open(filename, "w") as fp:
for i in vary(strains):
fp.write(f"Line {i}n")

print("Performed!")
return filename

For instance, to make a file named “test-1000.txt” with 100 strains in it seems like this:

filename_in = make_file(1000)

After operating this operate, you must discover in your present listing you must discover a file named “test-1000.txt” with 1,000 strains of textual content, like:

Line 0
Line 1
Line 2
Line 3
Line 4
Line 5
Line 6
Line 7
Line 8
Line 9
...

To check out our shuffle_in_memory() operate, we’ll title an output file, save the string within the variable filename_out, and name the operate:

filename_out = "test-randomized-1000.txt"
shuffle_in_memory(filename_in, filename_out)

Now, there needs to be a second file in your listing named “test-randomized-1000.txt”. It needs to be precisely the identical measurement as “test-1000.txt” with precisely the identical strains, however in a randomized order:

Line 110
Line 592
Line 887
Line 366
Line 52
Line 22
Line 891
Line 83
Line 931
Line 408
...

Okay, now for the massive query: what to do if we now have a really massive file? Let’s make a medium-sized one with, say, 10 million strains. (For many computer systems that is nonetheless sufficiently small to randomize in reminiscence, however it’s of sufficiently massive sufficient measurement to observe with.) As earlier than, we make the enter file with a name to make_file():

filename_in_big = make_file(10_000_000)

This may take just a few seconds. Afterwards you must have a file named “test-10000000.txt” in your listing. It ought to start as earlier than, however may have 10 million strains. The file is about 128 MB.

The right way to randomize it? If we don’t wish to use all of our RAM, or we don’t have sufficient RAM, we will use the laborious disk as a substitute. Here’s a recursive algorithm based mostly on an analogous drawback, sorting. The next operate shuffle() relies on the merge kind algorithm.

First, it checks to see if a file is sufficiently small to be shuffled in reminiscence (the bottom case in recursive operate parlance). The parameter memory_limit is given in bytes. If a file measurement is smaller than memory_limit, then it is going to be shuffled in reminiscence. Whether it is too massive, then it’s randomly break up into plenty of smaller recordsdata, and every of these is recursively shuffled. Lastly, the contents of the smaller shuffled recordsdata are merged again collectively.

Right here is the operate:

import tempfile

def shuffle(filename_in, filename_out, memory_limit, file_split_count,
depth=0, debug=False):
if os.path.getsize(filename_in) < memory_limit:
if debug: print(" " * depth, f"Stage {depth + 1}",
"Shuffle in reminiscence...")
shuffle_in_memory(filename_in, filename_out)
else:
if debug: print(
" " * depth, f"Stage {depth + 1}",
f"{os.path.getsize(filename_in)} is just too massive;",
f"Cut up into {file_split_count} recordsdata..."
)
# Cut up the massive file into smaller recordsdata
temp_files = [tempfile.NamedTemporaryFile('w+', delete=False)
for i in range(file_split_count)]
for line in open(filename_in):
random_index = random.randint(0, len(temp_files) - 1)
temp_files[random_index].write(line)

# Now we shuffle every smaller file
for temp_file in temp_files:
temp_file.shut()
shuffle(temp_file.title, temp_file.title, memory_limit,
file_split_count, depth+1, debug)

# And merge again rather than the unique
if debug: print(" " * depth, f"Stage {depth + 1}",
"Merge recordsdata...")
merge_files(temp_files, filename_out)

If this was a sorting algorithm, we might merge the recordsdata again collectively in a cautious method to create a sorted ordering. Nonetheless, for shuffling, we don’t care about merging them in a selected order as we wish them randomized. The merge_files() operate due to this fact seems like:

def merge_files(temp_files, filename_out):
with open(filename_out, "w") as fp_out:
for temp_file in temp_files:
with open(temp_file.title) as fp:
line = fp.readline()
whereas line:
fp_out.write(line)
line = fp.readline()

Word that we’re cautious to not learn the entire strains of the recordsdata into reminiscence all of sudden. Let’s take a look at this out by giving the restrict for reminiscence shuffling to be precisely the scale of the file. For the reason that file measurement isn’t smaller than 128,888,890 it is going to be damaged into plenty of smaller recordsdata. For this instance, let’s break the massive file into 2, every of which will probably be sufficiently small to shuffle in reminiscence:

filename_out_big = "test-randomized-10000000.txt"
shuffle(filename_in_big, filename_out_big, 128_888_890, 2, debug=True)

This name ends in the next output:

 Stage 1 128888890 is just too massive; Cut up into 2 recordsdata...
Stage 2 Shuffle in reminiscence...
Stage 2 Shuffle in reminiscence...
Stage 1 Merge recordsdata...

And the contents of the ensuing “test-randomized-10000000.txt” file ought to have 10 million strains, all randomized. A greater take a look at could be to cut back the reminiscence wanted to be a lot smaller than the file to randomize, and to interrupt the too-big recordsdata into greater than 2. Let’s say we solely wish to use about 1 MB of RAM, and break the recordsdata into 20 smaller recordsdata:

shuffle(filename_in_big, filename_out_big, 1_000_000, 20, debug=True)

This instance will use not more than 1 MB of RAM, and recursively decompose the subfiles which can be larger than that, 20 at a time.

This algorithm will work on recordsdata of any measurement (nicely, you want sufficient disk house!). The extra reminiscence you possibly can allocate for the shuffle_in_memory() the quicker it should run. If the variety of smaller recordsdata is just too massive, then you definitely’ll spend an excessive amount of time opening and shutting recordsdata. You’ll be able to strive totally different numbers for memory_limit, however I’ve had good luck with between 20 and 200. The bigger the preliminary file, you’ll most likely need extra subfiles.

There are different algorithms which you can additionally use. I had excessive hopes for writing the entire strains right into a SQLite database, SELECTing them in a random order, nevertheless it was no quicker that the above code.

import sqlite3

def shuffle_sql(filename_in, filename_out, memory_limit, depth=0, debug=False):
if os.path.getsize(filename_in) < memory_limit:
if debug: print(" " * depth, f"Stage {depth + 1}",
"Shuffle in reminiscence...")
shuffle_in_memory(filename_in, filename_out)
else:
if debug: print(
" " * depth, f"Stage {depth + 1}",
f"{os.path.getsize(filename_in)} is just too massive;",
f"Writing to SQLite database..."
)
temp_db = tempfile.NamedTemporaryFile(delete=False)
connection = sqlite3.join(temp_db.title)
cursor = connection.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS strains (
line TEXT
);
""")
with open(filename_in) as fp:
line = fp.readline()
whereas line:
cursor.execute("INSERT INTO strains (line) VALUES (?);", [line])
line = fp.readline()
connection.commit()
with open(filename_out, "w") as fp:
for line in cursor.execute("""
SELECT line FROM strains ORDER BY random();
"""):
fp.write(line[0])

shuffle_sql(filename_in_big, filename_out_big, 1_000_000, debug=True)

Are you able to beat the recursive shuffle algorithm, staying purely in Python? In that case, I’d love to listen to about it!

Taken with Synthetic Intelligence, Machine Studying, and Knowledge Science? Contemplate a clap and a observe. Let me know what you have an interest in!


Dive Into LoRA Adapters. Exploring Parameter Environment friendly… | by Mariano Kamp | Aug, 2023

AI’s Opaque Field Is Truly a Provide Chain – O’Reilly