in

How To Pace Up SQL Queries Utilizing Indexes [Python Edition]


How To Speed Up SQL Queries Using Indexes [Python Edition]
Picture by Writer

 

Suppose you’re sifting by way of the pages of a e-book. And also you wish to discover the data that you just’re on the lookout for a lot sooner. How’d you try this? Nicely, you’d most likely lookup the index of terminologies after which soar to the pages that reference a specific time period. Indexes in SQL work equally to the indexes in books.

In most real-world techniques, you’ll run queries towards a database desk with a lot of rows (assume thousands and thousands of rows). Queries that require a full-table scan by way of all of the rows to retrieve the outcomes can be fairly sluggish. If that you just’ll have to question data primarily based on among the columns usually, you’ll be able to create database indexes on these columns. This may velocity up the question considerably.

So what’ll we be taught right this moment? We’ll be taught to connect with and question a SQLite database in Python—utilizing the sqlite3 module. We’ll additionally learn to add indexes and see the way it improves efficiency.

To code alongside to this tutorial, you need to have Python 3.7+ and SQLite put in in your working setting.

Word: The examples and pattern output on this tutorial are for Python 3.10 and SQLite3 (model 3.37.2) on Ubuntu LTS 22.04.

 

 

We’ll use the built-in sqlite3 module. Earlier than we begin working queries, we have to:

  • connect with the database 
  • create a database cursor to run queries

To hook up with the database, we’ll use the join() perform from sqlite3 module. As soon as we’ve got established a connection, we will name cursor() on the connection object to create a database cursor as proven:

import sqlite3

# connect with the db
db_conn = sqlite3.join('people_db.db')
db_cursor = db_conn.cursor()

 

Right here we attempt to connect with the database people_db. If the database doesn’t exist, working the above snippet will create the sqlite database for us.

 

 

Now, we’ll create a desk within the database and populate it with information.

Let’s create a desk named folks within the people_db database with the next fields:

# predominant.py
...
# create desk
db_cursor.execute('''CREATE TABLE folks (
                  id INTEGER PRIMARY KEY,
                  identify TEXT,
                  e mail TEXT,
                  job TEXT)''')


...

# commit the transaction and shut the cursor and db connection
db_conn.commit()
db_cursor.shut()
db_conn.shut()

 

Artificial Knowledge Technology with Faker

 

We now must insert information into the desk. To do that, we’ll use the Faker—a Python package deal for artificial knowledge era—installable by way of pip:

 

After putting in faker, you’ll be able to import the Faker class into the Python script:

# predominant.py
...
from faker import Faker
...

 

The subsequent step is to generate and insert information into the folks desk. Simply so we all know how indexes can velocity up queries, let’s insert a lot of information. Right here, we’ll insert 100K information; set the num_records variable to 100000.

Then, we do the next:

  • Instantiate a Faker object pretend and set the seed so we get reproducibility. 
  • Get a reputation string utilizing first and final names—by calling first_name() and last_name() on the pretend object.
  • Generate a pretend area by calling domain_name().
  • Use the primary and final names and the area to generate the e-mail subject.
  • Get a job for every particular person document utilizing job().

We generate and insert information into the folks desk:

# create and insert information
pretend = Faker() # remember to import: from faker import Faker 
Faker.seed(42)

num_records = 100000

for _ in vary(num_records):
    first = pretend.first_name()
    final = pretend.last_name()
    identify = f"{first} {final}"
    area = pretend.domain_name()
    e mail = f"{first}.{final}@{area}"
    job = pretend.job()
    db_cursor.execute('INSERT INTO folks (identify, e mail, job) VALUES (?,?,?)', (identify,e mail,job))

# commit the transaction and shut the cursor and db connection
db_conn.commit()
db_cursor.shut()
db_conn.shut()

 

Now the primary.py file has the next code:

# predominant.py
# imports
import sqlite3
from faker import Faker

# connect with the db
db_conn = sqlite3.join('people_db.db')
db_cursor = db_conn.cursor()

# create desk
db_cursor.execute('''CREATE TABLE folks (
                  id INTEGER PRIMARY KEY,
                  identify TEXT,
                  e mail TEXT,
                  job TEXT)''')


# create and insert information
pretend = Faker()
Faker.seed(42)

num_records = 100000

for _ in vary(num_records):
    first = pretend.first_name()
    final = pretend.last_name()
    identify = f"{first} {final}"
    area = pretend.domain_name()
    e mail = f"{first}.{final}@{area}"
    job = pretend.job()
    db_cursor.execute('INSERT INTO folks (identify, e mail, job) VALUES (?,?,?)', (identify,e mail,job))

# commit the transaction and shut the cursor and db connection
db_conn.commit()
db_cursor.shut()
db_conn.shut()

 

Run this script—as soon as—to populate the desk with num_records variety of information.

 

 

Now that we’ve got the desk with 100K information, let’s run a pattern question on the folks desk. 

Let’s run a question to:

  • get the names and emails of the information the place the job title ‘Product supervisor’, and
  • restrict the question outcomes to 10 information.

We’ll use the default timer from the time module to get the approximate execution time for the question.

# sample_query.py

import sqlite3
import time

db_conn = sqlite3.join("people_db.db")
db_cursor = db_conn.cursor()

t1 = time.perf_counter_ns()

db_cursor.execute("SELECT identify, e mail FROM folks WHERE job='Product supervisor' LIMIT 10;")

res = db_cursor.fetchall()
t2 = time.perf_counter_ns()

print(res)
print(f"Question time with out index: {(t2-t1)/1000} us")

 

Right here’s the output:

Output >>
[
    ("Tina Woods", "[email protected]"),
    ("Toni Jackson", "[email protected]"),
    ("Lisa Miller", "[email protected]"),
    ("Katherine Guerrero", "[email protected]"),
    ("Michelle Lane", "[email protected]"),
    ("Jane Johnson", "[email protected]"),
    ("Matthew Odom", "[email protected]"),
    ("Isaac Daniel", "[email protected]"),
    ("Jay Byrd", "[email protected]"),
    ("Thomas Kirby", "[email protected]"),
]

Question time with out index: 448.275 us

 

You may as well invoke the SQLite command-line shopper by working sqlite3 db_name on the command line:

$ sqlite3 people_db.db
SQLite model 3.37.2 2022-01-06 13:25:41
Enter ".assist" for utilization hints.

 

To get the listing of indexes, you’ll be able to run .index:

 

As there aren’t any indexes presently, no index can be listed.

You may as well examine the question plan like so:

sqlite> EXPLAIN QUERY PLAN SELECT identify, e mail FROM folks WHERE job='Product Supervisor' LIMIT 10;
QUERY PLAN
`--SCAN folks

 

Right here the question plan is to scan all of the rows which is inefficient.

 

 

To create a database index on a specific column you need to use the syntax:

CREATE INDEX index-name on desk (column(s))

 

Say we have to steadily lookup the information of people with a specific job title. It’d assist to create an index people_job_index on the job column:

# create_index.py

import time
import sqlite3

db_conn = sqlite3.join('people_db.db')

db_cursor =db_conn.cursor()

t1 = time.perf_counter_ns()

db_cursor.execute("CREATE INDEX people_job_index ON folks (job)")

t2 = time.perf_counter_ns()

db_conn.commit()

print(f"Time to create index: {(t2 - t1)/1000} us")


Output >>
Time to create index: 338298.6 us

 

Although creating the index takes this lengthy, it is a one-time operation. You’ll nonetheless get substantial speed-up when working a number of queries.

Now should you run .index on the SQLite command-line shopper, you’ll get:

sqlite> .index
people_job_index

 

 

For those who now have a look at the question plan, you need to have the ability to see that we now search folks desk utilizing index people_job_index on the job column:

sqlite> EXPLAIN QUERY PLAN SELECT identify, e mail FROM folks WHERE job='Product supervisor' LIMIT 10;
QUERY PLAN
`--SEARCH folks USING INDEX people_job_index (job=?)

 

You may re-run sample_query.py. Solely modify the print() assertion and see how lengthy it takes for the question to run now:

# sample_query.py

import sqlite3
import time

db_conn = sqlite3.join("people_db.db")
db_cursor = db_conn.cursor()

t1 = time.perf_counter_ns()

db_cursor.execute("SELECT identify, e mail FROM folks WHERE job='Product supervisor' LIMIT 10;")

res = db_cursor.fetchall()
t2 = time.perf_counter_ns()

print(res)
print(f"Question time with index: {(t2-t1)/1000} us")

 

Right here’s the output:

Output >>
[
    ("Tina Woods", "[email protected]"),
    ("Toni Jackson", "[email protected]"),
    ("Lisa Miller", "[email protected]"),
    ("Katherine Guerrero", "[email protected]"),
    ("Michelle Lane", "[email protected]"),
    ("Jane Johnson", "[email protected]"),
    ("Matthew Odom", "[email protected]"),
    ("Isaac Daniel", "[email protected]"),
    ("Jay Byrd", "[email protected]"),
    ("Thomas Kirby", "[email protected]"),
]

Question time with index: 167.179 us

 

We see that the question now takes about 167.179 microseconds to execute.

 

Efficiency Enchancment

 

For our pattern question, querying with index is about 2.68 instances sooner. And we get a proportion speedup of 62.71% in execution instances. 

You may as well attempt working just a few extra queries: queries that contain filtering on the job column and see the efficiency enchancment. 

Additionally observe: As a result of we’ve got created an index solely on the job column, in case you are working queries that contain different columns, the queries won’t run any sooner than with out index.

 

 

I hope this information helped you perceive how creating database indexes—on steadily queried columns—can considerably velocity up queries. That is an introduction to database indexes. You may as well create multi-column indexes, a number of indexes for a similar column, and way more. 

You will discover all of the code used on this tutorial in this GitHub repository. Pleased coding!
 
 
Bala Priya C is a developer and technical author from India. She likes working on the intersection of math, programming, knowledge science, and content material creation. Her areas of curiosity and experience embrace DevOps, knowledge science, and pure language processing. She enjoys studying, writing, coding, and occasional! At present, she’s engaged on studying and sharing her data with the developer group by authoring tutorials, how-to guides, opinion items, and extra.
 


Finest Python Instruments for Constructing Generative AI Functions Cheat Sheet

Time Collection Evaluation: ARIMA Fashions in Python