Turning CSVs Into SQLite Tables with Python
Recently, I was faced with a collection of several different, related .csv (comma separated values) files, which were zipped as .gz files. Though they hadn’t been pulled from a relational database, they shared several common columns that I wished to relate, then aggregate using SQL queries.
After a little searching and coming up empty, I was able to chain together a few disparate approaches to take my .csv files and get them to the desired state, all locally (without having to set up a SQL server from scratch, or upload to one).
The following approach was designed to run in Jupyter Notebook.
Overview:
1) Unzip all files
2) Convert files to pandas dataframes
3) Create SQL tables from those dataframes
gzip, shutil, and os in Python:
Gzip will unzip each file.
Shutil supports several high level operations on collections of files, including copying.
Os is the usual Python-to-operating system functionality
#####Unzipping Files
import gzip
import shutil
import os
Create a subfolder of your current working directory, and move the starting zipped file(s) (in my case, containing the .csv files) there. I named mine ‘zippedData’.
Use the os module functionality to piece together the path to fit your needs.
z_folder = 'zippedData'
cwd = os.getcwd()
dir_to_unzip = os.path.join(cwd, z_folder)
Define the unzipDir function. Its arguments are a directory string and an optional verbose boolean if you want to see progress reports on the unzipping. It iterates through each file in the directory, and if it has a .gz extension, creates a new name by truncating off the .gz part — leaving the natural extension (in my case, .csv). It then uses gzip to open each zipped file and shutil to write it out.
def unzipDir(dir_name, verbose=False):
for name in os.listdir(dir_name):
path = os.path.join(dir_name, name)
#print(path)
if os.path.isfile(path) and path[-3:]=='.gz':
if verbose:
print('unzipping',path)
new_name=name[:-3]
new_path=os.path.join(dir_name,new_name)
with gzip.open(path, 'rb') as f_in:
with open(new_path, 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)
Now, just run the function!
unzipDir(dir_to_unzip, verbose=True)
Great, now what?
Pandas Dataframes
If you want to inspect those fresh new .csv files, its hard to go wrong with Pandas — anyway, you’ll likely be using it later to do some more analysis.
The following function will take all those freshly unzipped .csv files and turn them into pandas dataframes, useful for inspecting to see if any of those relational-database-like relationships exist. We will store the created dataframes in a dictionary called dfs.
import pandas as pddfs = {}
def make_dfs(dir_name):
for currfile in os.listdir(dir_name):
path = os.path.join(dir_name, currfile)
if os.path.isfile(path) and (path[-4:]=='.csv':
currdf = pd.read_csv(os.path.join(dir_name,currfile), header=0)
dfs[currfile]=currdf
make_dfs(dir_to_unzip)
Could we have just gone straight from .csv to sqlite? Sure, but the intermediate step of converting to dataframe gives a chance to inspect the underlying data. Furthermore, we get to use the very useful functionality of Pandas dataframes to_sql method to add rows to the SQL tables.
SQLite
Next, if progressing on to loading these into sqlite, import sqlite3. It’s a lightweight and widely used SQL module for Python with lots of advantages (and a few drawbacks, including not allowing Outer Joins).
import sqlite3
from sqlite3 import Error
from pandas import DataFrame
Then, create a sqlite connection to a database. The connection created below will be the object that you can pass SQL commands to later, for querying.
def create_conn(path):
conn = None
try:
conn = sqlite3.connect(path)
print("Connection successful")
except Error as e:
print("The error % occurred.".format(e))
return conn
conn = create_conn('TestDB1.db')
c = conn.cursor()
Next, iterate through the dataframes created above. Since they’re stored as entries in an iterable dictionary, this is easy, as is creating variable SQL commands to pass into the connection we created above. Thankfully, our dataframe object has the aforementioned .to_sql method, which interacts with our sqlite connection.
The second part of the function below tests to see if the dataframe loaded into the sql table correctly by trying to select everything from it and display the first four lines.
def convertDFtoSQL(df_name):
#df_name='df_movie_budgets'
df = dfs[df_name]
cols = str(list(df.columns))[1:-1]
createString = 'CREATE TABLE '+ df_name + '(' + cols + ')'
try:
c.execute(createString)
except:
pass
df.to_sql(df_name, conn, if_exists='replace', index=False)
c.execute('SELECT * FROM '+df_name)
#Checking that it came through
dfOut = pd.DataFrame(c.fetchall())
dfOut.columns = [i[0] for i in c.description]
print(df_name)
display(dfOut.head(4))for df in dfs.keys():
convertDFtoSQL(df)
I found this approach to be more manageable and scalable to subsequent tasks than hardcoding each unzip/conversion. I hope you find it useful too!
The collected code for the snippets above can be found here:
https://gist.github.com/ericthansen/bf82af45686385b58b4229cd20d573de.js