Working with SQL in Python
When you are starting off in Machine Learning you will play around with a lot of static datasets. This is normally in the form of an CSV document and sometimes more complex setups for computer vision problems.
While this is great, at some point you will need to communicate with relational databases. Having a solid foundation in databases can be indispensable in a data scientist or machine learning engineer role. You will often be working with data points spread across many systems and databases. Creating a database that brings together all your datapoint into one system that cleans the data as you expect can make your problems much easier to solve.
We will dive deeper into merging datasets in another post. Realistically you still start with a simple existing open database. We can find one we will look at here on Kaggle.
One of the most common ways in Python to talk with relational databases is using a library called SQLAlchemy. This tool is amazing in its coverage. The main databases you will run into are SQLite, Postgresql, MySQL, and MS-SQL, to name some of the many supported databases. With our local SQLite3 server running, let’s take a look at how you would go about making a query and loading data into a pandas DataFrame.
# Import packages from sqlalchemy import create_engine import pandas as pd
As with all our python scrips, we first need to import the packages we need. Luckily for our simple example, we just need a sub package from SQLAlchemy and pandas. Connecting to an SQL database is pretty simple—we just need to pass in the address for the SQLite server. We still need to connect to the database before we can talk to it. Make sure to make a note with SQLAlchemy that you need to explicitly connect and disconnect from your server.
engine = create_engine('sqlite:///soccer.sqlite') con = engine.connect()
With our connection made we can run SQL queries simply by running the
.execute() method on the connection object created in the last step. Here comes an important part of working with SQL in Python. SQL commands are pure strings. If you come from a Ruby background like myself you will be sad to hear we don’t get methods to easily pull data with SQL like you find in Rails. This is for the best as you will benefit from an understanding strong understanding of programming SQL.
rs = con.execute('SELECT * FROM Country') df = pd.DataFrame(rs.fetchall())
Once we have our results from the executed query we can load these into a Pandas data frame simply by passing the results into a data frame initialization method. To get all the data from the results we will need to use
# Be kind to your Database Admins and Close your connections when finished. con.close() # Print first 5 rows of DataFrame print(df.head())
Finally, we need to make sure to close the connection once we are done. You can keep this open if you have some more work to do. However, we can run these commands in a more idiomatic way.
with engine.connect() as con: rs = con.execute("SELECT player_api_id, birthday FROM Player”) df = pd.DataFrame(rs.fetchmany(size=3)) df.columns = rs.keys()
You will be used to this way of working with a database if you have experience working with files or TensorFlow v1 sessions. The beautiful thing about working with the database this way is you have made a very readable way of working inside a database connection with no fears of leaving connections opened by mistake.
You’ll notice with this last command that we didn’t fetch all results. We capped the results to three by using the
Using Pandas we can cut this code down further. Much like opening a CSV via the
.read_csv() method. Pandas provide a
.read_sql_query() method. We can pass our query as the first argument. The next required argument is the engine object that’a returned from SQLAlchemy’s
# Create engine: engine engine = create_engine('sqlite:///soccer.sqlite') # Execute query and store records in DataFrame: df df = pd.read_sql_query('SELECT * FROM Match WHERE home_team_goal >= 6 ORDER BY stage, engine)
Pandas will take care of opening and closing the connection for us. Up until now, we have done a pretty simple request. Below we have a more complex request, joining two tables and placing a conditional limiting the results to a scope we are looking for.
df = pd.read_sql_query('SELECT * FROM Player INNER JOIN Player_Attributes on Player.player_api_id = Player_Attributes.player_api_id WHERE volleys < 40’, engine)
Finally, what if we need to authenticate with our server? This is done when we create the engine. Below you can see what it would look like to connect to a MySQL server. Remember it is never a good idea to have your authentication directly in code. You might consider using environment variables or other ways to lock down communication so only your servers have this info and your keys aren’t just sitting on GitHub.
With this under our belts, we now have a good understanding on how to get started with talking to relational databases in Python. We know how to connect and query information. Then take the results and load them into data frames if we don’t directly load the results into the data frame by using pandas methods. If you are new to SQL and need to learn a bit more about the syntax I highly recommend playing with the dataset mentioned in the start of the article, as well checking out Kaggle’s SQL Lessons.