-
Notifications
You must be signed in to change notification settings - Fork 20
Joining files and databases with PythonQL
A common use-case of PythonQL is writing a single query against multiple data sources and have the PythonQL runtime figure out the best way to perform such query. This scenario is very common to daily data science work, when data scientist try out different datasets and test their additional predictive power.
Lets start with querying database sources. Currently, as of release 0.9.51, only PostgreSQL database is supported, however more database support will be coming soon. In order to use PostgreSQL source in PythonQL query, we need to use SQL Alchemy's engine and declare tables that we want to use in the query. Below we show a simple query that runs against two PostgreSQL sources:
#coding: pythonql
# Import the create_engine functions from SQL Alchemy and our PostgreSQL source
from sqlalchemy import create_engine
from pythonql.sources.postgresql import PostgresTable
# Create the database engine
engine = create_engine('postgresql://localhost:5432/test')
# Specify the tables that will be used in the query. You can also specify which
# schema the tables are in (public by default).
table1 = PostgresTable(engine,table_name='client')
table2 = PostgresTable(engine,table_name='product',schema_name='products')
# Use the tables in your query:
res = [select (c,p,pr)
for c in table1
for p in table2
let pr = p.price
where c.id == p.client_id
]Now suppose we want to add a .csv file to the mix. Suppose we get a .csv file that contains marketing data for our products. The data can be joined on the product name and contains information about promotions that appeared on different web sites. This data could be used to measure the effectiveness of different campaigns. This data file also has tons of irrelevant marketing data and is quite big, so a naive nested-loop join with a database (that is also quite big) won't work well in Python runtime. Instead we'll use an index join so speed things up.
So we will run the same query as above against the data, but also join it with the marketing data. Now, marketing .csv file can have multiple marketing data points for a single product, hence we package all that data into a list and attach it to the product. We will use the nifty panda's read_csv method to read the .csv file and use our wrapper to convert a dataframe into a stream of PythonQL tuples. The query is shown below:
#coding: pythonql
from pandas import read_csv
from pythonql.pandas_lib import wrap_df
# Import the create_engine functions from SQL Alchemy and our PostgreSQL source
from sqlalchemy import create_engine
from pythonql.sources.postgresql import PostgresTable
# Create the database engine
engine = create_engine('postgresql://localhost:5432/test')
# Specify the tables that will be used in the query. You can also specify which
# schema the tables are in (public by default).
table1 = PostgresTable(engine,table_name='client')
table2 = PostgresTable(engine,table_name='product',schema_name='products')
# Read in the .csv file data
csv_data = wrap_df( read_csv('myfile.csv'))
# Use the tables and csv file in the query:
res = [select (c,p,m)
for c in table1
for p in table2
for m in csv_data
where c.id == p.client_id and m.product_id == p.id and hint('index','p','m')
group by c, p
]Notice that we have added a query hint at the end of the query, which specifies that the join between the database and .csv file should be an index join, in which the .csv data is indexed and the database data is iterated over once.