ETL Google Spreadsheet Using Pandas To BigQuery

Adam WB
Data Folks Indonesia
4 min readJul 26, 2020

--

https://unsplash.com/photos/VMKBFR6r_jg

In this story I will give my experience about ETL process on google spreadsheet and store it into bigquery. Let’s begin

First, of course we need to add some data into google spreadsheet. Here is my example data:

Self Documentation

I took this sample data from kaggle Airbnb Boston. Kaggle is the place where you can learn a lot of data and transform it into great insight.

After we have some data on spreadsheet, it is time to enable API from https://console.developers.google.com, this should be like this after enabling the API

Now it’s time to do some code and see the magic happen

I am using jupyterlab so if you do not have the jupyterlab, you can download anaconda.

We need to import some packages on python, here are some packages which should be imported

from google.oauth2 import service_account
from google.auth.transport.requests import AuthorizedSession
import gspread
import pandas as pd
from google.cloud import bigquery

If you fail to run because no modules named …. you can install it first using pip on your environments :)

Did I mention that we have to create credentials to be able to read google sheets? :D

well, do not forgot to create credential on the same page where you enable the API. Then you can use it on code like mine above (project-gsheet-viewer.json).

Here if you want to find the google sheet id. Copy the id after “d/” until you find “/” for the second time

wait what, what happen? hmm it’s look like we forgot the important step here, which is we need to share our google sheet to our credential ( if you open json file from the credential, you will find client email, that’s what you need ). Share to client email and re run your code ( it should work now ).

If you print list_of_lists after adding the credential, it’s should be like this

After the data store into list_of_lists variable, it is time to store into pandas

I did some transformations about the column name, so we can push it into bigquery smoothly.

Finally, we can see the data easily on pandas. Then how do we push the data to BigQuery? hmm … here is the one line code

df.to_gbq(‘your-dataset.your-table-target’, project_id=’your-project-id’, if_exists=’replace’)

what is dataset here? dataset is kind of schema on database in common ( like postgresql have public ). Because in bigquery we need dataset to store our table. You can read more about pandas to gbq here

If you run that line, I think you will get error say that ‘access denied for the user, User does not have bigquery.tables.get permission for table …’

well, to solve that error we need to add authentication that we can access bigquery from our client code. So we need to make another service account credential to make sure that we have access to bigquery ( choose the right role when you create service account for each account, you can read more on https://cloud.google.com/docs/authentication/production and bigquery role https://cloud.google.com/bigquery/docs/access-control)

make sure that you have the right roles to push your data to bigquery. If you get error about google authentication, you can visit https://google-auth.readthedocs.io/en/latest/user-guide.html

Re run the code line df.to_gbq and wait for the progress bar like below

Finally, you can check your bigquery dashboard and

Now you can create query and do some SQL syntax, bigquery ML to your data.

I share this because I got some errors and confused to set google auth in the beginning. Hopefully it will give you and me some help if we find this error in the future, thank you!

--

--