Data Extraction from GitHub and Auto-run or Schedule Python Script

First Part of the Blog —
Automated ETL for LIVE Tableau Public Visualizations

Eklavya Saxena
6 min readApr 10, 2020
https://public.tableau.com/profile/eklavya.saxena#!/vizhome/CoronavirusCOVID-19CasesWorldwide/COVID-19_Country

This blog is a part of Automated ETL for LIVE Tableau Public Visualizations and is sub-divided into two parts, namely:

  1. Extract Data from Raw .csv Files of GitHub User Content
  2. Automate Python Scripts with Task Scheduler on Windows

1. Extract Data from Raw .csv Files of GitHub User Content

Data Sourced from JHU CSSE GitHub Repository: https://github.com/CSSEGISandData/COVID-19

Code Snippet — Refer the following text
import numpy as np
import pandas as pd

Imported the relevant libraries.

Code Snippet — Refer the following text
df_names = ['confirmed_global', 'deaths_global', 'recovered_global'] 
df_list = [pd.DataFrame() for df in df_names]
df_dict = dict(zip(df_names, df_list))

As we need to extract 3 .csv files, I have created a list ‘df_names’ holding the names of the dataframes. Note that, the names declared are chosen with respect to the url_part — explained below *.

Then, a list ‘df_list’ is declared to hold 3 Empty Dataframes created. Note that, for df in df_names just confirm that number of Empty Dataframes created = number of elements in the list ‘df_names’.

Then, a dictionary ‘df_dict’ is created with key: value pair as ‘df_names: df_list’. That is, each name is linked to an empty dataframe with respective positions.

Code Snippet — Refer the following text
url_part = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'

A URL part is created as required. * The names declared above will become suffix of the url_part declared.

for key, value in df_dict.items():
value = pd.read_csv(url_part+key+'.csv', parse_dates=[0])

value.rename(columns={'Province/State': 'Province_State', 'Country/Region': 'Country_Region'}, inplace=True)

dim_col = value.columns[0:4]
date_col = value.columns[4:]

value = value.melt(id_vars = dim_col, value_vars = date_col, var_name = 'Date', value_name = key)

value['Date'] = pd.to_datetime(value['Date'])

df_dict[key] = value

In order to load the data, I have ‘for looped’ through the items in the dictionary ‘df_dict’ declared above. The empty dataframes are fed one by one using pandas read_csv function, which reads the data from URL generated using url_part + key (which are the respective names declared) + ‘.csv’. And then, the column names are renamed to make it Python friendly.

Now comes the interesting part — the pandas melt function, which did the magic, enabling me to transform the data for my Tableau dashboard. Basically, the melt function unpivots a Dataframe from wide to long format. Refer the debug print output below:

So, the melt function unpivots the 77 date columns from wide format to long format with heading of these columns being fed to a new column ‘Date’ created using value_vars = date_col, var_name = 'Date' parameter and numeric data being fed into a new column ‘confirmed_global’ created using value_name = key parameter of the melt function.

Then, the ‘Date’ column data type is changed to datetime and the key of the dictionary declared before is assigned with respective loaded and transformed dataframe.

Code Snippet — Refer the following text
join_on_col = ['Province_State','Country_Region','Lat','Long','Date']df_COVID = df_dict['confirmed_global'].merge(df_dict['deaths_global'], on=join_on_col, how='outer').merge(df_dict['recovered_global'], on=join_on_col, how='outer')df_COVID.rename(columns = {'confirmed_global':'Confirmed', 'deaths_global':'Deaths', 'recovered_global':'Recovered'}, inplace = True)

Now, the pandas merge function is used to merge 3 different files having ‘Province_State’, ‘Country_Region’, ‘Lat’, ‘Long’, ‘Date’ in common. Then, column names of the merged dataframe ‘df_COVID’ is renamed.

# to fill the NaN in 'Province_State' columns with Countries name in 'Country_Region'
df_COVID['Province_State'] = np.where(df_COVID['Province_State'] == 'nan', df_COVID['Country_Region'], df_COVID['Province_State'])
# to fill the NaN in last three columns
df_COVID.iloc[0:,-3:] = df_COVID.iloc[0:,-3:].fillna(0)

This last line of code helps remove NaN from the dataframe ‘df_COVID’ as and where required. Later, this dataframe can be extracted to a .csv file using to_csv function of pandas.

But I planned to export it to Google Sheets — check how by following the second part of blog — Automated ETL for LIVE Tableau Public Visualizations

2. Automate Python Scripts with Task Scheduler on Windows

Now that I have created a .py python script file to ETL (Extract, Transform and Load) the data, I realized that the GitHub repository used to source the data is updated daily.

In search for need to run the python script daily, I came across a blog — Automate your Python Scripts with Task Scheduler written by Vincent Tatan. However, if you are a Linux user, please refer Scheduling Jobs With Crontab on macOS written by Ratik Sharma.

This section is briefly curated from the former blog:

  1. Create Windows Executable .bat File to Run Python
  2. Configure Task in Windows Task Scheduler
Running .bat File from Task Created in Task Scheduler

2.1 Create Windows Executable .bat File to Run Python

A BAT file is a DOS batch file used to execute commands with the Windows Command Prompt (cmd.exe). It contains a series of line commands that typically might be entered at the DOS command prompt. BAT files are most commonly used to start programs and run maintenance utilities within Windows. Source: https://fileinfo.com/extension/bat

Create a new .bat file (for e.g: etl-covid-data.bat) and edit it to write your command in following format:

<python.exe location> <python script location>

C:\Users\eklav\AppData\Local\Programs\Python\Python37\python.exe "C:\Users\eklav\Documents\My Tableau Repository\Datasources\COVID-DataSource\COVID-19\COVID-DataExtraction.py"

Please ensure that your Python37\Lib\site-packages has all the relevant libraries or modules installed. Else, execute pip install <module> on cmd.exe to download required dependencies.

For Debugging: Save and run this .bat file by double clicking it. You may type cmd /k before the command declared in .bat file. This will keep the Windows Command Prompt (cmd.exe) window open after execution of .bat file

2.2 Configure Task in Windows Task Scheduler

  1. Search and open ‘Task Scheduler’ from ‘Start’ menu
  2. Click on ‘Create Basic Task…’ located under ‘Actions’ tab on the right side of the ‘Task Scheduler
    Declare the ‘Name:’ (for e.g: COVID-ETL-PyScriptRun) and ‘Description:’ (for e.g: This task will execute the python script required for COVID live updates)
  3. Task Trigger:
    Choose ‘When do you want the task to start?’ (for e.g: Daily)
  4. Declare frequency parameters for the previous selection
  5. Action:
    Choose ‘What action do you want the task to perform?’ (for e.g: Start a program)
  6. Declare ‘Program/script:’ by browsing it to the .bat file location created earlier. Additionally, you may declare ‘Start in (optional):’ to the location of your application folder to access the relevant dependencies
  7. Finish:
    This shows the ‘Summary’ of all the declarations above. You may tick the checkbox ‘Open the Properties dialog for this task when I click Finish’ to discover interesting and helpful changes to your scheduled task. One edit that helped when my code broke was — On ‘General’ tab of the task properties, I checked the ‘Run with highest privileges

Following is the .gif animation for 7 steps mentioned above:

GIF Animation for 7 Steps Mentioned Above

--

--

Eklavya Saxena

Data Science proficient graduate from Northeastern University with experience in interpreting & analyzing data to drive growth