Data Extraction from GitHub and Auto-run or Schedule Python Script
First Part of the Blog —
Automated ETL for LIVE Tableau Public Visualizations
This blog is a part of Automated ETL for LIVE Tableau Public Visualizations and is sub-divided into two parts, namely:
- Extract Data from Raw .csv Files of GitHub User Content
- Automate Python Scripts with Task Scheduler on Windows
1. Extract Data from Raw .csv Files of GitHub User Content
Data Sourced from JHU CSSE GitHub Repository: https://github.com/CSSEGISandData/COVID-19
import numpy as np
import pandas as pd
Imported the relevant libraries.
df_names = ['confirmed_global', 'deaths_global', 'recovered_global']
df_list = [pd.DataFrame() for df in df_names]
df_dict = dict(zip(df_names, df_list))
As we need to extract 3 .csv files, I have created a list ‘df_names’ holding the names of the dataframes. Note that, the names declared are chosen with respect to the url_part
— explained below *.
Then, a list ‘df_list’ is declared to hold 3 Empty Dataframes created. Note that, for df in df_names
just confirm that number of Empty Dataframes created = number of elements in the list ‘df_names’.
Then, a dictionary ‘df_dict’ is created with key: value pair as ‘df_names: df_list’. That is, each name is linked to an empty dataframe with respective positions.
url_part = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_'
A URL part is created as required. * The names declared above will become suffix of the url_part
declared.
for key, value in df_dict.items():
value = pd.read_csv(url_part+key+'.csv', parse_dates=[0])
value.rename(columns={'Province/State': 'Province_State', 'Country/Region': 'Country_Region'}, inplace=True)
dim_col = value.columns[0:4]
date_col = value.columns[4:]
value = value.melt(id_vars = dim_col, value_vars = date_col, var_name = 'Date', value_name = key)
value['Date'] = pd.to_datetime(value['Date'])
df_dict[key] = value
In order to load the data, I have ‘for looped’ through the items in the dictionary ‘df_dict’ declared above. The empty dataframes are fed one by one using pandas read_csv
function, which reads the data from URL generated using url_part
+ key (which are the respective names declared) + ‘.csv’. And then, the column names are renamed to make it Python friendly.
Now comes the interesting part — the pandas
melt
function, which did the magic, enabling me to transform the data for my Tableau dashboard. Basically, themelt
function unpivots a Dataframe from wide to long format. Refer the debug print output below:
So, the melt
function unpivots the 77 date columns from wide format to long format with heading of these columns being fed to a new column ‘Date’ created using value_vars = date_col, var_name = 'Date'
parameter and numeric data being fed into a new column ‘confirmed_global’ created using value_name = key
parameter of the melt
function.
Then, the ‘Date’ column data type is changed to datetime and the key of the dictionary declared before is assigned with respective loaded and transformed dataframe.
join_on_col = ['Province_State','Country_Region','Lat','Long','Date']df_COVID = df_dict['confirmed_global'].merge(df_dict['deaths_global'], on=join_on_col, how='outer').merge(df_dict['recovered_global'], on=join_on_col, how='outer')df_COVID.rename(columns = {'confirmed_global':'Confirmed', 'deaths_global':'Deaths', 'recovered_global':'Recovered'}, inplace = True)
Now, the pandas merge
function is used to merge 3 different files having ‘Province_State’, ‘Country_Region’, ‘Lat’, ‘Long’, ‘Date’ in common. Then, column names of the merged dataframe ‘df_COVID’ is renamed.
# to fill the NaN in 'Province_State' columns with Countries name in 'Country_Region'
df_COVID['Province_State'] = np.where(df_COVID['Province_State'] == 'nan', df_COVID['Country_Region'], df_COVID['Province_State'])# to fill the NaN in last three columns
df_COVID.iloc[0:,-3:] = df_COVID.iloc[0:,-3:].fillna(0)
This last line of code helps remove NaN from the dataframe ‘df_COVID’ as and where required. Later, this dataframe can be extracted to a .csv file using to_csv
function of pandas.
But I planned to export it to Google Sheets — check how by following the second part of blog — Automated ETL for LIVE Tableau Public Visualizations
2. Automate Python Scripts with Task Scheduler on Windows
Now that I have created a .py python script file to ETL (Extract, Transform and Load) the data, I realized that the GitHub repository used to source the data is updated daily.
In search for need to run the python script daily, I came across a blog — Automate your Python Scripts with Task Scheduler written by Vincent Tatan. However, if you are a Linux user, please refer Scheduling Jobs With Crontab on macOS written by Ratik Sharma.
This section is briefly curated from the former blog:
- Create Windows Executable .bat File to Run Python
- Configure Task in Windows Task Scheduler
2.1 Create Windows Executable .bat File to Run Python
A BAT file is a DOS batch file used to execute commands with the Windows Command Prompt (cmd.exe). It contains a series of line commands that typically might be entered at the DOS command prompt. BAT files are most commonly used to start programs and run maintenance utilities within Windows. Source: https://fileinfo.com/extension/bat
Create a new .bat file (for e.g: etl-covid-data.bat) and edit it to write your command in following format:
<python.exe location> <python script location>
C:\Users\eklav\AppData\Local\Programs\Python\Python37\python.exe "C:\Users\eklav\Documents\My Tableau Repository\Datasources\COVID-DataSource\COVID-19\COVID-DataExtraction.py"
Please ensure that your Python37\Lib\site-packages
has all the relevant libraries or modules installed. Else, execute pip install <module>
on cmd.exe to download required dependencies.
For Debugging: Save and run this .bat file by double clicking it. You may type cmd /k
before the command declared in .bat file. This will keep the Windows Command Prompt (cmd.exe) window open after execution of .bat file
2.2 Configure Task in Windows Task Scheduler
- Search and open ‘Task Scheduler’ from ‘Start’ menu
- Click on ‘Create Basic Task…’ located under ‘Actions’ tab on the right side of the ‘Task Scheduler’
Declare the ‘Name:’ (for e.g: COVID-ETL-PyScriptRun) and ‘Description:’ (for e.g: This task will execute the python script required for COVID live updates) - Task Trigger:
Choose ‘When do you want the task to start?’ (for e.g: Daily) - Declare frequency parameters for the previous selection
- Action:
Choose ‘What action do you want the task to perform?’ (for e.g: Start a program) - Declare ‘Program/script:’ by browsing it to the .bat file location created earlier. Additionally, you may declare ‘Start in (optional):’ to the location of your application folder to access the relevant dependencies
- Finish:
This shows the ‘Summary’ of all the declarations above. You may tick the checkbox ‘Open the Properties dialog for this task when I click Finish’ to discover interesting and helpful changes to your scheduled task. One edit that helped when my code broke was — On ‘General’ tab of the task properties, I checked the ‘Run with highest privileges’
Following is the .gif animation for 7 steps mentioned above:
Thank you for reading! I hope this blog revealed an interesting aspect on how to automate python script on windows. Let me know in a comment if you felt like this did or didn’t help. If this article was helpful, share it.