Code Python Script to scrape sales data from multiple PDF files, clean info and consolidate results on one single CSV file that will be sent to Google Drive

Description

This is the third Data Source for the Data Pipeline Project. By using Python we designed a script to scrape hundreds of pdfs files that have the same layout, however some of its graphs generate unstructured data that we need to capture and refine.

Leveraging the power of some Python libraries we parse the content and enrich it to feed a data structure that finally consolidate into one single CSV file that is subsequently sent over to assigned Google Drive folder as Gsheet

Child issues

Issue Type Icon XDP-51 Code the Script Priority: Medium
Done
Issue Type Icon XDP-52 Enhance Notebook comments Priority: Medium
Done
Issue Type Icon XDP-53 Test the script with 100 pdfs samples Priority: Medium
Done
Issue Type Icon XDP-54 Enable connection to Cloud DB Priority: Medium
To Do

Activity