@ AArete 2025
Context:
- There are numerous public data sources that are essential for company consulting practices
- The data has a degree of variability that requires creating and maintaining custom data pipelines
- Purpose: Automate the manual process of updating reference databases quarterly by making DAGs in Airflow
Overall Approach
- Develop individual DAGs for each reference file
- Created a main DAG:
- Checks each target table using max(audit_key) to detemine if current quarter data is present
- Triggers respective DAGs if data is outdated
- Runs daily to simplify scheduling and monitoring
- Consolidates alerting into a single point of failure or success
Process of each DAG
- Web scraping from website to find download link for zip folder
- Save target file from zip folder into temporary folder
- Upload to S3 bucket
- Call stored procedure, and send email based on success result
- QC check
- Clean up temporary folder
