Skip to content

Baitur5/reddit_api_elt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit ELT pipilene


A data pipeline that extracts Reddit data for a Google Data Studio report, focusing on a specific subreddit. Motivated by interest in the subject and a desire to enhance skills, the project includes tools like dbt, Airflow, Docker, and cloud-based storage for added complexity.

Architecture

architecture


Overview

  1. Data Extraction from Reddit API
  2. Managing AWS Resources with Terraform
  3. Data Upload to S3
  4. Data Loading to Redshift Serverless
  5. Data Transformation with dbt
  6. Apache Airflow Orchestration in Docker
  7. Visualizes transformed data with Google Looker Studio

Result

Link result P.S. As to not incur further costs,data for this chart has been taken from csv file. As for how to convert table from redshift into csv file see it in this file


Setup

  • Clone the repository:
git clone https://github.com/Baitur5/reddit_api_elt.git
cd reddit_api_elt
  • Get reddit api keys from here
  • You must have configured AWS CLI
  • Create variables.tf file in the root directory and setup the following variables:
variable "s3_bucket"{
    type = string
    default = ""
}

variable "namespace_name"{
    type = string
    default = ""
}

variable "workgroup_name"{
    type = string
    default = ""
}

# password for db admin
variable "password"{
    type = string
    default = ""
}
  • Initialize terraform and Create resources:
terraform init
terraform apply
  • Create .env file in the root directory and setup the following variables:
AIRFLOW_UID=
#client id from reddit api
CLIENT_ID =''
#secret key from reddit api
SECRET_KEY= ''
IAM_ROLE =''
S3_BUCKET =''
SUBREDDIT =''
WORKGROUP=''
DBT_TABLE = ''
  • Setup airflow in docker:
docker compose up airflow-init
docker compose up

Contributions and Issues

Contribute or report issues. Pull requests are welcome! Securely handle sensitive information like API keys and credentials.

Where inspiration came from:

Thanks to this repo for providing inspiration and clear docs to learn from