Building an End-to-End Data Engineering Project with Reddit API Data using AWS S3, Redshift, Tableau, Airflow in Docker, and Terraform

Introduction

Mohammad Obaidullah Tusher
23 min readMay 20, 2023

Data engineering is an important part of technology in the world we live in now. It allows organizations to process and analyze vast amounts of data, leading to critical insights and informed decision-making.

Motivation

As a data enthusiast, I’ve always been passionate about exploring new technologies and building end-to-end projects that display my skills. In this blog post, I’ll take you on a journey through my latest project, where I’ve harnessed the power of AWS S3, Redshift, Tableau, Airflow in Docker, and Terraform to analyze Reddit API statistics. Through this project, I’ve not only enhanced my knowledge of these tools but also learned new skills that I’ll be sharing with you. So, let’s dive in!

I’m going to set up a data pipeline to extract some Reddit data from the r/datascience sub-Reddit.

This project was inspired by the interest in data science and the Q&A content on the official sub-Reddit!

Architecture

  1. Data extraction using the Reddit API
  2. Loading into Amazon Web Services S3
  3. The transformation will be done using dbt.
  4. Copy into AWS Redshift.
  5. Make a dashboard using Tableau.
  6. Orchestrate with Airflow in Docker.
  7. Terraform is used to create AWS resources.

Output

Tableau Dashboard

In this project, I made a live Tableau visualisation that was linked to an active Amazon Redshift server. This gave me useful information about the data. But I also wanted to give people another way to keep using the visualization without having to pay for the costs of a live database.

To do this, I stopped using the live database and started using a static CSV file instead. This meant that people could use the Tableau visualization to study and look at the data without having to pay for a live database.

Follow this link to see the Tableau visualization on Tableau Public. This visualization gives people interactive and insightful views of the data, so they can learn more about the information on the r/datascience sub-Reddit and discover new things.

Table of Contents

  1. Overview
  2. Reddit API Configuration
  3. AWS Account
  4. Infrastructure with Terraform
  5. Configuration Details
  6. Docker & Airflow
  7. dbt
  8. Dashboard
  9. Final Notes & Termination

GitHub — tusher16/reddit-api-analytics-ELT

Overview:

The ability to extract, process, and analyze data is absolutely necessary in the data-driven world that we live in today. As someone who is passionate about data, one of my goals was to build a pipeline that would not only provide a dashboard but also give me the opportunity to learn new skills and get experience with a variety of technologies. Python’s PRAW API wrapper, Amazon Web Services S3 (cloud storage), Amazon Web Services Redshift (cloud data warehousing), Apache Airflow (orchestration tool), and dbt (data transformation tool) are just some of the technologies that are utilized in the pipeline that I developed.

The data is retrieved from Reddit’s application programming interface (API) and saved in a CSV file while the pipeline is operating. This is referred to as a DAG.

The data that was extracted contains a variety of fields, including post ID, author name, and others. The DAG has been configured to retrieve data from the most recent twenty-four hours.

The CSV file is then immediately loaded into an Amazon S3 bucket, and after that, it is copied to an Amazon Redshift bucket for storage.

Because Apache Airflow is being run with Docker, the entire process is being controlled by Apache Airflow. This prevents us from having to manually set up Airflow.

In addition to Apache Airflow, this project is comprised of two more individual components.

In the first step of the process, we connect dbt to our data warehouse so that it can modify the data. we didn't use much of dbt in this part but it is a helpful tool for the transformation of data. and this dbt part will be done in other part of this project. In the second step, we generate some visuals by connecting a business intelligence tool to our warehouse.

Reddit API Configuration

To extract Reddit data, we’ll need to use its Application Programming Interface (API). Here are the steps we’ll need to follow to set this up:

  1. First we have to Create a Reddit account.
  2. Navigate to the following link and create an app. Make sure to select “script” from the radio buttons during the setup process: https://www.reddit.com/prefs/apps
  1. Take note of a few things once our app is set up, including:
  • The name of the app: we can choose any name for our app.
  • The App ID: This is a unique identifier for our app, which we’ll need to use in our code to authenticate requests to the Reddit API.
  • API Secret Key: This is another key that we’ll need to use in our code to authenticate requests to the Reddit API.

By following these steps, we’ll be able to set up our Reddit API and begin extracting data for our project.

AWS Account

To store our Reddit data in the cloud, we’ll be using Amazon Web Service (AWS). Specifically, we’ll be using two AWS services:

  1. Simple Storage Service (S3) — This is object storage that allows us to store our raw Reddit data in a cloud-based S3 bucket as objects.
  2. Redshift — This is a cloud-based data warehousing service that enables us to execute SQL queries on large datasets with fast processing speeds.

While we could use a local database like PostgreSQL, using cloud-based tools like AWS is best practice and allows for more scalability.

Setup

To get started with AWS, follow these steps:

On the AWS website, they are given detailed documentation on how to create an account with the free tier, secure the account, and finally configure the CLI.

  1. Set up a personal AWS account by following the instructions on the AWS website and sign up for the free tier.
  • Setup a personal AWS account go to this link .
  • Follow instructions here and setup with free tier.

2. Secure our account by enabling Multi-Factor Authentication (MFA) for the root user. The root user is a special account that has access to everything, so it’s important to secure it. Also, be sure to set up an IAM user with admin permissions to use instead of the root user for most tasks.

  • For Securing the account go to this link

3. Set up the AWS Command Line Interface (CLI) by following the guide provided by AWS. This will allow us to control AWS services from the command line interface. By the end of this, we should have a folder in our home directory called “.aws” that contains a “credentials” file.

  • Setup CLI following this guide.

Anyone can follow aws official guide but for this blog I am showng only for CLI step: I have install this AWS CLI on my Ubuntu 22.04.2 LTS system by writing this command:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

After install just write this command to check the aws version.

aws --version
output of aws version

Configure AWS CLI credentials:

To configure the credentials, use the command aws configure and include the credentials we have to create the user in the iam roles first.

First Go to IAM then Users, and click on “Add users

Step 1: Create a user name “reddit-api-user” and click “Next”
Step 2: add permissions and click “Create user”

Next when we click the reddit-api-user in the IAM>Users and and click on the “reddit-api-user” we will see a secutity credientials tab.

Step 3: Create acess key
Step 4: Select on the CLI and click Next
Step 5: Create acess key
Step 6: Save the acess keys

When we use the aws configure command, the terminal will be asked for:

  • AWS Access Key ID
  • AWS Secret Access Key
  • Default Region: Provide the Region in the following format us-east-1. For a list of Region names and codes, see this table.
  • Default Output Format: This is how the output should be displayed by default, and includes, but is not limited to: json, yaml, text. Review the documentation for all options.

Once completed, we should see the following in the terminal (I chose eu-central-1 as my default Region because I am close to this region):

aws configure 

AWS Access Key ID [None]: ANOTREALACCESSKEYID
AWS Secret Access Key [None]: ANOTREALSECRETACCESSKEY
Default region name [None]: eu-central-1
Default output format [None]: json
[default]
aws_access_key_id = XXXX
aws_secret_access_key = XXXX

By completing these steps, we’ll be able to set up our AWS environment and start storing our Reddit data in the cloud.

Project Folder and Git repository

In the meantime, I have created a git repository on Github and cloned the repository on my local machine.

Git repository
cloned the git repository on my local machine

Then I have created a virtualenv for this project and activated the environment.

Then I have added the requirements.txt and installed the packaged using pip install -r requirements.txt

boto3==1.24.7
botocore==1.27.7
certifi==2022.5.18.1
charset-normalizer==2.0.12
configparser==5.2.0
idna==3.3
jmespath==1.0.0
numpy==1.22.4
pandas==1.4.2
praw==7.6.0
prawcore==2.3.0
psycopg2-binary==2.9.3
pyarrow==8.0.0
python-dateutil==2.8.2
pytz==2022.1
requests==2.28.0
s3transfer==0.6.0
six==1.16.0
update-checker==0.18.0
urllib3==1.26.9
websocket-client==1.3.2

AWS Infrastructure with Terraform

We will be utilizing the infrastructure-as-code tool known as Terraform in order to both store the data that we have pulled from Reddit and set up a data warehousing solution with Amazon Web Services. With the help of this program, we will be able to rapidly configure (and delete) our AWS resources using code.

It is essential to keep in mind that Terraform is not limited to use with Amazon Web Services (AWS) and is compatible with other cloud resources. Check out this Tutorial for a brief introduction to Terraform if we’re just getting started with the program.

The following Amazon Web Services resources will be produced using Terraform:

  1. Redshift Cluster: As the final storage location for our data, we will be using Redshift, which is a columnar data warehousing solution that is provided by AWS.
  2. IAM Role for Redshift: Redshift will be granted authorization to read data from S3 by using this role.
  3. S3 Bucket: This will function as object storage for the data that we have collected from Reddit.
  4. Security Group: In order for our dashboard to be able to connect to Redshift, we will apply this security group to Redshift so that it can allow incoming traffic. Nevertheless, it is not a good idea to let all traffic into our resource while it is being used in a real production setting.

Setup:

  1. Intall Terraform: To install Terraform, we can follow the installation instructions for your operating system from the official Terraform website.

Ensure that our system is up to date and we have installed the gnupg, software-properties-common, and curl packages installed. we will use these packages to verify HashiCorp's GPG signature and install HashiCorp's Debian package repository.

$ sudo apt-get update && sudo apt-get install -y gnupg software-properties-common

Install the HashiCorp GPG key.

$ wget -O- https://apt.releases.hashicorp.com/gpg | \
gpg --dearmor | \
sudo tee /usr/share/keyrings/hashicorp-archive-keyring.gpg

Verify the key’s fingerprint.

$ gpg --no-default-keyring \
--keyring /usr/share/keyrings/hashicorp-archive-keyring.gpg \
--fingerprint

The gpg command will report the key fingerprint:

/usr/share/keyrings/hashicorp-archive-keyring.gpg
-------------------------------------------------
pub rsa4096 XXXX-XX-XX [SC]
AAAA AAAA AAAA AAAA
uid [ unknown] HashiCorp Security (HashiCorp Package Signing) <security+packaging@hashicorp.com>
sub rsa4096 XXXX-XX-XX [E]

Add the official HashiCorp repository to our system. The lsb_release -cs command finds the distribution release codename for our current system, such as buster, groovy, or sid.

$ echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] \
https://apt.releases.hashicorp.com $(lsb_release -cs) main" | \
sudo tee /etc/apt/sources.list.d/hashicorp.list

Download the package information from HashiCorp.

$ sudo apt update

Install Terraform from the new repository.

$ sudo apt-get install terraform

Verify the installation

Verify that the installation worked by opening a new terminal session and listing Terraform’s available subcommands.

$ terraform -help
Usage: terraform [-version] [-help] <command> [args]
The available commands for execution are listed below.
The most common, useful commands are shown first, followed by
less common or more advanced commands. If you're just getting
started with Terraform, stick with the common commands. For the
other commands, please read the help and docs before usage.
##...

After installing Terraform,create a direactory named terraform and navigate to the terraform directory in our project:

cd ~/reddit-api-analytics-ELT/terraform

Next,create and then open the variables.tf file in our preferred text editor and fill in the default parameters. Here are the parameters we'll need to specify:

  • aws_region: The AWS region where we want to create our resources. For example, eu-central-1.
  • s3_bucket_name: The name of the S3 bucket where we want to store our extracted Reddit data. This should be a unique name that doesn't violate any S3 bucket naming constraints. For example, <yourfullname>_reddit_bucket.
  • db_master_user_password: The password for the master database user for our Redshift cluster. This password should contain upper and lowercase letters, as well as numbers.

Be careful not to commit your password or other sensitive details to your version control system. You may want to add the Terraform state file to your .gitignore file. For this project I have uploaded all for educational purpose but don’t forget to change in other projects.

Created a main.tf file inside terraform added configurations that are creating the required Redshift cluster and S3 bucket.

Created a output.tf file inside terraform.

After specifying your parameters, run the following command to download the AWS Terraform plugin:

terraform init

Then, run the following command to create a plan based on main.tf and execute the planned changes to create resources in AWS:

terraform apply

This will create all the AWS things for us.

Optionally, we can run the following command to terminate the resources:

terraeform destroy

Configuration

Following that, we will need to generate a configuration file that has our specific information. The extract and load scripts that are a part of our pipeline will use the details that are provided here.

Setup:

  1. Create airflow folder inside our project. Inside airflow create a extraction folder.
  2. Create a configuration file under ~/Reddit-API-Pipeline/airflow/extraction/ called configuration.conf:
  3. Copy in the following:
[aws_config]
bucket_name = "reddit bucket name"
redshift_username = awsuser
redshift_password = "redshift password"
redshift_hostname = "redshift Host Name"
redshift_role = RedShiftLoadRole
redshift_port = 5439
redshift_database = dev
account_id = "account id"
aws_region = eu-central-1

[reddit_config]
secret = "your reddit secret key"
developer = tusher_16
name = api-analytics-ELT
client_id = "your reddit client_id"

Change the required values

  • If we need see the contents of our aws_config configuration, change folders back into the terraform folder, and then perform the terraform output command. It will output the settings that we need to store within the aws_config section of your configuration file. Simply make sure that any “ characters are removed from the strings.
terraform output

For reddit_config these are the details you took note of after setting up your Reddit App. Note the developer is your Reddit name.

In addition to that, I have included a .gitignore file located within the main project directory to prevent the uploading of configuration files that include our credentials.

# Local terraform directories
*/.terraform/

# State files
*.tfstate
.tfstate.

# Crash log files
*.log

# Sensitive data
*.tfvars

#local venv files
venv/

# ignore config info
**/airflow.cfg

Docker & Airflow

We set the LIMIT parameter in our Reddit extraction script to None to get the most current posts from the last 24 hours. This means that instead of a certain number of posts, the script will return all posts from the last 24 hours. We chose to focus on r/Datascience, but you can easily change the script to target a different subreddit or even multiple subreddits. By running the process every day, we can keep adding new data to our dataset, which gives us a better idea of what’s popular on Reddit.

We’ll use two tools, Apache Airflow and Docker, to manage the process of extracting and loading data into our pipeline.

Apache Airflow is an open-source platform that lets us create workflows, called Directed Acyclic Graphs (DAGs), to organize and schedule complex data pipelines. Even though using Airflow for our simple pipeline might seem like overkill, it is a best practice to use a process management tool like Airflow to make sure that our pipeline can grow and is reliable.

Docker, on the other hand, is a tool that lets us make and control containers. These containers are self-contained packages that have all the dependencies, libraries, and tools needed to run a specific program. In our case, we’ll use Docker to make a container that has everything we need to run Apache Airflow and all of its dependencies. This will help us avoid any problems that might come up if we install and run Airflow on different machines that aren’t compatible with each other.

By using these two tools together, we can make sure that our data pipeline is strong, scalable, and easy to manage. Don’t worry if you don’t know how to use Airflow or Docker. We’ve given you links to lessons that will help you get started.

For Docker Link.

For Airflow Link.

Setup Airflow in Docker

To run Airflow and create our pipeline, we’ll use Docker, a containerization platform that allows us to create isolated environments for our applications. We’ll use a pre-built docker-compose.yaml file that defines all the services we need for Airflow, such as the web server and scheduler. this link that have the contents of docker-compose.yaml file.

Inside airflow folder we have created our docker-compose.yaml

However, we made a few changes to the original file to customize it for our needs.

Firstly, we added two extra lines under volumes to mount two folders from our local file system to the docker containers. The first line mounts the extraction folder containing our scripts to /opt/airflow, where our Airflow DAG will run. The second line mounts our AWS credentials as read-only into the containers.

- ${AIRFLOW_PROJ_DIR:-.}/extraction:/opt/airflow/extraction
- $HOME/.aws/credentials:/home/airflow/.aws/credentials:ro

Additionally, we added a line to pip install specific packages within the containers. We used the _PIP_ADDITIONAL_REQUIREMENTS variable to specify these packages, including praw, boto3, configparser, and psycopg2-binary. Note that there are other ways to install these packages within the containers.

_PIP_ADDITIONAL_REQUIREMENTS: ${_PIP_ADDITIONAL_REQUIREMENTS:- praw boto3 configparser psycopg2-binary}

this the docker-compose.yaml file in the github.

Installing Docker

First install Docker. Instructions: https://docs.docker.com/get-docker/

I am usnig ubuntu so I am following This Link: https://docs.docker.com/desktop/install/ubuntu/

First Install using the apt repository

Before you install Docker Engine for the first time on a new host machine, you need to set up the Docker repository. Afterward, we can install and update Docker from the repository.

Set up the repository

  1. Update the apt package index and install packages to allow apt to use a repository over HTTPS:
 $ sudo apt-get update
$ sudo apt-get install ca-certificates curl gnupg
  1. Add Docker’s official GPG key:
$ sudo install -m 0755 -d /etc/apt/keyrings
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
$ sudo chmod a+r /etc/apt/keyrings/docker.gpg
  1. Use the following command to set up the repository:
 echo \
"deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
"$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

Install Docker Engine

  1. Update the apt package index:
$ sudo apt-get update
  1. Install Docker Engine, containerd, and Docker Compose.

To install the latest version, run:

$ sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  1. Verify that the Docker Engine installation is successful by running the hello-world image.
$ sudo docker run hello-world

This command downloads a test image and runs it in a container. When the container runs, it prints a confirmation message and exits.

We have now successfully installed and started Docker Engine.

Install Docker Desktop

Next Download latest DEB package.

Install the package with apt as follows:

$ sudo apt-get update
$ sudo apt-get install ./docker-desktop-<version>-<arch>.deb
$ docker compose version
Docker Compose version v2.17.3

$ docker --version
Docker version 23.0.5, build bc4487a

$ docker version
Client: Docker Engine - Community
Cloud integration: v1.0.31
Version: 23.0.5
API version: 1.42
<...>

To enable Docker Desktop to start on login, from the Docker menu, select Settings > General > Start Docker Desktop when we log in.

Alternatively, open a terminal and run:

systemctl --user enable docker-desktop

Now docker is now installed lets run our airflow.

First lets create our first script. which is extract_reddit_etl.py inside airflow/extraction folder to extract data from reddit.

extract_reddit_etl.py:

This Python script is part of an Airflow DAG and is designed to extract top posts from the Reddit API for a specific subreddit (in this case, “datascience”). The script uses the PRAW library to connect to the Reddit API and retrieve the desired posts.

The script reads configuration variables from a file and sets options for the data extraction, such as the time filter (“day”) and the fields to be extracted from the Reddit posts.

The main function of the script extracts the data from the Reddit API, transforms it by converting epoch timestamps to UTC format and boolean values to a consistent format, and then saves the extracted data to a CSV file in the “/tmp” folder.

The script includes error handling for API exceptions and other potential errors during the extraction process. It also provides a basic function to connect to the Reddit API, retrieve posts for a subreddit, and extract the necessary data.

Overall, this script serves as a useful tool for automating the extraction of Reddit data and can be integrated into an Airflow DAG for scheduled execution as part of a larger data pipeline.

upload_aws_s3_etl.py:

This Python script is part of a larger Airflow DAG and is responsible for uploading a file to an AWS S3 bucket. The file to be uploaded is determined by a command line argument provided in the format YYYYMMDD.

The script reads AWS credentials and configuration variables from a file, including the bucket name and AWS region. It also validates the command line argument to ensure it is provided correctly.

The main function of the script connects to the S3 service using the boto3 library, creates the S3 bucket if it doesn’t already exist, and then uploads the specified file to the bucket.

The script includes error handling to handle exceptions that may occur during the S3 connection or file upload process.

Overall, this script serves as a convenient tool for automating the upload of files to an AWS S3 bucket, making it suitable for integration into an Airflow DAG as part of a larger data pipeline.

upload_aws_redshift_etl.py:

This Python script is part of a larger Airflow DAG and is responsible for loading data from an S3 bucket into a Redshift database. The script takes a command line argument in the format YYYYMMDD, which represents the name of the file to be copied from S3.

The script reads AWS credentials and configuration variables from a file, including the Redshift database connection details, S3 bucket name, and the Redshift role to be assumed. It also validates the command line argument to ensure it is provided correctly.

The main function of the script establishes a connection to the Redshift database using psycopg2, creates the target table if it doesn’t already exist, creates a temporary table to store the data from S3, copies the data from the S3 file to the temporary table, deletes any records with the same ID from the main table, inserts the data from the temporary table into the main table, and finally drops the temporary table.

The script provides error handling and gracefully exits in case of any connection or execution errors.

Overall, this script serves as a valuable component in an Airflow DAG for automating the process of loading data from an S3 bucket into a Redshift database, ensuring data consistency and updating existing records as necessary.

validation.py:

This code snippet defines a function called validate_input that validates a date input in the format YYYYMMDD, raising a ValueError and exiting the program if the input is invalid.

Now we have created elt_reddit_pipeline.py inside airflow/dags

elt_reddit_pipeline:

This code defines an Airflow DAG (elt_reddit_pipeline) that orchestrates an Extract-Load-Transform (ELT) process for Reddit data. It consists of three tasks: extract_reddit_data, upload_to_s3, and copy_to_redshift. Each task is executed sequentially, with the output name of the extracted file passed as an argument to each task. The DAG runs daily, starting from the previous day, and the tasks execute Bash commands that invoke Python scripts for data extraction, uploading to AWS S3, and copying to AWS Redshift.

Running Airflow

To begin our pipeline, we need to start Airflow by following a few prerequisite steps. Here’s a step-by-step guide:

  1. Increase the CPU and memory allocation in Docker Desktop’s resource settings according to our PC’s capabilities.
  2. Open terminal and navigate to the directory where we have the Reddit API Pipeline files. In this case, assuming it’s located at ~/Reddit-API-Pipeline/airflow, run the following command:
cd ~/Reddit-API-Pipeline/airflow
  1. Create the necessary folders required by Airflow, including the logs and plugins folders. If the dags folder already exists and contains the DAG script used by Airflow, we don't need to create it again. Run the following command:
mkdir -p ./logs ./plugins
  1. Configure the Airflow quick-start by setting the host user ID. This is necessary for permissions and access control within the Airflow environment. Run the following command:
echo -e "AIRFLOW_UID=$(id -u)" > .env
  1. Ensure we are still in the airflow directory. Now, initialize the Airflow database by running the following command. Note that the Docker daemon (background process) should be running before proceeding:
docker-compose up airflow-init

This process may take a few minutes as it sets up the Airflow database.

  1. Once the Airflow database is initialized, it’s time to create the Airflow containers. This step might take a while. Run the following command:
docker-compose up
  1. If we want to check the running containers either in Docker Desktop or through the command line using the following command:
docker ps

This will list the active containers along with their details.

  1. Additionally, we can connect to a Docker container and explore its file system by executing the following command:
docker exec -it <CONTAINER ID> bash

Replace <CONTAINER ID> with the ID of the container you want to access. Once inside the container, you can navigate around the file system and perform various tasks.

After following the previous steps to start Airflow and access the Airflow Web Interface, here are some additional instructions:

  1. Go to http://localhost:8080 to access the Airflow Web Interface.
  2. Once the Airflow Web Interface loads, you’ll be prompted to enter a username and password. Use “airflow” for both the username and password to log in.
  3. To familiarize yourself with the Airflow UI and its features, it’s recommended to refer to guides or tutorials available online. You can find helpful resources like this guide [link] that explain the Airflow UI in detail.
  4. The DAG named etl_reddit_pipeline should be set to start running automatically once the containers are created. Depending on the timing, it may have already finished running by the time you log in. This setting is specified in the docker-compose file using the parameter AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'false'. The next DAG run is scheduled for midnight.
  5. To check the status of the DAG run, click on the DAG name (etl_reddit_pipeline) in the Airflow Web Interface. Navigate to the Tree view, where all the boxes representing tasks should be dark green if the DAG run was successful. If any boxes appear red, indicating issues or failures, you can click on them, access the logs, and investigate the error.
  6. If you want to shut down the Airflow containers, navigate to the airflow directory in your terminal and run the following command:
docker-compose down
  1. If you want to completely remove the containers, delete volumes with database data, and download images again, run the following command instead:
docker-compose down --volumes --rmi all

Please check the before when using the second command, as it will remove all data and images associated with the Airflow containers. It’s advisable to have a good understanding of Docker commands before executing this command.

By following these instructions, you should be able to access the Airflow Web Interface, monitor the DAG runs, and manage the Airflow containers effectively.

Explanation

The elt_reddit_pipeline DAG is a simple Airflow DAG that consists of three tasks executed sequentially.

  1. The extract_reddit_data task fetches the top posts from the r/Datascience subreddit, extracting attributes such as the number of comments, and saves the data as a CSV file in the /tmp folder.
  2. The upload_to_s3 task uploads the generated CSV file to an AWS S3 bucket for storage, using the bucket created by Terraform.
  3. The copy_to_redshift task ensures the existence of a Redshift table and uses the COPY command to load the data from the S3 CSV file into the table, ensuring no duplicate data based on post IDs.

This DAG is designed to automate the extraction, uploading, and loading process of Reddit data into Redshift, providing a scalable and efficient ELT pipeline for further analysis

DBT

For data transformation using dbt I am creating new separate part of this project.

Data Visualization Using Tableau

Tableau is a strong tool for visualizing data that lets you connect to different data sources and make dashboards that are both visually appealing and interactive. In this part, we’ll show you how to connect your Amazon Redshift data to Tableau and make a compelling data visualization.

Step 1: Install Tableau and Set Up Redshift Connection:

Download and install Tableau Desktop from the Tableau website.

  • Launch Tableau Desktop and click on “Connect to Data” to start a new connection.
  • In the “Connect” pane, select “Amazon Redshift” from the list of available connectors.
  • Enter your Redshift database credentials, including the server name, database name, username, and password.
  • Click “Sign In” to establish the connection with your Redshift database.

Step 2: Choose Data and Create Visualizations:

  • Tableau will show a list of tables and views in the database once it is linked to Redshift. Choose the data source (table or view) you want to use to make a chart. Move the chosen data source to the Tableau screen by dragging and dropping it.

Step 3: Publish and Share Your Visualization:

Tableau Dashboard
  1. Sign up for a Tableau Public account:
  • Visit the Tableau Public website (public.tableau.com) and click on the “Download the App” or “Get Started” button.

2. Save your Tableau workbook:

  • In Tableau Desktop, click on the “File” menu and select “Save to Tableau Public…”.
  • Choose a location on your computer to save the workbook.

3. Publish your workbook to Tableau Public:

  • In the “Save to Tableau Public” dialog box, provide a Title and Description for your visualization.
  • Choose the desired privacy setting for your visualization (Public, Unlisted, or Password Protected).
  • Click on the “Save” button to initiate the publishing process.

4. Wait for the publishing process to complete:

  • Tableau Desktop will upload your workbook and data to Tableau Public.
  • Once the upload is finished, a web browser window will open, displaying your published visualization on Tableau Public.

This the dashboard that published in my tableau public : https://public.tableau.com/app/profile/obaidullah.tusher/viz/DataScience-Reddit/Dashboard1

Tableau Basic Tutorial: Get Started with Tableau Desktop — Tableau

Termination

Terminating AWS Resources: To terminate your AWS resources provisioned using Terraform, follow these steps:

  1. Open a terminal and navigate to the Terraform directory.
  2. Execute the command terraform destroy to initiate the resource termination process.
  3. Allow Terraform to complete the destruction of the resources.
  4. Verify in the AWS console that all the resources created earlier have been successfully deleted.

Cleaning Up Docker Containers: To clean up Docker containers, volumes, and images, follow these steps:

  1. Navigate to the directory where you initially ran docker-compose up.
  2. Run the command docker-compose down --volumes --rmi all to stop and delete the containers, delete volumes with database data, and remove downloaded images.
  3. Next, execute docker system prune -a --volumes to remove all stopped containers, networks not used by any containers, unused images, volumes, and dangling build cache.
  4. Note: Exercise caution when running this command, as it permanently deletes unused resources.

--

--

Mohammad Obaidullah Tusher
Mohammad Obaidullah Tusher

Written by Mohammad Obaidullah Tusher

Data Scientist | ML Researcher | Django Developer

Responses (1)