🧠 S.C.H.A.D.

Date:

June 2, 2025

🧠 S.C.H.A.D. — Architecting a Cloud-Agnostic Big Data Pipeline

Streaming | Clickstream | Hadoop | Analytics | Datacenter

🚀 A personal deep-dive into designing portable big data architectures using open-source tools mirrored in AWS, Azure, and GCP.

💡 Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.

📌 Why I Created S.C.H.A.D.

Cloud providers offer incredible power — but every platform makes different trade-offs. This project explores how to:

Compare cloud services by understanding open-source equivalents
Avoid vendor lock-in by working at the tech layer
Build scalable analytics platforms from scratch

S.C.H.A.D. stands for:

Streaming
Clickstream
Hadoop
Analytics
Datacenter

It’s a modular, portable architecture designed to be deployed without reliance on any specific cloud provider, while still being comparable to AWS, Azure, and GCP services.

🧩 System Architecture

flowchart TD
    subgraph DataGeneration
        A[Clickstream Generator]
    end

    subgraph Producers
        B1[Kafka Producer]
        B2[Akka Producer]
    end

    subgraph Messaging
        C[Kafka]
    end

    subgraph Processing
        D1[Spark Streaming]
        D2[Spark Batch + Hive]
    end

    subgraph Storage
        E1[HDFS / Parquet]
        E2[Hive Tables]
    end

    subgraph Visualization
        F[Zeppelin Dashboard]
    end

    subgraph Orchestration
        G1[Docker Compose]
        G2[Ansible Scripts]
    end

    A --> B1 --> C
    A --> B2 --> C
    C --> D1 --> E1 --> F
    C --> D2 --> E2 --> F
    G1 --> B1
    G1 --> B2
    G1 --> C
    G2 --> D1
    G2 --> D2

This shows a complete ingest → process → store → visualize flow with orchestration and modularity.

☁️ Cloud Tool Mapping

The following tables and diagrams show how each open-source tool maps to a managed service in major cloud providers.

📊 Mapping Table

S.C.H.A.D. Tool	AWS Equivalent	Azure Equivalent	GCP Equivalent
Kafka	Amazon MSK	Azure Event Hubs	Google Pub/Sub
Spark	AWS Glue / EMR	Azure Synapse / HDInsight	Dataproc / Dataflow
Hive	Athena / Glue Catalog	Synapse SQL Pools	BigQuery
Akka	Lambda / ECS	Azure Functions	Cloud Functions
Docker	ECS / EKS	AKS / ACI	GKE / Cloud Run
Zeppelin	SageMaker Studio	Synapse Notebooks	Colab / Notebooks AI

🧪 Ingestion Layer

graph TB
  subgraph Ingestion
    KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
    KAFKA --> AZUREEVENT[Azure Event Hubs]
    KAFKA --> GCPPUBSUB[Google Pub/Sub]
  end

🧠 Compute Layer

graph TB
  subgraph Compute
    SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
    SPARK --> AZURESYNAPSE[Azure Synapse]
    SPARK --> GCPDATAPROC[GCP Dataproc]

    AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
    AKKA --> AZUREFUNC[Azure Functions]
    AKKA --> GCPCLOUDRUN[GCP Cloud Run]
  end

💾 Storage & Query Layer

graph TB
  subgraph Storage & Query
    HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
    HIVE --> AZURESQL[Synapse SQL]
    HIVE --> BIGQUERY[BigQuery]
  end

📊 Visualization Layer

graph TB
  subgraph Visualization
    ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
    ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
    ZEPPELIN --> COLAB[GCP Colab]
  end

🔗 Repository Breakdown

Each component is broken into individual repos to simulate real-world modularity and dev team structures.

Component	Description	Repo
Clickstream Generator	Simulates user activity on a site	GitHub
Kafka Producer	Pushes data into Kafka from simulated input	GitHub
Akka Producer	Actor-based producer using Akka Streams	GitHub
Spark Applications	Real-time + batch ETL and transformation logic	GitHub
Hive SQL Layer	DDL and analytical SQL queries	GitHub
Zeppelin Notebooks	Interactive visualization notebooks	Private
Orchestration Scripts	Docker Compose, Ansible playbooks	Private

🧠 What I Learned

Hands-on skills in streaming architectures (Kafka + Spark)
Cross-cloud platform mapping of data tools
Low-level debugging and orchestration using Ansible and Docker
Designing cloud-agnostic systems from first principles

💬 Want to Learn More?

👉 Visit the SCHAD Meta Repository to explore the full breakdown. 👉 Connect on LinkedIn if you’d like to discuss cloud architectures or data engineering!

S.C.H.A.D. isn’t just a proof of concept — it’s a blueprint for understanding the open-source foundations of cloud-native data platforms.