Skip links

🧠 S.C.H.A.D. β€” Architecting a Cloud-Agnostic Big Data Pipeline

Streaming | Clickstream | Hadoop | Analytics | Datacenter

πŸš€ A personal deep-dive into designing portable big data architectures using open-source tools mirrored in AWS, Azure, and GCP.
πŸ’‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.

πŸ“Œ Why I Created S.C.H.A.D.

Cloud providers offer incredible power β€” but every platform makes different trade-offs. This project explores how to:

  • Compare cloud services by understanding open-source equivalents
  • Avoid vendor lock-in by working at the tech layer
  • Build scalable analytics platforms from scratch

S.C.H.A.D. stands for:

  • Streaming
  • Clickstream
  • Hadoop
  • Analytics
  • Datacenter

It’s a modular, portable architecture designed to be deployed without reliance on any specific cloud provider, while still being comparable to AWS, Azure, and GCP services.


🧩 System Architecture

flowchart TD
    subgraph DataGeneration
        A[Clickstream Generator]
    end

    subgraph Producers
        B1[Kafka Producer]
        B2[Akka Producer]
    end

    subgraph Messaging
        C[Kafka]
    end

    subgraph Processing
        D1[Spark Streaming]
        D2[Spark Batch + Hive]
    end

    subgraph Storage
        E1[HDFS / Parquet]
        E2[Hive Tables]
    end

    subgraph Visualization
        F[Zeppelin Dashboard]
    end

    subgraph Orchestration
        G1[Docker Compose]
        G2[Ansible Scripts]
    end

    A --> B1 --> C
    A --> B2 --> C
    C --> D1 --> E1 --> F
    C --> D2 --> E2 --> F
    G1 --> B1
    G1 --> B2
    G1 --> C
    G2 --> D1
    G2 --> D2

This shows a complete ingest β†’ process β†’ store β†’ visualize flow with orchestration and modularity.


☁️ Cloud Tool Mapping

The following tables and diagrams show how each open-source tool maps to a managed service in major cloud providers.

πŸ“Š Mapping Table

S.C.H.A.D. Tool AWS Equivalent Azure Equivalent GCP Equivalent
Kafka Amazon MSK Azure Event Hubs Google Pub/Sub
Spark AWS Glue / EMR Azure Synapse / HDInsight Dataproc / Dataflow
Hive Athena / Glue Catalog Synapse SQL Pools BigQuery
Akka Lambda / ECS Azure Functions Cloud Functions
Docker ECS / EKS AKS / ACI GKE / Cloud Run
Zeppelin SageMaker Studio Synapse Notebooks Colab / Notebooks AI

πŸ§ͺ Ingestion Layer

graph TB
  subgraph Ingestion
    KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
    KAFKA --> AZUREEVENT[Azure Event Hubs]
    KAFKA --> GCPPUBSUB[Google Pub/Sub]
  end

🧠 Compute Layer

graph TB
  subgraph Compute
    SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
    SPARK --> AZURESYNAPSE[Azure Synapse]
    SPARK --> GCPDATAPROC[GCP Dataproc]

    AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
    AKKA --> AZUREFUNC[Azure Functions]
    AKKA --> GCPCLOUDRUN[GCP Cloud Run]
  end

πŸ’Ύ Storage & Query Layer

graph TB
  subgraph Storage & Query
    HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
    HIVE --> AZURESQL[Synapse SQL]
    HIVE --> BIGQUERY[BigQuery]
  end

πŸ“Š Visualization Layer

graph TB
  subgraph Visualization
    ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
    ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
    ZEPPELIN --> COLAB[GCP Colab]
  end

πŸ”— Repository Breakdown

Each component is broken into individual repos to simulate real-world modularity and dev team structures.

Component Description Repo
Clickstream Generator Simulates user activity on a site GitHub
Kafka Producer Pushes data into Kafka from simulated input GitHub
Akka Producer Actor-based producer using Akka Streams GitHub
Spark Applications Real-time + batch ETL and transformation logic GitHub
Hive SQL Layer DDL and analytical SQL queries GitHub
Zeppelin Notebooks Interactive visualization notebooks Private
Orchestration Scripts Docker Compose, Ansible playbooks Private

🧠 What I Learned

  • Hands-on skills in streaming architectures (Kafka + Spark)
  • Cross-cloud platform mapping of data tools
  • Low-level debugging and orchestration using Ansible and Docker
  • Designing cloud-agnostic systems from first principles

πŸ’¬ Want to Learn More?

πŸ‘‰ Visit the SCHAD Meta Repository to explore the full breakdown. πŸ‘‰ Connect on LinkedIn if you’d like to discuss cloud architectures or data engineering!


S.C.H.A.D. isn’t just a proof of concept β€” it’s a blueprint for understanding the open-source foundations of cloud-native data platforms.