- 2020
- 2021
- Agile
- AI,Machine Learning, Analytics
- Amazon Web Services
- Ambari
- Apache Spark
- Azure
- Bash Scripting
- Cloud Management
- Docker
- Hadoop
- HBase
- Hive
- Linux
- Private
- SQL
- Zeppelin
π§ S.C.H.A.D.
Date:
June 2, 2025
π§ S.C.H.A.D. β Architecting a Cloud-Agnostic Big Data Pipeline
Streaming | Clickstream | Hadoop | Analytics | Datacenter
π A personal deep-dive into designing portable big data architectures using open-source tools mirrored in AWS, Azure, and GCP.
π‘ Yes! I do incorporate cloud technologies from different providers in my daily workflow. I still find it valuable to compare offerings and understand the underlying technology in order to make the better decisions for cost and effort.
π Why I Created S.C.H.A.D.
Cloud providers offer incredible power β but every platform makes different trade-offs. This project explores how to:
- Compare cloud services by understanding open-source equivalents
- Avoid vendor lock-in by working at the tech layer
- Build scalable analytics platforms from scratch
S.C.H.A.D. stands for:
- Streaming
- Clickstream
- Hadoop
- Analytics
- Datacenter
Itβs a modular, portable architecture designed to be deployed without reliance on any specific cloud provider, while still being comparable to AWS, Azure, and GCP services.
π§© System Architecture
flowchart TD
subgraph DataGeneration
A[Clickstream Generator]
end
subgraph Producers
B1[Kafka Producer]
B2[Akka Producer]
end
subgraph Messaging
C[Kafka]
end
subgraph Processing
D1[Spark Streaming]
D2[Spark Batch + Hive]
end
subgraph Storage
E1[HDFS / Parquet]
E2[Hive Tables]
end
subgraph Visualization
F[Zeppelin Dashboard]
end
subgraph Orchestration
G1[Docker Compose]
G2[Ansible Scripts]
end
A --> B1 --> C
A --> B2 --> C
C --> D1 --> E1 --> F
C --> D2 --> E2 --> F
G1 --> B1
G1 --> B2
G1 --> C
G2 --> D1
G2 --> D2
This shows a complete ingest β process β store β visualize flow with orchestration and modularity.
βοΈ Cloud Tool Mapping
The following tables and diagrams show how each open-source tool maps to a managed service in major cloud providers.
π Mapping Table
| S.C.H.A.D. Tool | AWS Equivalent | Azure Equivalent | GCP Equivalent |
|---|---|---|---|
| Kafka | Amazon MSK | Azure Event Hubs | Google Pub/Sub |
| Spark | AWS Glue / EMR | Azure Synapse / HDInsight | Dataproc / Dataflow |
| Hive | Athena / Glue Catalog | Synapse SQL Pools | BigQuery |
| Akka | Lambda / ECS | Azure Functions | Cloud Functions |
| Docker | ECS / EKS | AKS / ACI | GKE / Cloud Run |
| Zeppelin | SageMaker Studio | Synapse Notebooks | Colab / Notebooks AI |
π§ͺ Ingestion Layer
graph TB
subgraph Ingestion
KAFKA[Kafka Open Source] --> AWSMSK[Amazon MSK]
KAFKA --> AZUREEVENT[Azure Event Hubs]
KAFKA --> GCPPUBSUB[Google Pub/Sub]
end
π§ Compute Layer
graph TB
subgraph Compute
SPARK[Spark Streaming + Batch] --> AWSGLUE[AWS Glue / EMR]
SPARK --> AZURESYNAPSE[Azure Synapse]
SPARK --> GCPDATAPROC[GCP Dataproc]
AKKA[Akka Producer] --> AWSLAMBDA[AWS Lambda]
AKKA --> AZUREFUNC[Azure Functions]
AKKA --> GCPCLOUDRUN[GCP Cloud Run]
end
πΎ Storage & Query Layer
graph TB
subgraph Storage & Query
HIVE[Hive Open Source] --> AWSATHENA[Athena / Glue Catalog]
HIVE --> AZURESQL[Synapse SQL]
HIVE --> BIGQUERY[BigQuery]
end
π Visualization Layer
graph TB
subgraph Visualization
ZEPPELIN[Zeppelin Notebook] --> SAGEMAKER[Amazon SageMaker Studio]
ZEPPELIN --> AZURENOTE[Azure Synapse Notebook]
ZEPPELIN --> COLAB[GCP Colab]
end
π Repository Breakdown
Each component is broken into individual repos to simulate real-world modularity and dev team structures.
| Component | Description | Repo |
|---|---|---|
| Clickstream Generator | Simulates user activity on a site | GitHub |
| Kafka Producer | Pushes data into Kafka from simulated input | GitHub |
| Akka Producer | Actor-based producer using Akka Streams | GitHub |
| Spark Applications | Real-time + batch ETL and transformation logic | GitHub |
| Hive SQL Layer | DDL and analytical SQL queries | GitHub |
| Zeppelin Notebooks | Interactive visualization notebooks | Private |
| Orchestration Scripts | Docker Compose, Ansible playbooks | Private |
π§ What I Learned
- Hands-on skills in streaming architectures (Kafka + Spark)
- Cross-cloud platform mapping of data tools
- Low-level debugging and orchestration using Ansible and Docker
- Designing cloud-agnostic systems from first principles
π¬ Want to Learn More?
π Visit the SCHAD Meta Repository to explore the full breakdown. π Connect on LinkedIn if you’d like to discuss cloud architectures or data engineering!
S.C.H.A.D. isnβt just a proof of concept β itβs a blueprint for understanding the open-source foundations of cloud-native data platforms.
