SaaS Subscription Analytics Pipeline - Layered Data Architecture with Deterministic ETL

Overview

This project demonstrates the design of an end-to-end analytics engineering pipeline for a SaaS subscription business.

The pipeline simulates operational SaaS activity, ingests the generated data into analytical infrastructure, and transforms it into a structured dimensional model suitable for analytical workloads.

The primary focus of the project is not dashboard creation, but rather pipeline architecture, data modeling discipline, and analytical correctness.

The system evolves in two stages:

a local development stack for experimentation and modeling.
a cloud-based analytics pipeline simulating a production-style environment.

DESIGN INTENT
This project prioritizes architectural clarity, deterministic transformations, and explicit analytical models over scale or performance optimization.

Business Problem

Operational SaaS systems generate multiple streams of events:

subscription lifecycle events
payment transactions
product usage activity
user account records

In theory, these events should make it easy to answer fundamental business questions:

Is recurring revenue growing?
Are users retaining or churning?
Which subscription plans generate the most revenue?

In practice, however, raw operational data rarely arrives in a form that is immediately suitable for analytics. Real-world SaaS data pipelines must deal with a wide range of instability, including:

Duplicate events, often caused by retry mechanisms
Late-arriving data, where events appear days after they actually occurred
Schema inconsistencies, especially during product evolution
Marketing-driven noise, such as referral campaigns introducing unexpected fields
Operational spikes, where sudden traffic surges produce unusual event patterns

These issues are not rare edge cases, they are normal characteristics of production systems. As a result, a pipeline can run successfully while still producing incorrect analytical results.

This creates a subtle but dangerous failure mode: the system appears healthy, but the metrics it produces are misleading.

For analytics systems supporting business decisions, incorrect metrics are often worse than missing metrics.

Defining the Analytical Guarantees

Before designing any analytics pipeline, it is important to define what guarantees the system is expected to provide.

Operational data pipelines can often run successfully while still producing misleading metrics. This happens when transformations succeed technically but violate analytical assumptions such as event ordering, uniqueness, or consistent dimensional attributes.

To avoid these failure modes, the pipeline in this project is designed around a small set of explicit analytical guarantees.

1. Metric Consistency

Business metrics must remain consistent across queries and analytical models. For example, calculations such as:

Monthly Recurring Revenue (MRR)
churn rate
retention cohorts

should produce the same results regardless of where they are queried from within the analytics layer. This requires:

well-defined event grains
consistent dimensional joins
clearly separated fact tables

Without these constraints, different queries may unintentionally compute the same metric in different ways.

2. Deterministic Transformations

All transformations in the pipeline are designed to be deterministic. Given the same input data, the pipeline should always produce the same analytical outputs.

This requirement eliminates hidden dependencies such as:

non-deterministic ordering
implicit deduplication logic
transformations that depend on execution timing

Deterministic transformations make analytical results reproducible and debuggable.

3. Explicit Analytical Grain

Each analytical dataset must define a clear grain.

For event-based analytics, this means that every fact table represents a single type of business event.

For example:

fct_payments → one payment transaction
fct_subscription_events → one subscription lifecycle event
fct_product_events → one product interaction

This prevents ambiguous joins and reduces the risk of double counting. Explicit grain definitions are essential when working with event-driven SaaS datasets.

4. Traceability to Source Data

The pipeline must preserve a clear lineage between analytical models and the original event data. This requirement ensures that analytical anomalies can always be traced back to their source.

To support this, the pipeline maintains a raw data layer where events are stored with minimal modification.

Traceability makes it possible to investigate issues such as:

duplicate payment records
unexpected plan changes
timestamp anomalies

without losing access to the original data.

5. Failure on Data Quality Violations

Perhaps the most important guarantee is that the pipeline fails when analytical correctness is at risk.

Many pipelines treat data quality checks as warnings. However, allowing the pipeline to continue in the presence of severe anomalies can silently corrupt business metrics.

In this design, certain conditions intentionally trigger pipeline failure.

Examples include:

abnormal spikes in duplicate events
unexpected null increases in critical fields
schema inconsistencies affecting analytical joins

When these conditions occur, the pipeline stops and emits operational alerts. This approach prioritizes analytical correctness over pipeline availability.

In other words, the system is designed around the principle that:

producing incorrect metrics is worse than temporarily producing no metrics at all.

What the Pipeline Ultimately Produces

Once the analytical guarantees are defined, the next step is determining what the pipeline must ultimately produce.

The goal of the system is not simply to move data between services, but to transform operational SaaS events into reliable analytical datasets that can support business decision making.

To achieve this, the pipeline automates the full lifecycle of data processing using a modern analytics stack.

At a high level, the system performs the following sequence of operations:

Each stage represents a distinct responsibility in the pipeline.

Automated Data Pipeline

The pipeline begins with a Python-based synthetic data generator that simulates operational SaaS activity.

The generator produces event datasets such as:

user registrations
subscription lifecycle events
payment transactions
product usage events

These datasets are written as Parquet files, which represent a common format used for event-based data storage.

The generated files are then uploaded to Cloudflare R2, which acts as the object storage layer for raw event data.

From there, the pipeline ingests the data into Snowflake, where it becomes the raw analytical layer of the warehouse.

Once the data is available in Snowflake, transformations are executed using dbt, which organizes the analytical models into structured layers.

The pipeline transforms the raw event data through several stages:

RAW → STAGING → FOUNDATION → MARTS

These transformations convert loosely structured event streams into a clean dimensional model suitable for analytics.

All ingestion and transformation steps are orchestrated using Apache Airflow, allowing the entire pipeline to run automatically according to a defined workflow.

Analytical Outputs

The primary analytical output of the pipeline is a structured dataset designed for subscription analytics.

At the core of the analytical model is a dimensional star schema, which organizes the event data into clear fact tables and dimensions.

This structure enables the creation of higher-level business models, including:

Monthly Recurring Revenue (MRR)
Cohort Retention Analysis
Customer Lifetime Value (LTV)

These models represent the types of metrics commonly used to evaluate SaaS performance.

The result is a dataset that allows analysts to reliably answer questions such as:

how revenue evolves over time
how long customers stay subscribed
how product usage relates to long-term value

Operational Outputs

In addition to analytical datasets, the pipeline also produces operational signals that indicate whether the system is functioning correctly.

Pipeline execution is orchestrated by Apache Airflow, which manages the dependency graph between ingestion and transformation tasks.

Airflow also acts as the monitoring layer for the pipeline.

During execution, the system sends Telegram notifications indicating the status of the workflow.

Two primary outcomes are reported:

Pipeline success, when all tasks complete successfully
Pipeline failure, when any stage of the pipeline encounters an error or data quality violation

These alerts provide lightweight observability into the pipeline without requiring a dedicated monitoring platform.

Why the Pipeline Uses a Layered Analytics Architecture

Most modern analytics pipelines separate transformations into multiple logical layers.

A common structure looks like this:

RAW → STAGING → FOUNDATION → MARTS

Each layer exists for a specific purpose and isolates different types of transformations.

This layered design helps prevent a common problem in data pipelines: mixing ingestion logic, technical cleanup, and business modeling in the same transformation step.

In earlier projects (e-commerce data pipeline), I implemented this structure with a strict separation between the analytical model and the business consumption layer. The experience from that design strongly influenced the architecture used in this project.

1. Raw Layer - Preserving Source Truth

The raw layer is designed as a 1:1 snapshot of the source data.

Data is ingested directly from the source without structural transformation so that any change in upstream systems can always be traced back to this layer.

By preserving the raw data exactly as it arrives, the pipeline maintains a reliable reference point for:

validation
debugging
historical comparison across runs

This approach improves data trustworthiness by minimizing the risk of accidental data loss or unintended alteration during ingestion.

Another practical benefit is operational simplicity. Persisting raw data inside the warehouse eliminates the need to repeatedly reload source files when investigating anomalies.

The trade-off is increased storage usage, since raw data is fully retained rather than transformed or discarded early. However, this cost is accepted deliberately in exchange for stronger data integrity guarantees.

2. Staging Layer - Technical Standardization

The staging layer exists to clean and standardize raw data before it is used to construct the analytical model.

At this stage, the data still closely resembles the raw source, but structural issues are resolved through transformations such as:

explicit type casting
timestamp normalization
format standardization
basic structural cleanup

Importantly, no business logic is introduced in the staging layer.

The goal is not to derive insights yet, but to ensure the data is technically consistent and ready for modeling.

By isolating type casting and normalization here, downstream models become significantly easier to reason about. Analytical transformations no longer need to deal with low-level data inconsistencies.

In short, staging acts as a technical preparation layer: the data is no longer raw, but it is not yet analytical.

3. Foundation Layer - The Canonical Analytical Model

The foundation layer contains the core analytical data model.

This layer implements a Kimball-style star schema, where event data is organized into well-defined fact tables and dimensions.

Each table has a clearly defined grain, and relationships between facts and dimensions are explicitly controlled.

In earlier pipeline designs (e-commerce data pipeline), this layer was treated as a protected canonical model that was not queried directly by analysts. Instead, all downstream consumption occurred through business-specific marts.

This design introduced a deliberate trade-off.

Restricting direct access to the foundation model improved analytical consistency and reduced the risk of subtle errors such as double counting or inconsistent aggregations.

However, it also created a bottleneck for exploratory analysis.

Business marts had to be created or modified every time analysts wanted to explore the data from a new perspective.

Adapting the Model for SaaS Analytics

In fast-moving SaaS environments, this strict separation can become counterproductive.

Unlike traditional transactional domains such as e-commerce, SaaS analytics often requires rapid exploratory analysis across multiple dimensions of product usage and subscription behavior.

Analysts frequently need to:

drill into subscription lifecycle events
explore product interaction patterns
investigate retention or churn anomalies

Creating a dedicated mart for every analytical question can slow down this process significantly.

For this reason, the design in this project intentionally allows the star schema in the foundation layer to be queried directly.

This provides analysts with a flexible analytical base where they can perform exploratory queries without waiting for new marts to be created.

The trade-off is that analysts must understand the grain and structure of the underlying fact tables.

However, in SaaS environments where analysts are often deeply familiar with product behavior, this flexibility tends to outweigh the risks.

In this architecture, the foundation layer serves two roles simultaneously:

a stable canonical analytical model
a flexible base for exploratory analysis

4. Marts Layer - Business-Specific Metrics

The marts layer remains the final consumption layer, but its purpose shifts slightly in this architecture.

Instead of acting as the only interface for analytics, marts now focus primarily on curated business metrics.

Examples include:

Monthly Recurring Revenue (MRR)
cohort retention models
customer lifetime value (LTV)

These models encode business definitions that should remain consistent across teams.

Because marts sit downstream from the canonical star schema, they can evolve independently without affecting the integrity of the core analytical model.

This design preserves a balance between:

analytical flexibility for exploration
stable business metrics for reporting

A Local-First Development Strategy

Before building the full cloud pipeline, the analytical model in this project was first developed in a local environment.

There were two main reasons for this decision.

First, this project represents an exploration of SaaS analytics patterns, including subscription lifecycle modeling and recurring revenue metrics.

Developing the transformations locally made it possible to iterate quickly while understanding the structure of the data and the analytical questions the system needs to answer.

Second, cloud infrastructure introduces operational complexity and cost.

Services such as cloud warehouses and orchestration platforms are powerful, but they also make experimentation slower and more expensive during the early stages of development.

For these reasons, the initial phase of the project focused on establishing a stable analytical foundation locally before introducing cloud infrastructure.

By performing the initial modeling locally, the project avoids these costs while still allowing the analytical logic to mature.

Once the data model and transformation structure become stable, the pipeline can then be migrated to a cloud environment with greater confidence.

Transitioning to a Cloud Analytics Architecture

Once the analytical model became stable in the local environment, the next step was to move the pipeline into a cloud-based architecture.

The goal of this transition was not simply to move data processing to the cloud, but to introduce several capabilities that are difficult to replicate in local workflows:

automated pipeline orchestration
centralized analytical storage
reproducible scheduled execution
operational monitoring

Rather than building a complex infrastructure stack, the architecture intentionally uses a small set of well-established tools from the modern analytics ecosystem. Each component serves a specific role in the pipeline.

Object Storage: Cloudflare R2

The pipeline uses Cloudflare R2 as the object storage layer for raw event data.

R2 serves a similar role to services such as Amazon S3: it stores immutable files that represent the original event data generated by the system.

One of the main motivations for choosing R2 is cost efficiency. Compared with many cloud storage solutions, R2 provides competitive storage pricing while still supporting the same object storage paradigm.

Another advantage is ecosystem compatibility. R2 follows an S3-compatible API, which means that the pipeline design remains portable. If the system later needs to migrate to Amazon S3 or another object storage provider, the transition can be performed with minimal architectural changes.

This makes R2 a practical choice for projects that aim to simulate real-world data platforms without introducing unnecessary infrastructure cost.

Data Warehouse: Snowflake

The analytical warehouse used in this project is Snowflake.

Snowflake has become a widely adopted platform in modern analytics engineering because it separates compute and storage while providing a SQL-first analytical environment.

For this pipeline, Snowflake serves as the central system responsible for:

storing the raw analytical layer
executing SQL transformations
supporting dimensional models and business marts

Because Snowflake is designed specifically for analytical workloads, it integrates naturally with tools such as dbt and orchestration frameworks.

Using Snowflake also reflects common patterns in modern data teams, where the warehouse becomes the primary environment for both transformation and analysis.

Transformation Layer: dbt

Transformations inside the warehouse are implemented using dbt (data build tool).

dbt provides a structured framework for organizing SQL transformations into modular models. Instead of writing large, hard-coded SQL scripts, transformations are defined as reusable models with explicit dependencies.

This approach offers several advantages.

First, it reduces repetitive SQL patterns that commonly appear in data pipelines. Transformations can be structured and version-controlled in a way that resembles software engineering workflows.

Second, dbt makes it easier to introduce automated testing and documentation.

Features such as dbt tests allow the pipeline to enforce data quality checks directly within the transformation layer, ensuring that critical assumptions about the data are validated during execution.

Finally, dbt fits particularly well in organizations where the data platform is evolving quickly. Because transformations are modular and declarative, new models can be added or modified without rewriting the entire pipeline.

Pipeline Orchestration: Apache Airflow

To automate the execution of the entire pipeline, the system uses Apache Airflow as the orchestration layer.

Airflow manages the workflow that connects each stage of the pipeline, including:

data ingestion
transformation execution
data quality checks
downstream model builds

By defining these tasks as a directed acyclic graph (DAG), Airflow ensures that each stage runs in the correct order and only after its dependencies have completed successfully.

Automation plays an important role here.

Without orchestration, many steps in the pipeline would require manual execution, increasing both operational overhead and the risk of human error.

With Airflow managing the workflow, the pipeline can run automatically on a schedule. This allows the system to operate continuously without requiring manual intervention.

In practice, this means that engineering time can be focused more on improving analytical models and investigating insights, rather than repeatedly executing pipeline tasks.

Operational Monitoring: Telegram Alerts

Even with automated orchestration, pipelines require monitoring.

To provide lightweight observability, the system sends Telegram alerts when the pipeline completes execution.

Two primary outcomes are reported:

pipeline success
pipeline failure

These notifications allow the pipeline operator to immediately detect issues such as ingestion failures or failed transformations.

Telegram was chosen primarily because it is simple, reliable, and essentially free to operate.

Unlike full monitoring platforms, it does not introduce additional infrastructure or cost, yet still provides immediate visibility into pipeline status.

For a project of this scale, this lightweight approach strikes a practical balance between observability and operational simplicity.

Simulating Real-World Data Instability (Chaos Scenarios)

Data pipelines are often designed under the assumption that upstream systems produce clean and well-structured data.

In reality, operational data is rarely this predictable. Production pipelines frequently encounter issues such as:

delayed events
duplicate records
schema changes
inconsistent categorical values
unexpected null spikes

Rather than treating these situations as rare edge cases, this project intentionally simulates several common data instability scenarios.

The goal is to ensure that the pipeline can handle messy input data while still producing reliable analytical outputs.

These scenarios are introduced deliberately during data generation and ingestion, creating a controlled form of chaos testing for the analytics pipeline.

Both the local development stack and the cloud analytics stack include their own chaos scenarios, reflecting different types of issues that occur in real data systems.

Chaos in the Local Development Stack

The local stack focuses on data-level anomalies that commonly occur in event-driven systems.

These scenarios test whether the analytical models remain correct even when upstream data evolves or behaves unexpectedly.

Five chaos scenarios are introduced in the local environment:

1. Late Arriving Events

Some events arrive later than when they actually occurred.

Approximately 3% of events are delayed by one month, meaning the event timestamp and ingestion batch timestamp do not match.

This situation commonly occurs in production pipelines when upstream systems retry failed deliveries or when ingestion pipelines experience temporary delays.

Handling strategy

Rather than rejecting late events, the pipeline tracks them explicitly.

the original event_date represents when the event occurred
batch_month represents when the event arrived

Analytical queries are based on event_date, ensuring temporal accuracy.

Late events are therefore accepted rather than discarded, even though they may cause historical reports to change slightly over time.

This reflects a common trade-off in analytics systems:

temporal accuracy is often more important than perfectly stable historical reports.

2. Plan Rename (Rebranding Event)

At month six, the product plan “Pro” is renamed to “Pro Plus”.

Both names refer to the same underlying product but appear as different values in the event data. This scenario simulates a typical business event such as product rebranding.

Handling strategy

The pipeline models this change using Slowly Changing Dimension Type 2 (SCD Type 2) in the dim_plans table.

Each version of the plan receives its own validity period. This allows historical queries to correctly reflect the plan name that existed at the time of each event.

Importantly, the values are not standardized into a single label, because the change itself is analytically meaningful.

Tracking the rename makes it possible to analyze how the rebranding affects subscription behavior.

3. Schema Evolution

At month eight, new columns appear in the upstream dataset.

Two additional fields are introduced:

ingestion_source
promo_code

This scenario simulates a common production situation where upstream services deploy new event attributes without coordinating with downstream pipelines.

Handling strategy

The ingestion process dynamically adapts to schema changes.

If a column appears that does not exist in the database table, the ingestion script automatically performs an ALTER TABLE to add the missing column.

However, the pipeline still evaluates whether the new field has analytical value.

promo_code is retained because it may influence future marketing analysis
ingestion_source is dropped because it is purely technical metadata

This approach allows the pipeline to remain robust to schema changes while avoiding unnecessary data bloat.

4. Duplicate Payments

In month ten, approximately 2% of payment events are duplicated.

Duplicate events are a common issue in distributed systems, often caused by retry mechanisms or idempotency failures.

Unlike some other anomalies, duplicates directly corrupt analytical metrics such as revenue totals.

Handling strategy

Duplicates are removed in the staging layer using window functions.

For each payment_id, the pipeline keeps the most recent record and discards the rest.

Additional dbt tests enforce that the cleaned dataset maintains a unique payment identifier, ensuring that duplicates cannot silently propagate downstream.

5. Datatype Drift

In month twelve, the amount_usd field changes from a numeric type to a string.

This situation simulates upstream systems that modify data serialization without updating downstream consumers.

Handling strategy

The staging layer performs explicit type casting to enforce the expected data type.

If invalid values appear that cannot be converted to numeric format, the transformation fails immediately.

Additional dbt tests verify that the resulting values are non-null and non-negative.

This follows a fail-fast philosophy: type errors should surface immediately rather than silently propagating into analytical models.

Chaos in the Cloud Analytics Stack

The cloud environment introduces a different category of instability.

While the local stack focuses on structural issues such as schema evolution and duplicates, the cloud stack simulates business-driven anomalies and operational spikes.

Four major chaos scenarios are introduced:

1. Plan Migration (Month 3 Onward)

Subscription plans are migrated from the previous naming convention:

Free → Starter
Pro → Growth
Business → Enterprise

However, the migration script introduces inconsistent variants such as:

GROWTH
Growht
rowth_plan

These messy values simulate real-world migrations where data transformations are not perfectly executed.

Handling strategy

The staging layer normalizes the plan values using conditional logic.

Values are cleaned using trimming and upper-case normalization before mapping them to a controlled set of valid plans.

At the same time, the pipeline preserves the original value in a separate column.

plan_raw → original source value
plan_cleaned → normalized canonical value

This approach preserves the source signal while still enabling consistent analytical modeling.

2. Referral Code Noise

Beginning in month eight, marketing campaigns introduce inconsistent referral code formats.

Examples include:

REF-123
ref_123
N/A
""
ORGANIC

These inconsistencies are common in marketing attribution pipelines.

Handling strategy

The staging layer normalizes valid referral codes while converting invalid formats to null values.

This produces a cleaned column:

referral_code_cleaned

while maintaining analytical consistency in downstream models.

3. Viral Usage Spike

Starting in month ten, the system experiences a sudden growth surge where user activity increases roughly fourfold.

This surge produces unusually high event volumes and occasionally causes timestamp collisions. Rather than artificially smoothing these spikes, the pipeline records them as signals.

Handling strategy

Events that arrive outside expected temporal patterns are flagged using an is_late_arriving indicator. These signals are surfaced later in a dedicated data quality mart, allowing anomalies to be observed without distorting the analytical model.

4. Null Spike in Plan Data

During the same growth period, onboarding systems become temporarily overwhelmed.

As a result, approximately 30% of new subscription records contain null values for the plan field.

Handling strategy

Instead of dropping these records, the pipeline preserves them while labeling the missing value as Unknown.

This ensures that anomalous behavior remains visible in the analytical layer rather than silently disappearing.

Null spikes are surfaced in a dedicated data quality monitoring model.

Layer Responsibilities for Chaos Handling

Across both environments, the pipeline follows a consistent philosophy for handling messy data. Each transformation layer has a clearly defined responsibility.

Raw layer

store data exactly as received
no transformations applied

Staging layer

normalize formats
clean duplicates
cast types
flag anomalies

Foundation layer

enforce analytical structure
stop the pipeline if data becomes invalid

Marts layer

surface analytical insights
expose anomaly signals through monitoring models

This layered approach allows the pipeline to separate data cleaning, anomaly detection, and analytical modeling, ensuring that messy operational data does not silently corrupt business metrics.

Data Quality Policy

Designing the pipeline is not only about moving and transforming data, but also about defining what level of data quality is acceptable.

In real systems, data is rarely perfect. Events may arrive late, values can be inconsistent, and upstream systems occasionally produce invalid records.

Because of that, this project adopts a clear data quality policy: not every anomaly should be silently fixed, but every anomaly should be observable.

1. Fail the Pipeline When Data Becomes Unsafe

Some issues indicate that the data is no longer reliable for analytics. When this happens, the pipeline should stop immediately rather than silently producing incorrect metrics.

The pipeline is designed to crash intentionally if critical constraints fail after the dimensional modeling stage, for example:

invalid keys
broken joins
major schema inconsistencies

This prevents corrupted data from propagating into downstream marts.

The failure event is captured by the orchestration layer, and a notification is sent through the monitoring channel so the issue can be investigated quickly.

2. Surface Anomalies Instead of Hiding Them

Not all data issues should cause the pipeline to fail. Many real-world anomalies are still analytically meaningful.

Examples include:

late arriving events
temporary null spikes during traffic surges
noisy referral codes generated by marketing campaigns

Instead of removing or silently correcting these records, the pipeline preserves them while adding flags or normalization logic during the staging transformation.

This approach ensures that the data warehouse reflects the actual behavior of the product, rather than presenting an artificially “perfect” dataset.

What This Project Demonstrates

This project is not meant to simulate a large-scale production data platform.

Instead, it demonstrates how a modern analytics pipeline can be designed intentionally from the start, even in a small environment.

The goal is to show how different components of the modern data stack can work together to build a pipeline that is automated, observable, and resilient to imperfect data.

1. End-to-End Modern Data Stack

The pipeline demonstrates a complete analytics workflow, starting from data generation all the way to business-ready metrics.

Synthetic SaaS event data is generated using Python and stored in object storage. From there, the data is ingested into the warehouse, transformed through multiple modeling layers, and finally exposed as analytical marts such as MRR, cohort retention, and customer lifetime value.

Each stage represents a common layer found in modern data platforms: raw ingestion, staging transformations, dimensional modeling, and analytical marts.

2. Automation Through Orchestration

Another key aspect demonstrated in this project is pipeline automation.

Instead of running ingestion or transformation scripts manually, the entire workflow is orchestrated through scheduled tasks. The pipeline can ingest new data, run transformations, execute data quality tests, and notify the monitoring channel without manual intervention.

This mirrors how production analytics systems are typically managed, where reliability and repeatability are critical.

3. Handling Imperfect Data

Real-world data pipelines must deal with inconsistencies and unexpected changes.

This project intentionally introduces several forms of data irregularities, such as: late arriving events, inconsistent plan names, referral noise, and temporary null spikes.

Rather than simply cleaning everything away, the pipeline demonstrates how these anomalies can be detected, normalized, or surfaced depending on their analytical importance.

This highlights an important principle: a good analytics system does not assume perfect data, it is designed to expect imperfect inputs.

4. Structuring Data for Analytics

Finally, the project demonstrates how raw event data can be transformed into a structured analytical model.

Through dimensional modeling and star schema design, the warehouse organizes raw events into a format that supports common SaaS metrics and analytical queries.

This ensures that downstream analysis remains stable even when upstream data evolves.

Together, these components illustrate how a relatively small project can still reflect many of the design considerations found in real-world analytics engineering environments.

References

GitHub Repository: SaaS Analytics Pipeline
Phase 3 Reflection: Phase 3 - Moving from Local Data Pipelines to Cloud Analytics Engineering

No results found

SaaS Subscription Analytics Pipeline - Layered Data Architecture with Deterministic ETL

Overview

Business Problem

Defining the Analytical Guarantees

1. Metric Consistency

2. Deterministic Transformations

3. Explicit Analytical Grain

4. Traceability to Source Data

5. Failure on Data Quality Violations

What the Pipeline Ultimately Produces

Automated Data Pipeline

Analytical Outputs

Operational Outputs

Why the Pipeline Uses a Layered Analytics Architecture

1. Raw Layer - Preserving Source Truth

2. Staging Layer - Technical Standardization

3. Foundation Layer - The Canonical Analytical Model

4. Marts Layer - Business-Specific Metrics

A Local-First Development Strategy

Transitioning to a Cloud Analytics Architecture

Object Storage: Cloudflare R2

Data Warehouse: Snowflake

Transformation Layer: dbt

Pipeline Orchestration: Apache Airflow

Operational Monitoring: Telegram Alerts

Simulating Real-World Data Instability (Chaos Scenarios)

Chaos in the Local Development Stack

Chaos in the Cloud Analytics Stack

Layer Responsibilities for Chaos Handling

Data Quality Policy

1. Fail the Pipeline When Data Becomes Unsafe

2. Surface Anomalies Instead of Hiding Them

What This Project Demonstrates

1. End-to-End Modern Data Stack

2. Automation Through Orchestration

3. Handling Imperfect Data

4. Structuring Data for Analytics

References