"Transforming messy transaction logs into High-Fidelity Behavioral Insights through Python, Postgres, and Star Schema architecture."
This pipeline was built to solve the "Messy Data" problem. Using **Python and Pandas**, I standardise disparate bank logs and user activity, ensuring every byte of data is cleaned, validated, and ready for senior-level analytics.
Categorical-to-Ordinal Transformation
Grouped Imputation for Missing Values
Automated Jest Integrity Testing
Messy CSV bank statements and disparate app logs.
Pandas cleaning, imputation, and ordinal mapping.
Postgres Fact/Dim tables optimized for SQL analytics.
-- Segments users based on total lifetime spend
SELECT
c.user_name,
COUNT(f.transaction_id) as total_purchases,
SUM(f.revenue) as lifetime_value,
CASE
WHEN SUM(f.revenue) > 1000 THEN 'VIP'
ELSE 'Standard'
END AS customer_segment
FROM fact_transactions f
JOIN dim_customers c ON f.user_id = c.user_id
GROUP BY c.user_id, c.user_name
ORDER BY lifetime_value DESC;
-- Identifies users who haven't purchased in the last 90+ days
WITH LastPurchase AS (
SELECT user_id, MAX(purchase_date) as last_order_date
FROM fact_transactions GROUP BY user_id
)
SELECT
c.user_name,
lp.last_order_date,
CURRENT_DATE - lp.last_order_date AS days_since_last_order,
CASE
WHEN CURRENT_DATE - lp.last_order_date > 180 THEN 'Churned'
WHEN CURRENT_DATE - lp.last_order_date > 90 THEN 'At Risk'
ELSE 'Active'
END as churn_status
FROM LastPurchase lp
JOIN dim_customers c ON lp.user_id = c.user_id;
Standardizes messy logs. Handles Grouped Imputation for missing fields and maps categorical spending levels to an ordinal scale for statistical analysis.
Architected for high-fidelity relational joins. Fact/Dimension separation ensures rapid query performance even as dataset volume scales.
Integrated unit testing of data transformations. Proves that the pipeline is mathematically accurate and ready for production-grade financial environments.