Hi, I'm Subhajit 👋🏻

I am a Data Engineer with nearly 4 years of hands-on experience architecting and maintaining production-scale ETL/ELT pipelines, enterprise data lakehouses, and distributed processing systems. Prior to this, I earned my B.Tech in Computer Science and Engineering from MAKAUT.

I specialize in the AWS data ecosystem—leveraging S3, Glue (PySpark), EMR, Redshift, and Athena—alongside modern platforms like Snowflake and Delta Lake. I have engineered and scaled over 150+ automated pipelines processing hundreds of GBs daily with periodic TB-scale refreshes, establishing robust medallion architectures (Bronze/Silver/Gold) and incremental load methodologies (CDC/SCD).

My engineering philosophy revolves around end-to-end ownership. From raw metadata-driven ingestion and complex transformation logic to CI/CD deployments, orchestration via AWS Step Functions, and targeted SQL query optimization, I prioritize rigid data integrity, fault tolerance, and aggressive compute cost-efficiency in live production environments.

Experience

Tata Consultancy Services

Data Engineer | System Engineer

July 2024 – Present

Architected and maintained 150+ production-grade AWS ETL/ELT pipelines (S3, Glue, EMR, Redshift, Athena) across a medallion architecture (Bronze/Silver/Gold), processing hundreds of GBs daily with periodic TB-scale refreshes for downstream BI consumers.
Standardized the data lake storage layer on Delta Lake on S3 as the production standard alongside Parquet — enabling ACID transactions, schema enforcement, and reliable upserts across the entire data lake.
Designed and implemented a metadata-driven ingestion framework using AWS Glue Crawlers for automated schema discovery across Parquet, JSON, and CSV sources — reducing downstream data latency by ~30%.
Orchestrated fault-tolerant batch ETL workflows using AWS Step Functions with retry logic and failure handling — reducing manual intervention and improving execution reliability in production.
Developed incremental load logic (insert/update/delete) using SQL and PySpark, aligned with dimensional modeling — keeping Redshift tables current and analytically consistent.
Established CI/CD pipelines using Jenkins and Git for version-controlled, repeatable deployments across dev, test, and production environments.
Optimized SQL query performance by ~20–25% through partition pruning, predicate pushdown, and efficient joins in Redshift and Athena — reducing unnecessary data scans and compute overhead.
Implemented proactive pipeline monitoring using CloudWatch logs, alerts, and dashboards — enabling early detection and resolution of production issues before downstream impact.
Reduced compute and storage costs by optimizing S3 storage patterns, query structures, and scan footprints across Glue and Athena workloads.
Partnered with analytics, QA, and BI teams across full Agile delivery cycles — sprint planning, backlog refinement, and production support — to ensure reliable, business-aligned data delivery.

Data Engineer | Assistant System Engineer

July 2023 – June 2024

Architected and built a cloud data lakehouse on AWS, migrating and consolidating on-premises data sources into a scalable S3 architecture, driving infrastructure cost optimizations of £487K annually.
Established a medallion architecture (Bronze/Silver/Gold) with Delta Lake on S3 as the storage standard, using AWS DMS for ingestion — enabling ACID transactions, schema enforcement, and reliable upserts.
Built end-to-end data transformation and processing jobs using AWS Glue (PySpark) and Lambda, orchestrated via AWS Step Functions with retry logic and failure handling for production reliability.
Delivered data to 6 downstream consumer systems based on business requirements — Athena (SQL querying), QuickSight (dashboards), PostgreSQL (cost-efficient results), Redshift (analytical workloads), Snowflake (project-specific), and DynamoDB — ensuring each team received fit-for-purpose data.
Collaborated across an Agile team via sprint planning, Jira-based task management, and cross-functional coordination with QA, analytics, and BI stakeholders.

Data Engineer | Assistant System Engineer - Trainee

July 2022 – June 2023

Migrated ETL pipelines from Informatica PowerCenter to Informatica Intelligent Cloud Services (IICS), handling diverse input and output data types across source and target systems to modernize the data integration layer.
Developed end-to-end ETL mappings, built solution design documentation in Confluence, and used Oracle SQL for data extraction and validation — ensuring data accuracy at source.
Executed an on-premises Oracle database migration to Oracle Cloud Infrastructure (OCI), handling end-to-end data integration and ensuring consistency between source and target systems throughout the migration lifecycle.
Built and validated ETL pipelines in IICS end-to-end — covering data cleaning via Oracle SQL, unit testing, sanity testing, and system integration testing (SIT).
Maintained version control and task tracking using GitHub and Jira across both projects within an Agile delivery model.

Education

University – MAKAUT

2018 - 2022

Bachelor of Technology (B.Tech) in Computer Science and Engineering

Relevant courses: Data Structures & Algorithms, Operating Systems, Database Management Systems
GPA: 8.90

Projects

Near-Real-Time Serverless ELT Pipeline on AWS with Snowflake

AWS EventBridge, DynamoDB, Lambda, S3, Snowflake, Snowpipe

Engineered a near-real-time, event-driven ELT pipeline using AWS EventBridge, DynamoDB Streams, Lambda, S3, and Snowflake — ingesting and processing live API data end-to-end with low latency.
Solved continuous ingestion of semi-structured JSON by staging raw events in DynamoDB, transforming via Lambda, and landing processed records into S3 before Snowflake ingestion.
Automated continuous data loading from S3 into Snowflake using Snowpipe — eliminating batch job dependency and enabling near-real-time data availability.
Applied analytical transformations using Snowflake SQL to produce analytics-ready datasets across the AWS-native and Snowflake layers.

Serverless Batch Analytics ETL Pipeline on AWS

Amazon S3, AWS Glue, PySpark, Amazon Athena, Amazon QuickSight

Designed and built a fully serverless batch ETL pipeline using Amazon S3, AWS Glue (PySpark), and Glue Crawlers — ingesting, processing, and structuring analytical datasets for downstream reporting.
Architected a scalable data lake with raw and curated layers (medallion-style) — handling automated schema discovery, data cleansing, and batch transformations via Spark-based Glue jobs.
Integrated AWS Glue Data Catalog with Amazon Athena for serverless SQL-based querying on S3 — connected to Amazon QuickSight to deliver business-ready reporting and dashboards.