Data Engineering Bulletin - Digest#4

Feb 06, 2023

This week's update includes Google, Meta, Snowflake, Databricks, Confluent, Lyft, Yelp, and others.
It covers data pipeline design patterns, event-driven asynchronous computing, customer stories, financial data pipeline design, and much more.
I hope you enjoy it.

If you like this bulletin, consider subscribing this newsletter and sharing it with others.

What Data Pipeline Architecture should I use?
Designing a data pipeline depends on elements such as data source format, tech stack to use, patterns such as ETL, ELT, or ETLT, Changed data management, and how changes are captured. There are multiple design patterns available such as data warehouse pipeline or data lake pipeline and the decision boils down to what is the use case of the data consumer, and the cost of the architecture.
How Oden provides actionable recommendations with network resiliency to optimize manufacturing processes
Oden Technologies is a technology solution provider for manufacturers that uses data to give real-time visibility to the biggest causes of inefficiency and recommendation to address them. They aggregate time series data from multiple devices and process them in real-time using dataflow before writing to a time-series database. They discuss challenges in late arriving data due to network disruptions, and quality specifications of the final product to be sold, along with their solution architecture.

Learn How OneWeb Delivers Space-Based Connectivity with Snowflake
OneWeb is a company with 648 lower earth orbit satellites that provides connectivity to government, businesses, and communities.OneWeb uses data mesh architecture and ingests 55 billion rows every day. They use snowflake on this journey of scalability of the data. While using snowflake tech stack they enable their consumer to access their data and open up new data monetization opportunities worth millions of dollars.
How Implementing a Data Catalog Optimizes Your Snowflake Data Cloud Migration
During data migration enterprises always struggle to find out which data they have on-premises, which one they should migrate to the cloud, and what should be the order for this. Often lift-shift strategy takes precedence where data is simply copied from one place to another. But the outcome of this is not desirable for many organizations because it's most likely moving tech debt from one place to another. The author of the article explains why cataloging on prem data is important and proposes a matrix to identify high-value and low-complexity data as the first target to move to the cloud.

Design Patterns for Batch Processing in Financial Services
Although the Batch process sounds like an outdated paradigm in the era of stream processing, for some industries it remains vital due to the nature of the operation. One such area is financial services. Although real-time systems solve a lot of problems in volatile market conditions but back-office batch processing is still relevant. Databricks shares the architecture to consider for financial services batch processing design. They discuss integrating third-party data ingestion services like Fivetran along with Databricks Lakehouse in the design.

Secure Shared Services with Data Streaming: OAuth, Client Quotas, and more
We all can agree on the fact that the real-time everything concept is gaining its space in every org and each team is excited to be part of an event-driven platform. The scope of the application served by the event-driven platform has increased from internal to external facing applications. But in order to be cost-effective while maintaining performance, shared infrastructure is the ideal solution. But this model brings challenges when comes to securing these applications. With this in mind the confluent team has released many new features in the Confluent cloud (SaaS/Managed offering of Confluent Kafka Cluster), such as Centralized identity management, Enhance Role-based access control, and more.

Powering Millions of Real-Time Decisions with LyftLearn Serving
The Lyft team built lyftlearn serving component for their in-house ML platform for their variety of ML use cases like price optimization for rides, incentive collection, fraud detection, etc. Lyftlearn service is an ML deployment and serving system. They talk about major design decisions for LyftLearn Serving which is made up of microservices with services including the Http serving layer, core library layer, third-party library, and custom ML predict code layer. They also talk about how they used this ML component for customized deployment for each team along with their own set of tooling and runtime.

Rebuilding a Cassandra cluster using Yelp’s Data Pipeline
Yelp uses apache cassandra for its primary and derived data stores. In this article, they talk about how they rebuilt one of the clusters by removing malformed data using the Yelp data pipeline. This data pipeline would ingest a stream of the event from the original cluster and using Stream SQL will filter the data into 2 sinks, one of them goes to sanitized cluster while invalid/malformed data moves to a different stream to perform further analysis for actual cause.

Event-Driven Access Controls with HashiCorp Boundary and Vault
Hashicorp team has built a prototype Rift that mainly engineered the workflow for assigning and revoking permission to the engineer who is on call without any manual intervention when a production incident occurs. It saves time for the security engineer so that he does not need to be available to assign permission to the on-call engineer. They discuss about problems with a manual approach, solutions, and workflow design in detail.

If you have any feedback, request, or improvement for this newsletter consider commenting so that I can improve it in the future.

If you like this bulletin, consider subscribing this newsletter and sharing it with others.

Data Engineering Bulletin