Data Engineering Bulletin

Share this post

Data Engineering Bulletin - Digest#4

databulletin.substack.com

Data Engineering Bulletin - Digest#4

Latest data engineering updates from Tech Companies

Suraj Mishra
Feb 6
Share this post

Data Engineering Bulletin - Digest#4

databulletin.substack.com

Introduction

  • This week's update includes Google, Meta, Snowflake, Databricks, Confluent, Lyft, Yelp, and others.

  • It covers data pipeline design patterns, event-driven asynchronous computing, customer stories, financial data pipeline design, and much more.

  • I hope you enjoy it.

If you like this bulletin, consider subscribing this newsletter and sharing it with others.


Google Cloud

  • What Data Pipeline Architecture should I use?
    Designing a data pipeline depends on elements such as data source format, tech stack to use, patterns such as ETL, ELT, or ETLT, Changed data management, and how changes are captured. There are multiple design patterns available such as data warehouse pipeline or data lake pipeline and the decision boils down to what is the use case of the data consumer, and the cost of the architecture.

  • How Oden provides actionable recommendations with network resiliency to optimize manufacturing processes
    Oden Technologies is a technology solution provider for manufacturers that uses data to give real-time visibility to the biggest causes of inefficiency and recommendation to address them. They aggregate time series data from multiple devices and process them in real-time using dataflow before writing to a time-series database. They discuss challenges in late arriving data due to network disruptions, and quality specifications of the final product to be sold, along with their solution architecture.

Meta

  • Asynchronous computing at Meta: Overview and learnings
    Meta engineering team shares architectural changes to their event-driven asynchronous computing platform to support easy integration with multiple data sources. They moved from a pull-based mechanism to schedule workloads across compute layer to push-based event-driven along with streamlining data ingestion and avoiding writing duplicate data ingestion code by each team/user.

Snowflake

  • Learn How OneWeb Delivers Space-Based Connectivity with Snowflake
    OneWeb is a company with 648 lower earth orbit satellites that provides connectivity to government, businesses, and communities.OneWeb uses data mesh architecture and ingests 55 billion rows every day. They use snowflake on this journey of scalability of the data. While using snowflake tech stack they enable their consumer to access their data and open up new data monetization opportunities worth millions of dollars.

  • How Implementing a Data Catalog Optimizes Your Snowflake Data Cloud Migration
    During data migration enterprises always struggle to find out which data they have on-premises, which one they should migrate to the cloud, and what should be the order for this. Often lift-shift strategy takes precedence where data is simply copied from one place to another. But the outcome of this is not desirable for many organizations because it's most likely moving tech debt from one place to another. The author of the article explains why cataloging on prem data is important and proposes a matrix to identify high-value and low-complexity data as the first target to move to the cloud.

Databricks

  • Design Patterns for Batch Processing in Financial Services
    Although the Batch process sounds like an outdated paradigm in the era of stream processing, for some industries it remains vital due to the nature of the operation. One such area is financial services. Although real-time systems solve a lot of problems in volatile market conditions but back-office batch processing is still relevant. Databricks shares the architecture to consider for financial services batch processing design. They discuss integrating third-party data ingestion services like Fivetran along with Databricks Lakehouse in the design.

Confluent

  • Secure Shared Services with Data Streaming: OAuth, Client Quotas, and more
    We all can agree on the fact that the real-time everything concept is gaining its space in every org and each team is excited to be part of an event-driven platform. The scope of the application served by the event-driven platform has increased from internal to external facing applications. But in order to be cost-effective while maintaining performance, shared infrastructure is the ideal solution. But this model brings challenges when comes to securing these applications. With this in mind the confluent team has released many new features in the Confluent cloud (SaaS/Managed offering of Confluent Kafka Cluster), such as Centralized identity management, Enhance Role-based access control, and more.

Lyft

  • Powering Millions of Real-Time Decisions with LyftLearn Serving
    The Lyft team built lyftlearn serving component for their in-house ML platform for their variety of ML use cases like price optimization for rides, incentive collection, fraud detection, etc. Lyftlearn service is an ML deployment and serving system. They talk about major design decisions for LyftLearn Serving which is made up of microservices with services including the Http serving layer, core library layer, third-party library, and custom ML predict code layer. They also talk about how they used this ML component for customized deployment for each team along with their own set of tooling and runtime.

Yelp

  • Rebuilding a Cassandra cluster using Yelp’s Data Pipeline
    Yelp uses apache cassandra for its primary and derived data stores. In this article, they talk about how they rebuilt one of the clusters by removing malformed data using the Yelp data pipeline. This data pipeline would ingest a stream of the event from the original cluster and using Stream SQL will filter the data into 2 sinks, one of them goes to sanitized cluster while invalid/malformed data moves to a different stream to perform further analysis for actual cause.

Hashicorp

  • Event-Driven Access Controls with HashiCorp Boundary and Vault
    Hashicorp team has built a prototype Rift that mainly engineered the workflow for assigning and revoking permission to the engineer who is on call without any manual intervention when a production incident occurs. It saves time for the security engineer so that he does not need to be available to assign permission to the on-call engineer. They discuss about problems with a manual approach, solutions, and workflow design in detail.

Product Upgrades

  • Introducing Upgrades to Databricks Notebooks — New Editor, Python Formatting, and More

  • AWS Lambda Now Supports Maximum Concurrency for SQS as Event Source

Webinars, Courses & Training

  • Data Engineering using AWS Data Analytics

  • Data Engineering Essentials using SQL, Python, and PySpark

  • From 0 to 1: The Cassandra Distributed Database


If you have any feedback, request, or improvement for this newsletter consider commenting so that I can improve it in the future.

If you like this bulletin, consider subscribing this newsletter and sharing it with others.

Thanks for reading Data Engineering Bulletin! Subscribe for free to receive new posts and support my work.

Share this post

Data Engineering Bulletin - Digest#4

databulletin.substack.com
Comments
TopNew

No posts

Ready for more?

© 2023 Suraj Mishra
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing