Data Engineering Bulletin

Share this post
Data Engineering Bulletin - Week #2
databulletin.substack.com

Data Engineering Bulletin - Week #2

This week's updates cover data management, workflow, static sql analysis, realtime messaging, cross cloud analytics and much more.

Suraj Mishra
Dec 6, 2022
Share this post
Data Engineering Bulletin - Week #2
databulletin.substack.com

Introduction

  • As a Data Engineer, I am always curious to know what others are doing in this space. The best way to learn about what is happening in other companies is to read their tech blogs and get inspiration for my next project.

  • Data Engineering Bulletin filtered version of those efforts which I feel might be helpful to my Data Engineering Community. Data Bulleting covers what matters the most.

  • If you like this bulletin, consider subscribing to it to get more of these weekly.


Netflix Tech Blog

  • Ready-to-go Sample Data Pipelines With Dataflow

    Dataflow is Netflix's homegrown solution for Data Pipeline management. In this article, they explained features supported by Dataflow. One feature of Dataflow is to allow onboarded users to get started quickly and for that, they have built templated jobs with configurable business logic with Sample Workflow.

Meta

  • Enabling static analysis of SQL queries at Meta

    Meta talks about their static analysis tool called UPM which takes each SQL query that is executed by the author and returns a semantic tree. The infrastructure team at Meta leverages UPM to enhance the performance of SQL queries and also build column-level lineage.

Confluent

  • Data pipeline: What Why and How
    This post from Confluent mainly discusses the foundational principles of data pipelines and explains when to consider building real-time data pipelines.

  • Diagnose and Debug Apache Kafka Issues: Understanding Increased Connections

    In this article Confluent talks about different types of connections that get created in Apache Kafka, such producer initiated connections, broker-initiated connections, and consumer-initiated connections along with how to monitor different metrics to monitor the spike in connection count, connection rate, and so on.

Go Jek

  • Customizing VerneMQ - The message broker for info superhighway
    GoJek talks about improvements and feature addition that they made to open-source VerneMQ, which is MQTT based message broker. MQTT is a protocol that is used when the device connection is in a remote location with network/bandwidth constraints. Typically IoT devices use this protocol to publish messages.

Walmart Labs

  • How to create a BigQuery Table with External Data Source
    Walmart engineer shares how to create external data sources for BigQuery along with its pros and cons. External data sources are great when you don’t want to move your data from source to BigQuery, instead, execute BigQuery query on external data, but doing so often we cannot leverage the advantages of BigQuery internal features like caching and performance.

Google Cloud

  • Movie Score Prediction with BigQuery

    This article covers the use case of building a Movie Score Predictor which is built by using BigQuery, Vertex AI, MongoDB, Cloud Run, and Cloud Functions.

  • Built with BigQuery: How Datalaksa provides unified marketing and customer data warehouse
    Datalaksa, a company servicing clients in south-east Asia provides unified marketing and customer data warehouse, they share their use case and reason to choose BigQuery as their Data warehouse solution.

  • Break down data silos with BigQuery Omni
    Google Cloud team talks about new upcoming features for BigQuery Omni which mainly solves the data analytics across the cloud including Azure, Aws with a Single BigQuery interface.

  • Cloud PubSub announces general availability of exactly-once delivery
    Google Cloud PubSub now supports exactly-once semantics in a single regional. It’s achieved through having the deduplication layer internally implemented with persistent storage. Although latency will be increased single delivery will be guaranteed. At the moment it’s only available for Pull Subscribers and not for Push-based subscribers.

InfoQ

  • AWS Lambda Snapstart Accelerate Java Functions
    AWS's newly announced Lambda Snapstart allows a faster initialization when executed in AWS lambda for Java execution of microservices, data processing, etc.

  • Using Serverless Websockets to Enable Real-Time Messaging
    The author discusses various protocols available to build real-time/live experiences in the modern world and WebSocket being promising and one of the most used protocols that really stands out in building event-driven systems for live experiences. However, managing WebSocket infra is not so easy, and serverless is the way forward.

  • AWS announces DataZone, new Data Management service to Govern Data
    Amazon announces a new Data Management Service called DataZone that integrates AWS products like Redshift/Athena along with other data sources to provide better data management in the cloud.

AWS

  • Point-to-Point Integration with Amazon Event Bridge Pipes
    AWS launches a new service under Event Bridge called Pipes which helps us get quickly started to build an event-driven system without writing glue code across the systems

  • Step Functions Distributed Map – A Serverless Solution for Large-Scale Parallel Data Processing
    AWS announces a new service for step function called distributed map to process large-scale data processing workloads. The step function allows us to build no-code workflows but currently, it was limited to 40 parallel execution.

  • New for Amazon Redshift – Simplify Data Ingestion and Make Your Data Warehouse More Secure and Reliable

    New feature announcement for Amazon Redshift in data ingestion from s3, MySQL, and Kinesis along with security related to dynamic masking and data governance.

  • Join the Preview – AWS Glue Data Quality
    AWS introduces Data Quality for AWS Glue, which basically analyzes the table and recommends data quality checks for the table. We can find tune those rules or write our own rules as well.

Databricks

  • Memory Profiling in PySpark
    Databricks talks about recent contributions to open-source spark for memory profiling of spark executors mainly executing UDFs.

  • Near-real-time IoT Robust Anomaly Detection Framework
    The author talks about how digitization results in the growth of tons of real-time data especially IOT-related use cases. Finding anomalies is common to use cases which is not trivial and finding outliers is not so straightforward. The author shares the framework for building anomaly detection by evaluating different ML models and summarizing the weighted average.

Snowflake

  • Data Access Control: Thoughts from the trenches
    Snowflakes discuss Data Access Control at various layers such as Dataset, Physical Data Mart, and Virtual Data Mart along with potential security issues at each layer.

  • How Retailers Increase Customer Satisfaction and Retention with the Snowflake Retail Data Cloud
    Customer 360 view is the key pillar for building Retail Data Cloud. Snowflake shares their customer’s use cases ( Doordash, US Foods, etc ) of using Snowflake data cloud to build a robust Customer 360 strategy.

  • Data Vault Techniques on Snowflake: Handling Semi-Structured Data
    Snowflakes shares how to ingest and query semi-structured data in Data Vault.

    The course of the week

    • If you want to learn Data engineering on AWS, please check out this bestseller course which includes 25+ hours of content, and 11k+ students enrolled.

      Data Engineering using AWS Data Analytics

      Build Data Engineering Pipelines on AWS using Data Analytics Services - Glue, EMR, Athena, Kinesis, Lambda, Redshift

Support the work

  • If you like my work, you can support buying Kofi.

Thanks for reading Data Engineering Bulletin! Subscribe for free to receive new posts .


Share this post
Data Engineering Bulletin - Week #2
databulletin.substack.com
Comments
TopNew

No posts

Ready for more?

© 2023 Suraj Mishra
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing