Data Engineering Bulletin - Week #2
This week's updates cover data management, workflow, static sql analysis, realtime messaging, cross cloud analytics and much more.
Introduction
As a Data Engineer, I am always curious to know what others are doing in this space. The best way to learn about what is happening in other companies is to read their tech blogs and get inspiration for my next project.
Data Engineering Bulletin filtered version of those efforts which I feel might be helpful to my Data Engineering Community. Data Bulleting covers what matters the most.
If you like this bulletin, consider subscribing to it to get more of these weekly.
Netflix Tech Blog
Ready-to-go Sample Data Pipelines With Dataflow
Dataflow is Netflix's homegrown solution for Data Pipeline management. In this article, they explained features supported by Dataflow. One feature of Dataflow is to allow onboarded users to get started quickly and for that, they have built templated jobs with configurable business logic with Sample Workflow.
Meta
Enabling static analysis of SQL queries at Meta
Meta talks about their static analysis tool called UPM which takes each SQL query that is executed by the author and returns a semantic tree. The infrastructure team at Meta leverages UPM to enhance the performance of SQL queries and also build column-level lineage.
Confluent
Data pipeline: What Why and How
This post from Confluent mainly discusses the foundational principles of data pipelines and explains when to consider building real-time data pipelines.Diagnose and Debug Apache Kafka Issues: Understanding Increased Connections
In this article Confluent talks about different types of connections that get created in Apache Kafka, such producer initiated connections, broker-initiated connections, and consumer-initiated connections along with how to monitor different metrics to monitor the spike in connection count, connection rate, and so on.
Go Jek
Customizing VerneMQ - The message broker for info superhighway
GoJek talks about improvements and feature addition that they made to open-source VerneMQ, which is MQTT based message broker. MQTT is a protocol that is used when the device connection is in a remote location with network/bandwidth constraints. Typically IoT devices use this protocol to publish messages.
Walmart Labs
How to create a BigQuery Table with External Data Source
Walmart engineer shares how to create external data sources for BigQuery along with its pros and cons. External data sources are great when you don’t want to move your data from source to BigQuery, instead, execute BigQuery query on external data, but doing so often we cannot leverage the advantages of BigQuery internal features like caching and performance.
Google Cloud
Movie Score Prediction with BigQuery
This article covers the use case of building a Movie Score Predictor which is built by using BigQuery, Vertex AI, MongoDB, Cloud Run, and Cloud Functions.
Built with BigQuery: How Datalaksa provides unified marketing and customer data warehouse
Datalaksa, a company servicing clients in south-east Asia provides unified marketing and customer data warehouse, they share their use case and reason to choose BigQuery as their Data warehouse solution.Break down data silos with BigQuery Omni
Google Cloud team talks about new upcoming features for BigQuery Omni which mainly solves the data analytics across the cloud including Azure, Aws with a Single BigQuery interface.Cloud PubSub announces general availability of exactly-once delivery
Google Cloud PubSub now supports exactly-once semantics in a single regional. It’s achieved through having the deduplication layer internally implemented with persistent storage. Although latency will be increased single delivery will be guaranteed. At the moment it’s only available for Pull Subscribers and not for Push-based subscribers.
InfoQ
AWS Lambda Snapstart Accelerate Java Functions
AWS's newly announced Lambda Snapstart allows a faster initialization when executed in AWS lambda for Java execution of microservices, data processing, etc.Using Serverless Websockets to Enable Real-Time Messaging
The author discusses various protocols available to build real-time/live experiences in the modern world and WebSocket being promising and one of the most used protocols that really stands out in building event-driven systems for live experiences. However, managing WebSocket infra is not so easy, and serverless is the way forward.AWS announces DataZone, new Data Management service to Govern Data
Amazon announces a new Data Management Service called DataZone that integrates AWS products like Redshift/Athena along with other data sources to provide better data management in the cloud.
AWS
Point-to-Point Integration with Amazon Event Bridge Pipes
AWS launches a new service under Event Bridge called Pipes which helps us get quickly started to build an event-driven system without writing glue code across the systemsStep Functions Distributed Map – A Serverless Solution for Large-Scale Parallel Data Processing
AWS announces a new service for step function called distributed map to process large-scale data processing workloads. The step function allows us to build no-code workflows but currently, it was limited to 40 parallel execution.New feature announcement for Amazon Redshift in data ingestion from s3, MySQL, and Kinesis along with security related to dynamic masking and data governance.
Join the Preview – AWS Glue Data Quality
AWS introduces Data Quality for AWS Glue, which basically analyzes the table and recommends data quality checks for the table. We can find tune those rules or write our own rules as well.
Databricks
Memory Profiling in PySpark
Databricks talks about recent contributions to open-source spark for memory profiling of spark executors mainly executing UDFs.Near-real-time IoT Robust Anomaly Detection Framework
The author talks about how digitization results in the growth of tons of real-time data especially IOT-related use cases. Finding anomalies is common to use cases which is not trivial and finding outliers is not so straightforward. The author shares the framework for building anomaly detection by evaluating different ML models and summarizing the weighted average.
Snowflake
Data Access Control: Thoughts from the trenches
Snowflakes discuss Data Access Control at various layers such as Dataset, Physical Data Mart, and Virtual Data Mart along with potential security issues at each layer.How Retailers Increase Customer Satisfaction and Retention with the Snowflake Retail Data Cloud
Customer 360 view is the key pillar for building Retail Data Cloud. Snowflake shares their customer’s use cases ( Doordash, US Foods, etc ) of using Snowflake data cloud to build a robust Customer 360 strategy.Data Vault Techniques on Snowflake: Handling Semi-Structured Data
Snowflakes shares how to ingest and query semi-structured data in Data Vault.The course of the week
If you want to learn Data engineering on AWS, please check out this bestseller course which includes 25+ hours of content, and 11k+ students enrolled.
Data Engineering using AWS Data Analytics
Support the work
If you like my work, you can support buying Kofi.