Data Engineering Bulletin - Week #3
This week's update covers efficient data loading, streaming best practices, data realtime telemetry, flood prediction, migration and much more
Introduction
As a Data Engineer, I am always curious to know what others are doing in this space. The best way to learn about what is happening in other companies is to read their tech blogs and get inspiration for my next project.
Data Engineering Bulletin filtered version of those efforts which I feel might be helpful to my Data Engineering Community. Data Bulleting covers what matters the most.
If you like this bulletin, consider subscribing to it to get more of these weekly.
Google Cloud
Running database migrations with Cloud Run Jobs
Cloud Run Jobs are useful when we need to execute admin tasks, one of commands or complex tasks like migration. In case of Cloud Run process starts and constantly running like web server or api service, but when we need to execute command or admin tasks there was not concrete way to achieve it. Cloud Run Jobs allow user to execute Jobs in serverless way.How a steel distributor reinvents its data science & ML workflows with Vertex AI
In this blog post, you will learn how Klöckner used Google Cloud Vertex AI services such as AutoML and Vertex AI Pipelines to improve internal processes through machine learning, including increasing velocity in ML model building, reducing time to production, and providing solutions for production-level ML challenges such as versioning, lineage, and reproducibility.
Best practices of Dataproc Persistent History Server
The challenge with ephemeral clusters and Dataproc Serverless for Spark is that you will lose the application logs when the cluster machines are deleted after the job. Persistent History Server enables access to the completed Hadoop and Spark application details for the jobs executed on different ephemeral clusters or serverless Spark
The business value of Cloud SQL: how companies speed up deployments, lower costs and boost agility
Google Cloud shares resources related to the benefits of using Cloud Based Databases and how one can save cost using Cloud SQL google’s managed database (MySQL, PostgreSQL, SQL Server)Performance considerations for loading data into BigQuery
When loading data to Google BigQuery, loading data in compressed format doesn’t gain a performance boost. In fact, in some cases, uncompressed formats are faster than compressed formats. If data is compressed then it’s better to parallelize the load with 256MB or less to speed up the process.
Confluent
Building Event Streaming Applications in .NET
Typically when we think of Apache Kafka and its ecosystem mainly all the development and documentation once can find is in Java since Apache Kafka itself build using Java. But there are support for other languages/framework like. NET to build the client library which is explained in this blog from Confluent.
How Fully Managed Connectors Make Apache Kafka Easier
In this blog confluent shares that although the developer-run connector ecosystem is free, managing infra to run them is not and always require troubleshooting, knowledge, etc. Hence Confluents managed connectors makes it easy to connect the different source and sink without worrying about infra management and a deep level of expertise to manage the connection of source and sink to Kafka Cluster.
Databricks
Streaming in Production: Collected Best Practices
Databricks shares best practices when deploying streaming applications in the production environment. Unit testing, Triggers, Fault tolerance, and many more are covered.Announcing General Availability of Data lineage in Unity Catalog
Databricks adds Data Lineage capability in Unity Catalog. It can now track column and table-level lineage in real time. It can also track lineage from Notebooks, workbooks, and dashboards.How Databricks Powers Stantec's Flood Predictor Engine
Customer story of Stantec’s, which explains how Databricks helped them build flood estimation products. Flood prediction requires the high computation of geospatial data, and they used Databricks delta live tables to solve the computation needs.Why We Migrated From Apache Airflow to Databricks Workflows at YipitData
YipitData team was using a custom installation of Apache Airflow on top of Databricks cluster since there was no managed data orchestration solution on Databricks. Once the Workflow was release YipitData team planned and created a migration script to translate Airflow Dags to workflow configuration. The overall transition was smooth for them.
Snowflake
How Snowflake’s Native Apps Simplify Technical Orchestration for My Data Outlet Customers
Snowflake’s client My Data Outlet (MDO) shares their use case for Snowflake’s native apps which helped them to share investment data through application code to their client along with maintaining security , compute cost etc.Canada Drives Reduces Customer Acquisition Cost (CAC) by 20% with Snowflake and RudderStack
Canada drives uses Snowflake and Rudderstack to build customer data platform to reduce customer acquitision cost , lower SaaS spend and decrease in time to sell cars.Open Data Is Back in Business
Authot talks about promises of open data and how the purpose and definition has changed in current scenario. Open data is not only for transparency but its open whole new economy, while saying that most of the government data are not accessible, tagged, classified and self serveble. Hence Snowflake Marketplace plays important role on providing access and self discovery on Open data with its ecosystem of tools.
GitHub
Creating an accessible search experience with the QueryBuilder component
GitHub team explains their work related to QueryBuilder which helps them tackle the issue of different types of queries for example discussion, action, autocomplete, jump to, etc. They will be open-sourcing this component as well for other developers.
AWS
Heads-Up: Amazon S3 Security Changes Are Coming in April of 2023
There are 2 security changes coming to the S3. These changes are S3 block public access will be enabled by default and ACL will be disabled by default. These changes were already been recommended so far now and from April 2023 will be by default.
Grafana
How to build a Formula 1 real-time analytics stack with Azure Data Explorer and Grafana Cloud
Grafana teams shares and explains the design for real-time telemetry data for F1 race and how we can leverage Azure Data Explorer and Grafana cloud together to build real-time data analytics.
Cloudera
How Agencies Can Gain the Cyber Edge with Smart Data Solutions
Cloudera shares how we can use Cloudera Data Platform to build smart data solutions that can help organization to gain thhe edge over cyber attack like one happend to Solarwinds.Introducing Cloudera DataFlow Designer: Self-service, No-Code Dataflow Design
Cloudera team explains how their new offering Dataflow desginer can help developers to build data pipeline through visual interface and minmal admin taks with No-Code.
The course of the week
If you want to learn Data engineering on AWS, please check out this bestseller course which includes 25+ hours of content, and 11k+ students enrolled.
Data Engineering using AWS Data Analytics
Support the work
If you like my work, you can support buying Kofi.