Data Engineering Bulletin - Week #1

Nov 28, 2022

As a Data Engineer, I am always curious to know what others are doing in this space. The best way to learn about what is happening in other companies is to read their tech blogs and get inspiration for my next project.
Data Engineering Bulletin filtered version of those efforts which I feel might be helpful to my Data Engineering Community. Data Bulleting covers what matters the most.
If you like this bulletin, consider subscribing to it to get more of these weekly.

Log data pipeline with Spark Streaming
This blog covers an internship report of a use case where the author builds a data pipeline to collect audit logs from Hadoop cluster and build spark processing for alerting and log format conversion.
Data Lineage Implementation on Line Big Data Platform
This covers about author’s journey to build in-house data lineage in order to better manage job failure’s. Author covers functionalities that were built as the initial scope of the project along with notifications, monitoring, and system design.

Unifying Data Lake Across Multiple Regions
Expedia shares its experience in building a unified cross-regional data lake. Author talks about various strategies that were tested and implemented to build this unification including federation, replication, etc.

Stream Processing for Survey Response Use Case
Confluent shares how we can build a survey response analysis pipeline using apache Kafka and the confluent platform. They also provide SurveyMonkey out of the box connector to kafka, so that users don’t have to build publisher code.
If you are interested in more use cases you can find them here

Enriching & Storing events with Flink SQL, Hive, and SQL Stream builder interface
Cloudera shares how to use Apache Flink SQL, and Hive Table along with Cloudera’s SQL Stream Builder user interface. Design talks about enriching streams with a hive table and storing output in the hive table as a sink.

Connected-Stories’s Use Case on Google Cloud
Connected Stories, a Google Cloud user shares their experience of building a next-generation video ads platform on GCP using products like Vertex AI, BigQuery, Cloud PubSub, Cloud Composer, etc. They ingest 200M+ events from user interaction and utilize them to serve better and more personalized ad experiences.
Optimizing Cloud Storage Cost with new Autoclass service
GCP shares about their new service to manage data placement so that customers can optimize on their cost based on access patterns on cloud storage.
This service moves data to cold storage automatically when they are not used.
New BigQuery Public Dataset to Improve Security Posture of Open Source Dependencies
Google cloud released a new public dataset called Open Source Insights on BigQuery, which will be helpful for the security team to handle security posture of dependencies. Open source insights project scans millions of packages from Go, Maven, PyPi and Computes their dependency graph, Its then labels them with security advisories, licenses, popularity metrics, etc.

Send Logs to any destination for Monitoring and Analytics
Cloudflare released a new service called Logpush that allows sending application logs to any destination such as cloud storage or datadog.
With the help of this service, Cloudflare provides more services in observability for their clients.
How Cloudflare uses Analytics Engine
Cloudflare released Analytics Engine Service earlier this year for their customers. In this blog, Cloudflare talks about how its customers can start using this service to gather telemetry data and use it to improve their products and add new features.

Introduction Ingestion time clustering
Databricks releases a new clustering method based on event ingestion time. They realized in many of the use cases data source provides incremental data and they are sorted by time, hence automatically partitioning data with ingestion time makes sense for these use cases.
They also tested this with online retailers and found a 19x performance increase in Select and Delete operations.

Observing Snowflake workloads with Snowflake Alerts
Snowflakes announce Snowflake Alert ( Private Beta ) which allows clients to write SQL-based alerts based on certain trigger conditions and then take action based on the triggered condition.
Virtual Warehouses and Chargeback
The author talks about Data Vault techniques in Snowflake and how to efficiently use the decoupled architecture of Snowflake to Compute, Storage, and Access layer. Additionally, it also covers how we can use it to leverage chargeback to each department in an organization based on the usage patterns of compute and storage.
3 attributes of Data Economy
Snowflake list the top 10 attributes of data infrastructure and discuss 3 attributes, Scalability, Versatility, and Data Applications in detail.
Be The Match Company’s Use Case with Snowflake
Snowflakes shares Be The Match’s use case where they use Snowflake to build a single source of truth for their case manager’s operations team. This single source of truth consists of data related to bone marrow and blood cell registry. Be The Match uses the extensive dataset that they have collected to match the patient to the donor (70 -80%) time to cure the blood cancer.

Data Engineering Bulletin