Data Engineering Bulletin

Share this post
Data Engineering Bulletin - Week #1
databulletin.substack.com

Data Engineering Bulletin - Week #1

This week's updates cover data lineage, streaming, sql, database, data ingestion, cost optimization and more.

Suraj Mishra
Nov 28, 2022
Share this post
Data Engineering Bulletin - Week #1
databulletin.substack.com

Introduction

  • As a Data Engineer, I am always curious to know what others are doing in this space. The best way to learn about what is happening in other companies is to read their tech blogs and get inspiration for my next project.

  • Data Engineering Bulletin filtered version of those efforts which I feel might be helpful to my Data Engineering Community. Data Bulleting covers what matters the most.

  • If you like this bulletin, consider subscribing to it to get more of these weekly.


Line Corp

  • Log data pipeline with Spark Streaming
    This blog covers an internship report of a use case where the author builds a data pipeline to collect audit logs from Hadoop cluster and build spark processing for alerting and log format conversion.

  • Data Lineage Implementation on Line Big Data Platform

    This covers about author’s journey to build in-house data lineage in order to better manage job failure’s. Author covers functionalities that were built as the initial scope of the project along with notifications, monitoring, and system design.

Expedia

  • Unifying Data Lake Across Multiple Regions

    Expedia shares its experience in building a unified cross-regional data lake. Author talks about various strategies that were tested and implemented to build this unification including federation, replication, etc.

Confluent

  • Stream Processing for Survey Response Use Case

    Confluent shares how we can build a survey response analysis pipeline using apache Kafka and the confluent platform. They also provide SurveyMonkey out of the box connector to kafka, so that users don’t have to build publisher code.
    If you are interested in more use cases you can find them here

Cloudera

  • Enriching & Storing events with Flink SQL, Hive, and SQL Stream builder interface

    Cloudera shares how to use Apache Flink SQL, and Hive Table along with Cloudera’s SQL Stream Builder user interface. Design talks about enriching streams with a hive table and storing output in the hive table as a sink.

Google Cloud

  • Connected-Stories’s Use Case on Google Cloud

    Connected Stories, a Google Cloud user shares their experience of building a next-generation video ads platform on GCP using products like Vertex AI, BigQuery, Cloud PubSub, Cloud Composer, etc. They ingest 200M+ events from user interaction and utilize them to serve better and more personalized ad experiences.

  • Optimizing Cloud Storage Cost with new Autoclass service

    GCP shares about their new service to manage data placement so that customers can optimize on their cost based on access patterns on cloud storage.
    This service moves data to cold storage automatically when they are not used.

  • New BigQuery Public Dataset to Improve Security Posture of Open Source Dependencies
    Google cloud released a new public dataset called Open Source Insights on BigQuery, which will be helpful for the security team to handle security posture of dependencies. Open source insights project scans millions of packages from Go, Maven, PyPi and Computes their dependency graph, Its then labels them with security advisories, licenses, popularity metrics, etc.

Cloudflare

  • Send Logs to any destination for Monitoring and Analytics
    Cloudflare released a new service called Logpush that allows sending application logs to any destination such as cloud storage or datadog.
    With the help of this service, Cloudflare provides more services in observability for their clients.

  • How Cloudflare uses Analytics Engine
    Cloudflare released Analytics Engine Service earlier this year for their customers. In this blog, Cloudflare talks about how its customers can start using this service to gather telemetry data and use it to improve their products and add new features.

Databricks

  • Introduction Ingestion time clustering

    Databricks releases a new clustering method based on event ingestion time. They realized in many of the use cases data source provides incremental data and they are sorted by time, hence automatically partitioning data with ingestion time makes sense for these use cases.
    They also tested this with online retailers and found a 19x performance increase in Select and Delete operations.

Snowflake

  • Observing Snowflake workloads with Snowflake Alerts
    Snowflakes announce Snowflake Alert ( Private Beta ) which allows clients to write SQL-based alerts based on certain trigger conditions and then take action based on the triggered condition.

  • Virtual Warehouses and Chargeback
    The author talks about Data Vault techniques in Snowflake and how to efficiently use the decoupled architecture of Snowflake to Compute, Storage, and Access layer. Additionally, it also covers how we can use it to leverage chargeback to each department in an organization based on the usage patterns of compute and storage.

  • 3 attributes of Data Economy
    Snowflake list the top 10 attributes of data infrastructure and discuss 3 attributes, Scalability, Versatility, and Data Applications in detail.

  • Be The Match Company’s Use Case with Snowflake
    Snowflakes shares Be The Match’s use case where they use Snowflake to build a single source of truth for their case manager’s operations team. This single source of truth consists of data related to bone marrow and blood cell registry. Be The Match uses the extensive dataset that they have collected to match the patient to the donor (70 -80%) time to cure the blood cancer.

The course of the week

    • If you want to learn Data engineering on AWS, please check out this bestseller course which includes 25+ hours of content, and 11k+ students enrolled.

      Data Engineering using AWS Data Analytics

      Build Data Engineering Pipelines on AWS using Data Analytics Services - Glue, EMR, Athena, Kinesis, Lambda, Redshift

Support the work

  • If you like my work, you can support buying Kofi.


Thanks for reading Data Engineering Bulletin! Subscribe for free to receive new posts and support my work.

Share this post
Data Engineering Bulletin - Week #1
databulletin.substack.com
Comments
TopNew

No posts

Ready for more?

© 2023 Suraj Mishra
Privacy ∙ Terms ∙ Collection notice
Start WritingGet the app
Substack is the home for great writing