What, Why, and How of Streaming data

August 3, 2020

Streaming data is getting extremely popular lately because of its use cases and a growing number of tools and databases that are coming out there in the market for stream analytics. It becomes extremely important these days to have some idea about it and that's what this blog post is all about.

Questions that this blog post will answer:

  1. What is streaming data?
  2. What are the sources of streaming data?
  3. Examples of streaming data.
  4. How it's collected, processed, and analyzed?
  5. How it helps companies to make data-driven decisions?

This blog post will also answer how companies more about YOU than you.

Data you're familiar with: Product's data(product_id, name, price ..), customer's data( customer_name, id,...). This kind of data is typically stored in SQL and NoSQL databases, depending on the business problem.

On the other hand, Streaming data is completely different from the data that most of you are familiar with.

Streaming data: data that's coming continuously from various sources and that needs to be processed either in real-time or after some time.

Let's break it down further.

Stream data = data generated from events(most of the time)

Event: anything that happens in the real world at a particular timestamp.

Continuous stream of events: never-ending succession of individual events.

They're continuous just like the water flowing through the river. And the same goes for stream data.

Now, let's come back to the world of programming.

Applications run 24x7 serving millions of users. These applications generate both kinds of data, data that you're familiar with, and streaming data.

Example of streaming data:

1) Logs: tells developer, what's happening during execution of code(you know about them already)

Logs are continuously generated by applications. Logs encapsulate what happened, what did not, where did the problem in code arise, edge case you missed, etc.

2) Events data

User X clicked Y product. User X liked Y tweet. User X added Y product to cart. User X spend Y mins to watch the Z video.

These are all events that are being sent to servers for data analysis.

This is how companies know more about you than you.

3) Time-series data

Data generated from sensors, IOT devices also come under streaming data.

Sensor X reads Y value at Z time.

Timestamp plays a big role in streaming data because we need to know the time at which an event occurs.

Streaming data brings a different viewpoint to look at data.

Collection of stream data

Stream data is generated by event producers and then usually send to Apache Kafka topics for further downstream of data.

Apache Kafka decouples event producers from event consumers.

You can read more about Kafka here.

Processing of stream data

The raw event data is either processed using Kafka's stream API or using Apache Samza.

And the analytics part is mainly done using DB called Apache druid that's made to serve applications that have high ingestion data rate and stream analytics.

How companies use stream analytics

The data generated from stream events is GOLD for companies. This is where most of the value and insights about user lies.

Numerous use cases:

  1. Deployed some UX improvements? User activities will help whether it's good or not.
  2. A/B Testing, tweaked some things on the website? Again, user activities come to rescue.
  3. Users are spending less time on the website now? Need to look at what went wrong.
  4. Heap memory got almost full, the container is going to terminate. Alert on slack. Fix that.

And many more.