Meetup: Streaming Data Pipeline Development

© 2023 Cloudera, Inc. All rights reserved.
Streaming Data Pipeline Development
Tim Spann
Principal Developer Advocate
25-April-2023

https://attend.cloudera.com/nificommitters0503

© 2023 Cloudera, Inc. All rights reserved. 4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink x Java

Tim Spann
Principal Developer Advocate | Cloudera

FLiP Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Kafka, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft

Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...

FREE LEARNING ENVIRONMENT

CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose ﬁle of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications

STREAMING

WHAT IS REAL-TIME?

ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending

STREAMING FROM … TO .. WHILE ..
Data distribution as a ﬁrst class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors

EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud
Storage
Application
Data Lake Data Stores
Make
Payment
µServices
Streams
Edge - IoT Dashboard

BUILDING REAL-TIME REQUIRES A TEAM

APACHE KAFKA
I Can Haz Data?

Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia

STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Eﬃcient implementation to operate at speed with
big data volumes.
• Organized by topic to support several use cases.

What is Apache Kafka?
– Distributed: horizontally scalable
– Partitioned: the data is split-up and distributed across the brokers
– Replicated: allows for automatic failover
– Unique: Kafka does not track the consumption of messages (the consumers
do)
– Fast: designed from the ground up with a focus on performance and
throughput
– Kafka was built at Linkedin in 2011
– Open sourced as an Apache project

What is Can You Do With Apache Kafka?
• Web site activity: track page views, searches, etc. in real time
• Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
• Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
• Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
• Real-time data ingestion: fast processing of a very large volume of messages

KAFKA TERMINOLOGY
• Kafka is a publish/subscribe messaging system comprised of the
following components:
– Topic: a message feed
– Producer: a process that publishes messages to a topic
– Consumer: a process that subscribes to a topic and processes its messages
– Broker: a server in a Kafka cluster

Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Eﬃcient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe

KAFKA CLUSTER GEO 2
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
KAFKA CLUSTER GEO 1
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
Apache Kafka
DATA COLLECTION
AT THE EDGE
C++ agent
US-West Fleet
C++ agent
US-Central Fleet
C++ agent
US-East Fleet
INGEST GATEWAY
POWERED BY
KAFKA
gateway-west-
raw-sensors
gateway-central-
raw-sensors
gateway-east-
raw-sensors
DATA FLOW APPS
POWERED BY NIFI
STREAMING
ANALYTICS APPS
Micro Batch Analytics
Stream Analytics App
Micro Services
Complex Low Latent
Apache
Flink
Structured
Streaming
Replication /
Data Deployment
MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink

APACHE FLINK

Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite

CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never ﬁnishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever

Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools

CLOUDERA SQL STREAM BUILDER
Making Streaming Analytics accessible to everyone with SQL
Application Developer
● Develop & test SQL queries with a
powerful UI
● Expose streaming data to
applications through materialized
views
● Single button “Push to
production” turns SQL queries into
Flink application
Business Analyst,
● Explore Streaming Data using SQL
without learning new skills
● Build new real-time business
reporting applications

30
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simpliﬁes access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL

SCHEMA
● AVRO - Schema Registry
● JSON - Schema Auto-detect
● Virtual Table design pattern
● Kafka Data Source
auto-created in SSB
{
"fields": [
{
"doc": "Type inferred from '215'",
"name": "userid",
"type": "long"
},
{
"doc": "Type inferred from '94204'",
"name": "amount",
"type": "long"
}
],
"name": "inferredSchema",
"type": "record"
}
Key Takeaway: Integrated with schema registry, also auto-detection for JSON types.

SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the ﬁrehose

Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder
Write Streaming
Result to Kudu
Join 2 Streaming
User Event Topics
Enrich Stream from
Warehouse HR Table
Enrich Stream from RT
Mart Timesheet Table
Filter &
Transform

Streaming Data Lineage with SDX
DATA GOVERNANCE FOR THE ENTIRE STREAMING PIPELINE
• Track Consumer, Producer, Topics
and Consumer Group Lineage
• No changes required to
Consumers or Producers
• End-To-End lineage from consumer
to producer

SSB Projects - Container Structure for All Assets of SQL Streaming Job
SDLC for Streaming SQL Applications With First Class Git Integration
Project in SSB
SSB Project provides the container structure
for all the assets for your streaming app.
Project is conﬁgured with a git repository
SSB allows you to
push/import projects
to/from Git
Project Represented In Git
The streaming application assets in
git within the project structure

SDLC Life Cycle with SSB Projects
Create SSB Project &
Conﬁgure Git Repo
Step 1
Run Service Discovery to
register Kafka, Hive, etc
Step 2
Create/Develop
Streaming Assets & Test
Step 3
Check-in Project
Into Git
Step 4
Import Project from Git into SSB
Prod, Setup Monitoring & Deploy
Step 5

Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics

DATAFLOW
APACHE NIFI

Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination

CLOUDERA DATAFLOW - POWERED BY APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Backpressure

CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
ﬂow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG

Processing one millions events per second with Apache NiFi
https://blog.cloudera.com/benchmarking-niﬁ-performance-and-scalability/

PROVENANCE

EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components

NiFi Load Balancing
• Improve NiFi cluster throughput
• Deﬁned at connection level
• Conﬁgurable balancing
strategies
• Critical for scale up paradigm in
Kubernetes
• Alleviates S2S balancing “hack”
customers use

QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed in a timely
fashion can be automatically removed from the flow.
• Back Pressure Thresholds - Thresholds indicate how much data
should be allowed to exist in the queue before the component
that is the source of the Connection is no longer scheduled to
run. This allows the system to avoid being overrun with data.
• Load Balance Strategy – Strategy to distribute the data in a flow
across the nodes in the cluster. When enabled, compression can
be configured on FlowFile contents and attributes.
• Prioritization – Determines the order in which flow files are
processed.

RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
• Record Reader and Writer support referencing a schema registry
for retrieving schemas when necessary.
• Enable processors that accept any data format without having to
worry about the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple
records, which results in far better performance.

RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against the contents of a
FlowFile.
• This can be used, for example, for field-specific filtering,
transformation, and row-level filtering.
• Columns can be renamed, simple calculations and aggregations
performed.
• The SQL statement must be valid ANSI SQL and is powered by
Apache Calcite.

Apache NiFi with Python Custom Processors
Python as a 1st class citizen

50
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed

51
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments

52
DEPLOYMENT
WIZARD
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Choose NiFi runtime version
• Pick from pre-defined NiFi node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Conﬁgure Sizing & Scaling Deﬁne KPIs

53
KEY
PERFORMANCE
INDICATORS
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring

54
DASHBOARD
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events

55
DEPLOYMENT
MANAGER
• Manage ﬂow deployment
lifecycle
(Suspend/Start/Terminate)
• Add/Edit KPIs
• Change sizing conﬁguration
• Update parameters
• Change NiFi version of the
deployment
• Gateway to NiFi canvas

56
NIFI VERSION
UPGRADES
• Pick up NiFi hotﬁxes easily
• Upgrade (or downgrade) the
hotﬁx version of existing
deployments
• Rolling upgrade (if the
deployment has >1 NiFi nodes)

BEST PRACTICES

STREAMING TECH DEBT TIPS
• Version Control All Assets
• Managed Public Cloud like Cloudera
• Use DevOps and APIs
• Latest Java and Python
• Stream Sizing (NiFi, Kafka, Flink)

Streaming
Solutions
When to use what?
Routing vs Analytics
Listeners
Joins
In-Memory
Operational Load
Current Skills
Use NiFi
Doing more than just Syndication
Not just small Kafka sized events
Edge Management is needed
Listener Type use cases that bind to ports
Lightweight ETL, Lineage, Provenance, Message Replay
Use Flink
Joining Streams
Windowing
Late Data Handling
Streaming Analytics
Use KConnect
Kafka Centric
In-Memory Stateless

RESOURCES AND WRAP-UP

Resources

Upcoming Events
April 26
May 10
May 9

Meetup: Streaming Data Pipeline Development

More Related Content

Similar to Meetup: Streaming Data Pipeline Development

More from Timothy Spann

Recently uploaded

Meetup: Streaming Data Pipeline Development