© 2023 Cloudera, Inc. All rights reserved.
Streaming Data Pipeline Development
Tim Spann
Principal Developer Advocate
25-April-2023
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved.
https://attend.cloudera.com/nificommitters0503
© 2023 Cloudera, Inc. All rights reserved. 4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink x Java
© 2023 Cloudera, Inc. All rights reserved.
Tim Spann
Principal Developer Advocate | Cloudera
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 7
FLiP Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Kafka, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved. 8
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
© 2023 Cloudera, Inc. All rights reserved.
FREE LEARNING ENVIRONMENT
© 2023 Cloudera, Inc. All rights reserved. 10
CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
© 2023 Cloudera, Inc. All rights reserved.
STREAMING
© 2023 Cloudera, Inc. All rights reserved. 12
WHAT IS REAL-TIME?
© 2023 Cloudera, Inc. All rights reserved. 13
ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending
© 2023 Cloudera, Inc. All rights reserved. 14
STREAMING FROM … TO .. WHILE ..
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 15
EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud
Storage
Application
Data Lake Data Stores
Make
Payment
µServices
Streams
Edge - IoT Dashboard
© 2023 Cloudera, Inc. All rights reserved. 16
BUILDING REAL-TIME REQUIRES A TEAM
© 2023 Cloudera, Inc. All rights reserved.
APACHE KAFKA
I Can Haz Data?
© 2023 Cloudera, Inc. All rights reserved. 18
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 19
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with
big data volumes.
• Organized by topic to support several use cases.
© 2023 Cloudera, Inc. All rights reserved. 20
What is Apache Kafka?
– Distributed: horizontally scalable
– Partitioned: the data is split-up and distributed across the brokers
– Replicated: allows for automatic failover
– Unique: Kafka does not track the consumption of messages (the consumers
do)
– Fast: designed from the ground up with a focus on performance and
throughput
– Kafka was built at Linkedin in 2011
– Open sourced as an Apache project
© 2023 Cloudera, Inc. All rights reserved. 21
What is Can You Do With Apache Kafka?
• Web site activity: track page views, searches, etc. in real time
• Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
• Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
• Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
• Real-time data ingestion: fast processing of a very large volume of messages
© 2023 Cloudera, Inc. All rights reserved. 22
KAFKA TERMINOLOGY
• Kafka is a publish/subscribe messaging system comprised of the
following components:
– Topic: a message feed
– Producer: a process that publishes messages to a topic
– Consumer: a process that subscribes to a topic and processes its messages
– Broker: a server in a Kafka cluster
© 2021 Cloudera, Inc. All rights reserved. 23
Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Efficient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
© 2019 Cloudera, Inc. All rights reserved. 24
KAFKA CLUSTER GEO 2
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
KAFKA CLUSTER GEO 1
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
Apache Kafka
DATA COLLECTION
AT THE EDGE
C++ agent
US-West Fleet
C++ agent
US-Central Fleet
C++ agent
US-East Fleet
INGEST GATEWAY
POWERED BY
KAFKA
gateway-west-
raw-sensors
gateway-central-
raw-sensors
gateway-east-
raw-sensors
DATA FLOW APPS
POWERED BY NIFI
STREAMING
ANALYTICS APPS
Micro Batch Analytics
Stream Analytics App
Micro Services
Stream Analytics App
Complex Low Latent
Stream Analytics App
Apache
Flink
Structured
Streaming
Replication /
Data Deployment
MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink
© 2023 Cloudera, Inc. All rights reserved.
APACHE FLINK
© 2023 Cloudera, Inc. All rights reserved. 26
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
© 2023 Cloudera, Inc. All rights reserved. 27
CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never finishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
© 2023 Cloudera, Inc. All rights reserved. 28
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
© 2023 Cloudera, Inc. All rights reserved. 29
CLOUDERA SQL STREAM BUILDER
Making Streaming Analytics accessible to everyone with SQL
Application Developer
● Develop & test SQL queries with a
powerful UI
● Expose streaming data to
applications through materialized
views
● Single button “Push to
production” turns SQL queries into
Flink application
Business Analyst,
● Explore Streaming Data using SQL
without learning new skills
● Build new real-time business
reporting applications
30
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2023 Cloudera, Inc. All rights reserved. 31
SCHEMA
● AVRO - Schema Registry
● JSON - Schema Auto-detect
● Virtual Table design pattern
● Kafka Data Source
auto-created in SSB
{
"fields": [
{
"doc": "Type inferred from '215'",
"name": "userid",
"type": "long"
},
{
"doc": "Type inferred from '94204'",
"name": "amount",
"type": "long"
}
],
"name": "inferredSchema",
"type": "record"
}
Key Takeaway: Integrated with schema registry, also auto-detection for JSON types.
© 2023 Cloudera, Inc. All rights reserved. 32
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
© 2023 Cloudera, Inc. All rights reserved. 33
Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder
Write Streaming
Result to Kudu
Join 2 Streaming
User Event Topics
Enrich Stream from
Warehouse HR Table
Enrich Stream from RT
Mart Timesheet Table
Filter &
Transform
© 2023 Cloudera, Inc. All rights reserved. 34
Streaming Data Lineage with SDX
DATA GOVERNANCE FOR THE ENTIRE STREAMING PIPELINE
• Track Consumer, Producer, Topics
and Consumer Group Lineage
• No changes required to
Consumers or Producers
• End-To-End lineage from consumer
to producer
© 2023 Cloudera, Inc. All rights reserved. 35
SSB Projects - Container Structure for All Assets of SQL Streaming Job
SDLC for Streaming SQL Applications With First Class Git Integration
Project in SSB
SSB Project provides the container structure
for all the assets for your streaming app.
Project is configured with a git repository
SSB allows you to
push/import projects
to/from Git
Project Represented In Git
The streaming application assets in
git within the project structure
© 2023 Cloudera, Inc. All rights reserved. 36
SDLC Life Cycle with SSB Projects
Create SSB Project &
Configure Git Repo
Step 1
Run Service Discovery to
register Kafka, Hive, etc
Step 2
Create/Develop
Streaming Assets & Test
Step 3
Check-in Project
Into Git
Step 4
Import Project from Git into SSB
Prod, Setup Monitoring & Deploy
Step 5
© 2023 Cloudera, Inc. All rights reserved. 37
Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics
© 2023 Cloudera, Inc. All rights reserved.
DATAFLOW
APACHE NIFI
© 2023 Cloudera, Inc. All rights reserved. 39
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 40
CLOUDERA DATAFLOW - POWERED BY APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Backpressure
© 2023 Cloudera, Inc. All rights reserved. 41
CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
© 2023 Cloudera, Inc. All rights reserved. 42
Processing one millions events per second with Apache NiFi
https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
© 2023 Cloudera, Inc. All rights reserved. 43
PROVENANCE
© 2023 Cloudera, Inc. All rights reserved. 44
EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components
© 2019 Cloudera, Inc. All rights reserved. 45
NiFi Load Balancing
• Improve NiFi cluster throughput
• Defined at connection level
• Configurable balancing
strategies
• Critical for scale up paradigm in
Kubernetes
• Alleviates S2S balancing “hack”
customers use
© 2019 Cloudera, Inc. All rights reserved. 46
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed in a timely
fashion can be automatically removed from the flow.
• Back Pressure Thresholds - Thresholds indicate how much data
should be allowed to exist in the queue before the component
that is the source of the Connection is no longer scheduled to
run. This allows the system to avoid being overrun with data.
• Load Balance Strategy – Strategy to distribute the data in a flow
across the nodes in the cluster. When enabled, compression can
be configured on FlowFile contents and attributes.
• Prioritization – Determines the order in which flow files are
processed.
© 2019 Cloudera, Inc. All rights reserved. 47
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
• Record Reader and Writer support referencing a schema registry
for retrieving schemas when necessary.
• Enable processors that accept any data format without having to
worry about the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple
records, which results in far better performance.
© 2019 Cloudera, Inc. All rights reserved. 48
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against the contents of a
FlowFile.
• This can be used, for example, for field-specific filtering,
transformation, and row-level filtering.
• Columns can be renamed, simple calculations and aggregations
performed.
• The SQL statement must be valid ANSI SQL and is powered by
Apache Calcite.
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
50
© 2023 Cloudera, Inc. All rights reserved.
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed
51
© 2023 Cloudera, Inc. All rights reserved.
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
52
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
WIZARD
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Choose NiFi runtime version
• Pick from pre-defined NiFi node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
53
© 2023 Cloudera, Inc. All rights reserved.
KEY
PERFORMANCE
INDICATORS
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
54
© 2023 Cloudera, Inc. All rights reserved.
DASHBOARD
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
55
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
MANAGER
• Manage flow deployment
lifecycle
(Suspend/Start/Terminate)
• Add/Edit KPIs
• Change sizing configuration
• Update parameters
• Change NiFi version of the
deployment
• Gateway to NiFi canvas
56
© 2023 Cloudera, Inc. All rights reserved.
NIFI VERSION
UPGRADES
• Pick up NiFi hotfixes easily
• Upgrade (or downgrade) the
hotfix version of existing
deployments
• Rolling upgrade (if the
deployment has >1 NiFi nodes)
© 2023 Cloudera, Inc. All rights reserved.
BEST PRACTICES
© 2023 Cloudera, Inc. All rights reserved. 58
STREAMING TECH DEBT TIPS
• Version Control All Assets
• Managed Public Cloud like Cloudera
• Use DevOps and APIs
• Latest Java and Python
• Stream Sizing (NiFi, Kafka, Flink)
© 2023 Cloudera, Inc. All rights reserved. 59
Streaming
Solutions
When to use what?
Routing vs Analytics
Listeners
Joins
In-Memory
Operational Load
Current Skills
Use NiFi
Doing more than just Syndication
Not just small Kafka sized events
Edge Management is needed
Listener Type use cases that bind to ports
Lightweight ETL, Lineage, Provenance, Message Replay
Use Flink
Joining Streams
Windowing
Late Data Handling
Streaming Analytics
Use KConnect
Kafka Centric
In-Memory Stateless
© 2023 Cloudera, Inc. All rights reserved.
RESOURCES AND WRAP-UP
© 2023 Cloudera, Inc. All rights reserved. 61
Resources
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 62
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 63
Upcoming Events
April 26
May 10
May 9
64
TH N Y U