Bigdata vs. Data Warehousing
     Synergy or Conflict?



          Thomas Kejser
        thomas@kejser.org
       http://blog.kejser.org
          @thomaskejser
Who is this Guy?


Thomas Kejser
http://blog.kejser.org
@thomaskejser

• Formerly: Lead SQLCAT EMEA
• Now:      CTO FusionIo EMEA

• 15 year database experience
• Performance Tuner
Human Consciousness Doesn’t Scale
                 10



                 9
Billion Humans




                 8



                 7



                 6



                 5
                  2000   2050   2100          2150   2200            2250
                                       Year           Source: United Nations Projections
Text Messages in a Table

CREATE TABLE AllTexts (
    Sender BIGINT                 8B
    , Receiver BIGINT             8B
    , SenderLocation BIGINT       8B
    , ReceiverLocation BIGINT     8B
    , Time DATETIME               8B
    , SMS VARCHAR(140)          140B
)
                           = 180Bytes
How much do we text?

• World Average
    •   6.1 Trillion Text Messages / year
    •   About 80% cell phone coverage
    •   7 billion people
    •   3 messages/day/person
• But:
    • Teenagers: 50 messages/day




Source: Pew Internet Research 2010 & ITU
How much will we EVER text?

• 9B people acting like teenagers (in 2050)
  • 50 texts/day
• That’s 450 billion texts/day
  • 164 Trillion texts/year (20x today)
  • 180 bytes each
  • Assume x3 compression
• Approximation: 10 Petabytes/year in
  2050
Moore’s Hard Drives


       LOG
Capacity GB




                  Can it be done?
                                    Year
How Large is this/year?



Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0”



            About 1500 Wine Bottles
In the Data Center

• Calculating:
  • 2U Storage=24 Disks
    (includes compute)
  • 4TB per Disk
  • 100TB in 2U (a bit
    less)
  • 10PB = 200U storage
• About six racks
Warehouses Serve us Well..
… And it is Becoming a Commodity

• Good Management
  Interfaces
• Standard SQL
  • with a few extensions
• Appliances
• Support system
• Homogenous HW
  • In chunks
vs.
PDW vs. Hive – Scan/seek
Query 1                     Query 2
SELECT count(*)             SELECT max(l_quantity)
FROM lineitem               FROM lineitem
                            WHERE l_orderkey > 1000
                              and l_orderkey < 100000
                            GROUP BY l_linestatus



          Secs.
          1500

          1000
                                               Hive
           500                                 PDW

             0
                  Query 1     Query 2
PDW vs. Hive - Joins
                                 PDW-U:
SELECT max(l_orderkey)           • orders partitioned on c_custkey
FROM orders
JOIN lineitem                    • lineitem partitioned on l_partkey
ON l_orderkey = o_orderkey       PDW-P:
                                 • orders partitioned on o_orderkey
                                 • lineitem partitioned on
                                   l_orderkey

        Secs.
         4000

         3000
                                                  Hive
         2000                                     PDW-U
         1000                                     PDW-P
            0
                  Hive   PDW-U    PDW-P
What does Big Data need to Catch up?

• Thread startup times
• Co-location awareness
• Files vs. optimized DB memory
  structures
• Column stores and other DB tech

            Generic is good…

… but when there is structure, make
            use of it!
• What is Bigdata
           Very Unstructured Data
How many Pictures of Cats?

• Flickr Today:
  • 300MB/month
  • 2GB/year
  • 51M users (too small?)


• Estimate: 102 PB /
  year

• 10 x text messages


                             Source: WikiPedia
How big is this in wine bottles?
We have learned how to store it!
What is HDFS?

• Distributed File
  System
• Open Source
• No more SAN



• The Failure
  Unit is the
  Server
Fully unstructured data is
          boring


…Unless you get money for
        storing it
Acquiring Personal Information




Your Semi-structured Data, the Old Fashioned Way
The Social Angle

Who do you talk to and how often?
The Reasons

Why do you own a cell phone?
Saturday, 1:39am   - at The Pub




Your Semi-structured Data, For Free
Big Value

      Extraction of
 of meaning and insight
from semi-structured data
Extracting Meaning from Humans

Method                             Examples
Turn semi-structure to structure   Image recognition, network proximity
                                   and super nodes, social media
Needle in a haystack               Extract outliers, Fraud
Herd behaviors                     Clustering, Pattern Recognition,
                                   “Customers who bought this also
                                   bought”
Text classification and search     Text indexes, syntactic counting,
                                   pagerank
Text to structure                  Semantic analysis, loose structure into
                                   structure
Find New Customers



 “Michael, who is
                                Tommy

                       Thomas

 respected among his
 peers,                             Michael
 often talks
 about his
 new, cool
 gadgets”
Cross Sell




 “Families who own an Aston Martin will often buy a
                 Mini Cooper too”
Free Information
Need: Lots of CPU Cores!
Need: Data Centers!
Provisioning has to be REALLY fast
Things to Learn for the Future

• Get good at
  • Statistics (again)
  • Distributed Algorithms
  • Tuning
• Understand Physical
  Constraints
• Acquire deep domain
  knowledge
Something is Changing


      Today                             Tomorrow




     CAPEX Hardware     OPEX Hardware       You
The Mother of All Stovepipes
Big Data / Staging
                (No Model)


Data you
are afraid                          Data You      Delivery
to lose                           actually need
                                                  (Model)
Synergy




              Create Structure
                  for me


                                 Warehouse
          Here is a table
Applying Social Media to Structure
Summary

    Data Warehouse                 Big Data

•   There is a model               •   Don’t bother modeling!
•   Seek Co-location               •   Optional Co-Location
•   Respond in seconds             •   Respond in minutes
•   Calculate first, query after   •   Calculate while querying
•   Expensive HW                   •   Cheap HW
•   Optimise for target HW         •   Good enough on all HW
•   Homogenous HW                  •   Heterogeneous HW
•   Pay vendor, expect             •   Free license, optimise
    optimised                          yourself
&

Big Data vs Data Warehousing

  • 1.
    Bigdata vs. DataWarehousing Synergy or Conflict? Thomas Kejser thomas@kejser.org http://blog.kejser.org @thomaskejser
  • 2.
    Who is thisGuy? Thomas Kejser http://blog.kejser.org @thomaskejser • Formerly: Lead SQLCAT EMEA • Now: CTO FusionIo EMEA • 15 year database experience • Performance Tuner
  • 3.
    Human Consciousness Doesn’tScale 10 9 Billion Humans 8 7 6 5 2000 2050 2100 2150 2200 2250 Year Source: United Nations Projections
  • 4.
    Text Messages ina Table CREATE TABLE AllTexts ( Sender BIGINT 8B , Receiver BIGINT 8B , SenderLocation BIGINT 8B , ReceiverLocation BIGINT 8B , Time DATETIME 8B , SMS VARCHAR(140) 140B ) = 180Bytes
  • 5.
    How much dowe text? • World Average • 6.1 Trillion Text Messages / year • About 80% cell phone coverage • 7 billion people • 3 messages/day/person • But: • Teenagers: 50 messages/day Source: Pew Internet Research 2010 & ITU
  • 6.
    How much willwe EVER text? • 9B people acting like teenagers (in 2050) • 50 texts/day • That’s 450 billion texts/day • 164 Trillion texts/year (20x today) • 180 bytes each • Assume x3 compression • Approximation: 10 Petabytes/year in 2050
  • 7.
    Moore’s Hard Drives LOG Capacity GB Can it be done? Year
  • 8.
    How Large isthis/year? Hard Disk (4TB) : 2.5” Wine Bottle (75cl): 4.0” About 1500 Wine Bottles
  • 9.
    In the DataCenter • Calculating: • 2U Storage=24 Disks (includes compute) • 4TB per Disk • 100TB in 2U (a bit less) • 10PB = 200U storage • About six racks
  • 10.
  • 11.
    … And itis Becoming a Commodity • Good Management Interfaces • Standard SQL • with a few extensions • Appliances • Support system • Homogenous HW • In chunks
  • 12.
  • 13.
    PDW vs. Hive– Scan/seek Query 1 Query 2 SELECT count(*) SELECT max(l_quantity) FROM lineitem FROM lineitem WHERE l_orderkey > 1000 and l_orderkey < 100000 GROUP BY l_linestatus Secs. 1500 1000 Hive 500 PDW 0 Query 1 Query 2
  • 14.
    PDW vs. Hive- Joins PDW-U: SELECT max(l_orderkey) • orders partitioned on c_custkey FROM orders JOIN lineitem • lineitem partitioned on l_partkey ON l_orderkey = o_orderkey PDW-P: • orders partitioned on o_orderkey • lineitem partitioned on l_orderkey Secs. 4000 3000 Hive 2000 PDW-U 1000 PDW-P 0 Hive PDW-U PDW-P
  • 15.
    What does BigData need to Catch up? • Thread startup times • Co-location awareness • Files vs. optimized DB memory structures • Column stores and other DB tech Generic is good… … but when there is structure, make use of it!
  • 16.
    • What isBigdata Very Unstructured Data
  • 17.
    How many Picturesof Cats? • Flickr Today: • 300MB/month • 2GB/year • 51M users (too small?) • Estimate: 102 PB / year • 10 x text messages Source: WikiPedia
  • 18.
    How big isthis in wine bottles?
  • 19.
    We have learnedhow to store it!
  • 20.
    What is HDFS? •Distributed File System • Open Source • No more SAN • The Failure Unit is the Server
  • 21.
    Fully unstructured datais boring …Unless you get money for storing it
  • 22.
    Acquiring Personal Information YourSemi-structured Data, the Old Fashioned Way
  • 23.
    The Social Angle Whodo you talk to and how often?
  • 24.
    The Reasons Why doyou own a cell phone?
  • 25.
    Saturday, 1:39am - at The Pub Your Semi-structured Data, For Free
  • 26.
    Big Value Extraction of of meaning and insight from semi-structured data
  • 27.
    Extracting Meaning fromHumans Method Examples Turn semi-structure to structure Image recognition, network proximity and super nodes, social media Needle in a haystack Extract outliers, Fraud Herd behaviors Clustering, Pattern Recognition, “Customers who bought this also bought” Text classification and search Text indexes, syntactic counting, pagerank Text to structure Semantic analysis, loose structure into structure
  • 28.
    Find New Customers “Michael, who is Tommy Thomas respected among his peers, Michael often talks about his new, cool gadgets”
  • 29.
    Cross Sell “Familieswho own an Aston Martin will often buy a Mini Cooper too”
  • 30.
  • 31.
    Need: Lots ofCPU Cores!
  • 32.
  • 33.
    Provisioning has tobe REALLY fast
  • 34.
    Things to Learnfor the Future • Get good at • Statistics (again) • Distributed Algorithms • Tuning • Understand Physical Constraints • Acquire deep domain knowledge
  • 35.
    Something is Changing Today Tomorrow CAPEX Hardware OPEX Hardware You
  • 36.
    The Mother ofAll Stovepipes
  • 37.
    Big Data /Staging (No Model) Data you are afraid Data You Delivery to lose actually need (Model)
  • 38.
    Synergy Create Structure for me Warehouse Here is a table
  • 39.
  • 40.
    Summary Data Warehouse Big Data • There is a model • Don’t bother modeling! • Seek Co-location • Optional Co-Location • Respond in seconds • Respond in minutes • Calculate first, query after • Calculate while querying • Expensive HW • Cheap HW • Optimise for target HW • Good enough on all HW • Homogenous HW • Heterogeneous HW • Pay vendor, expect • Free license, optimise optimised yourself
  • 41.

Editor's Notes

  • #4 We are at the end of the growth curve... 9B is our total population... This is an important observation because many data estimates are based on human activity and has so far assumed exponention growthm.. This is NOT the case anymore!
  • #8 This show the development of hard drive capacity over time
  • #9 The calculation is not meant to be read, just letting people know we did the calc and what it PHYSICALLY means (see the animation)... There is a real cost to storing a lot of data, and this is one of the reasons cloud makes a lot of senseWine bottles
  • #19 This is Hyde Park.. From on end to the other...