Introduction to Big Data 

& Basic Data Analysis


Big Data EveryWhere!


•       Lots of data is being collected 

and warehoused


–   Web data, e-commerce


–   purchases at department/

grocery stores


–   Bank/Credit Card 

transactions


–   Social Network


How much data?


•        Google processes 20 PB a day (2008)


•        Wayback Machine has 3 PB + 100 TB/month (3/2009)


•        Facebook has 2.5 PB of user data + 15 TB/day (4/2009)


•        eBay has 6.5 PB of user data + 50 TB/day (5/2009)


•        CERN’s Large Hydron Collider (LHC) generates 15 PB a year


 
































































Introduction to Big Data
& Basic Data Analysis

Big Data EveryWhere!

       Lots of data is being collected
and warehoused

   Web data, e-commerce

   purchases at department/
grocery stores

   Bank/Credit Card
transactions

   Social Network

How much data?

        Google processes 20 PB a day (2008)

        Wayback Machine has 3 PB + 100 TB/month (3/2009)

        Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

        eBay has 6.5 PB of user data + 50 TB/day (5/2009)

        CERN’s Large Hydron Collider (LHC) generates 15 PB a year

 

 

 

 

 

The Earthscope

  The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)

 

Type of Data

       Relational Data (Tables/Transaction/Legacy Data)

       Text Data (Web)

       Semi-structured Data (XML)

       Graph Data

   Social Network, Semantic Web (RDF), …

 

       Streaming Data

   You can only scan the data once

 

 

 

What to do with these data?

       Aggregation and Statistics

   Data warehouse and OLAP

       Indexing, Searching, and Querying

   Keyword based search

   Pattern matching (XML/RDF)

       Knowledge discovery

   Data Mining

   Statistical Modeling

 

 

Statistics 101

Random Sample and Statistics

         Population: is used to refer to the set or universe of all entities under study.

         However, looking at the entire population may not be feasible, or may be too expensive.

         Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Statistic

        Let Si denote the random variable corresponding to data point xi , then a statistic ˆθ is a function ˆθ : (S1, S2, · · · , Sn) → R.

 

        If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Empirical Cumulative Distribution Function

Where

Example

Measures of Central Tendency (Mean)

Population Mean:

Measures of Central Tendency (Median)

Population Median:

Example

Measures of Dispersion (Range)

Range:

Measures of Dispersion (Inter-Quartile Range)

Inter-Quartile Range (IQR):

Measures of Dispersion
(Variance and Standard Deviation)

Measures of Dispersion
(Variance and Standard Deviation)

Univariate Normal Distribution

Multivariate Normal Distribution

OLAP and Data Mining

Warehouse Architecture

Star Schemas

                 A star schema  is a common organization for data at a warehouse.  It consists of:

•   Fact table : a very large accumulation of facts such as sales.

w    Often “insert-only.”

•   Dimension tables : smaller, generally static information about the entities involved in the facts.

Terms

       Fact table

       Dimension tables

       Measures

Star

Cube

3-D Cube

ROLAP vs. MOLAP

       ROLAP:
Relational On-Line Analytical Processing

       MOLAP:
Multi-Dimensional On-Line Analytical Processing

Aggregates

Aggregates

Another Example

Aggregates

       Operators: sum, count, max, min,                   median, ave

       “Having” clause

       Using dimension hierarchy

   average by region (within store)

   maximum by month (within date)

What is Data Mining?

       Discovery of useful, possibly unexpected, patterns in data

       Non-trivial extraction of implicit, previously unknown and potentially useful information from data

       Exploration & analysis, by automatic or
semi-automatic means, of  large quantities of data in order to discover meaningful patterns

Data Mining Tasks

       Classification [Predictive]

       Clustering [Descriptive]

       Association Rule Discovery [Descriptive]

       Sequential Pattern Discovery [Descriptive]

       Regression [Predictive]

       Deviation Detection [Predictive]

       Collaborative Filter [Predictive]

 

Classification: Definition

       Given a collection of records (training set )

   Each record contains a set of attributes, one of the attributes is the class.

       Find a model  for class attribute as a function of the values of other attributes.

       Goal: previously unseen records should be assigned a class as accurately as possible.

   A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Decision Trees

Clustering

K-Means Clustering

Association Rule Mining

Association Rule Discovery

       Marketing and Sales Promotion:

   Let the rule discovered be

             {Bagels, … } --> {Potato Chips}

   Potato Chips as consequent => Can be used to determine what should be done to boost its sales.

   Bagels in the antecedent => can be used to see which products would be affected if the store discontinues selling bagels.

   Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!

       Supermarket shelf management.

       Inventory Managemnt

Collaborative Filtering

         Goal: predict what movies/books/… a person may be interested in, on the basis of

     Past preferences of the person

     Other people with similar past preferences

     The preferences of such people for a new movie/book/…

         One approach based on repeated clustering

     Cluster people on the basis of preferences for movies

     Then cluster movies on the basis of being liked by the same clusters of people

     Again cluster people based on their preferences for (the newly created clusters of) movies

     Repeat above till equilibrium

         Above problem is an instance of collaborative filtering, where users collaborate in the task of filtering information to find information of interest

Other Types of Mining

        Text mining: application of data mining to textual documents

    cluster Web pages to find related pages

    cluster pages a user has visited to organize their visit history

    classify Web pages automatically into a Web directory

        Graph Mining:

    Deal with graph data

Data Streams

         What are Data Streams?

     Continuous streams

     Huge, Fast, and Changing

         Why Data Streams?

     The arriving speed of streams and the huge amount of data are beyond our capability to store them.

     “Real-time” processing

         Window Models

     Landscape window (Entire Data Stream)

     Sliding Window

     Damped Window

         Mining Data Stream

 

 

 

A Simple Problem

         Finding frequent items

    Given a sequence (x1,xN) where xi [1,m], and a real number θ between zero and one.

    Looking for xi whose frequency > θ

    Naïve Algorithm (m counters)

         The number of frequent items 1/θ

         Problem: N>>m>>1/θ

KRP algorithm
Karp, et. al (TODS 03)

Streaming Sample Problem

       Scan the dataset once

       Sample K records

   Each one has equally probability to be sampled

   Total N record: K/N






Posted by MSNU