Data Warehousing
& Data Mining
Wolf-Tilo Balke Silviu Homoceanu
Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de
• How to build a DW
– The DW Project: usual tasks, hardware, software, timeline (phases)
– Data Extract/Transform/Load (ETL):
• Data storage structures, extraction strategies (e.g., scraping, sniffing)
• Transformation: data quality, integration
• Loading: issues, and strategies, (bulk loading for fact data is a must)
– Metadata:
• Describes the contents of a DW, comprises all the intermediate products of ETL,
• Helps for understanding how to use the DW
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 2
Summary
8. Business Intelligence 8.1 Business Intelligence Overview 8.2 Principles of Data Mining 8.3 Association Rule Mining
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3
8. Business Intelligence
• What is Business Intelligence (BI)?
– The process, technologies and tools needed to turn data into information, information into knowledge and knowledge into plans that drive profitable business action
– BI comprises data warehousing, business analytic tools, and content/knowledge management
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 4
8.1 BI Overview
• Typical BI applications are – Customer segmentation
– Propensity to buy (customer disposition to buy) – Customer profitability
– Fraud detection
– Customer attrition (loss of customers)
– Channel optimization (connecting with the customer)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5
8.1 BI Overview
• Customer segmentation
– What market segments do my customers fall into, and what are their characteristics?
– Personalize customer relationships for higher customer satisfaction and retention
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6
8.1 BI Overview
• Propensity to buy
– Which customers are most likely to respond to my promotion?
– Target the right customers
• Increase campaign profitability by focusing on the customers most
likely to buy
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7
8.1 BI Overview
• Customer profitability
– What is the lifetime profitability of my customer?
– Make individual business interaction decisions based on the overall profitability of customers
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8
8.1 BI Overview
• Fraud detection
– How can I tell which transactions are likely to be fraudulent?
• If your wife has just proposed to increase your life insurance policy, you should probably order pizza for a while
– Quickly determine fraud and take immediate action to minimize damage
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 9
8.1 BI Overview
• Customer attrition
– Which customer is at risk of leaving?
– Prevent loss of high-value customers and let go of lower-value customers
• Channel optimization – What is the best channel to reach
my customer in each segment?
– Interact with customers based on their preference and your need to manage cost
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 10
8.1 BI Overview
• BI architecture
8.1 BI Overview
• Automated decision tools
– Rule-based systems that provide a solution usually in one functional area to a specific repetitive
management problem in one industry
• E.g., automated loan approval, intelligent price setting
• Business performance management (BPM) – Based on the balanced scorecard methodology – A framework for defining, implementing, and managing
an enterprise’s business strategy by linking objectives with factual measures
8.1 BI Overview
• Dashboards
– Provide a comprehensive visual view of corporate performance measures, trends, and exceptions from multiple
business areas
• Allows executives to see hot spots in seconds and explore the situation
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13
8.1 BI Overview
• What is data mining (knowledge discovery in databases)?
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
information or patterns from data in large databases
• What is not data mining?
– (Deductive) query processing
– Expert systems or small statistical programs
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14
8.2 Data Mining
• Data Mining applications – Database analysis and decision
support
• Market analysis and management
• Risk analysis and management e.g., forecasting, customer retention, improved underwriting, quality control, competitive analysis
• Fraud detection and management – Other Applications
• Text mining (news group, email, documents) and Web analysis (Google Analytics)
• Intelligent query answering
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 15
8.2 Principles of DM
• Market analysis
– Targeted marketing/ Customer profiling
• Find clusters of “model” customers who
share the same characteristics: interest, income level, spending habits, etc.
– Determine customer purchasing patterns over time – Cross-market analysis
• Associations/co-relations between product sales
• Prediction based on the association of information – Provide summary information
• Various multidimensional summary reports
• Statistical summary information (data central tendency and variation)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16
8.2 Principles of DM
• Corporate analysis and risk management – Finance planning and asset evaluation
• Cash flow analysis and prediction
• Trend analysis, time series, etc.
– Resource planning
• Summarize and compare the resources and spending
– Competition
• Monitor competitors and market directions
• Group customers into classes and a class-based pricing procedure
• Set pricing strategy in a highly competitive market
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17
8.2 Principles of DM
• Architecture of DM systems
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18
8.2 Data Mining
Data Warehouse
ETL Filtering
Database or data warehouse server
Data mining engine Pattern evaluation
Graphical user interface
Knowledge-base
Databases
• DM functionalities
– Association (correlation and causality)
• Multi-dimensional vs. single-dimensional association
• age(X, “20..29”) , income(X, “20..29K”) ⟶ buys(X, “PC”) [support = 2%, confidence = 60%]
• contains(T, “computer”) ⟶ contains(x, “software”) [1%, 75%]
– Classification and Prediction
• Finding models (functions) that describe and distinguish classes or concepts for future predictions
• Presentation: decision-tree, classification rule, neural network
• Prediction: predict some unknown or missing numerical values
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 19
8.2 Data Mining
– Cluster analysis
• Class label is unknown: group data to form new classes, e.g., cluster houses to find distribution patterns
• Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
– Outlier analysis
• Outlier: a data object that does not comply with the general behavior of the data
• Can be considered as noise or exception, but is quite useful in fraud detection, rare events analysis
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20
8.2 Data Mining
• Association rule mining has the objective of finding all co-occurrence relationships (called associations), among data items
– Classical application: market basket data analysis, which aims to discover how items are purchased by customers in a supermarket
• E.g., Cheese ⟶ Wine [support = 10%, confidence = 80%]
meaning that 10% of the customers buy cheese and wine together, and 80% of customers buying cheese also buy wine
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 21
8.3 Association Rule Mining
• Basic concepts of association rules – Let I = {i
1, i
2, …, i
m} be a set of items.
Let T = {t
1, t
2, …, t
n} be a set of transactions where each transaction t
iis a set of items such that t
i⊆ I.
– An association rule is an implication of the form:
X ⟶ Y, where X ⊂ I, Y ⊂ I and X ⋂ Y = ∅
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22
8.3 Association Rule Mining
• Association rule mining market basket analysis example
– I – set of all items sold in a store
• E.g., i
1= Beef, i
2= Chicken, i
3= Cheese, … – T – set of transactions
• The content of a customers basket
• E.g., t
1: Beef, Chicken, Milk; t
2: Beef, Cheese; t
3: Cheese, Wine; t
4: …
– An association rule might be
• Beef, Chicken ⟶ Milk, where {Beef, Chicken} is X and {Milk} is Y
8.3 Association Rule Mining
• Rules can be weak or strong – The strength of a rule is measured by its
support and confidence
– The support of a rule X ⟶ Y, is the percentage of transactions in T that contains X and Y
• Can be seen as an estimate of the probability Pr{X,Y} ⊆ t
i)
• With n as number of transactions in T the support of the rule X ⟶ Y is:
support support support
support = |{i | {X, Y} ⊆ t
i}| / n
8.3 Association Rule Mining
– The confidence of a rule X ⟶ Y, is the percentage of transactions in T containing X, that contain X ∪ Y
• Can be seen as estimate of the probability PrY ⊆ t
i|X ⊆ t
i) confidence
confidence confidence
confidence = |{i | {X, Y} ⊆ t
i}| / |{j | X ⊆ t
j}|
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25
8.3 Association Rule Mining
• How do we interpret support and confidence?
– If support is too low, the rule may just occur due to chance
• Acting on a rule with low support may not be profitable since it covers too few cases
– If confidence is too low, we cannot reliably predict Y from X
• Objective of mining association rules is to discover all associated rules in T that have support and confidence greater than a minimum threshold (minsup, minconf)!
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26
8.3 Association Rule Mining
• Finding rules based on support and confidence thresholds
– Let minsup = 30% and minconf = 80%
– Chicken, Clothes ⟶ Milk is valid, [sup = 3/7 (42.84%), conf = 3/3 (100%)]
– Clothes ⟶ Milk, Chicken is also valid, and there are more…
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27
8.3 Association Rule Mining
Transactions
T1 Beef, Chicken, Milk
T2 Beef, Cheese
T3 Cheese, Boots
T4 Beef, Chicken, Cheese T5 Beef, Chicken,Clothes, Cheese, Milk T6 Clothes, Chicken, Milk T7 Chicken, Milk,Clothes
• This is rather a simplistic view of shopping baskets
– Some important information is not considered, e.g., the quantity of each item purchased, the price paid,…
• There are a large number of rule mining algorithms
– They use different strategies and data structures – Their resulting sets of rules are all the same
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28
8.3 Association Rule Mining
• Approaches in association rule mining – Apriori algorithm
– Mining with multiple minimum supports – Mining class association rules
• The best known mining algorithm is the Apriori algorithm
– Step 1: find all frequent itemsets (set of items with support ≥ minsup)
– Step 2: use frequent itemsets to generate rules
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29
8.3 Association Rule Mining
• Step 1: frequent itemset generation
– The key is the apriori property (downward closure property): any subset of a frequent itemset is also a frequent itemset
• E.g., for minsup = 30%
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30
8.3 Apriori Algorithm: Step 1
Transactions
T1 Beef, Chicken, Milk
T2 Beef, Cheese
T3 Cheese, Boots
T4 Beef, Chicken, Cheese T5 Beef, Chicken, Clothes, Cheese, Milk T6 Clothes, Chicken, Milk T7 Chicken, Milk, Clothes
Chicken, Clothes, Milk
Chicken, Clothes Chicken, Milk Clothes, Milk
Chicken Clothes Milk
• Finding frequent items
– Find all 1-item frequent itemsets; then all 2-item frequent itemsets, etc.
– In each iteration k, only consider itemsets that contain a k-1 frequent itemset
– Optimization: the algorithm assumes that items are sorted in lexicographic order
• The order is used throughout the algorithm in each itemset
• {w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k]
according to the lexicographic order
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31
8.3 Apriori Algorithm: Step 1
– Initial step
• Find frequent itemsets of size 1: F
1– Generalization, k ≥ ≥ ≥ 2 ≥
• C
k= candidates of size k: those itemsets of size k that could be frequent, given F
k-1• F
k= those itemsets that are actually frequent, F
k⊆ C
k(need to scan the database once)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32
8.3 Finding frequent items
– Generalization of candidates uses F
k-1as input and returns a superset (candidates) of the set of all frequent k-itemsets. It has two steps:
• Join step: generate all possible candidate itemsets C
kof length k, e.g., I
k= join(A
k-1, B
k-1) ⟺ A
k-1= {i
1, i
2, …, i
k-2, i
k-1} and B
k-1= {i
1, i
2, …, i
k-2, i’
k-1} and i
k-1< i’
k-1; Then I
k= {i
1, i
2, …, i
k-2, i
k-1, i’
k-1}
• Prune step: remove those candidates in C
kthat do not respect the downward closure property (include “k-1”
non-frequent subsets)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33
8.3 Apriori Algorithm: Step 1
– Generalization e.g., F
3= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}
• Try joining each 2 candidates from F
3DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34
8.3 Apriori Algorithm: Step 1
{1, 2, 4}
{2, 3, 4}
{1, 3, 4}
{1, 3, 5}
{1, 3, 5} {2, 3, 4}
{1, 2, 3}
{2, 3, 4}
{1, 2, 4}
{1, 3, 4}
{1, 3, 5}
{1, 2, 3, 4}
{1, 3, 4}
{2, 3, 4}
{1, 3, 5} {1, 3, 4, 5}
• After join C
4= {{1, 2, 3, 4}, {1, 3, 4, 5}}
• Pruning:
• After pruning C
4= {{1, 2, 3, 4}}
8.3 Apriori Algorithm: Step 1
{1, 2, 3, 4}
{2, 3, 4}
{1, 2, 3}
{1, 2, 4}
{1, 3, 4}
{1, 2, 3, 4}
∈ F3 ⟹ is a good candidate
F
3= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}
{1, 3, 4, 5}
{3, 4, 5}
{1, 3, 4}
{1, 3, 5}
{1, 4, 5}
{1, 3, 4, 5}
∉ F3 ⟹ Removed from C4
• Finding frequent items, example, minsup = 0.5
– First T scan ({item}:count)
• C
1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
• F
1: {1}:2, {2}:3, {3}:3, {5}:3;
{4} has a support of ¼ < 0.5 so it does not belong to the frequent items
• C
2= prune(join(F
1))
join : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5};
prune: C
2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; (all items belong to F
1)
8.3 Apriori Algorithm: Step 1
TID Items
T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5
T400 2, 5
– Second T scan
• C
2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
• F
2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
• Join: we could join {1,3} only with {1,4} or {1,5}, but they are not in F
2. The only possible join in F
2is {2, 3} with {2, 5}
resulting in {2, 3, 5};
• prune({2, 3, 5}): {2, 3}, {2, 5}, {3, 5} all belong to F
2, hence, C
3: {2, 3, 5}
– Third T scan
• {2, 3, 5}:2, then sup({2, 3, 5}) = 50%, minsup condition is fulfilled. Then F
3: {2, 3, 5}
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37
8.3 Apriori Algorithm: Step 1
TID Items
T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5
T400 2, 5
• Step 2: generating rules from frequent itemsets – Frequent itemsets are not the same as association
rules
– One more step is needed to generate association rules: for each frequent itemset I, for each proper nonempty subset X of I:
• Let Y = I \ X; X ⟶ Y is an association rule if:
– Confidence(X ⟶ Y) ≥ minconf,
– Support(X ⟶ Y) := |{i | {X, Y ⊆ t
i| / n = support(I) – Confidence(X ⟶ Y) := |{i | {X, Y ⊆ t
i| / |{j | X ⊆ t
j|
= support(I) / support(X)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38
8.3 Apriori Algorithm: Step 2
• Rule generation example, minconf = 50%
– Suppose {2, 3, 5} is a frequent itemset, with sup=50%, as calculated in step 1
– Proper nonempty subsets: {2, 3}, {2, 5}, {3, 5}, {2}, {3}, {5}, with sup=50%, 75%, 50%, 75%, 75%, 75% respectively – These generate the following association rules:
• 2,3 ⟶ 5, confidence=100%; (sup(I)=50%; sup{2,3=50%;
50/50= 1)
• 2,5 ⟶ 3, confidence=67%; (50/75)
• 3,5 ⟶ 2, confidence=100%; (…)
• 2 ⟶ 3,5, confidence=67%
• 3 ⟶ 2,5, confidence=67%
• 5 ⟶ 2,3, confidence=67%
– All rules have support = support(I) = 50%
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39
8.3 Apriori Algorithm: Step 2
TID Items
T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5
T400 2, 5
• Rule generation, summary
– In order to obtain X ⟶ Y, we need to know support(I) and support(X)
– All the required information for confidence computation has already been recorded in itemset generation
• No need to read the transactions data any more
• This step is not as time-consuming as frequent itemsets generation
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 40
8.3 Apriori Algorithm: Step 2
• Apriori Algorithm, summary
– If k is the size of the largest itemset, then it makes at most k passes over data (in practice, k is bounded e.g., 10)
– The mining exploits sparseness of data, and high minsup and minconf thresholds
– High minsup threshold makes it impossible to find rules involving rare items in the data.
The solution is a mining with multiple minimum supports approach
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 41
8.3 Apriori Algorithm
• Mining with multiple minimum supports – Single minimum support assumes that all items in the
data are of the same nature and/or have similar frequencies, which is incorrect…
– In practice, some items appear very frequently in the data, while others rarely appear
• E.g., in a supermarket, people buy cooking pans much less frequently than they buy bread and milk
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42
8.3 Multiple Minimum Supports
• Rare item problem: if the frequencies of items vary significantly, we encounter two problems
– If minsup is set too high, those rules that involve rare items will not be found
– To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 43
8.3 Multiple Minimum Supports
• Multiple Minimum Supports
– Each item can have a minimum item support
• Different support requirements for different rules – To prevent very frequent items and very rare items
from appearing in the same itemset S, we introduce a support difference constraint (φ)
• max
i∈S{sup(i) - min
i∈S{sup(i) ≤ φ,
where 0 ≤ φ ≤ 1 is user specified
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44
8.3 Multiple Minimum Supports
• Minsup of a rule
– Let MIS(i) be the minimum item support (MIS) value of item i. The minsup of a rule R is the lowest MIS value of the items in the rule:
• Rule R: i
1, i
2, …, i
k⟶ i
k+1, …, i
rsatisfies its minimum support if its actual support is ≥ min(MIS(i
1), MIS(i
2), …, MIS(i
r))
• E.g., the user-specified MIS values are as follows:
MIS(bread) = 2%, MIS(shoes) = 0.1%, MIS(clothes) = 0.2%
– clothes ⟶ bread [sup=0.15%,conf =70%] doesn’t satisfy its minsup – clothes ⟶ shoes [sup=0.15%,conf =70%] satisfies its minsup
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45
8.3 Multiple Minimum Supports
• Downward closure property is not valid anymore
– E.g., consider four items 1, 2, 3 and 4 in a database
Their minimum item supports are
• MIS(1) = 10%, MIS(2) = 20%, MIS(3) = 5%, MIS(4) = 6%
• {1, 2} with a support of 9% is infrequent since min(10%, 20%) > 9%, but {1, 2, 3} could be frequent, if it would have a support of e.g. , 7%
– If applied, downward closure, eliminates {1, 2} so that {1, 2, 3} is never evaluated
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 46
8.3 Multiple Minimum Supports
• How do we solve the downward closure property problem?
– Sort all items in I according to their MIS values (make it a total order)
• The order is used throughout the algorithm in each itemset – Each itemset w is of the following form:
• {w[1], w[2], …, w[k]}, consisting of items, w[1], w[2], …, w[k], where MIS(w[1]) ≤ MIS(w[2]) ≤ … ≤ MIS(w[k])
8.3 Multiple Minimum Supports
• Multiple minimum supports is an extension of the Apriori algorithm
– Step 1: frequent itemset generation
• Initial step
– Produce the seeds for generating candidate itemsets
• Candidate generation – For k = 2
• Generalization
– For k > 2, pruning step differs from the Apriori algorithm
– Step 2: rule generation
8.3 Multiple Minimum Supports
• Step 1: frequent itemset generation
– E.g., I={1, 2, 3, 4}, with given MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, and consider n=100
transactions:
– Initial step
• Sort I according to the MIS value of each item. Let M represent the sorted items
– Sort I, in M = {3, 4, 1, 2}
• Scan the data once to record the support count of each item
– E.g., {3}:6, {4}:3, {1}:9 and {2}:25
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49
8.3 Multiple Minimum Supports: Step 1
• Go through the items in M to find the first item i, that meets MIS(i). Insert it into a list of seeds L
• For each subsequent item j in M (after i), if sup(j) ≥ MIS(i), then insert j in L
–MIS(3) = 5%; sup ({3}) = 6%; sup(3) > MIS(3), so L={3}
Sup({4}) = 3% < MIS(3), so L remains {3}
Sup({1}) = 9% > MIS(3), L = {3, 1}
Sup({2}) = 25% > MIS(3), L = {3, 1, 2}
• Calculate F
1from L based on MIS of each item in L
–F1= {{3}, {2}}, since sup({1}) = 9% < MIS(1)• Why not eliminate {1} directly? Why calculate L and not directly F?
– Downward closure property is not valid from F anymore due to multiple minimum supports
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 50
8.3 Multiple Minimum Supports: Step 1
MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, n=100 {3}:6, {4}:3, {1}:9 {2}:25
– Candidate generation, k = 2.
Let φ = 10% (support difference)
• Take each item (seed) from L in order.
Use L and not F
1due to the downward closure property invalidity!
• Test the chosen item against its MIS: sup({3}) ≥ MIS(3) – If true, then we can use this value to form a level 2 candidate – If not, then go to the next element in L
• If true, e.g., sup({3) = 6% ≥ MIS(3) = 5%, then try to form a 2 level candidate together with each of the next items in L, e.g., {3, 1}, then {3, 2}
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51
8.3 Multiple Minimum Supports: Step 1
Items 1 2 3 4
MIS 10 20 5 6
SUP 9 25 6 3
L {3, 1, 2}
– {3, 1} is a candidate :⟺ sup({1)≥ MIS(3) and
|sup({3) – sup({1)| ≤ φ
• sup({1}) = 9%; MIS(3) = 5%; sup({3}) = 6%; φ := 10%
9% > 5% and |6%-9%| < 10%, thus C
2= {3, 1}
– Now try {3, 2}
• sup({2}) = 25%; 25% > 5% but |6%-25%| > 10% so this candidate will be rejected due to the support difference constraint
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52
8.3 Multiple Minimum Supports: Step 1
Items 1 2 3 4
MIS 10 20 5 6
SUP 9 25 6 3
L {3, 1, 2}
– Pick the next seed from L, i.e. 1 (needed to try {1,2})
• sup({1}) < MIS(1) so we can not use 1 as seed!
– Candidate generation for k=2 remains C
2= {{3, 1}}
• Now read the transaction list and calculate the support of each item in C
2. Let’s assume sup({3, 1})=6, which is larger than min(MIS(3), MIS(1))
• Thus F
2= {{3, 1}}
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53
8.3 Multiple Minimum Supports: Step 1
Items 1 2 3 4
MIS 10 20 5 6
SUP 9 25 6 3
L {3, 1, 2}
– Generalization, k > 2 uses L
k-1as input and returns a superset (candidates) of the set of all frequent k- itemsets. It has two steps:
• Join step: same as in the case of k=2
I
k= join(A
k-1, B
k-1) ⟺ A
k-1= {i
1, i
2, …, i
k-2, i
k-1} and B
k-1= {i
1, i
2,
…, i
k-2, i’
k-1} and i
k-1< i’
k-1and |sup(i
k-1) – sup(i’
k-1)| ≤ ≤ ≤ ≤φ φ φ. φ Then I
k= {i
1, i
2, …, i
k-2, i
k-1, i’
k-1}
• Prune step: for each (k-1) subset s of I
k, if s is not in F
k-1, then I
kcan be removed from C
k(it is not a good candidate).
There is however one exception to this rule, when s does not include the first item from I
kDW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54
8.3 Multiple Minimum Supports: Step 1
– Generalization, k > 2 example: let’s consider F3={{1, 2, 3}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5}, {1, 4, 6}, {2, 3, 5}}
• After join we obtain {1, 2, 3, 5}, {1, 3, 4, 5} and {1, 4, 5, 6} (we do not consider the support difference constraint)
• After pruning we get C
4= {{1, 2, 3, 5}, {1, 3, 4, 5}}
– {1, 2, 3, 5} is ok
– {1, 3, 4, 5} is not deleted although {3, 4, 5} ∉ F
3, because MIS(3) >
MIS(1). If MIS(3) = MIS(1), it could be deleted – {1, 4, 5, 6} is deleted because {1, 5, 6} ∉ F
3DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55
8.3 Multiple Minimum Supports: Step 1
• Step 2: rule generation
– Downward closure property is not valid anymore, therefore we have frequent k order items, which contain (k-1) non-frequent sub-items
• In the Apriori algorithm we only recorded the support of frequent itemsets
• For those non-frequent items we do not have the support value recorded
• This problem arises when we form rules of the form A,B ⟶ C, where MIS(C) = min(MIS(A), MIS(B), MIS(C)).
• Conf(A,B ⟶ C) = sup({A,B,C}) / sup({A,B})
We have the frequency of {A, B, C} because it is frequent, but we don’t have the frequency to calculate support of {AB} since it is not frequent by itself
• This is called head-item problem
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 56
8.3 Multiple Minimum Supports: Step 2
• Rule generation example
– {Shoes, Clothes, Bread} is a frequent itemset since
• MIS({Shoes, Clothes, Bread}) = 0.1 < sup({Shoes, Clothes, Bread}) = 0.12
– However {Clothes, Bread} is not (since 0.2 > 0.15)
• So we may not calculate the confidence of all rules depending on Shoes, i.e. rules:
– Clothes, Bread ⟶ Shoes – Clothes ⟶ Shoes, Bread – Bread ⟶ Shoes, Clothes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57
8.3 Multiple Minimum Supports: Step 2
Items Bread Clothes Shoes
MIS 2 0.2 0.1
Items {Clothes, Bread} {Shoes, Clothes, Bread}
SUP 0.15 0.12
– Head-item problem, e.g., Clothes, Bread ⟶ Shoes;
Clothes ⟶ Shoes, Bread; Bread ⟶ Shoes, Clothes
• If we have some item on the right side of a rule, which has the minimum MIS (e.g., Shoes), we may not be able to calculate the confidence without reading the data again
• Increase probability of success without data read by recording also the support of only one non-frequent sub-itemset, the itemset obtained by eliminating the item with the minimum MIS e.g.,
– {Clothes, Bread, Shoes} – {Shoes} = {Clothes, Bread}
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58
8.3 Multiple Minimum Supports: Step 2
• Advantages
– It is a more realistic model for practical applications – The model enables us to find rare item rules, but
without producing a huge number of meaningless rules with frequent items
– By setting MIS values of some items to 100% (or more), we can effectively instruct the algorithms not to generate rules only involving these items
8.3 Multiple Minimum Supports
• Mining Class Association Rules (CAR) – Normal association rule mining doesn’t
have a target
• It finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule
– However, in some applications, the user is interested in some targets
• E.g., the user has a set of text documents from some known topics. He wants to find out what words are associated or correlated with each topic
8.3 Association Rule Mining
• CAR, example
– A text document data set
• doc 1: Student, Teach, School : Education
• doc 2: Student, School : Education
• doc 3: Teach, School, City, Game : Education
• doc 4: Baseball, Basketball : Sport
• doc 5: Basketball, Player, Spectator : Sport
• doc 6: Baseball, Coach, Game, Team : Sport
• doc 7: Basketball, Team, City, Game : Sport – Let minsup = 20% and minconf = 60%. Examples of class
association rules:
• Student, School ⟶ Education [sup= 2/7, conf = 2/2]
• Game ⟶ Sport [sup= 2/7, conf = 2/3]
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61
8.3 Class Association Rules
• CAR algorithm
– Unlike normal association rules, CARs can be mined directly in one step
– The key operation is to find all rule-items that have support above minsup
• A rule-item is of the form (condset, y), where condset is a set of items from I (i.e., condset ⊆ I), and y ∈ Y is a class label where I ⋂ Y = ∅
– Each transaction basically represents a rule
• condset ⟶ y
– The Apriori algorithm can be modified to generate CARs
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62
8.3 Class Association Rules
• CAR can also be extended with multiple minimum supports
– The user can specify different minimum supports to different classes, which effectively assign a different minimum support to rules of each class
• E.g., a data set with two classes, Yes and No. We may want rules of class Yes to have the minimum support of 5% and rules of class No to have the minimum support of 10%
– By setting minimum class supports to 100% we can skip generating rules of those classes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63
8.3 Class Association Rules
• Tools
– Open source projects
• Weka
• RapidMiner – Commercial
• Intelligent Miner, replaced by DB2 Data Warehouse Editions
• PASW Modeler, developed by SPSS
• Oracle Data Mining (ODM)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64
8.3 Association Rule Mining
• Apriori algorithm, on car characteristics data-set – Class values: unacc., acc., good, vgood
– Attributes:
• Buying cost: vhigh, high, med, low
• Maintenance costs: vhigh, high, med, low
• Number of doors: 2, 3, 4, 5 or more
• Persons: 2, 4, more
• Safety: low, med, high
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65
8.3 Association Rule Mining
• Apriori algorithm – Number of rules – Support interval
• Upper and lower bound – Class index
– Confidence
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66
8.3 Association Rule Mining
• Apriori algorithm
– Largest frequent itemsets comprise 3 items – Most powerful rules
are simple rules
• Most of the people find 2 person cars unacceptable
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67
8.3 Association Rule Mining
– Lower confidence rule (62%)
• If 4 seat car, is found unacceptable, the it is because it’s unsafe (rule 30)
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68
8.3 Association Rule Mining
• Open source projects also have their limits – Car accidents data set
• 350 000 rows
• 54 attributes
DW & DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69
8.3 Association Rule Mining
• Business Intelligence Overview
– Customer segmentation, propensity to buy, customer profitability, attrition, etc.
• Data Mining Overview
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) information or patterns from data in large databases
• Association Rule Mining
– Apriori algorithm, support, confidence, downward closure property
– Multiple minimum supports solve the “rare-item” problem – Head-item problem
Data Warehousing & OLAP – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 70