• Keine Ergebnisse gefunden

DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig

N/A
N/A
Protected

Academic year: 2021

Aktie "DW & DM –Wolf-TiloBalke–InstitutfürInformationssysteme–TU Braunschweig"

Copied!
12
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Data Warehousing

& Data Mining

Wolf-Tilo Balke Kinda El Maarry

Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de

9. Business Intelligence

9.1 Business Intelligence Overview 9.2 Principles of Data Mining 9.3 Association Rule Mining

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 2

9. Business Intelligence

What is Business Intelligence (BI)?

–The process, technologies and tools needed to turn datainto information, information into knowledge and knowledge into plansthat drive profitable business action

–BI comprises data warehousing, business analytic tools, and content/knowledge management

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 3

9.1 BI Overview

Typical BI applications are

–Customer segmentation

–Propensity to buy (customer disposition to buy) –Customer profitability

–Fraud detection

–Customer attrition (loss of customers)

–Channel optimization (connecting with the customer)

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 4

9.1 BI Overview

Customer segmentation

–What market segments do my customersfall into, and what are their characteristics?

–Personalize customer relationships for higher customer satisfaction and retention

9.1 BI Overview

Propensity to buy

–Which customers are most likely to respond to my promotion?

–Target the right customers

Increase campaign profitability by focusing on the customers most

likely to buy

9.1 BI Overview

(2)

Customer profitability

–What is the lifetime profitability of my customer?

–Make individualbusiness interaction decisions based on the overall profitability of customers

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 7

9.1 BI Overview

Fraud detection

–How can I tell which transactions are likely to be fraudulent?

If your wife has just proposed to increase your life insurance policy, you should probably order pizza for a while

–Quickly determine fraud and take immediate action to minimize damage

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 8

9.1 BI Overview

Customer attrition

–Which customer is at risk of leaving?

–Prevent loss of high-value customers and let go of lower-value customers

Channel optimization

–What is the best channel to reach my customer in each segment?

–Interact with customers based on their preference and your need to manage cost

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 9

9.1 BI Overview

ng?

ers s

Automated decision tools

–Rule-based systems that provide a solution usually in one functional area to a specific repetitive

management problem in one industry

E.g., automated loan approval, intelligent price setting

Business performance management (BPM)

–A framework for defining, implementing and managing

an enterprise’s business strategy by linking objectives with factual measures - key performance indicators

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 10

9.1 BI Overview

Dashboards

–Provide a comprehensive visual view of corporate performance measures, trends, and exceptions from multiple

business areas

Allows executives to see hot spots in seconds and explore the situation

9.1 BI Overview

What is data mining (knowledge discovery in databases)?

–Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful)

informationor patterns from data in large databases

9.2 Data Mining

(3)

Market analysis

–Targeted marketing/ Customer profiling

Find clusters of “model” customers who

share the same characteristics: interest, income level, spending habits, etc.

–Determine customer purchasing patterns over time –Cross-market analysis

Associations/co-relations between product sales

Prediction based on the association of information

–…

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 13

9.2 Applications

Corporate analysis and risk management

–Finance planning and asset evaluation

Cash flow analysis and prediction

Trend analysis, time series, etc.

–Resource planning

Summarizeand compare the resources and spending

–Competition

Monitorcompetitors and market directions

Groupcustomers into classes and a class-based pricing procedure

Set pricing strategy in a highly competitive market

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 14

9.2 Applications

Architecture of DM systems

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 15

9.2 Data Mining

Data Warehouse

ETL Filtering

Database or data warehouse server

Data mining engine Pattern evaluation

Graphical user interface

Pat

aphi face

yste

fa

Knowledge-base

Databases

Association(correlation and causality)

Multi-dimensional vs. single-dimensional association age(X, “20..29”) , income(X, “20..29K”) buys(X, “PC”)

[support = 2%, confidence = 60%]

contains(T, “computer”) contains(x, “software”) [1%, 75%]

Classification andPrediction

Finding models (functions) that describe and distinguish classes or concepts for future predictions

Presentation: decision-tree, classification rule, neural network Prediction: predict some unknown or missing numerical values

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 16

9.2 Data Mining Techniques

Cluster analysis

–Class label is unknown: group data to form new classes, e.g., advertising based on client groups –Clustering based on the principle: maximizing the

intra-class similarity and minimizing the interclass similarity

Outlier analysis

–Outlier: a data object that does not comply with the general behavior of the data

–Can be considered as noise or exception, but is quite useful in fraud detection, rare events analysis

9.2 Data Mining Techniques

Association rule mining has the objective of finding all co-occurrence relationships (called associations), among data items

–Classical application: market basket data analysis, which aims to discover how items are purchased by customers in a supermarket

E.g., Cheese Wine [support= 10%, confidence= 80%]

meaning that 10% of the customers buy cheese and wine together, and 80% of customers buying cheese also buy wine

9.3 Association Rule Mining

(4)

Basic concepts of association rules

–Let ൌሼ‹ͳǡ‹ʹǡǥǡ‹ሽbe a set of items.

Let ൌሼ–ͳǡ–ʹǡǥǡ–ሽbe a set of

transactions where each transaction –‹is a set of items such that –‹ك.

–An association rule is an implication of the form:

ืǡwhereؿǡؿandځൌ׎

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 19

9.3 Association Rule Mining

Association rule mining market basket analysis example

–I –set of all itemssold in a store

E.g., ‹ͳ= Beef, ‹ʹ= Chicken, ‹͵= Cheese, …

–T –set of transactions

The content of a customers basket

E.g., –ͳ: Beef, Chicken, Milk; –ʹ: Beef, Cheese; –͵: Cheese, Wine; –Ͷ: …

–An association rule might be

Beef, Chicken Milk, where {Beef, Chicken} is and {Milk} is

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 20

9.3 Association Rule Mining

Rules can be weak or strong

–The strength of a rule is measured by its supportand confidence

–The support of a rule ื, is the percentage of transactions in T that contains and

Can be seen as an estimate of the probability ”ሺሼǡሽك–‹

With as number of transactions in T the support of the ruleis:

•—’’‘”– ൌȁሼ‹ȁሼǡሽك–‹ሽȁȀ

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 21

9.3 Association Rule Mining

–The confidence of a rule ื, is the percentage of transactions in T containing , that contain ׫

Can be seen as estimate of the probability ”ሺك–‹ȁك–‹ …‘ˆ‹†‡…‡ ൌȁሼ‹ȁሼǡሽك–‹ሽȁȀȁሼŒȁك–Œሽȁ

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 22

9.3 Association Rule Mining

How do we interpret support and confidence?

–If support is too low,the rule may just occur due to chance

Acting on a rule with lowsupport may not be profitable since it covers too few cases

–If confidence is too low,we cannot reliably predict from

Objective of mining association rules is to discover all associated rules in T that have support and confidence greater than a minimum threshold (minsup, minconf)!

9.3 Association Rule Mining

Finding rules based on support and confidence thresholds

–Let minsup = 30% and minconf = 80%

–Chicken, Clothes ืMilk is valid, [sup = 3/7

(42.84%), conf = 3/3 (100%)]

–Clothes ืMilk, Chicken is also valid, and there are more…

9.3 Association Rule Mining

Transactions

T1 Beef, Chicken, Milk

T2 Beef, Cheese

T3 Cheese, Boots

T4 Beef, Chicken, Cheese T5 Beef, Chicken,Clothes, Cheese, Milk T6 Clothes, Chicken, Milk T7 Chicken, Milk,Clothes

(5)

This is rather a simplistic view of shopping baskets

–Some important information is not considered e.g. the quantity of each item purchased, the price paid,…

There are a large number of rule mining algorithms

–They use different strategies and data structures –Their resulting sets of rules are all the same

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 25

9.3 Association Rule Mining

Approaches in association rule mining

Apriorialgorithm

–Mining with multiple minimum supports –Mining class association rules

The best known mining algorithm is the Apriori algorithm

–Step 1: find all frequent itemsets (set of items with support ≥ minsup)

–Step 2: use frequent itemsets to generate rules

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 26

9.3 Association Rule Mining

Step 1: frequent itemset generation

–The key is the apriori property (downward closure property): any subset of a frequent itemset is also a frequent itemset

E.g., for minsup = 30%

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 27

9.3 Apriori Algorithm: Step 1

Transactions

T1 Beef, Chicken, Milk

T2 Beef, Cheese

T3 Cheese, Boots

T4 Beef, Chicken, Cheese T5 Beef, Chicken, Clothes, Cheese, Milk T6 Clothes, Chicken, Milk T7 Chicken, Milk, Clothes Chicken, Clothes, Milk

Chicken, Clothes Chicken, Milk Clothes, Milk

Chicken Clothes Milk

Finding frequent items

–Find all 1-item frequent itemsets; then all 2-item frequent itemsets, etc.

–In each iteration k, only consider itemsets that contain a k-1 frequent itemset

–Optimization: the algorithm assumes that items are sorted in lexicographic order

The order is used throughout the algorithm in each itemset

{w[1], w[2], …, w[k]} represents a k-itemset w consisting of items w[1], w[2], …, w[k], where w[1] < w[2] < … < w[k]

according to the lexicographic order

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 28

9.3 Apriori Algorithm: Step 1

Initial step

Find frequent itemsets of size 1: F1

Generalization, k 2

Ck= candidates of size k: those itemsets of size k that could be frequent, given Fk-1

Fk= those itemsets that are actually frequent, FkكCk (need to scan the database once)

9.3 Finding frequent items

Generalization of candidates uses Fk-1as input and returns a superset (candidates) of the set of all frequent k-itemsets. It has two steps:

Join step: generate all possible candidate itemsets Ckof length k, e.g., Ik= join(Ak-1, Bk-1) Ak-1= {i1, i2, …, ik-2, ik-1} and Bk-1= {i1, i2, …, ik-2, i’k-1} and ik-1< i’k-1; Then Ik= {i1, i2, …, ik-2, ik-1, i’k-1}

Prune step: remove those candidates in Ckthat do not respect the downward closure property(include “k-1”

non-frequent subsets)

9.3 Apriori Algorithm: Step 1

(6)

–Generalization e.g., F3= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}

Try joining each 2 candidates from F3

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 31

9.3 Apriori Algorithm: Step 1

{1, 2, 4}

{2, 3, 4}

{1, 3, 4}

{1, 3, 5}

{1, 3, 5} {2, 3, 4}

{1, 2, 3}

{2, 3, 4}

{1, 2, 4}

{1, 3, 4}

{1, 3, 5}

{1, 2, 3, 4}

{1,

{1, 3, 4}

{2, 3, 4}

{1, 3, 5} {1, 3, 4, 5}{1,

After join C4= {{1, 2, 3, 4}, {1, 3, 4, 5}}

Pruning:

After pruningC4= {{1, 2, 3, 4}}

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 32

9.3 Apriori Algorithm: Step 1

{1, 2, 3, 4}

{2, 3, 4}

{1, 2, 3}

{1, 2, 4}

{1, 3, 4}

{1, 2, 3, 4}

א ͵ ‹•ƒ‰‘‘†…ƒ†‹†ƒ–‡

F3= {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}

{1, 3, 4, 5}

{3, 4, 5}

{1, 3, 4}

{1, 3, 5}

{1, 4, 5}

{1, 3, 4, 5}

ב ͵ ‡‘˜‡†ˆ”‘Ͷ

Finding frequent items, example, minsup = 0.5

–First T scan ({item}:count)

C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3

F1: {1}:2, {2}:3, {3}:3, {5}:3;

{4} has a support of ¼ < 0.5 so it does not belong to the frequent items

C2= prune(join(F1))

join : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5};

prune: C2 : {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}; (all items belong to F1)

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 33

9.3 Apriori Algorithm: Step 1

TID Items

T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5

T400 2, 5

–Second T scan

C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2

F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

Join: we could join {1,3} only with {1,4} or {1,5}, but they are not in F2. The only possible join in F2is {2, 3} with {2, 5}

resulting in {2, 3, 5};

prune({2, 3, 5}): {2, 3}, {2, 5}, {3, 5} all belong to F2, hence, C3: {2, 3, 5}

–Third T scan

{2, 3, 5}:2, then sup({2, 3, 5}) = 50%, minsup condition is fulfilled. Then F3: {2, 3, 5}

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 34

9.3 Apriori Algorithm: Step 1

TID Items

T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5

T400 2, 5

Step 2: generating rules from frequent itemsets

–Frequent itemsets are not the same as association

rules

–One more step is needed to generate association rules: for each frequent itemsetǡfor each proper nonempty subset of :

Let = \; is an association rule if:

Confidence() ≥ minconf,

Support() := ȁሼ‹ȁሼǡሽك–‹ሽȁȀ= support(I) Confidence() := ȁሼ‹ȁሼǡሽك–‹ሽȁȀȁሼŒȁك–Œሽȁ

= support() / support()

9.3 Apriori Algorithm: Step 2

• Rule generation example, minconf = 50%

–Suppose {2, 3, 5} is a frequent itemset, with sup=50%, as calculated in step 1

–Proper nonempty subsets: {2, 3}, {2, 5}, {3, 5}, {2}, {3}, {5}, with sup=50%, 75%, 50%, 75%, 75%, 75% respectively –These generate the following association rules:

ʹǡ͵ืͷǡ…‘ˆ‹†‡…‡ൌͳͲͲΨǢሺ•—’ሺሻൌͷͲΨǢ•—’ሼʹǡ͵ሽൌͷͲΨǢ ͷͲȀͷͲൌͳሻ

ʹǡͷื͵ǡ…‘ˆ‹†‡…‡ൌ͸͹ΨǢሺͷͲȀ͹ͷሻ

͵ǡͷืʹǡ…‘ˆ‹†‡…‡ൌͳͲͲΨǢሺǥሻ

ʹื͵ǡͷǡ…‘ˆ‹†‡…‡ൌ͸͹Ψ

͵ืʹǡͷǡ…‘ˆ‹†‡…‡ൌ͸͹Ψ

ͷืʹǡ͵ǡ…‘ˆ‹†‡…‡ൌ͸͹Ψ

–All rules have support = support(I) = 50%

9.3 Apriori Algorithm: Step 2

TID Items

T100 1, 3, 4 T200 2, 3, 5 T300 1, 2, 3, 5

T400 2, 5

(7)

Rule generation, summary

–In order to obtain ื, we need to know support() and support()

–All the required information for confidence computation has already been recorded in itemset generation

No need to read the transactions data any more

This step is not as time-consuming as frequent itemsets generation

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 37

9.3 Apriori Algorithm: Step 2

Apriori Algorithm, summary

–If k is the size of the largest itemset, then it makes at most k passes over data (in practice, k is bounded e.g., 10)

–The mining exploits sparseness of data, and high minsup and minconf thresholds

–High minsup threshold makes it impossible to find rules involving rare items in the data.

The solution is a mining with multiple minimum supports approach

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 38

9.3 Apriori Algorithm

Mining with multiple minimum supports

–Single minimum support assumes that all items in the

data are of the same nature and/or have similar frequencies, which is incorrect…

–In practice, some items appear very frequently in the data, while others rarely appear

E.g., in a supermarket, people buy cooking pans much less frequently than they buy bread and milk

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 39

9.3 Multiple Minimum Supports

Rare item problem: if the frequencies of items vary significantly, we encounter two problems

–If minsup is set too high, those rules that involve rare items will not be found

–To find rules that involve both frequent and rare items, minsup has to be set very low. This may cause

combinatorial explosion because those frequent items will be associated with one another in all possible ways

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 40

9.3 Multiple Minimum Supports

Multiple Minimum Supports

–Each item can have a minimum item support

Different support requirements for different rules

–To prevent very frequent items and very rare items from appearing in the same itemset, we introduce a support difference constraint(ɔ)

ƒš‹אሼ•—’ሺ‹ሻሽǦ ‹‹אሼ•—’ሺ‹ሻሽ൑ɔǡ

™Š‡”‡Ͳ൑ɔ ൑ͳ‹•—•‡”•’‡…‹ˆ‹‡†

9.3 Multiple Minimum Supports

Minsup of a rule

–Let MIS(i) be the minimum item support (MIS) value of item i. The minsup of a rule R is the lowest MIS valueof the items in the rule:

Rule ǣ‹ͳǡ‹ʹǡǥǡ‹ื‹൅ͳǡǥǡ‹”satisfies its minimum support if its actual support is ൒‹ሺሺ‹ͳሻǡሺ‹ʹሻǡǥǡሺ‹”ሻሻ

E.g., the user-specified MIS values are as follows:

MIS(bread) = 2%, MIS(shoes) = 0.1%, MIS(clothes) = 0.2%

clothes bread [sup=0.15%,conf =70%] doesn’t satisfy its minsup clothes shoes [sup=0.15%,conf =70%] satisfies its minsup

9.3 Multiple Minimum Supports

(8)

Downward closure property is not valid anymore

–E.g., consider four items 1, 2, 3 and 4 in a database

Their minimum item supports are

MIS(1) = 10%, MIS(2) = 20%, MIS(3) = 5%, MIS(4) = 6%

{1, 2} with a support of 9% is infrequent since min(10%, 20%) > 9%, but {1, 2, 3} could be frequent, if it would have a support of e.g. , 7%

If applied, downward closure, eliminates {1, 2} so that {1, 2, 3} is never evaluated

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 43

9.3 Multiple Minimum Supports

How do we solve the downward closure property problem?

–Sort all items in according to their MIS values (make it a total order)

The order is used throughout the algorithm in each itemset

–Each itemset w is of the following form:

{w[1], w[2], …, w[k]}, consisting of items, w[1], w[2], …, w[k], where MIS(w[1]) MIS(w[2]) MIS(w[k])

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 44

9.3 Multiple Minimum Supports

Multiple minimum supports is an extension of the Apriori algorithm

–Step 1: frequent itemset generation

Initial step

Produce the seeds for generating candidate itemsets

Candidate generation For k = 2

Generalization

For k 2, pruning step differs from the Apriori algorithm

–Step 2: rule generation

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 45

9.3 Multiple Minimum Supports

Step 1: frequent itemset generation

–E.g., ={1, 2, 3, 4}, with given MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, and consider n=100

transactions:

Initial step

Sort I according to the MIS value of each item. Let M represent the sorted items

Sort , in = {3, 4, 1, 2}

Scan the data once to record the support count of each item

E.g., {3}:6, {4}:3, {1}:9 and {2}:25

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 46

9.3 Multiple Minimum Supports: Step 1

Go through the items in M to find the first item i, that meets MIS(i). Insert it into a list of seeds L

For each subsequent item j in M (after i), if sup(j) ≥ MIS(i), then insert j in L

MIS(3) = 5%; sup ({3}) = 6%; sup(3) > MIS(3), so L={3}

Sup({4}) = 3% < MIS(3), so L remains {3}

Sup({1}) = 9% > MIS(3), L = {3, 1}

Sup({2}) = 25% > MIS(3), L = {3, 1, 2}

Calculate F1from L based on MIS of each item in L F1= {{3}, {2}}, since sup({1}) = 9% < MIS(1)

• Why not eliminate {1} directly? Why calculate L and not directly F?

–Downward closure property is not valid from F anymore due to multiple minimum supports

9.3 Multiple Minimum Supports: Step 1

MIS(1)=10%, MIS(2)=20%, MIS(3)=5%, MIS(4)=6%, n=100 {3}:6, {4}:3, {1}:9 {2}:25

Candidate generation, k = 2.

Let ɔ= 10% (support difference)

Take each item (seed) from L in order.

Use L and not F1due to the downward closure property invalidity!

Test the chosen item against its MIS: sup({3}) ≥ MIS(3) If true, then we can use this value to form a level 2 candidate If not, then go to the next element in L

If true, e.g., •—’ሺሼ͵ሽሻൌ͸Ψ൒ሺ͵ሻൌͷΨǡthen try to form a 2 level candidate together with each of the next items in L, e.g., {3, 1}, then {3, 2}

9.3 Multiple Minimum Supports: Step 1

Items 1 2 3 4

MIS 10 20 5 6

SUP 9 25 6 3

L {3, 1, 2}

(9)

–{3, 1} is a candidate :฻•—’ሺሼͳሽሻ൒ሺ͵ሻand ȁ•—’ሺሼ͵ሽሻȂ •—’ሺሼͳሽሻȁ൑ɔ

sup({1}) = 9%; MIS(3) = 5%; sup({3}) = 6%; ɔ:= 10%

9% > 5% and |6%-9%| < 10%, thus C2= {3, 1}

–Now try {3, 2}

sup({2}) = 25%; 25% > 5% but |6%-25%| > 10% so this candidate will be rejected due to the support difference constraint

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 49

9.3 Multiple Minimum Supports: Step 1

Items 1 2 3 4

MIS 10 20 5 6

SUP 9 25 6 3

L {3, 1, 2}

–Pick the next seed from L, i.e. 1 (needed to try {1,2})

sup({1}) < MIS(1) so we can not use 1 as seed!

–Candidate generation for k=2 remains C2= {{3, 1}}

Now read the transaction list and calculate the support of each item in C2. Let’s assume sup({3, 1})=6, which is larger than min(MIS(3), MIS(1))

Thus F2= {{3, 1}}

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 50

9.3 Multiple Minimum Supports: Step 1

Items 1 2 3 4

MIS 10 20 5 6

SUP 9 25 6 3

L {3, 1, 2}

Generalization, k > 2 uses Lk-1as input and returns a superset (candidates) of the set of all frequent k- itemsets. It has two steps:

Join step: same as in the case of k=2

Ik= join(Ak-1, Bk-1) Ak-1= {i1, i2, …, ik-2, ik-1} and Bk-1= {i1, i2,

…, ik-2, i’k-1} and ik-1< i’k-1and |sup(ik-1) – sup(i’k-1)| ൑ɔǤ Then Ik= {i1, i2, …, ik-2, ik-1, i’k-1}

Prune step: for each (k-1) subset sof Ik, if sis not in Fk-1, then Ikcan be removed from Ck (it is not a good candidate).

There is however one exception to this rule, when s does not include the first item from Ik

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 51

9.3 Multiple Minimum Supports: Step 1

Generalization, k > 2 example: let’s consider F3={{1, 2, 3}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5}, {1, 4, 6}, {2, 3, 5}}

After join we obtain {1, 2, 3, 5}, {1, 3, 4, 5} and {1, 4, 5, 6} (we do not consider the support difference constraint)

After pruning we get C4= {{1, 2, 3, 5}, {1, 3, 4, 5}}

{1, 2, 3, 5} is ok

{1, 3, 4, 5} is not deleted although {3, 4, 5} בF3, because MIS(3) >

MIS(1). If MIS(3) = MIS(1), it could be deleted {1, 4, 5, 6} is deleted because {1, 5, 6} בF3

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 52

9.3 Multiple Minimum Supports: Step 1

Step 2: rule generation

–Downward closure property is not valid anymore, therefore we have frequent k order items, which contain (k-1) non-frequent sub-items

For those non-frequent items we do not have the support value recorded

This problem arises when we form rules of the form A,B C, where MIS(C) = min(MIS(A), MIS(B), MIS(C))

Conf(A,B C) = sup({A,B,C}) / sup({A,B})

We have the frequency of {A, B, C} because it is frequent, but we don’t have the frequency to calculate support of {AB} since it is not frequent by itself

This is called head-item problem

9.3 Multiple Minimum Supports: Step 2

Rule generation example

–{Shoes, Clothes, Bread} is a frequent itemset since

MIS({Shoes, Clothes, Bread}) = 0.1 < sup({Shoes, Clothes, Bread}) = 0.12

–However {Clothes, Bread} is not since neither Clothes nor Bread can seed frequent itemsets

So we may not calculate the confidence of all rules depending on Shoes, i.e. rules:

Clothes, Bread Shoes Clothes Shoes, Bread Bread Shoes, Clothes

9.3 Multiple Minimum Supports: Step 2

Items Bread Clothes Shoes

MIS 2 0.2 0.1

Items {Clothes},{Bread} {Shoes, Clothes, Bread}

SUP 0.15 0.12

(10)

Head-item problem e.g.:

–Clothes, Bread ืShoes;

–Clothes ืShoes, Bread;

–Bread ืShoes, Clothes.

If we have some item on the right side of a rule, which has the minimum MIS (e.g. Shoes), we may not be able to calculate the confidence without reading the data again

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 55

9.3 Multiple Minimum Supports: Step 2

Advantages

–It is a more realistic model for practical applications –The model enables us to find rare item rules, but

without producing a huge number of meaningless rules with frequent items

–By setting MIS values of some items to 100% (or more), we can effectively instruct the algorithms not to generate rules only involving these items

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 56

9.3 Multiple Minimum Supports

Mining Class Association Rules (CAR)

–Normal association rule mining doesn’t

have a target

It finds all possible rules that exist in data, i.e., any item can appear as a consequent or a condition of a rule

–However, in some applications, the user is interested in some targets

E.g. the user has a set of text documents from some known topics. He wants to find out what words are associated or correlated with each topic

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 57

9.3 Association Rule Mining

• CAR, example

–A text document data set

doc 1: Student, Teach, School : Education

doc 2: Student, School : Education

doc 3: Teach, School, City, Game : Education

doc 4: Baseball, Basketball : Sport

doc 5: Basketball, Player, Spectator : Sport

doc 6: Baseball, Coach, Game, Team : Sport

doc 7: Basketball, Team, City, Game : Sport

–Let minsup = 20% and minconf = 60%. Examples of class association rules:

Student, School Education [sup= 2/7, conf = 2/2]

Game Sport [sup= 2/7, conf = 2/3]

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 58

9.3 Class Association Rules

CAR can also be extended with multiple minimum supports

–The user can specify different minimum supports to different classes, which effectively assign a different minimum support to rules of each class

E.g. a data set with two classes, Yes and No. We may want rules of class Yes to have the minimum support of 5% and rules of class No to have the minimum support of 10%

–By setting minimum class supports to 100% we can skip generating rules of those classes

9.3 Class Association Rules

Tools

–Open source projects

Weka

RapidMiner

–Commercial

Intelligent Miner, replaced by DB2 Data Warehouse Editions

PASW Modeler, developed by SPSS

Oracle Data Mining (ODM)

9.3 Association Rule Mining

(11)

Apriori algorithm, on CAR data-sets

–Class values: unacceptable , acceptable, good, very good

–And 6 attributes:

Buying cost: vhigh, high, med, low

Maintenance costs: vhigh, high, med, low

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 61

9.3 Association Rule Mining

Apriori algorithm

–Number of rules –Support interval

Upper and lower bound

–Class index –Confidence

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 62

9.3 Association Rule Mining

Apriori algorithm

–Largest frequent itemsets comprise 3 items –Most powerful rules

are simple rules

Most of the people find 2 person cars unacceptable

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 63

9.3 Association Rule Mining

–Lower confidence rule (62%)

If 4 seat car, is found unacceptable, the it is because it’s unsafe (rule 30)

DW & DM –Wolf-Tilo Balke–Institut für Informationssysteme–TU Braunschweig 64

9.3 Association Rule Mining

Open source projects also have their limits

–Car accidents data set

350 000 rows

54 attributes

9.3 Association Rule Mining

• Business Intelligence Overview

–Customer segmentation, propensity to buy, customer profitability, attrition, etc.

• Data Mining Overview

–Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) information or patterns from data in large databases

• Association Rule Mining

–Apriori algorithm, support, confidence, downward closure property

–Multiple minimum supports solve the “rare-item” problem –Head-item problem

Summary

(12)

Data Mining

–Time Series data

Trend and Similarity Search Analysis

–Sequence Patterns

DW & DM Wolf-Tilo BalkeInstitut für InformationssystemeTU Braunschweig 67

Next lecture

Referenzen

ÄHNLICHE DOKUMENTE

Entferne Kandidaten Item-Mengen, die nicht häufig auftretende Teilmengen der Länge k enthalten. Ermittle den Support

” Mit dem Terminus Text Mining werden computergest¨ utzte Verfahren f¨ ur die semantische Analyse von Texten bezeichnet, welche die automatische bzw1. semi-automatische

 Ähnlich wie beim standart data mining prozess, kann der Web Usage Mining Prozess in 3 Teile geteilt werden.  Data collection &amp; pre-processing

⇒ Data Mining als Anwendung von Algorithmen auf Daten mit der Zielsetzung, Muster aus den Daten zu extrahieren.... Nach welchen Mustern

In order to be able to compare the original and the mined staff assignment rules we logically remove those parts from the mined staff assignment rules which are built by redundant

– Mining with multiple minimum supports – Mining class association rules. 9.3 Association

– Mining with multiple minimum supports – Mining class association rules. DW &amp; DM – Wolf-Tilo Balke – Institut für Informationssysteme – TU

A) Es werden oft Windeln gekauft. B) Der Kauf von Windeln führt zum Kauf von Bier.. B) Der Kauf von Windeln führt zum Kauf