• Keine Ergebnisse gefunden

Computer¨ubungzurVorlesungModerneMethodenderDatenanalyseExercise7:DataMiningCup:Cuts Institutf”urexperimentelleKernphysik Fakult¨atf¨urPhysik

N/A
N/A
Protected

Academic year: 2022

Aktie "Computer¨ubungzurVorlesungModerneMethodenderDatenanalyseExercise7:DataMiningCup:Cuts Institutf”urexperimentelleKernphysik Fakult¨atf¨urPhysik"

Copied!
2
0
0

Wird geladen.... (Jetzt Volltext ansehen)

Volltext

(1)

Fakult¨ at f¨ ur Physik

parbox[t]12.cm

Institut f ”ur experimentelle Kernphysik

Prof. Dr. G. Quast, Prof. Dr. M. Feindt, Dr. A. Zupanc

“Ubungsgruppen: G. Sieber, B. Kronenbitter, A. Heller Ausgabe: 21.06.2012

Computer¨ ubung zur Vorlesung Moderne Methoden der Datenanalyse Exercise 7: Data Mining Cup: Cuts

Since 2000 a competition is organised each year to extract information relevant for a specific task from a large amount of data, the Data Mining Cup: www.data-mining-cup.de. In 2005 the challenge was to predict whether a customer of an online shop will pay his order or not. This is also the subject of this and the following exercises.

A text file containing a detailed description of the exercise can be found on the web page of this course. The other files needed for this exercise are provided there as well: A root file containing the training data where it is known whether the customer paid, a root file containing the test data with unknown customer behaviour and a text file containing the description of variables in the datasets. These files are also available at:

/home/staff/zupanc/Datenanalyse/DMC.

The simplest and most intuitive way of selection or classification is the application of cuts. For many problems this approach is absolutely sufficient. And even if other methods are superior in case of more complex problems, like the one we are facing here, a cut based study can help to understand the data.

• Exercise 7.1:

Explore the (training) data and try to find out which variables can be used to predict whether a customer will pay or not. Plot the distribution of variables for good and bad customers.

Make profile plots of the target for all variables.

Study correlations of variables as well.

• Exercise 7.2:

Invent some cuts for the classification of orders. Evaluate the quality of these cuts by calcu- lating the score as defined in the detailed description of the exercise.

Use the method MakeClass of theTTree object h1 to generate the source file for this task.

Add the score calculation code to theLoop method of the generated tree analysis class.

• Exercise 7.3:

Apply your cut based classification to the test dataset in the root fileclass.root. Create a text file containing per line one order number and the corresponding decision separated by a space. Use 1 for high and 0 for low risk orders.

(2)

2

Probably the simplest way to produce such a text file is to open the class.root file in the analysis class created in exercise 7.2 and print the two numbers to standard output in the Loopmethod. Then the output can be redirect to a file with the > operator.

• Exercise 7.4:

Give the text file produced in exercise 7.3 to a tutor. He will calculate a score for you.

Referenzen

ÄHNLICHE DOKUMENTE

Given a number of N = 100 000 patients calculate the expected number of patients with positive initial test results, the number of patients with detected cancer after the

As a repetition of important concepts in data analysis, this exercise aims at constructing the error band around a function, f (x), fitted to data points (x, y) - i.e.. Store the

To correct the measured m t¯ t spectrum, a Monte Carlo simulation of t¯t events is used to simulate the smearing between true and reconstructed masses.. For this exercise, you find

“Is this a new discovery or just a statistical fluctuation?” Statistics offers some methods to give a quantitative answer but these methods should not be used blindly; in

F¨ur einen gegebenen Impuls einer Spur, sei dE/dx in der Spurkammer im Mittel f¨ur Pionen 1,3 MeV/m und f¨ur Kaonen 1,6 MeV/m.. Die Unsicherheit auf die Messung betrage jeweils

Take the variables which you used for the cut based approach in the last exercise and calculate the ratio of the probability density functions for good and bad customers P good

One way how to check, that the network is not overtrained is to split the training sample into N subsamples, training a Neural Network N times with N − 1 subsamples and applying

The other files needed for this exercise are provided there as well: A root file containing the training data where it is known whether the customer paid, a root file containing