Experiment Setup and Benchmark Dataset - Data Space-Driven Analysis of Matrix Patterns

4.5 Data Space-Driven Analysis of Matrix Patterns

4.6.2 Experiment Setup and Benchmark Dataset

Our experiment setup for the CNN pattern retrieval experiments looks as follows:

1. We constructed –similarly to the MAGNOSTICS pattern response evaluation- an appropriate classifier training dataset with less degenerated benchmark images;

i.e.,≤4% for point-swap and noise and≤4% index swaps. Our selection is detailed below. Table 4.2 gives an overview of the number of images, which fulfill these properties.

2. We conduct a quasi 10-fold cross-validation, where each repetition involves the separation of 20% of all benchmark data (1,114 images) for validation purposes.

3. We derived the following performance measures/meta information for each trail:

Precision, Recall, F1 Score, F2 Score, CNN Training Time, File Ranking with final binary pattern decision and classification score.

Pattern Black

Table 4.2Composition of the evaluation dataset. The numbers in the table reflect the number of matrix images for a specific base pattern and modification method combination.

As stated above, we use the same less degenerated base pattern matrices for the MAGNOSTICSevaluation and the CNN experiments. Thus, we are able to compare the ex-periment results and examine the usefulness of both approaches in a comparative manner.

In particular, we use all matrix images with a (1) BlackWhitePointSwap modification level

≤4% (2) IndexSwap modification level≤4% (3) Masking modification level≤4% (4) 50 random noise images. Table 4.2 shows the pattern and degeneration degree distributions for the benchmark data set used in the MAGNOSTICSapproach and CNN experiments.

To reach a statistically more significant evaluation, we repeated all experiments ten times, leaving us with 100 precision and recall values for each experiment condition (one base pattern). All experiments were conducted on the Scientific Compute Cluster in Konstanz (SCCKN), a platform for High Performance Computing (HPC) and High Throughput Computing (HTC, “Big Data”) at the University Konstanz. In total,1366h 21min 35sec computing hourswere used to train all CNN classifiers.

Now we will report on the standard information retrieval performance measures:

precision, recall, the combined weighted harmonic mean measures of the former: F1 and F2 (favoring more precision, respectively recall). We give further information whenever the experiment results are not obvious. Additionally, we report on the Class Training Score progression, which is an indicator how much certainty is added in every iteration of the classier training.

Lastly, we conducted a multi-label classification CNN experiment in which the clas-sifier differentiates between the six pattern class labels. This experiment is detailed in Figure 4.6.2.

CNN Experiment Results: Block Pattern

The Block CNN shows on average a good retrieval performance with a precision of 0.85 and a recall of 0.74. The average CNN training duration was 22:50:07 for 150 iterations.

Only one CNN experiment showed no training convergence, thus leading to degraded retrieval performance results.

Generally, Block patterns seem to be reasonably well recognizable with CNNs.

Figure 4.18Results of the Block Pattern Retrieval CNN Experiment.

CNN Experiment Results: Off-Diagonal Block Pattern

The Off-Diagonal CNN shows a comparable retrieval performance as the Block CNN with an average precision of 0.81 and a recall of 0.71. The average CNN training duration was 22:50:07 for 150 iterations. Totally, 228h 45min 56s computing hours were used for the training. All CNN experiments showed training convergence.

Generally, Off-Diagonal Blocks seem to be reasonably well recognizable with CNNs.

Figure 4.19Results of the Off-Diagonal Block Pattern Retrieval CNN Experiment.

CNN Experiment Results: Star/Line Pattern

The Star/Line CNN shows a bad retrieval performance with an average precision of 0.61 and a recall of just 0.36. The average CNN training duration was 22:45:25 for 150 itera-tions. Totally, 227h 34min 17sec computing hours were used. Only four of the ten CNN experiments showed training convergence.

Generally, CNNs seem to have problems recognizing Star/Line patterns. This might be due to the reason that the benchmark images are quite diverse and show a lot of pattern variability (differences among the amount of lines, line width, etc.).

Figure 4.20Results of the Star/Line Pattern Retrieval CNN Experiment.

CNN Experiment Results: Band Pattern

Band Patterns can be relatively well recognized with CNN classifiers. The average precision for the ten experiment repetitions was 0.85, but the recall values were averagely only 0.52.

This suggests that bands can be classified/labeled with a quite reasonable certainty, but they often intervene with the Line pattern.

Figure 4.21Results of the Band Pattern Retrieval CNN Experiment.

The average CNN training duration was 22:43:36 for 150 iterations. Totally, 227h 16min 02sec computing hours were used. All ten CNN experiments showed training convergence.

CNN Experiment Results: Noise (Anti-)Pattern

The noise anti-pattern CNN experiments showed consistently that CNNs are not able to reflect what the term “noise” means in matrix plots. In other words, CNNs are built to represent/reflect the dominant characteristics of a plot. Whenever these structures are highly variant or just not existent, no feature can be extracted.

Figure 4.22Results of the Noise Antipattern Retrieval CNN Experiment.

The average training duration was 22:47:52. Although all ten experiments showed consis-tently training convergence, none of the CNNs was able to reflect the noise anti-pattern.

CNN Experiment Results: Bandwidth (Anti-)Pattern

Similar to the Band pattern, the bandwidth pattern shows average retrieval performance.

The averaged precision value is 0.82, and the average recall value is only 0.40, suggesting that the Bandwidth pattern may not be discriminated well from another pattern. A closer inspection of the results reveals that Line and Band patterns are found in the upper certainty ranks of the classifier (false positives). The average training duration for the experiments was 22h 45min 35s which is significantly faster than all other experiments (relative and absolute judgment).

Figure 4.23Results of the Bandwidth Antipattern Retrieval CNN Experiment.

CNN Experiment Results: Multi Pattern Classifier

We conducted one slightly adapted CNN experiment with the purpose to classify the matrix images wrt. the patterns they contain. This refers to a multi-label classification problem where the classifier outputs not a binary “existence” decision, but rather its pattern label (block, off-diagonal block, line, etc.).

Our setup differs from the formerly mentioned experiments in that 589 training it-erations were conducted (terminated after a longer convergence phase). In average all training passes took 47h 40min 9sec.

The retrieval results, depicted in Figure 4.24, are unexpectedly bad: Only an average precision of 0.16 and an average recall value of 0.16 was reached, thus leading to the conclusion that the CNN’s expressiveness may not be sufficient/appropriate to reflect the pattern differences appropriately.

Figure 4.24Results of the Multi Pattern Retrieval CNN Experiment.

Im Dokument Visual Analytic Methods for Exploring Large Amounts of Relational Data with Matrix-based Representations (Seite 156-162)