# Data sets for association rule mining
| Scenario | Size of Rule (# relevant number of features) | Distribution of 0/1 | Noise | # Data sets |
|:-------------------------------------------:|:--------------------------------------------:|:-----------------------------:|:---------------:|:-----------:|
| 1_1 | 1 (color) | Equivalent (250/250) | - | 500 |
| 1_2 | 2 (color/pos) | Equivalent (250/250(125/125)) | - | 500 |
| 1_3 (= 1_2 with non-equivalent distribution | 2 (color/pos) | Non-Equivalent (167/166/166) | - | 500 |
| 1_4 (= 1_1 with noise) | 1 (color) | Equivalent (250/250) | 1 (size, v = 5) | 500 |
| 1_5 (= 1_2 with noise) | 2 (color/pos) | Equivalent (250/250(125/125)) | 1 (size, v = 5) | 500 |
| 1_6 (= 1_3 with noise) | 2 (color/pos) | Non-Equivalent (167/166/166) | 1 (size, v = 5) | 500 |
| 1_7 (= 1_2 with 1000 Data sets) | 2 (color/pos) | Equivalent (250/250(125/125)) | - | 1000 |
| 1_8 (= 1_1 with smooth 25% prob. noise) | 1 (color) | Equivalent (250/250) | 1 (size, v = 2) | 500 |
| 2_1 | 2 (shape/pos) | Equivalent (250/250(random)) | - | 500 |
| 2_2 (= 2_1 with noise) | 2 (shape/pos) | Equivalent (250/250(random)) | 1 (size, v = 3) | 500 |
# Rules:
Scenario 1_1, 1_4, 1_8: color is decisive criterion (green object in scene => 0; blue object in scene = 1)
* easy example for proof of concept
* assumptions:
* excatly one object is visible in each image
* allover, two objects can occur in the scene
* Object 1: green, rectangular, fixed size
* Object 2: blue, rectangular, fixed size
Scenario 1_2, 1_3, 1_5, 1_6, 1_7: color and position are decisive criteria (blue object in scene + position lower half of the table (y > scene_height / 2), otherwise 0)
* interesting aspect: Dependency of the features: circular AND position have to be in one rule
* assumptions:
* exactly one object is visible in each image
* allover, two objects can occur in the scene
* Object 1: green, rectangular, fixed size
* Object 2: blue, rectangular, fixed size
* Scenario 1_2 is the basic version.
* Scenario 1_3 is a variant of Scenario 1_2; it includes a non-equivalent distribution, i.e., 333 data sets are labelled negatively and 167 data sets are labelled positively. The 333 negative labelled data covers 166 data sets where a green object is placed and 167 data sets where a blue object is placed in the wrong half of the table (in the upper half). Furthermore, very small noise regarding the label assignment is included: the negative labelled data contains 9 record that should actually be considered positive (= blue object positioned in the lower half of the table but label "0" is assigned).
* Scenario 1_5 is a variant of Scenario 1_2; it includes noise on the feature "size", i.e., the features "Sizes X" and "Sizes Y" vary by -5, -4,.. to ..,+4, +5 points.
* Scenario 1_6 is a variant of Scenario 1_3; it includes noise on the feature "size", i.e., the features "Sizes X" and "Sizes Y" vary by -5, -4,.. to ..,+4, +5 points. IN CONTRAST to Scenario 1_3, there is NO NOISE regarding the LABEL assignement.
* Scenario 1_7 is a variant of Scenario 1_2; in contrast to Scenario 1_2 it contains the double amount of data sets, i.e., 1000 data sets.
Scenario 2_1, 2_2: shape and position are decisive criteria (circular object in scene + position in upper seventh of table)
* interesting aspect: Dependency of the features: circular AND position have to be in one rule
* assumptions:
* exactly four objects are visible in each image
* allover, four objects can occur in the scene
* Object 1: red, circular, fixed size
* Object 2: red, circular, fixed size
* Object 3: red, rectangular, fixed size
* Object 4: red, rectangular, fixed size
Note: The data are in the Excel table in the order in which they were inserted (ordered by label). They should be tested ordered by label AND ordered by name.