Data sets for association rule mining

# Data sets for association rule mining | Scenario | Size of Rule (# relevant number of features) | Distribution of 0/1 | Noise | # Data sets | |:-------------------------------------------:|:--------------------------------------------:|:-----------------------------:|:---------------:|:-----------:| | 1_1 | 1 (color) | Equivalent (250/250) | - | 500 | | 1_2 | 2 (color/pos) | Equivalent (250/250(125/125)) | - | 500 | | 1_3 (= 1_2 with non-equivalent distribution | 2 (color/pos) | Non-Equivalent (167/166/166) | - | 500 | | 1_4 (= 1_1 with noise) | 1 (color) | Equivalent (250/250) | 1 (size, v = 5) | 500 | | 1_5 (= 1_2 with noise) | 2 (color/pos) | Equivalent (250/250(125/125)) | 1 (size, v = 5) | 500 | | 1_6 (= 1_3 with noise) | 2 (color/pos) | Non-Equivalent (167/166/166) | 1 (size, v = 5) | 500 | | 1_7 (= 1_2 with 1000 Data sets) | 2 (color/pos) | Equivalent (250/250(125/125)) | - | 1000 | | 1_8 (= 1_1 with smooth 25% prob. noise) | 1 (color) | Equivalent (250/250) | 1 (size, v = 2) | 500 | | 2_1 | 2 (shape/pos) | Equivalent (250/250(random)) | - | 500 | | 2_2 (= 2_1 with noise) | 2 (shape/pos) | Equivalent (250/250(random)) | 1 (size, v = 3) | 500 | # Rules: Scenario 1_1, 1_4, 1_8: color is decisive criterion (green object in scene => 0; blue object in scene = 1) * easy example for proof of concept * assumptions: * excatly one object is visible in each image * allover, two objects can occur in the scene * Object 1: green, rectangular, fixed size * Object 2: blue, rectangular, fixed size Scenario 1_2, 1_3, 1_5, 1_6, 1_7: color and position are decisive criteria (blue object in scene + position lower half of the table (y > scene_height / 2), otherwise 0) * interesting aspect: Dependency of the features: circular AND position have to be in one rule * assumptions: * exactly one object is visible in each image * allover, two objects can occur in the scene * Object 1: green, rectangular, fixed size * Object 2: blue, rectangular, fixed size * Scenario 1_2 is the basic version. * Scenario 1_3 is a variant of Scenario 1_2; it includes a non-equivalent distribution, i.e., 333 data sets are labelled negatively and 167 data sets are labelled positively. The 333 negative labelled data covers 166 data sets where a green object is placed and 167 data sets where a blue object is placed in the wrong half of the table (in the upper half). Furthermore, very small noise regarding the label assignment is included: the negative labelled data contains 9 record that should actually be considered positive (= blue object positioned in the lower half of the table but label "0" is assigned). * Scenario 1_5 is a variant of Scenario 1_2; it includes noise on the feature "size", i.e., the features "Sizes X" and "Sizes Y" vary by -5, -4,.. to ..,+4, +5 points. * Scenario 1_6 is a variant of Scenario 1_3; it includes noise on the feature "size", i.e., the features "Sizes X" and "Sizes Y" vary by -5, -4,.. to ..,+4, +5 points. IN CONTRAST to Scenario 1_3, there is NO NOISE regarding the LABEL assignement. * Scenario 1_7 is a variant of Scenario 1_2; in contrast to Scenario 1_2 it contains the double amount of data sets, i.e., 1000 data sets. Scenario 2_1, 2_2: shape and position are decisive criteria (circular object in scene + position in upper seventh of table) * interesting aspect: Dependency of the features: circular AND position have to be in one rule * assumptions: * exactly four objects are visible in each image * allover, four objects can occur in the scene * Object 1: red, circular, fixed size * Object 2: red, circular, fixed size * Object 3: red, rectangular, fixed size * Object 4: red, rectangular, fixed size Note: The data are in the Excel table in the order in which they were inserted (ordered by label). They should be tested ordered by label AND ordered by name.