---
tags: machine learning
---
# ML-01-LeibetsederRiegerRoitherSeiberl
#### Team: Manuel Leibetseder, Maike Rieger, Andreas Roither, Markus Seiberl
## Part 1: Diamond prices
### 1: Give an overview of the dataset structure by answering those questions:
**How many samples and features are in the dataset?**

There are 10 features and 53940 samples. The features are the columns and the rows are the samples.
**What are the feature data types?**
Most data types are floats.
| Feature | Type |
|---------|---------|
| carat | float64 |
| cut | object |
| color | object |
| clarity | object |
| depth | float64 |
| table | float64 |
| price | int64 |
| x | float64 |
| y | float64 |
| z | float64 |
**Are diamonds balanced across color, cut and clarity?**
| Color | Amount |
|-------|--------|
| G | 11292 |
| E | 9797 |
| F | 9542 |
| H | 8304 |
| D | 6775 |
| I | 5422 |
| J | 2808 |
| Cut | Amount |
|-----------|--------|
| Ideal | 21551 |
| Premium | 13791 |
| Very Good | 12082 |
| Good | 4906 |
| Fair | 1610 |
| Clarity | Amount |
|---------|--------|
| SI1 | 13065 |
| VS2 | 12258 |
| SI2 | 9194 |
| VS1 | 8171 |
| VVS2 | 5066 |
| VVS1 | 3655 |
| IF | 1790 |
| I1 | 741 |
We can see that no class is balanced because of the widespread data. They are all imbalanced. If we look at the highest and lowest value in each feature we have roughly a 1:3 imbalance for the color class, a 1:13 imbalance for the cut class and also roughly a 1:13 imbalance for the clarity class.
### 2: Visualize diamond prices using a histogram, boxplot, and density plot.
##### Boxplot

- The inner quartile range (IQR) of the Boxplot shows that 50% of the diamonds are priced between around 950 and about 5325.25.
- The diamond prices of the first quartile range from little above 0 to about 1.000 to ~1.500.
- The largest value within 1.5 * interquartile range from the 3rd quartile is close to ~12.500.
- You can also see "many" extreme outliners ranging from ~12.000 to ~19.500.
- The green line shows the median, which is equal to 2.401.
##### Density

- The density plot nicely illustrates that the most common price for diamonds is about ~2.500$.
##### Histogram

**Is there trend visible in those plots?**
Yes, there is a trend visible in two of the plots.
**If yes, which is it and in which
plots can you see it?**
- Density
- Histogram
**Trend:**
In these diagrams we can see, the higher the price of the diamonds, the lower is the amount of the diamonds in the dataset.
### 3: Calculate and state the mean, median, standard deviation, median absolute deviation (MAD), 1st and 3rd quantile (Q1 and Q3), and inner quartile range of the diamond price.
| mean | median | std | mad | q1 | q3 | iqr |
|-----------|--------|-----------|-----------|-----|---------|---------|
| 3932.7997 | 2401.0 | 3989.4397 | 3031.6032 | 950 | 5324.25 | 4374.25 |
### 4: Plot the diamond price against the carat values as a scatterplot.

**Is there a trend visible in the plot?**
Yes, there is a trend visible plotting the price against the carat.
**If yes, which is it?**
You can see in the plot above, that the more carat they have, the fewer diamonds appear in the dataset. If you look at the carat from 3 to 5 we could count them manually directly looking at the plot. Meanwhile looking at the lower carat amount we can see there are many more diamonds contained in the dataset.
We can see the lower the carat, the lower the price - also the higher the price the fewer appearances in the dataset. The lower the carat the more is in the dataset. Trending in the dataset we can see more diamonds with less carat and therefore lower price.
This can also be correlated to the count of the diamonds with different carat doing a `df_diamonds['carat'].valueCount()` in Python.
### 5: Analyze the correlation between diamond price and diamond x, y, and z dimensions.
**Pairwise plot:**

We created a pairwise plot for this task, because in this kind of plot it's easy to compare two features with each other.
**Is there a trend visible between x, y, and z? If yes, which is it?**
The trend between x,y,z tends to be linear.
**Is there a trend visible between the dimensions and the price? If yes, which is it?**
The trend between dimensions and price is not really visible, the only trend upward is minimal. The biggest trend upward would be dimension x. All dimensions tend to have a linear trend.
### 6: Analyze diamond prices per diamond color. Create boxplots showing diamond price boxes for each diamond color (all boxes should be in one figure). Create density plots showing diamond prices for each diamond color (all densities should be in one figure).

We can see again, we have more lower-priced diamonds than higher priced ones in the dataset.

**Is there a trend visible?**
Yes, there is a trend visible
**If yes, which one?**
Color D and E in the density plot are more frequently in the dataset than J or I. The colors vary, meaning that if you have a "higher" color e.g. J there are less diamonds in the dataset.
### 7: Use vectorized commands (= no loops!) to answer these questions:
**How many diamonds have a price above 9500?**
5734
**How many diamonds have a price above 9500 and have color “D”?**
461
**What is the mean and std of the price of all color “D” diamonds with
cut “Fair”?**
Mean: 4291.061349693252
Std: 3286.114238174996
**What is the median and mad of the price of all color “J” diamonds with cut “Ideal”?**
Median: 4096.0
Mad: 3467.586682378051
**Create two copies of the dataframe that contains only the price and carat feature. Apply a log with base 10 to both features in one of
those dataframes, and square (x' = x2) the features in the other
dataframe. What is the mean and std of the transformed features in
both dataframes?**
Copy1:
| | price | carat |
|------|----------|-----------|
| Mean | 3.381751 | -0.171532 |
| Std | 0.440657 | 0.253987 |
Copy2:
| | price | carat |
|------|----------|-----------|
| Mean | 3.138225e+07 | 8.613903e-01 |
| Std | 6.049189e+07 | 1.056506e+00 |
## Part 2: Cell Body Segmentation Data
### 1: Which classes exist? Are they (roughly) balanced?
Classes: PS, WS
| | PS | WS |
|---------|------|-----|
| Amount | 1300 | 719 |
Imbalance: ~1:2
### 2: Which noteworthy trends of features and relations between features as well as features and Class do you see?

There seem to be instances of a linear trend for *AvgIntenCh1* and *AvgIntenCh4*. Others show a linear threshold that have values above/below them. In general, blue (PS) scatters more than the other class (WS).
**Relations:**
- *AvgIntentCh1* and *AvgIntenCh4* have less scatter, therefore the most relation.
- *ConvexhullAreaRatioCh1* and *ConvexHullPerimRatioCh1* seem to be almost the opposite of each other
- *AvgIntenCh1* and *AvgIntenCh4* have a high positive correlation
**No Relations:**
- *AngleCh1* has almost no visible relations since they are widely scattered no matter which feature are plotted against it.
### 3: If you would need to distinguish the classes with those features, which features would you choose, and why?
Classes PS + WS:

Class PS


Class WS


#### Feature comparison of classes PS and WS:
| |  |
|---|---|
|  |  |
|  |  |
|  |  |
|  |  |
|  |  |
|  |  |
|  |  |
|  | |
For distinguishing the feature we would take the feautures *AvgIntenCh2* and *ConvexHullPerimRatioCh1*. These two feature are the easiest to distinguish becaues of what we can see in the above density plots. The other feature graphs are mostly the same compared in classes PS and WS. For the chosen two it's easier to draw a vertical line and determine on which side is class PS or WS.
In the boxplots above we can also see that it's a good choice to choose *ConvexHullPerimRatioCh1*, we can see that the values are not fully overlapping and there are only a few values in the same range. For *AvgIntenCh2* we can see that the main 50% are not in the same range and are therefore distinguishable. For the other features the overlapping areas are bigger than our chosen two features.