owned this note
owned this note
Published
Linked with GitHub
# Data Science 1 Prüfung
Thema: Classification
Note: 1.3
F: Fouché
I: Ich
## Ablauf
F: What is Data Science?
I: Its the science of extracting knowledge from data.
F: We had this Process, the KDD Process. Can you explain it to me?
I: This can be described as a series of alternating data and steps: (*names the data and steps*)
F: Okay, what would be an example of something we do in the transformation step?
I: Fit some data values into a certain range (*didn't explicitly say Normalization*)
F: What were the DS topics?
I: Classification and its evaluation, Association Rules, Clustering, and Outlier Detection.
F: What is the difference between supervised and unsupervised learning?
I: In supervised learning data labels are given and we try to learn to fit the data to their classes using the labels. In unsupervised learning labels aren't given, so we learn to fit the data to classes *and* the classes. (*I continue to assign the topics from the previous question to supervised/unsupervised and then start to mention how k-NN could also be used as a sort of supervised outlier detector*)
F: (*Stops me*) I actually want to talk about classification today. You've already mentioned k-NN, What other types did we learn?
I: knn, linear/svm, decision trees, bayesian classifiers
F: Please explain the k-NN Classifier.
I: (*explains k-NN*)
F: Okay, now how would you use k-NN as an outlier detector?
I: (*explains*)
F: And how would you use k-NN for regression?
I: I find the k nearest neighbours along one dimension and then average their values in the other dimension.
F: What about more than 2 dims, does this work as well?
I: Yes, then I find the k nearest neighbours in all but one dimensions and average their values in the remaining dimension, which is the one I'm concerned about.
F: Okay, now what are Pros and Cons of k-NN?
I: (*lists pros and cons*)
F: Okay, let's move on. What is a decision tree?
I: (*explains what a DT is and that there are 2 steps: constructing and pruning*)
F: Let's focus on construction
I: (*explains construction some more*)
F: How do we decide where and what to split?
I: We had three ways: Information gain, gini index, misclassification error
F: What is Information Gain?
I: It's a change in entropy when comparing a dataset before and after a split.
F: Define entropy, either in words or write down the formula.
I: (*does both at the same time*)
F: Okay now we have the attribute for the split, how to find the value?
I: (*I admit that I don't remember if we do lots of splits along all the values or if we only split into two and find the best value for that*)
F: That depends on the approach we're using. How is it done with ID3?
I: (*how dumb that I don't remember this when it was the example in the Q&A session... I say that I don't remember and explain the stopping criteria of ID3, to show that I remember something*)
F: How do we prevent overfitting?
I: This is where pruning comes in. We can do prepruning and postpruning.
F: How do we decide what to prune in postpruning?
I: (*I ramble a bit, mention chimerge, but in the end I don't know*)
F: Okay, next question: What if we want to avoid certain types of errors or if certain types of errors are worse than others?
I: We can adjust the ratio of data points with different classes by removing some or by artificially creating similar ones of a class that we want more from. We can also calculate the conditional risk.
F: How do we calculate the conditional risk?
I: (*explains the formula in words*)
F: Where do we get the Loss from?
I: (*I don't know*)
F: Okay, then let's move on to the next question: What do we mean by the "free lunch" theorem?
I: There's no one classifier that's best, it always depends on the kind of problem one is trying to solve.
F: Okay, the time is up.
:::info
The next questions would have been about Bayes classification and Ensembles
:::