データ工学特論めも

# データ工学特論めも ###### tags: `class` ## 第3回 This class has explained about word difinitions, association analysis and memory based reasoning. 3 words definitions are taught, confidence, support, and lift. confidence is conditional probability of Event A under the occurrence of Event A. This index can consider as conditional probability. Support is joint probability of Event A and Event B. Lift is the ration of the confidence to the probability of Event A. If this index is more than 1, we can consider that it might have the relationship between A and B. Association analysis analyzes the association which items are related to other items. For example, buying diapers is related to buying beer in supermarket. In details, this method finds association rules such as "Event P => Event Q", which means that there is a relationship between Event P and Event Q. The important thing is not all of obtained rules are beneficial. The example above, the relationship between buying diapers and beer, is valuable. However, the relationship between buying expensive computer and contracted 3-years guarantee is NOT valuable. That is because, this relationship is easily predictable. As the purpose of this method is to find the valuable relationship, we have to care about them. Procedure for this methods is 2 stages. 1st stage is to define items or determine abstractness of items. In this stage, we define what subject we define as target. If we define target abstractly, it is easy to analyze but ignorerance of less frequency(event) is disadvantage. On the other hands, concrete definition are for us to make it easy to focus on the particular items, but complex rules are obtained and long execution time is required is disadvantage. In addition, some items might be hided behind. It is called virtual items. For instance, how the user buys the item is relevant to this subject. 2nd stage is to calculate support, confidence, and lift to overcome difficulty during the calculation. In this section overcome difficulty during the calculation might be huge. For example, if there are N items, the rules to be calculated are n powered by 2. The advantage of this analysis is 3 points. 1st one is to generate result easily understandable. 2nd one is to be applicable to variable length data. 3rd one is to use simple and understandable calculation. In contrast the disadvantage of this method is also 3 points. 1st one is huge calculation cost in the number of items is huge. 2nd one is to need suitable definition or abstraction of items. 3rd one is not to explain the phenomena of rarely bought items. Memory based reasoning gives a reason based memory by measuring a distance, which is defined by analysist, and combining function(結合関数). The procedure is 3 steps. 1st step is to normalize or standardize data in records. 2nd step is to search records nearest from an input record. A distance function is used to find nearest records. 3rd step is to predict a result from the records found in step 2. In this step, a combining function is used to find a result from obtained records. ### Association analysis Methods - association rules - $$Event P \Rightarrow Event Q$$ - there is a relationship between P and Q - **Not all of obtained rules are beneficial** - 1 < 2 - $$buying computer \Rightarrow rich$$ --- A sample - $$Orange Juice \Rightarrow Pizza$$ --- - Confidence is conditional probability of Event A under the occurrence of Event B - 信頼度 - Event Bが起きた時に、Event Aが起きる条件付き確率と同様に考えられる - N := 全顧客ののべ数(全トランザクション数) - \frac{N(A \and B)}{N(B)} = \frac{N(A \and B)/N}{\N(B)/N} = \frac{P(A \and B)}{P(B)} - Support is joint probability of Event A and Event B - - AとBの同時確率分布 - List is the ratio of the confidence to the probability of Event A - Aが起きる確率の信頼度の比率(?) - ピザを買った時にオレンジジューすを買っている確率を、オレンジジュースを買った確率で割った時に、1より大きければ、ピザとオレンジジュースの購入関係が、オレンジジュース単体よりも高いことになる - そのための指標 - 1より大きければ他との結びつきが多いと言える --- Procedure 1. Define items or determine abstratness of items 2. Calculate support/confidence/lift --- Choice of correct level for input items --- Virtual items(disapperance), hided property Overcome difficultiy during the calc --- advantage for association analysis - generates results easily understandable - applicable to variable length data - uses simple and understandable calucation --- disadvantage for association analysis - calculation cost is also huge, if the member of items is huge - needs suitable definition/abstraction of items - does not explain the phenomena of rarely bought items - データが少ない場合 ### Memory based reasoning Memory based reasoning gives a reason based memory by measuring a distance, which is defined by analysist, and combining function(結合関数). The procedure is 3 steps. 1st step is to normalize or standardize data in records. 2nd step is to search records nearest from an input record. A distance function is used to find nearest records. 3rd step is to predict a result from the records found in step 2. In this step, a combining function is used to find a result from obtained records. The definition for distance function has 3 conditions which it must meet, non negativity, symmetry, and triangle inequeality. ## 第2回 Relation between statistics and Data Mining is **on which major is based** - statistics is based on mathmatics - data mining is based on computer data mining is used as a part or basis for statistics --- Deference between statistics and data mining Statistics describes data distribution by small number of parameters such as means, variants. It requires null hypothesis(帰無仮説) for a test. Data mining is used to categorize data to generate rules or predict etc. It uses all the data. It DOES NOT require null hypothesis. --- important things to learn data mining technique - understand the pros and cons of each DM method - investigate the previous cases of「applications of data mining methods - 応用される前のケースを調べる - 参考に、実行するために十分な情報が得られているか否かを判断する --- Model := algorithms + parameters(value) 1. modify the data format to apply the method you want(data cleaning) 2. fix parameters based on learning - clustering model - classification model - prediction model - time line analysis model the forms of output are different depending on the selected model --- verification-oriented models(検証指向のモデル): 仮説を立てて考える - verify the results with an analyst's assumptions - this models are used to classify data or predict based on model parameters learned in advance discovery-oriented models(発見指向のモデル) - discovery meanings of data with data mining - this models are used to find clusters etc --- Overfit is a condition which models are too specialized for training data because of so many learning parameters Underfit is a condition which parameters are insufficient to extract meaningful information --- Explainability Easy to explain - clustering - association analysis - decision tree Difficult to explain - neural network --- Association analysis ?is discovery-oriented? ?is verification-oriented? both of them are applicable, I think - analyze the association which items are relation to other items --- Memory based reasoning(記憶ベース推論) こちら、距離関数を設定する必要があると思うので、 - give a reason based memory by measuring a distance is used judging from, which is defined by analysist, and combining function(結合関数) --- Clustering - find clusters to which data are similar --- Link analysis - find patterns in links of data based on graph theory or network theory - Google search engine - the page is linked to other page - this structure regards as the DAG(Direct Acyclic Graph) - --- Decision tree(決定木) - tree to find each parameter to affect the result --- Artificial neural networks - model of neural networks in a brain - it learns patterns between input and corresponding output in training data --- Genetic algorithm(GA) - a model to find a solution that maximizes a given function - is based on the idea of evelution of species(animals) such as mutation and crossover. ## 第1回 - 講義で扱うコンテンツ - 前半はデータマイニング - 後半はテキストマイニングやディープラーニング - 評価 - ミニッツペーパーの提出 - 小テストも入れるかもしれん - 発表orレポート - スライドはGoogle classroomでやっていくらしい - 質問は随時チャットにて受け付ける --- - 構造化データ - データマイニング(DM)に使われるやつ - 表 - 非構造化データ - テキストマイニング(TM)に使われるやつ - 文書 --- DM(Data Mining) - 保存された大量のデータを目的に最も適した方法を決めた上で、それに適したフォーマットに編集してから解析する - データから何を知りたいんですか？ - 目的が不明瞭だと幸せにならない - そのために手法を理解する必要がある - 講義ではこっちしか扱えないかも - 解析後に、仕事知識に即した結果を解釈して自分の仕事に活かす - 既に知っている結果は得られても嬉しくない --- TM(Text Mining) - 非構造化データなので、構造化データに変換する必要がある --- Deep Learning - ただのニューラルネットワーク - 4層以上のニューラルネットワーク - 機械学習に属するが、データサイエンスにも適用される --- Q&A scikit-learn、Keras、TensorFlowによる実践 https://www.amazon.co.jp/dp/4873119286 Kaggleで勝つデータ分析の技術門脇大輔 https://www.amazon.co.jp/dp/4297108437