Time series - HackMD

# Time series ## Base knowledge - [Basics 101 : Machine Learning Algorithms ](https://medium.com/knowyourml/basics-101-machine-learning-algorithms-b61ae27070ca) ![](https://i.imgur.com/QjTEUt1.png) ### Time series type Time series data can be classified into two types: 1. Measurements gathered at regular time intervals (metrics) 定期時間間隔 2. Measurements gathered at irregular time intervals (events) 不規則時間間隔 ### Time series data vs. cross-sectional and panel data - If all you need is a timestamp, it’s probably time series data. - If you need something other than a timestamp, it’s probably cross-sectional data. - If you need a timestamp plus something else, like an ID, it’s probably panel data. #### Time series data definition Time series data is a collection of observations (behavior) for a single subject (entity) at different time intervals (generally equally spaced as in the case of metrics, or unequally spaced as in the case of events). For example: Max Temperature, Humidity and Wind (all three behaviors) in New York City (single entity) collected on First day of every year (multiple intervals of time) The relevance of time as an axis makes time series data distinct from other types of data. #### Cross-sectional data definition Cross-sectional data is a collection of observations (behavior) for multiple subjects (entities such as different individuals or groups ) at a single point in time. For example: Max Temperature, Humidity and Wind (all three behaviors) in New York City, SFO, Boston, Chicago (multiple entities) on 1/1/2015 (single instance) In cross-sectional studies, there is no natural ordering of the observations (e.g. explaining people’s wages by reference to their respective education levels, where the individuals’ data could be entered in any order). For example: the closing price of a group of 50 stocks at a given moment in time, an inventory of a given product in stock at a specific stores, and a list of grades obtained by a class of students on a given exam. #### Panel data (longitudinal data) definition Panel data is usually called as cross-sectional time series data as it is a combination of the above- mentioned types (i.e., collection of observations for multiple subjects at multiple instances). Panel data or longitudinal data is multi-dimensional data involving measurements over time. Panel data contains observations of multiple phenomena obtained over multiple time periods for the same firms or individuals. A study that uses panel data is called a longitudinal study or panel study. For example: Max Temperature, Humidity and Wind (all three behaviors) in New York City, SFO, Boston, Chicago (multiple entities) on the first day of every year (multiple intervals of time). :::info 縱橫資料（英語：Panel data，中國大陸稱作面板數據），是統計學與計量經濟學中截面數據與時間序列數據的結合。縱橫資料不同於混合橫截面數據（pooled cross-sectional data）。縱橫資料是對同一主體的不同時間點的觀測值。混合橫截面數據是在不同時點從同一個大母體內部分別抽樣，將所得到的數據混合起來的一種數據集。如許多關於個人、家庭和企業的調查，每隔一段時間，常常是每隔一年，重複進行一次，如果每個時期都抽取一個隨機樣本，那麼把所得到的隨機樣本合併起來就給出一個混合橫截面。 https://www.youtube.com/watch?v=99jR6-dgrj8&ab_channel=AnalyticsUniversity ::: ## Components of Time Series ![](https://i.imgur.com/Nm16qSM.png) https://www.fromthegenesis.com/components-of-time-series/ ## 時間序列分析的優點 - 可靠性 (Reliability)：時間序列分析使用歷史數據來表示條件以及進行性線性圖表。所使用的信息或數據是在一段時間內收集的，每週，每月或每年都有每週，每月或每年收集的信息。這使數據和預測可靠。 - 季節性模式 (Seasonal Patterns)：隨著與一系列時期有關的數據，它有助於我們理解和預測季節性模式。例如，時間序列可能會表明，對排燈節期間，對民族服裝的需求不僅在增加，而且在婚禮季節也有所增加。 - 趨勢的估計 (Estimation of trends)：時間序列分析有助於識別趨勢。數據趨勢對管理人員有用，因為它們顯示出銷售，生產，股價等的增加或下降。 - 增長 (Growth)：時間序列分析有助於衡量財務增長。它還有助於衡量導致經濟增長的組織的內部增長。 https://www.toppr.com/guides/fundamentals-of-business-mathematics-and-statistics/time-series-analysis/definition-of-time-series-analysis/ ## Time series clustering Clustering of time-series data is mostly utilized for discovery of interesting patterns in time-series datasets 1. Recognizing dynamic changes in time-series: detection of correlation between time-series [36]. For example, in financial databases, it can be used to find the companies with similar stock price move. (识别时间序列动态变化:检测时间序列[36]之间的相关性。例如，在金融数据库中，它可以用来寻找股价波动相似的公司。) 2. Prediction and recommendation: a hybrid technique combining clustering and function approximation per cluster can help user to predict and recommend [37–40]. For example, in scientific databases, it can address problems such as finding the patterns of solar magnetic wind to predict today’s pattern. (预测和推荐:结合聚类和每簇函数逼近的混合技术可以帮助用户预测和推荐[37-40]。例如，在科学数据库中，它可以解决诸如寻找太阳磁风的模式来预测今天的模式等问题。) 3. Pattern discovery: to discover the interesting patterns in databases. For example, in marketing database, differ- ent daily patterns of sales of a specific product in a store can be discovered. (模式发现:发现数据库中有趣的模式。例如，在营销数据库中，可以发现商店中特定产品的不同日常销售模式。) ### Application of time series clustering ![](https://i.imgur.com/4ateEP4.png) ### Challenge of time series clustering Time-series clustering is a challenging issue because ==first of all==, time-series data are often far larger than memory size and consequently they are stored on disks. This leads to an exponential decrease in speed of the clustering process. ==Second challenge== is that time-series data are often high dimensional which makes handling these data diffi- cult for many clustering algorithms and also slows down the process of clustering. ==Finally==, the third challenge addresses the similarity measures that are used to make the clusters. To do so, similar time-series should be found which needs time-series similarity matching that is the process of calculating the similarity among the whole time-series using a similarity measure. This process is also known as “whole sequence matching” where whole lengths of time-series are considered during distance calculation. However, the process is complicated, because time-series data are naturally noisy and include outliers and shifts, at the other hand the length of time-series varies and the distance among them needs to be calculated. ==These common issues have made the similarity measure a major challenge for data miners.== ### The time-series clustering approaches - In the shape-based approach, shapes of two time-series are matched as well as possible, by a non-linear stretching and contracting of the time axes. This approach has also been labelled as a raw-data-based approach because it typically works directly with the raw time-series data. Shape-based algorithms usually employ conventional clustering methods, which are compatible with static data while their distance/simi- larity measure has been modified with an appropriate one for time-series. - In the feature-based approach, the raw time-series are converted into a feature vector of lower dimension. Later, a conventional clustering algorithm is applied to the extracted feature vectors. Usually in this approach, an equal length feature vector is calculated from each time-series followed by the Euclidean distance mea- surement. - In model-based methods, a raw time-series is transformed into model parameters (a parametric model for each time-series,) and then a suitable model distance and a clustering algorithm (usually conventional clustering algorithms) is chosen and applied to the extracted model parameters. However, it is shown that usually model- based approaches has scalability problems, and its performance reduces when the clusters are close to each other. ![](https://i.imgur.com/FoYjYOQ.png) :::info Reviewing existing works in the literature, it is implied that essentially time-series clustering has four components: ==dimensionality reduction or representation method==, ==distance measurement==, ==clustering algorithm==, ==prototype definition==, and ==evaluation== ::: ### Representation method 降維目的: 在不過度降低準確度的前提下減少雜訊和計算時間 ![](https://i.imgur.com/1hhv5Gx.png =450x) ![](https://i.imgur.com/uPZCdQQ.png) ### Similarity/dissimilarity measures 对于时间序列之间的距离，可以采用不同的度量方法。有些相似度量是基于特定的时间序列表示提出的，例如与SAX兼容的MINDIST[84]，有些相似度量不考虑表示方法，或者与原始时间序列兼容。在传统聚类中，静态目标之间的距离是基于精确匹配的，而在时间序列聚类中，距离的计算是近似的。特别是，为了比较不规则采样间隔和长度的时间序列，充分确定时间序列的相似性具有重要意义。为指定时间序列之间的相似性，设计了不同的距离度量。Hausdorff距离、修正Hausdorff (MODH)、基于hmm的距离、动态时间弯曲(DTW)、欧几里得距离、PCA子空间中的欧几里得距离和最长公共子序列(LCSS)是用于时间序列数据的最流行的距离测量方法。距离测量方法参考如表3所示。==计算两个时间序列之间距离的最简单方法之一是将它们视为单变量时间序列，然后计算所有时间点之间的距离测量。== ![](https://i.imgur.com/62hHkhq.png =400x) 三種measure distance的方法 1. time based: 这种相似性是在每个时间步上，基于相关的距离或欧几里得距离测量适合于这个目标 2. shape based: 在形状上寻找相似的时间序列，模式出现的时间并不重要。因此，采用弹性方法[108,113]，如动态时间翘曲(Dynamic time Warping, DTW)[114]进行不相似度计算。 3. change based (structure based): 该方法通常采用隐马尔可夫模型(HMM)[116]或ARMA过程[107,117]等建模方法，然后对拟合模型参数与时间序列的相似性进行测量。这种方法适用于长时间序列，而不适用于中等或短时间序列 :::info 1)将上述方法作为相似/不相似度量进行研究，结果表明，最有效和最准确的方法是基于动态规划(DP)的方法，但其时间执行成本非常高(比较两个时间序列的成本是时间序列长度的二次型)[143]。尽管通常会对这些距离/相似度测量采取一些限制以降低复杂性[119,144]，但它需要仔细调整参数以使其高效和有效。因此，在使用该指标时，应该在速度和准确性之间进行权衡。另一种观点是，有必要了解距离测量在大规模时间序列数据集中的有效程度。这个问题不是从文献中得到的，因为大多数被考虑的作品都是基于相当小的数据集。 2)在相似度度量的研究中，距离度量考虑了挑战长度的变化。一个很大的挑战是距离度量与表示方法不兼容的问题。例如，应用于时间序列分析的常用方法之一是基于频域[85,109]，在使用频域空间时，很难发现序列之间的相似性，产生基于值的差异用于聚类。 3)欧几里得距离和DTW是时间序列聚类中最常用的相似度度量方法。一项研究表明，在时间序列分类精度方面，欧几里得距离具有惊人的竞争力[145]，但DTW在相似度测量方面也有不可否认的优势。 ::: ### Time-series cluster prototypes 1. The medoid sequence of the set 2. The average sequence of the set 3. The local search prototype ### Time-series clustering algorithms ![](https://i.imgur.com/J4qZMiL.png) ### Time-series clustering evaluation measures - The validation of algorithms should be performed on various ranges of datasets (unless the algorithm is created only for a specific set). The used dataset should be published and freely available - Implementation bias must be avoided by careful design of the experiments - If possible, data and algorithms should be freely provided - New methods of similarity measures should be compared with simple and stable metrics such as Euclidean distance. In general, ==evaluating of extracted clusters is not easy in the absence of data labels [26] and it is still an open problem==. The definition of clusters depends on the user, the domain, and it is subjective. For example, the number of clusters, the size of clusters, definition for outliers, and definition of the similarity among the time-series in a problem are all the concepts which depend on the task at hand and should be declared subjectively. ![](https://i.imgur.com/eyNuGjp.png) :::info In scalar accuracy measurements, a single real number is generated to represent the accuracy of different clustering methods. Numerical measures that are applied to judge various aspects of cluster validity are classified into two types: - ***External Index***: this index is used to measure the similarity of formed clusters to the externally supplied class labels or ground truth, and is the most popular clustering evaluation method [215]. In the literature, this index is known also as external criterion, external validation, extrin- sic methods, and supervised methods because the ground truth is available. (有標記過的資料集)(评价聚类质量最流行的方法之一是使用外部指标来判断发现聚类结果有多好[215]，这也被用于评价本研究中提出的模型。然而，它不能直接适用于现实生活中的无监督任务，因为并非所有数据集都可以获得基本真相) - ***Internal Index***: this index is used to measure the goodness of a clustering structure without respect to external information. In the literature, this index is known also as internal criterion, internal validation, intrinsic and unsupervised methods. ::: :::warning ## Internal Index Typical objective functions in clustering, formalize the goal of attaining high intra-cluster similarity (objects within a cluster are similar) and low inter-cluster similarity (objects from different clusters are dissimilar). Internal validation compares solutions based on the goodness of fit between each clustering and the data. Internal validity indices evaluate clustering results by using only features and information inherent in a dataset. They are usually used in the case that true solutions (ground truth) are unknown. However, this index can only make comparisons between different clustering approaches that are generated using the same model/metric. Otherwise, it makes assumptions about cluster structure. 聚类中的典型目标函数，形式化了实现高簇内相似度(一个簇内的对象相似)和低簇间相似度(来自不同簇的对象不相似)的目标。内部验证根据每个聚类与数据之间的拟合优度来比较解决方案。内部有效性指数通过仅使用数据集中固有的特征和信息来评估聚类结果。它们通常用于真实解(基本真相)未知的情况。但是，该索引只能在使用相同模型/度量生成的不同聚类方法之间进行比较。否则，对聚类结构进行假设。 There are many internal indices such as Sum of Squared Error, Silhouette index, Davies-Bouldin, Calinski-Harabasz, Dunn index, R-squared index, Hubert-Levin (C-index), Krzanowski-Lai index, Hartigan index, Root-Mean-Square Stan- dard Deviation (RMSSTD) index, Semi-Partial R-squared (SPR) index, Distance between two clusters (CD) index, Weighted inter-intra index, Homogeneity index, and Separation index. Sum of Squared Error (SSE) is an objective function that describes the coherence of a given cluster, “better” clusters are expected to give lower SSE values [241]. For evaluation of clusters in terms of accuracy, the Sum of Squared Error (SSE) can be used as the most common measure in different works [18,165]. For each time-series, the error is the distance to the nearest cluster. ::: ## 延伸研究方向 ### online/offline training (batch learning) #### 問題 1. online 和 batch 的定義 :::info Online learning: This system can be trained incrementally by feeding it data instances sequentially, either individually or in small groups called mini-batches. Each learning step is fast and cheap, so the system can easily learn about new data on the fly. This system is great for systems that receive data as a continuous flow (e.g., stock prices) and need to adapt to change rapidly or autonomously. This system is also a good option if you have limited learning resources as once the system has been trained with the new data, it does not need them anymore and can be discarded. This can save a large amount of space. Online learning algorithms can also be used to train systems on huge datasets that cannot fit in one machine’s main memory. This is called out-of-core learning. This is usually done offline just like incremental learning. The algorithm loads part of the data, runs a training step on that data and repeats the process until it has run all of its data. One important parameter of online learning is how fast they should adapt to changing data: this is called the learning rate. If you set a high learning rate, then your system will rapidly adapt to new data, but it will also tend to quickly forget the old data. If you set a low learning rate, the system will have more inertia, hence it would learn more slowly and shall be less effective to noise in the new data or to outliers. A major disadvantage with online learning is that if bad data is fed to the system, the system’s performance will gradually decline. 该系统可以通过按顺序输入数据实例(单个或称为小批量的小组)来递增地训练。每个学习步骤都是快速和廉价的，所以系统可以很容易地在飞行中学习新数据。该系统非常适合那些连续接收数据流(例如，股票价格)并需要快速或自主地适应变化的系统。如果学习资源有限，这个系统也是一个很好的选择，因为一旦系统使用新数据进行了训练，它就不再需要它们了，可以被丢弃。这可以节省大量的空间。在线学习算法还可以用于在一台机器的主内存无法容纳的庞大数据集上训练系统。这就是所谓的“外核学习”。这通常是离线完成的，就像增量学习一样。该算法加载部分数据，在该数据上运行一个训练步骤，并重复这个过程，直到运行所有数据。在线学习的一个重要参数是他们应该多快地适应变化的数据:这被称为学习率。如果你设置了较高的学习率，那么你的系统将迅速适应新数据，但它也会很快忘记旧数据。如果你设置一个较低的学习率，系统会有更大的惯性，因此它会学习得更慢，对新数据中的噪声或异常值的效果也会更差。在线学习的一个主要缺点是，如果向系统输入坏数据，系统的性能将逐渐下降。 ::: :::info Batch learning: The system is incapable of learning incrementally. It must be trained using all the available data. This generally takes a lot of time and computing resources, so it is typically done offline. First the system is trained, then it is launched into production and then it runs without learning anymore. It applies what it has just learned. This is also called offline learning. If this system has to be trained with any new data, you need to train a new version of the system from scratch on the full dataset, then stop the old/previous system and replace it with the new one. If the system needs to adapt to rapidly changing data then a more reactive solution is needed. This process utilizes a lot of computing resources (CPU, memory space, disk space, disk I/O, network I/O, etc.) and is incapable of learning autonomously. 系统无法进行增量学习。它必须使用所有可用的数据进行训练。这通常需要大量的时间和计算资源，因此通常是离线完成的。首先对系统进行训练，然后将其投入生产，然后在不学习的情况下运行。它应用刚刚学到的知识。这也被称为离线学习。如果这个系统必须用任何新数据进行训练，那么你需要在完整数据集上从头开始训练一个新版本的系统，然后停止旧的/以前的系统，并用新系统替换它。如果系统需要适应快速变化的数据，那么就需要一个反应性更强的解决方案。这个过程会占用大量的计算资源(CPU、内存空间、磁盘空间、磁盘I/O、网络I/O等)，并且无法进行自主学习。 ::: - [Online Learning vs Offline Learning](https://www.kaggle.com/getting-started/179176) ![](https://i.imgur.com/wtIbKuD.png =250x) ![](https://i.imgur.com/gp9vxUf.png =250x) ![](https://i.imgur.com/RdvBOHK.png =250x) ![](https://i.imgur.com/ccFMAVF.png =250x) 2. online的特點與應用的場景和offline是否不同? [online learning的定义，算法，优缺点，及和Batch learning 区别有哪些？](https://www.zhihu.com/tardis/sogou/ans/1406328045) 3. online會遇到的問題? 4. online training算是supervised or unsupervised learning? 5. 能不能onlin+offline (batch learning) 一起 6. 如何評估他們的差別 7. Batch learnong vs Online learning vs Active learning? [林軒田教授機器學習基石 Machine Learning Foundations 第 3 講學習筆記](https://blog.fukuball.com/lin-xuan-tian-jiao-shou-ji-qi-xue-xi-ji-shi-machine-learning-foundations-di-san-jiang-xue-xi-bi-ji/) - 從餵資料給機器的角度看機器學習，一次餵進全部資料，這就叫 Batch Learning。監督式學習方法，可能也會常使用 Batch Learning 的方式為資料。 - 從餵資料給機器的角度看機器學習，可以再慢慢餵進新資料，這就叫 Online Learning。Batch Learging 訓練好的機器，就無法調整他的技巧，可能會有越來越不準的情況，所以 Online Learning 可以再慢慢調整、增進技巧。PLA 算法可以很容易應用在 Online Learning 上，增強式學習方法也常常是使用 Online Learning 的方式餵資料。 - 從餵資料給機器的角度看機器學習，機器可以問問題，然後從問題的答案再餵進資料，這就叫 Active Learning。這樣的學習方法是要希望讓機器可以用一些策略問問題，然後慢慢學習、改善技巧。 :::info 總結一下，機器學習有很多種型態，從Data的給予方式可分為Batch Learning、Online Learning和Active Learning。Data的表達形式由輸入變數Xn和輸出值yn所決定，從輸入變數Xn的來源可分為Concrete Features、Raw Features和Abstract Features，從輸出值yn的種類上可以分為二元分類、多元分類、Regression和Structured Learning 問題，從輸出值yn的Label給予情況可分為Supervised Learning、Unsupervised Learning、Semi-supervised Learning 和 Reinforcement Learning。 https://ycc.idv.tw/ml-course-foundations_1.html ::: ### 多變量時序 ### 時間序列的相似度評估 ### 給出代表性的波形 ### 動態決定subsequence ### Panel data analysis ### Fairness ## Reference - [Time-series data mining](https://dl.acm.org/doi/abs/10.1145/2379776.2379788?casa_token=oqey7VggIOIAAAAA:XWwUpbM9NdpcvlTM9fFtDD6zEgtJRUrnWo2IpRi8Iz48ePasNZQYq-Sb9BwCnMGOTytZWvoZ37sB) - [What is time series data? ](https://www.influxdata.com/what-is-time-series-data/) - [Aghabozorgi, Saeed, Ali Seyed Shirkhorshidi, and Teh Ying Wah. "Time-series clustering–a decade review." Information systems 53 (2015): 16-38.](https://www.sciencedirect.com/science/article/pii/S0306437915000733?casa_token=1r4734AwOesAAAAA:Ok-wEozA9A0ms-GOdrwpnhCTaCIpJg9hj2ftuFFgTV4yuAw2C-lzlzEc7qoyUPksZFYTkeilpvQ) - [awesome-time-series](https://github.com/cuge1995/awesome-time-series) 針對金融資料處理的時序方法：1. 金融資料是非連續時序資料 ==2. 金融資料的特殊性==