NeurIPS2021 TimeSeriesCoreset Rebuttal

# NeurIPS2021 TimeSeriesCoreset Rebuttal ## Response to Area Chair ## Thank you for your efforts in engaging with the reviewers and leading the discussions. We appreciate the many constructive comments of the review team, and were delighted to see that four of the five reviewers recommend acceptance. We believe the critiques of Reviewer xadL (who alone rated 4 and did not recommend accept) on the empirical relevance of the time series model, the technical novelty of the coreset construction, and empirical results are likely based on misunderstandings. In our response to the reviewer, we have provided clarifications; and therefore hope the reviewer will raise their evaluation similar to the other four reviewers, whom all acknowledge the technical novelty and relevance of our model. Further, in response to the review team's request we now report an additional benchmark beyond uniform sampling to highlight the value of our coreset for clustering time-series data. This benchmark is the coreset for clustering cross-sectional data using [23]. We believe this additional analysis further clarifies and strengthens the contribution of the paper. We explain below why we disagree with the three specific critiques of Reviewer xadL and hope our clarifications and the additional analysis reported in point 4 will further address his/her concerns: **1. Empirical Appropriateness of Modeling Assumptions** > **"The model (with the assumptions imposed to get small coresets) seems inappropriate."** As we discuss in the paper, there are three key sub-literatures within time-series clustering based on whether they process directly on raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data (See [44,2] for surveys). Our approach is based on the third sub-literature on model-based clustering of time series data, specifically focusing on the AR(1) model, which is widely used in various real-world applications. (e.g., Frühwirth-Schnatter, Sylvia. “Panel data analysis: a survey on model-based clustering of time series.” Advances in Data Analysis and Classification 5, no. 4 (2011): 251-280.) We note that other reviewers have acknowledged the importance of our problem setting and our model: * *Reviewer Wpms*: "this is the **first paper** on coreset for time series data in the unsupervised setting." * *Reviewer V91b*: "The problem is indeed **very interesting and relevant**." * *Reviewer eQJc*: "this is the **first construction** of coresets in this setting (the problem of clustering multivariate time series)" **2. Technical Novelty** > **"Technically there is not more than already seen in the papers [23] where coresets for GMMs (with bounded condition) are developed."** While we agree that our entity coreset construction uses some ideas from [23], novel technical ideas are required in our paper. A generalization of [23] to time series data is to treat all observations $x_{it}$ independent and compute a sensitivity for each $x_{it}$ directly for importance sampling. However, this idea cannot capture the property that multiple observations $x_{i,1},\ldots,x_{i,T_i}$ are drawn from the same Gaussian model (certain $l\in [k]$). To handle multiple observations, we show that although each $\psi_i$ consists of $T_i$ sub-functions $\psi_{it}$, it can be approximated by a single function on the average observation $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$, specifically, we have $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. This property enables us to "reduce" the representative complexity of $\psi_i$, and motivate two key steps of our construction: 1) For the **sensitivity function**, we give a reduction to a certain $k$-means clustering problem on average observations $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$ of entities (Definition 4.2) by upper bounding the maximum effect of covariances and autocorrelations and applying the fact that $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. 2) For the **pseudo-dimension**, we prove that there are only $\mathrm{poly}(k,d)$ intristic operators in $\psi_i$ between parameters $\theta$ and observations $x_{it}$, based on the reduction of the representative complexity of $\psi_i$. We will make the novelty over [23] more clear in the final version. Further, with our additional benchmark using the coreset for clustering cross-sectional data [23], it is now also empirically clear that our extension of coreset construction for time series clustering is practically significant for this widely used class of time series models. We note that other reviewers also acknowledge and appreciate the technical novelty in our paper. * *Reviewer Wpms*: "Overall I believe the paper has **reasonable novelty**." * *Reviewer V91b*: "the analysis and proofs are indeed **deep and non-trivial**" * *Reviewer eQJc*: "The authors propose a **novel** coreset construction framework for the problem of clustering multivariate time series". **3. Coreset Performance Relative to Uniform Sampling** > **“Even on synthetic data the coresets show only slight improvements over uniform sampling.”** We believe this might be due to a misinterpretation of the results we report. As mentioned in Line 89, to achieve a similar fit with the full data, our coreset needs fewer entity-time observations (<6%); therefore, the computation time for a given level of accuracy reduces by a 2x-8x factor when compared to uniform. We note that other reviewers also agree that our coreset algorithm outperforms uniform sampling. * *Reviewer Wpms*: "They demonstrate the accuracy and efficiency of their coreset by superior empirical performance over synthetic data as compared to uniform sampling." * *Reviewer V91b*: "The authors validate their coreset empirically on synthetic data, as compared to a uniform random sample." **4. Coreset Performance Relative to Coreset for Clustering Cross-sectional Data[23]** Further, in response to other reviewer suggestions, we now show that our coreset for time series clustering outperforms the coreset for cross-section clustering in [23], both in terms of fit ($\gamma_S$) and computation time. We hope this further clarifies the empirical value of the time series clustering coreset relative to the existing literature, which only has developed clustering coresets for cross-sectional data. **Table 1a: Coreset performance (fit) on Synthetic Data 1 ($N=T=500$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|2050|2058|2069|2041|22|34|56|1514 |0.2|2093|2210|2264|2041|104|342|446|404 |0.3|2194|2398|2384|2041|306|714|686|191 |0.4|2335|3963|2705|2041|588|3844|1328|93 |0.5|2383|3304|3461|2041|684|2526|2840|72 **Table 1b: Coreset performance (fit) on Synthetic Data 2 ($N=200, T=1250$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|811|825|841|812|2|26|58|1718| |0.2|824|895|871|812|24|166|118|447| |0.3|864|958|994|812|104|292|364|199| |0.4|860|1190|1114|812|96|756|604|98| |0.5|910|1361|1284|812|196|1098|944|71| **Table 2a: Coreset performance (computation time) on Synthetic Data 1 ($N=T_i=500$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|109|2416|1355|3436| |0.2|41|1054|652|3436| |0.3|62|419|392|3436| |0.4|38|621|260|3436| |0.5|47|132|147|3436| **Table 2b: Coreset performance (computation time) on Synthetic Data 2 ($N=200$, $T_i=1250$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|694|1859|1687|9787| |0.2|139|1991|1484|9787| |0.3|51|527|832|9787| |0.4|43|408|450|9787| |0.5|57|389|277|9787| ## Review Wpms, Score: 7, Confidence: 3 Thank you for your positive review and for your comments on the writing and presentation of empirical results. We accept your suggestions and provide clarifications for your queries. Further, in response to your suggestion, we have added a new benchmark using coreset for clustering cross-sectional data and show the empirical value of our time series clustering coreset. We hope these changes further strengthen your support for our paper. Please see below the point-by-point response. > **"Quality and Clarity. It would be useful if some convention is followed for writing these variables. For e.g. vectors in small bold letters, matrices in capital bold letters, scalars in small letters etc. In current state it becomes very difficult while reading and one has to go to and fro to check if a given variable is vector, scalar etc."** *Response:* Thank you for the suggestion. We will revise the paper to follow the suggested convention for different types of variables. > **"On line 183 eq. 2 should it be $f'(\alpha(\theta), \theta)$ and not just alpha as a parameter?"** Yes, it should be $f'(\alpha’(\theta), \theta)$. > **"What is P on line 174? What is tau in the input of algorithm 1 as subscripst of S?"** *Response:* $P$ should be $P_\mathcal{X}$. There should be no $\tau$, it is a typo. $\mathcal{S}_\tau^d$ should be $\mathcal{S}^d$. > **"They are only for synthetic data. It would be good if we can have some experiments with real datasets."** *Response:* It would indeed be a valuable future direction to empirically evaluate our findings on real-world datasets. But given space constraints and the objective of our current paper to show the value of coresets for time series clustering, we chose to follow your suggestion below to add an additional benchmark using coresets for cross-sectional data (specifically [23]). We discuss the details of those results below. We hope you agree with our prioritization. . > **"Is it also possible to at least sample points in the first part using some existing k-means coreset building procedure and compare with that?"** *Response:* Thanks for your comments. In response to your suggestion, in addition to uniform sampling, we also consider coreset based sampling for cross-sectional data (i.e., $T$=1, where entity-time pairs are considered independent) as a baseline. The results show that our coreset for time series clustering outperforms the coreset for cross-section clustering in [23], both in terms of fit ($\gamma_S$) and computation time. We hope this further clarifies the empirical value of the time series clustering coreset relative to the existing literature, which only has developed clustering coresets for cross-sectional data. **Table 1a: Coreset performance (fit) on Synthetic Data 1 ($N=T=500$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|2050|2058|2069|2041|22|34|56|1514 |0.2|2093|2210|2264|2041|104|342|446|404 |0.3|2194|2398|2384|2041|306|714|686|191 |0.4|2335|3963|2705|2041|588|3844|1328|93 |0.5|2383|3304|3461|2041|684|2526|2840|72 **Table 1b: Coreset performance (fit) on Synthetic Data 2 ($N=200, T=1250$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|811|825|841|812|2|26|58|1718| |0.2|824|895|871|812|24|166|118|447| |0.3|864|958|994|812|104|292|364|199| |0.4|860|1190|1114|812|96|756|604|98| |0.5|910|1361|1284|812|196|1098|944|71| **Table 2a: Coreset performance (computation time) on Synthetic Data 1 ($N=T_i=500$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|109|2416|1355|3436| |0.2|41|1054|652|3436| |0.3|62|419|392|3436| |0.4|38|621|260|3436| |0.5|47|132|147|3436| **Table 2b: Coreset performance (computation time) on Synthetic Data 2 ($N=200$, $T_i=1250$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|694|1859|1687|9787| |0.2|139|1991|1484|9787| |0.3|51|527|832|9787| |0.4|43|408|450|9787| |0.5|57|389|277|9787| ########################################### ## Rewiew xadL, Score: 4, Confidence: 4 Thanks for your comments and questions that give us the chance to better clarify our novelty and contributions. We answer your questions below. We hope that they address your concerns and that you will increase your support for the paper. > **"The model (with the assumptions imposed to get small coresets) seems inappropriate. When the covariance matrices have constantly bounded conditions and the autocorrelation AR(1) is used, then every time series is just a random walk concentrated around their mean. This is very well behaved time series data. Why is there no comparison or related work for maybe more appropriate time series (polygonal curve) clustering eg under dynamic time-warping or Frechet distance, and where values can oscillate arbitrarily. See for example the recent series of works by K. Buchin or A. Driemel or D. Rhode."** *Response:* Thank you for requesting this clarification. As we discuss in the paper, there are three key sub-literatures within time-series clustering based on whether they process directly on raw data, indirectly with features extracted from the raw data, or indirectly with models built from the raw data (See [44,2] for surveys). The references you discuss are based on the first two streams of literature. Our approach is based on the third sub-literature on model-based clustering of time series data, specifically focusing on the AR(1) model. We will add a specific reference to a survey on model-based clustering to highlight the importance of this sub-literature, and its various real-world applications. (e.g., Frühwirth-Schnatter, Sylvia. "Panel data analysis: a survey on model-based clustering of time series." Advances in Data Analysis and Classification 5, no. 4 (2011): 251-280.) > **"Technically there is not more than already seen in the papers [23] where coresets for GMMs (with bounded condition) are developed."** *Response:* While we agree that our entity coreset construction uses some ideas from [23], novel technical ideas are required in our paper. A generalization of [23] to time series data is to treat all observations $x_{it}$ independent and compute a sensitivity for each $x_{it}$ directly for importance sampling. However, this idea cannot capture the property that multiple observations $x_{i,1},\ldots,x_{i,T_i}$ are drawn from the same Gaussian model (certain $l\in [k]$). To handle multiple observations, we show that although each $\psi_i$ consists of $T_i$ sub-functions $\psi_{it}$, it can be approximated by a single function on the average observation $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$, specifically, we have $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. This property enables us to "reduce" the representative complexity of $\psi_i$, and motivate two key steps of our construction: 1) For the **sensitivity function**, we give a reduction to a certain $k$-means clustering problem on average observations $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$ of entities (Definition 4.2) by upper bounding the maximum effect of covariances and autocorrelations and applying the fact that $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. 2) For the **pseudo-dimension**, we prove that there are only $\mathrm{poly}(k,d)$ intristic operators in $\psi_i$ between parameters $\theta$ and observations $x_{it}$, based on the reduction of the representative complexity of $\psi_i$. We will make the novelty over [23] more clear in the final version. > **"Why is the more recent generalization without this bound [25] being dismissed? I would rather deal with more general covariances but loose the log n factor."** *Response:* Generalizing our results in the same manner as [25] does over [23] would be an interesting future direction. Currently, it is unclear how to generalize the approach of [25] to time series data since [25] assumes that each point is an integral point within a bounded box, while we consider a continuous GMM generative model (1), and hence, each coordinate of an arbitrary observation $x_{itr}$ is drawn from a certain continuous GMM distribution which is not integral with probability $\approx 1$ and can be unbounded. We will add this in the discussion section. > **"The authors try to sell it as new or different by pointing out the two weighting functions $w(i)$ and $w_i(j)$ which due to non-linearities can't be combined, but in principle it was the same in [34] up to finally defining $w(i,j) = w(i) * w_i(j)$."** *Response:* If you are referring to Lines 185-189, we agree with you and will clarify in the final version. > **"Another thing that they point out that the component's cost functions are not bounded as in [34] but the bound on the eigenvalues is just the same, noting that the eigenvalues are dominating the costs in the GMM."** *Response:* While the eigenvalues do affect the cost functions $\psi_i$s, the positions of means $\mu^{(l)}$s also affect $\psi_i$s. For instance, consider the case that all eigenvalues are 1 and all autocorrelations are 0, i.e., for all $l\in [k]$, $\Sigma^{(l)}= I_d$ and $\Lambda^{(l)}=0_d$. In this case, it is easy to see that $\frac{\max_{\mu\in \mathcal{R}^d}\psi_i(\mu, I_d, 0_d)}{\min_{\mu\in \mathcal{R}^d}\psi_i(\mu, I_d, 0_d)} = \infty$, i.e., the value of $\psi_i(\mu, I_d, 0_d)$ is unbounded as $\mu$ changes. Differently, the component's cost functions are bounded in [34] since $\mu^{(l)}$s do not appear in [34]. We will clarify this point in the final version. > **"Even on synthetic data the coresets show only slight improvements over uniform sampling."** > > *and* > > > **"Since the authors are free to model the data in any way, they could at least simulate one data set that is particularly hard for uniform sampling where coresets show their strength. Currently it looks like if data follows the generating model, sensitivities are pretty uniform."** *Response to the above two comments:* As mentioned at Line 89, to achieve a similar fit with the full data, our coreset needs fewer entity-time observations (<6%); therefore, the computation time for a given level of accuracy reduces by a 2x-8x factor when compared to uniform. Given our clarification above that our coreset does indeed do better than uniform sampling, we hope you agree that that the existing simulated datasets already show the strength of our coresets. > **"Why is there no real-world data comparison? Is there even any real data that can be modeled similarly to the GMM model described here?"** *Response:* *Can real data be modeled using our GMM model?* As we discussed earlier, model-based clustering of time series data is an important sub-literature in time series clustering; and AR(1) models with 1 period lag are a primary workhorse model for time series data modeling. In the model-based clustering survey paper mentioned above, please see Eq (1) in Section 2.1 that reflects the first-order Markov Process captured by our AR(1) model. *Why no real-world data comparison?* It would indeed be a valuable future direction to empirically evaluate our findings on real-world datasets. But given space constraints and the objective of our current paper to show the value of coresets for time series clustering, we chose to further elaborate on the value of our coreset with respect to adding an additional benchmark using coresets for cross-sectional data (i.e., $T$=1, where entity-time pairs are considered independent; specifically we use [23]). We discuss the details of those results below. We hope you agree with our prioritization. The results show that our coreset for time series clustering outperforms the coreset for cross-section clustering in [23], both in terms of fit ($\gamma_S$) and computation time. We hope this further clarifies the empirical value of the time series clustering coreset relative to the existing literature, which only has developed clustering coresets for cross-sectional data. **Table 1a: Coreset performance (fit) on Synthetic Data 1 ($N=T=500$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|2050|2058|2069|2041|22|34|56|1514 |0.2|2093|2210|2264|2041|104|342|446|404 |0.3|2194|2398|2384|2041|306|714|686|191 |0.4|2335|3963|2705|2041|588|3844|1328|93 |0.5|2383|3304|3461|2041|684|2526|2840|72 **Table 1b: Coreset performance (fit) on Synthetic Data 2 ($N=200, T=1250$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|811|825|841|812|2|26|58|1718| |0.2|824|895|871|812|24|166|118|447| |0.3|864|958|994|812|104|292|364|199| |0.4|860|1190|1114|812|96|756|604|98| |0.5|910|1361|1284|812|196|1098|944|71| **Table 2a: Coreset performance (computation time) on Synthetic Data 1 ($N=T_i=500$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|109|2416|1355|3436| |0.2|41|1054|652|3436| |0.3|62|419|392|3436| |0.4|38|621|260|3436| |0.5|47|132|147|3436| **Table 2b: Coreset performance (computation time) on Synthetic Data 2 ($N=200$, $T_i=1250$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|694|1859|1687|9787| |0.2|139|1991|1484|9787| |0.3|51|527|832|9787| |0.4|43|408|450|9787| |0.5|57|389|277|9787| ########################################### ## Review 3r9a, Score: 7, Confidence: 3 Thank you for your positive review and feedback. We answer your specific questions below. > **"Line 141, $\Delta_k$ instead of $\Delta^k$"** *Response:* Yes, you are right. It is a typo. > **"$e_{it}$ is not a scalar based on its definition at Line 150. Did you mean to write $e_{it}\in\mathbb{R}^d$?"** *Response:* Yes, we mean $e_{it}\in\mathbb{R}^d$. > **"Why didn't you apply your coreset on real-world panel data?"** *Response:* It would indeed be a valuable future direction to empirically evaluate our findings on real-world datasets. But given space constraints and the objective of our current paper to show the value of coresets for time series clustering, we chose to further elaborate on the value of our coreswet with respect to add an additional benchmark using coresets for cross-sectional data (i.e., $T$=1, where entity-time pairs are considered independent; specifically we use [23]). We discuss the details of those results below. We hope you agree with our prioritization. The results show that our coreset for time series clustering outperforms the coreset for cross-section clustering in [23], both in terms of fit ($\gamma_S$) and computation time. We hope this further clarifies the empirical value of the time series clustering coreset relative to the existing literature, which only has developed clustering coresets for cross-sectional data. **Table 1a: Coreset performance (fit) on Synthetic Data 1 ($N=T=500$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|2050|2058|2069|2041|22|34|56|1514 |0.2|2093|2210|2264|2041|104|342|446|404 |0.3|2194|2398|2384|2041|306|714|686|191 |0.4|2335|3963|2705|2041|588|3844|1328|93 |0.5|2383|3304|3461|2041|684|2526|2840|72 **Table 1b: Coreset performance (fit) on Synthetic Data 2 ($N=200, T=1250$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|811|825|841|812|2|26|58|1718| |0.2|824|895|871|812|24|166|118|447| |0.3|864|958|994|812|104|292|364|199| |0.4|860|1190|1114|812|96|756|604|98| |0.5|910|1361|1284|812|196|1098|944|71| **Table 2a: Coreset performance (computation time) on Synthetic Data 1 ($N=T_i=500$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|109|2416|1355|3436| |0.2|41|1054|652|3436| |0.3|62|419|392|3436| |0.4|38|621|260|3436| |0.5|47|132|147|3436| **Table 2b: Coreset performance (computation time) on Synthetic Data 2 ($N=200$, $T_i=1250$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|694|1859|1687|9787| |0.2|139|1991|1484|9787| |0.3|51|527|832|9787| |0.4|43|408|450|9787| |0.5|57|389|277|9787| > **"Please provide a small discussion regarding the lower bound on the coreset size which is needed for GMMs on time-series data."** *Response:* Thank you for the comment and we will add a discussion regarding the lower bound on the coreset size in the final version. There is no provable lower bound result. We conjecture that without the assumption on the condition number ((1) in Assumption 4.1), the coreset size should depend exponentially in $k$ and logarithmic in $n$. The motivation is from a simple setting that all $T_i=1$ (GMM with static data), in which [25] reduces the problem to projective clustering whose coreset size depends exponentially in $k$ and logarithmic in $n$. Moreover, [25] believes that these dependencies are unavoidable for GMM coreset. ########################################### ## Review V91b, Score: 6, Confidence: 4 Thanks for your positive review and feedback. We answer your specific questions below and hope that you will increase your support for our paper. > **"The contributions paragraph is very high-level, and lacks the important details (e.g., coreset size, actual computation time, etc.)."** *Response:* Thanks for your feedback and we will add these details to the contribution paragraph. > **"The coreset obtains an additive error of epsilon times $f'$, where $f'$ is a different cost function. This relation should be investigated more thoroughly, both in theory and in the conducted experiments."** *Response:* Thanks for your suggestion and we will add a discussion on the relation between $f’_S$ and $f$. In theory, as a similar argument as Inequality (2), we have that $f'_S(\alpha'(\theta), \theta) + \phi(\alpha, \theta) \in (1\pm \varepsilon)\cdot f'(\alpha,\theta) + \phi(\alpha,\theta) \in (1\pm \varepsilon)\cdot f(\alpha, \theta) \pm 2\varepsilon \phi(\alpha,\theta)$. Thus, if $\phi(\alpha, \theta)\lesssim f(\alpha, \theta)$, we conclude that $f'_S(\alpha'(\theta), \theta)+ \phi(\alpha,\theta) \in (1\pm 3\varepsilon)\cdot f(\alpha, \theta)$. For the empirical results, we will add a comparison between optimizing $f_S(\alpha, \theta)$ directly and optimizing $f'_S(\alpha'(\theta), \theta) + \phi(\alpha, \theta)$ on the coreset $S$ in the final version. > **"I would recommend that the authors add a "hardness" paragraph that discusses this topic."** *Response:* Thank you for the suggestion and we will add a paragraph to discuss our technical novelty. While we agree that our entity coreset uses some ideas from [23], novel technical ideas are required in our construction. A simple generalization of [23] to time series data is to treat all observations $x_{it}$ independent and compute a sensitivity for each $x_{it}$ directly for importance sampling. However, this idea can not capture the property that multiple observations $x_{i,1},\ldots,x_{i,T_i}$ are drawn from the same Gaussian model (certain $l\in [k]$). To handle multiple observations, we show that although each $\psi_i$ consists of $T_i$ sub-functions $\psi_{it}$, it can be approximated by a single function on the average observation $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$, specifically, we have $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. This property enables us to "reduce" the representative complexity of $\psi_i$, and motivate two key steps of our construction: 1) For the **sensitivity function**, we reduce to a certain $k$-means clustering problem on average observations $b_i = \frac{\sum_{t\in [T_i]}x_{it}}{T_i}$ of entities (Definition 4.2) by upper bounding the maximum effect of covariances and autocorrelations and applying the fact that $\psi_i(\mu, I_d, 0_d) = d(b_i, \mu)^2 + O(1)$. 2) For the **pseudo-dimension**, we prove that there are only $\mathrm{poly}(k,d)$ intristic operators in $\psi_i$ between parameters $\theta$ and observations $x_{it}$, based on the reduction of the representative complexity of $\psi_i$. > **"Is there a provable lower bound for the coreset size?"** *Response:* No, we do not have a provable lower bound for the coreset size. We conjecture that without the assumption on the condition number ((1) in Assumption 4.1), the coreset size should depend exponentially in $k$ and logarithmic in $n$. The motivation is from a simple setting that all $T_i=1$ (GMM with static data), in which [25] reduces the problem to projective clustering whose coreset size depends exponentially in $k$ and logarithmic in $n$. Moreover, [25] believes that these dependencies are unavoidable for GMM coreset. Investigating this direction is an interesting topic for future work. > **"Establishing a connection to those problems (clustering input sets, or clustering shapes) seems very relevant."** -- Remain *Response:* Thank you for the suggestion. We will add a paragraph to discuss these problems in the related work section in the final version. > **"How was the optimization problem solved on the full data? Was it done similarly to the EM solution applied on the coreset?"** *Response:* Yes. As mentioned in Footnote 2, we implement an EM algorithm on both the full data and the coreset. > **"It seems a bit strange that the proposed coreset achieves a 2x-8x acceleration in solving Problem 3.1 compared to the uniform sampling, as the random sample is computed much faster, and is of the same size as the coreset. Can the authors elaborate on this?"** *Response:* By Table 1, we can see that the construction time $T_C$ is at most $2$ seconds for both our coreset and uniform sampling, which is at most 5\% of the computation time $T_\mathcal{X}$. Thus, our coreset achieves a 2x-8x acceleration in solving Problem 3.1 compared to the uniform sampling since the computation time on coreset is much faster. As mentioned in Footnote 3, a possible explanation is that our coreset selects $I_S$ of entities that are more representative than uniform sampling, and hence, achieves a better convergence speed. > **"The dimension d tested was only $d=2$ (probably due to the large dependency, in theory, on $d$). However, as the theory in most coreset papers is far too pessimistic, it would be interesting to see this effect in practice for $d>2$."** *Response:* We agree with your point that it would be interesting to assess sensitivity of the coreset size $d$ in practice. However given competing suggestions for additional testing, we decided to prioritize on the suggestion to compare our coreset for time series clustering against the coreset for cross-sectional clustering based on [23]. We find that our coreset outperforms the coreset from [23] on both fit and computation time. We will mention the issue about $d$ as an important topic for future work > **"In the definition of the input set $X$, the subscript of the first element $x_{i1}$ of every $X_i$ does not have a comma, while the last element $x_{i,Ti}$ does have a comma. It should be consistent. Also, $X_i$ is said to be a subset of $\mathbb{R}^{T_i \times d}$. However, as $X_i$ is not a set, the "$\subset$" should be replaced with "$\in$", and it should be $\mathbb{R}^{d\times T_i}$ as there are $T_i$ columns and $d$ rows in $X_i$."** *Response:* We will make $x_{i,t}$ consistent and fix these typos. > **"The argmax in Line 37: it is unclear over what set are we maximizing the expression."** *Response:* It should be $\arg\max_{\alpha, \theta^{(1)}, \ldots, \theta^{(k)}}$. > **"Line 38: missing closing bracket."** *Response:* Yes, it should be $\theta^{(l)})$ instead of $\theta^{(l)}$. > **"Line 80: "where objective" -> "where the objective"."** *Response:* Yes, it is a typo. > **"Line 91: "compared to uniform", sentence not complete."** *Response:* It should be "uniform sampling" instead of "uniform". > **"Line 174: If I'm not missing anything, then $P$ is undefined (should be $P_\mathcal{X}$). Similarly in Definition 3.3."** *Response:* Yes, $P$ should be $P_\mathcal{X}$. > **"Line 182: Extra closing bracket."** *Response:* Yes, it should be $\Sigma^{(l)}$ instead of $\Sigma^{(l)})$. > **"I would recommend either spacing the titles more, or placing some vertical dividers in between unrelated columns."** *Response:* Thank you for the suggestion. We will place vertical dividers in between unrelated columns. > **"An illustration figure to demonstrate the "GMM clustering with time series data" problem (even for d=2) would be very appreciated."** -- Remain *Response:* Thank you for the suggestion. We will include an illustration figure in the final version. ########################################### ## Review eQJc, Score: 6, Confidence: 2 Thanks for your positive review and feedback. We answer your specific questions below and hope that you will increase your support for our paper. > **"What if you keep all entities but sample only time? What if you keep all time periods but sample only entities?"** *Response:* These two approaches will result in a subset of entity-time pairs of size dependent on either the number of entities $N$ or the number of time periods $T$, which is much larger than our coreset size. However, empirically, we want to compare the achieved likelihood ratio under the same size between our coreset and baselined algorithms. Therefore, we did not include these two approaches as our baselines. Please also note that in response to your next comment, we have used coresets for cross-sectional data that ignores the time series aspect as a new baseline and compare it with different levels of auto-correlations. We hope you agree with our reasoning for why we chose the baselines we have. > **"What if you forget the time in your time series and directly apply the algorithm from [45]? What gain do you truly get with your approach?"** *Response:* To capture the spirit of your comment (and feedback from other review team members), we now compare against a coreset algorithm for cross-sectional data where $T$=1,(specifically [23]) which ignores the time series aspects. The results show that our coreset for time series clustering outperforms the coreset for cross-section clustering in [23], both in terms of fit ($\gamma_S$) and computation time. We hope this further clarifies the empirical value of the time series clustering coreset relative to the existing literature, which only has developed clustering coresets for cross-sectional data. **Table 1a: Coreset performance (fit) on Synthetic Data 1 ($N=T=500$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|2050|2058|2069|2041|22|34|56|1514 |0.2|2093|2210|2264|2041|104|342|446|404 |0.3|2194|2398|2384|2041|306|714|686|191 |0.4|2335|3963|2705|2041|588|3844|1328|93 |0.5|2383|3304|3461|2041|684|2526|2840|72 **Table 1b: Coreset performance (fit) on Synthetic Data 2 ($N=200, T=1250$).** |$\varepsilon$|$V_S^\star$ of **CRGMM**|$V_S^\star$ of **Uni**|$V_S^\star$ of **LFKF**|$V^\star$|$\gamma_S^\star$ of **CRGMM**|$\gamma_S^\star$ of **Uni**|$\gamma_S^\star$ of **LFKF**|size| | ------------- | ------------------------ | ---------------------- | ----------------------- | --------- | ----------------------------- | --------------------------- | ---------------------------- | --- | |0.1|811|825|841|812|2|26|58|1718| |0.2|824|895|871|812|24|166|118|447| |0.3|864|958|994|812|104|292|364|199| |0.4|860|1190|1114|812|96|756|604|98| |0.5|910|1361|1284|812|196|1098|944|71| **Table 2a: Coreset performance (computation time) on Synthetic Data 1 ($N=T_i=500$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|109|2416|1355|3436| |0.2|41|1054|652|3436| |0.3|62|419|392|3436| |0.4|38|621|260|3436| |0.5|47|132|147|3436| **Table 2b: Coreset performance (computation time) on Synthetic Data 2 ($N=200$, $T_i=1250$).** |$\varepsilon$|$T_S+T_C$(s) of **CRGMM**|$T_S+T_C$(s) of **Uni**|$T_S+T_C$(s) of **LFKF**|$T_\mathcal{X}$| | ------------- | ------------------------- | ----------------------- | ------------------------ | :--- | |0.1|694|1859|1687|9787| |0.2|139|1991|1484|9787| |0.3|51|527|832|9787| |0.4|43|408|450|9787| |0.5|57|389|277|9787|