# What the reviewers care about
Note that one can submit a rebuttal pdf revision.
Reviewer 1 (8). His complaints are mostly minor which can be addressed in an individualized reply.
Reviewer 2 (6). His complaints are mostly about extensions of the algorithm to more expressive Rasch model like 2PL or 3PL or MIRT. He also has more minor comments about ethics discussions and some smaller comments which can be addressed in an individualized reply.
Reviewer 3 (6). He also commented on extension of the model. He also has a small comment about joint estimation which can be addressed easily.
Reviewer 4 (6). He commented that we should have compared our method to a Bayesian method and that we should have discussed more about our experiment results (why AUC and log likelihood are similar while the spectral algorithm does better in terms of top K accuracy). Perhaps the detailed experiment discussions can be addressed to this reviewer individually.
Thoughts after reviews
* May be we can position the paper more towards applying the spectral algorithm to IRT, as in bringing attention to the utility of spectral algorithm to IRT.
# Main rebuttal
We thank the reviewers for all of their comments. We’re glad to see that the reviewers appreciate the novel application of the spectral algorithm to the Rasch model and the theoretical guarantees obtained in our work. We will address here the main comments from the reviewers, mostly pertaining to our experiments and extensions of the spectral method. We will also address more specific comments from each reviewer individually.
## Extensions of the spectral algorithm
As our work proposes a new approach to IRT based on spectral methods, extensions of the algorithm to more expressive IRT models are certainly promising research directions.
To reviewer 3: The spectral algorithm can be extended to perform joint estimation of both users and items parameters. We can apply the spectral algorithm to obtain the item parameter $\hat\beta$. Each individual user parameter can be obtained by maximizing the log-likelihood
$\hat \theta_l = \arg\max_{\theta} \sum_{i} X_{li}\log\frac{1}{1+\exp(-(\theta_l - \hat\beta_i))} + (1-X_{li})\log\frac{1}{1+\exp(-(\hat\beta_i-\theta_l))}.$
This is a concave optimization problem in $\theta_l$ and can be solved efficiently.
To reviewer 2 and 3: We can also extend the spectral algorithm to account for population heterogeneity under the mixed Rasch model [1]. The mixed Rasch model is similar to the Multivariate IRT model in that it aims to capture the idea pointed out by reviewer 2: a student might be good at one subject (arithmetic) while being bad at another (literature). In terms of modeling, each of these subjects corresponds to a mixture component. Both the student abilities and the test difficulties differ from one mixture component to another.
A mixture learning algorithm can proceed as follows:
* Perform spectral clustering on the rows of the response matrix $X$ (such as via sparse PCA) to produce $K$ clusters of rows.
* Run the spectral algorithm on these K clusters to obtain $K$ set of parameters.
* The value of $K$ can be chosen using cross-validation or using Bayesian information criterion.
Given that the spectral algorithm runs significantly faster than other estimation algorithms, we can see its utility in learning mixtures of Rasch models where $K$ is large. The learned parameter estimates can be used to study the different subpopulation of students and how the test difficulties vary among these subpopulations.
[1] J Rost, C Carstensen, M Von Davier. Applying the Mixed Rasch Model to Personality Questionnaires (1997).
## Comparison to a Bayesian method
We have performed some additional experiments with a Bayesian estimation method whose implementation can be found at https://github.com/nd-ball/py-irt, based on the recent paper [2]. The Bayesian algorithm uses a hierarchical prior and variational inference for parameter estimation. Given that the paper [2] is well cited, we believe that this is a strong baseline to compare to.
The message is similar to the experiment results in our main paper. The spectral algorithm is competitive in terms of accuracy. However, it is significantly more efficient than the Bayesian method. Furthermore, like other methods shown in our experiments, the Bayesian method doesn’t perform as well as the spectral method in terms of top-K ranking. Where the spectral method enjoys more practical advantages will be in recommendation systems applications or large scale education datasets.
```
| Dataset | AUC (Bayesian) | AUC (Spectral) |
|:--------:|:-----------------:|:-----------------:|
| LSAT | 0.706 | 0.707 |
| 3 Grades | 0.532 | 0.532 |
| UCI | 0.565 | 0.565 |
| ML-100K | 0.681 | 0.662 |
|:--------:|:-----------------:|:-----------------:|
| | LogLik (Bayesian) | Loglik (Spectral) |
| LSAT | -0.487 | -0.487 |
| 3 Grades | -0.681 | -0.687 |
| UCI | -0.693 | -0.706 |
| ML-100K | -0.635 | -0.646 |
|:--------:|:-----------------:|:-----------------:|
| | Top-K (Bayesian) | Top-K (Spectral) |
| ML-100K | 0; 0; 0.04; | 0.4; 0.6; 0.54 |
|:--------:|:-----------------:|:-----------------:|
| | Time (Bayesian) | Time (Spectral) |
| LSAT | 63 | 0.028 |
| 3 Grades | 27 | 0.015 |
| UCI | 26 | 0.021 |
| ML-100K | 2.8k | 2 |
```
[2] Natesan et al., Bayesian Prior Choice in IRT Estimation Using MCMC and Variational Bayes
<!-- ## Discussions of experiments -->
# Reviewer 1
In addition to our main reply above, we hope to address some of your specific comments here.
> "Theorems 3.5 and 3.6 appear to be missing something ... "
This is a Cramer-Rao lower bound so the estimator $T$ is assumed to be an unbiased estimator. This estimator has to output the true parameter given infinite data for any $\beta^*$. If it always outputs $T(X) = 0$ then it's no longer an unbiased estimator.
> "existence and uniqueness of the estimator, i.e., of the stationary distribution of (3)"
This is a good point. The existence and uniqueness of the stationary distribution can be guaranteed by making sure that the Markov chain is ergodic. Roughly speaking, this means that from every state, one can land at another state (after some time) with positive probability. This can be guaranteed using our regularization scheme which ensures that the pairwise transition probabilities are always positive and that the graph underlying the Markov chain is connected. If accepted, we will emphasize this point in the final version of the paper.
> "The similarities and differences to existing pairwise & spectral approaches"
This is a good suggestion as well. The main reason we don't discuss this in greater depth is because of the lack of space in the paper. If accepted, we should be able to discuss the connection to other pairwise methods in the final version which allows for more space.
> Typos and references
Thank you for pointing out these details. We'll surely address them in the final version.
> Why is the MMLE so bad in the Top-K case, depsite achieving good AUC and log-likelihood? (I don't understand the caption of Table 1).
This seems to be a point of confusion for reviewer 4 (id: L9zk) as well. Please take a look at our reply to reviewer 4 where we clarify our setup and provide more explanations as to why the spectral method tends to outperform other methods in terms of top-K accuracy while AUC and log-likelihood tend to be similar among the methods.
> Can you comment to what extent the proofs of Thms 3.1 and 3.2. mirror the proofs in Rank Centrality paper for the BTL model?
It is reasonable to see a lot of similarities between our algorithm and Rank Centrality as both construct a Markov chain and computes its stationary distribution to recover parameter estimate. As a starting point, the analyses of the two algorithms use similar tools (such as Lemma A.3 in our paper) to bound the error of the stationary distribution of a perturbed Markov chain. However, the pairwise transition probabilities in the two algorithms are defined differently. Our sampling model is also different from the Erdos-Reyni sampling model for the comparison graph in the analysis of Rank Centrality. Furthermore, the Rasch model has two sets of parameters while the BTL model has one. Here, we see this difference plays out as we have two separate results for two regimes of m -- when m is a constant or small relatively to n vs when m also grows with n.
# Reviewer 2
We hope that our main reply above shows how the spectral algorithm can be extended to more expressive IRT models such as the mixed Rasch model.
As for your specific comments about the 2PL and 3PL model. The spectral algorithm is derived based on certain properties that are characteristic of the 1PL model. Therefore, we are unsure what an extension to the 2PL and 3PL model would look like. The idea of a factored Markov chain sounds interesting but we're not familiar with the topic so we can't say much here.
> Potential negative social impacts ...
This is a very good point. We will certainly include a more thoughtful discussion of this in the final version of the paper.
> Minor points
Thanks for pointing these out, especially our claims on item parameter estimation. Perhaps a better way of phrasing this would be "one-sided parameter estimation", whether it is estimating the students' abilities or the items' parameters in the context of recommendation systems.
> The authors cite several other spectral methods but do not make clear if any of the theoretical results come from those papers. Is Proposition 2.1 specific to the current paper or was it proved in one of the earlier works?
This proposition and its proof are new. The pairwise transition probabilities of the Markov chain in our algorithm is different from those in Rank Centrality.
> For Theorem 3.3 (where m grows) the authors need to describe to this audience why JMLE fails in this case and what is different about their algorithm that makes it succeed.
We're actually not claiming that JMLE fails in this regime and this theorem only refers to the spectral method (the punchline is that when m grows, the spectral algorithm enjoys better, in fact, optimal error rate). However it has been shown that JMLE fails when the number of items m is a constant whereas our algorithm is still consistent under that regime (Theorem 3.1).
> related works and how our algorithm differs
You raised very good points about the related work section and we will incorporate the suggested changes in a final version. We also hope to clarify the connection between our algorithm and Rank Centrality, the closest spectral algorithm in the literature. Superficially, they seem similar as both construct a Markov chain and estimate the parameters from the stationary distribution. However, the construction of the Markov chains, the pairwise transition probabilities are different. The resulting analysis is also different for our algorithm as we have a different sampling model from the Erdos-Reyni graph sampling model in the Rank Centrality paper. Furthermore, the BTL model where Rank Centrality is applied to has 1 set of parameters whereas the Rasch model has two. This difference plays out in Theorem 3.1 and Theorem 3.3 where we have two different guarantees for two regimes of m. Here the message is that the relative size of m and n affects the estimation error. Similarly, the theorems and propositions in the paper are new and not simple extensions of the results for Rank Centrality.
# Reviewer 3
We hope that our main reply above addresses both of your comments about potential extensions of the spectral method as well as joint parameter estimation (learning both user and items parameters).
For joint parameter estimation, one can also avoid using MLE (say due to computational concerns) by running the spectral algorithm twice, to estimate $\hat \beta$ and $\hat \theta$. However, since the spectral algorithm essentially learns the 'difference' in the parameters and outputs normalized version of the user and item parameters, one needs to "align" the two sets of parameters. Fortunately, this alignment only comes in as a single scalar $a$. We estimate $\hat a$ by solving the following problem (keeping $\hat \beta$ and $\hat \theta$ fixed).
$\hat a = \arg\max_a \sum_{l\in [n], i\in[m]} X_{li} \log\frac{1}{1+\exp(-(\hat\theta_l + a - \hat\beta_i))} + (1-X_{li}) \log\frac{1}{1+\exp(-(\hat\beta_i - \hat\theta_l - a))}$
This is a concave maximization problem in a single scalar variable and can be solved efficiently.
# Reviewer 4
We hope that the additional experiments in our main reply above addresses your question on how our method would compare to a Bayesian method. The Bayesian algorithm that we compare the spectral method to is based on a recent and well cited paper [2] so we believe that it is a strong baseline.
Your comments on the discussions of our experiment results are well appreciated. If accepted, we will certainly discuss the experiment results in greater depth as the final version allows more space.
An important question that you posed and can be a really good discussion point is
> Why AUC, log-likelihood are similar while top-K accuracy differs significantly between the methods
For smaller scale education datasets (UCI, 3 Grades, LSAT) the number of items (tests) is small and there are no missing responses. There is much data relative to the number of parameters. As the algorithms, after all, operate on the same statistical model, they output very similar estimates and the performance is very close. The difference is more apparent in large scale datasets and in top-K accuracy (which is not defined for the education datasets) where the responses are sparse and there is a large number of items.
To transform ratings data to binary response data, for each user, we convert all ratings higher than the average to 0 and 1 otherwise (an item with a higher parameter value is better) like in [3]. The reference top-K set is the set of items with the highest average ratings. However, we have also removed items with very high but few ratings (<= 10) from this reference set as we treat these ratings as noisy.
Note that the $K \in [10, 25, 50]$ are relatively small compared to the number of items in the large scale datasets that we use. This is applicable in settings such as RecSys where we care about identifying the top few items. We observe that the other methods tend to "overvalue" items with noisy responses whereas our method, which uses pairwise differentials, is less susceptible to noisy ratings. This could explain why in some datasets, some of the baseline algorithms fail to identify any of the top items. Perhaps, this highlights the limitations of these classical IRT methods in applications like RecSys where there is much noisy data.
On the other hand, AUC and log-likelihood is evaluated on a held-out dataset that contains a large number of items and users. The difference among the methods, when averaged over the users and items, is quite small. Therefore, the performance difference becomes less significant.
> Some minor points about presentation
These are all good points and we'll certainly address these in a final version.
[2] Natesan et al., Bayesian Prior Choice in IRT Estimation Using MCMC and Variational Bayes
[3] Lan et al., An Estimation and Analysis Framework for the Rasch Model (ICML 18')