# Pool Pricing, 2023
## Reviewer List
- Yiding Feng
- Wei Tang
- Yiwei Chen
- Arnoud V. den Boer
- Boxiao (Beryl) Chen
- Renato Paes Leme
- Balasubramanian Sivan
-
## 12/1/23 Rebuttal Notes
### Reviewer 1
- Confusion about $n$ and $k$
### Main Points to Cover
## 102723 Delta from conference to journal
- Immeidiate-buy model:
- read Talluri, van Ryzin Section 5.5
- Mention that the special case when lambda= infty is studied
- A related question (No need to add this to this paper, because the rest of the paper does not consider limited price-change. But juts keepin mind if someone asks)
- prices $v_1,...,v_k$
- allow $m<k$ price changes
- each valuation group has $n_i$ buyers, unknown
- (maybe) structural lemma holds for infinte lambda
- Goal: find a non-adap policy $(t_1,p_1,...,t_m,p_m)$
- LB for regret (Su)
- Restless bandits:
- Each arm $a$ has a state space $S_a$
- Play an arm a, then its state evolves accroding to a known the transition $P_a$
- When playing an arm, a random reward is drawn
- Guha, Munagala, Peng'09 showed a constant factor apxn (against the optimal policy)
- But diff from ours because in our problem the arms (prices) that are not selected may still evolve
- Adv bandits.
- Protocol: in each round $t$,
- adv slectes a func $r_t: [k] \rightarrow [0,1]$
- learner selects arm $X_t$
- Learner recieves and observes $r_t(X_t)$
- But not observe the other $r_t(X)$'s
- Our problem can be viewed as a speical case
- Discretize time horizon in $T$ segments of length $\Delta$
- In each round $t$,
- sps for each price $p$, there are $N_{t,p}$ remaining buyers with valuation $\ge p$
- adv selects reward func $r_t$ such that $r_t(p) = p\cdot N_{t,p} \cdot \lambda \Delta$ for each price/arm $p$
- learner chooses price $X_t$
- receives and observes $r_t(X_t)$
- **Known** result on adv bandits: there is an alg (EXP3) that guarantees $\sqrt {T}$ for against any seq of reward func $(r_t)$ against the best FPP
- **Implication**. for our problem: $E[rev]$ of EXP3 is at least $E Rev(p)-\sqrt T$
- Fix any policy (possibly not FPP) $\pi$, it induces a distr $D_\pi$ over the sequences of reward functions $r_t$
- Then, $E_{D_\pi} Rev$ is exactly the mean rev of $\pi$
- Since for ANY seq $(r_t)$, and ANY FPP at, say, $p$, we have $E[R;r_1,...,r_t] \ge Rev(FPP(p);r_1,...,r_t)-\sqrt T$
- Benchamark is the best fixed-price policy (FPP)
- We need to argue that any FPP is bad for our problem
- Can illustrate this using just two prices.
- Simulation
- Non-adap:
- Given prices $v_1,...,v_k$.
- Don't know $n_i$'s but knows $n$
- Stream model: in each round, the vlaution of the custiomer is $v_i$ w.p. $q_i = n_i / n$
- If $q_i$ is known, then choose $p^* = \arg\max_i \{v_i \cdot \sum_{j\le i} q_j \}$
- Encode a policy as $(t_1,...,t_k)$
- The best non-adp policy is given by the bilevel program $$\max_{t_i}\min_{q_i} \sum_{i} \{t_i v_i \cdot \sum_{j\le i} q_j \}/\max \{v_i \cdot \sum_{j\le i} q_j \}$$
- What is the best non-adap policy under the stream model????
- Adap: compare with
- a good alg for the stream model (message: a good alg for the stream mnodel can be terrible under our model)
- good algo for Adv bandits, e.g. EXP3. (Message: At a higher level, our problem can be cast as adv bandits' framework, But a direct application of algo for adv bandits is not good.)
- best non-adap policy that knows the true instance
## Quick Notes
Error is a martingale (of some sort)
$D_2 = Binom(n_1+n_2 - D_1, q)$
$D_1 = Binom(n_1,q)$
$\hat n_2 = D_2 / q - (\hat n_1 - D_1)$
$\hat n_1 = D_1 / q$
$E[\hat n_2 - n_2 | D_1=d_1] = (n_1+n_2 - d_1) - d_1/q + d_1 - n_2 = n_1 -d_1/q$
Fact: $\sum_i \sqrt n_i \le \sqrt {kn}$.
## 102023
Appendix to submit to AISTATS
Section 2 (prelim):
- Prop 2.3 monotonicity (one parag)
- Prop 2.4: formula for reve func. (one parag)
- Thm 2.5 DP:
- restrcit our attention to non-adap policy
- state: $(p,t,m)$ where
- p is the current price
- t is the remaining time
- $m$ is the number of rem customers w/ valuation $\ge p$
- Value func $V(p,t,m)$:
- Bellman: $$V(p,t,m) = \max\{p_{t-1}:\}$$
Thomas's DP:
- Stages $i \in [k]$ (corresponding to $v_1,\ldots,v_k$)
- Remaining time $t$
- Number of customers $m$ w/ valuation $\geq v_i$
$$V(i,t,m) = \max_{0 \leq t' \leq t} \{v_i\cdot m\cdot (1-\exp(-\lambda t')) + V(i+1, t-t', m\cdot \exp(-\lambda t') + n_{i+1}) \}$$
Think about this as just optimizing our non-adaptive revenue function in a sequence of $k$ stages.
In particular, we want $V(1,1,n_1)$. For the base case, $V(k,t,m) = v_k\cdot m\cdot (1-\exp(-\lambda t))$
$m \in [0,n]$
$t \in [0,1]$
Should discretize $m,t$ in powers of $(1+\epsilon)$.
**Section 3 (Non-adp)**
- Lem 3.2 be casreful about the math rigor. Can be a bit subtle for adap policy. But immediate for non-adap
- Lem 3.4 high school
- LP: argue that if one of the constr is not tight, then we can improve the sln strictly
- Thm 3.5. just expand the pf sketch
- Thm 3.6: choose tiny $\lambda$ s.t. the linearation loses only $\epsilon$
**Adaptive:**
- Full proof for k=2 price
- for general k, STATE the following:
- (1) Azuma-Hoeffding for submtg
- (2) explain how to construct the submtg. (Subtract the width of confidence interval)
- (3) combine (1),(2) to derive a upper bound on $|\sum_k (n_k - \hat n_k)|$
- Important points to make:
- explain why the above is the only diff from the k=2 analysis.
- Mention that the naive approach leads to linear dependece in k. (recall that we summed over the abs values)
- Clever step: swap abs value with summation. Intuition: expect the errors in the estimation terms to cancel each other, which leads to concentration
- explain intuitively why the construct partial sum is a submtg
## 082523
Key features of our problem:
### Outline of intro:
- 1,2) Give a concrete retail scenario: fashion clothing, PS5
- 3) the commonly demand learning model: describe the model.
- 4) HOWEVER, in some applications (like the ones we mentioned in the beginning), the i.i.d. assumption in the stream model fails. Specifically, the distribution of valuation is not fixed over time. In fact, high valuation customers tend to make purchase early and leave the market, which causes a distr shift. Moreover, the demand at a price is not independent over time. (NEED JUSTIFICATION)
- 5) Previous work has considered how to limit these two limitations. One line of work considers dyna pricing under non-stationary demand distr. Decribe the model. But, this nonstationarity is Exogenous, whereas the non-stationraity in our problem it is Endogenous.
- 6) Previous work has also considered non-independent...(i) strategic mention (ii) RL based RM models. But RL approaches do not work for our prlbem because (a) our transition is montonic so we are not able to revisit a state multiple times (b) work on learning POMDP usually make stringent assumption which do not hold (?) in our problem, and they do not leverage the special structures of the problem.
- 7) Thus motivated, we consider a single-item rev maximizing prlobem where each customer has a unit demand and a private valuation drawn from an unknown distr, and makes repeated myopic interactions with the seller according to a separate stochastic process. More precisely, xxxx
- Aside: POMDP
One sentence summary:
- repeated interaction
- separate stochastic visits
- unknown valuation distr
- non-stationarity
- unit-demand
- myopic/non-strategic
- single-product
- inf inventory
### Related work
- stream model for dynamic pricing
- patience window with random length - explain
- markdown pricing
- RL/POMDP
- adv sequence (Kleinberg, Leighton). Explain why a fixed-price policy can be terrible in our problem.
# 081923
Argue how our problem is different from the following lines of works:
- customers with patience window
- startegic consumer behavior
- Jose Correa's work: related to Titing's section (one difference is that we do not pre-announce)
TO DO:
- For Thomas: why can't we just use Kleinberg's alg for Lipschitz bandits? Find a concrete example.
- For Titing: why can't we use e.g. Correa et al's preannounced price policy? Find a concrete example.
- For all: read the papers below, and summarize each within 3 sentences.
We realized that the "pool" element is not entriely new. What is new is the combination of demand learning and pool-model.
Read these "Pool-based" model
- Optimal Pricing Schemes for an Impatient Buyer.
Deng et al SODA'23.
https://arxiv.org/pdf/2106.02149.pdf
- Contingent Preannounced Pricing Policies With Strategic Consumers.
Correa et al, OR'16.
https://pubsonline.informs.org/doi/pdf/10.1287/opre.2015.1452
- Dyanmic pricing with hetero. impatience level.
Lobel, OR'20
https://pages.stern.nyu.edu/~ilobel/patient-consumers.pdf
- Optimal Pricing of Seasonal Products in
the Presence of Forward-Looking Consumers
Aviv, Pazgal, MSOM'08
https://pubsonline.informs.org/doi/pdf/10.1287/msom.1070.0183
- Dynamic pricing when consumers are strategic: Analysis of posted and contingent pricing schemes.
Dasu et al, EJOR'10
https://www.sciencedirect.com/science/article/pii/S037722170900890X
- Responsive Pricing of Fashion Products: The Effects of DemandLearning and Strategic Consumer Behavior.
Aviv et al, MS'19
https://pubsonline.informs.org/doi/epdf/10.1287/mnsc.2018.3114
- Models of the Spiral-Down Effect in Revenue Management.
Cooper, Kleywegt, OR'06
https://pubsonline.informs.org/doi/pdf/10.1287/opre.1060.0304
- Markdown Policies for Demand Learning with Forward-looking Customers.
Birge, Chen, Keskin, SSRN
https://www.researchgate.net/profile/N-Bora-Keskin/publication/329794518_Markdown_Policies_for_Demand_Learning_and_Strategic_Customer_Behavior/links/5ebee7af299bf1c09ac097f6/Markdown-Policies-for-Demand-Learning-and-Strategic-Customer-Behavior.pdf
- Optimal Continuous Pricing with Strategic Consumers
Correa, MS'17
https://www.dii.uchile.cl/~jcorrea/papers/Journals/BCP2017.pdf
## 14 July (all)
Recall: our alg (LP) for computing a non-dap policy is to linearize the CR and reduce this robust opt problem to an LP. Our Alg knows $v_i$'s but not $q_i$'s
So far: for any $(v_i)_{i=1,...,k}$, our algo has CR at least $$\frac{ALGO(v,q)}{UB(v,q)} \ge \frac 1{k -\frac{v_2}{v_1} - \frac{v_3}{v_2}-...-\frac{v_k}{v_{k-1}}}$$ for any $q$.
Sps we are given a price interval $[\delta, 1]$ instead of $(v_i)$. Consider a discrtization $v_k = \delta$, $v_{k-1} = \delta + (1-\delta)/k$,...,$v_0=1$. Let $f$ be the pdf of the valuation, supported on $[\delta, 1]$. Define $q_i= \int_{v_i}^{v_{i+1}} f(x) dx$.
Notations:
- $I$ is the original instance (conti)
- $I'$ is the discretized instance
- $\pi$ is the non-adp policy computed by our ALG using $I'$ as input
- $Rev(\pi), Rev'(\pi)$ are the expected revene under $I,I'$
Need to formalzie the follwoing gaps:
- Step 1: $Rev(\pi) \approx Rev'(\pi)$
- Step 2: $Rev'(\pi) \ge xx \cdot UB(I')$ due to Theorem 1
- Step 3: $UB(I) \approx UB(I')$
Let's start with Stp 3. By def, we have $$UB(I')= (1-e^{-\lambda T}) \sum_{i=1}^k q_i v_i $$ and $$UB(I)= (1-e^{-\lambda T}) \int q(v) v\ dv = (1-e^{-\lambda T}) \sum_i \int_{v_i}^{v_{i+1}} q(v) v\ dv.$$
Note that for each $i$, we have $$\sum_{i=1}^{k} v_i \int_{v_{i+1}}^{v_i}q(v)dv \le UB(I) \le \sum_{i=0}^{k-1} v_i \int_{v_{i+1}}^{v_i}q(v)dv.$$
So the corres term in these two summations differ by a multi factor of $(\delta + 1/k)/\delta = 1+1/(\delta k)$.
So combining Stsp 1 - 3, we obtain a $\frac 1{\left(1+\frac 1{\delta k}\right)\log k}$. The first order optimality (wrt $k$) is apxly
$$\log k = \delta k,$$
i.e., $$k/\log k =\frac 1 \delta$$
Guess: $k= \frac 1\delta \log\frac 1\delta$. Therfore we conlude that
**Coro.** For any distribution supported on $[a,b]$, we can achieve a CR $\frac 1{\log \frac ab}$.
## 12 July (all)
Recap:
- Consider a market of customer whose valuations follows distribution $D$ with support $\{v_1,.., v_k\}$, with $q_i=P(V=v_i)$
- moniter the price at rate $\lambda$,
- the revenue of any policy is upper bounded by $$UB = \sum_i q_i v_i (1 - e^{-\lambda T})$$
Notations:
- the optimal non-adap revenuedenoted by $OPT$,
- Rev of our algo: ALGO
- surrogate for OPT: UB
Recall: Sps the algo knows $v$. Then for any $q$, $$\frac{ALGO(v,q)}{UB(v,q)} \ge \frac 1{k -\frac{v_2}{v_1} - \frac{v_3}{v_2}-...-\frac{v_k}{v_{k-1}}}$$.
Intersting special cases:
- when $v_i$ are all close to each other, then the RHS $\approx 1/(k-(k-1))=1$
- when $v_i = 1-i/(k-1)$ then $\frac{v_i}{v_{i-1}} = 1-\frac 1{k-i}$, and thus $$k-\frac{v_2}{v_1} - \frac{v_3}{v_2}-...-\frac{v_k}{v_{k-1}}=k-(1-\frac 1{k-1}) - (1-\frac 1{k-2}) - ... -(1-1/2)) = \log k +O(1)$$
**Coro (?).** There is an algo such that forb any valuation distr $D$ supported on $[a,b]$, the algio returns a non-adap policy whose revenue is at least $\frac{f(k)}{\log k}$ of OPT.
**Idea.** Choose $k$ and consider a uniform grid of $[a,b]$. "round" each $x\in [a,b]$ to the closest grid point.
Aside: we are intersted in finiding an example where (1) $\lambda T$ is smalle and (2) UB and OPT can differ by a lot.
Consider the following: $k=2, T=1$ .Then the expected revenue of the $s$-policy is
$$r(s; q_1,q_2) := q_1 v_1 (1-e^{-\lambda s}) + q_1 v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (1-s)}) + q_2 v_2\cdot (1-e^{-\lambda (1-s)}). $$
On the other hand, $$UB= q_1 v_1 (1-e^{-\lambda})+q_2 v_2 (1-e^{-\lambda})$$
## 07 July (Su, Titing, Thomas)
### Next steps:
Non-adap algo:
- **Thm 1.** structural lemma: best non-adap policy is markdown
- **Thm 2 (positive result)**: The valuation $k$-point distribution supported on $\{v_1,...v_k\}$. Then there is an alg that finds a non-adap policy that guarantees $CR(v_1,....v_k)$ fraction of revenue of the best non-adap policy.
- Corollary. Note that $1/\log k\ge CR(\{v_i\})\ge 1/k$, so our alg guarantees a $1/k$ fraction of the opt revenue for **any** instance $(v_i)$.
- For large $k$, can we get better bound using a "core set"? For example, if $k=10^6$, and $v_i=i/k$, a natural idea is to consider a geometric suqeuence of prices. Then we need to consider discretization error.
- **Thm 3 (negative)**. For any algo, there is an instance on which the output of this alg eanrs only <= xx fraction of OPT.
idea. Recall that we solved the max-min problem EXACTLY. Also recall that our surrogate can be close to OPT sometinmes, Can we combine these and obtain an immediate lower bound (I guess it is $\rho/k$ where $\rho$ is the "loss" due to taking the surrogate) on the CR?
May ues thfe following multiplicative version of the "f-g" lemma: sps $\rho \le f/g \le 1$, then $$\max_{t_i} \min_{q_i} g(t,q) \ge \max_{t_i} \min_{q_i} f(t,q) \ge \rho \max_{t_i} \min_{q_i} g(t,q)$$
Adap algo
- **Thm 4.** generalize from 2 to k prices (Thomas). The Explore-then-Commit (ETC) policy has regret $n^{3/4}$.
- Rmk. Discuss why other ideas from bandits don't easily extend to this setting (E.g., we don't have a notion of "regret per round" due to the nature of our consumer purchasing model).
- **Thm 5**. (lower bound) Any policy has regret at least xxx on some instance.
Paper Organization:
- Introduction
- Model
- Characterization of Non-Adaptive OPT
- Competitive Ratio for Non-Adaptive Policies
- Regret Bounds for Adaptive Policies (Demand Learning)
- Conclusion and Future Work
## 5 July (Titing, Thomas)
We are mostly done with the regret bound proof for learn-then-optimize. Just a few details remain. We summarize the result here.
### Theorem (Regret for ${\cal A_{\rm LTO}}$)
For $T' = C\frac{1}{\lambda n^{1/4}}$ where $C > 0$ is a constant depending on problem parameters, we have that
$$ {\rm Reg}({\cal A}_{\rm LTO}) \leq O(n^{3/4})$$
Proof:
From Lemmas 1-5 below, we have the following bound on the regret:
$${\rm Reg}({\cal A}_{\rm LTO}) \leq n_1v_1 \lambda T' + C\frac{\sqrt{n_1 \log (n)}}{ \lambda T'} + \frac{T'}{T} OPT + o(1) $$
Applying the choice of $T'$, we get the stated bound.
## 29 June 2023 (Thomas):
Here's an overview of what Su and I discussed yesterday. Our goal is to give a regret bound for Learn-Then-Optimize for the simple two-price discrete distribution problem. After, we will generalize this to 3 prices but where one of the prices is negligible (never included in the optimal solution) and then to $k$ different prices.
Here is the problem description. There are two price levels $v_1 > v_2$, of which there are $n_1$ and $n_2$ customers at each level. We assume that $n = n_1 + n_2$ is known and that $n_1, n_2 = \Theta(n)$ (but otherwise we are uncertain on the split between $n_1$ and $n_2$). Each customer monitors the price according to an independent poisson process with rate $\lambda$ during the time horizon $T$. A customer buys at the first point they monitor the price and observe that the current price is below their valuation. By scaling, we may assume that $\lambda = T = 1$ (this is to simplify some expressions).
Let $\textrm{Rev}(\cal A)$ be the expected revenue of an algorithm $\cal A$. We define our benchmark to be the optimal non-adaptive policy which knows $n_1$ and $n_2$. For this two price problem this is described by a single switching time $s$. Let $R(s; T, n_1, n_2)$ be the expected revenue (expectation taken over the random monitoring times) for the policy which uses price $v_1$ for the first $s$ time units, then uses price $v_2$ for the remaining $T-s$ time units. It can be can be seen that:
$$ R(s;T, n_1,n_2):=n_1 v_1 (1-e^{-\lambda s}) + n_1 v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (T-s)}) + n_2 v_2\cdot (1-e^{-\lambda (T-s)}) $$
Using this, we define $\textrm{OPT} := \max_{s \in [0,T]} R(s; T, n_1, n_2)$. Our goal is to give algorithms (adaptive policies) with low regret:
$$ \textrm{Regret}({\cal A}) := {\rm OPT} - {\rm Rev}({\cal A}) $$
Ideally, we wish to have ${\rm Regret}({\cal A})/n \to 0$ as $n \to \infty$.
Next we describe the Learn-Then-Optimize algorithm for this setting.
### Algorithm ${\cal A}_{\rm LTO}$:
- $T' =$ exploration time to be specified later (optimized during the analysis).
- Use price $v_1$ for time $T'$ and observe sales $D_1$.
- Construct estimates $\hat n_1$ and $\hat n_2$ for $n_1$ and $n_2$.
- Compute $\hat s = \arg\max_s R(s; T-T', \hat n_1, \hat n_2 )$
- Use price $v_1$ for time $\hat s$, then use price $v_2$ for time $T-T'-\hat{s}$
**Remark** If we want to tighten the bound later, we may need to modify the optimization step. E.g., replace $\hat n_1$ with $\hat n_1 - D_1$, etc.
First, it is easy to see that our algorithm computes a feasible policy as the total time used by our algorithm is $T' + \hat{s} + T-T'-\hat{s} = T$. Thus the main thing we need to analyze is its regret.
### Lemma 1 (Overall Regret Bound)
Let $N_1$ be the random number of type 1 customers remaining after exploration for time $T'$. Then we have ${\rm Reg}({\cal A}_{\rm LTO}) \leq A + B + C + D$, where:
1. (Revenue estimation error) $A := |E[R(\hat s; T-T', N_1, n_2)] -E[R(\hat s; T-T', \hat n_1, \hat n_2)] |$
(idea: concentration)
2. (Optimization error) $B := | E[R(\hat s; T-T', \hat n_1, \hat n_2)] - E[OPT(T-T', \hat n_1, \hat n_2)] |$
(idea: = 0 by defn of $\hat s$)
3. (Optimum estimation error) $C := | E[OPT(T-T', \hat n_1, \hat n_2)] - OPT(T-T', n_1, n_2)|$
(idea: use ``f-g'' lemma on the estimation error)
4. (Exploration error) $D := |OPT(T-T', n_1, n_2)] - OPT|$
(Idea: shrinkage argument)
Break D into two substeps:
- D': (effect of a shorter horizon) for any $n_1,n_2$, we have $|OPT(T-T', n_1, n_2) - OPT(T, n_1, n_2)|$ small (bound this using the shrinkage argument)
- D'': (effect of distribution shift) recall that $N_1=n_1-D_1$ is the number of high val customers after exploration phase. For **any** $t$, want to bound $$|E [ OPT(t, N_1, n_2) ] - OPT(t, n_1, n_2)|$$
(idea: we already showed that whp $n_1 \approx N_1$, so we can condition on the good event, and consider the most pessimistic $N_1$, then the expectation is gone. Then use f-g lemma.)
Proof sketch: First note that ${\rm Rev}({\cal A}_{\rm LTO}) \geq E[R(\hat s; T-T', N_1, n_2)]$. Next, we add "zero" several times and apply triangle inequality to get the terms $A$, $B$, and $C$ defined in the lemma statement.
Recall that the demand $D_1$ in the exploration period is the sum of iid Bernoulli's, each with mean $1-e^{-\lambda T'}$. Formally,
$D_1 \sim {\rm Binom}(n_1, 1-e^{-\lambda T'})$, so $\hat n_1 = D_1 / (1-e^{-\lambda T'})$ is an unbiased estimate of $n_1$.
### Claim 1.
Let $\hat n_1 = D_1/(1-e^{-\lambda T'})$ and $\hat n_2 = n - \hat n_1$, then $\hat n_1$ and $\hat n_2$ are concentrated around their means.
Pf.
Note that $D_1$ is the sum of iid Bernoullis, each with mean $1-e^{-\lambda T'} \approx \lambda T'$, so by Hoeffding, $$P[|D_1-E[D_1]| > \gamma] \le e^{-\gamma^2/ n_1}.$$
So, to make RHS less than some target confidence level $\delta$, we can take $\gamma = \sqrt{n_1\log (1/\delta)}$. Note that $$|\hat n_1 - n_1| = \frac 1{1-e^{-\lambda T'}}\cdot |D_1-E[D_1]|,$$ so the above implies that that w.p. $1-\delta$, we have
$$|\hat n_1 - n_1|\le\frac{\sqrt n_1}{\lambda T'} $$
### Claim 2.
For all $s,T,n_1,n'_1,n_2, n'_2,$ we have
$$|R(s; T, n_1, n_2)] - R(s; T, n'_1, n'_2) | \leq |n_1-n'_1|v_1 + |n_2-n'_2|v_2 $$
Proof: Since $R(\cdot)$ is linear in $n_1,n_2$, we have that
$$ R(s; T, n_1, n_2)] - R(s; T, n'_1, n'_2) = \\(n_1 - n'_1) v_1 (1-e^{-\lambda s}) + (n_1-n'_1) v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (T-s)}) + (n_2 - n'_2) v_2\cdot (1-e^{-\lambda (T-s)}) $$
Taking absolute value on both sides, applying triangle inequality, and noting that $v_2 \leq v_1$ yields the claim.
### Lemma 2 (Revenue Estimation Error)
Let $\hat n_1$ and $\hat n_2$ be the estimators for $n_1$ and $n_2$ given in Claim 1 and $\hat s$ be the solution computed in our algorithm, then we have:
$$A = |E[R(\hat s; T-T', N_1, n_2)] -E[R(\hat s; T-T', \hat n_1, \hat n_2)] | \leq E[D_1] v_1 \leq n_1 \lambda T' v_1$$
Proof: Note that it suffices to show that this relation holds for all $s \in [0,T]$ (in place of $\hat s$) and $t > 0$ (in place of $T-T'$). First note that since $R(\cdot)$ is linear in $n_1$ and $n_2, we have that $E[R(s; t, N_1, n_2)] = R(s; t, E[N_1], n_2) = R(s; t, n_1 - E[D_1], n_2)$. Similarly, we have $E[R(s; t, \hat n_1, \hat n_2)] = R(s; t, E[\hat n_1], E[\hat n_2]) = R(s; t, n_1, n_2)$
Thus by Claim 2 we have:
$$ |E[R(s; t, N_1, n_2)] - E[R(s; t, \hat n_1, \hat n_2)]| \leq |n_1- E[D_1] - n_1|v_1 + |n_2-n_2|v_2 = E[D_1]v_1. $$
Next since, $D_1 \sim {\rm Binom}(n_1, 1-e^{-\lambda T'})$, we have $E[D_1] = n_1 (1 - e^{ - \lambda T'}) \leq n_1 \lambda T'$, which completes the proof.
### Lemma 3 (Optimization Error)
$B = | E[R(\hat s; T-T', \hat n_1, \hat n_2)] - E[OPT(T-T', \hat n_1, \hat n_2)] = 0$
Proof: This directly follows from the definition of $\hat s$ and $OPT(\cdot)$.
### Lemma 4 (Optimimum Estimation Error)
$C = | E[OPT(T-T', \hat n_1, \hat n_2)] - OPT(T-T', n_1, n_2)| \leq O\left( \frac{\sqrt{n_1 \log (n)}}{ \lambda T'} \right) + o(1)$
Here the $O()$ notation hides constant factors and the $o(1)$ term represents lower order terms that go to $0$ as $n \to \infty$.
Proof sketch: Let ${\cal B}$ be the event that at least one of $\hat n_1$ or $\hat n_2$ is further than $\epsilon$ from their expectations. We use the law of total expectation to compute the difference above conditioning on ${\cal B}$ and ${\neg \cal B}$. Then we construct a bound on each case apply Claim 1 to complete the proof.
Proof: We will show that this lemma holds for all $t \in [0,T]$ (and thus it holds for the specific choice of $t = T-T'$). Let ${\cal B}$ be the event that at least one of $\hat n_1$ or $\hat n_2$ is further than $\epsilon$ from their expectations. Here we will take $\epsilon = C\frac{\sqrt{n_1\log(\frac{1}{\delta})}}{\lambda T'}$ (for some constant $C > 0$) and $\delta = 1/n^2$. Note that by our choice of $\epsilon$ and $\delta$, from Claim 1 (Hoeffding's inequality + union bound) we have
$$\Pr[B] \leq O(1)\cdot \delta = \frac{O(1)}{n^2}.$$
By the law of total expectation we can rewrite $E[OPT(t, \hat n_1, \hat n_2) - OPT(t, n_1, n_2)]$ as:
$$ E[|OPT(t, \hat n_1, \hat n_2) - OPT(t, n_1, n_2)| \mid {\cal B}] \Pr[{\cal B}] + E[|OPT(t, \hat n_1, \hat n_2) - OPT(t, n_1, n_2)| \mid \neg{\cal B}] \Pr[\neg {\cal B}].$$
We will bound each term separately. We claim the first term is $o(1)$ since $\Pr[B] \leq (1/n^2)$ and the difference can be at most $O(nv_1)$.
The second term is at most $2\epsilon$ by Claim 2 and the "$f$-$g$" lemma (TODO: add this lemma somewhere). This completes the proof.
### Lemma 5 (Exploration Error)
$D \leq \frac{T'}{T} OPT$
Proof: Let $s = \arg\max R(s, T, n_1, n_2)$, and $s' = \frac{(T - T')s}{T}$, i.e., we shrink the optimal solution of $R(s, T, n_1, n_2)$ by the factor $\frac{(T - T')}{T}$. Since $s$ is feasible for $R(s, T, n_1, n_2)$, namely, $s \le T$, $s'$ is also feasible for $R(s', T - T', n_1, n_2)$. Since $T-T' \le T$ and $s'$ is a feasible solutio, we have $ R(s', T - T', n_1, n_2) \le OPT(T - T', n_1, n_2) \le OPT(T, n_1, n_2)$. Note that $\frac{1 - e^{-x}}{x}$ is decreasing in $x$, thus,
\begin{align}
R(s', T - T', n_1, n_2) &= n_1 v_1 (1-e^{-\frac{T- T'}{T}\lambda s}) + n_1 v_2 \cdot e^{-\frac{T- T'}{T}\lambda s} \cdot (1-e^{-\frac{T- T'}{T}\lambda (T-s)}) + n_2 v_2\cdot (1-e^{-\frac{T- T'}{T}\lambda (T-s)})\\
& \ge \frac{T- T'}{T} R(s, T, n_1, n_2)
\end{align}
Therefore,
\begin{align}
|OPT(T, n_1, n_2) - OPT(T - T', n_1, n_2)| & \le |OPT(T, n_1, n_2) - R(s', T - T', n_1, n_2)|\\
& = |R(s, T, n_1, n_2) - R(s', T - T', n_1, n_2)|\\
& \le|R(s, T, n_1, n_2) - \frac{T - T'}{T} R(s, T, n_1, n_2) |\\
& = \frac{T'}{T} R(s, T, n_1, n_2),
\end{align}
which completes the proof.
Proof sketch: See the notes from 28 June 2023 for the basic idea.
## 28 June 2023 (Su, Thomas):
Simplifying assumptions:
- $\lambda = T = 1$.
- Also assume $n_1,n_2 \ge Cn$.
Algo:
- exploration $T'$ time.
the prb that a high val customer bought in the exploration phase is $1-e^{-\lambda T'}$. So $E[N_1] = n_1 \cdot e^{-\lambda T'}$.
Q: if $T'$ is small (e.g., $T'=n^{-1/2}$), then $n_1 \cdot e^{-\lambda T'} \approx n_1$?
A: Yes. Note that $\lambda T' = n^{-1/2}$ is small, so by Taylor expansion around $0$, we have
$$ 1-e^{-\lambda T'} = \lambda T' + o(T') = n^{-1/2} + o(n^{-1/2}). $$
Thus, $N_1 \sim (1-n^{-1/2})\cdot n_1$
Next question: how does a (small) distruibtion a shift affect the revenue in the **exploitation** phase?
Assume $n_1 = \Theta(n)$.
Let $D_1$ be the demand in the explorfation phase. Essentially, $D_1$ is just the sum of $n_1$ iid Bernoulli's with means $1-e^{-\lambda T'}\approx \lambda T'=n^{-1/2}$, then $$ED_1 = n_1 - E[N_1] \sim n_1/n^{1/2} \approx C\sqrt n.$$
$B$ is the (bad) event that $N_1 < n_1(1 - C/n^{1/2})$, or equily, $D_1> C \sqrt n$.
Chernoff: $X_1,X_2,\ldots,X_n \in \{0,1\}$. $S = \sum_i X_i$, $\mu = \E[S]
$$\Pr[|S-\mu| \geq \epsilon \mu ] \leq 2\exp(-\epsilon^2 \mu/3) \leq 1/n^2 $$
By taking $\epsilon=n^{-1/4}$, we have $\mu = C\sqrt{n}$. Thus, $P[\neg B] \geq 1-1/n^2$
$s' = s^*(1-T')$
$$ E[R(s'; 1-T', N_1, n_2)] \geq E[R(s^*(1-T'); 1-T', N_1, n_2)| \neg B] \Pr[\neg B] $$
Hopefully:
$$ E[R(s^*(1-T'); 1-T', N_1, n_2)| \neg B] \Pr[\neg B] \geq R(s^*; n_1,n_2) - o(1) $$
## 23 June 2023: Post-meeting notes (Thomas)
I was thinking about the bound we were working on earlier and it may be a bit more involved (but do-able) than we thought. First lets walk through the easy part of the calculations. Let $R(s;T,n_1,n_2)$ and $OPT(T,n_1,n_2)$ be defined as below. Suppose we run exploration at price $v_1$ for some time $T' \ll T$. Let $N_1$ be the random number of type 1 customers remaining after exploration. Our goal is to find a bound of the form
$$ E[OPT(T-T',N_1,n_2)] \geq \alpha OPT(T,n_1,n_2) - \beta $$
for $\alpha \in (0,1)$ as large as possible and $\beta > 0$ as small as possible (note that the reverse inequality is true with $(\alpha,\beta) = (1,0)$). Let $s^* = \arg\max_{s\in [0,T]} R(s;T, n_1,n_2)$ and define $s' = s^*(\frac{T-T'}{T})$. At a high level, we dilate the optimal policy for time horizon $T$ to be a feasible policy for time horizon $T-T'$. Thus we have the following:
$$ E[OPT(T-T',N_1,n_2)] \geq E[R(s'; T-T', N_1, n_2)] = \sum_{x=0}^{n_1} R(s'; T-T', x, n_2) \Pr[N_1 = x] $$
$N_1$ is the number of remaining type 1 customers after exploring at price $v_1$ for time $T'$, so $N_1$ is Binomially distributed with parameters $n_1$ and $r = \exp(-\lambda T')$ since each type 1 customer (of which there are $n_1$ total) monitors the price independently and has probability $r$ of not monitoring the price in the interval $[0,T']$.
Directly bounding the above expression may get complicated quickly so I think it may be useful to break things into cases.
* Case 1: $n_1$ is large in terms of $n$. In this case, we can use the intuition that as long as $T' \to 0$ as $n \to \infty$, then by concentration inequalities $N_1 \approx n_1$ with high probability. Then we argue that we basically just have to deal with shrinking $s^*$ by a factor of $(T-T')/T \approx T$.
* Case 2: $n_1$ is small in terms of $n$. In this case, we cannot use the same argument since even if we take $T'$ to be vanishing as $n \to \infty$, we do not get as strong concentration on $N_1$ since $n_1 \ll n$. However in this case we can probably argue that most of $OPT(T,n_1,n_2)$'s value was given by selling to type 2 customers (since $n_2$ is large).
## 23 June 2023: adap policy (cont'd)
Simplest setting: number of type-i customer (i.e., someone with val $v_i$) is $n_i$. For simpl, we assume $i=1,2$.
For simp, we also assume $n:=\sum_i n_i$ is also known.
If $n_i$ is known, then the seller chooses $s$ that maximize the expected revenue
$$R(s;T, n_1,n_2):=n_1 v_1 (1-e^{-\lambda s}) + n_1 v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (T-s)}) + n_2 v_2\cdot (1-e^{-\lambda (T-s)}).$$ This is on the order of $\Theta (n)$. Denote by $s^*$ the optimal solution to the above.
### high level idea, corrected
*Defn*. Let $OPT(T, n_1,n_2):=R(s^*;T, n_1,n_2)$ be the optimal rev if we start with $n_i$ type-i customers.
Denote by $N_1<n_1$ the (random) number of remaining type-1 customer at the end of the exploration phase.
Then the key problem is to bound
$$OPT(T,n_1,n_2) - E[OPT(T-T',N_1,n_2)]$$ where the expectation is over $N_i$. We emphasize that $n_2$ does not change after the exploration.
## Step 3
Intuition: if $T'=o(n)$ then the distributional shift is small, i.e., the ratio $N_1 / n_2$ should not be too different from $n_1/n_2$. Therefore, we can apply the "one-to-one shrinkage" argument to bound this gap.
### Step 2
Observe that every non-adap policy for the $T$ horizon corresponds to a policy for $T-m$.
Let's state this lemma for general $k$.
**Lemma** For any $k$, and any non-adap policy (which is markdown) $(t_j)$ totalling to 1, we have $OPT(T) - OPT(T')\le ??$
Pf. Fix one customer with valuation $v$. We want to know what percentage of revenue we can get from her if the horizon is shortend by a factor of $T'/T$.
To this end, we need to first find the expression for the expected revenue:
$$\sum$$
### High level, flawed
Step (1): show that if we stay at $v_1$ for $m$ rounds, then we have estimation error $\delta(m)$.
Step (2): the optimal rev in the shortened horizon $T-m$ is at most xx from that of the origional horizon $T$
Step (3): if the estimation error is $\delta(m)$, then lose $O(\delta)$ on the shortened horizon.
The regret would then be $$\lesssim n\cdot m + (OPT(T)-OPT(T-m) ) + \delta(m).$$
The above analysis is flawed since the distribution of valuations for the shortedn instance is **different** from the original problem!
### Step 3
We start with Step 3: how would an estimation $\epsilon$ on $q_1$ affect $s^*$?
**Lemma.** Sps $f(s),g(s)$ are defined on the same domain and that $|f(s)-g(s)|\le \delta$. Then, $|\max_s f(s) - \max_s g(s)|\le \delta$
Applying the above on $R(s; q_1,q_2)$ and $R(s; \hat q_1,\hat q_2)$, then we have:
**Coro.** If $|q_i - \hat q_i|\le \delta$, then $|R^* - \hat R^*|\le 3\delta \|v\|_\infty$.
Pf. $$|R- \hat R|\le \delta v_1 (1-e^{-\lambda s}) + \delta v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (T-s)}) + \delta v_2\cdot (1-e^{-\lambda (T-s)})\le 3\delta \|v\|_\infty.$$
Q2.
Note that if we select $v_1$, then only type-1 customer may buy. If we select $v_2$, then both types may buy.
Sps until time $t$, we have been in price $v_1$ for $s$ time. Then, the expected demand from those $s$ time is $E[D] = (1-e^{-\lambda s}) \cdot q_1 n$. Sps the observed $D$. A natural estimator for $q_1$ then is $$\hat q_1 = \frac D{(1-e^{-\lambda s}) n}$$
Then, by concentration bounds (?), we have $|D - E[D]|\le f(s,n)$.
Consider the following learn-then-earn algo:
## 21 June 2023: adap policy
Recall that we have $n^{3/4}$ adap against the best non-adap policy that knows the true demand. But this result assumes linear demand. We believe a simlar regret bound (against the same BM) can be shown for non-param setting, just like for the stream model.
Setting: Suppose the the valuation is supported on $v_1>...>v_k$, and the prb mass $q_i$ is unknown. An algo takes $v_i$ as input and computes $t_i$. The CR of this alg is given by the worst-case $q_i$.
**Conj.** there is an algorithm (output of the LP below from 16 June 2023) that takes $(v_i)$ as input and returns a $(t_i)$ which guarantees $\Omega(1/k)$ rev under any instance $\{(q_i,v_i)\}$.
Adap version: recall two-armed bandits: $\mu_1,\mu_2$ reward rates, time horizon $T$, denote $|\mu_1 -\mu_2|=\Delta$. Then, there is an alg with regret $\le \Delta^{-1} \log T$.
Confidence interval: let $\bar \mu_i$ be the emp mean of arm $i$ after selcting it $t$ times. Then, w.p. $1-t^{-5}$, we have $|\mu - \bar \mu| \le \sqrt{\frac{\log t}{t}} \sim t^{-1/2}$.
Successive Elimination (SE): play round robin until a separation occurs.
**Lemma**. After $t^*:=\Delta^{-2}\log T$ rounds, the two C.I. become disjoint whp.
Idea. To have $t^{-1/2}\le \Delta/2$, we only need $t\ge \Delta^{-2}$.
Regret: note that after toime $t^*$, we suffer 0 regret per round,. Beofre $t^*$ we suffer $\Delta$ regret per round. So, the total reg is $\Delta \cdot t^* + 0 \cdot (T-t^*)\le \Delta^{-1}$.
Analogy: $n$ customer, $n$ known. valuations $v_1 , v_2, 0$, (wlog) assume a policy only plays $v_1$ or $v_2$. Each of the $n$ customers has valuation $v_i$ w.p. $q_i$ (unknown). Note that $q_1 + q_2 < 1$.
Benchmark: Sps $q_i$'s were known. Recall that the optimal non-adap policy is markdown, so can can encode a policy by when the unique price reduction occurs.
Sps the price goes from $v_1$ to $v_2$ at time $s$. Then, the expected revenue is $$q_1 v_1 (1-e^{-\lambda s}) + q_1 v_2 \cdot e^{-\lambda s} \cdot (1-e^{-\lambda (T-s)}) + q_2 v_2\cdot (1-e^{-\lambda (T-s)}).$$
#### Algo idea
Note that if we select $v_1$, then only type-1 customre may buy. If we select $v_2$, then both types may buy.
Sps until time $t$, we have been in price $v_1$ for $s$ time. Then, the expected demand from those $s$ time is $E[D] = (1-e^{-\lambda s}) \cdot q_1 n$. Sps the observed $D$. A natural estimator for $q_1$ then is $$\hat q_1 = \frac D{(1-e^{-\lambda s}) n}$$
Then, by concentration bounds (?), we have $|D - E[D]|\le f(s,n)$.
High level idea for UB: (Q1) how does length of exploration affect the estimation error (Q2) how the estimation error affect the revenue.
Idea for LB: choose $v_i,q_i$ s.t. optimal $s=0$. Then, argue that (1) any policy that always stays at the low price $v_2$ is bad because it never explores (2) any policy that ever stays at $v_1$ is bad for the above instance.
## 16 June 2023
Recall that $$UB_D = E_{V\sim D}[V] \cdot (1-e^{-\lambda T}) = \sum_i q_i v_i (1-e^{-\lambda T}).$$
The RHS the revenue that we can get if we present price $v_i$ to every customer with valuation $v_i$. This is a (highly) personalized policy, however, this paper is not about personalized policies!
Instead, we interpret the above as an upper bound on the expected revenue of **any** policy.
In fact, for any $T$, the maximum revenue we can hope for from a customer with valuation $v$ is $v_i \cdot (1-e^{-\lambda T})$.
**Prop.** For any non-adap policy, the E-rev is at most $UB_D$.
Why is this useful: the optimal non-adap policy is given by a DP, but we do not have a clean expression for its E-revenue. So we instead compute the apx ratio against the above $UB_D$. Prop 1 ensures that any policy that receives $c$ fraction rev of $UB_D$ also has $c$ fraction rev of the best non-adap policy.
#### Non-adap: show CR for different classes of q's.
Reduces to: given $(v_i)$ and a **class** $\cal F$ of ${\bf q}=(q_i)$, we want to find
$$\max_{t_i}\min_{{\bf q} \in \cal F} \frac{\sum_i q_i \sum_{j=i}^k v_jt_j}{\sum_i q_i v_i}$$
where $\sum_i t_i =1$.
**Lemma** Let $v_i,u_i\ge 0$, consider the ratio $f(q)=\frac{\sum_i q_i u_i}{\sum_i q_i v_i}$. Then, $f(q)$ is minimized by the one-hot vector where all mass are on $i^* = \arg\min_i u_i/v_i$.
**Proof.** Wlog assume $i^*=1$. Then, for any $i$, we have $u_i/v_i\ge u_1/v_1$, i.e., $v_i \cdot \frac {u_1}{v_1} \le u_i$. Summing over all $i$'s, we have
$$ \frac{u_1}{v_1} \sum_i q_i v_i =\sum_i q_i (v_i \cdot \frac{u_1}{v_1})\le \sum_i q_i u_i, $$
i.e., $u_1/v_1 \le f(q)$. Note that the LHS is just $f(e_1)$ where $e_1=(1,0...,0)$, and the done.
Q: What is the worst of for the inner min?
A: **q** = (0,..,0,1,0...), the 1 appears at the min of the following
$$\frac{\sum_{j=i}^k v_jt_j}{v_i}$$.
The maximization problem is equivalent to
$$ \max_{t_j} \min_{i} \frac{\sum_{j=i}^k v_jt_j}{v_i}$$
Let's rewrite it as an LP: given constants $v_i$, consider
$$ \max_{t_j, c}\ c, \\ s.t. \quad \sum_{j=i}^k v_jt_j \ge c v_i,\ {\rm for\ all} \ i $$
**To do offline: the optimal sln is achieved when all constraints are tight.**
$$t_k = 1/(k - v_2/v_1 - v_3/v_2 - ... - v_{k}/v_{k-1})\\
= 1/(1 - v_2/v_1 + 1- v_3/v_2 + ... + 1 - v_{k}/v_{k-1})\\
= (1/k + 1/(k-1) + ..., + 1/2) \sim 1/log(k)$$
(connection to Ski-rental?)
This optimiztaion probilem is related to the *ski rental* problem with buying cost $N$: On each day we either rent skis at cost 1 (if we haven't bought) or buy them at cost $N$, until the end of the ski-trip. In terms of competitive ratio the optimal deterministic policy is to buy at day $N$ which has a competitive ratio of $2-1/N$. In general, we can use an LP to solve for the randomized policy with optimal competitive ratio. Let $f = (f_1,f_2,\ldots,f_N)$ be a distribution over buying times. The LP is as follows
$$ \min_{f, \alpha}\ \alpha, \\ s.t. \quad \sum_{t=1}^x \frac{N+t-1}{x} f_t + \sum_{t=x+1}^N f_t \leq \alpha \quad \forall x \in [N] \\
\ \quad \sum_{t=1}^N f_t = 1$$
The optimal solution balances (makes tight) all the constraints): $\sum_{t=1}^x \frac{N+t-1}{x} f_t + \sum_{t=x+1}^N f_t = \alpha$ for all $x \in [N]$.
## 9 June 2023
Basic setting: for *non-adap* setting. WLOG assume there is just one customer. Sps:
- Valuation $V\sim D$ takes value on $\{v_1,\dots,v_k\}$ (known) and $P[V=v_i]=q_i$ (unknown)
- Monitoring rate is $\lambda=1$
- conti time horizon is $[0, T]$
**Observation**: in the non-adap setting, we can assume there is only one customer.
Recall that an optimal non-adap policy must be markdown. So we may specify a non-adap policy by specifying the time $t_1,...t_k$ spent at each price $v_1>\dots>v_k$, where $\sum t_i=T$.
To simplify the notations, we denote by $T_i^j = \sum_{s=i}^j t_s$. In parti, $T_1^k = T$.
**Benchmark:** Sps the valuation $V$ of this customer is known to be some $v_i$, then choose $p_1=v_i, t_1=T$. The expected revenue is $$E[R|V=v_i] = v_i (1 - e^{-\lambda T}).$$ Note that here the expectation is only over the monitor behavior. The expected revenue if we always know $V$ is then
$$OPT_D = \sum_i q_i \cdot E[R|V=v_i] = \sum_i q_i v_i (1-e^{-\lambda T}).$$
As a special case, if $\lambda T$ is large, then by Taylor expansion, the above can be apprioximated as $OPT \approx \sum_i q_i v_i \cdot \lambda T$.
Rmk. If $\lambda=\infty$, then the above quanity is just the area under the demand curve.
Now sps the realization of $V$ is not known to the seller.
**Q:** Given a family $\mathcal{F}$ of distributions supported on $\{v_1,...,v_k\}$, consider a non-adap policy $\pi=(t_1,...,t_k)$ where $\sum t_i = T$. What is the "guaranteed fraction" of OPT, if we know that $D\in \mathcal{F}$? Formally, we want to find $$\max_{t_i} \min_{D\in \cal F} \frac{E_{V\sim D}[R(t_1,...,t_k; V)]}{OPT_D}$$
**Step 1.** Find a simple expresson for $E_{V\sim D}[R(t_1,...,t_k; V)]$ given any non-adap policy $(t_i)$ and $D$.
As a shorthand, we denote by $R:=R(t_1,...,t_k)$ the random revenue.
We define the E-revenue under a fixed $D$. To this end, fix any $v_i$, and consider $j\ge i$.
The buyer purchases if (i) she has never bought before and (ii) she monitors at least once in the j-th interval.
Thus, the prb of a purchase in the j-th interval is $e^{-T_i^{j-1}} \cdot (1-e^{-t_j})$. Then, the E-revenue of $\pi$ is $$E[R|V = v_i] = \sum_{j\ge i} v_j e^{-T_i^{j-1}} \cdot (1-e^{-t_j}).$$ Now, taking expectation over $v_i$, we have
$$E_{V\sim D}[R] = \sum_i q_i \cdot E[R|V=v_i] = \sum_i q_i \sum_{j\ge i} v_j e^{-T_i^{j-1}} \cdot (1-e^{-t_j}).$$
**Step 2.** Find the competitive ratio, given a non-adap policy and $D$.
**Observation**. If $\lambda$ is large, then the problem is easy. In fact, for $\lambda=\infty$, we have $OPT_D = Area(D) = Rev_D$. This suggests that to construct "bad" instances, we should focus on small $\lambda$'s.
Note that $\lambda=1$ and that when $\lambda T$ is small, we can approximate both OPT and Rev using Taylor expansion as $${\rm OPT}_D \approx {\rm OPT}'_D :=\sum_i q_i v_i T.$$ To apx the revnue, observe that $e^{-T_i^{j-1}}(1-e^{-t_j})=e^{-T_i^{j-1}} - e^{-T_i^j} \approx -T_i^{j-1} - (-T_i^j) = t_j$
$${\rm Rev_D} \approx {\rm Rev}'_D:=\sum_i q_i\sum_{j=i}^{k} v_j t_j$$
We refer to ${\rm Rev'_D}$ and ${\rm OPT'_D}$ as the surrogate Rev and OPT. We next show that any CR on the surrogate carries over to the true CR.
**Lemma.** Sps ${\rm Rev'_D}/{\rm OPT'_D}\ge c$, then ${\rm Rev_D}/{\rm OPT_D}\ge c$.
Pf. I think (?) the proof follows from the following fact: given any $0<\alpha< \beta$, then $(1-e^{-\alpha})/(1-e^{-\beta})\ge \alpha/\beta$. This is true becuase the numerator and denom are both concave, and their ratio near $0$ equals RHS. (Titing also checked using wolfram) $\blacksquare$
Q: Given $t_i$'s, what is the worst combin of $(q_i)$ and $(v_i)$? Formally, we want to maximize $$\frac{\sum_i q_i v_i T}{\sum_i q_i\sum_{j=i}^{k} v_j t_j}$$
Homework for Titing:
1. for $t_i=T/k$, find the CR as a func of $q_i, v_i$. Then choose $q_i,v_i$ to make this ratio small.
2. what about general $t_i$? Ultimate we want to show a result like this:
**Lemma** The compet ratio is lower bounded by $\frac{\sum_i q_i\sum_{j=i}^{k} v_j t_j}{\sum_i q_i v_i T}.$
**Prop (which we dont have yet).** The non-adap policy xxx guarantees a xx fraction of OPT whenever $q_i,v_i$ satsisfy Assumption xxxx.
Reduces to: given $(v_i)$ and a **class** $\cal F$ of ${\bf q}=(q_i)$, we want to find
$$\max_{t_i}\min_{{\bf q} \in \cal F} \frac{\sum_i q_i \sum_{j=i}^k v_jt_j}{\sum_i q_i v_i}$$
where $\sum_i t_i =1$.
Q: What is the worst of for the inner min?
A: **q** = (0,..,0,1,0...), the 1 appears at the min of the following
$$\frac{\sum_{j=i}^k v_jt_j}{v_i}$$
For example, we can show the following results using the above Prop.
**Theorem** For the linear demand functions $\cal F$, i.e., $(q_i)$ have the pattern $0$, constant, $0$, then the above ratio is at least xxx.
Does it make it easier if we consider additve loss instead of multiplicative?
Consider the following $$\min_{t_i}\max_{{\bf q} \in \cal F} \sum_i q_i v_i - \sum_i q_i \sum_{j=i}^k v_jt_j$$
Let's simplify the obj:
$$\sum_i q_i v_i - \sum_i q_i \sum_{j=i}^k v_j t_j = \sum_i q_i \cdot (v_i - \sum_{j=i}^k v_jt_j)$$
Next time(s):
- Recall that we have $n^{3/4}$ adap against the best non-adap policy that knows the true demand. But this result assumes linear demand. We believe a simlar regret bound (against the same BM) can be shown for non-param setting, just like for the stream model.
- Non-adap: show CR for differnt classes of q's.
## 7 June 2023
#### EC'22 paper (Alaei et al)
- considered supply/inventory $1$ or $m$.
- Valuation is iid from a known distr.
- The seller announces in advance a price set $\{p_1,\dots,p_k\}$ that the buyers are allowed to chosse from.
- What's known to everyone: when the bids are submited, the seller selects the top $m$ bids to allocate, and each such buyer pays exactly her bid.
- Assume the bids $b_1,...,b_n$ form a BNE. They showed existence of "symmetric" BNE, i.e., BNEs that only depends on the valuations. Then (?) we consider the worst case revenue over all BNEs
They reduced this to so-called Batched Prophet Ineq. Recall that (non-batched) P.I.:
- there is one unit of good, and the "bids" $X_t \sim D$ i.i.d.
- In each round $t$, the seller sees $X_t$, and decide whether to accept or reject irrevocably.
- Benchmark is $E_{X_t\sim D} [X_{\max}]$ where $X_{\max} =\max\{X_t\}$.
- A common alg for PI is threshold policy: accept when the first time $X_t \ge $ some threshold $\tau$.
- For suitable threshold, this alg achieves the optimal $1/2$-comp ratio.
#### Batched PI:
- Now consider multithreshold $\tau_1\ge\dots\ge \tau_k$, a descending sequence. We still have iid $X_t\sim D$ (known).
- Let $j$ be smallest number such that $S_t:=\{X_t \ge \tau_j\}\neq \emptyset$.
- If $m\ge |S_t|$, then assign to everyone in $S_t$. Else randomly draw $I_1,\dots I_m \sim S_t$ and receive revenue $\sum_{i=1}^m X_{I_i}$.
- Note: for $k=m=1$ this is equivlant to random-order PI
#### Connection and distinctions to our problem:
- Observe that in BPI, if the inv is $\infty$, then just choose all $X_t$'s (assuming all $X_t \ge 0$.)
- **Q**: Is it true that the non-adap, immediate-buy version of Pool pricing is equivalent to their problem with $\infty$ inventory?
- **A**: A natural "reduction" would be: view our prices $p_1,\dots,p_k$ as their $\tau_i$'s. Then the purchase behaviour becomes the same: buyer $t$ buys at time $i$ iff $X_t$ is greater than the "price" $\tau_i$ at round $i$ AND $X_t < \tau_{i-1}$.
However, in our problem, when a buyer buys, she pays $\tau_i$, whereas in BPI, she pays her valuation $X_t$ (which may be $>\tau_i$).
- Aside: if $k$ is very large, then $X_t = \tau_{i}$ + tiny term.
- We also note that the benchmark are of the same spirit. In BPI, if $X_t$'s were known, then the seller earns $$X_{\max}^m := \max_{S \subseteq [n] : |S| \leq m} \sum_{t\in S} X_t.$$
In our problem, the benchmark is $$Area(D) = \int_0^1 D(v)\ dv= \int_0^1 P[V\ge v]\ dv = E[V].$$
In particular, when we have suffcient inventory, i.e., $m= n$, then $E[X_{\max}^n] =n\cdot E[X_1] = n\cdot E[V]$.
To do: check out (1) the papers that cited this EC'22 paper (2) on-going work by the authors
Q: Thomas brought up this idea: can we argue that any optimal adap policy is markdown in this way? Fix a price trajectory, if we flip a non-markdown pair then the expected revenues can only go up. (There may be some independence issues)
Recall our non-adap results so far.
Statement: we want to find a non-adap policy $\pi=(p_1,t_1,...,p_k,t_k)$.
We measure the quality of a non-adap policy by its "loss". For a fixed demand curve $D$, the loss is
$$Loss(\pi, D) = Area(D) - E[Rev(\pi)].$$
We care about the worst-case loss over a given family $\mathcal{F}$ of $D$'s, i.e.,
$$\max_{D\in \mathcal{F}} Loss(\pi, D).$$
We proved:
1. For $\mathcal{F}$= Lipschitz, we have a policy with Loss < xxx (idea: for any fixed $\pi$, we contrsuct a "bad" curve for $\pi$. Then optimize over all $\pi$'s in a "box"-shaped feas region)
2. Linear