HackMD - Collaborative Markdown Knowledge Base

*Cost-Benefit Assessment of Seasonal Forecasts for Fresh Produce* ## People involved - Tosin Babasola (o.l.babasola@bath.ac.uk) - Christopher Nankervis (chris@weatherlogistics.com) - Chuan Fu Yap (chuanfu.yap@manchester.ac.uk) - David Barton (david.barton@bristol.ac.uk) - Constantin Puiu (constantin.puiu@maths.ox.ac.uk) - Chinasa Juliet Ojiako (chinasaojiako@gmail.com and c.j.ojiako@lboro.ac.uk) - Colm Connaughton (C.P.Connaughton@warwick.ac.uk) - Kamil Kulesza (Kamil.Kulesza@maths.com.pl) ## Introduction Many sectors of the economy are exposed to the impacts of weather or climate conditions, either directly or indirectly. The agriculture sector is clearly an example of a sector whose production is directly impacted by weather variation which result to changes to the cost of production or demand for their products. Therefore, understanding the effect of weather variability and assessing the cost benefit of the seasonal forecast of fresh produces are of great importance to the weather logistics who aim to help the farmers (growers) of fresh produce to manage their supply and also reduce the possible waste. One of the leading cause of waste in the fresh produce is the impact of the climate variability which leads to reduction of grower yield and decrease in the acquired profit. The weather logistics produce local seasonal ensemble forecasts of weather conditions on a rolling 3-months horizon and this seasonal forecasts are more informative than climatology over this horizon where the users are primarily in the agricultural sector. So, we aim to determine the how the value of a seasonal forecast to a grower of fresh produce be quantified. The information available are: the historical local temperature and precipitation observations for some representative growing sites in the UK and Spain together with the corresponding ensemble forecast data and the historical wholesale price data for some representative crops such as spinach, asparagus and courgettes. Thus, in this report, we shall work towards determining the economic value (cost-benefit determination) of the weather forecast such as temperature, rainfall, solar radiation etc for the growers. For instance, the cost of production of the crop against its wholesale price. In order to determine whether the available weather data is linked with the wholesale price we implemented the machine learning approach. ## Questions to be answered? - Determine economic value (cost-benefit determination) of weather forecast (temperature, rainfall, solar radiation etc) for the growers, for example cost of production of crop vs wholesale price of crop (the cost-benefit). Steps to be taken... 1. Determine if weather data is linked to wholesale price, to be attempted with suite of Machine Learning. ### General set up: * At the start of a season, growers should choose a growing strategy, $S$. E.g. proportion of spinach to asparagus or something similarly simple. * The subsequent weather timeseries for the season, $W_t$ determines the payoff, $P(S, W_t)$ for that strategy. * If $W_t$ were known, the grower should select the strategy by solving the optimisation problem: $$ S^* = argmax_S\, P(S, W_t) $$ * However, $W_t$ is uncertain so we "should" instead solve the following optimisation problem: $$ S^* = argmax_S\, \mathbb{E} [P(S, W_t)] $$ where the expectation is taken with respect to the ensemble of possible outcomes for the weather timeseries, $W_t$. * What should we take as this ensemble? One option is to use an ensemble of outcomes drawn from the climatology. A second is to use the ensemble of outcomes drawn from the seasonal forecast. If the seasonal forecast is informative, the latter should yield a higher payoff? * The missing piece is that we don't know the payoff function $P(S,W_t)$ that relates the strategy and the weather to the payoff at the end of the season. Suggestion is to model this. * **Option 1** : build an ML model that relates historical weather timeseries to historical prices and use this to estimate the price that will be paid for the produce that is grown once the weather for that season is observed. **Drawback**: ignores the fact that the amount of produce resulting from a given strategy also depends on the weather. * **Option 2**: build a simple cropping model that relates the amount of produce produced to the observed weather time series (eg using a variant of the temperature thresholding rule suggested by Chris). Once the amount of produce is determined for a given season, estimate the price obtained for it using the average historical price for that month. **Drawback**: ignores the fact that the price also depends on the weather. * If these approaches look sensible, the two could be combined to better take account of the drawbacks. ## Data Quirks ### Observed data * Each month has 31 days. Sometimes the 31st day contains non-zero data which is a duplicate of the 31st day of the previous month. (E.g., the Spanish temperature data in `Cancarixthourly_May2020.npy` and `Cancarixthourly_Jun2020.npy`.) * Missing precipitation data (see example plot below for one of the areas with missing data in certain period) ![](https://i.imgur.com/S9hYMIW.png) ### Forecast data * Forecast data is stored in json format. Each file contains multiple json objects so the json parser will fail if you just pipe the file straight into it. Here's an example that works: ``` import json import numpy as np # Read json file with open("CancarixSeasonalforecast_TmaxDAILY-Early-Summer2020esgi.json", 'r') as f: data=f.read() # Split into separate json objects - there are two in the file objs = data.replace("}{","}|{").split('|') # Parse each object farm_data = json.loads(objs[0]) forecast_data = json.loads(objs[1]) ``` The forecast data itself is a 15 x 92 array. The columns are the dates contained in ```forecast_data["columns"]```. The rows are a bunch of different quantities contained in ```forecast_data["indices"]```: ``` ['Forecast Min.', '25th Centile', 'Forecast Mean', '75th Centile', 'Forecast Max.', 'Climate Min.', 'Climate 25th Centile', 'Climate Mean', 'Climate 75th Centile', 'Climate Max.', 'Benchmark Min.', 'Benchmark 25th Centile', 'Benchmark Mean', 'Benchmark 75th Centile', 'Benchmark Max.'] ``` The actual numbers are stored in ```forecast_data["data"]```. The fields can be accessed by index as follows: ``` import datetime as dt dates = [dt.datetime.strptime(date, '%m/%d/%Y').date() for date in forecast["columns"]] A = dict() for i, key in enumerate(forecast["index"]): print(i, " ", key) A[key] = np.array(forecast["data"])[i,:] ``` Plot as follows: ``` import matplotlib.pyplot as plt plt.plot(dates, A["Forecast Max."]) plt.ylabel('Max temp') ax = plt.gca() ax.tick_params(axis='x', rotation=70) ``` ![](https://i.imgur.com/Km6JDwY.png) ### Reading in the observation data Hopefully something like this fixes the number of days in the month correctly: ``` months = ['May','Jun', 'Jul'] ndays = dict(zip(months, [31, 30, 31])) year = "2020" Tmax = [] for m in months: filename = "Cancarixthourly_"+m+year+".npy" print(filename) T = np.load(obs_data_folder+filename)[0:ndays[m],:] Tmax.extend([np.nanmax(T[i,:]) for i in range(0,np.shape(T)[0])]) ``` EXtract the maximum temperature per day: ``` Tmax = [np.nanmax(T[i,:]) for i in range(0,np.shape(T)[0])] ``` Compare the observation to the forecast maximum temperature: ``` plt.plot(dates, A["Forecast Max."], label="forecast") plt.plot(dates, Tmax, label="observed") plt.ylabel('Max temp') ax = plt.gca() ax.legend() ax.tick_params(axis='x', rotation=70) ``` ![](https://i.imgur.com/hv76o6M.png) ![](https://i.imgur.com/LKGgVYq.png) ## Second Data approach 0. All the below uses UK Asparagus (price) data. 1. What does the data look like? Would the Seasonal Average Explain anything? ![](https://i.imgur.com/iMg0Y7P.png) 2. Should we be worried about inflation? ![](https://i.imgur.com/sqvpfY3.png) No! 3. How bad would a Naive Seasonal Average Forecast perform with 2015-2019 "train data" and 2020 test data? ![](https://i.imgur.com/GU1mquF.png) RMS = 2.58. Not that bad! 4. What next? Will the remaining variance be explained by the observed weather (with appropriate delay etc.)? To do this, we detrend our data by subtracting the seasonal mean (using years 2015 - 2019) from the actual price time series. This is what we get: ![](https://i.imgur.com/f0vjEqh.png) We will now try to find approapriate features from the weather data to further explain the remaining variance. We shall begin by identifying the relevant time-window for a given [week,price] tuple (the week number matters here). 5. We need two numbers: (1) the time it takes for an asparagus spear to grow from the moment it "appears"; (2) how fast does the asparagus get to the market from the day it was harvested. t1 ------ t2 ----> t3 X '=' F1,F2,F3,F4 (t1->t2); .... y = Price(t3) t11..........t12 t21..........t22 ## A pay-off model from a previous study group This is an email from David Allwright who worked on a similar problem about balancing supply and demand for lettuces at a previous study group. It would be useful to incorporate it as a component of our $P(S, W_t)$ function - it will need some assumptions about demand levels and price elasticity but has the nice feature that it incorporates penalties for over-supply which we will need if our pay-off function is to avoid trivially maximising the production of the highest priced produce. > Dear Bernard, John, Colin, Yes, G's Growers brought a problem about lettuces to that meeting. Their issue was to improve the match of supply to demand, by using weather forecasts and a growth model. I don't have any documents about it. Supply S and demand D were day-by-day functions. Roughly speaking demand was split as retail+wholesale, D=D_r+D_w. The retail price was roughly constant say p_r. It was important to meet the retail market completely, so there was some (perhaps reputational) effective cost c_r of failing to meet D_r. So if the total supply was S then your retail sales are S_r=min(S,D_r) and your gain from retail is p_r*S_r-c_r*max(D_r-S,0) those terms being the sales, and then the reputational cost if S<D_r. Then if S>S_r that leaves you with S-S_r that you try to sell to the wholesale market. That has some elasticity of demand D_w(p_w). So in the wholesale market you sell S_w=min(D_w(p_w),S-S_r), and you gain p_w*S_w. Then you may be left with S-S_r-S_w unsaleable lettuces. For some reason with lettuces this had to be ploughed in. (I don't know why it couldn't just be left in the field like Chris showed yesterday, but apparently with lettuces it was ploughed in.) So there was a cost of waste, c_p*(S-S_r-S_w). I think that was the economic side of it. The growth model for the lettuces was something that G's Growers had got someone else to set up based on weather & growth data from their fields. ### Flow chart for the pay-off function We think in terms of discrete 3-month growing seasons. At the start of each season, a grower selects a strategy (essentially how much of each crop to plant). This choice together with the 3-month time series of weather conditions during the season are fed into a cropping model which outputs the amount of each crop available for the market at the end of the season. These yields are then fed into a market model. The market model includes the demand at the end of the season, any fixed costs associated with the chosen strategy and outputs the market value of the produce at the end of the season: ![](https://i.imgur.com/4uGFTAN.png) ### Growing strategies We need a set of strategies described by a manageably small number of parameters (since we will need to do optimisation). To achieve this we imagine that the grower has to choose how to allocate a fixed growing area to 3 different crops. Thus the space of strategies is parametrised by two real numbers, each between 0 and 1 representing the proportions of the first and second crops (the proportion of the third is determined by assuming that the total area is always used). The 3 crops have different characteristics in terms of their growth response to variations in temperature, their average price and the fixed costs associated with their production: * **spinach**: low fixed cost, low temperature threshold, low temperature sensitivity, low price * **asparagus**: high fixed cost, high temperature threshold, high temperature sensitivity, high price * **courgettes** : medium fixed cost, medium temperature threshold, low temperature sensitivity, medium price ### Cropping models: We propose to use a simple model that expresses the yield per unit area of a particular crop produced during the growing season as a function of the "growing degree days" (GDD) of the season: $$ GDD = \int_0^S \max(T(t) - T_{base}, 0)\, dt $$ where $S$ is the length of the growing season, $T_{base}$ is lower threshold for growth, and T(t) is the temperature realised during the season. I suggest that we us a sigmoid function to represent the dependence of yield on GDD, G: $$ Y_c(G) = \frac{1}{1 + \exp[-\alpha^{(c)}\,(G - G_0^{(c)})]} $$ with the superscript $(c)$ indicating that the parameters for this function should vary for different crops in order to introduce heterogeneity into the growing strategies. ### Market models: Much of the current work focuses on the market model component. For simplicity we can consider the demand to be fixed and known in advance. We have 3 options under consideration: 1. **Simple model** : price of each crop is the average of the historical price (perhaps adjusted for seasonality). 2. **Machine learning model** : price of each crop is obtained from a machine learning model that attempts to predict wholesale prices from historical price data and historical weather data. 3. **System dynamics model**: price is determined from a variation of David Allwright's model above. Needs some development. ## Weekly wholesale model for UK spinach As a rapid prototype based on available data, a simple model based on GDD and weekly rainfall accumulation is developed. This was chosen based on case study 3 regarding spinach farm in Cambridge Waterbeach, therefore historical data from Cambridge Waterbeach temperature and preciptation was used. As wholesale price availability starts from 2015, the window of data starts there, and due to missing data in precipitation, window of data ends in 2017. Target regression prediction is: * the weekly wholesale price in UK for spinach which starts at April each year. Input features are as follow: * weekly approximate GDD that accumulates from start of available date of each year up to the weekly wholesale price date, the starting window from which accumulation is done shifts weekly along with the target wholesale price, GDD for spinach is maxtemp-mintemp/2. * weekly rainfall of that week, they are summed from the 7 days before wholesale price. Given the small dimensionality, Elastic Net was used. Elastic Net is Linear regression with combined L1 and L2 priors as regularizer, which helps in reducing model complexity. Tuned Elastic-Net gives 0.01 mean squared error. ![](https://i.imgur.com/I7GsrIG.png) Following on from this, to account for the time series where there is autocorrelation in wholesale prices, one order differencing was done in prices where price_t - price_t-1. The differenced/detrended values was the target for prediction instead. This time Random Forest Regressor is used instead as Linear model did not work as well. Random Forest is an ensemble decision tree algorithm which is known to capture non-linear data well. However, the random forest was untuned given the lack of time. ![](https://i.imgur.com/mpNitKA.png) This is the result with the differenced undone where the starting value is added back using the average of starting values from the previous 2 years, where starting value is the first available wholesale price available for the year. This random forest model using detrended price give mean squared error of 0.03 which performed worse than the previous iteration. ![](https://i.imgur.com/2JfO5dk.png) Digging further into this outcome, it seems accumulated GDD (labeled gdu on the plots) have more correlations with the actual price than detrended data. ![](https://i.imgur.com/YNbRVI7.png)![](https://i.imgur.com/a4axktB.png) ### Improvement on this It would be beneficial to detrend the cumulative sum of GDD as well, as it is also a time series data, and engineer features from it. Naively, polynomial or trigonometric transformations could be employed. Additionally, stacking an additional learner/transformer prior to price regression could be done. The additional learner/transformer would be used to learn the optimal period to sum the GDD or to engineer useful features from the approximate GDD. ## Market model 3 : Retail/wholesale market with price elasticity, contractual obligations and wastage costs. Let us develop a bit further the more complex market model from the previous study group. Once the amount of produce, Q, is known for the season, the main ideas are the following: * the retail market is supplied first. The retail market has fixed demand and a fixed (higher) price. * failure to supply the retail market incurs a penalty in proportion to the shortfall - e.g. grower is contractually obliged to source the shortfall from elsewhere. * any produce that remains after supplying the retail market is sold on the wholesale market. Prices in the wholesale market are lower than in the retail market and the demand is elastic. This means the demand varies with the price. By optimising the price, the grower may or may not be able to sell all of the remaining produce. * Any produce which remains after selling on the wholesale market incurs a wastage cost (e.g. for disposal) Let us define * $Y$ the amount of produce obtained at the end of the season. * $Q_R$ the retail market demand (the amount of produce that can be sold on the retail market) * $p_R$ the retail market price * $Q_W(p)$ the amount of produce that can be sold on the wholesale market. Since the wholesale market is elastic, this depends on the wholesale price, $p$. * $p_W$ the wholesale price charged by the grower. * $Q_D$ the quantity of produce that remains unsold and must be disposed of. * $c_R$ cost per unit shortfall in the event that there is insufficient produce to supply the retail market. * $c_D$ cost per unit to dispose of unsold produce. **Income for the grower**: Retail market sales: $p_R \min(Y, Q_R)$ Wholesale market sales : $p_W\, Q_W(p_W)$ **Costs for the grower**: Penalty for failing to meet retail demand: $c_R \max(Q_R-Y, 0)$ Cost of disposal of unsold produce: $c_D \max(Y-Q_R - Q_W(p_W), 0)$. We need to supply a model of the wholesale market price elasticity of demand, $Q_W(p)$, and a rule for determining the wholesale price, $p_W$. Let us assume a linear relationship between demand and price (what the economists call the "price elasticity of demand curve"): $$Q_W(p) = Q_{max} - \epsilon \,p $$. Here $Q_{max}$ and $\epsilon >0$ are given properties of the market. The demand goes down as the price goes up. $Q_{max}$ is the maximum capacity of the market and $\epsilon$ controls how quickly the demand decreases as price increases. Given $Q_w(p)$, the grower should choose the price, $p_W$, to maximise the revenue, $R(p) = p\, Q_W(p)$. Thus $$ p_W = \arg \max_p p\, (Q_{max} - \epsilon \,p) $$ This would give: $p_W = \frac{Q_{max}}{2 \epsilon}$ However there is a constraint: we can't supply the wholesale market with more than $Y - Q_R$. Thus we need to satisfy the constraint: $$Q_W(p) \leq Y-Q_R.$$ This can be expressed as a constraint on the price: $\begin{align*} Q_{max} - \epsilon p &\leq Y - Q_R\\ \Rightarrow p &\geq \frac{Q_{max} - (Y - Q_R)}{\epsilon} \end{align*}$. Including this constraint, we have $$ p_W = \max(\frac{Q_{max}}{2 \epsilon}, \frac{Q_{max} - (Y - Q_R)}{\epsilon} ). $$ With some additional algebra, this then gives the amount of produce sold on the wholesale market as $$Q_W(p_W) = \left\{\begin{array}{ll} \frac{Qmax}{2} & \text{if } Y - Q_R > \frac{Q_{max}}{2}\\ Y - Q_R & \text{if } Y-Q_R \leq \frac{Q_{max}}{2} \end{array}\right.$$ I think this is consistent: if there is less produce available than the amount that optimises revenue, then you just sell all of it. 210415 morning Kamil to Colm (& possibly others): In connection with our (KK-CC) evening discusion on "lettuce model" I spoke with Christoper this morning. Looks like both D_r and D_w can be increased by lowering the price. So you can sell more for the lower price, but if you take into account that there is cost (waste management) associated with products you do no sell, then you can make extra money as long as your price decrease is lower than the cost of managing the waste. Hence there is possibility to add an extra bit to the "lettuce model". I am off for the other project, but will pop-in some time today.