Linear Regression Forecast (LRF)

# Linear Regression Forecast (LRF) ## Context The purpose of this document is to try to exemplify a way to build an expectation of delivery in a team through a probabilistic approach (using a mathematical approach). The method used is linear regression, based on our accumulated throughput, to forecast when a project would end. Some other points that this document seeks to address: - Remove the charge from the question about the delivery date of the engineer and throw it into the process or flow. - Have greater robustness in the way of building an expectation of delivery with stakeholders (decrease guesswork). An important disclaimer for reading this review: Throughput and Lead Times do not follow a normal distribution, thus it would not make sense to calculate a linear regression (See on [Looking at Lead Time in a different way](http://blog.plataformatec.com.br/2016/03/looking-at-lead-time-in-a-different-way/) and [Assumptions of Linear Regression](https://www.statisticssolutions.com/assumptions-of-linear-regression/)). However, it is possible to avoid this problem by making some changes to the approach: I manually added the throughput performance that would cause each different forecast to reach the backlog on a certain date. Using the percentiles 95% and 75% of throughput, we avoided being biased using only the average queue represented by the 50% percentile. In summary: how are we dealing with a Non-Normal Distribution, projects will rarely have a distribution where average, median, and standard deviation are identical (*[see studies about it here](http://blog.plataformatec.com.br/2016/01/power-of-the-metrics-dont-use-average-to-forecast-deadlines/)*). As the data becomes skewed, the average loses its ability to provide the best central location for the data because the data will drag it away from the typical value. As we don’t have information about the data distribution, **it’s wrong to make a forecast based on an average**. So, if you are a Product Owner or Product Manager, you shouldn’t use the average throughput when you are trying to establish a release date. ## Forecasting Data The data modeling was built using the team's Throughput analysis, however, some points were added in the simulation of this domain: - **A variation in scope of around 20% was added.** *This parameter was added to simulate a situation closer to reality where the scope is variable.* - **Throughput data was modeled to consider a more pessimistic scenario.** *Even with the throughput percentiles in hand, they were reduced in the modeling to reflect the context of a sudden change of direction of an entire team, this usually affects the delivery rate.* - **The work on tasks would start in the week 1, between 04/05 and 08/04.** *This premise came based on the alignment and coordination of the PMs.* - **The tasks were handled in a homogeneous manner, without differentiating the size or effort.** *Task size is not very important when you have a low flow efficiency, the important thing is to always compare things of approximate batch sizes, stories with stories and tasks with tasks, and so on.* **Data modeling:** ![](https://i.imgur.com/tWNNSum.png) To avoid using average in a non-Gaussian distribution, we analyze the median and percentiles are better approaches to advise the team when they need to estimate the time that will be necessary to create new features in the future. ## Forecasting Preview: ![](https://i.imgur.com/RG2BJam.png) ## Conclusions Using percentiles to project our delivery forecast avoids the error of being skewed by the average, as the data becomes skewed, the average loses its ability to provide the best central location for the data because the data will drag it away from the typical value (*[this study can be found here](http://blog.plataformatec.com.br/2016/01/power-of-the-metrics-dont-use-average-to-forecast-deadlines/)*). In this case, analyzing our burnup we have the following three scenarios: ![](https://i.imgur.com/Id4JkyU.png) In short, taking into account the 20% scope variation and a weekly delivery rate of around 6 tasks (*the team never delivered less than that*), we would be able to guarantee that we would deliver all tasks in the **week [4]** between **25/05** and **29/05**. The purpose of this forecasting is that it be dynamic, every week it will be updated with the weekly accumulated throughput, so we can monitor the progression of the delivery rate and check which scenarios we are pending on (optimistic, realistic or pessimistic). --- ## **Glossary (***Work in Progress***)** Lead Times Throughput Burnup Percentiles Linear Regression Non-Gaussian Non-Normal Distribution