In the following, we will see how the different notations for these two lines. With the estimated regression line, we intent to estimate the true line. When referring to the regression line, we often refer to the estimated regression line. More on this in Caution in simple linear regression.
Below, X≈90, we don’t know what happens, and it could be a completely different situation: Outside the observed X values the relationship could be completely different, like this, where we only see from X≈90. The reason that we cannot extrapolate is that we cannot estimate for an interval of X values that we have not observed. We don’t know these areas, so we cannot just follow the regression line beyond the lowest and the highest X-values: That means, that we can never follow the line outside of the observed X-values.
In real-world situation our statistical software will calculate this Y-value for us and associate the relative uncertainty as well. In our mini data example, there is no observation for X=2.5, but as our line makes out a very good fit to the datapoint, we can simply read from the line at X=2.5: With a strong fit we can immediately estimate a prediction of hardness (Y) for given densities (X) which, without the regression model, would have been very difficult and costly. In this case the density of the wood (X) is easy to measure and the hardness/durability (Y) is difficult to measure. That means that we can now predict a Y value for a given X value that has not been observed.įor example, there are cases when Y is difficult to measure and X easy to measure as in the case of the dataset of Australian timber.
With an r 2 = 0.974, we can say that approximately 97.4% of the variation in Y can be explained by the variation in X. Let’s take this mini dataset as an example: In the case of a strong fit we can use the line to Y values for given X values that are not included in our observations. In a real-world case, we would first of all consider getting more observations than just these four, but for this exercise example, I’ll leave it as is. In some cases, this can be dealt with by transforming data and achieving linearity, see: Transforming data to achieve linearity. We could already see this from the graph, because the datapoints fall relatively far from the line. That is a relatively low fit, so our model is not a good fit. Therefore, we can note that approximately 40% of the variation in Y can be explained by the variation in X.
The calculated correlation coefficient (r) for our line is 0.63 which gives an r 2 of 0.40. The regression line for these 4 datapoints would be the red line:īut is this regression line a good/strong fit to the datapoints? This can be answered by the correlation coefficient (r): Let’s insert the four X and Y couples to a scatter plot: I will use a mini dataset of 4 datapoints to run some step-by-step examples below: In real-world cases we will typically work with larger datasets.