Simple linear regression

In this analysis, we are interested in whether changes in values of the quantitative predictor variable (denoted X) are associated with changes in values of the quantitative response variable (denoted Y). In some experiments, we may believe the association is causal, though for observational studies, we can still describe the association without attributing the relationship to causation.

y=mX + b

For a simple linear regression, we are assuming that the relationship between the two variables is a linear one, which you can check with a scatterplot (as above). In this analysis, a line is fitted to the data (to learn more about the process behind this, take statistics!). The equation for the line is as follows, where m is the slope and b is the y-intercept:

The “hat” appears above the Y to denote that it is a predicted value for the Y variable.

We perform a regression analysis to help understand the relationship between X and Y. We can learn the equation of the fitted line and use it to make predictions for Y based on future X values. We can also assess if the relationship between X and Y is a significant one using a hypothesis test.

For this test, our null and alternative hypotheses are:

H_O: The slope of the regression line is zero.

H_A: The slope of the regression line does not equal zero.

You might wonder why we are interested in testing about the slope and not the intercept. So, let’s talk about the slope and intercept a bit. In many situations, the intercept is simply not useful or nonsensical. Consider a regression using tree diameter (X) to predict tree height (Y). The intercept would be the average tree height when tree diameter is 0. That just doesn’t make sense to talk about. In this example, the slope would be the expected change in tree height for a one unit increase in tree diameter. You’d argue in this case that the slope should probably be positive. So, we focus on the slope more often than the intercept.

Why do we care if the slope is zero versus not zero? We might also focus on the slope being zero versus being either positive or negative (use a one-tailed approach). If the slope is 0, the regression line is flat, that is the regression line would become:

This basically says that the response is equal to a constant. In other words, using X to try to predict Y is no better than just using a constant value for Y. If the slope is significantly different from 0, then using X to predict Y is better than using a constant for Y. That’s why we focus on this particular test for simple linear regression.

Regression is a very versatile statistical technique. Applied appropriately, it can handle non-linear relationships, work with many predictor variables (not just one), and even be used to run the two-sample t-test and ANOVA procedures. To learn more, take more statistics!

Some key points:

You should always identify the response (Y) and predictor (X) variables.
The regression line provides predictions for Y based on X. Doing the reverse is also possible but would require refitting the line and might not make sense in all situations.
You should check that the relationship observed is linear with a scatterplot before running the regression analysis. If the relationship is not linear, variable transformations may be used to help improve the linearity. (Ask for assistance from a statistician!).
It is possible to get a significant predictor (finding) when the model does not fit well. The ideal relationship would have a strong model fit and a significant predictor. Judging model fit is done with different statistics than those used for this test procedure.
As with our previous tests, there are conditions that need to be met here, and we will assume they hold. Most software has easily accessible diagnostic plots to help check these conditions.

Assumptions

Before proceeding with a statistical test, you will want to make sure assumptions are met.

Example

Collared lizards (Crotaphytus collaris) are territorial, with males maintaining a territory that overlaps with several females. Successfully establishing and defending a large territory provides greater mating opportunities, and hence potential reproductive success, for males. Therefore, males compete for territory, typically using their jaws as weapons. Collared lizards are a sexually dimorphic species, with males having much larger heads (jaws and associated muscles) than females. Having larger jaws may influence the strength of the bite and potentially the ability of a male to defend and expand the size of his territory. Thus, Lappin and Husak (2005) sought to determine whether bite force (strength of bite) could predict the size of territory for males of this species. (Example and data from Whitlock and Shluter 2009)

Photo credit: Bryce Chackerian https://commons.wikimedia.org/w/index.php?curid=945972

H_O: Slope of regression line of territory area in relation to bite force is zero.

H_A: Slope of regression line of territory area in relation to bite force does not equal zero.

Since we are using bite force to predict the size of territory, bite force is the predictor (X) variable, and territory.area is the response (Y) variable.

Statistics for Biology

Some key points:

Assumptions

Example

Excel directions

Video

R directions

Data