User Tools

Site Tools


A Simple linear regression model and plotting a graphic

Regression is used to analyze relationships between continuous variables. In statistics, simple linear regression is the least squares estimator of a linear regression model with a single predictor variable. It describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis. Simple linear regression fits a straight line through the set of n points. The sum of squared residuals of the model is the smallest possible.
The fitted line has the slope equal to the correlation between y and x, corrected by the ratio of standard deviation of these variables. The intercept of the fitted line is such that it passes through the center of mass (x, y) of the data points.
Suppose there are n data points {yi, xi}, where i = 1, 2, …, n.
The goal is to find the equation of the straight line y = α + β x

This would provide a “best” fit for the data points: such as a line that minimizes the sum of squared residuals of the linear regression model.


As a first example of a linear model we use the Orange data set. This data includes measurements of the growth of a sample of five orange trees at a location in California.
The response is the circumference of the tree at a particular height from the ground (often converted to “diameter at breast height”).

R include many datasets in memory. Different library carry different datasets as well. To know more about your data availability type

Plotting data

To see how age and circumference data are related use the plot(x,y) function

plot(Orange$circumference,Orange$age) # plotting the default layout
# Now add some graphical parameters
plot(Orange$circumference, Orange$age , xlab="Age", ylab="Circumference", main="Growth of five orange trees in California" ) 
## See the differences by the tree sampled
points(Orange$circumference[Orange$Tree==1], Orange$age[Orange$Tree==1], pch=1, col="black")
points(Orange$circumference[Orange$Tree==2], Orange$age[Orange$Tree==2], pch=2, col="blue")
points(Orange$circumference[Orange$Tree==3], Orange$age[Orange$Tree==3], pch=3, col="red")
points(Orange$circumference[Orange$Tree==4], Orange$age[Orange$Tree==4], pch=4, col="magenta")
points(Orange$circumference[Orange$Tree==5], Orange$age[Orange$Tree==5], pch=5, col="green")
lines(Orange$circumference[Orange$Tree==1], Orange$age[Orange$Tree==1], pch=1, col="black")
lines(Orange$circumference[Orange$Tree==2], Orange$age[Orange$Tree==2], pch=2, col="blue")
lines(Orange$circumference[Orange$Tree==3], Orange$age[Orange$Tree==3], pch=3, col="red")
lines(Orange$circumference[Orange$Tree==4], Orange$age[Orange$Tree==4], pch=4, col="magenta")
lines(Orange$circumference[Orange$Tree==5], Orange$age[Orange$Tree==5], pch=5, col="green")

Simple Linear Regression model

We will now fit a simple linear model to circumference as a function of age. To perform linear regression we create a linear model using the lm (linear model) function: Note that the variables are in the opposite order from the plot command, which is plot(x,y), whereas the linear model lm(Y~x) is read as model Age as a function of circumference


The lm command produces no output at all, but it creates orange.lm as an object
i.e. a data structure consisting of multiple parts, holding the results of a regression analysis with Age being modeled as a function of Circumference. R needs additional commands to extract the desired information about the model or display results graphically.

To get a summary of the results, enter the command


You will vizualize the following information:

lm(formula = Orange$age ~ Orange$circumference)
    Min      1Q  Median      3Q     Max
-317.88 -140.90  -17.20   96.54  471.16
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)           16.6036    78.1406   0.212    0.833
Orange$circumference   7.8160     0.6059  12.900 1.93e-14 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 203.1 on 33 degrees of freedom
Multiple R-squared: 0.8345,     Adjusted R-squared: 0.8295
F-statistic: 166.4 on 1 and 33 DF,  p-value: 1.931e-14

R sets up model objects so that the function summary “knows” that orange.lm was created by lm, and produces an appropriate summary of results for an lm object.
The table of coefficients gives the estimated regression line as age = 16.6036 + 7.8160 × circumference, and associated with each coefficient is the standard error of the estimate the t-statistic value for testing whether the coefficient is nonzero, and the p-value corresponding to the t-statistic. Below the table, the adjusted R-squared gives the estimated fraction of the variance explained by the regression line, and the p-value in the last line is an overall test for significance of the model against the null hypothesis that the response variable is independent of the predictors.

To add the regression line in the previously plotted graphic:

abline(orange.lm, cex=3, col="brown")

You can get the coefficients by using the coef function:

 (Intercept) Orange$circumference
   16.603609             7.815998

You can also “interrogate” orange.lm directly. Type names(orange.lm) to get a list of the components of orange.lm, and then use the $ symbol to extract components according to their names.

 > names(orange.lm)
   [1] "coefficients"  "residuals"     "effects"       "rank"
   [5] "fitted.values" "assign"        "qr"            "df.residual"
   [9] "xlevels"       "call"          "terms"         "model"

For more information (perhaps more than you want) about orange.lm, use str(orange.lm) (for structure). You can get the regression coefficients this way:

 (Intercept) Orange$circumference
   16.603609             7.815998
wiki/lm_and_graphicr.txt · Last modified: 2017/12/05 22:53 (external edit)