The scientific method is frequently used as a guided approach to learning. Linear statistical methods are widely used as part of this learning process.

Linear models describe a continuous response variable as a function of one or more predictor variables. They can help you understand and predict the behavior of complex systems or analyze experimental, financial, and biological data. Linear regression is a statistical method used to create a linear model.

#### Mathematical Modelling:

Let Y" align="absmiddle" /> be the dependent variable of dimension , be the independent variables, and are unknown parameters, the the linear model can be written as:

Y_{n \times 1}=X_{n \times p} \theta_{p \times 1} + \epsilon_{n \times 1}" align="absmiddle" />

where and {D(.) implying dispersion}

#### Usage:

**Prediction:**Estimates of the individual parameters are of less importance for prediction than the overall influence of the x variables on y. However, good estimates are needed to achieve good prediction performance.**Data Description or Explanation:**The scientist or engineer uses the estimated model to summarize or describe the observed data.**Parameter Estimation:**The values of the estimated parameters may have theoretical implications for a postulated model.**Variable Selection or Screening:**The emphasis is on determining the importance of each predictor variable in modeling the variation in y" align="absmiddle" />. The predictors that are associated with an important amount of variation in y" align="absmiddle" /> are retained; those that contribute little are deleted.**Control of Output:**A cause-and-effect relationship between y" align="absmiddle" /> and the x variables is assumed. The estimated model might then be used to control the output of a process by varying the inputs. By systematic experimentation, it may be possible to achieve the optimal output.

#### Classification:

Linear Model can mainly be classified in 3 types:

**Simple linear regression:**models using only one predictor**Multiple linear regression:**models using multiple predictors**Multivariate linear regression:**models for multiple response variables

#### Parameter Estimation:

We apply method of Least Squares to estimate the parameter \theta" align="absmiddle" />, which involves the minimization of the error sum of squares L, given by:

y-X\theta)'(y-X\theta)=\sum_{i=1}^n(y_i-\sum x_{ij}\theta_j)^2" align="absmiddle" />

Differentiating L w.r.t. and equating the derivative to 0, we obtain the following set of linear equations, also called the **Normal Equations**:

\hat{\theta}=X'y" align="absmiddle" />

where \hat{\theta}" align="absmiddle" /> is an estimator of \theta" align="absmiddle" />, referred to as the least square estimate.

Predicted values are , where , is the hat matrix, which is idempotent, i.e H’H=I.

**Exercise**: Check that the normal equations are consistent, (i.e. admits a solutions) whatever be the rank of X.

(Hint: y \in C(X') \Rightarrow X'y \in C(X'X)" align="absmiddle" /> where C(X) means the column space of X)

Now, suppose, X \sim N(0,I_n)" align="absmiddle" />. Then iff A is idempotent, where k=rank(A).

Now, we may compute the mle (Maximum Likelihood Estimator) of and . After some trivial calculations, we arrive at the following estimates:

assuming rank of X is p. (Otherwise, we can use the Generalized Inverse, but let’s not go into that in this post.)

Now, we can comment on the distributions of the estimates obtained.

#### R codes

Here are some useful codes for R software:

Multiple Linear Regression Example

fit <- lm(y ~ x1 + x2 + x3, data=mydata) #fit data

summary(fit) # show results

Other useful functions

coefficients(fit) # model coefficients

confint(fit, level=0.95) # Confidence Intervals for model parameters

fitted(fit) # predicted values

residuals(fit) # residuals

anova(fit) # anova table

vcov(fit) # covariance matrix for model parameters

influence(fit) # regression diagnostics

No topic of Statistics is fully understood till it is applied on some real data. So one reading this should try to apply the given method to a real dataset for complete comprehension. You can use R or Matlab or Python, whichever suits you better.

You may find datasets in the UCI Database.