SUPPORT THE WORK

GetWiki

Ordinary least squares

ARTICLE SUBJECTS
aesthetics  →
being  →
complexity  →
database  →
enterprise  →
ethics  →
fiction  →
history  →
internet  →
knowledge  →
language  →
licensing  →
linux  →
logic  →
method  →
news  →
perception  →
philosophy  →
policy  →
purpose  →
religion  →
science  →
sociology  →
software  →
truth  →
unix  →
wiki  →
ARTICLE TYPES
essay  →
feed  →
help  →
system  →
wiki  →
ARTICLE ORIGINS
critical  →
discussion  →
forked  →
imported  →
original  →
Ordinary least squares
[ temporary import ]
please note:
- the content below is remote from Wikipedia
- it has been imported raw for GetWiki
{{Short description|Method for estimating the unknown parameters in a linear regression model}}{{Regression bar}}In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one{{clarify|reason=What does "fixed" and "level-one" mean?|date=December 2023}} effects of a linear function of a set of explanatory variables) by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable.Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth momentsWEB, What is a complete list of the usual assumptions for linear regression?,weblink 2022-09-28, Cross Validated, en, and—by the Gauss–Markov theorem—optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.

Linear model

File:Okuns law quarterly differences.svg|300px|thumb|Okun's law in macroeconomics states that in an economy the GDPGDPSuppose the data consists of n observations left{mathbf{x}_i, y_iright}_{i=1}^n. Each observation i includes a scalar response y_i and a column vector mathbf{x}_i of p parameters (regressors), i.e., mathbf{x}_i=left[x_{i1}, x_{i2}, dots, x_{ip}right]^operatorname{T}. In a linear regression model, the response variable, y_i, is a linear function of the regressors:
y_i = beta_1 x_{i1} + beta_2 x_{i2} + cdots + beta_p x_{ip} + varepsilon_i,
or in vector form,
y_i = mathbf{x}_i^operatorname{T} boldsymbol{beta} + varepsilon_i, ,
where mathbf{x}_i, as introduced previously, is a column vector of the i-th observation of all the explanatory variables; boldsymbol{beta} is a p times 1 vector of unknown parameters; and the scalar varepsilon_i represents unobserved random variables (errors) of the i-th observation. varepsilon_i accounts for the influences upon the responses y_i from sources other than the explanatory variables mathbf{x}_i. This model can also be written in matrix notation as
mathbf{y} = mathbf{X} boldsymbol{beta} + boldsymbol{varepsilon}, ,
where mathbf{y} and boldsymbol{varepsilon} are n times 1 vectors of the response variables and the errors of the n observations, and mathbf{X} is an n times p matrix of regressors, also sometimes called the design matrix, whose row i is mathbf{x}_i^operatorname{T} and contains the i-th observations on all the explanatory variables.Typically, a constant term is included in the set of regressors mathbf{X}, say, by taking x_{i1}=1 for all i=1, dots, n. The coefficient beta_1 corresponding to this regressor is called the intercept. Without the intercept, the fitted line is forced to cross the origin when x_i = vec{0}.Regressors do not have to be independent for estimation to be consistent, but multicolinearity makes estimation inconsistent. As a concrete example where regressors are not independent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters (boldsymbol{beta}).

Matrix/vector formulation

Consider an overdetermined system
sum_{j=1}^{p} x_{ij} beta_j = y_i, (i=1, 2, dots, n),
of n linear equations in p unknown coefficients, beta_1, beta_2, dots, beta_p , with n > p . This can be written in matrix form as
mathbf{X} boldsymbol{beta} = mathbf {y},
where
mathbf{X} = begin{bmatrix}
X_{11} & X_{12} & cdots & X_{1p} X_{21} & X_{22} & cdots & X_{2p} vdots & vdots & ddots & vdots X_{n1} & X_{n2} & cdots & X_{np}end{bmatrix} ,qquadboldsymbol beta = begin{bmatrix}beta_1 beta_2 vdots beta_pend{bmatrix} ,qquadmathbf y = begin{bmatrix}y_1 y_2 vdots y_nend{bmatrix}. (Note: for a linear model as above, not all elements in mathbf{X} contains information on the data points. The first column is populated with ones, X_{i1} = 1. Only the other columns contain actual data. So here p is equal to the number of regressors plus one).Such a system usually has no exact solution, so the goal is instead to find the coefficients boldsymbol{beta} which fit the equations "best", in the sense of solving the quadratic minimization problem
hat{boldsymbol{beta}} = underset{boldsymbol{beta}}{operatorname{arg,min}},S(boldsymbol{beta}),
where the objective function S is given by
S(boldsymbol{beta}) = sum_{i=1}^n left| y_i - sum_{j=1}^p X_{ij}beta_jright|^2 = left|mathbf y - mathbf{X} boldsymbol beta right|^2.
A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the p columns of the matrix mathbf{X} are linearly independent, given by solving the so-called normal equations:{{anchor|Normal equations}}
left( mathbf{X}^{operatorname{T}} mathbf{X} right)hat{boldsymbol{beta}} = mathbf{X}^{operatorname{T}} mathbf y .
{{anchor|Normal matrix}} The matrix mathbf{X}^{operatorname{T}} mathbf{X} is known as the normal matrix or Gram matrix and the matrix mathbf{X}^{operatorname{T}} mathbf y is known as the moment matrix of regressand by regressors.BOOK, Arthur S., Goldberger, Arthur Goldberger, Classical Linear Regression, Econometric Theory, New York, John Wiley & Sons, 1964, 0-471-31101-4, 158,weblinkweblink Finally, hat{boldsymbol{beta}} is the coefficient vector of the least-squares hyperplane, expressed as
hat{boldsymbol{beta}} = left( mathbf{X}^{operatorname{T}} mathbf{X} right)^{-1} mathbf{X}^{operatorname{T}} mathbf y.
or
hat{boldsymbol{beta}} = boldsymbol{beta} + left(mathbf{X}^operatorname{T} mathbf{X}right)^{-1}mathbf {X}^operatorname{T} boldsymbol{varepsilon}.

Estimation

Suppose b is a "candidate" value for the parameter vector β. The quantity {{math|yi − xiTb}}, called the residual for the i-th observation, measures the vertical distance between the data point {{math|(xi, yi)}} and the hyperplane {{math|1=y = xTb}}, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))BOOK, Hayashi, Fumio, Fumio Hayashi, Econometrics, Princeton University Press, 2000, 15, is a measure of the overall model fit:
S(b) = sum_{i=1}^n (y_i - x_i ^operatorname{T} b)^2 = (y-Xb)^operatorname{T}(y-Xb),
where T denotes the matrix transpose, and the rows of X, denoting the values of all the independent variables associated with a particular value of the dependent variable, are Xi = xiT. The value of b which minimizes this sum is called the OLS estimator for β. The function S(b) is quadratic in b with positive-definite Hessian, and therefore this function possesses a unique global minimum at b =hatbeta, which can be given by the explicit formula:{{harvtxt|Hayashi|2000|loc=page 18}}[proof]
hatbeta = operatorname{argmin}_{binmathbb{R}^p} S(b) = (X^operatorname{T}X)^{-1}X^operatorname{T}y .
The product N=XT X is a Gram matrix and its inverse, Q=N–1, is the cofactor matrix of β,BOOK,weblink Adjustment Computations: Spatial Data Analysis, 9780471697282, Ghilani, Charles D., Paul r. Wolf, Ph. D., 12 June 2006, BOOK,weblink GNSS – Global Navigation Satellite Systems: GPS, GLONASS, Galileo, and more, 9783211730171, Hofmann-Wellenhof, Bernhard, Lichtenegger, Herbert, Wasle, Elmar, 20 November 2007, BOOK,weblink GPS: Theory, Algorithms and Applications, 9783540727156, Xu, Guochang, 5 October 2007, closely related to its covariance matrix, Cβ.The matrix (XT X)–1 XT=Q XT is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse).After we have estimated β, the fitted values (or predicted values) from the regression will be
hat{y} = Xhatbeta = Py,
where P = X(XTX)−1XT is the projection matrix onto the space V spanned by the columns of X. This matrix P is also sometimes called the hat matrix because it "puts a hat" onto the variable y. Another matrix, closely related to P is the annihilator matrix {{math|1=M = In − P}}; this is a projection matrix onto the space orthogonal to V. Both matrices P and M are symmetric and idempotent (meaning that {{math|1=P2 = P}} and {{math|1=M2 = M}}), and relate to the data matrix X via identities {{math|1=PX = X}} and {{math|1=MX = 0}}.{{harvtxt|Hayashi|2000|loc=page 19}} Matrix M creates the residuals from the regression:
hatvarepsilon = y - hat y = y - Xhatbeta = My = M(Xbeta+varepsilon) = (MX)beta + Mvarepsilon = Mvarepsilon.
{{anchor|Reduced chi-squared}}Using these residuals we can estimate the value of σ 2 using the reduced chi-squared statistic:
s^2 = frac{hatvarepsilon ^mathrm{T} hatvarepsilon}{n-p} = frac{(My)^mathrm{T} My}{n-p} = frac{y^mathrm{T} M^mathrm{T}My}{n-p}= frac{y ^mathrm{T} My}{n-p} = frac{S(hatbeta)}{n-p},qquad
hatsigma^2 = frac{n-p}{n};s^2
The denominator, n−p, is the statistical degrees of freedom. The first quantity, s2, is the OLS estimate for σ2, whereas the second, scriptstylehatsigma^2, is the MLE estimate for σ2. The two estimators are quite similar in large samples; the first estimator is always unbiased, while the second estimator is biased but has a smaller mean squared error. In practice s2 is used more often, since it is more convenient for the hypothesis testing. The square root of s2 is called the regression standard error,Julian Faraway (2000), Practical Regression and Anova using R standard error of the regression,BOOK, Kenney, J., Keeping, E. S., 1963, Mathematics of Statistics, van Nostrand, 187, BOOK, Zwillinger, D., 1995, Standard Mathematical Tables and Formulae, Chapman&Hall/CRC, 0-8493-2479-3, 626, or standard error of the equation.It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y, in the cases where the regression sum of squares equals the sum of squares of residuals:{{harvtxt|Hayashi|2000|loc=page 20}}
R^2 = frac{sum(hat y_i-overline{y})^2}{sum(y_i-overline{y})^2} = frac{y ^mathrm{T} P ^mathrm{T} LPy}{y ^mathrm{T} Ly} = 1 - frac{y ^mathrm{T} My}{y ^mathrm{T} Ly} = 1 - frac{rm RSS}{rm TSS}
where TSS is the total sum of squares for the dependent variable, L=I_n-frac{1}{n}J_n, and J_n is an n×n matrix of ones. (L is a centering matrix which is equivalent to regression on a constant; it simply subtracts the mean from a variable.) In order for R2 to be meaningful, the matrix X of data on regressors must contain a column vector of ones to represent the constant whose coefficient is the regression intercept. In that case, R2 will always be a number between 0 and 1, with values close to 1 indicating a good degree of fit.The variance in the prediction of the independent variable as a function of the dependent variable is given in the article Polynomial least squares.

Simple linear regression model

If the data matrix X contains only two variables, a constant and a scalar regressor xi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as {{math|(α, β)}}:
y_i = alpha + beta x_i + varepsilon_i.
The least squares estimates in this case are given by simple formulas
begin{align}
widehatbeta &= frac{sum_{i=1}^n{(x_i-bar{x})(y_i-bar{y})} }{ sum_{i=1}^n{(x_i-bar{x})^2}}
[2pt]
widehatalpha &= bar{y} - widehatbeta,bar{x} ,
end{align}

Alternative derivations

In the previous section the least squares estimator hatbeta was obtained as a value that minimizes the sum of squared residuals of the model. However it is also possible to derive the same estimator from other approaches. In all cases the formula for OLS estimator remains the same: {{math|1=^β = (XTX)−1XTy}}; the only difference is in how we interpret this result.

Projection

(File:OLS geometric interpretation.svg|thumb|250px|OLS estimation can be viewed as a projection onto the linear space spanned by the regressors. (Here each of X_1 and X_2 refers to a column of the data matrix.)){{cleanup merge|21=section|Linear least squares}}For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations {{math|Xβ ≈ y}}, where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies
hatbeta = {rm arg}min_beta,lVert mathbf{y} - mathbf{X}boldsymbolbeta rVert,
where {{math|{{norm|·}}}} is the standard L2 norm in the n-dimensional Euclidean space Rn. The predicted quantity Xβ is just a certain linear combination of the vectors of regressors. Thus, the residual vector {{math|y − Xβ}} will have the smallest length when y is projected orthogonally onto the linear subspace spanned by the columns of X. The OLS estimator hatbeta in this case can be interpreted as the coefficients of vector decomposition of {{math|1=^y = Py}} along the basis of X.In other words, the gradient equations at the minimum can be written as:
(mathbf y - mathbf{X} hat{boldsymbol{beta}})^{top} mathbf{X}=0.
A geometrical interpretation of these equations is that the vector of residuals, mathbf y - X hat{boldsymbol{beta}} is orthogonal to the column space of X, since the dot product (mathbf y- mathbf{X}hat{boldsymbol{beta}})cdot mathbf{X} mathbf v is equal to zero for any conformal vector, v. This means that mathbf y - mathbf{X} boldsymbol{hat beta} is the shortest of all possible vectors mathbf{y}- mathbf{X} boldsymbol beta, that is, the variance of the residuals is the minimum possible. This is illustrated at the right.Introducing hat{boldsymbol{gamma}} and a matrix K with the assumption that a matrix [mathbf{X} mathbf{K}] is non-singular and KT X = 0 (cf. Orthogonal projections), the residual vector should satisfy the following equation:
hat{mathbf{r}} := mathbf{y} - mathbf{X} hat{boldsymbol{beta}} = mathbf{K} hat{{boldsymbol{gamma}}}.
The equation and solution of linear least squares are thus described as follows:
begin{align}
mathbf{y} &= begin{bmatrix}mathbf{X} & mathbf{K}end{bmatrix} begin{bmatrix} hat{boldsymbol{beta}} hat{boldsymbol{gamma}} end{bmatrix} ,
{}Rightarrow begin{bmatrix} hat{boldsymbol{beta}} hat{boldsymbol{gamma}} end{bmatrix} &= begin{bmatrix}mathbf{X} & mathbf{K}end{bmatrix}^{-1} mathbf{y} = begin{bmatrix} left(mathbf{X}^{top} mathbf{X}right)^{-1} mathbf{X}^{top} left(mathbf{K}^{top} mathbf{K}right)^{-1} mathbf{K}^top end{bmatrix} mathbf{y} .
end{align}Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.WEB, Akbarzadeh, Vahab, Line Estimation, 7 May 2014,weblink Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.

Maximum likelihood

The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms.{{harvtxt|Hayashi|2000|loc=page 49}}[proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson.{{Citation needed|date=February 2010}} From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied.{{harvtxt|Hayashi|2000|loc=page 52}}

Generalized method of moments

In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions
mathrm{E}big[, x_ileft(y_i - x_i ^operatorname{T} betaright) ,big] = 0.
These moment conditions state that the regressors should be uncorrelated with the errors. Since xi is a p-vector, the number of moment conditions is equal to the dimension of the parameter vector β, and thus the system is exactly identified. This is the so-called classical GMM case, when the estimator does not depend on the choice of the weighting matrix.Note that the original strict exogeneity assumption {{math|E[εi {{!}} xi] {{=}} 0}} implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function {{math|Æ’}}, the moment condition {{math|E[Æ’(xi)·εi] {{=}} 0}} will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function {{math|Æ’}} is to take {{math|Æ’(x) {{=}} x}}, which results in the moment equation posted above.

Properties

Assumptions

{{see also|Linear regression#Assumptions}}There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors xi are random and sampled together with the yi{{'}}s from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X. All results stated in this article are within the random design framework.

Classical linear regression model

The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the behavior at a large number of samples is studied.
  • Correct specification. The linear functional form must coincide with the form of the actual data-generating process.
  • Strict exogeneity. The errors in the regression should have conditional mean zero:{{harvtxt|Hayashi|2000|loc=page 7}}operatorname{E}[,varepsilonmid X,] = 0. The immediate consequence of the exogeneity assumption is that the errors have mean zero: {{Math|1=E[ε] = 0}} (for the law of total expectation), and that the regressors are uncorrelated with the errors: {{math|1=E[XTε] = 0}}. {{paragraph}}The exogeneity assumption is critical for the OLS theory. If it holds then the regressor variables are called exogenous. If it does not, then those regressors that are correlated with the error term are called endogenous,{{harvtxt|Hayashi|2000|loc=page 187}} and the OLS estimator becomes biased. In such case the method of instrumental variables may be used to carry out inference.
  • No linear dependence. The regressors in X must all be linearly independent. Mathematically, this means that the matrix X must have full column rank almost surely:{{harvtxt|Hayashi|2000|loc=page 10}}Pr!big[,operatorname{rank}(X) = p,big] = 1. Usually, it is also assumed that the regressors have finite moments up to at least the second moment. Then the matrix {{math|1=Qxx = E[XTX / n]}} is finite and positive semi-definite. {{paragraph}}When this assumption is violated the regressors are called linearly dependent or perfectly multicollinear. In such case the value of the regression coefficient β cannot be learned, although prediction of y values is still possible for new values of the regressors that lie in the same linearly dependent subspace.
  • Spherical errors:operatorname{Var}[,varepsilon mid X,] = sigma^2 I_n, where {{mvar|In}} is the identity matrix in dimension n, and σ2 is a parameter which determines the variance of each observation. This σ2 is considered a nuisance parameter in the model, although usually it is also estimated. If this assumption is violated then the OLS estimates are still valid, but no longer efficient. {{paragraph}}It is customary to split this assumption into two parts:
    • Homoscedasticity: {{math|1=E[ εi2 {{!}} X ] = σ2}}, which means that the error term has the same variance σ2 in each observation. When this requirement is violated this is called heteroscedasticity, in such case a more efficient estimator would be weighted least squares. If the errors have infinite variance then the OLS estimates will also have infinite variance (although by the law of large numbers they will nonetheless tend toward the true values so long as the errors have zero mean). In this case, robust estimation techniques are recommended.
    • No autocorrelation: the errors are uncorrelated between observations: {{math|1=E[ εiεj {{!}} X ] = 0}} for {{math|i ≠ j}}. This assumption may be violated in the context of time series data, panel data, cluster samples, hierarchical data, repeated measures data, longitudinal data, and other data with dependencies. In such cases generalized least squares provides a better alternative than the OLS. Another expression for autocorrelation is serial correlation.
  • Normality. It is sometimes additionally assumed that the errors have normal distribution conditional on the regressors:{{harvtxt|Hayashi|2000|loc=page 34}}varepsilon mid Xsim mathcal{N}(0, sigma^2I_n).This assumption is not needed for the validity of the OLS method, although certain additional finite-sample properties can be established in case when it does (especially in the area of hypotheses testing). Also when the errors are normal, the OLS estimator is equivalent to the maximum likelihood estimator (MLE), and therefore it is asymptotically efficient in the class of all regular estimators. Importantly, the normality assumption applies only to the error terms; contrary to a popular misconception, the response (dependent) variable is not required to be normally distributed.JOURNAL, Williams, M. N, Grajales, C. A. G, Kurkiewicz, D, Assumptions of multiple regression: Correcting two misconceptions, Practical Assessment, Research & Evaluation, 2013, 18, 11,weblink

Independent and identically distributed (iid)

In some applications, especially with cross-sectional data, an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size {{math|n → ∞}}), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:
  • iid observations: (xi, yi) is independent from, and has the same distribution as, (xj, yj) for all {{nowrap|i ≠ j}};
  • no perfect multicollinearity: {{Math|1=Qxx = E[ xi xiT ]}} is a positive-definite matrix;
  • exogeneity: {{Math|1=E[ εi {{!}} xi ] = 0;}}
  • homoscedasticity: {{Math|1=Var[ εi {{!}} xi ] = σ2}}.

Time series model

  • The stochastic process {xi, yi} is stationary and ergodic; if {xi, yi} is nonstationary, OLS results are often spurious unless {xi, yi} is co-integrating.WEB, Memento on EViews Output,weblink 28 December 2020,
  • The regressors are predetermined: E[xiεi] = 0 for all i = 1, ..., n;
  • The p×p matrix {{math|1=Qxx = E[ xi xiT ]}} is of full rank, and hence positive-definite;
  • {xiεi} is a martingale difference sequence, with a finite matrix of second moments {{math|1=Qxxε² = E[ εi2xi xiT ]}}.

Finite sample properties

First of all, under the strict exogeneity assumption the OLS estimators scriptstylehatbeta and s2 are unbiased, meaning that their expected values coincide with the true values of the parameters:{{harvtxt|Hayashi|2000|loc=pages 27, 30}}[proof]
operatorname{E}[, hatbeta mid X ,] = beta, quad operatorname{E}[,s^2 mid X,] = sigma^2.
If the strict exogeneity does not hold (as is the case with many time series models, where exogeneity is assumed only with respect to the past shocks but not the future ones), then these estimators will be biased in finite samples.{{anchor|Covariance matrix}}The variance-covariance matrix (or simply covariance matrix) of scriptstylehatbeta is equal to{{harvtxt|Hayashi|2000|loc=page 27}}
operatorname{Var}[, hatbeta mid X ,] = sigma^2left(X ^operatorname{T} Xright)^{-1} = sigma^2 Q.
In particular, the standard error of each coefficient scriptstylehatbeta_j is equal to square root of the j-th diagonal element of this matrix. The estimate of this standard error is obtained by replacing the unknown quantity σ2 with its estimate s2. Thus,
widehat{operatorname{s.!e.}}(hat{beta}_j) = sqrt{s^2 left(X ^operatorname{T} Xright)^{-1}_{jj}}
It can also be easily shown that the estimator scriptstylehatbeta is uncorrelated with the residuals from the model:
operatorname{Cov}[, hatbeta,hatvarepsilon mid X,] = 0.
The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator scriptstylehatbeta is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator scriptstyletildebeta which would be linear in y and unbiased, then
operatorname{Var}[, tildebeta mid X ,] - operatorname{Var}[, hatbeta mid X ,] geq 0
in the sense that this is a nonnegative-definite matrix. This theorem establishes optimality only in the class of linear unbiased estimators, which is quite restrictive. Depending on the distribution of the error terms ε, other, non-linear estimators may provide better results than OLS.

Assuming normality

The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that {{math|ε ~ N(0, σ2In)}}), then additional properties of the OLS estimators can be stated.The estimator scriptstylehatbeta is normally distributed, with mean and variance as given before:BOOK, Takeshi, Amemiya, Takeshi Amemiya, Advanced Econometrics,weblink registration, Harvard University Press, 1985, 13, 9780674005600,
hatbeta sim mathcal{N}big(beta, sigma^2(X ^mathrm{T} X)^{-1}big).
This estimator reaches the Cramér–Rao bound for the model, and thus is optimal in the class of all unbiased estimators. Note that unlike the Gauss–Markov theorem, this result establishes optimality among both linear and non-linear estimators, but only in the case of normally distributed error terms.The estimator s2 will be proportional to the chi-squared distribution:{{harvtxt|Amemiya|1985|loc=page 14}}
s^2 sim frac{sigma^2}{n-p} cdot chi^2_{n-p}
The variance of this estimator is equal to {{math|2σ4/(n − p)}}, which does not attain the Cramér–Rao bound of {{math|2σ4/n}}. However it was shown that there are no unbiased estimators of σ2 with variance smaller than that of the estimator s2.BOOK, C. R., Rao, C. R. Rao, Linear Statistical Inference and its Applications, New York, J. Wiley & Sons, 1973, Second, 319, 0-471-70823-2, If we are willing to allow biased estimators, and consider the class of estimators that are proportional to the sum of squared residuals (SSR) of the model, then the best (in the sense of the mean squared error) estimator in this class will be {{math|1=~σ2 = SSR / (n − p + 2)}}, which even beats the Cramér–Rao bound in case when there is only one regressor ({{nowrap|1=p = 1}}).{{harvtxt|Amemiya|1985|loc=page 20}}Moreover, the estimators scriptstylehatbeta and s2 are independent,{{harvtxt|Amemiya|1985|loc=page 27}} the fact which comes in useful when constructing the t- and F-tests for the regression.

Influential observations

{{see also|Leverage (statistics)}}As was mentioned before, the estimator hatbeta is linear in y, meaning that it represents a linear combination of the dependent variables yi. The weights in this linear combination are functions of the regressors X, and generally are unequal. The observations with high weights are called influential because they have a more pronounced effect on the value of the estimator.To analyze which observations are influential we remove a specific j-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for β will be equal to BOOK, Davidson, Russell, MacKinnon, James G., Estimation and Inference in Econometrics, New York, Oxford University Press, 1993, 0-19-506011-3, 33,
hatbeta^{(j)} - hatbeta = - frac{1}{1-h_j} (X ^mathrm{T} X)^{-1}x_j ^mathrm{T} hatvarepsilon_j,,
where {{math|1=hj = xjT (XTX)−1xj}} is the j-th diagonal element of the hat matrix P, and xj is the vector of regressors corresponding to the j-th observation. Similarly, the change in the predicted value for j-th observation resulting from omitting that observation from the dataset will be equal to
hat{y}_j^{(j)} - hat{y}_j = x_j ^mathrm{T} hatbeta^{(j)} - x_j ^operatorname{T} hatbeta = - frac{h_j}{1-h_j},hatvarepsilon_j
From the properties of the hat matrix, {{math|0 ≤ hj ≤ 1}}, and they sum up to p, so that on average {{math|hj ≈ p/n}}. These quantities hj are called the leverages, and observations with high hj are called leverage points.{{harvtxt|Davidson|MacKinnon|1993|loc=page 36}} Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.

Partitioned regression

Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form
y = X_1beta_1 + X_2beta_2 + varepsilon,
where X1 and X2 have dimensions n×p1, n×p2, and β1, β2 are p1×1 and p2×1 vectors, with {{math|1=p1 + p2 = p}}.The Frisch–Waugh–Lovell theorem states that in this regression the residuals hatvarepsilon and the OLS estimate scriptstylehatbeta_2 will be numerically identical to the residuals and the OLS estimate for β2 in the following regression:{{harvtxt|Davidson|MacKinnon|1993|loc=page 20}}
M_1y = M_1X_2beta_2 + eta,,
where M1 is the annihilator matrix for regressors X1.The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.

Constrained estimation

Suppose it is known that the coefficients in the regression satisfy a system of linear equations
Acolonquad Q ^operatorname{T} beta = c, ,
where Q is a p×q matrix of full rank, and c is a q×1 vector of known constants, where {{nowrap|''q 

- content above as imported from Wikipedia
- "Ordinary least squares" does not exist on GetWiki (yet)
- time: 6:11pm EDT - Wed, May 01 2024
[ this remote article is provided by Wikipedia ]
LATEST EDITS [ see all ]
GETWIKI 23 MAY 2022
GETWIKI 09 JUL 2019
Eastern Philosophy
History of Philosophy
GETWIKI 09 MAY 2016
GETWIKI 18 OCT 2015
M.R.M. Parrott
Biographies
GETWIKI 20 AUG 2014
CONNECT