Mega Code Archive

Tlinearregression class

This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm. (This article originally appeared in the March 17, 2000 issue of The Unofficial Newsletter of Delphi Users) TLinearRegression Class This article gives the source code for a nice, clean implementation of linear regression by the linear, least-squares error algorithm. The class provides properties returning the average for both the independent and dependent variables, as well as the variance, covariance, and correlation. A function method is provided to interpolate along the line defined by the computed slope and intercept. The code is well documented, giving the basic theoretical outline and some notes as to what the various methods are doing. An example project is also included. { ****************************************************************** } { } { Class implementing linear regression by linear-least-squares } { } { Copyright © 2000, Berend M. Tober. All rights reserved. } { Author's E-mail - mailto:btober@computer.org } { Other components at } { http://www.connix.com/~btober/delphi.htm } { } { ****************************************************************** } unit linregr; {Linear Regression Class} { Note Regarding the Problem Statement ------------------------------------ This class is used for finding the best-fit slope (m) and y-intercept (b), as in the equation for a straight line y := m*x + b in the case where you have more than two (x,y) data point pairs, i.e., you have (Xi, Yi) for i:=1,2,...,n, with n>2. In this situation you can write n equations: Y1 := m*X1 + b + E1 Y2 := m*X2 + b + E2 . . . Yn := m*Xn + b + En giving a system of n linear equations that you need to solve for two unknowns, namely, m and b, where the Ei are error terms. The linear- least-squares error algorithm employed here finds the best m and b such that the sum of the squared error terms is minimized. The linear algebra behind the solution is to write an n-by-2 matrix A that has rows looking like [1, Xi] for i=1,..,n and a 2-by-1 vector u of unknowns with the intercept and slope, i.e., u = [b,m]', where (') indicates "take the transpose of". There's a bunch of other technical detail stuff like "finding an orthogonal basis for a sub-space", and "guaranteed unique solutions", etc., that I vaguely recall from my course in real vectors, but it boils down to writing e'e = (y - Au)'(y - Au) = |y - Au|^2 to show the best solution is found by solving for u in the equation Au = y where, to summarize: y = [Y1,Y2,...,Yn]' u = [b,m]' A = [[1,X1]', [1,X2]',...[1,Xn]']' In the process of solving this you eventually get to the explicit solution u = inv(A'A)(A'y) where inv() means "take the inverse of". This is the equation implemented in this class. It calculates the statistics "on the fly", so it works for any number of two or more data points. This is all well-known, straight-forward stuff out of a textbook. I don't claim any credit for the math...I'm not inventing anything new here, just making a slick implementation. Note Regarding Exceptions ------------------------------- This class does not raise exceptions: your controlling application must trap 'divide by zero' errors that can occur in function GetCorrelation function GetIntercept function GetSlope I didn't build in this exception handling because it would entail a USES clause. As it is, this unit has NO explicit dependencies on ANY other units, and I couldn't bear to destroy that purity! The 'divide by zero' circumstance can occur when accessing any of the derived, covariance-related properties if: you have accumulated less than two data points, or you have "degenerate" data, i.e., all the x's are the same (this corresponds to slope of infinity), in which case you will find that CovarianceXX will be zero (or within some small error range of zero). Note Regarding Normalizing Data ------------------------------- It is recommeded that you normalize data in advance of applying this linear-least-squares algorithm, if you can. For purely multiplicative scaling, apply the following: If you scale data input as (x',y') := (Ax,By), then you must 1) adjust computed intercept by b := b'/B, and 2) adjust computed slope by m := m'*A/B } interface type TRegressionRec=Record N :Integer; { Count of number of data points accumulated } AvgX :Double; { Average of independent variables } AvgY :Double; { Average of dependent variables } AvgXX :Double; { Average of squared independent variables } AvgYY :Double; { Average of squared dependent variables } AvgXY :Double; { Average of cross product } end; TLinearRegression=class(TObject) private FRegressionRec:TRegressionRec; function GetCorrelation:Double; function GetCovarianceXX:Double; function GetCovarianceYY:Double; function GetCovarianceXY:Double; function GetIntercept:Double; function GetSlope:Double; public constructor Create; procedure Clear; function Add(X,Y:Double):Integer; function Interpolate(x:Double):Double; property AverageX:Double Read FRegressionRec.AvgX; property AverageY:Double read FRegressionRec.AvgY; property Count:Integer read FRegressionRec.N; property Correlation:Double Read GetCorrelation; property CovarianceXX:Double Read GetCovarianceXX; property CovarianceYY:Double Read GetCovarianceYY; property CovarianceXY:Double Read GetCovarianceXY; property Intercept:Double Read GetIntercept; property Slope:Double Read GetSlope; end; implementation function Accumulate ( N:Integer; {Sequence number of this new data point} a,{The existing average (of N-1 data points)} x:Double{The new data value to be included in running average}):Double; {Returns the running average} begin Result:=(N-1)/N; Result:=a*Result + x/N; end; constructor TLinearRegression.Create; begin inherited Create; Clear; end; function TLinearRegression.Add(X,Y:Double):Integer; {This function 'accumulates' another (x,y) data point} begin with FRegressionRec DO begin Inc(N); AvgX := Accumulate(N,AvgX,X); AvgY := Accumulate(N,AvgY,Y); AvgXX := Accumulate(N,AvgXX,X*X); AvgYY := Accumulate(N,AvgYY,Y*Y); AvgXY := Accumulate(N,AvgXY,X*Y); Result:=N; end; end; procedure TLinearRegression.Clear; begin with FRegressionRec do begin N := 0; AvgX := 0; AvgY := 0; AvgXX:= 0; AvgYY:= 0; AvgXY:= 0; end; end; function TLinearRegression.GetCovarianceXX:Double; begin with FRegressionRec do Result := AvgXX-AvgX*AvgX; end; function TLinearRegression.GetCovarianceYY:Double; begin with FRegressionRec do Result := AvgYY-AvgY*AvgY; end; function TLinearRegression.GetCovarianceXY:Double; begin with FRegressionRec do Result := AvgXY-AvgX*AvgY; end; function TLinearRegression.GetCorrelation:Double; var CovXX,CovYY:Double; begin with FRegressionRec do begin CovXX:=GetCovarianceXX; CovYY:=GetCovarianceYY; Result:=0; if CovXX = 0.0 then Result := 1.0 {Degenerate case with infinite slope} else if CovYY = 0.0 then Result := 1.0 {Data has zero slope} else Result := GetCovarianceXY/Sqrt(CovXX*CovYY); end; end; function TLinearRegression.GetIntercept:Double; begin {Note that if CovXX is zero, then no unique y-intercept exists} with FRegressionRec do Result:=(AvgY*AvgXX-AvgX*AvgXY)/GetCovarianceXX; end; function TLinearRegression.GetSlope:Double; begin {Note that if CovXX is zero, then slope is infinite} with FRegressionRec do Result := GetCovarianceXY/GetCovarianceXX; end; function TLinearRegression.Interpolate(x:Double):Double; begin with FRegressionRec do Result:=x*Slope + Intercept; end; end. Sample project file follows: program Example; uses WinCRT,linregr; type TDataArray=Array[1..10,1..2] of Real; const Data1:TDataArray= ( (75,81), (78,73), (88,85), (92,85), (95,89), (67,73), (55,66), (73,81), (74,81), (80,81) ); { These are the "correct" results for the data set above: r=0.891, m=0.513, b=39.658 } Data2:TDataArray= ( {Zero slope} (75,1), (78,1), (88,1), (92,1), (95,1), (67,1), (55,1), (73,1), (74,1), (80,1) ); Data3:TDataArray= ( {Infinite slope} (2,81), (2,73), (2,85), (2,85), (2,89), (2,73), (2,66), (2,81), (2,81), (2,81) ); procedure TestRegression(Data:TDataArray); var i:Word; begin with TLinearRegression.Create do begin for i:= 1 to 10 do Add(Data[i,1],Data[i,2]); try write(Correlation:12:3); write(Slope:12:3); writeln(Intercept:12:3); writeln('Average of X''s',AverageX:12:3); writeln('Average of Y''s',AverageY:12:3); writeln('Std Dev of X''s',sqrt(CovarianceXX):12:3); writeln('Std Dev of Y''s',sqrt(CovarianceYY):12:3); except end; writeln; Free; end; end; begin writeln('This is what the correlation, slope and intercept values should be:'); writeln(0.891:12:3,0.513:12:3,39.658:12:3,#13); writeln('Sample results:'); TestRegression(Data1); writeln(#13'Data with zero slope'); TestRegression(Data2); writeln(#13'Degenerate data with infinite slope'); TestRegression(Data3); end.