Curve Fitting Handout #1 S. Arlinghaus Polynomials--general form y=a_nx^n + b_(n-1)x^(n-1) +...+ a_3x^3 + a_2x^2 + a_1x^1 + a_0x^0 The simplest form of a polynomial is a straight line, generated either as y=a_0x^0=a constant or y=a_1x^1 + a_0x^0=mx+b where m is the slope of the line and (0,b) is the intercept of the line with the y-axis. 1. Simple fitting of unbounded curves using Excel 5.0 Given a finite set of data points in two dimensions. Find a straight line that "best" fits the scatter of points. One might minimize the absolute value of the displacements of the points from the line; one might minimize the squares of the absolute values of displacements; or, one might minimize a variety of other measures. The "regression" analysis featured in most spreadsheets is a "least squares" analysis that minimizes the squares of the absolute values of displacements of the actual data points from the line to be fit. Consider the following simple example: Data points--filled in red: (1,1) (2,3) (3,3) displacements from line (points filled in blue) (3,3m+b) (2,3) Displacements of the red points (3,3) from the blue points along the line y=mx+b. (2,2m+b) Displacement is a vertical measure. (1,m+b) --differences in y-coordinates. d_1: |m+b-1| d_2: |2m+b-3| (1,1) d_3: |3m+b-3| MINIMIZE: the sum of the squares of the displacements so that the absolute value is no problem to deal with. The line that "best" fits the data will be the one that is least far from the data points--the one for which the sums of the squares of the displacements is a minimum. MINIMIZE: f(m,b) = (m+b-1)^2 + (2m+b-3)^2 + (3m+b-3)^2 SIMPLIFY: f(m,b) = 14m^2 + 12mb - 32m +3b^2 - 14b + 19 To find the minimum, find suitable partial derivatives, set them equal to zero, and solve to find critical points. partial of f with respect to m = 28m + 12b - 32 partial of f with respect to b = 12m + 6b - 14 Set the two partial derivatives equal to zero and solve the set of two equations in two variables: Solve: 14m + 6b =16 12m + 6b =14 Therefore, m=1; b = 1/3 a minimum (in theory could be a maximum or saddle point). So, y=mx+b becomes, for these data points: y=x+1/3, the line of best fit obtained by minimizing squares of displacements. This very simple example illustrates WHY the strategy works; it also illustrates that the role of outliers is perhaps overly emphasized--a data point that is 5 units from the line of best fit contributes a value of 25. Thus, in some cases, one might want to consider lines minimizing strict linear displacement, even though it is more trouble to deal with and is not customarily treated in "black-box" mode, as is least squares analysis. BLACK-BOX treatment of least squares Most spreadsheets offer a very easy strategy for dealing with least squares "regression" analysis. Of course, the manner of handling the analysis often changes, from one update to the next, so having some understanding of what is happening is important. Use Excel 5.0 to find the line of best fit to the data points above. Data points: x y 1 1 2 3 3 3 Enter these by hand, or import them and set them up in a manner such as that above. Then, use the commands SLOPE and INTERCEPT to find the values to put in y=mx+b Type =SLOPE({1,3,3},{1,2,3}) to return the value of the slope of the line of least squares-- note that the numbers in the first set are the y's and in the second set are the x's. Type =INTERCEPT({1,3,3},{1,2,3}) to return the value of the y-intercept of the line of least squares-- note that the numbers in the first set are the y's and in the second set are the x's. SLOPE CALCULATION: INTERCEPT CALCULATION: Then, it is an easy matter to read off the equation of the line of least squares as y=1*x+0.3333333 The equation can be used to forecast values. x y y=mx+b 1 1 1.333333 Enter the extended set of x-values--4,5,6,7,8,9 2 3 2.333333 3 3 3.333333 Then, in the y-column, next to the 4, type =a118+0.333333 (that is, mx+b). 4 4.33333 4.333333 Then, copy the cell b118 and paste it into the highlighted range from b119 to b123. 5 5.33333 5.333333 6 6.33333 6.333333 You can use chartWizard to graph the results. In many installations of 7 7.33333 7.333333 Excel, you can do all this at once using tools, data analysis. 8 8.33333 8.333333 9 9.33333 9.333333 2. Simple non-linear curve fitting. Exponential functions in general: y=a^x, a is a fixed number Suppose a>1 An increase in x causes an eventually sharper increase in y. Consider y=2^x for example x y -1 0.5 0 1 1 2 2 4 (0,1) Suppose 00 and that a is not 1. Then, one of the two familiar exponential curves emerges. 3. Least squares and exponential curves. Basic definition: Log_a y = x if and only if a^x = y. That is, exponential and logarithmic function are inverses--one can be used to undo the other. Consider the same data set as before: (1,1), (2,3), (3,3). Suppose that we can fit a line of the form ln y = mx + b to the data. Then, from this line, we get (using the basic definition) y=e^(mx+b) as an exponential fit to the data. To do this, take the ln of the y values: (1, ln 1), (2, ln 3), (3, ln 3). Now, fit a line to these points, using least squares. x ln y y 1 0 1 2 1.098612 3 3 1.098612 3 Slope: 0.549306 Intercept: -0.366204 Thus, ln y = 0.549306x-0.3662 So, y = e^(0.549306x-0.3662) So, the exponential fit to the given values of x=1,2,3 is given below, as is the projected fit for x=4,...,9 actual y y=x+0.333 x y=e^(0.549306*x-0.3662) 1 1.200942 1 1.3333 2 2.080092 3 2.3333 3 3.602824 3 3.3333 4 6.240273 4.3333 5 10.80847 5.3333 6 18.72081 6.3333 7 32.4254 7.3333 8 56.16243 8.3333 9 97.27617 9.3333 4. Least squares and logarithmic curves. Basic definition: Log_a y = x if and only if a^x = y. That is, exponential and logarithmic function are inverses--one can be used to undo the other. Consider the same data set as before: (1,1), (2,3), (3,3). Suppose that we can fit a line of the form e^ y = mx + b to the data. Then, from this line, we get (using the basic definition) y=ln(mx+b) as a logarithmic curve fit to the data. To do this, take the exp of the y values: (1, e^ 1), (2, e^ 3), (3, e^ 3). Now, fit a line to these points, using least squares. x e ^y y 1 2.718282 1 2 20.08554 3 3 20.08554 3 Slope: 8.683629 Intercept: -3.070804 Thus, e^ y = 8.683629x-3.0708 So, y = ln(8.683629x-3.0708) So, the logarithmic curve fit to the given values of x=1,2,3 is given below, as is the projected fit for x=4,...,9 actual y y=x+0.333 x y=e^(0.549306*x-0.3662) y=ln(8.683629*x-3.0708) exponential linear (actual) linear (proj.) logarithmic 1 1.200942 1 1.3333 1.725055 2 2.080092 3 2.3333 2.660012 3 3.602824 3 3.3333 3.134628 4 6.240273 4.3333 3.455171 5 10.80847 5.3333 3.697526 6 18.72081 6.3333 3.892452 7 32.4254 7.3333 4.05551 8 56.16243 8.3333 4.19567 9 97.27617 9.3333 4.318579 All of linear, exponential, and logarithmic curves offer a reasonable fit to the actual data; however, the various futures forecast by each are vastly different!! Note diagrams of this sort in Meadows--this is the theory behind them. If the tools, data analysis feature is installed, least squares is easy to do; additional information is produced, and graphs can be drawn directly, too. x e ^y y 1 2.718282 1 2 20.08554 3 3 20.08554 3