acgc.stats.bivariate_lines

Specialized methods of bivariate line fitting

  • Standard major axis (SMA) also called reduced major axis (RMA)
  • York regression, for data with errors in x and y
  • Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
  1# -*- coding: utf-8 -*-
  2"""Specialized methods of bivariate line fitting
  3
  4* Standard major axis (SMA) also called reduced major axis (RMA)
  5* York regression, for data with errors in x and y
  6* Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
  7"""
  8# from collections import namedtuple
  9import numpy as np
 10import scipy.stats as stats
 11from sklearn.covariance import MinCovDet
 12import statsmodels.formula.api as smf
 13import statsmodels.robust.norms as norms
 14import warnings
 15#from numba import jit
 16
 17__all__ = [
 18    "sma",
 19    "smafit",
 20    "sen",
 21    "sen_slope",
 22    "york"
 23]
 24# Aliases
 25def sen_slope(*args,**kwargs):
 26    '''Alias for `sen`'''
 27    return sen(*args,**kwargs)
 28def smafit(*args,**kwargs):
 29    '''Alias for `sma`'''
 30    return sma(*args,**kwargs)
 31
 32def sma(X,Y,W=None,
 33           data=None,
 34           cl=0.95,
 35           intercept=True,
 36           robust=False,robust_method='FastMCD'):
 37    '''Standard Major-Axis (SMA) line fitting
 38    
 39    Calculate standard major axis, aka reduced major axis, fit to 
 40    data X and Y. The main advantage of this over ordinary least squares is 
 41    that the best fit of Y to X will be the same as the best fit of X to Y.
 42    
 43    The fit equations and confidence intervals are implemented following 
 44    Warton et al. (2006). Robust fits use the FastMCD covariance estimate 
 45    from Rousseeuw and Van Driessen (1999). While there are many alternative 
 46    robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 
 47    the FastMCD algorithm is default in Matlab. When the standard error or 
 48    uncertainty of each point is known, then weighted SMA may be preferrable to 
 49    robust SMA. The conventional choice of weights for each point i is 
 50    W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 
 51    (squared standard error).
 52    
 53    References 
 54    Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 
 55        Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 
 56        doi:10.1017/S1464793106007007, 2006.
 57    Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 
 58        Covariance Determinant Estimator, Technometrics, 41(3), 1999.
 59
 60    Parameters
 61    ----------
 62    X, Y : array_like or str
 63        Input values, Must have same length.
 64    W    : array_like or str, optional
 65        array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 
 66    data : dict_like, optional
 67        data structure containing variables. Used when X, Y, or W are str.
 68    cl   : float (default = 0.95)
 69        Desired confidence level for output. 
 70    intercept : bool, default=True
 71        Specify if the fitted model should include a non-zero intercept.
 72        The model will be forced through the origin (0,0) if intercept=False.
 73    robust : bool, default=False
 74        Use statistical methods that are robust to the presence of outliers
 75    robust_method: {'FastMCD' (default), 'Huber', 'Biweight'}
 76        Method for calculating robust variance and covariance. Options:
 77        - 'MCD' or 'FastMCD' for Fast MCD
 78        - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
 79        - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
 80
 81        
 82    Returns
 83    -------
 84    fitresult : dict 
 85        Contains the following keys:
 86        - slope (float)
 87            Slope or Gradient of Y vs. X
 88        - intercept (float)
 89            Y intercept.
 90        - slope_ste (float)
 91            Standard error of slope estimate
 92        - intercept_ste (float)
 93            standard error of intercept estimate
 94        - slope_interval ([float, float])
 95            confidence interval for gradient at confidence level cl
 96        - intercept_interval ([float, float])
 97            confidence interval for intercept at confidence level cl
 98        - df_model (float)
 99            degrees of freedom for model
100        - df_resid (float)
101            degrees of freedom for residuals
102        - params ([float,float])
103            array of fitted parameters
104    '''
105
106    def str2var( v, data ):
107        '''Extract variable named v from Dataframe named data'''
108        try:
109            return data[v]
110        except Exception as exc:
111            raise ValueError( 'Argument data must be provided with a key named '+v ) from exc
112
113    # If variables are provided as strings, get values from the data structure
114    if isinstance( X, str ):
115        X = str2var( X, data )
116    if isinstance( Y, str ):
117        Y = str2var( Y, data )
118    if isinstance( W, str ):
119        W = str2var( W, data )
120
121    # Make sure arrays have the same length
122    assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length'
123    if W is None:
124        W = np.zeros_like(X) + 1
125    else:
126        assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y'
127
128    # Make sure cl is within the range 0-1
129    assert (cl < 1), 'cl must be less than 1'
130    assert (cl > 0), 'cl must be greater than 0'  
131
132    # Drop any NaN elements of X, Y, or W
133    # Infinite values are allowed but will make the result undefined
134    # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) )
135    idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W)
136
137    X0 = X[idx]
138    Y0 = Y[idx]
139    W0 = W[idx]
140
141    # Number of observations
142    N = len(X0)
143
144    # Degrees of freedom for the model
145    if intercept:
146        dfmod = 2
147    else:
148        dfmod = 1
149
150
151    # Choose whether to use methods robust to outliers
152    if robust:
153
154        # Choose the robust method
155        if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ):
156            # FAST MCD
157
158            if not intercept:
159                # intercept=False could possibly be supported by calculating
160                # using mcd.support_ as weights in an explicit variance/covariance calculation
161                raise NotImplementedError('FastMCD method only supports SMA with intercept')
162
163            # Fit robust model of mean and covariance
164            mcd = MinCovDet().fit( np.array([X0,Y0]).T )
165
166            # Robust mean
167            Xmean = mcd.location_[0]
168            Ymean = mcd.location_[1]
169
170            # Robust variance of X, Y
171            Vx    = mcd.covariance_[0,0]
172            Vy    = mcd.covariance_[1,1]
173
174            # Robust covariance
175            Vxy   = mcd.covariance_[0,1]
176
177            # Number of observations used in mean and covariance estimate
178            # excludes observations marked as outliers
179            N = mcd.support_.sum()
180
181        elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ):
182
183            # Tukey's Biweight and Huber's T
184            if robust_method.lower()=='biweight':
185                norm = norms.TukeyBiweight()
186            else:
187                norm = norms.HuberT()
188
189            # Get weights for downweighting outliers
190            # Fitting a linear model the easiest way to get these
191            # Options include "TukeyBiweight" (totally removes large deviates)
192            # "HuberT" (linear, not squared weighting of large deviates)
193            rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights
194
195            # Sum of weight and weights squared, for convienience
196            rsum  = np.sum( rweights )
197            rsum2 = np.sum( rweights**2 )
198
199            # Mean
200            Xmean = np.sum( X0 * rweights ) / rsum
201            Ymean = np.sum( Y0 * rweights ) / rsum
202
203            # Force intercept through zero, if requested
204            if not intercept:
205                Xmean = 0
206                Ymean = 0
207
208            # Variance & Covariance
209            Vx    = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2
210            Vy    = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2
211            Vxy   = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2
212
213            # Effective number of observations
214            N = rsum
215
216        else:
217
218            raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\
219                                      format(robust_method))
220    else:
221
222        if intercept:
223
224            wsum = np.sum(W)
225
226            # Average values
227            Xmean = np.sum(X0 * W0) / wsum
228            Ymean = np.sum(Y0 * W0) / wsum
229
230            # Covariance matrix
231            cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 )
232
233            # Variance
234            Vx = cov[0,0]
235            Vy = cov[1,1]
236
237            # Covariance
238            Vxy = cov[0,1]
239
240        else:
241
242            # Force the line to pass through origin by setting means to zero
243            Xmean = 0
244            Ymean = 0
245
246            wsum = np.sum(W0)
247
248            # Sum of squares in place of variance and covariance
249            Vx = np.sum( X0**2 * W0 ) / wsum
250            Vy = np.sum( Y0**2 * W0 ) / wsum
251            Vxy= np.sum( X0*Y0 * W0 ) / wsum
252
253    # Standard deviation
254    Sx = np.sqrt( Vx )
255    Sy = np.sqrt( Vy )
256
257    # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases)
258    R = Vxy / np.sqrt( Vx * Vy )
259
260    #############
261    # SLOPE
262
263    Slope  = np.sign(R) * Sy / Sx
264
265    # Standard error of slope estimate
266    ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) )
267
268    # Confidence interval for Slope
269    B = (1-R**2)/(N-dfmod) * stats.f.isf(1-cl, 1, N-dfmod)
270    ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) )
271
272    #############
273    # INTERCEPT
274
275    if intercept:
276        Intercept = Ymean - Slope * Xmean
277
278        # Standard deviation of residuals
279        # New Method: Formula from smatr R package (Warton)
280        # This formula avoids large residuals of outliers when using robust=True
281        Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 *  Vx ) * (N-1) / (N-dfmod) )
282
283        # OLD METHOD
284        # Standard deviation of residuals
285        #resid = Y0 - (Intercept + Slope * X0 )    
286        # Population standard deviation of the residuals
287        #Sr = np.std( resid, ddof=0 )      
288
289        # Standard error of the intercept estimate
290        ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2  )
291
292        # Confidence interval for Intercept
293        tcrit = stats.t.isf((1-cl)/2,N-dfmod)
294        ci_int = Intercept + ste_int * np.array([-tcrit,tcrit])
295
296    else:
297
298        # Set Intercept quantities to zero
299        Intercept = 0
300        ste_int   = 0
301        ci_int    = np.array([0,0])
302
303    result = dict( slope            = Slope,
304                   intercept        = Intercept,
305                   slope_ste        = ste_slope,
306                   intercept_ste    = ste_int,
307                   slope_interval   = ci_grad,
308                   intercept_interval = ci_int,
309                   df_model         = dfmod,
310                   df_resid         = N-dfmod,
311                   params           = np.array([Slope,Intercept]),
312                   nobs             = N,
313                   fittedvalues     = Intercept + Slope * X0,
314                   resid            = Intercept + Slope * X0 - Y0 )
315
316    # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int
317    return result
318
319def york( x, y, err_x=1, err_y=1, rerr_xy=0 ):
320    '''York regression accounting for error in x and y
321    Follows the notation and algorithm of York et al. (2004) Section III
322    
323    Parameters
324    ----------
325    x, y : ndarray
326        dependent (x) and independent (y) variables for fitting
327    err_x, err_y : ndarray (default=1)
328        standard deviation of errors/uncertainty in x and y
329    rerr_xy : float (default=0)
330        correlation coefficient for errors in x and y, 
331        default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y
332        err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
333    
334    Returns
335    -------
336    fitresult : dict 
337        Contains the following keys:
338        - slope (float)
339            Slope or Gradient of Y vs. X
340        - intercept (float)
341            Y intercept.
342        - slope_ste (float)
343            Standard error of slope estimate
344        - intercept_ste (float)
345            standard error of intercept estimate
346        - slope_interval ([float, float])
347            confidence interval for gradient at confidence level cl
348        - intercept_interval ([float, float])
349            confidence interval for intercept at confidence level cl
350        - df_model (float)
351            degrees of freedom for model
352        - df_resid (float)
353            degrees of freedom for residuals
354        - params ([float,float])
355            array of fitted parameters
356    '''
357
358    # relative error tolerance required for convergence
359    rtol = 1e-15
360
361    # Initial guess for slope, from ordinary least squares
362    result = stats.linregress( x, y )
363    b = result[0]
364
365    # Weights for x and y
366    wx = 1 / err_x**2
367    wy = 1 / err_y**2
368
369    # Combined weights
370    alpha = np.sqrt( wx * wy )
371
372    # Iterate until solution converges, but not more 50 times
373    maxiter=50
374    for i in range(1,maxiter):
375
376        # Weight for point i
377        W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha )
378        Wsum = np.sum( W )
379
380        # Weighted means
381        Xbar = np.sum( W * x ) / Wsum
382        Ybar = np.sum( W * y ) / Wsum
383
384        # Deviation from weighted means
385        U = x - Xbar
386        V = y - Ybar
387
388        # parameter needed for slope
389        beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha )
390
391        # Update slope estimate
392        bnew = np.sum( W * beta * V ) / np.sum( W * beta * U )
393
394        # Break from loop if new value is very close to old value
395        if np.abs( (bnew-b)/b ) < rtol:
396            break
397        else:
398            b = bnew
399
400    if i==maxiter:
401        raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' )
402
403    # Intercept
404    a = Ybar - b * Xbar
405
406    # least-squares adjusted points, expectation values of X and Y
407    xa = Xbar + beta
408    ya = Ybar + b*beta
409
410    # Mean of adjusted points
411    xabar = np.sum( W * xa ) / Wsum
412    yabar = np.sum( W * ya ) / Wsum
413
414    # Devaiation of adjusted points from their means
415    u = xa - xabar
416    v = ya - yabar
417
418    # Variance of slope and intercept estimates
419    varb = 1 / np.sum( W * u**2 )
420    vara = 1 / Wsum + xabar**2 * varb
421
422    # Standard error of slope and intercept
423    siga = np.sqrt( vara )
424    sigb = np.sqrt( varb )
425
426    # Define a named tuple type that will contain the results
427    # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' )
428
429    # Return results as a named tuple, User can access as a regular tuple too
430    # return result( b, a, sigb, siga, [b,a], [sigb, siga] )
431
432    dfmod = 2
433    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
434
435    result = dict( slope         = b,
436                intercept        = a,
437                slope_ste        = sigb,
438                intercept_ste    = siga,
439                slope_interval   = [None,None],
440                intercept_interval = [None,None],
441                df_model         = dfmod,
442                df_resid         = N-dfmod,
443                params           = np.array([b,a]),
444                nobs             = N,
445                fittedvalues     = a + b * x,
446                resid            = a + b * x - y )
447
448    return result
449
450#@jit(nopython=True)
451def sen( x, y ):
452    '''Estimate linear trend using the Thiel-Sen method
453    
454    This non-parametric method finds the median slope among all
455    combinations of time points. 
456    scipy.stats.theilslopes provides the same slope estimate, with  
457    confidence intervals. However, this function is faster for 
458    large datasets due to Numba 
459    
460    Parameters
461    ----------
462    x : array_like (N,)
463        independent variable
464    y : array_like (N,)
465        dependent variable
466    
467    Returns
468    -------
469    sen : float
470        the median slope
471    slopes : array (N*N,)
472        all slope estimates from all combinations of x and y
473    '''
474
475    with warnings.catch_warnings():
476        warnings.simplefilter('always', DeprecationWarning)
477        warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.',
478                    DeprecationWarning, stacklevel=2)
479        
480    if len( x ) != len( y ):
481        print('Inputs x and y must have same dimension')
482        return np.nan
483
484    # Find number of time points
485    n = len( x )
486
487    # Array to hold all slope estimates
488    slopes = np.zeros(  np.ceil( n * ( n-1 ) / 2 ).astype('int') )
489    slopes[:] = np.nan
490
491    count = 0
492
493    for i in range(n):
494        for j in range(i+1, n):
495
496            # Slope between elements i and j
497            slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] )
498
499            slopes[count] = slopeij
500
501            count += 1
502
503    # Thiel-Sen estimate is the median slope, neglecting NaN
504    sen = np.nanmedian( slopes )
505
506    return sen, slopes
def sma( X, Y, W=None, data=None, cl=0.95, intercept=True, robust=False, robust_method='FastMCD'):
 33def sma(X,Y,W=None,
 34           data=None,
 35           cl=0.95,
 36           intercept=True,
 37           robust=False,robust_method='FastMCD'):
 38    '''Standard Major-Axis (SMA) line fitting
 39    
 40    Calculate standard major axis, aka reduced major axis, fit to 
 41    data X and Y. The main advantage of this over ordinary least squares is 
 42    that the best fit of Y to X will be the same as the best fit of X to Y.
 43    
 44    The fit equations and confidence intervals are implemented following 
 45    Warton et al. (2006). Robust fits use the FastMCD covariance estimate 
 46    from Rousseeuw and Van Driessen (1999). While there are many alternative 
 47    robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 
 48    the FastMCD algorithm is default in Matlab. When the standard error or 
 49    uncertainty of each point is known, then weighted SMA may be preferrable to 
 50    robust SMA. The conventional choice of weights for each point i is 
 51    W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 
 52    (squared standard error).
 53    
 54    References 
 55    Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 
 56        Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 
 57        doi:10.1017/S1464793106007007, 2006.
 58    Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 
 59        Covariance Determinant Estimator, Technometrics, 41(3), 1999.
 60
 61    Parameters
 62    ----------
 63    X, Y : array_like or str
 64        Input values, Must have same length.
 65    W    : array_like or str, optional
 66        array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 
 67    data : dict_like, optional
 68        data structure containing variables. Used when X, Y, or W are str.
 69    cl   : float (default = 0.95)
 70        Desired confidence level for output. 
 71    intercept : bool, default=True
 72        Specify if the fitted model should include a non-zero intercept.
 73        The model will be forced through the origin (0,0) if intercept=False.
 74    robust : bool, default=False
 75        Use statistical methods that are robust to the presence of outliers
 76    robust_method: {'FastMCD' (default), 'Huber', 'Biweight'}
 77        Method for calculating robust variance and covariance. Options:
 78        - 'MCD' or 'FastMCD' for Fast MCD
 79        - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
 80        - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
 81
 82        
 83    Returns
 84    -------
 85    fitresult : dict 
 86        Contains the following keys:
 87        - slope (float)
 88            Slope or Gradient of Y vs. X
 89        - intercept (float)
 90            Y intercept.
 91        - slope_ste (float)
 92            Standard error of slope estimate
 93        - intercept_ste (float)
 94            standard error of intercept estimate
 95        - slope_interval ([float, float])
 96            confidence interval for gradient at confidence level cl
 97        - intercept_interval ([float, float])
 98            confidence interval for intercept at confidence level cl
 99        - df_model (float)
100            degrees of freedom for model
101        - df_resid (float)
102            degrees of freedom for residuals
103        - params ([float,float])
104            array of fitted parameters
105    '''
106
107    def str2var( v, data ):
108        '''Extract variable named v from Dataframe named data'''
109        try:
110            return data[v]
111        except Exception as exc:
112            raise ValueError( 'Argument data must be provided with a key named '+v ) from exc
113
114    # If variables are provided as strings, get values from the data structure
115    if isinstance( X, str ):
116        X = str2var( X, data )
117    if isinstance( Y, str ):
118        Y = str2var( Y, data )
119    if isinstance( W, str ):
120        W = str2var( W, data )
121
122    # Make sure arrays have the same length
123    assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length'
124    if W is None:
125        W = np.zeros_like(X) + 1
126    else:
127        assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y'
128
129    # Make sure cl is within the range 0-1
130    assert (cl < 1), 'cl must be less than 1'
131    assert (cl > 0), 'cl must be greater than 0'  
132
133    # Drop any NaN elements of X, Y, or W
134    # Infinite values are allowed but will make the result undefined
135    # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) )
136    idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W)
137
138    X0 = X[idx]
139    Y0 = Y[idx]
140    W0 = W[idx]
141
142    # Number of observations
143    N = len(X0)
144
145    # Degrees of freedom for the model
146    if intercept:
147        dfmod = 2
148    else:
149        dfmod = 1
150
151
152    # Choose whether to use methods robust to outliers
153    if robust:
154
155        # Choose the robust method
156        if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ):
157            # FAST MCD
158
159            if not intercept:
160                # intercept=False could possibly be supported by calculating
161                # using mcd.support_ as weights in an explicit variance/covariance calculation
162                raise NotImplementedError('FastMCD method only supports SMA with intercept')
163
164            # Fit robust model of mean and covariance
165            mcd = MinCovDet().fit( np.array([X0,Y0]).T )
166
167            # Robust mean
168            Xmean = mcd.location_[0]
169            Ymean = mcd.location_[1]
170
171            # Robust variance of X, Y
172            Vx    = mcd.covariance_[0,0]
173            Vy    = mcd.covariance_[1,1]
174
175            # Robust covariance
176            Vxy   = mcd.covariance_[0,1]
177
178            # Number of observations used in mean and covariance estimate
179            # excludes observations marked as outliers
180            N = mcd.support_.sum()
181
182        elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ):
183
184            # Tukey's Biweight and Huber's T
185            if robust_method.lower()=='biweight':
186                norm = norms.TukeyBiweight()
187            else:
188                norm = norms.HuberT()
189
190            # Get weights for downweighting outliers
191            # Fitting a linear model the easiest way to get these
192            # Options include "TukeyBiweight" (totally removes large deviates)
193            # "HuberT" (linear, not squared weighting of large deviates)
194            rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights
195
196            # Sum of weight and weights squared, for convienience
197            rsum  = np.sum( rweights )
198            rsum2 = np.sum( rweights**2 )
199
200            # Mean
201            Xmean = np.sum( X0 * rweights ) / rsum
202            Ymean = np.sum( Y0 * rweights ) / rsum
203
204            # Force intercept through zero, if requested
205            if not intercept:
206                Xmean = 0
207                Ymean = 0
208
209            # Variance & Covariance
210            Vx    = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2
211            Vy    = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2
212            Vxy   = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2
213
214            # Effective number of observations
215            N = rsum
216
217        else:
218
219            raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\
220                                      format(robust_method))
221    else:
222
223        if intercept:
224
225            wsum = np.sum(W)
226
227            # Average values
228            Xmean = np.sum(X0 * W0) / wsum
229            Ymean = np.sum(Y0 * W0) / wsum
230
231            # Covariance matrix
232            cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 )
233
234            # Variance
235            Vx = cov[0,0]
236            Vy = cov[1,1]
237
238            # Covariance
239            Vxy = cov[0,1]
240
241        else:
242
243            # Force the line to pass through origin by setting means to zero
244            Xmean = 0
245            Ymean = 0
246
247            wsum = np.sum(W0)
248
249            # Sum of squares in place of variance and covariance
250            Vx = np.sum( X0**2 * W0 ) / wsum
251            Vy = np.sum( Y0**2 * W0 ) / wsum
252            Vxy= np.sum( X0*Y0 * W0 ) / wsum
253
254    # Standard deviation
255    Sx = np.sqrt( Vx )
256    Sy = np.sqrt( Vy )
257
258    # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases)
259    R = Vxy / np.sqrt( Vx * Vy )
260
261    #############
262    # SLOPE
263
264    Slope  = np.sign(R) * Sy / Sx
265
266    # Standard error of slope estimate
267    ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) )
268
269    # Confidence interval for Slope
270    B = (1-R**2)/(N-dfmod) * stats.f.isf(1-cl, 1, N-dfmod)
271    ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) )
272
273    #############
274    # INTERCEPT
275
276    if intercept:
277        Intercept = Ymean - Slope * Xmean
278
279        # Standard deviation of residuals
280        # New Method: Formula from smatr R package (Warton)
281        # This formula avoids large residuals of outliers when using robust=True
282        Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 *  Vx ) * (N-1) / (N-dfmod) )
283
284        # OLD METHOD
285        # Standard deviation of residuals
286        #resid = Y0 - (Intercept + Slope * X0 )    
287        # Population standard deviation of the residuals
288        #Sr = np.std( resid, ddof=0 )      
289
290        # Standard error of the intercept estimate
291        ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2  )
292
293        # Confidence interval for Intercept
294        tcrit = stats.t.isf((1-cl)/2,N-dfmod)
295        ci_int = Intercept + ste_int * np.array([-tcrit,tcrit])
296
297    else:
298
299        # Set Intercept quantities to zero
300        Intercept = 0
301        ste_int   = 0
302        ci_int    = np.array([0,0])
303
304    result = dict( slope            = Slope,
305                   intercept        = Intercept,
306                   slope_ste        = ste_slope,
307                   intercept_ste    = ste_int,
308                   slope_interval   = ci_grad,
309                   intercept_interval = ci_int,
310                   df_model         = dfmod,
311                   df_resid         = N-dfmod,
312                   params           = np.array([Slope,Intercept]),
313                   nobs             = N,
314                   fittedvalues     = Intercept + Slope * X0,
315                   resid            = Intercept + Slope * X0 - Y0 )
316
317    # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int
318    return result

Standard Major-Axis (SMA) line fitting

Calculate standard major axis, aka reduced major axis, fit to data X and Y. The main advantage of this over ordinary least squares is that the best fit of Y to X will be the same as the best fit of X to Y.

The fit equations and confidence intervals are implemented following Warton et al. (2006). Robust fits use the FastMCD covariance estimate from Rousseeuw and Van Driessen (1999). While there are many alternative robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), the FastMCD algorithm is default in Matlab. When the standard error or uncertainty of each point is known, then weighted SMA may be preferrable to robust SMA. The conventional choice of weights for each point i is W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance (squared standard error).

References Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, doi:10.1017/S1464793106007007, 2006. Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41(3), 1999.

Parameters
  • X, Y (array_like or str): Input values, Must have same length.
  • W (array_like or str, optional): array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i))
  • data (dict_like, optional): data structure containing variables. Used when X, Y, or W are str.
  • cl (float (default = 0.95)): Desired confidence level for output.
  • intercept (bool, default=True): Specify if the fitted model should include a non-zero intercept. The model will be forced through the origin (0,0) if intercept=False.
  • robust (bool, default=False): Use statistical methods that are robust to the presence of outliers
  • robust_method ({'FastMCD' (default), 'Huber', 'Biweight'}): Method for calculating robust variance and covariance. Options:
    • 'MCD' or 'FastMCD' for Fast MCD
    • 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
    • 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
Returns
  • fitresult (dict): Contains the following keys:
    • slope (float) Slope or Gradient of Y vs. X
    • intercept (float) Y intercept.
    • slope_ste (float) Standard error of slope estimate
    • intercept_ste (float) standard error of intercept estimate
    • slope_interval ([float, float]) confidence interval for gradient at confidence level cl
    • intercept_interval ([float, float]) confidence interval for intercept at confidence level cl
    • df_model (float) degrees of freedom for model
    • df_resid (float) degrees of freedom for residuals
    • params ([float,float]) array of fitted parameters
def smafit(*args, **kwargs):
29def smafit(*args,**kwargs):
30    '''Alias for `sma`'''
31    return sma(*args,**kwargs)

Alias for sma

def sen(x, y):
452def sen( x, y ):
453    '''Estimate linear trend using the Thiel-Sen method
454    
455    This non-parametric method finds the median slope among all
456    combinations of time points. 
457    scipy.stats.theilslopes provides the same slope estimate, with  
458    confidence intervals. However, this function is faster for 
459    large datasets due to Numba 
460    
461    Parameters
462    ----------
463    x : array_like (N,)
464        independent variable
465    y : array_like (N,)
466        dependent variable
467    
468    Returns
469    -------
470    sen : float
471        the median slope
472    slopes : array (N*N,)
473        all slope estimates from all combinations of x and y
474    '''
475
476    with warnings.catch_warnings():
477        warnings.simplefilter('always', DeprecationWarning)
478        warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.',
479                    DeprecationWarning, stacklevel=2)
480        
481    if len( x ) != len( y ):
482        print('Inputs x and y must have same dimension')
483        return np.nan
484
485    # Find number of time points
486    n = len( x )
487
488    # Array to hold all slope estimates
489    slopes = np.zeros(  np.ceil( n * ( n-1 ) / 2 ).astype('int') )
490    slopes[:] = np.nan
491
492    count = 0
493
494    for i in range(n):
495        for j in range(i+1, n):
496
497            # Slope between elements i and j
498            slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] )
499
500            slopes[count] = slopeij
501
502            count += 1
503
504    # Thiel-Sen estimate is the median slope, neglecting NaN
505    sen = np.nanmedian( slopes )
506
507    return sen, slopes

Estimate linear trend using the Thiel-Sen method

This non-parametric method finds the median slope among all combinations of time points. scipy.stats.theilslopes provides the same slope estimate, with
confidence intervals. However, this function is faster for large datasets due to Numba

Parameters
  • x (array_like (N,)): independent variable
  • y (array_like (N,)): dependent variable
Returns
  • sen (float): the median slope
  • slopes (array (N*N,)): all slope estimates from all combinations of x and y
def sen_slope(*args, **kwargs):
26def sen_slope(*args,**kwargs):
27    '''Alias for `sen`'''
28    return sen(*args,**kwargs)

Alias for sen

def york(x, y, err_x=1, err_y=1, rerr_xy=0):
320def york( x, y, err_x=1, err_y=1, rerr_xy=0 ):
321    '''York regression accounting for error in x and y
322    Follows the notation and algorithm of York et al. (2004) Section III
323    
324    Parameters
325    ----------
326    x, y : ndarray
327        dependent (x) and independent (y) variables for fitting
328    err_x, err_y : ndarray (default=1)
329        standard deviation of errors/uncertainty in x and y
330    rerr_xy : float (default=0)
331        correlation coefficient for errors in x and y, 
332        default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y
333        err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
334    
335    Returns
336    -------
337    fitresult : dict 
338        Contains the following keys:
339        - slope (float)
340            Slope or Gradient of Y vs. X
341        - intercept (float)
342            Y intercept.
343        - slope_ste (float)
344            Standard error of slope estimate
345        - intercept_ste (float)
346            standard error of intercept estimate
347        - slope_interval ([float, float])
348            confidence interval for gradient at confidence level cl
349        - intercept_interval ([float, float])
350            confidence interval for intercept at confidence level cl
351        - df_model (float)
352            degrees of freedom for model
353        - df_resid (float)
354            degrees of freedom for residuals
355        - params ([float,float])
356            array of fitted parameters
357    '''
358
359    # relative error tolerance required for convergence
360    rtol = 1e-15
361
362    # Initial guess for slope, from ordinary least squares
363    result = stats.linregress( x, y )
364    b = result[0]
365
366    # Weights for x and y
367    wx = 1 / err_x**2
368    wy = 1 / err_y**2
369
370    # Combined weights
371    alpha = np.sqrt( wx * wy )
372
373    # Iterate until solution converges, but not more 50 times
374    maxiter=50
375    for i in range(1,maxiter):
376
377        # Weight for point i
378        W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha )
379        Wsum = np.sum( W )
380
381        # Weighted means
382        Xbar = np.sum( W * x ) / Wsum
383        Ybar = np.sum( W * y ) / Wsum
384
385        # Deviation from weighted means
386        U = x - Xbar
387        V = y - Ybar
388
389        # parameter needed for slope
390        beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha )
391
392        # Update slope estimate
393        bnew = np.sum( W * beta * V ) / np.sum( W * beta * U )
394
395        # Break from loop if new value is very close to old value
396        if np.abs( (bnew-b)/b ) < rtol:
397            break
398        else:
399            b = bnew
400
401    if i==maxiter:
402        raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' )
403
404    # Intercept
405    a = Ybar - b * Xbar
406
407    # least-squares adjusted points, expectation values of X and Y
408    xa = Xbar + beta
409    ya = Ybar + b*beta
410
411    # Mean of adjusted points
412    xabar = np.sum( W * xa ) / Wsum
413    yabar = np.sum( W * ya ) / Wsum
414
415    # Devaiation of adjusted points from their means
416    u = xa - xabar
417    v = ya - yabar
418
419    # Variance of slope and intercept estimates
420    varb = 1 / np.sum( W * u**2 )
421    vara = 1 / Wsum + xabar**2 * varb
422
423    # Standard error of slope and intercept
424    siga = np.sqrt( vara )
425    sigb = np.sqrt( varb )
426
427    # Define a named tuple type that will contain the results
428    # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' )
429
430    # Return results as a named tuple, User can access as a regular tuple too
431    # return result( b, a, sigb, siga, [b,a], [sigb, siga] )
432
433    dfmod = 2
434    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
435
436    result = dict( slope         = b,
437                intercept        = a,
438                slope_ste        = sigb,
439                intercept_ste    = siga,
440                slope_interval   = [None,None],
441                intercept_interval = [None,None],
442                df_model         = dfmod,
443                df_resid         = N-dfmod,
444                params           = np.array([b,a]),
445                nobs             = N,
446                fittedvalues     = a + b * x,
447                resid            = a + b * x - y )
448
449    return result

York regression accounting for error in x and y Follows the notation and algorithm of York et al. (2004) Section III

Parameters
  • x, y (ndarray): dependent (x) and independent (y) variables for fitting
  • err_x, err_y (ndarray (default=1)): standard deviation of errors/uncertainty in x and y
  • rerr_xy (float (default=0)): correlation coefficient for errors in x and y, default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
Returns
  • fitresult (dict): Contains the following keys:
    • slope (float) Slope or Gradient of Y vs. X
    • intercept (float) Y intercept.
    • slope_ste (float) Standard error of slope estimate
    • intercept_ste (float) standard error of intercept estimate
    • slope_interval ([float, float]) confidence interval for gradient at confidence level cl
    • intercept_interval ([float, float]) confidence interval for intercept at confidence level cl
    • df_model (float) degrees of freedom for model
    • df_resid (float) degrees of freedom for residuals
    • params ([float,float]) array of fitted parameters