Effects of Missing Data on Estimates of Monthly Mean General Circulation Statistics

Abstract
To assess the impact of missing data on general circulation statistics, an analysis has been carried out on the effect of systematically decreasing the amount of available data. Thus the accuracy of monthly mean analyses of a variety of parameters including means, variances, and covariances is determined from ECMWF data as a function of the number of twice-daily analyses included in the monthly mean. Because the standard error of a monthly mean depends on the standard deviation of the daily values and the effective number of independent observations during each month, results have been expressed as the ratio of the root-mean-square error (RMSE) in the monthly mean to the daily standard deviation, thus allowing fairly universal relationships to be developed for application to many parameters, and to different season latitudes and longitudes. Results are indeed dependent upon the numbers of observations and the autocorrelation within each series, but could be modeled sufficiently well, for this purpose, with a first order autoregressive (Markov) process to allow simulated data to be used for further tests. Experiments have been carried out varying the number of observations missing that were 1) evenly spaced, 2) randomly spaced, or 3) occurring in single blocks. If missing observations am randomly spaced, the RMSE increases by factors of 2–3 over equally spaced data and there is virtually no advantage due to autocorrelation in the data. If the missing data occur in one block, another increase in RMSE occurs by a factor up to 2. The zonal mean daily standard deviations for the horizontal wind components, vertical p-velocity ω geopotential height, and temperature are presented for January. Tables are given of the ratio of the RMSE of the monthly mean to the daily standard deviation, and along with the autocorrelation, these allow estimates of either the relative or absolute errors to be expected in monthly mean statistics. For more persistent variables, fewer observations per month are needed to accurately define a monthly mean. Consequently, for variables such as ω, in which the persistence in the series is weak up to twice as many observations are needed to bring about the same RMSE ratio as for other linear variables. The RMSE ratios for variances and covariances are only slightly greater than those obtained for highly correlated linear variables such as the wind components u and v. As an example, for the zonal wind component at 300 hPa, 11 observations equally distributed in time would produce a RMSE in the monthly mean wind of ∼2.4 m s−1. If the same number of observations were randomly distributed, the RMSE increases to 4.1 m s−1; or if 20% of the observations were missing in one block, the RMSE would be 2.4 m s−1. For the monthly mean poleward momentum flux by the transient eddies the expected RMSE would be up to 27 m2 s−2 for 11 randomly distributed observations.