THE NATURE OF FINANCIAL DATA

January 19, 1998 12:00 AM

A vast amount of effort is spent upon producing complex market models.

A vast amount of effort is spent upon producing complex market models. However, comparatively little resource is directed toward the other essential component of modelling--the test data. We discuss here some of the common problems encountered when testing models with financial data.

WHY THERE IS NEVER ENOUGH...

In financial mathematics there is never enough data. This occurs for several reasons.

* There may be data but spread between firms unwilling to share.

* Data may never have been collected.

* There may be only a few data points in existence, for example interest rate volatility smiles. In forex markets, the smile is readily apparent, but for interest rate options, there is only enough data to be reasonably certain that the volatility smile exists, and not enough to do detailed testing.

* There may be plenty of data but only some is relevant. It is usually considered that only data which is fairly recent is sufficiently relevant to be of use. If this is true, then there is always a restricted amount of data which may reasonably be used--and unlike data collection in the physical sciences, it is not possible to go back and collect more data.

DATA ERRORS

Any time series of data will contain errors. These result from a variety of causes including capture problems, incorrect input and market problems. Careful data analysis at the time of capture will reduce the number of these errors, and subsequent error checks upon the series will gradually capture remaining errors. It is unwise to assume that data series will ever be entirely 'clean', and the results of any analysis should always bear the possibility of data error in mind.

WHAT TO DO WITH A BAD DATA POINT

If a bad data point is identified, there are four possible courses of action. We list them in descending order of preference.

* Replace the bad data point with a correct one, if available.

* Remove the data point. This is quite acceptable, as long as any subsequent data analysis is done correctly--for example, if a single data point is removed from a series of 10, to obtain averages we would then divide by nine.

* Replace the data point with the same value as the previous data point. This should not be done too frequently.

* Interpolate between the surrounding data points. This should only be done when there is no alternative. A series with many interpolated points is not a useful series.

TYPES OF DATA ERRORS

There are different types of data errors, some of which are more easily detected than others. We list them below, together with some ways of detecting them.

Large spikes and missing data are easily seen. But, non-obvious spikes are very difficult to spot. We here outline a method for checking interest rate data, and one for checking FX rate data.

To check for small spikes in interest rate data, collect all the data series which comprise the yield curve. Then fit each day of this yield curve data to a simple yield curve model. The sum of the squares of the residuals obtained for each day is retained and stored in a series of its own. If any one of the data series from which the yield curve is constructed has a data error, it will show up as a large uncharacteristic spike in the series of the sum of the squares of the residuals. The data for that day may then be more closely examined to find which of the interest rate series contains the error. (see Figure 1 and Figure 2)

To check for similar small spikes in forex rate data, we need several series. If we have, for example, the USD/DEM rate, the USD/GBP rate and the GBP/DEM rate then we know they should be internally consistent. If they are not then one of them must contain an error. For this kind of check to work all the data series must be captured in a very short time period, as forex rates move quickly. Small inconsistencies may simply be because of timing differences, which should be allowed for.

* Autocorrelation errors. Data throughout the day is, on average, correlated to itself. A data point at 10 am will have an average correlation with a data point at 3 pm. To avoid picking up these correlations, it is important that daily data is collected at the same time each day.

* Manufactured data errors. Some series, like zero coupon interest rates, are not captured directly from market data but are created from it and then stored. This may causes problems when the calculation method is changed. To avoid a change of properties at the date when the new method was introduced, if a new calculation method is introduced, all manufactured data series should be 'back-filled', and recalculated right from their start date.

Clearly, ensuring clean and accurate data is not a trivial job. However, ignoring data quality problems all too frequently leads to practical proof of the old computing maxim 'garbage in, garbage out.'

This week's Learning Curve was written by Jessica James, from First Chicago NBD's strategic risk management group in London.