09 August 2010

Skill in Prediction, Part IIb, The Naive Prediction

OK, I received 14 usable, independent naive predictions from my request (thanks all), which you can see in the graph above.  The time series is of course the CRU global temperature anomaly (for January, multiplied by 100 then added to 100, just to throw you off;-).

As you can see, I received a very wide range of proposed naive forecasts.  The red line shows the average of the various forecasts. It has only a very small trend.

I was motivated to do this little experiment after reading Julia Hargreaves recent essay on the skill of Jim Hansen's 1988 Scenario B forecast.  They used a naive forecast of simply extending the final value of the observed data into the future as their baseline (which several people suggested in this exercise).  Not surprisingly, they found that Hansen's forecast was skillful compared to this baseline, as shown below.
My initial reaction was that this naive forecast was way too low a threshold for a skill test.  But then again, I know what the dataset was and how history had played out.  So I could simply be reflecting my own biases in that judgment.  So I decided to conduct this little blog experiment with a blind test, using my readers to to see what might result (and thanks to Dr. Hargreaves for the CRU data that she used).  What you readers came up with is little different from what Hargreaves used.

Upon reflection, it would have been better to stop the CRU data in 1988 when asking for the naive forecasts.  However, given that the dataset to 2009 has more of a trend than to 1988, I can conclude that the use of a zero trend baseline by Hargreaves is certainly justifiable.  But, as you can see from the spread of naive forecasts in the figure at the top, many other possible naive trends are also justifiable.

This helps to illustrate the fact that the selection of the naive trend against which to measure skill is in many respects arbitrary and almost certainly influenced by extra-scientific considerations.  Were I a forecaster whose salary depended on a skillful forecast, I'd certainly argue for an easy-to-beat metric of skill!  You can imagine all sorts of these types of issues becoming wrapped up in a debate over the appropriate naive forecast used to determine skill.

Some lessons from this exercise:

1. In situations where an evaluation of skill is being conducted and there are not well-established naive baselines, it is probably not a good idea to have the same person/group doing the evaluation come up with the naive forecast against which skill will be judged -- especially after the forecast has been made and observations collected.

2. Related, metrics of skill should be negotiated and agreed upon at the time that a forecast is issued, to avoid these sort of problems.  Those in the climate community who issue long-term forecasts (such as associated with climate change) generally do not have a systematic approach to forecast verification.

Evaluating skill of forecasts is important, even if it takes a long time for that judgment to occur.