Some Case Studies

1.Academia das Ciências de Lisboa (Lisbon Academy of Sciences), Lisbon, Portugal.
1.Academia das Ciências de Lisboa (Lisbon Academy of Sciences), Lisbon, Portugal.
This chapter deals with the statistical analysis of data considered in the parts: univariate extremes data and multivariate extremes data related to stochastic processes and sequence of extremes connected to multivariate extremes data. The case studies of maximum wind speed data in Lisbon, flood discharges of the North Saskachevan River at Edmonton and flood discharges of the Fox River at Berlin and Wrightstown are analysed.
Univariate extremes data , Multivariate extremes data , Multivariate extremes , Exceedance probabilities
We will deal here with the statistical analysis of data considered in Parts 2 (univariate extremes data) and 3 (multivariate extremes data); the situation related to stochastic processes and sequence of extremes (still in its infancy) must, in general, be connected to multivariate extremes data and was, in part, sketched in the Chapters of Part 4 and Annex 6 and Chapter 16 (GEA).
We will begin by analysing some univariate data and then, presupposing the analysis technique, deal with bivariate and multivariate data. We will not of course detail the computation of the estimators of parameters, of quantiles and of exceedance probabilities because they follow directly from the formulae and examples given in Parts 2 and 3.
The data that follow correspond to maximum wind speeds in Lisbon between 1941 and 1970[1]
The data for exploratory and analytical study are contained in Table 17.1.
Year |
Km/h |
Year |
Km/h |
Year |
Km/h |
1941 |
129 |
1951 |
96 |
1961 |
86 |
1942 |
117 |
1952 |
72 |
1962 |
91 |
1943 |
100 |
1953 |
98 |
1963 |
96 |
1944 |
100 |
1954 |
85 |
1964 |
89 |
1945 |
132 |
1955 |
124 |
1965 |
90 |
1946 |
94 |
1956 |
108 |
1966 |
89 |
1947 |
108 |
1957 |
102 |
1967 |
89 |
1948 |
113 |
1958 |
102 |
1968 |
84 |
1949 |
96 |
1959 |
112 |
1969 |
107 |
1950 |
113 |
1960 |
107 |
1970 |
111 |
a. Quick exploration of extremes data:
The calculation of the statistic
\(\mathrm{ Q=\frac{x_{n}^{̓}-x_{ \left[ {n}/{2} \right] +1}^{̓}}{x_{ \left[ {n}/{2} \right] +1}^{̓}-x_{1}^{̓}}=\frac{x_{max}-x_{med}}{x_{med}-x_{min}} }\),
as \(\mathrm{ x_{max}=132,x_{med}=100,x_{min}=72 }\) gives
\(\mathrm{ Q_{30}=1.1428571 }\).
As for \(\mathrm{ \theta =0 }\) we have, for \(\mathrm{ n=30 }\),
\(\mathrm{ \beta _{30,0}=\frac{log~30+log \left( log~2 \right) }{log\,log~30-log\,log~2}=1.9078381 }\)
and \(\mathrm{ \alpha _{30,0}=\frac{1}{log\,log~30}=.8169083 }\),
we see that
\(\mathrm{ \frac{Q_{30}- \beta _{30,0}}{ \alpha _{30,0}}=-.9364343 }\).
Recall that \(\mathrm{ \left( Q_{n}- \beta _{n,0} \right) / \alpha _{n,0} }\) has asymptotically a reduced Gumbel distribution if the i.i.d. sample has the same distribution we see that \(\mathrm{ \left( Q_{30}- \beta _{30,0} \right) / \alpha _{30,0} }\) is between the bounds for shortest 95% - probability interval for the Gumbel distribution and therefore we can, initially, assume it as the approximate distribution of maxima.
For the plotting we must know \(\mathrm{ x_{ \left[ .367n \right] +1}^{̓}=x_{12}^{̓}=96 }\) and \(\mathrm{ x_{ \left[ .692n \right] +1}^{̓}=x_{21}^{̓}=108 }\) and so the plot in the statistical choice modification of Gumbel probability papers gives the graph Figure 17.1 that follows, agreeing with the use of the Gumbel model.
b. Analytical choice of the model:
Once having seen a general trend for the interpretation of data as (approximately) described by the Gumbel distribution, let us confirm (or reject) it by analytic study.
The ML estimators of \(\mathrm{ \left( \lambda , \delta \right) }\) are
\(\mathrm{ \hat{\lambda} =94.71, \hat{\delta} =12.49 }\),
and so the statistical choice statistic has the value
\(\mathrm{ \hat{V}_{30}=-9.63 }\).
Consequently, as \(\mathrm{ \hat{V}_{n}/\sqrt[]{2.09797\times n} }\) has the value,
\(\mathrm{ \hat{V} _{30}/\sqrt[]{2.09797\times 30}=-1.21 }\),
which is in the acceptance region for the asymptotically normal distribution of \(\mathrm{ \hat{V}_{n}/\sqrt[]{n~ \hat{\sigma} _{0}^{2}} }\), thus confirming the exploratory analysis. For the value of \(\mathrm{ \hat{V}_{n}/\sqrt[]{n~ \hat{\sigma} _{0}^{2}} =-1.21 }\), obviously, we do not need to go to the refined bounds which are, for \(\mathrm{ \alpha =.05 }\) and to the order \(\mathrm{ O ( n^{-1} ) }\), \(\mathrm{ b_{30}=-~1.71 }\) and \(\mathrm{ a_{30}=2.91 }\), confirming the decision.
It would be useful to divide the sample into two parts (about 2/3 and 1/3) and confirm the results.
To a certain extent this was done in the papers by Tiago de Oliveira (1981) and Fransén and Tiago de Oliveira (1984): in the former, data for greatest ages of death for men and for women in Sweden are studied for the years 1905/1958, and in the latter corresponding data for a larger set of years (1905/1970); and the results are coherent.
Other cases have been considered in the paper by Fransén and Tiago de Oliveira (1984) and can be considered as exercises. We will only follow another case leading to a different decision — the choice of the Fréchet distribution.
It would also be useful to read Whitmore et al. (1987) where data for maximum wind speeds are analysed for Canada.
In this case we consider not only the raw data but, as a consequence of the rejection of the Gumbel distribution as a possible model, their logarithms. The ordered raw data, for a period of 47 years in \(\mathrm{ 1000~ft^{3}/s }\), is[2]
(Table 17.2):
i |
xi |
i |
xi |
i |
xi |
i |
xi |
1 |
19.885 |
13 |
30.380 |
25 |
40.400 |
37 |
61.740 |
2 |
20.940 |
14 |
31.500 |
26 |
42.250 |
38 |
65.440 |
3 |
21.820 |
15 |
32.600 |
27 |
44.020 |
39 |
65.597 |
4 |
23.700 |
16 |
32.680 |
28 |
44.730 |
40 |
66.000 |
5 |
24.88 |
17 |
34.400 |
29 |
44.900 |
41 |
74.100 |
6 |
25.460 |
18 |
35.347 |
30 |
46.300 |
42 |
75.800 |
7 |
25.760 |
19 |
35.700 |
31 |
50.330 |
43 |
84.100 |
8 |
26.720 |
20 |
38.100 |
32 |
51.442 |
44 |
106.600 |
9 |
27.500 |
21 |
39.020 |
33 |
57.220 |
45 |
109.700 |
10 |
28.100 |
22 |
39.200 |
34 |
58.700 |
46 |
121.970 |
11 |
28.600 |
23 |
40.000 |
35 |
58.800 |
47 |
185.560 |
12 |
30.200 |
24 |
40.400 |
36 |
61.200 |
|
|
a. Quick exploration of extremes data:
We will avoid the graphical analysis, which is as before, but will only consider the statistic
\(\mathrm{ Q_{47}=\frac{x_{max}-x_{med}}{x_{med}-x_{min}} }\)
which as \(\mathrm{ x_{max}=185.560 }\), \(\mathrm{ x_{med}=~40.400 }\) and \(\mathrm{ x_{min}=~19.885 }\) takes the value \(\mathrm{ Q_{47}=7.0757982~ }\); we have \(\mathrm{ \beta _{47,0}=2.0317188 }\) and of \(\mathrm{ \alpha _{47,0}=.7417784 }\). So
\(\mathrm{ \frac{Q_{47}- \beta _{47,0}}{ \alpha _{47,0}}=6.7999815 }\),
thus rejecting the acceptance of the Gumbel distribution as an approximate model.
b. Analytical choice of the model:
The values of \(\mathrm{ ( \hat{\lambda} ,\hat{ \delta} ) }\) are \(\mathrm{ \hat{\lambda}=38.151 }\) and \(\mathrm{ \hat{ \delta }=17.740 }\), thus giving \(\mathrm{ \hat{ V}_{47}=37.77531 }\) and then \(\mathrm{ \hat{ V}_{47} /\sqrt[]{2.09797 \times 47}=3.804 }\), outside the acceptance region of the asymptotically normal distribution of \(\mathrm{ \hat{ V}_{47}/\sqrt[]{n \hat{\sigma }_{0}^{2}} }\) at a level even smaller than 2%. But this value leads to the Fréchet distribution in the statistical choice procedure.
c. Analysis of the logarithms of raw data:
The value of \(\mathrm{ \left( Q_{47}- \beta _{47,0} \right) / \alpha _{47,0} }\) and of \(\mathrm{ \hat{V}_{47}/\sqrt[]{2.09797 \times 47} }\) suggest the relevance of trying the Fréchet distribution. Denoting by \(\mathrm{ ’x=log~x }\) we get \(\mathrm{ ’x_{max}=5.2234 }\), \(\mathrm{ ’x_{med}=3.6988 }\) and \(\mathrm{ ’x_{min}=2.9900 }\) and \(\mathrm{ ’Q47=2.1509594 }\) and so \(\mathrm{ \left( ’Q_{47}- \beta _{47,0} \right) / \alpha _{47,0}=.1607495 }\), which leads us to accept the Gumbel distribution for \(\mathrm{ log~x }\) and thus suggests the use of the Fréchet approximation.
Let us confirm this analytically. We have \(\mathrm{ \hat{ ’ \lambda }=3.5489 }\), \(\mathrm{ \hat{ ’\delta} =.3963 }\), \(\mathrm{ \hat{ ’V}_{47}=-1.95849 }\) and \(\mathrm{ \hat{ ’V}_{47}/\sqrt[]{2.09797 \times 47}=-1.197 }\), thus leading to the acceptance of Gumbel distribution for the logarithms of raw data and thus the Fréchet distribution for raw data, as seen in (b).
In this analysis we will use flood data the Fox river at two points of its course: Berlin (upstream) and Wrightstown (downstream), both in Wisconsin, USA[3]
Year |
Berlin |
Wrightstown |
Year |
Berlin |
Wrightstown |
1000 ft3/s |
1000 ft3/s |
1000 ft3/s |
1000 ft3/s |
||
1918 |
6.05 |
16.3 |
1935 |
4.34 |
11.1 |
1919 |
2.67 |
13.1 |
1936 |
4.34 |
6.3 |
1920 |
5.15 |
16.6 |
1937 |
3.26 |
13.5 |
1921 |
2.45 |
14.2 |
1938 |
6.19 |
18.0 |
1922 |
5.95 |
20.1 |
1939 |
4.91 |
18.2 |
1923 |
6.05 |
13.7 |
1940 |
4.72 |
17.5 |
1924 |
4.02 |
15.5 |
1941 |
3.54 |
16.6 |
1925 |
2.52 |
8.3 |
1942 |
2.74 |
19.8 |
1926 |
3.44 |
9.1 |
1943 |
5.08 |
21.3 |
1927 |
3.17 |
13.3 |
1944 |
2.29 |
10.8 |
1928 |
5.92 |
15.1 |
1945 |
3.46 |
15.8 |
1929 |
6.62 |
20.6 |
1946 |
6.90 |
21.3 |
1930 |
3.00 |
6.6 |
1947 |
3.16 |
11.0 |
1931 |
1.14 |
3.1 |
1948 |
4.54 |
10.3 |
1932 |
1.91 |
9.9 |
1949 |
2.00 |
6.4 |
1933 |
2.60 |
8.9 |
1950 |
4.63 |
10.9 |
1934 |
1.91 |
6.7 |
|
|
|
We can, evidently, expect dependence between the yearly flood discharges at Berlin (B) and Wrightstown (W). The graphs on the probability paper show that the Gumbel distribution fits well to the observed marginal flood discharges. The ML estimators are \(\mathrm{ \hat{\lambda }_{B}=3.21070, \hat{\delta} _{B}=1.33880 }\), \(\mathrm{ \hat{ \lambda} _{W}=10.89643 }\) and \(\mathrm{ \hat{ \delta} _{W}=4.61371 }\) and the empirical Kolmogoroff-Smirnov statistics are \(\mathrm{ KS_{B}=.07462 }\) and \(\mathrm{ {KS}_{W}=.09957 }\) with \(\mathrm{ \sqrt[]{n}~{KS}_{B}=.429 }\) and \(\mathrm{ \sqrt[]{n}~{KS}_{W}=.572 }\) accepting thus the Gumbel model for the margins. Evidently the better analysis is to use the \(\mathrm{ \hat{V}_{n} }\) statistics for each margin.
The elements for our statistical decision are the values of the estimated correlation coefficient \(\mathrm{ \rho ^{*}=.693, }\)the medians \(\mathrm{ \tilde{\mu} _{B}=3.54 }\) and \(\mathrm{ \tilde{ \mu} _{W}=13.5 }\) and the number \(\mathrm{ N=24}\) of pairs \(\mathrm{ \left( x_{i},y_{i} \right) }\) such that \(\mathrm{ x_{i}> \tilde{ \mu} _{B} ~ }\) and \(\mathrm{ y_{i}> \tilde{ \mu}_{W} }\) (first quadrant) or \(\mathrm{ x_{i}< \tilde{ \mu} _{B} }\) and \(\mathrm{ y_{i}< \tilde{ \mu}_{W} }\) (third quadrant). This value of \(\mathrm{ N }\) is not coincident with the total of the first and third quadrants (22) given in Gumbel and Mustafi (1967) because the authors, instead of computing the medians directly from the sample, estimate the margin parameters and from them compute estimated medians. The test of independence, as \(\mathrm{ \sqrt[]{n}~ \rho ^{*}=\sqrt[]{33} \times .693=3.981 }\)is larger than the values \(\mathrm{ \lambda _{0.05}=1.64 }\) and \(\mathrm{ \lambda _{0.01}=2.33 }\), leads to rejection of independence at level 1%. Consequently we will have to estimate \(\mathrm{ \theta }\).
Let us consider first the mixed model. As the larger value of \(\mathrm{ ~ \rho ^{*} }\) in the mixed case is 2/3, we have no point estimator of \(\mathrm{ \theta }\) by this method (the best being evidently \(\mathrm{ \theta ^{*}=1 }\), allowing for sampling errors).
The relation \(\mathrm{ p^{**}=p_{M} \left( \theta ^{**} \right) =2^{-2+{ \theta ^{**}}/{2}}=\frac{N}{2~n}=\frac{24}{2 \times 33}=.36364 }\) gives \(\mathrm{ \theta ^{**}= 1.0 81 }\) and thus is not solvable for \(\mathrm{ \theta ^{**} }\) as we should have \(\mathrm{ 0 \leq \theta ^{**} \leq 1 }\); also note that the maximum value of \(\mathrm{ p_{M} \left( \theta \right) }\) is \(\mathrm{ p_{M} \left( 1 \right) =.35355<\frac{24}{2 \times 33}=.36364 }\). The mixed model must therefore be rejected.
Consider now the logistic model. As \(\mathrm{ 0 \leq \rho _{L} \left( \theta \right) \leq 1 }\) the equation \(\mathrm{ \rho _{L} \left( \theta ^{*} \right) = \theta ^{*} \left( 2- \theta ^{*} \right) =.693 }\) can be solved giving \(\mathrm{ \theta ^{*}=446 }\). As \(\mathrm{ p_{L} \left( \theta \right) =2^{-2^{1- \theta }} }\), the equation \(\mathrm{ p^{**}=p_{L} \left( \theta ^{**} \right) =\frac{N}{2n}=.36364 }\) gives \(\mathrm{ \theta ^{**}=.4546 }\), which is close to the previous estimate, justifying to a certain extent the use of the logistic model.
Note that this is a margin (location-dispersion) parameter-free approach. As said, Gumbel and Mustafi (1967) used estimators of the margin parameters and computed the estimated reduced difference, comparing it, by the Kolmogorov-Smirnov test, with the logistic and confirming the suggestion to use the logistic model as an approximation.
Recall that estimation of the margin parameters involves as yet unsolved problems with the use of the Kolmogorov-Smimov test, although it gives a hint for the logistic model, as stated.
The 5% asymptotic confidence interval is given by
\(\mathrm{ \vert p_{L} \left( \theta \right) -p^{**} \vert \leq 1.96\sqrt[]{p^{**} \left( .5-p^{**} \right) }/\sqrt[]{n} }\) which is, in our case,
\(\mathrm{ \vert p_{L} \left( \theta \right) -.36364 \vert \leq .07598~or~.28766 \leq p_{L} \left( \theta \right) \leq .43962 }\) and so
\(\mathrm{ .154 \leq \theta \leq .754 }\), a very large (asymptotic) confidence interval.
Let us now consider the biextremal model. The equation \(\mathrm{ \rho _{B} \left( \theta ^{*} \right) =.693 }\) gives \(\mathrm{ \theta ^{*}=.55 }\). The solution from \(\mathrm{ p_{B} \left( \theta ^{**} \right) =2^{-2+ \theta ^{**}}=.36364 }\) is \(\mathrm{ .54 1 }\); the approximate confidence interval for the significance level 5% is \(\mathrm{ \vert p_{B} \left( \theta \right) -.36364 \vert \leq .07598~is~.202 \leq \theta \leq .814 }\).
For the Gumbel model the equation \(\mathrm{ \rho _{G} \left( \theta ^{*} \right) =.693 }\) gives \(\mathrm{ \theta ^{*}=.723 }\) and the confidence interval from \(\mathrm{ p_{G} \left( \theta \right) =p_{B} \left( \theta \right) }\) is also \(\mathrm{ .202 \leq \theta \leq .814 }\), which suggests that the Gumbel model is not appropriate to this problem, the solution of \(\mathrm{ \rho \left( \theta ^{*} \right) =.6846 }\) being close to the upper bound of the confidence interval; in fact the model arises naturally in questions connected with failures. A simple check can be the estimation of \(\mathrm{ \theta }\)from the use of Kendall’s \(\mathrm{ \tau_{G}= \theta / \left( 2- \theta \right) }\).
The question of discriminating between models, as for the remaining ones (logistic, biextremal) in this example is, as yet, an unsolved problem. Here, though, the solution is simple.
As in the biextremal model we have for the reduced values \(\mathrm{ Y=X }\) with probability \(\mathrm{ \theta }\) and \(\mathrm{ Y>X }\) with probability \(\mathrm{ 1- \theta }\), we can expect that, roughly, \(\mathrm{ n~ \theta }\) observations \(\mathrm{ \left( x_{i},y_{i} \right) }\) should be in a straight line\(\mathrm{ y=ax+b }\) and rest should be such that \(\mathrm{ y>ax+b }\). In our case, as \(\mathrm{ \theta ^{*}=.541 }\) and \(\mathrm{ n= 33 }\) we should have roughly \(\mathrm{ {.541} \times 33 \approx18 }\) points in a straight line; the scatter diagram given in Gumbel and Mustafi (1967) shows immediately that this is not the case. Of the four models considered, the only one not yet rejected is, for the moment, the logistic. Also the plotting of estimated reduced differences on the logistic paper, given also in Gumbel and Mustafi (1967), speaks in favour of the logistic model.
The application of the strip method — see Posner et al. (1969) referred to in Chapter — to the logistic model as a flood model for the Fox River confirms its usefulness.
It should be remarked here that if we expect bivariate models to be absolutely continuous here, as in other cases, the logistic model seems to be a good candidate. Also, as shown in Tiago de Oliveira (1987) the discrepancy between the logistic and \(\mathrm{ \left( 2,N \right) }\) natural model is small and not easily seen for such small samples. The intrinsic estimation will also not give large discrepancies.
[1]. The author thanks the Portuguese Institute for Meteorology and Geophysics for having kindly supplied the data.
[2]. Data taken from van Montfort (1970).
[3]. Data taken from Gumbel and Mustafi (1967) .
Fransén, A. and Tiago de Oliveira, J., 1984. Statistical choice of univariate extreme models, part II, in Statistical Extremes and Applications, Tiago de Oliveira, J, ed., 373-394, D. Reidel, Dordrecht.
Tiago de Oliveira, J., 1981. Statistical choice of univariate extreme models, Statistical Distributions in Scientific work, 6, G. P. Patil ed., 6, 367-387, D. Reidel, Dordrecht.
Tiago de Oliveira, J., 1987. Comparaison entre les modeles logistique et naturel pour les maxima et extensions, C. R. Acad. Sc. Paris, t. 305(1), 481-484.
Van Montfort, M.A.J., 1970. On testing that the distribution of extremes is of type I when type II is the alternative, J. Hydrology, 11, 421-427.
Whitmore, G. A., et al. 1987. Case studies in data analysis, nº 5. Canad. J. Statist., 15, (4), 311-337.