Abstract

Dating from the seminal work of Ellison and Glaeser in 1997, a wealth of evidence for the ubiquity of industrial agglomerations has been published. However, most of these results are based on analyses of single (scalar) indices of agglomeration. Hence, it is not surprising that industries deemed to be similar by such indices can often exhibit very different patterns of agglomeration—with respect to the number, size and spatial extent of individual agglomerations. The purpose of this article is thus to propose a more detailed spatial analysis of agglomeration in terms of multiple-cluster patterns, where each cluster represents a (roughly) convex set of contiguous regions within which the density of establishments is relatively uniform. The key idea is to develop a simple probability model of multiple clusters, called cluster schemes, and then to seek a ‘best’ cluster scheme for each industry by employing a standard model-selection criterion. Our ultimate objective is to provide a richer characterization of spatial agglomeration patterns that will allow more meaningful comparisons of these patterns across industries.

1. Introduction

Economic agglomeration is the single most dominant feature of industrial location patterns throughout the modern world. In Japan, with a population density more than 10 times that of the USA, land is generally considered to be extremely scarce. Yet, more than 60% of the total population and more than 80% of total employment are concentrated in less than 3% of total area. Similar observations can be made for any other developed country. The extent of this concentration phenomenon explains why economic agglomeration is now a major topic in urban and regional economics (see, e.g. Henderson and Thisse, 2004). Industrial agglomeration has also gained increasing interest in the management literature, dating from the seminal work of Porter (1990) on ‘industrial cluster theory.’

In terms of empirical work, a substantial number of studies on industrial agglomeration have been published in the recent decades. Some of them have proposed indices of industrial agglomeration that allow testable comparisons of the degree of agglomeration among industries (Brülhart and Traeger, 2005; Duranton and Overman, 2005; Mori et al., 2005; Marcon and Puech, 2010). The results of these works suggest that industrial agglomeration is far more ubiquitous than previously believed and extends well beyond the traditional types of industrial agglomeration (such as information technology industries in Silicon Valley and automobile manufacturing in Detroit). Moreover, the degree of such agglomeration has been shown to vary widely across industries.

But while these studies provide ample evidence for the ubiquity of industrial agglomerations, they tell us very little about the actual spatial structure of agglomerations. In particular (to our knowledge), there have been no systematic efforts to determine the number, location and spatial extent of agglomerations within individual industries. Most indices of agglomeration currently in use measure the discrepancy between industry-specific regional distributions of establishments/employment and some hypothetical reference distribution representing ‘complete dispersion.’1 But even if industries are judged to be similar with respect to these indices, their spatial patterns of agglomeration may appear to be quite different. The reason for this is that such patterns are basically multidimensional in nature and are not easily compared with any single index.

This can be illustrated by a sample of our results for Japanese manufacturing industries (developed in more detail in Section 5, and in our companion paper, Mori and Smith, 2011b). Here, we consider two industries that are virtually indistinguishable in terms of their overall degree of spatial concentration (as measured by the Kulback–Leibler measure of concentration sketched in Section 5). But the actual patterns of agglomeration for these two industries are quite different. The agglomeration pattern of the first industry, classified as ‘plastic compounds and reclaimed plastics’, is seen in Figure 13(b). (For now, the area marked in gray can be considered as industrial agglomerations.) The concentration of this industry lies mainly along the inland industrial belt extending westward from Tokyo to Hiroshima. Moreover, the individual clusters of establishments within this belt are seen to be densely packed from end to end. Our second industry, classified as ‘soft drinks and carbonated water’, exhibits a very different pattern of agglomeration. As seen in Figure 14(b), this industry is spread throughout the nation, but exhibits a large number of local agglomerations. A closer inspection of these industries reveals the nature of these differences. On the one hand, plastic components constitute essential inputs to a variety of manufactured goods, from automobiles to TV sets. Hence, the concentration of this industry along the industrial belt forms a series of intermediate markets for other manufacturing industries using these components. On the other hand, soft drinks are more directly oriented to final markets serving consumers. So while there are still sufficient scale economies to warrant industrial agglomerations, these agglomerations are widely scattered and essentially follow patterns of population density.

Geographical framework.
Figure 1.

Geographical framework.

Thus, while summary measures of spatial concentration (or dispersion) are unquestionably useful for a wide range of global comparisons, the above illustration suggests that more detailed representations of spatial agglomeration patterns can in principle allow much richer types of comparisons. With this in mind, our central objective is to propose a methodology for representing and identifying such agglomeration patterns.

Before doing so, it is important to note that there have been other attempts to develop statistical measures that are more multidimensional in nature. Most notably, the formula-density approach of Duranton and Overman (2005) utilizes pairwise distances between individual establishments and is capable of indicating the spatial extent of an agglomeration. In a similar vein, Mori et al. (2005) proposed a spatially decomposable index of regional localization that yields some information about the most relevant geographic scales of agglomeration within individual industries. However, neither of these approaches is designed to identify specific (map) locations of industrial agglomerations, from which spatial patterns of agglomerations can be characterized.

Methodologically, our approach is closely related to cluster-identification methods proposed by Besag and Newell (1991), Kulldorff and Nagarwalla (1995) and Kulldorff (1997) that have been used for the detection of disease clusters in epidemiology.2 As with the agglomeration indices mentioned above, these methods start by postulating a null hypothesis of ‘no clustering’ (in terms of a uniform distribution of industrial locations across regions), and then seek to test this hypothesis by finding a single ‘most significant’ cluster of regions with respect to this hypothesis. Candidate clusters are typically defined to be approximately circular areas containing all regions with centroids within some specified distance from a reference point (e.g. the centroid of a ‘central’ region). While this approach is in principle extendable to multiple clusters by recursion (i.e. by removing the cluster found and repeating the procedure), such extensions are piecemeal at best.3

Hence, our strategy is essentially to generalize their approach by finding the single most significant ‘cluster scheme’ rather than ‘cluster’. We do so by formalizing these schemes as probability models to which appropriate statistical model-selection criteria can be applied for finding a ‘best cluster scheme’. Here, a cluster scheme is simply a partition of space in which it is postulated that firms are more likely to locate in ‘cluster’ partitions than elsewhere.4 Our probability model then amounts to a multinomial sampling model on this partition. These candidate cluster schemes can in principle be compared by means of standard model-selection criteria, including Akaike’s (1973),information criterion, Schwarz’s (1978),Bayesian information criterion (BIC) and the Normalized maximum likelihood of Kontkanen and Myllymäki (2005).

To find a best model (cluster scheme) with respect to such criteria, it would of course be ideal to compare all possible cluster schemes constructible from the given system of regions. But even for modest numbers of regions, this is a practical impossibility. Hence, a second major objective of this article is to develop a reasonable algorithm for searching the space of possible cluster schemes. Our approach can be considered as an elaboration of the basic ideas proposed by Besag and Newell (1991) in which one starts with an individual region and then adds contiguous regions within a given distance from this initial region to identify the single most significant cluster. In particular, we generalize the Besag–Newell concept of clusters by imposing only convexity rather than circularity. Although searching over possible convex sets of regions is computationally impractical when the number of regions is large, the procedure reduces to be reasonably simple if the (continuous) location space is approximated by a (discrete) regional network. Accordingly, we develop the notion of convex solid, representing the convexity in the regional network.

In this context, cluster schemes are grown by (i) adding new disjoint clusters or by (ii) either expanding or combining existing clusters until no further improvement in the given model-selection criterion is possible. The final result is thus a ‘locally best cluster scheme’ with respect to this criterion. Although the criteria listed above are conceptually different, it turns out that the cluster schemes found are in high agreement across different criteria. Thus, in this article, we will focus on BIC, which turns out to be the most parsimonious criterion in terms of the number of clusters found (Mori and Smith, 2009, Section 3).

The rest of the article is organized as follows. We begin in Section 2 by defining a probabilistic location model for an establishment, where location probabilities are assumed to be industry-specific and independent for each establishment within a given industry as well as across industries. Our criterion for model selection in terms of BIC is also developed. In Section 3, we introduce the notion of convex solids and then in Section 4 present a practical procedure for cluster detection which searches for the best cluster scheme consisting of a set of distinct ‘convex’ clusters. The results of this procedure are then illustrated in Section 5 in terms of the selected pair of Japanese industries discussed above. Here, we sketch a classification scheme for agglomeration patterns in terms of ‘global extent’ (GE) and ‘local density’ (LD) that can be employed to quantify the spatial scale of industrial agglomeration and dispersion. A possible refinement and the results of sensitivity analyses for our cluster detection are also presented. Finally, in Section 6, we briefly discuss a number of directions for further research.

2. A probability model of agglomeration patterns

To motivate our approach to cluster detection, we begin by observing that recent theoretical results on equilibrium location patterns in continuous space (e.g. Tabuchi and Thisse, 2011; Ikeda et al., 2012; Hsu, 2012) suggest that there is remarkable commonality among possible equilibrium patterns of agglomeration within each industry. In particular, the number, size and spacing of agglomerations are shown to be well preserved under a variety of stable equilibria. From this perspective, our objective is to identify these common features. To do so, we treat such equilibria as stationary states and develop a probabilistic model of location behavior within such stationary states. In particular, while individual location decisions may be based on the prevailing steady-state distribution, they can nonetheless be treated as statisitically independent events, i.e. as random samples from this distribution.5 This simplification of course precludes any questions about the process of cluster formation, or even the economic rationale for clustering. Rather, our goal here is to provide a simple statistical framework within which the most salient features of these equilibrium cluster patterns can be identified.6

To this end, we start by assuming that the location behavior of individual establishments in a given industry can be treated as independent random samples from an unknown industry-specific locational probability distribution, formula, over a continuous location space, formula (e.g. a national location space). Hence, for any (measurable) subregion, formula, the probability that a randomly sampled establishment locates in formula is denoted by formula. In this context, the class of all possible location models corresponds to the set of probability measures on formula.

However, observable location data are here assumed to be only in terms of establishment counts for each of a set of disjoint basic regions (e.g. municipalities), formula, indexed by formula. These regions are assumed to partition formula, so that formula. Hence, the only relevant features of the location probability distribution, formula, for our purposes are the location probabilities for each basic region:
(2.1)

We now consider an approximation of formula by probability models, formula, that postulate areas of relatively intense locational activity. Each model is characterized by a ‘cluster scheme’, formula, consisting of disjoint clusters of basic regions, formula, formula, within which establishments are more densely located. For the present, such clusters are left unspecified. A more detailed model of individual clusters is developed in Section 3.

If the full extent of cluster formula in formula is denoted by formula then the corresponding location probabilities, formula, are implicitly taken to define areas of concentration.7 To complete these probability models, let the set of residual regions be denoted by formula, and let formula, with corresponding location probability, formula

Each cluster scheme, formula, then constitutes a partition of the regional index set, formula, and the location probabilities formula yield a probability distribution on formula.8 Finally, to specify location probabilities for basic regions, it is assumed that within each cluster, formula, the location behavior of individual establishments is completely random.9 To define ‘complete randomness’ in the present setting, it is important to focus on those locations within each basic region where establishments could potentially locate (excluding, e.g. bodies of water). Such locations are here designated as the economic area of each region.10 Hence, if for each basic region formula, we let formula denote the (economic) area of formula, so that the total area of cluster formula is given by
(2.2)
then for each establishment locating in formula, it is postulated that the conditional probability of locating in basic region, formula, is proportional to the area of region formula, i.e. that
(2.3)
But since formula implies that formula, if we let formula, it then follows that for all formula
(2.4)
Hence, for each cluster scheme, formula, Expression (2.4) yields a well-defined cluster probability model, formula, which is comparable with the unknown true model (2.1). Note moreover that since all area values are known, it follows that for each given cluster scheme, formula, the only unknown parameters are given by the formula-dimensional vector of cluster probabilities, formula.11
Within this modeling framework, we now consider a sequence of formula independent location decisions by individual establishments. For each establishment, formula, let its location choice be modeled by a random (indicator) vector, formula, with formula if establishment formula locates in region formula, and formula, otherwise. This set of location decisions is then representable by a random matrix of indicators, formula, with the following finite set of possible realizations (location patterns):
(2.5)
By independence, the probability distribution of formula under the unknown true distribution in (2.1) is given for each location pattern, formula, by
(2.6)
where the total number of estabishments locating in region formula is denoted by
(2.7)
[see Expression (2.5)]. Similarly, for each cluster probability model, formula, the postulated distribution of formula is given for each pattern, formula, by
(2.8)
where the relevant parameter vector, formula, for each such model has been made explicit. In most contexts, it will turn out that the locational frequencies formula, are sufficient statistics, since by definition
(2.9)
where the factor, formula, is independent of parameter vector, formula.
This likelihood function will form the central element in our comparisons among candidate cluster schemes. As mentioned in Section 1, the specific model-selection criterion to be used here is the BIC of Schwarz (1978). As with a number of other criteria, BIC is essentially a ‘penalized likelihood’ measure. To state this criterion precisely, we first recall from Expression (2.9), that for any given cluster scheme, formula, the log likelihood of parameter vector, formula, given an observed location pattern, formula, is of the form
(2.10)
But since the second term is independent of formula, it follows at once (by differentiation) that the maximum-likelihood estimate, formula, of formula is given for each formula simply by the fraction of establishments in formula, i.e.
(2.11)
By substituting (2.11) into (2.10), we obtain a corresponding estimate of the maximum log-likelihood value for model formula,
(2.12)
But since likelihood values are non-decreasing in the number of parameters estimated, it follows in particular that values of formula will almost always increase as more clusters are introduced. Hence, the ‘best’ cluster scheme with respect to model fit alone is the completely disaggregated scheme in which every basic region constitutes its own cluster. To avoid this obvious ‘over fitting’ problem, BIC penalizes those cluster schemes with larger numbers of clusters, formula, and for any given sample size, formula, is of the form
(2.13)
In the actual computations involved in cluster detection (to be described in Section 4), it turned out to be convenient to evaluate the cluster scheme, formula, relative to the uniform probability distribution model as a benchmark in which individual establishment location follows uniform probability density over economic area. If the BIC value for the uniform probability distribution model is denoted by formulaformula, where formula represents the total area, then we may reformulate this measure in terms of BIC-differences from this benchmark model as
(2.14)
where formula is the log-likelihood ratio between the cluster and benchmark models:
(2.15)

Since the sample size (number of establishments) for each industry is fixed, it plays no direct role in model selection for that industry. But when comparing cluster patterns for different industries, this penalty term will be more severe in industries with larger numbers of establishments. So, all else being equal, BIC tends to yield more parsimonious cluster schemes for larger industries. Moreover, it tends to yield more parsimonious cluster schemes for all industries than the other model-selection criteria mentioned above. It is for this reason that we choose to focus on BIC in the present application.

3. A model of clusters as convex solids

Given the set of basic regions, formula, it might seem desirable to treat cluster schemes, formula, as arbitrary partitions of formula, and then to identify the best cluster scheme from this class, i.e.
(3.1)

But from a practical viewpoint, the number of possible partitions can be enormous for even modest numbers of basic regions.12 Moreover, without further restrictions, the components of such partitions can be bizarre and difficult to interpret as ‘clusters’. This has long been recognized by cluster analysts, who have typically proposed that clusters be roughly circular in shape (as in Besag and Newell, 1991; Kulldorff and Nagarwalla, 1995; Kulldorff, 1997). Here, we propose a more flexible class of clusters that preserve spatial compactness by requiring only that they be ‘approximately convex’. We further simplify the identification of convex clusters by representing the location space in terms of a discrete regional network, since from a practical viewpoint, searching over candidate convex clusters is much simpler on networks than in Euclidian space (especially when the space is large). This network-based (as opposed to Euclidian space-based) approach is particularly useful when economically meaningful distances are adopted (such as travel distance and time), rather than simplistic straight-line distances between regions. Before developing the details of this approach, it is useful to begin with a brief overview.

To define clusters of basic regions, we first require that they be convex sets with respect to the underlying network. This means simply that clusters must include all regions on shortest paths between their members (in the same way, planar convex sets include all lines between their points). But unlike straight-line planar paths, shortest paths on discrete networks can sometimes exclude regions that are obviously interior to the desired clusters, thus leaving ‘holes’ (as shown in Figures 5 and 6).13 It is thus appropriate to ‘fill’ these holes by requiring that regional clusters be convex solid sets with respect to the underlying network. The formal procedures for developing these convex solid sets will in fact be utilized in the cluster detection algorithm itself, as detailed in Section 4.2.

3.1. A discrete network representation of the regional system

Recall in Section 2 that the relevant location space, formula, is partitioned into a set of basic regions, formula, indexed by formula. For our present purposes, it is convenient to consider a larger world region, formula, in which formula resides, so that formula denotes the ‘rest of the world’, as shown schematically in Figure 1. As in Section 2, we identify formula with the set of regional labels for formula. In this framework, the boundary of the given location space consists of the subset of basic regions, formula, that share boundary points (i.e. the edges of a basic region cell) with formula. This distinguished set of boundary regions (shown in gray) will play an important role in Section 3.3.

Within this basic continuous geographical framework, we next develop a discrete network representation of the regional system that contains all the relevant information needed for our cluster model. The nodes of this network are represented by the set formula of basic regions, and the links are taken to represent pairs of regional ‘neighbors’ in terms of the underlying regional network. Here, it is assumed that data are available on minimal travel distances, formula, between each pair of regions, formula, say between their designated administrative centers. These neighbors should of course include regional pairs formula for which the shortest route from formula to formula passes through no regions other than formula and formula. But for computational convenience, we choose to approximate this relation by the standard ‘contiguity’ relation that takes each pair of basic regions sharing some common boundary to be neighbors. While this approximation is reasonable in most cases, there are exceptions. Consider for example the coastal regions, formula and formula, joined by a bridge, as shown in Figure 2. Here, it is clear that the shortest route (path) between regions formula and formula passes through no other regions, even though formula and formula share no common boundary. Hence, to maintain a reasonable notion of ‘closeness’ among neighbors, it is appropriate to include such regional pairs as neighbors. Finally, it is mathematically convenient to include formula as a neighbor of itself (since formula is always ‘closer’ to itself than to any other region).

Bridge example.
Figure 2.

Bridge example.

If this set of neighbors for region formula is denoted by formula, then for the region formula shown in the schematic regional system of Figure 1, formula is seen to consist of eight neighbors other than formula itself. Our only formal requirement is that neighbors be symmetric, i.e. that formula if and only if formula. If we now denote the full set of neighbor pairs by formula, then this defines the relevant set of links for our discrete network representation, formula, of the regional system. A simple example of such a regional network, formula, is shown in Figure 3. Here, formula consists of 25 square regions shown on the left. These regions are connected by the road network shown by dotted lines on the left, with travel distances on each of the 40 links (to be discussed later) displayed on the right. Hence, formula in this case consists of the 40 distinct regional pairs associated with each of these links, together with the 25 identity pairs formula.

Regional network example.
Figure 3.

Regional network example.

Next, we employ travel distances between neighbors to approximate the entire regional network by a shortest path metric on network formula. To do so, let each sequence, formula, of linked neighbors [i.e. with formula for formula] be designated as a path in formula, and let the set of all paths in formula be denoted by formula. If for each pair of regions, formula, we denote the subset of all paths from formula to formula in formula by formula, then to ensure that shortest paths between all pairs of regions are meaningful, we henceforth assume that formula for all formula, i.e. that the given regional network formula is connected.14 In this context, if the length, formula, of path, formula, is now taken to be the sum of travel distances on each of its links, i.e. formula, then for any pair of regions, formula, the shortest path distance, formula, from formula to formula is taken to be the length of the (possibly nonunique) shortest path from formula to formula:
(3.2)

The set of all shortest paths in formula is then denoted by formulaformula. The shortest path distances in (3.2) are easily seen to define a metric on formula, i.e. to satisfy (i) formula, (ii) formula and (iii) formula for all formula. Moreover, these distances always agree with travel distances between neighbors (i.e. formula for all formula). But for non-neighbors, formula, it will generally be true that formula (since the shortest route from formula to formula on the actual network may not pass through any intermediate regional centers). Hence, these shortest path distances are only an approximation to shortest route distances.15 The advantage of this approximation for our present purposes is that for any formula and formula, the number of paths in formula is generally much smaller than the number of routes from formula to formula on the network, so that shortest paths in formula are more easily identified.

3.2. Convexity in networks

Within this network framework, we now return to the question of defining candidate clusters as spatially coherent groups of basic regions. As mentioned in Section 1, the standard approach to this problem is to require that clusters be as close to ‘circular’ as possible. To broaden this class, we begin by observing that a key property of circular sets in the plane is their convexity. More generally, a set, formula, in the plane is convex if and only if for every pair of points, formula, the set formula also contains the line segment joining formula and formula. But since lines are shortest paths with respect to Euclidean distance, an equivalent definition of convexity would be to say that formula contains all shortest paths between points in formula. Since shortest paths are equally well defined for the network model above, it then follows that we can identify convex sets in the same way.

In particular, a set of basic regions, formula, is now said to be formula-convex if and only if for every pair of regions formula and formula in formula, the set of regions on every shortest path from formula to formula is also in formula.16 More formally, if for any path, formula, we now denote the set of distinct points in formula by formula, and if the family of all nonempty subsets of formula is denoted by formula, then  

Definition 3.1 (formula-Convexity)

(i) A subset of basic regions, formula, is said to be formula-convex iff for all formula, formula. (ii) The family of all formula-convex sets in formula is denoted by formula.

For example, suppose that in the schematic regional system of Figure 4, it is assumed that regional squares sharing boundary points (faces or corners) are always neighbors, and that travel distance, formula, between neighbors is simply the Euclidean distance between their centers. Then, with respect to the induced shortest path distance, formula, it is clear that the set, formula, on the left consisting of four black squares is not formula-convex, since the gray squares in the middle figure belong to shortest paths between the black squares. But even if these gray squares are added to formula, the resulting set is still not formula-convex, since the four white squares remaining in the middle belong to shortest paths between the gray squares. However, if these four squares are added, then the resulting set on the right is seen to be formula-convex since all squares on every shortest path between squares in the set are included.

-Convexification of sets.
Figure 4.

formula-Convexification of sets.

This process of adding shortest paths actually yields a well-defined constructive procedure for ‘convexifying’ a given set, which can be formalized as follows. Let
(3.3)
denote the formula-interval of all points on shortest paths from formula to formula, and let the mapping, formula, defined for all formula by
(3.4)
be designated as the interval function generated by formula. For notational convenience, we set formulaformula, and construct the m th-iterate of formula recursively by formula for all formula and formula. Since formula for all formula, it follows from (3.4) that for each set, formula,
(3.5)
By the same argument, it follows that for any formula and formula with formula, we must have formula. Hence, these interval iterates satisfy the following nesting property for all formula,
(3.6)
and thus constitute a monotone nondecreasing sequence of sets. It then follows that for any subset, formula, of nodes in the finite network, formula, there must be an integer, formulaformula,17 such that formula.18 The smallest such integer:
(3.7)
is called the geodesic iteration number of set, formula.19 With these definitions, it is well known that the unique smallest formula-convex set containing a given set formula is given by the formula-convex hull (see Proposition A.2 in the appendix for a proof of this assertion),
(3.8)
The mapping, formula, defined by (3.8) is designated as the formula-convexification function. With this definition, it is shown in Proposition A.3 of the appendix that formula-convex sets are equivalently characterized as the fixed points of this mapping, i.e. a set formula is formula-convex if and only if formula. So the family of all formula-convex sets can be equivalently defined as
(3.9)
However, for purposes of constructing formula-convex sets, it is more useful to note that they are equivalently characterized as the fixed points of the interval function, formula (as shown in the Corollary to Proposition A.3). Hence, formula can also be written as
(3.10)

This in turn implies that a simple constructive algorithm for obtaining formula is to iterate formula until the iteration number, formula is found. This procedure is in fact illustrated by Figure 4, where formula.

But while this particular set, formula, does indeed look reasonably compact (and close to circular), this is not always the case. One simple counterexample is shown in Figure 5. Given the regional network, formula in Figure 3, suppose that formula consists of the four regions shown in black on the left in Figure 5. These regions are assumed to be connected by major highways as shown by the heavy lines on the right in Figure 3, with travel distances, formula, on each link. All other road links are assumed to be circuitous secondary roads, as represented by a travel distance of formula on each link. Here, it is clear that the formula-convexification, formula, of formula is obtained by adding all other regions connected by the ring of major highways (as shown in gray on the right in Figure 5), since shortest paths between such regions are always on these highways. But since the central region shown in white is not on any of these paths, we see that formula is a formula-convex set with a ‘hole’ in the middle.

This is very different from convex sets in the plane, which are always ‘solid’. But in more general metric spaces, this need not be true. Indeed, for the present case of a network (or graph) structure, the notion of a ‘hole’ itself is not even meaningful. For example, if the central node in Figure 5 was pulled ‘outside’ the coastal regions (leaving all links in tact) then the network, formula, would remain the same. So it is clear that the above notion of a ‘hole’ depends on additional spatial structure, including the positions of regions relative to one another. In particular, since the present notion of formula-convexity is intended to approximate convexity in the original location space, it is appropriate to fill these holes.

Finally, it is of interest to note that even with simpler approximations to travel distances, such holes can still exist. For example, if shortest paths between adjacent regions are approximated by straight-line paths between their geometric centroids, then this same convexification procedure can still yield holes. This is illustrated by the simple four-region example in Figure 6, where the three exterior regions are seen to form a convex set containing all shortest paths between them. Hence, the central region is not part of this convex set and constitutes an obvious hole.

3.3. Convex solids in networks

These observations motivate the spatial structure that we now impose in order to characterize ‘solid’ subsets of formula in formula. The key idea here is to recall from Figure 1 that relative to the rest of the world, there is a distinguished collection of boundary regions, formula, that are essentially ‘external’ to all subsets of formula. If for any subset, formula, and boundary region, formula, it is true that formula, then it is reasonable to assert that formula is outside of formula.20 This set of boundary regions, formula, thus defines a natural reference set for distinguishing regions in complement, formula, of formula that are ‘inside’ or ‘outside’ of formula. In particular, we now say that a complementary region, formula, is insideformula if and only if every path joining formula to a boundary region in formula must pass through at least one region of formula. For example, given the set, formula, of black squares in Figure 7, the complementary region formula is seen to be inside of formula since every path to the boundary, formula, must intersect formula. Similarly, the complementary region formula is not inside formula, since there is a path from formula to formula that does not intersect formula. To formalize this concept, we now let the set of all paths from any region, formula, to formula be denoted by formula. Then, for any nonempty set, formula, the set of all complementary regions insideformula is given by
(3.11)
and is designated as the interior complement of formula.
d-Convex set with a hole.
Figure 5.

d-Convex set with a hole.

Non-solid d-convex set.
Figure 6.

Non-solid d-convex set.

Inside versus outside.
Figure 7.

Inside versus outside.

With this concept, we now say that a set, formula, is solid if and only if its interior complement is empty. In addition, we can now solidify a set formula by simply adjoining its interior complement. More formally, we now say that:  

Definition 3.2 (Solidity)
For any nonempty subset, formula, (i) formula is said to be solid iff formula. (ii) The set formed by adding formulatoformula,
(3.12)
is designated as the solidification of formula. (iii) The family of all solid sets in formula is denoted by formula.

The justification for the terminology in (ii) is given by Lemma A.1 in the appendix, where it is shown that for any set, formula, the set, formula, is solid in the sense of (i) above. The mapping, formula, induced by (3.12) is designated as the solidification function. As with the formula-convexification function above, it also follows that solid sets are precisely the fixed points of the solidification function (see Lemma A.2 in the appendix).

With these definitions, the two properties of formula-convexity and solidity are taken to constitute our desired model of clusters in formula. Hence, we now combine them as follows:  

Definition 3.3 (d-Convex solids)
For any nonempty subset, formula, (i) if formula is both formula-convex and solid, then formula is designated as a formula-convex solid in formula. (ii) The composite image set,
(3.13)
is designated as the formula-convex solidification of formula.

If we now let formula denote the family of all formula-convex solids in formula, then it follows at once from Definitions 3.1–3.3 that
(3.14)

3.4. Convex solidification of sets

As with (3.11) and (3.12), Expression (3.13) induces a composite mapping, formula, designated as the formula-convex solidification function. We now examine this function in more detail. To do so, it is instructive to begin by observing that the order in which these two maps are composed is critical. In particular, it is not true that the formula-convexification of a solid set is necessarily a formula-convex solid. This can be illustrated by the example in Figures 3 and 5. If the exterior squares are taken to define the relevant boundary set, formula, in Figure 3, then it is clear that the original set, formula, of four black squares is solid, since there are paths from every complementary region to formula that do not intersect formula.21 But, the formula-convexification, formula, of formula is precisely the non-solid set that was used to motivate solidification. So in this case, the composite image, formula is not solid (and hence not a formula-convex solid).

With this in mind, the key result of this section, established in Theorem A.1 of the appendix, is to show that the terminology in Definition 3.3 is justified, i.e. that:  

Property 3.4 (formula-Convex solidification)

For any set, formula, the image set, formula, is a formula-convex solid.

Hence, if one is enlarging a given cluster, formula, by adding a set, formula, of new regions to construct a new cluster containing formula, one need only formula-convexify this set by the algorithm
(3.15)
and then solidify the resulting set by identifying all regions in the interior complement formula of formula and forming
(3.16)

This algorithm has already been illustrated by the simple case in Figure 4, where no solidification was required. A somewhat more detailed illustration is given in Figures 8 and 9. Figure 8 exhibits a subsystem of 19 (hexagonal) basic regions in formula, along with the major road network (solid and dashed lines) connecting the centers of these regions. As in Figure 4, it is assumed that there are primary roads (freeways) and secondary roads. Some regions lie along freeway corridors, as denoted by solid network links with travel distance (or time) values of formula. Other regions are connected by secondary roads denoted by dashed network links with higher values of formula.

Regional subsystem.
Figure 8.

Regional subsystem.

Formation of composite clusters.
Figure 9.

Formation of composite clusters.

A possible sequence of steps in the formation of a composite cluster in this subsystem is depicted in Figure 9. Stage 1 begins at the point where it has been determined that an existing cluster (formula-convex solid), formula, of three regions (shown in black) should be expanded to include a secondary set, formula, of two regions (also shown in black). Given the shortest path distances, formula, generated by the formula-values in Figure 8, it is clear that the formula-convexification, formula, of this composite set, formula, is given by adding the gray regions as shown in Stage 2. This larger ring of regions lies entirely on freeway corridors and thus includes all shortest paths joining its members (in a manner similar to the ring of regions in Figure 5). Hence, the two regions in the center of this ring lie in the internal complement of formula and are thus added in Stage 3 to form an new cluster (formula-convex solid), formula, containing formula. In Stage 4, it is determined that one additional singleton set, formula, should also be added to the existing cluster, formula. Again, Stage 5 shows that all regions on the freeway corridors from formula to formula should be added in a new formula-convexification, formula. Finally, this formula-convex set is again seen to have two regions in its interior complement, which are thus added to achieve the final formula-convex solid cluster, formula.

Before proceeding, it is appropriate to note several additional features of this formula-convex solidification procedure that parallel the basic procedure of formula-convexification itself. First, as a parallel to formula-convex hulls in (3.8), it is shown in Theorem A.3 of the appendix that for any given set of regions, formula, the formula-convex solidification, formula, yields a ‘best formula-convex solid approximation’ to formula in the sense that:  

Property 3.5 (Minimality of formula-convex solidifications)

For any set, formula, the formula-convex solidification, formula, of formula is the smallest formula-convex solid containing formula

Hence, this process of cluster formation can be regarded as a smoothing procedure that approximates each candidate set of high-density regions by a more spatially coherent covexified version of this set.

Recall that our network representation of space is mainly for the computational efficiency, and the formula-convexity aims for approximating convexity in the original location space. Property 3.5 indicates that formula-convex solid in the network corresponds to the convex hull in Euclidian space. Thus, as desired, it is conceptually consistent to adopt formula-convex solid as convex approximation of the spatial coverage of a given cluster.

Next, as a parallel to the fixed-point property of formula-convexifications, it is shown in Theorem A.4 of the appendix that the procedure in (3.15) and (3.16) always yields a fixed point of the composite mapping, formula:  

Property 3.6 (formula-Convex solid fixed points)

A set, formula, is a formula-convex solid iff formula

Hence, the family, formula, of all formula-convex solids in (3.14) can equivalently be written as formula. In this form, each new cluster is seen to be a natural ‘stopping point’ of the combined formula-convexification and solidification procedure above.

4. A cluster-detection procedure

Given the cluster model developed above, the set of relevant cluster schemes for regional network formula can now be formalized as follows:  

Definition 4.1 (Cluster schemes)

A finite partition, formula, of formula is designated as a cluster scheme for formula iff (i) (d-convex solidity)formula for all formula and (ii) (disjointness)formula for all formula with formula. Let formula denote the class of admissible cluster schemes for formula

Below, we develop our search procedure to identify the best cluster scheme. Before developing the details of this procedure, however, it is useful to begin with an overview.

For any given industry, we start with the single best cluster consisting of a single basic region. Then, at each subsequent step, we decide whether we should (i) stay with the current cluster scheme; (ii) expand one of the existing clusters or (iii) start a new cluster. In alternative (ii), we compare potential expansions of all the existing clusters. Such expansions involve annexations of nearby regions (or clusters) which are then further enlarged to maintain formula-convex solidity. A new cluster in alternative (iii) consists of the best basic region in the current set of residual regions, formula. At each step, the best option among these three is selected, and the system of clusters continues growing until option (i) is evaluated as the best among the 3. Before completing the description of this procedure (in Section 4.2), we specify the details of option (iii) above in the next section.

4.1. Operational rules for cluster expansion

At each step of the search procedure outlined above, option (ii) involves the expansion of an existing cluster by first annexing certain nearby regions and then further enlarging this set to maintain ‘spatial cohesiveness’. In view of the above definition of a cluster scheme, this requires that such annexations be enlarged so as to maintain both formula-convex solidity and disjointness with respect to other existing clusters. This procedure can sometimes require the annexation of other existing clusters, as illustrated by Figure 10. Given the subsystem of a regional network shown in Figure 8, suppose that the current cluster scheme includes the clusters formula and formula shown in Stage 1 of Figure 10. Suppose also that it has been determined that the next step of the search procedure should be an expansion of cluster formula to include the set formula shown in Stage 1. The composite cluster, formula, resulting from formula-convex solidification of formula, includes formula together with the gray region shown in Stage 2. But since cluster formula is seen to overlap this composite cluster, it is clear that disjointness between clusters can only be maintained by annexing cluster formula as well. This results in the larger composite cluster, formula, shown by the combined black and gray region of Stage 3 in Figure 10.

Formation of composite clusters.
Figure 10.

Formation of composite clusters.

More generally, if some current cluster, formula, is to be expanded by annexing a set formula, then the formula-convex solidification, formula, must be further enlarged to include all clusters, formula, intersecting formula. For any given current cluster scheme formula, this procedure can be formalized in terms of the following operator, formula, defined for all formula by
(4.1)
where the relevant sets, formula, of interest will be of the form, formula, with formula and formula. Observe next that this single operation is not sufficient, since the resulting image sets, formula, may fail to be formula-convex solids. Moreover, the formula-convex solidification, formula, may again fail to be disjoint from other existing clusters in formula. So it should be clear that what is needed here is an iteration of this operator until both conditions are met. To formalize such iterations, we proceed as in Section 3.2 by letting the iterates of formula be defined for each formula by formula, formula and formula for all formula. Since it is clear by definition that formula for all formula, this yields a monotone nondecreasing sequence of sets in formula. Hence, by the same arguments leading to (3.7), it again follows that there must be an integer, formulaformula, such that formula. As a parallel to (3.7), we may thus designate the smallest integer, formula, satisfying this condition as the expansion iteration number of formula given formula. Finally, if (as a parallel to formula-convex hulls) we now designate the resulting fixed point of formula,
(4.2)
as the formula-compatible expansion of formula, then it is this set that satisfies the expansion properties we need. First observe that the fixed point property, formula, of this expanded set implies at once from (4.1) that for all clusters formula with formula we must have formula. Thus, formula is always disjoint from any clusters, formula, that have not already been absorbed into formula. Moreover, this in turn implies from (4.1) that formula, and hence that formula must be a formula-convex solid.

4.2. Cluster-detection procedure

In terms of Definition 4.1, the objective of this procedure, which we now designate as the cluster-detection procedure, is to find a cluster scheme, formula, satisfying,
(4.3)

From a practical viewpoint, it should be stressed that the following search procedure will only guarantee that the cluster scheme found is a ‘local maximum’ of (4.3) with respect to the class of admissible ‘perturbations’ in formula defined by the procedure itself.

To specify these perturbations in more detail, we begin with the following notational conventions. At each stage, formula, of this procedure, let formulaformula denote the current cluster scheme in formula. The procedure then starts at stage formula with the null cluster scheme, formula, containing no clusters. By Expressions (2.14) and (2.15), it follows that the corresponding initial value of the objective function in (4.3) must be formula. Given data, formula, at stage formula, we then seek the modification (perturbation), formula, of formula in formula which yields the highest value of formula. As outlined above, these modifications are of two types: (i) the formation of a new cluster in scheme formula or (ii) the expansion of an existing cluster in scheme formula. We now develop each of these steps in turn.

4.2.1. New cluster formation

Given the current cluster scheme, formula, at stage formula, one can start a new cluster, formula, by choosing some residual region, formula, which is disjoint with all existing clusters. Hence, the set of feasible choices for formula is given by formula. For each formula, the corresponding expanded cluster scheme is then given by formula, where formula, formula, formula and formula for formula. The superscript ‘0’ in cluster scheme, formula, indicates that a change is made to the residual region, formula, rather than to one of the clusters in formula. Note that since formula is automatically a formula-convex solid, and since formula guarantees that disjointness of all clusters is maintained, it follows that formula, and hence that formula is an admissible modification of formula.

The best candidate for new cluster formation is of course the region, formula, that yields the highest value of the objective function, i.e. for which formulaformula. For purposes of comparison with other possible modifications of formula, we now set
(4.4)

4.2.2. Expansion of an existing cluster

Next, we consider a potential expansion of each cluster, formula, by annexing a set formula of nearby regions in formula. While the basic mechanics of this expansion procedure were developed in Section 4.1, the specific choice of formula was not. Recall that such annexations can potentially result in large expansions of formula, given the need to preserve both formula-convex solidity and disjointness. Hence, to maintain reasonably ‘small increments’ in our search process, it is appropriate to restrict initial annexations to single regions whenever possible. Of course, when such regions are already part of another cluster, it will be necessary to annex the whole cluster to preserve disjointness. But to motivate our basic approach, it is convenient to start by considering the annexation of a single region not in any other cluster, i.e. to set formula for some formula. Here, it would seem natural to consider only regions in the immediate neighborhood of formula. However, this often turns out to be too restrictive, since there may exist much better choices that are not direct neighbors of formula.

In fact, it might seem more reasonable to consider all possible regions in formula and simply let our model-selection criterion determine the best choice. But if one allows choices of formula ‘far away’ from formula, then our formula-convex solidity and disjointness criteria can lead to the formation of very large clusters that violate any notion of spatial cohesiveness.22 So it is convenient at this point to introduce a new set of neighborhoods which strike a compromise between these two extremes. To do so, we first extend shortest path distances, formula, between points to corresponding distances between points and sets by letting
(4.5)
for formula and formula. Since formula is a metric on formula, it is well known that for each set, formula, (4.5) yields a well-defined distance function that preserves the usual continuity properties of formula on formula (e.g. Berge, 1963, Ch. 5). Hence, one can define well-behaved neighborhoods of formula in terms of this distance function as follows. For each formula, the formula-neighborhood of formula in formula is defined to be formula. Hence, the appropriate choices for expansions of formula are taken to be regions in formula for some pre-specified choice of parameter formula.23
As mentioned above, there are two cases that need to be distinguished here. First suppose that for some given cluster formula we consider the annexation of a region not in any other cluster, i.e. a region formula. Then, it follows from Expression (4.2) that the corresponding formula-compatible expansion of formula is given by
(4.6)
Thus, the cluster scheme, formula, resulting from this expansion has the form
(4.7)
where by Expression (4.1), the set of all other clusters in formula is given by
(4.8)
and where the corresponding residual region has the form:
(4.9)

As above, if formula now denotes the region in formula that yields the highest value of the objective function, i.e. for which formula, then the best cluster expansion for formula in formula starting with regions in formula is given by formula

Next, recall that it is possible that another cluster, formula in formula, intersects formula so that the annexation of formula is a possible expansion of formula. For this case, it is necessary to annex the entire cluster formula in order to preserve disjointness. So if we now define the index set, formula [not to be confused with interval sets formula in Section 3.2], and for each formula replace (4.6) with the formula-compatible expansion formula, then as a parallel to (4.7)–(4.9), the cluster scheme, formula, resulting from this expansion now has the form
(4.10)
with the set of all other clusters in formula given by
(4.11)
and with corresponding residual region:
(4.12)
If formula denotes the cluster in formula that yields the highest value of the objective function for which formula, then the best cluster expansion for formula in formula is given by formula. Hence, the best cluster expansion, formula, of formula starting with cluster formula is given by
(4.13)

4.2.3. Revision of the cluster scheme

Finally, given these candidate modifications, formula, of formula in formula [as defined by (4.4) together with (4.13)], let formula be the best candidate, as defined by
(4.14)

There are then two possibilities left to consider: If formula, then set formula and proceed to stage formula. On the other hand, if formula, then no (local) improvement can be made, and the cluster-detection procedure terminates with the (locally) optimal cluster scheme, formula.

Finally, it is of interest to note that this cluster-detection procedure is roughly analogous to ‘mixed forward search’ procedure in stepwise regression, where in the present case, we add new clusters or merge existing ones until some locally optimal stopping point is found. With this analogy in mind, it is in principle possible to consider ‘mixed backward search’ procedures as well. For example, one could start with a maximal number of singleton clusters and proceed by either eliminating or merging clusters until a stopping point is reached. Some experiments with this approach produced results similar to the present search procedure, but proved to be far more computationally demanding.

4.3. A test of spurious clustering

Although the cluster-detection procedure developed above will always find a (locally) best cluster scheme, formula, with respect to BIC used, there is still the statistical question of whether such clustering could simply have occurred by chance. Hence, one can ask how the optimal criterion value, formula, obtained compares with typical values obtainable by applying the same cluster-detection procedure to randomly generated spatial data. This can be formalized in terms of the hypothesis of complete spatial randomness, which in this present context asserts that the probability, formula, that any given establishment will locate in region, formula, is proportional to the areal size, formula, of that region, i.e. that
(4.15)

While the sampling distribution of formula under this hypothesis is complex, it can easily be estimated by Monte Carlo simulation. More precisely, for any given industrial location pattern of formula establishments, one can use (4.15) to generate, say, 1000 random location patterns of formula establishments, and apply the cluster-detection procedure to each pattern. This will yield 1000 values of formula, say formula. If the value for the actual cluster scheme, formula, is say bigger than all but five of these in the ordering of values, formula, then the chance, formula, of getting a value as large as this (under the hypothesis that formula is coming from the same population of random patterns) is, formula. This would indicate very ‘significant clustering’. On the other hand, if formula were only bigger than say 800 of these values, then the formula value, formula, would suggest that the observed cluster scheme, formula, is not sufficiently significant to warrant further investigation. This procedure was used in the following illustrative application [as well as in the more extensive applications in Mori and Smith (2011a, 2011b, 2012) and Hsu et al. (2011)].

4.4. Essential clusters

The identified clusters, formula, vary in terms of their contribution to the value of formula. While the clusters with larger contributions are often insensitive to small perturbations of the original regional distribution of establishments, those with smaller contributions may be sensitive. Thus, to obtain more robust results, it may be useful to focus on those essential clusters which account for a large shares of formula.

To formalize this idea, we start by assuming that an optimal cluster scheme, formula, has been found for the industry. To identify the essential clusters in formula, we proceed recursively by successively adding those clusters in formula with maximum incremental contributions to formula.24 This recursion starts with the ‘empty’ cluster scheme represented by formula, where formula denotes the full set of regions, formula. If the set of (non-residual) clusters in formula is denoted by formula, then we next consider each possible ‘one-cluster’ scheme created by choosing a cluster, formula, and forming formula, with formula. The ‘most significant’ of these, denoted by formula, is then taken to be the cluster scheme with the maximum BIC value (defined below). If this is called stageformula, and if the essential cluster scheme found at each stage formula is denoted by formula, then the recursive construction of these schemes can be defined more precisely as follows.

For each formula, let formula denote the (non-residual) clusters in formula (so that for formula we have formula), and for each cluster not yet included in formula, i.e. each formula, let formula be defined by, formula, where formula. Then, the additional essential cluster, formula (formula), at stage formula is defined by
(4.16)
where formula is the estimated maximum log-likelihood ratio for model formula given [in a manner paralleling expression (2.15)] by
(4.17)
where formula and formula. Thus, at each stage formula, the likelihood-maximizing cluster, formula, is removed from the residual region, formula, and added to the set of essential clusters in formula. The resulting formula value at each stage formula is then given by
(4.18)
with
(4.19)
Finally, the incremental contribution of each new cluster, formula, to BIC within formula is given by the increment for its associated cluster scheme, formula, as follows:
(4.20)

To identify the relevant set of the essential clusters in formula, one simple criterion would be to require that each has a BIC contribution at least some specified fraction, formula, of formula. In terms of this criterion, the procedure would stop at the first stage, formula, where additional increments fail to satisfy this condition, i.e. where formula. Refer to Mori and Smith (2011b, Section 3) for an application of these essential clusters.

5. An illustrative application

In this section, we illustrate the above procedure in terms of the two Japanese industries discussed in Section 1, which for convenience we refer to here as simply ‘plastics’ and ‘soft drinks’, respectively. These two industries are part of the larger study in Mori and Smith (2011b) that applies the present methodology to 163 manufacturing industries in Japan. As discussed in Section 4.2 of that article, the test of spurious clustering above identified nine industries with spurious clustering, so that only 154 industries were used in the final analysis. The appropriate notion of a ‘basic region’, formula, for purposes of this study was taken to be the municipality category equivalent to a city-ward-town-village in Japan. The relevant set formula was then taken to be the 3207 municipalities geographically connected to the major islands of Japan, as shown in Figure 11.25

Basic regions (shi-ku-cho-son) of Japan.
Figure 11.

Basic regions (shi-ku-cho-son) of Japan.

5.1. Comparison with a scalar measure of agglomeration

The choice of these two industries is motivated by their similarity in terms of overall degree of agglomeration. This can be illustrated in terms of the formula-index developed in Mori et al. (2005), which for a given industry formula is defined as the Kullback–Leibler (1951) divergence of its establishment location probability distribution, formula, [as in expression (2.1)] from purely random establishment locations. Here, the latter is characterized by the uniform probability distribution, formula, with formula [as in expression (4.15)]. By using the sample estimate of formula, namely, formula with formula [as in expression (2.7)], a corresponding estimate of this formula-index is given by
(5.1)

The intuition behind this particular index is that it provides a natural measure of distance between probability distributions. So by taking uniformity to represent the complete absence of clustering, it is reasonable to assume that those distributions ‘more distant’ from the uniform distribution should involve more clustering. Note also that since both formula and formula are based on similar log-likelihood measures of ‘distance from uniformity’, our cluster detection procedure is closer in spirit to this scalar measure than other possible choices such as the index by Ellison and Glaeser (1997).26 Hence formula provides a natural candidate for comparing the advantages of this approach over scalar measures in general. The histogram of divergence values, formula, for the 154 industries in Japan is shown in Figure 12 and is seen to range from formula up to formula. With respect to this overall range, the formula values, formula and formula, for soft drinks and plastics, respectively, are seen to be virtually identical.

Frequency distribution of D-values of Japanese manufacturing industries.
Figure 12.

Frequency distribution of D-values of Japanese manufacturing industries.

But in spite of this overall similarity, the agglomeration patterns obtained for these two industries are substantially different, as seen in Figures 13 and 14.

Panel (a) of each figure displays the establishment densities for the corresponding industry, where those basic regions with higher densities are shown as darker. In Panel (b), the individual clusters in the derived cluster scheme, formula, are represented by enclosed gray areas. The portion of each cluster in lighter gray shows those basic regions which contain no establishments (but are included in formula by the process of convex solidification).

Before examining these patterns in detail, it is of interest to consider the results of the cluster-detection procedure itself. By comparing the establishment densities and cluster schemes in Panels (a) and (b) of each figure, respectively, it is clear that these cluster schemes closely reflect the underlying densities from which they were obtained. Notice also that individual clusters are by no means ‘circular’ in shape. Rather each consists of an easily recognizable set of contiguous basic regions (municipalities) in formula that approximates the area of higher establishment density in Panel (a) of the figure. Notice also that certain clusters in each pattern are themselves contiguous. We shall return to this point below.

To compare these two agglomeration patterns in more detail, we begin by observing that while the plastics industry is more than twice as large as soft drinks in terms of the number of establishments (formula versus formula), its agglomeration pattern contains only formula clusters versus formula clusters for soft drinks. This illustrates the relative parsimoniousness of our cluster-detection procedure with respect to larger industries, as mentioned following the definition of BIC in expression (2.13). Notice also that clustering is indeed much stronger in the plastics industry than in soft drinks. This can be seen in several ways. First, the share of plastics establishments in clusters is much larger than for soft drinks (formula versus formula). Second, the average size of these clusters is greater not only in terms of establishments per cluster (as implied by the statistics above), they are also more than three time larger in terms of average areal extent.

5.2. Global extent versus local density of agglomerations

Aside from these general comparisons in terms of summary statistics, the level of spatial detail in each of these agglomeration patterns allows a much broader range of comparative measures. While such measures are developed in more detail in Mori and Smith (2011b), their essential elements are well illustrated in terms of the present pair of industries. As mentioned in Section 1, the plastics industry is primarily concentrated along the industrial belt of Japan as in Figure 13(b). More generally, industries often tend to concentrate within specific subregions of the nation, i.e. are themselves ‘spatially contained’. To make this precise in terms of our present model of cluster schemes, we adopt a two-stage approach. First, we identify the essential clusters with μ = 0.05 (as defined in Section 4.4) in the optimal cluster scheme, formula, for a given industry. We then define the essential containment (e-containment) for that industry to be the convex solidification of these essential clusters, in other words, the smallest convex solid27 containing all these essential clusters for the industry. The e-containment for the plastics industry is indicated by the hatched area in Figure 13(c) which clearly distinguishes the ‘industrial belt’ portion of this industry. In contrast, the e-containment for soft drinks shown in Figure 14(c) appears to be much larger and reflects the wide scattering of essential clusters for this industry.

While these visual summaries of ‘containment’ can be very informative, it is often more useful to quantify such relations for purposes of analysis. One possibility here is to define the global extent (GE) of an industry to be the fraction of area in its e-containment relative to the nation as a whole.28 In the present case, the GE values for plastics and soft drinks are formula and formula, respectively. So in terms of this measure, it is clear that the clusters of the plastics industry are much more localized than those of soft drinks.

Next observe that while the GE of the plastics industry is much smaller than that of soft drinks, the average size of its essential clusters is actually much larger. As is clear from Figures 13 and 14, these clusters are thus more densely packed inside the e-containment of the plastics industry. To capture this additional dimension of agglomeration patterns, we now designate the fraction of e-containment area represented by these essential clusters as the local density (LD) of the industry. Since the LD values for plastics and soft drinks are given, respectively, by formula and formula, it is also clear that the agglomeration pattern for plastics is much more locally dense than that of soft drinks.

5.3. Refinements of cluster schemes

Recall that in terms of our basic probability model of cluster schemes, formula, individual clusters, formula, are implicitly assumed to constitute sets of basic regions with similar (and unusually high) establishment density. But the relations between these clusters is left unspecified. In this regard, it was observed above that the opimal cluster schemes, formula, for both plastics and soft drinks contain clusters that are mutually contiguous. Here, it is natural to ask why such clusters were not ‘joined’ at some stage during the cluster-detection procedure. The reason is that our basic cluster probability model assumes that location probabilities are essentially uniform within each cluster [as in expression (2.3)], so that maximum-likelihood estimates for cluster probabilities, formula, are simply proportional to the number of establishments, formula, in that cluster. Hence, with respect to the BIC measure underlying this procedure, contiguous clusters with very different uniform densities often yield a better fit to establishment data than does their union with its associated uniform density. As one illustration, there is a contiguous chain of clusters for the plastics industry extending from Tokyo toward west as far as Osaka [Figure 13(b)]. Here, the establishment densities in these contiguous areas are sufficiently different so that by treating each as a different cluster, one obtains a better overall fit in terms of BIC—even though the resulting scheme is penalized for this larger number of clusters.

It is often the case, however, that there are not only very different establishment densities among contiguous clusters, but also strong ‘central’ clusters: Tokyo, Nagoya and Osaka in this case. More generally, this suggests that there is often more spatial structure in cluster schemes than is captured by a simple listing of their clusters. In particular, this example suggests that a grouping of contiguous clusters around each central cluster (i.e. with the highest establishment density) might best be treated as single agglomerations for an industry.

To formalize these ideas, we begin with a given cluster scheme, formula, that has been identified for an industry. For each individual cluster, formula, let formula be the set of contiguous neighbors offormula in formula (including formula itself), so that by definition there exists for each formula a basic region, formula, which is adjacent to cluster formula, i.e. with formula.29 For each formula, the maximal-density cluster in its immediate neighborhood, formula, can then be identified by a hill climbing function, formula.30 In particular, if formula, then cluster formula is a local peak of establishment density with respect to its contiguous neighbors, formula, and hence can be considered as a central cluster in its vicinity. More generally, we can generate a unique central cluster for each formula by recursive applications of this hill climbing function. To do so, we begin by setting formula and constructing m th-iterates of formula by formula for all integers, formula.31 It can easily be verified that this recursive mapping reaches a fixed point after a finite number of iterations. If the smallest such number is denoted by formulaformula, then the fixed point of this mapping, say, formula, identifies the unique central cluster generated by each cluster formula. Accordingly, we now define the corresponding agglomeration, formula, generated by formula to be the solidifiation of all clusters leading to the same central cluster, formula, i.e. formula. Note that if formula is an isolated cluster, i.e. if formula, then by definition, formula. Moreover, for all clusters, formula, either formula or formula So this procedure essentially transforms the cluster scheme, formula, by grouping its contiguous clusters into distinct agglomerations, each with a central cluster.

Agglomerations identified for the plastic industry are shown in Figure 15, where 43 clusters reduced to 30 agglomerations, where darker colors indicate larger concentrations of establishments. Notice in particular that certain individual clusters in the Tokyo, Nagoya and Osaka areas have now been joined to larger agglomerations in these respective areas.

Spatial distributions of establishments and clusters (plastics industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.
Figure 13.

Spatial distributions of establishments and clusters (plastics industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.

Spatial distributions of establishments and clusters (soft drinks industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.
Figure 14.

Spatial distributions of establishments and clusters (soft drinks industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.

Agglomerations of plastics industry.
Figure 15.

Agglomerations of plastics industry.

5.4. Sensitivity analysis

Finally, we report on the sensitivity of identified cluster schemes with respect to small perturbations to both the search algorithm and to regional boundaries. In doing so, it should be stressed that our main objective has been to propose the first practical framework for identifying industrial clusters on a map using regional data. So many refinements of the present search procedure are yet to be made, such as optimizing its computational efficiency. But even at this preliminary stage, it is nonetheless informative to consider the robustness of this procedure with respect to possible perturbations.

5.4.1. Alternative initial clusters

We first investigate the sensitivity of the results with respect to alternative starting points. To do so, we now re-initialize the cluster search procedure for a given industry by taking the initial cluster to be a randomly chosen municipality (with a strictly positive number of establishments of the industry in question). In particular, we have generated 10 such samples for each of formula industries with non-spurious clusters.

To compare the overlap between cluster schemes identified for industry formula in each of these samples with the original cluster scheme, we focus on their agreement in terms of establishments belonging to clusters. To do so, let formula and formula denote, respectively, the sets of municipalities in formula belonging to clusters identified for industry formula using (i) the original initial cluster and (ii) the formulath sampled initial cluster. If for any set of municipalities, formula we let formula denote the total number of formula establishments in these municipalities, then the agreement between these cluster schemes can be measured in terms of the share, formula, of cluster establishments common to both, as defined by
(5.2)
where formula with formula and formula32 Over the full set of samples, formulathe minimum value of formula observed was 0.952 (with a mean of 0.999). On this basis, we conclude that the cluster schemes identified are highly robust against the perturbation of initial clusters.

5.4.2. Perturbation of regional divisions

Next, we employ simulation methods to determine whether the identified cluster schemes are sensitive to small perturbations of municipality boundaries. To construct each perturbation, we first randomly partition the set of all municipalities, formula, into mutually exclusive adjacent pairs. In particular, if formula denotes the set of all adjacent municipality pairs, then this partition is given by a randomly selected maximal subset, formula, of mutually exclusive pairs in formula [which by definition satisfies the two conditions, that (i) formula for all formula, and that (ii) for each formula there is some formula with formula]. Second, for each pair, formula, we reallocate 5% of both establishments and economic area from one municipality to the other,33 where the direction of the reallocation is determined randomly.

Using this procedure, we again generated 10 randomly perturbed samples for each of 154 industries. Within each industry, we focus only on the set of essential clusters in the original cluster scheme (using formula as defined in Section 4.4). With respect to these clusters, we then compute the industry share (in terms of the number of establishments) that continues to appear in these essential clusters identified for the perturbation. In all but three industries, these industry shares exceeded formula in each of the 10 sample perturbations. A common property of these three exceptional industries (‘alcoholic beverages’, ‘paving materials’ and ‘cement and its products’) is that they are relatively ubiquitous. As for paving materials, the original clusters account for only formula of all establishments, which is the smallest among all industries (where the average value is formula). Thus, the majority of establishments of this industry are located outside clusters. As for the other two, while original clusters do account for a substantial percentage of industry establishments (more than formula), these clusters are spread over more than 40% of the national economic area, as opposed to an average share of formula for all industries. Thus, for extremely ubiquitous industries of this type, the change in lumpiness of establishment distributions across municipalities induced by such perturbations in the regional allocation of establishments and economic area can in principle produce quite different clustering patterns. But for the majority of industries, our cluster-identification procedure does appear to produce robust results.34

Figures 16 and 17 show for the cases of our two example industries, plastic and soft drinks industries, respectively. The top and bottom panels in each figure show the clusters identified under the actual and perturbed municipality boundaries, respectively, where for the latter, we chose the random sample for which the industry share deviates the most from the actual pattern, i.e. the worst case. In each panel, the darker clusters represent the essential clusters. These two examples indicate that not only the essential clusters are robust but also the entire cluster distributions are quite similar between the actual and perturbed sample. Basically, this same property holds for 151 of the 154 three-digit industries.

Clusters under the actual and perturbed municipalities boundaries (plastics industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.
Figure 16.

Clusters under the actual and perturbed municipalities boundaries (plastics industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.

Clusters under the actual and perturbed municipalities boundaries (soft drinks industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.
Figure 17.

Clusters under the actual and perturbed municipalities boundaries (soft drinks industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.

6. Concluding remarks

In this article, we have developed a simple cluster-scheme model of agglomeration patterns and have constructed an information-based algorithm for identifying such patterns. To the best of our knowledge, this constitutes the first systematic framework for doing so. In addition, this formal framework opens up a number of possible directions for further research. In particular, by utilizing clusters identified, it becomes possible for the first time to directly identify the spatial patterns of industrial agglomerations on a map, and test the hypotheses implied by the recent theoretical developments on economic agglomerations under many-region/continuous location space (e.g. Fujita et al., 1999; Tabuchi and Thisse, 2011; Ikeda et al., 2011; Hsu, 2012). Below, we touch on two areas where initial investigations are already under way.

6.1. Cluster-based choice cities for industries

In our previous work (Mori et al., 2008), we reported on an empirical regularity between the (population) size and industrial structure of cities in Japan, designated as the number-average size (NAS) rule. This regularity (also established for the USA by Hsu, 2012) asserts a negative log-linear relation between the number and average population size of those cities where a given industry is present. Hence, the validity of the NAS rule depends critically on how such ‘industrial presence’ is defined. In its follow-up paper (Mori and Smith, 2011a), we have employed the present cluster-detection procedure to identify cities where given industries exhibit a ‘substantial’ presence with respect to their agglomeration patterns. In particular, if formula denotes the relevant set of cities in formula, and if formula is the cluster scheme identified for industry formula, then each city formulaformula containing establishments from at least one of the clusters in formula is designated as a cluster-based (cb) choice city for industry formula. This cb-approach to industrial presence yields a sharper version of the NAS rule for the case of Japan. In addition, by identifying those cb-choice cities shared by different industries, this also provides one approach to analyzing spatial coordination between industries. In ongoing work (Hsu et al., 2012), we are examining the consequences of such industrial coordination for city size distributions, and in particular for the Rank Size Rule. In addition, by examining the spacing between cb cities for industries, one can also formulate a range of testable propositons about the spatial structure of urban hierarchies.

6.2. Regional agglomeration analysis

As emphasized in Section 1, most analyses of industrial agglomeration have relied on overall indices of agglomeration, and hence have necessarily been aggregate in nature. However, the present identification of local cluster patterns for industries allows the possibility for more disaggregate spatial analyses. Of particular interest is the question of why industries agglomerate in certain regions and not others. While this question has of course been addressed by a variety of theoretical models, there has been little empirical work done to date. This is in large part due to the conspicuous absence of ‘local agglomeration’ measures. While the present cluster-scheme model is not itself numerical, it nonetheless suggests a number of possibilities for such measures.

The simplest are of course binary variables indicating the ‘presence’ or ‘absence’ of agglomeration. Indeed, the above definition of cb choice cities yields precisely a binary variable of this type on the set of cities, formula. Hence, given appropriate socio-economic data for cities, formulaformula, one could in principle test for significant predictors of industrial presence in these cities by employing standard logit or probit models.

Alternatively, one may focus directly on the individual clusters for each industry. Here, one might characterize the degree of local agglomeration for each industry in terms of the contribution of these clusters to the industry as a whole. Natural candidates include the fraction of industry establishments or employment in each cluster. Given the availability of data at the municipality level, one could in principle aggregate such data to the cluster level and use this to identify predictors of local agglomeration by more standard types of linear regression models. As one illustration, in Japan, data on education levels (among others) are available at the municipality level. Thus, by employing appropriate summary measures, ‘education accessibility’ across cluster municipalities can be defined. Then, by treating ‘industry’ as a categorical variable, one can attempt to compare the relative importance of these local accessibilities in attracting various industries. Regression analyses of this type will be presented in subsequent work (Mori and Smith, 2012).

Acknowledgments

In developing the basic idea of this article, we benefited from the discussion with Tomoki Nakaya, Yoshihiko Nishiyama and Yukio Sadahiro. The road-network distance and map data of Japan were constructed by Takashi Kirimura. We also thank Asao Ando, David Bernstein, Gilles Duranton, Masahisa Fujita, Kazuhiko Kakamu, Kiyoshi Kobayashi, Yasusada Murata, Koji Nishikimi, Henry Overman, Yasuhiro Sato, Kazuhiro Yamamoto, Xiao-Ping Zheng, two anonymous referees as well as the editor, Kristian Behrens, for their constructive comments.

Funding

This study was conducted as a part of the Project, ‘The formation of economic agglomerations and the emergence of order in their spatial patterns: Theory, evidence, and policy implications,’ undertaken at RIETI, and partially supported by The Kajima Foundation, The Grant in Aid for Research (Nos 13851002, 16683001, 17330052, 18903016, 19330049 and the 21 Century COE program) of Ministry of Education, Culture, Sports, Science and Technology of Japan.

1 Examples of such reference distributions are the regional distribution of all-industry employment or establishments (e.g. Ellison and Glaeser, 1997; Duranton and Overman, 2005), and that of economic area (e.g. Mori et al., 2005).

2 We shall use ‘clusters’ and ‘agglomerations’ interchangeably throughout the analysis to follow. However, one possible distinction between these terms is suggested in Section 5.3.

3 The recursive application of such procedures gives rise to the notorius ‘multiple testing’ problem that these procedures were originally designed to overcome. In essence, multiple applications of this procedure tend to identify too many clusters as being significant. For a further discussion of this ‘false discovery’ problem, see Castro and Singer (2005) together with the references cited therein.

4 An alternative approach would be to characterize spatial distributions of establishments as smooth surfaces, by utilizing the density estimation methods in Billings and Johnson (2012). However, a primary advantage of our present discrete characterization of agglomerations as spatially disjoint clusters is to allow systematic identification of the location, spatial extent and size of each individual agglomeration. This derived data can in turn lead to more detailed analyses of industrial agglomerations (as discussed in Section 6).

5 Here, all firms within each industry are implicitly treated as identical single-establishment firms.

6 A complementary clustering approach has recently been proposed by Kerr and Kominers (2012) which identifies establishment clusters based on maximal interaction distances. This distance approach is particularly useful when relevant interactions can be documented, as in the case of patent citations within research-intensive industries.

7 An implicit assumption here is that the regions formula in each cluster are contiguous. This assumption is not crucial at present, but will play a central role in the construction of clusters below.

8 A formal definition of cluster schemes is given in Definition 4.1.

9 This implicitly assumes that the regions within a given cluster not only have high densities of establishments but also that these densities are similar.

10 As pointed out by a referee, ‘economic area’ is at best a crude approximation to actual usable area for firms. But without more detailed information, we believe that it provides the best approximation currently available.

11 Note that formula is constructable from formula as shown above.

12 For instance, the numbers of counties in the USA and municipalities in Japan are both over 3000.

13 In Section 3.2, it is shown that such holes persist for even straight-line approximations to travel networks.

14 See Mori and Smith (2011b, Section 4.2.1) for the treatment of major off-shore islands.

15 This approximation appears to be good for the municipality network in Japan considered in Section 5. For the ratios of short-path over shortest route distances (formula) across all 4,491,991 relevant pairs of municipalities, the mean and the 99.5 percentile point are 1.14 and 1.28, respectively.

16 Our present notion of formula-convexity is an instance of the more general notion of geodesic convexity applied to graphs and appears to have first been introduced by Soltan (1983).

17 Throughout this article, we denote cardinality of a set formula by formula.

18 Since formula implies from (3.6) that formula, and since formula for all formula, it follows that this expansion process can involve at most formula steps.

19 In our present application, this iteration number is typically small.

20 Even if formula is an element of formula, it must always be part of the boundary of formula. Hence, it is still reasonable to assert that formula is ‘on the outside’ of formula.

21 Note also from this example that the notion of ‘solidity’ by itself is rather weak. However, when applied to formula-convex sets, this turns out to be exaclty what is needed for ‘filling holes’.

22 The inclusion of large undeveloped regions (e.g. mountains and inland sea) of the nation can lead to an exaggerated depiction of agglomeration involving areas that are mostly devoid of establishments. It should be noted that this is in part due to our use of economic area (rather than total area), which effectively ignores such undeveloped land when expanding clusters.

23 In our application in Section 5, the value used is formula km, which was chosen so that any single expansion of a cluster cannot include a large section without economic area (e.g. inland sea and lakes). This formula value covers about 90% of the shortest path distances between neighboring basic regions (municipalities) in our application. It is also worth noting from a practical viewpoint that this use of uniform formula-neighborhoods has the added advantage of controlling (at least in part) for size differences among basic regions.

24 The procedure for identifying essential clusters in formula is different from the one used to indentify formula in Section 4.2. Here, candidate clusters considered are only those in formula itself.

25 The establishment counts across these industries are taken from the Establishment and Enterprise Census of Japan in 2001. Economic area of each municipality is obtained by subtracting forests, lakes, marshes and undeveloped area from the total area of the municipality. The data are available from the Toukei de Miru Shi-Ku-Cho-Son no Sugata in 2002 and 2003 (in Japanese) by the Statistical Information Institute for Consulting and Analysis of Japan.

26 In fact, Ellison–Glaeser index is highly correlated with formula (refer to Mori et al., 2005, Section D). So, the arguments in this section would remain essentially the same.

27 Recall Property 3.3 of convex solidification.

28 Here, we use the full geographic areas of basic regions rather than economic area, to give a better representation of ‘extent’. See further discussions in Mori and Smith (2011b, Section 3.2).

29 Recall from Section 3.1 that formula is the set of adjacent neighbors for basic region formula.

30 In practice, this solution is almost always unique. But if not, then additional conditions must be imposed to ensure uniqueness (such as choosing the maximal-density cluster with the largest number of establishments).

31 For example, formula.

32 This measure is an instance of the standard Jaccard measure of similarity between sets.

33 The number of establishments to be reallocated is rounded to the nearest integer.

34 It is to be noted that since municipality sizes vary significantly for the case of Japan, with geographic areas of municipalities ranging from 1.64 to 1408.1 formula, these 5% reallocations of establishments and economic area are actually not ‘small’ perturbations for relatively large municipalities. Hence, the results here indicate strong robustness of our approach for the case of localized industries.

References

Akaike
H
Petrov
B N
Csaki
F
,
Information theory as an extension of the maximum likelihood principle
Second International Symposium on Information Theory
,
1973
Budapest
Akademiai Kiado
(pg.
267
-
281
)
Berge
C
Topological Spaces
,
1963
New York
MacMillan
Besag
J
Newell
J
,
The detection of clusters in rare diseases
Journal of the Royal Statistical Society, Series A
,
1991
, vol.
154
(pg.
143
-
155
)
Billings
S B
Johnson
E B
,
A non-parametric test for industrial specialization
Journal of Urban Economics
,
2012
, vol.
71
(pg.
312
-
331
)
Brülhart
M
Traeger
R
,
An account of geographic concentration patterns in Europe
Regional Science and Urban Economics
,
2005
, vol.
35
(pg.
597
-
624
)
Castro
M C
Singer
B H
,
Controlling the false discovery rate: a new application to account for multiple and dependent tests in local statistics of spatial association
Geographical Analysis
,
2005
, vol.
38
(pg.
180
-
208
)
Duranton
G
Overman
H G
,
Testing for localization using micro-geographic data
Review of Economic Studies
,
2005
, vol.
72
(pg.
1077
-
1106
)
Ellison
G
Glaeser
E L
,
Geographic concentration in US manufacturing industries: a dartboard approach
Journal of Political Economy
,
1997
, vol.
105
(pg.
889
-
927
)
Fujita
M
Krugman
P
Venables
A J
The Spatial Economy: Cities, Regions, and International Trade
,
1999
Cambridge, MA
MIT Press
Henderson
J V
Thisse
J-F
Handbook of Regional and Urban Economics, vol. 4
,
2004
Amsterdam
North-Holland
Hsu
W
,
Central place theory and the city size Distribution
Economic Journal
,
2012
, vol.
122
(pg.
903
-
932
)
Hsu
W
Mori
T
Smith
T E
,
Industrial location and city size: does space matter?
,
2011
 
Unpublished data
Ikeda
K
Akamatsu
T
Kono
T
,
Spatial period doubling agglomeration of a core-periphery model with a system of cities
Journal of Economic Dynamics and Control
,
2012
, vol.
36
(pg.
754
-
778
)
Kerr
W R
Kominers
S D
,
Agglomerative forces and cluster shapes
,
2012
 
Working Paper 12-09, Center for Economic Studies, U.S. Census Bureau
Kontkanen
P
Buntine
W
Myllymäki
P
Rissanen
J
Tirri
H
Bishop
C M
Frey
B J
,
Efficient computation of stochastic complexity
Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics
,
2003
 
pp. 181–188. Society for Artificial Intelligence and Statistics
Kontkanen
P
Myllymäki
P
Analyzing the Stochastic Complexity via Tree Polynomials
,
2005
 
Technical Report, 2005-4. Helsinki Institute for Information Technology
Kullback
S
Leibler
R A
,
On information and sufficiency
Annals of Mathematical Statistics
,
1951
, vol.
22
(pg.
79
-
86
)
Kulldorff
M
,
A spatial scan statistic
Communications in Statistics—Theory and Methods
,
1997
, vol.
26
(pg.
1481
-
1496
)
Kulldorff
M
Nagarwalla
N
,
Spatial disease clusters: detection and inference
Statistics in Medicine
,
1995
, vol.
14
(pg.
799
-
810
)
Marcon
E
Puech
F
,
Measures of the geographic concentration of industries: improving distance-based methods
Journal of Economic Geography
,
2010
, vol.
10
(pg.
745
-
762
)
Mori
T
Nishikimi
K
Smith
T E
,
A divergence statistic for industrial localization
Review of Economics and Statistics
,
2005
, vol.
87
(pg.
635
-
651
)
Mori
T
Nishikimi
K
Smith
T E
,
The number-average size rule: a new empirical relationship between industrial location and city size
Journal of Regional Science
,
2008
, vol.
48
(pg.
165
-
211
)
Mori
T
Smith
T E
,
A probabilistic modeling approach to the detection of industrial agglomerations
,
2009
 
Discussion Paper No.682, Institute of Economic Research, Kyoto University
Mori
T
Smith
T E
,
An industrial agglomeration approach to central place and city size regularities
Journal of Regional Science
,
2011a
, vol.
51
(pg.
694
-
731
)
Mori
T
Smith
T E
,
Analysis of industrial agglomeration patterns: an application to manufacturing industries in Japan
,
2011b
 
Discussion Paper No. 794, Institute of Economic Research, Kyoto University
Mori
T
Smith
T E
,
Spatial approach to identifying agglomeration determinants
,
2012
 
In progress. Unpublished data
Porter
M E
The Competitive Advantage of Nations
,
1990
New York
The Free Press
Schwarz
G
,
Estimating the dimension of a model
Annals of Statistics
,
1978
, vol.
6
(pg.
461
-
464
)
Soltan
V P
,
D-convexity in graphs
Soviet Mathematics-Doklady
,
1983
, vol.
28
(pg.
419
-
421
)
Tabuchi
T
Thisse
J-F
,
A new economic geography model of central places
Journal of Urban Economics
,
2011
, vol.
69
(pg.
240
-
252
)

Appendix

Formal analysis of d-convex solids

To develop formal properties of d-convex solids, we require a few additional definitions. First, for any path, formula, let formulaformula denote the reverse path in formula. Next, for any two paths, formula, with formula, the combined path, formula is designated as the concatenation of formula and formula. It then follows by definition that the length of any concatenated path, formula, is simply the sum of the lengths of formula and formula, i.e. that formulaformula. Using this and (3.5)–(3.8), it is convenient to establish the following well-known properties of d-convex sets, as in Definition 3.1 of the text. First, we show that for the d-convexification function, formula, in (3.8), the naming of this function is justified by the fact that:  

Proposition A.1 (d-Convexification)

For all formula, the image set, formula, is d-convex.

 
Proof

For any formula and shortest path, formula, it must be shown that formula But by definition, formula for some formula Hence, by (3.6), it follows that formulaformula . Thus, formulaformula.▪

Next, we show that the formula-convex hull, formula, can be characterized as the unique smallest d-convex superset of formula. More precisely, if formula denotes the family of all d-convex sets in formula, then we have:  

Proposition A.2 (minimality ofd-convexifications)
For all formula,
(A.1)
 
Proof
By Proposition A.1, formula, and by (3.5)
(A.2)

Hence, it suffices to show that for all sets, formula, with formula and formula, we must have formula. By the definition of formula this in turn is equivalent to showing that formula for all formula. But by (3.4),
(A.3)
Moreover, by (3.3) and (3.4) together with the definition of d-convexity, it follows that
(A.4)

Hence, we may conclude from (A.3) and (A.4) that formula. Finally, since the same argument shows that formula, the result follows by induction on formula.▪

Finally, using these two results, we show that d-convex sets can be equivalently characterized as the fixed points of the d-convexification mapping, formula:  

Proposition A.3 (d-convex fixed points)
For all formula,
(A.5)
 
Proof

If formula then formula by Proposition A.1. Conversely, if formula then formula by (A.2), and formula by Proposition A.2, hence formula.▪

This in turn implies that the family, formula, of d-convex sets can be equivalently defined as in Expression (3.9) of the text. But while this definition provides a natural parallel to the case of d-convex solids developed below, the more useful interval characterization of formula in Expression (3.10) of the text, can easily be obtained from Proposition A.3 as follows:  

Corollary (interval fixed points)
For all formula,
(A.6)
 
Proof

Since formula by (A.4) (with formula), and since formula holds for all formula [by (3.5)], it follows on the one hand that formula. Conversely, since formula for all formula (by recursion on formula), it follows from (3.8) and Proposition A.3 that formula.▪

Given these properties of d-convex sets, one objective of this appendix is to show that each of these properties is inherited by d-convex solids. To do so, we begin with an analysis of solid sets as in Definition 3.2 of the text. First, in a manner paralleling Proposition A.1, we show for the solidification function, formula, defined by (3.12), the naming of this function is justified by the fact that:  

Lemma A.1 (Solidification)

For all formula, the image set, formula, is solid.

 
Proof

If formula, then it must be shown that for all formula there is some path, formula with formula. But for any formula, it follows that formula and formula, so that by the definition of formula in (3.11), it must be true that there is some boundary region, formula, and path, formula with formula. Next, we show that formula as well. To do so, suppose to the contrary that formula, so that for some formula, formula with formula and formula. Then, again by the definition of formula it must be true that formula, which contracts the fact that formula and formula. Hence, formulaformula, and the result is established.▪

If the family of all solid sets in formula is denoted by formula, then we next show that these sets are precisely the fixed points of the solidification function:  

Lemma A.2 (solid fixed points)
For all formula,
(A.7)
 
Proof

If formula then formula, so that formula by (3.12). Conversely, if formula, then by Lemma A.1, formula.▪

As a parallel to (A.6), this in turn implies that the family of solid sets in formula can be equivalently defined as follows:
(A.8)

Finally, solid sets also exhibit the following nesting property:  

Lemma A.3 (solid nesting)
For all formula,
(A.9)
 
Proof

Since formula, it suffices to show that formula. Hence, consider any formula and observe from the above that formula. Hence, it remains to consider formula. Here, we show that formula must be in formula. To do so, observe first that formula. Moreover, formula implies that for any path, formula we must have formula. But formula then implies formula. Hence, formula, and the result is established. ▪

With these properties of solid sets, we are ready to analyze formula-convex solids in formula. As asserted in the text, our key result is to show that d-convexity is preserved under solidifications:  

Theorem A.1 (Solidification invariance of d-convexity)

For all d-convex sets, formula, the image set, formula, is also formula-convex.

 
Proof

Suppose to the contrary that for some d-convex set, formula, the image set formula is not d-convex. Then, there must exist some pair of elements, formulaformula, and some shortest path, formula, with formula. But if formula then by the d-convexity of formula we would have formula. So at least one of these elements must be in formula. Without loss of generality, we may suppose that formula and that formula is some element of formula, so that formula with formula and formula. But then we must have formula. For if not then we obtain a contradiction as follows. Since formula and formula, there must be some path, formula with formula. Hence, the combined path, formulaformula, then satisfies formula, which contradicts the hypothesis that formula. Thus, we may assume that there is some formula and consider the following two cases:

  • (i)

    Suppose first that formula is also an element of formula. We then show that this contradicts the hypothesized shortest path property of formula as follows. Observe first that if formula denotes the reverse path for formula above, then the same argument used for formula above now shows that there must be some formula, so that formulaformulaformula with formula and formula. These paths are shown in Figure 18.

Example ().
Figure 18.

Example (formula).

But if we choose any shortest path, formula (as in Figure 18), then it follows from the d-convexity of formula, together with formula and formula that formula [since every shortest path in formula lies in formula, and formula]. Hence, for the path, formulaformula, we must have formulaformula which contradicts the shortest path property of formula.

  • (ii)

    Finally, suppose that formula, and for the point formula above, consider the representation of formula as formula with formula and formula, as shown in Figure 19.

Example ().
Figure 19.

Example (formula).

Then, we again show that this contradicts the shortest path property of formula as follows. For any shortest path, formula (as in Figure 19), the d-convexity of formula, together with formula and formula, now implies that formula. Thus, for the path, formula, we must have formulaformula which again contradicts the shortest path property of formula. Hence, for each pair of elements, formulaformula, there can be no shortest path, formula, with formula, so that formula is formula-convex.▪

With this result, we can now establish parallels to Propositions A.1, A.2 and A.3 above for d-convex solids, as in Definition 3.3. First, we show that for the d-convex solidification function, formula, in (3.13), the naming of this function is justified by the fact that:  

Theorem A.2 (d-convex solidification)

For each set, formula, the image set, formula, is a d-convex solid.

 
Proof
First observe from Definition 3.3 that we may use Expressions (A.6) and (A.7) to define the family of all d-convex solids in equivalent terms as
(A.10)

Hence, it suffices to show that formula. But by Proposition A.1, it follows that formula, and hence as a direct consequence of Theorem A.1 that formula. Moreover, since formula also implies from Lemma A.1 that formula, it then follows that formula.▪

Next, as a parallel to Proposition A.2, we now have:  
Theorem A.3 (minimality ofd-convex solidifications)
For each set, formula,
(A.11)
 
Proof

First observe from Theorem A.2 that formula and from Expression (A.2) that formula (since by definition, formula for all formula). Hence, it suffices to show that formula whenever formula. But by Proposition A.2, formula and formula imply that formula. Moreover, since formula, we obtain the following conclusion from Lemma A.3 together with Lemma A.2, and the result is established.

(A.12)

Finally, we may use these results to show that d-convex sets are equivalently characterized as fixed points of the d-convex solidification function, formula:  

Theorem A.4 (d-convex solid fixed points)
For all formula,
(A.13)
 
Proof

If formula then by Theorem A.2, formula. Conversely, if formula then since formula implies from Proposition A.3 that formula, we may conclude from Lemma A.2 that formula, and the result is established.▪