Power-law Distributions in Empirical Data

Throughout many fields of science, one finds quantities which behave (or are claimed to behave) according to a power-law distribution. That is, one quantity of interest, y, scales as another number x raised to some exponent:

$$y \propto x^{-\alpha}.$$

Power-law distributions made it big in complex systems when it was discovered (or rather re-discovered) that a simple procedure for growing a network, called “preferential attachment,” yields networks in which the probability of finding a node with exactly k other nodes connected to it falls off as k to some exponent:

$$p(k) \propto k^{-\gamma}.$$

The constant γ is typically found to be between 2 and 3. Now, from my parenthetical remarks, the Gentle Reader may have gathered that the story is not quite a simple one. There are, indeed, many complications and subtleties, one of which is an issue which might sound straightforward: how do we know a power-law distribution when we see one? Can we just plot our data on a log-log graph and see if it falls on a straight line? Well, as Eric and I are fond of saying, “You can hide a multitude of sins on a log-log graph.”

Via Dave Bacon comes word of a review article on this very subject. Clauset, Shalizi and Newman offer us “Power-law distributions in empirical data” (7 June 2007), whose abstract reads as follows:

Power-law distributions occur in many situations of scientific interest and have significant consequences for our understanding of natural and man-made phenomena. Unfortunately, the empirical detection and characterization of power laws is made difficult by the large fluctuations that occur in the tail of the distribution. In particular, standard methods such as least-squares fitting are known to produce systematically biased estimates of parameters for power-law distributions and should not be used in most circumstances. Here we describe statistical techniques for making accurate parameter estimates for power-law data, based on maximum likelihood methods and the Kolmogorov-Smirnov statistic. We also show how to tell whether the data follow a power-law distribution at all, defining quantitative measures that indicate when the power law is a reasonable fit to the data and when it is not. We demonstrate these methods by applying them to twenty-four real-world data sets from a range of different disciplines. Each of the data sets has been conjectured previously to follow a power-law distribution. In some cases we find these conjectures to be consistent with the data while in others the power law is ruled out.

After going over the theory involved, the authors look at twenty-four different real-world data sets which have been claimed to follow a power-law distribution. They compute p-values based on the power-law model to judge whether or not a power law is actually a good description of that data. For most of the twenty-four datasets, the power-law description holds up pretty well. In one case — the distribution of word frequencies in Moby Dick — the power-law description trumps all others, while in other cases, fits to other curves may still be plausible, but power laws certainly aren’t ruled out.

However, in seven of the twenty-four datasets, the authors’ calculations show that a power-law model is just no good!

In particular, the distributions for the HTTP connections, earthquakes, web links, fires, wealth, web hits, and the metabolic network cannot plausibly be considered to follow a power law; the probability of getting a fit as poor as that observed purely by chance is very small in each case and one would have to be unreasonably optimistic to see power-law behavior in any of these data sets. (For two data sets â€” the HTTP connections and wealth distribution â€” the power law, while not a good fit, is nonetheless better than the alternatives, implying that these data sets are not well-characterized by any of the functional forms considered here.)

Furthermore, they find that it is in general difficult to tell a power-law distribution apart from another kind known as a log-normal, whose probability density function looks like this:

$$p(x;\mu,\sigma) = \frac{1}{\sqrt{2\pi}} \frac{e^{-(\ln x – \mu)^2/(2\sigma^2)}}{x \sigma},$$

where μ is the mean and σ the standard deviation of the logarithm of x. The authors note that

the log-normal is not ruled out for any of our data sets, save the HTTP connections. In every case it is a plausible alternative and in a few it is strongly favored. In fact, we find that it is in general extremely difficult to tell the difference between a log-normal and true power-law behavior. Indeed over realistic ranges of x the two distributions are very closely equal, so it appears unlikely that any test would be able to tell them apart unless we have an extremely large data set. Thus one must again rely on physical intuition to draw any final conclusions. Otherwise, the best that one can say is that the data do not rule out the power law, but that other distributions are as good or better a fit to the data.

Because this is the age of Web 3.1, Aaron Clauset has a blog entry devoted to discussing this paper, here. He and his co-authors have even made their code publicly available — if you can use MATLAB and R, it’s all yours!

Applications to the technological singularity are left as an exercise to the interested reader.

One thought on “Power-law Distributions in Empirical Data”

1. ben says:

Rely on physical intuition? Uh oh.