Data Analysis – Do You Really Mean Average?

In the corporate world I see this issue quite frequently.  Specifically, I will hear a request where the verbiage doesn’t align to what the requestor is ultimately looking for.  To illustrate, I have included an example below that shows ten different customers within a territory.  For each customer the total revenue year-to-date is listed.  To make the illustration relevant for this example, I listed Customer 5 with revenue that is exponentially higher than the rest. 

Now here’s the question I typically hear:

"What is the average customer size (revenue) for Territory A?"

Here is what that really means most of the time:

"What is a typical customer size (revenue) for Territory A?"

You may think it’s semantics, but it’s really not.  I don’t want to turn this into a statistics lesson, but average (mean) doesn’t always translate into typical.  Because Customer 5 is such an outlier, the average (sum of all customer revenue divided by count of customers) will be higher than if that customer fell into the typical range like the rest.

I have included the median revenue amount for the ten customers, which I think is probably a better predictor (in general) than the mean or average.  The median is simply defined as the number in the middle.  In reality, Customer 5’s revenue could be 875 zillion dollars and the median amount wouldn’t change.  When there are thousands of records and you need to know what the typical amount is, it’s often safer to choose median unless you want to take the time to calculate min, max, median, std deviation and mean to compare.

"In probability theory and statistics, a median is described as the numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half." [source]

Now the real question that would need to be answered is, can a typical territory have one very large customer or is this a unique situation and should not be considered normal?  Answering the preceding question will make all the difference in what calculation to use.  Most often I will include both.

Median vs. Average Example

It’s my belief that most people are simply familiar with the term average because it’s so commonly used.  The underlying reason that average is more prevalent in analysis is probably due to the fact that it’s very easy to calculate.  Before spreadsheet software was available that automated the median calculation, it was much more difficult to get a median amount even with a calculator.

As a data analyst, it’s prudent to know the difference between mean and median and when each is applicable.  Telling the CEO/CFO that the typical customer is roughly $131,000 when one customer is atypical and the true amount is more like $57,000 can be a career changer.

Most Commented Posts



 Subscribe to main feed



13 Responses to “Data Analysis – Do You Really Mean Average?”

  1. Jon Peltier Says:

    This is the whole premise of a book by Sam Savage called The Flaw of Averages. It describes how people look at averages instead of whole distributions (“typical” and “median” are also incomplete measures), and completely misunderstand risk.

    [Reply]

    Tony Reply:

    Jon, you may understand whole distribution principals, but I think in general, most executives may not even know what median is outside of the middle of two roads, which metaphorically is right on.

    Thanks for the tip on the book. I’ll have to have Amazon Prime sneak it into my Christmas stocking.

    [Reply]

    Jon Peltier Reply:

    A client sent me a copy of the book when I helped him with some charts last month. I’d done some business modeling work with him: I built the templates and some UDFs, and he hooked the model up to a Monte Carlo routine.

    Imagine my amusement when I got to the relevant part of the book, and found some of the graphs I’d made for his modeling project.

    The book is easy to read, not too formal, so even some of those executives might have a chance.

    [Reply]

  2. Naomi B. Robbins Says:

    You are treating the terms “average” and “mean” as synonyms. Many people do. When I went to school we were taught that the term “average” meant “typical value” and that means, medians and modes were all forms of averages. I just checked three statistical dictionaries; two of them said that average could be median or mode as well as mean while the third said that average is the same as mean. The third is less accurate in many other areas as well.

    If I heard a request for an average value, I’d question which average they meant or provide more than one with an explanation of when each was appropriate. Mean, median, and mode are statistical concepts while average is an everyday term.

    [Reply]

    Tony Reply:

    Naomi, you are right. Mean and average don’t always translate to the same thing. There can be geometric mean, arithmetic mean, population mean, etc.. For the purposes of this post, I am referring to the arithmetic mean when I said “Mean”. I think the confusion comes from people using the term average when really meaning arithmetic mean. Again, you are correct in that mean, median and mode all measure central tendency (average).

    Going back to the purpose of this post, I think the thing to take away is that if you are only providing the arithmetic mean or commonly called “average” you are really putting yourself out there in terms of accuracy risk.

    [Reply]

    Jon Peltier Reply:

    In Excel, you have to use the AVERAGE function to calculate the mean. Because of this, I tend to forget that average is a more general term.

    [Reply]

    Tony Reply:

    I guess =arithmeticmean(x:x) may have been too long to have to write out. On another note, why do you have to type out “average” in Excel. You should be able to just type =avg(x:x).

    [Reply]

  3. Mathias Says:

    Can’t agree more with Jon. Why settle for half the truth, when you can have the whole truth? In the end, whether you produce the mean, median, or the mode (which btw maches the informal definition of “typical customer size” better than the median), you are still reducing a whole set of numbers to ONE single number. No matter how you do it, there will be loss of information.

    [Reply]

    Tony Reply:

    Mathias, thanks for stopping by and leaving a comment.

    Along the same lines as Naomi’s response, central tendency is measured by mean, median and mode. You are correct in that all of these functions reduce a population down to a single number. But that’s the point. How would you respond to an Executive if they were to ask directly what a typical customer size is in terms of revenue? You could answer with the arithmetic mean and median only to add a disclaimer for the faults of each OR we could introduce a confidence interval. I think going to whole distribution and standard deviations would be over most people’s heads in the corporate world.

    I could say, with a 99% confidence level the typical revenue is between x and y. Most of your non statisticians probably won’t go beyond arithmetic mean (they would say “average”) or maybe median. Using a confidence interval may be the way to go. What do you think?

    [Reply]

  4. Naomi B. Robbins Says:

    How about giving the executives what they ask for as well as a plot of all the data?

    [Reply]

    Tony Reply:

    Naomi – what would you provide if asked that question? I wrote to Mathias saying that using a confidence interval may be a good option. I do like the idea of plotting all of the data points. But if I could only choose one, it would probably be median.

    [Reply]

    Jon Peltier Reply:

    The confidence intervals are not too much better than just a central measure. Nowadays, though, you can enter a distribution, either via a distribution type and its parameters or by entering a polygon, and let Monte Carlo produce an outcome distribution.

    [Reply]

  5. Peter Robinson Says:

    Hi Tony, thanks for raising the issues around Averages. I have long had similar discussions with computer measurement vendors. It seems that because the mean is so easily calculated using a running total and a running count of samples, it is mindlessly provided as the standard measure of computer performance when it really is quite useless for the purpose.
    Jon, thanks for the book link. That looks interesting.

    [Reply]

Leave a Reply