Online Data Analysis and Visualization Tool [Poll]

February 2nd, 2010

Not too long ago, I got a tip from someone on Twitter with a link to a site called Verifiable.com.  Upon further investigation, I learned that this site is similar to Manyeyes.com in that you can upload a data set and, using the tools on the site, create data visualizations. 

At first glance, the site seems somewhat plain.  After digging into what the site is about, I quickly learned that they utilize sound and popular theory in the data visualization field.  On their about page, the first line that explains the features is:

“A clean, low-chartjunk philosophy — no shadows, no pie charts, no 3-D bar graphs, just the ink you need. [verifable.com]”

Well, simply reading that peaked my interest because they use similar principles that I follow when creating charts/graphs.  No frills.  I like the fact that you can create charts that don’t have excessive grid lines, shadows, weak labeling and limited charting options.  Below you will see a few examples from their site.  You can also follow the links to see the visualizations in an interactive environment.  As you will see, there is a lot of data (hover over), many different options and some good visuals.  Granted some of the charts I had no idea what they were trying to show, but in general this site gives you a seemingly good tool to apply charting/graphing best practices.

Major League Baseball Payroll Efficiency 2006-2008 

[Interactive version]

U.S. Unemployment Rates by Education, 1992-Latest

[Interactive version]

Verifiable also offers a Pro version of their tool where you can keep your data and visualizations private and receive premium support.  The cost is minimal with the Pro version going for $29.95/year. 

I didn’t try to upload a data set to give the site a full trial, but it definitely looks interesting.  I am not sure how much demand there is for online data visualization using a tool like Verifiable.

Mean and Median – Part 2

January 19th, 2010

Back in December I wrote a post about using the arithmetic mean and median when analyzing data.  This is a follow up post that shares some insights from Naomi Robbins, author of Creating More Effective Graphs.  The following paragraph was Naomi’s response to my question, “what would you provide if asked what is a typical customer size?”

“I’d give the median together with the graph below. The graph, modeled after Figure 4.1 of Creating More Effective Graphs, shows all the data, as Jon suggested. I’d say something like:

Median-Mean Chart

"I provided the median (shown by the black line in the figure) rather than the mean (shown by the light cyan line) since as you can see from the figure, the mean is not a typical value. There are no actual customers who have revenues near to the mean value because customer 5 influences the mean so strongly since its revenues are so much higher than the others. Half of our customers have revenues less than the median while the other half have revenues greater than the median. The middle half of our customers have revenues that are between the dotted lines."

For a slightly larger customer base I’d jitter the points. For a much larger customer base I’d replace this strip plot (also called a one-dimensional scatter plot) with a box plot. By box plot, I mean a Tukey box plot. I object to every software program and every author redefining box plots so that the reader can’t read them without an explanation.

The figure was drawn with R. However, it is easy to reproduce it using Excel or other software. [Robbins]

I think this goes back to one of my original points, which was that many people just to provide the mean and it can be very misleading.  The graph above that Naomi provided illustrates this point clearly.  The light cyan line [mean] isn’t even close to the majority of the data points.  The median is much more representative of a typical customer value, but also not perfect.  Combine the median, mean, quartiles and actual values and now you’re providing real value.  Looking at this chart, it clearly shows the grouping of typical customers, outlier and where the median and mean fall.  Thank you Naomi for the insights!

If you are interested in providing a guest post, please contact me for more information or to submit a proposal.

Happy Holiday’s, Merry Christmas and Happy New Year!

December 23rd, 2009

There is quite a lot going on this Holiday season and I hope to be back with a more regular posting schedule after New Year’s.  I hope you have a safe and happy Holiday(s)! 

Data Analysis – Do You Really Mean Average?

December 17th, 2009

In the corporate world I see this issue quite frequently.  Specifically, I will hear a request where the verbiage doesn’t align to what the requestor is ultimately looking for.  To illustrate, I have included an example below that shows ten different customers within a territory.  For each customer the total revenue year-to-date is listed.  To make the illustration relevant for this example, I listed Customer 5 with revenue that is exponentially higher than the rest. 

Now here’s the question I typically hear:

"What is the average customer size (revenue) for Territory A?"

Here is what that really means most of the time:

"What is a typical customer size (revenue) for Territory A?"

You may think it’s semantics, but it’s really not.  I don’t want to turn this into a statistics lesson, but average (mean) doesn’t always translate into typical.  Because Customer 5 is such an outlier, the average (sum of all customer revenue divided by count of customers) will be higher than if that customer fell into the typical range like the rest.

I have included the median revenue amount for the ten customers, which I think is probably a better predictor (in general) than the mean or average.  The median is simply defined as the number in the middle.  In reality, Customer 5’s revenue could be 875 zillion dollars and the median amount wouldn’t change.  When there are thousands of records and you need to know what the typical amount is, it’s often safer to choose median unless you want to take the time to calculate min, max, median, std deviation and mean to compare.

"In probability theory and statistics, a median is described as the numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half." [source]

Now the real question that would need to be answered is, can a typical territory have one very large customer or is this a unique situation and should not be considered normal?  Answering the preceding question will make all the difference in what calculation to use.  Most often I will include both.

Median vs. Average Example

It’s my belief that most people are simply familiar with the term average because it’s so commonly used.  The underlying reason that average is more prevalent in analysis is probably due to the fact that it’s very easy to calculate.  Before spreadsheet software was available that automated the median calculation, it was much more difficult to get a median amount even with a calculator.

As a data analyst, it’s prudent to know the difference between mean and median and when each is applicable.  Telling the CEO/CFO that the typical customer is roughly $131,000 when one customer is atypical and the true amount is more like $57,000 can be a career changer.

Rainbow Chart – Twitter Messages Per Day

December 14th, 2009

Below is a great example of the wrong use of color in a column chart.  Use color to differentiate between segments, but don’t use it when time is on the x-axis for the different days.

A better use of color may be for each quarter within the year.  Using the chart below, it would make more sense to have every first week of the month always in one color, like blue.  Then, at least you could easily compare the first week of each month quickly.  I’m not even going to touch the chart title not matching to what is actually being displayed in the graph – days vs. weeks.

You really can’t make the color mistake if you used a line graph, just saying.

image

[Source]

Business Intelligence Vendor Size is Important

December 1st, 2009

The most recent copy of Information Management had the image below on page 8.  What’s funny is the person figure on the left looks like it’s wearing pants.  Oh wait, those aren’t pants, the blue is part of the data visualization.  The person on the right looks to be wearing orange work boots or ski boots for that matter.  The article by Julie Langenkamp is interesting and discusses how small vendors tend to rank much higher than large vendors in product support and other areas.

Person chart

[image source]

112009_pendse_fig2 

[image source]

It appears that small vendors scored better than large vendors in every single category of complaints as shown in the chart above.  In the chart below, you will see that small vendors appeared to provide more benefit to the customer/client than large or medium vendors.

Benefits

[image source]

There’s a lot more to the article if you are interested in business intelligence. 

Happy Thanksgiving!

November 25th, 2009

I would like to wish you and your families a happy, joyous and safe Thanksgiving.  A special thank you goes to our men and women of the military stationed around the world in harm’s way. Be safe, strong and we’ll see you home soon.  As for my agenda, I plan to spend some quality time with my family and use this as an excuse to enjoy some of my favorite fine Ruffino Chianti and home cooking.  Life is short.  I want to enjoy every single holiday with my now young girls and wonderful wife.  Happy Thanksgiving!

Investment Growth Chart

November 24th, 2009

One of the benefits I truly enjoy is having USAA as my insurance company.  It only took me one phone call to their customer service center to know why they consistently rank in the top of companies for customer service.  I can think of a few big companies that could learn a lot from how USAA treats their customers/policy holders. 

In their recent magazine, I quickly noticed the chart below that is called, "The Snowball Effect" while flipping through.  The heading that was cut off states the following:

"What’s the hardest-working investment tool you can use? The power of time. Beth, Bob and Bridget all invested $2,500 at the same 6 percent rate of return.  But see how compounding made Beth’s account grow? That’s the value of starting early."

Going back to high school and college one of the first lessons one learns is the time value of money and compounding effect.  I won’t get into that, but what I did want to touch on is the chart below that left me speechless for a bit.  There are so many things wrong with it that it wasn’t even worth taking out my red pen.  

Investment Growth Chart 

I recreated the data from the chart in Excel (shown below) and used fictitious numbers for the middle of the graph.  Each person starts with the same money and each has an end amount.  So I basically filled in the blanks.   I know my chart doesn’t have Beth with her arms raised in celebration or decimals, but it’s definitely a lot cleaner and easy to understand.  This isn’t rocket science, is it?

Investment Growth Chart

Gradient Fill and Deception with Charts and Graphs

November 10th, 2009

Below you will see a column chart that appeared in the weekend’s print edition of the Baltimore Sun.  It’s no secret that they used a gradient fill on the columns to give it the fading appearance.  I’m not a big fan of the gradient fill on the 2009 columns, but this could work for the previous year’s numbers (2008) if the intent was to minimize the prior year.  I doubt that was the case as I’m sure they were trying to make the chart "pretty" or different than the default setup.

BS Unemployment Chart 

Below you will see a replica that I made using Excel and the fill effects formatting option.  It looks alright, but something still isn’t right.  What is the problem with this chart?

BS Chart Replica

The problem is the y-axis and the scale that was used.  I don’t think this is a straight out misrepresentation in order to mislead, but it could be.  That’s the risk you face when manipulating the axis.  Yes, the columns take up a lot of space when the axis starts at zero, but that’s the correct method here.  To help illustrate my point, check out the exact same chart (below) with the y-axis starting at zero.

BS Chart Replica - Axis

This version using the correct axis setting accurately shows that October, year-over-year, is not three times as much, but only about 1.5 times greater.  Also, look at the trend of the first replica chart.  The upward trend definitely has a greater slope compared to the replica with the correct axis.  To help prove this visually, check out the side-by-side comparison below using a trendline in the chart.  The slope of the chart on the left is much greater than the one on the right.  If you were presenting this data in something like PowerPoint or SlideShare, and quickly went to the next slide, the audience might not catch the axis starting at 5 and the steep trendline would be the point taken from the data.

BS Chart Replica - Slope

Furthermore, forget the gradient fill and go with something like the chart below if you want to highlight the current year.

BS Chart Replica - Color 2

iPhone Data Visualization Application?

November 5th, 2009

I recently came across a few iPhone applications (Apps) that allow a user to view or edit spreadsheets in Excel.  Some have pretty good reviews and others, well, not so good.  I think there is some benefit to being able to view data visualizations, charts, graphs, spreadsheets and reports on your phone.  I think the capability is probably limited as it would be near impossible to do large scale spreadsheets on a phone.  Also, the screen size would limit the size and amount of data that could be displayed. 

  Excel iphone app

Here are a few spreadsheet type apps for the iPhone:

In my opinion, the best option is still to view web-published visualizations from a company like Spotfire or Tableau to see near real-time data, trends and visualizations.  Let’s forget about trying to build spreadsheets on your phone, because that isn’t going to happen.