Archive for the ‘Statistics’ Category

Online Data Analysis and Visualization Tool [Poll]

Tuesday, February 2nd, 2010

Not too long ago, I got a tip from someone on Twitter with a link to a site called Verifiable.com.  Upon further investigation, I learned that this site is similar to Manyeyes.com in that you can upload a data set and, using the tools on the site, create data visualizations. 

At first glance, the site seems somewhat plain.  After digging into what the site is about, I quickly learned that they utilize sound and popular theory in the data visualization field.  On their about page, the first line that explains the features is:

“A clean, low-chartjunk philosophy — no shadows, no pie charts, no 3-D bar graphs, just the ink you need. [verifable.com]”

Well, simply reading that peaked my interest because they use similar principles that I follow when creating charts/graphs.  No frills.  I like the fact that you can create charts that don’t have excessive grid lines, shadows, weak labeling and limited charting options.  Below you will see a few examples from their site.  You can also follow the links to see the visualizations in an interactive environment.  As you will see, there is a lot of data (hover over), many different options and some good visuals.  Granted some of the charts I had no idea what they were trying to show, but in general this site gives you a seemingly good tool to apply charting/graphing best practices.

Major League Baseball Payroll Efficiency 2006-2008 

[Interactive version]

U.S. Unemployment Rates by Education, 1992-Latest

[Interactive version]

Verifiable also offers a Pro version of their tool where you can keep your data and visualizations private and receive premium support.  The cost is minimal with the Pro version going for $29.95/year. 

I didn’t try to upload a data set to give the site a full trial, but it definitely looks interesting.  I am not sure how much demand there is for online data visualization using a tool like Verifiable.

Data Analysis – Do You Really Mean Average?

Thursday, December 17th, 2009

In the corporate world I see this issue quite frequently.  Specifically, I will hear a request where the verbiage doesn’t align to what the requestor is ultimately looking for.  To illustrate, I have included an example below that shows ten different customers within a territory.  For each customer the total revenue year-to-date is listed.  To make the illustration relevant for this example, I listed Customer 5 with revenue that is exponentially higher than the rest. 

Now here’s the question I typically hear:

"What is the average customer size (revenue) for Territory A?"

Here is what that really means most of the time:

"What is a typical customer size (revenue) for Territory A?"

You may think it’s semantics, but it’s really not.  I don’t want to turn this into a statistics lesson, but average (mean) doesn’t always translate into typical.  Because Customer 5 is such an outlier, the average (sum of all customer revenue divided by count of customers) will be higher than if that customer fell into the typical range like the rest.

I have included the median revenue amount for the ten customers, which I think is probably a better predictor (in general) than the mean or average.  The median is simply defined as the number in the middle.  In reality, Customer 5’s revenue could be 875 zillion dollars and the median amount wouldn’t change.  When there are thousands of records and you need to know what the typical amount is, it’s often safer to choose median unless you want to take the time to calculate min, max, median, std deviation and mean to compare.

"In probability theory and statistics, a median is described as the numeric value separating the higher half of a sample, a population, or a probability distribution, from the lower half." [source]

Now the real question that would need to be answered is, can a typical territory have one very large customer or is this a unique situation and should not be considered normal?  Answering the preceding question will make all the difference in what calculation to use.  Most often I will include both.

Median vs. Average Example

It’s my belief that most people are simply familiar with the term average because it’s so commonly used.  The underlying reason that average is more prevalent in analysis is probably due to the fact that it’s very easy to calculate.  Before spreadsheet software was available that automated the median calculation, it was much more difficult to get a median amount even with a calculator.

As a data analyst, it’s prudent to know the difference between mean and median and when each is applicable.  Telling the CEO/CFO that the typical customer is roughly $131,000 when one customer is atypical and the true amount is more like $57,000 can be a career changer.

Little Known Way to Improve Google [website] Analytics

Friday, August 21st, 2009

I think everyone in the world knows about Google Analytics, but this other service can drastically help improve the effectiveness of a website or blog as a compliment to [not replace] Google Analytics.  The website is called crazyegg and offers a service to “visualize your visitors” based on the volume of monthly visitors you want to measure.  The cost starts at a mere $9 per month for 10,000 visitors and 10 pages.

There are two features that I think are valuable and can help you visualize conversions on a website or blog.  They are:

1.  Heatmap

heatmap_tn

The heatmap option allows you to visually see where on the site most visitors click.  This can be extremely helpful for trying to sell advertising, place key products or featured items.  Some other features of the site include:

  • The ability to exclude your own IP address so the results are not skewed
  • Setting up tests to see what people are doing on a page
  • Reports
  • Email notifications

2.  Confetti

 confetti_tn

The confetti option allows you to see dots in color where visitors are clicking.  The list option lets you see who the top 15 referrers are, search terms, operating systems, browser and so on.  If you use one of these services, please share your feedback in the comments section.

Check out their demo here.

Some other providers that have similar services are:

New Data Analysis Book

Tuesday, August 4th, 2009

HFDA 

A new book on data analysis was released yesterday by Michael Milton called Head First Data Analysis.  The Head First books have gotten rave reviews and I’ve never actually read one until now.  I was able to be involved in the technical review of this book, which gave me a first glimpse into the content.  I will say there were a few of pleasant surprises as I went from chapter to chapter.  Some of them being; exercises that utilized the open source statistical program R, examples using relevant data, theories and ideas, and simply hitting on all the major topics without going so deep it was no longer relevant.  I also liked the fact that almost anyone could pick up this book and start applying the concepts immediately. It is designed for a very easy read with rich content.  Here is a note from Amazon that describes what this book is about:

“Today, interpreting data is a critical decision-making factor for businesses and organizations. If your job requires you to manage and analyze all kinds of data, turn to Head First Data Analysis, where you’ll quickly learn how to collect and organize data, sort the distractions from the truth, find meaningful patterns, draw conclusions, predict the future, and present your findings to others.

Whether you’re a product developer researching the market viability of a new product or service, a marketing manager gauging or predicting the effectiveness of a campaign, a salesperson who needs data to support product presentations, or a lone entrepreneur responsible for all of these data-intensive functions and more, the unique approach in Head First Data Analysis is by far the most efficient way to learn what you need to know to convert raw data into a vital business tool.
You’ll learn how to:

  • Determine which data sources to use for collecting information
  • Assess data quality and distinguish signal from noise
  • Build basic data models to illuminate patterns, and assimilate new information into the models
  • Cope with ambiguous information
  • Design experiments to test hypotheses and draw conclusions
  • Use segmentation to organize your data within discrete market groups
  • Visualize data distributions to reveal new relationships and persuade others
  • Predict the future with sampling and probability models
  • Clean your data to make it useful
  • Communicate the results of your analysis to your audience ” [source]

I think the Head First series methodology and design creates an experience where the reader is able to learn and apply the concepts quickly.  It was a pleasure working with the folks at O’Reilly.

Data Versus Information – Financial Bailout (Part 1 of 2)

Tuesday, March 17th, 2009

The Financial Lobbying information below is a great example of the difference between giving someone data and providing them with information.  The designer stopped far too short when putting this matrix together because they left all the work for me to do.  If you’re like me and you see this grid, what are first few things you do?

Financial Lobbying

[source]

When I saw this, I immediately did these things:

  1. Quickly read the title and sub title
  2. Scanned the companies looking for a familiar one
  3. Started calculating percentages of each to the total
  4. Thought about how much these bailouts are of the total bailout package

I am only looking for some basic statistics and context for this data.  I need to put it into perspective and try to tell a story.  I recreated this data in Excel and added a few simple columns to illustrate my points.  Also, we aren’t even talking about charts or graphs, just a simple matrix.

First, I have the same matrix with one additional column for the percent each company is of the total financial bailout spend.  Also, you’ll notice I abbreviated the numbers in the millions to save space.  Finally, I removed the zebra striping because it really isn’t needed in such a small data set.

Financial 1

In the next example below, I added an additional column that represents the percent each company is of the total bailout package.  Now I can see that these eight large financial companies make up 26 percent of the total bailout spend assuming a $700 billion total.  What this does, is put the data in some perspective versus just showing a bunch of numbers. 

Financial 2

In part 2, I will show you a few more changes that I made to the matrix that speaks to the revenue columns.

New Stephen Few Book On Quantitative Analysis

Tuesday, January 13th, 2009

There is a very interesting discussion going on that Jorge Camoes started on his blog, Charts.  The discussion is in regards to Edward Tufte principals and business charts or data visualizations.  From that post, Jon Peltier chimed in and provided his insights.  These two discussions are centered on implementing Tufte’s design principals in the corporate [business] sector. 

I am introducing a third piece to the discussion that I think may help.  In reading two of Stephen Few’s books, I’ve seen references to Tufte’s work, which dates back to the 80s.  Personally, I think Tufte’s book, TVDoQI is one of the most influential books I have on data visualization.  I think Stephen’s upcoming book may write a new chapter on data visualization for analysis that will help bridge the gap between theory and practice in the business world.  I would be willing to bet that this book will be the next staple in the library of anyone involved in data visualization and analysis.

What I really enjoy about Few’s books is that they are very applicable to the business world and present data in a simple and intuitive way.  I first got wind of this book back in November of 2007, when Stephen and I had a brief conversation.  Ever since that Friday in November, I have been anxious to see it released.  The posts by Jon and Jorge, along with a tweet via Twitter reminded me that the release date should be near.

Right now you can pre-order his new hardcover book on Amazon for $29. Its release date is scheduled for 4/1/2009.  I personally think it will be worth every penny.

Now You See It: Simple Visualization Techniques for Quantitative Analysis

Now you see it book

Book Description per Amazon:

"This companion to Show Me the Numbers teaches the fundamental principles and practices of quantitative data analysis. Employing a methodology that is primarily learning by example and “thinking with our eyes,” this manual features graphs and practical analytical techniques that can be applied to a broad range of data analysis tools—including the most commonly used Microsoft Excel. This approach is particularly valuable to those who need to make sense of quantitative business data by discerning meaningful patterns, trends, relationships, and exceptions that reveal business performance, potential problems and opportunities, and hints about the future. It provides practical skills that are useful to managers at all levels and to those interested in keeping a keen eye on their business." [Amazon]

What do you think?  Is this book going to be worth the hype or another book that’ll soon be forgotten?

Other books worth a look by Stephen Few:

  1. Show Me The Numbers: Designing Tables and Graphs to Enlighten, Stephen Few
  2. Information Dashboard Design: The Effective Visual Communication of Data, Stephen Few

There are referral links within this post to Amazon. However, there is no endorsement from Stephen Few for this post or any reference to his books.

The Biggest Hazard to Americans [Chart Review]

Monday, December 22nd, 2008

The information drawn from this visualization doesn’t come too easily.  There are two distinct sets of data shown within this map.  First is the actual map, which could be considered a heat map.  The shading of the different regions represents the Standardized Mortality Ratios.  The second set of data is the pie charts located in the ten different regions, which shows the proportional hazard mortality categories.

Yahoo Pie Chart Map

[via Yahoo]

After studying the map, I can conclude that regions III and V have the highest mortality due to heat/drought.  Is it me or does that seem a little odd because the regions aren’t too hot?  Not really that shocking is that the Northeast that has the highest mortality due to winter weather.  Out West, the highest mortality comes from severe weather.  The rest of the pie chart data is pretty evenly mixed. 

When asked about the biggest threats to life, the article states:

“According to our results, the answer is heat,” Susan Cutter and Kevin Borden of the University of South Carolina wrote in their report, which gathered data from 1970 to 2004.”

I don’t know what data they were looking at because I’m not sure I can make that same conclusion based on this map.  Also, who made up the categories?  Should tornados fall under severe weather?  Or maybe severe weather has lightning within the category. Nope, lightning is a category by itself.  Can someone please tell me what Geophysical is and how it relates to mortality?  They don’t make much sense.

Regarding the standard deviations, how many people (reading Yahoo) do you think can decipher this data?  Wikipedia has a pretty good explanation of Standardized Mortality Ratios (SMRs).  Basically, it’s the ration of deaths to expected deaths. 

  • If the number is 1.0, then the expected number equals the actual number
  • Less than a number of 1.0 and the actual deaths are less than expected
  • Higher than 1.0, the actual deaths are greater than expected

Referencing the map with this information, section VIII has significantly more deaths than would have been expected.  On the other hand, California and the Northeast have significantly less deaths than would be expected.  It seems like the color scheme and ranges should be different.  I would expect two sets of numbers above and below the number 1.0, which can be easily identified with a better color.  Or, maybe black is the right color being that we are talking about death.

In summary, I think there is some potential here and with a few tweaks this data visualization could be more effective.

Is this an effective way to show this data?

Do you think the categories make sense?

Seasonality in Data

Thursday, December 4th, 2008

Here is a good excerpt about seasonality. 

"For analyzing general price trends in the economy, seasonally adjusted changes are usually preferred since they eliminate the effect of changes that normally occur at the same time and in about the same magnitude every year—such as price movements resulting from changing climatic conditions, production cycles, model changeovers, holidays, and sales." source 

I am not an expert statistician, but hopefully the two examples I found below will help give you a high level understanding of seasonality.  At a minimum, when you see footnotes in periodicals, like the one in a recent post, you may have a better understanding.  Also, when analyzing data, it’s always a good rule to think about the data and if there is a seasonal effect involved.  There could be drastically different results if you tried to forecast or model using data from a period of extreme seasonality.

Pre Adjustment for Seasonality

The seasonal effect is extremely visible in the example below.

figure_01

Seasonally Adjusted

After the data is adjusted for seasonality (smoothed out) it is much easier to see the periods of decline.

 figure_02

Bureau of Transportation Statistics (source)

Related to Seasonality

X-12 ARIMA – A Census Bureau method for removing seasonal factors

BV4.1 Developed by the Federal Statistical Office of Germany, this software can adjust data for seasonality

Jon Peltier – Blog post about seasonality

Best Method for Illustrating a Data Point

Monday, November 3rd, 2008

Figure

The above statistic was shown in a recent copy of Businessweek and shows the average U.S. State debt per capita.  This method of calling attention to data is typically seen in magazines, newspapers and other periodicals.  I can honestly say that I read 99% of these callout boxes when I come across them.  Maybe it’s just me, but they effectively grab my attention.  Now, I could be biased because I am so tied to data and statistics, but I would guess others may feel the same way. 

One tip that I will give readers is that this method of highlighting data can be effectively used in presentations to draw attention to the slide.  I would be much more apt to notice a fact or statistic like this versus a slide with just bullet points.  Some of the best PowerPoint presentations I’ve seen include this method of presenting data.  All the work that goes into the data gathering and analysis is a waste if nobody pays attention to the results, right?  This method can also be used in dashboards if done sparingly and only for extremely important statistics.

Do you typically read these callouts when you come across them?

Do you agree that they are effective in presentations?

MIT Blackjack Group – 21 and Rounders

Monday, September 22nd, 2008

21

It appears that Amazon.com now has movies and TV shows available for immediate viewing.  This new service is like a rental, but online.  If you click on the image above or the link, you will be directed to the Video On Demand section of Amazon.  I confess that I’m a huge fan of Amazon and haven’t bought a book anywhere else, probably since the 90’s.  Just this year, I opened a PRIME account with Amazon and haven’t looked back.

Just recently, I watched the movie 21 with Kevin Spacey and really enjoyed it contrary to what every critic wrote about it when the movie hit the theaters.  I would have gone to the theater, but my two-year-old probably wouldn’t have liked it as much as her Elmo movies.

In addition to just about every blackjack book there is, I have read the two books below by Ben Mezrich and have seen every documentary on the MIT group because of my obsession with numbers, statistics and the incredible MIT story.  I know a lot of embellishment went into the books and movie, but just the idea of a team beating the Casinos out of millions blows me away.  The books are an amazing read and will probably be done within a day or two if you’re like me.  I have long since loaned my books away to friends, family and my wife, all who loved the books.

If you like numbers, statistics, gambling or blackjack, these are a must have.

Busting Vegas, Ben Mezrich

Bringing Down the House, Ben Mezrich

Another great movie related to statistics and numbers is Rounders.  Rounders is probably one of the best poker movies of all time and features Ed Norton and Matt Damon.

Rounders

“Gimme me three stacks of high society!”