Business Intelligence, Metrics and Reporting

Small Data is Hard – Benchmarking Lessons

A Journey in Benchmarking With Small Data

Recently while working on an algorithm to detect jobs that are slow to fill in our Jobaline Network, I ran into a problem with not having enough data at the fine grain I was seeking. The journey of overcoming that problem inspired me to document the process. At Jobaline we are a cloud based, mobile friendly, automated recruitment tool, optimized for hourly jobs. We are a data driven company and do a good job of collecting and using data to power our customers’ recruitment efforts, maximize efficiency within our system and measure our business. We have an ever-growing database of worker data that, at the time of this writing, has reached almost 200 million unique data points. Even with all that data there are still times when I run into issues when slicing on many dimensions to drill deep into an analysis.

It’s a hard world for little things

At Jobaline we deliver local, prescreened, relevant workers to jobs our customers post through us. One of our key metrics is time to fill, or the rate at which we deliver workers to open jobs. The challenge was to compare each job to a benchmark to see if it was filling slower than a certain percentile of the other jobs in that benchmark. The benchmark was jobs that had been open the same amount of days, in the same state, and the same job category. The below table illustrates the objective (completely fake data). The benchmark represents the 25th percentile of average delivered applicants for all jobs in that type, state and days opened. If the delivered applicants for the actual job is less than its benchmark, it is a slow moving job.

Job Id Job Type State Days Opened Delivered Applicants Benchmark Delivered Applicants Slow Moving?
123 Retail CA 2 11 5 N
124 Retail CA 2 4 5 Y
125 Retail CA 3 9 6 N
126 Retail CA 4 13 8 N
127 Retail WA 3 8 7 N
128 Retail WA 3 6 7 Y
129 Retail WA 3 11 7 N
130 Retail WA 5 9 9 N

If this analysis was across all of the jobs in our network, that wouldn’t have been a problem (we have a lot of data). The problem was we were only considering a new type of job for which we didn’t have a meaningful number for every job category/state/age combination, so some jobs didn’t have a meaningful benchmark. Now that you have context, I am going to shift to a hypothetical example to continue, but first, I must define what I consider to be “small data.”

Defining small data is about as nebulous and shifty as defining big data, and no one definition can suffice across the board. For the purposes of this discussion, I will offer the following definition of small data.

Small data is structured data that is too sparse to provide meaningful analytical results when sliced across all necessary dimensions.

Its not a stereotype if its always true

Let’s say you have a database of 10 million records, each being an order for a product placed online. Ten million records certainly does not constitute big data, but most would hesitate to describe it as ‘small.’ However, if you are trying to conduct an analysis of order profit by product type, size, country, time of year, destination, price and discount, there may be groupings that have very few, if any results.

Let’s say the analysis is comparing each purchase to a benchmark to measure the difference in order to identify the most profitable rewards programs. The tough part here is defining an accurate benchmark. The unit margin of movie posters shipping to the next city over wouldn’t be comparable to industrial equipment shipping overseas, and orders at the height of the Christmas season shouldn’t be compared to March. If, after slicing your data by all of the necessary dimensions to get a good benchmark, you don’t have enough results for a meaningful analysis in each subsection, you may be dealing with ‘small data.’

Just to crystallize the problem statement a little further, consider the following fictitious aggregated dataset of retail orders.

Product Type Size Location Purchase Quarter Destination Price Bucket Discount % Num of Purchases Avg Profit
Home Furnishing Large US Q1 New York 15-20k 0% 1,001,534 $2,617
Home Furnishing Large US Q1 New York 15-20k 10% 11 $7,143
Home Furnishing Large US Q1 New York 1k – 5k 0 5,694 $1,704
Home Furnishing Large US Q1 New York 1k – 5k 5% 67,894 $401
Home Furnishing Large US Q1 New York 20k + 0% 22,412 $9,716
Home Furnishing Large US Q1 New York 20k + 10% 2 $662
Home Furnishing Large US Q1 Seattle 15-20k 0% 51,367 $5,183
Home Furnishing Large US Q1 Seattle 15-20k 10% 324 $587
Home Furnishing Large US Q1 Seattle 1k – 5k 0 8,471 $248
Home Furnishing Large US Q1 Seattle 1k – 5k 5% 1,253 $2,179
Home Furnishing Large US Q1 Seattle 20k + 0% 962 $1,198
Home Furnishing Large US Q1 Seattle 20k + 10% 51 $5,792
Total 1,159,975 $3,119

We can see the number of purchases and average profit by Product Type, Size, Country, Quarter, Destination, Price and Discount %. There are some categories of dimensions that do not yield much data. If this represented benchmarks, and we wanted to compare each purchase to the average profit in their group, that would be a problem. In the data above, the Q1 – New York – 20k+ price bucket – 10% discount grouping (2nd highlighted row) only has 2 records. This is not enough volume to provide a valid benchmark with which to measure transactions that fall into that bucket against.

This type of problem can occur at even the largest organizations. The more data available, the more sophisticated business users want to be in their analysis, and the more likely it is that there will be levels of stratification that are not sufficient for extrapolation.

Business consumers of data want to aggregate, slice, dice and analyze data across all conceivable dimensions in order to get relevant insights and trends that can give their organizations a leg up internally and externally.

This is hard with small data and can lead to incorrect conclusions that can do far more harm than good. We really can’t solve this completely until data volumes grow, but we can mitigate the risk.

Step 1 – Sample smartly

One way to ensure your benchmarks have enough data to be used effectively in analysis is to define the minimum population size you need in order to give a level of accuracy you are comfortable with and is statistically significant. Simple statistical methods can help calculate the necessary sample size for the benchmark. While the following is no substitute for an intro stats course, this is a place to start if you want to enhance the statistical significance of your benchmarks. All you need is to provide a few variables (I recognize this is simplified, but it is a good starting point):

  • Confidence Interval (Margin of Error) – How far above or below the actual average you are willing to let your sample average fall. This corresponds to a z-score, which can be found in a z-score table such as this one.
  • Confidence Level – The % chance that your sample average will fall within the confidence interval described above
  • Standard Deviation – The standardized variance of your data

The minimum sample size formula is:CodeCogsEqn (5)

If the confidence level is +/- 5% the corresponding z-score is 1.96

Assuming a standard deviation of .5 and confidence interval of .05 the math works out as

CodeCogsEqn (4)

So, rounding up, any category with less than 385 values should not be used for benchmarking given these parameters. This is a simplified example; to learn more see this great article. In the table above, I highlighted the groupings that do not meet these criteria. For the buckets that have enough observations, run your full analysis. For the ones that don’t, move to step 2 to minimize your risk and optimize the accuracy of the output.

Option 2 – Roll it on up… and benchmark again

Further aggregation is not optimal because you lose the explanatory power of whatever dimensions you remove from the model. In the above data, if we remove the discount % dimension, we then have no stratifications that are below our minimum sample size, so we are good to go. But what if there are valuable insights that can be derived from looking at the discounts? The key here is to have a simple decision tree. If the number of observations for that combination of dimensions is greater than the minimum calculated above, run your analysis using everything. Otherwise take out dimensions until you get to that level.

This means there are now two sets of data. The first set is the buckets where the number of purchases is greater than the minimum sample size. This has discount % included because we have enough data that can be used in analysis.

Product Type Size Location Purchase Quarter Destination City Price Bucket Discount % Num of Purchases Avg Profit
Home Furnishing Large US Q1 New York 1k – 5k 0 5,694 $1,704
Home Furnishing Large US Q1 New York 1k – 5k 5% 67,894 $401
Home Furnishing Large US Q1 Seattle 1k – 5k 0 8,471 $248
Home Furnishing Large US Q1 Seattle 1k – 5k 5% 1,253 $2,179
Total 83,312 $1,133

The second set of data removes discount % and combines the rows where one of the buckets was too small.

Product Type Size Location Purchase Quarter Destination City Price Bucket Number of Purchases Avg Profit
Home Furnishing Large US Q1 New York 15-20k 1,001,545 $2,617
Home Furnishing Large US Q1 New York 20k + 22,414 $9,715
Home Furnishing Large US Q1 Seattle 15-20k 51,691 $5,154
Home Furnishing Large US Q1 Seattle 20k + 1,013 $1,429

Now you can run the full benchmark analysis against orders that fall into the buckets in the first dataset, and the less targeted benchmarks on the rest. That way everything gets evaluated, but we have appropriately sized benchmarks for comparisons that we can make decisions off of with confidence. As your dataset grows, the number of benchmarks that are too small for the entire analysis will shrink, so it is important to review periodically which groupings go through which branch of the decision tree (or better yet evaluate dynamically each time the analysis is ran).

A conclusion is the place where you got tired thinking.

-Martin H. Fischer

This is not an exhaustive list of problems or solutions that can arise from not having enough data to support the most granular analytical requests of your business users. However this is a good start and a solid way to minimize the risk of bad decisions being made as a result of poor analysis. Remember, the purpose of business intelligence is to enable data driven decisions to help the organization grow and become more efficient. The better that purpose is fulfilled, the quicker your data will grow (as the company grows!) and the fewer problems you will have with small data.

This is my third blog post (and the most technical) so I want your feedback! Also, if you like the content and want to keep up to date on my posts, make sure to subscribe to my mailing list by providing your name and email in the form on the right. Also check out my other posts, like, share and comment!

Cheers!

 

 

  • This is the first blog post I have attempted that is semi-technical. Please let me know if anything is difficult to understand or could be explained better. I want your feedback!

  • bryan young

    Great job on working through sample size issues. I run into this a lot when we try to analyze consumer sentiment with Nielsen data. The sample size for a very specific consumer can sometimes be 10 or less. It can be tempting to put together a story using this data but in the end it is misleading.

    I’m reading “Thinking, Fast and Slow” by Daniel Kahneman and he goes into these issues at great length. A lot of social science results run into this issue as well. They underestimate the needed sample size to create reproducible research. That led to the recent news this year where almost half of a 100 study’s results were not reproducible.

    http://www.smithsonianmag.com/ist/?next=/science-nature/scientists-replicated-100-psychology-studies-and-fewer-half-got-same-results-180956426/

    Keep up the good work. Some interesting articles, we’ll have to meet up the next time I’m in Seattle!

    • Wow, interesting article. I imagine psychology is one of the more difficult fields for reproducibility. It was good to see support for the validity of p-values.

      Sample size is incredibly important. The problem is, some of the greatest findings come from segments with less data, because variation will likely be higher. It can be tempting to go with those findings, especially if they support your hypothesis.

      It is also a lot of work to properly handle segments with small data. If the algorithm is built into an application or product, it can really impact performance as well. The best way to handle that is to actually materialize the benchmarks and minimum sample size in your database and repopulate every so often so that all that logic isn’t being calculated on the fly.

      Let me know next your up here. I actually make it down to Portland pretty often as I have family that way. We definitely should meet up and grab a drink.

  • Pingback: Small Data is Hard – Benchmarking Lessons | Vizually()