GTAC 2015: Statistical Data Sampling

Celal Ziftci, Ben Greenberg Google Tech Talks

30:47

3,911 views

Published November 27, 2015

About this talk

http://g.co/gtac Slides: https://docs.google.com/presentation/d/1zAgKXFOQn02PVik9b4YkV0ZJ2wJIaGAf5oFY_dUyDD8/pub Celal Ziftci (Google) and Ben Greenberg (MIT graduate student) It is common practice to use a sample of production data in tests. Examples are: * Sanity Test: Feed a sample of production data into your system to see if anything fails. * A/B Test: Take a large chunk of production data, run it through the current and new versions of your system, and diff the outputs for inspection. To get a sample of production data, teams typically use ad-hoc solutions, such as: * Manually looking at the distribution of specific fields (e.g. numeric fields), * Choosing a totally random sample However, these approaches have a serious downside: They can miss out rare events (e.g. edge cases), which increases the risk of uncaught bugs in production. To mitigate this risk, teams choose very large samples. However, with such large samples, there are even more downsides: * Rare events can still be missed, * Runtime of tests greatly increases, * Diffs are too large for a human being to comprehend, and there is a lot of repetition. In this talk, we propose a novel statistical data sampling technique to "smartly" choose a "good" sample from production data that: * Guarantees rare events will not be missed, * Minimizes the size of the chosen sample by eliminating duplicates. Our technique catches rare/boundary cases, keeps the sample size to a minimum, and implicitly decreases the manual burden of looking at test outputs/diffs on the developers. It also supports parallel execution (e.g. MapReduce) so that vast amounts of data can be processed in a short time-frame to choose the sample.