SAMPLING · 9-MINUTE READ · By Ruben Ugarte on February 13 2017.
If you’re heavy user of Google Analytics, then you know about the dreaded “sampling issue”. If you don’t, then let’s see how Google themselves describe sampling:
“Sampling in Analytics is the practice of selecting a subset of data from your traffic and reporting on the trends available in that sample set. Sampling is widely used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data. In addition, sampling speeds up processing for reports when the volume of data is so large as to slow down report queries.”
We recently worked with a client who was constantly running into sampling issues (more than 500,000 sessions per month) which meant that we needed to find a way to work with the unsampled data without spending hundreds of hours banging our heads against the wall.
Let’s take a look at how we solved the Google Analytics sampling issue without going crazy.
Why Sampling is an Issue and How it Can Hide Crucial Issues With Your Data
The first obvious issue of sampling is that you aren’t seeing 100% of the data. This means that the numbers you’re seeing aren’t exact. You could have 26,000 visitors from Organic Search or you could have 49,000 visitors.
30% sample might be too small even for trends.
This kind of ambiguity is the opposite of how we expect analytics to work. The whole reason why we even decided to use Google Analytics was to get accurate numbers on our traffic and users.
The second issue depends on how big the sample is. Sampling is meant to show you the trends within your data. This means that if Organic Search is responsible for 50% of your traffic in your sample, it should be around the same number if you were to look at the complete data set.
However, this changes depends on how big your sample size is. If the sample is 90% of all sessions, then the overall trends are likely to be accurate. However, if the sample is less than 1% of all sessions, then even the trends themselves could be completely wrong.
Sayf Sharif from Lunar Metrics talks about how even a 50% sample can hide issues in the data:
“Even at higher sample rates we’ve noticed issues. Once we detected an upwards of a 10% overall change at a near 50% sample. A client site when compared month to month, year over year, showed a 5% increase, with a 48% sample. However, when we looked at the data unsampled, we were able to show that instead of improving by 5% it had actually decreased by 5%. That’s an issue.” – Source
One of the biggest issue that we have seen when working with clients is getting them to trust the data. If you can’t trust your data, you won’t take it seriously and you might as well not have it. Sampling can create a distrust of the data and make it irrelevant inside your company.
There’s a few different ways of avoiding the sampling issue but let’s focus on two: within the GA interface and using the Google Analytics API.
Simple Workarounds to Sampling Within the Google Analytics Interface
The GA interface is the easiest way to work with your data so avoiding sampling here is important. You have 3 options if you want to avoid sampling:
Option 1: Change the date range
Sampling starts when you have too much data which means that you can reduce your time period to avoid sampling e.g. 2 weeks instead of 1 month.
The issue with this option is obvious. You may need to look at a whole month and not just 2 weeks. You could try combining two date periods (2 weeks at a time) to get a month but that isn’t perfect either.
Option 2: Increase the precision of reports
Google Analytics lets you increase the precision of reports. This button is available near the date range as seen in the image below:
Moving the setting to the highest value of “Higher Precision” doubles the sample size that you can work through. If your sample size is high to start with e.g. 70%+, you could most likely avoid sampling altogether by increasing precision.
Option 3: Stick to Standard Reports
The Standard Reports in Google Analytics are never sampled. These are the core overview reports that you find under nearly every section. For example, we can see the “Audience Overview” report below showing all the possible data even though the numbers are high enough to justify sampling.
Our client couldn’t use any of these workarounds. We needed to view long date periods such as an entire year and we needed accurate numbers for key metrics like pageviews, sessions and bounce rate.
This meant that we needed a fast way of pulling unsampled data from Google Analytics. We decided to work with the Google Analytics API.
How We Avoided Sampling by Using the Google Analytics API + Supermetrics
The Google Analytics API lets you export all your data into a CSV or Excel file. It really is quite magical once you get the hang of it.
The Query Explorer lets you limit results to avoid sampling.
The API also lets you limit how much data you export at any given time which means you could limit the sampling within the data. Instead of pulling data for 1 month, you can pull data for 2 weeks and then do 2 exports to get the complete month.
This works well but it can become a bit of pain if you have to do it multiple exports or if you need to split your data across 20 different queries and then combine that data into one file.
We decided to use Supermetrics for Google Drive to help us streamline this process. After we installed the addon to Google Sheets, we got access to the sidebar below:
The Supermetrics addon in Google Sheets.
This interface matches what the Query Explorer API tool is looking for but it makes your life easier in a few important ways:
1. You can ask Supermetrics to “avoid sampling”. This doesn’t always work but it worked in most of the queries that we did. Supermetrics will break your query into multiple queries behind the scenes to avoid sampling.
2. Dumps the data straight into the Google Sheet. The addon adds the data to the Google Sheets which is where we could work with it. This meant that even if we did 3 queries to get all the unsampled data, it would all be dumped into the same spreadsheet.
3. Once we created a query, we could edit it or use it create slightly similar queries. We were doing regional analysis which meant that we were looking at the same numbers except we would filter out by different regions e.g. California, New York, etc.
We could create our query once, and then simply edit the filter to the change the State without having to create the query from scratch.
4. We could create advanced segments inside Google Analytics and then use them with Supermetrics to get specific metrics and dimensions.
This simple interface saved us 35+ hours in our reporting work. We didn’t have to manually piece together multiple files and all of our unsampled data was in one Google Sheet. You can view a short 5 minute video here on how you can start using Supermetrics to analyze millions of Google Analytics hits and the best tips that we learned from this project.
If we made a mistake with a query (which happened often), we could simply run the query again and get new data. We could magically go through millions of sessions without breaking a sweat.
Sampling in Google Analytics will continue to be an issue especially since Google doesn’t offer an affordable option beyond Google 360 (which starts at $150,000 a year).
In the meantime, you could try a few of these options to get around sampling and start working with all your data.
Do you have any other useful workarounds to this issue? Let me know in the comments.
About Ruben Ugarte:
Ruben Ugarte is the founder of Practico Analytics which helps companies get better insights out of Google Analytics and similar tools. I put together a short video that will show you how to start using Supermetrics to analyze millions of Google Analytics hits and avoid the sampling issue. You can download this free 5 minute video here.