It seems like data lakes are all the rage these days. Why is that?
And more importantly, if you work with marketing data, why should you care?
And if you do care, but don’t know how to query data in your cloud storage bucket from BigQuery, how would you go about doing that?
In this post, I’ll do my best to answer these three questions.
3 reasons why you should care about data lakes
Based on my experience, here are the three reasons why you should care about data lakes.
Versatility is hard to pin down in the context of data lakes, so let’s get specific. When your data is stuck in an Excel file or Google Sheet, it can be hard to use that data for any kinds of automated processes. I’m not saying it’s impossible, in fact I’ve written so to the contrary. But your options are, let’s say… limited.
When your data is in a data lake, you can expose it to all kinds of advanced data technologies that are designed to make use of data. Databases, data warehouses, custom applications, and not to mention advanced data science tools like Apache Spark. When your data is in a data lake, the number of options you have expands significantly.
Durability means that your data will be safe no matter what happens to your computer or any database you might be storing your data on. If your laptop hard drive crashes and it was storing all of the files that you were using to build those reports, that can have a pretty severe impact on your business.
Of course, any good IT group will make sure that the contents of laptops and production databases are backed up. But still, this will likely result in some kind of downtime for your reporting or analysis workflow.
Why bother when you can ensure that your data will remain untouched in a remote location backed by 99.999999999% durability and 99.99% availability? An in-depth discussion of durability vs. reliability and why they matter is beyond the scope of this post, but you should know that both of those figures published by Google are extremely high, and far safer than your laptop hard drive.
Security is something that many marketers don’t think too much about, but is critically important when talking about confidential client data, media spend, and annual budgets. Having been in the industry for a few years now, the number of publicly available Google Sheets with link-based sharing enabled containing confidential data that I’ve seen is… frightening to say the least.
Having your data in a data lake provides you with the tools to enact an iron-clad security strategy, ensuring that only those who are supposed to access your data are able to.
Notice how I said “provides you with the tools.” Your data will be pretty secure with most default configurations, but you’ll want to discuss access permissions with your system administrators to ensure that security policies are consistently applied in your organization.
How to query data in a cloud storage bucket from BigQuery
Are you excited to start using data lakes yet? Unless you’re some kind of data or infosec nerd, based on the above probably not. Let me give you a very tangible example that I see as a gateway toward data lake adoption in marketing teams every day: querying data in BigQuery from Google Cloud Storage.
You might do this if:
- You need to report on media spend for data sources that don’t work with your third-party integration tools (like Supermetrics)
- You have a spreadsheet containing client account information that changes frequently and you need to make that data easily accessible inside of BigQuery
- A vendor or partner of yours can only send you data in raw files and you don’t want to build an ETL process to get it into BigQuery
Let’s talk about how you might set this up.
Step 1: Get your data ready
We’re going to use CSV as our data format for this demonstration, but many other formats are supported by BigQuery for these purposes.
In general, you want your data to be separated into columns and rows, preferably with some kind of date if you can. It’s easiest if you have a top row of headers in the file, but that’s optional.
The only trick to watch out for here is making sure your date columns are formatted correctly. You can make sure that they are saved correctly by applying custom formatting to the cells that have dates in them.
Step 2: Create a cloud storage bucket
The next step is to create a cloud storage bucket to store your data in. When people talk about data lakes, really they’re just referring to a collection of cloud storage buckets usually hosted by Google, Amazon or Microsoft. I hope I didn’t let you down, considering there is no “Create Data Lake” button anywhere in this guide.
For customers in the US, you’ll need to roll with us-central1 as your region, as we’ll be using what are called “External Tables” in BigQuery. European customers will need to create a bucket in europe-west1. At the time of writing, these are the only bucket locations that support external tables. Here’s a guide that will be updated when this changes.
If you don’t have a Google Cloud project, feel free to create one using their generous free tier of $300 credits in your first year.
Most of the default options for your bucket are fine, just make sure you specify a “Regional” bucket and pick the region you want to query data from.
Step 3: Upload your data to your cloud storage bucket
Once you’ve created a Cloud Storage Bucket, you’ll want to upload your data to it. This step is pretty easy.
Step 4: Build an external table in BigQuery that references the data in your cloud storage bucket
Here’s where the fun part starts.
Now, we’re going to build an external table in BigQuery that references the data in Google Cloud Storage.
Jump into BigQuery and create a dataset if you haven’t already. Make sure your dataset is in the same location as your bucket.
Once you’ve created your dataset, click the + icon to create a new table in your dataset. Here are the settings we’ll use to get this set up:
- Create table from: Google Cloud Storage
- Select file from GCS bucket: evans-central-bucket/ad-campaign-data_2020-05-01.csv
- File format: CSV
- Table type: External table
- Table name: external_csv_campaign_data (you can make this whatever you’d like)
- date: DATE
- campaign_id: INTEGER
- campaign_name: STRING
- impressions: INTEGER
- clicks: INTEGER
- cost: FLOAT
- Advanced options:
- Field delimiter: Comma
- Header rows to skip: 1
If you have any questions about data types in BigQuery, you can consult this documentation.
Once you have the settings adjusted to suit your CSV file, click “Create table” and it will create a table reference in your specified dataset.
Step 5: Query the data
You’ll now be able to query your data in Google Cloud Storage exactly the same as if it were a native table in BigQuery.
And that’s it!
Whether you’re piping your data from BigQuery into Data Studio, Power BI, or some other visualization destination, you now have the ability to use the power of BigQuery on top of your newly-created data lake. You can of course change and update the underlying data at any time.
For those of you who are looking to get creative here, I’ll show you two things that are absolute must-have features in my book:
If you want to select data from multiple files that share a similar file name but have identical schemas, you can accomplish this using a wildcard in your table configuration. Notice how I specified the table URI in this example with a wildcard (asterisk) in place of the date for my filename.
This means that BigQuery is going to search my Cloud Storage bucket for any files that fit the pattern, and will load data only from those files. In this case, you can see that we’re going to load data from all files with that filename, but independent of any specific dates. This means that you can continue to add files to Google Cloud Storage and they’ll automatically be picked up by BigQuery when the new queries run.
If you use the trick above, you’ll then face this question eventually: which file is this data coming from? Another variation of this question is, how do I limit the file that I’m selecting?
Enter the _FILE_NAME pseudo column. This can be used to accomplish both of the above. You can use it in your SELECT statement, as long as you give it an alias.
You can also use it to limit the data that is selected by your query. This actually reduces the amount of data loaded into BigQuery, so this can be used to improve the performance of your queries if you’re dealing with a large amount of data.
Querying data in your cloud storage bucket from BigQuery isn’t all that hard if you follow the tips and tricks I shared in this post.
If you have any questions, feel free to drop us a message on twitter by tagging @Supermetrics.