Because of the increasingly fragmented marketing data landscape, data-driven marketers now need specialized tools to reduce the time it takes for data to be processed into information. While marketing data warehouses are not a new concept, the legacy model to implement them has serious drawbacks.
Customers need to first purchase large capacity hardware, install and manage it locally (on-premise), and then buy and deploy additional tools to extract, transform, and load data. Not only is this a significant capital investment and involves a very high time-to-insights-delivery, this model has a significant reliance on IT, and almost no self-service potential as far as marketers are considered.
Not surprisingly, leading technology vendors have jumped into the fray with each providing its own technology flavor for addressing the common drawbacks of on-premise data warehousing. Products from companies such as Snowflake, Amazon, and Google Cloud provide a full suite of capabilities for significantly expediting not just the core data warehousing phases of data preparation, transformation, and loading but also in how this data can be used for visualization and machine learning.
In this article, we evaluate Google Cloud, Snowflake, and Amazon Cloud and discuss their specific technology components and capabilities relevant for marketers looking to quickly and efficiently convert data into information.
Try Supermetrics for BigQuery with a free 14-day trial.
Aligning marketing technology investments to business needs
Given the range of technology options for cloud-based data warehousing, it is important to align investments to actual business needs. We divide marketing business needs into three categories:
- Basic reporting-In this category, Marketers are only looking to do basic reporting and most often with data from isolated sources. Something like customized reports from Google Analytics, Google Ads, or Facebook. Most often, their requirement is to simply pull data from these sources and then push into a data warehouse without any transformation. All the logic for processing this data happens in the visualization tool (e.g. Tableau, Google Data Studio, R etc.)
- Business Intelligence with complex drill-down reporting using multi-source data-In this scenario, the marketing needs are too complex to be able to process data in a BI/analytics tool. Data must be extracted from multiple sources, at varying time intervals, and transformed with a large number of business rules before it can be consumed. However, the end requirement is still to be able to report on known metrics.
- Machine learning and advanced analytics-In this category, Marketers typically don’t know what they are looking for in their data. The requirement is to be able to quickly source, process and store large quantities of raw data without any predefined schemas and then arbitrarily slice and dice it for various modeling needs.
In the sections below, we outline the most relevant tool capabilities to consider for each of these business needs categories.
In this category, the focus is mainly on quick and automated data extraction. None of the three platforms provide an in-built data sourcing capability. However, marketers can use third-party tools such as Supermetrics, to quickly transform data from a range of sources (e.g. Google Ads, Google Analytics, Facebook insights etc.) into products like Google’s BigQuery/Cloud Storage, Amazon’s S3/Redshift/RDS, and Snowflake.
These third-party source connectors include several in-built features such as authorization and token management, time/event-based data refresh, incremental or full load refreshes, and in-built monitoring of data extraction tasks. Marketers can easily control these options through web/app-based management consoles. Another very important aspect of these off-the-shelf connectors is the end-to-end protection of data as it moves from source to destination in the cloud.
All the three platforms provide ready integration with tools such as Tableau, Qlik, and Google Data Studio, which can then be leveraged for building reports and also for ad-hoc data analysis.
Next up the business maturity continuum are marketers looking to go beyond basic reporting of isolated data sources to analyzing data from multiple sources and for specific business metrics.
Consider an ecommerce store looking to optimize its acquisition marketing spend. The company spends on both Facebook and Google ads but in many cases, both the platforms take credit for conversions. This results in misaligned investments if going by reporting available within Google/Facebook or even if we just report on these two sources in isolation. The requirement here is to de-duplicate the conversions and then use some sort of attribution scheme to allocate credit to the various participating channels.
To do this, a marketer may need to the following:
- Automatically source data from all campaigns in Facebook into a table. Do the same with Google Ads.
- Build another table that contains conversion data based on cookie id/email or some other identifier. This table could be populated as soon as a conversion takes place.
- As soon as an entry is recorded in the conversion table, automatically trigger some rules to populate a third table with the time of first touch from Facebook (if any) and also from Google (if any) campaigns.
- If the marketer is using a first touch attribution scheme, this third table can then provide insights into which campaigns (FB/Google) brought the user in first.
To implement this technically in each of the three platforms, the following components would need to be used:
Components used: Third-party source connectors (e.g. Supermetrics for BigQuery), Google BigQuery (storage layer), Google Dataflow (automated data pipeline for ETL)
Here, the source connectors would write Facebook/Google Data into separate BigQuery tables on a regular basis. A batch job can then be run using Google Dataflow (for example every night) and which will create the de-duplicate user conversion record by looking at the Facebook, Google and conversion data tables.
Components used: Third-party source connectors (e.g. Supermetrics), Amazon Redshift or RDS (storage layer), Amazon Data Pipeline (automated data pipeline for ETL)
Similar to Google Cloud, the source connectors can pipe data from Facebook/Google Ads/conversion tables into Amazon RDS which provides relational storage for transactional data. A batch data pipeline could then be triggered every night to process last day’s conversions from RDS and move the rows into Amazon Redshift which is a specialized product for querying large amounts of data.
Unlike Amazon and Google Cloud, the Snowflake platform actually provides data warehousing capabilities that internally use either Amazon, Google, or Azure platforms. It doesn’t provide any hardware resources of its own, instead, the entire setup is physically deployed on the customer’s selected underlying cloud. The USPs of the Snowflake platform are its unique data storage and processing design which (as the vendor claims) provide better performance than using the native cloud components directly (e.g. Google BigQuery, Amazon Redshift etc.)
Unlike Google BigQuery and Amazon Redshift/RDS, the options to pipe data into the Snowflake platform are limited. Additionally, the capability of transforming data using a pipeline-based approach is also missing. The platform is largely focused on providing extremely fast and optimized query processing engines. This means that unlike the other two platforms, the customer will have to prepare its data externally and then load into Snowflake. For customers who already have dedicated ETL tools that can source and transform data, Snowflake can provide an extremely fast and scalable query processing engine.
In our use case, customers would somehow source the data from Facebook/Google into a staging area using one of the pipeline products in the market (e.g. Supermetrics). They would then need to process this data using dedicated hardware and finally, move it back into Snowflake.
Advanced analytics and machine learning
Marketers use machine learning with marketing data for a number of use cases including next best action recommendations, churn prediction, audience targeting, ad optimization, and customer segmentation to name a few. This requires data to be sliced and diced arbitrarily at a non-aggregated level.
For example, consider a marketer looking to do econometrics on marketing campaign data. In order to this, he needs to be able to collect daily data at arbitrary levels of granularity (ad, ad group, audience segment, geo, interest category, campaign etc.) and then feed it into some kind of regression analysis software which can then identify various statistical variables like R-squared, beta coefficients for various dependent variables, standard deviations for various models etc.
These metrics then help marketers make important decisions about not just which marketing efforts perform better than others but also where to focus efforts. The core data processing needs here are not so much about fast query performance on structured data, but more about things like data preparation capabilities, data partitioning, integration with machine learning models, and ability to deploy selected models on new datasets quickly.
The relevant components on the three platforms would include:
Datalab, BigQuery, Dataproc, Dataflow, Dataprep, Tensorflow
Third-party source connectors, S3/RDS, SageMaker
A key differentiator in the Snowflake platform is its integration with a number of third-party data science specialists (e.g. Zepl) rather than provide any dedicated component of its own for machine learning specifically. This architecture allows Snowflake customers to leverage the platform as a core data storage engine while piping the data into specialized third-party tools via the Partner Connect platform.
Selecting a cloud warehouse is about getting the balance right between ready-made functionality, business needs and cost/time/skills constraints. There is no one-size-fits-all solution and marketers must make a customized template to select the best technology option, keeping in mind actual business needs and time/cost constraints. The formulation of business needs must specifically consider the core focus of analysis, given that very different products and solutions exist depending upon whether the needs are centered around basic reporting, business intelligence or advanced statistical analysis.
To get started with your marketing data warehousing journey, book a demo.