How to use your GSC data in BigQuery like a pro?

In this post I talk about:

Share the Post:

In February 2023, Google announced the bulk data export allowing companies to export their raw GSC data to their cloud warehouse, BigQuery. This change had two main objectives:

  1. Allow companies (and SEO) to get ownership of their data without having to use the API. Indeed, it is great, but it does come with pitfalls such as limited historical data, sampling and anonymized queries (more info here).
  2. Monetize GSC (free tool) through BigQuery, a paid product. As we’ll see in this article, costs can be low for most projects, still some companies will have to pay hundreds or thousands per month based on the data volume their websites generate.

In this article, I’ll explain what you can expect from this integration and what are the best practices you should follow to leverage this data. If you’ve not activated the export, please note that you’ll get less value from it than if you’ve already done it.

How to activate the bulk export?

I won’t enter into details into that part because I assume that some of you have already done it. If this is not the case, please refer to this wonderful article explaining every step with screenshots. I’ll just add a couple of things:

  • If you’re not using BigQuery as a cloud warehouse, you’ll still have to go through BigQuery to get your data. But you can ask your data team (or a data engineer freelance that can set it up easily for you) to send data from BigQuery to any other warehouse.
  • You can’t get historical data: if you activate it today, you’ll have data from today. This is a petty, but there is no workaround available, unfortunately.

How much will it cost me?

This is the first question everybody wants to ask because we all know the AWS meme that can be applied to any cloud provider as well.

This is a tricky question because it depends on countless factors. In a nutshell, you’re charged for two concepts in BigQuery:

  • Storage: €0.02 / GB. It means that if you have 1TB of data in BigQuery during the full month, you’ll pay €20. To estimate this cost, you can activate the bulk export and see the data generated for the first couple of days. For instance, this specific table is generating around 3GB / day, and I can forecast cost based on that.

Note that

  1. If your traffic increases, the amount of data generated by GSC may as well. Therefore, consider the situation your project is in for your forecast.
  2. GSC generates more than one table by default (more on that later) and you may create new ones by yourself for your reporting. I highly recommend assuming that you’ll double the amount of data GSC generates, just in case.

Anyway, unless you are dealing with huge websites, the storing costs are often insignificant. I’ve dealt with projects generating dozens of millions of clicks per day, and even then, the costs were less than €100 / month.

  • Querying: this is where things can get expensive if you don’t follow some best practices. The general rule is that you’ll pay €5 / TB processed.

If you use the BigQuery UI, you can easily see the amount of data that it needs to process to run your query and therefore what it will cost you. In the below example, the query would cost €1.9.

The issue is that when you use a third-party tool to query or visualize your data, this information is not provided.

For instance:

  1. I use TablePlus to manage and query some of the databases I have access to. This includes BigQuery. I don’t know how much a query will cost when I run it from this tool. This is why I go through the UI first to double-check it first.
  2. Looker Studio has a BigQuery connector to visualize your data easily. Under the hood, Looker Studio will generate and execute the queries on your behalf. A simple dashboard can end up running dozens of queries and being costly if the underlying data is not well optimized.

Keep reading, and I’ll explain later how to optimize the tables Looker Studio is using with simple data engineering techniques, and you won’t have to worry about querying costs.

Tables generated by the export

The export will generate two tables, as you can quickly see after activating the export:

  • searchdata_site_impression: this table contains data aggregated by property. The list of available fields (list here) is similar to what you may already be used to if you’ve used the API. The main difference is that anonymized query data is included. You’ll therefore have rows where the field query is empty, but not the rest.
  • searchdata_url_impression: this table contains data aggregated by URL. The list of available fields (list here) is similar to the first table (with the url field being added) but we also have boolean fields is_[search_appearance_type] to indicate search appearance type, such as is_amp_top_stories. To be honest, you won’t use these fields that much unless you are in some specific industries.

How to use your data (the wrong way)?

Let’s start with the wrong way of using your data. Let’s assume that you have already a couple of months of data, and you want to create a GSC report in Looker Studio.

You could follow these simple steps:

  • Add connectors to both tables and then create your report

Easy, right? Well, yes it is, but this is precisely what you don’t need to do! Why? These tables contain plenty of information and plenty of rows. One example:

11 billion rows !! This is a huge amount of data and by plugging your Looker Studio directly to these sources, you create two levels of inefficiency:

  • Speed: the report will be crazy slow because the underlying SQL queries generated by Looker Studio can take a while to be executed. In that specific case, minutes (!!). If you thought the GA4 UI was slow, try plugging a Looker Studio report to a BigQuery table with billions of rows in it.
  • Cost: a full page (including graphs, tables …) load can easily end up costing >€10 based on the dimensions you require.

Now, for small properties this might not be an issue because you’re overcharged but on a rather small dataset, so convenience > cost & speed benefits. But for big properties, I strongly advise you to double-check your costs and ensure that you’re not overspending.

To achieve that, just open this article and use the second SQL query. I’ve adapted it slightly in my case to get the total per day, and you can see that the amount consumed can be quite high if your setup is not optimized.

In this example, Looker Studio consumed 23TB (!!) in a single day, around €115.

So yes, you can end up spending a lot of money if you use your data the wrong way, and you don’t know what you are doing.

Now, what about using it the right way?

How to use your data (the right way)?

As you can expect, this setup is not as straightforward as the previous one, but I’ll guide you through it.

The modern data architecture

This article is not an introduction to data engineering, but I want to highlight how modern data architecture is usually designed to help you understand what we are doing.

  • You have raw data (in our case, what Google includes in the searchconsole tables). These tables include all the information available but are never used for reporting for the reasons we explained previously.
  • Based on this raw data, final data is built, including only the metrics & dimensions we require in our dashboards.

Why do we have these two different levels?

  • raw data is needed because we sometimes have new needs, and having everything stored allows us to apply our changes to our historical data. Storage cost being low, this is not an issue cost-wise, and it comes with countless benefits when you need to apply changes to your other tables.
  • final data is also needed because it ensures that tables used for visualization (maybe by dozens of people) are fast and cost-effective.

You can obviously have more steps between raw and final data, but this is the rough idea behind any modern data architecture.

Define the required information

Based on what Google is providing us in those tables, you need to define what information you need in your reporting table. Let’s assume that we just need the following:

Included by default

  • date
  • clicks
  • impressions

Have to be calculated

  • page_type: we are dealing with a huge website; therefore we need to classify our pages because we won’t dig into page-level data in our report.
  • query_type: same comment, but we want to be able to separate branded and non-branded traffic.

And that’s it for now! Let’s build our final data table with that in mind.

Data transformation

The easiest way to create a data transformation pipeline in BigQuery is through Dataform (or DBT, a similar solution). I may write a complete article on the topic another day, but this is not the objective here. We’ll get through the basics, though. Dataform is a service offered by Google to schedule SQL workflows for data transformation in BigQuery. It allows you to transform your raw data into final data with simple SQL code.

  • Follow this article to set it up correctly until the section where GA4 code is started to be generated. Obviously, this is not what we want here!
  • In the worflow_settings.yaml file that is automatically generated …

… modify the content of the following variables

defaultProject: it must be the ID of your current GCP project. If you don’t know it, just click on the banner located at the top of your screen and copy the ID in the list that will appear.

defaultLocation: use the value for the dataset where you’ll have your table.

defaultDataset: use the name of the dataset where data should be added.

  • In the declarations.js file that must be located inside the sources folder …

… copy the following content, where you’ll need to indicate the project_id that you already added in the previous step. <your_dataset_id> is often searchconsole, but since you can customize it when you configure the bulk export, it can also be something else.

declare({
   database: "<your_project_id>",
   schema:   "<your_dataset_id>",
   name:     "searchdata_site_impression",
});

declare({
   database: "<your_project_id>",
   schema:   "<your_dataset_id>",
   name:     "searchdata_url_impression",
});

This will allow you to use shortcuts in your code to refer to these tables, as shown below:

  • Finally, create a SQLX file with the following content.

This first part allows you to configure the table that Dataform will create. We’ll configure an incremental table where only new data in our source tables will be added. This will ensure that the execution is faster and that we’re charged only for the new data processed, not the whole table.

config {
    type: "incremental", 
    tags: ["daily"],
    bigquery:{
        partitionBy:"date", 
    }
}

If we’re using the table we show before, an incremental run will cost us around 3GB (€0.015) while a full run would cost €9.9. Given that we want to add new data daily, this is quite significant.

We also partition the table by date, a common practice to reduce costs. Indeed, if a query uses a qualifying filter on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. 

  • I have a 100GB table, including 100 days
  • If I query the full table, the execution will cost 100GB
  • If i query the table for 10 days, I will be charged 10GB if the table is partitioned by day but 100GB if the table is not partitioned by day.

As GSC data are almost always queried using a date criteria, this partitioning allows us to save quite a decent amount of money.

We then add this part of the code …


js {
  const PARTITIONS = `${self().split(".").slice(0, -1).join(".")}`+'.INFORMATION_SCHEMA.PARTITIONS`';
  module.exports = PARTITIONS;
}

pre_operations {
    DECLARE gsc_checkpoint DEFAULT DATE('2023-01-01');
    SET gsc_checkpoint = (
    ${when(incremental(),
    `SELECT PARSE_DATE('%Y%m%d', MAX(partition_id)) AS max_date FROM ${PARTITIONS} where table_name = '${self().split(".").pop().replace('`','')}' and partition_id != "__NULL__"`,
    `SELECT DATE('2023-01-01')`)}
    );
}

… that does a simple thing. It gets the maximum date already available in our final data table. Why is it useful?

  • When we build the table, we want to get all the available data in raw data
  • When we run an incremental load (the table already exists), we just want to add new data, therefore rows where the date in raw data is after the maximum date of final data.

This is why we need these two snippets. We can then continue with our main query: a simple SQL code that you should be able to understand if you know some SQL basics. Otherwise, read my article on SQL for SEO. Only new things should be:

  • The incremental condition (${when(incremental(), WHERE date > gsc_checkpoint) }) that is run only for an incremental load. It uses the maximum value we calculated before.
  • The ${ref(“searchdata_url_impression”)} that can be used in Dataform to refer to tables we added in the declarations.js file.
WITH raw_gsc_data AS (
    SELECT 
    *
    FROM ${ref("searchdata_url_impression")}
    ${when(incremental(), `WHERE date > gsc_checkpoint`) }
)

SELECT 
date, 
CASE 
    when url like '%/blog/%' then 'Blog'
    else "Other" 
END as page_type, 
CASE 
   when query like "%mybrand%" then "branded"
   when query is null then "anonymized" 
   else "non-branded" 
END as query_type, 
SUM(clicks) as clicks, 
SUM(impressions) as impressions, 
from raw_gsc_data
group by date, page_type, query_type

If you’ve followed correctly this process, you should see the following on the right. The non-incremental will always be more expensive because it rebuilds your table from scratch (it may be needed if you add new dimensions or metrics, for instance). If you have an error, debug it and solve it.

Run the model using the explanation provided at the end of the article I’ve already linked and ensure that your new table is available. You can then schedule your workflow to run every day.

  • If new GSC data is available, the data will be added to your table
  • If new GSC data is not available, nothing will happen.

Looker Studio

You can now connect your new table to your Looker Studio and benefit from the GSC data through an optimized (speed and cost-wise) setup. Nothing else to add here, to be honest, you’ll use the data as you usually do.

Conclusion

Applying simple data engineering techniques to your data workflow is frightening at the beginning because there are numerous new concepts. Nevertheless, the benefits of applying them far outweighs the learning curve.

You’ll be able to unleash the power of your GSC data without worrying about the cost, unless you’re dealing with colossal web properties.

Now go apply this logic to your project and start using your GSC data like a pro !

Share the Post: