Batch Query with GA4 API
In this post I will provide some code and insights into querying the GA4 API. Documentation is currently existent but not necessarily robust in explanation. Hence, I will cover this topic more below.
Please subscribe if you enjoy this content.
Assumptions
We will assume a couple things given the scope of this post:
You have a GA4 instance set up and configured
You have data flowing into your GA4 instance in the period within which you query
You have generated a token to access your GA4 instance via API
Environment
I am running miniconda (Anaconda lite). Some of the package installs below may be redundant depending on your favored environment.
Package Install & Import
First, install necessary packages.
!pip install google-analytics-data
!pip install pandas
Then import relevant packages for batch querying.
from google.analytics.data_v1beta import BetaAnalyticsDataClient
from google.analytics.data_v1beta.types import DateRange
from google.analytics.data_v1beta.types import Dimension
from google.analytics.data_v1beta.types import Metric
from google.analytics.data_v1beta.types import RunReportRequest
from google.analytics.data_v1beta.types import BatchRunReportsRequest
import pandas as pd
import os
import datetime
from datetime import date
Function Definitions
There are several functions we will need to run batch queries as sketched out in today’s use case.
Date list function
The first function is a simple one for creating a list of dates. We will be using this list of dates as the iterative object in a for loop for when we execute our batch queries. This is to ensure greater scalability given query limitations.
def date_diff_list(start_date,end_date):
date_list = []
start_date = pd.to_datetime(start_date)
end_date = pd.to_datetime(end_date)
# delta time
delta = datetime.timedelta(days=1)
# iterate over range of dates
while (start_date <= end_date):
date_list.append(start_date.strftime("%Y-%m-%d"))
start_date += delta
return date_list
Batch Response Report function
The below function is particular to the number and nature of the batch query you want to execute. In my case I will have 5 individual queries: geo, meta, page, traffic, and ad. These 5 queries will each be called within one batch query.
def get_ga4_batch_report_df(property_id, dimensions_name_list, metrics, start_date, end_date, limit):
property_val=f"properties/{property_id}"
geo_dim = []
meta_dim = []
page_dim= []
traffic_dim = []
ad_dim = []
for dimension in dimensions_name_list[0]:
geo_dim.append(Dimension(name=dimension))
for dimension in dimensions_name_list[1]:
meta_dim.append(Dimension(name=dimension))
for dimension in dimensions_name_list[2]:
page_dim.append(Dimension(name=dimension))
for dimension in dimensions_name_list[3]:
traffic_dim.append(Dimension(name=dimension))
for dimension in dimensions_name_list[4]:
ad_dim.append(Dimension(name=dimension))
metrics_ga4 = []
for metric in metrics:
metrics_ga4.append(Metric(name=metric))
date_range_query = [DateRange(start_date=start_date,end_date=end_date)]
client = BetaAnalyticsDataClient()
batch_report_request = BatchRunReportsRequest(
property=property_val,
requests=[
RunReportRequest(
property=property_val,
dimensions=geo_dim,
date_ranges=date_range_query,
limit = limit),
RunReportRequest(
property=property_val,
dimensions=meta_dim,
date_ranges=date_range_query,
limit = limit),
RunReportRequest(
property=property_val,
dimensions=page_dim,
date_ranges=date_range_query,
limit = limit),
RunReportRequest(
property=property_val,
dimensions=traffic_dim,
date_ranges=date_range_query,
limit = limit),
RunReportRequest(
property=property_val,
dimensions=ad_dim,
date_ranges=date_range_query,
limit = limit)
])
response = client.batch_run_reports(batch_report_request)
return response
Transforming the Batch Report into DF function
The next function takes the batch report object above and transforms it into a Pandas dataframe.
def ga4_batch_response_to_df(batchreport):
all_data = pd.DataFrame()
for report in batchreport:
#response = report.reports
dim_len = len(report.dimension_headers)
metric_len = len(report.metric_headers)
report_data = []
for row in report.rows:
row_data = {}
for i in range(0, dim_len):
row_data.update({report.dimension_headers[i].name: row.dimension_values[i].value})
for i in range(0, metric_len):
row_data.update({report.metric_headers[i].name: row.metric_values[i].value})
report_data.append(row_data)
report_df = pd.DataFrame(report_data)
all_data = pd.concat([all_data,report_df])
df = all_data
return df
Configuration
There are several configuration variables we will set:
The token for accessing our GA4 API from Google; the property ID for our GA4 instance; the start and end dates for which we want to query; and the limit for query size.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'token.json'
property_id = ''
start_date = '2022-11-01'
end_date = '2022-11-07'
limit = 100000
If you only want a day worth of queries, set the two values equal to each other. Here we will assume we want multiple days’ worth of queries. The limit is currently set to its maximum value.
Next, we must find out the dimensions and metrics we want to pull.
What dimensions and metrics do you need?
The GA4 Dimensions & Metrics Explorer is a helpful tool for learning what variables you can query.
Dimensions
For the purposes of this demo, I will be utilizing a couple functionally oriented queries:
Geo: Country, Region, City, etc.
Meta: If you host multiple client sites under your GA4 implementation, this query could call out specific customEvent variables you have defined, etc.
Page: Page Title, Landing Page, other customEvent variables you have defined, etc.
Traffic: Referring URLs, Session Source/Medium, Session Channel Grouping, etc.
Ad: Campaign ID/Name, Session Google Ads Keywords, Creative IDs, etc.
The below code will translate the above requirements into a format suitable for the batch query provided slight modification to the customEvent variable templates and that you have defined a custom session id:
geo_dimensions_name = ['customEvent:custom_session_id', 'dateHour', 'fullPageUrl', 'country', 'region', 'city']
meta_dimensions_name = ['customEvent:custom_session_id', 'dateHour', 'fullPageUrl', 'customEvent:_____', 'customEvent:_____']
page_dimensions_name = ['customEvent:custom_session_id', 'dateHour', 'fullPageUrl', 'pageTitle', 'landingPage', 'customEvent:_________']
traffic_dimensions_name = ['customEvent:custom_session_id', 'dateHour', 'fullPageUrl', 'pageReferrer', 'sessionSourceMedium', 'sessionDefaultChannelGrouping']
ad_dimensions_name = ['customEvent:custom_session_id', 'dateHour', 'fullPageUrl', 'campaignId', 'campaignName', 'sessionGoogleAdsKeyword', 'googleAdsCreativeId']
geo_dimensions_val=[{"name":value} for value in geo_dimensions_name]
meta_dimensions_val=[{"name":value} for value in meta_dimensions_name]
page_dimensions_val=[{"name":value} for value in page_dimensions_name]
traffic_dimensions_val=[{"name":value} for value in traffic_dimensions_name]
ad_dimensions_val=[{"name":value} for value in ad_dimensions_name]
dimensions_name_list = []
dimensions_name_list.append(geo_dimensions_name)
dimensions_name_list.append(meta_dimensions_name)
dimensions_name_list.append(page_dimensions_name)
dimensions_name_list.append(traffic_dimensions_name)
dimensions_name_list.append(ad_dimensions_name)
Metrics
I am not interested in any metrics for the purpose of this demo. This query will generate one row per individual page hit hourly in a user’s session.
metrics = []
Daily Loop
Now that we have our parameters set, we can execute the functions we have defined in a for loop corresponding to the list of dates generated by our date list function.
combined_df = pd.DataFrame()
for dates in date_diff_list(start_date,end_date):
start_date = dates
end_date = dates
df = get_ga4_batch_report_df(property_id, dimensions_name_list, metrics, start_date, end_date, limit)
batch_df = ga4_batch_response_to_df(df.reports)
group_df = batch_df.groupby(['customEvent:custom_session_id', 'dateHour', 'fullPageUrl'])
group_df = pd.DataFrame(group_df.first()).reset_index()
combined_df = pd.concat([combined_df, group_df])
Resetting the index renders the dataframe index consistent.
combined_df_reset = combined_df.reset_index(drop=True)