Discover your Strava Data with Serverless & Advanced Analytics in GCP/AWS.

9 min readMar 17, 2021

Advanced Analytics and Data Exploration, while already the bread and butter of many (think GAFAM, BATX), the idea is gaining grounds everywhere, on the business side as well as on a more personal ground.

The end goal here is to show how easy it is to explore, define and generate insights and metrics when the right tools are used and that data exploration in general is not only easy but can be applied to any situation.

To demonstrate all this, we’ll be using an activity that has been extensively popular in 2020, and doesn’t seem to be willing to stop in 2021: Cycling.

If you happen to track your activities for performance, progress or simply to recoup the amount of activities done, there is a good chance you use Strava, a social media platform for athletes.

Interceptor, the decent road bike that serves me well.

Like the other +60 million users, I use Strava to keep a record of my rides. Strava provides a great amount of data, including GPS information; we will explore all this.

The Setup

So we have our collection of activities and we want to end up with a more personalized dashboard where we can explore and play with that data.

From Strava to Minority Reports with the Cloud’s power

We start with our data source (Strava) and through serverless will get access to more evolved analytics and big data. We will be using building blocks that can be automated and requires almost no maintenance past the initial setup.

note: all the infrastructure and python code used for this demo is available on github. Also, this use-case fits within the boundries of all Cloud Providers.

Requirements

To perform advanced analytics, you’ll need a few prerequisites:

A Source of Data: For our use case, we are using the Strava API, which provides access to data.
A Set of Tools: Cloud Providers (MSFT Azure, Amazon AWS, Google Cloud Platform) all provide the building blocks to carry out what we want to do, and at a cheap price.
An Insight Goal: Even if you achieve data exploration, without a premise and finding targets, you won’t achieve much. Here, we will try to gather progress, performance and improvements throughout our own activities. This can be done with Google’s DataStudio, Amazon’s Quicksight or Microsoft PowerBI.

Setting The Source

Strava provides an API to our data that requires a sign-up and to set authentication. Using their Getting Started page, we can easily make an Application ID, setup our authentication and retrieve our information:

Client ID
Client Secret
Refresh Token

Strava API authenticates using oauth2, which requires a ClientID & ClientSecret. Once this setup is done, the API calls must use the Refresh Token to retrieve a temporary Access Token that expires after few hours but allows us to fetch data based on the permissions we allow. If you have never setup a similar application or used oauth2, you can easily find tutorials to help you do it.

note: by default, strava provides permissions for read access which only permits to read public activities. If your Strava profile is private, you must use read_all permission.

Setting The Data Platform

To achieve our goal of retrieving our data, we must be able to do a few things automatically:

Retrieve at regular intervals our activities without the need of human interaction.
Store that data.
Allow ourselves to explore the data, design graphs and visualize insights and breakthroughts.

There are several ways to do this, and among the easiest, is to use a Cloud Platform such as Google Cloud.

There are several ways to do so, but our main goal is to be able to do it with no human interaction or overhead required and with the highest certainty of resiliency (no crashes, bugs, issues). We can do that by building a Serverless Architecture as follow:

Let’s go into details on that workflow.

The main part of our whole setup is the Serverless Cloud Function. The Function will perform all the work for us, meaning retrieving authentication tokens, calling Strava, retrieving the data and storing it. Unfortunately, Cloud Function requires other building blocks to store or access the information:

Cloud Scheduler: At regular intervals, the Cloud Scheduler (essentially a cronjob) starts and triggers our serverless functionfor us. The Scheduler triggers the endpoint of the Cloud Function (via a HTTP POST) every 6 hours (note: strava allows 100 calls every 15 minutes, up to 1000 calls a day). Since our function doesn’t require a specific parameter, the body of the call can be empty but is still has to be of a valid json format.
Cloud IAM: The Function requires permissions, via a service-account. For this case, it needs to :

Access to launch data jobs (bigquery.jobUser)
Store data to BigQuery (bigquery.DataEditor).
Access to read our Strava Secrets (secretmanager.secretAccessor).

3. Secret Manager: Strava API requires various secrets for its oauth2 handshake. We could store those secrets anywhere; but that wouldn’t be a good practice. We therefore store our secrets in… Secret Manager, where it’s meant to be.

4. BigQuery: BQ is Google’s Data warehouse / SQL storage solution. This is where our data will be stored by the Function. It’s also the SQL service the Analytical Tool will connect to.

5. Analytical Tools: There are various tools that can be used to connect to and help us discover our data, mainly Microsoft PowerBI. Our this use case, we will be using Google’s own analytical graph tool, DataStudio.

While you can create this setup by hand, the Terraform code is avaible here to automate the process.

The Serverless Function

This function is coded using python 3.8. While the code is available on github, here is a break down:

Retrieve Strava information: we store our strava oauth2 information in Secret Manager and the function must retrieve temporarily these information:

def fetch_from_secretmanager(project_id, secret_id):
    client      = secretmanager.SecretManagerServiceClient()
    name        = f"projects/{project_id}/secrets/{secret_id}/versions/latest"
    response    = client.access_secret_version(request={"name": name})
    payload     = response.payload.data.decode("UTF-8")
    logging.info(f'Retrieved {secret_id}')
    return payload

This sub-function is called 3 times to retrieve the ClientID, ClientSecret and Refresh_Token and requires your project_id name and each secret-key. This is all done using Secret Manager’s python library.

2. Retrieve Access_Token: Once our function has the required secrets, it can now call to retrieve the temporary token, which takes secrets as parameters:

def fetch_strava_accesstoken(clientid, secret, refreshtoken):
    resp = requests.post(
        'https://www.strava.com/api/v3/oauth/token',
        params={f'client_id': {clientid}, 'client_secret': {secret}, 'grant_type': 'refresh_token', 'refresh_token': {refreshtoken}}
    )
    response = resp.json()
    logging.info(f'Retrieved refresh_token & access_token')
    return response['access_token']

The call return is a json with various information; the one we want is called access_token; which we return.

3. Retrieve Data from Strava: Now that we are authorized to call Strava API, let’s fetch our activities:

def fetch_strava_activities(token):
    page, activities = 1, []
    while True:
        resp = requests.get(
            'https://www.strava.com/api/v3/athlete/activities',
            headers={'Authorization': f'Bearer {token}'},
            params={'page': page, 'per_page': 200}
        )
        data = resp.json()
        activities += data
        if len(data) < 200:
            break
        page += 1 
        
    logging.info(f'Fetched {len(activities)} activites')
    return activities

The sub-function calls Athelete/Activities from the API. Strava allows you to retrieve 200 activities per call; so in case the athlete has more than 200, a while loop is used to go throught all the pages. The sub-function returns all the activities in json format.

4. Store in BigQuery: It is now time to output it all in our SQL Database:

def activites_to_bq(jsonl_lines, project, dataset, table): 
    bq_client = bigquery.Client()
    job_config = bigquery.job.LoadJobConfig()
    logging.info(f'Loading in {project} / {dataset} / {table}')
    job_config.source_format = bigquery.job.SourceFormat.NEWLINE_DELIMITED_JSON
    job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE # Overwrite
    job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
    job_config.autodetect = True
    job = bq_client.load_table_from_json(
        json_rows=jsonl_lines,
        destination=f'{project}.{dataset}.{table}',
        job_config=job_config
    )

    logging.info(f'Launched job id: {job.job_id}')
    return job.job_id

Using Bigquery’s python library we can parse the activities json into our BigQuery’s table.

5. The Call Sequence: Now that all our sub-functions are done, we can roll the whole process in order:

def run(request):
    strava_clientid     = fetch_from_secretmanager(GCP_PROJECT_ID, STORED_CLIENT) 
    strava_clientsecret = fetch_from_secretmanager(GCP_PROJECT_ID, STORED_SECRET) 
    strava_refreshtoken = fetch_from_secretmanager(GCP_PROJECT_ID, STORED_REFRESHTOKEN) 

    strava_accesstoken  = fetch_strava_accesstoken(strava_clientid, strava_clientsecret, strava_refreshtoken)

    strava_activities   = fetch_strava_activities(strava_accesstoken)

    activites_to_bq(strava_activities, GCP_PROJECT_ID, BQ_DATASET, BQ_TABLE)
    return f"Strava API Job completed."

We retrieve our secrets in temporary variables, retrieve the temporary access_token, fetch our activities and store them in BigQuery. This is the sub-function that gets triggered at our step #1 with the Cloud Scheduler.

Exploring The Data

So far we managed to gather our data in a powerful data warehouse service. We must find a way to explore, display and visualize the data. Using Data Studio, we can create a source (connector) and connect our BigQuery to DataStudio. You can now visualize the data and make various correlations between each data field Strava collected, build reports, and insights.

Cloud Data warehouses such as BigQuery or AWS Redshift can easily be connected to your favourite data exploration tool:

Discovering Insights

Ultimately we didn’t build all this to display statistics in a fancy way but to figure out sights for ourselves (or our business). A powerful analytical tool will allow us to see and analyse these patterns we would not normally see.

While there are more advanced tools, such as DataRobot, for this demo we will keep it simple and try to find progression through our activities. Doing this requires knowledge of the data but also of the business and industry.

In cycling, there is an approximation of the effort deployed for a ride and the correlation between the distance and the terrain; not all rides are flat (actually, none are) and to really check the difficulty of a ride, we can’t only consider distance and must account for wind, weather, elevation, etc. By combining those we can more accurately compare the activities and the physical effort:

left: Weekly averages with watts progression; right: quarterly overview.

With those graphs and new metrics we can now do correlations between our activities and the progression.

In this use case, using Running Totals & Moving Averages let’s us see how riding correlates with better performing activities (orange line).

The quarterly average watts is also quite self-explanatory.

Building on AWS

If you wish to build the same infrastructure using AWS, you can rely on this architecture and switch to aws libraries in the lambda function:

On AWS, Secrets Manager can retrieve and update Oauth2 tokens directly, streamlining the lambda workflow to only fetch Strava data and store in a data warehouse service; on AWS, S3 storage buckets can serve as data warehouse.

You can connect Quicksight to the S3 bucket containing your json activities. This simplifies the setup quite a bit while increasing reliability. Going even further, you can leverage lake formation for additional analytics, security and crawling capabilities.