Cloud Data Fusion: Using HTTP Plugin to batch API Calls

Business Challenge

Available HTTP Data Fusion Plugins
Result of API Call

Static API Batch Pipeline

Simple pipeline for calling a series of API’s
General Configuration for HTTP Plugin
Format the schema of the call
Custom pagination section
import json
url_array = ["","","","","","",""]
def get_next_page_url(url, page, headers):

##find the position in the array. The first url in array should be the one in the URL block

##test for if at end of array; if the lenght of the array is the same as the postition of the current URL +1 then end we're at the end of array
if len(url_array)==position_index+1:
return None

##else take the next array value
return url_array[position_index+1]
Running the Pipeline
Output schema from HTTP Plugin
Results of pipeline in BQ

Dynamic API Batch Pipeline

  • Static single node Dataproc cluster (not ephemeral)
  • Database/File to store the list of URL’s
SQL for inserting the data into the table
Query output from the table
  • BigQuery Plugin: Pulls the URL data from BigQuery
  • Repartitioner Plugin: Ensure that there is only one file output. This is important if there are thousands of calls to make.
  • File Plugin: Write the results to the local filesystem of the Dataproc cluster. If we write to HDFS that will generate tons of latency with each API call.
Ingest pipeline for URL’s
  • If the use case requires making potentially hundreds to thousands of calls, it’s more efficient to get that data from the local file system rather HDFS or GCS. Since the plugin calls the python script every iteration that would induce lots of latency into processing data with every read of the file.
  • The HTTP plugin doesn’t have any dependency management so using any libraries outside of whats defined seems to encounter issues.
File Plugin Configuration
The repartitioner will only output 1 partition
Custom pagination code for API calls
import jsondef get_next_page_url(url, page, headers):url_file = open('/tmp/url_output/part-r-00000', 'r')
urls = [u.strip() for u in url_file.readlines()]
if url in urls:
if len(urls)==position_index+1:
return None
next_url = urls[position_index+1]
return next_url
next_url = urls[0]
return next_url
  • HTTPSourceURLS_final (Sources the URL’s and stores them to a file)
  • getHTTPData_filebased_final (takes the file from the previous step and sources the data, and writes to BigQuery)
Inbound Trigger on the HTTPSourceURLS_final pipeline
Select the proper compute profile for both deployed pipelines
Loaded data from API Calls




I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Low level vs. High level programming.

Apple Music Dashboard — an introduction

Bracket Match — Stack and Hash Map

Replace docker desktop on macOS with Vagrant and Virtual Box

Mito Python Library — A Pandas Alternate

DIY bristlebot tutorial — & test: will the robot manage to shake my React Native accelerometer app…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Justin Taras

Justin Taras

I’m a Google Customer Engineer interested in all things data. I love helping customers leverage their data to build new and powerful data driven applications!

More from Medium

Cloud Data Fusion: Customizing Compute Profiles at Runtime

How Google BigQuery Secure your Data | Offering of Google Dataprep

Working with JSON data in BigQuery

GCP: How To Sync Cloud SQL with BigQuery