Copyright (C) 2014 Google Inc.
- Introduction
- Prerequisites and initial setup
- Start up a GKE cluster
- Configure and deploy your app
- Query your BigQuery table
- Shut down your replicated pods and cluster
- Troubleshooting
- Appendix: Optional: Use a customized Docker image for your app
This directory contains an example Google Kubernetes Engine (GKE) app that shows how you can build a 'pipeline' to stream Twitter data into BigQuery, via Google Cloud PubSub.
Kubernetes Engine is Google's managed version of Kubernetes, an open source container orchestrator originally developed by Google, and now managed by a community of contributors. We won't go into much detail about Kubernetes' features in this tutorial, but see its docs for more information.
Bigquery lets you run fast, SQL-like queries against multi-terabyte datasets in seconds, using the processing power of Google's infrastructure.
PubSub provides many-to-many, asynchronous messaging that decouples senders and receivers. It allows for secure and highly available communication between independently written applications and delivers low-latency, durable messaging.
The app uses uses PubSub to buffer the data coming in from Twitter and to decouple ingestion from processing. One of the Kubernetes app pods reads the data from Twitter and publishes it to a PubSub topic. Other pods subscribe to the PubSub topic, grab data in small batches, and stream it into BigQuery. The figure below suggests this flow.
This app can be thought of as a 'workflow' type of app-- it doesn't have a web front end (though Kubernetes is great for those types of apps as well). Instead, it is designed to continously run a scalable data ingestion pipeline. Note that PubSub provides guaranteed at-least-once message delivery. This means that we might sometimes see a duplicated item, but as each tweet has a UID, that's not an issue for this example.
See also a related app, in the ../redis directory of this repo, which uses
Redis instead of PubSub. The general structure of this
example is similar in many respects to that of the Redis example.
Note: This tutorial uses several billable components of Google Cloud Platform. The cost of running this tutorial will vary depending on run time. New Cloud Platform users may be eligible for a free trial.
To work through this example, you must have a Google Cloud Platform project with
the required APIs enabled. In the Cloud Developers Console,
create a new project or choose an existing project, then in the APIs & auth > APIs panel,
enable the BigQuery, Google Compute Engine, PubSub, Google Cloud Storage, and Google Cloud Storage JSON APIs.
You will be prompted to enable billing if you have not previously done so.
This tutorial uses gcloud, the Google Cloud SDK, to interact with GKE and other parts of the Google Cloud Platform (GCP). If you like, you can run the rest of the tutorial from the Cloud Shell of your GCP project. It has everything installed that you will need, including gcloud.
Start up the Cloud Shell by clicking on this 'prompt' icon in the Cloud Console:
Alternately, you can install the SDK according to these instructions, and run locally, e.g. on your laptop. If you install locally, make sure that you've authenticated using your Google Account:
gcloud auth loginthen set the default project for the Cloud SDK to your project:
gcloud config set project <your_project_id>In the Cloud Shell, you won't have to auth, but confirm you're using the correct project via:
gcloud config list projectand set it as above if necessary.
If you haven't already, clone the Github repository for this example:
git clone https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python.git(If you prefer, you can download it instead.)
Then change to the pubsub subdirectory of the repo:
cd kubernetes-bigquery-python/pubsubWe'll stay in this directory for the rest of the tutorial.
Next, create a BigQuery table to store your tweets. BigQuery groups tables into abstraction layers called datasets, so first create a dataset as necessary. You can do this from the web UI, or from the command line like this:
bq mk <your-dataset-name>The bq command line tool is included in the Cloud SDK.
If you prefer, you can use an existing project dataset instead.
Then, create a new table (e.g., tweets) in that
dataset, to contain your incoming tweets. Each BigQuery table must be defined by
a schema. This example includes a predefined schema in the kubernetes-bigquery-python/bigquery-setup subdirectory,
schema.json, that you can use to define your table. From the pubsub directory, run:
bq mk -t <your-dataset-name>.tweets ../bigquery-setup/schema.json(If you prefer, you can also create your table via the web BigQuery UI, pasting in the schema from schema.json).
To use the Twitter API, you need to create a Twitter application.
Create a Twitter account as necessary, then create a Twitter application..
In the Twitter Application Management page, navigate to the Keys and
Access Tokens tab. Note your Consumer Key and Consumer Secret.
Then, click the Create my access token button to create a new access token, and note your
Access Token and Access Token Secret.
Before running the example, in addition to the Twitter and BigQuery
configuration described in the tutorial, you will need to create a PubSub
topic in your Cloud project. An easy way to do this is via the API explorer, in the "Try this API" section on this page. Specify your topic name like this: projects/your-project/topics/your-topic.
Note down the name of the topic you created.
Create a new GKE cluster in the Cloud Console by essentially following the instructions here, except that you will enable additional scopes for the cluster nodes before you click the "Create" button; specifically, you will enable access to both BigQuery and PubSub.
Click the "More" link towards the bottom of the GKE cluster creation page:
In the expanded panel, enable BigQuery project access:
and PubSub project access:
Click "Create" when you're done. This configuration will let apps running on this cluster access the BigQuery and PubSub APIs.
Notes: It's possible to create a GKE cluster from the command line as well.
If you've already run the Redis tutorial and optionally enabled the PubSub scope when you were setting up that cluster, you can use the same cluster for this example.
Once your cluster is created, configure kubectl command-line access. You can see the command that does this by clicking into the cluster details in the Cloud Console, then clicking the "Connect to cluster" link. You'll be shown a command that looks like the following, but configured for your info.
gcloud container clusters get-credentials <cluster-name> --zone <cluster-zone> --project <your-project-name>Run the version of that command for your cluster, then check that it was successful by running:
kubectl get nodesNow you're ready to configure your app. This requires editing two Kubernetes .yaml config files.
(If you should want to do any customization of the app code, see the section: Appendix: Use a customized Docker image for your app. Build and push your container image before you deploy.)
In Kubernetes, pods-- rather than individual application containers-- are the smallest deployable units that can be created, scheduled, and managed.
A replica set ensures that a specified number of pod "replicas" are running at any one time. If there are too many, it will kill some. If there are too few, it will start more. As opposed to just creating singleton pods or even creating pods in bulk, a replica set replaces pods that are deleted or terminated for any reason, such as in the case of node failure.
A Deployment provides declarative updates for Pods and Replica Sets. You only need to describe the desired state in a Deployment object, and the Deployment controller will change the actual state to the desired state at a controlled rate for you.
We will use Deployments for both parts of our Kubernetes app. The first, specified by twitter-stream.yaml, defines one replica of a container that will read in tweets via the Twitter streaming API and dump them to a PubSub topic. We're only using one replica here so that we don't open up multiple Twitter API connections on the same app. However, we're still using a replicated pod for the robustness that gives us-- if the pod crashes for some reason, it will be restarted, since will specify that there should always be one running.
The second part of the app, specified by bigquery-controller.yaml, defines two replicas of a container that will subscribe to the same PubSub topic, pull off tweets in small batches, and insert them into a BigQuery table via the BigQuery Streaming API. Here, we can use multiple pods-- they will use the same subscription to read from the PubSub topic, thus distributing the load. If source throughput were to increase, we could increase the number of these bigquery-controller pods.
Edit twitter-stream.yaml. Set your PUBSUB_TOPIC to the name of the topic you created.
Then, set the Twitter authentication information to the values you noted when setting up your Twitter application (CONSUMERKEY,CONSUMERSECRET, ACCESSTOKEN, and ACCESSTOKENSEC).
Edit bigquery-controller.yaml. Set your PUBSUB_TOPIC, and set your PROJECT_ID, BQ_DATASET, and BQ_TABLE information.
(If you optionally built your own docker image as described in the Appendix, also replace the image string gcr.io/google-samples/pubsub-bq-pipe:v5 with the name of the container image that you have built and pushed.)
After starting up your GKE cluster, and configuring your .yaml files, create the Deployments from the pubsub directory:
kubectl.sh create -f bigquery-controller.yaml
kubectl.sh create -f twitter-stream.yamlTo see your running pods, run:
kubectl get pods -o wideYou'll see a list of the pods that are running, the containers they're using, and the node they're running on in the cluster.
Because the bigquery-controller Deployment has specified two replicas, you will see two pods running with names like bigquery-controller-xxxx.
You can see whether each pod is Running or Pending. If a pod isn't moving into
the Running state after about a minute, that is an indication that it isn't starting up properly.
See the "Troubleshooting" section below.
You can run:
kubectl get deploymentsto see the system's defined deployments, and how many replicas each is specified to have.
You can also use the Kubernetes Dashboard interface to inspect your cluster and its components. Do this by running:
kubectl proxythen navigating to the following location in your browser: http://localhost:8001/ui.
For fun, try resizing bigquery-controller once its pods are running:
kubectl scale --replicas=3 deployment bigquery-controllerYou should see an additional third pod running shortly.
Note: don't resize the twitter-stream-controller. You should only open up one Twitter streaming connection for an app at a time.
After setting up your pipeline, let it collect tweets for a while – a few hours should do, but the longer you let it run, the richer your data set will be. After you have some data in your BigQuery table, you can try running some sample queries.
Visit the BigQuery console
and click Compose Query to begin writing a new query. In the following, replace your-dataset-name with the actual name of your dataset, and tweets with your table if you named it differently.
This example query demonstrates how to find the most retweeted tweets in your table, filtering on a specific term (in this case, "android"):
SELECT
text,
MAX(retweeted_status.retweet_count) AS max_retweets,
retweeted_status.user.screen_name
FROM
`your-dataset-name.tweets`
WHERE
LOWER(text) LIKE '%android%'
GROUP BY
text,
retweeted_status.user.screen_name
ORDER BY
max_retweets DESC
LIMIT
1000You might also find it interesting to filter your collected tweets by a set of terms. The following query filters by the words "Kubernetes," "BigQuery," or "Redis" (case-insensitive):
SELECT
created_at,
text,
id,
retweeted_status.retweet_count,
user.screen_name
FROM
`your-dataset-name.tweets`
WHERE
LOWER(text) LIKE '%kubernetes%'
OR LOWER(text) LIKE '%bigquery%'
OR LOWER(text) LIKE '%redis%'
ORDER BY
created_at DESC
LIMIT
1000The following query looks for a correlation between the number of favorites and the number of retweets in your set of tweets, grouped by language:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c
FROM `your-dataset-name.tweets`
GROUP BY lang
HAVING c > 2000000
ORDER BY 1You could also investigate whether the speakers of a specific language prefer favoriting to retweeting, or vice versa:
SELECT
CORR(retweeted_status.retweet_count, retweeted_status.favorite_count),
lang,
COUNT(*) c,
AVG(retweeted_status.retweet_count) avg_rt,
AVG(retweeted_status.favorite_count) avg_fv,
AVG(retweeted_status.retweet_count)/AVG(retweeted_status.favorite_count) ratio_rt_fv
FROM `your-dataset-name.tweets`
WHERE retweeted_status.retweet_count > 1 AND retweeted_status.favorite_count > 1
GROUP BY lang
HAVING c > 1000000
ORDER BY 1;Labels make it easy to select the resources you want to stop or delete, e.g.:
kubectl delete deployment -l "name in (twitter-stream, bigquery-controller)"If you'd like to shut down your cluster altogether, you can delete it in the Cloud Console (or from the command line). Be sure to do this when you're done with it, so that you don't get charged.
You can inspect the components of your cluster in the Kubernetes dashboard by running:
kubectl proxyTo confirm that all your nodes, pods, and deployments are up and running properly, you can run the following commands from the command line:
kubectl get nodes
kubectl get pods
kubectl get deploymentsFor the pods, you can see whether each pod is Running or Pending. If a pod isn't moving into
the Running state after about a minute, that is an indication that it isn't starting up properly.
Double check that the pods show that they are using the correct container image name. You may also want to double check your .yaml file edits.
If nothing is obvious, a good next step is to look at the pod logs. You can do this from your local machine via:
kubectl logs <pod-name>You can also get a shell to a running container.
If you don't see evidence that your container has ever started sucessfully, double check the name of the container image that the pod is using, and make sure that you have pushed a container with that tag to the container registry.
This step is optional.
The example app uses a Docker image that runs the app's python scripts.
If you like, you can just use the prebuilt docker image, gcr.io/google-samples/pubsub-bq-pipe:v5, for your app. This is the image used by default in the bigquery-controller.yaml and twitter-stream.yaml files.
Follow the instructions below if you'd like to add customization and use your own image instead. For this, you'll either need Docker installed locally, or you can run Docker in the Cloud Shell. (You could also alternately use Google Cloud Container Builder.)
Once the image is built, it needs to be pushed somewhere that Kubernetes can access it. We'll use the Google Container Registry (GCR). It uses a Google Cloud Storage bucket in your own project to store the images, for privacy and low latency. The GCR docs provide more information on GCR and how to push images to it.
To build and push your Docker image to GCR, cd to the pubsub-pipe-image subdirectory, and run the following series of commands.
First, build your image, replacing <your-project-name> with your project name:
gcr.io/<your-project-name>/pubsub-bq-pipe:v1This builds your image according to the specifications of the Dockerfile in that directory. (You can name your image something other than 'pubsub-bq-pipe').
Then, push your image to the
Google Container Registry (GCR), again
replacing <your-project-name> with your project:
gcloud docker -- push gcr.io/<your-project-name>/pubsub-bq-pipe:v1Finally, edit twitter-stream.yaml and bigquery-controller.yaml, and
replace gcr.io/google-samples/pubsub-bq-pipe:v4 with the name of your image.





