This section provides an overview of what google-bigquery is, and why a developer might want to use it.
It should also mention any large subjects within google-bigquery, and link out to the related topics. Since the Documentation for google-bigquery is new, you may need to create initial versions of those related topics.
Installation or Setup
Detailed instructions on getting google-bigquery set up or installed.
Analyzing 50 billion Wikipedia pageviews in 5 seconds (BigQuery beginner tutorial)
Hi everyone! This is a demo I love running for people that get started with BigQuery. So let’s run some simple queries to get you started.
You will need a Google Cloud project:
- Go to https://bigquery.cloud.google.com/.
- If it tells you to create a project, follow the link to create a project, and create a project.
- Come back to https://bigquery.cloud.google.com/.
- You don’t need a credit card. Everyone gets a free 1TB for analysis each month.
- BigQuery charges per query. Before running a query you will be able to see how much each query costs.
- Find the pageviews for May 2015 at https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_201505
Note: Google protects your data with the highest security standards (PCI, ISO, HIPAA, SOC, etc), but it’s also easy to share data if you want so - as I did here. https://cloud.google.com/security/
This table has 5 columns: datehour language title requests content_size. They basically say “this wikipedia page in this language had these many requests at this hour”.
This table has almost 6 billion rows (379 GB of data).
To find out how many pageviews Wikipedia had during May, you can add up all the 6 billion lines of requests:
SELECT SUM(requests) FROM [fh-bigquery:wikipedia.pagecounts_201505]
Did you notice how fast that was? (1.8s elapsed, 43.1 GB processed for me)
Let’s do something more complex. Let’s run a regular expression over these 6 billion rows. How fast could this be?
SELECT SUM(requests) req, title FROM [fh-bigquery:wikipedia.pagecounts_201505] WHERE REGEXP_MATCH(title, 'Red.*t') GROUP BY title ORDER BY req DESC LIMIT 100
How fast was it for you? Did you find Reddit in the results?
This last query processed 269 GB: More than a quarter of the free monthly terabyte. Why?
BigQuery looks at the columns you process on your query. ‘title’ is a big column - it contains text. The ‘requests’ column is only 43.1 GB.
To make your free terabyte last, extract data to smaller tables. For example, I have a table with only the top 65,000 English Wikipedia pages pageviews. The same query processes only 1.18 GB - you can run almost a 1000 of them for free a month.
SELECT SUM(requests) req, title FROM [fh-bigquery:wikipedia.pagecounts_201408_en_top65k] WHERE REGEXP_MATCH(title, 'Red.*t') GROUP BY title ORDER BY req DESC LIMIT 100
You can’t create tables with the free monthly terabyte - it’s only for analysis. Activate your free $300 for new Google Cloud Platform accounts, or ask me here to do an extract for you. I will be happy to do so.
Loading data into BigQuery
To load data into BigQuery, you will need to activate billing for your project - try it with your free $300 for new accounts.
Create a dataset in your project to load the data to: https://i.imgur.com/FRClJ3K.jpg.
Find the raw logs shared by Wikipedia at https://dumps.wikimedia.org/other/pagecounts-raw/
wget one of these files into your computer, like https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-06/pagecounts-20150616-160000.gz
Install the ‘bq’ tool. https://cloud.google.com/bigquery/bq-command-line-tool
Load it into BigQuery:
bq load -F" " --quote "" YourProjectName:DatasetName.pagecounts_20150616_16 pagecounts-20150616-160000.gz language,title,requests:integer,content_size:integer
Wait a couple minutes. While you wait, let me explain that line: This is not a CSV file, it’s a space separated file (-F” ”) that doesn’t use quotes (—quote ""), we choose a destination table in a dataset in your project (remember to create the dataset first), we chose the file to load, and we define the 4 columns this file has.
Note that BigQuery will happily ingest .gz files, up to a certain size. For very large files it’s better to un-compress them and put them in Google Cloud Storage first. That’s what I did with the reddit comments that /u/Stuck_In_The_Matrix compiled. Those files were large, but BigQuery ingested them in 2 minutes or so.