Skip to content Skip to sidebar Skip to footer

How To Fragment Bigquery Response Into 10000 In Every Request?

I have bigquery 'SELECT visitorId , totals.visits FROM [12123333.ga_sessions_20160602]' which return 500k rows in one request. But I want to fragment data from 1 to 10,000 row in o

Solution 1:

One option would be to write result of your query into destination table and then use Tabledata: list API to retrieve data from that table in a paged manner either using maxResults and pageToken to retrieve page by page or maxResults and startIndex to retrieve specified set of rows.

Another option would be to add row_number to your query (something like below)

SELECT visitorId , totals.visits,  
  ROW_NUMBER() OVER() as num
FROM [12123333.ga_sessions_20160602]

with still writing result into destination temp table and then retrieve data from that table using new num field for grouping as num % 10000 = {group_number} for example . Or you can use INTEGER(num / 10000) = {group_number} - whatever you like more

SELECT visitorId , totals.visits 
FROM tempTable
WHERE num %10000=0

next will be with

WHERE num %10000=1

and so on ...

Please note: second option uses expensive (execution wise - not billing wise) ROW_NUMBER() function which requires all data for each partition (in this case it is only one partition - all rows) to be in the same node - so depends on number of rows it can work or not. For your specific example with just 500K rows it's going to work - but if you extend it to table with millions and millions rows - it might not (depends on how much data you output in each row and number of rows)

One more note: - in first option you pay only once when you generate result and save it into temp table. Then - it's free in a sense that Tabledata.list API is free to use as it does not use BigQuery query per se, but rather just reads directly from underlying data. - in second option you pay both - and when you generate temp table and each time you retrieve/query yet another group - because it is all BigQuery queries. Moreover each time you get data for specific group - you are charged for scanning whole temp table - so in your case it is extra 50 times

This makes (in your case) first option around 51 times cheaper than second one :o)

Solution 2:

It sound like you are asking for data pagination, where page size is 10 000, you could use the following query

SELECT visitorId, totals.visits,  
FROM (
   SELECT visitorId , totals.visits, ROW_NUMBER() OVER() as rownum 
   FROM [12123333.ga_sessions_20160602]' 
) WHERE rownum BETWEEN 1AND10000

and so on

SELECT visitorId, totals.visits,  
FROM (
   SELECT visitorId , totals.visits, ROW_NUMBER() OVER() as rownum
   FROM [12123333.ga_sessions_20160602]' 
) WHERE rownum BETWEEN 10001AND20000

Post a Comment for "How To Fragment Bigquery Response Into 10000 In Every Request?"