1
0
mirror of https://github.com/djohnlewis/stackdump synced 2024-12-04 15:07:36 +00:00
stackdump/questions (1).json

1 line
491 KiB
JSON
Raw Normal View History

2021-06-10 08:46:47 +00:00
{"items":[{"tags":["sql","google-bigquery"],"owner":{"reputation":1,"user_id":15480821,"user_type":"registered","profile_image":"https://lh3.googleusercontent.com/a-/AOh14GhaQIWPcBlhCShm2ZSOy8UJWCIU6YeDr2mzDUq0=k-s128","display_name":"Joe Griffin","link":"https://stackoverflow.com/users/15480821/joe-griffin"},"is_answered":false,"view_count":1,"answer_count":0,"score":0,"last_activity_date":1623259457,"creation_date":1623259457,"question_id":67908983,"share_link":"https://stackoverflow.com/q/67908983","body_markdown":"The query I&#39;m using does the job that I need it to do. Basically I need to take the maximum &#39;Fee&#39; from each day in the past 7 days, then average those together to get a 7 day average for each ID. \r\n\r\nThe data in &#39;table_one&#39; has over 2 billion rows, however it contains data from the whole of 2021 whereas I only need data for the most recent week, which is why i&#39;ve used the &#39;_PARTITIONTIME&#39; line, however there are likely still millions of records after filtering. I have a feeling the issue could be due to the two &#39;rank () over&#39; lines. When I look in the &#39;Execution Details&#39; in BQ, the wait times are the longest but the compute times are also considerably long.\r\n\r\n```with raw_data as (\r\nselect TimeStamp\r\n, EXTRACT(DAYOFWEEK FROM TimeStamp) as day_of_week\r\n, RestaurantId\r\n, Fee\r\n, MinAmount\r\n, Zone\r\nfrom table_one to,\r\nunnest(calculatedfees) as fees,\r\nunnest(Bands) as bands\r\nwhere\r\ntimestamp between timestamp(date_sub(date_trunc(current_date, week(monday)), interval 1 week)) and timestamp(date_trunc(current_date, week(monday)))\r\nand _PARTITIONTIME &gt;= timestamp(current_date()-8)\r\nand Fee &lt; 500),\r\n\r\nranked_data as (\r\nselect *\r\n, rank() over(partition by RestaurantId, Zone, day_of_week order by TimeStamp desc) as restaurant_zone_timestamp_rank -- identify last update for restaurant / zone pair\r\n, rank() over(partition by RestaurantId, Zone, TimeStamp, day_of_week order by MinAmount desc) as restaurant_zone_timestamp_minamount_rank -- identify highest &quot;MinAmount&quot; band\r\nfrom raw_data),\r\n\r\ndaily_week_fee as (\r\nselect RestaurantId\r\n, day_of_week\r\n, max(Fee) as max_delivery_fee\r\nfrom ranked_data\r\nwhere restaurant_zone_timestamp_rank = 1\r\nand restaurant_zone_timestamp_minamount_rank = 1\r\ngroup by 1,2),\r\n\r\navg_max_fee as (\r\nselect \r\nRestaurantID\r\n, avg(max_delivery_fee) as weekly_avg_fee\r\nfrom daily_week_fee\r\ngroup by 1)\r\n\r\nSELECT restaurant.restaurant_id_local\r\n, restaurant.restaurant_name\r\n, amf.weekly_avg_fee/100 as avg_df\r\nFROM restaurants r\r\nLEFT JOIN avg_max_fee amf\r\nON CAST(r.restaurant_id AS STRING) = amf.RestaurantId\r\nWHERE company.brand_key = &quot;Country&quot;\r\n\r\n```","link":"https://stackoverflow.com/questions/67908983/my-query-is-taking-a-very-long-time-any-suggestions-on-how-i-can-optimise-it","title":"My query is taking a very long time, any suggestions on how I can optimise it?","body":"<p>The query I'm using does the job that I need it to do. Basically I need to take the maximum 'Fee' from each day in the past 7 days, then average those together to get a 7 day average for each ID.</p>\n<p>The data in 'table_one' has over 2 billion rows, however it contains data from the whole of 2021 whereas I only need data for the most recent week, which is why i've used the '_PARTITIONTIME' line, however there are likely still millions of records after filtering. I have a feeling the issue could be due to the two 'rank () over' lines. When I look in the 'Execution Details' in BQ, the wait times are the longest but the compute times are also considerably long.</p>\n<pre><code>select TimeStamp\n, EXTRACT(DAYOFWEEK FROM TimeStamp) as day_of_week\n, RestaurantId\n, Fee\n, MinAmount\n, Zone\nfrom table_one to,\nunnest(calculatedfees) as fees,\nunnest(Bands) as bands\nwhere\ntimestamp between timestamp(date_sub(date_trunc(current_date, week(monday)), interval 1 week)) and timestamp(date_trunc(current_date, week(monday)))\nand _PARTITIONTIME &gt;= timestamp(curre