Skip to content

perf: reimplement unpivot to use cross join rather than union #47

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 22, 2023

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@TrevorBergeron TrevorBergeron requested review from a team as code owners September 21, 2023 20:19
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Sep 21, 2023
@shobsi
Copy link
Contributor

shobsi commented Sep 21, 2023

Thanks! Could you please paste SQLs before and after, and the bigframes example showcasing the improvement?

@TrevorBergeron
Copy link
Contributor Author

TrevorBergeron commented Sep 22, 2023

before:

  SELECT t3.`rowindex`, t3.`int64_col`, t3.`int64_too`, t3.`rowindex_2`,
         (row_number() OVER (ORDER BY t3.`rowindex` IS NULL ASC, t3.`rowindex` ASC) - 1) AS `bigframes_ordering_id`
  FROM (

    SELECT *
    FROM `project`.`dataset`.`table`
    FOR SYSTEM_TIME AS OF TIMESTAMP('2023-09-22T16:39:40.111000+00:00')

  ) t3
),
t1 AS (
  SELECT t3.`rowindex`, t3.`col_3`, t3.`unpivot_2`, t3.`bigframes_ordering_id`
  FROM (
    WITH t0 AS (
      SELECT t3.`rowindex`, t3.`int64_col`, t3.`int64_too`, t3.`rowindex_2`,
             (row_number() OVER (ORDER BY t3.`rowindex` IS NULL ASC, t3.`rowindex` ASC) - 1) AS `bigframes_ordering_id`
      FROM (

        SELECT *
        FROM `project`.`dataset`.`table`
        FOR SYSTEM_TIME AS OF TIMESTAMP('2023-09-22T16:39:40.111000+00:00')

      ) t3
    ),
    t4 AS (
      SELECT t0.`rowindex`, 'int64_too' AS `col_3`, t0.`int64_too` AS `unpivot_2`,
             (t0.`bigframes_ordering_id` * 3) + 1 AS `bigframes_ordering_id`
      FROM t0
    ),
    t5 AS (
      SELECT t0.`rowindex`, 'int64_col' AS `col_3`, t0.`int64_col` AS `unpivot_2`,
             (t0.`bigframes_ordering_id` * 3) + 0 AS `bigframes_ordering_id`
      FROM t0
    ),
    t6 AS (
      SELECT t0.`rowindex`, 'rowindex_2' AS `col_3`, t0.`rowindex_2` AS `unpivot_2`,
             (t0.`bigframes_ordering_id` * 3) + 2 AS `bigframes_ordering_id`
      FROM t0
    )
    SELECT *
    FROM t5
    UNION ALL
    SELECT *
    FROM t4
    UNION ALL
    SELECT *
    FROM t6
  ) t3
)
SELECT t2.*
FROM (
  SELECT t1.`rowindex`, t1.`col_3`, t1.`unpivot_2`,
         t1.`unpivot_2` IS NOT NULL AS `bigframes_predicate`,
         t1.`bigframes_ordering_id`
  FROM t1
) t2
WHERE t2.`bigframes_predicate`
ORDER BY t2.`bigframes_ordering_id` IS NULL ASC, t2.`bigframes_ordering_id` ASC

after:

  SELECT `rowindex`,
    CASE `unpivot_offsets_4`
      WHEN 0 THEN 'int64_col'
      WHEN 1 THEN 'int64_too'
      WHEN 2 THEN 'rowindex_2'
      ELSE CAST(NULL AS STRING)
    END AS `col_3`,
    CASE `unpivot_offsets_4`
      WHEN 0 THEN `int64_col`
      WHEN 1 THEN `int64_too`
      WHEN 2 THEN `rowindex_2`
      ELSE CAST(NULL AS INT64)
    END AS `unpivot_2`,
         (`bigframes_ordering_id` * 3) + `unpivot_offsets_4` AS `bigframes_ordering_id`
  FROM (
    SELECT t4.`rowindex`, t4.`int64_col`, t4.`int64_too`, t4.`rowindex_2`,
           (row_number() OVER (ORDER BY t4.`rowindex` IS NULL ASC, t4.`rowindex` ASC) - 1) AS `bigframes_ordering_id`
    FROM (

      SELECT *
        FROM `project`.`dataset`.`table`
      FOR SYSTEM_TIME AS OF TIMESTAMP('2023-09-22T16:38:11.237000+00:00')

    ) t4
  ) t2
    CROSS JOIN UNNEST(ARRAY<STRUCT<unpivot_offsets_4 INT64>>[STRUCT(0 AS unpivot_offsets_4), STRUCT(1 AS unpivot_offsets_4), STRUCT(2 AS unpivot_offsets_4)]) t3
)
SELECT t1.*
FROM (
  SELECT t0.`rowindex`, t0.`col_3`, t0.`unpivot_2`,
         t0.`unpivot_2` IS NOT NULL AS `bigframes_predicate`,
         t0.`bigframes_ordering_id`
  FROM t0
) t1
WHERE t1.`bigframes_predicate`
ORDER BY t1.`bigframes_ordering_id` IS NULL ASC, t1.`bigframes_ordering_id` ASC```

@TrevorBergeron TrevorBergeron merged commit f9a93ce into main Sep 22, 2023
@TrevorBergeron TrevorBergeron deleted the neostack branch September 22, 2023 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants