Skip to content

feat: add DataFrame.eval, DataFrame.query #361

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 20 commits into from
Mar 26, 2024
Merged

feat: add DataFrame.eval, DataFrame.query #361

merged 20 commits into from
Mar 26, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Jan 30, 2024
@TrevorBergeron TrevorBergeron marked this pull request as ready for review January 30, 2024 02:46
@TrevorBergeron TrevorBergeron requested review from a team as code owners January 30, 2024 02:46
@TrevorBergeron TrevorBergeron requested a review from shobsi January 30, 2024 02:46
from typing import Optional

import pandas
import pandas.core.computation.parsing as pandas_eval_parsing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a private method. We'll need to be very careful with this if we want to go with this implementation.

Please instead pull this function into third_party so we can avoid surprises.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Was even considering vendoring the entire eval implementation. That way, I could get rid of the series.values.dtype calls and also be more resilient to any pandas changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was even considering vendoring the entire eval implementation.

Let's do that. Even though eval is public, it does spook me to pass in anything but pandas objects into it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have now vendored the pandas eval implementation

@product-auto-label product-auto-label bot added size: l Pull request size is large. and removed size: m Pull request size is medium. labels Jan 30, 2024
from typing import Optional

import pandas
import pandas.core.computation.parsing as pandas_eval_parsing
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was even considering vendoring the entire eval implementation.

Let's do that. Even though eval is public, it does spook me to pass in anything but pandas objects into it.

@product-auto-label product-auto-label bot added size: xl Pull request size is extra large. and removed size: l Pull request size is large. labels Mar 21, 2024
@TrevorBergeron TrevorBergeron requested a review from tswast March 21, 2024 20:11
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! A few questions to resolve before we merge.

def eval(self, expr: str) -> DataFrame:
import bigframes.core.eval as bf_eval

return bf_eval.eval(self, expr, target=self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eval uses .copy() on the target if inplace=False

)
def test_df_query(scalars_dfs, expr):
# local_var is referenced in expressions
local_var = 3 # NOQA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! Didn't realize it went as far as snatching up locals to pass through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, its kind of scary, you tell it how many stack frames to look up and it will bring all those variables in scope for the evaluation.

"""
for element in line:
if iterable_not_string(element):
yield from flatten(element)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL: yield from https://stackoverflow.com/a/26109157/101923 I always just iterated and yielded.

Edit: Added in Python 3.3. https://docs.python.org/3/whatsnew/3.3.html#pep-380 I had to be compatible with Python 2.x for far too long, so I missed a lot of these features.


if overlap:
s = ", ".join([repr(x) for x in overlap])
raise NumExprClobberingError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine we don't actually use the NumExpr package (https://github.com/pydata/numexpr) right? Maybe we can delete this function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how much the "python" engine has changed over the years? If it's been pretty stable, then I think there's probably more benefit in cutting the vendored code to just the "python" engine code. If it's pretty actively changing, then keeping this file as close to as in sync with upstream makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the python engine itself is pretty stable. Went ahead and removed remaining NumExpr stuff

str
Engine name.
"""
from pandas.core.computation.check import NUMEXPR_INSTALLED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, I doubt we actually want to use numexpr, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

use Python.
"""
try:
return x.isin(y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: if y is a Series of an ARRAY column, would this do an element-wise "is in" check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so

return pprint_thing(f"{self.op}({self.operand})")

@property
def return_type(self) -> np.dtype:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny to see this pattern here too. We're gonna have so many expression trees, lol.

BACKTICK_QUOTED_STRING = 100


def create_valid_python_identifier(name: str) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: I wonder if we should use this function when creating column names? Could make them more interpretable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could borrow this approach with sql special characters


Operates on columns only, not specific rows or elements. This allows
`eval` to run arbitrary code, which can make you vulnerable to code
injection if you pass user input to this function.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


def query(self, expr: str) -> DataFrame | None:
"""
Query the columns of a DataFrame with a boolean expression.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@TrevorBergeron TrevorBergeron requested a review from tswast March 23, 2024 20:48
@tswast tswast merged commit 5e28ebd into main Mar 26, 2024
16 checks passed
@tswast tswast deleted the eval2 branch March 26, 2024 19:06
Genesis929 pushed a commit that referenced this pull request Apr 9, 2024
* feat: add DataFrame.eval, DataFrame.query

* address pr comments

* add docstring, disable new tests for legacy pandas

* vendor the pandas eval implementation

* amend eval docstring

* fix doctest expectation

* amend doctest

* pr comments

* Fix doctest for eval
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: xl Pull request size is extra large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants