-
Notifications
You must be signed in to change notification settings - Fork 48
feat: add DataFrame.eval, DataFrame.query #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bigframes/eval.py
Outdated
from typing import Optional | ||
|
||
import pandas | ||
import pandas.core.computation.parsing as pandas_eval_parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a private method. We'll need to be very careful with this if we want to go with this implementation.
Please instead pull this function into third_party
so we can avoid surprises.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Was even considering vendoring the entire eval implementation. That way, I could get rid of the series.values.dtype calls and also be more resilient to any pandas changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was even considering vendoring the entire eval implementation.
Let's do that. Even though eval
is public, it does spook me to pass in anything but pandas objects into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have now vendored the pandas eval
implementation
bigframes/eval.py
Outdated
from typing import Optional | ||
|
||
import pandas | ||
import pandas.core.computation.parsing as pandas_eval_parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was even considering vendoring the entire eval implementation.
Let's do that. Even though eval
is public, it does spook me to pass in anything but pandas objects into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! A few questions to resolve before we merge.
def eval(self, expr: str) -> DataFrame: | ||
import bigframes.core.eval as bf_eval | ||
|
||
return bf_eval.eval(self, expr, target=self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we only set target
if inplace=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eval uses .copy()
on the target if inplace=False
) | ||
def test_df_query(scalars_dfs, expr): | ||
# local_var is referenced in expressions | ||
local_var = 3 # NOQA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! Didn't realize it went as far as snatching up locals to pass through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, its kind of scary, you tell it how many stack frames to look up and it will bring all those variables in scope for the evaluation.
""" | ||
for element in line: | ||
if iterable_not_string(element): | ||
yield from flatten(element) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL: yield from
https://stackoverflow.com/a/26109157/101923 I always just iterated and yielded.
Edit: Added in Python 3.3. https://docs.python.org/3/whatsnew/3.3.html#pep-380 I had to be compatible with Python 2.x for far too long, so I missed a lot of these features.
|
||
if overlap: | ||
s = ", ".join([repr(x) for x in overlap]) | ||
raise NumExprClobberingError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine we don't actually use the NumExpr package (https://github.com/pydata/numexpr) right? Maybe we can delete this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how much the "python" engine has changed over the years? If it's been pretty stable, then I think there's probably more benefit in cutting the vendored code to just the "python" engine code. If it's pretty actively changing, then keeping this file as close to as in sync with upstream makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the python engine itself is pretty stable. Went ahead and removed remaining NumExpr stuff
str | ||
Engine name. | ||
""" | ||
from pandas.core.computation.check import NUMEXPR_INSTALLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, I doubt we actually want to use numexpr, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
use Python. | ||
""" | ||
try: | ||
return x.isin(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: if y is a Series of an ARRAY column, would this do an element-wise "is in" check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so
return pprint_thing(f"{self.op}({self.operand})") | ||
|
||
@property | ||
def return_type(self) -> np.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny to see this pattern here too. We're gonna have so many expression trees, lol.
BACKTICK_QUOTED_STRING = 100 | ||
|
||
|
||
def create_valid_python_identifier(name: str) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: I wonder if we should use this function when creating column names? Could make them more interpretable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could borrow this approach with sql special characters
|
||
Operates on columns only, not specific rows or elements. This allows | ||
`eval` to run arbitrary code, which can make you vulnerable to code | ||
injection if you pass user input to this function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
||
def query(self, expr: str) -> DataFrame | None: | ||
""" | ||
Query the columns of a DataFrame with a boolean expression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Examples?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
* feat: add DataFrame.eval, DataFrame.query * address pr comments * add docstring, disable new tests for legacy pandas * vendor the pandas eval implementation * amend eval docstring * fix doctest expectation * amend doctest * pr comments * Fix doctest for eval
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕