-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter update #67
base: main
Are you sure you want to change the base?
Filter update #67
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was my first round reading. Will check again tomorrow in more detail.
molpipeline/mol2mol/filter.py
Outdated
FilterCatalog.SmartsMatcher(smarts, smarts) | ||
for i, smarts in enumerate(self.patterns) | ||
] | ||
rdkit_filter = FilterCatalog.FilterCatalog() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to check if this works in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a test, but not sure whether this is what you want. please have a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sorry this was more a reminder for myself. Some rdkit objects cannot be pickled properly. I'll have look at your test,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-w-feldmann did you check this one again?
I have one mypy issue, which I do not understand. Also three questions:
|
This comment was marked as resolved.
This comment was marked as resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my head is exploding, but beside the small additions in the doc string, nothing I can complain about. @JochenSiegWork , your turn.
|
||
@property | ||
def filter_elements(self) -> Mapping[str, tuple[Optional[int], Optional[int]]]: | ||
"""Get filter elements as dict.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add note that this was implemented to keep backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this is true. I implemented this because I liked to name the descriptors "descriptors" in the init for the DescriptorFilter and the patterns "patterns" in the PatternsFilters. But I could easily get rid of this. As the classes were not there yet, there is no real backward compatibility. What is your opinion? Is the naming not so important? Should we always go with filter_elements? I would remove quite some code and move get params and set params to the base method... @c-w-feldmann @JochenSiegWork your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you can access the patterns, descriptor names, etc. I also think "filter_elements" is a suitable name for that property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed with @JochenSiegWork that it might make sense to implement this quite generic with just filter_elements, and include a good descriptoin in the docstring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented this generically now. Please check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
almost there :)
if not unique_elements.issubset(self.allowed_element_numbers): | ||
forbidden_elements = unique_elements - self.allowed_element_numbers | ||
to_process_value = ( | ||
Chem.AddHs(value) if 1 in self.allowed_element_numbers else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicitly modifying the molecule does not seem right. I understand that this can make sense when you are interested in exact element counts in the molecules. However, the filter should just filter molecules and not implicitly modify them before filtering. I think the most consistent way would be to let the user explicitly add hydrogens before the filter, which would result in an additional pipeline element. However, I think a compromise would also be fine by adding a flag add_hs
to the ElementFilter
's constructor to let the user choose.
@c-w-feldmann what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I disagree... If the Use adds the element number 1 in allowed_elements, I am pretty sure he would like to get the Hydrogens counted... The returned molecules do NOT have the added hydrogens, because this creates a new mol object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're absolutely right, and I understand your point. I have 3 aspects I am unhappy with:
- It's not explicit but implicit, and I advocate the "Explicit is Better Than Implicit" principle. However, this might also be solved with good, explicit docstrings that you can add.
- The molecules might already have been prepared with Hydrogens added. Calling Chem.AddHs again adds additional computation for every molecule.
- I am not sure about our use cases, but for docking, for example, the users typically prepare the protonation states of their molecules with some other tool beforehand. In which case readding hydrogen might mess with the prepared molecules. Is something similar also possible in our use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, could we check, whether hydrogens are there already? Or should we alternatively log a warning here if hydrogens are requested but the count is 0 and tell people to e.g. use an AddHs Pipeline module? (not sure whether this exists?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the three of us should talk about it together. This is a more conceptual decision we need to make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolved most, and some comments. Please have a look
|
||
@property | ||
def filter_elements(self) -> Mapping[str, tuple[Optional[int], Optional[int]]]: | ||
"""Get filter elements as dict.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that this is true. I implemented this because I liked to name the descriptors "descriptors" in the init for the DescriptorFilter and the patterns "patterns" in the PatternsFilters. But I could easily get rid of this. As the classes were not there yet, there is no real backward compatibility. What is your opinion? Is the naming not so important? Should we always go with filter_elements? I would remove quite some code and move get params and set params to the base method... @c-w-feldmann @JochenSiegWork your opinion?
if not unique_elements.issubset(self.allowed_element_numbers): | ||
forbidden_elements = unique_elements - self.allowed_element_numbers | ||
to_process_value = ( | ||
Chem.AddHs(value) if 1 in self.allowed_element_numbers else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I disagree... If the Use adds the element number 1 in allowed_elements, I am pretty sure he would like to get the Hydrogens counted... The returned molecules do NOT have the added hydrogens, because this creates a new mol object
@c-w-feldmann @JochenSiegWork I added an additional filter (ComplexFilter) which can be initiatied with multiple moltomol elements and uses the keep matches logic I also had a look on auto2mol - I was a bit puzzled about serializability (there you also just set the elements as attributes, no get_params / set_params functionality given. Please have a look on the implementation especially regarding serializability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just two small comments
|
||
@property | ||
def filter_elements(self) -> Mapping[str, tuple[Optional[int], Optional[int]]]: | ||
"""Get filter elements as dict.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that you can access the patterns, descriptor names, etc. I also think "filter_elements" is a suitable name for that property.
Strong yes. I think we should combine SMARTS and SMILES filter within a single class, because every SMILES is a valid SMARTS. Effectively a list of SMILES is a list of SMARTS.
No. But since you pre-compute the molecules for the SMARTS/SMILES now this is done automatically during the construction of the filter. So you already included this feature now.
We talked about this. We can in the future because it should fit but we don't have any real use case right now. So postpone. |
params["mode"] = self.mode | ||
if deep: | ||
params["filter_elements"] = { | ||
element: (count_tuple[0], count_tuple[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@c-w-feldmann You are more familiar with sklearns conventions in set_params/get_params. Is this already considered a "deep" copy? As I understand the code, the element
can also be a full-fledged object, e.g. in the ComplexFilter. The current code here copies just the reference to element
not the object itself.
if not unique_elements.issubset(self.allowed_element_numbers): | ||
forbidden_elements = unique_elements - self.allowed_element_numbers | ||
to_process_value = ( | ||
Chem.AddHs(value) if 1 in self.allowed_element_numbers else value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the three of us should talk about it together. This is a more conceptual decision we need to make.
def __int__( | ||
class SmilesFilter(_BasePatternsFilter): | ||
"""Filter to keep or remove molecules based on SMILES patterns. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the SmilesFilter, then add a comment in the docstring here: "In contrast to the SMARTSFilter, which also can match SMILES, the SmilesFilter checks kekulized bonds for aromaticity and then sets it to aromatic while the SmartsFilter detects alternating single and double bonds."
Closes #66 and #61 .
Additionally implements a (jsonable) DescriptorsFilter.
Also did some code cleaning and removed unnecessary lines of code without a breaking change