Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a data generator class based on the demo notebook #411

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

nikisix
Copy link

@nikisix nikisix commented Oct 24, 2023

Copy link
Contributor

@wd60622 wd60622 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still taking a look at this. You might want to run make lint locally in order to format

df = self.df
self.is_lagged = True # signals that adstock has been applied
self.lag_fn = fn
self.alphas = np.random.beta(.5, .5, self.num_channels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to use or pass another random seed here

@wd60622
Copy link
Contributor

wd60622 commented Oct 26, 2023

Can you post a picture of some of the different plots and their variations

Comment on lines 115 to 120
Parameters
suffix: str
Channel suffix. I.e. x1_adstock "_adstock" would be the
channel suffix.
show: bool
Can be turned off for overlays.'''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stick with numpy docstrings


def plot_adstock_effect(self):
'''Plots the raw channels compared to when their adstock effects are applied.'''
assert self.is_lagged
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more informative error messages and error types could be helpfuk

Comment on lines 219 to 227
if not self.is_lagged:
fig, ax = self.plot_channel_spends(show=False)
self.plot_channel_spends(
suffix="_saturated", title="Saturation Effect", fig=fig, ax=ax, alpha=.5)
else:
fig, ax = self.plot_channel_spends(suffix="_adstock", show=False)
self.plot_channel_spends(
suffix="_adstock_saturated", alpha=.5, title="Saturation Effect",
fig=fig, ax=ax)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be consolidate and by using if block to define different kwargs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call

Comment on lines 247 to 252
(self.sin_coef, self.cos_coef) = np.random.exponential(seasonality_scale, size=2)
num_periods = len(df)
freq = 52 # weeks per year
df["cs"] = -self.sin_coef * np.sin(num_periods * 2 * np.pi * df.index / freq)
df["cc"] = self.cos_coef * np.cos(num_periods * 2 * np.pi * df.index / freq)
df["seasonality"] = 0.5 * (df["cs"] + df["cc"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This already exists in the mmm/utils.py
Could that be utilized?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That native approach of creating 2N columns and dotting with the sin and cos coefs would make the df unwieldy to pass around. I prefer keeping them compressed into a single 'seasonality' column.

I did however generalize this to N degrees of Fourier series.

import seaborn as sns
import warnings

warnings.filterwarnings('ignore', category=FutureWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are these coming from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python3.11/site-packages/seaborn/_core.py:1218: FutureWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, CategoricalDtype) instead
  if pd.api.types.is_categorical_dtype(vector):

moved it back to matplotlib

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, too many plots in other areas, just going to keep the warnings statement unless it's a big deal.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're coming from seaborn^

Copy link

codecov bot commented Oct 31, 2023

Codecov Report

Merging #411 (b490b15) into main (916dce4) will decrease coverage by 9.86%.
The diff coverage is 0.00%.

❗ Current head b490b15 differs from pull request most recent head 93b2b45. Consider uploading reports for the commit 93b2b45 to get more accurate results

@@            Coverage Diff             @@
##             main     #411      +/-   ##
==========================================
- Coverage   88.67%   78.81%   -9.86%     
==========================================
  Files          21       22       +1     
  Lines        1880     2115     +235     
==========================================
  Hits         1667     1667              
- Misses        213      448     +235     
Files Coverage Δ
pymc_marketing/mmm/data_generator.py 0.00% <0.00%> (ø)

@nikisix
Copy link
Author

nikisix commented Oct 31, 2023

@nikisix
Copy link
Author

nikisix commented Oct 31, 2023

Everything should be addressed now. Let me know if there's anything else.

@nikisix
Copy link
Author

nikisix commented Oct 31, 2023

Also, looks like a new version just dropped. Let me know if you'd like me to remake this PR on top of a fresh branch or not.

logistic_saturation,
)

warnings.filterwarnings("ignore", category=FutureWarning)
Copy link
Contributor

@ricardoV94 ricardoV94 Nov 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't have a blank warning filter like this. You can suppress specific warnings around specific function calls instead with a local with warnings.catch_warnings():

)

warnings.filterwarnings("ignore", category=FutureWarning)
FIGSIZE = (15, 8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matplotlib / Seaborn / Arviz use a global variable (something rc) that controls things like this. We should let those be used by default, and users can tweak them that way as well.

Comment on lines +68 to +69
seed: int = sum(map(ord, "six-mmm"))
self.rng: np.random.Generator = np.random.default_rng(seed=seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow user to provide an optional seed?

@ricardoV94
Copy link
Contributor

This functionality is fine, although I fear a bit hard to maintain as we change how MMM works. Ideally the MMM models could be used to generate synthetic data, but that's another story.

We should add a couple of tests to make sure basic functionality is not fundamentally broken.

@ricardoV94 ricardoV94 added enhancement New feature or request MMM labels Nov 1, 2023
@nikisix
Copy link
Author

nikisix commented Nov 1, 2023

This functionality is fine, although I fear a bit hard to maintain as we change how MMM works. Ideally the MMM models could be used to generate synthetic data, but that's another story.

Interesting... I can remove the pymc_marketing dependencies (on the transform functions) to reduce your having to worry about breaking the generator with core functionality upgrades. Let me know.

We should add a couple of tests to make sure basic functionality is not fundamentally broken.

Sure, I'll try and get some tests in there (this week perhaps).

@wd60622
Copy link
Contributor

wd60622 commented May 9, 2024

Hi @nikisix
Are you still interested in this?

It might be good to separate the data generation into:

  • X data generation
  • y data generation

The y data generation like @ricardoV94 mentioned can be do with the model itself. We have some examples of that being done here and here with the pm.do operator.

Separating these would allow us to not duplicate the data generation process defined by the model!

@nikisix
Copy link
Author

nikisix commented May 10, 2024

Hi @wd60622,
Thanks for reaching out, responses inline.

Hi @nikisix Are you still interested in this?

A bit, but honestly I don't have a ton of time for this right now.

It might be good to separate the data generation into:

  • X data generation
  • y data generation

The y data generation like @ricardoV94 mentioned can be do with the model itself. We have some examples of that being done here and here with the pm.do operator.

Separating these would allow us to not duplicate the data generation process defined by the model!

Doesn't that seem a bit circular though? As I implied to @ricardoV94, my intuition actually pulls me the opposite direction and remove all model/transform dependencies. This has 2 benefits:

  1. Model selection can run on generated data in an unbiased way. Ex. if generating data from an adstock hill model for instance then adstock hill model types would have an unfair advantage in model comparison.
  2. Decouples the data generator from model code, making the code-base more robust.

Am I missing something perhaps?

However, I did write a test file with graphs (as requested). So if that's the delta between this PR and acceptance I can commit it, and y'all can take it from there.

@nikisix
Copy link
Author

nikisix commented May 10, 2024

@wd60622 on second thought I agree with you guys about leveraging the models directly. Mainly because I can't think of a way to implement an independent data generation process. So you might as well couple this to a model type, give up on using it for model selection, but have a cleaner approach to parameter recovery. Point taken. Also the do operator looks slick.
Like I said, kind of swamped right now. Would you be ok accepting this (+test) as a first round, and then we can update the core generator to use pymc.do as a second pass?

@wd60622
Copy link
Contributor

wd60622 commented Jun 12, 2024

@wd60622 on second thought I agree with you guys about leveraging the models directly. Mainly because I can't think of a way to implement an independent data generation process. So you might as well couple this to a model type, give up on using it for model selection, but have a cleaner approach to parameter recovery. Point taken. Also the do operator looks slick. Like I said, kind of swamped right now. Would you be ok accepting this (+test) as a first round, and then we can update the core generator to use pymc.do as a second pass?

Hi @nikisix
I think it'd be good to just have a generation process for

  • dates
  • controls / events
  • spends

Then just remove the target generation process which can be saved for later. I.e. remove intercept, trend, seasonality, etc because that is handled by the model.

Maybe @juanitorduz has some thoughts too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request MMM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants