Skip to content

mainkoon81/U004-project-Python-AB-testing-typical

Repository files navigation

04-AB-testing

Case-I. More 'Click-Through-Rate' in the new website ? Does the experiment webpage drive higher traffic than the control webpage ?

  • package: Pandas, Numpy, Matplotlib, Seaborn
  • func:

Case-II. More Career focused description of the course lead more success for both the website and the student ?

  • package: Pandas, Numpy, Matplotlib, Seaborn
  • func:

[Case I]. Does the 'experiment' webpage drive higher traffic than the 'control' webpage ? Which one is better ?

Data:

  • timestamp
  • id
  • group: control / experiment
  • action: view / click

  • number of unique users - 6328
df['id'].nunique()
  • size of control group and experiment group - (3332, 2996)
df.query('group == "control"').id.nunique(), df.query('group == "experiment"').id.nunique() 
  • action types in this experiment
df.action.unique()
  • duration of this experiment - ('2016-09-24 17:42:27.839496', '2017-01-18 10:24:08.629327')
df.timestamp.min(), df.timestamp.max()

Story:

  • Definition of 'Click-Through-Rate' (CTR)
    • The No.of unique visitors who click at least once / The No.of unique visitors who view the page
  • Why would we use 'Click-Through-Rate' instead of number of clicks to compare the performances of control and experiment pages? Because... Getting the proportion of the users who click is more effective than getting the number of users who click when comparing groups of different sizes. ie..more total clicks could occur in one version, even if there is a greater percentage of clicks in the other version (simpson's paradox).
  • Steps to analyze the results of this A/B test.
      1. We computed the observed difference between the metric, CTR, for the control and experiment group.
      1. We simulated the sampling distribution for the difference in proportions (or difference in CTR).
      1. We used this sampling distribution to simulate the distribution under the null hypothesis, by creating a random normal distribution centered at 0 with the same spread and size.
      1. We computed the p-value by finding the proportion of values in the null distribution that were greater than our observed difference.
      1. We used this p-value to determine the statistical significance of our observed difference.

  • Extract all the actions from 'control' group + compute CTR - 0.2797118847539016
control_df = df.query('group == "control"')
control_ctr = control_df.query('action == "click"').id.nunique() / control_df.query('action == "view"').id.nunique(); 
  • Extract all the actions from 'experiment' group + compute CTR - 0.3097463284379172
experiment_df = df.query('group == "experiment"')
experiment_ctr = experiment_df.query('action == "click"').id.nunique() / experiment_df.query('action == "view"').id.nunique(); 

Observed Difference in CTR - 0.030034443684015644

obs_diff = experiment_ctr - control_ctr; obs_diff

See if this difference is significant!.

  • We are using 'Normal Dist of the Null' instead of 'C.I' because it's one-tailed.
  • First, BOOTSTRAPPING to simulate the sampling distribution for the difference in proportions - CTRs !
diffs = []

for _ in range(10000):
    b_samp = df.sample(df.shape[0], replace=True)
    control_df = b_samp.query('group == "control"')
    experiment_df = b_samp.query('group == "experiment"')
    control_ctr = control_df.query('action == "click"').id.nunique() / control_df.query('action == "view"').id.nunique()
    experiment_ctr = experiment_df.query('action == "click"').id.nunique() / experiment_df.query('action == "view"').id.nunique()
    diffs.append(experiment_ctr - control_ctr)
    
diffs = np.array(diffs)    

Simulating From the Null Hypothesis

  • Simulate distribution under the null hypothesis
null_vals = np.random.normal(0, diffs.std(), diffs.size)
  • Plot null distribution and line at our observed differece
plt.hist(null_vals)
plt.axvline(x=obs_diff, color='red');

  • Compute p-value for (H0: diff <= 0)
(null_vals > obs_diff).mean()

It is 0.0044 so we reject H0.


[Case II]. Playing with metrics. More Career focused description of the course lead more success for both the website and the student ?

Data: In the Online Education Website, they made the second change that is a more career focused description(ad) on a course overview page in the hope that this change may encourage more users to enroll and complete this course. In this experiment, we’re going to analyze the following metrics:

  • Enrollment Rate: Click through rate for the Enroll button the course overview page
  • Average Reading Duration: Average number of seconds spent on the course overview page
  • Average Classroom Time: Average number of days spent in the classroom for students enrolled in the course
  • Completion Rate: Course completion rate for students enrolled in the course

>Dataset A

  • timestamp
  • id
  • group: control / experiment
  • action: view / enroll
  • duration

>Dataset B

  • timestamp
  • id
  • group: control / experiment
  • total_days
  • completed: True / False

*Let's determine if the difference observed for each metric is statistically significant individually.

Metric 1. Enrollment Rate

  • Compute 'Click-Through-Rate' for control group - 0.2364438839848676
control_df = df.query('group == "control"')
control_ctr = control_df.query('action == "enroll"').id.nunique() / control_df.query('action == "view"').id.nunique()
  • Compute 'Click-Through-Rate' for experiment group - 0.2668693009118541
experiment_df = df.query('group == "experiment"')
experiment_ctr = experiment_df.query('action == "enroll"').id.nunique() / experiment_df.query('action == "view"').id.nunique()
  • Compute the observed difference in CTR - 0.030425416926986526
obs_diff = experiment_ctr - control_ctr
  • Create a sampling distribution of the difference in proportions with bootstrapping
diffs = []
for i in range(10000):
    b_samp = df.sample(df.shape[0], replace =True)
    ctrl_df = b_samp.query('group == "control"')
    exp_df = b_samp.query('group == "experiment"')
    ctrl_ctr = ctrl_df.query('action == "enroll"').id.nunique() / ctrl_df.query('action == "view"').id.nunique()
    exp_ctr = exp_df.query('action == "enroll"').id.nunique() / exp_df.query('action == "view"').id.nunique()
    diffs.append(exp_ctr - ctrl_ctr)

diffs = np.array(diffs)
diffs.shape, diffs.size, diffs.mean(), diffs.std()
  • Simulate distribution under the null hypothesis
null_vals = np.random.normal(0, diffs.std(), diffs.size)
  • Plot observed statistic with the null distibution
plt.hist(null_vals);
plt.axvline(obs_diff, c='red')

  • Compute p-value
(null_vals > obs_diff).mean()

Reject H0! The possibility of the No difference is 0.022 ! so the difference is significant! With a type I error rate of 0.05, we have evidence that the enrollment rate for this course increases when using the experimental description on its overview page.

Metric 2. Average Reading Duration

control_mean = df.query('group == "control"').duration.mean()
experiment_mean = df.query('group == "experiment"').duration.mean()
obs_diff = experiment_mean - control_mean

diffs = []
for _ in range(10000):
    b_samp = df.sample(df.shape[0], replace=True)
    control_mean = b_samp.query('group == "control"').duration.mean()
    experiment_mean = b_samp.query('group == "experiment"').duration.mean()
    diffs.append(experiment_mean - control_mean)

diffs = np.array(diffs)
null_vals = np.random.normal(0, diffs.std(), diffs.size)
plt.hist(null_vals)
plt.axvline(x=obs_diff, color='red')

(null_vals > obs_diff).mean()

Reject H0! The possibility of the No difference is 0.0 ! so the difference is significant! With a type I error rate of 0.05, we have evidence that the average reading duration for this course increases when using the experimental description on its overview page.

Metric 3. Average Classroom Time

control_mean = df.query('group == "control"').total_days.mean()
experiment_mean = df.query('group == "experiment"').total_days.mean()
obs_diff = experiment_mean - control_mean

diffs = []
size = df.shape[0]
for _ in range(10000):
    b_samp = df.sample(size, replace=True)
    control_mean = b_samp.query('group == "control"').total_days.mean()
    experiment_mean = b_samp.query('group == "experiment"').total_days.mean()
    diffs.append(experiment_mean - control_mean)

diffs = np.array(diffs)
null_vals = np.random.normal(0, diffs.std(), diffs.size)
plt.hist(null_vals)
plt.axvline(obs_diff, c='red')

(null_vals > obs_diff).mean()

Reject H0! The possibility of the No difference is 0.0321 ! so the difference is significant! With a type I error rate of 0.05, we have evidence that the average reading duration for this course increases when using the experimental description on its overview page.

Metric 4. Completion Rate

control_df = df.query('group == "control"')
control_ctr = control_df.query('completed == True').id.nunique() / control_df.id.nunique()
experiment_df = df.query('group == "experiment"')
experiment_ctr = experiment_df.query('completed == True').id.nunique() / experiment_df.id.nunique()

obs_diff = experiment_ctr - control_ctr

diffs = []
size = df.shape[0]
for _ in range(10000):
    b_samp = df.sample(size, replace=True)
    control_df = b_samp.query('group == "control"')
    experiment_df = b_samp.query('group == "experiment"')
    control_ctr = control_df.query('completed == True').id.nunique() / control_df.id.nunique()
    experiment_ctr = experiment_df.query('completed == True').id.nunique() / experiment_df.id.nunique()
    diffs.append(experiment_ctr - control_ctr)

diffs = np.array(diffs)
null_vals = np.random.normal(0, diffs.std(), diffs.size)
plt.hist(null_vals)
plt.axvline(obs_diff, c='red')

(null_vals > obs_diff).mean()

Reject H0! The possibility of the No difference is 0.039 ! so the difference is significant! With a type I error rate of 0.05, we have evidence that the completion rate for this course increases when using the experimental description on its overview page.

Analyzing Multiple Metrics

  • The more metrics you evaluate, the more likely you are to observe significant differences just by chance. Luckily, this multiple comparisons problem can be handled in several ways. The probability of any false Positive (alpha) increases as we increase the number of metrics - Bonferroni Correction
  • Here are the p-values computed for the four metrics in this experiment
    • 1.Enrollment Rate: 0.022
    • 2.Average Reading Duration: 0
    • 3.Average Classroom Time: 0.032
    • 4.Completion Rate: 0.039
  • If our original alpha value was 0.05, Bonferroni corrected alpha value is 0.0125. so.. With the Bonferroni corrected alpha value, statistically significant metrics are...only Average Reading Duration.
  • Since the Bonferroni method is too conservative when we expect correlation among metrics, we can better approach this problem with more sophisticated methods, such as: the closed testing procedure, Boole-Bonferroni bound, the Holm-Bonferroni method.

Note

  • When designing an A/B test and drawing conclusions based on its results, there are some common ones to consider.
    • Novelty effect and change aversion when existing users first experience a change
    • Sufficient traffic and conversions to have significant and repeatable results
    • Best metric choice for making the ultimate decision (eg. measuring revenue vs. clicks)
    • Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.
    • Practical significance of a conversion rate (the cost of launching a new feature vs. the gain from the increase in conversion)
    • Consistency among test subjects in the control and experiment group (imbalance in the population represented in each group can lead to situations like Simpson's Paradox)

About

Toddler's Probability application

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published