Merge pull request #184 from holukas/meteoscreening-update

Meteoscreening updates
holukas · Aug 28, 2024 · e05ee15 · e05ee15
2 parents 2b81037 + a22b4ca
commit e05ee15
Show file tree

Hide file tree

Showing 16 changed files with 4,432 additions and 2,425 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,37 @@
 
 ![DIIVE](images/logo_diive1_256px.png)
 
+## v0.80.0 | 28 Aug 2024
+
+### Additions
+
+- Added outlier tests to step-wise meteoscreening from database: `Hampel`, `HampelDaytimeNighttime` and `TrimLow` (
+  `diive.pkgs.qaqc.meteoscreening.StepwiseMeteoScreeningDb`)
+- Added parameter to control whether or not to output the middle timestamp when loading parquet files with
+  `load_parquet()`. By default, `output_middle_timestamp=True`. (`diive.core.io.files.load_parquet`)
+
+### Environment
+
+- Re-created environment and created new `lock` file
+- Currently using Python 3.9.19
+
+### Notebooks
+
+- Added new notebook for creating a flag that indicates missing values (`notebooks/OutlierDetection/MissingValues.ipynb`)
+- Updated notebook for meteoscreening from database (
+  `notebooks/MeteoScreening/StepwiseMeteoScreeningFromDatabase.ipynb`)
+- Updated notebook for loading and saving parquet files (`notebooks/Formats/LoadSaveParquetFile.ipynb`)
+
+### Tests
+
+- Added unittest for flagging missing values (`tests.test_outlierdetection.TestOutlierDetection.test_missing_values`)
+- 37/37 unittests ran successfully
+
+### Bugfixes
+
+- Fixed links in README, needed absolute links to notebooks
+- Fixed issue with return list in (`diive.pkgs.analyses.histogram.Histogram.peakbins`)
+
 ## v0.79.1 | 26 Aug 2024
 
 ### Additions
@@ -500,25 +531,25 @@ multiple outlier tests into one single overall outlier flag.
 
 ## v0.72.0 | 25 Mar 2024
 
-## New feature
+### New feature
 
 - Added new heatmap plotting class `HeatmapYearMonth` that allows to plot a variable in year/month
   classes(`diive.core.plotting.heatmap_datetime.HeatmapYearMonth`)
 
 ![DIIVE](images/plotHeatmapYearMonth_diive_v0.72.0.png)
 
-## Changes
+### Changes
 
 - Refactored code for class `HeatmapDateTime` (`diive.core.plotting.heatmap_datetime.HeatmapDateTime`)
 - Added new base class `HeatmapBase` for heatmap plots. Currently used by `HeatmapYearMonth`
   and `HeatmapDateTime` (`diive.core.plotting.heatmap_base.HeatmapBase`)
 
-## Notebooks
+### Notebooks
 
 - Added new notebook for `HeatmapDateTime` (`notebooks/Plotting/HeatmapDateTime.ipynb`)
 - Added new notebook for `HeatmapYearMonth` (`notebooks/Plotting/HeatmapYearMonth.ipynb`)
 
-## Bugfixes
+### Bugfixes
 
 - Fixed bug in `HeatmapDateTime` where the last record of each day was not shown
 
@@ -1318,7 +1349,7 @@ to `diive`. From now on, new example notebooks will be added regularly.
 
 ## v0.52.3 | 10 Mar 2023
 
-## Additions
+### Additions
 
 - Added plotting library `bokeh` to dependencies
 

diff --git a/README.md b/README.md
@@ -116,9 +116,9 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
 - **Hampel filter daytime/nighttime**, separately for daytime and nighttime data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/HampelDaytimeNighttime.ipynb))
 - **Local standard deviation**: Identify outliers based on the local standard deviation from a running median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
 - **Local outlier factor**: Identify outliers based on local outlier factor, across all data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
-- **Local outlier factor daytime/nighttime**: Identify outliers based on local outlier factor, daytime nighttime separately ([notebook example](notebooks/OutlierDetection/LocalOutlierFactorDaytimeNighttime.ipynb))
-- **Manual removal**: Remove time periods (from-to) or single records from time series ([notebook example](notebooks/OutlierDetection/ManualRemoval.ipynb))
-- **Missing values**: Simply creates a flag that indicated available and missing data in a time series
+- **Local outlier factor daytime/nighttime**: Identify outliers based on local outlier factor, daytime nighttime separately ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalOutlierFactorDaytimeNighttime.ipynb))
+- **Manual removal**: Remove time periods (from-to) or single records from time series ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/ManualRemoval.ipynb))
+- **Missing values**: Simply creates a flag that indicated available and missing data in a time series ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/MissingValues.ipynb))
 - **Trimming**: Remove values below threshold and remove an equal amount of records from high end of data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/TrimLow.ipynb))
 - **z-score**: Identify outliers based on the z-score across all time series data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScore.ipynb))
 - **z-score increments daytime/nighttime**: Identify outliers based on the z-score of double increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
@@ -130,7 +130,7 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
 - **Diel cycle per month** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
 - **Heatmap date/time**: showing values (z) of time series as date (y) vs time (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
 - **Heatmap year/month**: showing values (z) of time series as year (y) vs month (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapYearMonth.ipynb))
-- **Histogram**: includes options to show z-score limits and to highlight the peak distribution bin ([notebook example](notebooks/Plotting/Histogram.ipynb))
+- **Histogram**: includes options to show z-score limits and to highlight the peak distribution bin ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/Histogram.ipynb))
 - **Long-term anomalies**: calculate and plot long-term anomaly for a variable, per year, compared to a reference period. ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/LongTermAnomalies.ipynb))
 - **Time series plot**: Simple (interactive) time series plot ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/TimeSeries.ipynb))
 - **ScatterXY plot** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/ScatterXY.ipynb))

diff --git a/diive/core/io/files.py b/diive/core/io/files.py
@@ -42,13 +42,15 @@ def save_parquet(filename: str, data: DataFrame or Series, outpath: str or None
     return str(filepath)
 
 
-def load_parquet(filepath: str or Path) -> DataFrame:
+def load_parquet(filepath: str or Path, output_middle_timestamp: bool = True) -> DataFrame:
     """
     Load data from Parquet file to pandas DataFrame
 
     Args:
         filepath: str
             filepath to parquet file
+        output_middle_timestamp: Converts the timestamp to show the middle
+            of the averaging interval.
 
     Returns:
         pandas DataFrame, data from Parquet file as pandas DataFrame
@@ -57,7 +59,7 @@ def load_parquet(filepath: str or Path) -> DataFrame:
     df = read_parquet(filepath)
     toc = time.time() - tic
     # Check timestamp, also detects frequency of time series, this info was lost when saving to the parquet file
-    df = TimestampSanitizer(data=df).get()
+    df = TimestampSanitizer(data=df, output_middle_timestamp=output_middle_timestamp).get()
     print(f"Loaded .parquet file {filepath} ({toc:.3f} seconds). "
           f"Detected time resolution of {df.index.freq} / {df.index.freqstr} ")
     return df

diff --git a/diive/pkgs/analyses/histogram.py b/diive/pkgs/analyses/histogram.py
@@ -83,7 +83,7 @@ def peakbins(self):
         """Returns the five bins with the most counts in decreasing order"""
         ix_maxcounts = self.results['COUNTS'].sort_values(ascending=False).head(5).index
         peakbins = self.results['BIN_START_INCL'].iloc[ix_maxcounts]
-        return list(peakbins.values)
+        return peakbins.values.tolist()
         # Find bin with maximum counts
         # idx_maxcounts = self.results['COUNTS'].idxmax()
         # return self.results['BIN_START_INCL'].iloc[idx_maxcounts]

diff --git a/diive/pkgs/gapfilling/xgboost_ts.py b/diive/pkgs/gapfilling/xgboost_ts.py
@@ -18,7 +18,6 @@
     - XXX
 
 """
-import numpy as np
 import xgboost as xgb
 from pandas import DataFrame
 
@@ -114,7 +113,7 @@ def example_xgbts():
     # subsetcols = [TARGET_COL, 'Tair_f', 'VPD_f', 'Rg_f', 'SWC_FF0_0.15_1', 'PPFD']
 
     # Example data
-    from diive.configs.exampledata import load_exampledata_parquet, load_exampledata_parquet_long
+    from diive.configs.exampledata import load_exampledata_parquet_long
     df_orig = load_exampledata_parquet_long()
 
     # # Create a large gap
@@ -123,14 +122,12 @@ def example_xgbts():
     # df = df[remove].copy()
 
     # Subset
-    # keep = (df_orig.index.year >= 1997) & (df_orig.index.year <= 2001)
-    # df = df_orig[keep].copy()
-    df = df_orig.copy()
-
-
+    keep = (df_orig.index.year >= 1997) & (df_orig.index.year <= 2001)
+    df = df_orig[keep].copy()
+    # df = df_orig.copy()
 
     # Checking nighttime
-    nt_locs = df['Rg_f'] < 50
+    # nt_locs = df['Rg_f'] < 50
     # nt = df[nt_locs].groupby(df[nt_locs].index.year).agg(['mean'])
     # means_nt = nt[TARGET]['mean']
     # # import matplotlib.pyplot as plt
@@ -140,16 +137,16 @@ def example_xgbts():
     # mean_nt_0613 = means_nt.loc[2006:2013].mean()
     # corr_nt = mean_nt_0613 / mean_nt_9704
 
-    # corr_nt = 100
-    # corr_nt = 1.19
-    corr_nt = 0.7759670068746911
-    corr_df = df[['Rg_f', 'NEE_CUT_REF_orig']].copy()
-    corr_df['gain'] = 1
-    # nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004)
-    nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004) & (df['Rg_f'] < 50)
-    corr_df.loc[nt_locs_9704, 'gain'] = corr_nt
-    corr_df['NEE_CUT_REF_orig'] = corr_df['NEE_CUT_REF_orig'].multiply(corr_df['gain'])
-    df[TARGET_COL] = corr_df[TARGET_COL].copy()
+    # # corr_nt = 100
+    # # corr_nt = 1.19
+    # corr_nt = 0.7759670068746911
+    # corr_df = df[['Rg_f', 'NEE_CUT_REF_orig']].copy()
+    # corr_df['gain'] = 1
+    # # nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004)
+    # nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004) & (df['Rg_f'] < 50)
+    # corr_df.loc[nt_locs_9704, 'gain'] = corr_nt
+    # corr_df['NEE_CUT_REF_orig'] = corr_df['NEE_CUT_REF_orig'].multiply(corr_df['gain'])
+    # df[TARGET_COL] = corr_df[TARGET_COL].copy()
 
     # df[nt_locs].groupby(df[nt_locs].index.year).agg(['mean'])['NEE_CUT_REF_orig']
     # df[~nt_locs].groupby(df[~nt_locs].index.year).agg(['mean'])['NEE_CUT_REF_orig']
@@ -262,26 +259,26 @@ def example_xgbts():
     # xgbts.reduce_features()
     # xgbts.report_feature_reduction()
 
-    xgbts.trainmodel(showplot_scores=False, showplot_importance=False)
-    xgbts.report_traintest()
-
-    xgbts.fillgaps(showplot_scores=False, showplot_importance=False)
-    xgbts.report_gapfilling()
-
-    observed = df[TARGET_COL]
-    gapfilled = xgbts.get_gapfilled_target()
-
-    frame = {
-        nee_mds.name: nee_mds,
-        gapfilled.name: gapfilled,
-    }
-    import pandas as pd
-    checkdf = pd.DataFrame.from_dict(frame, orient='columns')
-    checkdf = checkdf.groupby(checkdf.index.year).agg('sum')
-    checkdf['diff'] = checkdf[gapfilled.name].subtract(checkdf[nee_mds.name])
-    checkdf = checkdf.multiply(0.02161926)
-    print(checkdf)
-    print(checkdf.sum())
+    # xgbts.trainmodel(showplot_scores=False, showplot_importance=False)
+    # xgbts.report_traintest()
+    #
+    # xgbts.fillgaps(showplot_scores=False, showplot_importance=False)
+    # xgbts.report_gapfilling()
+
+    # observed = df[TARGET_COL]
+    # gapfilled = xgbts.get_gapfilled_target()
+
+    # frame = {
+    #     nee_mds.name: nee_mds,
+    #     gapfilled.name: gapfilled,
+    # }
+    # import pandas as pd
+    # checkdf = pd.DataFrame.from_dict(frame, orient='columns')
+    # checkdf = checkdf.groupby(checkdf.index.year).agg('sum')
+    # checkdf['diff'] = checkdf[gapfilled.name].subtract(checkdf[nee_mds.name])
+    # checkdf = checkdf.multiply(0.02161926)
+    # print(checkdf)
+    # print(checkdf.sum())
 
     # rfts.feature_importances
     # rfts.scores
@@ -304,15 +301,15 @@ def example_xgbts():
     #     f"\nMIN_SAMPLES_SPLIT: {MIN_SAMPLES_SPLIT} "
     #     f"/ MIN_SAMPLES_LEAF: {MIN_SAMPLES_LEAF}  "
     # )
-    title="title"
-    from diive.core.plotting.timeseries import TimeSeries
-    import matplotlib.pyplot as plt
-    fig, ax = plt.subplots()
-    TimeSeries(series=gapfilled.multiply(0.02161926).cumsum(), ax=ax).plot(color='blue')
-    TimeSeries(series=nee_mds.multiply(0.02161926).cumsum(), ax=ax).plot(color='orange')
-    fig.suptitle(f'{title}', fontsize=16)
-    # ax.set_ylim(-2000, 200)
-    fig.show()
+    # title = "title"
+    # from diive.core.plotting.timeseries import TimeSeries
+    # import matplotlib.pyplot as plt
+    # fig, ax = plt.subplots()
+    # TimeSeries(series=gapfilled.multiply(0.02161926).cumsum(), ax=ax).plot(color='blue')
+    # TimeSeries(series=nee_mds.multiply(0.02161926).cumsum(), ax=ax).plot(color='orange')
+    # fig.suptitle(f'{title}', fontsize=16)
+    # # ax.set_ylim(-2000, 200)
+    # fig.show()
 
     # from diive.core.plotting.heatmap_datetime import HeatmapDateTime
     # HeatmapDateTime(series=observed).show()
@@ -333,19 +330,19 @@ def example_xgbts():
     # # plt.legend()
     # plt.show()
 
-    from diive.core.plotting.cumulative import CumulativeYear
-    CumulativeYear(
-        series=gapfilled.multiply(0.02161926),
-        series_units="units",
-        yearly_end_date=None,
-        # yearly_end_date='08-11',
-        start_year=1997,
-        end_year=2022,
-        show_reference=True,
-        excl_years_from_reference=None,
-        # excl_years_from_reference=[2022],
-        # highlight_year=2022,
-        highlight_year_color='#F44336').plot(digits_after_comma=0)
+    # from diive.core.plotting.cumulative import CumulativeYear
+    # CumulativeYear(
+    #     series=gapfilled.multiply(0.02161926),
+    #     series_units="units",
+    #     yearly_end_date=None,
+    #     # yearly_end_date='08-11',
+    #     start_year=1997,
+    #     end_year=2022,
+    #     show_reference=True,
+    #     excl_years_from_reference=None,
+    #     # excl_years_from_reference=[2022],
+    #     # highlight_year=2022,
+    #     highlight_year_color='#F44336').plot(digits_after_comma=0)
     # CumulativeYear(
     #     series=nee_mds.multiply(0.02161926),
     #     series_units="units",
@@ -369,6 +366,27 @@ def example_xgbts():
     #             each_month=True, legend_n_col=2, ylim=[-0.4, 0.2])
     #     # d = dc.get_data()
 
+    from yellowbrick.model_selection import ValidationCurve, validation_curve
+    import numpy as np
+    # viz = ValidationCurve(
+    #     xgbts.model_, param_name="max_depth",
+    #     param_range=np.arange(3, 10), cv=10, scoring="r2"
+    # )
+    y = df[TARGET_COL]
+    X = df[['Tair_f', 'VPD_f', 'Rg_f']]
+
+    viz = validation_curve(
+        xgbts.model_, X, y, param_name="n_estimators",
+        param_range=np.arange(10, 20), cv=10, scoring="r2",
+    )
+
+    # Fit and show the visualizer
+    viz.fit(X, y)
+    viz.show()
+
+    # from yellowbrick.datasets import load_energy
+    # x,y = load_energy()
+
 
     print("Finished.")
 

diff --git a/diive/pkgs/outlierdetection/stepwiseoutlierdetection.py b/diive/pkgs/outlierdetection/stepwiseoutlierdetection.py
@@ -169,7 +169,8 @@ def flag_outliers_increments_zcore_test(self, thres_zscore: int = 30, showplot:
 
     def flag_outliers_trim_low_test(self, trim_daytime: bool = False, trim_nighttime: bool = False,
                                     lower_limit: float = None, showplot: bool = False, verbose: bool = False):
-        """XXX"""
+        """Flag values below a given absolute limit as outliers, then flag an
+        equal number of datapoints at the high end as outliers."""
         series_cleaned = self._series_hires_cleaned.copy()
         flagtest = TrimLow(series=series_cleaned, idstr=self.idstr,
                            trim_daytime=trim_daytime, trim_nighttime=trim_nighttime,

diff --git a/diive/pkgs/qaqc/flags.py b/diive/pkgs/qaqc/flags.py
@@ -63,17 +63,10 @@ def __init__(self, series: Series, idstr: str = None, verbose: bool = False):
         super().__init__(series=series, flagid=self.flagid, idstr=idstr)
         self.verbose = verbose
 
-    def calc(self, repeat: bool = False):
-        """Calculate overall flag, based on individual flags from multiple iterations.
-
-        Args:
-            repeat: If *True*, the outlier detection is repeated until all
-                outliers are removed.
-
-        """
+    def calc(self, repeat=False):
         self._overall_flag, n_iterations = self.repeat(self.run_flagtests, repeat=False)
         # if self.showplot:
-        #     self.defaultplot(n_iterations=n_iterations)
+        #     self.defaultplot(n_iterations=1)
 
         if self.verbose:
             print(f"MISSING VALUES TEST: Generated new flag variable {self.overall_flag.name}, "