Merge pull request #205 from holukas/indev

Indev
holukas · Sep 11, 2024 · 0a9f1d7 · 0a9f1d7
2 parents e05ee15 + 47fe57c
commit 0a9f1d7
Show file tree

Hide file tree

Showing 33 changed files with 6,841 additions and 4,825 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,88 @@
 
 ![DIIVE](images/logo_diive1_256px.png)
 
+## v0.81.0 | 11 Sep 2024
+
+## Expanding Flux Processing Capabilities
+
+This update brings advancements for post-processing eddy covariance data in the context of the `FluxProcessingChain`.
+The goal is to offer a complete chain for post-processing ecosystem flux data, specifically designed to work seamlessly
+with the standardized `_fluxnet` output file from the
+widely-used [EddyPro](https://www.licor.com/env/products/eddy-covariance/eddypro) software.
+
+Now, diive offers the option for USTAR filtering based on *known* constant thresholds across the entire dataset (similar
+to the `CUT` scenarios in FLUXNET data). While seasonal (DJF, MAM, JJA, SON) thresholds are calculated internally,
+applying them on a seasonal basis or using variable thresholds per year (like FLUXNET's `VUT` scenarios) isn't yet
+implemented.
+
+With this update, the `FluxProcessingChain` class can handle various data processing steps:
+
+- Level-2: Quality flag expansion
+- Level-3.1: Storage correction
+- Level-3.2: Outlier removal
+- Level-3.3: (new) USTAR filtering (with constant thresholds for now)
+- (upcoming) Level-4.1: long-term gap-filling using random forest and XGBoost
+- For info about the different flux levels
+  see [Swiss FluxNet flux processing chain](https://www.swissfluxnet.ethz.ch/index.php/data/ecosystem-fluxes/flux-processing-chain/)
+
+### New features
+
+- Added class to apply multiple known constant USTAR (friction velocity) thresholds, creating flags that indicate time
+  periods characterized by low turbulence for multiple USTAR scenarios. The constant thresholds must be known
+  beforehand, e.g., from an earlier USTAR detection run, or from results from FLUXNET (
+  `diive.pkgs.flux.ustarthreshold.FlagMultipleConstantUstarThresholds`)
+- Added class to apply one single known constant USTAR thresholds (
+  `diive.pkgs.flux.ustarthreshold.FlagSingleConstantUstarThreshold`)
+- Added `FlagMultipleConstantUstarThresholds` to the flux processing chain (
+  `diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain.level33_constant_ustar`)
+- Added USTAR detection algorithm based on Papale et al., 2006 (`diive.pkgs.flux.ustarthreshold.UstarDetectionMPT`)
+- Added function to analyze high-quality ecosystem fluxes that helps in understanding the range of highest-quality data(
+  `diive.pkgs.flux.hqflux.analyze_highest_quality_flux`)
+
+### Additions
+
+- `LocalSD` outlier detection can now use a constant SD:
+    - Added parameter to use standard deviation across all data (constant) instead of the rolling SD to calculate the
+      upper and lower limits that define outliers in the median rolling window (
+      `diive.pkgs.outlierdetection.localsd.LocalSD`)
+    - Added to step-wise outlier detection (
+      `diive.pkgs.outlierdetection.stepwiseoutlierdetection.StepwiseOutlierDetection.flag_outliers_localsd_test`)
+    - Added to meteoscreening from database (
+      `diive.pkgs.qaqc.meteoscreening.StepwiseMeteoScreeningDb.flag_outliers_localsd_test`)
+    - Added to flux processing chain (
+      `diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain.level32_flag_outliers_localsd_test`)
+
+### Changes
+
+- Replaced `.plot_date()` from the Matplotlib library with `.plot()` due to deprecation
+
+### Notebooks
+
+- Added notebook for plotting cumulative sums per year (`notebooks/Plotting/CumulativesPerYear.ipynb`)
+- Added notebook for removing outliers based on the z-score in rolling time window (
+  `notebooks/OutlierDetection/zScoreRolling.ipynb`)
+
+### Bugfixes
+
+- Fixed bug when saving a pandas Series to parquet (`diive.core.io.files.save_parquet`)
+- Fixed bug when plotting `doy_mean_cumulative`: no longer crashes when years defined in parameter
+  `excl_years_from_reference` are not in dataset (`diive.core.times.times.doy_mean_cumulative`)
+- Fixed deprecation warning when plotting in `bokeh` (interactive plots)
+
+### Tests
+
+- Added unittest for `LocalSD` using constant SD (
+  `tests.test_outlierdetection.TestOutlierDetection.test_localsd_with_constantsd`)
+- Added unittest for rolling z-score outlier removal (
+  `tests.test_outlierdetection.TestOutlierDetection.test_zscore_rolling`)
+- Improved check if figure and axis were created in (`tests.test_plots.TestPlots.test_histogram`)
+- 39/39 unittests ran successfully
+
+### Environment
+
+- Added new package `scikit-optimize`
+- Added new package `category_encoders`
+
 ## v0.80.0 | 28 Aug 2024
 
 ### Additions
@@ -18,7 +100,8 @@
 
 ### Notebooks
 
-- Added new notebook for creating a flag that indicates missing values (`notebooks/OutlierDetection/MissingValues.ipynb`)
+- Added new notebook for creating a flag that indicates missing values (
+  `notebooks/OutlierDetection/MissingValues.ipynb`)
 - Updated notebook for meteoscreening from database (
   `notebooks/MeteoScreening/StepwiseMeteoScreeningFromDatabase.ipynb`)
 - Updated notebook for loading and saving parquet files (`notebooks/Formats/LoadSaveParquetFile.ipynb`)
@@ -2048,17 +2131,17 @@ which allows the calculation of the flux detection limit following Langford et a
 
 - None
 
-#### **REFERENCES**
-
-Langford, B., Acton, W., Ammann, C., Valach, A., & Nemitz, E. (2015). Eddy-covariance data with low signal-to-noise
-ratio: Time-lag determination, uncertainties and limit of detection. Atmospheric Measurement Techniques, 8(10),
-4197–4213. https://doi.org/10.5194/amt-8-4197-2015
-
-# References
+## **REFERENCES**
 
 - Hollinger, D. Y., & Richardson, A. D. (2005). Uncertainty in eddy covariance measurements
   and its application to physiological models. Tree Physiology, 25(7),
   873–885. https://doi.org/10.1093/treephys/25.7.873
+- Langford, B., Acton, W., Ammann, C., Valach, A., & Nemitz, E. (2015). Eddy-covariance data with low signal-to-noise
+  ratio: Time-lag determination, uncertainties and limit of detection. Atmospheric Measurement Techniques, 8(10),
+  4197–4213. https://doi.org/10.5194/amt-8-4197-2015
+- Papale, D., Reichstein, M., Aubinet, M., Canfora, E., Bernhofer, C., Kutsch, W., Longdoz, B., Rambal, S., Valentini,
+  R., Vesala, T., & Yakir, D. (2006). Towards a standardized processing of Net Ecosystem Exchange measured with eddy
+  covariance technique: Algorithms and uncertainty estimation. Biogeosciences, 3(4),
+  571–583. https://doi.org/10.5194/bg-3-571-2006
 - Pastorello, G. et al. (2020). The FLUXNET2015 dataset and the ONEFlux processing pipeline
   for eddy covariance data. 27. https://doi.org/10.1038/s41597-020-0534-3
-
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 
 [![DOI](https://zenodo.org/badge/708559210.svg)](https://zenodo.org/doi/10.5281/zenodo.10884017)
 
+*`diive` is currently under active developement with frequent updates.*
+
 # Time series data processing
 
 `diive` is a Python library for time series processing, in particular ecosystem data. Originally developed
@@ -123,10 +125,11 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
 - **z-score**: Identify outliers based on the z-score across all time series data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScore.ipynb))
 - **z-score increments daytime/nighttime**: Identify outliers based on the z-score of double increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
 - **z-score daytime/nighttime**: Identify outliers based on the z-score, separately for daytime and nighttime ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreDaytimeNighttime.ipynb))
-- **z-score rolling**: Identify outliers based on the rolling z-score
+- **z-score rolling**: Identify outliers based on the rolling z-score ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreRolling.ipynb))
 
 ### Plotting
 
+- **Cumulatives per year** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/CumulativesPerYear.ipynb))
 - **Diel cycle per month** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
 - **Heatmap date/time**: showing values (z) of time series as date (y) vs time (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
 - **Heatmap year/month**: showing values (z) of time series as year (y) vs month (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapYearMonth.ipynb))

diff --git a/diive/core/base/flagbase.py b/diive/core/base/flagbase.py
@@ -165,19 +165,19 @@ def defaultplot(self, n_iterations: int = 1):
         ax_ok = fig.add_subplot(gs[1, 0], sharex=ax_series)
         ax_ok_hist = fig.add_subplot(gs[1, 1])
 
-        ax_series.plot_date(self.series.index, self.series,
-                            label=f"{self.series.name}", color="#607D8B",
-                            alpha=.5, markersize=8, markeredgecolor='none')
-        ax_series.plot_date(self.series[rejected].index, self.series[rejected],
-                            label="outlier (rejected)", color="#F44336", alpha=1,
-                            markersize=12, markeredgecolor='none', fmt='X')
+        ax_series.plot(self.series.index, self.series,
+                       label=f"{self.series.name}", color="#607D8B", linestyle='none', markeredgewidth=1,
+                       marker='o', alpha=.5, markersize=6, markeredgecolor="#607D8B", fillstyle='none')
+        ax_series.plot(self.series[rejected].index, self.series[rejected],
+                       label="outlier (rejected)", color="#F44336", alpha=1, linestyle='none',
+                       markersize=12, markeredgecolor='none', marker='X')
         hist_kwargs = dict(method='n_bins', n_bins=None, highlight_peak=True, show_zscores=True, show_info=False,
                            show_title=False, show_zscore_values=False, show_grid=False)
         HistogramPlot(self.series, **hist_kwargs).plot(ax=ax_series_hist)
 
-        ax_ok.plot_date(self.series[ok].index, self.series[ok],
-                        label="filtered series", alpha=.5,
-                        markersize=8, markeredgecolor='none')
+        ax_ok.plot(self.series[ok].index, self.series[ok],
+                   label="filtered series", alpha=.5, linestyle='none', markeredgewidth=1,
+                   marker='o', markersize=6, markeredgecolor="#607D8B", fillstyle='none')
         HistogramPlot(self.series[ok], **hist_kwargs).plot(ax=ax_ok_hist)
 
         default_format(ax=ax_series)
@@ -220,56 +220,57 @@ def plot_outlier_daytime_nighttime(self, series: Series, flag_daytime: Series,
             fig.suptitle(title, fontsize=24, fontweight='bold')
 
         ax_series = fig.add_subplot(gs[0, 0])
+        ax_series.xaxis.axis_date()
         ax_series_hist = fig.add_subplot(gs[0, 1])
         ax_cleaned = fig.add_subplot(gs[0, 2], sharex=ax_series)
+        ax_cleaned.xaxis.axis_date()
         ax_cleaned_hist = fig.add_subplot(gs[0, 3])
 
         ax_series_dt = fig.add_subplot(gs[1, 0])
+        ax_series_dt.xaxis.axis_date()
         ax_series_dt_hist = fig.add_subplot(gs[1, 1])
         ax_cleaned_dt = fig.add_subplot(gs[1, 2], sharex=ax_series)
+        ax_cleaned_dt.xaxis.axis_date()
         ax_cleaned_dt_hist = fig.add_subplot(gs[1, 3])
 
         ax_series_nt = fig.add_subplot(gs[2, 0], sharex=ax_series)
+        ax_series_nt.xaxis.axis_date()
         ax_series_nt_hist = fig.add_subplot(gs[2, 1])
         ax_cleaned_nt = fig.add_subplot(gs[2, 2], sharex=ax_series)
+        ax_cleaned_nt.xaxis.axis_date()
         ax_cleaned_nt_hist = fig.add_subplot(gs[2, 3])
 
         axes_series = [ax_series, ax_cleaned, ax_series_dt, ax_cleaned_dt, ax_series_nt, ax_cleaned_nt]
         axes_hist = [ax_series_hist, ax_cleaned_hist, ax_series_dt_hist,
                      ax_cleaned_dt_hist, ax_series_nt_hist, ax_cleaned_nt_hist]
         hist_kwargs = dict(method='n_bins', n_bins=None, highlight_peak=True, show_zscores=True, show_info=False,
                            show_title=False, show_zscore_values=False, show_grid=False)
-        series_kwargs = dict(x=df.index, fmt='o', mec='none', alpha=.2, color='black')
+        series_kwargs = dict(marker='o', mec='black', markeredgewidth=1, alpha=.2, fillstyle='none', linestyle='none')
 
         # Column 0
-        ax_series.plot_date(
-            y=df['CLEANED'], label=f"OK ({df['CLEANED'].count()} values)", **series_kwargs)
-        ax_series.plot_date(
-            x=df.index, y=df['OUTLIER'], fmt='X', ms=10, mec='none',
-            alpha=.9, color='red', label=f"outlier ({df['OUTLIER'].count()} values)")
-        ax_series_dt.plot_date(
-            y=df['UNFILTERED_DT'], label=f"series ({df['UNFILTERED_DT'].count()} values)", **series_kwargs)
-        ax_series_dt.plot_date(
-            x=df.index, y=df['OUTLIER_DT'], fmt='X', ms=10, mec='none',
-            alpha=.9, color='red', label=f"outlier ({df['OUTLIER_DT'].count()} values)")
-        ax_series_nt.plot_date(
-            y=df['UNFILTERED_NT'], label=f"series ({df['UNFILTERED_NT'].count()} values)", **series_kwargs)
-        ax_series_nt.plot_date(
-            x=df.index, y=df['OUTLIER_NT'], fmt='X', ms=10, mec='none',
-            alpha=.9, color='red', label=f"outlier ({df['OUTLIER_NT'].count()} values)")
+        ax_series.plot(df.index, df['CLEANED'], label=f"OK ({df['CLEANED'].count()} values)", **series_kwargs)
+        ax_series.plot(df.index, df['OUTLIER'], marker='X', ms=10, mec='none', linestyle='none',
+                       alpha=.9, color='red', label=f"outlier ({df['OUTLIER'].count()} values)")
+        ax_series_dt.plot(df.index, df['UNFILTERED_DT'], label=f"series ({df['UNFILTERED_DT'].count()} values)",
+                          **series_kwargs)
+        ax_series_dt.plot(df.index, df['OUTLIER_DT'], marker='X', ms=10, mec='none', linestyle='none',
+                          alpha=.9, color='red', label=f"outlier ({df['OUTLIER_DT'].count()} values)")
+        ax_series_nt.plot(df.index, df['UNFILTERED_NT'], label=f"series ({df['UNFILTERED_NT'].count()} values)",
+                          **series_kwargs)
+        ax_series_nt.plot(df.index, df['OUTLIER_NT'], marker='X', ms=10, mec='none', linestyle='none',
+                          alpha=.9, color='red', label=f"outlier ({df['OUTLIER_NT'].count()} values)")
 
         # Column 1
         HistogramPlot(s=df['UNFILTERED'], **hist_kwargs).plot(ax=ax_series_hist)
         HistogramPlot(s=df['UNFILTERED_DT'], **hist_kwargs).plot(ax=ax_series_dt_hist)
         HistogramPlot(s=df['UNFILTERED_NT'], **hist_kwargs).plot(ax=ax_series_nt_hist)
 
         # Column 2
-        ax_cleaned.plot_date(
-            y=df['CLEANED'], label=f"cleaned ({df['CLEANED'].count()} values)", **series_kwargs)
-        ax_cleaned_dt.plot_date(
-            y=df['CLEANED_DT'], label=f"cleaned daytime ({df['CLEANED_DT'].count()} values)", **series_kwargs)
-        ax_cleaned_nt.plot_date(
-            y=df['CLEANED_NT'], label=f"cleaned nighttime ({df['CLEANED_NT'].count()} values)", **series_kwargs)
+        ax_cleaned.plot(df.index, df['CLEANED'], label=f"cleaned ({df['CLEANED'].count()} values)", **series_kwargs)
+        ax_cleaned_dt.plot(df.index, df['CLEANED_DT'], label=f"cleaned daytime ({df['CLEANED_DT'].count()} values)",
+                           **series_kwargs)
+        ax_cleaned_nt.plot(df.index, df['CLEANED_NT'], label=f"cleaned nighttime ({df['CLEANED_NT'].count()} values)",
+                           **series_kwargs)
 
         # Column 3
         HistogramPlot(s=df['CLEANED'], **hist_kwargs).plot(ax=ax_cleaned_hist)

diff --git a/diive/core/io/files.py b/diive/core/io/files.py
@@ -36,6 +36,8 @@ def save_parquet(filename: str, data: DataFrame or Series, outpath: str or None
     """
     filepath = set_outpath(outpath=outpath, filename=filename, fileextension='parquet')
     tic = time.time()
+    if isinstance(data, Series):
+        data = data.to_frame()
     data.to_parquet(filepath)
     toc = time.time() - tic
     print(f"Saved file {filepath} ({toc:.3f} seconds).")

diff --git a/diive/core/plotting/cumulative.py b/diive/core/plotting/cumulative.py
@@ -60,6 +60,7 @@ def __init__(self,
 
         # Create axis
         self.fig, self.ax = pf.create_ax()
+        self.ax.xaxis.axis_date()
 
     def _add_reference(self, digits_after_comma):
 
@@ -69,12 +70,12 @@ def _add_reference(self, digits_after_comma):
 
         # label = f"{year}: {cumulative_df[year].dropna().iloc[-1]:.2f}"
         mean_end = self.mean_doy_cumulative_df['MEAN_DOY_TIME'].iloc[-1]
-        self.ax.plot_date(x=self.mean_doy_cumulative_df.index.values,
-                          y=self.mean_doy_cumulative_df['MEAN_DOY_TIME'].values,
-                          color='black', alpha=1,
-                          ls='-', lw=theme.WIDTH_LINE_WIDER,
-                          marker='', markeredgecolor='none', ms=0,
-                          zorder=99, label=f'mean {mean_end:.{digits_after_comma}f}')
+        self.ax.plot(self.mean_doy_cumulative_df.index.values,
+                     self.mean_doy_cumulative_df['MEAN_DOY_TIME'].values,
+                     color='black', alpha=1,
+                     ls='-', lw=theme.WIDTH_LINE_WIDER,
+                     marker='', markeredgecolor='none', ms=0,
+                     zorder=99, label=f'mean {mean_end:.{digits_after_comma}f}')
         # self.ax.fill_between(mean_cumulative_df.index.values,
         #                      mean_cumulative_df['MEAN+1.96_SD'].values,
         #                      mean_cumulative_df['MEAN-1.96_SD'].values,
@@ -125,12 +126,12 @@ def plot(self, showplot: bool = True, digits_after_comma: int = 2):
             lw = theme.WIDTH_LINE_WIDER if year == self.highlight_year else theme.WIDTH_LINE_DEFAULT
             color = self.highlight_year_color if year == self.highlight_year else color_list[ix]
 
-            self.ax.plot_date(x=self.cumulatives_per_year_df.index,
-                              y=self.cumulatives_per_year_df[year],
-                              color=color, alpha=1,
-                              ls='-', lw=lw,
-                              marker='', markeredgecolor='none', ms=0,
-                              zorder=99, label=label)
+            self.ax.plot(self.cumulatives_per_year_df.index,
+                         self.cumulatives_per_year_df[year],
+                         color=color, alpha=1,
+                         ls='-', lw=lw,
+                         marker='', markeredgecolor='none', ms=0,
+                         zorder=99, label=label)
 
         # Show reference
         if self.show_reference:
@@ -143,29 +144,9 @@ def plot(self, showplot: bool = True, digits_after_comma: int = 2):
 
 
 def example():
-    # # Test data
-    # from diive.core.io.filereader import ReadFileType
-    # loaddatafile = ReadFileType(
-    #     filetype='DIIVE-CSV-30MIN',
-    #     filepath=r"M:\Downloads\_temp\CH_LAE_FP2021_2004-2020_ID20210607205711.diive.csv",
-    #     # filepath=r"F:\Dropbox\luhk_work\_current\fp2022\7-14__IRGA627572__addingQCF0\CH-DAV_FP2022.1_1997-2022.08_ID20220826234456_30MIN.diive.csv",
-    #     data_nrows=None)
-    # data_df, metadata_df = loaddatafile.get_filedata()
-    #
-    # from diive.core.io.files import save_as_pickle
-    # filepath = save_as_pickle(
-    #     outpath=r"M:\Downloads\_temp",
-    #     # outpath=r'F:\Dropbox\luhk_work\_current\fp2022\7-14__IRGA627572__addingQCF0',
-    #     filename='CH_LAE_FP2021_2004-2020_ID20210607205711.diive.csv',
-    #     # filename='CH-DAV_FP2022.1_1997-2022.08_ID20220826234456_30MIN.diive.csv',
-    #     data=data_df)
-
     # Test data
-    from diive.core.io.files import load_pickle
-    df_orig = load_pickle(
-        filepath=r"M:\Downloads\_temp\CH_LAE_FP2021_2004-2020_ID20210607205711.diive.csv.pickle"
-        # filepath=r'F:\Dropbox\luhk_work\_current\fp2022\7-14__IRGA627572__addingQCF0\CH-DAV_FP2022.1_1997-2022.08_ID20220826234456_30MIN.diive.csv.pickle'
-    )
+    from diive.configs.exampledata import load_exampledata_parquet
+    df_orig = load_exampledata_parquet()
 
     df = df_orig.copy()
 
@@ -190,7 +171,7 @@ def example():
     # series.index = pd.to_datetime(series.index)
     # series = series.groupby(series.index.year).mean()  # yearly mean
 
-    series = df['NEE_f'].copy()
+    series = df['NEE_CUT_REF_f'].copy()
     # series = df['NEE_CUT_REF_f'].copy()
     series = series.multiply(0.02161926)  # umol CO2 m-2 s-1 --> g C m-2 30min-1
     series_units = r'($\mathrm{gC\ m^{-2}}$)'