v0.67.0

holukas · Jan 9, 2024 · 2c18ae6 · 2c18ae6
2 parents ee91ad7 + 9d61f4c
commit 2c18ae6
Show file tree

Hide file tree

Showing 55 changed files with 26,068 additions and 4,734 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,10 +2,10 @@
 
 # e.g.: /src
 /.idea/
-target/
-__manuscripts/
-__workbench/
-__todo/
+/__local_folders
+
+/notebooks/_scratch/
+/notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/
 
 # Byte-compiled / optimized / DLL files
 __pycache__/
@@ -114,3 +114,6 @@ venv.bak/
 /notebooks/Manuscripts/Hörtnagl et al. (2023) - NEP Penalty/
 /notebooks/Workbench/example_n2o_outlier/
 /notebooks/Workbench/FLUXNET CH4 N2O Committee WP2/data/
+/notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/
+/__archived/
+/__local_folders/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,53 @@
 
 ![DIIVE](images/logo_diive1_256px.png)
 
+## v0.67.0 | 9 Jan 2024
+
+### Updates to flux processing chain
+
+The flux processing chain was updated in an attempt to make processing more streamlined and easier to follow. One of the
+biggest changes is the implementation of the `repeat` keyword for outlier tests. With this keyword set to `True`, the
+respective test is repeated until no more outliers can be found. How the flux processing chain can be used is shown in
+the updated `FluxProcessingChain`notebook (`notebooks/FluxProcessingChain/FluxProcessingChain.ipynb`).
+
+### New features
+
+- Added new class `QuickFluxProcessingChain`, which allows to quickly execute a simplified version of the flux
+  processing chain. This quick version runs with a lot of default values and thus not a lot of user input is needed,
+  only some basic settings. (`diive.pkgs.fluxprocessingchain.fluxprocessingchain.QuickFluxProcessingChain`)
+- Added new repeater function for outlier detection: `repeater` is wrapper that allows to execute an outlier detection
+  method multiple times, where each iteration gets its own outlier flag. As an example: the simple z-score test is run
+  a first time and then repeated until no more outliers are found. Each iteration outputs a flag. This is now used in
+  the `StepwiseOutlierDetection` and thus the flux processing chain Level-3.2 (outlier detection) and the meteoscreening
+  in `StepwiseMeteoScreeningDb` (not yet checked in this update). To repeat an outlier method use the `repeat` keyword
+  arg (see the `FluxProcessingChain` notebook for examples).(`diive.pkgs.outlierdetection.repeater.repeater`)
+- Added new function `filter_strings_by_elements`: Returns a list of strings from list1 that contain all of the elements
+  in list2.(`core.funcs.funcs.filter_strings_by_elements`)
+- Added new function `flag_steadiness_horizontal_wind_eddypro_test`: Create flag for steadiness of horizontal wind u
+  from the sonic anemometer. Makes direct use of the EddyPro output files and converts the flag to a standardized 0/1
+  flag.(`pkgs.qaqc.eddyproflags.flag_steadiness_horizontal_wind_eddypro_test`)
+
+### Changes
+
+- Added automatic calculation of daytime and nighttime flags whenever the flux processing chain is started
+  flags (`diive.pkgs.fluxprocessingchain.fluxprocessingchain.FluxProcessingChain._add_swinpot_dt_nt_flag`)
+
+### Removed features
+
+- Removed class `ThymeBoostOutlier` for outlier detection. At the moment it was not possible to get it to work properly.
+
+### Changes
+
+- It appears that the kwarg `fmt` is used slightly differently for `plot_date` and `plot` in `matplotlib`. It seems it
+  is always defined for `plot_date`, while it is optional for `plot`. Now using `fmt` kwarg to avoid the warning:
+  *UserWarning: marker is redundantly defined by the 'marker' keyword argument and the fmt string "o" (-> marker='o').
+  The keyword argument will take precedence.* Therefore using 'fmt="X"' instead of 'marker="X"'. See also
+  answer [here](https://stackoverflow.com/questions/69188540/userwarning-marker-is-redundantly-defined-by-the-marker-keyword-argument-when)
+
+### Environment
+
+- Removed `thymeboost`
+
 ## v0.66.0 | 2 Nov 2023
 
 ### New features

diff --git a/README.md b/README.md
@@ -92,7 +92,6 @@ Fill gaps in time series with various methods
 - Missing values: Simply creates a flag that indicated available and missing data in a time series
 - Seasonal trend decomposition using LOESS, identify outliers based on seasonal-trend decomposition and
   z-score calculations
-- Thymeboost: Identify outliers based on [thymeboost](https://github.com/tblume1992/ThymeBoost)
 - z-score: Identify outliers based on the z-score across all time series data
 - z-score: Identify outliers based on the z-score, separately for daytime and nighttime
 - z-score: Identify outliers based on max z-scores in the interquartile range data

diff --git a/diive/configs/filetypes/FLUXNET-CH4-HH-CSV-30MIN.yml b/diive/configs/filetypes/FLUXNET-CH4-HH-CSV-30MIN.yml
@@ -0,0 +1,22 @@
+GENERAL:
+  NAME: "FLUXNET-CH4-HH-CSV-30MIN"
+  DESCRIPTION: "The Data Product for the FLUXNET-CH4 Release."
+  TAGS: [ "FLUXNET" ]
+
+FILE:
+  EXTENSION: "*.csv"
+  COMPRESSION: "None"
+
+TIMESTAMP:
+  DESCRIPTION: "1 column with full timestamp with seconds"
+  INDEX_COLUMN: [ 'TIMESTAMP_END' ]
+  DATETIME_FORMAT: "%Y%m%d%H%M"
+  SHOWS_START_MIDDLE_OR_END_OF_RECORD: "end"
+
+DATA:
+  HEADER_SECTION_ROWS: [ 0, 1, 2 ]
+  SKIP_ROWS: [ 0, 1 ]
+  HEADER_ROWS: [ 0 ]
+  NA_VALUES: [ -9999 ]
+  FREQUENCY: "30T"
+  DELIMITER: ","
diff --git a/diive/configs/filetypes/ICOS_H2R_CSVZIP_1MIN.yml b/diive/configs/filetypes/ICOS_H2R_CSVZIP_1MIN.yml
@@ -0,0 +1,22 @@
+GENERAL:
+  NAME: "ICOS_H2R_CSVZIP_1MIN"
+  DESCRIPTION: "Compressed (zipped) ICOS format with 2-row header (variable names and units) and ISO timestamp."
+  TAGS: [ "ICOS" ]
+
+FILE:
+  EXTENSION: "*.csv"
+  COMPRESSION: "zip"
+
+TIMESTAMP:
+  DESCRIPTION: "1 column with full ISO timestamp with seconds"
+  INDEX_COLUMN: [ 0 ]
+  DATETIME_FORMAT: "%Y%m%d%H%M%S"
+  SHOWS_START_MIDDLE_OR_END_OF_RECORD: "end"
+
+DATA:
+  HEADER_SECTION_ROWS: [ 0, 1 ]
+  SKIP_ROWS: [  ]
+  HEADER_ROWS: [ 0, 1 ]
+  NA_VALUES: [ -9999 ]
+  FREQUENCY: "1MIN"
+  DELIMITER: ","
diff --git a/diive/core/base/flagbase.py b/diive/core/base/flagbase.py
@@ -8,21 +8,26 @@
 import numpy as np
 import pandas as pd
 from pandas import Series, DatetimeIndex
-from diive.core.plotting.plotfuncs import default_format, default_legend
+
 import diive.core.plotting.styles.LightTheme as theme
+from diive.core.funcs.funcs import validate_id_string
+from diive.core.plotting.plotfuncs import default_format, default_legend
 
-class FlagBase():
 
-    def __init__(self, series: Series, flagid: str, levelid: str = None):
+class FlagBase:
+
+    def __init__(self, series: Series, flagid: str, idstr: str = None, verbose: bool = True):
         self.series = series
         self._flagid = flagid
-        self._levelid = levelid
-        self._flagname = self._generate_flagname()
+        self._idstr = validate_id_string(idstr=idstr)
+        self.verbose = verbose
+
+        self.flagname = self._generate_flagname()
 
         self._filteredseries = None
         self._flag = None
 
-        print(f"Generating flag {self._flagname} for variable {self.series.name} ...")
+        print(f"Generating flag {self.flagname} for variable {self.series.name} ...")
 
     @property
     def flag(self) -> Series:
@@ -62,18 +67,20 @@ def setfiltered(self, rejected: DatetimeIndex):
     def reset(self):
         self._filteredseries = self.series.copy()
         # Generate flag series with NaNs
-        self._flag = pd.Series(index=self.series.index, data=np.nan, name=self._flagname)
+        self._flag = pd.Series(index=self.series.index, data=np.nan, name=self.flagname)
 
     def _generate_flagname(self) -> str:
         """Generate standardized name for flag variable"""
         flagname = "FLAG"
-        if self._levelid: flagname += f"_L{self._levelid}"
+        if self._idstr:
+            flagname += f"{self._idstr}"
         flagname += f"_{self.series.name}"
-        if self._flagid: flagname += f"_{self._flagid}"
+        if self._flagid:
+            flagname += f"_{self._flagid}"
         flagname += f"_TEST"
         return flagname
 
-    def plot(self, ok:DatetimeIndex, rejected:DatetimeIndex, plottitle:str=""):
+    def plot(self, ok: DatetimeIndex, rejected: DatetimeIndex, plottitle: str = ""):
         """Basic plot that shows time series with and without outliers"""
         fig = plt.figure(facecolor='white', figsize=(16, 7))
         gs = gridspec.GridSpec(2, 1)  # rows, cols
@@ -83,8 +90,8 @@ def plot(self, ok:DatetimeIndex, rejected:DatetimeIndex, plottitle:str=""):
         ax_series.plot_date(self.series.index, self.series, label=f"{self.series.name}", color="#42A5F5",
                             alpha=.5, markersize=2, markeredgecolor='none')
         ax_series.plot_date(self.series[rejected].index, self.series[rejected],
-                            label="outlier (rejected)", color="#F44336", marker="X", alpha=1,
-                            markersize=8, markeredgecolor='none')
+                            label="outlier (rejected)", color="#F44336", alpha=1,
+                            markersize=8, markeredgecolor='none', fmt='X')
         ax_ok.plot_date(self.series[ok].index, self.series[ok], label=f"OK", color="#9CCC65", alpha=.5,
                         markersize=2, markeredgecolor='none')
         default_format(ax=ax_series)

diff --git a/diive/core/dfun/frames.py b/diive/core/dfun/frames.py
@@ -741,7 +741,7 @@ def add_continuous_record_number(df: DataFrame) -> DataFrame:
     newcol = '.RECORDNUMBER'
     data = range(1, len(df) + 1)
     df[newcol] = data
-    print(f"Added new column {newcol} with record numbers from {df[newcol][0]} to {df[newcol][-1]}.")
+    print(f"Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
     return df
 
 

diff --git a/diive/core/funcs/funcs.py b/diive/core/funcs/funcs.py
@@ -2,6 +2,38 @@
 from pandas import Series
 
 
+def validate_id_string(idstr: str):
+    if idstr:
+        # idstr = idstr if idstr.endswith('_') else f'{idstr}_'
+        idstr = idstr if idstr.startswith('_') else f'_{idstr}'
+    return idstr
+
+
+def filter_strings_by_elements(list1: list[str], list2: list[str]) -> list[str]:
+    """Returns a list of strings from list1 that contain all of the elements in list2.
+
+    The function uses a set to keep track of the elements in list2, which makes it more
+    efficient than iterating over the list twice with this one-liner:
+        result = [s1 for s1 in list1 if all(s2 in str(s1) for s2 in list2)]
+
+    Args:
+        list1: A list of strings.
+        list2: A list of elements to check for in each string in list1.
+
+    Returns:
+        A list of strings from list1 that contain all of the elements in list2.
+    """
+    if not list1 or not list2:
+        return []
+
+    elements_in_other_list = set(list2)
+    result = []
+    for s1 in list1:
+        if all(s2 in str(s1) for s2 in elements_in_other_list):
+            result.append(s1)
+    return result
+
+
 def zscore(series: Series) -> Series:
     """Calculate the z-score of each record in *series*"""
     mean, std = np.mean(series), np.std(series)

diff --git a/diive/core/io/filereader.py b/diive/core/io/filereader.py
@@ -25,7 +25,8 @@ def search_files(searchdirs: str or list, pattern: str) -> list:
     """ Search files and store their filename and the path to the file in dictionary. """
     # found_files_dict = {}
     foundfiles = []
-    if isinstance(searchdirs, str): searchdirs = [searchdirs]  # Use str as list
+    if isinstance(searchdirs, str):
+        searchdirs = [searchdirs]  # Use str as list
     for searchdir in searchdirs:
         for root, dirs, files in os.walk(searchdir):
             for idx, settings_file_name in enumerate(files):

diff --git a/diive/core/plotting/scatter.py b/diive/core/plotting/scatter.py
@@ -1,8 +1,11 @@
+from typing import Literal
+
 import matplotlib.pyplot as plt
 import pandas as pd
 from pandas import Series
 
 import diive.core.plotting.plotfuncs as pf
+from diive.core.dfun.stats import q25, q75
 
 
 class ScatterXY:
@@ -15,8 +18,9 @@ def __init__(
             title: str = None,
             ax: plt.Axes = None,
             nbins: int = 0,
+            binagg: Literal['mean', 'median'] = 'median',
             xlim: list = None,
-            ylim: list = None
+            ylim: list or Literal['auto'] = None
     ):
         """
 
@@ -29,9 +33,12 @@ def __init__(
         self.yunits = yunits
         self.ax = ax
         self.nbins = nbins
+        self.binagg = binagg
         self.xlim = xlim
         self.ylim = ylim
 
+        self.binagg = None if self.nbins == 0 else self.binagg
+
         self.xy_df = pd.concat([x, y], axis=1)
         self.xy_df = self.xy_df.dropna()
 
@@ -46,7 +53,7 @@ def _databinning(self):
         group, bins = pd.qcut(self.xy_df[self.xname], q=self.nbins, retbins=True, duplicates='drop')
         groupcol = f'GROUP_{self.xname}'
         self.xy_df[groupcol] = group
-        self.xy_df_binned = self.xy_df.groupby(groupcol).agg({'mean', 'std', 'count'})
+        self.xy_df_binned = self.xy_df.groupby(groupcol).agg({'mean', 'median', 'std', 'count', q25, q75})
 
     def plot(self):
         """Generate plot"""
@@ -73,22 +80,31 @@ def _plot(self, nbins: int = 10):
                         label=label)
 
         if self.nbins > 0:
+
             _min = self.xy_df_binned[self.yname]['count'].min()
             _max = self.xy_df_binned[self.yname]['count'].max()
-            self.ax.scatter(x=self.xy_df_binned[self.xname]['mean'],
-                            y=self.xy_df_binned[self.yname]['mean'],
-                            c='none',
-                            s=80,
-                            marker='o',
-                            edgecolors='r',
-                            lw=2,
-                            label=f"binned data, mean±SD "
-                                  f"({_min}-{_max} values per bin)")
-            self.ax.errorbar(x=self.xy_df_binned[self.xname]['mean'],
-                             y=self.xy_df_binned[self.yname]['mean'],
-                             xerr=self.xy_df_binned[self.xname]['std'],
-                             yerr=self.xy_df_binned[self.yname]['std'],
-                             elinewidth=3, ecolor='red', alpha=.6, lw=0)
+            self.ax.plot(self.xy_df_binned[self.xname][self.binagg],
+                         self.xy_df_binned[self.yname][self.binagg],
+                         c='r', ms=10, marker='o', lw=2,
+                         # c='none', ms=80, marker='o', edgecolors='r', lw=2,
+                         label=f"binned data ({self.binagg}, {_min}-{_max} values per bin)")
+
+
+
+            if self.binagg == 'median':
+                self.ax.fill_between(self.xy_df_binned[self.xname][self.binagg],
+                                     self.xy_df_binned[self.yname]['q25'],
+                                     self.xy_df_binned[self.yname]['q75'],
+                                     alpha=.2, zorder=10, color='red',
+                                     label="interquartile range")
+
+            if self.binagg == 'mean':
+                self.ax.errorbar(x=self.xy_df_binned[self.xname][self.binagg],
+                                 y=self.xy_df_binned[self.yname][self.binagg],
+                                 xerr=self.xy_df_binned[self.xname]['std'],
+                                 yerr=self.xy_df_binned[self.yname]['std'],
+                                 elinewidth=3, ecolor='red', alpha=.6, lw=0,
+                                 label="standard deviation")
 
         self._apply_format()
         self.ax.locator_params(axis='x', nbins=nbins)
@@ -99,12 +115,31 @@ def _apply_format(self):
         if self.xlim:
             xmin = self.xlim[0]
             xmax = self.xlim[1]
-            self.ax.set_xlim(xmin, xmax)
-
-        if self.ylim:
+        else:
+            xmin = self.xy_df[self.xname].quantile(0.01)
+            xmax = self.xy_df[self.xname].quantile(0.99)
+        self.ax.set_xlim(xmin, xmax)
+
+        if self.ylim == 'auto':
+            if self.binagg == 'median':
+                ymin = self.xy_df_binned[self.yname]['q25'].min()
+                ymax = self.xy_df_binned[self.yname]['q75'].max()
+            elif self.binagg == 'mean':
+                _lowery = self.xy_df_binned[self.yname]['mean'].sub(self.xy_df_binned[self.yname]['std'])
+                _uppery = self.xy_df_binned[self.yname]['mean'].add(self.xy_df_binned[self.yname]['std'])
+                ymin = _lowery.min()
+                ymax = _uppery.max()
+            else:
+                ymin = self.xy_df[self.yname].quantile(0.01)
+                ymax = self.xy_df[self.yname].quantile(0.99)
+        elif isinstance(self.ylim, list):
             ymin = self.ylim[0]
             ymax = self.ylim[1]
-            self.ax.set_ylim(ymin, ymax)
+        else:
+            ymin = self.xy_df[self.yname].min()
+            ymax = self.xy_df[self.yname].max()
+
+        self.ax.set_ylim(ymin, ymax)
 
         pf.add_zeroline_y(ax=self.ax, data=self.xy_df[self.yname])
 
@@ -118,6 +153,8 @@ def _apply_format(self):
                           labelspacing=0.2,
                           ncol=1)
 
+        self.ax.set_title(self.title, size=20)
+
         # pf.nice_date_ticks(ax=self.ax, minticks=3, maxticks=20, which='x', locator='auto')
 
         # if self.showplot: