Merge pull request #128 from holukas/ML-long-term-gap-filling

Ml long term gap filling
holukas · Jun 11, 2024 · 60e6623 · 60e6623
2 parents ceebdb4 + 8464e20
commit 60e6623
Show file tree

Hide file tree

Showing 26 changed files with 5,762 additions and 3,133 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,7 @@
 /__local_folders
 /notebooks/_scratch/
 /notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/
+/diive/configs/exampledata/local
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,52 @@
 
 ![DIIVE](images/logo_diive1_256px.png)
 
+## v0.77.0 | 11 Jun 2024
+
+### Additions
+
+- Plotting cumulatives with `CumulativeYear` now also shows the cumulative for the reference, i.e. for the mean over the
+  reference years (`diive.core.plotting.cumulative.CumulativeYear`)
+- Plotting `DielCycle` now accepts `ylim` parameter (`diive.core.plotting.dielcycle.DielCycle`)
+- Added long-term dataset for local testing purposes (internal
+  only) (`diive.configs.exampledata.load_exampledata_parquet_long`)
+- Added several classes in preparation for long-term gap-filling for a future update
+
+### Changes
+
+- Several updates and changes to the base class for regressor decision
+  trees (`diive.core.ml.common.MlRegressorGapFillingBase`):
+    - The data are now split into training set and test set at the very start of regressor setup. This test set is used
+      to evaluate models on unseen data. The default split is 80% training and 20% test data.
+    - Plotting (scores, importances etc.) is now generally separated from the method where they are calculated.
+    - the same `random_state` is now used for all processing steps
+    - refactored code
+    - beautified console output
+- When correcting for relative humidity values above 100%, the maximum of the corrected time series is now set to 100,
+  after the (daily) offset was removed (`diive.pkgs.corrections.offsetcorrection.remove_relativehumidity_offset`)
+- During feature reduction in machine learning regressors, features with permutation importance < 0 are now always
+  removed (`diive.core.ml.common.MlRegressorGapFillingBase._remove_rejected_features`)
+- Changed default parameters for quick random forest gap-filling (`diive.pkgs.gapfilling.randomforest_ts.QuickFillRFTS`)
+- I tried to improve the console output (clarity) for several functions and methods
+
+### Environment
+
+- Added package [dtreeviz](https://github.com/parrt/dtreeviz?tab=readme-ov-file) to visualize decision trees
+
+### Notebooks
+
+- Updated notebook (`notebooks/GapFilling/RandomForestGapFilling.ipynb`)
+- Updated notebook (`notebooks/GapFilling/LinearInterpolation.ipynb`)
+- Updated notebook (`notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb`)
+- Updated notebook (`notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb`)
+- Updated notebook (`notebooks/GapFilling/RandomForestParamOptimization.ipynb`)
+- Updated notebook (`notebooks/GapFilling/QuickRandomForestGapFilling.ipynb`)
+
+### Tests
+
+- Updated and fixed test case (`tests.test_outlierdetection.TestOutlierDetection.test_zscore_increments`)
+- Updated and fixed test case (`tests.test_gapfilling.TestGapFilling.test_gapfilling_randomforest`)
+
 ## v0.76.2 | 23 May 2024
 
 ### Additions

diff --git a/README.md b/README.md
@@ -16,7 +16,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)
 
 ## Overview of example notebooks
 
-- For many examples see notebooks here: [Notebook overview](https://github.com/holukas/diive/blob/main/notebooks/OVERVIEW.ipynb)
+- For many examples see notebooks
+  here: [Notebook overview](https://github.com/holukas/diive/blob/main/notebooks/OVERVIEW.ipynb)
 - More notebooks are added constantly.
 
 ## Current Features
@@ -25,7 +26,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)
 
 - Calculate z-aggregates in quantiles (classes) of x and
   y ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/CalculateZaggregatesInQuantileClassesOfXY.ipynb))
-- Daily correlation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DailyCorrelation.ipynb))
+- Daily
+  correlation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DailyCorrelation.ipynb))
 - Decoupling: Sorting bins
   method ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DecouplingSortingBins.ipynb))
 - Find data gaps ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/GapFinder.ipynb))
@@ -42,7 +44,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)
 
 ### Create variable
 
-- Calculate time since last occurrence, e.g. since last precipitation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/TimeSince.ipynb))
+- Calculate time since last occurrence, e.g. since last
+  precipitation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/TimeSince.ipynb))
 - Calculate daytime flag, nighttime flag and potential radiation from latitude and
   longitude ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/Daytime_and_nighttime_flag.ipynb))
 - Day/night flag from sun angle
@@ -78,9 +81,11 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)
 
 ### Flux processing chain
 
-For info about the Swiss FluxNet flux levels, see [here](https://www.swissfluxnet.ethz.ch/index.php/data/ecosystem-fluxes/flux-processing-chain/).
+For info about the Swiss FluxNet flux levels,
+see [here](https://www.swissfluxnet.ethz.ch/index.php/data/ecosystem-fluxes/flux-processing-chain/).
 
-- Flux processing chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb))
+- Flux processing
+  chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb))
     - The notebook example shows the application of:
         - Level-2 quality flags
         - Level-3.1 storage correction
@@ -101,10 +106,14 @@ Format data to specific formats
 
 Fill gaps in time series with various methods
 
-- XGBoostTS ([notebook example (minimal)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb), [notebook example (more extensive)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb))
-- RandomForestTS ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/RandomForestGapFilling.ipynb))
-- Linear interpolation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LinearInterpolation.ipynb))
-- Quick random forest gap-filling ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/QuickRandomForestGapFilling.ipynb))
+-
+XGBoostTS ([notebook example (minimal)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb), [notebook example (more extensive)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb))
+-
+RandomForestTS ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/RandomForestGapFilling.ipynb))
+- Linear
+  interpolation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LinearInterpolation.ipynb))
+- Quick random forest
+  gap-filling ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/QuickRandomForestGapFilling.ipynb))
 
 ### Outlier Detection
 
@@ -116,10 +125,14 @@ Fill gaps in time series with various methods
 
 Single outlier tests create a flag where `0=OK` and `2=outlier`.
 
-- Absolute limits ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimits.ipynb))
-- Absolute limits, separately defined for daytime and nighttime data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb))
-- Incremental z-score: Identify outliers based on the z-score of double increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
-- Local standard deviation: Identify outliers based on the local standard deviation from a running median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
+- Absolute
+  limits ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimits.ipynb))
+- Absolute limits, separately defined for daytime and nighttime
+  data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb))
+- Incremental z-score: Identify outliers based on the z-score of double
+  increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
+- Local standard deviation: Identify outliers based on the local standard deviation from a running
+  median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
 - Local outlier factor: Identify outliers based on local outlier factor, across all data
 - Local outlier factor: Identify outliers based on local outlier factor, daytime nighttime separately
 - Manual removal: Remove time periods (from-to) or single records from time series
@@ -130,7 +143,8 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.
 
 ### Plotting
 
-- Diel cycle per month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
+- Diel cycle per
+  month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
 - Heatmap showing values (z) of time series as date (y) vs time (
   x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
 - Heatmap showing values (z) of time series as year (y) vs month (
@@ -148,11 +162,14 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.
   database ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/MeteoScreening/StepwiseMeteoScreeningFromDatabase.ipynb))
 
 ### Resampling
-- Calculate diel cycle per month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Resampling/ResamplingDielCycle.ipynb))
+
+- Calculate diel cycle per
+  month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Resampling/ResamplingDielCycle.ipynb))
 
 ### Stats
 
-- Time series stats ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Stats/TimeSeriesStats.ipynb))
+- Time series
+  stats ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Stats/TimeSeriesStats.ipynb))
 
 ### Timestamps
 
@@ -163,22 +180,35 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.
 
 ## Installation
 
-`diive` can be installed from source code, e.g. using [`poetry`](https://python-poetry.org/) for dependencies.
-
 `diive` is currently developed under Python 3.9.7, but newer (and many older) versions should also work.
 
-`diive` can be installed using conda with `conda intall -c conda-forge diive`
+### Using pip
+
+`pip install diive`
+
+### Using poetry
+
+`poetry add diive`
+
+### Using conda
+
+`conda intall -c conda-forge diive`
+
+### From source
+
+Directly use .tar.gz file of the desired version.
+
+`pip install https://github.com/holukas/diive/archive/refs/tags/v0.76.2.tar.gz`
+
+### Create and use a conda environment for diive
 
 One way to install and use `diive` with a specific Python version on a local machine:
 
 - Install [miniconda](https://docs.conda.io/en/latest/miniconda.html)
 - Start `miniconda` prompt
-- Create a environment named `diive-env` that contains Python 3.9.7:
-  `conda create --name diive-env python=3.9.7`
+- Create a environment named `diive-env` that contains Python 3.9.7: `conda create --name diive-env python=3.9.7`
 - Activate the new environment: `conda activate diive-env`
-- Install `diive` version directly from source code:
-  `pip install https://github.com/holukas/diive/archive/refs/tags/v0.63.1.tar.gz` (select .tar.gz file of the desired
-  version)
+- Install `diive` using pip: `pip install diive`
 - If you want to use `diive` in Jupyter notebooks, you can install Jupyterlab.
   In this example Jupyterlab is installed from the `conda` distribution channel `conda-forge`:
   `conda install -c conda-forge jupyterlab`

diff --git a/diive/configs/exampledata/__init__.py b/diive/configs/exampledata/__init__.py
@@ -15,6 +15,12 @@ def load_exampledata_parquet() -> DataFrame:
     return data_df
 
 
+def load_exampledata_parquet_long() -> DataFrame:
+    filepath = Path(DIR_PATH) / 'local/exampledata_PARQUET_CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet'
+    data_df = load_parquet(filepath=filepath)
+    return data_df
+
+
 def load_exampledata_DIIVE_CSV_30MIN():
     filepath = Path(DIR_PATH) / 'exampledata_DIIVE-CSV-30MIN_CH-DAV_FP2022.5_2022.07_ID20230206154316_30MIN.diive.csv'
     loaddatafile = ReadFileType(filetype='DIIVE-CSV-30MIN',
@@ -103,6 +109,7 @@ def load_exampledata_TOA5_DAT_1MIN():
     data_df, metadata_df = loaddatafile.get_filedata()
     return data_df, metadata_df
 
+
 def load_exampledata_GENERIC_CSV_HEADER_1ROW_TS_MIDDLE_FULL_1MIN_long():
     filepath = Path(
         DIR_PATH) / 'exampledata_GENERIC-CSV-HEADER-1ROW-TS-MIDDLE-FULL-1MIN_CH-FRU_iDL_BOX1_0_1_TBL1_20240401-0000.dat.csv'
@@ -129,17 +136,22 @@ def load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN_with_datafilereader_parameters():
     dfr = DataFileReader(filepath=filepath,
                          data_header_section_rows=[0],  # Header section (before data) comprises 1 row
                          data_skip_rows=[],  # Skip no rows
-                         data_header_rows=[0], # Header with variable names and units, in this case only variable names in first row of header
+                         data_header_rows=[0],
+                         # Header with variable names and units, in this case only variable names in first row of header
                          data_varnames_row=0,  # Variable names are in first row of header
                          data_varunits_row=None,  # Header does not contain any variable units
-                         data_na_vals=[-9999], # List of values interpreted as missing values, EddyPro uses -9999 for missing values in ouput file
+                         data_na_vals=[-9999],
+                         # List of values interpreted as missing values, EddyPro uses -9999 for missing values in ouput file
                          data_freq="30min",  # Time resolution of the data is 30-minutes
                          data_delimiter=",",  # This csv file uses the comma as delimiter
-                         data_nrows=None, # How many data rows to read from files, mainly used for testing, in this case None to read all rows in file
+                         data_nrows=None,
+                         # How many data rows to read from files, mainly used for testing, in this case None to read all rows in file
                          timestamp_idx_col=["TIMESTAMP_END"],  # Name of the column that is used for the timestamp index
                          timestamp_datetime_format="%Y%m%d%H%M",  # Timestamp in the files looks like this: 202107010300
-                         timestamp_start_middle_end="end", # Timestamp in the file defined in *timestamp_idx_col* refers to the END of the averaging interval
-                         output_middle_timestamp=True, # Timestamp in output dataframe (after reading the file) refers to the MIDDLE of the averaging interval
+                         timestamp_start_middle_end="end",
+                         # Timestamp in the file defined in *timestamp_idx_col* refers to the END of the averaging interval
+                         output_middle_timestamp=True,
+                         # Timestamp in output dataframe (after reading the file) refers to the MIDDLE of the averaging interval
                          compression=None)  # File is not compressed (not zipped)
     data_df, metadata_df = dfr.get_data()
     return data_df, metadata_df

diff --git a/diive/core/dfun/frames.py b/diive/core/dfun/frames.py
@@ -786,11 +786,10 @@ def rolling_variants(df, records: int, aggtypes: list, exclude_cols: list = None
 
 def add_continuous_record_number(df: DataFrame) -> DataFrame:
     """Add continuous record number as new column"""
-    print("\nAdding continuous record number ...")
     newcol = '.RECORDNUMBER'
     data = range(1, len(df) + 1)
     df[newcol] = data
-    print(f"Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
+    print(f"++ Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
     return df
 
 
@@ -830,7 +829,7 @@ def lagged_variants(df: DataFrame,
     Example:
 
     """
-    print(f"\nCreating lagged variants ...")
+
     if len(df.columns) == 1:
         if df.columns[0] in exclude_cols:
             raise Exception(f"(!) No lagged variants can be created "
@@ -881,9 +880,8 @@ def lagged_variants(df: DataFrame,
             _included.append(col)
 
     if verbose:
-        print(f"Created lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
-              f"with stepsize {stepsize})\n"
-              f"No lagged variants for: {_excluded}")
+        print(f"++ Added new columns with lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
+              f"with stepsize {stepsize}), no lagged variants for: {_excluded}.")
     return df