Skip to content

Commit

Permalink
Merge pull request #184 from holukas/meteoscreening-update
Browse files Browse the repository at this point in the history
Meteoscreening updates
  • Loading branch information
holukas committed Aug 28, 2024
2 parents 2b81037 + a22b4ca commit e05ee15
Show file tree
Hide file tree
Showing 16 changed files with 4,432 additions and 2,425 deletions.
41 changes: 36 additions & 5 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,37 @@

![DIIVE](images/logo_diive1_256px.png)

## v0.80.0 | 28 Aug 2024

### Additions

- Added outlier tests to step-wise meteoscreening from database: `Hampel`, `HampelDaytimeNighttime` and `TrimLow` (
`diive.pkgs.qaqc.meteoscreening.StepwiseMeteoScreeningDb`)
- Added parameter to control whether or not to output the middle timestamp when loading parquet files with
`load_parquet()`. By default, `output_middle_timestamp=True`. (`diive.core.io.files.load_parquet`)

### Environment

- Re-created environment and created new `lock` file
- Currently using Python 3.9.19

### Notebooks

- Added new notebook for creating a flag that indicates missing values (`notebooks/OutlierDetection/MissingValues.ipynb`)
- Updated notebook for meteoscreening from database (
`notebooks/MeteoScreening/StepwiseMeteoScreeningFromDatabase.ipynb`)
- Updated notebook for loading and saving parquet files (`notebooks/Formats/LoadSaveParquetFile.ipynb`)

### Tests

- Added unittest for flagging missing values (`tests.test_outlierdetection.TestOutlierDetection.test_missing_values`)
- 37/37 unittests ran successfully

### Bugfixes

- Fixed links in README, needed absolute links to notebooks
- Fixed issue with return list in (`diive.pkgs.analyses.histogram.Histogram.peakbins`)

## v0.79.1 | 26 Aug 2024

### Additions
Expand Down Expand Up @@ -500,25 +531,25 @@ multiple outlier tests into one single overall outlier flag.

## v0.72.0 | 25 Mar 2024

## New feature
### New feature

- Added new heatmap plotting class `HeatmapYearMonth` that allows to plot a variable in year/month
classes(`diive.core.plotting.heatmap_datetime.HeatmapYearMonth`)

![DIIVE](images/plotHeatmapYearMonth_diive_v0.72.0.png)

## Changes
### Changes

- Refactored code for class `HeatmapDateTime` (`diive.core.plotting.heatmap_datetime.HeatmapDateTime`)
- Added new base class `HeatmapBase` for heatmap plots. Currently used by `HeatmapYearMonth`
and `HeatmapDateTime` (`diive.core.plotting.heatmap_base.HeatmapBase`)

## Notebooks
### Notebooks

- Added new notebook for `HeatmapDateTime` (`notebooks/Plotting/HeatmapDateTime.ipynb`)
- Added new notebook for `HeatmapYearMonth` (`notebooks/Plotting/HeatmapYearMonth.ipynb`)

## Bugfixes
### Bugfixes

- Fixed bug in `HeatmapDateTime` where the last record of each day was not shown

Expand Down Expand Up @@ -1318,7 +1349,7 @@ to `diive`. From now on, new example notebooks will be added regularly.

## v0.52.3 | 10 Mar 2023

## Additions
### Additions

- Added plotting library `bokeh` to dependencies

Expand Down
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,9 +116,9 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
- **Hampel filter daytime/nighttime**, separately for daytime and nighttime data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/HampelDaytimeNighttime.ipynb))
- **Local standard deviation**: Identify outliers based on the local standard deviation from a running median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
- **Local outlier factor**: Identify outliers based on local outlier factor, across all data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
- **Local outlier factor daytime/nighttime**: Identify outliers based on local outlier factor, daytime nighttime separately ([notebook example](notebooks/OutlierDetection/LocalOutlierFactorDaytimeNighttime.ipynb))
- **Manual removal**: Remove time periods (from-to) or single records from time series ([notebook example](notebooks/OutlierDetection/ManualRemoval.ipynb))
- **Missing values**: Simply creates a flag that indicated available and missing data in a time series
- **Local outlier factor daytime/nighttime**: Identify outliers based on local outlier factor, daytime nighttime separately ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalOutlierFactorDaytimeNighttime.ipynb))
- **Manual removal**: Remove time periods (from-to) or single records from time series ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/ManualRemoval.ipynb))
- **Missing values**: Simply creates a flag that indicated available and missing data in a time series ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/MissingValues.ipynb))
- **Trimming**: Remove values below threshold and remove an equal amount of records from high end of data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/TrimLow.ipynb))
- **z-score**: Identify outliers based on the z-score across all time series data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScore.ipynb))
- **z-score increments daytime/nighttime**: Identify outliers based on the z-score of double increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
Expand All @@ -130,7 +130,7 @@ _Create single outlier flags where `0=OK` and `2=outlier`._
- **Diel cycle per month** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
- **Heatmap date/time**: showing values (z) of time series as date (y) vs time (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
- **Heatmap year/month**: showing values (z) of time series as year (y) vs month (x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapYearMonth.ipynb))
- **Histogram**: includes options to show z-score limits and to highlight the peak distribution bin ([notebook example](notebooks/Plotting/Histogram.ipynb))
- **Histogram**: includes options to show z-score limits and to highlight the peak distribution bin ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/Histogram.ipynb))
- **Long-term anomalies**: calculate and plot long-term anomaly for a variable, per year, compared to a reference period. ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/LongTermAnomalies.ipynb))
- **Time series plot**: Simple (interactive) time series plot ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/TimeSeries.ipynb))
- **ScatterXY plot** ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/ScatterXY.ipynb))
Expand Down
6 changes: 4 additions & 2 deletions diive/core/io/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,15 @@ def save_parquet(filename: str, data: DataFrame or Series, outpath: str or None
return str(filepath)


def load_parquet(filepath: str or Path) -> DataFrame:
def load_parquet(filepath: str or Path, output_middle_timestamp: bool = True) -> DataFrame:
"""
Load data from Parquet file to pandas DataFrame
Args:
filepath: str
filepath to parquet file
output_middle_timestamp: Converts the timestamp to show the middle
of the averaging interval.
Returns:
pandas DataFrame, data from Parquet file as pandas DataFrame
Expand All @@ -57,7 +59,7 @@ def load_parquet(filepath: str or Path) -> DataFrame:
df = read_parquet(filepath)
toc = time.time() - tic
# Check timestamp, also detects frequency of time series, this info was lost when saving to the parquet file
df = TimestampSanitizer(data=df).get()
df = TimestampSanitizer(data=df, output_middle_timestamp=output_middle_timestamp).get()
print(f"Loaded .parquet file {filepath} ({toc:.3f} seconds). "
f"Detected time resolution of {df.index.freq} / {df.index.freqstr} ")
return df
Expand Down
2 changes: 1 addition & 1 deletion diive/pkgs/analyses/histogram.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ def peakbins(self):
"""Returns the five bins with the most counts in decreasing order"""
ix_maxcounts = self.results['COUNTS'].sort_values(ascending=False).head(5).index
peakbins = self.results['BIN_START_INCL'].iloc[ix_maxcounts]
return list(peakbins.values)
return peakbins.values.tolist()
# Find bin with maximum counts
# idx_maxcounts = self.results['COUNTS'].idxmax()
# return self.results['BIN_START_INCL'].iloc[idx_maxcounts]
Expand Down
138 changes: 78 additions & 60 deletions diive/pkgs/gapfilling/xgboost_ts.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
- XXX
"""
import numpy as np
import xgboost as xgb
from pandas import DataFrame

Expand Down Expand Up @@ -114,7 +113,7 @@ def example_xgbts():
# subsetcols = [TARGET_COL, 'Tair_f', 'VPD_f', 'Rg_f', 'SWC_FF0_0.15_1', 'PPFD']

# Example data
from diive.configs.exampledata import load_exampledata_parquet, load_exampledata_parquet_long
from diive.configs.exampledata import load_exampledata_parquet_long
df_orig = load_exampledata_parquet_long()

# # Create a large gap
Expand All @@ -123,14 +122,12 @@ def example_xgbts():
# df = df[remove].copy()

# Subset
# keep = (df_orig.index.year >= 1997) & (df_orig.index.year <= 2001)
# df = df_orig[keep].copy()
df = df_orig.copy()


keep = (df_orig.index.year >= 1997) & (df_orig.index.year <= 2001)
df = df_orig[keep].copy()
# df = df_orig.copy()

# Checking nighttime
nt_locs = df['Rg_f'] < 50
# nt_locs = df['Rg_f'] < 50
# nt = df[nt_locs].groupby(df[nt_locs].index.year).agg(['mean'])
# means_nt = nt[TARGET]['mean']
# # import matplotlib.pyplot as plt
Expand All @@ -140,16 +137,16 @@ def example_xgbts():
# mean_nt_0613 = means_nt.loc[2006:2013].mean()
# corr_nt = mean_nt_0613 / mean_nt_9704

# corr_nt = 100
# corr_nt = 1.19
corr_nt = 0.7759670068746911
corr_df = df[['Rg_f', 'NEE_CUT_REF_orig']].copy()
corr_df['gain'] = 1
# nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004)
nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004) & (df['Rg_f'] < 50)
corr_df.loc[nt_locs_9704, 'gain'] = corr_nt
corr_df['NEE_CUT_REF_orig'] = corr_df['NEE_CUT_REF_orig'].multiply(corr_df['gain'])
df[TARGET_COL] = corr_df[TARGET_COL].copy()
# # corr_nt = 100
# # corr_nt = 1.19
# corr_nt = 0.7759670068746911
# corr_df = df[['Rg_f', 'NEE_CUT_REF_orig']].copy()
# corr_df['gain'] = 1
# # nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004)
# nt_locs_9704 = (df.index.year >= 1997) & (df.index.year <= 2004) & (df['Rg_f'] < 50)
# corr_df.loc[nt_locs_9704, 'gain'] = corr_nt
# corr_df['NEE_CUT_REF_orig'] = corr_df['NEE_CUT_REF_orig'].multiply(corr_df['gain'])
# df[TARGET_COL] = corr_df[TARGET_COL].copy()

# df[nt_locs].groupby(df[nt_locs].index.year).agg(['mean'])['NEE_CUT_REF_orig']
# df[~nt_locs].groupby(df[~nt_locs].index.year).agg(['mean'])['NEE_CUT_REF_orig']
Expand Down Expand Up @@ -262,26 +259,26 @@ def example_xgbts():
# xgbts.reduce_features()
# xgbts.report_feature_reduction()

xgbts.trainmodel(showplot_scores=False, showplot_importance=False)
xgbts.report_traintest()

xgbts.fillgaps(showplot_scores=False, showplot_importance=False)
xgbts.report_gapfilling()

observed = df[TARGET_COL]
gapfilled = xgbts.get_gapfilled_target()

frame = {
nee_mds.name: nee_mds,
gapfilled.name: gapfilled,
}
import pandas as pd
checkdf = pd.DataFrame.from_dict(frame, orient='columns')
checkdf = checkdf.groupby(checkdf.index.year).agg('sum')
checkdf['diff'] = checkdf[gapfilled.name].subtract(checkdf[nee_mds.name])
checkdf = checkdf.multiply(0.02161926)
print(checkdf)
print(checkdf.sum())
# xgbts.trainmodel(showplot_scores=False, showplot_importance=False)
# xgbts.report_traintest()
#
# xgbts.fillgaps(showplot_scores=False, showplot_importance=False)
# xgbts.report_gapfilling()

# observed = df[TARGET_COL]
# gapfilled = xgbts.get_gapfilled_target()

# frame = {
# nee_mds.name: nee_mds,
# gapfilled.name: gapfilled,
# }
# import pandas as pd
# checkdf = pd.DataFrame.from_dict(frame, orient='columns')
# checkdf = checkdf.groupby(checkdf.index.year).agg('sum')
# checkdf['diff'] = checkdf[gapfilled.name].subtract(checkdf[nee_mds.name])
# checkdf = checkdf.multiply(0.02161926)
# print(checkdf)
# print(checkdf.sum())

# rfts.feature_importances
# rfts.scores
Expand All @@ -304,15 +301,15 @@ def example_xgbts():
# f"\nMIN_SAMPLES_SPLIT: {MIN_SAMPLES_SPLIT} "
# f"/ MIN_SAMPLES_LEAF: {MIN_SAMPLES_LEAF} "
# )
title="title"
from diive.core.plotting.timeseries import TimeSeries
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
TimeSeries(series=gapfilled.multiply(0.02161926).cumsum(), ax=ax).plot(color='blue')
TimeSeries(series=nee_mds.multiply(0.02161926).cumsum(), ax=ax).plot(color='orange')
fig.suptitle(f'{title}', fontsize=16)
# ax.set_ylim(-2000, 200)
fig.show()
# title = "title"
# from diive.core.plotting.timeseries import TimeSeries
# import matplotlib.pyplot as plt
# fig, ax = plt.subplots()
# TimeSeries(series=gapfilled.multiply(0.02161926).cumsum(), ax=ax).plot(color='blue')
# TimeSeries(series=nee_mds.multiply(0.02161926).cumsum(), ax=ax).plot(color='orange')
# fig.suptitle(f'{title}', fontsize=16)
# # ax.set_ylim(-2000, 200)
# fig.show()

# from diive.core.plotting.heatmap_datetime import HeatmapDateTime
# HeatmapDateTime(series=observed).show()
Expand All @@ -333,19 +330,19 @@ def example_xgbts():
# # plt.legend()
# plt.show()

from diive.core.plotting.cumulative import CumulativeYear
CumulativeYear(
series=gapfilled.multiply(0.02161926),
series_units="units",
yearly_end_date=None,
# yearly_end_date='08-11',
start_year=1997,
end_year=2022,
show_reference=True,
excl_years_from_reference=None,
# excl_years_from_reference=[2022],
# highlight_year=2022,
highlight_year_color='#F44336').plot(digits_after_comma=0)
# from diive.core.plotting.cumulative import CumulativeYear
# CumulativeYear(
# series=gapfilled.multiply(0.02161926),
# series_units="units",
# yearly_end_date=None,
# # yearly_end_date='08-11',
# start_year=1997,
# end_year=2022,
# show_reference=True,
# excl_years_from_reference=None,
# # excl_years_from_reference=[2022],
# # highlight_year=2022,
# highlight_year_color='#F44336').plot(digits_after_comma=0)
# CumulativeYear(
# series=nee_mds.multiply(0.02161926),
# series_units="units",
Expand All @@ -369,6 +366,27 @@ def example_xgbts():
# each_month=True, legend_n_col=2, ylim=[-0.4, 0.2])
# # d = dc.get_data()

from yellowbrick.model_selection import ValidationCurve, validation_curve
import numpy as np
# viz = ValidationCurve(
# xgbts.model_, param_name="max_depth",
# param_range=np.arange(3, 10), cv=10, scoring="r2"
# )
y = df[TARGET_COL]
X = df[['Tair_f', 'VPD_f', 'Rg_f']]

viz = validation_curve(
xgbts.model_, X, y, param_name="n_estimators",
param_range=np.arange(10, 20), cv=10, scoring="r2",
)

# Fit and show the visualizer
viz.fit(X, y)
viz.show()

# from yellowbrick.datasets import load_energy
# x,y = load_energy()


print("Finished.")

Expand Down
3 changes: 2 additions & 1 deletion diive/pkgs/outlierdetection/stepwiseoutlierdetection.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,8 @@ def flag_outliers_increments_zcore_test(self, thres_zscore: int = 30, showplot:

def flag_outliers_trim_low_test(self, trim_daytime: bool = False, trim_nighttime: bool = False,
lower_limit: float = None, showplot: bool = False, verbose: bool = False):
"""XXX"""
"""Flag values below a given absolute limit as outliers, then flag an
equal number of datapoints at the high end as outliers."""
series_cleaned = self._series_hires_cleaned.copy()
flagtest = TrimLow(series=series_cleaned, idstr=self.idstr,
trim_daytime=trim_daytime, trim_nighttime=trim_nighttime,
Expand Down
11 changes: 2 additions & 9 deletions diive/pkgs/qaqc/flags.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,17 +63,10 @@ def __init__(self, series: Series, idstr: str = None, verbose: bool = False):
super().__init__(series=series, flagid=self.flagid, idstr=idstr)
self.verbose = verbose

def calc(self, repeat: bool = False):
"""Calculate overall flag, based on individual flags from multiple iterations.
Args:
repeat: If *True*, the outlier detection is repeated until all
outliers are removed.
"""
def calc(self, repeat=False):
self._overall_flag, n_iterations = self.repeat(self.run_flagtests, repeat=False)
# if self.showplot:
# self.defaultplot(n_iterations=n_iterations)
# self.defaultplot(n_iterations=1)

if self.verbose:
print(f"MISSING VALUES TEST: Generated new flag variable {self.overall_flag.name}, "
Expand Down
Loading

0 comments on commit e05ee15

Please sign in to comment.