04_basic_filters.qmd

---
title: Basic Filtering
---

```{r}
#| include: false
knitr::read_chunk("XX_functions.R")
```

Here we apply broad filters for missing details, time, space, and species. 
We'll omit runs missing deployment information for tags and receivers, 
runs which occur during times we're not interested in, and runs which occur
for species/receiver pairs which are not within a particular species' range 
(ranges were determined in [Range Maps](03_range_maps.html)), also omitting 
species we're not interested in.

The idea is to reduce the size of the data we're working with as much as possible
before we get into the fine details of assessing data quality.

However, we don't want to simply delete records in our original data because

- we're working with data which may need to be updated
- if we delete the data we can't go back if we change our criteria

So we will create lists of all the run/tagID combinations *to be removed*,
and we'll store these as a new dataset.

When we want to calculate our final summaries, we can apply these filters 
in the last step *before* `collect()`ing (flattening) the feather data files. 

## Setup
```{r}
#| message: false
#| cache: false
source("XX_setup.R")

recvs <- read_csv("Data/02_Datasets/receivers.csv")
runs <- open_dataset("Data/02_Datasets/runs", format = "feather")
```

## Non-deployments

Both tags and receivers have device ids and deployment ids. 
Deployments are associated with a particular device (tag or receiver) 
being deployed at a certain time. 

Many erroneous runs are associated with a tag at a time
in which that tag was not recorded as deployed. 

Funnily enough this can also occur with receivers where the run occurs at a time
where the associated receiver is not recorded as deployed.

This could occur for several reasons:

- The metadata is incorrect and the tag/receiver was, in fact deployed at that time
- This is false positive recording of the tag
- This is non-trustworthy receiver data, which may have been collecting during 
  testing or installation

As we cannot be sure which is correct, for now we will omit those runs.

:::{.panel-tabset}

### Create filter
In our custom `runs` tables, any run/tag combination with missing a `tagDeployID` 
or a `recvDeployID` did not actually have a record of the tag/receiver being deployed at that time.

:::{.callout-tip}
#### `speciesID` vs. `tagDeployID`
Technically we should omit those without a tagDeployID, however, looking at the
records which *have* a tagDeployID but do *not have* a speciesID shows that 
they are all test tags, so we'll go ahead and just remove all records missing
a speciesID.

See the "Explore - Tags" panel for this exploration
:::

```{r}
no_deps <- filter(runs, is.na(speciesID) | is.na(recvDeployID)) |>
  collect()
```

### Explore - Tags

To be sure that we're not missing things, we'll explore what is removed if
we use `tagDeployID` rather than `speciesID` (as we could conceivably have
tags with a deploy id but not species ids).

```{r}
no_tag_dep_check <- filter(runs, is.na(tagDeployID))
```


For example, in this project for this tag, we have no species ids for a series
of runs. 
```{r}
no_tag_dep_check |>
  select(tagID, tagDeployID, speciesID, tsBegin, tsEnd, motusFilter) |>
  filter(tagID == 63138) |>
  collect_ts()
```

Note that the `motusFilter` is also 0, indicating that they are likely false positives
based on other metrics as well.

Now if we look at the tag deployment dates for that tag, we see that these
runs occurred even before the tag was deployed, further indicating that they 
are false positives.
```{r}
runs |>
  filter(tagID == 63138) |>
  select(tagID, tagDeployID,  tsStartTag, tsEndTag, speciesID) |>
  distinct() |>
  collect_ts()
```

However, just because we have a tagDeployID, it doesn't necessarily mean we
have a speciesID.

But in all cases where there *is* a tagDeployID but there *is not* a speciesID, 
it's because the tag is a test tag (`test = 1`). 
```{r}
runs |> 
  filter(!is.na(tagDeployID), is.na(speciesID)) |>
  select(tagID, tagDeployID, test) |>
  distinct() |>
  collect()
```

Therefore, we can actually remove all records with missing species ids without worry.


### Explore - Receivers

For example, in this project (484) for this receiver device ID (3071), 
we have no deployment ids (`recvDeployID`) for a series of runs. 

```{r}
filter(runs, is.na(recvDeployID)) |>
  select(proj_id, recvDeployID, recvDeviceID) |>
  distinct() |>
  collect()

filter(runs, is.na(recvDeployID)) |>
  filter(proj_id == 484, recvDeviceID == 3071) |>
  select(proj_id, runID, recvDeployID, recvDeviceID, tagDeployID) |>
  collect_ts()
```

Now if we look at the receiver deployment dates for that device, we see that these
runs occurred before the receiver was deployed, although not that much earlier. 

Perhaps these are semi-legitimate runs, which were detected when testing and 
installing the receiver.
```{r}
tbl(dbs[["484"]], "recvDeps") |>
  filter(deviceID == "3071") |>
  select(deviceID, deployID, projectID, tsStart, tsEnd, status) |>
  collect_ts()
```
:::

## Time

In this filter, we omit times of year that don't apply to our study.
This is the easiest filter to apply as it doesn't rely on metadata or other variables.

**Which dates do we keep?**

- Fall and Spring migration
- August - December & February - July

:::{.panel-tabset}

### Create filter
Get all runs that are NOT within these dates (Jan, July, Dec)
```{r}
noise_time <- runs |>
  filter(monthBegin %in% c(1, 7, 12)) |>
  collect()
```

### Example of the data omitted

```{r}
#| fig-width: 8
#| fig-asp: 1.5

d <- runs |>
  filter(proj_id == 464) |>
  mutate(omit = if_else(monthBegin %in% c(1, 7, 12), TRUE, FALSE)) |>
  select(runID, tagID, tsBegin, omit, timeBegin) |>
  collect()

ggplot(data = d, aes(x = timeBegin, y = factor(tagID), colour = omit)) +
  theme_bw() +
  geom_point() +
  scale_color_viridis_d(end = 0.8)
```
:::

## Space

Here we identify runs which are *not* associated with appropriate stations as determined by
overlap with a species range map ([Range Maps](03_range_maps.html)).

These are runs associated with a species which is unlikely to be found at that
station and can therefore be assumed to be false positives.

Note that this step *also* omits species we're not interested in, as they are
not included in the list of species ranges and receivers.

:::{.panel-tabset}


### Create filter

First we'll add the master list of whether or not a species is in range for each 
receiver deployment.
```{r}
rs <- recvs |>
  mutate(speciesID = as.integer(id), 
         recvDeployID = as.integer(recvDeployID)) |>
  filter(in_range) |>
  select(speciesID, recvDeployID, in_range)
```

Now, get all runs which are NOT in range given the species and the receiver deployment.
```{r}
noise_space <- runs |>
  left_join(rs, by = c("recvDeployID", "speciesID")) |>
  filter(is.na(in_range)) |>
  select(-"in_range") |>
  collect()
```


### Check filter values

Many of these omitted are omitted because they are missing the speciesID or the
recevDeployID (which is already accounted for the the [Non-deployments](#non-depoloyments) 
section above.

However, let's take a look at the remaining ones to check.

Hmm, most of these are poor quality runs (i.e. the `motusFilter` is 0). 
So this is a good sign that these are indeed false positives which should be 
omitted.

```{r}
noise_space |> 
  filter(!is.na(speciesID), !is.na(recvDeployID)) |>
  count(proj_id, motusFilter) |>
  collect() |>
  gt() |>
  gt_theme()
```


Let's take a closer look at one species in one project.

There is quite possibly a lot of noise (see the amount of scatter in the `motusFilter = 0` 
category. And clearly this individual (Spotted Towhee) had a home base in 
southern BC.

```{r}
ns <- noise_space |> 
  filter(proj_id == 484, speciesID == 18550) |>
  select("runID", "tagID") |>
  mutate(in_range = 0)

coords <- tbl(dbs[["484"]], "recvDeps") |> 
  select("deployID", "latitude", "longitude") |> 
  collect()

rn <- filter(runs, proj_id == 484, speciesID == 18550) |>
  left_join(coords, by = c("recvDeployID" = "deployID")) |>
  left_join(ns, by = c("runID", "tagID")) |>
  filter(!is.na(latitude), !is.na(longitude)) |>
  select(motusFilter, tagDeployID, tsBegin, tsEnd, in_range, latitude, longitude) |>
  collect() |>
  mutate(in_range = factor(replace_na(in_range, 1)))

map <- ne_countries(country = c("Canada", "United States of America"), returnclass = "sf")

rn_sf_cnt <- rn |>
  summarize(n = n(), .by = c("motusFilter", "in_range", "latitude", "longitude")) |> 
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
  
ggplot(rn_sf_cnt) +
  geom_sf(data = map) +
  geom_sf(aes(colour = in_range, size = n)) +
  facet_wrap(~motusFilter)

```

:::

## Duplicate runIDs

Double check that there are no duplicate runIDs, which can happen when joining
all the data together if we don't account for duplicate/overlapping receiver/tag 
deployments. 

Additionally there are some duplicates based on the same runID in different batches...
This occurs in `allruns` too, so isn't an artifact of the custom joins, but still
not ideal. 

These *should* have been dealt with in `custom_runs()` but just in case, we'll
verify here.

```{r}
runs |>
  count(runID) |>
  filter(n > 1) |>
  collect()
```

Okay, good, there are zero duplicates (zero rows).


## Looking at the filters

First we'll combine and collect the 'bad' runs.
```{r}
noise <- bind_rows(no_deps, noise_time, noise_space) |>
  distinct()
```


Next we'll take a look at how this compares to the motusFilter.


Original runs
```{r}
count(runs, proj_id, motusFilter) |>
  filter(proj_id == 484) |>
  collect() |>
  arrange(motusFilter)
```

Remaining (non-bad) runs
```{r}
anti_join(runs, noise, by = c("runID", "tagDeployID", "recvDeployID")) |>
  filter(proj_id == 484) |>
  count(proj_id, motusFilter) |>
  collect()
```

There are still many remaining 'bad' data (according to the `motusFilter`)... perhaps we should
use that as well, or see what happens after we do the next stage of fine scale
filtering.

## Saving filters

We'll save the 'bad data' for use in the next steps.

```{r}
write_feather(noise, sink = "Data/02_Datasets/noise_runs.feather")
```

## Wrap up
Disconnect from the databases
```{r}
walk(dbs, dbDisconnect)
```

## Reproducibility
{{< include _reproducibility.qmd >}}