-
Notifications
You must be signed in to change notification settings - Fork 0
/
04_basic_filters.qmd
370 lines (275 loc) · 10.1 KB
/
04_basic_filters.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
---
title: Basic Filtering
---
```{r}
#| include: false
knitr::read_chunk("XX_functions.R")
```
Here we apply broad filters for missing details, time, space, and species.
We'll omit runs missing deployment information for tags and receivers,
runs which occur during times we're not interested in, and runs which occur
for species/receiver pairs which are not within a particular species' range
(ranges were determined in [Range Maps](03_range_maps.html)), also omitting
species we're not interested in.
The idea is to reduce the size of the data we're working with as much as possible
before we get into the fine details of assessing data quality.
However, we don't want to simply delete records in our original data because
- we're working with data which may need to be updated
- if we delete the data we can't go back if we change our criteria
So we will create lists of all the run/tagID combinations *to be removed*,
and we'll store these as a new dataset.
When we want to calculate our final summaries, we can apply these filters
in the last step *before* `collect()`ing (flattening) the feather data files.
## Setup
```{r}
#| message: false
#| cache: false
source("XX_setup.R")
recvs <- read_csv("Data/02_Datasets/receivers.csv")
runs <- open_dataset("Data/02_Datasets/runs", format = "feather")
```
## Non-deployments
Both tags and receivers have device ids and deployment ids.
Deployments are associated with a particular device (tag or receiver)
being deployed at a certain time.
Many erroneous runs are associated with a tag at a time
in which that tag was not recorded as deployed.
Funnily enough this can also occur with receivers where the run occurs at a time
where the associated receiver is not recorded as deployed.
This could occur for several reasons:
- The metadata is incorrect and the tag/receiver was, in fact deployed at that time
- This is false positive recording of the tag
- This is non-trustworthy receiver data, which may have been collecting during
testing or installation
As we cannot be sure which is correct, for now we will omit those runs.
:::{.panel-tabset}
### Create filter
In our custom `runs` tables, any run/tag combination with missing a `tagDeployID`
or a `recvDeployID` did not actually have a record of the tag/receiver being deployed at that time.
:::{.callout-tip}
#### `speciesID` vs. `tagDeployID`
Technically we should omit those without a tagDeployID, however, looking at the
records which *have* a tagDeployID but do *not have* a speciesID shows that
they are all test tags, so we'll go ahead and just remove all records missing
a speciesID.
See the "Explore - Tags" panel for this exploration
:::
```{r}
no_deps <- filter(runs, is.na(speciesID) | is.na(recvDeployID)) |>
collect()
```
### Explore - Tags
To be sure that we're not missing things, we'll explore what is removed if
we use `tagDeployID` rather than `speciesID` (as we could conceivably have
tags with a deploy id but not species ids).
```{r}
no_tag_dep_check <- filter(runs, is.na(tagDeployID))
```
For example, in this project for this tag, we have no species ids for a series
of runs.
```{r}
no_tag_dep_check |>
select(tagID, tagDeployID, speciesID, tsBegin, tsEnd, motusFilter) |>
filter(tagID == 63138) |>
collect_ts()
```
Note that the `motusFilter` is also 0, indicating that they are likely false positives
based on other metrics as well.
Now if we look at the tag deployment dates for that tag, we see that these
runs occurred even before the tag was deployed, further indicating that they
are false positives.
```{r}
runs |>
filter(tagID == 63138) |>
select(tagID, tagDeployID, tsStartTag, tsEndTag, speciesID) |>
distinct() |>
collect_ts()
```
However, just because we have a tagDeployID, it doesn't necessarily mean we
have a speciesID.
But in all cases where there *is* a tagDeployID but there *is not* a speciesID,
it's because the tag is a test tag (`test = 1`).
```{r}
runs |>
filter(!is.na(tagDeployID), is.na(speciesID)) |>
select(tagID, tagDeployID, test) |>
distinct() |>
collect()
```
Therefore, we can actually remove all records with missing species ids without worry.
### Explore - Receivers
For example, in this project (484) for this receiver device ID (3071),
we have no deployment ids (`recvDeployID`) for a series of runs.
```{r}
filter(runs, is.na(recvDeployID)) |>
select(proj_id, recvDeployID, recvDeviceID) |>
distinct() |>
collect()
filter(runs, is.na(recvDeployID)) |>
filter(proj_id == 484, recvDeviceID == 3071) |>
select(proj_id, runID, recvDeployID, recvDeviceID, tagDeployID) |>
collect_ts()
```
Now if we look at the receiver deployment dates for that device, we see that these
runs occurred before the receiver was deployed, although not that much earlier.
Perhaps these are semi-legitimate runs, which were detected when testing and
installing the receiver.
```{r}
tbl(dbs[["484"]], "recvDeps") |>
filter(deviceID == "3071") |>
select(deviceID, deployID, projectID, tsStart, tsEnd, status) |>
collect_ts()
```
:::
## Time
In this filter, we omit times of year that don't apply to our study.
This is the easiest filter to apply as it doesn't rely on metadata or other variables.
**Which dates do we keep?**
- Fall and Spring migration
- August - December & February - July
:::{.panel-tabset}
### Create filter
Get all runs that are NOT within these dates (Jan, July, Dec)
```{r}
noise_time <- runs |>
filter(monthBegin %in% c(1, 7, 12)) |>
collect()
```
### Example of the data omitted
```{r}
#| fig-width: 8
#| fig-asp: 1.5
d <- runs |>
filter(proj_id == 464) |>
mutate(omit = if_else(monthBegin %in% c(1, 7, 12), TRUE, FALSE)) |>
select(runID, tagID, tsBegin, omit, timeBegin) |>
collect()
ggplot(data = d, aes(x = timeBegin, y = factor(tagID), colour = omit)) +
theme_bw() +
geom_point() +
scale_color_viridis_d(end = 0.8)
```
:::
## Space
Here we identify runs which are *not* associated with appropriate stations as determined by
overlap with a species range map ([Range Maps](03_range_maps.html)).
These are runs associated with a species which is unlikely to be found at that
station and can therefore be assumed to be false positives.
Note that this step *also* omits species we're not interested in, as they are
not included in the list of species ranges and receivers.
:::{.panel-tabset}
### Create filter
First we'll add the master list of whether or not a species is in range for each
receiver deployment.
```{r}
rs <- recvs |>
mutate(speciesID = as.integer(id),
recvDeployID = as.integer(recvDeployID)) |>
filter(in_range) |>
select(speciesID, recvDeployID, in_range)
```
Now, get all runs which are NOT in range given the species and the receiver deployment.
```{r}
noise_space <- runs |>
left_join(rs, by = c("recvDeployID", "speciesID")) |>
filter(is.na(in_range)) |>
select(-"in_range") |>
collect()
```
### Check filter values
Many of these omitted are omitted because they are missing the speciesID or the
recevDeployID (which is already accounted for the the [Non-deployments](#non-depoloyments)
section above.
However, let's take a look at the remaining ones to check.
Hmm, most of these are poor quality runs (i.e. the `motusFilter` is 0).
So this is a good sign that these are indeed false positives which should be
omitted.
```{r}
noise_space |>
filter(!is.na(speciesID), !is.na(recvDeployID)) |>
count(proj_id, motusFilter) |>
collect() |>
gt() |>
gt_theme()
```
Let's take a closer look at one species in one project.
There is quite possibly a lot of noise (see the amount of scatter in the `motusFilter = 0`
category. And clearly this individual (Spotted Towhee) had a home base in
southern BC.
```{r}
ns <- noise_space |>
filter(proj_id == 484, speciesID == 18550) |>
select("runID", "tagID") |>
mutate(in_range = 0)
coords <- tbl(dbs[["484"]], "recvDeps") |>
select("deployID", "latitude", "longitude") |>
collect()
rn <- filter(runs, proj_id == 484, speciesID == 18550) |>
left_join(coords, by = c("recvDeployID" = "deployID")) |>
left_join(ns, by = c("runID", "tagID")) |>
filter(!is.na(latitude), !is.na(longitude)) |>
select(motusFilter, tagDeployID, tsBegin, tsEnd, in_range, latitude, longitude) |>
collect() |>
mutate(in_range = factor(replace_na(in_range, 1)))
map <- ne_countries(country = c("Canada", "United States of America"), returnclass = "sf")
rn_sf_cnt <- rn |>
summarize(n = n(), .by = c("motusFilter", "in_range", "latitude", "longitude")) |>
st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
ggplot(rn_sf_cnt) +
geom_sf(data = map) +
geom_sf(aes(colour = in_range, size = n)) +
facet_wrap(~motusFilter)
```
:::
## Duplicate runIDs
Double check that there are no duplicate runIDs, which can happen when joining
all the data together if we don't account for duplicate/overlapping receiver/tag
deployments.
Additionally there are some duplicates based on the same runID in different batches...
This occurs in `allruns` too, so isn't an artifact of the custom joins, but still
not ideal.
These *should* have been dealt with in `custom_runs()` but just in case, we'll
verify here.
```{r}
runs |>
count(runID) |>
filter(n > 1) |>
collect()
```
Okay, good, there are zero duplicates (zero rows).
## Looking at the filters
First we'll combine and collect the 'bad' runs.
```{r}
noise <- bind_rows(no_deps, noise_time, noise_space) |>
distinct()
```
Next we'll take a look at how this compares to the motusFilter.
Original runs
```{r}
count(runs, proj_id, motusFilter) |>
filter(proj_id == 484) |>
collect() |>
arrange(motusFilter)
```
Remaining (non-bad) runs
```{r}
anti_join(runs, noise, by = c("runID", "tagDeployID", "recvDeployID")) |>
filter(proj_id == 484) |>
count(proj_id, motusFilter) |>
collect()
```
There are still many remaining 'bad' data (according to the `motusFilter`)... perhaps we should
use that as well, or see what happens after we do the next stage of fine scale
filtering.
## Saving filters
We'll save the 'bad data' for use in the next steps.
```{r}
write_feather(noise, sink = "Data/02_Datasets/noise_runs.feather")
```
## Wrap up
Disconnect from the databases
```{r}
walk(dbs, dbDisconnect)
```
## Reproducibility
{{< include _reproducibility.qmd >}}