The `id` argument needs to be more discoverable from documentation #505

MarekGierlinski · 2023-08-07T07:47:10Z

I often deal with data which is split into multiple files identified only by the file name. For example, experimental results from multiple samples, where the file name identifies the sample, but there is no information about the sample inside the file. It would be useful to have an option, while reading multiple files, to add a column with either the input file name or a name specified in a vector provided alongside input file names.

Here is an example. Consider 3 TSV files:

File sample_a.txt:

gene	count
gene1	19
gene2	22
gene3	14

File sample_b.txt:

gene	count
gene1	26
gene2	24
gene3	18

File sample_c.txt:

gene	count
gene1	22
gene2	17
gene3	24

A command:

files <- fs::dir_ls()
df <- vroom(files, col_file_name = "sample_file")

would create the following tibble:

# A tibble: 9 × 3
  gene  count sample_file 
  <chr> <int> <chr>       
1 gene1    19 sample_a.txt
2 gene2    22 sample_a.txt
3 gene3    14 sample_a.txt
4 gene1    26 sample_b.txt
5 gene2    24 sample_b.txt
6 gene3    18 sample_b.txt
7 gene1    22 sample_c.txt
8 gene2    17 sample_c.txt
9 gene3    24 sample_c.txt

Alternatively, vector of names could be provided to be parsed into the column, for example file_names = c("a", "b", "c") would place a, b and c instead of file names in the file names. You can probably come up with better names for these additional arguments.

I hope I'm not the only one who would find this useful.

The text was updated successfully, but these errors were encountered:

jennybc · 2023-08-08T01:40:17Z

You can use the id argument of vroom() for this.

id
Either a string or 'NULL'. If a string, the output will contain a variable with that name with the filename(s) as the value. If 'NULL', the default, no variable will be created.

But this is not advertised well in vroom's documentation, I will admit. It is more discoverable in readr, which is where most vroom usage actually originates. Here's an example borrowed from readr:

library(vroom)

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
  paste0("mini-gapminder-", continents, ".csv"),
  FUN = readr::readr_example,
  FUN.VALUE = character(1)
)
vroom(filepaths, id = "file")
#> Rows: 26 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 26 × 6
#>    file                                   country  year lifeExp    pop gdpPercap
#>    <chr>                                  <chr>   <dbl>   <dbl>  <dbl>     <dbl>
#>  1 /Users/jenny/Library/R/arm64/4.3/libr… Algeria  1952    43.1 9.28e6     2449.
#>  2 /Users/jenny/Library/R/arm64/4.3/libr… Angola   1952    30.0 4.23e6     3521.
#>  3 /Users/jenny/Library/R/arm64/4.3/libr… Benin    1952    38.2 1.74e6     1063.
#>  4 /Users/jenny/Library/R/arm64/4.3/libr… Botswa…  1952    47.6 4.42e5      851.
#>  5 /Users/jenny/Library/R/arm64/4.3/libr… Burkin…  1952    32.0 4.47e6      543.
#>  6 /Users/jenny/Library/R/arm64/4.3/libr… Burundi  1952    39.0 2.45e6      339.
#>  7 /Users/jenny/Library/R/arm64/4.3/libr… Argent…  1952    62.5 1.79e7     5911.
#>  8 /Users/jenny/Library/R/arm64/4.3/libr… Bolivia  1952    40.4 2.88e6     2677.
#>  9 /Users/jenny/Library/R/arm64/4.3/libr… Brazil   1952    50.9 5.66e7     2109.
#> 10 /Users/jenny/Library/R/arm64/4.3/libr… Canada   1952    68.8 1.48e7    11367.
#> # ℹ 16 more rows

^{Created on 2023-08-07 with reprex v2.0.2.9000}

I'm going to change the title of this issue to reflect the need for documentation.

jennybc · 2023-08-08T01:42:02Z

"Reading multiple files" is featured prominently in the README, so that would be an obvious place to use or at least mention id. Probably in addition to adding an example for vroom().

MarekGierlinski · 2023-08-08T07:54:15Z

Oh, indeed, it is there. I was actually learning vroom from the tidyverse blog, which also contains a section on reading multiple files. It would be nice to update this one too, if the author is available to do it.

Thanks a lot for your help and being so nice, as it is essentially an RTFM issue.

jennybc changed the title ~~Consider adding a column with file name while reading multiple files~~ The id argument needs to be more discoverable from documentation Aug 8, 2023

jennybc closed this as completed in ea00a1f Sep 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The `id` argument needs to be more discoverable from documentation #505

The `id` argument needs to be more discoverable from documentation #505

MarekGierlinski commented Aug 7, 2023

jennybc commented Aug 8, 2023

jennybc commented Aug 8, 2023

MarekGierlinski commented Aug 8, 2023

The id argument needs to be more discoverable from documentation #505

The id argument needs to be more discoverable from documentation #505

Comments

MarekGierlinski commented Aug 7, 2023

jennybc commented Aug 8, 2023

jennybc commented Aug 8, 2023

MarekGierlinski commented Aug 8, 2023

The `id` argument needs to be more discoverable from documentation #505

The `id` argument needs to be more discoverable from documentation #505