Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The id argument needs to be more discoverable from documentation #505

Closed
MarekGierlinski opened this issue Aug 7, 2023 · 3 comments
Closed

Comments

@MarekGierlinski
Copy link

I often deal with data which is split into multiple files identified only by the file name. For example, experimental results from multiple samples, where the file name identifies the sample, but there is no information about the sample inside the file. It would be useful to have an option, while reading multiple files, to add a column with either the input file name or a name specified in a vector provided alongside input file names.

Here is an example. Consider 3 TSV files:

File sample_a.txt:

gene	count
gene1	19
gene2	22
gene3	14

File sample_b.txt:

gene	count
gene1	26
gene2	24
gene3	18

File sample_c.txt:

gene	count
gene1	22
gene2	17
gene3	24

A command:

files <- fs::dir_ls()
df <- vroom(files, col_file_name = "sample_file")

would create the following tibble:

# A tibble: 9 × 3
  gene  count sample_file 
  <chr> <int> <chr>       
1 gene1    19 sample_a.txt
2 gene2    22 sample_a.txt
3 gene3    14 sample_a.txt
4 gene1    26 sample_b.txt
5 gene2    24 sample_b.txt
6 gene3    18 sample_b.txt
7 gene1    22 sample_c.txt
8 gene2    17 sample_c.txt
9 gene3    24 sample_c.txt

Alternatively, vector of names could be provided to be parsed into the column, for example file_names = c("a", "b", "c") would place a, b and c instead of file names in the file names. You can probably come up with better names for these additional arguments.

I hope I'm not the only one who would find this useful.

@jennybc
Copy link
Member

jennybc commented Aug 8, 2023

You can use the id argument of vroom() for this.

id
Either a string or 'NULL'. If a string, the output will contain a variable with that name with the filename(s) as the value. If 'NULL', the default, no variable will be created.

But this is not advertised well in vroom's documentation, I will admit. It is more discoverable in readr, which is where most vroom usage actually originates. Here's an example borrowed from readr:

library(vroom)

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
  paste0("mini-gapminder-", continents, ".csv"),
  FUN = readr::readr_example,
  FUN.VALUE = character(1)
)
vroom(filepaths, id = "file")
#> Rows: 26 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 26 × 6
#>    file                                   country  year lifeExp    pop gdpPercap
#>    <chr>                                  <chr>   <dbl>   <dbl>  <dbl>     <dbl>
#>  1 /Users/jenny/Library/R/arm64/4.3/libr… Algeria  1952    43.1 9.28e6     2449.
#>  2 /Users/jenny/Library/R/arm64/4.3/libr… Angola   1952    30.0 4.23e6     3521.
#>  3 /Users/jenny/Library/R/arm64/4.3/libr… Benin    1952    38.2 1.74e6     1063.
#>  4 /Users/jenny/Library/R/arm64/4.3/libr… Botswa…  1952    47.6 4.42e5      851.
#>  5 /Users/jenny/Library/R/arm64/4.3/libr… Burkin…  1952    32.0 4.47e6      543.
#>  6 /Users/jenny/Library/R/arm64/4.3/libr… Burundi  1952    39.0 2.45e6      339.
#>  7 /Users/jenny/Library/R/arm64/4.3/libr… Argent…  1952    62.5 1.79e7     5911.
#>  8 /Users/jenny/Library/R/arm64/4.3/libr… Bolivia  1952    40.4 2.88e6     2677.
#>  9 /Users/jenny/Library/R/arm64/4.3/libr… Brazil   1952    50.9 5.66e7     2109.
#> 10 /Users/jenny/Library/R/arm64/4.3/libr… Canada   1952    68.8 1.48e7    11367.
#> # ℹ 16 more rows

Created on 2023-08-07 with reprex v2.0.2.9000

I'm going to change the title of this issue to reflect the need for documentation.

@jennybc jennybc changed the title Consider adding a column with file name while reading multiple files The id argument needs to be more discoverable from documentation Aug 8, 2023
@jennybc
Copy link
Member

jennybc commented Aug 8, 2023

"Reading multiple files" is featured prominently in the README, so that would be an obvious place to use or at least mention id. Probably in addition to adding an example for vroom().

@MarekGierlinski
Copy link
Author

Oh, indeed, it is there. I was actually learning vroom from the tidyverse blog, which also contains a section on reading multiple files. It would be nice to update this one too, if the author is available to do it.

Thanks a lot for your help and being so nice, as it is essentially an RTFM issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants