vroom::vroom() reads data as a single column #126

nerutenbeck · 2019-06-03T17:53:48Z

In attempting to read a .csv file with 44 columns, I experienced a parsing failure that rendered a data object with a single column. Reading the same file with readr::read_csv() is successful.

treeUrl <- "https://apps.fs.usda.gov/fia/datamart/CSV/WI_TREE.zip"
treeZip <- "/tmp/WI_TREE.zip"
system(paste0("cd /tmp; unzip WI_TREE.zip"))
treeFile <- "/tmp/WI_TREE.csv"

vroom_trees <- vroom::vroom(treeFile)

readr_trees <- readr::read_csv(treeFile)

Thanks for the great package!

The text was updated successfully, but these errors were encountered:

jimhester · 2019-06-04T14:45:50Z

The heuristic used to guess the delimiter is not perfect, in this case it does not guess the delimiter correctly and a newline is used as the fallback. But you can specify the delimiter explicitly for this data with the delim argument.

vroom::vroom("/tmp/WI_TREE.csv", delim = ",")
#> Observations: 1,109,323
#> Variables: 207
#> chr  [  1]: P2A_GRM_FLG
#> dbl  [104]: CN, PLT_CN, PREV_TRE_CN, INVYR, STATECD, UNITCD, COUNTYCD, PLOT, SUBP, TREE...
#> lgl  [100]: DAMTYP2, DAMSEV2, WDLDSTEM, CVIGORCD, TREEHISTCD, BHAGE, TOTAGE, CULLDEAD, ...
#> date [  2]: CREATED_DATE, MODIFIED_DATE
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 1,109,323 x 207
#>         CN  PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD  PLOT  SUBP
#>      <dbl>   <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl> <dbl> <dbl>
#>  1 2.41e13 2.41e13          NA  1983      55      1       37     2   101
#>  2 2.41e13 2.41e13          NA  1983      55      1       37     2   102
#>  3 2.41e13 2.41e13          NA  1983      55      1       37     2   103
#>  4 2.41e13 2.41e13          NA  1983      55      1       37     2   103
#>  5 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  6 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  7 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  8 2.41e13 2.41e13          NA  1983      55      1       37     2   107
#>  9 2.41e13 2.41e13          NA  1983      55      1       37     2   108
#> 10 2.41e13 2.41e13          NA  1983      55      1       37     2   110
#> # … with 1,109,313 more rows, and 198 more variables: TREE <dbl>,
#> #   CONDID <dbl>, AZIMUTH <dbl>, DIST <dbl>, PREVCOND <dbl>,
#> #   STATUSCD <dbl>, SPCD <dbl>, SPGRPCD <dbl>, DIA <dbl>, DIAHTCD <dbl>,
#> #   HT <dbl>, HTCD <dbl>, ACTUALHT <dbl>, TREECLCD <dbl>, CR <dbl>,
#> #   CCLCD <dbl>, TREEGRCD <dbl>, AGENTCD <dbl>, CULL <dbl>, DAMLOC1 <dbl>,
#> #   DAMTYP1 <dbl>, DAMSEV1 <dbl>, DAMLOC2 <dbl>, DAMTYP2 <lgl>,
#> #   DAMSEV2 <lgl>, DECAYCD <dbl>, STOCKING <dbl>, WDLDSTEM <lgl>,
#> #   VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, VOLCSGRS <dbl>,
#> #   VOLBFNET <dbl>, VOLBFGRS <dbl>, VOLCFSND <dbl>, GROWCFGS <dbl>,
#> #   GROWBFSL <dbl>, GROWCFAL <dbl>, MORTCFGS <dbl>, MORTBFSL <dbl>,
#> #   MORTCFAL <dbl>, REMVCFGS <dbl>, REMVBFSL <dbl>, REMVCFAL <dbl>,
#> #   DIACHECK <dbl>, MORTYR <dbl>, SALVCD <dbl>, UNCRCD <dbl>,
#> #   CPOSCD <dbl>, CLIGHTCD <dbl>, CVIGORCD <lgl>, CDENCD <dbl>,
#> #   CDIEBKCD <dbl>, TRANSCD <dbl>, TREEHISTCD <lgl>, DIACALC <dbl>,
#> #   BHAGE <lgl>, TOTAGE <lgl>, CULLDEAD <lgl>, CULLFORM <lgl>,
#> #   CULLMSTOP <lgl>, CULLBF <lgl>, CULLCF <lgl>, BFSND <lgl>, CFSND <lgl>,
#> #   SAWHT <lgl>, BOLEHT <lgl>, FORMCL <lgl>, HTCALC <dbl>,
#> #   HRDWD_CLUMP_CD <lgl>, SITREE <dbl>, CREATED_BY <lgl>,
#> #   CREATED_DATE <date>, CREATED_IN_INSTANCE <dbl>, MODIFIED_BY <lgl>,
#> #   MODIFIED_DATE <date>, MODIFIED_IN_INSTANCE <dbl>, MORTCD <lgl>,
#> #   HTDMP <dbl>, ROUGHCULL <lgl>, MIST_CL_CD <lgl>, CULL_FLD <dbl>,
#> #   RECONCILECD <dbl>, PREVDIA <dbl>, FGROWCFGS <dbl>, FGROWBFSL <dbl>,
#> #   FGROWCFAL <dbl>, FMORTCFGS <dbl>, FMORTBFSL <dbl>, FMORTCFAL <dbl>,
#> #   FREMVCFGS <dbl>, FREMVBFSL <dbl>, FREMVCFAL <dbl>, P2A_GRM_FLG <chr>,
#> #   TREECLCD_NERS <lgl>, TREECLCD_SRS <lgl>, TREECLCD_NCRS <dbl>,
#> #   TREECLCD_RMRS <lgl>, STANDING_DEAD_CD <dbl>, PREV_STATUS_CD <dbl>, …

^{Created on 2019-06-04 by the reprex package (v0.2.1)}

jimhester · 2019-06-04T14:48:36Z

If we look into more robust parsing #105 this would be a useful test dataset.

mhbw · 2019-06-14T17:48:49Z

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

jimhester · 2019-06-14T18:21:59Z

How am I mostly looking at doing a csv reading product?

nerutenbeck · 2019-06-14T18:53:28Z

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

This is rude. Adding delim = "," explicitly solves the problem and more robust parsing is already suggested as a feature.

mhbw · 2019-06-14T22:50:16Z

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

This is rude.

you know what? you're right. I could have phrased it differently, and the world doesn't need more rude people on the internet, so I apologize. Thanks for pointing it out.

mhbw · 2019-06-14T23:02:06Z

How am I mostly looking at doing a csv reading product?

well, maybe I misread it, but as the docs say:

"The most common type of delimited files are CSV (Comma Separated Values) files"

and the tidyverse page that I found this through says:

"vroom reads rectangular data, such as comma separated (csv), tab separated (tsv) or fixed width files (fwf) into R. It performs similar roles to functions like readr::read_csv(), data.table::fread() or read.csv(). But for many datasets vroom::vroom() can read them much, much faster (hence the name)."

Which explicitly references CSVs a lot, and compares itself to the two major CSV libraries, so that's why I'd thought that this was a mostly a csv reading product. maybe that's not the case, the page says 'feedback welcome' so I'd like to make a polite suggestion that commas be the default. Also it's mentioned that delim solves the problem, but I didn't find that in the roll out articles or front page (to be fair it's in the docs).

Anyhow, big fan of the tool, was very fast on pulling down a rather large file, and I enjoyed it, so bravo on making a cool package.

The previous logic was hard to follow and did not work well for many real world files. The new code seems simpler to read and also works better on files which used to fail. It also deals better with quoted fields. Fixes #126 Fixes #141 Fixes #167

jimhester closed this as completed Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vroom::vroom() reads data as a single column #126

vroom::vroom() reads data as a single column #126

nerutenbeck commented Jun 3, 2019 •

edited

Loading

jimhester commented Jun 4, 2019

jimhester commented Jun 4, 2019

mhbw commented Jun 14, 2019

jimhester commented Jun 14, 2019

nerutenbeck commented Jun 14, 2019

mhbw commented Jun 14, 2019

mhbw commented Jun 14, 2019 •

edited

Loading

vroom::vroom() reads data as a single column #126

vroom::vroom() reads data as a single column #126

Comments

nerutenbeck commented Jun 3, 2019 • edited Loading

jimhester commented Jun 4, 2019

jimhester commented Jun 4, 2019

mhbw commented Jun 14, 2019

jimhester commented Jun 14, 2019

nerutenbeck commented Jun 14, 2019

mhbw commented Jun 14, 2019

mhbw commented Jun 14, 2019 • edited Loading

nerutenbeck commented Jun 3, 2019 •

edited

Loading

mhbw commented Jun 14, 2019 •

edited

Loading