Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vroom::vroom() reads data as a single column #126

Closed
nerutenbeck opened this issue Jun 3, 2019 · 7 comments
Closed

vroom::vroom() reads data as a single column #126

nerutenbeck opened this issue Jun 3, 2019 · 7 comments

Comments

@nerutenbeck
Copy link

nerutenbeck commented Jun 3, 2019

In attempting to read a .csv file with 44 columns, I experienced a parsing failure that rendered a data object with a single column. Reading the same file with readr::read_csv() is successful.

treeUrl <- "https://apps.fs.usda.gov/fia/datamart/CSV/WI_TREE.zip"
treeZip <- "/tmp/WI_TREE.zip"
system(paste0("cd /tmp; unzip WI_TREE.zip"))
treeFile <- "/tmp/WI_TREE.csv"

vroom_trees <- vroom::vroom(treeFile)

readr_trees <- readr::read_csv(treeFile)

Thanks for the great package!

@jimhester
Copy link
Collaborator

The heuristic used to guess the delimiter is not perfect, in this case it does not guess the delimiter correctly and a newline is used as the fallback. But you can specify the delimiter explicitly for this data with the delim argument.

vroom::vroom("/tmp/WI_TREE.csv", delim = ",")
#> Observations: 1,109,323
#> Variables: 207
#> chr  [  1]: P2A_GRM_FLG
#> dbl  [104]: CN, PLT_CN, PREV_TRE_CN, INVYR, STATECD, UNITCD, COUNTYCD, PLOT, SUBP, TREE...
#> lgl  [100]: DAMTYP2, DAMSEV2, WDLDSTEM, CVIGORCD, TREEHISTCD, BHAGE, TOTAGE, CULLDEAD, ...
#> date [  2]: CREATED_DATE, MODIFIED_DATE
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
#> # A tibble: 1,109,323 x 207
#>         CN  PLT_CN PREV_TRE_CN INVYR STATECD UNITCD COUNTYCD  PLOT  SUBP
#>      <dbl>   <dbl>       <dbl> <dbl>   <dbl>  <dbl>    <dbl> <dbl> <dbl>
#>  1 2.41e13 2.41e13          NA  1983      55      1       37     2   101
#>  2 2.41e13 2.41e13          NA  1983      55      1       37     2   102
#>  3 2.41e13 2.41e13          NA  1983      55      1       37     2   103
#>  4 2.41e13 2.41e13          NA  1983      55      1       37     2   103
#>  5 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  6 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  7 2.41e13 2.41e13          NA  1983      55      1       37     2   104
#>  8 2.41e13 2.41e13          NA  1983      55      1       37     2   107
#>  9 2.41e13 2.41e13          NA  1983      55      1       37     2   108
#> 10 2.41e13 2.41e13          NA  1983      55      1       37     2   110
#> # … with 1,109,313 more rows, and 198 more variables: TREE <dbl>,
#> #   CONDID <dbl>, AZIMUTH <dbl>, DIST <dbl>, PREVCOND <dbl>,
#> #   STATUSCD <dbl>, SPCD <dbl>, SPGRPCD <dbl>, DIA <dbl>, DIAHTCD <dbl>,
#> #   HT <dbl>, HTCD <dbl>, ACTUALHT <dbl>, TREECLCD <dbl>, CR <dbl>,
#> #   CCLCD <dbl>, TREEGRCD <dbl>, AGENTCD <dbl>, CULL <dbl>, DAMLOC1 <dbl>,
#> #   DAMTYP1 <dbl>, DAMSEV1 <dbl>, DAMLOC2 <dbl>, DAMTYP2 <lgl>,
#> #   DAMSEV2 <lgl>, DECAYCD <dbl>, STOCKING <dbl>, WDLDSTEM <lgl>,
#> #   VOLCFNET <dbl>, VOLCFGRS <dbl>, VOLCSNET <dbl>, VOLCSGRS <dbl>,
#> #   VOLBFNET <dbl>, VOLBFGRS <dbl>, VOLCFSND <dbl>, GROWCFGS <dbl>,
#> #   GROWBFSL <dbl>, GROWCFAL <dbl>, MORTCFGS <dbl>, MORTBFSL <dbl>,
#> #   MORTCFAL <dbl>, REMVCFGS <dbl>, REMVBFSL <dbl>, REMVCFAL <dbl>,
#> #   DIACHECK <dbl>, MORTYR <dbl>, SALVCD <dbl>, UNCRCD <dbl>,
#> #   CPOSCD <dbl>, CLIGHTCD <dbl>, CVIGORCD <lgl>, CDENCD <dbl>,
#> #   CDIEBKCD <dbl>, TRANSCD <dbl>, TREEHISTCD <lgl>, DIACALC <dbl>,
#> #   BHAGE <lgl>, TOTAGE <lgl>, CULLDEAD <lgl>, CULLFORM <lgl>,
#> #   CULLMSTOP <lgl>, CULLBF <lgl>, CULLCF <lgl>, BFSND <lgl>, CFSND <lgl>,
#> #   SAWHT <lgl>, BOLEHT <lgl>, FORMCL <lgl>, HTCALC <dbl>,
#> #   HRDWD_CLUMP_CD <lgl>, SITREE <dbl>, CREATED_BY <lgl>,
#> #   CREATED_DATE <date>, CREATED_IN_INSTANCE <dbl>, MODIFIED_BY <lgl>,
#> #   MODIFIED_DATE <date>, MODIFIED_IN_INSTANCE <dbl>, MORTCD <lgl>,
#> #   HTDMP <dbl>, ROUGHCULL <lgl>, MIST_CL_CD <lgl>, CULL_FLD <dbl>,
#> #   RECONCILECD <dbl>, PREVDIA <dbl>, FGROWCFGS <dbl>, FGROWBFSL <dbl>,
#> #   FGROWCFAL <dbl>, FMORTCFGS <dbl>, FMORTBFSL <dbl>, FMORTCFAL <dbl>,
#> #   FREMVCFGS <dbl>, FREMVBFSL <dbl>, FREMVCFAL <dbl>, P2A_GRM_FLG <chr>,
#> #   TREECLCD_NERS <lgl>, TREECLCD_SRS <lgl>, TREECLCD_NCRS <dbl>,
#> #   TREECLCD_RMRS <lgl>, STANDING_DEAD_CD <dbl>, PREV_STATUS_CD <dbl>, …

Created on 2019-06-04 by the reprex package (v0.2.1)

@jimhester
Copy link
Collaborator

If we look into more robust parsing #105 this would be a useful test dataset.

@mhbw
Copy link

mhbw commented Jun 14, 2019

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

@jimhester
Copy link
Collaborator

How am I mostly looking at doing a csv reading product?

@nerutenbeck
Copy link
Author

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

This is rude. Adding delim = "," explicitly solves the problem and more robust parsing is already suggested as a feature.

@mhbw
Copy link

mhbw commented Jun 14, 2019

ran into this as well; having the same operational heuristic as read.csv would maybe be helpful? since you're mostly looking at doing a csv reading product seems odd that it wouldn't automatically know the separator is a comma...

This is rude.

you know what? you're right. I could have phrased it differently, and the world doesn't need more rude people on the internet, so I apologize. Thanks for pointing it out.

@mhbw
Copy link

mhbw commented Jun 14, 2019

How am I mostly looking at doing a csv reading product?

well, maybe I misread it, but as the docs say:

"The most common type of delimited files are CSV (Comma Separated Values) files"

and the tidyverse page that I found this through says:

"vroom reads rectangular data, such as comma separated (csv), tab separated (tsv) or fixed width files (fwf) into R. It performs similar roles to functions like readr::read_csv(), data.table::fread() or read.csv(). But for many datasets vroom::vroom() can read them much, much faster (hence the name)."

Which explicitly references CSVs a lot, and compares itself to the two major CSV libraries, so that's why I'd thought that this was a mostly a csv reading product. maybe that's not the case, the page says 'feedback welcome' so I'd like to make a polite suggestion that commas be the default. Also it's mentioned that delim solves the problem, but I didn't find that in the roll out articles or front page (to be fair it's in the docs).

Anyhow, big fan of the tool, was very fast on pulling down a rather large file, and I enjoyed it, so bravo on making a cool package.

jimhester added a commit that referenced this issue Dec 6, 2019
The previous logic was hard to follow and did not work well for many
real world files. The new code seems simpler to read and also works
better on files which used to fail. It also deals better with quoted
fields.

Fixes #126
Fixes #141
Fixes #167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants