Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

aruneko · 2021-11-10T03:02:57Z

When I uploaded a CSV file on a datastore via XLoader, I noticed that all Japaneses kanjis in the CSV header are replaced into Chinese Pinyins.

For example, when a XLoader loads such CSV file, we get this table.

id,緯度,経度
1,35.6824572,139.763119
2,35.6733858,139.7506726

id	Wei Du	Jing Du
1	35.6824572	139.763119
2	35.6733858	139.7506726

Note: The pronunciation of 緯度 is /ido/ and the pronunciation of 経度 is /ke:do/ in Japanese. So you can notice these are quite far from the Chinese one.

But I want to get a table like this. So, those are should not be replaced.

id	緯度	経度
1	35.6824572	139.763119
2	35.6733858	139.7506726

I know unidecode does that in this function. But I don't have good idea not to encode Kanjis in a CSV header.

The text was updated successfully, but these errors were encountered:

KatiRG · 2023-05-04T18:25:43Z

The application of unidecode() to CSV headers here https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/loader.py#L355 will also strip off French accents, which is not good for our French users. The PyPi docs recommend :

It’s best to not use Unidecode for strings that are directly visible to users of your application.

Why is unidecode() applied by default in xloader?

wardi · 2023-05-04T18:33:25Z

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

For preview it is possible to add the original names to the data dictionary, but that's an extra step right now.

We should experiment with not converting these column headings, or only filtering specific characters known to cause problems. Have you tried removing this conversion to see if these column names cause any problems?

KatiRG · 2023-05-04T18:36:24Z

Thanks @wardi , yes I have experimented with removing unidecode() on the header names and this preserves the original names, which is good for us.

aimalkhan · 2023-05-04T21:00:27Z

@wardi

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

what are some examples of this difficulty in access?

Have you tried removing this conversion to see if these column names cause any problems?

for the test resources we are using, there are no issues. However, there could be issues for other characters not yet encountered. Is there a test dataset that has been used previously (such as that would have lead to the decision of using unidecode()) that we can test with?

wardi · 2023-05-04T21:14:49Z

The difficulty I'm thinking of is for people without input methods set up to enter non-ascii characters reliably, and the ambiguity of some Unicode characters. It's not a blocker just a possible impediment.

If we update our installation docs to make sure that the databases are created with a UTF-8 encoding then we should be able to use any language in the column names that postgres supports and I think removing unidecode (or putting it behind a configuration option) would be a great change for this repo.

For testing names we have to follow postgres restrictions for column naming and be aware of case-folding behavior.

aimalkhan · 2023-05-05T14:36:29Z

We can make a PR to make the unidecode() use here headers = [unidecode(header) for header in headers] controllable from the .ini file, which is turned on by default. Will that suffice as an acceptable change?

update our installation docs to make sure that the databases are created with a UTF-8 encoding

The documentation does mention and provide steps to use UTF-8 encoding, so that is taken care of.

column naming and be aware of case-folding

afaik the postgres naming restrictions are already being followed, and the case-folding behavior won't change with the proposed PR.

wardi · 2023-05-11T14:22:44Z

@aimalkhan sounds good, but we'll need to check that the column name will be valid as a postgres column name (at a minimum no " characters but I'm sure that rules are more strict than that.

@jqnatividad you did some work on code that validated column names for datapusher+ recently, right? Maybe we can borrow that.

@aimalkhan let's also disable unidecode by default instead of enabling it, and include a note in the changes or readme about this.

jqnatividad · 2023-05-11T16:28:17Z

Yes @wardi, in DP+, especially since it supports spreadsheets, we found a lot of users were uploading worksheets with column header names that were not valid PostgreSQL column identifiers.

That's why I created qsv's safenames command that Datapusher+ leverages.

It guarantees that it will only create "safe names" - sanitizing column header names so they are not only valid Postgres identifiers, it will also take into account additional CKAN restrictions - (columns that start with "_" and the special "_id" column), and ensure duplicate column names are handled as well.

Further, DP+ adds the original name of the column to the Data Dictionary as the column labels. The labels are used automatically when viewing the resource using DataTables view instead of the sanitized name - in that way, you still get the original header names when viewing the dataset, but the "safe", machine-readable header names are the ones used when using the Datastore API or exporting the data.

While we're on the topic of aliases - DP+ also has an "auto-alias" option - automatically creating a human-readable resource alias (by default - org-packagename-resourcename with optional collision detection) that you can use instead of the resource id hash in CKAN API calls and in SQL queries (as the CKAN resource alias is really just a PostgreSQL view).

https://github.com/dathere/datapusher-plus/blob/3a3c69687f16973cbdfa9f52a4d6db11ac48f908/datapusher/dot-env.template#L132-L143

@aimalkhan , if you're so inclined, I encourage you to try out DP+ as I'd love to get it exercised with more languages. I took care to make sure it supports UTF-8 properly, and I have some folks in Europe, China and Japan testing it and I'd love to see more.

FYI, while we were at OpenGov, we sponsored xloader, as the original datapusher was the largest source of post-CKAN installation issues. DP+ combines the best of datapusher and xloader since then, and I'd be interested in your feedback.

wardi · 2023-05-11T16:43:43Z

@jqnatividad for the alphanumeric test are you using the unicode categories L* and N*?

jqnatividad · 2023-05-11T19:09:10Z

@wardi I have no specific test for those unicode categories, but since Rust strings are UTF-8, I "think" it's something we get for free.

Do you have specific test cases I can try to check out support for those specific categories?

KatiRG mentioned this issue May 4, 2023

DataStore column name mapping ckan/ideas#210

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

aruneko commented Nov 10, 2021 •

edited

Loading

KatiRG commented May 4, 2023

wardi commented May 4, 2023

KatiRG commented May 4, 2023

aimalkhan commented May 4, 2023 •

edited

Loading

wardi commented May 4, 2023

aimalkhan commented May 5, 2023

wardi commented May 11, 2023

jqnatividad commented May 11, 2023 •

edited

Loading

wardi commented May 11, 2023

jqnatividad commented May 11, 2023 •

edited

Loading

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

Comments

aruneko commented Nov 10, 2021 • edited Loading

KatiRG commented May 4, 2023

wardi commented May 4, 2023

KatiRG commented May 4, 2023

aimalkhan commented May 4, 2023 • edited Loading

wardi commented May 4, 2023

aimalkhan commented May 5, 2023

wardi commented May 11, 2023

jqnatividad commented May 11, 2023 • edited Loading

wardi commented May 11, 2023

jqnatividad commented May 11, 2023 • edited Loading

aruneko commented Nov 10, 2021 •

edited

Loading

aimalkhan commented May 4, 2023 •

edited

Loading

jqnatividad commented May 11, 2023 •

edited

Loading

jqnatividad commented May 11, 2023 •

edited

Loading