Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Japanese kanjis in CSV header automatically replaced into Chinese Pinyins #145

Open
aruneko opened this issue Nov 10, 2021 · 10 comments
Open

Comments

@aruneko
Copy link

aruneko commented Nov 10, 2021

When I uploaded a CSV file on a datastore via XLoader, I noticed that all Japaneses kanjis in the CSV header are replaced into Chinese Pinyins.

For example, when a XLoader loads such CSV file, we get this table.

id,緯度,経度
1,35.6824572,139.763119
2,35.6733858,139.7506726
id Wei Du Jing Du
1 35.6824572 139.763119
2 35.6733858 139.7506726

Note: The pronunciation of 緯度 is /ido/ and the pronunciation of 経度 is /ke:do/ in Japanese. So you can notice these are quite far from the Chinese one.

But I want to get a table like this. So, those are should not be replaced.

id 緯度 経度
1 35.6824572 139.763119
2 35.6733858 139.7506726

I know unidecode does that in this function. But I don't have good idea not to encode Kanjis in a CSV header.

@KatiRG
Copy link

KatiRG commented May 4, 2023

The application of unidecode() to CSV headers here https://github.com/ckan/ckanext-xloader/blob/master/ckanext/xloader/loader.py#L355 will also strip off French accents, which is not good for our French users. The PyPi docs recommend :

It’s best to not use Unidecode for strings that are directly visible to users of your application.

Why is unidecode() applied by default in xloader?

@wardi
Copy link
Contributor

wardi commented May 4, 2023

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

For preview it is possible to add the original names to the data dictionary, but that's an extra step right now.

We should experiment with not converting these column headings, or only filtering specific characters known to cause problems. Have you tried removing this conversion to see if these column names cause any problems?

@KatiRG
Copy link

KatiRG commented May 4, 2023

Thanks @wardi , yes I have experimented with removing unidecode() on the header names and this preserves the original names, which is good for us.

@aimalkhan
Copy link

aimalkhan commented May 4, 2023

@wardi

It's being used because these are used as postgres column names in the datastore database, which can be difficult to access for users not used to entering non-ascii text via the API.

what are some examples of this difficulty in access?

Have you tried removing this conversion to see if these column names cause any problems?

for the test resources we are using, there are no issues. However, there could be issues for other characters not yet encountered. Is there a test dataset that has been used previously (such as that would have lead to the decision of using unidecode()) that we can test with?

@wardi
Copy link
Contributor

wardi commented May 4, 2023

The difficulty I'm thinking of is for people without input methods set up to enter non-ascii characters reliably, and the ambiguity of some Unicode characters. It's not a blocker just a possible impediment.

If we update our installation docs to make sure that the databases are created with a UTF-8 encoding then we should be able to use any language in the column names that postgres supports and I think removing unidecode (or putting it behind a configuration option) would be a great change for this repo.

For testing names we have to follow postgres restrictions for column naming and be aware of case-folding behavior.

@aimalkhan
Copy link

We can make a PR to make the unidecode() use here headers = [unidecode(header) for header in headers] controllable from the .ini file, which is turned on by default. Will that suffice as an acceptable change?

update our installation docs to make sure that the databases are created with a UTF-8 encoding

The documentation does mention and provide steps to use UTF-8 encoding, so that is taken care of.

column naming and be aware of case-folding

afaik the postgres naming restrictions are already being followed, and the case-folding behavior won't change with the proposed PR.

@wardi
Copy link
Contributor

wardi commented May 11, 2023

@aimalkhan sounds good, but we'll need to check that the column name will be valid as a postgres column name (at a minimum no " characters but I'm sure that rules are more strict than that.

@jqnatividad you did some work on code that validated column names for datapusher+ recently, right? Maybe we can borrow that.

@aimalkhan let's also disable unidecode by default instead of enabling it, and include a note in the changes or readme about this.

@jqnatividad
Copy link

jqnatividad commented May 11, 2023

Yes @wardi, in DP+, especially since it supports spreadsheets, we found a lot of users were uploading worksheets with column header names that were not valid PostgreSQL column identifiers.

That's why I created qsv's safenames command that Datapusher+ leverages.

It guarantees that it will only create "safe names" - sanitizing column header names so they are not only valid Postgres identifiers, it will also take into account additional CKAN restrictions - (columns that start with "_" and the special "_id" column), and ensure duplicate column names are handled as well.

Further, DP+ adds the original name of the column to the Data Dictionary as the column labels. The labels are used automatically when viewing the resource using DataTables view instead of the sanitized name - in that way, you still get the original header names when viewing the dataset, but the "safe", machine-readable header names are the ones used when using the Datastore API or exporting the data.

While we're on the topic of aliases - DP+ also has an "auto-alias" option - automatically creating a human-readable resource alias (by default - org-packagename-resourcename with optional collision detection) that you can use instead of the resource id hash in CKAN API calls and in SQL queries (as the CKAN resource alias is really just a PostgreSQL view).

https://github.com/dathere/datapusher-plus/blob/3a3c69687f16973cbdfa9f52a4d6db11ac48f908/datapusher/dot-env.template#L132-L143

@aimalkhan , if you're so inclined, I encourage you to try out DP+ as I'd love to get it exercised with more languages. I took care to make sure it supports UTF-8 properly, and I have some folks in Europe, China and Japan testing it and I'd love to see more.

FYI, while we were at OpenGov, we sponsored xloader, as the original datapusher was the largest source of post-CKAN installation issues. DP+ combines the best of datapusher and xloader since then, and I'd be interested in your feedback.

@wardi
Copy link
Contributor

wardi commented May 11, 2023

@jqnatividad for the alphanumeric test are you using the unicode categories L* and N*?

@jqnatividad
Copy link

jqnatividad commented May 11, 2023

@wardi I have no specific test for those unicode categories, but since Rust strings are UTF-8, I "think" it's something we get for free.

Do you have specific test cases I can try to check out support for those specific categories?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants