View Issue Details

IDProjectCategoryView StatusLast Update
0000747HTML & PERLBug Report - Interfacepublic2018-03-07 23:15
ReporterQuietust Assigned ToDerIdiot  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Fixed in Version2018-04 
Summary0000747: Inconsistent sorting by title on MyList
DescriptionOn my MyList page, if I sort anime by title they are listed in a very peculiar order, not sorted by their Japanese names (shown in kana/kanji) or their English names. If I sort in reverse, it reproduces the same odd sorting in reverse.
Additional InformationExample 1: (all examples are sorted in ascending order)
*ラブひな
*ちょびっツ
*らき☆すた
*モンスター

Example 2:
*スクールランブル
*タイドライン・ブルー
*ハチミツとクローバー
*シリアルエクスペリメンツレイン

Example 3:
*あずまんが大王
*ああっ女神さまっ

Example 4:
*スクールランブル一学期補習
*万能文化猫娘 (1998)
*スクールランブル 二学期
Tagslanguage, mylist, sorting

Relationships

has duplicate 0000941 closedDerIdiot .hack sorted differently in lists 
has duplicate 0001116 closed .hack does not show in mylist & anime list 
has duplicate 0001815 closedDerIdiot mylist + japanese alphabetical order 

Activities

epoximator

2008-02-25 08:39

reporter   ~0001752

bump, postgresql updated. sorted by lower, btw

DerIdiot

2014-09-02 14:23

administrator   ~0003407

http://postgresql.1045698.n5.nabble.com/Japanese-words-not-distinguished-td1845979.html

as him we are using the english locale which seems to be the cause of this. so if we care to fix this we need to change the locale to C

Belove

2016-05-17 05:25

reporter   ~0003766

"The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values" http://www.postgresql.org/docs/current/static/collation.html
That sounds like virtually the opposite of a remedy. Unicode codepoints are not really in any proper collation order. If collation were specified elsewhere, this might be an acceptable database default, if not ideal, since default behavior would still not be good.

What you want, perhaps, is standard, full Unicode Collation Algorithm (UCA) using the generic Default Unicode Collation Element Table (DUCET) or a derivative. http://www.unicode.org/faq/collation.html http://www.unicode.org/reports/tr10/
That should achieve reasonable results, although you may have additional requirements. For example, much like "A" and "a" have the same collation weight, many characters in other languages do too, such as Hiragana/Kata Kana with and without voicing marks, which may test as equal (for better and worse). Also, for Japanese Kanji , "The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring". If you want to sort based on reading, "the common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings." to store the reading, and sort by that. Now, you don't currently store the reading to my knowledge, except approximately, as a transcription. You could create this value at least initially, if desired -- perhaps -- by attempting to converte at least ja-t to Hiragana (the same issue exists for other Asian languages). That would, however, for better or worse, not include symbols and other things we leave out of the ja-t values we use. Furthermore, most languages have language-specific rules, since they often share characters with other languages. "Q: How are mixed Japanese and Chinese handled? A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation)."
This may not apply, depending on how PostgresSQL implements sorting, but: an additional recommendation, if you want sort orders to be consistent ("stable sorts"), even if on an arbitrary basis to most observers --- when collation character (or character combination) weights are equivalent --- is to append the record number to the field or sort on the field + the record number. http://www.unicode.org/notes/tn9/#Forcing_Stable_Sorts (one method.)

Since PostgreSQL does not define it's own collations and character sets, they are imported from the host OS's locale settings. http://www.postgresql.org/docs/current/static/collation.html
http://www.postgresql.org/docs/current/static/multibyte.html (re. client character encoding and server character encoding)
http://www.postgresql.org/docs/current/interactive/sql-createcollation.html http://www.postgresql.org/docs/current/interactive/sql-altercollation.html http://www.postgresql.org/docs/current/interactive/sql-dropcollation.html
http://www.postgresql.org/docs/current/interactive/sql-createdatabase.html
http://www.postgresql.org/docs/current/interactive/sql-createtable.html http://www.postgresql.org/docs/current/interactive/sql-altertable.html
http://www.postgresql.org/docs/current/interactive/sql-expressions.html#SQL-SYNTAX-COLLATE-EXPRS
http://php.net/manual/en/function.pg-connect.php (see client-encoding connection parameter; for PHP)
http://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS (see client-encoding connection parameter, etc.)
The starting point seems to be determining the best locale setting or settings to create a collation or collations from. _I'm unclear on what the best available locale settings to use as a basis are_, and it's possible that there is no language-agnostic locale available by default in the Linux installation. The most important thing is probably that the selected locale has complete Unicode coverage in the character set you're using (presumably UTF-8). There is no one correct collation for all languages, unfortunately, since many languages use the same characters and have different collation rules. Other times, there ar different potential correct collation rules for a given language depending on requirements. Some locales may leave out rules for other languages simply because they do not consider them relevant, even if they would create a more universal ruleset.

You might be able to create a locale that duplicates DUCET/UCA exactly, based on the Unicode CLDR (Common Language Data Repository), using the XML and tools available from unicode.org: http://cldr.unicode.org/ http://www.unicode.org/Public/UCA/latest/allkeys.txt http://unicode.org/repos/cldr/trunk/common/uca/allkeys_CLDR.txt
That might be more language-agnostic while still being more complete than locale settings already included in your OS installation -- making an educated guess. I'm not sure if it would be _better_ than other locale settings or not.

I'm not sure how far you would want to go down the rabbit hole of achieving ideal/best possible multi-lingual behaviors. I haven't done sufficient research, and studying what approaches others have taken (Wikipedia?) might be useful, but here are some initial thoughts of my own:
It should be possible to import multiple collations for different languages into the database, and you might use COLLATE clauses in SQL, and reference the language identifier for records to select the correct collation. I think this would be very complex, and you would probably want to be sure that the database defaults provided reasonable results for existing queries. These are just loose thoughts, as I suspect you won't worry too much about this being perfect. Also, when a table includes results in multiple languages, you might want to have languages with similar symbols sorted together, but still have, for results in any given language, the collation be internally correct for that language. This might be a complex problem, require additional server resources at runtime (such as large/multiple indexes and/or more complex queries), and additional ongoing development resources to solve and maintain well... And what collation should be used for searches? You'd have to determine the language someone is searching in, it seems.
Possibly the problem could be simplified along the lines of the Hiragana reading collation solution I described, by using a seperate field -- one that would likely be used only internally -- with a single value that will collate as you desire it to within the context of the entire database. The correct values to use would depend on the collation used for the field, and you would want to generate a custom value for the field based on the original language text and transform it in such a way that it will collate with all other records as desired, using rules that vary on the input language. If you did this for all languages, then this could potentially be a single solution to sort by Hiragana reading (if desired) as well as all other language's own collation rules.


There's also the potential to improve on how searches are matched to the database through this process. Any change in collation will have some effects in that arena, but it might be nice to be able to have searches match any of AniDB's supported languages' collation rules (or multiple sets of rules, such as kanji radicals + strokes as well as readings and/or kana and transliterations) from a single search/query. Thoughts above touch on things that could achieve this, but the idea would need more development to be fully understood, at least by me. Maybe if you had the seperate field I described with a special version of the text used for sorting purposes, whose value was defined differently according to at least language... and you don't definitively know the language someone is searching in, you could convert search text into language/converted search text pairs for every language, which could then be used as a pair to match records.

Anyway, for the more complex ideas I've mentioned, I suspect compromises could be made for simplicity and performance. Also, I'm just elucidating some immature thoughts of my own in areas I don't have a lot of expertise in.

Belove

2016-05-17 05:34

reporter   ~0003767

Multiple collation indexes can be created on the same field:
http://www.postgresql.org/docs/current/static/indexes-collations.html

(Though there may be performance considerations.)

DerIdiot

2018-03-02 20:19

administrator   ~0004202

postgresql10 implemented icu

https://blog.2ndquadrant.com/icu-support-postgresql-10/
https://blog.ohgaki.net/postgresql-10-icu-locale-collation-enables-natural-japanese-sorting

so we will probably employ und-x-icu which has a few minor drawbacks though like punctuation being considered compared to en_GB so .hack ends up on top again, but considering using the jhumpbar hid it anyway as that was punctuation unaware this might be ok.

Issue History

Date Modified Username Field Change
2007-08-22 22:13 Quietust New Issue
2008-02-10 22:11 ninjamask Tag Attached: language
2008-02-10 22:11 ninjamask Tag Attached: mylist
2008-02-10 22:11 ninjamask Tag Attached: sorting
2008-02-25 08:39 epoximator Note Added: 0001752
2009-08-31 17:19 DerIdiot Status new => acknowledged
2014-09-01 13:35 DerIdiot Relationship added has duplicate 0000941
2014-09-01 13:36 DerIdiot Relationship added has duplicate 0001116
2014-09-02 14:23 DerIdiot Note Added: 0003407
2014-09-03 09:41 DerIdiot Relationship added has duplicate 0001815
2016-04-17 02:08 DerIdiot Assigned To => DerIdiot
2016-04-17 02:08 DerIdiot Status acknowledged => assigned
2016-05-17 05:25 Belove Note Added: 0003766
2016-05-17 05:34 Belove Note Added: 0003767
2018-03-02 20:19 DerIdiot Note Added: 0004202
2018-03-07 23:15 DerIdiot Status assigned => resolved
2018-03-07 23:15 DerIdiot Resolution open => fixed
2018-03-07 23:15 DerIdiot Fixed in Version => 2018-04