0000747: Inconsistent sorting by title on MyList - AniDB Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000747	HTML & PERL	Bug Report - Interface	public	2007-08-22 22:13	2018-03-07 23:15

Reporter	Quietust	Assigned To	DerIdiot
Priority	normal	Severity	minor	Reproducibility	always
Status	resolved	Resolution	fixed
Fixed in Version	2018-04

Summary	0000747: Inconsistent sorting by title on MyList
Description	On my MyList page, if I sort anime by title they are listed in a very peculiar order, not sorted by their Japanese names (shown in kana/kanji) or their English names. If I sort in reverse, it reproduces the same odd sorting in reverse.
Additional Information	Example 1: (all examples are sorted in ascending order) ラブひなちょびっツらき☆すたモンスター Example 2: スクールランブルタイドライン・ブルーハチミツとクローバーシリアルエクスペリメンツレイン Example 3: あずまんが大王ああっ女神さまっ Example 4: スクールランブル一学期補習万能文化猫娘 (1998) *スクールランブル二学期
Tags	language, mylist, sorting

epoximator 2008-02-25 08:39 reporter ~0001752	bump, postgresql updated. sorted by lower, btw

DerIdiot 2014-09-02 14:23 administrator ~0003407	http://postgresql.1045698.n5.nabble.com/Japanese-words-not-distinguished-td1845979.html as him we are using the english locale which seems to be the cause of this. so if we care to fix this we need to change the locale to C

Belove 2016-05-17 05:25 reporter ~0003766	"The C and POSIX collations both specify "traditional C" behavior, in which only the ASCII letters "A" through "Z" are treated as letters, and sorting is done strictly by character code byte values" http://www.postgresql.org/docs/current/static/collation.html That sounds like virtually the opposite of a remedy. Unicode codepoints are not really in any proper collation order. If collation were specified elsewhere, this might be an acceptable database default, if not ideal, since default behavior would still not be good. What you want, perhaps, is standard, full Unicode Collation Algorithm (UCA) using the generic Default Unicode Collation Element Table (DUCET) or a derivative. http://www.unicode.org/faq/collation.html http://www.unicode.org/reports/tr10/ That should achieve reasonable results, although you may have additional requirements. For example, much like "A" and "a" have the same collation weight, many characters in other languages do too, such as Hiragana/Kata Kana with and without voicing marks, which may test as equal (for better and worse). Also, for Japanese Kanji , "The Default Unicode Collation Element Table uses the Unicode order for CJK ideographs (Kanji). This represents a radical-stroke ordering for the characters in JIS levels 1 and 2. If a different order is needed, such as an exact match to binary JIS order for these characters, that can be achieved with tailoring". If you want to sort based on reading, "the common practice for sorting in a database by reading is to store the reading in a separate field, and construct the sort keys from the readings." to store the reading, and sort by that. Now, you don't currently store the reading to my knowledge, except approximately, as a transcription. You could create this value at least initially, if desired -- perhaps -- by attempting to converte at least ja-t to Hiragana (the same issue exists for other Asian languages). That would, however, for better or worse, not include symbols and other things we leave out of the ja-t values we use. Furthermore, most languages have language-specific rules, since they often share characters with other languages. "Q: How are mixed Japanese and Chinese handled? A: The Unicode Collation Algorithm specifies how collation works for a single context. In this respect, mixed Japanese and Chinese are no different than mixed Swedish and German, or any other languages that use the same characters. Generally, the customers using a particular collation will want text sorted uniformly, no matter what the source. Japanese customers would want them sorted in the Japanese fashion, etc. There are contexts where foreign words are called out separately and sorted in a separate group with different collation conventions. Such cases would require the source fields to be tagged with the type of desired collation (or tagged with a language, which is then used to look up an associated collation)." This may not apply, depending on how PostgresSQL implements sorting, but: an additional recommendation, if you want sort orders to be consistent ("stable sorts"), even if on an arbitrary basis to most observers --- when collation character (or character combination) weights are equivalent --- is to append the record number to the field or sort on the field + the record number. http://www.unicode.org/notes/tn9/#Forcing_Stable_Sorts (one method.) Since PostgreSQL does not define it's own collations and character sets, they are imported from the host OS's locale settings. http://www.postgresql.org/docs/current/static/collation.html http://www.postgresql.org/docs/current/static/multibyte.html (re. client character encoding and server character encoding) http://www.postgresql.org/docs/current/interactive/sql-createcollation.html http://www.postgresql.org/docs/current/interactive/sql-altercollation.html http://www.postgresql.org/docs/current/interactive/sql-dropcollation.html http://www.postgresql.org/docs/current/interactive/sql-createdatabase.html http://www.postgresql.org/docs/current/interactive/sql-createtable.html http://www.postgresql.org/docs/current/interactive/sql-altertable.html http://www.postgresql.org/docs/current/interactive/sql-expressions.html#SQL-SYNTAX-COLLATE-EXPRS http://php.net/manual/en/function.pg-connect.php (see client-encoding connection parameter; for PHP) http://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-PARAMKEYWORDS (see client-encoding connection parameter, etc.) The starting point seems to be determining the best locale setting or settings to create a collation or collations from. _I'm unclear on what the best available locale settings to use as a basis are_, and it's possible that there is no language-agnostic locale available by default in the Linux installation. The most important thing is probably that the selected locale has complete Unicode coverage in the character set you're using (presumably UTF-8). There is no one correct collation for all languages, unfortunately, since many languages use the same characters and have different collation rules. Other times, there ar different potential correct collation rules for a given language depending on requirements. Some locales may leave out rules for other languages simply because they do not consider them relevant, even if they would create a more universal ruleset. You might be able to create a locale that duplicates DUCET/UCA exactly, based on the Unicode CLDR (Common Language Data Repository), using the XML and tools available from unicode.org: http://cldr.unicode.org/ http://www.unicode.org/Public/UCA/latest/allkeys.txt http://unicode.org/repos/cldr/trunk/common/uca/allkeys_CLDR.txt That might be more language-agnostic while still being more complete than locale settings already included in your OS installation -- making an educated guess. I'm not sure if it would be _better_ than other locale settings or not. I'm not sure how far you would want to go down the rabbit hole of achieving ideal/best possible multi-lingual behaviors. I haven't done sufficient research, and studying what approaches others have taken (Wikipedia?) might be useful, but here are some initial thoughts of my own: It should be possible to import multiple collations for different languages into the database, and you might use COLLATE clauses in SQL, and reference the language identifier for records to select the correct collation. I think this would be very complex, and you would probably want to be sure that the database defaults provided reasonable results for existing queries. These are just loose thoughts, as I suspect you won't worry too much about this being perfect. Also, when a table includes results in multiple languages, you might want to have languages with similar symbols sorted together, but still have, for results in any given language, the collation be internally correct for that language. This might be a complex problem, require additional server resources at runtime (such as large/multiple indexes and/or more complex queries), and additional ongoing development resources to solve and maintain well... And what collation should be used for searches? You'd have to determine the language someone is searching in, it seems. Possibly the problem could be simplified along the lines of the Hiragana reading collation solution I described, by using a seperate field -- one that would likely be used only internally -- with a single value that will collate as you desire it to within the context of the entire database. The correct values to use would depend on the collation used for the field, and you would want to generate a custom value for the field based on the original language text and transform it in such a way that it will collate with all other records as desired, using rules that vary on the input language. If you did this for all languages, then this could potentially be a single solution to sort by Hiragana reading (if desired) as well as all other language's own collation rules. There's also the potential to improve on how searches are matched to the database through this process. Any change in collation will have some effects in that arena, but it might be nice to be able to have searches match any of AniDB's supported languages' collation rules (or multiple sets of rules, such as kanji radicals + strokes as well as readings and/or kana and transliterations) from a single search/query. Thoughts above touch on things that could achieve this, but the idea would need more development to be fully understood, at least by me. Maybe if you had the seperate field I described with a special version of the text used for sorting purposes, whose value was defined differently according to at least language... and you don't definitively know the language someone is searching in, you could convert search text into language/converted search text pairs for every language, which could then be used as a pair to match records. Anyway, for the more complex ideas I've mentioned, I suspect compromises could be made for simplicity and performance. Also, I'm just elucidating some immature thoughts of my own in areas I don't have a lot of expertise in.

Belove 2016-05-17 05:34 reporter ~0003767	Multiple collation indexes can be created on the same field: http://www.postgresql.org/docs/current/static/indexes-collations.html (Though there may be performance considerations.)

DerIdiot 2018-03-02 20:19 administrator ~0004202	postgresql10 implemented icu https://blog.2ndquadrant.com/icu-support-postgresql-10/ https://blog.ohgaki.net/postgresql-10-icu-locale-collation-enables-natural-japanese-sorting so we will probably employ und-x-icu which has a few minor drawbacks though like punctuation being considered compared to en_GB so .hack ends up on top again, but considering using the jhumpbar hid it anyway as that was punctuation unaware this might be ok.

Date Modified	Username	Field	Change
2007-08-22 22:13	Quietust	New Issue
2008-02-10 22:11	ninjamask	Tag Attached: language
2008-02-10 22:11	ninjamask	Tag Attached: mylist
2008-02-10 22:11	ninjamask	Tag Attached: sorting
2008-02-25 08:39	epoximator	Note Added: 0001752
2009-08-31 17:19	DerIdiot	Status	new => acknowledged
2014-09-01 13:35	DerIdiot	Relationship added	has duplicate 0000941
2014-09-01 13:36	DerIdiot	Relationship added	has duplicate 0001116
2014-09-02 14:23	DerIdiot	Note Added: 0003407
2014-09-03 09:41	DerIdiot	Relationship added	has duplicate 0001815
2016-04-17 02:08	DerIdiot	Assigned To	=> DerIdiot
2016-04-17 02:08	DerIdiot	Status	acknowledged => assigned
2016-05-17 05:25	Belove	Note Added: 0003766
2016-05-17 05:34	Belove	Note Added: 0003767
2018-03-02 20:19	DerIdiot	Note Added: 0004202
2018-03-07 23:15	DerIdiot	Status	assigned => resolved
2018-03-07 23:15	DerIdiot	Resolution	open => fixed
2018-03-07 23:15	DerIdiot	Fixed in Version	=> 2018-04