Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.
I understood your point. I just happen to disagree with you
That's perfectly fine.
If we would all think alike, it would be boring
But I believe you are disagreeing, because I might not have explained it well enough.
If you take into account, that you are using a collation that is accent insensitive and case insensitive, it doesn't matter, what accentuated characters or which case they are in the database, because an A = a = Â = à = À = á = ... etc... using the
*_unicode_ai_ci collation.
So having the original data accents and case on the search_by_titles table, doesn't affect how you currently do your searches.
If you don't believe me, change one and try
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
This was your proposal for uppercase:
So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.
According to your proposal, searching for "Earth Shock" won't find anything, because the exact spelling is "Earthshock".
Nope.
NOTE: his is why I believe, I might not have explained myself well.
In search engines, things known to have different spellings, like your example above, are perfectly fine.
They just need several entries into the search index (not database index), like if they were alternative names (aliases).
Clipping/removing the spaces, dashes, funny characters, etc..., to try to retain only the parts that are common to facilitate a match is not the way to go about it, because it will give a lot of false positives in other unrelated searches.
So in this case, you would have 3 records in search_by_titles, namely:
- "Earth Shock"
- "Earthshock"
- "Earth-Shock"
that map to the same exact game id (entry_id).
If you think about it, you already do this with the "aliases" table.
The problem with clipping the uncommon stuff, besides giving false positives, is that it will not work for names (aliases) that are no exact matches once you strip the funny(uncommon) characters between the several spellings.
A good example is the "Live and Let Die" versus "Aquablast".
Same game, different "spellings" or names for the same thing.
I hope that clears any doubt about it.
You get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
The point is, I believe searches should be as permissive as possible. Right now, users don't need to know the exact spelling of a title. Searching for "Earth Shock", "earthshock" and "earth-shock" will find the intended game.
Completely agree.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
It goes even further: right now, searching for "night rally" will find both "
Night Rally" and "Nightmare Rally".
I hope you agree with me, that those results are not the outcome, of clipping the funny/uncommon chars.
Because "night rally" when stripped of the funny chars, would become "nightrally", and would never match "nightmarerally".
I haven't checked, but that outcome is most probably the result of the "aliases" table or similar.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
And searching for "Ocean Software" will also find titles published by Ocean brands and subsidiaries, not just Ocean directly. That's how searches in Martijn's WoS work and I always liked it, thus the reason I provided something similar with these auxiliary tables. It's a very different requirement from autocomplete, that only makes sense for exact matches.
Again, that is not the result of clipping the funny/uncommon chars.
That is the result of pre-processing some table joins to include extra keys in the search_by_* tables, which is exactly what I'm proposing.
We can still have all those features, without having the false positives.
Autocomplete is just a search, like any other. I see no difference there.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 am
Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).
I disagree...
Can you please try a search, after changing a few entries to have the correct case and accents, and confirm that you will still find them with the current collation in use ?
NOTE: If this isn't true, MySQL has a bug in that collation.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).
And I still think searches here should be always accent insensitive. Otherwise searching for "iniciación" won't find any results.
OK, here is a matter of opinion, that I won't debate, because my proposition is based on the supposition of user intent, since we do not have a real search engine at hand, which would allow definition of rules to match cases like these, or misspells like char reversals and other combinations.
"Iniciation" versus "iniciacion" versus "inicaison" versus "iniciasno", etc...
Note however, if user knows and uses NON accentuated chars and gets the results with the correct accents, they will know what to do.
This is what I referred has user intent, in the search.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 am
Regarding special cases, ( like "
Off Road" versus "
Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "
Off Road Racing" versus "
Offroad Racing".
Even so, your suggestion wouldn't be enough to locate the game called "4x4 Off-Road Racing". Unless you are suggesting to store lots of possible variants for each game title in the database?
That's exactly right!
That's the whole idea of a search Index (completely different from a database Index), and they usually are huge!
Specially if we do not have a search engine, that can generate the combinations by itself, given some (many) rules.
Since we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
If so, that would interfere with autocomplete. Imagine someone typing "knight" and autocomplete showing 10 variants of the same title ("Knight Time", "Knight Tyme", "Knight-time", "Knight-tyme", "Knighttime", "Knighttyme", "Knightime", "Knightyme", etc) and none for Knight Lore, for instance.
Actually that will work fine, because we only show the real title, extracted from the "entries.title", while searching using the search_by_titles.entry_title, and after performing an UNION operation, that removes the duplicates.
Code: Select all
SELECT E.title AS name
FROM entries E INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE SBT.entry_title LIKE ?
ORDER BY E.title
LIMIT 10
Notice that the result comes from the entries table => E.title
Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
Besides, it feels wrong to store aliases in the database for games that were never really known by those titles, just to help searches that are already working without this.
Now, that I can relate!
It's that feeling that it seems a waste due to apparent duplication of data.
But that is the domain of the problem unfortunately.
But let me clarify that when i mentioned "aliases", it was to make it simpler to relate.
I know that "aliases" table has a more in depth context, which is more akin to different language spellings or country releases.
So when I said "aliases" I meant real aliases as in possible replacement names, which are usually handled in specific collections (AKA tables) for search indexes.
Thank you for your feedback.
I really appreciate it.
I hope I was able to clarify any doubt, I may have left previously.