Predictive Search

Broken link? Feature request? Anything related to the Spectrum Computing website here.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

Sure, I understand your ideas. They are quite similar to my considerations when I created these tables about a year ago. But I still don't see how they could be improved.

I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates. Also collation is currently defined in these tables just for convenience, so site owners don't need to remember to specify collation in their queries. Changing these details wouldn't improve searches or help autocomplete.

Moreover, I still think it's a good idea to remove spaces and punctuation from these tables, otherwise users searching for either "Offroad Racing" or "Off Road Racing" wouldn't find anything.

Admittedly I'm not an expert in this area so I could be wrong. If you have any changes to suggest, please go ahead!
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Thu Jul 18, 2019 3:18 pmAlso remember, that quick search, includes: Titles, Aliases, Authors and Publishers.
You probably don't need to distinguish between authors and publishers.

In field "author" you can simply autocomplete against:

Code: Select all

SELECT name FROM labels WHERE id IN
(SELECT label_id FROM search_by_authors)
Analogously for "publisher":

Code: Select all

SELECT name FROM labels WHERE id IN
(SELECT label_id FROM search_by_publishers)
In field "title" you can autocomplete against:

Code: Select all

SELECT title FROM entries 
UNION ALL SELECT title FROM aliases
And in a generic field "text" in basic search, you can autocomplete against:

Code: Select all

SELECT title AS text FROM entries 
UNION ALL SELECT title AS text FROM aliases 
UNION ALL SELECT name AS text FROM labels
For simplicity I have omitted COLLATION, GROUP BY, and ORDER BY. They are left as exercise for the reader :)

This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Einar Saukas wrote: Thu Jul 18, 2019 8:48 pm Sure, I understand your ideas. They are quite similar to my considerations when I created these tables about a year ago. But I still don't see how they could be improved.

I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates. Also collation is currently defined in these tables just for convenience, so site owners don't need to remember to specify collation in their queries. Changing these details wouldn't improve searches or help autocomplete.
I agree, except with the last sentence :)

NOTE:
If table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.

So, there is an improvement to be gained, by not forcing the content to be all lower case, which is to allow matching case if required.

Another improvement, is that instead of using MYSQL collation utf8_unicode_ci , the MB4 version should be used (utf8mb4_unicode_ci), according to note on the following documentation page https://dev.mysql.com/doc/refman/8.0/en ... icode.html (this is mentioned in several places in the documentaiton, since the 3 byte version, is a MySQL fluke and non standard Umicode.

Einar Saukas wrote: Thu Jul 18, 2019 8:48 pm Moreover, I still think it's a good idea to remove spaces and punctuation from these tables, otherwise users searching for either "Offroad Racing" or "Off Road Racing" wouldn't find anything.
Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).

NOTE:
If we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).

NOTE:
Current collation in use utf8_unicode_ci, is equivalent to utf8_unicode_ai_ci (3 byte Unicode accent insensitive and case insensitive).
This is why searching for '%aña%' returns results that contain 'ana'.

Here is some documentation reference on this:
https://dev.mysql.com/doc/refman/8.0/en ... names.html

Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".

This "aliases" (dictionary) can then be used, when generating the search indexes (AKA search_by_* tables in our case).
Einar Saukas wrote: Thu Jul 18, 2019 8:48 pm Admittedly I'm not an expert in this area so I could be wrong. If you have any changes to suggest, please go ahead!
Neither am I, because search is a huge subject, with many considerations and p´ros and cons, like any engineering solution, sometimes compromises need to be done.

But I can give you practical examples of this stuff I'm talking about.

For example, in every place in the world, there are streets or areas that are known by a different name, then their real name, usually for cultural reasons. And hence this is the name that most people use to search stuff in that area.

For example, in Lisbon, there was a big international event in 1998, The World International Exposition, also known as Expo 98.

The surrounding area of that event, got culturally known as "EXPO", which in fact, maps to 3 different parishes.
So when people search for stuff in EXPO, we know that we should limit the search to that geographic location.

The same happens, to a river side road that is culturally called "Marginal", since it's on the margin/bank of the river, but it has 3 or 4 names, since it crosses several parishes and 2 counties.

The same probably happens with some games, that might have some culturally distinct name.

No good example comes to mind right now, but a tentative example is "Chaos - The Battle of Wizards", that is known by many as just "Chaos". Not a very good example, because the cultural name is a substring of the real name.

Another example is "Live and Let Die" which is also known as "Aquablast". Although these different names are from different releases, but the idea behind is the same.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Einar Saukas wrote: Thu Jul 18, 2019 9:07 pm
RMartins wrote: Thu Jul 18, 2019 3:18 pmAlso remember, that quick search, includes: Titles, Aliases, Authors and Publishers.
You probably don't need to distinguish between authors and publishers.

...
Currently the Quick search is using this query:

Code: Select all

SELECT N.name
FROM(
(
	SELECT	E.title AS name
	FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
	WHERE	SBT.entry_title LIKE ?
	ORDER	BY E.title
	LIMIT 10
)
UNION
(
	SELECT	A.title AS name
	FROM	aliases A
	WHERE	A.title LIKE ?
	ORDER	BY A.title
	LIMIT 10
)
UNION
(
	SELECT	L.name AS name
	FROM	labels L	INNER JOIN search_by_authors SBA ON (SBA.label_id = L.id)
	WHERE	L.name LIKE ?
	ORDER	BY L.name
	LIMIT 10
)
UNION
(
	SELECT	L.name AS name
	FROM	labels L	INNER JOIN search_by_publishers SBP ON (SBP.label_id = L.id)
	WHERE	L.name LIKE ?
	ORDER	BY L.name
	LIMIT 10
)) N
ORDER BY N.name
LIMIT 10
Einar Saukas wrote: Thu Jul 18, 2019 9:07 pm This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.
Not sure I understood what you meant.

It does not matter, if we have 40 results of each, if we are only going to present 10 in total.
At most they could be 10 of the same type, so we will never show the other 30 of each type.
If you look at the above query, it should be self explanatory.

I would suggest we take the more technical detail somewhere else, to avoid loosing most of the current readers :lol:

However, I can give a simple example, on how to query for each intent.
Here is a simple explanation of how COLLATIONs work, using weights.
https://dev.mysql.com/doc/refman/8.0/en ... ation.html

Lets focus just on titles for example.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ?
ORDER	BY E.title
LIMIT 10
Current COLLATION is utf8_unicode_ci, meaning the search is accent insensitive and case insensitive.

Now suppose the user, introduced a search string like 'aña':
- this means the user is interested in matching the special char 'ñ'.
- the user did not use any Upper case character ( so is not interested in a specific case match ).

From the observation of the search input string, we can infer that we should do an accent sensitive and case insensitive search.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ? COLLATE utf8mb4_unicode_as_ci
ORDER	BY E.title
LIMIT 10
Notice the extra "COLLATE utf8mb4_unicode_as_ci"!

NOTE:
Remember that to reap the benefits of the several distinct collations, the table data needs to contain mixed case and special chars, so that the comparisons will work as expected.

So, for all the possible combinations, we would need the following 4 COLLATIONS:
- utf8_unicode_ai_ci <=> utf8_unicode_ci
- utf8_unicode_as_cs <=> utf8_unicode_cs
- utf8_unicode_as_ci
- utf8_unicode_ai_cs

And this is were most MySQL installs will give us trouble, since these COLLATIONS are not ALL available by default!

But this can be fixed, by adding them to the server.
https://dev.mysql.com/doc/refman/8.0/en ... e-uca.html
It might not be as easy as we would like (might require a recompile), for the mixed ones at least.

But at first try, we could live with just the first two, that are usually available or more easily installable.

NOTE: to simplify the example, I'm not using the utf8mb4, which we should.

To check for locally available/installed collations, run this command:

Code: Select all

SHOW COLLATION WHERE Charset = 'utf8mb4';
UPDATE:
Using specific COLLATIONS on a query, might trip MySQL into not using indexes.
Or might require extra indexes on the same field, but for a different collation.
Something to keep an eye on.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

RMartins wrote: Fri Jul 19, 2019 3:15 am ...

UPDATE:
Using specific COLLATIONS on a query, might trip MySQL into not using indexes.
...
But using SQL LIKE usually prevents the use of any index, so it might not be a big concern, since the database is small.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Fri Jul 19, 2019 2:42 amIf table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.
I understood your point. I just happen to disagree with you :)

This was your proposal for uppercase:
So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.
According to your proposal, searching for "Earth Shock" won't find anything, because the exact spelling is "Earthshock".

The point is, I believe searches should be as permissive as possible. Right now, users don't need to know the exact spelling of a title. Searching for "Earth Shock", "earthshock" and "earth-shock" will find the intended game. It goes even further: right now, searching for "night rally" will find both "Night Rally" and "Nightmare Rally". And searching for "Ocean Software" will also find titles published by Ocean brands and subsidiaries, not just Ocean directly. That's how searches in Martijn's WoS work and I always liked it, thus the reason I provided something similar with these auxiliary tables. It's a very different requirement from autocomplete, that only makes sense for exact matches.

RMartins wrote: Fri Jul 19, 2019 2:42 amAnother improvement, is that instead of using MYSQL collation utf8_unicode_ci , the MB4 version should be used (utf8mb4_unicode_ci), according to note on the following documentation page https://dev.mysql.com/doc/refman/8.0/en ... icode.html (this is mentioned in several places in the documentaiton, since the 3 byte version, is a MySQL fluke and non standard Umicode.
Good point, I will fix it. Thanks!

RMartins wrote: Fri Jul 19, 2019 2:42 am Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).
I disagree...

RMartins wrote: Fri Jul 19, 2019 2:42 amIf we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).
And I still think searches here should be always accent insensitive. Otherwise searching for "iniciación" won't find any results.

RMartins wrote: Fri Jul 19, 2019 2:42 am Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".
Even so, your suggestion wouldn't be enough to locate the game called "4x4 Off-Road Racing". Unless you are suggesting to store lots of possible variants for each game title in the database? If so, that would interfere with autocomplete. Imagine someone typing "knight" and autocomplete showing 10 variants of the same title ("Knight Time", "Knight Tyme", "Knight-time", "Knight-tyme", "Knighttime", "Knighttyme", "Knightime", "Knightyme", etc) and none for Knight Lore, for instance.

Besides, it feels wrong to store aliases in the database for games that were never really known by those titles, just to help searches that are already working without this.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
RMartins wrote: Fri Jul 19, 2019 2:42 amIf table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.
I understood your point. I just happen to disagree with you :)
That's perfectly fine. 8-)

If we would all think alike, it would be boring :)

But I believe you are disagreeing, because I might not have explained it well enough.

If you take into account, that you are using a collation that is accent insensitive and case insensitive, it doesn't matter, what accentuated characters or which case they are in the database, because an A = a = Â = à = À = á = ... etc... using the *_unicode_ai_ci collation.

So having the original data accents and case on the search_by_titles table, doesn't affect how you currently do your searches.
If you don't believe me, change one and try :D

Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm This was your proposal for uppercase:
So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.
According to your proposal, searching for "Earth Shock" won't find anything, because the exact spelling is "Earthshock".
Nope.
NOTE: his is why I believe, I might not have explained myself well.

In search engines, things known to have different spellings, like your example above, are perfectly fine.
They just need several entries into the search index (not database index), like if they were alternative names (aliases).

Clipping/removing the spaces, dashes, funny characters, etc..., to try to retain only the parts that are common to facilitate a match is not the way to go about it, because it will give a lot of false positives in other unrelated searches.

So in this case, you would have 3 records in search_by_titles, namely:
- "Earth Shock"
- "Earthshock"
- "Earth-Shock"
that map to the same exact game id (entry_id).

If you think about it, you already do this with the "aliases" table.

The problem with clipping the uncommon stuff, besides giving false positives, is that it will not work for names (aliases) that are no exact matches once you strip the funny(uncommon) characters between the several spellings.

A good example is the "Live and Let Die" versus "Aquablast".
Same game, different "spellings" or names for the same thing.

I hope that clears any doubt about it.
You get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.

Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm The point is, I believe searches should be as permissive as possible. Right now, users don't need to know the exact spelling of a title. Searching for "Earth Shock", "earthshock" and "earth-shock" will find the intended game.
Completely agree.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm It goes even further: right now, searching for "night rally" will find both "
Night Rally" and "Nightmare Rally".
I hope you agree with me, that those results are not the outcome, of clipping the funny/uncommon chars.
Because "night rally" when stripped of the funny chars, would become "nightrally", and would never match "nightmarerally".

I haven't checked, but that outcome is most probably the result of the "aliases" table or similar.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm And searching for "Ocean Software" will also find titles published by Ocean brands and subsidiaries, not just Ocean directly. That's how searches in Martijn's WoS work and I always liked it, thus the reason I provided something similar with these auxiliary tables. It's a very different requirement from autocomplete, that only makes sense for exact matches.
Again, that is not the result of clipping the funny/uncommon chars.
That is the result of pre-processing some table joins to include extra keys in the search_by_* tables, which is exactly what I'm proposing. 8-)

We can still have all those features, without having the false positives.

Autocomplete is just a search, like any other. I see no difference there.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
RMartins wrote: Fri Jul 19, 2019 2:42 am Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).
I disagree...
Can you please try a search, after changing a few entries to have the correct case and accents, and confirm that you will still find them with the current collation in use ?

NOTE: If this isn't true, MySQL has a bug in that collation.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
RMartins wrote: Fri Jul 19, 2019 2:42 amIf we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).
And I still think searches here should be always accent insensitive. Otherwise searching for "iniciación" won't find any results.
OK, here is a matter of opinion, that I won't debate, because my proposition is based on the supposition of user intent, since we do not have a real search engine at hand, which would allow definition of rules to match cases like these, or misspells like char reversals and other combinations.
"Iniciation" versus "iniciacion" versus "inicaison" versus "iniciasno", etc...

Note however, if user knows and uses NON accentuated chars and gets the results with the correct accents, they will know what to do.
This is what I referred has user intent, in the search.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
RMartins wrote: Fri Jul 19, 2019 2:42 am Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".
Even so, your suggestion wouldn't be enough to locate the game called "4x4 Off-Road Racing". Unless you are suggesting to store lots of possible variants for each game title in the database?
That's exactly right!
That's the whole idea of a search Index (completely different from a database Index), and they usually are huge!
Specially if we do not have a search engine, that can generate the combinations by itself, given some (many) rules.

Since we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
If so, that would interfere with autocomplete. Imagine someone typing "knight" and autocomplete showing 10 variants of the same title ("Knight Time", "Knight Tyme", "Knight-time", "Knight-tyme", "Knighttime", "Knighttyme", "Knightime", "Knightyme", etc) and none for Knight Lore, for instance.
Actually that will work fine, because we only show the real title, extracted from the "entries.title", while searching using the search_by_titles.entry_title, and after performing an UNION operation, that removes the duplicates.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ?
ORDER	BY E.title
LIMIT 10
Notice that the result comes from the entries table => E.title
Einar Saukas wrote: Fri Jul 19, 2019 5:44 pm
Besides, it feels wrong to store aliases in the database for games that were never really known by those titles, just to help searches that are already working without this.
Now, that I can relate! :lol:

It's that feeling that it seems a waste due to apparent duplication of data.
But that is the domain of the problem unfortunately.

But let me clarify that when i mentioned "aliases", it was to make it simpler to relate.
I know that "aliases" table has a more in depth context, which is more akin to different language spellings or country releases.

So when I said "aliases" I meant real aliases as in possible replacement names, which are usually handled in specific collections (AKA tables) for search indexes.

Thank you for your feedback.
I really appreciate it.

I hope I was able to clarify any doubt, I may have left previously.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Fri Jul 19, 2019 3:15 am
Einar Saukas wrote: Thu Jul 18, 2019 9:07 pmYou probably don't need to distinguish between authors and publishers.
Currently the Quick search is using this query:
There's no need. All labels are valid in quick search.

For instance, try searching for "columbia pictures" in quick search to see what happens. It's neither author or publisher, but you will still get a result.

RMartins wrote: Fri Jul 19, 2019 3:15 am
Einar Saukas wrote: Thu Jul 18, 2019 9:07 pm This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.
Not sure I understood what you meant.
Sorry, I understood you had an overall limit of 40 results. Now I see in your query that your overall limit is 10. Never mind!

RMartins wrote: Fri Jul 19, 2019 7:35 pmBut I believe you are disagreeing, because I might not have explained it well enough.
No. I understood exactly what you meant. There's no need to discuss it technically, because the problem is not technical. We just happen to have different ideas for the intended behavior.

You assume that users searching for "cadáv" are only interested in words containing this accent. I assume they want "Cadàveriön" but didn't memorize the exact spelling.

You assume that users searching for "off-road" are only interested in words containing this hyphen. I assume they want all off road games regardless of spelling.

You assume that users searching for "night rally" don't want to see "Nightmare Rally" in the list of results. I assume they do.

Basically you assume most users will bother to learn the nuances of using capital letters or not, using punctuation or not, etc. I assume they just want to find a game with minimum effort.

RMartins wrote: Fri Jul 19, 2019 7:35 pm If you take into account, that you are using a collation that is accent insensitive and case insensitive, it doesn't matter, what accentuated characters or which case they are in the database, because an A = a = Â = à = À = á = ... etc... using the *_unicode_ai_ci collation.

So having the original data accents and case on the search_by_titles table, doesn't affect how you currently do your searches.
If you don't believe me, change one and try :D
Of course I believe you. That's exactly what I wrote in a previous post. In this post I wrote "I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates."

RMartins wrote: Fri Jul 19, 2019 7:35 pm They just need several entries into the search index (not database index), like if they were alternative names (aliases).
And that's exactly what I wanted to avoid, because maintaining a list of alternative names would be a PITA.

RMartins wrote: Fri Jul 19, 2019 7:35 pmIf you think about it, you already do this with the "aliases" table.
No, I don't. The "aliases" table contains alternative official names for the same game. It doesn't make sense to add other titles to this table that a game never had, just to help searches.

RMartins wrote: Fri Jul 19, 2019 7:35 pmYou get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.
I know. False positives is the price to pay for easier, broader searches. I think that's a good tradeoff.

RMartins wrote: Fri Jul 19, 2019 7:35 pmWe can still have all those features, without having the false positives.
I disagree, because my understanding of desired features is different than yours (as I described at the beginning of this post).

RMartins wrote: Fri Jul 19, 2019 7:35 pmAutocomplete is just a search, like any other. I see no difference there.
I do!

RMartins wrote: Fri Jul 19, 2019 7:35 pmCan you please try a search, after changing a few entries to have the correct case and accents, and confirm that you will still find them with the current collation in use ?
RMartins wrote: Fri Jul 19, 2019 7:35 pmSince we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.
Again, that's what I'm trying to avoid.

RMartins wrote: Fri Jul 19, 2019 7:35 pmActually that will work fine, because we only show the real title, extracted from the "entries.title", while searching using the search_by_titles.entry_title, and after performing an UNION operation, that removes the duplicates.
Your proposal wouldn't be able to find "Electra 9000", for instance. It's the title of a game re-released by Alternative, probably better known than the original title from the first K'Soft release.

RMartins wrote: Fri Jul 19, 2019 7:35 pm Thank you for your feedback.
I really appreciate it.

I hope I was able to clarify any doubt, I may have left previously.
Absolutely!

And I hope my answers didn't sound as negative criticism, it wasn't my intent. You originally asked about the motivations for the current implementation, so I tried to explain them all and point out the reasons I didn't do it differently. However I have no right to impose my own preferences on any websites. You are currently working on autocomplete and searches at SC, so the site functionality should be your decision, not mine. Please feel free to implement your ideas, and if there's any way I can help just let me know. I'm confident you can do a good job, and hopefully the issues I pointed out can help you make even better choices. I will gladly leave the ball in your court now! :)

BTW tables "search_by" are included in ZXDB distribution right now, just for convenience. However they are not really part of the database content. I actually like the idea of distributing them separately, to emphasize they are optional and motivate each website to come up with its own creative solutions. Good luck!
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Once again thank you for your feedback [mention]Einar Saukas[/mention].

This I s just a debate so I hope I'm not comming across as imposing anything.That's not the intent.

Having said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.

One of the reasons why I did not want to go too technical, so not to scare the rest of the crowd 😂

Please come back and give us your opinion or insight on how you use or want to use search.

Thank you in advance for any feedback.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Sat Jul 20, 2019 12:48 amThis I s just a debate so I hope I'm not comming across as imposing anything.That's not the intent.
Sure, don't worry about it!

RMartins wrote: Sat Jul 20, 2019 12:48 amHaving said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.
I suspect this discussion has been too abstract for most users. It's hard for anyone to evaluate what's the best interface without actually trying it.

I suggest you implement your idea in your alternate page (or even replace the main page), afterwards write a post describing the changes and asking for feedback.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Einar Saukas wrote: Sat Jul 20, 2019 11:36 pm ...
RMartins wrote: Sat Jul 20, 2019 12:48 amHaving said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.
I suspect this discussion has been too abstract for most users. It's hard for anyone to evaluate what's the best interface without actually trying it.

I suggest you implement your idea in your alternate page (or even replace the main page), afterwards write a post describing the changes and asking for feedback.
For now the intent was to provide an autocomplete, and that's done and working.
I will probably change it a bit, to allow something like 40 entries, but just show 10, but then allow to scroll to see the remaining 30.

To review the way how search is being done, I need to first understand the data and relation between the several tables involved, in more detail.
In particular, how other info is kept in the labels table.



Some extra coments below.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am ...
RMartins wrote: Fri Jul 19, 2019 7:35 pmBut I believe you are disagreeing, because I might not have explained it well enough.
No. I understood exactly what you meant. There's no need to discuss it technically, because the problem is not technical. We just happen to have different ideas for the intended behavior.

You assume that users searching for "cadáv" are only interested in words containing this accent. I assume they want "Cadàveriön" but didn't memorize the exact spelling.

You assume that users searching for "off-road" are only interested in words containing this hyphen. I assume they want all off road games regardless of spelling.

You assume that users searching for "night rally" don't want to see "Nightmare Rally" in the list of results. I assume they do.

Basically you assume most users will bother to learn the nuances of using capital letters or not, using punctuation or not, etc. I assume they just want to find a game with minimum effort.
Yes I assume all that, behind what I mentioned as user intent.

They can still search exactly like you describe, just by not using any case or accent.
You can call that a "nuance" they will have to learn. I have no problem with that.

The thing is, if they do not want to mach every single "ana" in the search topic, but just want to get the ones that match "aña" they can, and will not have FALSE positives.
With the current solution they can't, it's case insensitive and accent insensitive all the way.

Having a choice for refining the search, seems an improvement or win to me, but that might not be the opinion of the majority.

Ideally we would have a search engine, and someone to maintain it, but we don't.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am ...
RMartins wrote: Fri Jul 19, 2019 7:35 pm They just need several entries into the search index (not database index), like if they were alternative names (aliases).
And that's exactly what I wanted to avoid, because maintaining a list of alternative names would be a PITA.
OK, I understand that it involves some effort, but that's the only feasible (NON FALSE positives) way to do it, using only SQL.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am
RMartins wrote: Fri Jul 19, 2019 7:35 pmIf you think about it, you already do this with the "aliases" table.
No, I don't. The "aliases" table contains alternative official names for the same game. It doesn't make sense to add other titles to this table that a game never had, just to help searches.
I believe I explained that it should be some other table, and not the "aliases" table, since that has a specific release context.
A table specific for this purpose, like "title_synonyms" or similar.
i.e. specific tables and data created for handling searching, like the search_by_* tables already are.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am
RMartins wrote: Fri Jul 19, 2019 7:35 pmYou get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.
I know. False positives is the price to pay for easier, broader searches. I think that's a good tradeoff.
OK, I just believe we can improve on the current solution.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am ...
RMartins wrote: Fri Jul 19, 2019 7:35 pmSince we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.
Again, that's what I'm trying to avoid.
I understand that, but by creating the search_by_* tables, aren't you actually already doing something similar ?
Surely not as involved as it could be, but a start of what a typical "search index" is.
Einar Saukas wrote: Sat Jul 20, 2019 12:05 am ...
Your proposal wouldn't be able to find "Electra 9000", for instance. It's the title of a game re-released by Alternative, probably better known than the original title from the first K'Soft release.
I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.

Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?

If that's the case, I was just following the tip on the field "Search by Title, Author, Publisher".
Maybe we should include the "labels" individually too.

But to be fair, I still haven't grasped the "labels" table completely, because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.

Again, thank you for your feedback.
I do appreciate it.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Sun Jul 21, 2019 3:27 pm I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.
Both "Electra" and "electra" provide the same results in searching. But they show different results in autocomplete.

RMartins wrote: Sun Jul 21, 2019 3:27 pm Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?
It's the opposite. Current quick search query includes "labels" alone (but autocomplete doesn't).

RMartins wrote: Sun Jul 21, 2019 3:27 pm If that's the case, I was just following the tip on the field "Search by Title, Author, Publisher".
A more accurate description would be "Search by Title, Author, Publisher, License Owner, Magazine Name, Team Name, Magazine Columnist, etc"

RMartins wrote: Sun Jul 21, 2019 3:27 pm Maybe we should include the "labels" individually too.
Yes.

RMartins wrote: Sun Jul 21, 2019 3:27 pm But to be fair, I still haven't grasped the "labels" table completely
In ZXDB, table "labels" contains all entities. It includes real people's names, nicknames, companies, user groups, fantasy names used by publishers, etc.

Some of them are authors (of certain titles), for instance Joffa.

Some of them are publishers (or certain titles), for instance Ocean Software.

Some of them are both authors and publishers, for instance Jonathan Cauldwell.

Some of them are owners (of certain licenses), for instance Daley Thompson.

Some of them are both authors and license owners, for instance Alfonso Azpiri.

There are a few "relationship" tables that associate entries to the entities that authored them, or published them, or own the licenses related to them. They are described here:

https://github.com/zxdb/ZXDB

RMartins wrote: Sun Jul 21, 2019 3:27 pm because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.
No, it's not odd, not messy, and not related with legacy data. We already had this conversation here.

A certain person could have authored different games. Perhaps also published a few of them, worked as columnist in some magazine, or created some character that was later licensed to others. Don't you think it makes sense to store information about each person (country, nickname, homepage, etc) in a separate table, instead of duplicating it everywhere?

RMartins wrote: Sun Jul 21, 2019 3:27 pm Again, thank you for your feedback.
I do appreciate it.
You are welcome!
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Einar Saukas wrote: Mon Jul 22, 2019 5:58 pm
RMartins wrote: Sun Jul 21, 2019 3:27 pm I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.
Both "Electra" and "electra" provide the same results in searching. But they show different results in autocomplete.

RMartins wrote: Sun Jul 21, 2019 3:27 pm Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?
It's the opposite. Current quick search query includes "labels" alone (but autocomplete doesn't).
OK, I see what you mean.
The quick search is different from the autocomplete search, in current implementation.
That's true, due to the tip definition that I followed.

I can re-align that, if that's the intent.
Einar Saukas wrote: Mon Jul 22, 2019 5:58 pm ...
There are a few "relationship" tables that associate entries to the entities that authored them, or published them, or own the licenses related to them. They are described here:

https://github.com/zxdb/ZXDB
Yes, you have mentioned something along those lines before.
Einar Saukas wrote: Mon Jul 22, 2019 5:58 pm
RMartins wrote: Sun Jul 21, 2019 3:27 pm because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.
No, it's not odd, not messy, and not related with legacy data. We already had this conversation here.
Maybe I haven't explained myself well enough.
I'm not criticizing the ZXDB tables or it's data, per se.

Let me try and explain what I meant by "has not relation with any other extra context giving table".
And I may be wrong, so please correct me if that is the case.

"Labels" table, has a bunch of records, some:
- have a reference to the "authors" table, giving that "label" context in the sense that it's an author name.
- have a reference to the "publishers" table, giving that "label" context in the sense that it's a publisher label/name.
- have a reference to the "licensors" table, giving that "label" context in the sense that it's a License Owner label/name.
but there seems to exist more label data in "labels" table then just "authors", "publishers" or "licensors".

Assuming this is true (might not be), I haven't seen any extra table referenced, in order to give context to whatever other stuff is in the "labels" records.

At least, that's might current view, which can be wrong.
Einar Saukas wrote: Mon Jul 22, 2019 5:58 pm A certain person could have authored different games. Perhaps also published a few of them, worked as columnist in some magazine, or created some character that was later licensed to others. Don't you think it makes sense to store information about each person (country, nickname, homepage, etc) in a separate table, instead of duplicating it everywhere?
YEs, I do agree with this.

But from my current perspective (mentioned above), there seems to be some info that does not have context associated.
And that was my only complain about feeling it's "messy" or a "bit odd".

Thank you for your feedback.
User avatar
Einar Saukas
Bugaboo
Posts: 3070
Joined: Wed Nov 15, 2017 2:48 pm

Re: Predictive Search

Post by Einar Saukas »

RMartins wrote: Mon Jul 22, 2019 7:27 pm OK, I see what you mean.
The quick search is different from the autocomplete search, in current implementation.
That's true, due to the tip definition that I followed.

I can re-align that, if that's the intent.
Yes, I think you should accept all "labels".

Right now, quick searches also provides other results besides authors and publishers.

Einar Saukas wrote: Mon Jul 22, 2019 5:58 pm Maybe I haven't explained myself well enough.
I'm not criticizing the ZXDB tables or it's data, per se.

Let me try and explain what I meant by "has not relation with any other extra context giving table".
And I may be wrong, so please correct me if that is the case.

"Labels" table, has a bunch of records, some:
- have a reference to the "authors" table, giving that "label" context in the sense that it's an author name.
- have a reference to the "publishers" table, giving that "label" context in the sense that it's a publisher label/name.
- have a reference to the "licensors" table, giving that "label" context in the sense that it's a License Owner label/name.
but there seems to exist more label data in "labels" table then just "authors", "publishers" or "licensors".

Assuming this is true (might not be), I haven't seen any extra table referenced, in order to give context to whatever other stuff is in the "labels" records.
There are many other references:

* A label can be referenced to another label. For instance, label "Frobush" is a nickname that belongs to label "Jonathan M. Smith". Because of this, you can search for this nickname to find his games, although he was never credited as "Frobush" in any of them.

* A label can be a team of developers. For instance, label "Bizarre Developments" was a team behind the development of a few titles. It wasn't an individual author, publisher or license owner for any of them.

* A label can be a columnist responsible for a regular section in a magazine, in our magazine references. For instance "Lloyd Mangram".

* A label can be a company that's not a publisher, but it's the main company that owns several other publishers. For instance "British Telecom".

And so on.

You can take a look at the ZXDB schema to get all possible references. Or you can simply allow searches on all existing labels.
User avatar
Metalbrain
Microbot
Posts: 107
Joined: Thu Feb 15, 2018 2:14 pm

Re: Predictive Search

Post by Metalbrain »

Regarding the "aña" vs "ana" search, would it be possible to make an initial search that is case/accent insensitive , and then refine it to prioritize the case/accent sensitive results? This way the intended results would appear first, but if someone makes a mistake (such as the cadáv vs cadàv one), he'd still get the result.

I'd love to have this kind of search, and the fact that Google (and search engines in general) just does insensitive searches, no matter how specific you write the terms has always frustrated the hell out of me.
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Metalbrain wrote: Tue Jul 23, 2019 10:06 am Regarding the "aña" vs "ana" search, would it be possible to make an initial search that is case/accent insensitive , and then refine it to prioritize the case/accent sensitive results? This way the intended results would appear first, but if someone makes a mistake (such as the cadáv vs cadàv one), he'd still get the result.
Well, yes and no.

That means we have to make 2 queries, or enlarge the current query, to replicate every single subquery, which is basically the same.

However, even if we do this ( at least for autocomplete), the number of displayable results is limited to some N (currently 10).
So adding other entries, could generate duplicates (if using separate queries).

On the other hand, if we get the positive matches, they will have the real name, which is correctly accented and cased.
So I'm not really sure if there is a benefit there.

The real benefit we would like is to avoid false positives, while still finding all the entries we are currently able to find.
Metalbrain wrote: Tue Jul 23, 2019 10:06 am I'd love to have this kind of search, and the fact that Google (and search engines in general) just does insensitive searches, no matter how specific you write the terms has always frustrated the hell out of me.
Yes, in general, having some control on the match is desirable, but Google results are hugely more complex, and using a real search engine, with tons of configurable properties and rules. Not even comparable to something like an SQL query. :D
User avatar
RMartins
Manic Miner
Posts: 776
Joined: Thu Nov 16, 2017 3:26 pm

Re: Predictive Search

Post by RMartins »

Although I haven't had much feedback yet, I know there is some problem, when different case searches are returning different results.
Searching for "Chaos" or "chaos" gives a different result list.

As soon as I have some extra free time, I'll take a look a find the reason for it.
I already have an idea of what it might be.

If any one has some other issue to report, please do so.
If I have all the known issues available, I can fix them all in one session, if possible.
Last edited by RMartins on Tue Jul 23, 2019 3:40 pm, edited 1 time in total.
User avatar
Metalbrain
Microbot
Posts: 107
Joined: Thu Feb 15, 2018 2:14 pm

Re: Predictive Search

Post by Metalbrain »

RMartins wrote: Tue Jul 23, 2019 2:57 pm
Metalbrain wrote: Tue Jul 23, 2019 10:06 am Regarding the "aña" vs "ana" search, would it be possible to make an initial search that is case/accent insensitive , and then refine it to prioritize the case/accent sensitive results? This way the intended results would appear first, but if someone makes a mistake (such as the cadáv vs cadàv one), he'd still get the result.
Well, yes and no.

That means we have to make 2 queries, or enlarge the current query, to replicate every single subquery, which is basically the same.

However, even if we do this ( at least for autocomplete), the number of displayable results is limited to some N (currently 10).
So adding other entries, could generate duplicates (if using separate queries).

On the other hand, if we get the positive matches, they will have the real name, which is correctly accented and cased.
So I'm not really sure if there is a benefit there.

The real benefit we would like is to avoid false positives, while still finding all the entries we are currently able to find.
For autocomplete, we'd need 2 separate searches with the capped number of results (only if uppercase or special chars are used, otherwise just the insensitive search would be enough), and we'd show the N first results of the sensitive search first, and if there are not enough sensitive results we'd fill up the results later with the results from the insensitive search, placed lower (and removing any duplicates from the sensitive search).

For the main search (without capping the number of results), the sensitive search could be made only within the results of the insensitive search, to make it faster (and once again, only if needed). And the results (if we get them) from the inner sensitive search would appear first, and the "only insensitive" results would appear lower.
Post Reply