Predictive Search

ZXDunny · Post by **ZXDunny** » Wed Jul 17, 2019 12:25 pm

RMartins wrote: ↑Wed Jul 17, 2019 11:42 am
ZXDunny wrote: ↑Wed Jul 17, 2019 11:22 am No effect here, I've cleared cookies, refreshed hard with CTRL+F5 and rebooted Win10.

Still no predictive search.
Did you notice that you need to use a different URL ?
https://spectrumcomputing.co.uk/index_new.php

Aha! That was the problem - I was just clicking on the banner at the top of the forum

Feel a bit of a pratt now!

PeterJ · Post by **PeterJ** » Wed Jul 17, 2019 1:39 pm

No worries [mention]ZXDunny[/mention]

Just glad you are now able to test.

ZXDunny · Post by **ZXDunny** » Wed Jul 17, 2019 2:30 pm

Yeah, it works. It's a bit weird typing in "Cyb" and getting results that have "cyb" somewhere inside them, rather than starting with it. We may need to add in some modifiers for the user to specify where the results need to match.

RMartins · Post by **RMartins** » Wed Jul 17, 2019 2:41 pm

ZXDunny wrote: ↑Wed Jul 17, 2019 2:30 pm Yeah, it works. It's a bit weird typing in "Cyb" and getting results that have "cyb" somewhere inside them, rather than starting with it. We may need to add in some modifiers for the user to specify where the results need to match.

That's the current behaviour, of the search_by tables.

"Search" is a big subject. I know because I was responsible (tech lead) for web platform of "Yellow Pages" in my country, for about 3 years.
Doing search properly, requires a very different approach than using SQL and is something that takes constant effort to maintain.

Here the approach being followed, is to have the best we can just using SQL, so some limitations apply, but we will try to improve whenever we can.

Some things still need to be debated, taking into account, what features we would like to have and what can be done with an SQL only solution.

RMartins · Post by **RMartins** » Wed Jul 17, 2019 10:42 pm

To the ones, that keep an eye on the Tab Icons, there is a proposal for a new one on this new page.

It was based on the site Logo, but its design was adapted to fit the size restrictions of a favicon.
Previous favicon was just a rescale of the site logo, and becomes unreadable and blurry at smaller versions.

NOTE: It might not show up on every browser, since there are a lot of settings for the favicon, and several formats, so some might still be missing.

Metalbrain · Post by **Metalbrain** » Thu Jul 18, 2019 9:07 am

Tested in several browsers, it's working on all of them.

Regarding the Sir Fred problem, it can be found if we skip the space (like looking for sirfr). Looking just for "Sir ", we only get these results:
Sir Clive
Sir Clive and Mr. ZX
Sir Computers

and looking for "sir " in lowercase, we get nothing

RMartins · Post by **RMartins** » Thu Jul 18, 2019 3:10 pm

"Sir Fred", or any other Title that is searched using spaces, will not easily find a match.

And the reason for this, is that search tables, effectively remove the spaces in the titles.
... replace( lower(title),' ','' ),

I'm not sure what was the reason/motivation behind removing the spaces, but I'm sure there must have been one.

The problem with removing spaces is that then it becomes possible to find text formed by concatenating 2 previous words.
For example, we can find "Green Beret" by searching for "nbe", which is not what users expect.

Another example, is that searching for "green beret" doesn't return any result.
But searching for "greenberet" will get you a result for "Green Beret".

From a usability stand point, humans typically search for words or parts of words, they never expect to find substrings of concatenations of words.

Another implicit usability rule, is that if users are very specific about something, is because they are passing an intent.
Examples are Upper case letters or special chars like "ñ".

So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.

If user uses special chars, it means they explicitly want to find that specific accentuated char. Searching for "aña" means that the user wants matches like "Cañas", but not "Anagrams" or "Zemana", but we currently get those.
However, if the user searches for "ana", We expect all of these to be valid hits: "Cañas", "Anagrams" or "Zemana".

The good news, is that all of the above, can be done using just SQL, a pré-processed search table and collation rules.

These are all things, that need some debate, to find the best compromise for an SQL only solution.

RMartins · Post by **RMartins** » Thu Jul 18, 2019 3:18 pm

On another note, take into account, that the autocomplete feature, can not show a huge amount of results, so the result is being limited to the first 10 entries, listed in alphabetical ascending order.

If the user does not find the result he expects, he needs to drill down into a more specific content, by continuing to write a larger search string.

Also remember, that quick search, includes: Titles, Aliases, Authors and Publishers.

So, 10 of each are gathered and sorted individually, and then sorted as a whole (this guarantees the results are deterministic for the same query) and only the first 10 are shown (of potentially 40 results).

If you are unlucky, it can be 10 of the same type: 10 titles or 10 Authors, etc... so you might need to drill down until you get to the sweet spot, where what you are searching for, actually shows up in that 10 slot window.

Einar Saukas · Post by **Einar Saukas** » Thu Jul 18, 2019 3:37 pm

RMartins wrote: ↑Thu Jul 18, 2019 3:10 pmI'm not sure what was the reason/motivation behind removing the spaces, but I'm sure there must have been one.

The motivation is that users searching for either "Knight Lore" or "Knightlore" will work in both cases. Searching for either "Chop Lifter" or "Choplifter" will also work in both cases. Incidentally the exact spellings are "Knight Lore" and "Choplifter", but nobody needs to know that.

Ignoring special characters has the same motivation. Otherwise nobody would ever find "Cadàveriön". And it has to work both ways, otherwise searching for proper spellings like "iniciación" would not find anything, because most foreign titles imported from Martijn's WoS were missing accentuation (we are gradually fixing it but it's going to take some time).

Thus basically "search_by" tables were created to help searches work as good as possible. They were never intended to help autocomplete, and I doubt there's a single solution that would work for both. You may need to create your own auxiliary tables for that.

RMartins · Post by **RMartins** » Thu Jul 18, 2019 4:05 pm

Einar Saukas wrote: ↑Thu Jul 18, 2019 3:37 pm
RMartins wrote: ↑Thu Jul 18, 2019 3:10 pmI'm not sure what was the reason/motivation behind removing the spaces, but I'm sure there must have been one.
The motivation is that users searching for either "Knight Lore" or "Knightlore" will work in both cases. Searching for either "Chop Lifter" or "Choplifter" will also work in both cases. Incidentally the exact spellings are "Knight Lore" and "Choplifter", but nobody needs to know that.

And I have nothing against that.
But that is why search is a hard topic.

For these kinds of wanted matches, a specific extra entry should be created for it, instead of just clipping the space on the existing one.
What I mean is the search table should have both spellings (or how many that title is usually searched for).

This is why I said, that a well fined tune search engine, takes time and effort to maintain.

Einar Saukas wrote: ↑Thu Jul 18, 2019 3:37 pm Ignoring special characters has the same motivation. Otherwise nobody would ever find "Cadàveriön". And it has to work both ways, otherwise searching for proper spellings like "iniciación" would not find anything, because most foreign titles imported from Martijn's WoS were missing accentuation (we are gradually fixing it but it's going to take some time).

I also do not have anything against that.
However, it should not be "the only way" to search (case insensitive and accent insensitive).

For example, maybe add some entries in the "aliases" tables could be kept as search related aliases for titles.

NOTE: we can use a different COLLATION, on each query condition for example.

We can determine the collation to use, depending on the input search string for example (if user used upper case, or used special chars).
Look at my explanation on user intent. I can make an example later, if needed.

Einar Saukas wrote: ↑Thu Jul 18, 2019 3:37 pm Thus basically "search_by" tables were created to help searches work as good as possible. They were never intended to help autocomplete, and I doubt there's a single solution that would work for both. You may need to create your own auxiliary tables for that.

What I'm proposing, is that we can in fact use the same system, and we probably should.
We just need to refine it a bit, to fit the intended user searches.

Any user would expect that if he can find something using quick search, to be able to find it on advanced search using the same text on the proper field.

But anyway, these are intended as ideas for debate.
Just trying to help out improve the search.

Einar Saukas · Post by **Einar Saukas** » Thu Jul 18, 2019 8:48 pm

Sure, I understand your ideas. They are quite similar to my considerations when I created these tables about a year ago. But I still don't see how they could be improved.

I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates. Also collation is currently defined in these tables just for convenience, so site owners don't need to remember to specify collation in their queries. Changing these details wouldn't improve searches or help autocomplete.

Moreover, I still think it's a good idea to remove spaces and punctuation from these tables, otherwise users searching for either "Offroad Racing" or "Off Road Racing" wouldn't find anything.

Admittedly I'm not an expert in this area so I could be wrong. If you have any changes to suggest, please go ahead!

Einar Saukas · Post by **Einar Saukas** » Thu Jul 18, 2019 9:07 pm

RMartins wrote: ↑Thu Jul 18, 2019 3:18 pmAlso remember, that quick search, includes: Titles, Aliases, Authors and Publishers.

You probably don't need to distinguish between authors and publishers.

In field "author" you can simply autocomplete against:

Code: Select all

SELECT name FROM labels WHERE id IN
(SELECT label_id FROM search_by_authors)

Analogously for "publisher":

Code: Select all

SELECT name FROM labels WHERE id IN
(SELECT label_id FROM search_by_publishers)

In field "title" you can autocomplete against:

Code: Select all

SELECT title FROM entries 
UNION ALL SELECT title FROM aliases

And in a generic field "text" in basic search, you can autocomplete against:

Code: Select all

SELECT title AS text FROM entries 
UNION ALL SELECT title AS text FROM aliases 
UNION ALL SELECT name AS text FROM labels

For simplicity I have omitted COLLATION, GROUP BY, and ORDER BY. They are left as exercise for the reader

This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.

RMartins · Post by **RMartins** » Fri Jul 19, 2019 2:42 am

Einar Saukas wrote: ↑Thu Jul 18, 2019 8:48 pm Sure, I understand your ideas. They are quite similar to my considerations when I created these tables about a year ago. But I still don't see how they could be improved.

I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates. Also collation is currently defined in these tables just for convenience, so site owners don't need to remember to specify collation in their queries. Changing these details wouldn't improve searches or help autocomplete.

I agree, except with the last sentence

NOTE:
If table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.

So, there is an improvement to be gained, by not forcing the content to be all lower case, which is to allow matching case if required.

Another improvement, is that instead of using MYSQL collation utf8_unicode_ci , the MB4 version should be used (utf8mb4_unicode_ci), according to note on the following documentation page https://dev.mysql.com/doc/refman/8.0/en ... icode.html (this is mentioned in several places in the documentaiton, since the 3 byte version, is a MySQL fluke and non standard Umicode.

Einar Saukas wrote: ↑Thu Jul 18, 2019 8:48 pm Moreover, I still think it's a good idea to remove spaces and punctuation from these tables, otherwise users searching for either "Offroad Racing" or "Off Road Racing" wouldn't find anything.

Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).

NOTE:
If we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).

NOTE:
Current collation in use utf8_unicode_ci, is equivalent to utf8_unicode_ai_ci (3 byte Unicode accent insensitive and case insensitive).
This is why searching for '%aña%' returns results that contain 'ana'.

Here is some documentation reference on this:
https://dev.mysql.com/doc/refman/8.0/en ... names.html

Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".

This "aliases" (dictionary) can then be used, when generating the search indexes (AKA search_by_* tables in our case).

Einar Saukas wrote: ↑Thu Jul 18, 2019 8:48 pm Admittedly I'm not an expert in this area so I could be wrong. If you have any changes to suggest, please go ahead!

Neither am I, because search is a huge subject, with many considerations and p´ros and cons, like any engineering solution, sometimes compromises need to be done.

But I can give you practical examples of this stuff I'm talking about.

For example, in every place in the world, there are streets or areas that are known by a different name, then their real name, usually for cultural reasons. And hence this is the name that most people use to search stuff in that area.

For example, in Lisbon, there was a big international event in 1998, The World International Exposition, also known as Expo 98.

The surrounding area of that event, got culturally known as "EXPO", which in fact, maps to 3 different parishes.
So when people search for stuff in EXPO, we know that we should limit the search to that geographic location.

The same happens, to a river side road that is culturally called "Marginal", since it's on the margin/bank of the river, but it has 3 or 4 names, since it crosses several parishes and 2 counties.

The same probably happens with some games, that might have some culturally distinct name.

No good example comes to mind right now, but a tentative example is "Chaos - The Battle of Wizards", that is known by many as just "Chaos". Not a very good example, because the cultural name is a substring of the real name.

Another example is "Live and Let Die" which is also known as "Aquablast". Although these different names are from different releases, but the idea behind is the same.

RMartins · Post by **RMartins** » Fri Jul 19, 2019 3:15 am

Einar Saukas wrote: ↑Thu Jul 18, 2019 9:07 pm
RMartins wrote: ↑Thu Jul 18, 2019 3:18 pmAlso remember, that quick search, includes: Titles, Aliases, Authors and Publishers.
You probably don't need to distinguish between authors and publishers.

...

Currently the Quick search is using this query:

Code: Select all

SELECT N.name
FROM(
(
	SELECT	E.title AS name
	FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
	WHERE	SBT.entry_title LIKE ?
	ORDER	BY E.title
	LIMIT 10
)
UNION
(
	SELECT	A.title AS name
	FROM	aliases A
	WHERE	A.title LIKE ?
	ORDER	BY A.title
	LIMIT 10
)
UNION
(
	SELECT	L.name AS name
	FROM	labels L	INNER JOIN search_by_authors SBA ON (SBA.label_id = L.id)
	WHERE	L.name LIKE ?
	ORDER	BY L.name
	LIMIT 10
)
UNION
(
	SELECT	L.name AS name
	FROM	labels L	INNER JOIN search_by_publishers SBP ON (SBP.label_id = L.id)
	WHERE	L.name LIKE ?
	ORDER	BY L.name
	LIMIT 10
)) N
ORDER BY N.name
LIMIT 10

Einar Saukas wrote: ↑Thu Jul 18, 2019 9:07 pm This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.

Not sure I understood what you meant.

It does not matter, if we have 40 results of each, if we are only going to present 10 in total.
At most they could be 10 of the same type, so we will never show the other 30 of each type.
If you look at the above query, it should be self explanatory.

I would suggest we take the more technical detail somewhere else, to avoid loosing most of the current readers

However, I can give a simple example, on how to query for each intent.
Here is a simple explanation of how COLLATIONs work, using weights.
https://dev.mysql.com/doc/refman/8.0/en ... ation.html

Lets focus just on titles for example.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ?
ORDER	BY E.title
LIMIT 10

Current COLLATION is utf8_unicode_ci, meaning the search is accent insensitive and case insensitive.

Now suppose the user, introduced a search string like 'aña':
- this means the user is interested in matching the special char 'ñ'.
- the user did not use any Upper case character ( so is not interested in a specific case match ).

From the observation of the search input string, we can infer that we should do an accent sensitive and case insensitive search.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ? COLLATE utf8mb4_unicode_as_ci
ORDER	BY E.title
LIMIT 10

Notice the extra "COLLATE utf8mb4_unicode_as_ci"!

NOTE:
Remember that to reap the benefits of the several distinct collations, the table data needs to contain mixed case and special chars, so that the comparisons will work as expected.

So, for all the possible combinations, we would need the following 4 COLLATIONS:
- utf8_unicode_ai_ci <=> utf8_unicode_ci
- utf8_unicode_as_cs <=> utf8_unicode_cs
- utf8_unicode_as_ci
- utf8_unicode_ai_cs

And this is were most MySQL installs will give us trouble, since these COLLATIONS are not ALL available by default!

But this can be fixed, by adding them to the server.
https://dev.mysql.com/doc/refman/8.0/en ... e-uca.html
It might not be as easy as we would like (might require a recompile), for the mixed ones at least.

But at first try, we could live with just the first two, that are usually available or more easily installable.

NOTE: to simplify the example, I'm not using the utf8mb4, which we should.

To check for locally available/installed collations, run this command:

Code: Select all

SHOW COLLATION WHERE Charset = 'utf8mb4';

UPDATE:
Using specific COLLATIONS on a query, might trip MySQL into not using indexes.
Or might require extra indexes on the same field, but for a different collation.
Something to keep an eye on.

RMartins · Post by **RMartins** » Fri Jul 19, 2019 11:12 am

RMartins wrote: ↑Fri Jul 19, 2019 3:15 am ...

UPDATE:
Using specific COLLATIONS on a query, might trip MySQL into not using indexes.
...

But using SQL LIKE usually prevents the use of any index, so it might not be a big concern, since the database is small.

Einar Saukas · Post by **Einar Saukas** » Fri Jul 19, 2019 5:44 pm

RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.

I understood your point. I just happen to disagree with you

This was your proposal for uppercase:

So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.

According to your proposal, searching for "Earth Shock" won't find anything, because the exact spelling is "Earthshock".

The point is, I believe searches should be as permissive as possible. Right now, users don't need to know the exact spelling of a title. Searching for "Earth Shock", "earthshock" and "earth-shock" will find the intended game. It goes even further: right now, searching for "night rally" will find both "Night Rally" and "Nightmare Rally". And searching for "Ocean Software" will also find titles published by Ocean brands and subsidiaries, not just Ocean directly. That's how searches in Martijn's WoS work and I always liked it, thus the reason I provided something similar with these auxiliary tables. It's a very different requirement from autocomplete, that only makes sense for exact matches.

RMartins wrote: ↑Fri Jul 19, 2019 2:42 amAnother improvement, is that instead of using MYSQL collation utf8_unicode_ci , the MB4 version should be used (utf8mb4_unicode_ci), according to note on the following documentation page https://dev.mysql.com/doc/refman/8.0/en ... icode.html (this is mentioned in several places in the documentaiton, since the 3 byte version, is a MySQL fluke and non standard Umicode.

Good point, I will fix it. Thanks!

RMartins wrote: ↑Fri Jul 19, 2019 2:42 am Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).

I disagree...

RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).

And I still think searches here should be always accent insensitive. Otherwise searching for "iniciación" won't find any results.

RMartins wrote: ↑Fri Jul 19, 2019 2:42 am Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".

Even so, your suggestion wouldn't be enough to locate the game called "4x4 Off-Road Racing". Unless you are suggesting to store lots of possible variants for each game title in the database? If so, that would interfere with autocomplete. Imagine someone typing "knight" and autocomplete showing 10 variants of the same title ("Knight Time", "Knight Tyme", "Knight-time", "Knight-tyme", "Knighttime", "Knighttyme", "Knightime", "Knightyme", etc) and none for Knight Lore, for instance.

Besides, it feels wrong to store aliases in the database for games that were never really known by those titles, just to help searches that are already working without this.

RMartins · Post by **RMartins** » Fri Jul 19, 2019 7:35 pm

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf table data does not contain Upper case characters, we can never match them as Upper case characters.
However, if table data contains Upper case characters, then we can match or ignore them, depending on the collation being used.
I understood your point. I just happen to disagree with you

That's perfectly fine.

If we would all think alike, it would be boring

But I believe you are disagreeing, because I might not have explained it well enough.

If you take into account, that you are using a collation that is accent insensitive and case insensitive, it doesn't matter, what accentuated characters or which case they are in the database, because an A = a = Â = à = À = á = ... etc... using the *_unicode_ai_ci collation.

So having the original data accents and case on the search_by_titles table, doesn't affect how you currently do your searches.
If you don't believe me, change one and try

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm This was your proposal for uppercase:

So, if user uses Upper case in the search string, then it means they are looking for that specific case, for example to distinguish the start of a Word or a Name. But if the user uses all lower case, then it means that any match is acceptable including in the middle of a word or name.
According to your proposal, searching for "Earth Shock" won't find anything, because the exact spelling is "Earthshock".

Nope.
NOTE: his is why I believe, I might not have explained myself well.

In search engines, things known to have different spellings, like your example above, are perfectly fine.
They just need several entries into the search index (not database index), like if they were alternative names (aliases).

Clipping/removing the spaces, dashes, funny characters, etc..., to try to retain only the parts that are common to facilitate a match is not the way to go about it, because it will give a lot of false positives in other unrelated searches.

So in this case, you would have 3 records in search_by_titles, namely:
- "Earth Shock"
- "Earthshock"
- "Earth-Shock"
that map to the same exact game id (entry_id).

If you think about it, you already do this with the "aliases" table.

The problem with clipping the uncommon stuff, besides giving false positives, is that it will not work for names (aliases) that are no exact matches once you strip the funny(uncommon) characters between the several spellings.

A good example is the "Live and Let Die" versus "Aquablast".
Same game, different "spellings" or names for the same thing.

I hope that clears any doubt about it.
You get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm The point is, I believe searches should be as permissive as possible. Right now, users don't need to know the exact spelling of a title. Searching for "Earth Shock", "earthshock" and "earth-shock" will find the intended game.

Completely agree.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm It goes even further: right now, searching for "night rally" will find both "
Night Rally" and "Nightmare Rally".

I hope you agree with me, that those results are not the outcome, of clipping the funny/uncommon chars.
Because "night rally" when stripped of the funny chars, would become "nightrally", and would never match "nightmarerally".

I haven't checked, but that outcome is most probably the result of the "aliases" table or similar.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm And searching for "Ocean Software" will also find titles published by Ocean brands and subsidiaries, not just Ocean directly. That's how searches in Martijn's WoS work and I always liked it, thus the reason I provided something similar with these auxiliary tables. It's a very different requirement from autocomplete, that only makes sense for exact matches.

Again, that is not the result of clipping the funny/uncommon chars.
That is the result of pre-processing some table joins to include extra keys in the search_by_* tables, which is exactly what I'm proposing.

We can still have all those features, without having the false positives.

Autocomplete is just a search, like any other. I see no difference there.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 am Punctuation in data is not required to be removed, for similar reasons as case (being possible to match it, if needed).
I disagree...

Can you please try a search, after changing a few entries to have the correct case and accents, and confirm that you will still find them with the current collation in use ?

NOTE: If this isn't true, MySQL has a bug in that collation.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 amIf we remove accents, we can no longer match them explicitly, since they are not there.
But if we have accents available in table data, we can ignore it (accent insensitive), just by using a different collation, or use it for mathching (accent sensitive).
And I still think searches here should be always accent insensitive. Otherwise searching for "iniciación" won't find any results.

OK, here is a matter of opinion, that I won't debate, because my proposition is based on the supposition of user intent, since we do not have a real search engine at hand, which would allow definition of rules to match cases like these, or misspells like char reversals and other combinations.
"Iniciation" versus "iniciacion" versus "inicaison" versus "iniciasno", etc...

Note however, if user knows and uses NON accentuated chars and gets the results with the correct accents, they will know what to do.
This is what I referred has user intent, in the search.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
RMartins wrote: ↑Fri Jul 19, 2019 2:42 am Regarding special cases, ( like "Off Road" versus "Offroad" ) the standard solution is to have a dictionary for those (where synonyms can be defined) or special match/translationrules (when using a search engine), but probably the best we can do here is to provide aliases (AKA synonym) for those known cases, including the entire title for simplification, for example in the "aliases" table or similar.
So there should be 2 entries like "Off Road Racing" versus "Offroad Racing".
Even so, your suggestion wouldn't be enough to locate the game called "4x4 Off-Road Racing". Unless you are suggesting to store lots of possible variants for each game title in the database?

That's exactly right!
That's the whole idea of a search Index (completely different from a database Index), and they usually are huge!
Specially if we do not have a search engine, that can generate the combinations by itself, given some (many) rules.

Since we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
If so, that would interfere with autocomplete. Imagine someone typing "knight" and autocomplete showing 10 variants of the same title ("Knight Time", "Knight Tyme", "Knight-time", "Knight-tyme", "Knighttime", "Knighttyme", "Knightime", "Knightyme", etc) and none for Knight Lore, for instance.

Actually that will work fine, because we only show the real title, extracted from the "entries.title", while searching using the search_by_titles.entry_title, and after performing an UNION operation, that removes the duplicates.

Code: Select all

SELECT	E.title AS name
FROM	entries E	INNER JOIN search_by_titles SBT ON (SBT.entry_id = E.id)
WHERE	SBT.entry_title LIKE ?
ORDER	BY E.title
LIMIT 10

Notice that the result comes from the entries table => E.title

Einar Saukas wrote: ↑Fri Jul 19, 2019 5:44 pm
Besides, it feels wrong to store aliases in the database for games that were never really known by those titles, just to help searches that are already working without this.

Now, that I can relate!

It's that feeling that it seems a waste due to apparent duplication of data.
But that is the domain of the problem unfortunately.

But let me clarify that when i mentioned "aliases", it was to make it simpler to relate.
I know that "aliases" table has a more in depth context, which is more akin to different language spellings or country releases.

So when I said "aliases" I meant real aliases as in possible replacement names, which are usually handled in specific collections (AKA tables) for search indexes.

Thank you for your feedback.
I really appreciate it.

I hope I was able to clarify any doubt, I may have left previously.

Einar Saukas · Post by **Einar Saukas** » Sat Jul 20, 2019 12:05 am

RMartins wrote: ↑Fri Jul 19, 2019 3:15 am
Einar Saukas wrote: ↑Thu Jul 18, 2019 9:07 pmYou probably don't need to distinguish between authors and publishers.
Currently the Quick search is using this query:

There's no need. All labels are valid in quick search.

For instance, try searching for "columbia pictures" in quick search to see what happens. It's neither author or publisher, but you will still get a result.

RMartins wrote: ↑Fri Jul 19, 2019 3:15 am
Einar Saukas wrote: ↑Thu Jul 18, 2019 9:07 pm This way, it's also easy to impose a limit of 40 results overall, instead of partial limits of 10 each.
Not sure I understood what you meant.

Sorry, I understood you had an overall limit of 40 results. Now I see in your query that your overall limit is 10. Never mind!

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmBut I believe you are disagreeing, because I might not have explained it well enough.

No. I understood exactly what you meant. There's no need to discuss it technically, because the problem is not technical. We just happen to have different ideas for the intended behavior.

You assume that users searching for "cadáv" are only interested in words containing this accent. I assume they want "Cadàveriön" but didn't memorize the exact spelling.

You assume that users searching for "off-road" are only interested in words containing this hyphen. I assume they want all off road games regardless of spelling.

You assume that users searching for "night rally" don't want to see "Nightmare Rally" in the list of results. I assume they do.

Basically you assume most users will bother to learn the nuances of using capital letters or not, using punctuation or not, etc. I assume they just want to find a game with minimum effort.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pm If you take into account, that you are using a collation that is accent insensitive and case insensitive, it doesn't matter, what accentuated characters or which case they are in the database, because an A = a = Â = à = À = á = ... etc... using the *_unicode_ai_ci collation.

So having the original data accents and case on the search_by_titles table, doesn't affect how you currently do your searches.
If you don't believe me, change one and try

Of course I believe you. That's exactly what I wrote in a previous post. In this post I wrote "I don't really need to store text in these tables as lowercase only, since all comparisons use "ci" collation. Everything is currently converted to lowercase just to reduce unnecessary duplicates."

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pm They just need several entries into the search index (not database index), like if they were alternative names (aliases).

And that's exactly what I wanted to avoid, because maintaining a list of alternative names would be a PITA.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmIf you think about it, you already do this with the "aliases" table.

No, I don't. The "aliases" table contains alternative official names for the same game. It doesn't make sense to add other titles to this table that a game never had, just to help searches.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmYou get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.

I know. False positives is the price to pay for easier, broader searches. I think that's a good tradeoff.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmWe can still have all those features, without having the false positives.

I disagree, because my understanding of desired features is different than yours (as I described at the beginning of this post).

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmAutocomplete is just a search, like any other. I see no difference there.

I do!

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmCan you please try a search, after changing a few entries to have the correct case and accents, and confirm that you will still find them with the current collation in use ?

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmSince we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.

Again, that's what I'm trying to avoid.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmActually that will work fine, because we only show the real title, extracted from the "entries.title", while searching using the search_by_titles.entry_title, and after performing an UNION operation, that removes the duplicates.

Your proposal wouldn't be able to find "Electra 9000", for instance. It's the title of a game re-released by Alternative, probably better known than the original title from the first K'Soft release.

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pm Thank you for your feedback.
I really appreciate it.

I hope I was able to clarify any doubt, I may have left previously.

Absolutely!

And I hope my answers didn't sound as negative criticism, it wasn't my intent. You originally asked about the motivations for the current implementation, so I tried to explain them all and point out the reasons I didn't do it differently. However I have no right to impose my own preferences on any websites. You are currently working on autocomplete and searches at SC, so the site functionality should be your decision, not mine. Please feel free to implement your ideas, and if there's any way I can help just let me know. I'm confident you can do a good job, and hopefully the issues I pointed out can help you make even better choices. I will gladly leave the ball in your court now!

BTW tables "search_by" are included in ZXDB distribution right now, just for convenience. However they are not really part of the database content. I actually like the idea of distributing them separately, to emphasize they are optional and motivate each website to come up with its own creative solutions. Good luck!

RMartins · Post by **RMartins** » Sat Jul 20, 2019 12:48 am

Once again thank you for your feedback [mention]Einar Saukas[/mention].

This I s just a debate so I hope I'm not comming across as imposing anything.That's not the intent.

Having said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.

One of the reasons why I did not want to go too technical, so not to scare the rest of the crowd

Please come back and give us your opinion or insight on how you use or want to use search.

Thank you in advance for any feedback.

Einar Saukas · Post by **Einar Saukas** » Sat Jul 20, 2019 11:36 pm

RMartins wrote: ↑Sat Jul 20, 2019 12:48 amThis I s just a debate so I hope I'm not comming across as imposing anything.That's not the intent.

Sure, don't worry about it!

RMartins wrote: ↑Sat Jul 20, 2019 12:48 amHaving said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.

I suspect this discussion has been too abstract for most users. It's hard for anyone to evaluate what's the best interface without actually trying it.

I suggest you implement your idea in your alternate page (or even replace the main page), afterwards write a post describing the changes and asking for feedback.

RMartins · Post by **RMartins** » Sun Jul 21, 2019 3:27 pm

Einar Saukas wrote: ↑Sat Jul 20, 2019 11:36 pm ...
RMartins wrote: ↑Sat Jul 20, 2019 12:48 amHaving said that, it would be nice to know what other users think about this subject, in order to get more feedback and ideas.
I suspect this discussion has been too abstract for most users. It's hard for anyone to evaluate what's the best interface without actually trying it.

I suggest you implement your idea in your alternate page (or even replace the main page), afterwards write a post describing the changes and asking for feedback.

For now the intent was to provide an autocomplete, and that's done and working.
I will probably change it a bit, to allow something like 40 entries, but just show 10, but then allow to scroll to see the remaining 30.

To review the way how search is being done, I need to first understand the data and relation between the several tables involved, in more detail.
In particular, how other info is kept in the labels table.

Some extra coments below.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am ...
RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmBut I believe you are disagreeing, because I might not have explained it well enough.
No. I understood exactly what you meant. There's no need to discuss it technically, because the problem is not technical. We just happen to have different ideas for the intended behavior.

You assume that users searching for "cadáv" are only interested in words containing this accent. I assume they want "Cadàveriön" but didn't memorize the exact spelling.

You assume that users searching for "off-road" are only interested in words containing this hyphen. I assume they want all off road games regardless of spelling.

You assume that users searching for "night rally" don't want to see "Nightmare Rally" in the list of results. I assume they do.

Basically you assume most users will bother to learn the nuances of using capital letters or not, using punctuation or not, etc. I assume they just want to find a game with minimum effort.

Yes I assume all that, behind what I mentioned as user intent.

They can still search exactly like you describe, just by not using any case or accent.
You can call that a "nuance" they will have to learn. I have no problem with that.

The thing is, if they do not want to mach every single "ana" in the search topic, but just want to get the ones that match "aña" they can, and will not have FALSE positives.
With the current solution they can't, it's case insensitive and accent insensitive all the way.

Having a choice for refining the search, seems an improvement or win to me, but that might not be the opinion of the majority.

Ideally we would have a search engine, and someone to maintain it, but we don't.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am ...

RMartins wrote: ↑Fri Jul 19, 2019 7:35 pm They just need several entries into the search index (not database index), like if they were alternative names (aliases).
And that's exactly what I wanted to avoid, because maintaining a list of alternative names would be a PITA.

OK, I understand that it involves some effort, but that's the only feasible (NON FALSE positives) way to do it, using only SQL.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am
RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmIf you think about it, you already do this with the "aliases" table.
No, I don't. The "aliases" table contains alternative official names for the same game. It doesn't make sense to add other titles to this table that a game never had, just to help searches.

I believe I explained that it should be some other table, and not the "aliases" table, since that has a specific release context.
A table specific for this purpose, like "title_synonyms" or similar.
i.e. specific tables and data created for handling searching, like the search_by_* tables already are.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am
RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmYou get the good but you also get the bad, by clipping data.
If instead you just keep the several spellings, you get the good part, but without the bad ones.
I know. False positives is the price to pay for easier, broader searches. I think that's a good tradeoff.

OK, I just believe we can improve on the current solution.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am ...
RMartins wrote: ↑Fri Jul 19, 2019 7:35 pmSince we are using the poor man SQL solution, we need to build the search index tables, with all valid combinations and then search that, using SQL.
Again, that's what I'm trying to avoid.

I understand that, but by creating the search_by_* tables, aren't you actually already doing something similar ?
Surely not as involved as it could be, but a start of what a typical "search index" is.

Einar Saukas wrote: ↑Sat Jul 20, 2019 12:05 am ...
Your proposal wouldn't be able to find "Electra 9000", for instance. It's the title of a game re-released by Alternative, probably better known than the original title from the first K'Soft release.

I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.

Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?

If that's the case, I was just following the tip on the field "Search by Title, Author, Publisher".
Maybe we should include the "labels" individually too.

But to be fair, I still haven't grasped the "labels" table completely, because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.

Again, thank you for your feedback.
I do appreciate it.

Einar Saukas · Post by **Einar Saukas** » Mon Jul 22, 2019 5:58 pm

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.

Both "Electra" and "electra" provide the same results in searching. But they show different results in autocomplete.

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?

It's the opposite. Current quick search query includes "labels" alone (but autocomplete doesn't).

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm If that's the case, I was just following the tip on the field "Search by Title, Author, Publisher".

A more accurate description would be "Search by Title, Author, Publisher, License Owner, Magazine Name, Team Name, Magazine Columnist, etc"

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm Maybe we should include the "labels" individually too.

Yes.

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm But to be fair, I still haven't grasped the "labels" table completely

In ZXDB, table "labels" contains all entities. It includes real people's names, nicknames, companies, user groups, fantasy names used by publishers, etc.

Some of them are authors (of certain titles), for instance Joffa.

Some of them are publishers (or certain titles), for instance Ocean Software.

Some of them are both authors and publishers, for instance Jonathan Cauldwell.

Some of them are owners (of certain licenses), for instance Daley Thompson.

Some of them are both authors and license owners, for instance Alfonso Azpiri.

There are a few "relationship" tables that associate entries to the entities that authored them, or published them, or own the licenses related to them. They are described here:

https://github.com/zxdb/ZXDB

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.

No, it's not odd, not messy, and not related with legacy data. We already had this conversation here.

A certain person could have authored different games. Perhaps also published a few of them, worked as columnist in some magazine, or created some character that was later licensed to others. Don't you think it makes sense to store information about each person (country, nickname, homepage, etc) in a separate table, instead of duplicating it everywhere?

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm Again, thank you for your feedback.
I do appreciate it.

You are welcome!

RMartins · Post by **RMartins** » Mon Jul 22, 2019 7:27 pm

Einar Saukas wrote: ↑Mon Jul 22, 2019 5:58 pm
RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm I searched for "Electra" and it finds it, but searching for "electra" fails to find it, curious!
Maybe I missed something.
Both "Electra" and "electra" provide the same results in searching. But they show different results in autocomplete.

RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm Are you referring to the fact, that current quick search query, doesn't include "labels" alone (without joining to authors or publishers) ?
It's the opposite. Current quick search query includes "labels" alone (but autocomplete doesn't).

OK, I see what you mean.
The quick search is different from the autocomplete search, in current implementation.
That's true, due to the tip definition that I followed.

I can re-align that, if that's the intent.

Einar Saukas wrote: ↑Mon Jul 22, 2019 5:58 pm ...
There are a few "relationship" tables that associate entries to the entities that authored them, or published them, or own the licenses related to them. They are described here:

https://github.com/zxdb/ZXDB

Yes, you have mentioned something along those lines before.

Einar Saukas wrote: ↑Mon Jul 22, 2019 5:58 pm
RMartins wrote: ↑Sun Jul 21, 2019 3:27 pm because it seems a bit messy to have some of it related with other tables, giving it context (like authors or publishers), but part of it, is neither of those, and has not relation with any other extra context giving table.
I find it a bit odd, although I do understand that this is probably related with legacy data.
No, it's not odd, not messy, and not related with legacy data. We already had this conversation here.

Maybe I haven't explained myself well enough.
I'm not criticizing the ZXDB tables or it's data, per se.

Let me try and explain what I meant by "has not relation with any other extra context giving table".
And I may be wrong, so please correct me if that is the case.

"Labels" table, has a bunch of records, some:
- have a reference to the "authors" table, giving that "label" context in the sense that it's an author name.
- have a reference to the "publishers" table, giving that "label" context in the sense that it's a publisher label/name.
- have a reference to the "licensors" table, giving that "label" context in the sense that it's a License Owner label/name.
but there seems to exist more label data in "labels" table then just "authors", "publishers" or "licensors".

Assuming this is true (might not be), I haven't seen any extra table referenced, in order to give context to whatever other stuff is in the "labels" records.

At least, that's might current view, which can be wrong.

Einar Saukas wrote: ↑Mon Jul 22, 2019 5:58 pm A certain person could have authored different games. Perhaps also published a few of them, worked as columnist in some magazine, or created some character that was later licensed to others. Don't you think it makes sense to store information about each person (country, nickname, homepage, etc) in a separate table, instead of duplicating it everywhere?

YEs, I do agree with this.

But from my current perspective (mentioned above), there seems to be some info that does not have context associated.
And that was my only complain about feeling it's "messy" or a "bit odd".

Thank you for your feedback.

Einar Saukas · Post by **Einar Saukas** » Mon Jul 22, 2019 8:48 pm

RMartins wrote: ↑Mon Jul 22, 2019 7:27 pm OK, I see what you mean.
The quick search is different from the autocomplete search, in current implementation.
That's true, due to the tip definition that I followed.

I can re-align that, if that's the intent.

Yes, I think you should accept all "labels".

Right now, quick searches also provides other results besides authors and publishers.

Einar Saukas wrote: ↑Mon Jul 22, 2019 5:58 pm Maybe I haven't explained myself well enough.
I'm not criticizing the ZXDB tables or it's data, per se.

Let me try and explain what I meant by "has not relation with any other extra context giving table".
And I may be wrong, so please correct me if that is the case.

"Labels" table, has a bunch of records, some:
- have a reference to the "authors" table, giving that "label" context in the sense that it's an author name.
- have a reference to the "publishers" table, giving that "label" context in the sense that it's a publisher label/name.
- have a reference to the "licensors" table, giving that "label" context in the sense that it's a License Owner label/name.
but there seems to exist more label data in "labels" table then just "authors", "publishers" or "licensors".

Assuming this is true (might not be), I haven't seen any extra table referenced, in order to give context to whatever other stuff is in the "labels" records.

There are many other references:

* A label can be referenced to another label. For instance, label "Frobush" is a nickname that belongs to label "Jonathan M. Smith". Because of this, you can search for this nickname to find his games, although he was never credited as "Frobush" in any of them.

* A label can be a team of developers. For instance, label "Bizarre Developments" was a team behind the development of a few titles. It wasn't an individual author, publisher or license owner for any of them.

* A label can be a columnist responsible for a regular section in a magazine, in our magazine references. For instance "Lloyd Mangram".

* A label can be a company that's not a publisher, but it's the main company that owns several other publishers. For instance "British Telecom".

And so on.

You can take a look at the ZXDB schema to get all possible references. Or you can simply allow searches on all existing labels.

Metalbrain · Post by **Metalbrain** » Tue Jul 23, 2019 10:06 am

Regarding the "aña" vs "ana" search, would it be possible to make an initial search that is case/accent insensitive , and then refine it to prioritize the case/accent sensitive results? This way the intended results would appear first, but if someone makes a mistake (such as the cadáv vs cadàv one), he'd still get the result.

I'd love to have this kind of search, and the fact that Google (and search engines in general) just does insensitive searches, no matter how specific you write the terms has always frustrated the hell out of me.

Spectrum Computing

Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search

Re: Predictive Search