Page 17 of 23

Re: New Database Model ZXDB

Posted: Wed Jun 19, 2019 6:25 pm
by Einar Saukas
djnzx48 wrote:
Wed Jun 19, 2019 5:27 am
Not sure if this is a bug or not, but in the past I've noticed the magazine references often link to one past the intended page. For example, the Pssst link takes me to page 79 when the actual reference is on page 78. It seems like the page numbers in archive.org links start at zero, with no page number specified for the first page, so subtracting one from the page number in the SC links might fix the problem.
Yes, that's a problem. And it's something that needs to be fixed in ZXDB, not SC.

Let me explain:

When you click on a page number, SC opens the corresponding image file. So if it needs to show page 4 of issue #52 of magazine X, it simply has to open a file like "magx05200004.jpg" or something similar. That's easy.

However when you click on the "VIEW" link next to the page number, SC will try to open the corresponding page of the PDF file inside the browser. But how can it find out what's the corresponding page?

* In certain magazines, the cover is not included in numbering. Therefore page number 4 is actually the 5th page of the PDF file (as if cover was page number zero).

* In other magazines, the cover is considered as page number 1. Therefore page number 4 coincides with the 4th page of the PDF.

* In other magazines, page number 4 could be the 8th page of the PDF, due to additional index and advertising pages without numbering at the beginning (as if cover was page number -3), for instance.

The best way to solve this problem is to store this information in ZXDB for each magazine issue. The magazines currently indexed as PDF in ZXDB are listed below:

Crash

Jogos 80

Micro Mart

Mundo Spectrum

Personal Computer News

Planeta Sinclair Almanaque

RetroMagazine

Sinclair User

Spectrum Today (EN)

The Spectrum Show

If anyone is willing to help, by checking "what would be the corresponding page number for the cover?" for even a few of these issues, it would help us a lot!

Re: New Database Model ZXDB

Posted: Wed Jun 19, 2019 8:27 pm
by PeterJ
Thanks for the great explanation @Einar Saukas

When I go to our MicroMart Page (I didn't know that was included in ZXDB).

https://spectrumcomputing.co.uk/index.p ... mag_id=280

I click on any PDF and it says Page not Found on archive.org. Is that something with the path? Every PDF link seems to be trying to go to:

https://archive.org/download/Micro-Mart ... pecial.pdf

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 3:00 am
by djnzx48
Einar Saukas wrote:
Wed Jun 19, 2019 6:25 pm
djnzx48 wrote:
Wed Jun 19, 2019 5:27 am
Not sure if this is a bug or not, but in the past I've noticed the magazine references often link to one past the intended page. For example, the Pssst link takes me to page 79 when the actual reference is on page 78. It seems like the page numbers in archive.org links start at zero, with no page number specified for the first page, so subtracting one from the page number in the SC links might fix the problem.
Yes, that's a problem. And it's something that needs to be fixed in ZXDB, not SC.
OK, I didn't know whether the links were stored in the database or autogenerated on the website. But are you sure the problem isn't just a simple off-by-one error? Having a brief look, all the magazines I've seen so far have the correct page numbers on archive.org, except for:

PCN: the numbering starts at the contents page rather than the cover. So the archive.org numbers are two pages ahead.

The Spectrum Show Magazine (issues #0 and #1 only): the numbering starts on the next page after the cover. So the archive.org numbers are one page ahead.

In the archive.org URLs, #page/n1 refers to the second page of the magazine, #page/n2 refers to the third, and so on. For the title page, this parameter is simply omitted.

EDIT: Only, some magazine scans don't even have consistent numbering within themselves. For example, this Crash #40 scan has an extra page inserted after page 13 (some kind of fold-out poster?) Without verifying every scan, you can't be sure whether the page numbers are all correct.

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 3:30 am
by djnzx48
PeterJ wrote:
Wed Jun 19, 2019 8:27 pm
Thanks for the great explanation @Einar Saukas

When I go to our MicroMart Page (I didn't know that was included in ZXDB).

https://spectrumcomputing.co.uk/index.p ... mag_id=280

I click on any PDF and it says Page not Found on archive.org. Is that something with the path? Every PDF link seems to be trying to go to:

https://archive.org/download/Micro-Mart ... pecial.pdf
It's because all the links have '-Special' appended to them, when only a few of them are actually specials. Try for instance the 2015/1/22 link and it works.

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 8:06 am
by hikoki
I wonder if the archive.org api search could be used for page numbers.
https://openlibrary.org/dev/docs/bookurls
I guess the search term would have to contain footer or header words, characteristic of every magazine.

For example, a search link for page 54 on a Crash magazine containing the quoted term "54 Crash"
https://archive.org/details/crash-magaz ... 4+crash%22

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 8:18 am
by djnzx48
You mean scanning the bottom of every page with OCR to find out the page numbers?

I suppose it might work in theory, but it could be kind of impractical. How would you distinguish them from any other number on the page? And a lot of the full-page advertisements don't have page numbers anyway.

EDIT: Heh, so it actually works? That's interesting. It still seems hackish though as you'd need a specific query for every magazine layout.

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 9:36 am
by hikoki
djnzx48 wrote:
Thu Jun 20, 2019 8:18 am
And a lot of the full-page advertisements don't have page numbers anyway.

EDIT: Heh, so it actually works? That's interesting. It still seems hackish though as you'd need a specific query for every magazine layout.
well you could provide users with both the expected and hackish links :)

a script to detect such searches without results might be useful to locate which pages are not numbered

EDIT

Sample on how to data mining the internet archive
https://programminghistorian.org/en/les ... et-archive

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 12:09 pm
by Einar Saukas
hikoki wrote:
Thu Jun 20, 2019 8:06 am
I wonder if the archive.org api search could be used for page numbers.
https://openlibrary.org/dev/docs/bookurls
I guess the search term would have to contain footer or header words, characteristic of every magazine.

For example, a search link for page 54 on a Crash magazine containing the quoted term "54 Crash"
https://archive.org/details/crash-magaz ... /"54+crash"
Apparently only a few magazines have tagged pages. For instance it doesn't seem to work for Crash issue #84...

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 12:29 pm
by hikoki
Einar Saukas wrote:
Thu Jun 20, 2019 12:09 pm
Apparently only a few magazines have tagged pages. For instance it doesn't seem to work for Crash issue #84...
Footers on Crash 84 follow a different pattern: january (black square) numberpage
If you know the ascii code of black square should work

https://archive.org/details/crash-magazine-84/search/
%22
january+(black square)+numberpage
%22

EDIT

@Einar Saukas you are right in this case as footers don't seem to be OCRed
well, you can automatically detect the lack of footers by crawling mags with different URLs. If your search term doesn't return one single result then provide just the estimated /page url
*trying to outsmart Einar*

Re: New Database Model ZXDB

Posted: Thu Jun 20, 2019 9:28 pm
by Einar Saukas
djnzx48 wrote:
Thu Jun 20, 2019 3:00 am
EDIT: Only, some magazine scans don't even have consistent numbering within themselves. For example, this Crash #40 scan has an extra page inserted after page 13 (some kind of fold-out poster?) Without verifying every scan, you can't be sure whether the page numbers are all correct.
Actually that's an error in the PDF version. This was supposed to be a single page:

https://archive.org/download/World_of_S ... 000015.jpg

But it was broken into 2 pages in PDF:

https://archive.org/details/crash-magazine-40/page/n14

If anyone provides a fixed PDF, I can ask people at Archive.org to replace it. It would fix numbering in this issue.