New Database Model ZXDB

This is the place for general discussion and updates about the ZXDB Database. This forum is not specific to Spectrum Computing.

Moderator: druellan

User avatar
Einar Saukas
Manic Miner
Posts: 911
Joined: Wed Nov 15, 2017 2:48 pm

Re: New Database Model ZXDB

Post by Einar Saukas » Wed Jun 19, 2019 6:25 pm

djnzx48 wrote:
Wed Jun 19, 2019 5:27 am
Not sure if this is a bug or not, but in the past I've noticed the magazine references often link to one past the intended page. For example, the Pssst link takes me to page 79 when the actual reference is on page 78. It seems like the page numbers in archive.org links start at zero, with no page number specified for the first page, so subtracting one from the page number in the SC links might fix the problem.
Yes, that's a problem. And it's something that needs to be fixed in ZXDB, not SC.

Let me explain:

When you click on a page number, SC opens the corresponding image file. So if it needs to show page 4 of issue #52 of magazine X, it simply has to open a file like "magx05200004.jpg" or something similar. That's easy.

However when you click on the "VIEW" link next to the page number, SC will try to open the corresponding page of the PDF file inside the browser. But how can it find out what's the corresponding page?

* In certain magazines, the cover is not included in numbering. Therefore page number 4 is actually the 5th page of the PDF file (as if cover was page number zero).

* In other magazines, the cover is considered as page number 1. Therefore page number 4 coincides with the 4th page of the PDF.

* In other magazines, page number 4 could be the 8th page of the PDF, due to additional index and advertising pages without numbering at the beginning (as if cover was page number -3), for instance.

The best way to solve this problem is to store this information in ZXDB for each magazine issue. The magazines currently indexed as PDF in ZXDB are listed below:

Crash

Jogos 80

Micro Mart

Mundo Spectrum

Personal Computer News

Planeta Sinclair Almanaque

RetroMagazine

Sinclair User

Spectrum Today (EN)

The Spectrum Show

If anyone is willing to help, by checking "what would be the corresponding page number for the cover?" for even a few of these issues, it would help us a lot!
0 x

User avatar
PeterJ
Site Admin
Posts: 1265
Joined: Thu Nov 09, 2017 7:19 pm
Location: Surrey, UK

Re: New Database Model ZXDB

Post by PeterJ » Wed Jun 19, 2019 8:27 pm

Thanks for the great explanation @Einar Saukas

When I go to our MicroMart Page (I didn't know that was included in ZXDB).

https://spectrumcomputing.co.uk/index.p ... mag_id=280

I click on any PDF and it says Page not Found on archive.org. Is that something with the path? Every PDF link seems to be trying to go to:

https://archive.org/download/Micro-Mart ... pecial.pdf
0 x

User avatar
djnzx48
Manic Miner
Posts: 482
Joined: Wed Dec 06, 2017 2:13 am
Location: New Zealand

Re: New Database Model ZXDB

Post by djnzx48 » Thu Jun 20, 2019 3:00 am

Einar Saukas wrote:
Wed Jun 19, 2019 6:25 pm
djnzx48 wrote:
Wed Jun 19, 2019 5:27 am
Not sure if this is a bug or not, but in the past I've noticed the magazine references often link to one past the intended page. For example, the Pssst link takes me to page 79 when the actual reference is on page 78. It seems like the page numbers in archive.org links start at zero, with no page number specified for the first page, so subtracting one from the page number in the SC links might fix the problem.
Yes, that's a problem. And it's something that needs to be fixed in ZXDB, not SC.
OK, I didn't know whether the links were stored in the database or autogenerated on the website. But are you sure the problem isn't just a simple off-by-one error? Having a brief look, all the magazines I've seen so far have the correct page numbers on archive.org, except for:

PCN: the numbering starts at the contents page rather than the cover. So the archive.org numbers are two pages ahead.

The Spectrum Show Magazine (issues #0 and #1 only): the numbering starts on the next page after the cover. So the archive.org numbers are one page ahead.

In the archive.org URLs, #page/n1 refers to the second page of the magazine, #page/n2 refers to the third, and so on. For the title page, this parameter is simply omitted.

EDIT: Only, some magazine scans don't even have consistent numbering within themselves. For example, this Crash #40 scan has an extra page inserted after page 13 (some kind of fold-out poster?) Without verifying every scan, you can't be sure whether the page numbers are all correct.
0 x

User avatar
djnzx48
Manic Miner
Posts: 482
Joined: Wed Dec 06, 2017 2:13 am
Location: New Zealand

Re: New Database Model ZXDB

Post by djnzx48 » Thu Jun 20, 2019 3:30 am

PeterJ wrote:
Wed Jun 19, 2019 8:27 pm
Thanks for the great explanation @Einar Saukas

When I go to our MicroMart Page (I didn't know that was included in ZXDB).

https://spectrumcomputing.co.uk/index.p ... mag_id=280

I click on any PDF and it says Page not Found on archive.org. Is that something with the path? Every PDF link seems to be trying to go to:

https://archive.org/download/Micro-Mart ... pecial.pdf
It's because all the links have '-Special' appended to them, when only a few of them are actually specials. Try for instance the 2015/1/22 link and it works.
0 x

hikoki
Manic Miner
Posts: 389
Joined: Thu Nov 16, 2017 10:54 am

Re: New Database Model ZXDB

Post by hikoki » Thu Jun 20, 2019 8:06 am

I wonder if the archive.org api search could be used for page numbers.
https://openlibrary.org/dev/docs/bookurls
I guess the search term would have to contain footer or header words, characteristic of every magazine.

For example, a search link for page 54 on a Crash magazine containing the quoted term "54 Crash"
https://archive.org/details/crash-magaz ... 4+crash%22
Last edited by hikoki on Thu Jun 20, 2019 9:31 am, edited 5 times in total.
0 x

User avatar
djnzx48
Manic Miner
Posts: 482
Joined: Wed Dec 06, 2017 2:13 am
Location: New Zealand

Re: New Database Model ZXDB

Post by djnzx48 » Thu Jun 20, 2019 8:18 am

You mean scanning the bottom of every page with OCR to find out the page numbers?

I suppose it might work in theory, but it could be kind of impractical. How would you distinguish them from any other number on the page? And a lot of the full-page advertisements don't have page numbers anyway.

EDIT: Heh, so it actually works? That's interesting. It still seems hackish though as you'd need a specific query for every magazine layout.
0 x

hikoki
Manic Miner
Posts: 389
Joined: Thu Nov 16, 2017 10:54 am

Re: New Database Model ZXDB

Post by hikoki » Thu Jun 20, 2019 9:36 am

djnzx48 wrote:
Thu Jun 20, 2019 8:18 am
And a lot of the full-page advertisements don't have page numbers anyway.

EDIT: Heh, so it actually works? That's interesting. It still seems hackish though as you'd need a specific query for every magazine layout.
well you could provide users with both the expected and hackish links :)

a script to detect such searches without results might be useful to locate which pages are not numbered

EDIT

Sample on how to data mining the internet archive
https://programminghistorian.org/en/les ... et-archive
0 x

User avatar
Einar Saukas
Manic Miner
Posts: 911
Joined: Wed Nov 15, 2017 2:48 pm

Re: New Database Model ZXDB

Post by Einar Saukas » Thu Jun 20, 2019 12:09 pm

hikoki wrote:
Thu Jun 20, 2019 8:06 am
I wonder if the archive.org api search could be used for page numbers.
https://openlibrary.org/dev/docs/bookurls
I guess the search term would have to contain footer or header words, characteristic of every magazine.

For example, a search link for page 54 on a Crash magazine containing the quoted term "54 Crash"
https://archive.org/details/crash-magaz ... /"54+crash"
Apparently only a few magazines have tagged pages. For instance it doesn't seem to work for Crash issue #84...
0 x

hikoki
Manic Miner
Posts: 389
Joined: Thu Nov 16, 2017 10:54 am

Re: New Database Model ZXDB

Post by hikoki » Thu Jun 20, 2019 12:29 pm

Einar Saukas wrote:
Thu Jun 20, 2019 12:09 pm
Apparently only a few magazines have tagged pages. For instance it doesn't seem to work for Crash issue #84...
Footers on Crash 84 follow a different pattern: january (black square) numberpage
If you know the ascii code of black square should work

https://archive.org/details/crash-magazine-84/search/
%22
january+(black square)+numberpage
%22

EDIT

@Einar Saukas you are right in this case as footers don't seem to be OCRed
well, you can automatically detect the lack of footers by crawling mags with different URLs. If your search term doesn't return one single result then provide just the estimated /page url
*trying to outsmart Einar*
0 x

User avatar
Einar Saukas
Manic Miner
Posts: 911
Joined: Wed Nov 15, 2017 2:48 pm

Re: New Database Model ZXDB

Post by Einar Saukas » Thu Jun 20, 2019 9:28 pm

djnzx48 wrote:
Thu Jun 20, 2019 3:00 am
EDIT: Only, some magazine scans don't even have consistent numbering within themselves. For example, this Crash #40 scan has an extra page inserted after page 13 (some kind of fold-out poster?) Without verifying every scan, you can't be sure whether the page numbers are all correct.
Actually that's an error in the PDF version. This was supposed to be a single page:

https://archive.org/download/World_of_S ... 000015.jpg

But it was broken into 2 pages in PDF:

https://archive.org/details/crash-magazine-40/page/n14

If anyone provides a fixed PDF, I can ask people at Archive.org to replace it. It would fix numbering in this issue.
0 x

Post Reply