This page is DanBri's record of how he got ISBNs out of Wikipedia. There are similar notes in the NoTube project wiki, but these are public.

Basically I ran a series of scripts from Jakob Voss. The DBPedia guys also have 14-15000 ISBNs. This technique finds about half a million, but they are just mentions of ISBNs; the page might be about something else.


Script extraction

Info from Jakob:

Jakob Voss: $ ./ en
...downloading a lot of data

$ ./ enwiki-latest-pages-articles.xml.bz2 | awk -F'|' '$1=="ISBN" {print $5";"$2}'
...get ISBN;Article mapping

I put the scripts online, but might move them later. Not sure where originals are found, maybe from main wikipedia site.

"I just downloaded and parsed the finnish Wikipedia for isbns, so it's doable in some minutes to hours. The DBPedia guys have their own extraction framework. By the way the new version of the Bibliographic Ontology (BIBO 1.3) has a bibo:cites property - if you want to build RDF triples."


It turns out this data is a bit dirty and needs cleaning, normalising. Most of it is usable though.

Again script from Jakob (needs a perl module):

$ perl -MSeeAlso::Identifier::ISBN -n -e '($i,$a)=split(";");print new SeeAlso::Identifier::ISBN($i)->value . ";".$a;' < bookrefs2.txt > bookrefs3.txt
$ awk -F';' '$1!=""{print}' bookrefs3.txt > bookrefs4.txt
$ awk -F';' '$1!=""{n++}; END {print n}' bookrefs3.txt
$ awk -F';' '$1==""{n++}; END {print n}' bookrefs3.txt

The big number is (hopefully :) the number of good values...

Ruby matcher code

Quick hack to compare this list (486690 good ISBNs) to a local collection of books:

#!/usr/bin/env ruby -rubygems

db = {}
raw = `gzcat bookrefs3.txt.gz`.split(/\n/).each do |line|
  x,y = line.split(/;/)
  puts "Map: #{x} -> #{y}"
l ="/Users/danbri/Documents/danbri-books.txt")
l.each do |book|
  puts "Looking for: #'{book}'"
  if db[book] != nil
     puts "MATCH: #{book} -> #{db[book]}"


Info thanks to Chris Bizer, Anja Jentzsch. 15,000 or so ISBNs.

select distinct ?book ?isbn where {?book <> ?isbn}

(run this on their SPARQL db).