# Last edited on 2003-08-19 20:20:48 by stolfi Obtaining from the NEC Citeseer all papers that reference my papers GETTING THE BASIC DATA Did a [Search citations] with the query stolfi j or stolfi jorge or j stol or jorge stol The "stol" queries are necessary because Postscript and/or TeX render "fi" as a ligature, which the Citeseer cannot handle. Got 101 papers (many duplicates, and some bogus ones, such as Guido Stolfi's), with 759 citations on 2003-07-04 771 citations on 2003-08-19 Saved the result pages as "citation-search-results/${dt}/stolfi-N.html" where {N} is 0,1,2 (50 entries each). Extracted raw citations from those pages: set dt = 2003-07-04 # set dt = 2003-07-21 # set dt = 2003-08-19 extract-papers-from-nec-hits \ citation-search-results/${dt}/stolfi-{0,1,2}.html \ | fix-nec-database-bugs \ | splice-bib-entries \ | cleanup-isi-nec-citations \ | sort-bib-file \ | add-keys-to-citations \ > target-paper-bibs/${dt}/stolfi.bib cat stolfi-cit-0-edt.bib \ | fix-nec-database-bugs \ | splice-bib-entries \ | cleanup-isi-nec-citations \ | sort-bib-file \ | add-keys-to-citations \ > .xxx Comparing with manually edited version: cat stolfi-cit-0-edt.bib | sort | uniq > .edt cat target-paper-bibs/${dt}/stolfi.bib | sort | uniq > .raw diff -Bb .raw .edt \ | prettify-diff-output \ | egrep -v '^ *([>]|[-][-][-]|$)' \ | egrep -v '^ *[<] *(ctxurl|docurl|neckey)' \ > .diffn diff -Bb .raw .edt \ | prettify-diff-output \ | egrep -v '^ *([<]|[-][-][-]|$)' \ | egrep -v '^ *[>] *(author|title|booktitle)' \ > .diffo diff -Bb target-paper-bibs/${dt}/stolfi.bib stolfi-cit-0-edt.bib \ | prettify-diff-output \ | egrep -v '^ *[<] *(citations|ctxurl|docurl|neckey)' \ | egrep -v '^ *[>] *(citations|author|title|booktitle)' \ > .diff GETTING THE CITING PAPERS Entered the [Context] button of each citation for the quad-edge and paper (in all incarnations) and then the various AA papers. Saved those pages as "context-htmls/{NNNNN-N}.html", where {NNNNN-N} is basically a random number. The cited paper corresponding to each file was noted in the file "stolfi-cit-0-edt.bib". Extracted the citing papers through the script foreach f ( context-htmls/[0-9]*.html ) echo $f cat $f | extract-citations-from-nec-context > ${f:r}.hcit end The result is a list of URLs like http://citeseer.nj.nec.com/isenburg99triangle.html http://citeseer.nj.nec.com/371559.html ... Then collected all those urls into a file, removing the common prefix cat context-htmls/*.hcit \ | sed -e 's"http://citeseer.nj.nec.com/""' \ | sed -e 's: *$::' \ | sort | uniq \ > paper-htmls/all.hcits Fetching those entries: set delay = 300 foreach f ( `cat paper-htmls/all.hcits` ) echo "=== $f ==================================================" wget "http://citeseer.nj.nec.com/${f}" -N -nv -O "paper-htmls/$f" sleep ${delay} @ delay = ${delay} + 67 @ tmp = $delay - 300 if ( ${delay} > 600 ) set delay = ${tmp} end Checking whether we got them all: ( cd paper-htmls && ls -1d *.html | sort > .fetched ) diff paper-htmls/all.hcits paper-htmls/.fetched