# Last edited on 2003-08-19 20:20:48 by stolfi

Obtaining from the NEC Citeseer all papers that reference my papers

GETTING THE BASIC DATA

  Did a [Search citations] with the  query
  
    stolfi j or stolfi jorge or j stol or jorge stol
    
  The "stol" queries are necessary because Postscript and/or TeX 
  render "fi" as a ligature, which the Citeseer cannot handle.
  
  Got 101 papers (many duplicates, and some bogus ones, such 
  as Guido Stolfi's), with 
  
      759 citations on 2003-07-04
      771 citations on 2003-08-19
  
  Saved the result pages as "citation-search-results/${dt}/stolfi-N.html"
  where {N} is 0,1,2 (50 entries each).
  
  Extracted raw citations from those pages:
  
    set dt = 2003-07-04
    # set dt = 2003-07-21
    # set dt = 2003-08-19

    extract-papers-from-nec-hits \
        citation-search-results/${dt}/stolfi-{0,1,2}.html \
      | fix-nec-database-bugs \
      | splice-bib-entries \
      | cleanup-isi-nec-citations \
      | sort-bib-file \
      | add-keys-to-citations \
      > target-paper-bibs/${dt}/stolfi.bib
      
    cat stolfi-cit-0-edt.bib \
      | fix-nec-database-bugs \
      | splice-bib-entries \
      | cleanup-isi-nec-citations \
      | sort-bib-file \
      | add-keys-to-citations \
      > .xxx

  Comparing with manually edited version:
  
    cat stolfi-cit-0-edt.bib | sort | uniq > .edt
    cat target-paper-bibs/${dt}/stolfi.bib | sort | uniq > .raw
    diff -Bb .raw .edt  \
      | prettify-diff-output \
      | egrep -v '^ *([>]|[-][-][-]|$)' \
      | egrep -v '^ *[<] *(ctxurl|docurl|neckey)' \
      > .diffn
    diff -Bb .raw .edt  \
      | prettify-diff-output \
      | egrep -v '^ *([<]|[-][-][-]|$)' \
      | egrep -v '^ *[>] *(author|title|booktitle)' \
      > .diffo

    diff -Bb target-paper-bibs/${dt}/stolfi.bib stolfi-cit-0-edt.bib \
      | prettify-diff-output \
      | egrep -v '^ *[<] *(citations|ctxurl|docurl|neckey)' \
      | egrep -v '^ *[>] *(citations|author|title|booktitle)' \
      > .diff

GETTING THE CITING PAPERS
  
  Entered the [Context] button of each citation for the quad-edge and 
  paper (in all incarnations) and then the various AA papers.
  Saved those pages as "context-htmls/{NNNNN-N}.html", where {NNNNN-N} is
  basically a random number.  The cited paper corresponding to each
  file was noted in the file "stolfi-cit-0-edt.bib".
  
  Extracted the citing papers through the script

    foreach f ( context-htmls/[0-9]*.html )
      echo $f
      cat $f | extract-citations-from-nec-context > ${f:r}.hcit
    end
   
  The result is a list of URLs like 
  
    http://citeseer.nj.nec.com/isenburg99triangle.html 
    http://citeseer.nj.nec.com/371559.html 
    ...
    
  Then collected all those urls into a file, removing the common prefix
  
    cat context-htmls/*.hcit \
      | sed -e 's"http://citeseer.nj.nec.com/""' \
      | sed -e 's: *$::' \
      | sort | uniq \
      > paper-htmls/all.hcits 

  Fetching those entries:
  
    set delay = 300
    foreach f ( `cat paper-htmls/all.hcits` ) 
      echo "=== $f =================================================="
      wget "http://citeseer.nj.nec.com/${f}" -N -nv -O "paper-htmls/$f"
      sleep ${delay}
      @ delay = ${delay} + 67
      @ tmp = $delay - 300
      if ( ${delay} > 600 ) set delay = ${tmp}
    end
    
  Checking whether we got them all:
  
    ( cd paper-htmls && ls -1d *.html | sort > .fetched )
    diff paper-htmls/all.hcits paper-htmls/.fetched