# Last edited on 2011-12-15 12:46:35 by stolfi FETCHING Obtained all cards from the remote site: wget-it.sh Took a couple of days; result was 20630 files named "gene-card.php?clusterid=cluster1" to "gene-card.php?clusterid=cluster20630". RENAMING Renamed each file "gene-card.php?clusterid=clusterNNNNN" to "NN/NNN.html" after zero-padding: rename-files.sh EXTRACTING DATA Extracted the relevant data from "NN/NNN.html" to "NN/NNN.dat" via "NN/NNN.spt": split-and-extract-all.sh MATCHING PROBESETS TO KH GENES Obtaining the probsets and KH gene files: wget -nv http://dl.dropbox.com/u/35254929/myKH.txt wget -nv http://dl.dropbox.com/u/35254929/probeset.txt Matching them match-probesets-to-KH.gawk myKH.txt loaded 30965 map pairs from probeset.txt The output is two files, "ps-KH-dir.txt" (Probeset,scaffold,start,end,KHgenes) and "ps-KH-inv.txt" (KHGene,scaffold,Probesets). MATCHING PROBESETS AND KH GENES TO ANISEED CARDS Output with one line for each (prbset,KH model) pair: match-ps-KH-to-cards.gawk -v collapse=0 ??/???.dat \ | sort -t, -k1,1n -k3,3 -k2,2 \ > ps-KH-anis1.csv there were 30965 probesets. there were 7674 probesets without KH models. there were 690 probesets without ANISEED cards. there were 41400 (probesets,KH model) pairs. there were 1327 (probesets,KH model) pairs without ANISEED cards. Output with one collapsed line for each probeset: match-ps-KH-to-cards.gawk -v collapse=1 ??/???.dat \ | sort -t, -k1,1n -k2,2 \ > ps-KH-anis2.csv there were 30965 probesets. there were 7674 probesets without KH models. there were 690 probesets without ANISEED cards. there were 41400 (probesets,KH model) pairs. there were 1327 (probesets,KH model) pairs without ANISEED cards. LISTING GENE NAMES BY INSITU WELL CODE Running: describe-wells.gawk ??/???.dat \ | sort -t, -k1,1 -k2,2 \ > InSitu-genes.csv read 20170 cards there were 14543 InSitu clone codes. there were 9593 cards without InSitu clone codes. there were 1782 InSitu clones without manual or inferred gene names. there were 315 InSitu clones without KH gene names. there were 2419 InSitu clones without full ORFs. The plate map was manually extracted from the stderr output and saved to "InSitu-plates.txt" SHIPPING Packing the ".dat" files: tar -cvzf datfiles.tgz ??/???.dat Sending to manaus: rsync -avzu \ 00-Notebook.txt \ datfiles.tgz \ myKH.txt probeset.txt aniseedV3_GM.txt \ ps-KH-dir.txt ps-KH-inv.txt \ ps-KH-anis1.csv ps-KH-anis2.csv \ InSitu-genes.csv InSitu-plates.txt \ *.gawk *.sh \ stolfi@manaus.ic.unicamp.br:projects/ciona/ &