%expand%
#!/bin/bash -e
in=$1
f2b=$in.f2b
bzi=$in.bzi
rm -f $f2b $bzi
mkfifo $f2b $bzi
faToTwoBit $in $f2b &
tee $bzi < $f2b | perl -ne '$t+=length($_); END{print "$t\n"}' > $in.bsz &
comp=`bzip2 < $bzi | perl -ne '$t+=length($_); END{print "$t\n"}'`
wait
perl -ne "printf qq{%.4f\n}, 100*$comp/\$_" $in.bsz
rm -f $f2b $bzi $in.bsz
| Bracing against the wind | |
| www.documentroot.com |
|
Tuesday, January 08, 2013
Compression ratio and genome assembly quality
One easy measure of complexity is the degree to which a genome can be compressed. After converting to 2-bit format, some genomes compress better than others. bzip2 has a large default block size and the ratio of compressed vs uncompressed size of a 2-bit fasta should result in a good measure of complexity. Can't think of a good source of data to test this theory. Maybe look at the Amos validate paper. Source of "complexity-measure.sh" works well... fast, and produces a percentage as its only output: #!/bin/bash -e
in=$1
f2b=$in.f2b
bzi=$in.bzi
rm -f $f2b $bzi
mkfifo $f2b $bzi
faToTwoBit $in $f2b &
tee $bzi < $f2b | perl -ne '$t+=length($_); END{print "$t\n"}' > $in.bsz &
comp=`bzip2 < $bzi | perl -ne '$t+=length($_); END{print "$t\n"}'`
wait
perl -ne "printf qq{%.4f\n}, 100*$comp/\$_" $in.bsz
rm -f $f2b $bzi $in.bsz
[View/Post Comments] [Digg] [Del.icio.us] [Stumble] |
|
Bloghop:
|
Blogarama
|
Technorati
|
Blogwise