%expand% Compression ratio and genome assembly quality %set(body,%html-clean({
When doing genomic assembly, you would expect the complexity of the completed genome to be comparable to the complexity of genomes of similar size and neighboring taxonomy.

One easy measure of complexity is the degree to which a genome can be compressed.   After converting to 2-bit format, some genomes compress better than others.  bzip2 has a large default block size and the ratio of compressed vs uncompressed size of a 2-bit fasta should result in a good measure of complexity.

Can't think of a good source of data to test this theory.  Maybe look at the Amos validate paper.

Source of "complexity-measure.sh" works well... fast, and produces a percentage as its only output:

#!/bin/bash -e

in=$1

f2b=$in.f2b
bzi=$in.bzi

rm -f $f2b $bzi

mkfifo $f2b $bzi

faToTwoBit $in $f2b &

tee $bzi < $f2b | perl -ne '$t+=length($_); END{print "$t\n"}' > $in.bsz &
comp=`bzip2 < $bzi | perl -ne '$t+=length($_); END{print "$t\n"}'`
wait

perl -ne "printf qq{%.4f\n}, 100*$comp/\$_" $in.bsz

rm -f $f2b $bzi $in.bsz

}))
   
Bracing against the wind  
www.documentroot.com  

Tuesday, January 08, 2013

Compression ratio and genome assembly quality

When doing genomic assembly, you would expect the complexity of the completed genome to be comparable to the complexity of genomes of similar size and neighboring taxonomy.

One easy measure of complexity is the degree to which a genome can be compressed.   After converting to 2-bit format, some genomes compress better than others.  bzip2 has a large default block size and the ratio of compressed vs uncompressed size of a 2-bit fasta should result in a good measure of complexity.

Can't think of a good source of data to test this theory.  Maybe look at the Amos validate paper.

Source of "complexity-measure.sh" works well... fast, and produces a percentage as its only output:

#!/bin/bash -e

in=$1

f2b=$in.f2b
bzi=$in.bzi

rm -f $f2b $bzi

mkfifo $f2b $bzi

faToTwoBit $in $f2b &

tee $bzi < $f2b | perl -ne '$t+=length($_); END{print "$t\n"}' > $in.bsz &
comp=`bzip2 < $bzi | perl -ne '$t+=length($_); END{print "$t\n"}'`
wait

perl -ne "printf qq{%.4f\n}, 100*$comp/\$_" $in.bsz

rm -f $f2b $bzi $in.bsz


[View/Post Comments] [Digg] [Del.icio.us] [Stumble]

Home | Email me when this weblog updates: | View Archive

(C) 2002 Erik Aronesty/DocumentRoot.Com. Right to copy, without attribution, is given freely to anyone for any reason.


Listed on BlogShares | Bloghop: the best pretty good | Blogarama | Technorati | Blogwise