Control: tags -1 + patch pending
On Wed, 2 Apr 2025 10:05:52 +0300 Andrius Merkys <mer...@debian.org> wrote:
I finally managed to isolate the difference in cdhit output which causes
segfaults in provean. It seems that cdhit >= 4.8.1-4 replaced full FASTA
headers in its output with partial IDs:
diff -r /home/andrius/provean/good/cdhit.cluster
/home/andrius/provean/bad/cdhit.cluster
1c1
< >gi|119610548|gb|EAW90142.1| tumor protein p53 (Li-Fraumeni syndrome),
isoform CRA_c
---
> >EAW90142.1 tumor protein p53 (Li-Fraumeni syndrome), isoform CRA_c
[Homo sapiens]
I need to look deeper if cdhit could be persuaded to use the old output
format. If not, provean will have to be adjusted to the change.
I was wrong, it is blastdbcmd which has changed its default format to
not replicate the full input FASTA header. I managed to successfully
patch the code to explicitly set the requested output format.
It would be nice to add an autopkgtest to prevent regressions, but the
input database is ~12GB (and it seems that only one from [1] works).
Andrius
[1] ftp://ftp.jcvi.org/data/provean/nr_Aug_2011/