Hi everyone,

I thought it would be a good idea to introduce myself before I get on to bug 
fixing details. So, here goes.

I'm Karthik (*waves*) and you may have seen me hanging around on #kde as 
karthikp. I have a background in aerospace engineering and got my doctorate in 
2012 studying flame structure using optical diagnostic techniques. However, I 
now work in another field, lithography, creating optical proximity correction 
recipes. As for relevant coding skills, I'm very comfortable in C++, less 
perhaps with Qt and cmake. Oh, and I can use git in my sleep, so that helps. :)

I caught the kde bug upon discovering kubuntu sometime around 2006. In 2008, I 
caught the arch bug as well and the two have been with me ever since. These are 
bugs I _don't_ want to fix. :) The bug I _do_ want to fix is this one that 
affects the sonnet spell checking framework: 
https://bugs.kde.org/show_bug.cgi?id=337145

I reported this over the weekend, having encountered it all through my grad 
years when having automatic spell check in katepart would highlight "spelling 
errors" all over data files. So, the goal is to make sonnet smarter and avoid 
spell checking numbers, generally speaking. I wanted to share what I've been 
doing so that, a) someone can tell me if I'm on the right/wrong path, or b) if 
someone else is already working on it, we can pool forces and not unnecessarily 
duplicate work.

Here's what I've done so far. The bug exists in 4.x for sure, and since the 
code in the kf5/sonnet repo seems to be much the same implementation as 
kdecore/sonnet in kdelibs, I think the fix would need to be applied there as 
well.

My first approach was to try and extend the isValidWord() function in the 
Filter class to identify "words" that are actually numbers. I started with just 
converting QString to a double using toDouble() and using the error status to 
identify numbers. This actually catches most of the simple forms like, 1, 1.0, 
1.0e-1, 1.0E+1, etc. but fails for numbers with field separators like 1,000. 
So, I added another test: if the word contains a comma, split and check if each 
non-empty part is a number. If so, it's not a valid word. This worked great... 
for a time.

However, it couldn't handle this format of writing numbers: 1.23(4). This form 
is often found in scientific data where the number in the brackets denotes the 
standard error in the significant digits. So, I added another test for the 
presence of ( or ) and did the split dance again. That also worked great... for 
a time.

Then came the doozy. What about 1/2? This opens the door on all kinds of 
expressions with operators. 1+2, 1-2, etc. Also, comparisons, 1<2 should be 
exempt from spell checking. Now, this approach rapidly got out of hand.

My next approach (that I'm still in the middle of) is to use setBuffer() 
instead of isValidWord. This uses QTextBoundaryFinder to break up the text into 
words. I had high hopes for using the boundary type Grapheme instead of Word, 
but that seems to think every character is at a valid boundary. I'm now going 
to try and combine this with QChar::isLetterOrNumber() to identify word and 
number boundaries so that isValidWord() can then just drop "words" entirely 
composed of letters.

I'd appreciate any thoughts/advice on this problem. If anyone else is working 
on the sonnet code, do let me know. Also, who's the current maintainer of the 
sonnet code base? The bug tracker CC's Zack Rusin, but the repo names Martin 
Sandsmark. If either are actively maintaining sonnet, I'd love to pepper you 
with more questions!

Otherwise, I'll post an update later this week or more likely over the weekend 
with what I hope will be a working solution. I'll bug everyone then for help in 
reviewing my work and we can close this bug finally!

Thanks,
Karthik

Attachment: signature.asc
Description: This is a digitally signed message part.

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

Reply via email to