dweiss commented on pull request #2277: URL: https://github.com/apache/lucene-solr/pull/2277#issuecomment-773476961
Hi Peter. The file size itself doesn't matter if we can assume some kind of leader buffer in which these flags have to occur and which we can rewind. Implementing this is technically easy - for example via BufferedInputStream with a reasonably large internal buffer, then a mark on the zero-eth byte. Once you reach your flags, you reset the buffer. The only problem I see here is a minor potential to have a the buffer limit fall on an UTF8 surrogate, for example, which could potentially trigger some kind of exception... but this can be worked around. I'll try to do this, time permitting. It's not much of a problem - it can be done later too. I have to limit my time for Lucene to reasonable chunks though. :) > for Kinyarwanda it's 38MB. Is this one of the openoffice dictionaries? Once we have them all parse successfully it'd be a good baseline test. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org