date:20170210

Stemming and accents

2017-02-10 Thread Dominique Bejean

Hi,

Is the SnowballPorterFilter sensitive to the accents for French for
instance ?

If I use both SnowballPorterFilter and ASCIIFoldingFilter, do I have to
configure ASCIIFoldingFilter after SnowballPorterFilter  ?

Regards.

Dominique
-- 
Dominique Béjean
06 08 46 12 43

Re: Removing duplicate terms from query

2017-02-10 Thread Ere Maijala

Thanks for the insight. You're right, of course, regarding the score 
calculation. I'll think about it. There are certain cases where the 
search is human-obviously bad and could be cleaned up, but it's not too 
easy to write rules for that.


--Ere

9.2.2017, 18.37, Walter Underwood kirjoitti:

1. I don’t think this is a good idea. It means that a search for “hey hey hey” 
won’t score that document higher.

2. Maybe you want to change how tf is calculated. Ignore multiple occurrences 
of a word.

I ran into this with the movie title “New York, New York” at Netflix. It isn’t 
twice as much about New York, but it needs to be the best match for the query 
“new york new york”.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



On Feb 9, 2017, at 5:18 AM, Ere Maijala  wrote:

Thanks Emir.

I was thinking of something very simple like doing what 
RemoveDuplicatesTokenFilter does but ignoring positions. It would of course 
still be possible to have the same term multiple times, but at least the 
adjacent ones could be deduplicated. The reason I'm not too eager to do it in a 
query preprocessor is that I'd have to essentially duplicate functionality of 
the query analysis chain that contains ICUTokenizerFactory, 
WordDelimiterFilterFactory and whatnot.

Regards,
Ere

9.2.2017, 14.52, Emir Arnautovic kirjoitti:

Hi Ere,

I don't think that there is such filter. Implementing such filter would
require looking backward which violates streaming approach of token
filters and unpredictable memory usage.

I would do it as part of query preprocessor and not necessarily as part
of Solr.

HTH,
Emir


On 09.02.2017 12:24, Ere Maijala wrote:

Hi,

I just noticed that while we use RemoveDuplicatesTokenFilter during
query time, it will consider term positions and not really do anything
e.g. if query is 'term term term'. As far as I can see the term
positions make no difference in a simple non-phrase search. Is there a
built-in way to deal with this? I know I can write a filter to do
this, but I feel like this would be something quite basic to do for
the query. And I don't think it's even anything too weird for normal
users to do. Just consider e.g. searching for music by title:

Hey, hey, hey ; Shivers of pleasure

I also verified that at least according to debugQuery=true and
anecdotal evicende the search really slows down if you repeat the same
term enough.

--Ere




--
Ere Maijala
Kansalliskirjasto / The National Library of Finland





--
Ere Maijala
Kansalliskirjasto / The National Library of Finland

OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин

Hello, community!
Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
Ive got only latin letter (looks like ugly translite text) in result for
that moment.For image contains only lattin letters it works fine.
Does anyone have any suggestion, best practice or case studies refer to
this situation?

Re: OCR image contains cyrillic characters

2017-02-10 Thread Rick Leir

My guess is that you are using using Tika and Tesseract. The latter is 
complex, and you can start learning at


https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF

The traineddata for Cyrillic is here:

https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

https://github.com/tesseract-ocr/tesseract/issues/147

You likely need to enhance the images before running Tesseract.

cheers -- Rick

On 2017-02-10 05:03 AM, Игорь Абрашин wrote:

Hello, community!
Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
Ive got only latin letter (looks like ugly translite text) in result for
that moment.For image contains only lattin letters it works fine.
Does anyone have any suggestion, best practice or case studies refer to
this situation?

Re: Problem with cyrillics letters through Tika OCR indexing

2017-02-10 Thread Alexandre Rafalovitch

At what level is this exactly a problem? Are you looking for a way for Solr
to pass -L rus flag to Tika?

Or you are saying that whatever OCR is used here is bad. In the second
case, this is probably not a question for Solr or even Tika but for
whatever underlying OCR library is.

The stack is deep here, more precision is required.

Удачи,
Alex

On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" 
wrote:

Hello, everyone I’m encountered the error mentioned at the title?

The original image attached and recognized text below:
3ApaBCTyI7ITe 9| )KVIBy xopomo



Does anyone faced the similar?
Need to mentioned that tesseract recognize it more correctly with –l rus
option.

Thanks in advance!





*С уважением, *

*Игорь Абрашин*

*ООО «НОВАТЭК НТЦ»*

*тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>*

*тел. внутр. корпор.: 22-586*

[image: 121]

Re: how to get modified field data if it doesn't exist in meta

2017-02-10 Thread Alexandre Rafalovitch

Custom update request processor that looks up a file from the name and gets
the date should work.

Regards,
Alex

On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas"  wrote:

Hi,

We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).

Modified date values is needed for each file. MS Office's files, pdfs have
this value.
Problem is with txt files as they don't have this value in their meta.

Is there any possibility to get it somehow from os level and force adding
it to solr when we do indexing.

p.s.

Windows 2012 server, single instance

typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
-Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin: -jar
example/exampledocs/post.jar "M:\DNS_dump"


Regards,

Gytis

Re: Interval Facets with JSON

2017-02-10 Thread Tom Evans

On Wed, Feb 8, 2017 at 11:26 PM, deniz  wrote:
> Tom Evans-2 wrote
>> I don't think there is such a thing as an interval JSON facet.
>> Whereabouts in the documentation are you seeing an "interval" as JSON
>> facet type?
>>
>>
>> You want a range facet surely?
>>
>> One thing with range facets is that the gap is fixed size. You can
>> actually do your example however:
>>
>> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190,
>> hardend:True, field:height}}
>>
>> If you do require arbitrary bucket sizes, you will need to do it by
>> specifying query facets instead, I believe.
>>
>> Cheers
>>
>> Tom
>
>
> nothing other than
> https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting-IntervalFaceting
> for documentation on intervals...  i am ok with range queries as well but
> intervals would fit better because of different sizes...

That documentation is not for JSON facets though. You can't pick and
choose features from the old facet system and use them in JSON facets
unless they are mentioned in the JSON facet documentation:

https://cwiki.apache.org/confluence/display/solr/JSON+Request+API

and (not official documentation)

http://yonik.com/json-facet-api/

Cheers

Tom

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Tom Evans

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
 wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.
>
> My situation, slightly simplified is:
>
> Solr 4.6.1
>
>   *   Doc set: ~200,000 docs
>   *   Grouping by item_id, an indexed, stored, single value string field with 
> ~50,000 unique values, ~4 docs per item
>   *   Faceting by person_id, an indexed, stored, multi-value string field 
> with ~50,000 values (w/ a very skewed distribution)
>   *   No docValues fields
>
> Each document here is a description of an item, and there are several 
> descriptions per item in multiple languages.
>
> With legacy facets I use group.field=item_id and group.facet=true, which 
> gives me facet counts with the number of items rather than descriptions, and 
> correctly sorted by descending item count.
>
> With JSON facets I'm doing the equivalent like so:
>
> &json.facet={
> "people": {
> "type": "terms",
> "field": "person_id",
> "facet": {
> "grouped_count": "unique(item_id)"
> },
> "sort": "grouped_count desc"
> }
> }
>
> This works, and is somewhat faster than legacy faceting, but it also produces 
> a massive spike in memory usage when (and only when) the sort parameter is 
> set to the aggregate field. A server that runs happily with a 512MB heap OOMs 
> unless I give it a 4GB heap. With sort set to (the default) "count desc" 
> there is no memory usage spike.
>
> I would be curious if anyone has experienced this kind of memory usage when 
> sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
> I’ve tried reindexing with docValues enabled on the relevant fields and it 
> seems to make no difference in this respect.
>
> Many thanks,
> ~Mike

Java version set to 1.8 for SOLR 6.4.0

2017-02-10 Thread Uchit Patel

 blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
#715FFA solid !important;  padding-left:1ex !important; background-color:white 
!important; }  I have installed SOLR 6.4.0 on Linux box. I have Java 1.7.0 and 
1.8.0 both on the box. By default it point to 1.7.0. Some other applications 
using 1.7.0 Java. I want to set Java 1.8.0 only for SOLR 6.4.0. What should I 
need to update for only SOLR 6.4.0 to hit Java 1.8.0. I don't want to remove 
Java 1.7.0 because some other applications using Java 1.7.0.
Thanks.
Regards,
Uchit Patel


Sent from Yahoo Mail for iPhone


On Tuesday, February 7, 2017, 7:14 PM, Uchit Patel 
 wrote:

Hi,
I want detailed step by step guidance for SOLR version upgrade from 5.1.0 to 
6.4.0 for Windows and Linux both.
Thanks.
Regards,
Uchit PatelSr. GIS AnalystWaste Management Inc.

Re: Java version set to 1.8 for SOLR 6.4.0

2017-02-10 Thread Alexandre Rafalovitch

That's the operating system question really. Whatever invokes your
Solr instance  (system script, command line, etc) should have its path
and/or JAVA_HOME pointing to Java 8.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 10 February 2017 at 09:53, Uchit Patel  wrote:
>  blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
> #715FFA solid !important;  padding-left:1ex !important; 
> background-color:white !important; }  I have installed SOLR 6.4.0 on Linux 
> box. I have Java 1.7.0 and 1.8.0 both on the box. By default it point to 
> 1.7.0. Some other applications using 1.7.0 Java. I want to set Java 1.8.0 
> only for SOLR 6.4.0. What should I need to update for only SOLR 6.4.0 to hit 
> Java 1.8.0. I don't want to remove Java 1.7.0 because some other applications 
> using Java 1.7.0.
> Thanks.
> Regards,
> Uchit Patel
>
>
> Sent from Yahoo Mail for iPhone
>
>
> On Tuesday, February 7, 2017, 7:14 PM, Uchit Patel 
>  wrote:
>
> Hi,
> I want detailed step by step guidance for SOLR version upgrade from 5.1.0 to 
> 6.4.0 for Windows and Linux both.
> Thanks.
> Regards,
> Uchit PatelSr. GIS AnalystWaste Management Inc.
>
>

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-10 Thread Kelly, Frank

Thanks Shawn,

Yeah think we have identified root cause thanks to some of the suggestions
here.

Originally we stopped using deleteByQuery as we saw it caused some large
CPU spikes (see https://issues.apache.org/jira/browse/LUCENE-7049) and
Solr pauses
And switched to using a search and then deleteById. It worked fine on our
(small) test collections.

But with 200M documents it appears that deleteById causes the heap to
increase dramatically (we guess fieldCache gets populated with a large
number of object ids?)
To confirm our suspicion we put ³docValues²=³true² on the schema and began
to reindex and the heap memory usage dropped significantly - in fact heap
memory usage on the Solr VMs dropped by a half.

Can someone confirm (or deny) our suspicion that deleteById results in
some on-heap caching of the unique key (id?)?

Cheers!

-Frank

P.s. Interesting when I searched the Wiki for docs on deleteById I did not
find any
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSea
rch=true&queryString=deleteById

P.p.s Separately we are also turning off FilterCache but we know from
usage and plugin stats that it is not in use but best to turn it off
entirely for risk reduction

Frank Kelly
Principal Software Engineer

HERE 
5 Wayside Rd, Burlington, MA 01803, USA
42° 29' 7" N 71° 11' 32" W

On 2/9/17, 11:00 AM, "Shawn Heisey"  wrote:

>On 2/9/2017 6:19 AM, Kelly, Frank wrote:
>> Got a heap dump on an Out of Memory error.
>> Analyzing the dump now in Visual VM
>>
>> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in
>>
>>   * TreeMap$Entry
>>   * FieldCacheImpl$SortedDocValues
>>
>> We¹re considering switch over to DocValues but would rather be
>> definitive about the root cause before we experiment with DocValues
>> and require a reindex of our 200M document index
>> In each of our 4 data centers.
>>
>> Any suggestions on what I should look for in this heap dump to get a
>> definitive root cause?
>>
>
>Analyzing the cause of large memory allocations when the large
>allocations are byte[] arrays might mean that it's a low-level class,
>probably in Lucene.  Solr will likely have almost no influence on these
>memory allocations, except by changing the schema to enable docValues,
>which changes the particular Lucene code that is called.  Note that
>wiping the index and rebuilding it from scratch is necessary when you
>enable docValues.
>
>Another possible source of problems like this is the filterCache.  A 200
>million document index (assuming it's all on the same machine) results
>in filterCache entries that are 25 million bytes each.  In Solr
>examples, the filterCache defaults to a size of 512.  If a cache that
>size on a 200 million document index fills up, it will require nearly 13
>gigabytes of heap memory.
>
>Thanks,
>Shawn
>

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-10 Thread Kelly, Frank

To clarify 


"we put ³docValues²=³true² on the schema” should have said
"we put ³docValues²=³true² on the id field only”

-Frank

On 2/10/17, 10:27 AM, "Kelly, Frank"  wrote:

>Thanks Shawn,
>
>Yeah think we have identified root cause thanks to some of the suggestions
>here.
>
>Originally we stopped using deleteByQuery as we saw it caused some large
>CPU spikes (see 
>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.
>apache.org%2Fjira%2Fbrowse%2FLUCENE-7049&data=01%7C01%7C%7Cd9606e62fa5a421
>a08d008d451c95f04%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=5LhJ4eWQY1s
>tkH0vyMm6c5kzeOcpjOXLtzU5gql6TT8%3D&reserved=0) and
>Solr pauses
>And switched to using a search and then deleteById. It worked fine on our
>(small) test collections.
>
>But with 200M documents it appears that deleteById causes the heap to
>increase dramatically (we guess fieldCache gets populated with a large
>number of object ids?)
>To confirm our suspicion we put ³docValues²=³true² on the schema and began
>to reindex and the heap memory usage dropped significantly - in fact heap
>memory usage on the Solr VMs dropped by a half.
>
>Can someone confirm (or deny) our suspicion that deleteById results in
>some on-heap caching of the unique key (id?)?
>
>
>Cheers!
>
>-Frank
>
>P.s. Interesting when I searched the Wiki for docs on deleteById I did not
>find any
>https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.a
>pache.org%2Fconfluence%2Fdosearchsite.action%3Fwhere%3Dsolr%26spaceSea&dat
>a=01%7C01%7C%7Cd9606e62fa5a421a08d008d451c95f04%7C6d4034cd72254f72b85391fe
>aea64919%7C1&sdata=ixHi%2BZ%2B5wlqQ3tu%2FSQCcgjqPIfMRA2ta7Uo%2BBvwEUxE%3D&
>reserved=0
>rch=true&queryString=deleteById
>
>
>P.p.s Separately we are also turning off FilterCache but we know from
>usage and plugin stats that it is not in use but best to turn it off
>entirely for risk reduction
>
> 
>Frank Kelly
>Principal Software Engineer
> 
>HERE 
>5 Wayside Rd, Burlington, MA 01803, USA
>42° 29' 7" N 71° 11' 32" W
> 
> 
>e.com%2F&data=01%7C01%7C%7Cd9606e62fa5a421a08d008d451c95f04%7C6d4034cd7225
>4f72b85391feaea64919%7C1&sdata=R%2BAbWMlSJ%2FRN0oAF3smwJawoQGr4U4%2BFdKCxy
>XWLXIg%3D&reserved=0>
>itter.com%2Fhere&data=01%7C01%7C%7Cd9606e62fa5a421a08d008d451c95f04%7C6d40
>34cd72254f72b85391feaea64919%7C1&sdata=qnVxW4o1CDcnjOiKdqjhCddGHUqbVlZuvxp
>zMxRme0s%3D&reserved=0>
>cebook.com%2Fhere&data=01%7C01%7C%7Cd9606e62fa5a421a08d008d451c95f04%7C6d4
>034cd72254f72b85391feaea64919%7C1&sdata=YaluC4BvPWpKhe5HQ8aaJqy7eW4SIOEdls
>8tNp63xV0%3D&reserved=0>
>nkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7Cd9606e62fa5a421a08d008d4
>51c95f04%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=jLfR0kUX4yDZ29FeJEN5
>2jRUxYAPOEaXqoq3L67xSBk%3D&reserved=0>
>stagram.com%2Fhere%2F&data=01%7C01%7C%7Cd9606e62fa5a421a08d008d451c95f04%7
>C6d4034cd72254f72b85391feaea64919%7C1&sdata=xKrwI%2BcUq0sSNf%2FUUdiz9GA%2B
>ckjttBO61qCk1%2BwlsTk%3D&reserved=0>
>
>
>
>On 2/9/17, 11:00 AM, "Shawn Heisey"  wrote:
>
>>On 2/9/2017 6:19 AM, Kelly, Frank wrote:
>>> Got a heap dump on an Out of Memory error.
>>> Analyzing the dump now in Visual VM
>>>
>>> Seeing a lot of byte[] arrays (77% of our 8GB Heap) in
>>>
>>>   * TreeMap$Entry
>>>   * FieldCacheImpl$SortedDocValues
>>>
>>> We¹re considering switch over to DocValues but would rather be
>>> definitive about the root cause before we experiment with DocValues
>>> and require a reindex of our 200M document index
>>> In each of our 4 data centers.
>>>
>>> Any suggestions on what I should look for in this heap dump to get a
>>> definitive root cause?
>>>
>>
>>Analyzing the cause of large memory allocations when the large
>>allocations are byte[] arrays might mean that it's a low-level class,
>>probably in Lucene.  Solr will likely have almost no influence on these
>>memory allocations, except by changing the schema to enable docValues,
>>which changes the particular Lucene code that is called.  Note that
>>wiping the index and rebuilding it from scratch is necessary when you
>>enable docValues.
>>
>>Another possible source of problems like this is the filterCache.  A 200
>>million document index (assuming it's all on the same machine) results
>>in filterCache entries that are 25 million bytes each.  In Solr
>>examples, the filterCache defaults to a size of 512.  If a cache that
>>size on a 200 million document index fills up, it will require nearly 13
>>gigabytes of heap memory.
>>
>>Thanks,
>>Shawn
>>
>

Re: how to get modified field data if it doesn't exist in meta

2017-02-10 Thread Erick Erickson

Would TimestampUpdateProcessorFactory work?

Best,
Erick

On Fri, Feb 10, 2017 at 4:59 AM, Alexandre Rafalovitch
 wrote:
> Custom update request processor that looks up a file from the name and gets
> the date should work.
>
> Regards,
> Alex
>
> On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas"  wrote:
>
> Hi,
>
> We have started to use solr for our documents indexing (vsd, vsdx,
> xls,xlsx, doc, docx, pdf, txt).
>
> Modified date values is needed for each file. MS Office's files, pdfs have
> this value.
> Problem is with txt files as they don't have this value in their meta.
>
> Is there any possibility to get it somehow from os level and force adding
> it to solr when we do indexing.
>
> p.s.
>
> Windows 2012 server, single instance
>
> typical command we use: java -Dauto -Dc=index_sandbox -Dport=80
> -Dfiletypes=vsd,vsdx,xls,xlsx,doc,docx,pdf,txt -Dbasicauth=admin: -jar
> example/exampledocs/post.jar "M:\DNS_dump"
>
>
> Regards,
>
> Gytis

Re: Stemming and accents

2017-02-10 Thread Erick Erickson

The easiest way to answer that is to define two different fieldTypes,
one with Snowball first and one with ASCIIFolding first, fire up the
admin/analysis page and give it some input. That'll show you _exactly_
what transformations take place at each step.

Best,
Erick

On Fri, Feb 10, 2017 at 12:26 AM, Dominique Bejean
 wrote:
> Hi,
>
> Is the SnowballPorterFilter sensitive to the accents for French for
> instance ?
>
> If I use both SnowballPorterFilter and ASCIIFoldingFilter, do I have to
> configure ASCIIFoldingFilter after SnowballPorterFilter  ?
>
> Regards.
>
> Dominique
> --
> Dominique Béjean
> 06 08 46 12 43

Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Kelly, Frank

Hello,

  We have a 100M+ documents across 2 collections and need to reindex the 
entirety of the Collections as we need to turn on “docValues”:true on a number 
of fields (see previous emails from this week :-] ).
Unfortunately we have 4 AWS regions each with their own SolrCloud cluster each 
with its own copy of the entire search index.
So we have to do this reindex 4 times and in each case we have to take down 
each region as we need to delete the collection. And reindexing takes about 2-3 
days.

Is there someway we can reindex in one (offline) region and then use some 
mechanism - replication? Backup/restore? EBS snapshot? to “copy and paste” a 
known Solr state from one SolrCloud instance to another.
>From that state then we’d just reindex the delta (from when the snapshot was 
>taken to now)

Appreciate any thoughts or ideas or hear how other folks do it,

Thanks!

-Frank

[Description: Macintosh 
HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]



Frank Kelly

Principal Software Engineer



HERE

5 Wayside Rd, Burlington, MA 01803, USA

42° 29' 7" N 71° 11' 32" W

[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
[Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
 [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
  [Description: 
/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Erick Erickson

First, perhaps the slickest way to reindex without as much downtime would
be to just index to a _new_ collection. Then use "collection aliasing" to
point incoming requests to the old collection to the new one. True, you do
need extra hardware

But that aside, Solr (well Lucene really) indexes are just files. There's a
collection-wide backup restore but check the PDF for your Solr version to
see if it's available to you.

Beyond that, just copy things around. So here's a process, modify as you
see fit:
1> index to your new collection in region 1
2> in region 2, create a new collection with the same number of shards (no
followers, leader-only).
3> with the Solr instances in region 2 down, copy the data dir from your
servers in region 1 to the corresponding data dir on your severs in region
2. It is _very_ important that the hash ranges match. If you look at your
state.json you'll see an entry for each shard like "hash_range
0x800-0x. The hash range on the source must match exactly the
hash range on dest in region 2. Double check this as you basically copy
from collection_shard1_replica1...data(on region 1)/data to
collection_shard1_replica1...data on region 2.
4> Once this is done for all shards, bring up Solr on region 2 and verify
it's as you expect.
5> Use the Collections API to ADDREPLICA in region 2 to build out your
collection. the ADDREPLICA will automatically copy the index from the
leader.

Best,
Erick

On Fri, Feb 10, 2017 at 10:12 AM, Kelly, Frank  wrote:

> Hello,
>
>   We have a 100M+ documents across 2 collections and need to reindex the
> entirety of the Collections as we need to turn on “docValues”:true on a
> number of fields (see previous emails from this week :-] ).
> Unfortunately we have 4 AWS regions each with their own SolrCloud cluster
> each with its own copy of the entire search index.
> So we have to do this reindex 4 times and in each case we have to take
> down each region as we need to delete the collection. And reindexing takes
> about 2-3 days.
>
> Is there someway we can reindex in one (offline) region and then use some
> mechanism - replication? Backup/restore? EBS snapshot? to “copy and paste”
> a known Solr state from one SolrCloud instance to another.
> From that state then we’d just reindex the delta (from when the snapshot
> was taken to now)
>
> Appreciate any thoughts or ideas or hear how other folks do it,
>
> Thanks!
>
> -Frank
>
> [image: Description: Macintosh
> HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PDF:HERE_Logo_2016_POS_sRGB.pdf]
>
>
>
> *Frank Kelly*
>
> *Principal Software Engineer*
>
>
>
> HERE
>
> 5 Wayside Rd, Burlington, MA 01803, USA
>
> *42° 29' 7" N 71° 11' 32" W*
>
>
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_360.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Twitter.gif]
>    [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_FB.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_IN.gif]
> [image: Description:
> /Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Images/20160726_HERE_EMail_Signature_Insta.gif]
> 
>

Field collapsing, facets, and qtime: caching issue?

2017-02-10 Thread Ronald K. Braun

I'm experimenting with field collapsing in solrcloud 6.2.1 and have this
set of request parameters against a collection:

/default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid}

My default handler is just defaults:



explicit



The first query runs about 600ms, then subsequent repeats of the same query
are 0-5ms for qTime, which I interpret to mean that the query is cached
after the first hit.  All as expected.

However, if I enable facets without actually requesting a facet:

/default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid}&facet=true

then every submission of the query runs at ~600ms.  I interpret this to
mean that caching is somehow defeated when facet processing is set.  Facets
are empty as expected:

facet_counts": {
  "facet_queries": { },
  "facet_fields": { },
  "facet_ranges": { },
  "facet_intervals": { },
  "facet_heatmaps": { }
}

If I remove the collapse directive

/default?indent=on&q=*:*&wt=json&facet=true

qTimes are back down to 0 after the initial query whether or not faceting
is requested.

Is this expected behaviour or am I missing some supporting configuration
for proper field collapsing?

Thanks!

Ron

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Kelly, Frank

Thanks Erick for that idea and the fast response


Cheers!

F

On 2/10/17, 1:24 PM, "Erick Erickson"  wrote:

>First, perhaps the slickest way to reindex without as much downtime would
>be to just index to a _new_ collection. Then use "collection aliasing" to
>point incoming requests to the old collection to the new one. True, you do
>need extra hardware
>
>But that aside, Solr (well Lucene really) indexes are just files. There's
>a
>collection-wide backup restore but check the PDF for your Solr version to
>see if it's available to you.
>
>Beyond that, just copy things around. So here's a process, modify as you
>see fit:
>1> index to your new collection in region 1
>2> in region 2, create a new collection with the same number of shards (no
>followers, leader-only).
>3> with the Solr instances in region 2 down, copy the data dir from your
>servers in region 1 to the corresponding data dir on your severs in region
>2. It is _very_ important that the hash ranges match. If you look at your
>state.json you'll see an entry for each shard like "hash_range
>0x800-0x. The hash range on the source must match exactly the
>hash range on dest in region 2. Double check this as you basically copy
>from collection_shard1_replica1...data(on region 1)/data to
>collection_shard1_replica1...data on region 2.
>4> Once this is done for all shards, bring up Solr on region 2 and verify
>it's as you expect.
>5> Use the Collections API to ADDREPLICA in region 2 to build out your
>collection. the ADDREPLICA will automatically copy the index from the
>leader.
>
>Best,
>Erick
>
>On Fri, Feb 10, 2017 at 10:12 AM, Kelly, Frank 
>wrote:
>
>> Hello,
>>
>>   We have a 100M+ documents across 2 collections and need to reindex the
>> entirety of the Collections as we need to turn on ³docValues²:true on a
>> number of fields (see previous emails from this week :-] ).
>> Unfortunately we have 4 AWS regions each with their own SolrCloud
>>cluster
>> each with its own copy of the entire search index.
>> So we have to do this reindex 4 times and in each case we have to take
>> down each region as we need to delete the collection. And reindexing
>>takes
>> about 2-3 days.
>>
>> Is there someway we can reindex in one (offline) region and then use
>>some
>> mechanism - replication? Backup/restore? EBS snapshot? to ³copy and
>>paste²
>> a known Solr state from one SolrCloud instance to another.
>> From that state then we¹d just reindex the delta (from when the snapshot
>> was taken to now)
>>
>> Appreciate any thoughts or ideas or hear how other folks do it,
>>
>> Thanks!
>>
>> -Frank
>>
>> [image: Description: Macintosh
>> 
>>HD:Users:jerchow:Downloads:Asset_Package_01_160721:HERE_Logo_2016:sRGB:PD
>>F:HERE_Logo_2016_POS_sRGB.pdf]
>>
>>
>>
>> *Frank Kelly*
>>
>> *Principal Software Engineer*
>>
>>
>>
>> HERE
>>
>> 5 Wayside Rd, Burlington, MA 01803, USA
>>
>> *42° 29' 7" N 71° 11' 32" W*
>>
>>
>> [image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_360.gif]
>> 
>>>re.com%2F&data=01%7C01%7C%7C05ea18ff9173472e95f008d451e22130%7C6d4034cd72
>>254f72b85391feaea64919%7C1&sdata=PXzSNwFL%2FgL2xo4tQ35vCzfIq4eQVr0roL6pzY
>>nbRvg%3D&reserved=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_Twitter.gif]
>> 
>>>witter.com%2Fhere&data=01%7C01%7C%7C05ea18ff9173472e95f008d451e22130%7C6d
>>4034cd72254f72b85391feaea64919%7C1&sdata=lV7%2BO0mdqv%2Fj%2Fg05nt7nBwrfHe
>>ED7%2BOir%2B5OOcYByA8%3D&reserved=0>   [image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_FB.gif]
>> 
>>>acebook.com%2Fhere&data=01%7C01%7C%7C05ea18ff9173472e95f008d451e22130%7C6
>>d4034cd72254f72b85391feaea64919%7C1&sdata=1JMzDtPvN5lML9rvnrygoPi5vRwcrup
>>Rlko7oC1bT3w%3D&reserved=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_IN.gif]
>> 
>>>inkedin.com%2Fcompany%2Fheremaps&data=01%7C01%7C%7C05ea18ff9173472e95f008
>>d451e22130%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=ySduRBgnY7f%2FDzx
>>0xdBmvq08oOtls5TcYs1G4jWJqFo%3D&reserved=0>[image: Description:
>> 
>>/Users/nussbaum/_WORK/PROJECTS/20160726_HERE_EMail_Signature/_Layout/_Ima
>>ges/20160726_HERE_EMail_Signature_Insta.gif]
>> 
>>>nstagram.com%2Fhere%2F&data=01%7C01%7C%7C05ea18ff9173472e95f008d451e22130
>>%7C6d4034cd72254f72b85391feaea64919%7C1&sdata=9tf7axgNV3jq5bYBkFNoRg6Pmwc
>>HXPcgcVsAN%2BBf85A%3D&reserved=0>

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Bryant, Michael

Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!  
Thanks for cluing me in to this, I love it when you can delete a load of hacks 
for something both simpler and faster.

Best,
~Mike


--
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans 
mailto:tevans...@googlemail.com>> wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
 wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better 
performance, especially with high cardinality facet fields. However, the one 
issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
trying to simulate the effect of "group.facet" to sort facets according to a 
grouping field.

My situation, slightly simplified is:

Solr 4.6.1

 *   Doc set: ~200,000 docs
 *   Grouping by item_id, an indexed, stored, single value string field with 
~50,000 unique values, ~4 docs per item
 *   Faceting by person_id, an indexed, stored, multi-value string field with 
~50,000 values (w/ a very skewed distribution)
 *   No docValues fields

Each document here is a description of an item, and there are several 
descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives 
me facet counts with the number of items rather than descriptions, and 
correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
   "people": {
   "type": "terms",
   "field": "person_id",
   "facet": {
   "grouped_count": "unique(item_id)"
   },
   "sort": "grouped_count desc"
   }
}

This works, and is somewhat faster than legacy faceting, but it also produces a 
massive spike in memory usage when (and only when) the sort parameter is set to 
the aggregate field. A server that runs happily with a 512MB heap OOMs unless I 
give it a 4GB heap. With sort set to (the default) "count desc" there is no 
memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when 
sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
I’ve tried reindexing with docValues enabled on the relevant fields and it 
seems to make no difference in this respect.

Many thanks,
~Mike

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Bryant, Michael

Darn, spoke too soon. Field collapsing throws off my facet counts where facet
fields differ within groups.

Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate
counts and it has the same issue as unique() when used as the facet sort
parameter - while reasonably fast it uses masses of memory.

Cheers,
~Mike

--
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 18:53, Bryant, Michael
mailto:michael.bry...@kcl.ac.uk>> wrote:

Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!
Thanks for cluing me in to this, I love it when you can delete a load of hacks
for something both simpler and faster.

Best,
~Mike

--
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans
mailto:tevans...@googlemail.com>>
wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.

Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better
performance, especially with high cardinality facet fields. However, the one
issue I can't seem to resolve is excessive memory usage (and OOM errors) when
trying to simulate the effect of "group.facet" to sort facets according to a
grouping field.

My situation, slightly simplified is:

Solr 4.6.1

* Doc set: ~200,000 docs
* Grouping by item_id, an indexed, stored, single value string field with
~50,000 unique values, ~4 docs per item
* Faceting by person_id, an indexed, stored, multi-value string field with
~50,000 values (w/ a very skewed distribution)
* No docValues fields

Each document here is a description of an item, and there are several
descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives
me facet counts with the number of items rather than descriptions, and
correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
"people": {
"type": "terms",
"field": "person_id",
"facet": {
"grouped_count": "unique(item_id)"
},
"sort": "grouped_count desc"
}
}

This works, and is somewhat faster than legacy faceting, but it also produces a
massive spike in memory usage when (and only when) the sort parameter is set to
the aggregate field. A server that runs happily with a 512MB heap OOMs unless I
give it a 4GB heap. With sort set to (the default) "count desc" there is no
memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when
sorting JSON facets by stats and if there’s anything I can do to mitigate it.
I’ve tried reindexing with docValues enabled on the relevant fields and it
seems to make no difference in this respect.

Many thanks,
~Mike

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley

On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
 wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.

Yeah, I sort of expected this... but haven't gotten around to
implementing something that takes less memory yet.
If you're faceting on A and sorting by unique(B), then memory use is
O(cardinality(A)*cardinality(B))
We can definitely do a lot better.

-Yonik

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley

FYI, I just opened https://issues.apache.org/jira/browse/SOLR-10122 for this
-Yonik

On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley  wrote:
> On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
>  wrote:
>> Hi all,
>>
>> I'm converting my legacy facets to JSON facets and am seeing much better 
>> performance, especially with high cardinality facet fields. However, the one 
>> issue I can't seem to resolve is excessive memory usage (and OOM errors) 
>> when trying to simulate the effect of "group.facet" to sort facets according 
>> to a grouping field.
>
> Yeah, I sort of expected this... but haven't gotten around to
> implementing something that takes less memory yet.
> If you're faceting on A and sorting by unique(B), then memory use is
> O(cardinality(A)*cardinality(B))
> We can definitely do a lot better.
>
> -Yonik

commongrams

2017-02-10 Thread David Hastings

Hey All,
I followed an old blog post about  implementing the common grams, and used
the 400 most popular words file on a subset of my data.  original index
size was 33gb with 2.2 million documents, using the 400, it grep to 96gb.
I scaled it down to the 100 most common words and got to about 76gb, but
with a cold phrase search going from 4 seconds at 400 words to 6 with 100.
 this will not really scale well, as the base index that this is a subset
of right now has 22 million documents and sits around 360 gb.  at this
rate, it would be around a TB index size.  is there a common
hardware/software configuration to handle TB size indexes?
thanks,
DH

Re: Stemming and accents

2017-02-10 Thread Ahmet Arslan

Hi,

I have experimented before, and found that Snowball is sensitive to 
accents/diacritics.
Please see for more details: 
http://www.sciencedirect.com/science/article/pii/S0306457315001053

Ahmet



On Friday, February 10, 2017 11:27 AM, Dominique Bejean 
 wrote:
Hi,

Is the SnowballPorterFilter sensitive to the accents for French for
instance ?

If I use both SnowballPorterFilter and ASCIIFoldingFilter, do I have to
configure ASCIIFoldingFilter after SnowballPorterFilter  ?

Regards.

Dominique
-- 
Dominique Béjean
06 08 46 12 43

Re: Field collapsing, facets, and qtime: caching issue?

2017-02-10 Thread Joel Bernstein

It's been a little while since I looked at this section of the code. But
what I believe is going on is that the queryResultCache has kicked in which
will give you the DocList (the top N docs that match query/filters/sort)
back immediately. But faceting requires a DocSet which is a bitset of all
docs that match the query. The DocSet is not cached in this scenario, so it
needs to be regenerated, which means re-running the query/collapse. So I
believe your instincts are correct. This same issue gets worse if you have
facets that need refinement. In this scenario the DocSet is needed on the
first and second pass and is not cached, the so query/collapse need to be
run twice for facets.

The fix for this would be to start caching the DocSets needed for faceting.

Joel Bernstein
http://joelsolr.blogspot.com/

On Fri, Feb 10, 2017 at 1:29 PM, Ronald K. Braun  wrote:

> I'm experimenting with field collapsing in solrcloud 6.2.1 and have this
> set of request parameters against a collection:
>
> /default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid}
>
> My default handler is just defaults:
>
> 
> 
> explicit
> 
> 
>
> The first query runs about 600ms, then subsequent repeats of the same query
> are 0-5ms for qTime, which I interpret to mean that the query is cached
> after the first hit.  All as expected.
>
> However, if I enable facets without actually requesting a facet:
>
> /default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid}&facet=true
>
> then every submission of the query runs at ~600ms.  I interpret this to
> mean that caching is somehow defeated when facet processing is set.  Facets
> are empty as expected:
>
> facet_counts": {
>   "facet_queries": { },
>   "facet_fields": { },
>   "facet_ranges": { },
>   "facet_intervals": { },
>   "facet_heatmaps": { }
> }
>
> If I remove the collapse directive
>
> /default?indent=on&q=*:*&wt=json&facet=true
>
> qTimes are back down to 0 after the initial query whether or not faceting
> is requested.
>
> Is this expected behaviour or am I missing some supporting configuration
> for proper field collapsing?
>
> Thanks!
>
> Ron
>

Re: Problem with cyrillics letters through Tika OCR indexing

2017-02-10 Thread Игорь Абрашин

The same problem for me. So, first case probably or how to force tika
parser recognize cyrillic character as required. For me it tries to
recognize russian text as eng translit, show up in result russian text
utilize only latin alphabet.

10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch" <
arafa...@gmail.com> написал:

> At what level is this exactly a problem? Are you looking for a way for
> Solr to pass -L rus flag to Tika?
>
> Or you are saying that whatever OCR is used here is bad. In the second
> case, this is probably not a question for Solr or even Tika but for
> whatever underlying OCR library is.
>
> The stack is deep here, more precision is required.
>
> Удачи,
> Alex
>
> On 10 Feb 2017 2:52 AM, "Абрашин, Игорь Олегович" <
> igor.abras...@novatek.ru> wrote:
>
> Hello, everyone I’m encountered the error mentioned at the title?
>
> The original image attached and recognized text below:
> 3ApaBCTyI7ITe 9| )KVIBy xopomo
>
>
>
> Does anyone faced the similar?
> Need to mentioned that tesseract recognize it more correctly with –l rus
> option.
>
> Thanks in advance!
>
>
>
>
>
> *С уважением, *
>
> *Игорь Абрашин*
>
> *ООО «НОВАТЭК НТЦ»*
>
> *тел. раб.: +7 (3452) 680-386 <+7%20345%20268-03-86>*
>
> *тел. внутр. корпор.: 22-586*
>
> [image: 121]
>
>
>
>
>

Re: OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин

Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.

10 февр. 2017 г. 15:28 пользователь "Rick Leir" 
написал:

> My guess is that you are using using Tika and Tesseract. The latter is
> complex, and you can start learning at
>
> https://wiki.apache.org/tika/TikaOCR   <--shows you how to work with TIFF
>
> The traineddata for Cyrillic is here:
>
> https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
>
> https://github.com/tesseract-ocr/tesseract/issues/147
>
> You likely need to enhance the images before running Tesseract.
>
> cheers -- Rick
>
> On 2017-02-10 05:03 AM, Игорь Абрашин wrote:
>
>> Hello, community!
>> Did you manage to recognize jpf,tiff or whatever with cyrillics text
>> inside?
>> Ive got only latin letter (looks like ugly translite text) in result for
>> that moment.For image contains only lattin letters it works fine.
>> Does anyone have any suggestion, best practice or case studies refer to
>> this situation?
>>
>>
>

Stemming and accents

Re: Removing duplicate terms from query

OCR image contains cyrillic characters

Re: OCR image contains cyrillic characters

Re: Problem with cyrillics letters through Tika OCR indexing

Re: how to get modified field data if it doesn't exist in meta

Re: Interval Facets with JSON

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Java version set to 1.8 for SOLR 6.4.0

Re: Java version set to 1.8 for SOLR 6.4.0

Re: Solr Heap Dump: Any suggestions on what to look for?

Re: Solr Heap Dump: Any suggestions on what to look for?

Re: how to get modified field data if it doesn't exist in meta

Re: Stemming and accents

Copying SolrCloud collections (Replication? Backup/Restore?)

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

Field collapsing, facets, and qtime: caching issue?

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

commongrams

Re: Stemming and accents

Re: Field collapsing, facets, and qtime: caching issue?

Re: Problem with cyrillics letters through Tika OCR indexing

Re: OCR image contains cyrillic characters

27 matches

Site Navigation

Mail list logo

Footer information