Re: OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hi, Rick. I didnt mean that he need to train, because tesseract works well separetly. So, tika included in solr doesnt try to use russian dict to recognize cyrillic text and result comes up utilize only eng alphabet. 10 февр. 2017 г. 15:28 пользователь "Rick Leir" написал: > My guess is that you

Re: Problem with cyrillics letters through Tika OCR indexing

2017-02-10 Thread Игорь Абрашин
The same problem for me. So, first case probably or how to force tika parser recognize cyrillic character as required. For me it tries to recognize russian text as eng translit, show up in result russian text utilize only latin alphabet. 10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch"

Re: Field collapsing, facets, and qtime: caching issue?

2017-02-10 Thread Joel Bernstein
It's been a little while since I looked at this section of the code. But what I believe is going on is that the queryResultCache has kicked in which will give you the DocList (the top N docs that match query/filters/sort) back immediately. But faceting requires a DocSet which is a bitset of all doc

Re: Stemming and accents

2017-02-10 Thread Ahmet Arslan
Hi, I have experimented before, and found that Snowball is sensitive to accents/diacritics. Please see for more details: http://www.sciencedirect.com/science/article/pii/S0306457315001053 Ahmet On Friday, February 10, 2017 11:27 AM, Dominique Bejean wrote: Hi, Is the SnowballPorterFilter

commongrams

2017-02-10 Thread David Hastings
Hey All, I followed an old blog post about implementing the common grams, and used the 400 most popular words file on a subset of my data. original index size was 33gb with 2.2 million documents, using the 400, it grep to 96gb. I scaled it down to the 100 most common words and got to about 76gb,

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley
FYI, I just opened https://issues.apache.org/jira/browse/SOLR-10122 for this -Yonik On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley wrote: > On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael > wrote: >> Hi all, >> >> I'm converting my legacy facets to JSON facets and am seeing much better >> perfor

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley
On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael wrote: > Hi all, > > I'm converting my legacy facets to JSON facets and am seeing much better > performance, especially with high cardinality facet fields. However, the one > issue I can't seem to resolve is excessive memory usage (and OOM errors)

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Bryant, Michael
Darn, spoke too soon. Field collapsing throws off my facet counts where facet fields differ within groups. Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate counts and it has the same issue as unique() when used as the facet sort parameter - while reasonably fast it

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Bryant, Michael
Hi Tom, Well the collapsing query parser is… a much better solution to my problems! Thanks for cluing me in to this, I love it when you can delete a load of hacks for something both simpler and faster. Best, ~Mike -- Mike Bryant Research Associate Department of Digital Humanities King’s

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Kelly, Frank
Thanks Erick for that idea and the fast response Cheers! F On 2/10/17, 1:24 PM, "Erick Erickson" wrote: >First, perhaps the slickest way to reindex without as much downtime would >be to just index to a _new_ collection. Then use "collection aliasing" to >point incoming requests to the old col

Field collapsing, facets, and qtime: caching issue?

2017-02-10 Thread Ronald K. Braun
I'm experimenting with field collapsing in solrcloud 6.2.1 and have this set of request parameters against a collection: /default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid} My default handler is just defaults: explicit The first query runs about

Re: Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Erick Erickson
First, perhaps the slickest way to reindex without as much downtime would be to just index to a _new_ collection. Then use "collection aliasing" to point incoming requests to the old collection to the new one. True, you do need extra hardware But that aside, Solr (well Lucene really) indexes a

Copying SolrCloud collections (Replication? Backup/Restore?)

2017-02-10 Thread Kelly, Frank
Hello, We have a 100M+ documents across 2 collections and need to reindex the entirety of the Collections as we need to turn on “docValues”:true on a number of fields (see previous emails from this week :-] ). Unfortunately we have 4 AWS regions each with their own SolrCloud cluster each with

Re: Stemming and accents

2017-02-10 Thread Erick Erickson
The easiest way to answer that is to define two different fieldTypes, one with Snowball first and one with ASCIIFolding first, fire up the admin/analysis page and give it some input. That'll show you _exactly_ what transformations take place at each step. Best, Erick On Fri, Feb 10, 2017 at 12:26

Re: how to get modified field data if it doesn't exist in meta

2017-02-10 Thread Erick Erickson
Would TimestampUpdateProcessorFactory work? Best, Erick On Fri, Feb 10, 2017 at 4:59 AM, Alexandre Rafalovitch wrote: > Custom update request processor that looks up a file from the name and gets > the date should work. > > Regards, > Alex > > On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" wrot

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-10 Thread Kelly, Frank
To clarify "we put ³docValues²=³true² on the schema” should have said "we put ³docValues²=³true² on the id field only” -Frank On 2/10/17, 10:27 AM, "Kelly, Frank" wrote: >Thanks Shawn, > >Yeah think we have identified root cause thanks to some of the suggestions >here. > >Originally we stopp

Re: Solr Heap Dump: Any suggestions on what to look for?

2017-02-10 Thread Kelly, Frank
Thanks Shawn, Yeah think we have identified root cause thanks to some of the suggestions here. Originally we stopped using deleteByQuery as we saw it caused some large CPU spikes (see https://issues.apache.org/jira/browse/LUCENE-7049) and Solr pauses And switched to using a search and then delete

Re: Java version set to 1.8 for SOLR 6.4.0

2017-02-10 Thread Alexandre Rafalovitch
That's the operating system question really. Whatever invokes your Solr instance (system script, command line, etc) should have its path and/or JAVA_HOME pointing to Java 8. Regards, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 10 February 2017 at

Java version set to 1.8 for SOLR 6.4.0

2017-02-10 Thread Uchit Patel
blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px #715FFA solid !important; padding-left:1ex !important; background-color:white !important; } I have installed SOLR 6.4.0 on Linux box. I have Java 1.7.0 and 1.8.0 both on the box. By default it point to 1.7.0. Some other

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Tom Evans
Hi Mike Looks like you are trying to get a list of the distinct item ids in a result set, ordered by the most frequent item ids? Can you use collapsing qparser for this instead? Should be much quicker. https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results Every document w

Re: Interval Facets with JSON

2017-02-10 Thread Tom Evans
On Wed, Feb 8, 2017 at 11:26 PM, deniz wrote: > Tom Evans-2 wrote >> I don't think there is such a thing as an interval JSON facet. >> Whereabouts in the documentation are you seeing an "interval" as JSON >> facet type? >> >> >> You want a range facet surely? >> >> One thing with range facets is t

Re: how to get modified field data if it doesn't exist in meta

2017-02-10 Thread Alexandre Rafalovitch
Custom update request processor that looks up a file from the name and gets the date should work. Regards, Alex On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" wrote: Hi, We have started to use solr for our documents indexing (vsd, vsdx, xls,xlsx, doc, docx, pdf, txt). Modified date values is

Re: Problem with cyrillics letters through Tika OCR indexing

2017-02-10 Thread Alexandre Rafalovitch
At what level is this exactly a problem? Are you looking for a way for Solr to pass -L rus flag to Tika? Or you are saying that whatever OCR is used here is bad. In the second case, this is probably not a question for Solr or even Tika but for whatever underlying OCR library is. The stack is deep

Re: OCR image contains cyrillic characters

2017-02-10 Thread Rick Leir
My guess is that you are using using Tika and Tesseract. The latter is complex, and you can start learning at https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF The traineddata for Cyrillic is here: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files https://gith

OCR image contains cyrillic characters

2017-02-10 Thread Игорь Абрашин
Hello, community! Did you manage to recognize jpf,tiff or whatever with cyrillics text inside? Ive got only latin letter (looks like ugly translite text) in result for that moment.For image contains only lattin letters it works fine. Does anyone have any suggestion, best practice or case studies re

Re: Removing duplicate terms from query

2017-02-10 Thread Ere Maijala
Thanks for the insight. You're right, of course, regarding the score calculation. I'll think about it. There are certain cases where the search is human-obviously bad and could be cleaned up, but it's not too easy to write rules for that. --Ere 9.2.2017, 18.37, Walter Underwood kirjoitti: 1.

Stemming and accents

2017-02-10 Thread Dominique Bejean
Hi, Is the SnowballPorterFilter sensitive to the accents for French for instance ? If I use both SnowballPorterFilter and ASCIIFoldingFilter, do I have to configure ASCIIFoldingFilter after SnowballPorterFilter ? Regards. Dominique -- Dominique Béjean 06 08 46 12 43