Hi, Rick.
I didnt mean that he need to train, because tesseract works well separetly.
So, tika included in solr doesnt try to use russian dict to recognize
cyrillic text and result comes up utilize only eng alphabet.
10 февр. 2017 г. 15:28 пользователь "Rick Leir"
написал:
> My guess is that you
The same problem for me. So, first case probably or how to force tika
parser recognize cyrillic character as required. For me it tries to
recognize russian text as eng translit, show up in result russian text
utilize only latin alphabet.
10 февр. 2017 г. 17:55 пользователь "Alexandre Rafalovitch"
It's been a little while since I looked at this section of the code. But
what I believe is going on is that the queryResultCache has kicked in which
will give you the DocList (the top N docs that match query/filters/sort)
back immediately. But faceting requires a DocSet which is a bitset of all
doc
Hi,
I have experimented before, and found that Snowball is sensitive to
accents/diacritics.
Please see for more details:
http://www.sciencedirect.com/science/article/pii/S0306457315001053
Ahmet
On Friday, February 10, 2017 11:27 AM, Dominique Bejean
wrote:
Hi,
Is the SnowballPorterFilter
Hey All,
I followed an old blog post about implementing the common grams, and used
the 400 most popular words file on a subset of my data. original index
size was 33gb with 2.2 million documents, using the 400, it grep to 96gb.
I scaled it down to the 100 most common words and got to about 76gb,
FYI, I just opened https://issues.apache.org/jira/browse/SOLR-10122 for this
-Yonik
On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley wrote:
> On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
> wrote:
>> Hi all,
>>
>> I'm converting my legacy facets to JSON facets and am seeing much better
>> perfor
On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better
> performance, especially with high cardinality facet fields. However, the one
> issue I can't seem to resolve is excessive memory usage (and OOM errors)
Darn, spoke too soon. Field collapsing throws off my facet counts where facet
fields differ within groups.
Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate
counts and it has the same issue as unique() when used as the facet sort
parameter - while reasonably fast it
Hi Tom,
Well the collapsing query parser is… a much better solution to my problems!
Thanks for cluing me in to this, I love it when you can delete a load of hacks
for something both simpler and faster.
Best,
~Mike
--
Mike Bryant
Research Associate
Department of Digital Humanities
King’s
Thanks Erick for that idea and the fast response
Cheers!
F
On 2/10/17, 1:24 PM, "Erick Erickson" wrote:
>First, perhaps the slickest way to reindex without as much downtime would
>be to just index to a _new_ collection. Then use "collection aliasing" to
>point incoming requests to the old col
I'm experimenting with field collapsing in solrcloud 6.2.1 and have this
set of request parameters against a collection:
/default?indent=on&q=*:*&wt=json&fq={!collapse+field=groupid}
My default handler is just defaults:
explicit
The first query runs about
First, perhaps the slickest way to reindex without as much downtime would
be to just index to a _new_ collection. Then use "collection aliasing" to
point incoming requests to the old collection to the new one. True, you do
need extra hardware
But that aside, Solr (well Lucene really) indexes a
Hello,
We have a 100M+ documents across 2 collections and need to reindex the
entirety of the Collections as we need to turn on “docValues”:true on a number
of fields (see previous emails from this week :-] ).
Unfortunately we have 4 AWS regions each with their own SolrCloud cluster each
with
The easiest way to answer that is to define two different fieldTypes,
one with Snowball first and one with ASCIIFolding first, fire up the
admin/analysis page and give it some input. That'll show you _exactly_
what transformations take place at each step.
Best,
Erick
On Fri, Feb 10, 2017 at 12:26
Would TimestampUpdateProcessorFactory work?
Best,
Erick
On Fri, Feb 10, 2017 at 4:59 AM, Alexandre Rafalovitch
wrote:
> Custom update request processor that looks up a file from the name and gets
> the date should work.
>
> Regards,
> Alex
>
> On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" wrot
To clarify
"we put ³docValues²=³true² on the schema” should have said
"we put ³docValues²=³true² on the id field only”
-Frank
On 2/10/17, 10:27 AM, "Kelly, Frank" wrote:
>Thanks Shawn,
>
>Yeah think we have identified root cause thanks to some of the suggestions
>here.
>
>Originally we stopp
Thanks Shawn,
Yeah think we have identified root cause thanks to some of the suggestions
here.
Originally we stopped using deleteByQuery as we saw it caused some large
CPU spikes (see https://issues.apache.org/jira/browse/LUCENE-7049) and
Solr pauses
And switched to using a search and then delete
That's the operating system question really. Whatever invokes your
Solr instance (system script, command line, etc) should have its path
and/or JAVA_HOME pointing to Java 8.
Regards,
Alex.
http://www.solr-start.com/ - Resources for Solr users, new and experienced
On 10 February 2017 at
blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px
#715FFA solid !important; padding-left:1ex !important; background-color:white
!important; } I have installed SOLR 6.4.0 on Linux box. I have Java 1.7.0 and
1.8.0 both on the box. By default it point to 1.7.0. Some other
Hi Mike
Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?
Can you use collapsing qparser for this instead? Should be much quicker.
https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results
Every document w
On Wed, Feb 8, 2017 at 11:26 PM, deniz wrote:
> Tom Evans-2 wrote
>> I don't think there is such a thing as an interval JSON facet.
>> Whereabouts in the documentation are you seeing an "interval" as JSON
>> facet type?
>>
>>
>> You want a range facet surely?
>>
>> One thing with range facets is t
Custom update request processor that looks up a file from the name and gets
the date should work.
Regards,
Alex
On 10 Feb 2017 2:39 AM, "Gytis Mikuciunas" wrote:
Hi,
We have started to use solr for our documents indexing (vsd, vsdx,
xls,xlsx, doc, docx, pdf, txt).
Modified date values is
At what level is this exactly a problem? Are you looking for a way for Solr
to pass -L rus flag to Tika?
Or you are saying that whatever OCR is used here is bad. In the second
case, this is probably not a question for Solr or even Tika but for
whatever underlying OCR library is.
The stack is deep
My guess is that you are using using Tika and Tesseract. The latter is
complex, and you can start learning at
https://wiki.apache.org/tika/TikaOCR <--shows you how to work with TIFF
The traineddata for Cyrillic is here:
https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
https://gith
Hello, community!
Did you manage to recognize jpf,tiff or whatever with cyrillics text inside?
Ive got only latin letter (looks like ugly translite text) in result for
that moment.For image contains only lattin letters it works fine.
Does anyone have any suggestion, best practice or case studies re
Thanks for the insight. You're right, of course, regarding the score
calculation. I'll think about it. There are certain cases where the
search is human-obviously bad and could be cleaned up, but it's not too
easy to write rules for that.
--Ere
9.2.2017, 18.37, Walter Underwood kirjoitti:
1.
Hi,
Is the SnowballPorterFilter sensitive to the accents for French for
instance ?
If I use both SnowballPorterFilter and ASCIIFoldingFilter, do I have to
configure ASCIIFoldingFilter after SnowballPorterFilter ?
Regards.
Dominique
--
Dominique Béjean
06 08 46 12 43
27 matches
Mail list logo