Re: Positions files analysis

Erick Erickson Tue, 28 Jun 2016 08:01:35 -0700

yeah, Luke is the way to go. If you're patient the admin UI>>schema browser,
pick a field and hit the "load term info" button. You'll see some
terms and, in light gray
the total number of terms in your index for that replica.


Since this is a text field, the TermsComponent can also help. Basically you can
return the terms from your index that start with some text, so say
return the first 1,000 terms that start with "123" for instance. This isn't very
scientific but it'll be a very quick way to get some evidence that you're
suspicion is correct.

see: https://cwiki.apache.org/confluence/display/solr/The+Terms+Component

Best,
Erick

On Tue, Jun 28, 2016 at 1:40 AM, Avi Steiner <astei...@varonis.com> wrote:
> Thanks Eric.
> I don't want to disable the phrase searches option.
> I just wonder if there is any way I can find terms within index, and thought 
> the pos file analysis may be a direction.
> I suspect that our index is full of long float numbers (i.e: 
> 1234.4546786585899544) which may be unnecessary.  Before I make any changes 
> in our index process (like drop such terms), I want to prove my suspicion.
> I can make a search using regex in order to find how many _documents_ 
> contains those terms, but I would like to know how many such _terms_ (unique 
> or total) are indexed. Is there a way to do it? Maybe with luke?
>
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Tuesday, June 28, 2016 8:27 AM
> To: solr-user <solr-user@lucene.apache.org>
> Subject: Re: Positions files analysis
>
> Positions are necessary if you need to do "phrase searches".
> If that's not necessary, simply turn that option off in your schema for the 
> fields where it's unnecessary. See the reference guide for termVectors 
> termPositions termOffsets
>
> I'm really not sure what you're asking by:
> "Is there a way I can read/analyze index files as .pos?"
>
> The various file extensions are a result of the options you define on your 
> fields, that's just the way Lucene works...
>
> Best,
> Erick
>
> On Mon, Jun 27, 2016 at 7:25 AM, asteiner <astei...@varonis.com> wrote:
>> Hi
>>
>> I have a very large index and I'd like to see how can I reduce it.
>> Some of the largest files in the index are the .pos files (positions).
>> There are many excel files indexed with formulas, so I suspect that a
>> large part of the index is used by junk terms as very long numbers.
>> Is there a way I can read/analyze index files as .pos?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Positions-files-analysis-tp4284485.
>> html Sent from the Solr - User mailing list archive at Nabble.com.
> ________________________________
> This email and any attachments thereto may contain private, confidential, and 
> privileged material for the sole use of the intended recipient. Any review, 
> copying, or distribution of this email (or any attachments thereto) by others 
> is strictly prohibited. If you are not the intended recipient, please contact 
> the sender immediately and permanently delete the original and any copies of 
> this email and any attachments thereto.

Re: Positions files analysis

Reply via email to