date:20150826

Solr 5.2.1 versus Solr 4.7.0 performance

2015-08-26 Thread Esther Goldbraich

Hello,
We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same 
data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 
(5-15% better than 5.2.1, with an exception of 100% improvement for one of 
the queries ).
Using same JVM (IBM 1.7) and JVM params.
Index's size is ~500G, spread over 64 shards, with replication factor 2.
Do you know about any config / setup change for Solr 5.2.1 that can 
improve the performance? Any idea what causes this behavior?
Thank you,
Esther

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo

Hi Toke,

Thank you for the link.

I'm using Solr 5.2.1 but I think the carrot2 bundled will be slightly older
version, as I'm using the latest carrot2-workbench-3.10.3, which is only
released recently. I've changed all the settings like fragSize and
desiredCluserCountBase to be the same on both sides, and I'm now able to
get very similar cluster results.

Now I've tried to increase the carrot.fragSize to 75 and
carrot.summarySnippets to 2, and set the carrot.produceSummary to true.
With this setting, I'm mostly able to get the cluster results back within 2
to 3 seconds when I set rows=200. I'm still trying out to see if the
cluster labels are ok, but in theory do you think this is a suitable
setting to attempt to improve the clustering results and at the same time
improve the performance?

Regards,
Edwin

On 26 August 2015 at 13:58, Toke Eskildsen  wrote:

> On Wed, 2015-08-26 at 10:10 +0800, Zheng Lin Edwin Yeo wrote:
> > I'm currently trying out on the Carrot2 Workbench and get it to call Solr
> > to see how they did the clustering. Although it still takes some time to
> do
> > the clustering, but the results of the cluster is much better than mine.
> I
> > think its probably due to the different settings like the fragSize and
> > desiredCluserCountBase?
>
> Either that or the carrot bundled with Solr is an older version.
>
> > By the way, the link on the clustering example
> > https://cwiki.apache.org/confluence/display/solr/Result is not working
> as
> > it says 'Page Not Found'.
>
> That is because it is too long for a single line. Try copy-pasting it:
>
> https://cwiki.apache.org/confluence/display/solr/Result
> +Clustering#ResultClustering-Configuration
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>

Re: Search opening hours

2015-08-26 Thread Upayavira

On Tue, Aug 25, 2015, at 10:54 PM, Yonik Seeley wrote:
> On Tue, Aug 25, 2015 at 5:02 PM, O. Klein  wrote:
> > I'm trying to find the best way to search for stores that are open NOW.
> 
> It's probably not the *best* way, but assuming it's currently 4:10pm,
> you could do
> 
> +open:[* TO 1610] +close:[1610 TO *]
> 
> And to account for days of the week have different fields for each day
> openM, closeM, openT, closeT, etc...  not super elegant, but seems to
> get the job done.

So, the basic question is what does "now" mean? If it is 5:29pm and a
shop closes at 5:30pm, does that count as "open"? If you want to query
"a single time" within a range, then Yonik's approach will work
(although I'd use open0 to open6 for the days of the week).

If you want to find a range within another range, then use what
Alexandre suggested - spatial search functionality. For example, you
could say, is the shop open for 10 minutes either side of "now". Of
course, you could use spatial for a time within a range, and it might be
a little more elegant because you can use a multivalued field to specify
the open/close ranges for your store.

Upayavira

Re: Search opening hours

2015-08-26 Thread O. Klein

Thank you for responding.

Yonik's solution is what I had in mind. Was hoping for something more
elegant, as he said, but it will work.

The thing I haven't figured out is how to deal with closing times early
morning next day.

So it's 22:00 now and opening hours are 20:00 to 03:00

Can this be done with either or both approaches?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html
Sent from the Solr - User mailing list archive at Nabble.com.

Hash of solr documents

2015-08-26 Thread david . davila

Hi,

I have read in one post in the Internet that the hash Solr Cloud 
calculates over the key field to send each document to a different shard 
is indexed. Is this true? If true, is there any way to show this hash for 
each document?

Thanks,

David

best way for adding a new field to all indexed documents...

2015-08-26 Thread Roxana Danger

Hello,
   I have a index created with solr, and I would like to add a new
field to all the documents of the index. I suppose I could a) use an
updateRequestHandler or b) create another index importing the data from the
initial index and the data of my new field. Which could be the best
approach? Will the background processing be re-indexing the documents?
   Thank you very much in advance,
Roxana

-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »

Re: Hash of solr documents

2015-08-26 Thread Anshum Gupta

Hi David,

The route key itself is indexed, but not the hash value. Why do you need to
know and display the hash value? This seems like an XY problem to me:
http://people.apache.org/~hossman/#xyproblem

On Wed, Aug 26, 2015 at 1:17 AM,  wrote:

> Hi,
>
> I have read in one post in the Internet that the hash Solr Cloud
> calculates over the key field to send each document to a different shard
> is indexed. Is this true? If true, is there any way to show this hash for
> each document?
>
> Thanks,
>
> David

-- 
Anshum Gupta

Re: best way for adding a new field to all indexed documents...

2015-08-26 Thread Mikhail Khludnev

Sadly, it's always a problem
http://searchivarius.org/blog/how-rename-fields-solr


On Wed, Aug 26, 2015 at 11:20 AM, Roxana Danger <
roxana.dan...@reedonline.co.uk> wrote:

> Hello,
>I have a index created with solr, and I would like to add a new
> field to all the documents of the index. I suppose I could a) use an
> updateRequestHandler or b) create another index importing the data from the
> initial index and the data of my new field. Which could be the best
> approach? Will the background processing be re-indexing the documents?
>Thank you very much in advance,
> Roxana
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] 
> The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
> 
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Re: Please answer my question on StackOverflow ... "Best approach to guarantee commits in SOLR"

2015-08-26 Thread Charlie Hull


On 25/08/2015 13:21, Simer P wrote:

http://stackoverflow.com/questions/32138845/what-is-the-best-approach-to-guarantee-commits-in-apache-solr
.

*Question:* How can I get "guarantee commits" with Apache SOLR where
persisting data to disk and visibility are both equally important ?

*Background:* We have a website which requires high end search
functionality for machine learning and also requires guaranteed commit for
financial transaction. We just want to SOLR as our only datastore to keep
things simple and *do not* want to use another database on the side.

I can't seem to find any answer to this question. The simplest solution for
a financial transaction seems to be to periodically query SOLR for the
record after it has been persisted but this can have longer wait time or is
there a better solution ?

Can anyone please suggest a solution for achieving "guaranteed commits"
with SOLR ?

Firstly, if you're asking here, you're likely to be answered here, not 
on Stack Overflow.


A search engine is not a database. Although both Solr and Elasticsearch 
are often used as primary stores with varying degrees of success, they 
are after all search engines and designed for this use.


Cheers

Charlie

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Search opening hours

2015-08-26 Thread Stefan Matheis

Have a look at the links that Alexandre mentioned. a somewhat non-obvious style 
solution because you'd probably not think about spatial features while dealing 
with opening time - but it's worth having a look. 

-Stefan 


On Wednesday, August 26, 2015 at 10:16 AM, O. Klein wrote:

> Thank you for responding.
> 
> Yonik's solution is what I had in mind. Was hoping for something more
> elegant, as he said, but it will work.
> 
> The thing I haven't figured out is how to deal with closing times early
> morning next day.
> 
> So it's 22:00 now and opening hours are 20:00 to 03:00
> 
> Can this be done with either or both approaches?
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225339.html
> Sent from the Solr - User mailing list archive at Nabble.com 
> (http://Nabble.com).
> 
>

Re: New Solr installation fails to create collection/core

2015-08-26 Thread deviantcode

I run into this exact problem trying out the latest solr, [5.3.0], @Scott,
how did you fix it?
KR
Henry



--
View this message in context: 
http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Hash of solr documents

2015-08-26 Thread david . davila

Yes, it´s an XY  problem :)

We are making the first tests to split our shard (Solr 5.1)

The problem we have is this: the number of documents indexed in the new 
shards is lower than in the original one (19814  and 19653, vs 61100), and 
always the same. We have no idea why Solr is doing this. A problem with 
some documents, with the segment?

A long time after we changed from "normal" Solr to Solr Cloud, we found 
that the parameter "router" in clusterstate.json was incorrect, because we 
wanted to have "compositeId" and it was set as "explicit". The solution 
was deleting the clusterstate.json and restart Solr. And we are thinking 
that maybe the problem with the SPLIT is related with that: some documents 
are stored with the hash value and others not, and SPLIT needs that to 
distribute them. But I know that this likely has nothing to do with the 
SPLIT problem, it's only an idea. 

This is the log, all seem to be normal:

INFO  - 2015-08-26 09:13:47.654; 
org.apache.solr.handler.admin.CoreAdminHandler; Invoked split action for 
core: buscon
INFO  - 2015-08-26 09:13:47.656; 
org.apache.solr.update.DirectUpdateHandler2; start 
commit{,optimize=false,openSearcher=true,
waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
INFO  - 2015-08-26 09:13:47.656; 
org.apache.solr.update.DirectUpdateHandler2; No uncommitted changes. 
Skipping IW.commit.
INFO  - 2015-08-26 09:13:47.657; org.apache.solr.core.SolrCore; 
SolrIndexSearcher has not changed - not re-opening: org.apach
e.solr.search.SolrIndexSearcher
INFO  - 2015-08-26 09:13:47.657; 
org.apache.solr.update.DirectUpdateHandler2; end_commit_flush
INFO  - 2015-08-26 09:13:47.658; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partitions=2 segments=1
INFO  - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #0 partitionCount=2 r
ange=0-3fff
INFO  - 2015-08-26 09:13:47.922; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #0 partitionCount=2 r
ange=0-3fff segment #0 segmentCount=1
INFO  - 2015-08-26 09:22:19.533; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #1 partitionCount=2 r
ange=4000-7fff
INFO  - 2015-08-26 09:22:19.536; org.apache.solr.update.SolrIndexSplitter; 
SolrIndexSplitter: partition #1 partitionCount=2 r
ange=4000-7fff segment #0 segmentCount=1
INFO  - 2015-08-26 09:30:44.141; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={ta
rgetCore=buscon_shard2_0_replica1&targetCore=buscon_shard2_1_replica1&action=SPLIT&core=buscon&wt=javabin&qt=/admin/cores&ver
sion=2} status=0 QTime=1016486 
INFO  - 2015-08-26 09:30:44.387; 
org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates 
on core: buscon_sh
ard2_0_replica1
INFO  - 2015-08-26 09:30:44.387; 
org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates 
available. core=buscon_s
hard2_0_replica1
INFO  - 2015-08-26 09:30:44.388; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={na
me=buscon_shard2_0_replica1&action=REQUESTAPPLYUPDATES&wt=javabin&qt=/admin/cores&version=2}

status=0 QTime=2 
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.handler.admin.CoreAdminHandler; Applying buffered updates 
on core: buscon_sh
ard2_1_replica1
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.handler.admin.CoreAdminHandler; No buffered updates 
available. core=buscon_s
hard2_1_replica1
INFO  - 2015-08-26 09:30:44.441; 
org.apache.solr.servlet.SolrDispatchFilter; [admin] webapp=null 
path=/admin/cores params={na
me=buscon_shard2_1_replica1&action=REQUESTAPPLYUPDATES&wt=javabin&qt=/admin/cores&version=2}

status=0 QTime=0 
INFO  - 2015-08-26 09:30:44.743; 
org.apache.solr.common.cloud.ZkStateReader$2; A cluster state change: 
WatchedEvent state:Syn
cConnected type:NodeDataChanged path:/clusterstate.json, has occurred - 
updating... (live nodes size: 4)

Thanks,

David

De: Anshum Gupta 
Para:   "solr-user@lucene.apache.org" , 
Fecha:  26/08/2015 10:27
Asunto: Re: Hash of solr documents

Hi David,

The route key itself is indexed, but not the hash value. Why do you need 
to
know and display the hash value? This seems like an XY problem to me:
http://people.apache.org/~hossman/#xyproblem

On Wed, Aug 26, 2015 at 1:17 AM,  wrote:

> Hi,
>
> I have read in one post in the Internet that the hash Solr Cloud
> calculates over the key field to send each document to a different shard
> is indexed. Is this true? If true, is there any way to show this hash 
for
> each document?
>
> Thanks,
>
> David

-- 
Anshum Gupta

Re: Search opening hours

2015-08-26 Thread O. Klein

Those options don't fix my problem with closing times the next morning, or is
there a way to do this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225354.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exact substring search with ngrams

2015-08-26 Thread Christian Ramseyer

On 26/08/15 00:24, Erick Erickson wrote:
> Hmmm, this sounds like a nonsensical question, but "what do you mean
> by arbitrary substring"?
> 
> Because if your substrings consist of whole _tokens_, then ngramming
> is totally unnecessary (and gets in the way). Phrase queries with no slop
> fulfill this requirement.
> 
> But let's assume you need to march within tokens, i.e. if the doc
> contains "my dog has fleas", you need to match input like "as fle", in this
> case ngramming is an option.

Yeah the "as fle"-thing is exactly what I want to achieve.

> 
> You have substantially different index and query time chains. The result is 
> that
> the offsets for all the grams at index time are the same in the quick 
> experiment
> I tried, all were 1. But at query time, each gram had an incremented position.
> 
> I'd start by using the query time analysis chain for indexing also. Next, I'd
> try enclosing multiple words in double quotes at query time and go from there.
> What you have now is an anti-pattern in that having substantially
> different index
> and query time analysis chains is not something that's likely to be very
> predictable unless you know _exactly_ what the consequences are.
> 
> The admin/analysis page is your friend, in this case check the
> "verbose" checkbox
> to see what I mean.

Hmm interesting. I had the additional \R tokenizer in the index chain
because the the document can be multiple lines (but the search text is
always a single line) and if the document was

my dog
has fleas

I wouldn't want some variant of "og ha" to match, but I didn't realize
it didn't give me any positions like you noticed.

I'll try to experiment some more, thanks for the hints!

Chris

> 
> Best,
> Erick
> 
> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer  wrote:
>> Hi
>>
>> I'm trying to build an index for technical documents that basically
>> works like "grep", i.e. the user gives an arbitray substring somewhere
>> in a line of a document and the exact matches will be returned. I
>> specifically want no stemming etc. and keep all whitespace, parentheses
>> etc. because they might be significant. The only normalization is that
>> the search should be case-insensitvie.
>>
>> I tried to achieve this by tokenizing on line breaks, and then building
>> trigrams of the individual lines:
>>
>> 
>>
>> 
>>
>> > pattern="\R" group="-1"/>
>>
>> > minGramSize="3" maxGramSize="3"/>
>> 
>>
>> 
>>
>> 
>>
>> > minGramSize="3" maxGramSize="3"/>
>> 
>>
>> 
>> 
>>
>> Then in the search, I use the edismax parser with mm=100%, so given the
>> documents
>>
>>
>> {"id":"test1","content":"
>> encryption
>> 10.0.100.22
>> description
>> "}
>>
>> {"id":"test2","content":"
>> 10.100.0.22
>> description
>> "}
>>
>> and the query content:encryption, this will turn into
>>
>> "parsedquery_toString":
>>
>> "+((content:enc content:ncr content:cry content:ryp
>> content:ypt content:pti content:tio content:ion)~8)",
>>
>> and return only the first document. All fine and dandy. But I have a
>> problem with possible false positives. If the search is e.g.
>>
>> content:.100.22
>>
>> then the generated query will be
>>
>> "parsedquery_toString":
>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>>
>> and because all of tokens are also generated for document test2 in the
>> proximity of 5, both documents will wrongly be returned.
>>
>> So somehow I'd need to express the query "content:.10 content:100
>> content:00. content:0.2 content:.22" with *the tokens exactly in this
>> order and nothing in between*. Is this somehow possible, maybe by using
>> the termvectors/termpositions stuff? Or am I trying to do something
>> that's fundamentally impossible? Other good ideas how to achieve this
>> kind of behaviour?
>>
>> Thanks
>> Christian
>>
>>
>>

Re: Search opening hours

2015-08-26 Thread Upayavira

On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
> Those options don't fix my problem with closing times the next morning,
> or is
> there a way to do this?

Use the spatial model, and a time window of a week. There are 10,080
minutes in a week, so you could use that as your scale.

Assuming the week starts at 00:00 Monday morning, you might index Monday
9:00-23:00 as  540:1380

Tuesday 9am-Wednesday 1am would be 1980:2940

You convert your NOW time into a "minutes since Monday 00:00" and do a
spatial search within that time.

If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
do a search for 683:683.

If you have a shop that is open over Sunday night to Monday, you just
list it as open until Sunday 23:59 and open again Monday 00:00.

Would that do it?

Upayavira

Re: Behavior of grouping on a field with same value spread across shards.

2015-08-26 Thread Modassar Ather

Thanks Erick.

On Wed, Aug 26, 2015 at 12:11 PM, Erick Erickson 
wrote:

> That should be the case.
>
> Best,
> Erick
>
> On Tue, Aug 25, 2015 at 8:55 PM, Modassar Ather 
> wrote:
> > Thanks Erick,
> >
> > I saw the link. So is it that the grouping functionality works fine in
> > distributed search except the two cases mentioned in the link?
> >
> > Regards,
> > Modassar
> >
> > On Tue, Aug 25, 2015 at 10:40 PM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> >> That's not really the case. Perhaps you're confusing
> >> group.ngroups and group.facet with just grouping?
> >>
> >> See the ref guide:
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Aug 25, 2015 at 4:51 AM, Modassar Ather  >
> >> wrote:
> >> > Hi,
> >> >
> >> > As per my understanding, to group on a field all documents with the
> same
> >> > value in the field have to be in the same shard.
> >> >
> >> > Can we group by a field where the documents with the same value in
> that
> >> > field will be distributed across shards?
> >> > Please let me know what are the limitations, feature not available or
> >> > performance issues for such fields?
> >> >
> >> > Thanks,
> >> > Modassar
> >>
>

Re: Tokenizers and DelimitedPayloadTokenFilterFactory

2015-08-26 Thread Jamie Johnson

Thanks again Erick, I created
https://issues.apache.org/jira/browse/SOLR-7975, though I didn't attach s
patch because my current implementation is not useful generally right now,
it meets my use case but likely would not meet others.  I will try to look
about generalizing this to allow something custom to be plugged in.
On Aug 26, 2015 2:46 AM, "Erick Erickson"  wrote:

> Sure, I think it's fine to raise a JIRA, especially if you can include
> a patch, even a preliminary one to solicit feedback... which I'll
> leave to people who are more familiar with that code...
>
> I'm not sure how generally useful this would be, and if it comes
> at a cost to normal searching there's sure to be lively discussion.
>
> Best
> Erick
>
> On Tue, Aug 25, 2015 at 7:50 PM, Jamie Johnson  wrote:
> > Looks like I have something basic working for Trie fields.  I am doing
> > exactly what I said in my previous email, so good news there.  I think
> this
> > is a big step as there are only a few field types left that I need to
> > support, those being date (should be similar to Trie) and Spatial fields,
> > which at a glance looked like it provided a way to provide the token
> stream
> > through an extension.  Definitely need to look more though.
> >
> > All of this said though, is this really the right way to get payloads
> into
> > these types of fields?  Should a jira feature request be added for this?
> > On Aug 25, 2015 8:13 PM, "Jamie Johnson"  wrote:
> >
> >> Right, I had assumed (obviously here is my problem) that I'd be able to
> >> specify payloads for the field regardless of the field type.  Looking at
> >> TrieField that is certainly non-trivial.  After a bit of digging it
> appears
> >> that if I wanted to do something here I'd need to build a new TrieField,
> >> override createField and provide a Field that would return something
> like
> >> NumericTokenStream but also provide the payloads.  Like you said sounds
> >> "interesting" to say the least...
> >>
> >> Were payloads not really intended to be used for these types of fields
> >> from a Lucene perspective?
> >>
> >>
> >> On Tue, Aug 25, 2015 at 6:29 PM, Erick Erickson <
> erickerick...@gmail.com>
> >> wrote:
> >>
> >>> Well, you're going down a path that hasn't been trodden before ;).
> >>>
> >>> If you can treat your primitive types as text types you might get
> >>> some traction, but that makes a lot of operations like numeric
> >>> comparison difficult.
> >>>
> >>> H. another idea from left field. For single-valued types,
> >>> what about a sidecar field that has the auth token? And even
> >>> for a multiValued field, two parallel fields are guaranteed to
> >>> maintain order so perhaps you could do something here. Yes,
> >>> I'm waving my hands a LOT here.
> >>>
> >>> I suspect that trying to have a custom type that incorporates
> >>> payloads for, say, trie fields will be "interesting" to say the least.
> >>> Numeric types are packed to save storage etc. so it'll be
> >>> an adventure..
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>> On Tue, Aug 25, 2015 at 2:43 PM, Jamie Johnson 
> wrote:
> >>> > We were originally using this approach, i.e. run things through the
> >>> > KeywordTokenizer -> DelimitedPayloadFilter -> WordDelimiterFilter.
> >>> Again
> >>> > this works fine for text, though I had wanted to use the
> >>> StandardTokenizer
> >>> > in the chain.  Is there an equivalent filter that does what the
> >>> > StandardTokenizer does?
> >>> >
> >>> > All of this said this doesn't address the issue of the primitive
> field
> >>> > types, which at this point is the bigger issue.  Given this use case
> >>> should
> >>> > there be another way to provide payloads?
> >>> >
> >>> > My current thinking is that I will need to provide custom
> >>> implementations
> >>> > for all of the field types I would like to support payloads on which
> >>> will
> >>> > essentially be copies of the standard versions with some extra
> "sugar"
> >>> to
> >>> > read/write the payloads (I don't see a way to wrap/delegate these at
> >>> this
> >>> > point because AttributeSource has the attribute retrieval related
> >>> methods
> >>> > as final so I can't simply wrap another tokenizer and return my added
> >>> > attributes + the wrapped attributes).  I know my use case is a bit
> >>> strange,
> >>> > but I had not expected to need to do this given that Lucene/Solr
> >>> supports
> >>> > payloads on these field types, they just aren't exposed.
> >>> >
> >>> > As always I appreciate any ideas if I'm barking up the wrong tree
> here.
> >>> >
> >>> > On Tue, Aug 25, 2015 at 2:52 PM, Markus Jelsma <
> >>> markus.jel...@openindex.io>
> >>> > wrote:
> >>> >
> >>> >> Well, if i remember correctly (i have no testing facility at hand)
> >>> >> WordDelimiterFilter maintains payloads on emitted sub terms. So if
> you
> >>> use
> >>> >> a KeywordTokenizer, input 'some text^PAYLOAD', and have a
> >>> >> DelimitedPayloadFilter, the entire string gets a payload. You can
> then
> >>> >> split th

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Toke Eskildsen

On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:

> Now I've tried to increase the carrot.fragSize to 75 and
> carrot.summarySnippets to 2, and set the carrot.produceSummary to
> true. With this setting, I'm mostly able to get the cluster results
> back within 2 to 3 seconds when I set rows=200. I'm still trying out
> to see if the cluster labels are ok, but in theory do you think this
> is a suitable setting to attempt to improve the clustering results and
> at the same time improve the performance?

I don't know - the quality/performance point as well as which knobs to
tweak is extremely dependent on your corpus and your hardware. A person
with better understanding of carrot might be able to do better sanity
checking, but I am not at all at that level.

Related, it seems to me that the question of how to tweak the clustering
has little to do with Solr and a lot to do with carrot (assuming here
that carrot is the bottleneck). You might have more success asking in a
carrot forum?

- Toke Eskildsen, State and University Library, Denmark

Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade

Hi,


So i got the shards too split. But they are very unbalanced.


 7204922 total docs on the original collection

shard1_0 numdocs 3661699

shard1_1 numdocs 3543132

shard2_0 numdocs 0

shard2_1 numdcs 0

Any ideas?

This is what i had to do to get this to split with the custom libs

I got shard1 to split successfully and it created replicas on the other
servers in the cloud for the new shard/shards.


This is the jist of it.


When you split a shard solr creates a 2 new cores.

When creating a core it uses the solr/solr.xml settings for classpath
etc

This is why searches etc work fine and can find the opa plugins but when we
called shardsplit it could not.


I had to move the custom jars outside of the collection directory and add
this to solr/solr.xml on the 4 nodes.


info here  https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond






${sharedLib:../lib}


when you restart you can see it in the log loading the jars form the new
location.



INFO  - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer;
loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commons-pool-1.6.jar' to
classloader

INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/query-processing-language-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.299; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-continuation-8.1.10.v20130312.jar'
to classloader

INFO  - 2015-08-25 23:40:52.301; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/groovy-all-2.0.4.jar' to
classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/qpl-solr472-0.2-SNAPSHOT.jar'
to classloader

INFO  - 2015-08-25 23:40:52.302; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-jmx-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-deploy-8.1.10.v20130312.jar' to
classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/ext/' to classloader

INFO  - 2015-08-25 23:40:52.303; org.apache.solr.core.SolrResourceLoader;
Adding
'file:/opt/solr/solr-4.7.2/solr01/lib/jetty-xml-8.1.10.v20130312.jar' to
classloader

so I then ran the split and checked on it in the morning

http://dj01.aws.narasearch.us:8981/solr/admin/collections?action=SPLITSHARD&collection=collection1&shard=shard1


it succeeded and created replicas.

ls /opt/solr/solr-4.7.2/solr0*/solr/

/opt/solr/solr-4.7.2/solr01/solr/:
bin  collection1_shard1_0_replica1  README.txt  zoo.cfg
collection1  collection1_shard1_1_replica1  solr.xml

/opt/solr/solr-4.7.2/solr02/solr/:
bin  collection1  README.txt  solr.xml  zoo.cfg

/opt/solr/solr-4.7.2/solr03/solr/:
bin  collection1  collection1_shard1_0_replica2  README.txt  solr.xml
 zoo.cfg

/opt/solr/solr-4.7.2/solr04/solr/:
bin  collection1  collection1_shard1_1_replica2  README.txt  solr.xml
 zoo.cfg


and it actually distributed it

[root@dj01 solr]# du -sh *
4.0Kbin
41G collection1
18G collection1_shard1_0_replica1
16G collection1_shard1_1_replica1
4.0KREADME.txt
4.0Ksolr.xml
4.0Kzoo.cfg
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
16G /opt/solr/solr-4.7.2/solr04/solr/collection1_shard1_1_replica2
[root@dj01 solr]# du -sh
/opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2
18G /opt/solr/solr-4.7.2/solr03/solr/collection1_shard1_0_replica2


Jeff Courtade
M: 240.507.6116
On Aug 25, 2015 11:09 PM, "Anshum Gupta"  wrote:

> Can you elaborate a bit more on the setup, what do the custom plugins do,
> what error do you get ? It seems like a classloader/classpath issue to me
> which doesn't really relate to Shard splitting.
>
>
> On Tue, Aug 25, 2015 at 7:59 PM, Jeff Courtade 
> wrote:
>
> > I am getting failures when trying too split shards on solr 4.2.7 with
> > custom plugins.
> >
> > It fails regularily it cannot find the jar files for  plugins when
> creating
> > the new cores/shards.
> >
> > Ideas?
> >
> > --
> > Thanks,
> >
> > Jeff Courtade
> > M: 240.507.6116
> >
>
>
>
> --
> Anshum Gupta
>

Re: splitting shards on 4.7.2 with custom plugins

2015-08-26 Thread Jeff Courtade

im looking at the clusterstate.json t see why it is doing this I really
dont understand it though...

{"collection1":{
"shards":{
  "shard1":{
"range":"8000-",
"state":"active",
"replicas":{
  "core_node1":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node10":{
"state":"active",
"base_url":"http://10.135.2.153:8982/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8982_solr"}}},
  "shard2":{
"range":"0-7fff",
"state":"inactive",
"replicas":{
  "core_node9":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node11":{
"state":"active",
"base_url":"http://10.135.2.153:8983/solr";,
"core":"collection1",
"node_name":"10.135.2.153:8983_solr"}}},
  "shard1_1":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
  "core_node6":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard1_1_replica1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node8":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard1_1_replica2",
"node_name":"10.135.2.153:8984_solr"}}},
  "shard1_0":{
"range":null,
"state":"active",
"parent":null,
"replicas":{
  "core_node5":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard1_0_replica1",
"node_name":"10.135.2.153:8981_solr",
"leader":"true"},
  "core_node7":{
"state":"active",
"base_url":"http://10.135.2.153:8983/solr";,
"core":"collection1_shard1_0_replica2",
"node_name":"10.135.2.153:8983_solr"}}},
  "shard2_0":{
"range":"0-3fff",
"state":"active",
"replicas":{
  "core_node13":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard2_0_replica1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node14":{
"state":"active",
"base_url":"http://10.135.2.153:8982/solr";,
"core":"collection1_shard2_0_replica2",
"node_name":"10.135.2.153:8982_solr"}}},
  "shard2_1":{
"range":"4000-7fff",
"state":"active",
"replicas":{
  "core_node12":{
"state":"active",
"base_url":"http://10.135.2.153:8984/solr";,
"core":"collection1_shard2_1_replica1",
"node_name":"10.135.2.153:8984_solr",
"leader":"true"},
  "core_node15":{
"state":"active",
"base_url":"http://10.135.2.153:8981/solr";,
"core":"collection1_shard2_1_replica2",
"node_name":"10.135.2.153:8981_solr",
"maxShardsPerNode":"1",
"router":{"name":"compositeId"},
"replicationFactor":"1",
"autoCreated":"true"}}


--
Thanks,

Jeff Courtade
M: 240.507.6116

On Wed, Aug 26, 2015 at 8:44 AM, Jeff Courtade 
wrote:

> Hi,
>
>
> So i got the shards too split. But they are very unbalanced.
>
>
>  7204922 total docs on the original collection
>
> shard1_0 numdocs 3661699
>
> shard1_1 numdocs 3543132
>
> shard2_0 numdocs 0
>
> shard2_1 numdcs 0
>
> Any ideas?
>
> This is what i had to do to get this to split with the custom libs
>
> I got shard1 to split successfully and it created replicas on the other
> servers in the cloud for the new shard/shards.
>
>
> This is the jist of it.
>
>
> When you split a shard solr creates a 2 new cores.
>
> When creating a core it uses the solr/solr.xml settings for classpath
> etc
>
> This is why searches etc work fine and can find the opa plugins but when
> we called shardsplit it could not.
>
>
> I had to move the custom jars outside of the collection directory and add
> this to solr/solr.xml on the 4 nodes.
>
>
> info here  https://wiki.apache.org/solr/Solr.xml%204.4%20and%20beyond
>
>
>
> 
>
>
> ${sharedLib:../lib}
>
>
> when you restart you can see it in the log loading the jars form the new
> location.
>
>
>
> INFO  - 2015-08-25 23:40:52.297; org.apache.solr.core.CoreContainer;
> loading shared library: /opt/solr/solr-4.7.2/solr01/solr/../lib
>
> INFO  - 2015-08-25 23:40:52.298; org.apache.solr.core.SolrResourceLoader;
> Adding 'file:/opt/solr/solr-4.7.2/solr01/lib/commo

Connect and sync two solr server

2015-08-26 Thread shahper


Hi,

I want to connect two solrcloud server. and sync there indexes to each 
other so that is any server is down we can work with other and whenever 
I update or add index in any server the other also get updated.


shahper

Re: Search opening hours

2015-08-26 Thread Darren Spehr

If you wanted to try a spatial approach that blended times like above, you
could try a polygon of minimum width that spans the globe - this is
literally using spatial search (geocodes) against time. So in this scenario
you logically subdivide the polygon into 7 distinct regions (for days) and
then within this you can defined, like a timeline, what open and closed
means. The problem of 3AM is taken care of because of it's continuous
nature - ie one day is adjacent to the next, with Sunday and Monday backing
up to each other. Just a thought.

On Wed, Aug 26, 2015 at 5:38 AM, Upayavira  wrote:

>
>
> On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
> > Those options don't fix my problem with closing times the next morning,
> > or is
> > there a way to do this?
>
> Use the spatial model, and a time window of a week. There are 10,080
> minutes in a week, so you could use that as your scale.
>
> Assuming the week starts at 00:00 Monday morning, you might index Monday
> 9:00-23:00 as  540:1380
>
> Tuesday 9am-Wednesday 1am would be 1980:2940
>
> You convert your NOW time into a "minutes since Monday 00:00" and do a
> spatial search within that time.
>
> If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
> do a search for 683:683.
>
> If you have a shop that is open over Sunday night to Monday, you just
> list it as open until Sunday 23:59 and open again Monday 00:00.
>
> Would that do it?
>
> Upayavira
>

-- 
Darren

Re: Search opening hours

2015-08-26 Thread Upayavira

Darren,

That was delightfully dense. Do you think you could unpack it a bit
more? Possibly some sample (pseudo) queries?

Upayavira 

On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
> If you wanted to try a spatial approach that blended times like above,
> you
> could try a polygon of minimum width that spans the globe - this is
> literally using spatial search (geocodes) against time. So in this
> scenario
> you logically subdivide the polygon into 7 distinct regions (for days)
> and
> then within this you can defined, like a timeline, what open and closed
> means. The problem of 3AM is taken care of because of it's continuous
> nature - ie one day is adjacent to the next, with Sunday and Monday
> backing
> up to each other. Just a thought.
> 
> On Wed, Aug 26, 2015 at 5:38 AM, Upayavira  wrote:
> 
> >
> >
> > On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
> > > Those options don't fix my problem with closing times the next morning,
> > > or is
> > > there a way to do this?
> >
> > Use the spatial model, and a time window of a week. There are 10,080
> > minutes in a week, so you could use that as your scale.
> >
> > Assuming the week starts at 00:00 Monday morning, you might index Monday
> > 9:00-23:00 as  540:1380
> >
> > Tuesday 9am-Wednesday 1am would be 1980:2940
> >
> > You convert your NOW time into a "minutes since Monday 00:00" and do a
> > spatial search within that time.
> >
> > If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
> > do a search for 683:683.
> >
> > If you have a shop that is open over Sunday night to Monday, you just
> > list it as open until Sunday 23:59 and open again Monday 00:00.
> >
> > Would that do it?
> >
> > Upayavira
> >
> 
> 
> 
> -- 
> Darren

Re: Search opening hours

2015-08-26 Thread Upayavira

"delightfully dense" = really intriguing, but I couldn't quite
understand it - really hoping for more info

On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
> Darren,
> 
> That was delightfully dense. Do you think you could unpack it a bit
> more? Possibly some sample (pseudo) queries?
> 
> Upayavira 
> 
> On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
> > If you wanted to try a spatial approach that blended times like above,
> > you
> > could try a polygon of minimum width that spans the globe - this is
> > literally using spatial search (geocodes) against time. So in this
> > scenario
> > you logically subdivide the polygon into 7 distinct regions (for days)
> > and
> > then within this you can defined, like a timeline, what open and closed
> > means. The problem of 3AM is taken care of because of it's continuous
> > nature - ie one day is adjacent to the next, with Sunday and Monday
> > backing
> > up to each other. Just a thought.
> > 
> > On Wed, Aug 26, 2015 at 5:38 AM, Upayavira  wrote:
> > 
> > >
> > >
> > > On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
> > > > Those options don't fix my problem with closing times the next morning,
> > > > or is
> > > > there a way to do this?
> > >
> > > Use the spatial model, and a time window of a week. There are 10,080
> > > minutes in a week, so you could use that as your scale.
> > >
> > > Assuming the week starts at 00:00 Monday morning, you might index Monday
> > > 9:00-23:00 as  540:1380
> > >
> > > Tuesday 9am-Wednesday 1am would be 1980:2940
> > >
> > > You convert your NOW time into a "minutes since Monday 00:00" and do a
> > > spatial search within that time.
> > >
> > > If it is now Monday, 11:23am, that would be 11*60+23=683, so you would
> > > do a search for 683:683.
> > >
> > > If you have a shop that is open over Sunday night to Monday, you just
> > > list it as open until Sunday 23:59 and open again Monday 00:00.
> > >
> > > Would that do it?
> > >
> > > Upayavira
> > >
> > 
> > 
> > 
> > -- 
> > Darren

Re: Solr performance is slow with just 1GB of data indexed

2015-08-26 Thread Zheng Lin Edwin Yeo

Thanks for your recommendation Toke.

Will try to ask in the carrot forum.

Regards,
Edwin

On 26 August 2015 at 18:45, Toke Eskildsen  wrote:

> On Wed, 2015-08-26 at 15:47 +0800, Zheng Lin Edwin Yeo wrote:
>
> > Now I've tried to increase the carrot.fragSize to 75 and
> > carrot.summarySnippets to 2, and set the carrot.produceSummary to
> > true. With this setting, I'm mostly able to get the cluster results
> > back within 2 to 3 seconds when I set rows=200. I'm still trying out
> > to see if the cluster labels are ok, but in theory do you think this
> > is a suitable setting to attempt to improve the clustering results and
> > at the same time improve the performance?
>
> I don't know - the quality/performance point as well as which knobs to
> tweak is extremely dependent on your corpus and your hardware. A person
> with better understanding of carrot might be able to do better sanity
> checking, but I am not at all at that level.
>
> Related, it seems to me that the question of how to tweak the clustering
> has little to do with Solr and a lot to do with carrot (assuming here
> that carrot is the bottleneck). You might have more success asking in a
> carrot forum?
>
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>

Re: Search opening hours

2015-08-26 Thread Darren Spehr

Sure - and sorry for its density. I reread it and thought the same ;)

So imagine a polygon of say 1/2 mile width (I made that up) that stretches
around the equator. Let's call this a week's timeline and subdivide it into
7 blocks, one for each day. For the sake of simplicity assume it's a line
(which I forget but is supported in Solr as an infinitely small polygon)
starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
Sunday at 11:59 PM. By subdivide you can think of it either radially or by
longitude, but you have 360 degrees to divide into 7, which means that
every hour is represented by a range of roughly 2.143 degrees (360/7/24).
These regions represent each day and hour (or less), and the region
boundaries represent midnight for the day before.

Now for indexing - your open hours then become a combination of these
subdivisions. If you're open 24x7 then the whole polygon is indexed. If
you're only open on Monday from 9-5 then only the polygon between
(0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
can index any combination of times this way.

So now the varsity question is how to do this with a fluctuating calendar?
I think this example can be extended to include searching against any given
day of the week in a year, or years. Just imagine a translation layer that
adjusts the latitude N or S by some amount to represent which day in which
year you're looking for. Make sense?

On Wed, Aug 26, 2015 at 10:50 AM, Upayavira  wrote:

> "delightfully dense" = really intriguing, but I couldn't quite
> understand it - really hoping for more info
>
> On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
> > Darren,
> >
> > That was delightfully dense. Do you think you could unpack it a bit
> > more? Possibly some sample (pseudo) queries?
> >
> > Upayavira
> >
> > On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
> > > If you wanted to try a spatial approach that blended times like above,
> > > you
> > > could try a polygon of minimum width that spans the globe - this is
> > > literally using spatial search (geocodes) against time. So in this
> > > scenario
> > > you logically subdivide the polygon into 7 distinct regions (for days)
> > > and
> > > then within this you can defined, like a timeline, what open and closed
> > > means. The problem of 3AM is taken care of because of it's continuous
> > > nature - ie one day is adjacent to the next, with Sunday and Monday
> > > backing
> > > up to each other. Just a thought.
> > >
> > > On Wed, Aug 26, 2015 at 5:38 AM, Upayavira  wrote:
> > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
> > > > > Those options don't fix my problem with closing times the next
> morning,
> > > > > or is
> > > > > there a way to do this?
> > > >
> > > > Use the spatial model, and a time window of a week. There are 10,080
> > > > minutes in a week, so you could use that as your scale.
> > > >
> > > > Assuming the week starts at 00:00 Monday morning, you might index
> Monday
> > > > 9:00-23:00 as  540:1380
> > > >
> > > > Tuesday 9am-Wednesday 1am would be 1980:2940
> > > >
> > > > You convert your NOW time into a "minutes since Monday 00:00" and do
> a
> > > > spatial search within that time.
> > > >
> > > > If it is now Monday, 11:23am, that would be 11*60+23=683, so you
> would
> > > > do a search for 683:683.
> > > >
> > > > If you have a shop that is open over Sunday night to Monday, you just
> > > > list it as open until Sunday 23:59 and open again Monday 00:00.
> > > >
> > > > Would that do it?
> > > >
> > > > Upayavira
> > > >
> > >
> > >
> > >
> > > --
> > > Darren
>

-- 
Darren

Re: Solr 5.2.1 versus Solr 4.7.0 performance

2015-08-26 Thread Shawn Heisey

On 8/26/2015 1:11 AM, Esther Goldbraich wrote:
> We have benchmarked a set of queries on Solr 4.7.0 and 5.2.1 (with same 
> data, same solrconfig.xml) and saw better query performance on Solr 4.7.0 
> (5-15% better than 5.2.1, with an exception of 100% improvement for one of 
> the queries ).
> Using same JVM (IBM 1.7) and JVM params.
> Index's size is ~500G, spread over 64 shards, with replication factor 2.
> Do you know about any config / setup change for Solr 5.2.1 that can 
> improve the performance? Any idea what causes this behavior?

I have little experience comparing the performance of different
versions, but I have a general sense that OS disk caching becomes
increasingly important to Solr's performance as time goes on.  What this
means in real terms is that if you have enough memory for adequate OS
disk caching, using a later version of Solr will probably yield better
performance, but if you don't have enough memory, you might actually see
*worse* performance.

A question that might become important later, but doesn't really affect
the immediate things I'm thinking about: What GC tuning options you are
using?

How much RAM do you have in each machine, and how big is Solr's heap? 
How much index data actually lives on each server?  Be sure to count all
replicas on each machine.

https://wiki.apache.org/solr/SolrPerformanceProblems#RAM

Thanks,
Shawn

Re: how to index document with multiple words (phrases) and words permutation?

2015-08-26 Thread afrooz

Simon, Thanks a lot. that is a great tool . I am trying to use it.
Great solution.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-index-document-with-multiple-words-phrases-and-words-permutation-tp4224919p4225425.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search opening hours

2015-08-26 Thread Darren Spehr

Sorry - didn't finish my thought. I need to address querying :) So using
the above to define what's in the index your queries for a day/time become
a CONTAINS operation against the field. Let's say that the field is defined
as a location_rpt using JTS and its Spatial Factory (which supports
polygons) - oh, and it would need to be multi-valued. Querying the field
would require first translating "now" or "in an hour" or "Monday at 9am" to
a geocode, then hitting the index with a CONTAINS request per the docs:

https://cwiki.apache.org/confluence/display/solr/Spatial+Search


On Wed, Aug 26, 2015 at 11:23 AM, Darren Spehr  wrote:

> Sure - and sorry for its density. I reread it and thought the same ;)
>
> So imagine a polygon of say 1/2 mile width (I made that up) that stretches
> around the equator. Let's call this a week's timeline and subdivide it into
> 7 blocks, one for each day. For the sake of simplicity assume it's a line
> (which I forget but is supported in Solr as an infinitely small polygon)
> starting at (0,-180) for Monday at 12:00 AM and ending back at (0,180) for
> Sunday at 11:59 PM. By subdivide you can think of it either radially or by
> longitude, but you have 360 degrees to divide into 7, which means that
> every hour is represented by a range of roughly 2.143 degrees (360/7/24).
> These regions represent each day and hour (or less), and the region
> boundaries represent midnight for the day before.
>
> Now for indexing - your open hours then become a combination of these
> subdivisions. If you're open 24x7 then the whole polygon is indexed. If
> you're only open on Monday from 9-5 then only the polygon between
> (0,-160.7) and (0,-143.57) is indexed. With careful attention to detail you
> can index any combination of times this way.
>
> So now the varsity question is how to do this with a fluctuating calendar?
> I think this example can be extended to include searching against any given
> day of the week in a year, or years. Just imagine a translation layer that
> adjusts the latitude N or S by some amount to represent which day in which
> year you're looking for. Make sense?
>
> On Wed, Aug 26, 2015 at 10:50 AM, Upayavira  wrote:
>
>> "delightfully dense" = really intriguing, but I couldn't quite
>> understand it - really hoping for more info
>>
>> On Wed, Aug 26, 2015, at 03:49 PM, Upayavira wrote:
>> > Darren,
>> >
>> > That was delightfully dense. Do you think you could unpack it a bit
>> > more? Possibly some sample (pseudo) queries?
>> >
>> > Upayavira
>> >
>> > On Wed, Aug 26, 2015, at 03:02 PM, Darren Spehr wrote:
>> > > If you wanted to try a spatial approach that blended times like above,
>> > > you
>> > > could try a polygon of minimum width that spans the globe - this is
>> > > literally using spatial search (geocodes) against time. So in this
>> > > scenario
>> > > you logically subdivide the polygon into 7 distinct regions (for days)
>> > > and
>> > > then within this you can defined, like a timeline, what open and
>> closed
>> > > means. The problem of 3AM is taken care of because of it's continuous
>> > > nature - ie one day is adjacent to the next, with Sunday and Monday
>> > > backing
>> > > up to each other. Just a thought.
>> > >
>> > > On Wed, Aug 26, 2015 at 5:38 AM, Upayavira  wrote:
>> > >
>> > > >
>> > > >
>> > > > On Wed, Aug 26, 2015, at 10:17 AM, O. Klein wrote:
>> > > > > Those options don't fix my problem with closing times the next
>> morning,
>> > > > > or is
>> > > > > there a way to do this?
>> > > >
>> > > > Use the spatial model, and a time window of a week. There are 10,080
>> > > > minutes in a week, so you could use that as your scale.
>> > > >
>> > > > Assuming the week starts at 00:00 Monday morning, you might index
>> Monday
>> > > > 9:00-23:00 as  540:1380
>> > > >
>> > > > Tuesday 9am-Wednesday 1am would be 1980:2940
>> > > >
>> > > > You convert your NOW time into a "minutes since Monday 00:00" and
>> do a
>> > > > spatial search within that time.
>> > > >
>> > > > If it is now Monday, 11:23am, that would be 11*60+23=683, so you
>> would
>> > > > do a search for 683:683.
>> > > >
>> > > > If you have a shop that is open over Sunday night to Monday, you
>> just
>> > > > list it as open until Sunday 23:59 and open again Monday 00:00.
>> > > >
>> > > > Would that do it?
>> > > >
>> > > > Upayavira
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Darren
>>
>
>
>
> --
> Darren
>



-- 
Darren

Re: re:New Solr installation fails to create core

2015-08-26 Thread deviantcode

Hi Scott,
How about having logged in as a privileged user,  you to run create_core as
solr, 
something like this on a Redhat env: sudo -u solr ./bin/solr create_core -c
demo

KR
Henry



--
View this message in context: 
http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225361.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Connect and sync two solr server

2015-08-26 Thread Erick Erickson

>From the description, this is straight forward SolrCloud where you
have replicas on the separate machines, see:
https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud

A different way of accomplishing this would be the master/slave style, see:
https://cwiki.apache.org/confluence/display/solr/Index+Replication

Best,
Erick

On Wed, Aug 26, 2015 at 6:55 AM, shahper  wrote:
> Hi,
>
> I want to connect two solrcloud server. and sync there indexes to each other
> so that is any server is down we can work with other and whenever I update
> or add index in any server the other also get updated.
>
> shahper
>
>
>
>
>
>
>
>
>

Re: Exact substring search with ngrams

2015-08-26 Thread Erick Erickson

bq: my dog
has fleas
I wouldn't  want some variant of "og ha" to match,

Here's where the mysterious "positionIncrementGap" comes in. If you
make this field "multiValued",  and index this like this:

my dog
has fleas


or equivalently in SolrJ just
doc.addField("blah", "my dog");
doc.addField("blah", "has fleas");

then the position of "dog" will be 2 and the position of "has" will be
102 assuming
the positionIncrementGap is the default 100. N.B. I'm not sure you'll
see this in the
admin/analysis page or not.

Anyway, now your example won't match across the two parts unless
you specify a "slop" up in the 101 range.

Best,
Erick

On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer  wrote:
> On 26/08/15 00:24, Erick Erickson wrote:
>> Hmmm, this sounds like a nonsensical question, but "what do you mean
>> by arbitrary substring"?
>>
>> Because if your substrings consist of whole _tokens_, then ngramming
>> is totally unnecessary (and gets in the way). Phrase queries with no slop
>> fulfill this requirement.
>>
>> But let's assume you need to march within tokens, i.e. if the doc
>> contains "my dog has fleas", you need to match input like "as fle", in this
>> case ngramming is an option.
>
> Yeah the "as fle"-thing is exactly what I want to achieve.
>
>>
>> You have substantially different index and query time chains. The result is 
>> that
>> the offsets for all the grams at index time are the same in the quick 
>> experiment
>> I tried, all were 1. But at query time, each gram had an incremented 
>> position.
>>
>> I'd start by using the query time analysis chain for indexing also. Next, I'd
>> try enclosing multiple words in double quotes at query time and go from 
>> there.
>> What you have now is an anti-pattern in that having substantially
>> different index
>> and query time analysis chains is not something that's likely to be very
>> predictable unless you know _exactly_ what the consequences are.
>>
>> The admin/analysis page is your friend, in this case check the
>> "verbose" checkbox
>> to see what I mean.
>
> Hmm interesting. I had the additional \R tokenizer in the index chain
> because the the document can be multiple lines (but the search text is
> always a single line) and if the document was
>
> my dog
> has fleas
>
> I wouldn't want some variant of "og ha" to match, but I didn't realize
> it didn't give me any positions like you noticed.
>
> I'll try to experiment some more, thanks for the hints!
>
> Chris
>
>>
>> Best,
>> Erick
>>
>> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer  wrote:
>>> Hi
>>>
>>> I'm trying to build an index for technical documents that basically
>>> works like "grep", i.e. the user gives an arbitray substring somewhere
>>> in a line of a document and the exact matches will be returned. I
>>> specifically want no stemming etc. and keep all whitespace, parentheses
>>> etc. because they might be significant. The only normalization is that
>>> the search should be case-insensitvie.
>>>
>>> I tried to achieve this by tokenizing on line breaks, and then building
>>> trigrams of the individual lines:
>>>
>>> 
>>>
>>> 
>>>
>>> >> pattern="\R" group="-1"/>
>>>
>>> >> minGramSize="3" maxGramSize="3"/>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> >> minGramSize="3" maxGramSize="3"/>
>>> 
>>>
>>> 
>>> 
>>>
>>> Then in the search, I use the edismax parser with mm=100%, so given the
>>> documents
>>>
>>>
>>> {"id":"test1","content":"
>>> encryption
>>> 10.0.100.22
>>> description
>>> "}
>>>
>>> {"id":"test2","content":"
>>> 10.100.0.22
>>> description
>>> "}
>>>
>>> and the query content:encryption, this will turn into
>>>
>>> "parsedquery_toString":
>>>
>>> "+((content:enc content:ncr content:cry content:ryp
>>> content:ypt content:pti content:tio content:ion)~8)",
>>>
>>> and return only the first document. All fine and dandy. But I have a
>>> problem with possible false positives. If the search is e.g.
>>>
>>> content:.100.22
>>>
>>> then the generated query will be
>>>
>>> "parsedquery_toString":
>>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
>>>
>>> and because all of tokens are also generated for document test2 in the
>>> proximity of 5, both documents will wrongly be returned.
>>>
>>> So somehow I'd need to express the query "content:.10 content:100
>>> content:00. content:0.2 content:.22" with *the tokens exactly in this
>>> order and nothing in between*. Is this somehow possible, maybe by using
>>> the termvectors/termpositions stuff? Or am I trying to do something
>>> that's fundamentally impossible? Other good ideas how to achieve this
>>> kind of behaviour?
>>>
>>> Thanks
>>> Christian
>>>
>>>
>>>
>

Re: New Solr installation fails to create collection/core

2015-08-26 Thread Erick Erickson

Deviantcode, did you look at the referenced JIRA:

https://issues.apache.org/jira/browse/SOLR-7826

Or is that irrelevant?

Best,
Erick

On Wed, Aug 26, 2015 at 1:58 AM, deviantcode  wrote:
> I run into this exact problem trying out the latest solr, [5.3.0], @Scott,
> how did you fix it?
> KR
> Henry
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/re-New-Solr-installation-fails-to-create-core-tp4221768p4225350.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Exact substring search with ngrams

2015-08-26 Thread Upayavira

analysis tab does not support multi-valued fields. It only analyses a
single field value.

On Wed, Aug 26, 2015, at 05:05 PM, Erick Erickson wrote:
> bq: my dog
> has fleas
> I wouldn't  want some variant of "og ha" to match,
> 
> Here's where the mysterious "positionIncrementGap" comes in. If you
> make this field "multiValued",  and index this like this:
> 
> my dog
> has fleas
> 
> 
> or equivalently in SolrJ just
> doc.addField("blah", "my dog");
> doc.addField("blah", "has fleas");
> 
> then the position of "dog" will be 2 and the position of "has" will be
> 102 assuming
> the positionIncrementGap is the default 100. N.B. I'm not sure you'll
> see this in the
> admin/analysis page or not.
> 
> Anyway, now your example won't match across the two parts unless
> you specify a "slop" up in the 101 range.
> 
> Best,
> Erick
> 
> On Wed, Aug 26, 2015 at 2:19 AM, Christian Ramseyer 
> wrote:
> > On 26/08/15 00:24, Erick Erickson wrote:
> >> Hmmm, this sounds like a nonsensical question, but "what do you mean
> >> by arbitrary substring"?
> >>
> >> Because if your substrings consist of whole _tokens_, then ngramming
> >> is totally unnecessary (and gets in the way). Phrase queries with no slop
> >> fulfill this requirement.
> >>
> >> But let's assume you need to march within tokens, i.e. if the doc
> >> contains "my dog has fleas", you need to match input like "as fle", in this
> >> case ngramming is an option.
> >
> > Yeah the "as fle"-thing is exactly what I want to achieve.
> >
> >>
> >> You have substantially different index and query time chains. The result 
> >> is that
> >> the offsets for all the grams at index time are the same in the quick 
> >> experiment
> >> I tried, all were 1. But at query time, each gram had an incremented 
> >> position.
> >>
> >> I'd start by using the query time analysis chain for indexing also. Next, 
> >> I'd
> >> try enclosing multiple words in double quotes at query time and go from 
> >> there.
> >> What you have now is an anti-pattern in that having substantially
> >> different index
> >> and query time analysis chains is not something that's likely to be very
> >> predictable unless you know _exactly_ what the consequences are.
> >>
> >> The admin/analysis page is your friend, in this case check the
> >> "verbose" checkbox
> >> to see what I mean.
> >
> > Hmm interesting. I had the additional \R tokenizer in the index chain
> > because the the document can be multiple lines (but the search text is
> > always a single line) and if the document was
> >
> > my dog
> > has fleas
> >
> > I wouldn't want some variant of "og ha" to match, but I didn't realize
> > it didn't give me any positions like you noticed.
> >
> > I'll try to experiment some more, thanks for the hints!
> >
> > Chris
> >
> >>
> >> Best,
> >> Erick
> >>
> >> On Tue, Aug 25, 2015 at 3:00 PM, Christian Ramseyer  
> >> wrote:
> >>> Hi
> >>>
> >>> I'm trying to build an index for technical documents that basically
> >>> works like "grep", i.e. the user gives an arbitray substring somewhere
> >>> in a line of a document and the exact matches will be returned. I
> >>> specifically want no stemming etc. and keep all whitespace, parentheses
> >>> etc. because they might be significant. The only normalization is that
> >>> the search should be case-insensitvie.
> >>>
> >>> I tried to achieve this by tokenizing on line breaks, and then building
> >>> trigrams of the individual lines:
> >>>
> >>> 
> >>>
> >>> 
> >>>
> >>>  >>> pattern="\R" group="-1"/>
> >>>
> >>>  >>> minGramSize="3" maxGramSize="3"/>
> >>> 
> >>>
> >>> 
> >>>
> >>> 
> >>>
> >>>  >>> minGramSize="3" maxGramSize="3"/>
> >>> 
> >>>
> >>> 
> >>> 
> >>>
> >>> Then in the search, I use the edismax parser with mm=100%, so given the
> >>> documents
> >>>
> >>>
> >>> {"id":"test1","content":"
> >>> encryption
> >>> 10.0.100.22
> >>> description
> >>> "}
> >>>
> >>> {"id":"test2","content":"
> >>> 10.100.0.22
> >>> description
> >>> "}
> >>>
> >>> and the query content:encryption, this will turn into
> >>>
> >>> "parsedquery_toString":
> >>>
> >>> "+((content:enc content:ncr content:cry content:ryp
> >>> content:ypt content:pti content:tio content:ion)~8)",
> >>>
> >>> and return only the first document. All fine and dandy. But I have a
> >>> problem with possible false positives. If the search is e.g.
> >>>
> >>> content:.100.22
> >>>
> >>> then the generated query will be
> >>>
> >>> "parsedquery_toString":
> >>> "+((content:.10 content:100 content:00. content:0.2 content:.22)~5)",
> >>>
> >>> and because all of tokens are also generated for document test2 in the
> >>> proximity of 5, both documents will wrongly be returned.
> >>>
> >>> So somehow I'd need to express the query "content:.10 content:100
> >>> content:00. content:0.2 content:.22" with *the tokens exactly in this
> >>> order and nothing in between*. Is this somehow possible, maybe by using
> >>>

StrDocValues

2015-08-26 Thread Jamie Johnson

Are there any example implementation showing how StrDocValues works?  I am
not sure if this is the right place or not, but I was thinking about having
some document level doc value that I'd like to read in a function query to
impact if the document is returned or not.  Am I barking up the right tree
looking at this or is there another method to supporting this?

Re: Search opening hours

2015-08-26 Thread O. Klein

Darren,

This sounds like solution I'm looking for. Especially nice fix for the
Sunday-Monday problem.

Never worked with spatial search before, so any pointers are welcome. 

Will start working on this solution.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
Sent from the Solr - User mailing list archive at Nabble.com.

Is Solr ready for Nested Documents importing and querying ?

2015-08-26 Thread Rafael

Hi, I'm using solr and I'm starting to index my database. I work for a book
seller, but we have a lot of different publications (i.e: different
editions from different publishers ) for the same book, and I was wondering
if it would be wise to model this "schema" using a hierarchical approach
(with nested docs). For example:

{
  title: 'The hoobit',
  author: 'J. R. Tolkien,
  publications: [{
  isbn: 9780007591855,
  price: 0.99,
  pages: 200
}, {
  isbn: 9780007497904,
  price: 4.00,
  pages: 230
}
  ]
}

And, another question, how can I achieve this with data-import-handler ? I
found this: https://issues.apache.org/jira/browse/SOLR-5147 (I'm using solr
5.3) and I was able to index the data, but I cannot retrieve the
publications values inside a book.

What do you think, guys ? Or is it better to forget about nested documents
and get back to the old-fashioned denormalized approach ?

Thanks.

[]'s
Rafael

Re: StrDocValues

2015-08-26 Thread Mikhail Khludnev

Hello Jamie,

Check here
https://github.com/apache/lucene-solr/blob/7f721a1f9323a85ce2b5b35e12b4788c31271b69/lucene/sandbox/src/java/org/apache/lucene/search/DocValuesRangeQuery.java#L185
Note, SortedSet works there even if an actual field is multivalue=false


On Wed, Aug 26, 2015 at 8:48 PM, Jamie Johnson  wrote:

> Are there any example implementation showing how StrDocValues works?  I am
> not sure if this is the right place or not, but I was thinking about having
> some document level doc value that I'd like to read in a function query to
> impact if the document is returned or not.  Am I barking up the right tree
> looking at this or is there another method to supporting this?
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

Data Import Handler use of JNDI decayed

2015-08-26 Thread Davis, Daniel (NIH/NLM) [C]

NLM tends to be rather security conscious.   Nothing appears terribly wrong, 
but the layout of Solr doesn't include Jetty's start.ini or jetty.xml
It will have to be the detailed way - 
https://wiki.eclipse.org/Jetty/Feature/JNDI#Detailed_Setup

Once I've figured it out, I'll request wiki edit permissions to add it in.

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

Re: Search opening hours

2015-08-26 Thread Darren Spehr

So thanks to the tireless efforts of David Smiley and the devs at Vivid
Solutions (not to mention the various contributors that help power Solr and
Lucene) spatial search is awesome, efficient and easy.  The biggest
roadblock I've run into is not having the JTS (Java Topology Suite) JAR
where Solr can find it. It doesn't ship with Solr OOB so you have to either
add it to one of the dynamic directories, or bundle it with the WAR (I
think pre-5.0). The link above has most of what you need to index data and
issue queries. I'd also suggest the sections on spatial search in "Solr In
Action" (Grainger, Potter) - they add a few more use cases that I've found
interesting. Finally, the aging wiki has some good info too:

http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4

Basically indexing spatial data is as easy as anything else: define the
field in the solrconfig.xml, create the data and push it in. Now the data
in this case are boxes or polygons (effectively the same here) and come in
a specific format known as WKT, or Well-Known-Text
. I'd say unless you're
aiming at an advanced use case set the max dist error on the field config a
little higher than normal - precision isn't really a requirement here and
good unit tests would alert you to any unforeseen issues. Then for the
query side of the world you just ask for point inclusion like:

q=+polygon:"Contains(POINT(my_long my_lat))"

Please note that WKT reverses the order of lat/lng because it uses
euclidean geometry heuristics (so X=longitude and Y=latitude). Can't tell
you how many times my brain hurt thanks to this idiom combined with janky
client logic :) Anyway, that's about it - let me know if you have any other
questions.

On Wed, Aug 26, 2015 at 1:56 PM, O. Klein  wrote:

> Darren,
>
> This sounds like solution I'm looking for. Especially nice fix for the
> Sunday-Monday problem.
>
> Never worked with spatial search before, so any pointers are welcome.
>
> Will start working on this solution.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Search-opening-hours-tp4225250p4225443.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Darren

Securing Solr 5.3 with Basic Authentication

2015-08-26 Thread Gofio Code

With version 5.3 Solr have full-featured authentication and authorization
plugins that use Basic  authentication and “permission rules” which are
completely driven from ZooKeeper.

So I have tried that without success follwong the info in
https://cwiki.apache.org/confluence/display/solr/Securing+Solr and
http://lucidworks.com/blog/securing-solr-basic-auth-permission-rules:

I followed this steps:

*1) Set up a Zookeeper Ensemble (3 nodes).*

*2) I upload the filesecurity.json to Zookeper*

I used this command to upload the file: zkcli.bat -zkhost localhost:2181
-cmd putfile /security.json security.json

Content of the file security.json:
{
"authentication":{
   "class":"solr.BasicAuthPlugin",
   "credentials":{"solr":"IV0EHq1OnNrj6gvRCwvFwTrZ1+z1oBbnQdiVC3otuq0=
Ndd7LKvVBAaZIF0QAVi1ekCfAJXr1GGfLtRUXhgrF8c="}
},
"authorization":{
   "class":"solr.RuleBasedAuthorizationPlugin",
   "user-role":{"solr":"admin"},
   "permissions":[{"name":"security-edit",
  "role":"admin"}]
}}

I also tried with this security.json content:

{"authentication":{"class":"solr.BasicAuthPlugin"},"authorization":{"class":"solr.RuleBasedAuthorizationPlugin"}}


*3) ** I started Solr 5.3.0 in cloud mode (and 'bootstrap' ):*

I used this command:
./solr start -c -z "localhost:2181,localhost:2182,localhost:2183" -s
../server/solrcloud_test
-Dbootstrap_confdir=../server/solrcloud_test/configsets/basic_configs/conf
-Dcollection.configName=c_test_cfg -f


However, I can access directly to http://localhost:8983/solr and the
browser doesn't ask me the credentials. In Solr Admin I can see the
/security.json (with the correct content) and even the c_test_cfg under
/cofigs .

I can see this in the log when solr starts:

955  INFO  (main) [   ] o.a.s.c.CoreContainer Security conf doesn't exist.
Skipping setup for authorization module.
955  INFO  (main) [   ] o.a.s.c.CoreContainer No authentication plugin used.

Can anybody tell me what I'm doing wrong??

Re: StrDocValues

2015-08-26 Thread Jamie Johnson

I think I found it.  {!boost..} gave me what i was looking for and then a
custom collector filtered out anything that I didn't want to show.

On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson  wrote:

> Are there any example implementation showing how StrDocValues works?  I am
> not sure if this is the right place or not, but I was thinking about having
> some document level doc value that I'd like to read in a function query to
> impact if the document is returned or not.  Am I barking up the right tree
> looking at this or is there another method to supporting this?
>

find documents based on specific term frequency

2015-08-26 Thread Tang, Rebecca

Hi there,

We have an index build on solr 5.0.  We received an user question:
"Is there a way to search for documents that have a word appearing more than a 
certain number of times? For example, I want to find documents that only have 
more than 10 instances of the word "genetics" …"

I'm not sure if it's possible to do this with solr.  Does anyone know?


Rebecca Tang
Applications Developer, UCSF CKM
Industry Documents Digital Libraries
E: rebecca.t...@ucsf.edu

Re: StrDocValues

2015-08-26 Thread Jamie Johnson

I don't see it explicitly mentioned, but does the boost only get applied to
the final documents/score that matched the provided query or is it called
for each field that matched?  I'm assuming only once per document that
matched the main query, is that right?

On Wed, Aug 26, 2015 at 5:35 PM, Jamie Johnson  wrote:

> I think I found it.  {!boost..} gave me what i was looking for and then a
> custom collector filtered out anything that I didn't want to show.
>
> On Wed, Aug 26, 2015 at 1:48 PM, Jamie Johnson  wrote:
>
>> Are there any example implementation showing how StrDocValues works?  I
>> am not sure if this is the right place or not, but I was thinking about
>> having some document level doc value that I'd like to read in a function
>> query to impact if the document is returned or not.  Am I barking up the
>> right tree looking at this or is there another method to supporting this?
>>
>
>

Re: find documents based on specific term frequency

2015-08-26 Thread Chris Hostetter


: "Is there a way to search for documents that have a word appearing more 
: than a certain number of times? For example, I want to find documents 
: that only have more than 10 instances of the word "genetics" …"

Try...

q=text:genetics&fq={!frange+incl=false+l=10}termfreq('text','genetics')

Note: the q=text:genetics isn't neccessary -- you could do any query and 
then filter on the numeric function range of the termfreq() function, or 
use that {!frange} as your main query (in which case all matchin docs will 
have identical scores).  i just included that in the example to show how 
you can search & sort by the "normal" style scoring (which takes into 
account full TF-IDF and length normalization) while filtering on the TF 
using a function query.

You can also request the termfreq() as a psuedo field for each doc in the 
the results, and parameterize the details to eliminate redundency in 
the request params...


...&fq={!frange+incl=false+l=10+v=$tf}&fl=*,$tf&tf=termfreq('text','genetics')

Is the same as...

...&fq={!frange+incl=false+l=10}termfreq('text','genetics')&fl=*,termfreq('text','genetics')


A big caveat to this however is that the termfreq function operates on the 
*RAW* underlying term values -- no query time analyzer is used -- so 
if you do stemming, or lowercasing in your index analyzer, you have to 
pass the stemmed/lowercased values to the function  (Although i just filed 
SOLR-7981 since it occurs to me we can make this automatic in the future 
with a new function argument)

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-FunctionRangeQueryParser
https://cwiki.apache.org/confluence/display/solr/Function+Queries



-Hoss
http://www.lucidworks.com/

Re: StrDocValues

2015-08-26 Thread Yonik Seeley

On Wed, Aug 26, 2015 at 6:20 PM, Jamie Johnson  wrote:
> I don't see it explicitly mentioned, but does the boost only get applied to
> the final documents/score that matched the provided query or is it called
> for each field that matched?  I'm assuming only once per document that
> matched the main query, is that right?

Correct.

-Yonik

Re: Lucene/Solr 5.0 and custom FieldCahe implementation

2015-08-26 Thread Jamie Johnson

Sorry to poke this again but I'm not following the last comment of how I
could go about extending the solr index searcher and have the extension
used.  Is there an example of this?  Again thanks

Jamie
On Aug 25, 2015 7:18 AM, "Jamie Johnson"  wrote:

> I had seen this as well, if I over wrote this by extending
> SolrIndexSearcher how do I have my extension used?  I didn't see a way that
> could be plugged in.
> On Aug 25, 2015 7:15 AM, "Mikhail Khludnev" 
> wrote:
>
>> On Tue, Aug 25, 2015 at 2:03 PM, Jamie Johnson  wrote:
>>
>> > Thanks Mikhail.  If I'm reading the SimpleFacets class correctly, out
>> > delegates to DocValuesFacets when facet method is FC, what used to be
>> > FieldCache I believe.  DocValuesFacets either uses DocValues or builds
>> then
>> > using the UninvertingReader.
>> >
>>
>> Ah.. got it. Thanks for reminding this details.It seems like even
>> docValues=true doesn't help with your custom implementation.
>>
>>
>> >
>> > I am not seeing a clean extension point to add a custom
>> UninvertingReader
>> > to Solr, would the only way be to copy the FacetComponent and
>> SimpleFacets
>> > and modify as needed?
>> >
>> Sadly, yes. There is no proper extension point. Also, consider overriding
>> SolrIndexSearcher.wrapReader(SolrCore, DirectoryReader) where the
>> particular UninvertingReader is created, there you can pass the own one,
>> which refers to custom FieldCache.
>>
>>
>> > On Aug 25, 2015 12:42 AM, "Mikhail Khludnev" <
>> mkhlud...@griddynamics.com>
>> > wrote:
>> >
>> > > Hello Jamie,
>> > > I don't understand how it could choose DocValuesFacets (it occurs on
>> > > docValues=true) field, but then switches to
>> UninvertingReader/FieldCache
>> > > which means docValues=false. If you can provide more details it would
>> be
>> > > great.
>> > > Beside of that, I suppose you can only implement and inject your own
>> > > UninvertingReader, I don't think there is an extension point for this.
>> > It's
>> > > too specific requirement.
>> > >
>> > > On Tue, Aug 25, 2015 at 3:50 AM, Jamie Johnson 
>> > wrote:
>> > >
>> > > > as mentioned in a previous email I have a need to provide security
>> > > controls
>> > > > at the term level.  I know that Lucene/Solr doesn't support this so
>> I
>> > had
>> > > > baked something onto a 4.x baseline that was sufficient for my use
>> > cases.
>> > > > I am now looking to move that implementation to 5.x and am running
>> into
>> > > an
>> > > > issue around faceting.  Previously we were able to provide a custom
>> > cache
>> > > > implementation that would create separate cache entries given a
>> > > particular
>> > > > set of security controls, but in Solr 5 some faceting is delegated
>> to
>> > > > DocValuesFacets which delegates to UninvertingReader in my case (we
>> are
>> > > not
>> > > > storing DocValues).  The issue I am running into is that before 5.x
>> I
>> > had
>> > > > the ability to influence the FieldCache that was used at the Solr
>> level
>> > > to
>> > > > also include a security token into the key so each cache entry was
>> > scoped
>> > > > to a particular level.  With the current implementation the
>> FieldCache
>> > > > seems to be an internal detail that I can't influence in anyway.  Is
>> > this
>> > > > correct?  I had noticed this Jira ticket
>> > > > https://issues.apache.org/jira/browse/LUCENE-5427, is there any
>> > movement
>> > > > on
>> > > > this?  Is there another way to influence the information that is put
>> > into
>> > > > these caches?  As always thanks in advance for any suggestions.
>> > > >
>> > > > -Jamie
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Sincerely yours
>> > > Mikhail Khludnev
>> > > Principal Engineer,
>> > > Grid Dynamics
>> > >
>> > > 
>> > > 
>> > >
>> >
>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>> Principal Engineer,
>> Grid Dynamics
>>
>> 
>> 
>>
>

47 matches

Mail list logo