Solr Exact match boost Reduce the results

2015-06-12 Thread JACK
I have two fields, one is copy field. I have to get Exact match results first
along with entire result of fuzzy search.





Its filed definition is given below



  


















 


  



 
 






   


The dummy filed is to get the exact match results.
1. To get exact results first just use quotes around the search words. So i
am getting the exact results first. But the result is too less.Around
8000.Query is given below
q="laptop+bag"&df=product_name&defType=edismax&qf=product_name^0.01+dummy_name^200

2. But for the query without quotes gives huge amount of results around
2, but won't give exact one first. Its query is below
q=laptop+bag&df=product_name&defType=edismax&qf=product_name^0.01+dummy_name^200

I have to get huge results like my second option with exact results first.
Is this the way to do it or any problem in my query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-12 Thread JACK
Hi Alessandro Benedetti ,

What i meant is that suppose if i have items like this

dell laptop with bag
dell laptop
dell laptop without bag
dell inspiron laptop with bag
if i query for "dell laptop", the result should be like this
dell laptop
dell laptop with bag
dell laptop without bag
dell inspiron laptop with bag
Exact match should come first, rest of the things will be in the any order,
but should get the same number of results




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211377.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-12 Thread JACK
Hi, I have to search on the field product_name.It is found that in order to
get exact matches first, I made one copy field named as dummy_name with the
above field definition.And while query, just boost the copy field. I done
this. So as to get exact matches I need to put quotes around the search
words. When I do this my results is too less compared to search without
quotes. But I need the same results without quotes along with exact matches
should come first



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211409.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-12 Thread JACK
The quoted search words will be different and it will be any word or more
than one word. In the query it's just example 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211410.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-12 Thread JACK
As explained above, actually I have around 10 lack data not 5 row. It's not
about synonyms . When I checked in the FAQ page of Solr wiki, it is found
that if we need to get exact match results first, use a copy field with
different configuration. That's why I followed this way. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211434.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr Exact match boost Reduce the results

2015-06-14 Thread JACK
Hi chillra,
I have changed the index and query filed configuration to

 
  

But still my problem not solved , it won't resolve my problem.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211788.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-15 Thread JACK
Hi  Alessandro Benedetti,
The query is 
http://localhost:8983/solr/MYDBCORE/select?q=product_name:(laptop+bag)&wt=json&indent=true

1.Dell Inspiron 3542 Laptop (Black) without Laptop Bag
2.Dell 3542 15-inch Laptop with Laptop Bag by Dell
3.Dell Inspiron N3137 11-inch Laptop without Laptop Bag by Dell
4.Dell Inspiron 3442 14-Inch Laptop (Black) without Laptop Bag by Dell
5.Dell Inspiron 3442  Black 14 inch Laptop Without Laptop Bag by Dell
6.Dell Alienware 13-inch Laptop without Laptop Bag by Dell
7.Dell Vostro 3546 Laptop without Laptop Bag by Dell
8.Laptop - BAG
9.Laptop -BAG
10.Laptop-BAG

I need to get Last three results first, rest of the results can be any order



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211826.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-15 Thread JACK
Hi  Alessandro Benedetti,
Its my Analysis value index


WT

text
raw_bytes
start
end
positionLength
type
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
1

bag
[62 61 67]
7
10
1
word
2
SF

text
raw_bytes
start
end
positionLength
type
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
1

bag
[62 61 67]
7
10
1
word
2
WDF

text
raw_bytes
start
end
positionLength
type
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
1

bag
[62 61 67]
7
10
1
word
2
LCF

text
raw_bytes
start
end
positionLength
type
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
1

bag
[62 61 67]
7
10
1
word
2
SF

text
raw_bytes
start
end
positionLength
type
keyword
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
false
1

bag
[62 61 67]
7
10
1
word
false
2
PSF

text
raw_bytes
start
end
positionLength
type
keyword
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
false
1

bag
[62 61 67]
7
10
1
word
false
2
KSF

text
raw_bytes
start
end
positionLength
type
keyword
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
false
1

bag
[62 61 67]
7
10
1
word
false
2
EMSF

text
raw_bytes
start
end
positionLength
type
keyword
position

laptop
[6c 61 70 74 6f 70]
0
6
1
word
false
1

bag
[62 61 67]
7
10
1
word
false
2

How do you get the result i can't understand 
Can you help?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4211845.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-22 Thread JACK
Hi  Alessandro Benedetti,

I have changed the query like this.

/select?q=("dell+laptop"~13+OR+"dell+laptop")&df=product_name&defType=edismax&qf=product_name^0.001+dummy_product_name^2&fl=product_name&wt=json&indent=true&debug=true

The corresponding results also given in the below link.Now am getting exact
match first.

http://pastebin.com/rAYrFiB8


Now the problem is Look at the 8th result 
"product_name":"Dell Inspiron 15R 15.6-inch Laptop without Laptop Bag by
Dell", its not a relevant result
can you check How this happens , check the scores. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4213382.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Exact match boost Reduce the results

2015-06-23 Thread JACK
Hi  Alessandro Benedetti,
Can you check what happens to the below given product_name , just check its
score 

"product_name":"LAPTOP BATTERY DELL Inspiron 6400 1501 E1505 RD859 UD267
XU937"
How is this product comes , ?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Exact-match-boost-Reduce-the-results-tp4211352p4213417.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Boost Search word before Specific Content

2015-07-07 Thread JACK
I am using Solr 5.0.0, I have one question in relevance boost:

If I search for laptop table like words, is there any way to boost results
search word before the words like by with or without etc.

I used this query:

? defType = dismax 
& q = foo bar 
& bq = (*:* -by)^999  
But, this will boost negatively those documents having the word by or with
etc. How can i avoid this problem?

For example, if I search for laptop table then by the above query the result
DGB Cooling Laptop Table by GDB won't boost.

I just need to give a boost to the search words before certain words like
by, with, etc. Is it possible?

Example 2

If i search for Laptop bag 
It should boost and return Results having search words before these words
with,by,without,etc first.

Lets say 
dell laptop with laptop bag
laptop bag with cover
laptop bag and table

Results like

laptop bag with cover
laptop bag and table
dell laptop without laptop bag, 
In the Results search word laptop bag is before with, without,and. Search
results containing search word before these word should come first. 





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Boost-Search-word-before-Specific-Content-tp4216072.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Boost Search word before Specific Content

2015-07-07 Thread JACK
Hi Ahmet,

Can you elaborate it more?
Is it possible to solve my problem in Solr 5.0.0?
if yes can just explain how?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Boost-Search-word-before-Specific-Content-tp4216072p4216257.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Ranking based on term position

2015-07-09 Thread JACK
Hi Li Li,

I am experiencing the same problem. can you Explain little detailed?
Where do i change these methods?
I am using Solr 5.0.0, And How do i query this? Is there any change while
query?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ranking-based-on-term-position-tp979271p4216522.html
Sent from the Solr - User mailing list archive at Nabble.com.


Get TF-IDF from index?

2007-11-19 Thread Jack
I have only used the basic functionality of solr - post and query,
which works great.

I wonder if it's possible to get more information from the index. For
example, I'd like to get the TF-IDF score of a given term, or get a
list of terms sorted by TF-IDF, not from a given document, but from
the whole corpus, or, simply to enumerate all indexed terms.

Maybe solor doesn't expose this data, but I can use Lucene jar to get
them? Since I am not familiar with Lucene, either, any pointer is
helpful.

Thanks,
jack


Re: LSA Implementation

2007-11-26 Thread Jack
Interesting. Patents are valid for 20 years so it expires next year? :)
PLSA does not seem to have been patented, at least not mentioned in
http://en.wikipedia.org/wiki/Probabilistic_latent_semantic_analysis

On Nov 26, 2007 6:58 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> LSA (http://en.wikipedia.org/wiki/Latent_semantic_indexing) is
> patented, so it is not likely to happen unless the authors donate the
> patent to the ASF.
>
> -Grant
>
>
>
> On Nov 26, 2007, at 8:23 AM, Eswar K wrote:
>
> > All,
> >
> > Is there any plan to implement Latent Semantic Analysis as part of
> > Solr
> > anytime in the near future?
> >
> > Regards,
> > Eswar
>
> --
> Grant Ingersoll
> http://lucene.grantingersoll.com
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>


Re: the time factor

2008-05-20 Thread Jack
Hi Otis,

I tried this. It doesn't seem to solve my problem, though. I think
it's best used to make small adjustment when relevance scores are
similar. In my case, if I want to rank the most recent documents first
(because it's about news), I have to use very large boost, which will
end up getting the docs that are not so relevant to the top. I haven't
been able to get desired results of showing only recent documents with
decent relevance scores.

Ideally, I think it can be solved by doing a query for the past 24
hours and keeping the docs with best relevance scores, then another
query for the previous 24 hours ... but this really isn't very
efficient. Maybe OK for news because I may need to serve for up to 7
days. Still, 7 solr queries for a front-end query doesn't sound ideal.
So I'm still in search for a better way ...

Thanks,
Jack

On Tue, May 13, 2008 at 9:06 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote:
> The answer is: function queries! :)
> You can easily use function queries with DisMaxRequestHandler.  For example, 
> this is what you can add to the dismax config section in solrconfig.xml:
>
> 
>recip(rord(addDate),1,1000,1000)^2.5
> 
>
> Assuming you have an addDate field, this will give fresher document some 
> boost.  Look for this on the Wiki, it's all there.
>
> Otis


Re: How to limit number of pages per domain

2008-05-22 Thread Jack
I think I'll give it a try. I haven't done this before. Are there any
instructions regarding how to apply the patch? I see 9 files, some
displayed in gray links, some in blue links; some named as .diff, some
.patch; one has 1.3 in file name, one has 1.3, I suppose the other
files are for both versions. Should I apply all of them?
https://issues.apache.org/jira/browse/SOLR-236

> Actually, the best documentation are really the comments in the JIRA issue 
> itself.
> Is there anyone actually using Solr with this patch?
>
>
> Otis


Re: Append fields to a document

2015-12-16 Thread Jack Krupansky
What is the nature of your documents that reproducing them is so expensive?
Whatever it is, you should spend some time trying to reduce it to something
more manageable and performant. Generally, the primary recommendation is to
simply reindex any documents that need to be updated since atomic update
has various caveats so that it is only useful in a subset of use cases.

-- Jack Krupansky

On Wed, Dec 16, 2015 at 10:09 AM, Jamie Johnson  wrote:

> I have a use case where we only need to append some fields to a document.
> To retrieve the full representation is very expensive but I can easily get
> the deltas.  Is it possible to just add fields to an existing Solr
> document?  I experimented with using overwrite=false, but that resulted in
> two documents with the same uniqueKey in the index (which makes sense).  Is
> there a way to accomplish what I'm looking to do in Solr?  My fields aren't
> all stored and think it will be too expensive for me to make that change.
> Any thoughts would be really appreciated.
>


Re: Slow query response.

2015-12-17 Thread Jack Krupansky
A single query with tens of thousands of terms is very clearly a misuse of
Solr. If it happens to work at all, consider yourself lucky. Are you using
a standard Solr query parser or the terms query parser that lets you write
a raw list of terms to OR.

Are your nodes CPU-bound or I/O-bound during those 50-second intervals? My
bet is that your index does not fit fully in memory, causing lots of I/O to
repeatedly page in portions of the index and probably additional CPU usage
as well.

How many rows are you returning on each query? Are you using all these
terms just to filter a smaller query or to return a large bulk of documents?


-- Jack Krupansky

On Thu, Dec 17, 2015 at 7:01 AM, Modassar Ather 
wrote:

> Hi,
>
> I have a field f which is defined as follows.
>  omitNorms="true"/>
>
> Solr-5.2.1 is used. The index is spread across 12 shards (no replica) and
> the index size on each node is around 100 GB.
>
> When I search for 50 thousand values (ORed) in the field f it takes almost
> around 45 to 55 seconds.
> Per my understanding it is too slow. Kindly share your thoughts on this
> behavior and provide your suggestions.
>
> Thanks,
> Modassar
>


Re: While idexing millions of data Getting error

2015-12-18 Thread Jack Krupansky
Deep in that stack trace: "Suppressed: java.io.IOException: No space left
on device". Out of disk, apparently. Seems unlikely for the big disks on
most systems these days. Are you using SSD? They can be relatively small,
especially if on a box that has been virtualized into multiple VMs.

Some discussion of the initial HTTP error here:
http://stackoverflow.com/questions/29527803/eliminating-or-understanding-jetty-9s-illegalstateexception-too-much-data-aft

But maybe Solr/Lucene are behaving in some extreme manner when out of disk
space, cascading to the actual error you got at the client.

How many documents do you send at a time? How often do you commit?
Generally, you should send batches of documents, like 1,000 at a time.
Maybe commit every 50,000 documents.

2 million documents is nothing for Solr. I recommend 100 million per
node/shard as a rough practical limit although the exact practical limit
depends on your particular hardware and your particular data model and the
data itself.

How large is each document, roughly? Hundreds, thousands, or millions of
bytes? Are some documents extremely large?


-- Jack Krupansky

On Fri, Dec 18, 2015 at 10:30 AM, Toke Eskildsen 
wrote:

> Mugeesh Husain  wrote:
> > could you tell me the maximum number of limit for posting data to solr.
>
> The data size can be at most 2GB, possibly minus a few bytes. It is due to
> the HttpUrlComponent used inside of Solr, which only accepts a signed
> integer as size.
>
> As for the number of documents, the limit is 2 billion. It does not seem
> to be a problem in your case.
>
>
> - Toke Eskildsen
>


Re: Schema/Index design for disparate data sources (Federated / Google like search)

2015-12-22 Thread Jack Krupansky
Step one is to refine and more clearly state the requirements. Sure,
sometimes (most of the time?) the end user really doesn't know exactly what
they expect or want other than "Gee, I want to search for everything, isn't
that obvious??!!", but that simply means that an analyst is needed to
intervene before you leap to implementation. An analyst is someone who
knows how to interview all relevant parties (not just the approving
manager) to understand their true needs. I mean, who knows, maybe all they
really need is basic keyword search. Or... maybe they actually need a
full-blown data warehouse with precise access to each specific field of
each data source. Without knowing how refined user queries need to get,
there is little to go on here.

My other advice is to be careful not to overthink the problem - to imagine
that some complex solution is needed when the end users really only need to
do super basic queries. In general, managers are very poor when it comes to
analysis and requirement specification.

Do they need to do date searches on a variety of date fields?

Do they need to do numeric or range queries on specific numeric fields?

Do they need to do any exact match queries on raw character fields (as
opposed to tokenized text)?

Do they have fields like product names or numbers in addition to free-form
text?

Do they need to distinguish or weight titles from detailed descriptions?

You could have catchall fields for categories of field types like titles,
bodies, authors/names, locations, dates, numeric values. But... who
knows... this may be more than what an average user really needs.

As far as the concern about fields from different sources that are not
used, Lucene only stores and indexes fields which have values, so no
storage or performance is consumed when you have a lot of fields which are
not present for a particular data source.

-- Jack Krupansky

On Tue, Dec 22, 2015 at 11:25 AM, Susheel Kumar 
wrote:

> Hello,
>
> I am going thru few use cases where we have kind of multiple disparate data
> sources which in general doesn't have much common fields and i was thinking
> to design different schema/index/collection for each of them and query each
> of them separately and provide different result sets to the client.
>
> I have seen one implementation where all different fields from these
> disparate data sources are put together in single schema/design/collection
> that it can be searched easily using catch all field but this was having
> 200+ fields including copy fields. The problem i see with this design is
> ingestion will be slower (and scaling) as many of the fields for one data
> source will not be applicable when ingesting for other data source.
> Basically everything is being dumped into one huge schema/index/collection.
>
> After looking above, I am wondering how we can design this better in
> another implementation where we have the requirement to search across
> disparate source (each having multiple fields 10-15 fields searchable &
> 10-15 fields stored) with only 1 common field like description in each of
> the data sources.  Most of the time user may perform search on description
> and rest of the time combination of different fields. Similar to google
> like search where you search for "coffee" and it searches in various data
> sources (websites, maps, images, places etc.)
>
> My thought is to make separate indexes for each search scenario.  For
> example for single search box, we index description, other key fields which
> can be searched together  and their data source type into one index/schema
> that we don't make a huge index/schema and use the catch all field for
> search.
>
> And for other Advance search (field specific) scenario we create separate
> index/schema for each data sources.
>
> Any suggestions/guidelines on how we can better address this in terms of
> responsiveness and scaling? Each data source may have documents in 50-100+
> millions.
>
> Thanks,
> Susheel
>


Re: How to check when a search exceeds the threshold of timeAllowed parameter

2015-12-22 Thread Jack Krupansky
timeAllowed was designed to handle queries that by themselves consume lots
of resources, not to try to handle situations with large numbers of
requests that starve other requests from accessing CPU and I/O resources.

The usual technique for handling large numbers of requests is replication,
making more copies of the index that can each be searched in parallel.

How long do queries take when the site is operating normally?

Make sure that you have enough system memory to cache the index, otherwise
the machine wish be thrashing with lots of I/O for competing requests.

-- Jack Krupansky

On Tue, Dec 22, 2015 at 8:43 PM, Vincenzo D'Amore 
wrote:

> Well... I can write everything, but really all this just to understand
> when timeAllowed
> parameter trigger a partial answer? I mean, isn't there anything set in the
> response when is partial?
>
> On Wed, Dec 23, 2015 at 2:38 AM, Walter Underwood 
> wrote:
>
> > We need to know a LOT more about your site. Number of documents, size of
> > index, frequency of updates, length of queries approximate size of server
> > (CPUs, RAM, type of disk), version of Solr, version of Java, and features
> > you are using (faceting, highlighting, etc.).
> >
> > After that, we’ll have more questions.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
> >
> > > On Dec 22, 2015, at 4:58 PM, Vincenzo D'Amore 
> > wrote:
> > >
> > > Hi All,
> > >
> > > my website is under pressure, there is a big number of concurrent
> > searches.
> > > When the connected users are too many, the searches becomes so slow
> that
> > in
> > > some cases users have to wait many seconds.
> > > The queue of searches becomes so long that, in same cases, servers are
> > > blocked trying to serve all these requests.
> > > As far as I know because some searches are very expensive, and when
> many
> > > expensive searches clog the queue server becomes unresponsive.
> > >
> > > In order to quickly workaround this herd effect, I have added a
> > > default timeAllowed to 15 seconds, and this seems help a lot.
> > >
> > > But during stress tests but I'm unable to understand when and what
> > requests
> > > are affected by timeAllowed parameter.
> > >
> > > Just be clear, I have configure timeAllowed parameter in a SolrCloud
> > > environment, given that partial results may be returned (if there are
> > any),
> > > how can I know when this happens? When the timeAllowed parameter
> trigger
> > a
> > > partial answer?
> > >
> > > Best regards,
> > > Vincenzo
> > >
> > >
> > >
> > > --
> > > Vincenzo D'Amore
> > > email: v.dam...@gmail.com
> > > skype: free.dev
> > > mobile: +39 349 8513251
> >
> >
>
>
> --
> Vincenzo D'Amore
> email: v.dam...@gmail.com
> skype: free.dev
> mobile: +39 349 8513251
>


Re: Best practices on monitoring Solr

2015-12-23 Thread Jack Krupansky
Solr does have a monitoring wiki page, but it is fairly weak and could use
more serious contribution, including suggestions from this email thread.

This is also a good example of where the wiki still has value relative to
the formal Solr Reference Guide. E.g., third parties can add tool and
service descriptions and users can add their experiences and own tips. The
Reference Guide probably should have at least a cursory summary of Solr
monitoring (beyond just documenting JMX), probably simply referring users
to the wiki. IOW, details on monitoring are beyond the scope of the
Reference Guide itself (other than raw JMX and ping.)

-- Jack Krupansky

On Wed, Dec 23, 2015 at 6:27 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Shail,
> As William mentioned, our SPM <https://sematext.com/spm/index.html>
> allows you to monitor all main Solr/Jvm/Host metrics and also set up alerts
> for some values or use anomaly detection to notify you when something is
> about to be wrong. You can test all features for free for 30 days (no
> credit card required). There is embedded chat if you have some questions.
>
> HTH,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 23.12.2015 07:38, William Bell wrote:
>
>> Sematext.com has a service for this...
>>
>> Or just curl "http://localhost:8983/solr//select?q=*:*" to
>> see
>> if it returns ?
>>
>> On Tue, Dec 22, 2015 at 12:15 PM, Tiwari, Shailendra <
>> shailendra.tiw...@macmillan.com> wrote:
>>
>> Hi,
>>>
>>> Last week our Solr Search was un-responsive and we need to re-boot the
>>> server, but we were able to find out after customer complained about it.
>>> What's best way to monitor that search is working?
>>> We can always add Gomez alerts from UI.
>>> What are the best practices?
>>>
>>> Thanks
>>>
>>> Shail
>>>
>>
>>
>>
>>
>


Re: Changing Solr Schema with Data

2015-12-28 Thread Jack Krupansky
All crucial data that you don't want to delete should be stored in a
non-Solr backing store, either flat files (e.g., CSV or Solr XML), an
RDBMS, or a NoSQL database. You should always be in a position to either
fully reindex or fully discard your Solr data. Solr is not a system of
record database. Was someone telling you something different?

-- Jack Krupansky

On Mon, Dec 28, 2015 at 1:48 PM, Salman Ansari 
wrote:

> Hi,
>
> I am facing an issue where I need to change Solr schema but I have crucial
> data that I don't want to delete. Is there a way where I can change the
> schema of the index while keeping the data intact?
>
> Regards,
> Salman
>


Re: Adding the same field value question

2015-12-28 Thread Jack Krupansky
Is the field multivalued?

-- Jack Krupansky

On Sun, Dec 27, 2015 at 11:16 PM, Jamie Johnson  wrote:

> What is the difference of adding a field with the same value twice or
> adding it once and boosting the field on add?  Is there a situation where
> one approach is preferred?
>
> Jamie
>


Re: Issue with if() statement

2015-12-31 Thread Jack Krupansky
You can't have spaces in a function query - the %20 will get expanded to a
space (just as a "+" would.)

And fq is "filter query" anyway, not "function query". Try: fq={!func}...

Not sure what the solution to those embedded spaces is, but you probably
need function queries there as well.


-- Jack Krupansky

On Thu, Dec 31, 2015 at 6:50 PM, William Bell  wrote:

> We are getting weird results with if(exists(a),b,c). We are getting b+c!!
>
>
> http://localhost:8983/solr/providersearch/select?q=*:*&wt=json&state=state:%22CO%22&state1=state:%22NY%22&fq=if(exists(query($state1)),{!lucene%20v=$state1},{!lucene%20v=$state})
>
> I am getting NY and CO!
>
> I only want $state1, which is NY.
>
> Any other ways to craft this?
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


Re: Multiple solr instances on one server

2016-01-04 Thread Jack Krupansky
See the Solr Reference Guide:

"
-s 

Sets the solr.solr.home system property; Solr will create core directories
under this directory. This allows you to run multiple Solr instances on the
same host while reusing the same server directory set using the -d
parameter. If set, the specified directory should contain a solr.xml file,
unless solr.xml exists in ZooKeeper. The default value is server/solr.
"
https://cwiki.apache.org/confluence/display/solr/Solr+Start+Script+Reference



-- Jack Krupansky

On Mon, Jan 4, 2016 at 10:28 AM, Mugeesh Husain  wrote:

> you could start solr with multiple port like below
>
>
> bin/solr start -p 8983 one instance
> bin/solr start -p 8984 second instance and so its depend on you
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Multiple-solr-instances-on-one-server-tp4248411p4248413.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Many patterns against many sentences, storing all results

2016-01-05 Thread Jack Krupansky
It doesn't sound like a very good match with Solr - or any other search
engine or any relational database or data store for that matter. Sure,
maybe you can get something to work with extraordinary effort, but it is
unlikely that you will ever be happy with the results. You should probably
just bite the bullet and develop a full-custom in-memory data store that is
wired for the kinds of matching you are trying to accomplish. Sure, you can
probably scavenge some code/logic from Lucene, but that won't help with the
kind of patterns you are trying to match. Or... if you're willing to put
enough effort into it you might be able to develop a custom Lucene Query
class that did in fact align with your pattern matching requirements. But
that's not an out of the box feature at this stage.

It's not that this type of sentence matching is so unusual or hasn't come
up before (once a year or so?), but it just doesn't have any natural fit in
Lucene as it was originally conceptualized. Solr and Lucene are focused on
query of documents or matching a query against a document, not matching one
set of documents against another set of documents. IOW, Solr/Lucene is a
Query/Search engine, not a document/sentence set matching system.

A sentence matcher would be a great new feature for Lucene/Solr, but it's
not there today.

You can also take a look at Elasticsearch Percolator for another example of
matching incoing documents against stored queries:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html


-- Jack Krupansky

On Tue, Jan 5, 2016 at 11:05 AM, Allison, Timothy B. 
wrote:

> Might want to look into:
>
> https://github.com/flaxsearch/luwak
>
> or
>  https://github.com/OpenSextant/SolrTextTagger
>
> -Original Message-
> From: Will Moy [mailto:w...@fullfact.org]
> Sent: Tuesday, January 05, 2016 11:02 AM
> To: solr-user@lucene.apache.org
> Subject: Many patterns against many sentences, storing all results
>
> Hello
>
> Please may I have your advice as to whether Solr is a good tool for this
> job?
>
> We have (per year) –
> Up to 50,000,000 sentences
> And about 5,000 search patterns (i.e. queries)
>
> Our task is to identify all matches between any sentence and any search
> pattern.
>
> That list of detections must be kept up to date as patterns are added or
> updated (a handful an hour), and as new sentences are added.
>
> Some of the sentences will be added in real time, at probably max 100 /
> second and usually much less. The detections on these should be provided
> within 3 seconds.
>
> It's an unusual application in that we want all results in an external DB,
> and also in that every sentence is either a hit or not. we don't care about
> scoring results, only about matches for the exact search pattern entered.
>
> The application is automatically detecting instances of factchecked
> statements.
>
> The smaller-scale prototype was done with postgres full text searching,
> but that can't do exact phrase matching or other more sophisticated
> searches, so it's out.
>
> Thanks very much
>
> Will
>


Re: Count multivalued field issue

2016-01-06 Thread Jack Krupansky
Out of curiosity, where did you get your example code from - so we can
assure that it gets corrected?

Here's a valid example, from de-dupe:


  
dedupe
  
...


Note it is the request handler for "/update", not the "update handler."

See:
https://cwiki.apache.org/confluence/display/solr/De-Duplication

It is unfortunate that such an example is not given in the actual update
request processor doc, which only shows an example for the Solr Cell
request handler:
https://cwiki.apache.org/confluence/display/solr/Update+Request+Processors

If that still doesn't work, be sure to provide detail of what the symptom
is rather than simply saying that it doesn't work.


-- Jack Krupansky

On Wed, Jan 6, 2016 at 8:43 AM, marotosg  wrote:

> Hi,
>
> I am trying to add a new field to my schema to add the number of items of a
> multivalued field.
> I am using solr 4.11
>
> These are my fields on *schema.xml*
>  multiValued="true" stored="true" />
> 
>
> Here is the update done to my *solrconfig.xml*. I created an
> updateRequestProcessorChain
> and add it to the update handler
>
> 
> 
> countfields
> 
> 
>
> 
>
>  EmailListS
>  EmailListCountD
>
>
>  EmailListCountD
>
>
>  EmailListCountD
>  0
>
>
>
>  
>
> Am I doing somwthing wrong here?
>
> Thanks for your help.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Count-multivalued-field-issue-tp4248878.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query behavior difference.

2016-01-06 Thread Jack Krupansky
The motivation for the constant-score rewrite is simply performance. As per
the Javadoc:

"*This method is faster than the BooleanQuery rewrite methods when the
number of matched terms or matched documents is non-trivial. Also, it will
never hit an errant BooleanQuery.TooManyClauses exception.*"

So that's a second reason - to avoid the max clause count limitation of
Boolean Query.

See:
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/MultiTermQuery.html#CONSTANT_SCORE_REWRITE
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/search/WildcardQuery.html


-- Jack Krupansky

On Wed, Jan 6, 2016 at 6:07 AM, Modassar Ather 
wrote:

> Please help me understand why queries like wildcard, prefix and few others
> are re-written into constant score query?
> Why the scoring factors are not taken into consideration in such queries?
>
> Please correct me if I am wrong that this behavior is per the query type
> irrespective of the parser used.
>
> Thanks,
> Modassar
>
> On Wed, Jan 6, 2016 at 12:56 PM, Modassar Ather 
> wrote:
>
> > Thanks for your response Ahmet.
> >
> > Best,
> > Modassar
> >
> > On Mon, Jan 4, 2016 at 5:07 PM, Ahmet Arslan 
> > wrote:
> >
> >> Hi,
> >>
> >> I think wildcard queries fl:networ* are re-written into Constant Score
> >> Query.
> >> fl=*,score should returns same score for all documents that are
> retrieved.
> >>
> >> Ahmet
> >>
> >>
> >>
> >> On Monday, January 4, 2016 12:22 PM, Modassar Ather <
> >> modather1...@gmail.com> wrote:
> >> Hi,
> >>
> >> Kindly help me understand how will relevance ranking differ int
> following
> >> searches.
> >>
> >> query : fl:network
> >> query : fl:networ*
> >>
> >> What I am observing that the results returned are different in both of
> >> them
> >> in a way that the top documents returned for q=fl:network is not present
> >> in
> >> the top results of q=fl:networ*.
> >> For example for q=fl:network I am getting top documents having around 20
> >> occurrence of network whereas the top result of q=fl:networ* has only
> >> couple of occurrence of network.
> >> I am aware of the underlying normalization process participation in
> >> relevance ranking of documents but not able to understand such a
> >> difference
> >> in the ranking of result for the queries.
> >>
> >> Thanks,
> >> Modassar
> >>
> >
> >
>


Re: Dynamically Adding query parameters in my custom Request Handler class

2016-01-09 Thread Jack Krupansky
Sure, you CAN do this, but why would you want to? I mean, what exactly is
the motivation here? If you truly have custom code to execute, fine, but if
all you are trying to do is set parameters, a custom request handler is
hitting a tack with a sledge hammer. For example, why isn't setting
defaults in solrconfig sufficient for your needs? At least then you can
change parameters with a simple text edit rather than require a Java build
and jar deploy.

Can you share what some of the requirements are for your custom request
handler, including the motivation? I'd hate to see you go off and invest
significant effort in a custom request handler when simpler techniques may
suffice.

-- Jack Krupansky

On Sat, Jan 9, 2016 at 12:08 PM, Ahmet Arslan 
wrote:

> Hi Mark,
>
> Yes this is possible. Better, you can use a custom SearchComponent for
> this task too.
> You retrieve solr parameters, wrap it into ModifiableSolrParams. Add extra
> parameters etc, then pass it to underlying search components.
>
> Ahmet
>
>
> On Saturday, January 9, 2016 3:59 PM, Mark Robinson <
> mark123lea...@gmail.com> wrote:
> Hi,
> When I initially fire a query against my Solr instance using SOLRJ I pass
> only, say q=*:*&fq=(myfield:vaue1).
>
> I have written a custom RequestHandler, which is what I call in my SolrJ
> query.
> Inside this custom request handler can I add more query params like say the
> facets etc.. so that ultimately facets are also received back in my results
> which were initially not specified when I invoked the Solr url using SolrJ.
>
> In short, instead of constructing the query dynamically initially in SolrJ
> I want to add the extra query params, adding a jar in Solr (a java code
> that will check certain conditions and dynamically add the query params
> after the initial SolrJ query is done). That is why I thought of a custom
> RH which would help we write a java class and deploy in Solr.
>
> Is this possible. Could some one get back please.
>
> Thanks!
> Mark.
>


Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-13 Thread Jack Krupansky
The "Legacy Scaling and Distribution" section of the Solr Reference Guide
also gives info elated to so-called master-slave mode:
https://cwiki.apache.org/confluence/display/solr/Legacy+Scaling+and+Distribution

Also, although the old master-slave mode is still technically supported in
the sense that the code and doc is still there, You won't be able to get
the level of community support  here on the mailing list as you can get for
SolrCloud.

Unless you're simply trying to decide whether to leave an old legacy system
as-is with the old distributed mode, nobody should be considered a fresh
new distributed Solr deployment with anything other than SolrCloud.

(Hmmm... have any of the committers considered deprecating the old
non-SolrCloud distributed mode features?)

-- Jack Krupansky

On Wed, Jan 13, 2016 at 9:02 AM, Shivaji Dutta 
wrote:

> - SolrCloud uses zookeeper to manage HA
> - Zookeeper is a standard for all HA in Apache Hadoop
> - You have collections which will manage your shards across nodes
> - SolrJ Client is now fault tolerant with CloudSolrClient
>
> This is the way future direction of the product will go.
>
>
>
> On 1/13/16, 5:58 AM, "Gian Maria Ricci - aka Alkampfer"
>  wrote:
>
> >Thanks.
> >
> >--
> >Gian Maria Ricci
> >Cell: +39 320 0136949
> >
> >
> >
> >-Original Message-
> >From: Shawn Heisey [mailto:apa...@elyograg.org]
> >Sent: lunedì 11 gennaio 2016 18:28
> >To: solr-user@lucene.apache.org
> >Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave
> >Replica
> >
> >On 1/11/2016 4:28 AM, Gian Maria Ricci - aka Alkampfer wrote:
> >> a customer need a comprehensive list of all pro and cons of using
> >> standard Master Slave replica VS using Solr Cloud. I¹m interested
> >> especially in query performance consideration, because in this
> >> specific situation the rate of new documents is really slow, but the
> >> amount of data is about 50 millions of document, and the index size on
> >> disk for single core is about 30 GB.
> >
> >The primary advantage to SolrCloud is that SolrCloud handles most of the
> >administrative and operational details for you automatically.
> >
> >SolrCloud is a little more complicated to set up initially, because you
> >must worry about Zookeeper as well as Solr, but once it's properly set
> >up, there is no single point of failure.
> >
> >> Such amount of data should be easily handled by a Master Slave replica
> >> with a  single core replicated on a certain number of slaves, but we
> >> need to evaluate also the option of SolrCloud, especially for fault
> >> tolerance.
> >>
> >
> >Once you're beyond initial setup, fault tolerance with SolrCloud is much
> >easier than master/slave replication.  Switching a slave to a master is
> >possible, but the procedure is somewhat complicated.  SolrCloud does not
> >*have* masters, it is a true cluster.
> >
> >With master/slave replication, the master handles all indexing, and the
> >finished index segments are copied to the slaves via HTTP, and the slaves
> >simply need to open them.  SolrCloud does indexing on all shard replicas,
> >nearly simultaneously.  Usually this is an advantage, not a disadvantage,
> >but in heavy indexing situations master/slave replication
> >*might* show better performance on the slaves.
> >
> >Thanks,
> >Shawn
> >
> >
>
>


Re: &fq degrades qtime in a 20million doc collection

2016-01-13 Thread Jack Krupansky
I recall a couple of previous discussions regarding some sort of
filter/field cache change in Lucene where they removed what had been an
optimization for Solr.

-- Jack Krupansky

On Wed, Jan 13, 2016 at 8:10 PM, Erick Erickson 
wrote:

> It's quite surprising that you're getting this kind of query
> degradation by adding an "fq" clause
> unless something's really out of whack on the setup. How much memory
> are you giving
> the JVM? Are you autowarming? Are you indexing while this is going on,
> and if what are
> your commit parameters? If you add &debug=true to your query, one of
> the returned sections
> will be "timings" for the various components of a query measured in
> milliseconds. Occasionally
> there will be surprises in there.
>
> What are you measuring when you say it takes seconds? The time to
> render the result page or
> are you looking at the QTime parameter of the return packet?
>
> Best,
> Erick
>
> On Wed, Jan 13, 2016 at 4:27 PM, Anria B.  wrote:
> > hi Shawn
> >
> > Thanks for the quick answer.  As for the q=*,  we also saw similar
> results
> > in our testing when doing things like
> >
> > q=somefield:qval
> > &fq=otherfield:fqval
> >
> > Which makes a pure Lucene query.  I simplified things somewhat since our
> > results were always that as numFound got large, the query time degraded
> as
> > soon as we added any &fq in the mix.
> >
> > We also saw similar results for queries like
> >
> > q=query stuff
> > &defType=edismax
> > &df=afield
> > &qf=afield bfield cfield
> >
> >
> > So the query structure was not what created the 3-7 second query time, it
> > was always a correlation between is &fq in the query, and what is the
> > numFound.  We've run numerous load tests for bringing in good query with
> fq
> > values in the "newSearcher",  caches on, caches off  ... this same
> > phenomenon persisted.
> >
> > As for Tomcat, it's an easy enough test to run it in Jetty.  We will sure
> > try that!  GC we've had default and G1 setups.
> >
> > Thanks for giving us something to think about
> >
> > Anria
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/fq-degrades-qtime-in-a-20million-doc-collection-tp4250567p4250600.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Monitor backup progress when location parameter is used.

2016-01-14 Thread Jack Krupansky
I think the doc is wrong or at least misleading:
https://cwiki.apache.org/confluence/display/solr/Making+and+Restoring+Backups+of+SolrCores

"The backup operation can be monitored to see if it has completed by
sending the details command to the /replication handler..."

>From reading the code, it looks like the snapshot details are only stored
and returned after the snapshot completes, either successfully or fails,
but there is nothing set or reported if a snapshot is in progress. So, if
you don't see a "backup" section in the response, that means the snapshot
is in progress.

I think it's worth a Jira - either to improve the doc or improve the code
to report backup as "inProgress... StartedAt...".

You can also look at the log... "Creating backup snapshot" indicates the
backup has started and "Done creating backup snapshot" indicates success or
"Exception while creating snapshot" indicates failure. If only that first
message appeals, it means the backup is still in progress.


-- Jack Krupansky

On Thu, Jan 14, 2016 at 9:23 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> If I start a backup operation using the location parameter
>
>
>
> *http://localhost:8983/solr/mycore/replication?command=backup&name=mycore&;
> <http://localhost:8983/solr/mycore/replication?command=backup&name=mycore&;>location=z:\temp\backupmycore*
>
>
>
> How can I monitor when the backup operation is finished? Issuing a
> standard *details* operation
>
>
>
> *http://localhost:8983/solr/ <http://localhost:8983/solr/> mycore
> /replication?command=details*
>
>
>
> does not gives me useful information, because there are no information on
> backup on returning data.
>
>
>
>
>
> 
>
>
>
> 
>
> 0
>
> 1
>
> 
>
> 
>
> 57.62 GB
>
>  name="indexPath">X:\NoSql\Solr\solr-5.3.1\server\solr\mycore\data\index/
>
> 
>
> 
>
> 1452534703494
>
> 1509
>
> 
>
> _2cw.fdt
>
> _2cw.fdx
>
> _2cw.fnm
>
> _2cw.nvd
>
> _2cw.nvm
>
> _2cw.si
>
> _2cw_Lucene50_0.doc
>
> _2cw_Lucene50_0.dvd
>
> _2cw_Lucene50_0.dvm
>
> _2cw_Lucene50_0.pos
>
> _2cw_Lucene50_0.tim
>
> _2cw_Lucene50_0.tip
>
> segments_15x
>
> 
>
> 
>
> 
>
> true
>
> false
>
> 1452534703494
>
> 1509
>
> 
>
>  name="confFiles">schema.xml,stopwords.txt,elevate.xml
>
> 
>
> optimize
>
> 
>
> true
>
> 
>
> 
>
>
>
> 
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
> <http://www.linkedin.com/in/gianmariaricci> [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
> <https://twitter.com/alkampfer> [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
> <http://feeds.feedburner.com/AlkampferEng> [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Re: &fq degrades qtime in a 20million doc collection

2016-01-14 Thread Jack Krupansky
That sounds like it. Sorry my memory is so hazy.

Maybe Yonik can either confirm that that Jira is still outstanding or close
it, and confirm if these symptoms are related.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 10:54 AM, Erick Erickson 
wrote:

> Jack:
>
> I think that was for faceting? SOLR-8096 maybe?
>
>
> On Thu, Jan 14, 2016 at 12:25 AM, Toke Eskildsen 
> wrote:
> > On Wed, 2016-01-13 at 15:01 -0700, Anria B. wrote:
> >
> > [256GB RAM]
> >
> >> 1.   Collection has 20-30 million docs.
> >
> > Just for completeness: How large is the collection in bytes?
> >
> >> 2.   q=*&fq=someField:SomeVal   ---> takes 2.5 seconds
> >> 3.q=someField:SomeVal -->  300ms
> >> 4.   as numFound -> infinity, qtime -> infinity.
> >
> > What are you doing besides the q + fq above? Assuming a modest size
> > index (let's say < 100GB), even 300ms for a simple key:value query is a
> > long time. Usual culprits for performance problems when result set size
> > grows are high values for rows, grouping and faceting.
> >
> > Could you experiment with
> >  q=queryA&fq=queryB
> > vs
> >  q=queryA AND queryB
> > to make sure that is is not the underlying queries themselves that
> > causes the slowdown?
> >
> > Also, could you check the speed of the first request for
> >  q=queryA&fq=queryB
> > vs subsequent requests (you might need to set the queryResultCache to
> > zero to force a re-calculation of the result set), to see whether is is
> > the creation of the fq result set or the intersection calculation that
> > is slow?
> >
> >> We have already tested different autoCommit strategies, and different
> values
> >> for heap size, starting at 16GB, 32GB, 64GB, 128GB ...The only
> place we
> >> saw a 100ms improvement was between 32 - -Xmx=64GB.
> >
> > I would guess the 100 ms improvement was due to a factor not related to
> > heap size. With the exception of a situation where the heap is nearly
> > full, increasing Xmx will not improve Solr performance significantly.
> >
> > Quick note: Never set Xmx in the range 32GB-40GB (40GB is approximate):
> > At the 32GB point, the JVM switches to larger pointers, which means that
> > effective heap space is _smaller_ for Xmx=33GB than it is for Xmx=31GB:
> >
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
> >
> > - Toke Eskildsen, State and University Library, Denmark
> >
> >
>


Re: Position increment in WordDelimiterFilter.

2016-01-14 Thread Jack Krupansky
Which release of Solr are you using? Last year (or so) there was a Lucene
change that had the effect of keeping all terms for WDF at the same
position. There was also some discussion about whether this was either a
bug or a bug fix, but I don't recall any resolution.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 4:15 AM, Modassar Ather 
wrote:

> Hi,
>
> I have following definition for WordDelimiterFilter.
>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="1" splitOnCaseChange="1" preserveOriginal="1"/>
>
> The analysis of 3d shows following four tokens and their positions.
>
> token position
> 3d 1
> 3   1
> 3d 1
> d   2
>
> Please help me understand why d is at 2? Should not it also be at position
> 1.
> Is it a bug and if not is there any attribute which I can use to restrict
> the position increment?
>
> Thanks,
> Modassar
>


Re: Solr Query Tuning

2016-01-14 Thread Jack Krupansky
Add &debug=all to your query to see where the time is spent in the "timing"
section to see which Solr search component is consuming the time.

You may also have to add &debug=track to get the shard-specific info.

In theory, 19 of the shards should return nothing and the 20th will return
a single document.

Maybe one of the shard nodes is having trouble and takes way too long to do
essentially nothing.

Does the document ID have any special characters in it? If so, be sure to
escape them or put the ID in quotes, otherwise some piece of the ID may
match lots of documents, although even that should not be a big problem.

And make sure the ID field is string or numeric, not tokenized text.


-- Jack Krupansky

On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey  wrote:

> On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> > I am working with a customer that has about a billion documents on 20
> shards. The documents are extremely small about 100 characters each.
> > The insert rate is pretty good, but they are trying to fetch the
> document by using SolrJ SolrQuery
> >
> > Solr Query is taking about 1 min to return.
> >
> > The query is very simple
> > id:
> > Note the content of the document is just the documentid.
> >
> > Request for Information
> >
> > A) I am looking for some information as how I could go about tuning the
> query.
> > B) An alternate approach that I am thinking of is to use the "/get"
> request handler
> > Is this going to be faster than "/select"
> > C) I am looking at the debugQuery option, but I am unsure how to
> interpret this. I saw an slide share which talked about "
> http://explain.solr.pl/help";, but it only supports older versions of solr.
>
> I have no idea whether /get would be faster.  You'd need to try it.
>
> Can you provide the SolrJ code that you are using to do the query?
> Another useful item would be the entire entry from the Solr logfile for
> this query.  There will probably be multiple log entries for one query,
> usually the relevant log entry is the last one in the series.  I may
> need the schema, but we'll decide that later.
>
> Are all 20 shards on the same server, or have you got them spread out
> across multiple machines?  What is the replicationFactor on the
> collection?  If there are multiple machines, how many shards live on
> each machine, and how many machines do you have total?  Do you happen to
> know how large the Lucene index is for each of these shards?  How much
> total memory does each server have, and how large is the Java heap?  Is
> there software other than Solr running on the machine(s)?
>
> I am suspecting that you don't have enough memory for the operating
> system to effectively cache your index.  Good performance for a billion
> documents is going to require a lot of memory and probably a lot of
> servers.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>


Re: Solr Query Tuning

2016-01-14 Thread Jack Krupansky
Sounds intriguing. It would have to know for sure which query parser is
being used, which might be set in the server side defaults.

Over in Cassandra NoSQL database land we have the concept of "token aware
load balancing policy" on the client side that does the necessary magic
(requiring parsing of the query) to send the request to exactly the node
(or replica) that owns that token/ID.

But if you really just trying to "query by ID", that should really have a
nice clean API so you don't have to build query syntax.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 8:41 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Stupid thought/question. Is there a query by id API that understands
> SolrCloud routing and can simply fwd the query to the shard that would hold
> said document? Barring that, can one use SolrJ's routing brains to see what
> shard a given id would be routed to and only query that shard?
>
> -Doug
>
> On Thursday, January 14, 2016, Jack Krupansky 
> wrote:
>
> > Add &debug=all to your query to see where the time is spent in the
> "timing"
> > section to see which Solr search component is consuming the time.
> >
> > You may also have to add &debug=track to get the shard-specific info.
> >
> > In theory, 19 of the shards should return nothing and the 20th will
> return
> > a single document.
> >
> > Maybe one of the shard nodes is having trouble and takes way too long to
> do
> > essentially nothing.
> >
> > Does the document ID have any special characters in it? If so, be sure to
> > escape them or put the ID in quotes, otherwise some piece of the ID may
> > match lots of documents, although even that should not be a big problem.
> >
> > And make sure the ID field is string or numeric, not tokenized text.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey  > > wrote:
> >
> > > On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> > > > I am working with a customer that has about a billion documents on 20
> > > shards. The documents are extremely small about 100 characters each.
> > > > The insert rate is pretty good, but they are trying to fetch the
> > > document by using SolrJ SolrQuery
> > > >
> > > > Solr Query is taking about 1 min to return.
> > > >
> > > > The query is very simple
> > > > id:
> > > > Note the content of the document is just the documentid.
> > > >
> > > > Request for Information
> > > >
> > > > A) I am looking for some information as how I could go about tuning
> the
> > > query.
> > > > B) An alternate approach that I am thinking of is to use the "/get"
> > > request handler
> > > > Is this going to be faster than "/select"
> > > > C) I am looking at the debugQuery option, but I am unsure how to
> > > interpret this. I saw an slide share which talked about "
> > > http://explain.solr.pl/help";, but it only supports older versions of
> > solr.
> > >
> > > I have no idea whether /get would be faster.  You'd need to try it.
> > >
> > > Can you provide the SolrJ code that you are using to do the query?
> > > Another useful item would be the entire entry from the Solr logfile for
> > > this query.  There will probably be multiple log entries for one query,
> > > usually the relevant log entry is the last one in the series.  I may
> > > need the schema, but we'll decide that later.
> > >
> > > Are all 20 shards on the same server, or have you got them spread out
> > > across multiple machines?  What is the replicationFactor on the
> > > collection?  If there are multiple machines, how many shards live on
> > > each machine, and how many machines do you have total?  Do you happen
> to
> > > know how large the Lucene index is for each of these shards?  How much
> > > total memory does each server have, and how large is the Java heap?  Is
> > > there software other than Solr running on the machine(s)?
> > >
> > > I am suspecting that you don't have enough memory for the operating
> > > system to effectively cache your index.  Good performance for a billion
> > > documents is going to require a lot of memory and probably a lot of
> > > servers.
> > >
> > > https://wiki.apache.org/solr/SolrPerformanceProblems
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>


Re: Pro and cons of using Solr Cloud vs standard Master Slave Replica

2016-01-15 Thread Jack Krupansky
Yeah, and to the original question, there is no master list of features and
how SolrCloud vs. legacy distributed mode compare feature by feature.

And until SolrCloud actually does subsume every single (important) feature
of legacy distributed mode, Solr probably still needs to continue to
support legacy distributed mode, including backup.

The doc does need better coverage of backup and restore at the cluster
level, including configuration files. What's there now is basically the old
single-node replication backup. What exactly is the recommended best
practice for backing up a single shard, let alone all shards. Should
backups be collection-based as well?


-- Jack Krupansky

On Fri, Jan 15, 2016 at 3:26 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Yes, I've checked that jira some weeks ago and it is the reason why I was
> telling that there is still no clear procedure to backup SolrCloud in
> current latest version.  I'm glad that the priority is Major, but until it
> is not closed in an official version, I have to tell to customers that
> there is not easy and supported backup procedure for SolrCloud
> configuration :(.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: giovedì 14 gennaio 2016 16:46
> To: solr-user 
> Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave
> Replica
>
> re: SolrCloud backup/restore:
> https://issues.apache.org/jira/browse/SOLR-5750
>
> not committed yet, but getting attention.
>
>
>
> On Thu, Jan 14, 2016 at 6:19 AM, Gian Maria Ricci - aka Alkampfer <
> alkamp...@nablasoft.com> wrote:
> > Actually there are situation where a restore is needed, suppose that
> someone does some error and deletes all documents from a collection, or
> maybe deletes a series of document, etc. I know that this is not likely to
> happen, but in mission critical enterprise system, we always need a
> detailed procedure for disaster recovering.
> >
> > For such scenario we need to plan the worst case, where everything is
> lost.
> >
> > With Master Slave is just a matter of recreating machines, reconfigure
> the core, and restore a backup, and the game is done, with SolrCloud is not
> really clear for me how can I backup / restore data. From what I've found
> in the internet I need to backup every shard of the collection, and, if we
> need to restore everything from a backup, we can recreate the collection
> and then restore all the individual shards. I do not know if this is a
> supported scenario / procedure, but theoretically it could work.
> >
> > --
> > Gian Maria Ricci
> > Cell: +39 320 0136949
> >
> >
> >
> > -Original Message-
> > From: Alessandro Benedetti [mailto:abenede...@apache.org]
> > Sent: giovedì 14 gennaio 2016 10:46
> > To: solr-user@lucene.apache.org
> > Subject: Re: Pro and cons of using Solr Cloud vs standard Master Slave
> > Replica
> >
> > It's true that SolrCloud is adding some complexity.
> > But few observations :
> >
> > SolrCloud has some disadvantages and c an't beat the easiness and
> > simpleness
> >> of
> >> Master Slave Replica. So I can only encourage to keep Master Slave
> >> Replica in future versions.
> >
> >
> > I agree, it can happen situations when you have really simple and not
> critical systems.
> > Anyway old style replication is still used in SolrCloud, so I think it
> is going to stay for a while ( until is replaced with something else) .
> >
> > To answer to Gian :
> >
> > One of the problem I've found is that I've not found a simple way to
> > backup
> >> the content of a collection to restore in situation of disaster
> recovery.
> >> With simple master / slave scenario we can use the replication
> >> handler to generate backups that can be easily used to restore
> >> content of a core, while with SolrCloud is not clear how can we
> >> obtain a full backup
> >
> >
> > To be fair, Disaster recovery is when SolrCloud shines.
> > If you lose random nodes across your collection, you simply need to fix
> them and spin up again .
> > The system will automatically restore the content to the last version
> availa ble ( the tlog first and the  leader ( if the tlog is not enough)
> will help the dead node to catch up .
> > If you lose all the replicas for a shard and you lose the content in
> disk of all this replicas ( index and tlog), SolrCloud can't help you.
> > For this unlikely scenarios a backup is s

Re: Speculation on Memory needed to efficently run a Solr Instance.

2016-01-15 Thread Jack Krupansky
Personally, I'll continue to recommend that the ideal goal is to fully
cache the entire Lucene index in system memory, as well as doing a proof of
concept implementation to validate actual performance for your actual data.
You can do a POC with a small fraction of your full data, like 15% or even
10%, and then it's fairly safe to simply multiple those numbers to get the
RAM needed for the full 100% of your data (or even 120% to allow for modest
growth.)

Be careful about distinguishing search and query - sure, only a subset of
the data is needed to find the matching documents, but then the stored data
must be fetched to return the query results (search/lookup vs. query
results.) If the stored values are not also cached, you will increase the
latency of your overall query (returning results) even if the
search/match/lookup was reasonably fast.

So, the model is to prototype with a measured subset of your data, see how
the latency and system memory usage work out, and then scale that number up
for total memory requirement.

Again to be clear, if you really do need the best/minimal overall query
latency, your best bet is to have sufficient system memory to fully cache
the entire index. If you actually don't need minimal latency, then of
course you can feel free to trade off RAM for lower latency.



-- Jack Krupansky

On Fri, Jan 15, 2016 at 4:43 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi,
>
>
>
> When it is time to calculate how much RAM a solr instance needs to run
> with good performance, I know that it is some form of art, but I’m looking
> at a general “formula” to have at least one good starting point.
>
>
>
> Apart the RAM devoted to Java HEAP, that is strongly dependant on how I
> configure caches, and the distribution of queries in my system, I’m
> particularly interested in the amount of RAM to leave to operating system
> to use File Cache.
>
>
>
> Suppose I have an index of 51 Gb of dimension, clearly having that amount
> of ram devoted to the OS is the best approach, so all index files can be
> cached into memory by the OS, thus I can achieve maximum speed.
>
>
>
> But if I look at the detail of the index, in this particular example I see
> that the bigger file has .fdt extension, it is the stored field for the
> documents, so it affects retrieval of document data, not the real search
> process. Since this file is 24 GB of size, it is almost half of the space
> of the index.
>
>
>
> My question is: it could be safe to assume that a good starting point for
> the amount of RAM to leave to the OS is the dimension of the index less the
> dimension of the .fdt file because it has less importance in the search
> process?
>
>
>
> Are there any particular setting at OS level (CentOS linux) to have
> maximum benefit from OS file cache? (documentation at
> https://cwiki.apache.org/confluence/display/solr/Taking+Solr+to+Production#TakingSolrtoProduction-MemoryandGCSettings
> does not have any information related to OS configuration). Elasticsearch (
> https://www.elastic.co/guide/en/elasticsearch/reference/1.4/setup-configuration.html)
> generally have some suggestions such as using mlockall, disable swap etc
> etc, I wonder if there are similar suggestions for solr.
>
>
>
> Many thanks for all the great help you are giving me in this mailing list.
>
>
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
> <http://www.linkedin.com/in/gianmariaricci> [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
> <https://twitter.com/alkampfer> [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
> <http://feeds.feedburner.com/AlkampferEng> [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Re: Issue with stemming and lemmatizing

2016-01-15 Thread Jack Krupansky
Yes, you can do all of that, but... Solr is more of a toolkit rather than a
packaged solution, so you will have plug together all the pieces yourself.
There are a variety of stemmers in Solr and any number of techniques for
have to index and query using the stemmed and unstemmed variants of words.

Plenty of doc for you to start reading. Once you get the basics, then you
can move on to more specific and advanced details:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers%2C+Tokenizers%2C+and+Filters




-- Jack Krupansky

On Fri, Jan 15, 2016 at 2:58 PM, sara hajili  wrote:

> I wanna to write my own text tokenizer.
> And my question is about what solr treat with stemming  or lemmatizing?
> Solr store both lemmatizerd token and orginal token together?
> I mean if in index time solr lemmatize creation to create.
> And in query time.user want to search about exactly creation not creat.
> How solr do that?!
> If I lemmatize query string creation  to create
> In this way solr find all create not creatin.
> How solr behave with stemmer and lemmatizer?index both original and
> lemmatized word?
>


Re: Solr Block join not working after parent update

2016-01-15 Thread Jack Krupansky
Read the note at the bottom of the doc page:
"One limitation of indexing nested documents is that the whole block of
parent-children documents must be updated together whenever any changes are
required. In other words, even if a single child document or the parent
document is changed, the whole block of parent-child documents must be
indexed together."

See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments

As Mikhail indicatefed, "*the whole block of parent-child documents must be
indexed together.*" They must also be updated together.

-- Jack Krupansky

On Fri, Jan 15, 2016 at 3:31 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> On Thu, Jan 14, 2016 at 10:01 PM, sairamkumar <
> sairam.subraman...@gmail.com>
> wrote:
>
> > This is a show stopper. Kindly suggest solution/alternative.
>
>
> update whole block.
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>


Re: Returning all documents in a collection

2016-01-20 Thread Jack Krupansky
Yes, Exporting Results Sets is the preferred and recommended technique for
returning all documents in a collection, or even simply for queries that
select a large number of documents, all of which are to be returned. It
uses efficient streaming rather than paging.

But... this great feature currently does not have support for
distributed/SolrCloud mode:
"The initial release treats all queries as non-distributed requests. So the
client is responsible for making the calls to each Solr instance and
merging the results.
Using SolrJ’s CloudSolrClient as a model, developers could build clients
that automatically send requests to all the shards in a collection (or
multiple collections) and then merge the sorted sets any way they wish."

-- Jack Krupansky

On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
wrote:

> Hello Salman,
>
> Please checkout the export functionality
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
>
> Thanks,
> Susheel
>
> On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > Hi Salman,
> > You should use cursors in order to avoid "deep paging issues". Take a
> look
> > at
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results.
> >
> > Regards,
> > Emir
> >
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> >
> > On 20.01.2016 12:55, Salman Ansari wrote:
> >
> >> Hi,
> >>
> >> I am looking for a way to return all documents from a collection.
> >> Currently, I am restricted to specifying the number of rows using
> Solr.NET
> >> but I am looking for a better approach to actually return all documents.
> >> If
> >> I specify a huge number such as 1M, the processing takes a long time.
> >>
> >> Any feedback/comment will be appreciated.
> >>
> >> Regards,
> >> Salman
> >>
> >>
> >
>


Re: Returning all documents in a collection

2016-01-20 Thread Jack Krupansky
It would be nice to have an explicit section in the doc on the topic of
"Dealing with Large Result Sets" to point people to the various approaches
(paging, caching, export, streaming expressions, and how to select the best
one for a given use case.)

(And Joel is going to promise to update the doc for this stored field
restriction, right?!)

-- Jack Krupansky

On Wed, Jan 20, 2016 at 9:38 AM, Joel Bernstein  wrote:

> CloudSolrStream is available in Solr 5. The "search" streaming expression
> can used or CloudSolrStream can be used in directly.
>
> https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions
>
> The export handler does not export stored fields though. It only exports
> fields using DocValues caches. So you may need to re-index your data to use
> this feature.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Wed, Jan 20, 2016 at 9:29 AM, Salman Ansari 
> wrote:
>
> > Thanks Emir, Susheel and Jack for your responses. Just to update, I am
> > using Solr Cloud plus I want to get the data completely without
> pagination
> > or cursor (I mean in one shot). Is there a way to do this in Solr?
> >
> > Regards,
> > Salman
> >
> > On Wed, Jan 20, 2016 at 4:49 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > Yes, Exporting Results Sets is the preferred and recommended technique
> > for
> > > returning all documents in a collection, or even simply for queries
> that
> > > select a large number of documents, all of which are to be returned. It
> > > uses efficient streaming rather than paging.
> > >
> > > But... this great feature currently does not have support for
> > > distributed/SolrCloud mode:
> > > "The initial release treats all queries as non-distributed requests. So
> > the
> > > client is responsible for making the calls to each Solr instance and
> > > merging the results.
> > > Using SolrJ’s CloudSolrClient as a model, developers could build
> clients
> > > that automatically send requests to all the shards in a collection (or
> > > multiple collections) and then merge the sorted sets any way they
> wish."
> > >
> > > -- Jack Krupansky
> > >
> > > On Wed, Jan 20, 2016 at 8:41 AM, Susheel Kumar 
> > > wrote:
> > >
> > > > Hello Salman,
> > > >
> > > > Please checkout the export functionality
> > > >
> https://cwiki.apache.org/confluence/display/solr/Exporting+Result+Sets
> > > >
> > > > Thanks,
> > > > Susheel
> > > >
> > > > On Wed, Jan 20, 2016 at 6:57 AM, Emir Arnautovic <
> > > > emir.arnauto...@sematext.com> wrote:
> > > >
> > > > > Hi Salman,
> > > > > You should use cursors in order to avoid "deep paging issues".
> Take a
> > > > look
> > > > > at
> > > >
> https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results
> > .
> > > > >
> > > > > Regards,
> > > > > Emir
> > > > >
> > > > > --
> > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > Management
> > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > >
> > > > >
> > > > >
> > > > > On 20.01.2016 12:55, Salman Ansari wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I am looking for a way to return all documents from a collection.
> > > > >> Currently, I am restricted to specifying the number of rows using
> > > > Solr.NET
> > > > >> but I am looking for a better approach to actually return all
> > > documents.
> > > > >> If
> > > > >> I specify a huge number such as 1M, the processing takes a long
> > time.
> > > > >>
> > > > >> Any feedback/comment will be appreciated.
> > > > >>
> > > > >> Regards,
> > > > >> Salman
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: Couple of question about Virtualization and Load Balancer

2016-01-21 Thread Jack Krupansky
Official numbers? There are none. If for no other reason than that
performance is completely dependent on your specific hardware and your
specific data and your specific data model. The standard recommendation is
that you should do a proof of concept implementation with a reasonable
subset of your data and judge for yourself whether the throughput and
latency are sufficient for your own specific requirements. Not everyone has
extreme throughput and latency requirements. If your requirements are
extreme then virtualization will likely not to work out for you, but if
your requirements are reasonably mild and you adequately provision your
cluster with enough shards and enough replicas, then virtualization may
actually work out well for you. Either way, adequately provisioning the
cluster (not overloading individual nodes with either too many documents or
too many requests) is always essential unless you are working with a very
small collection of data with a very light load.

The standard recommendation is to avoid the use of a load balancer between
the app and Solr - since the server client API in SolrJ automatically does
smart routing and round-robin load balancing:
https://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrClient.html
https://lucene.apache.org/solr/5_4_0/solr-solrj/org/apache/solr/client/solrj/impl/CloudSolrServer.html

You may want a load balancer in front of multiple instances of your app,
but that's not a question or issue for Solr. The only issue there is
assuring that you have enough Solr shards and replicas to handle the
aggregate request load.


-- Jack Krupansky

On Thu, Jan 21, 2016 at 6:37 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi,
>
>
>
> I’ve a couple of quick question about production setup.
>
>
>
> The first one is about virtualization, I’d like to know if there are any
> official test on loss of performance in virtualization environment. I think
> that the loss of performance is negligible, and quick question on test
> infrastructure is confirming this, but I’d like to know if there is some
> official numbers on this.
>
>
>
> The second question is about Load Balancer: any clue on how to
> automatically change the configuration on the load balancer if some of the
> node goes down? I’m looking to advices on what to monitor, the simplest
> solution could be issuing some test query and verify if the node is able to
> answer, but it would be nice to know if there are some standard metrics to
> monitor to proactively alert. (Es. Heap size almost full, so it would be
> probably better to remove the node from the balancer and alert a human to
> have a look at the status of the node).
>
>
>
> Many thanks.
>
>
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
> [image:
> https://ci5.googleusercontent.com/proxy/5oNMOYAeFXZ_LDKanNfoLRHC37mAZkVVhkPN7QxMdA0K5JW2m0bm8azJe7oWZMNt8fKHNX1bzrUTd-kIyE40CmwT2Mlf8OI=s0-d-e1-ft#http://www.codewrecks.com/files/signature/mvp.png]
> <http://mvp.microsoft.com/en-us/mvp/Gian%20Maria%20Ricci-4025635> [image:
> https://ci3.googleusercontent.com/proxy/f-unQbmk6NtkHFspO5Y6x4jlIf_xrmGLUT3fU9y_7VUHSFUjLs7aUIMdZQYTh3eWIA0sBnvNX3WGXCU59chKXLuAHi2ArWdAcBclKA=s0-d-e1-ft#http://www.codewrecks.com/files/signature/linkedin.jpg]
> <http://www.linkedin.com/in/gianmariaricci> [image:
> https://ci3.googleusercontent.com/proxy/gjapMzu3KEakBQUstx_-cN7gHJ_GpcIZNEPjCzOYMrPl-r1DViPE378qNAQyEWbXMTj6mcduIAGaApe9qHG1KN_hyFxQAIkdNSVT=s0-d-e1-ft#http://www.codewrecks.com/files/signature/twitter.jpg]
> <https://twitter.com/alkampfer> [image:
> https://ci5.googleusercontent.com/proxy/iuDOD2sdaxRDvTwS8MO7-CcXchpNJX96uaWuvagoVLcjpAPsJi88XeOonE4vHT6udVimo7yL9ZtdrYueEfH7jXnudmi_Vvw=s0-d-e1-ft#http://www.codewrecks.com/files/signature/rss.jpg]
> <http://feeds.feedburner.com/AlkampferEng> [image:
> https://ci6.googleusercontent.com/proxy/EBJjfkBzcsSlAzlyR88y86YXcwaKfn3x7ydAObL1vtjJYclQr_l5TvrFx4PQ5qLNYW3yp7Ig66DJ-0tPJCDbDmYAFcamPQehwg=s0-d-e1-ft#http://www.codewrecks.com/files/signature/skype.jpg]
>
>
>


Re: One complex wildcard query lead solr OOM

2016-01-21 Thread Jack Krupansky
The Lucene WildcardQuery class does have an additional constructor that has
a maxDeterminizedStates parameter to limit the size of the FSM generated by
a wildcard queery, and the QueryParserBase class does have a method to set
that parameter, setMaxDeterminizedStates, but there is no Solr support for
invoking that method.

It is probably worth a Jira to get such support. Even then, the question is
how Solr should respond to the exception that gets thrown when that limit
is reached.

Even if Solr had an option to disable complex wildcards, the question is
what you want to happen when a complex wildcard is used - should an
exception be thrown, or... what?

I suppose it might be simplest to have a Solr option to limit the number of
wildcard characters used in a term, like to 4 or 8 or something like that.
IOW, have Solr check the term before the WildcardQuery is generated.

-- Jack Krupansky

On Thu, Jan 21, 2016 at 8:18 PM, Jian Mou  wrote:

> We are using Solr as our search engine, and recently notice some user
> input wildcard query can lead to Solr dead loop in
>
> org.apache.lucene.util.automaton.Operations.determinize()
>
> , and it also eats memory and finally OOM.
>
> the wildcard query seems like **?-???o·???è??**。
>
> Although we can validate the input parameter, but I also wonder is there
> any configuration which can disable complex wildcard query like this which
> lead to serve performance problems.
>
>
> Related statcktrace
>
>
> [image: Inline image 1]
>
>
>
> Thanks,
>
> Jian
>


Re: Mix Solr 4 and 5?

2016-01-22 Thread Jack Krupansky
Just to be clear, are you talking about a single app that does SolrJ calls
to both your CMS and your free text search index? So, one Java app that is
simultaneously sending requests to two Solr instances (once 4, one 5)?

-- Jack Krupansky

On Fri, Jan 22, 2016 at 1:57 AM, 
wrote:

> Hi,
>
> Long story short, we use a CMS that is integrated with Solr 4.6, with the
> solrj jar file in the global/common Tomcat classpath. We currently use a
> Google Search Appliance machine for our own freetext search needs, but plan
> to replace that with some other solution in the near future. Since we
> already work with solr because of the CMS integration, we would like to
> select solr for this project.
>
> But I would prefer to use the latest version, ie solr 5, and I am not sure
> how that would work in our situation. Can we use the solrj client for solr
> 4 when indexing and searching on a solr 5 server? If so, would we miss some
> important feature, and would this setup be future proof?
>
> Or can we somehow use both solr4 and solr 5 client libraries at the same
> time, in the same context? It is not possible to upgrade the solr server
> that the CMS is using, and it is not possible to remove the 4.6 solrj jar
> from the common classpath in Tomcat. That is, unless the solr 5 version of
> solrj is backwards compatible, so that we can switch the jar files and our
> CMS would still be able to index and search in it's own solr 4 server.
>
> What would you say that our options are? I would really not like having do
> low level http calls to the solr 5 server.
>
> Regards
> /Jimi
>


Re: Mix Solr 4 and 5?

2016-01-22 Thread Jack Krupansky
The doc is silent on this issue of SolrJ vs. server version compatibility
in general (e.g., 4 vs. 5.) That's not an absolute assurance, but at least
it's a possibility. And and far as I know, if you had a SolrJ 4 app and
upgraded the server (with no change in the index or data model), the app
should work fine. So... if you stick with SolrJ 4 and use the Solr 4 doc as
your guide, you should be okay. That's the theory.

Worst case, you would have to deploy a Solr 4 server. That's not the
preferred choice, but is a decent backup plan.


-- Jack Krupansky

On Fri, Jan 22, 2016 at 10:19 AM, Shawn Heisey  wrote:

> On 1/21/2016 11:57 PM, jimi.hulleg...@svensktnaringsliv.se wrote:
> > Long story short, we use a CMS that is integrated with Solr 4.6, with
> the solrj jar file in the global/common Tomcat classpath. We currently use
> a Google Search Appliance machine for our own freetext search needs, but
> plan to replace that with some other solution in the near future. Since we
> already work with solr because of the CMS integration, we would like to
> select solr for this project.
> >
> > But I would prefer to use the latest version, ie solr 5, and I am not
> sure how that would work in our situation. Can we use the solrj client for
> solr 4 when indexing and searching on a solr 5 server? If so, would we miss
> some important feature, and would this setup be future proof?
> >
> > Or can we somehow use both solr4 and solr 5 client libraries at the same
> time, in the same context? It is not possible to upgrade the solr server
> that the CMS is using, and it is not possible to remove the 4.6 solrj jar
> from the common classpath in Tomcat. That is, unless the solr 5 version of
> solrj is backwards compatible, so that we can switch the jar files and our
> CMS would still be able to index and search in it's own solr 4 server.
>
> If you are NOT running SolrCloud, then that should work with no
> problem.  The HTTP API is fairly static and has not seen any major
> upheaval recently.  If you're NOT running SolrCloud, you may even be
> able to replace the SolrJ jar in your existing system with the 5.4.1
> version (and update SolrJ's dependent jars) and have everything continue
> to work.
>
> If you ARE running SolrCloud, I would not try mixing 4.x and 5.x, in
> either direction.  SolrCloud is evolving very quickly ... I wouldn't
> even mix *minor* versions, much less *major* versions.  There are
> differences in how the zookeeper database is laid out, and mixing
> versions is not guaranteed to work, especially if SolrJ is older than
> Solr.  If the version difference is small and SolrJ is newer than Solr,
> there's a chance of success, but with the situation you have described,
> SolrCloud would likely not work.
>
> I have no idea about whether or not you can mix SolrJ versions in your
> client project.  This is extremely tricky to get working right with Java
> in general, and may not be possible.
>
> Thanks,
> Shawn
>
>


Re: Mix Solr 4 and 5?

2016-01-22 Thread Jack Krupansky
Personally, I think the Solr project should endeavor to commit to
guaranteeing that a SolrJ x.y client will be compatible with a Solr x+1.y2
Solr server. AFAICT there currently isn't such a formal compat commitment
or promise, but also AFAIK there is no known non-compat issue between SolrJ
4.y and Sol 5.y2. Let's see if anybody else knows of any. There might be
issues if you do extreme things like examining the detailed cluster state
from Zookeeper or use some of the non-traditional APIs introduced since 4.6
that may have been works in progress, but as long as you keep your app
usage of these more advanced features to a minimum, you should be able to
sidestep such issues.

If you try it and do run into a compat issue, we should make an effort to
consider that an unacceptable bug since upgrading clients is not always
such an easy or even feasible process, and if the old clients aren't using
any new features there would be a reasonable expectation that they should
continue to work.


-- Jack Krupansky

On Fri, Jan 22, 2016 at 10:40 AM, 
wrote:

> Yeah, sort of. Solr isn't bundled in the CMS, it is in a separate Tomcat
> instance. But our code is running on the same Tomcat as the CMS, and the
> CMS uses solrj 4.x to talk with its solr. And now we want to be able to
> talk with our own separate solr, running solr 5.x, and would prefer to use
> solrj for this.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Friday, January 22, 2016 10:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Mix Solr 4 and 5?
>
> Just to be clear, are you talking about a single app that does SolrJ calls
> to both your CMS and your free text search index? So, one Java app that is
> simultaneously sending requests to two Solr instances (once 4, one 5)?
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 1:57 AM, 
> wrote:
>
> > Hi,
> >
> > Long story short, we use a CMS that is integrated with Solr 4.6, with
> > the solrj jar file in the global/common Tomcat classpath. We currently
> > use a Google Search Appliance machine for our own freetext search
> > needs, but plan to replace that with some other solution in the near
> > future. Since we already work with solr because of the CMS
> > integration, we would like to select solr for this project.
> >
> > But I would prefer to use the latest version, ie solr 5, and I am not
> > sure how that would work in our situation. Can we use the solrj client
> > for solr
> > 4 when indexing and searching on a solr 5 server? If so, would we miss
> > some important feature, and would this setup be future proof?
> >
> > Or can we somehow use both solr4 and solr 5 client libraries at the
> > same time, in the same context? It is not possible to upgrade the solr
> > server that the CMS is using, and it is not possible to remove the 4.6
> > solrj jar from the common classpath in Tomcat. That is, unless the
> > solr 5 version of solrj is backwards compatible, so that we can switch
> > the jar files and our CMS would still be able to index and search in
> it's own solr 4 server.
> >
> > What would you say that our options are? I would really not like
> > having do low level http calls to the solr 5 server.
> >
> > Regards
> > /Jimi
> >
>


Re: Mix Solr 4 and 5?

2016-01-22 Thread Jack Krupansky
To be clear, having separate Solr servers on different versions should
definitely not be a problem. The only potential difficulty here is the
SolrJ vs. server back-compat issue.

-- Jack Krupansky

On Fri, Jan 22, 2016 at 10:57 AM, 
wrote:

> Shawn wrote:
> >
> > If you are NOT running SolrCloud, then that should work with no problem.
> > The HTTP API is fairly static and has not seen any major upheaval
> recently.
> > If you're NOT running SolrCloud, you may even be able to replace the
> > SolrJ jar in your existing system with the 5.4.1 version (and update
> > SolrJ's dependent jars) and have everything continue to work.
> >
> > If you ARE running SolrCloud, I would not try mixing 4.x and 5.x,
> > in either direction.  SolrCloud is evolving very quickly ... I wouldn't
> > even mix *minor* versions, much less *major* versions.
> > There are differences in how the zookeeper database is laid out,
> > and mixing versions is not guaranteed to work, especially if SolrJ
> > is older than Solr.  If the version difference is small and SolrJ is
> newer
> > than Solr, there's a chance of success, but with the situation you
> > have described, SolrCloud would likely not work.
>
> When you talk about not mixing 4.x and 5.x when using SolrCloud, you mean
> between the client and the server that talk to each other, right? Or would
> it be a problem keeping our existing non cloud solr 4.x server, upgrading
> the client solrj jar to 5.x (assuming this works, like you and others here
> seem to think it should/could), and then adding a new solr cloud 5.x
> server? That way, there the two separate communication "channels" are solrj
> 5.x <--> solr 4.x server, and, solrj 5.x  <--> solrcloud 5.x.
>
> Or does the mere presense of a solr 4.x server and a solr cloud 5.x server
> on the same network cause problems, even when they don't know about
> eachother?
>
> Regards
> /Jimi
>


Re: Taking Solr to production

2016-01-22 Thread Jack Krupansky
"1 Leader & 3 Replicas"

SolrCloud does not distinguish leaders from replicas - that's old
master-slave terminology. The leader is just one of the replicas.

So, are you really talking about 2 shards with 4 replicas each or 2 shards
with 2 replicas each?

Putting multiple replica instances on each machine isn't buying you
anything, just making it more complicated to manage.

Number of shards is determined by amount of data and whether query latency
can be achieved - use more shards if the query latency is too high.

2.5 million (2,500,000) documents is rather small, so unless your queries
are running really slow, it's not clear you even need sharding, but we
don't know your document and query complexity. Heavy faceting or complex
function queries?

Number of replicas is determined by query load - number of simultaneous
query requests, as well as HA availability requirements.




-- Jack Krupansky

On Fri, Jan 22, 2016 at 5:45 PM, Toke Eskildsen 
wrote:

> Aswath Srinivasan (TMS)  wrote:
> > * Totally about 2.5 million documents to  be indexed
> > * Documents average size is 512 KB - pdfs and htmls
>
> > This being said I was thinking I would take the Solr to production with,
> > * 2 shards, 1 Leader & 3 Replicas
>
> > Do you all think this set up will work? Will this server me 150 QPS?
>
> It certainly helps that you are batch updating. What is missing in this
> estimation is how large the documents are when indexed, as I guess the ½MB
> average is for the raw files? If they are your everyday short PDFs with
> images, meaning not a lot of text, handling 2M+ of them is easy. If they
> are all full-length books, it is another matter.
>
> Your document count is relatively low and if your index data end up being
> not-too-big (let's say 100GB), then you ought to consider having just a
> single shard with 4 replicas: There is a non-trivial overhead going from 1
> shard to more than one, especially if you are doing faceting.
>
> - Toke Eskildsen
>


Re: One complex wildcard query lead solr OOM

2016-01-24 Thread Jack Krupansky
Just escape them with a backslash. Or put each term in quotes.

-- Jack Krupansky

On Sun, Jan 24, 2016 at 5:21 AM, Jian Mou  wrote:

> Hi Jack,
>
> Thanks! Do you know how to disable wildcards, What I want is if input is
> wildcards, just treat it as a normal char. I other words,
> I just want to disable wildcard search.
>
> Thanks,
> Jian
>
> On Fri, Jan 22, 2016 at 1:55 PM, Jack Krupansky 
> wrote:
>
> > The Lucene WildcardQuery class does have an additional constructor that
> has
> > a maxDeterminizedStates parameter to limit the size of the FSM generated
> by
> > a wildcard queery, and the QueryParserBase class does have a method to
> set
> > that parameter, setMaxDeterminizedStates, but there is no Solr support
> for
> > invoking that method.
> >
> > It is probably worth a Jira to get such support. Even then, the question
> is
> > how Solr should respond to the exception that gets thrown when that limit
> > is reached.
> >
> > Even if Solr had an option to disable complex wildcards, the question is
> > what you want to happen when a complex wildcard is used - should an
> > exception be thrown, or... what?
> >
> > I suppose it might be simplest to have a Solr option to limit the number
> of
> > wildcard characters used in a term, like to 4 or 8 or something like
> that.
> > IOW, have Solr check the term before the WildcardQuery is generated.
> >
> > -- Jack Krupansky
> >
> > On Thu, Jan 21, 2016 at 8:18 PM, Jian Mou  wrote:
> >
> > > We are using Solr as our search engine, and recently notice some user
> > > input wildcard query can lead to Solr dead loop in
> > >
> > > org.apache.lucene.util.automaton.Operations.determinize()
> > >
> > > , and it also eats memory and finally OOM.
> > >
> > > the wildcard query seems like **?-???o·???è??**。
> > >
> > > Although we can validate the input parameter, but I also wonder is
> there
> > > any configuration which can disable complex wildcard query like this
> > which
> > > lead to serve performance problems.
> > >
> > >
> > > Related statcktrace
> > >
> > >
> > > [image: Inline image 1]
> > >
> > >
> > >
> > > Thanks,
> > >
> > > Jian
> > >
> >
>


Re: unmerged index segments

2016-01-25 Thread Jack Krupansky
What exacting are you merge policy settings in solrconfig? They control
when the background merges will be performed. Sometimes they do need to be
tweaked.

-- Jack Krupansky

On Mon, Jan 25, 2016 at 1:50 PM, James Mason 
wrote:

> Hi,
>
> I’ve have a large index that has been added to over several years, and
> I’ve discovered that I have many segments that haven’t been updated for
> well over a year, even though I’m adding, updating and deleting records
> daily. My five largest segments all haven’t been updated for over a year.
>
> Meanwhile, the number of segments I have keeps on increasing, and I have
> hundreds of segment files that don’t seem to be getting merged past a
> certain size (e.g. the largest is 2Gb but my older segments are over 100Gb).
>
> My understanding was that background merges should be merging these older
> segments with newer data over time, but this doesn’t seem to be the case.
>
> I’m using Solr 4.9, but I was using an older version at the time that
> these ‘older’ segments were created.
>
> Any help on suggestions of what’s happening would be very much
> appreciated. And also any suggestion on how I can monitor what’s happening
> with the background merges.
>
> Thanks,
>
> James


Re: unmerged index segments

2016-01-26 Thread Jack Krupansky
Sorry I don't have any specific guidance since the results are so
unpredictable. But a much lower mergeFactor should result in more frequent
merges, which should reduce segment count but may slow indexing down.

If you make the change and then add enough documents to exceed the segment
size limit (ramBufferSizeMB and maxBufferedDocs), then it should trigger
the merge, we hope.

You may also have to use your own explicit  in order to get
control over more of the parameters of TieredMergePolicy which is the
default. Solr is using  to set the maxMergeAtOnce and
segmentsPerTier options to be the same, but you may want change them to
differ.

Some doc to read:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig
https://wiki.apache.org/solr/SolrPerformanceFactors

The official Solr doc doesn't detail all the merge policy settings,
pointing yoou to the Javadoc, which for Tiered is here:
http://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/TieredMergePolicy.html

I did doc all of these options (as of Solr 4.4) in my Solr 4.x Deep Dive
e-book and I don't think much of that has changed since then:
http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-deep-dive-early-access-release-7/ebook/product-21203548.html

-- Jack Krupansky

On Tue, Jan 26, 2016 at 3:37 AM, James Mason 
wrote:

> Hi Jack,
>
> Sorry, I should have put them on my original message.
>
> All merge policy settings are at their default except mergeFactor, which I
> now notice is quite high at 45. Unfortunately I don’t have the full history
> to see when this setting was changed, but I do know they haven’t been
> changed for well over a year, and that we did originally run Solr using the
> default settings.
>
> So reading about mergeFactor it sounds like this is likely the problem,
> and we’re simply not asking Solr to merge into these old and large segments
> yet?
>
> If I was to change this back down to the default of 10, would you expect
> we’d get quite an immediate and intense period of merging?
>
> If I was to launch a dupliacate test Solr instance, change the merge
> factor, and simply leave it for a few days, would it perform the background
> merge (so I can test to see if there’s enough memory etc for the merge to
> complete?).
>
> Thanks,
>
> James
>
>
>
> > On 25 Jan 2016, at 21:39, Jack Krupansky 
> wrote:
> >
> > What exacting are you merge policy settings in solrconfig? They control
> > when the background merges will be performed. Sometimes they do need to
> be
> > tweaked.
> >
> > -- Jack Krupansky
> >
> > On Mon, Jan 25, 2016 at 1:50 PM, James Mason 
> > wrote:
> >
> >> Hi,
> >>
> >> I’ve have a large index that has been added to over several years, and
> >> I’ve discovered that I have many segments that haven’t been updated for
> >> well over a year, even though I’m adding, updating and deleting records
> >> daily. My five largest segments all haven’t been updated for over a
> year.
> >>
> >> Meanwhile, the number of segments I have keeps on increasing, and I have
> >> hundreds of segment files that don’t seem to be getting merged past a
> >> certain size (e.g. the largest is 2Gb but my older segments are over
> 100Gb).
> >>
> >> My understanding was that background merges should be merging these
> older
> >> segments with newer data over time, but this doesn’t seem to be the
> case.
> >>
> >> I’m using Solr 4.9, but I was using an older version at the time that
> >> these ‘older’ segments were created.
> >>
> >> Any help on suggestions of what’s happening would be very much
> >> appreciated. And also any suggestion on how I can monitor what’s
> happening
> >> with the background merges.
> >>
> >> Thanks,
> >>
> >> James
>
>


Re: Solr cannot return result when query with # * like title:#7654321*

2016-01-27 Thread Jack Krupansky
Just be to sure, please post the lines of code or command line that you are
using to issue the query.

-- Jack Krupansky

On Wed, Jan 27, 2016 at 10:50 PM, Yonik Seeley  wrote:

> On Wed, Jan 27, 2016 at 10:47 PM, diyun2008  wrote:
> > Hi Yonik
> >
> >I do actually encode it like q=titile:%237654321* (which is :
> > q=titile:#7654321*)
>
> Yes, if you *need* to encode it yourself (i.e. if you're using curl,
> or a browser URL bar).  It really depends on the client you are using.
>
> -Yonik
>


Re: Adding new documents to the search results and rescoring. Is it possible?

2016-01-28 Thread Jack Krupansky
Please provide a little more context.

How exactly are new documents getting added to a result set? I mean, each
query has its own result set, so there really isn't any way for a new query
to impact the results of a previous query.

Scores are always calculated fresh on each query, so there would never be a
need to "re" score them. Are you simply looking for a way to shift/boost
the scores somehow? Again, tell us more about what you are actually trying
to achieve.

-- Jack Krupansky

On Thu, Jan 28, 2016 at 9:52 AM, vitaly bulgakov 
wrote:

> I have Solr 4.2. Is it possible to rescore results after adding new
> documents
> to the result set?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Adding-new-documents-to-the-search-results-and-rescoring-Is-it-possible-tp4253859.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr cannot return result when query with # * like title:#7654321*

2016-01-28 Thread Jack Krupansky
Thanks. This is what Yonik was referring to - that # is a special URL
syntax character which signifies that the text after the # is what is known
as a fragment identifier, which is separated from the path and query
parameters of the URL. The Solr query is simply one URL query parameter
(&name=value). You need to escape the #, such as %23. But if you are using
SolrJ, the escaping should handled by the SolrJ API itself.

See:
https://en.wikipedia.org/wiki/Fragment_identifier
https://tools.ietf.org/html/rfc3986

Just to be super clear, how exactly are you sending the query to Solr - if
using curl, please post the full curl command.


-- Jack Krupansky

On Thu, Jan 28, 2016 at 1:03 AM, diyun2008  wrote:

> The query is rather simple:
> http://127.0.0.1:8080/solr/collection1/select?q=title:#7654321*
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-cannot-return-result-when-query-with-like-title-7654321-tp4253541p4253760.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: implement exact match for one of the search fields only?

2016-01-28 Thread Jack Krupansky
A simple boost query (bq) might do the trick, using edismax:

q=dvd bracket
bq=spp_keyword_exact:"dvd bracket"^100
qf=P_VeryShortDescription P_ShortDescription P_CatConcatKeyword

-- Jack Krupansky

On Thu, Jan 28, 2016 at 12:49 PM, Erick Erickson 
wrote:

> bq: if you are interested phrase query, you should use String field
>
> If you do this, you will NOT be able to search within the string. I.e.
> if the doc field is "my dog has fleas" you cannot match
> "dog has" with a string-based field.
>
> If you want to match the _entire_ string or you want prefix-only
> matching, then string might work, i.e. if you _only_ want to be able
> to match
>
> "my dog has fleas"
> "my dog*"
> but not
> "dog has fleas".
>
> On to the root question though.
>
> I really think you want to look at edismax. What you're trying to do
> is apply the same search term to individual fields. In particular,
> the pf parameter will automatically apply the search terms _as a phrase_
> against the field specified, relieving you of having to enclose things
> in quotes.
>
> The manual way of doing this would be to construct an elaborate query, like
> q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket)
> OR
>
> NOTE: the parens are necessary or the last part of the above would be
> parsed as
> P_ShortDescription:dvd default_searchfield:bracket
>
> And the &debug=query trick will show you exactly how things are actually
> searched, it's invaluable.
>
> Best,
> Erick
>
> On Thu, Jan 28, 2016 at 5:08 AM, Mugeesh Husain  wrote:
> > Hi,
> > if you are interested phrase query, you should use String field instead
> of
> > text field in schema like as
> >  
> >
> > this will solved you problem.
> >
> > if you are missing anything else let share
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Nested documents and many-many relation

2016-01-29 Thread Jack Krupansky
If you wish to change, add, or delete a child or change the parent you must
do an add of the entire block again with both the parent and all children.
This is because the efficiency of Block Join comes from the documents being
adjacent in Lucene and segments are immutable in Lucene, so the entire
block must be written to a new segment.


-- Jack Krupansky

On Fri, Jan 29, 2016 at 5:13 AM, Sathyakumar Seshachalam <
sathyakumar_seshacha...@trimble.com> wrote:

> Hi,
>
> Am trying to investigate the possibility of using Block Join query parser
> in a many-to-many relation scenario.
> Observation is that when a document is added as a child to more than one
> parent document (I use Solrj to do this), I seem to get two copies of the
> child document. Can this be avoided ? Is this per design ?
> Are there are articles talking about ways to model a many-to-many
> relationship (even if its a hacky solution).
>
>


Re: How much JVM should we allocate

2016-01-29 Thread Jack Krupansky
Ultimately, your JVM heap size will likely be somewhere in the 4GB to 12GB
range. Alas, you have to use trial and error to size it. If it's too small
you will hit OOM or performance degradation due to frequent GC. If it's too
large you will accumulate way too much garbage before a GC hits and then
the GC will take too long and even OOM if load is heavy enough to interfere
with GC. So, you're looking for a size where there is enough heap to avoid
OOM, plus a margin to allow for index growth and spikes in query load, but
not so small that GC occurs way too frequently. A simple binary search can
be used to find this sweet spot:

1. Pick a likely workable size, like 8 GB.
2. Index an amount of data that is either the real, intended load or a
simulation of that load and hit with plenty of queries, including complex
and expensive ones. (This should all be some automated test.)
3. If you hit OOM or performance is bad, repeat test with a heap size that
is half again bigger, like 12 GB. With each iteration, divide the increment
by 2 (8 GB, 4GB, 2GB, 1GB)
4. If you don't hit OOM and performance is decent, repeat test with a heap
size that is half again smaller, like 4GB.
5. Once you get to the point where you have tested two configs that differ
only by 1GB or less, you're done.
6. If the final smaller heap doesn't OOM, use it, otherwise use the larger
config
7. Add 10% to 20% (or even 25%) or so to that minimal working config, like
from 8GB to 10GB to have room to expand and handle spikes.
8. Run that final config for an extended period (days) with as realistic a
load as possible
9. If it too hits OOM or frequent GC, you may have to bump up the heap some
more, like another 10%.


-- Jack Krupansky

On Fri, Jan 29, 2016 at 11:51 AM, Erick Erickson 
wrote:

> And adding to Shawn's comment you want to have as little JVM as possible,
> see:
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Best,
> Erick
>
> On Fri, Jan 29, 2016 at 5:02 AM, Shawn Heisey  wrote:
> > On 1/28/2016 10:24 PM, Midas A wrote:
> >> CPU : 4
> >> physical memory : 48 GB
> >>
> >>
> >> and we are only have solr on this server . How much JVM  can be
> allocate to
> >> run server smoothly.
> >
> > We don't know.  You haven't provided any information about your index or
> > how you use it.  Even if you do provide that information, any number
> > that we gave you might be completely wrong, and should be viewed as
> > *only* a starting point.
> >
> >
> https://lucidworks.com/blog/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
> >
> > Thanks,
> > Shawn
> >
>


Re: Increasing maxMergedSegmentMB value

2016-01-30 Thread Jack Krupansky
>From the Lucene MergePolicy Javadoc:

"Whenever the segments in an index have been altered by IndexWriter
<https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/IndexWriter.html>,
either the addition of a newly flushed segment, addition of many segments
from addIndexes* calls, or a previous merge that may now need to cascade,
IndexWriter
<https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/IndexWriter.html>
 invokes findMerges(org.apache.lucene.index.MergeTrigger,
org.apache.lucene.index.SegmentInfos, org.apache.lucene.index.IndexWriter)
<https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html#findMerges(org.apache.lucene.index.MergeTrigger,
org.apache.lucene.index.SegmentInfos, org.apache.lucene.index.IndexWriter)> to
give the MergePolicy a chance to pick merges that are now required. This
method returns a MergePolicy.MergeSpecification
<https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.MergeSpecification.html>
instance
describing the set of merges that should be done, or null if no merges are
necessary. When IndexWriter.forceMerge is called, it calls
findForcedMerges(SegmentInfos,int,Map,
IndexWriter)
<https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html#findForcedMerges(org.apache.lucene.index.SegmentInfos,
int, java.util.Map, org.apache.lucene.index.IndexWriter)> and the
MergePolicy should then return the necessary merges."

See:
https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html

IOW, when the next commit occurs that closes and flushes the currently open
segment.

Nothing will happen to any existing 10GB segments, now or ever in the
future since merging two 10GB segments would not be possible with a limit
of only 15GB.

Maybe you could clue us in as to what effect you are trying to achieve. I
mean, why should any app care whether segments are 10GB or 15GB?


-- Jack Krupansky

On Sat, Jan 30, 2016 at 6:28 PM, Shawn Heisey  wrote:

> On 1/30/2016 7:31 AM, Zheng Lin Edwin Yeo wrote:
> > I would like to find out, when I increase the maxMergedSegmentMB from
> 10240
> > (10GB) to 15360 (15GB), will all the 10GB segments that were created
> > previously be automatically merge to 15GB?
>
> Not necessarily.  It will make those 10GB+ segments eligible for further
> merging, whereas they would have been ineligible before the change.
>
> This might mean that one or more of those large segments will be merged
> soon after the change and restart/reload, but I do not know when it
> might happen.  It would probably wait until at least one new segment was
> created, at which time the merge policy would be consulted.
>
> Thanks,
> Shawn
>
>


Re: Increasing maxMergedSegmentMB value

2016-01-31 Thread Jack Krupansky
Make sure you fully digest Mike McCandless' blog post on segment merge
before trying to outguess his code:
http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html

Generally, I don't think you would want to merge just two segments.
Generally, you should do a bunch at a time, typically 10. IOW, take all the
segments on a tier and merge them into one segment at the next tier.

There is no documented practical upper limit for how big to make a single
segment, but very large segments are not likely to be optimized well in
Lucene, hence the default max merge size of 5GB. If you want to get a lot
above that, you're in uncharted territory. Besides, if you start pushing
your index well above the amount of available system memory your query
performance will suffer. I'd watch for the latter before pushing on the
former.


-- Jack Krupansky

On Sun, Jan 31, 2016 at 10:43 AM, Zheng Lin Edwin Yeo 
wrote:

> Thanks for your reply Shawn and Jack.
>
> I wanted to increase the segment size to 15GB, so that there will be lesser
> segments to search for during the query, which should potentially improve
> the query speed.
>
> What if I set the segment size to 20GB? Will all the existing 10GB segments
> be merge to 20GB, as now merging two 10GB segments will results in a 20GB
> segment?
>
> Regards,
> Edwin
>
>
> On 31 January 2016 at 12:16, Jack Krupansky 
> wrote:
>
> > From the Lucene MergePolicy Javadoc:
> >
> > "Whenever the segments in an index have been altered by IndexWriter
> > <
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/IndexWriter.html
> > >,
> > either the addition of a newly flushed segment, addition of many segments
> > from addIndexes* calls, or a previous merge that may now need to cascade,
> > IndexWriter
> > <
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/IndexWriter.html
> > >
> >  invokes findMerges(org.apache.lucene.index.MergeTrigger,
> > org.apache.lucene.index.SegmentInfos,
> org.apache.lucene.index.IndexWriter)
> > <
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html#findMerges(org.apache.lucene.index.MergeTrigger
> > ,
> > org.apache.lucene.index.SegmentInfos,
> > org.apache.lucene.index.IndexWriter)> to
> > give the MergePolicy a chance to pick merges that are now required. This
> > method returns a MergePolicy.MergeSpecification
> > <
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.MergeSpecification.html
> > >
> > instance
> > describing the set of merges that should be done, or null if no merges
> are
> > necessary. When IndexWriter.forceMerge is called, it calls
> > findForcedMerges(SegmentInfos,int,Map,
> > IndexWriter)
> > <
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html#findForcedMerges(org.apache.lucene.index.SegmentInfos
> > ,
> > int, java.util.Map, org.apache.lucene.index.IndexWriter)> and the
> > MergePolicy should then return the necessary merges."
> >
> > See:
> >
> >
> https://lucene.apache.org/core/5_4_0/core/org/apache/lucene/index/MergePolicy.html
> >
> > IOW, when the next commit occurs that closes and flushes the currently
> open
> > segment.
> >
> > Nothing will happen to any existing 10GB segments, now or ever in the
> > future since merging two 10GB segments would not be possible with a limit
> > of only 15GB.
> >
> > Maybe you could clue us in as to what effect you are trying to achieve. I
> > mean, why should any app care whether segments are 10GB or 15GB?
> >
> >
> > -- Jack Krupansky
> >
> > On Sat, Jan 30, 2016 at 6:28 PM, Shawn Heisey 
> wrote:
> >
> > > On 1/30/2016 7:31 AM, Zheng Lin Edwin Yeo wrote:
> > > > I would like to find out, when I increase the maxMergedSegmentMB from
> > > 10240
> > > > (10GB) to 15360 (15GB), will all the 10GB segments that were created
> > > > previously be automatically merge to 15GB?
> > >
> > > Not necessarily.  It will make those 10GB+ segments eligible for
> further
> > > merging, whereas they would have been ineligible before the change.
> > >
> > > This might mean that one or more of those large segments will be merged
> > > soon after the change and restart/reload, but I do not know when it
> > > might happen.  It would probably wait until at least one new segment
> was
> > > created, at which time the merge policy would be consulted.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>


Re: URI is too long

2016-01-31 Thread Jack Krupansky
Or try the terms query parser that lets you eliminate all the OR operators:
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-TermsQueryParser


-- Jack Krupansky

On Sun, Jan 31, 2016 at 9:23 AM, Paul Libbrecht  wrote:

> How about using POST?
>
> paul
>
> > Salman Ansari <mailto:salman.rah...@gmail.com>
> > 31 January 2016 at 15:20
> > Hi,
> >
> > I am building a long query containing multiple ORs between query terms. I
> > started to receive the following exception:
> >
> > The remote server returned an error: (414) Request-URI Too Long. Any idea
> > what is the limit of the URL in Solr? Moreover, as a solution I was
> > thinking of chunking the query into multiple requests but I was wondering
> > if anyone has a better approach?
> >
> > Regards,
> > Salman
> >
>
>


Re: Determine if Merge is triggered in SOLR

2016-01-31 Thread Jack Krupansky
You would have to implement your own MergeScheduler that wrapped an
existing merge scheduler and then save the merge info and then write a
custom request handler to retrieve that saved info.

See:
https://lucene.apache.org/core/5_4_1/core/org/apache/lucene/index/MergeScheduler.html
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig


-- Jack Krupansky

On Sun, Jan 31, 2016 at 1:59 PM, abhi Abhishek  wrote:

> Hi All,
> any suggestions/ ideas?
>
> Thanks,
> Abhishek
>
> On Tue, Jan 26, 2016 at 9:16 PM, abhi Abhishek 
> wrote:
>
> > Hi All,
> > is there a way in SOLR to determine if a merge has been triggered in
> > SOLR? is there a API exposed to query this?
> >
> > if its not available is there a way to do the same using lucene jar files
> > available in the SOLR libs?
> >
> > Appreciate your help.
> >
> > Best Regards,
> > Abhishek
> >
>


Re: Error in UIMA, probably opencalais,

2016-02-01 Thread Jack Krupansky
At the bottom (the fine print!) it says: lineNumber: 15; columnNumber: 7;
The element type "meta" must be terminated by the matching end-tag
"".

-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:45 AM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> Hi,
>
>
>
> I’ve configured integration with UIMA but when I try to add a document I
> always got the error reported at bottom of the mail.
>
>
>
> It seems to be related to openCalais, but I’ve registered to OpenCalais
> and setup my token in solrconfig, so I wonder if anyone has some clue on
> what could be the reason of the error.
>
>
>
> I’m running this on Solr 5.3.1 instance running on linux.
>
>
>
> Gian Maria.
>
>
>
> null:org.apache.solr.common.SolrException: processing error null.
> id=doc4,  text="This is some textual content to verify UIMA integration..."
>
>  at
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:127)
>
>  at
> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:250)
>
>  at
> org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
>
>  at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
>
>  at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
>
>  at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
>
>  at
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
>
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)
>
>  at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>
>  at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>
>  at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>
>  at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>
>  at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>
>  at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>
>  at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>
>  at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
>
>  at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>
>  at org.eclipse.jetty.server.Server.handle(Server.java:499)
>
>  at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
>
>  at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
>
>  at
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
>
>  at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
>
>  at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
>
>  *at
> org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:208)*
>
>  at
> org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56)
>
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377)
>
>  at
> org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295)
>
>  at
> org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567)
>

Re: alternative forum for SOLR user

2016-02-01 Thread Jack Krupansky
Some people prefer to use Stack Overflow, but this mailing list is still
the definitive "forum" for Solr users.

See:
http://stackoverflow.com/questions/tagged/solr


-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:58 AM, Shawn Heisey  wrote:

> On 2/1/2016 1:13 AM, Jean-Jacques MONOT wrote:
> > I am a newbie with SOLR and just registered to this mailing list.
> >
> > Is there an alternative forum for SOLR user ? I am using this mailing
> > list for support, but did not find "real" web forum.
>
> Are you using "forum" as a word that can include a mailing list, or are
> you talking explicitly about a website for Solr that is running forum
> software?
>
> There is at least one "forum" website that actually mirrors this mailing
> list -- posts made on the forum are sent to the mailing list, and
> vice-versa.  The example I am thinking of is Nabble.
>
> This mailing list is the primary official path to find support on Solr
> -- the list is run by the Apache Software Foundation, which owns all
> rights connected to Solr.  There is no official "forum" website for the
> project, and nothing like it is planned for the near future.  Nabble is
> a third-party website.
>
> There are some third-party systems, entirely separate from this mailing
> list, that offer community support for Solr, such as stackoverflow.
> Another possibility is the #solr IRC channel, which is not exactly an
> official resource, but is frequented by users who have an official
> connection with the project.
>
> Thanks,
> Shawn
>
>


Re: Error configuring UIMA

2016-02-01 Thread Jack Krupansky
What was the specific error you had to correct? The NPE appears to be in
exception handling code so the actual exception is not indicated in the
stack trace.

The UIMA code is rather poor in terms of failing to check and report
missing parameters or bad parameters which in turn reference data that does
not exist.

-- Jack Krupansky

On Mon, Feb 1, 2016 at 10:18 AM, alkampfer  wrote:

>
>
> From: outlook_288fbf38c031d...@outlook.com
> To: solr-user@lucene.apache.org
> Cc:
> Date: Mon, 1 Feb 2016 15:59:02 +0100
> Subject: Error configuring UIMA
>
> I've solved the problem, it was caused by wrong configuration in
> solrconfig.xml.
>
> Thanks.
>
>
>
> > Hi,>  > I’ve followed the guide
> https://cwiki.apache.org/confluence/display/solr/UIMA+Integration to
> setup a UIMA integration to test this feature. The doc is not updated for
> Solr5, I’ve followed the latest comment to that guide and did some other
> changes but now each request to /update handler fails with the following
> error.>  > Someone have a clue on what I did wrong?>  > Thanks in advance.>
>  > {>   "responseHeader": {> "status": 500,> "QTime": 443>   },>
> "error": {> "trace": "java.lang.NullPointerException\n\tat
> org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:105)\n\tat
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:143)\n\tat
> org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:113)\n\tat
> org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76)\n\tat
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)\n\tat
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)\n\tat
> org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:210)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:499)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)\n\tat
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)\n\tat
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)\n\tat
> java.lang.Thread.run(Thread.java:745)\n",> "code": 500>   }> }>  > --
> > Gian Maria Ricci
> > Cell: +39 320 0136949> >
>
>


Re: Error configuring UIMA

2016-02-01 Thread Jack Krupansky
Yeah, that's exactly the kind of innocent user error that UIMA simply has
no code to detect and reasonably report.

-- Jack Krupansky

On Mon, Feb 1, 2016 at 12:13 PM, Gian Maria Ricci - aka Alkampfer <
alkamp...@nablasoft.com> wrote:

> It was a stupid error, I've mistyped the logField configuration in UIMA
>
> I'd like error not to use the Id but another field, but I've mistyped in
> solrconfig.xml and then I've got that error.
>
> Gian Maria.
>
> --
> Gian Maria Ricci
> Cell: +39 320 0136949
>
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: lunedì 1 febbraio 2016 16:54
> To: solr-user@lucene.apache.org
> Subject: Re: Error configuring UIMA
>
> What was the specific error you had to correct? The NPE appears to be in
> exception handling code so the actual exception is not indicated in the
> stack trace.
>
> The UIMA code is rather poor in terms of failing to check and report
> missing parameters or bad parameters which in turn reference data that does
> not exist.
>
> -- Jack Krupansky
>
> On Mon, Feb 1, 2016 at 10:18 AM, alkampfer 
> wrote:
>
> >
> >
> > From: outlook_288fbf38c031d...@outlook.com
> > To: solr-user@lucene.apache.org
> > Cc:
> > Date: Mon, 1 Feb 2016 15:59:02 +0100
> > Subject: Error configuring UIMA
> >
> > I've solved the problem, it was caused by wrong configuration in
> > solrconfig.xml.
> >
> > Thanks.
> >
> >
> >
> > > Hi,>  > I’ve followed the guide
> > https://cwiki.apache.org/confluence/display/solr/UIMA+Integration to
> > setup a UIMA integration to test this feature. The doc is not updated
> > for Solr5, I’ve followed the latest comment to that guide and did some
> > other changes but now each request to /update handler fails with the
> > following error.>  > Someone have a clue on what I did wrong?>  > Thanks
> in advance.>
> >  > {>   "responseHeader": {> "status": 500,> "QTime": 443>   },>
> > "error": {> "trace": "java.lang.NullPointerException\n\tat
> > org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(U
> > IMAUpdateRequestProcessor.java:105)\n\tat
> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.pro
> > cessUpdate(JsonLoader.java:143)\n\tat
> > org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.loa
> > d(JsonLoader.java:113)\n\tat
> > org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:76)\n\t
> > at
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandl
> > er.java:98)\n\tat
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Con
> > tentStreamHandlerBase.java:74)\n\tat
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandle
> > rBase.java:143)\n\tat
> > org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)\n\tat
> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)\n\
> > tat
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)\n\tat
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:210)\n\tat
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
> > .java:179)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletH
> > andler.java:1652)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:
> > 585)\n\tat
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> > va:143)\n\tat
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java
> > :577)\n\tat
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandle
> > r.java:223)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandle
> > r.java:1127)\n\tat
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:5
> > 15)\n\tat
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler
> > .java:185)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler
> > .java:1061)\n\tat
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.ja
> > va:141)\n\tat
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(Conte
> > xtHandlerCollection.java:215)\n\tat
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerColle
> > ction.java:110)\n\tat
> > org.eclipse.jetty.server.handler.HandlerWr

Re: implement exact match for one of the search fields only?

2016-02-04 Thread Jack Krupansky
The desired architecture is that you use a middle app layer that clients
send queries to and that middle app layer then constructs the formal query
and sends it on to Solr proper. This architecture also enables breaking a
user query into multiple Solr queries and then aggregating the results.
Besides, the general goal is to avoid app clients talking directly to Solr
anyway.

-- Jack Krupansky

On Thu, Feb 4, 2016 at 2:57 AM, Derek Poh  wrote:

> Hi Erick
>
> <<
> The manual way of doing this would be to construct an elaborate query,
> like q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket)
> OR NOTE: the parens are necessary or the last part of the above would
> be parsed as P_ShortDescription:dvd default_searchfield:bracket
> >>
>
> Your suggestion to construct the query like q=spp_keyword_exact:"dvd
> bracket" OR P_ShortDescription:(dvd bracket) OR does not fit into our
> current implementation.
> The front-end pages will only pass the "q=search keywords" in the query to
> solr. The list of search fields (qf) is pre-defined in solr.
>
> Do you have any alternatives to implement your suggestion without making
> changes to the front-end?
>
> On 1/29/2016 1:49 AM, Erick Erickson wrote:
>
>> bq: if you are interested phrase query, you should use String field
>>
>> If you do this, you will NOT be able to search within the string. I.e.
>> if the doc field is "my dog has fleas" you cannot match
>> "dog has" with a string-based field.
>>
>> If you want to match the _entire_ string or you want prefix-only
>> matching, then string might work, i.e. if you _only_ want to be able
>> to match
>>
>> "my dog has fleas"
>> "my dog*"
>> but not
>> "dog has fleas".
>>
>> On to the root question though.
>>
>> I really think you want to look at edismax. What you're trying to do
>> is apply the same search term to individual fields. In particular,
>> the pf parameter will automatically apply the search terms _as a phrase_
>> against the field specified, relieving you of having to enclose things
>> in quotes.
>>
>> The manual way of doing this would be to construct an elaborate query,
>> like
>> q=spp_keyword_exact:"dvd bracket" OR P_ShortDescription:(dvd bracket)
>> OR
>>
>> NOTE: the parens are necessary or the last part of the above would be
>> parsed as
>> P_ShortDescription:dvd default_searchfield:bracket
>>
>> And the &debug=query trick will show you exactly how things are actually
>> searched, it's invaluable.
>>
>> Best,
>> Erick
>>
>> On Thu, Jan 28, 2016 at 5:08 AM, Mugeesh Husain 
>> wrote:
>>
>>> Hi,
>>> if you are interested phrase query, you should use String field instead
>>> of
>>> text field in schema like as
>>>   
>>>
>>> this will solved you problem.
>>>
>>> if you are missing anything else let share
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://lucene.472066.n3.nabble.com/implement-exact-match-for-one-of-the-search-fields-only-tp4253786p4253827.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>
>


Re: large number of fields

2016-02-05 Thread Jack Krupansky
This doesn't sound like a great use case for Solr - or any other search
engine for that matter. I'm not sure what you are really trying to
accomplish, but you are trying to put way too many balls in the air to
juggle efficiently. You really need to re-conceptualize your problem so
that it has far fewer moving parts. Sure, Solr can handle many millions or
even billions of documents, but the focus for scaling Solr is on more
documents and more nodes, not incredibly complex or large documents. The
key to effective and efficient use of Solr is that queries are "quite
short", definitely not "quite long."

That said, the starting point for any data modeling effort is to look at
the full range of desired queries and that should drive the data model. So,
give us more info on queries, in terms of plain English descriptions of
what the user is trying to achieve.


-- Jack Krupansky

On Fri, Feb 5, 2016 at 8:20 AM, Jan Verweij - Experts in search <
j...@searchxperts.nl> wrote:

> Hi,
> We store 50K products stored in Solr. We have 10K customers and each
> customer buys up to 10K of these products. Now we want to influence the
> results by adding a field for every customer.
> So we end up with 10K fields to influence the results on the buying
> behavior of
> each customer (personal results). Don't think this is the way to go so I'm
> looking for suggestions how to solve
> this.
> One other option would be to: 1. create one multivaluefield
> 'company_hitrate'
>  2. store for each company their [companyID]_[hitrate]
>
> During search use boostfields [companyID]_50 …. [companyID]_100 So in this
> case the query can become quit long (51 options) but the number of
> fields is limited to 1. What kind of effect would this have on the search
> performance
> Any other suggestions?
> Jan.


Re: indexing pdf binary stored in mongodb?

2016-02-05 Thread Jack Krupansky
See if they are stored in BSON format using GridFS. If so, you can simply
use the mongofiles command to retrieve the PDF into a local file and index
that in Solr either using Solr Cell or Tika.

See:
http://blog.mongodb.org/post/183689081/storing-large-objects-and-files-in-mongodb
https://docs.mongodb.org/manual/reference/program/mongofiles/


-- Jack Krupansky

On Fri, Feb 5, 2016 at 3:13 PM, Arnett, Gabriel 
wrote:

> Anyone have any experience indexing pdfs stored in binary form in mongodb?
>
> .
> Gabe Arnett
> Senior Director
> Moody's Analytics
>
> -
>
> The information contained in this e-mail message, and any attachment
> thereto, is confidential and may not be disclosed without our express
> permission. If you are not the intended recipient or an employee or agent
> responsible for delivering this message to the intended recipient, you are
> hereby notified that you have received this message in error and that any
> review, dissemination, distribution or copying of this message, or any
> attachment thereto, in whole or in part, is strictly prohibited. If you
> have received this message in error, please immediately notify us by
> telephone, fax or e-mail and delete the message and all of its attachments.
> Thank you. Every effort is made to keep our network free from viruses. You
> should, however, review this e-mail message, as well as any attachment
> thereto, for viruses. We take no responsibility and have no liability for
> any computer virus which may be transferred via this e-mail message.
>


Re: URI is too long

2016-02-06 Thread Jack Krupansky
And you're sure that you can't use the terms query parser, which was
explicitly designed for handling a very long list of terms to be implicitly
ORed?

-- Jack Krupansky

On Sat, Feb 6, 2016 at 2:26 PM, Salman Ansari 
wrote:

> It looked like there was another issue with my query. I had too many
> boolean operators (I believe maxBooleanClause property in SolrConfig.xml).
> I just looped in batch of 1000 to get all the docs. Not sure if there is a
> better way of handling this.
>
> Regards,
> Salman
>
>
> On Wed, Feb 3, 2016 at 12:29 AM, Shawn Heisey  wrote:
>
> > On 2/2/2016 1:46 PM, Salman Ansari wrote:
> > > OK then, if there is no way around this problem, can someone tell me
> the
> > > maximum size a POST body can handle in Solr?
> >
> > It is configurable in solrconfig.xml.  Look for the
> > formdataUploadLimitInKB setting in the 5.x configsets.  This setting
> > defaults to 2048, which means 2 megabytes.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Solr architecture

2016-02-08 Thread Jack Krupansky
So is there any aging or TTL (in database terminology) of older docs?

And do all of your queries need to query all of the older documents all of
the time or is there a clear hierarchy of querying for aged documents, like
past 24-hours vs. past week vs. past year vs. older than a year? Sure, you
can always use a function query to boost by the inverse of document age,
but Solr would be more efficient with filter queries or separate indexes
for different time scales.

Are documents ever updated or are they write-once?

Are documents explicitly deleted?

Technically you probably could meet those specs, but... how many
organizations have the resources and the energy to do so?

As a back of the envelope calculation, if Solr gave you 100 queries per
second per node, that would mean you would need 1,200 nodes. It would also
depend on whether those queries are very narrow so that a single node can
execute them or if they require fanout to other shards and then aggregation
of results from those other shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson 
wrote:

> Short form: You really have to prototype. Here's the long form:
>
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>
> I've seen between 20M and 200M docs fit on a single piece of hardware,
> so you'll absolutely have to shard.
>
> And the other thing you haven't told us is whether you plan on
> _adding_ 2B docs a day or whether that number is the total corpus size
> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
> docs/day, 30 days later do you have 2B docs or 60B docs in your
> corpus?
>
> Best,
> Erick
>
> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar 
> wrote:
> > Also if you are expecting indexing of 2 billion docs as NRT or if it will
> > be offline (during off hours etc).  For more accurate sizing you may also
> > want to index say 10 million documents which may give you idea how much
> is
> > your index size and then use that for extrapolation to come up with
> memory
> > requirements.
> >
> > Thanks,
> > Susheel
> >
> > On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
> > emir.arnauto...@sematext.com> wrote:
> >
> >> Hi Mark,
> >> Can you give us bit more details: size of docs, query types, are docs
> >> grouped somehow, are they time sensitive, will they update or it is
> rebuild
> >> every time, etc.
> >>
> >> Thanks,
> >> Emir
> >>
> >>
> >> On 08.02.2016 16:56, Mark Robinson wrote:
> >>
> >>> Hi,
> >>> We have a requirement where we would need to index around 2 Billion
> docs
> >>> in
> >>> a day.
> >>> The queries against this indexed data set can be around 80K queries per
> >>> second during peak time and during non peak hours around 12K queries
> per
> >>> second.
> >>>
> >>> Can Solr realize this huge volumes.
> >>>
> >>> If so, assuming we have no constraints for budget what would be a
> >>> recommended Solr set up (number of shards, number of Solr instances
> >>> etc...)
> >>>
> >>> Thanks!
> >>> Mark
> >>>
> >>>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
> >>
>


Re: Solr architecture

2016-02-08 Thread Jack Krupansky
Oops... at 100 qps for a single node you would need 120 nodes to get to 12K
qps and 800 nodes to get 80K qps, but that is just an extremely rough
ballpark estimate, not some precise and firm number. And that's if all the
queries can be evenly distributed throughout the cluster and don't require
fanout to other shards, which effectively turns each incoming query into n
queries where n is the number of shards.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 12:07 PM, Jack Krupansky 
wrote:

> So is there any aging or TTL (in database terminology) of older docs?
>
> And do all of your queries need to query all of the older documents all of
> the time or is there a clear hierarchy of querying for aged documents, like
> past 24-hours vs. past week vs. past year vs. older than a year? Sure, you
> can always use a function query to boost by the inverse of document age,
> but Solr would be more efficient with filter queries or separate indexes
> for different time scales.
>
> Are documents ever updated or are they write-once?
>
> Are documents explicitly deleted?
>
> Technically you probably could meet those specs, but... how many
> organizations have the resources and the energy to do so?
>
> As a back of the envelope calculation, if Solr gave you 100 queries per
> second per node, that would mean you would need 1,200 nodes. It would also
> depend on whether those queries are very narrow so that a single node can
> execute them or if they require fanout to other shards and then aggregation
> of results from those other shards.
>
> -- Jack Krupansky
>
> On Mon, Feb 8, 2016 at 11:24 AM, Erick Erickson 
> wrote:
>
>> Short form: You really have to prototype. Here's the long form:
>>
>>
>> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/
>>
>> I've seen between 20M and 200M docs fit on a single piece of hardware,
>> so you'll absolutely have to shard.
>>
>> And the other thing you haven't told us is whether you plan on
>> _adding_ 2B docs a day or whether that number is the total corpus size
>> and you are re-indexing the 2B docs/day. IOW, if you are adding 2B
>> docs/day, 30 days later do you have 2B docs or 60B docs in your
>> corpus?
>>
>> Best,
>> Erick
>>
>> On Mon, Feb 8, 2016 at 8:09 AM, Susheel Kumar 
>> wrote:
>> > Also if you are expecting indexing of 2 billion docs as NRT or if it
>> will
>> > be offline (during off hours etc).  For more accurate sizing you may
>> also
>> > want to index say 10 million documents which may give you idea how much
>> is
>> > your index size and then use that for extrapolation to come up with
>> memory
>> > requirements.
>> >
>> > Thanks,
>> > Susheel
>> >
>> > On Mon, Feb 8, 2016 at 11:00 AM, Emir Arnautovic <
>> > emir.arnauto...@sematext.com> wrote:
>> >
>> >> Hi Mark,
>> >> Can you give us bit more details: size of docs, query types, are docs
>> >> grouped somehow, are they time sensitive, will they update or it is
>> rebuild
>> >> every time, etc.
>> >>
>> >> Thanks,
>> >> Emir
>> >>
>> >>
>> >> On 08.02.2016 16:56, Mark Robinson wrote:
>> >>
>> >>> Hi,
>> >>> We have a requirement where we would need to index around 2 Billion
>> docs
>> >>> in
>> >>> a day.
>> >>> The queries against this indexed data set can be around 80K queries
>> per
>> >>> second during peak time and during non peak hours around 12K queries
>> per
>> >>> second.
>> >>>
>> >>> Can Solr realize this huge volumes.
>> >>>
>> >>> If so, assuming we have no constraints for budget what would be a
>> >>> recommended Solr set up (number of shards, number of Solr instances
>> >>> etc...)
>> >>>
>> >>> Thanks!
>> >>> Mark
>> >>>
>> >>>
>> >> --
>> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> >> Solr & Elasticsearch Support * http://sematext.com/
>> >>
>> >>
>>
>
>


Re: Need to move on SOlr cloud (help required)

2016-02-10 Thread Jack Krupansky
What exactly is your motivation? I mean, the primary benefit of SolrCloud
is better support for sharding, and you have only a single shard. If you
have no need for sharding and your master-slave replicated Solr has been
working fine, then stick with it. If only one machine is having a load
problem, then that one node should be replaced. There are indeed plenty of
good reasons to prefer SolrCloud over traditional master-slave replication,
but so far you haven't touched on any of them.

How much data (number of documents) do you have?

What is your typical query latency?


-- Jack Krupansky

On Wed, Feb 10, 2016 at 2:15 AM, kshitij tyagi 
wrote:

> Hi,
>
> We are currently using solr 5.2 and I need to move on solr cloud
> architecture.
>
> As of now we are using 5 machines :
>
> 1. I am using 1 master where we are indexing ourdata.
> 2. I replicate my data on other machines
>
> One or the other machine keeps on showing high load so I am planning to
> move on solr cloud.
>
> Need help on following :
>
> 1. What should be my architecture in case of 5 machines to keep (zookeeper,
> shards, core).
>
> 2. How to add a node.
>
> 3. what are the exact steps/process I need to follow in order to change to
> solr cloud.
>
> 4. How indexing will work in solr cloud as of now I am using mysql query to
> get the data on master and then index the same (how I need to change this
> in case of solr cloud).
>
> Regards,
> Kshitij
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Is this a scenario that was working fine and suddenly deteriorated, or has
it always been slow?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 4:33 AM, Matteo Grolla 
wrote:

> Hi,
>  I'm trying to optimize a solr application.
> The bottleneck are queries that request 1000 rows to solr.
> Unfortunately the application can't be modified at the moment, can you
> suggest me what could be done on the solr side to increase the performance?
> The bottleneck is just on fetching the results, the query executes very
> fast.
> I suggested caching .fdx and .fdt files on the file system cache.
> Anything else?
>
> Thanks
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Are queries scaling linearly - does a query for 100 rows take 1/10th the
time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?

Does the app need/expect exactly 1,000 documents for the query or is that
just what this particular query happened to return?

What does they query look like? Is it complex or use wildcards or function
queries, or is it very simple keywords? How many operators?

Have you used the debugQuery=true parameter to see which search components
are taking the time?

-- Jack Krupansky

On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla 
wrote:

> Hi Yonic,
>  after the first query I find 1000 docs in the document cache.
> I'm using curl to send the request and requesting javabin format to mimic
> the application.
> gc activity is low
> I managed to load the entire 50GB index in the filesystem cache, after that
> queries don't cause disk activity anymore.
> Time improves now queries that took ~30s take <10s. But I hoped better
> I'm going to use jvisualvm's sampler to analyze where time is spent
>
>
> 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
>
> > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla 
> > wrote:
> > > Thanks Toke, yes, they are long times, and solr qtime (to execute the
> > > query) is a fraction of a second.
> > > The response in javabin format is around 300k.
> >
> > OK, That tells us a lot.
> > And if you actually tested so that all the docs would be in the cache
> > (can you verify this by looking at the cache stats after you
> > re-execute?) then it seems like the slowness is down to any of:
> > a) serializing the response (it doesn't seem like a 300K response
> > should take *that* long to serialize)
> > b) reading/processing the response (how fast the client can do
> > something with each doc is also a factor...)
> > c) other (GC, network, etc)
> >
> > You can try taking client processing out of the equation by trying a
> > curl request.
> >
> > -Yonik
> >
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but still
relatively bad. Even 50ms for 10 rows would be considered barely okay.
But... again it depends on query complexity - simple queries should be well
under 50 ms for decent modern hardware.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla 
wrote:

> Hi Jack,
>   response time scale with rows. Relationship doens't seem linear but
> Below 400 rows times are much faster,
> I view query times from solr logs and they are fast
> the same query with rows = 1000 takes 8s
> with rows = 10 takes 0.2s
>
>
> 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
>
> > Are queries scaling linearly - does a query for 100 rows take 1/10th the
> > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> >
> > Does the app need/expect exactly 1,000 documents for the query or is that
> > just what this particular query happened to return?
> >
> > What does they query look like? Is it complex or use wildcards or
> function
> > queries, or is it very simple keywords? How many operators?
> >
> > Have you used the debugQuery=true parameter to see which search
> components
> > are taking the time?
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla 
> > wrote:
> >
> > > Hi Yonic,
> > >  after the first query I find 1000 docs in the document cache.
> > > I'm using curl to send the request and requesting javabin format to
> mimic
> > > the application.
> > > gc activity is low
> > > I managed to load the entire 50GB index in the filesystem cache, after
> > that
> > > queries don't cause disk activity anymore.
> > > Time improves now queries that took ~30s take <10s. But I hoped better
> > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > >
> > >
> > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > >
> > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > > > Thanks Toke, yes, they are long times, and solr qtime (to execute
> the
> > > > > query) is a fraction of a second.
> > > > > The response in javabin format is around 300k.
> > > >
> > > > OK, That tells us a lot.
> > > > And if you actually tested so that all the docs would be in the cache
> > > > (can you verify this by looking at the cache stats after you
> > > > re-execute?) then it seems like the slowness is down to any of:
> > > > a) serializing the response (it doesn't seem like a 300K response
> > > > should take *that* long to serialize)
> > > > b) reading/processing the response (how fast the client can do
> > > > something with each doc is also a factor...)
> > > > c) other (GC, network, etc)
> > > >
> > > > You can try taking client processing out of the equation by trying a
> > > > curl request.
> > > >
> > > > -Yonik
> > > >
> > >
> >
>


Re: optimize requests that fetch 1000 rows

2016-02-11 Thread Jack Krupansky
Again, first things first... debugQuery=true and see which Solr search
components are consuming the bulk of qtime.

-- Jack Krupansky

On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla 
wrote:

> virtual hardware, 200ms is taken on the client until response is written to
> disk
> qtime on solr is ~90ms
> not great but acceptable
>
> Is it possible that the method FilenameUtils.splitOnTokens is really so
> heavy when requesting a lot of rows on slow hardware?
>
> 2016-02-11 17:17 GMT+01:00 Jack Krupansky :
>
> > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> still
> > relatively bad. Even 50ms for 10 rows would be considered barely okay.
> > But... again it depends on query complexity - simple queries should be
> well
> > under 50 ms for decent modern hardware.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla  >
> > wrote:
> >
> > > Hi Jack,
> > >   response time scale with rows. Relationship doens't seem linear
> but
> > > Below 400 rows times are much faster,
> > > I view query times from solr logs and they are fast
> > > the same query with rows = 1000 takes 8s
> > > with rows = 10 takes 0.2s
> > >
> > >
> > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky :
> > >
> > > > Are queries scaling linearly - does a query for 100 rows take 1/10th
> > the
> > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > >
> > > > Does the app need/expect exactly 1,000 documents for the query or is
> > that
> > > > just what this particular query happened to return?
> > > >
> > > > What does they query look like? Is it complex or use wildcards or
> > > function
> > > > queries, or is it very simple keywords? How many operators?
> > > >
> > > > Have you used the debugQuery=true parameter to see which search
> > > components
> > > > are taking the time?
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > matteo.gro...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi Yonic,
> > > > >  after the first query I find 1000 docs in the document cache.
> > > > > I'm using curl to send the request and requesting javabin format to
> > > mimic
> > > > > the application.
> > > > > gc activity is low
> > > > > I managed to load the entire 50GB index in the filesystem cache,
> > after
> > > > that
> > > > > queries don't cause disk activity anymore.
> > > > > Time improves now queries that took ~30s take <10s. But I hoped
> > better
> > > > > I'm going to use jvisualvm's sampler to analyze where time is spent
> > > > >
> > > > >
> > > > > 2016-02-11 15:25 GMT+01:00 Yonik Seeley :
> > > > >
> > > > > > On Thu, Feb 11, 2016 at 7:45 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > > > Thanks Toke, yes, they are long times, and solr qtime (to
> execute
> > > the
> > > > > > > query) is a fraction of a second.
> > > > > > > The response in javabin format is around 300k.
> > > > > >
> > > > > > OK, That tells us a lot.
> > > > > > And if you actually tested so that all the docs would be in the
> > cache
> > > > > > (can you verify this by looking at the cache stats after you
> > > > > > re-execute?) then it seems like the slowness is down to any of:
> > > > > > a) serializing the response (it doesn't seem like a 300K response
> > > > > > should take *that* long to serialize)
> > > > > > b) reading/processing the response (how fast the client can do
> > > > > > something with each doc is also a factor...)
> > > > > > c) other (GC, network, etc)
> > > > > >
> > > > > > You can try taking client processing out of the equation by
> trying
> > a
> > > > > > curl request.
> > > > > >
> > > > > > -Yonik
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: optimize requests that fetch 1000 rows

2016-02-12 Thread Jack Krupansky
Thanks for that critical clarification. Try...

1. A different response writer to see if that impacts the clock time.
2. Selectively remove fields from the fl field list to see if some
particular field has some issue.
3. If you simply return only the ID for the document, how fast/slow is that?

How many fields are in fl?
Any function queries in fl?


-- Jack Krupansky

On Fri, Feb 12, 2016 at 4:57 AM, Matteo Grolla 
wrote:

> Hi Jack,
>  tell me if I'm wrong but qtime accounts for search time excluding the
> fetch of stored fields (I have a 90ms qtime and a ~30s time to obtain the
> results on the client on a LAN infrastructure for 300kB response). debug
> explains how much of qtime is used by each search component.
> For me 90ms are ok, I wouldn't spend time trying to make them 50ms, it's
> the ~30s to obtain the response that I'd like to tackle.
>
>
> 2016-02-12 5:42 GMT+01:00 Jack Krupansky :
>
> > Again, first things first... debugQuery=true and see which Solr search
> > components are consuming the bulk of qtime.
> >
> > -- Jack Krupansky
> >
> > On Thu, Feb 11, 2016 at 11:33 AM, Matteo Grolla  >
> > wrote:
> >
> > > virtual hardware, 200ms is taken on the client until response is
> written
> > to
> > > disk
> > > qtime on solr is ~90ms
> > > not great but acceptable
> > >
> > > Is it possible that the method FilenameUtils.splitOnTokens is really so
> > > heavy when requesting a lot of rows on slow hardware?
> > >
> > > 2016-02-11 17:17 GMT+01:00 Jack Krupansky :
> > >
> > > > Good to know. Hmmm... 200ms for 10 rows is not outrageously bad, but
> > > still
> > > > relatively bad. Even 50ms for 10 rows would be considered barely
> okay.
> > > > But... again it depends on query complexity - simple queries should
> be
> > > well
> > > > under 50 ms for decent modern hardware.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Thu, Feb 11, 2016 at 10:36 AM, Matteo Grolla <
> > matteo.gro...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi Jack,
> > > > >   response time scale with rows. Relationship doens't seem
> linear
> > > but
> > > > > Below 400 rows times are much faster,
> > > > > I view query times from solr logs and they are fast
> > > > > the same query with rows = 1000 takes 8s
> > > > > with rows = 10 takes 0.2s
> > > > >
> > > > >
> > > > > 2016-02-11 16:22 GMT+01:00 Jack Krupansky <
> jack.krupan...@gmail.com
> > >:
> > > > >
> > > > > > Are queries scaling linearly - does a query for 100 rows take
> > 1/10th
> > > > the
> > > > > > time (1 sec vs. 10 sec or 3 sec vs. 30 sec)?
> > > > > >
> > > > > > Does the app need/expect exactly 1,000 documents for the query or
> > is
> > > > that
> > > > > > just what this particular query happened to return?
> > > > > >
> > > > > > What does they query look like? Is it complex or use wildcards or
> > > > > function
> > > > > > queries, or is it very simple keywords? How many operators?
> > > > > >
> > > > > > Have you used the debugQuery=true parameter to see which search
> > > > > components
> > > > > > are taking the time?
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Thu, Feb 11, 2016 at 9:42 AM, Matteo Grolla <
> > > > matteo.gro...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Yonic,
> > > > > > >  after the first query I find 1000 docs in the document
> > cache.
> > > > > > > I'm using curl to send the request and requesting javabin
> format
> > to
> > > > > mimic
> > > > > > > the application.
> > > > > > > gc activity is low
> > > > > > > I managed to load the entire 50GB index in the filesystem
> cache,
> > > > after
> > > > > > that
> > > > > > > queries don't cause disk activity anymore.
> > > > > > > Time improves now queries that took ~30s take <10s. But I hoped
> > > > better
> > > > > >

Re: query knowledge graph

2016-02-12 Thread Jack Krupansky
"knowledge graph" is kind of vague - what did you have in mind? An example
would help.

-- Jack Krupansky

On Fri, Feb 12, 2016 at 7:27 AM, Midas A  wrote:

>  Please suggest how to create query knowledge graph for e-commerce
> application .
>
>
> please describe in detail . our mote is to improve relevancy . we are from
> LAMP back ground .
>


Re: Negating multiple array fileds

2016-02-14 Thread Jack Krupansky
Due to a bug (or poorly designed feature), you need to explicitly include a
non-negative query term in a purely negative sub-query. Usually this means
using *:* to select all documents. Note that the use of parentheses
introduces a sub-query. So, (-persons:*) s.b. (*:* -persons:*).

-- Jack Krupansky

On Sun, Feb 14, 2016 at 8:21 AM, Salman Ansari 
wrote:

> Hi,
>
> I think what I am asking should be easy to do but for some reasons I am
> facing issues in making that happen. The issue is that I want
> include/exclude some fields from my Solr query. All the fields that I need
> to include are multi valued int fields. When I include the fields I have
> the following query
>
> http://
>
> [MySolrServer]/solr/[Collection]/select?q=(persons:*)AND(places:*)AND(orgs:*)
> This does return the desired result. However, when I negate the values
>
> http://
>
> [MySolrServer]/solr/[Collection]/select?q=(-persons:*)AND(-places:*)AND(-orgs:*)
> This returns 0 documents although there are a lot of documents that have
> all those fields empty.
>
> Any ideas why this is happening?
>
> Appreciate any comments/feedback.
>
> Regards,
> Salman
>


Re: "pf" not supported by edismax?

2016-02-14 Thread Jack Krupansky
pf stands for phrase boosting, which implies tokenized text...
spp_keyword_exact sounds like it is not tokenized.

-- Jack Krupansky

On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh  wrote:

> Hi
>
> Correct me If I am wrong, edismax is an extension of dismax, so it will
> support "pf".
> But from my testing I noticed "pf" is not working with edismax.
> From the debug information of a query using "pf" with edismax, there is no
> phrase match for the "pf" field "spp_keyword_exact".
> If I changed to dismax, it is doing a phrase match on the field.
>
> Is this normal?
>
> We are running Solr 4.10.4.
>
> Below is the queriesand their debug information.
>
> Query using "pf" with edismax and the debug statement:
>
> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket&qf=spp_keyword_exact&fl=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription&pf=spp_keyword_exact&debug=query&defType=edismax
>
> dvd bracket
> dvd bracket
> 
> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
> DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord
> 
> 
> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()
> 
> ExtendedDismaxQParser
>
>
> Query using "pf" with dismax and the debug statement:
>
> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket&qf=spp_keyword_exact&fl=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription&pf=spp_keyword_exact&debug=query&defType=dismax
>
> dvd bracket
> dvd bracket
> 
> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
> DisjunctionMaxQuery((spp_keyword_exact:bracket)))
> DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord
> 
> 
> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
> (spp_keyword_exact:dvd bracket)
> 
> DisMaxQParser
>
> Derek
>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.


Re: "pf" not supported by edismax?

2016-02-14 Thread Jack Krupansky
Maybe because the tokenized phrase produces only a single term it is
ignored. In any case, it won't be a phrase. pf only does something useful
for phrases. IOW, where a PhraseQuery can be generated. A PhraseQuery for
more than a single term would never match when the field value is a single
term.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 12:11 AM, Derek Poh  wrote:

> It is using KeywordTokenizerFactory. It is still consider as tokenized?
>
> Here's the field definition:
>  type="gs_keyword_exact" multiValued="true"/>
>
>  positionIncrementGap="100">
>   
> 
> 
> 
>   
>   
> 
> 
> 
>   
> 
>
>
> On 2/15/2016 12:43 PM, Jack Krupansky wrote:
>
>> pf stands for phrase boosting, which implies tokenized text...
>> spp_keyword_exact sounds like it is not tokenized.
>>
>> -- Jack Krupansky
>>
>> On Sun, Feb 14, 2016 at 10:08 PM, Derek Poh 
>> wrote:
>>
>> Hi
>>>
>>> Correct me If I am wrong, edismax is an extension of dismax, so it will
>>> support "pf".
>>> But from my testing I noticed "pf" is not working with edismax.
>>>  From the debug information of a query using "pf" with edismax, there is
>>> no
>>> phrase match for the "pf" field "spp_keyword_exact".
>>> If I changed to dismax, it is doing a phrase match on the field.
>>>
>>> Is this normal?
>>>
>>> We are running Solr 4.10.4.
>>>
>>> Below is the queriesand their debug information.
>>>
>>> Query using "pf" with edismax and the debug statement:
>>>
>>>
>>> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket&qf=spp_keyword_exact&fl=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription&pf=spp_keyword_exact&debug=query&defType=edismax
>>>
>>> dvd bracket
>>> dvd bracket
>>> 
>>> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
>>> DisjunctionMaxQuery((spp_keyword_exact:bracket))) ())/no_coord
>>> 
>>> 
>>> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket)) ()
>>> 
>>> ExtendedDismaxQParser
>>>
>>>
>>> Query using "pf" with dismax and the debug statement:
>>>
>>>
>>> http://hkenedcdg1.globalsources.com:8983/solr/product/select?q=dvd%20bracket&qf=spp_keyword_exact&fl=P_SPPKW,P_NewShortDescription.P_CatConCatKeyword,P_VeryShortDescription&pf=spp_keyword_exact&debug=query&defType=dismax
>>>
>>> dvd bracket
>>> dvd bracket
>>> 
>>> (+(DisjunctionMaxQuery((spp_keyword_exact:dvd))
>>> DisjunctionMaxQuery((spp_keyword_exact:bracket)))
>>> DisjunctionMaxQuery((spp_keyword_exact:dvd bracket)))/no_coord
>>> 
>>> 
>>> +((spp_keyword_exact:dvd) (spp_keyword_exact:bracket))
>>> (spp_keyword_exact:dvd bracket)
>>> 
>>> DisMaxQParser
>>>
>>> Derek
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and
>>> you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>>
>>
> --
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>
>


Re: Negating multiple array fileds

2016-02-15 Thread Jack Krupansky
I should also have noted that your full query:

(-persons:*)AND(-places:*)AND(-orgs:*)

can be written as:

-persons:* -places:* -orgs:*

Which may work as is, or can also be written as:

*:* -persons:* -places:* -orgs:*




-- Jack Krupansky

On Mon, Feb 15, 2016 at 1:57 AM, Salman Ansari 
wrote:

> @Binoy: The query does work but for one term (-persons:[* TO *]) but it
> does not work for multiple terms such as
> http://[Myserver]/solr/[Collection]/select?q=(-persons:[* TO
> *])AND(-orgs:[*
> TO *])
> This returns zero records although I do have records that has both persons
> and orgs empty.
>
> @Jack: Replacing (-persons:*)AND(-orgs:*) with (*:* -persons:*)AND(*:*
> -orgs:*) did the trick. Thanks.
>
> Thanks you both for your comments.
>
> Salman
>
> On Sun, Feb 14, 2016 at 7:51 PM, Jack Krupansky 
> wrote:
>
> > Due to a bug (or poorly designed feature), you need to explicitly
> include a
> > non-negative query term in a purely negative sub-query. Usually this
> means
> > using *:* to select all documents. Note that the use of parentheses
> > introduces a sub-query. So, (-persons:*) s.b. (*:* -persons:*).
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 14, 2016 at 8:21 AM, Salman Ansari 
> > wrote:
> >
> > > Hi,
> > >
> > > I think what I am asking should be easy to do but for some reasons I am
> > > facing issues in making that happen. The issue is that I want
> > > include/exclude some fields from my Solr query. All the fields that I
> > need
> > > to include are multi valued int fields. When I include the fields I
> have
> > > the following query
> > >
> > > http://
> > >
> > >
> >
> [MySolrServer]/solr/[Collection]/select?q=(persons:*)AND(places:*)AND(orgs:*)
> > > This does return the desired result. However, when I negate the values
> > >
> > > http://
> > >
> > >
> >
> [MySolrServer]/solr/[Collection]/select?q=(-persons:*)AND(-places:*)AND(-orgs:*)
> > > This returns 0 documents although there are a lot of documents that
> have
> > > all those fields empty.
> > >
> > > Any ideas why this is happening?
> > >
> > > Appreciate any comments/feedback.
> > >
> > > Regards,
> > > Salman
> > >
> >
>


Re: Near Duplicate Documents, "authorization"? tf/idf implications, spamming the index?

2016-02-15 Thread Jack Krupansky
Sounds a lot like multi-tenancy, where you don't want the document
frequencies of one tenant to influence the query relevancy scores for other
tenants.

No ready solution.

Although, I have thought of a simplified document scoring using just tf and
leaving out df/idf. Not as good a tf*idf or BM25 score, but avoids the
pollution problem.

I haven't heard of anybody in the Lucene space discussing a way to
categorize documents such that df is relative to a specified document
category and then the query specifies a document category. I support that
indexing and query of some hypothetical similarity schema could both
specify any number of document categories. But that's speculation on my
part.

-- Jack Krupansky

On Mon, Feb 15, 2016 at 6:42 PM, Chris Morley  wrote:

> Hey Solr people:
>
>  Suppose that we did not want to break up our document set into separate
> indexes, but had certain cases where many versions of a document were not
> relevant for certain searches.
>
>  I guess this could be thought of as a "authorization" class of problem,
> however it is not that for us.  We have a few other fields that determine
> relevancy to the current query, based on what page the query is coming
> from.  It's kind of like authorization, but not really.
>
>  Anyway, I think the answer for how you would do it for authorization would
> solve it for our case too.
>
>  So I guess suppose you had 99 users and 100 documents and Document 1
> everybody could see it the same, but for the 99 documents, there was a
> slightly different document, and it was unique for each of 99 users, but
> not "very" unique.  Suppose for instance that the only thing different in
> the text of the 99 different documents was that it was watermarked with the
> users name.  Aren't you spamming your tf/idf at that point?  Is there a way
> around this?  Is there a way to say, hey, group these 99 documents together
> and only count 1 of them for tf/idf purposes?
>
>  When doing queries, each user would only ever see 2 documents, Document 1
> , plus whichever other document they specifically owned.
>
>  If there are web pages or book chapters I can read or re-read that address
> this class of problem, those references would be great.
>
>
>  -Chris.
>
>
>
>


Re: Negating multiple array fileds

2016-02-17 Thread Jack Krupansky
I actually thought seriously about whether to mention wildcard vs. range,
but... it annoys me that the Lucene and query parser folks won't fix either
PrefixQuery or the query parsers to do the right/optimal thing for
single-asterisk query. I wrote up a Jira for it years ago, but for whatever
reason the difficulty persists. At one point one of the Lucene guys told me
that there was a filter query that could do both * and -* very efficiently,
but then later that was disputed, not to mention that filter query is now
gone. In any case, with the newer AutomatonQuery the single-asterisk
PrefixQuery case should always perform at least semi-reasonably no matter
what, especially since it is now a constant-score query, which it wasn't
many years ago.

Whether [* TO *] is actually a lot more (or less) efficient than
PrefixQuery for an empty prefix these days is... unknown to me, but I won't
give anybody grief for using it as a way of compensating for the
brain-damaged way that Lucene and Solr handle single-asterisk and negated
single-asterisk queries.


-- Jack Krupansky

On Tue, Feb 16, 2016 at 8:17 PM, Shawn Heisey  wrote:

> On 2/15/2016 9:22 AM, Jack Krupansky wrote:
> > I should also have noted that your full query:
> >
> > (-persons:*)AND(-places:*)AND(-orgs:*)
> >
> > can be written as:
> >
> > -persons:* -places:* -orgs:*
> >
> > Which may work as is, or can also be written as:
> >
> > *:* -persons:* -places:* -orgs:*
>
> Salman,
>
> One fact of Lucene operation is that purely negative queries do not
> work.  A negative query clause is like a subtraction.  If you make a
> query that only says "subtract these values", then you aren't going to
> get anything, because you did not start with anything.
>
> Adding the "*:*" clause at the beginning of the query says "start with
> everything."
>
> You might ask why a query of -field:value works, when I just said that
> it *won't* work.  This is because Solr has detected the problem and
> fixed it.  When the query is very simple (a single negated clause), Solr
> is able to detect the unworkable situation and implicitly add the "*:*"
> starting point, producing the expected results.  With more complex
> queries, like the one you are trying, this detection fails, and the
> query is executed as-is.
>
> Jack is an awesome member of this community.  I do not want to disparage
> him at all when I tell you that the rewritten query he provided will
> work, but is not optimal.  It can be optimized as the following:
>
> *:* -persons:[* TO *] -places:[* TO *] -orgs:[* TO *]
>
> A query clause of the format "field:*" is a wildcard query.  Behind the
> scenes, Solr will interpret this as "all possible values for field" --
> which sounds like it would be exactly what you're looking for, except
> that if there are ten million possible values in the field you're
> searching, the constructed Lucene query will quite literally include all
> ten million values.  Wildcard queries tend to use a lot of memory and
> run slowly.
>
> The [* TO *] syntax is an all-inclusive range query, which will usually
> be much faster than a wildcard query.
>
> Thanks,
> Shawn
>
>


Re: Reverse Eningeer Query For a Given Result Set?

2016-02-18 Thread Jack Krupansky
Out of the box? No. Could you develop one? Probably, or at least a rough
approximation, at least some of the time... but probably at a cost
significantly greater than converting queries by hand.

If it is taking you 2-4 hours per query then that suggests that the query
complexity is not amenable to any simple mechanical reverse engineering.

What aspects of the conversion is taking your so many hours? A few examples
would be helpful.

A mechanical reverse engineering from results would likely reduce the
semantic content of the original query, so that the query may then return a
false positive or false negative as new documents are added to the index
that are no longer in the same pattern as the old results by still within
the pattern of the original Oracle query. The trick may be whether the
delta is meaningful for the actual application use case.

-- Jack Krupansky

On Thu, Feb 18, 2016 at 4:07 AM, Christian Effertz 
wrote:

> Hi,
>
> Can I somehow feed Solr with a result set or a list of primary keys and get
> the shortest query that leads to this result? In other terms, can I reverse
> engineer a query for a given result set?
>
> Some background why I ask this question:
> We are currently migrating a search application from Oracle Text to Solr.
> Our users have several (>30) complex queries that we need to migrate to our
> new Solr index. This can be done by hand, but is rather time consuming. To
> get an idea of how long the whole task would need, we started with a hand
> full of them. We spent ~2-4h per query to get everything right.
>
> Thank you for your input
>


Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-24 Thread Jack Krupansky
Your statement makes no sense. Please clarify. Express your requirement(s)
in plain English first before dragging in possible solutions. Technically,
path elements can have embedded spaces.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 6:53 AM, Anil  wrote:

> HI,
>
> i need to use both WhitespaceTokenizerFactory and
> PathHierarchyTokenizerFactory for use case.
>
> Solr supports only one tokenizer. is there any way we can achieve
> PathHierarchyTokenizerFactory  functionality with filters ?
>
> Please advice.
>
> Regards,
> Anil
>


Re: WhitespaceTokenizerFactory and PathHierarchyTokenizerFactory

2016-02-25 Thread Jack Krupansky
You still haven't stated exactly what your query requirements are. In Solr
you should always start with an analysis of how people will expect to query
the data and then work backwards to how to store and index the data to
achieve the desired queries.

Note that the standard tokenizer will tokenize all of the elements of a
path or IP as separate terms. Ditto for a query, so you can effectively do
bth keyword and phrase queries to match individual terms (e.g., path
elements) or phrases or sequences of path elements or IP address components.

-- Jack Krupansky

On Thu, Feb 25, 2016 at 12:41 AM, Anil  wrote:

> Sorry Jack for confusion.
>
> I have field which holds free text. text can contain path , ip or any free
> text.
>
> I would like to tokenize the text of the field using white space. if the
> text token is of path or ip pattern , it has be tockenized like path
> hierarchy way.
>
>
> Regards,
> Anil
>
> On 24 February 2016 at 21:59, Jack Krupansky 
> wrote:
>
> > Your statement makes no sense. Please clarify. Express your
> requirement(s)
> > in plain English first before dragging in possible solutions.
> Technically,
> > path elements can have embedded spaces.
> >
> > -- Jack Krupansky
> >
> > On Wed, Feb 24, 2016 at 6:53 AM, Anil  wrote:
> >
> > > HI,
> > >
> > > i need to use both WhitespaceTokenizerFactory and
> > > PathHierarchyTokenizerFactory for use case.
> > >
> > > Solr supports only one tokenizer. is there any way we can achieve
> > > PathHierarchyTokenizerFactory  functionality with filters ?
> > >
> > > Please advice.
> > >
> > > Regards,
> > > Anil
> > >
> >
>


Re: Query time de-boost

2016-02-25 Thread Jack Krupansky
0.1 is a fractional boost - all intra-query boosts are multiplicative, not
additive, so term^0.1 reduces the term by 90%.

-- Jack Krupansky

On Wed, Feb 24, 2016 at 11:29 AM, shamik  wrote:

> Binoy, 0.1 is still a positive boost. With title getting the highest
> weight,
> this won't make any difference. I've tried this as well.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Query time de-boost

2016-02-26 Thread Jack Krupansky
Could you share your actual numbers and test case? IOW, the document score
without ^0.01 and with ^0.01.

Again, to repeat, the specific boost factor may be positive, but the effect
of a fractional boost is to reduce, not add, to the score, so that a score
of 0.5 boosted by 0.1 would become 0.05. IOW, it de-boosts occurrences of
the term.

The point remains that you do not need a "negative boost" to de-boost a
term.


-- Jack Krupansky

On Fri, Feb 26, 2016 at 4:01 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Jack,
> I just checked on 5.5 and 0.1 is positive boost.
>
> Regards,
> Emir
>
>
> On 26.02.2016 01:11, Jack Krupansky wrote:
>
>> 0.1 is a fractional boost - all intra-query boosts are multiplicative, not
>> additive, so term^0.1 reduces the term by 90%.
>>
>> -- Jack Krupansky
>>
>> On Wed, Feb 24, 2016 at 11:29 AM, shamik  wrote:
>>
>> Binoy, 0.1 is still a positive boost. With title getting the highest
>>> weight,
>>> this won't make any difference. I've tried this as well.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>>
>>> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: Solr regex documenation

2016-02-27 Thread Jack Krupansky
See:
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/search/RegexpQuery.html
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/util/automaton/RegExp.html

I vaguely recall a Jira about regex not working at all in Solr. I don't
recall reading about a resolution.


-- Jack Krupansky

On Sat, Feb 27, 2016 at 7:05 AM, Anil  wrote:

> Hi,
>
> Can some one point me to the solr regex documentation ?
>
> i read it supports all java regex features.  i tried ^ and $ , seems it is
> not working.
>
> Thanks,
> Anil
>


Re: Query time de-boost

2016-02-28 Thread Jack Krupansky
Thanks for clarifying - that you are referring to the bq parameter which is
in fact additive to the underlying score within the original query, while
in the main query, or using the bf and boost and qf and pf parameters the
boosting is multiplicative rather than "additive".

IOW, only in the bq parameter do you need to use negative boost values - in
all the other contexts a fractional boost is sufficient.

It's unfortunate that the ref guide isn't more clear about this key
distinction.

Now hopefully we (and others!) are on the same page.


-- Jack Krupansky

On Sun, Feb 28, 2016 at 3:26 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Jack,
> I think we are talking about different things: I agree that boost is
> multiplicative, and boost values less than zero will reduce score, but if
> you use such boost value in bq, it will still bust documents that are
> matching it. Simplest example is with ids. If you query:
>   q=id:a OR id:b
> both doc a and b will have same score. If you boost a^2 it will be first,
> if you boost a^0.1 it will be second. But if you use dismax's bq=id:a^0.1
> it will be first. In such case you have to use negative boost to make sure
> it is last.
>
> Are we on the same page now?
>
> Regards,
> Emir
>
>
> On 26.02.2016 16:00, Jack Krupansky wrote:
>
>> Could you share your actual numbers and test case? IOW, the document score
>> without ^0.01 and with ^0.01.
>>
>> Again, to repeat, the specific boost factor may be positive, but the
>> effect
>> of a fractional boost is to reduce, not add, to the score, so that a score
>> of 0.5 boosted by 0.1 would become 0.05. IOW, it de-boosts occurrences of
>> the term.
>>
>> The point remains that you do not need a "negative boost" to de-boost a
>> term.
>>
>>
>> -- Jack Krupansky
>>
>> On Fri, Feb 26, 2016 at 4:01 AM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>> Hi Jack,
>>> I just checked on 5.5 and 0.1 is positive boost.
>>>
>>> Regards,
>>> Emir
>>>
>>>
>>> On 26.02.2016 01:11, Jack Krupansky wrote:
>>>
>>> 0.1 is a fractional boost - all intra-query boosts are multiplicative,
>>>> not
>>>> additive, so term^0.1 reduces the term by 90%.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Wed, Feb 24, 2016 at 11:29 AM, shamik  wrote:
>>>>
>>>> Binoy, 0.1 is still a positive boost. With title getting the highest
>>>>
>>>>> weight,
>>>>> this won't make any difference. I've tried this as well.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>>
>>>>>
>>>>> http://lucene.472066.n3.nabble.com/Query-time-de-boost-tp4259309p4259552.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>>
>>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>


Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky
Consult the Confluence wiki for more recent doc:
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

You can specify all the parameters on your query request as in the
examples, or by placing the parameters in the "defaults" section for your
request handler in solrconfig.xml.


-- Jack Krupansky

On Sun, Feb 28, 2016 at 2:42 PM, 
wrote:

> Hi,
>
> I want to setup ExtendedDisMax in our solr 4.6 server, but I can't seem to
> find any example configuration for this. Ie the configuration needed in
> solrconfig.xml. In the wiki page
> http://wiki.apache.org/solr/ExtendedDisMax it simply says:
>
> "Extended DisMax is already configured in the example configuration, with
> the name edismax."
>
> But this is not true for the solrconfig.xml in our setup (it only contains
> an example for dismax, not edismax), and I downloaded the latest solr zip
> file (solr 5.5.0), and it didn't have either dismax or edismax in any of
> its solrconfig.xml files.
>
> Why is it so hard to find this configuration? Am I missing something
> obvious?
>
> Regards
> /Jimi
>


Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky
Yes, it absolutely is automagic - just look at those examples in the
Confluence ref guide. No special request handler is needed - just the
normal default handler. Just the defType and qf parameters are needed - as
shown in the wiki examples.

It really is that simple! All you have to supply is the list of fields to
query (qf) and your actual query text (q).

I know, I know... some people just can't handle automatic. (Some people
hate DisneyLand/World!)

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:16 PM, 
wrote:

> I'm sorry, but I am still confused. I'm expecting to see some
>  tag somewhere. Why doesn't the documentation nor the
> example solrconfig.xml contain such a tag?
>
> If the edismax requestHandler is defined automatically, the documentation
> should explain that. Also, there should still exist some xml code that
> corresponds exactly to that default setup, right? That is what I'm looking
> for.
>
> For now, this edismax thing seems to work "automagically", and I prefer to
> understand why and how something works.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 10:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Consult the Confluence wiki for more recent doc:
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser
>
> You can specify all the parameters on your query request as in the
> examples, or by placing the parameters in the "defaults" section for your
> request handler in solrconfig.xml.
>
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 2:42 PM, 
> wrote:
>
> > Hi,
> >
> > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't
> > seem to find any example configuration for this. Ie the configuration
> > needed in solrconfig.xml. In the wiki page
> > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> >
> > "Extended DisMax is already configured in the example configuration,
> > with the name edismax."
> >
> > But this is not true for the solrconfig.xml in our setup (it only
> > contains an example for dismax, not edismax), and I downloaded the
> > latest solr zip file (solr 5.5.0), and it didn't have either dismax or
> > edismax in any of its solrconfig.xml files.
> >
> > Why is it so hard to find this configuration? Am I missing something
> > obvious?
> >
> > Regards
> > /Jimi
> >
>


Re: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread Jack Krupansky
So, all this hard work that people have put into Solr to make it more like
a Disney theme park is just... wasted... on you? Sigh. Okay, I guess we
can't please everyone.

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:40 PM, 
wrote:

> I have no problem with automatic. It is "automagicall" stuff that I find a
> bit hard to like. Ie things that are automatic, but doesn't explain how and
> why they are automatic. But Disney Land and Disney World are actually
> really good examples of places where the magic stuff is suitable, ie in
> themeparks, designed mostly for kids. In the grown up world of IT, most
> people prefer logical and documented stuff, not things that "just works"
> without explaining why. No offence :)
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:31 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Yes, it absolutely is automagic - just look at those examples in the
> Confluence ref guide. No special request handler is needed - just the
> normal default handler. Just the defType and qf parameters are needed - as
> shown in the wiki examples.
>
> It really is that simple! All you have to supply is the list of fields to
> query (qf) and your actual query text (q).
>
> I know, I know... some people just can't handle automatic. (Some people
> hate DisneyLand/World!)
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 5:16 PM, 
> wrote:
>
> > I'm sorry, but I am still confused. I'm expecting to see some
> >  tag somewhere. Why doesn't the documentation nor the
> > example solrconfig.xml contain such a tag?
> >
> > If the edismax requestHandler is defined automatically, the
> > documentation should explain that. Also, there should still exist some
> > xml code that corresponds exactly to that default setup, right? That
> > is what I'm looking for.
> >
> > For now, this edismax thing seems to work "automagically", and I
> > prefer to understand why and how something works.
> >
> > /Jimi
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Sunday, February 28, 2016 10:58 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: ExtendedDisMax configuration nowhere to be found
> >
> > Consult the Confluence wiki for more recent doc:
> >
> > https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Q
> > uery+Parser
> >
> > You can specify all the parameters on your query request as in the
> > examples, or by placing the parameters in the "defaults" section for
> > your request handler in solrconfig.xml.
> >
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 28, 2016 at 2:42 PM, 
> > wrote:
> >
> > > Hi,
> > >
> > > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't
> > > seem to find any example configuration for this. Ie the
> > > configuration needed in solrconfig.xml. In the wiki page
> > > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> > >
> > > "Extended DisMax is already configured in the example configuration,
> > > with the name edismax."
> > >
> > > But this is not true for the solrconfig.xml in our setup (it only
> > > contains an example for dismax, not edismax), and I downloaded the
> > > latest solr zip file (solr 5.5.0), and it didn't have either dismax
> > > or edismax in any of its solrconfig.xml files.
> > >
> > > Why is it so hard to find this configuration? Am I missing something
> > > obvious?
> > >
> > > Regards
> > > /Jimi
> > >
> >
>


Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky
There is nothing wrong with features that appear to be automagical - that
should in fact be a goal for all modern software systems. Of course, there
is no magic, it's all real logic and any magic is purely appearance - it's
just that the underlying logic may be complex and not obvious to an
uninformed observer. Deliberately hiding information from users (e.g.,
implementation details) is indeed a goal for Solr - no mere mortal should
be exposed to the intricate detail of the underlying Lucene search library
or the apparent magic of edismax. In truth, nothing is hidden - the source
code of both Solr and Lucene are readily available. But to the user it may
(and should) appear to magical and even automagical.

OTOH, maybe some of the doc on edismax was not as clear as it could have
been, in which case it is up to you to point out which specific passage(s)
caused your difficulty. AFAICT, nothing at all was hidden - the examples in
the doc (which I pointed you to) seem very simple and direct to the point.
If you experienced them otherwise, it is up to you to point out any
problems that you had. And as I pointed out, you had started with the old
wiki when you should have started with the current Solr Reference Guide.

The old edismax wiki should in fact have a tombstone warning that indicates
that it is obsolete and redirect people to the new doc. Out of curiosity,
how did you get to that old wiki page in the first place?

-- Jack Krupansky

On Mon, Feb 29, 2016 at 3:20 AM, 
wrote:

> There is no need to deliberately misinterpret what I wrote. What I was
> trying to say was that "automagical" things don't belong in a professional
> environment, because it is hiding important information from people. And
> this is bad as it is, but if it on top of that is the *intended* meaning
> for things in solr to be "automagical", ie *deliberately* hiding
> information from the solr users, well that attitude is just baffling in my
> eyes. I can only hope that I misunderstood you.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> So, all this hard work that people have put into Solr to make it more like
> a Disney theme park is just... wasted... on you? Sigh. Okay, I guess we
> can't please everyone.
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 5:40 PM, 
> wrote:
>
> > I have no problem with automatic. It is "automagicall" stuff that I
> > find a bit hard to like. Ie things that are automatic, but doesn't
> > explain how and why they are automatic. But Disney Land and Disney
> > World are actually really good examples of places where the magic
> > stuff is suitable, ie in themeparks, designed mostly for kids. In the
> > grown up world of IT, most people prefer logical and documented stuff,
> not things that "just works"
> > without explaining why. No offence :)
> >
> > /Jimi
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Sunday, February 28, 2016 11:31 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: ExtendedDisMax configuration nowhere to be found
> >
> > Yes, it absolutely is automagic - just look at those examples in the
> > Confluence ref guide. No special request handler is needed - just the
> > normal default handler. Just the defType and qf parameters are needed
> > - as shown in the wiki examples.
> >
> > It really is that simple! All you have to supply is the list of fields
> > to query (qf) and your actual query text (q).
> >
> > I know, I know... some people just can't handle automatic. (Some
> > people hate DisneyLand/World!)
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 28, 2016 at 5:16 PM, 
> > wrote:
> >
> > > I'm sorry, but I am still confused. I'm expecting to see some
> > >  tag somewhere. Why doesn't the documentation nor
> > > the example solrconfig.xml contain such a tag?
> > >
> > > If the edismax requestHandler is defined automatically, the
> > > documentation should explain that. Also, there should still exist
> > > some xml code that corresponds exactly to that default setup, right?
> > > That is what I'm looking for.
> > >
> > > For now, this edismax thing seems to work "automagically", and I
> > > prefer to understand why and how something works.
> > >
> > > /Jimi
> > >
> > > -Original Message-
> > > From: Jack Krupansky [mailt

Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky
It is indeed a problem that the old edismax wiki is result #1 from Google.
I find that annoying as well since I also use Google search as my first
step in accessing doc on everything.

-- Jack Krupansky

On Mon, Feb 29, 2016 at 10:03 AM, 
wrote:

> Thanks Shawn,
>
> I had more or less assumed that the cwiki site was focused on the latest
> Solr version, but never really noticed that the "reference guide" was
> available in version-specific releases. I guess that is partly because I
> prefer googling about a specific topic, instead of reading some reference
> guide cover to cover. And from a google search for "edismax" (for example),
> it's not really trivial to click one's way into a version-specific
> reference guide on that topic. Instead, one tends to land on the wiki pages
> (with the old wiki as the first hit, sometimes).
>
> Regards
> /Jimi
>
> -Original Message-
> From: Shawn Heisey [mailto:apa...@elyograg.org]
> Sent: Monday, February 29, 2016 3:45 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> On 2/29/2016 7:00 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
> > So, should I assume that the "Confluence wiki" is the correct place for
> all documentation, even for solr 4.6?
>
> If you want documentation specifically for 4.6, there are version-specific
> releases of the guide:
>
> https://archive.apache.org/dist/lucene/solr/ref-guide/
>
> The confluence wiki is the "live" version of the reference guide,
> applicable to whatever version of Solr is being worked on at the moment,
> not the released versions.  Because it's such a large documentation set and
> Solr evolves incrementally, quite a lot of the confluence wiki is
> applicable to older versions, but the wiki as a whole is not intended for
> those older versions.
>
> The project is gearing up to begin the work on releasing version 6.0, so
> you can expect a LOT of change activity on the confluence wiki in the near
> future.  I have no idea how long it will take to finish 6.0.  The last two
> major releases (4.0 and 5.0) took months, but there's strong hope on the
> team that it will only take a few weeks this time.
>
> If you want to keep an eye on the pulse of the project, join the dev list.
>
> http://lucene.apache.org/solr/resources.html#mailing-lists
>
> In addition to a fair number of messages from real people, the dev list
> receives automated email from back-end systems in the project
> infrastructure, which creates very high traffic.  The ability to create
> filters to move mail between folders may help you keep your sanity.
>
> Also listed on the link above page is the commit notification list, which
> offers a particularly verbose look into what's happening to the project.
>
> Thanks,
> Shawn
>
>


Re: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread Jack Krupansky
Interesting... if I google for "edismax solr" (I usually specify a product
name when searching for doc by feature name to avoid irrelevant hits) the
old wiki comes up as #1 and new doc as #2, but if I search for "edismax"
alone (which I normally wouldn't do out of a desire to limit matches to the
desired product) the new ref guide does indeed show up as #1 and the old
wiki as #2.

I'm not enough of an SEO expert to know how to de-boost the old wiki other
than outright deletion. I'm guessing it's due to a lot of inbound links,
maybe mostly from references in old emails.

In any case, a proper tombstone is probably the best step at this point.

-- Jack Krupansky

On Mon, Feb 29, 2016 at 10:39 AM, Jack Krupansky 
wrote:

> It is indeed a problem that the old edismax wiki is result #1 from Google.
> I find that annoying as well since I also use Google search as my first
> step in accessing doc on everything.
>
> -- Jack Krupansky
>
> On Mon, Feb 29, 2016 at 10:03 AM, 
> wrote:
>
>> Thanks Shawn,
>>
>> I had more or less assumed that the cwiki site was focused on the latest
>> Solr version, but never really noticed that the "reference guide" was
>> available in version-specific releases. I guess that is partly because I
>> prefer googling about a specific topic, instead of reading some reference
>> guide cover to cover. And from a google search for "edismax" (for example),
>> it's not really trivial to click one's way into a version-specific
>> reference guide on that topic. Instead, one tends to land on the wiki pages
>> (with the old wiki as the first hit, sometimes).
>>
>> Regards
>> /Jimi
>>
>> -Original Message-
>> From: Shawn Heisey [mailto:apa...@elyograg.org]
>> Sent: Monday, February 29, 2016 3:45 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: ExtendedDisMax configuration nowhere to be found
>>
>> On 2/29/2016 7:00 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
>> > So, should I assume that the "Confluence wiki" is the correct place for
>> all documentation, even for solr 4.6?
>>
>> If you want documentation specifically for 4.6, there are
>> version-specific releases of the guide:
>>
>> https://archive.apache.org/dist/lucene/solr/ref-guide/
>>
>> The confluence wiki is the "live" version of the reference guide,
>> applicable to whatever version of Solr is being worked on at the moment,
>> not the released versions.  Because it's such a large documentation set and
>> Solr evolves incrementally, quite a lot of the confluence wiki is
>> applicable to older versions, but the wiki as a whole is not intended for
>> those older versions.
>>
>> The project is gearing up to begin the work on releasing version 6.0, so
>> you can expect a LOT of change activity on the confluence wiki in the near
>> future.  I have no idea how long it will take to finish 6.0.  The last two
>> major releases (4.0 and 5.0) took months, but there's strong hope on the
>> team that it will only take a few weeks this time.
>>
>> If you want to keep an eye on the pulse of the project, join the dev list.
>>
>> http://lucene.apache.org/solr/resources.html#mailing-lists
>>
>> In addition to a fair number of messages from real people, the dev list
>> receives automated email from back-end systems in the project
>> infrastructure, which creates very high traffic.  The ability to create
>> filters to move mail between folders may help you keep your sanity.
>>
>> Also listed on the link above page is the commit notification list, which
>> offers a particularly verbose look into what's happening to the project.
>>
>> Thanks,
>> Shawn
>>
>>
>


  1   2   3   4   5   6   7   8   9   10   >