from:"Dotan Cohen"

Filter queries taking a long time, even with cache disabled

2013-06-27 Thread Dotan Cohen

On a Solr 4.1 install I see that queries with use the fq parameter
take a long time (upwards of 120 seconds), both on the standard Lucene
query parser and also with edismax. I have added the {!cache=false}
localparam to the filter query, but this does not speed up the query.
Putting all the search terms in the main query returns results in
miliseconds.

Note that I am not using any wildcard queries, in each case I am
specifying the field to search and the terms to search on. Where
should I start to debug?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Filter queries taking a long time, even with cache disabled

2013-06-27 Thread Dotan Cohen

On Thu, Jun 27, 2013 at 12:14 PM, Upayavira  wrote:
> can you give an example?
>

Thank you. This is an example query:
select
?q=search_field:iraq
&fq={!cache=false}search_field:love%20obama
&defType=edismax

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

No date.gap on pivoted facets

2013-06-30 Thread Dotan Cohen

Consider the following query:
select?q=*:*
&facet=true
&facet.date=added
&facet.date.start=2013-04-01T00:00:00Z
&facet.date.end=2013-06-30T00:00:00Z
&facet.date.gap=%2b7DAYS
&rows=0
&facet.pivot=added,provider

In this query, the facet.date.gap is ignored and each individual
second in faceted on. The issue remains the same even when reversing
the order of the pivot:
&facet.pivot=provider,added

Is this a Solr bug, or am I pivoting wrong? This is on Solr 4.1.0
running on OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode) on
Ubuntu Server 12.04. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: No date.gap on pivoted facets

2013-07-02 Thread Dotan Cohen

On Sun, Jun 30, 2013 at 5:33 PM, Jack Krupansky  wrote:
> Sorry, but Solr pivot faceting is based solely on "field" facets, not
> "range" (or "date") facets.
>

Thank you. I tried adding that information to the
SimpleFacetParameters wiki page, but that page seems to be defined as
"Immutable Page".

> You can approximate date gaps by making a copy of your raw date field and
> then manually "gap" (truncate) the date values so that the their discrete
> values correspond to your date gap.
>

Thank you, this is what I have done.

> In the next release of my book I have a script for a
> StatelessScriptUpdateProccessor (with examples) that supports truncation of
> dates to a desired resolution, copying or modifying the input date as
> desired.
>

Terrific, I anticipate the release. Next release? Did I miss the release?
http://www.amazon.com/Lucene-Solr-Definitive-comprehensive-realtime/dp/1449359957/

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How to improve the Solr "OR" query performance

2013-07-03 Thread Dotan Cohen

On Wed, Jul 3, 2013 at 6:48 AM, huasanyelao  wrote:
> Nowdays, I've got a urgent task to improve the "OR" query performance with 
> solr.
> I have deployed 9 shards with solr-cloud in two server(each server : 16 
> cores, 32G RAM).
> The total document count: 60,000,000, total index size : 9G.
> According to the requirement, I have to use the "OR" query to get results.
> The average number of query terms is about 15.
> The response time for "OR" query is around 1-2seconds(the "AND" query is just 
> about 30ms-40ms ).
> Our target : promote 50%, that is, at most 500ms-1s per query.
> The document will soar to 80,000,000, however, the performance should keep in 
> 500ms-1s query.
> Any advise or approach is appreciated. Thanks in advance.
>

What size documents? I've currently got stats like this, only a few
more documents but 5s searches on 15 ORs:
q=love%20OR%20hate%20OR%20beer%20OR%20sex%20OR%20peace%20OR%20war%20OR%20up%20OR%20down%20OR%20this%20OR%20that%20OR%20left%20OR%20right%20OR%20north%20OR%20south%20OR%20east%20OR%20west
05604love OR hate
OR beer OR sex OR peace OR war OR up OR down OR this OR that OR left
OR right OR north OR south OR east OR west

My index currently has 77461952 documents, most under 1 KiB each but
upwards of ten fields.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Find related words

2013-07-04 Thread Dotan Cohen

How might one find the top related words for a given word in a Solr index?

For instance, given the following single-field documents:
1: I love chocolate
2: I love Solr
3: I eat chocolate cake
4: You will eat chocolate candy

Thus, given the word "Chocolate" Solr might find these top words:
I (3 times matched)
eat (2 times matched)
love, cake, you, will, candy (1 time each)

Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Find related words

2013-07-04 Thread Dotan Cohen

Thank you Jack and Koji. I will take a look at MLT and also at the
.zip files from LUCENE-474. Koji, did you have to modify the code for
the latest Solr?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

To search for duplicate IDs, I am running the following query:
select?q=*:*&facet=true&facet.field=id&rows=0

However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving
OutOfMemoryError errors instead of the desired facet:

java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at ...

Might there be a less resource-intensive way to get this information.
This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index
has over 100,000,000 small records, for a total of about 95 GiB of
disk space, with Solr running on it's own disk. Actually, the 'disk'
is an Amazon Web Service EBS volume.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal  wrote:
> Does adding facet.mincount=2 help?
>
>

In fact, when adding facet.mincount=20 (I know that some dupes are in
the hundreds) I got the OutOfMemoryError in seconds instead of
minutes.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 9:23 PM, Michael Della Bitta
 wrote:
> Are you talking about the document's ID field?
>
> If so, you can't have duplicates... the latter document would overwrite the
> earlier.
>
> If not, sorry for asking irrelevant questions. :)
>

In Solr 4.1 we were using overwrite=false&allowDups=false in order to
discard the new document, not overwrite the extant document. We knew
at the time that the features were depreciated, and apparently
allowDups=false stopped working in 4.3. We are testing new solutions,
but we need to identify the dupes to get them out.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 9:24 PM, Shawn Heisey  wrote:
> Add &facet.method=enum to the query URL.  This will cause Solr to enumerate
> the facet information on every query rather than load it into the field
> cache, which takes a lot of memory.  Solr 4.1 was probably very close to
> running out of memory as well.
>
> If you have enough OS disk cache for your index, the enum method should not
> cause an enormous slowdown.  If you don't have enough OS disk cache, then it
> can make the facets run very slowly.
>
> Thanks,
> Shawn
>

‎Thanks, the query ran for almost 2 full minutes but it returned
results! I'll google for how to increase the disk cache for queries
like this. Other than the Qtime, is there no way to judge the amount
of memory required for a particular query to run?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 9:43 PM, Michael Della Bitta
 wrote:
> Since this is a one-time problem, Have you thought of just dumping all the
> IDs and looking for dupes using sort and awk or something similar to that?
>

All 100,000,000 of them :) That would take even longer! Also, I fear
that this is not a one-time problem, rather, that I should already
learn how to deal with tuning Solr for intensive queries as such. I
learn by the problems encountered!

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 9:56 PM, Shawn Heisey  wrote:
> On 7/30/2013 12:49 PM, Dotan Cohen wrote:
>>
>> ‎Thanks, the query ran for almost 2 full minutes but it returned
>> results! I'll google for how to increase the disk cache for queries
>> like this. Other than the Qtime, is there no way to judge the amount
>> of memory required for a particular query to run?
>
>
> The way you increase disk cache is to add memory to the server.  Any memory
> that's not being used by programs (OS, Solr, or anything else) is
> automatically part of the disk cache.
>
> Thanks,
> Shawn
>

I see, thanks. I thought that 'disk cache' was something on disk, such
as swap space. The server is already maxed out on RAM:
$ free -m
 total   used   free sharedbuffers
cached
Mem: 14980  14906 73  0167
5293
-/+ buffers/cache:   9444   5535
Swap:0  0  0

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 11:00 PM, Mikhail Khludnev
 wrote:
> Dotan,
>
> Could you please provide more line of the stack trace?

Sure, thanks:
java.lang.OutOfMemoryError: Java heap spacejava.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1307)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:453)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:560)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1072)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:382)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1006)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:365)
at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:485)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)
at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.OutOfMemoryError: Java heap space
500


> I have no idea why it made worse at 4.3. I know that 4.3 can use facets
> backed on DocValues, which are modest for the heap. But from what I saw,
> but can be wrong it's disabled from numeric facets. Hence, I can suggest to
> reindex id as string docvalues and hope for them. However, it's doubtful to
> reindex everything without strong guaranties.

We also had issues with 4.2, though I really don't remember the
details. Some simple queries such as 'q=ubuntu' would take tens of
seconds whereas on 4.1 it was almost instantaneous. In fact, even in
4.3 I feel that things have slowed down terribly (3000 ms on simple
queries whereas 4.1 would do it in tens or maximum a few hundred). Of
course, the index is constantly growing so that may be a factor. Note
that in both cases the index and configuration was carryover from 4.1
so that may have been an issue. Moving back from 4.2 to 4.1 I bit the
bullet and deleted the extant documents. I no longer have that luxury
now.


> Also, I checked source code of
> http://wiki.apache.org/solr/TermsComponentand found that it can be
> really memory modest (ie without sort nor limit).
> Be aware that df-s returned by that component are unaware of deleted
> document, hence expungeDeletes before.
>

Thank you, I will look into that.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Tue, Jul 30, 2013 at 11:14 PM, Jack Krupansky
 wrote:
> The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe...
> any particular reason you did not use it?
>
> See:
> http://wiki.apache.org/solr/Deduplication
>
> and
>
> https://cwiki.apache.org/confluence/display/solr/De-Duplication
>

Actually, the guy who made the changes (a coworker) did in fact write
an alternative UpdateHandler. I've just noticed that there are a bunch
of dupes right now, though.

public class DiscoAPIUpdateHandler extends DirectUpdateHandler2 {

public DiscoAPIUpdateHandler(SolrCore core) {
super(core);
}

@Override
public int  addDoc(AddUpdateCommand cmd) throws IOException{

// if overwrite is set to false we'll use the
DefaultUpdateHandler2 , this is done for debugging to insert
duplicates to solr
if (!cmd.overwrite) return super.addDoc(cmd);


// when using ref counted objects you have!! to decrement the
ref count when your done
RefCounted indexSearcher =
this.core.getNewestSearcher(false);

// the idea is like this we'll make an internal lucene query
and check if that id already exists

Term updateTerm = null;


if (cmd.updateTerm != null){
updateTerm = cmd.updateTerm;
} else {
updateTerm = new Term("id",cmd.getIndexedId());
}


Query query = new TermQuery(updateTerm);
TopDocs docs = indexSearcher.get().search(query,2);

if (docs.totalHits>0){
// index searcher is no longer needed
indexSearcher.decref();
// don't add the new document
return 0;
}

// index searcher is no longer needed
indexSearcher.decref();

// if i'm here then it's a new document
return super.addDoc(cmd);

}

}


> And I give a bunch of examples in my book.
>

I anticipate the book with esteem!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Wed, Jul 31, 2013 at 12:48 AM, Jack Krupansky
 wrote:
> You could also try the terms component which provides a very efficient
> facet-like feature - counting the terms. And you can set a minimum term
> frequency of 2, so only the dups would come back:
>
> curl "http://localhost:8983/solr/terms?terms.fl=id&terms.mincount=2";
>

Thanks, Jack. This returns results with comparable Qtimes to the
faceting on enum. Good to know!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How might one search for dupe IDs other than faceting on the ID field?

2013-07-30 Thread Dotan Cohen

On Wed, Jul 31, 2013 at 4:56 AM, Bill Bell  wrote:
> On Jul 30, 2013, at 12:34 PM, Dotan Cohen  wrote:
>> On Tue, Jul 30, 2013 at 9:21 PM, Aloke Ghoshal  wrote:
>>> Does adding facet.mincount=2 help?
>>
>> In fact, when adding facet.mincount=20 (I know that some dupes are in
>> the hundreds) I got the OutOfMemoryError in seconds instead of
>> minutes.
>>
>> Dotan Cohen
>
> This seems like a fairly large issue. Can you create a Jira issue ?
>
> Bill Bell

I'll file an issue, but on what? What information should I include?
How is this different that what you would expect?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Don't cache filter queries

2013-03-21 Thread Dotan Cohen

I need to use the filter query feature to filter my results, but I
don't want the results cached as documents are added to the index
several times per second and the results will be state immediately. Is
there any way to disable filter query caching?

This is on Solr 4.1 running in Jetty on Ubuntu Server. Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Don't cache filter queries

2013-03-22 Thread Dotan Cohen

On Thu, Mar 21, 2013 at 6:22 PM, Chris Hostetter
 wrote:
>
> : Just add {!cache=false} to the filter in your query
> : (http://wiki.apache.org/solr/SolrCaching#filterCache).
> ...
> : > I need to use the filter query feature to filter my results, but I
> : > don't want the results cached as documents are added to the index
> : > several times per second and the results will be state immediately. Is
> : > there any way to disable filter query caching?
>
> Or remove the filterCache config option from your solrconfig.xml if you
> really don't want any caching of any filter queries.
>
> Fnrakly though: that's throwing the baby out with the bath water -- just
> because you are updating your index super-fast-like doesn't mean you
> aren't getting benefts from the caches, particularly from commonly
> reused filters which are applied to many qureies which might get
> executed concurrently -- not to entnion that a single filter might be
> reused multiple times within a single request to solr.
>
> disabling cache *warming* can make a lot of sense in NRT cases, but
> eliminating caching alltogether rarely does.
>

Thanks. The problem is that the queries with filter queries are taking
much longer to run (~60-80 ms) than the queries without (~1-4 ms). I
figured that the problem may have been with the caching.

In fact, running a query with a filter query and caching disabled is
running in the range of 16-30 ms, which is quite an improvement.

Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

.handleRequest(BlockingHttpConnection.java:53)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:926)\n\tat
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:988)\n\tat
org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)\n\tat
org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)\n","code":500}}

I notice that this only occurs on queries that run facets. I start
Solr with the following command:
sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
/opt/solr-4.1.0/example/start.jar &

The server seems to have enough memory:
$ free -m
 total   used   free sharedbuffers cached
Mem: 14980  10604   4375  0472   8078
-/+ buffers/cache:   2054  12925
Swap:0  0  0

The server is 64-bit Ubuntu Server 12.04 LTS running Solr 4.1 and the
following Java:
$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)


Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

On Tue, Apr 2, 2013 at 12:59 PM, Toke Eskildsen  
wrote:
> How many documents does your index have, how many fields do you facet on
> and approximately how many unique values does your facet fields have?
>

8971763 documents, growing at a rate of about 500 per minute. We
actually expect that to be ~5 per minute once we get out of
testing. Most documents are less than a KiB in the 'text' field, and
they have a few other fields which store short strings, dates, or
ints. You can think of these documents like tweets: short general
purpose text messages.

>> I notice that this only occurs on queries that run facets. I start
>> Solr with the following command:
>> sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
>> /opt/solr-4.1.0/example/start.jar &
>
> You are not specifying any maximum heap size (-Xmx), which you should do
> in order to avoid unpleasant surprises. Facets and sorting are often
> memory hungry, but your system seems to have 13GB free RAM so the easy
> solution attempt would be to increase the heap until Solr serves the
> facets without OOM.
>

Thanks, I will start with "-Xmx8g" and test.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

On Tue, Apr 2, 2013 at 2:41 PM, Toke Eskildsen  wrote:
> 9M documents in a heavily updated index with faceting. Maybe you are
> committing faster than the faceting can be prepared?
> https://wiki.apache.org/solr/FAQ#What_does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
>

Thank you Toke, this is exactly on my "list of things to learn about
Solr". We do get the error mentioned and we cannot reduce the amount
of commits. Also, I do believe that we have the necessary server
resources (16 GiB RAM).

I have increased maxWarmingSearchers to 4, let's see how this goes.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen  wrote:
> On Tue, 2013-04-02 at 15:55 +0200, Dotan Cohen wrote:
>
> [Tokd: maxWarmingSearchers limit exceeded?]
>
>> Thank you Toke, this is exactly on my "list of things to learn about
>> Solr". We do get the error mentioned and we cannot reduce the amount
>> of commits. Also, I do believe that we have the necessary server
>> resources (16 GiB RAM).
>
> Memory does not help you if you commit too frequently. If you commit
> each X seconds and warming takes X+Y seconds, then you will run out of
> memory at some point.
>
>> I have increased maxWarmingSearchers to 4, let's see how this goes.
>
> If you still get the error with 4 concurrent searchers, you will have to
> either speed up warmup time or commit less frequently. You should be
> able to reduce facet startup time by switching to segment based faceting
> (at the cost of worse search-time performance) or maybe by using
> DocValues. Some of the current threads on the solr-user list is about
> these topics.
>
> How often do you commit and how many unique values does your facet
> fields have?
>
> Regards,
> Toke Eskildsen
>



-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

On Tue, Apr 2, 2013 at 5:33 PM, Toke Eskildsen  wrote:
> Memory does not help you if you commit too frequently. If you commit
> each X seconds and warming takes X+Y seconds, then you will run out of
> memory at some point.
>

How might I time the warming? I've been googling warming since your
earlier message but there does not seem to be any really good
documentation on the subject. If there is anything that you feel I
should be reading I would appreciate a link or a keyword to search on.
I've read the Solr wiki on caching and performance, but other than
that I don't see the issue addressed.

>> I have increased maxWarmingSearchers to 4, let's see how this goes.
>
> If you still get the error with 4 concurrent searchers, you will have to
> either speed up warmup time or commit less frequently. You should be
> able to reduce facet startup time by switching to segment based faceting
> (at the cost of worse search-time performance) or maybe by using
> DocValues. Some of the current threads on the solr-user list is about
> these topics.
>
> How often do you commit and how many unique values does your facet
> fields have?
>

Batches of 20-50 results are added to solr a few times a minute, and a
commit is done after each batch since I'm calling Solr as such:
http://127.0.0.1:8983/solr/core/update/json?commit=true

Should I remove commit=true and run a cron job to commit once per minute?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-02 Thread Dotan Cohen

> How often do you commit and how many unique values does your facet
> fields have?
>

Most of the time I facet on one field that has about twenty unique
values. However, once per day I would like to facet on the text field,
which is a free-text field usually around 1 KiB (about 100 words), in
order to determine what the top keywords / topics are. That query
would take up to 200 seconds to run, but it does not have to return
the results in real-time (the output goes to another process, not to a
waiting user).

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

maxWarmingSearchers in Solr 4.

2013-04-03 Thread Dotan Cohen

I have been dragging the same solrconfig.xml from Solr 3.x to 4.0 to
4.1, with no customization (bad, bad me!). I'm now looking into
customizing it and I see that the Solr 4.1 solrconfig.xml is much
simpler and shorter. Is this simply because many of the examples have
been removed?

In particular, I notice that there is no mention of
maxWarmingSearchers in the Solr 4.1 solrconfig.xml. I assume that I
can simply add it in, are there any other critical config options that
are missing that I should be looking into as well? Would I be better
off using the old Solr 3.x solrconfig.xml in Solr 4.1 as it contains
so many examples?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen

On Tue, Apr 2, 2013 at 6:26 PM, Andre Bois-Crettez
 wrote:
> warmupTime is available on the admin page for each type of cache (in
> milliseconds) :
> http://solr-box:8983/solr/#/core1/plugins/cache
>
> Or if you are only interested in the total :
> http://solr-box:8983/solr/core1/admin/mbeans?stats=true&key=searcher
>

Thanks.


>> Batches of 20-50 results are added to solr a few times a minute, and a
>> commit is done after each batch since I'm calling Solr as such:
>> http://127.0.0.1:8983/solr/core/update/json?commit=true Should I
>> remove commit=true and run a cron job to commit once per minute?
>
>
> Even better, it sounds like a job for CommitWithin :
> http://wiki.apache.org/solr/CommitWithin
>


I'll look into that. Thank you!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-03 Thread Dotan Cohen

On Wed, Apr 3, 2013 at 10:11 AM, Toke Eskildsen  
wrote:
>> However, once per day I would like to facet on the text field,
>> which is a free-text field usually around 1 KiB (about 100 words), in
>> order to determine what the top keywords / topics are. That query
>> would take up to 200 seconds to run, [...]
>
> If that query is somehow part of your warming, then I am surprised that
> search has worked at all with your commit frequency. That would however
> explain your OOM if you have multiple warmups running at the same time.
>

No, the 'heavy facet' is not part of the warming. I run it at most
once per day, at the end of the day. Solr is not shut down daily.

> It sounds like TermsComponent would be a better fit for getting top
> topics: https://wiki.apache.org/solr/TermsComponent
>

I had once looked at TermsComponent, but I think that I eliminated it
as a possibility because I actually need the top keywords related to a
specific keyword. For instance, I need to know which words are most
commonly used with the word "coffee".

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: maxWarmingSearchers in Solr 4.

2013-04-03 Thread Dotan Cohen

On Wed, Apr 3, 2013 at 7:55 PM, Shawn Heisey  wrote:
> In situations where I don't want to change the default value, I prefer
> to leave config elements out of the solrconfig.  It makes the config
> smaller, and it also makes it so that I will automatically see benefits
> from the default changing in new versions.
>

Thanks. This makes sense. I take it, then, that you update (or at
least review) solrconfig for each new Solr version. As I become more
familiar with that file I will begin doing the same.

> In the case of maxWarmingSearchers, I would hope that you have your
> system set up so that you would never need more than 1 warming searcher
> at a time.  If you do a commit while a previous commit is still warming,
> Solr will try to create a second warming searcher.
>

How would I set the system up for that? We have very many commits
(every few seconds) and each commit contains a few tens of documents
(mostly smaller than 1 KiB per document). Right now we get about
200-300 searches per minute.

Note that I expect both the commit rate and the search rate to
increase 2-3 times in the next month, and ideally I should be able to
scale it beyond that. I'm right now looking into sharding as a
possible solution.

> I went poking in the code, and it seems that maxWarmingSearchers
> defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
> default or not.  It does mean that a pathological setup without
> maxWarmingSearchers in the config will probably blow up with an
> OutOfMemory exception, but is that better or worse than commits that
> don't make new documents searchable?  I can see arguments either way.
>

This is interesting, what you found is that the value in the stock
solrconfig.xml file differs from the Solr default value. I think that
this is bad practice: a single default should be decided upon and Solr
should use this value when nothing is specified in solrconfig.xml, and
that _same_value_ should be specified in the stock solrconfig.xml. Is
it not a reasonable assumption that this would be the case?

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Understanding the Solr Admin page

2013-04-07 Thread Dotan Cohen

I am expanding my Solr skills and would like to understand the Admin
page better. I understand that understanding Java memory management
and Java memory options will help me, and I am reading and
experimenting on that front, but if there are any concise resources
that are especially pertinent to Solr I would love to know about them.
Everything that I've found is either a "do this" one-liner or expects
Java experience which I don't have and don't know what I need to
learn.

I notice that some of the Args presented are in black text, and others
in grey. Why are  they presented differently? Where would I have found
this information in the fine manual?

When I start Solr with nohup, the resulting nohup.out file is _huge_.
How might I start Solr such that INFO is not output, but only WARNINGs
and SEVEREs are. In particular, I'd rather not log every query, even
the invalid queries which also log as SEVERE. I thought that this
would be easy to Google for, but it is not! If there is a concise
document that examines this issue, I would love to know where on the
wild wild web it exists.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Out of memory on some faceting queries

2013-04-08 Thread Dotan Cohen

On Wed, Apr 3, 2013 at 8:47 PM, Shawn Heisey  wrote:
> On 4/2/2013 3:09 AM, Dotan Cohen wrote:
>> I notice that this only occurs on queries that run facets. I start
>> Solr with the following command:
>> sudo nohup java -XX:NewRatio=1 -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
>> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
>> /opt/solr-4.1.0/example/start.jar &
>
> It looks like you've followed some advice that I gave previously on how
> to tune java.  I have since learned that this advice is bad, it results
> in long GC pauses, even with heaps that aren't huge.
>

I see, thanks.

> As others have pointed out, you don't have a max heap setting, which
> would mean that you're using whatever Java chooses for its default,
> which might not be enough.  If you can get Solr to successfully run for
> a while with queries and updates happening, the heap should eventually
> max out and the admin UI will show you what Java is choosing by default.
>
> Here is what I would now recommend for a beginning point on your Solr
> startup command.  You may need to increase the heap beyond 4GB, but be
> careful that you still have enough free memory to be able to do
> effective caching of your index.
>
> sudo nohup java -Xms4096M -Xmx4096M -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=75 -XX:NewRatio=3
> -XX:MaxTenuringThreshold=8 -XX:+CMSParallelRemarkEnabled
> -XX:+ParallelRefProcEnabled -XX:+UseLargePages -XX:+AggressiveOpts
> -Dsolr.solr.home=/mnt/SolrFiles100/solr -jar
> /opt/solr-4.1.0/example/start.jar &
>

Thank you, I will experiment with that.

> If you are running a really old build of java (latest versions on
> Oracle's website are 1.6 build 43 and 1.7 build 17), you might want to
> leave AggressiveOpts out.  Some people would argue that you should never
> use that option.
>

Great, thank for the warning. This is what we're running, I'll see
about updating it through my distro's package manager:
$ java -version
java version "1.6.0_27"
OpenJDK Runtime Environment (IcedTea6 1.12.3) (6b27-1.12.3-0ubuntu1~12.04.1)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: maxWarmingSearchers in Solr 4.

2013-04-08 Thread Dotan Cohen

On Thu, Apr 4, 2013 at 10:54 PM, Shawn Heisey  wrote:
> You'll want to ensure that your autowarmCount value on Solr's caches is low
> enough that each commit happens quickly.  If it takes 5000 milliseconds to
> warm the caches when you commit, then you want to be sure that you are
> committing less often than that, or you'll quickly reach your
> maxWarmingSearchers config value.  If the commits are happening VARY
> quickly, you may need to set autowarmCount to 0, and possibly disable caches
> entirely.
>

I see. This seems to be the opposite of the approach that I was taking.


>>> I went poking in the code, and it seems that maxWarmingSearchers
>>> defaults to Integer.MAX_VALUE.  I'm not sure whether this is a bad
>>> default or not.  It does mean that a pathological setup without
>>> maxWarmingSearchers in the config will probably blow up with an
>>> OutOfMemory exception, but is that better or worse than commits that
>>> don't make new documents searchable?  I can see arguments either way.
>>
>>
>> This is interesting, what you found is that the value in the stock
>> solrconfig.xml file differs from the Solr default value. I think that
>> this is bad practice: a single default should be decided upon and Solr
>> should use this value when nothing is specified in solrconfig.xml, and
>> that _same_value_ should be specified in the stock solrconfig.xml. Is
>> it not a reasonable assumption that this would be the case?
>
>
> That was directed more at the other committers.  I would argue that either a
> low number or a relatively high number should be the default, but not
> MAX_VALUE.  The example config should have a commented out section for
> maxWarmingSearchers that mentions the default.  I'm having the same
> discussion about maxBooleanClauses on SOLR-4586.
>

Right.


> It's possible that this has already been discussed, and that everyone
> prefers that a badly configured setup will eventually have a spectacular
> blow up with OutOfMemory, rather than semi-silently ignoring commits.  A
> searcher object contains caches and uses a lot of memory, so having lots of
> them around will eventually use up the entire heap.
>

Silently dropping data is by far the worse choice, I agree, especially
as a default setting.


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-24 Thread Dotan Cohen

I am looking through the schema of a Solr installation that I
inherited last year. The original dev, who is unavailable for comment,
has two types of text fields: one with
RemoveDuplicatesTokenFilterFactory and one without. These fields are
intended for full-text search.

Why would someone _not_ use RemoveDuplicatesTokenFilterFactory on a
field intended for full-text search? What are the drawbacks to using
it? This application is very, very write heavy (hundreds of writes per
minute) if that matters. It was running on websolr.com at the time,
I've now moved it to Amazon Web Services.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-26 Thread Dotan Cohen

On Fri, May 24, 2013 at 4:04 PM, Jack Krupansky  wrote:
> The primary purpose of this filter is in conjunction with the
> KeywordRepeatFilterFactory and a stemmer, to remove the tokens that did not
> produce a stem from the original token, so the keyword duplicate is no
> longer needed. The goal is to index both the stemmed and unstemmed terms at
> the same position.
>
> Whether your app is using the filter for that purpose remains to be seen.
>
> Removing duplicates from the raw input token stream would impact the term
> frequency.
>
> -- Jack Krupansky
>

Thank you Jack. I thought that the filter only removed tokens with
both identical position and identical text:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.RemoveDuplicatesTokenFilterFactory

Are stemmed terms considered the same text as the original word, such
that they will show as a dupe fo the
RemoveDuplicatesTokenFilterFactory? That seems odd.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Why would one not use RemoveDuplicatesTokenFilterFactory?

2013-05-27 Thread Dotan Cohen

On Sun, May 26, 2013 at 8:16 PM, Jack Krupansky  wrote:
> The only comment I was trying to make here is the relationship between the
> RemoveDuplicatesTokenFilterFactory and the KeywordRepeatFilterFactory.
>
> No, stemmed terms are not considered the same text as the original word. By
> definition, they are a new value for the term text.
>
>

I see, for some reason I did not concentrate on this key quote of yours:
"...to remove the tokens that did not produce a stem ..."

Now it makes perfect sense.

Thank you, Jack!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Dotan Cohen

When adding or removing a text field to/from the schema and then
restarting Solr, what exactly happens to extant documents? Is the
schema only consulted when Solr writes a document, therefore extant
documents are unaffected?

Considering that Solr supports dynamic fields, my experimentation with
removing and adding fields to the schema has shown almost no change in
the extant index results returned.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

2013-05-28 Thread Dotan Cohen

On Tue, May 28, 2013 at 2:20 PM, Upayavira  wrote:
> The schema provides Solr with a description of what it will find in the
> Lucene indexes. If you, for example, changed a string field to an
> integer in your schema, that'd mess things up bigtime. I recently had to
> upgrade a date field from the 1.4.1 date field format to the newer
> TrieDateField. Given I had to do it on a live index, I had to add a new
> field (just using copyfield) and re-index over the top, as the old field
> was still in use. I guess, given my app now uses the new date field
> only, I could presumably reindex the old date field with the new
> TrieDateField format, but I'd want to try that before I do it for real.
>

Thank you for the insight. Unfortunately, with 20 million records and
growing by hundreds each minute (social media posts) I don't see that
I could ever reindex the data in a timely way.


> However, if you changed a single valued field to a multi-valued one,
> that's not an issue, as a field with a single value is still valid for a
> multi-valued field.
>
> Also, if you add a new field, existing documents will be considered to
> have no value in that field. If that is acceptable, then you're fine.
>
> I guess if you remove a field, then those fields will be ignored by
> Solr, and thus not impact anything. But I have to say, I've never tried
> that.
>
> Thus - changing the schema will only impact on future indexing. Whether
> your existing index will still be valid depends upon the changes you are
> making.
>
> Upayavira

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

2013-05-29 Thread Dotan Cohen

On Tue, May 28, 2013 at 3:58 PM, Jack Krupansky  wrote:
> The technical answer: Undefined and not guaranteed.
>

I was afraid of that!

> Sure, you can experiment and see what the effects "happen" to be in any
> given release, and maybe they don't tend to change (too much) between most
> releases, but there is no guarantee that any given "change schema but keep
> existing data without a delete of directory contents and full reindex" will
> actually be benign or what you expect.
>
> As a general proposition, when it comes to changing the schema and not
> deleting the directory and doing a full reindex, don't do it! Of course, we
> all know not to try to walk on thin ice, but a lot of people will try to do
> it anyway - and maybe it happens that most of the time the results are
> benign.
>

In the case of this particular application, reindexing really is
overly burdensome as the application is performing hundreds of writes
to the index per minute. How might I gauge how much spare I/O Solr
could commit to a reindex? All the data that I need is in fact in
stored fields.

Note that because the social media application that feeds our Solr
index is global, there are no 'off hours'.


> OTOH, you could file a Jira to propose that the effects of changing the
> schema but keeping the existing data should be precisely defined and
> documented, but, that could still change from release to release.
>

Seems like a lot of effort to document, for little benefit. I'm not
going to file it. I would like to know, though, is the schema
consulted at index time, query time, or both?


> From a practical perspective for your original question: If you suddenly add
> a field, there is no guarantee what will happen when you try to access that
> field for existing documents, or what will happen if you "update" existing
> documents. Sure, people can talk about what "happens to be true today", but
> there is no guarantee for the future. Similarly for deleting a field from
> the schema, there is no guarantee about the status of existing data, even
> though people can chatter about "what it seems to do today."
>
> Generally, you should design your application around contracts and what is
> guaranteed to be true, not what happens to be true from experiments or even
> experience. Granted, that is the theory and sometimes you do need to rely on
> experimentation and folklore and spotty or ambiguous documentation, but to
> the extent possible, it is best to avoid explicitly trying to rely on
> undocumented, uncontracted behavior.
>

Thanks. The application does change (added features) and we do not
want to loose old data.


> One question I asked long ago and never received an answer: what is the best
> practice for doing a full reindex - is it sufficient to first do a delete of
> "*:*", or does the Solr index directory contents or even the directory
> itself need to be explicitly deleted first? I believe it is the latter, but
> the former "seems" to work, most of the time. Deleting the directory itself
> "seems" to be the best answer, to date - but no guarantees!
>

I don't have an answer for that, sorry!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Reindexing strategy

2013-05-29 Thread Dotan Cohen

I see that I do need to reindex my Solr index. The index consists of
20 million documents with a few hundred new documents added per minute
(social media data). The documents are mostly smaller than 1KiB of
data, but some may go as large as 10 KiB. All the data is text, and
all indexed fields are stored.

To reindex, I am considering adding a 'last_indexed' field, and having
a Python or Java application pull out N results every T seconds when
sorting on "last_indexed asc". How might I determine a good values for
N and T? I would like to know when the Solr index is 'overloaded', or
whatever happens to Solr when it is being pushed beyond the limits of
its hardware. What should I be looking at to know if Solr is over
stressed? Is looking at CPU and memory good enough? Is there a way to
measure I/O to the disk on which the Solr index is stored? Bear in
mind that while the reindex is happening, clients will be performing
searches and a few hundred documents will be written per minute. Note
that the machine running Solr is an EC2 instance running on Amazon Web
Services, and that the 'disk' on which the Solr index is stored in an
EBS volume.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

2013-05-29 Thread Dotan Cohen

On Wed, May 29, 2013 at 2:41 PM, Upayavira  wrote:
> I presume you are running Solr on a multi-core/CPU server. If you kept a
> single process hitting Solr to re-index, you'd be using just one of
> those cores. It would take as long as it takes, I can't see how you
> would 'overload' it that way.
>

I mean 'overload' Solr in the sense that it cannot read, process, and
write data fast enough because too much data is being handled. I
remind you that this system is writing hundreds of documents per
minute. Certainly there is a limit to what Solr can handle. I ask how
to know how close I am to this limit.

> I guess you could have a strategy that pulls 100 documents with an old
> last_indexed, and push them for re-indexing. If you get the full 100
> docs, you make a subsequent request immediately. If you get less than
> 100 back, you know you're up-to-date and can wait, say, 30s before
> making another request.
>

Actually, I would add a filter query for documents whose last_index
value is before the last schema change, and stop when less documents
were returned than were requested.

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Removing a single value from a multiValue field

2013-05-30 Thread Dotan Cohen

I have a Solr application with a multiValue field 'tags'. All fields
are indexed in this application. There exists a uniqueKey field 'id'
and a '_version_' field. This is running on Solr 4.x.

In order to add a tag, the application retrieves the full document,
creates a PHP array from the document structure, removes the
'_version_' field, and then adds the appropriate tag to the 'tags'
array. This is all then sent to Solr's update method via HTTP with
'overwrite=true'. Solr correctly replaces the extant document with the
new document, which is identical with the exception of a new value for
the '_version_' field and an additional value in the multiValued field
'tags'. This all works correctly.

I am now adding a feature where one can remove tags. I am using the
same business logic, however instead of adding a value to the 'tags'
array I am removing one. I can confirm that the data being sent to
Solr does not contain the removed tag. However, it seems that the old
value for the multiValue field is persisted, that is the old tag
stays. I can see that the '_version_' field has a new value, so I see
that the change was properly commited.

Is there a known bug that overwriting such a doc...:


a
b
 


...with this doc...:


    a
 


...has no effect? Can multiValue fields be only added, but not removed?

Thanks.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

2013-05-30 Thread Dotan Cohen

On Wed, May 29, 2013 at 5:37 PM, Shawn Heisey  wrote:
> It's impossible for us to give you hard numbers.  You'll have to
> experiment to know how fast you can reindex without killing your
> servers.  A basic tenet for such experimentation, and something you
> hopefully already know: You'll want to get baseline measurements before
> you begin testing for comparison.
>

Thanks. I wan't looking for hard numbers, but rather am looking for
what are the signs of problems. I know to keep my eye on memory and
CPU, but I have no idea how to check disk I/O, and I'm not sure how to
determine even if that becomes saturated.

> One of the most reliable Solr-specific indicators of pushing your
> hardware too hard is that the QTime on your queries will start to
> increase dramatically.  Solr 4.1 and later has more granular query time
> statistics in the UI - the median and 95% numbers are much more
> important than the average.
>

Thank you, this will help. At least I now have a hard metric to see
when Solr is getting overburdened (QTime).


> Outside of that, if your overall IOwait CPU percentage starts getting
> near (or above) 30-50%, your server is struggling.  If all of your CPU
> cores are staying near 100% usage, then it's REALLY struggling.
>

I see, thanks.


> Assuming you have plenty of CPU cores, using fast storage and having
> plenty of extra RAM will alleviate much of the I/O bottleneck.  The
> usual rule of thumb for good query performance is that you need enough
> RAM to put 50-100% of your index in the OS disk cache.  For blazing
> performance during a rebuild, that becomes 100-200%.  If you had 150%,
> that would probably keep most indexes well-cached even during a rebuild.
>
> A rebuild will always lower performance, even with lots of RAM.
>

Considering that the Solr index is the only place that the data is
stored, and that users are actively using the system, I was not
planning on a rebuild but rather to iteratively reindex the extant
documents, even as new documents are being push in.


> My earlier reply to your other message has some other ideas that will
> hopefully help.
>

Thank you Shawn!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What exactly happens to extant documents when the schema changes?

2013-05-30 Thread Dotan Cohen

On Wed, May 29, 2013 at 5:09 PM, Shawn Heisey  wrote:
> I handle this in a very specific way with my sharded index.  This won't
> work for all designs, and the precise procedure won't work for SolrCloud.
>
> There is a 'live' and a 'build' core for each of my shards.  When I want
> to reindex, the program makes a note of my current position for deletes,
> reinserts, and new documents.  Then I use a DIH full-import from mysql
> into the build cores.  Once the import is done, I run the update cycle
> of deletes, reinserts, and new documents on those build cores, using the
> position information noted earlier.  Then I swap the cores so the new
> index is online.
>

I do need to examine sharding and multiple cores. I'll look into that,
thank you. By the way, don't google for DIH! It took me some time to
figure out that it is DataImportHandler, as some people use the
acronym for something completely different.


> To adapt this for SolrCloud, I would need to use two collections, and
> update a collection alias for what is considered live.
>
> To control the I/O and CPU usage, you might need some kind of throttling
> in your update/rebuild application.
>
> I don't need any throttling in my design.  Because I'm using DIH, the
> import only uses a single thread for each shard on the server.  I've got
> RAID10 for storage and half of the CPU cores are still available for
> queries, so it doesn't overwhelm the server.
>
> The rebuild does lower performance, so I have the other copy of the
> index handle queries while the rebuild is underway.  When the rebuild is
> done on one copy, I run it again on the other copy.  Right now I'm
> half-upgraded -- one copy of my index is version 3.5.0, the other is
> 4.2.1.  Switching to SolrCloud with sharding and replication would
> eliminate this flexibility, unless I maintained two separate clouds.
>

Thank you. I am not using Solr Cloud but if I ever consider it, then I
will keep this in mind.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Removing a single value from a multiValue field

2013-05-30 Thread Dotan Cohen

On Thu, May 30, 2013 at 3:42 PM, Jack Krupansky  wrote:
> First, you cannot do any internal editing of a multi-valued list, other
> than:
>
> 1. Replace the entire list.
> 2. Add values on to the end of the list.
>

Thank you. I meant that I am actually editing the entire document.
Reading it, changing the values that I need, and then 'updating' it. I
will look into updating only the single multiValued field.


> But you can do both of those operations on a single multivalued field with
> "atomic update" without reading and writing the entire document.
>
> Second, there is no "" element in the Solr Update XML format. Only
> "".
>
> To simply replace the full, current value of one multi-valued field:
>
> 
>  
>doc-id
>a
>b
>  
> 
>
> If you simply want to append a couple of values:
>
> 
>  
>doc-id
>a
>b
>  
> 
>
> To empty out a multivalued field:
>
> 
>  
>doc-id
>
>  
> 
>

Thank you. I will see about translating that into the JSON format that
I work with.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Removing a single value from a multiValue field

2013-06-03 Thread Dotan Cohen

On Thu, May 30, 2013 at 5:01 PM, Jack Krupansky  wrote:
> You gave an XML example, so I assumed you were working with XML!
>

Right, I did give the output as XML. I find XML to be a great document
markup language, but a terrible command format! Mostly, due to
(mis-)use of the attributes.

> In JSON...
>
> [{"id": "doc-id", "tags": {"add": ["a", "b"]}]
>
> and
>
> [{"id": "doc-id", "tags": {"set": null}}]
>

Thank you! That is quite more intuitive and less ambiguous than the
XML, would you not agree?

> BTW, this kind of stuff is covered in the book, separate chapters for XML
> and JSON, each with dozens of examples like this.
>

I have not posted on the book postings, but I will definitely order
one. My vote is for spiral bound, though I know that the perfect-bound
will look more professional on a bookshelf. I don't even care what the
book costs, within reason. Any resource that compiles in a single
package the wonderful methods that yourself and other contributors
mention here and in other places online, will pay for itself in short
order. Apache Solr is an amazing product, but it is often obtuse and
unintuitive. Other times one does not even know what Solr is capable
of, such as the case in this thread, where I was parsing entire
documents to change the multiField value.

Thank you very much!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Reindexing strategy

2013-06-03 Thread Dotan Cohen

On Fri, May 31, 2013 at 3:57 AM, Michael Sokolov
gt wrote:
> On UNIX platforms, take a look at vmstat for basic I/O measurement, and
> iostat for more detailed stats.  One coarse measurement is the number of
> blocked/waiting processes - usually this is due to I/O contention, and you
> will want to look at the paging and swapping numbers - you don't want any
> swapping at all.  But the best single number to look at is overall disk
> activity, which is the I/O percentage utilized number Shaun was mentioning.
>
> -Mike

Great, thanks! I've got some terms to google. For those who follow in
my footsteps, on Ubuntu the package 'sysstat' needs to be installed to
use iostat. Here are my reference stats before starting to experiment,
both for my own use later to compare and also if anybody sees anything
amiss here then I would love to know about it. If there is any fine
manual that is particularly urgent that I should read, please do
mention it. Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen

Consider the following Solr query:
select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags&rows=0

The 'tags' field is a multivalue field. I would expect the previous
query to return only tags that begin with the string 'dotan-' such as:
dotan-home
dotan-work
...but not strings which do not begin with (or even contain) the
string in question.

However, I am getting these results:

14
13
0
0


It _may_ be that the 'beer' and 'beatles' tags were once attached to
the same documents as are attached the 'dotan-home' and/or
'dotan-work'. I've done a bit of experimenting on this Solr install,
so I cannot be sure. However, considering that they are in fact 0
results for those two, I would not expect them to show up at all, even
if they ever were attached to (i.e. once a value in the multiValue
field) any of the results that match the filter query.

So, the questions are:
1) How can I check if ever the multiValue fields for a particular
document (given its uniqueKey id) ever contains a specific value.
Alternatively, how can I see all the values that the document ever had
for the field. I don't expect this to actually be possible, but I ask
if it is, i.e. by examining certain aspects of the Solr index with a
text editor.

2) If those spurious results are appearing does that mean necessarily
that those values for the multivalued field were in fact once in the
multivalued field for documents matching the filter query? Thus, the
answer to the previous question would be to simply run a query for the
id of the document in question, and facet on the multivalued field
with a large limit.

3) How to have Solr return only those faceting values for the field
that in fact begin with 'dotan-', even if a document has other tags
such as 'beatles'?

4) How to have Solr return only those faceting values which are larger than 0?

Thank you!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen

On Wed, Jun 5, 2013 at 3:38 PM, Raymond Wiker  wrote:
> 3) Use the parameter facet.prefix, e.g, facet.prefix=dotan-. Note: this
> particular case will not work if the field you're facetting on is tokenised
> (with "-" being used as a taken separator).
>
> 4) Use the parameter facet.mincount - looks like you want to set it to 1,
> instead of the default which is 0.

Perfect, thank you Raymond!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Receiving unexpected Faceting results.

2013-06-05 Thread Dotan Cohen

On Wed, Jun 5, 2013 at 3:41 PM, Brendan Grainger
 wrote:
> Hi Dotan,
>
> I think all you need to do is add:
>
> facet.mincount=1
>
> i.e.
>
> select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags&
> rows=0&facet.mincount=1
>
> Note that you can do it per field as well:
>
> select?q=*:*&fq=tags:dotan-*&facet=true&facet.field=tags&
> rows=0&f.tags.facet.mincount=1
>
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount
>

Thanks, Brendan. I will review the available Facet Parameters, which I
really should have thought to do before posting as it is already
bookmarked!

Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen

How would one write a query which should perform set union on the
search terms (term1 OR term2 OR term3), and yet also perform phrase
matching if both terms are found? I tried a few variants of the
following, but in every case I am getting set intersection on the
search terms:

select?q={!q.op=OR}text:"term1 term2"~10

Thus, if term1 matches 10 documents and term2 matches 20 documents,
then SET UNION would include all of the documents that have either
term1 and/or term2. That means that between 20-30 results should be
returned. Conversely, SET INTERSECTION would return only results with
_both_ term1 _and_ term2, which could be between 0-10 documents.

Note that in the application, users will be searching for any
arbitrary number of terms, in fact they will be entering phrases. I
can limit these phrases to 140 characters if needed.

Thank you in advance!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen

On Wed, Jun 5, 2013 at 6:10 PM, Shawn Heisey  wrote:
> On 6/5/2013 9:03 AM, Dotan Cohen wrote:
>> How would one write a query which should perform set union on the
>> search terms (term1 OR term2 OR term3), and yet also perform phrase
>> matching if both terms are found? I tried a few variants of the
>> following, but in every case I am getting set intersection on the
>> search terms:
>>
>> select?q={!q.op=OR}text:"term1 term2"~10
>
> A phrase search by definition will require all terms to be present.
> Even though it is multiple terms, conceptually it is treated as a single
> term.
>
> It sounds like what you are after is what edismax can do.  If you define
> the pf field in addition to the qf field, Solr will do something pretty
> amazing - it will automatically construct a phrase query from a
> non-phrase query and search with it against multiple fields.  Done
> correctly, this means that an exact match will be listed first in the
> results.
>
> http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29
>
> Thanks,
> Shawn
>

Thank you Shawn, this pretty much does what I need it to do:
select?defType=edismax&q={!q.op=OR}search_field:term1 term2&pf=search_field

I'm reviewing the Edismax page now. Is there any other documentation
that I should review? I have found the Edismax page at the wonderful
lucidworks site, but if there are any other documentation that I
should review to squeeze the most out of Edismax thenI would love to
know about it.
http://docs.lucidworks.com/display/solr/The+Extended+DisMax+Query+Parser

Thank you very much!


--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen

On Wed, Jun 5, 2013 at 6:23 PM, Jack Krupansky  wrote:
> term1 OR term2 OR "term1 term2"^2
>
> term1 OR term2 OR "term1 term2"~10^2
>
> The latter would rank documents with the terms nearby higher, and the
> adjacent terms highest.
>
> term1 OR term2 OR "term1 term2"~10^2 OR "term1 term2"^20 OR "term2 term1"^20
>
> To further boost adjacent terms.
>
> But the edismax pf/pf2/pf3 options might be good enough for you.
>

Thank you Jack. I suppose that I could write a script in PHP to create
such a query string from an arbitrary-length phrase, but it wouldn't
be pretty! Edismax does in fact meet my need, though.

Thanks!

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen

> select?defType=edismax&q={!q.op=OR}search_field:term1 term2&pf=search_field
>

Is there any way to perform a fuzzy search with this method? I have
tried appending "~1" to every term in the search like so:
select?defType=edismax&q={!q.op=OR}search_field:term1~1%20term2~1&pf=search_field

However, two issues:
1) It doesn't work! The results are identical to the results given
when not appending "~1" to every term (or "~3").

2) If at all possible, I would rather define the 'fuzzyness'
elsewhere. Right now I would have to mangle the user-input in order to
add the "~1" to the end of each term.

Note that the ExtendedDisMax page does in fact mention that fuzziness
is supported:
http://wiki.apache.org/solr/ExtendedDisMax#Query_Syntax

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Phrase matching with set union as opposed to set intersection on query terms

2013-06-05 Thread Dotan Cohen

On Wed, Jun 5, 2013 at 9:04 PM, Eustache Felenc
 wrote:
> There is also http://wiki.apache.org/solr/SolrRelevancyCookbook with nice
> examples.
>

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Filtering on results with more than N words.

2013-06-06 Thread Dotan Cohen

Is there any way to restrict the search results to only those
documents with more than N words / tokens in the searched field? I
thought that this would be an easy one to Google for, but I cannot
figure it out. or find any references. There are many references to
word size in characters, but not to  filed size in words.

Thank you.

--
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Best way to retrieve 20 specific documents

2012-11-21 Thread Dotan Cohen

On Tue, Nov 20, 2012 at 12:45 AM, Shawn Heisey  wrote:
> You can also use this query format:
>
> id:(123 OR 456 OR 789)
>
> This does get expanded internally by the query parser to the format that has
> the field name on every clause, but it is sometimes easier to write code
> that produces the above form.
>

Thank you Shawn, that is much cleaner and will be easier to debug when
/ if things go wrong.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Error: _version_field must exist in schema

2012-11-22 Thread Dotan Cohen

On Thu, Nov 22, 2012 at 9:26 PM, Nick Zadrozny  wrote:
> Belated reply, but this is probably something you should let us know about
> directly at supp...@onemorecloud.com if it happens again. Cheers.
>

Hi Nick. This particular issue was on a Solr 4 instance on AWS, not on
the Websolr account. But I commend you taking notice and taking an
interest. Thank you!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: What should focus be on hardware for solr servers?

2013-02-19 Thread Dotan Cohen

On Thu, Feb 14, 2013 at 5:54 PM, Michael Della Bitta
 wrote:
> My dual-core, HT-enabled Dell Latitude from last year has this CPU:
> model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> bogomips: 4988.65
>
> An m3.xlarge reports:
> model name : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz
> bogomips : 4000.14
>
> I tried running geekbench and phoronx-test-suite and failed at both...
> Anybody have a favorite, free, CLI benchmarking suite?
>

I'll suggest to the Phoronix team to include some Solr tests in their
suite. Solr does seem to be a perfect test for Phoronix, and much more
relevant for some readers than Jack-the-Ripper or Quake.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry

2013-02-20 Thread Dotan Cohen

Note that the issue is present in Solr 4.1 as well.

I did find this post, which is not very encouraging:
http://grokbase.com/t/lucene/solr-user/128sz03jdk/recursiveprefixtreestrategy-class-not-found

Might the name of the class be simply a typo that is easily rectified?
How might one go about checking which classes are available?

Thank you.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Solr 4 Spatial: NoClassDefFoundError: com/vividsolutions/jts/geom/Geometry

2013-02-27 Thread Dotan Cohen

On Wed, Feb 27, 2013 at 10:24 AM, Smiley, David W.  wrote:
> Dotan,
>
> http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4#Configuration
> You need to put the its jar within Solr's WEB-INF/lib; unfortunately you
> can't simply reference it via a  entry and put it wherever.  FWIW you
> can find the same question and my response on Stackoverflow.
>
> ~ David
>

Thank you David. In fact I do frequent Stack Overflow.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Can't search words in quotes

2013-02-27 Thread Dotan Cohen

On Thu, Feb 28, 2013 at 8:14 AM, Alex Cougarman  wrote:
> Thanks, Oussama. That was very useful information and we have added the 
> double quotes. One interesting trick: we had to change the way we did it to 
> wrap the pattern value in single quotes so we could have double quotes inside.
>

Hi Alex. Would you mind posting the new analyzers?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen

Solr 4.1 has been giving up much trouble rejecting documents indexed.
While I try to work my way through this, I would like to move our
application back to Solr 4.0. However, now when I try to start Solr
with same index that was created with Solr 4.0 but has been running on
4.1 few a few days I get this error chain:

org.apache.solr.common.SolrException: Error opening new searcher
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
Caused by: java.lang.IllegalArgumentException: A SPI class of type
org.apache.lucene.codecs.Codec with name 'Lucene41' does not exist.
You need to add the corresponding JAR file supporting this SPI to your
classpath.The current classpath supports the following names:
[Lucene40, Lucene3x]

Obviously I'll not be installing Lucene41 in Solr 4.0, but is there
any way to work around this? Note that neither solrconf.xml nor
schema.xml have changed. Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen

On Fri, Mar 1, 2013 at 11:28 AM, Rafał Kuć  wrote:
> Hello!
>
> I suppose the only way to make this work will be reindexing the data.
> Solr 4.1 uses Lucene 4.1 as you know, which introduced new default
> codec with stored fields compression and this is one of the reasons
> you can't read that index with 4.0.
>

Thank you. My first inclination is to "reindex" the documents, but the
only store of these documents is the Solr index itself. I am trying to
find solutions to create a new core and to index the data in the old
core into the new core. I'm not finding any good ways of going about
this.

Note that we are talking about ~18,000,000 (yes, 18 million) small
documents similar to 'tweets' (mostly under 1 KiB each, very very few
over 5 KiB).

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen

On Fri, Mar 1, 2013 at 11:59 AM, Rafał Kuć  wrote:
> Hello!
>
> I assumed that re-indexing can be painful in your case, if it wouldn't
> you probably would re-index by now :) I guess (didn't test it myself),
> that you can create another collection inside your cluster, use the
> old codec for Lucene 4.0 (setting the version in solrconfig.xml should
> be enough) and re-indexing, but still re-indexing will have to be
> done. Or maybe someone knows a better way ?
>

Will I have to reindex via an external script bridging, such as a
Python script which requests N documents at a time, indexes them into
Solr 4.1, then requests another N documents to index? Or is there
internal Solr / Lucene facility for this? I've actually looked for
such a facility, but as I am unable to find such a thing I ask.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Returning to Solr 4.0 from 4.1

2013-03-01 Thread Dotan Cohen

On Fri, Mar 1, 2013 at 12:22 PM, Rafał Kuć  wrote:
> Hello!
>
> As far as I know you have to re-index using external tool.
>

Thank you Rafał. That is what I figured.



-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Returning to Solr 4.0 from 4.1

2013-03-02 Thread Dotan Cohen

On Fri, Mar 1, 2013 at 1:37 PM, Upayavira  wrote:
> Can you use a checkout from SVN? Does that resolve your issues? That is
> what will become 4.2 when it is released soon:
>
> https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/
>
> Upayavira
>

Thank you. Which feature of 4.2 are you suggesting for this issue? Can
Solr 4.2 natively import from a Solr index?


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Returning to Solr 4.0 from 4.1

2013-03-03 Thread Dotan Cohen

On Sat, Mar 2, 2013 at 9:32 PM, Upayavira  wrote:
> What I'm questioning is whether the issue you see in 4.1 has been
> resolved in Subversion. While I would not expect 4.0 to read a 4.1
> index, the SVN branch/4.2 should be able to do so effortlessly.
>
> Upayavira
>

I see, thanks. Actually, running a clean 4.1 with no previous index
does not have the issues.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Get results only from the last hour

2012-08-20 Thread Dotan Cohen

On Mon, Aug 20, 2012 at 3:00 PM, Markus Jelsma
 wrote:
> Date queries are described here: http://wiki.apache.org/solr/SolrQuerySyntax
>

Terrific, thank you!


> You must first make sure your dates end up in a Date fieldType and are in the 
> proper format.
>

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Faceting Facets

2012-09-03 Thread Dotan Cohen

On Mon, Sep 3, 2012 at 5:50 PM, Alexey Serba  wrote:
> http://wiki.apache.org/solr/SimpleFacetParameters#Pivot_.28ie_Decision_Tree.29_Faceting
>

Thank you, that does seem to be only available on Solr 4.0. Luckily,
we're using Websolr so upgrading is rather easy!

Thanks!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Faceting Facets

2012-09-03 Thread Dotan Cohen

On Mon, Sep 3, 2012 at 6:07 PM, Tanguy Moal  wrote:
> I think it's not possible to combine pivots with facet queries, nor with
> facet ranges (or facet dates), please someone correct me if I'm wrong...
>
> I think only "standard" fields are "pivotable" :)
>
> That said, if you always use the same ranges for your DateTime field, you
> *could* have a "string" version of the time field that only outputs the
> hour of the day of the date contained in your time field, and then you'll
> be able to use facet.pivots with those two text fields.
>
> You could still use the original date time field to constrain the results
> set to return docs within the last 24 hours...
>
> Would that make sense to you ?
>

I think that I understand you!

Actually, the DateTime is currently being stored as a UNIX timestamp
for compatibility with other software. I had planned on converting it
all over to the internal Solr Datetime type, but I now see that I
should leave it as a timestamp.

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen

On Fri, Sep 7, 2012 at 12:23 PM, Erik Hatcher  wrote:
> Pivot facets currently only work with individual terms, not ranges.
>
> The response you provided does look odd in that there are duplicate 
> timestamps listed, but pivots were only implemented for textual (string being 
> the most common type) fields initially.
>

I see, thanks. Other than creating an additional rounded-off timestamp
field, are there any other solutions? Might ranges work if instead of
a timestamp we used a real DateTime field?

In any case, in order to pivot on the timestamp, will I have to change
its type to string?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen

On Fri, Sep 7, 2012 at 4:05 PM, Erik Hatcher  wrote:

> Ranges won't work at all pivots are purely by individual term currently.
>
> If you want to pivot by ranges, and you can define those ranges during 
> indexing, then you could make a field that represented which range each 
> document is in.
>
>   doc:
> id: 1234
> category: History
> date_range_buckets: 2004/March->June
>
> or something like that.  Then you could pivot on category and 
> date_range_buckets.  It's a hacky workaround, but might just be sufficient 
> for some cases.
>

Thanks. As there are other applications using the index I was hoping
to avoid adding a redundant work-around field. But it looks like the
best solution.

Just to be clear, as I'm not logged onto the dev server at the moment
but it was implied in an earlier mail: Any field that is to be pivoted
on needs to be a string field? Is that documented, as I cannot find
that in the docs.

Thanks!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen

On Fri, Sep 7, 2012 at 4:39 PM, Erik Hatcher  wrote:
>> Just to be clear, as I'm not logged onto the dev server at the moment
>> but it was implied in an earlier mail: Any field that is to be pivoted
>> on needs to be a string field? Is that documented, as I cannot find
>> that in the docs.
>
> No, it doesn't need to be a string field but whatever terms come out of 
> the analysis process are what gets faceted upon.  If it was a "text" field, 
> each word in the field would be a facet value.  A "trie" field probably 
> doesn't work properly, as it indexes multiple terms per value and you'd get 
> odd values.   Pivot faceting was initially implemented only with textual 
> terms in mind, and string is generally the desired type.
>

Thanks for the insight. I'll see how much time for experimentation I
might afford.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Unexpected results in Solr 4 Pivot Faceting

2012-09-07 Thread Dotan Cohen

On Fri, Sep 7, 2012 at 5:04 PM, Yonik Seeley  wrote:
> On Fri, Sep 7, 2012 at 9:39 AM, Erik Hatcher  wrote:
>> A "trie" field probably doesn't work properly, as it indexes multiple terms 
>> per value and you'd get odd values.
>
> I don't know about pivot faceting, but all of the other types of
> faceting take this into account (hence faceting works fine on trie
> fields).
>

Thanks. I am not familiar with the trie field, but I'll look into it.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Return only matched multiValued field

2012-09-11 Thread Dotan Cohen

Assuming a multivalued, stored and indexed field with name "comment".
When performing a search, I would like to return only the values of
"comment" which contain the match. For example:

When searching for "gold" instead of getting this result:


Theres a lady whos sure
all that glitters is gold
and shes buying a stairway to heaven



I would prefer to get this result:


all that glitters is gold



(psuedo-XML from memory, may not be accurate but illustrates the point)

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Solr unique key can't be blank

2012-09-12 Thread Dotan Cohen

On Wed, Sep 12, 2012 at 5:27 PM, Ahmet Arslan  wrote:
> Hi Dotan,
>
> Did you define the following update processor chain in solrconfig.xml ?
> And did you reference it in an update handler?
>
> 
> 
>   id
> 
> 
> 
>

Thank you Ahmet! In fact, I did not know that the
updateRequestProcessorChain needed to be defined in solrconfig.xml and
I had tried to define it in schema.xml. I don't have access to
solrconfig.xml (I am using Websolr) but I will contact them about
adding it.

Thank you.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Cannot insert text into solr.StrField field

2012-09-13 Thread Dotan Cohen

On Fri, Sep 14, 2012 at 1:00 AM, Jack Krupansky  wrote:
> Did you check the log file?
>
> How are you adding data to Solr? Show us the actual input document or code.
>

The Solr instances on Websolr. I will put in a feature request for
that, though. I am adding the documents with Solr-PHP-Client. In fact,
preceding the variable with (int) does in fact resolve the issue I
have found. This looks like an issue with PHP being weakly typed.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Return only matched multiValued field

2012-09-23 Thread Dotan Cohen

Assuming a multivalued, stored and indexed field with name "comment".
When performing a search, I would like to return only the values of
"comment" which contain the match. For example:

When searching for "gold" instead of getting this result:



Theres a lady whos sure
all that glitters is gold
and shes buying a stairway to heaven



I would prefer to get this result:



all that glitters is gold



(psuedo-XML from memory, may not be accurate but illustrates the point)

Is there any way to do this with a Solr 4 index? The client accessing
Solr is on a dial-up connection (no provision for DSL or other high
speed internet) so I'd like to move as little data over the wire as
possible. In reality, the array will have tens of fields so returning
only the relevant fields may reduce the data transferred by an order
of magnitude.

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen

>  indexed="true"
> multiValued="true" />
> 
> doctest

Note that in anonymizing the information, I introduced a typo. The
above "doctest" should be "doctext". In any case, the field names in
the production application and in production schema do in fact match!


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen

On Mon, Sep 24, 2012 at 2:16 PM, Erick Erickson  wrote:
> Hmmm, works for me. What is your entire response packet?
>
> And you've covered the bases with indexed and stored so this
> seems like it _should_ work.
>

I'm sorry, reducing the output to rows=1 helped me notice that the
highlighted sections come after the main results. The highlighting
feature works as expected.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Return only matched multiValued field

2012-09-24 Thread Dotan Cohen

On Mon, Sep 24, 2012 at 9:47 AM, Mikhail Khludnev
 wrote:
> Hi
> It seems like highlighting feature.

Thank you Mikhail. I actually do need the entire matched single entry,
not a snippet of it. Looking at the example in the OP, with
highlighting on "gold" I would get

glitters is gold

Whereas I need:
all that glitters is gold

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

How does Solr know which relative paths to use?

2012-10-16 Thread Dotan Cohen

I have just installed Solr 4.0 on a test server. I start it like so:
$ pwd
/some/dir
$ java -jar start.jar

The Solr Instance now looks like this:
CWD
/some/dir
Instance
/some/dir/solr/collection1
Data
/some/dir/solr/collection1/data
Index
/some/dir/solr/collection1/data/index

>From where did the additional relative paths 'collection1',
'collection1/data', and 'collection1/data/index' come from? I know
that I can change the value of CWD with the -Dsolr.solr.home flag, but
what affects the relative paths mentioned?

Thanks.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: How does Solr know which relative paths to use?

2012-10-16 Thread Dotan Cohen

On Wed, Oct 17, 2012 at 12:16 AM, P Williams
 wrote:
> Hi Dotan,
>
> It seems that the examples now use Multiple
> Cores<http://wiki.apache.org/solr/CoreAdmin>by default.  If your test
> server is based on the stock example, you should
> see a solr.xml file in your CWD path which is how Solr knows about the
> relative paths.  There should also be a README.txt file that will tell you
> more about how the directory is expected to be organized.
>
> Cheers,
> Tricia
>

Thanks. I read the top-level README.txt but now I see that the answer
is in the solr/README.txt file.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Error: _version_field must exist in schema

2012-10-17 Thread Dotan Cohen

On Thu, Oct 18, 2012 at 12:09 AM, Rafał Kuć  wrote:
> Hello!
>
> The _version_ field is needed by some of Solr 4.0 functionality like
> transaction log or partial documents update. If you want to use them,
> just update your schema.xml and put the _version_ field definition
> there.
>
> However if you don't want those, you can remove the transaction log
> configuration in your solrconfig.xml. However please remember that
> when using SolrCloud you'll need that field.
>

Thanks. Where is that bit documented? I don't see it on the Solr wiki:
http://wiki.apache.org/solr/SchemaXml

I do have a Solr 4 Beta index running on Websolr that does not have
such a field. It works, but throws many "Service Unavailable" and
"Communication Error" errors. Might the lack of the _version_ field be
the reason?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen

On Thu, Oct 18, 2012 at 12:25 AM, Rafał Kuć  wrote:
> Hello!
>
> You can some find information about requirements of SolrCloud at
> http://wiki.apache.org/solr/SolrCloud . I don't know if _version_ is
> mentioned elsewhere.
>
> As for Websolr - I'm afraid I can't say anything about the cause of
> those errors without seeing the exception.
>

I see, thanks. I don't think that I'm using the SolrCloud feature. Is
it enable because there exist "solr/collection1" and also
"multicore/core0"?

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen

On Thu, Oct 18, 2012 at 9:21 AM, Rafał Kuć  wrote:
> Hello!
>
> Look at your solrconfig.xml file, you should see something like that:
>
> 
>  ${solr.data.dir:}
> 
>
> Just remove it and Solr shouldn't bother you with the version field
> information. However remember that some features won't work (like the
> real time get or partial documents update).
>

Thank you. Is there any place where this is documented? It certainly
does not appear in the relevant wiki page:
http://wiki.apache.org/solr/SolrConfigXml

> You can also add _version_ field to your schema and forget about it.
> You don't need to do anything with it as it is used internally by
> Solr.
>

That is exactly my plan, but I would also like to understand more
about what is going on. I don't like cut-and-paste programming.

Thank you very much!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Error: _version_field must exist in schema

2012-10-18 Thread Dotan Cohen

On Thu, Oct 18, 2012 at 1:06 PM, Erick Erickson  wrote:
> I've updated the schema.xml page, see
> http://wiki.apache.org/solr/SchemaXml#Recommended_fields
>

Great, thanks!

> Care to change the schema.xml file to warn about this too and
> submit a patch?
>

If you are referring to the example schema.xml file provided with
Solr, then I'd love to. I'm signing up for the dev list now. Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

When Solr is slow, I'm seeing these in the logs:
[collection1] Error opening new searcher. exceeded limit of
maxWarmingSearchers=2, try again later.
[collection1] PERFORMANCE WARNING: Overlapping onDeckSearchers=2

Googling, I found this in the FAQ:
"Typically the way to avoid this error is to either reduce the
frequency of commits, or reduce the amount of warming a searcher does
while it's on deck (by reducing the work in newSearcher listeners,
and/or reducing the autowarmCount on your caches)"
http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F

I happen to know that the script will try to commit once every 60
seconds. How does one "reduce the work in newSearcher listeners"? What
effect will this have? What effect will reducing the autowarmCount on
caches have?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 5:02 PM, Rafał Kuć  wrote:
> Hello!
>
> You can check if the long warming is causing the overlapping
> searchers. Check Solr admin panel and look at cache statistics, there
> should be warmupTime property.
>

Thank you, I have gone over the Solr admin panel twice and I cannot
find the cache statistics. Where are they?


> Lowering the autowarmCount should lower the time needed to warm up,
> howere you can also look at your warming queries (if you have such)
> and see how long they take.
>

Thank you, I will look at that!

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 5:27 PM, Mark Miller  wrote:
> Are you using Solr 3X? The occasional long commit should no longer
> show up in Solr 4.
>

Thank you Mark. In fact, this is the production release of Solr 4.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 7:29 PM, Shawn Heisey  wrote:
> On 10/22/2012 9:58 AM, Dotan Cohen wrote:
>>
>> Thank you, I have gone over the Solr admin panel twice and I cannot find
>> the cache statistics. Where are they?
>
>
> If you are running Solr4, you can see individual cache autowarming times
> here, assuming your core is named collection1:
>
> http://server:port/solr/#/collection1/plugins/cache?entry=queryResultCache
> http://server:port/solr/#/collection1/plugins/cache?entry=filterCache
>
> The warmup time for the entire searcher can be found here:
>
> http://server:port/solr/#/collection1/plugins/core?entry=searcher
>
>

Thank you Shawn! I can see how I missed that data. I'm reviewing it
now. Solr has a low barrier to entry, but quite a learning curve. I'm
loving it!

I see that the server is using less than 2 GiB of memory, whereas it
is a dedicated Solr server with 16 GiB of memory. I understand that I
can increase the query and document caches to increase performance,
but I worry that this will increase the warm-up time to unacceptable
levels. What is a good strategy for increasing the caches yet
preserving performance after an optimize operation?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 9:22 PM, Mark Miller  wrote:
> Perhaps you can grab a snapshot of the stack traces when the 60 second
> delay is occurring?
>
> You can get the stack traces right in the admin ui, or you can use
> another tool (jconsole, visualvm, jstack cmd line, etc)
>
Thanks. I've refactored so that the index is optimized once per hour,
instead after each dump of commits. But when I will need to increase
the optmize frequency in the future I will go through the stack
traces. Thanks!

In any case, the server has an extra 14 GiB of memory available, how
might I make the best use of that for Solr assuming both heavy reads
and writes?

Thanks.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 10:01 PM, Walter Underwood
 wrote:
> First, stop optimizing. You do not need to manually force merges. The system 
> does a great job. Forcing merges (optimize) uses a lot of CPU and disk IO and 
> might be the cause of your problem.
>

Thanks. Looking at the index statistics, I see that within minutes
after running optimize that the stats say the index needs to be
reoptimized. Though, the index still reads and writes fine even in
that state.


> Second, the OS will use the "extra" memory for file buffers, which really 
> helps performance, so you might not need to do anything. This will work 
> better after you stop forcing merges. A forced merge replaces every file, so 
> the OS needs to reload everything into file buffers.
>

I don't see that the memory is being used:

$ free -g
 total   used   free sharedbuffers cached
Mem:14  2 12  0  0  1
-/+ buffers/cache:  0 14
Swap:    0  0  0

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Mon, Oct 22, 2012 at 10:44 PM, Walter Underwood
 wrote:
> Lucene already did that:
>
> https://issues.apache.org/jira/browse/LUCENE-3454
>
> Here is the Solr issue:
>
> https://issues.apache.org/jira/browse/SOLR-3141
>
> People over-use this regardless of the name. In Ultraseek Server, it was 
> called "force merge" and we had to tell people to stop doing that nearly 
> every month.
>

Thank you for those links. I commented on the Solr bug. There are some
very insightful comments in there.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-22 Thread Dotan Cohen

On Tue, Oct 23, 2012 at 3:52 AM, Shawn Heisey  wrote:
> As soon as you make any change at all to an index, it's no longer
> "optimized."  Delete one document, add one document, anything.  Most of the
> time you will not see a performance increase from optimizing an index that
> consists of one large segment and a bunch of very tiny segments or deleted
> documents.
>

I've since realized that by experimentation. I've probably saved quite
a few minutes of reading time by investing hours of experiment time!


> How big is your index, and did you run this right after a reboot?  If you
> did, then the cache will be fairly empty, and Solr has only read enough from
> the index files to open the searcher.The number is probably too small to
> show up on a gigabyte scale.  As you issue queries, the cached amount will
> get bigger.  If your index is small enough to fit in the 14GB of free RAM
> that you have, you can manually populate the disk cache by going to your
> index directory and doing 'cat * > /dev/null' from the commandline or a
> script.  The first time you do it, it may go slowly, but if you immediately
> do it again, it will complete VERY fast -- the data will all be in RAM.
>

The cat trick to get the files in RAM is great. I would not have
thought that would work for binary files.

The index is small, much less than the available RAM, for the time
being. Therefore, there was nothing to fill it with I now understand.
Both 'free' outputs were after the system had been running for some
time.


> The 'free -m' command in your first email shows cache usage of 1243MB, which
> suggests that maybe your index is considerably smaller than your available
> RAM.  Having loads of free RAM is a good thing for just about any workload,
> but especially for Solr.Try running the free command without the -g so you
> can see those numbers in kilobytes.
>
> I have seen a tendency towards creating huge caches in Solr because people
> have lots of memory.  It's important to realize that the OS is far better at
> the overall job of caching the index files than Solr itself is.  Solr caches
> are meant to cache result sets from queries and filters, not large sections
> of the actual index contents.  Make the caches big enough that you see some
> benefit, but not big enough to suck up all your RAM.
>

I see, thanks.


> If you are having warm time problems, make the autowarm counts low.  I have
> run into problems with warming on my filter cache, because we have filters
> that are extremely hairy and slow to run. I had to reduce my autowarm count
> on the filter cache to FOUR, with a cache size of 512.  When it is 8 or
> higher, it can take over a minute to autowarm.
>

I will have to experiment with the warning. Thank you for the tips.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-24 Thread Dotan Cohen

On Tue, Oct 23, 2012 at 3:07 PM, Erick Erickson  wrote:
> Maybe you've been looking at it but one thing that I didn't see on a fast
> scan was that maybe the commit bit is the problem. When you commit,
> eventually the segments will be merged and a new searcher will be opened
> (this is true even if you're NOT optimizing). So you're effectively committing
> every 1-2 seconds, creating many segments which get merged, but more
> importantly opening new searchers (which you are getting since you pasted
> the message: Overlapping onDeckSearchers=2).
>
> You could pinpoint this by NOT committing explicitly, just set your autocommit
> parameters (or specify commitWithin in your indexing program, which is
> preferred). Try setting it at a minute or so and see if your problem goes away
> perhaps?
>
> The NRT stuff happens on soft commits, so you have that option to have the
> documents immediately available for search.
>


Thanks, Erick. I'll play around with different configurations. So far
just removing the periodic optimize command worked wonders. I'll see
how much it helps or hurts to run that daily or more or less frequent.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen

On Wed, Oct 24, 2012 at 4:33 PM, Walter Underwood  wrote:
> Please consider never running "optimize". That should be called "force merge".
>

Thanks. I have been letting the system run for about two days already
without an optimize. I will let it run a week, then merge to see the
effect.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen

I spoke too soon! Wereas three days ago when the index was new 500
records could be written to it in <3 seconds, now that operation is
taking a minute and a half, sometimes longer. I ran optimize() but
that did not help the writes. What can I do to improve the write
performance?

Even opening the Logging tab of the Solr instance is taking quite a
long time. In fact, I just left it for 20 minutes and it still hasn't
come back with anything. I do have an SSH window open on the server
hosting Solr and it doesn't look overloaded at all:

$ date && du -sh data/ && uptime && free -m
Fri Oct 26 13:15:59 UTC 2012
578Mdata/
 13:15:59 up 4 days, 17:59,  1 user,  load average: 0.06, 0.12, 0.22
 total   used   free sharedbuffers cached
Mem: 14980   3237  11743  0284   
-/+ buffers/cache:729  14250
Swap:    0      0  0


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-26 Thread Dotan Cohen

On Fri, Oct 26, 2012 at 4:02 PM, Shawn Heisey  wrote:
>
> Taking all the information I've seen so far, my bet is on either cache
> warming or heap/GC trouble as the source of your problem.  It's now specific
> information gathering time.  Can you gather all the following information
> and put it into a web paste page, such as pastie.org, and reply with the
> link?  I have gathered the same information from my test server and created
> a pastie example. http://pastie.org/5118979
>
> On the dashboard of the GUI, it lists all the jvm arguments. Include those.
>
> Click Java Properties and gather the "java.runtime.version" and
> "java.specification.vendor" information.
>
> After one of the long update times, pause/stop your indexing application.
> Click on your core in the GUI, open Plugins/Stats, and paste the following
> bits with a header to indicate what each section is:
> CACHE->filterCache
> CACHE->queryResultCache
> CORE->searcher
>
> Thanks,
> Shawn
>

Thank you Shawn. The information is here:
http://pastebin.com/aqEfeYVA

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

Re: Occasional Solr performance issues

2012-10-28 Thread Dotan Cohen

On Fri, Oct 26, 2012 at 11:04 PM, Shawn Heisey  wrote:
> Warming doesn't seem to be a problem here -- all your warm times are zero,
> so I am going to take a guess that it may be a heap/GC issue.  I would
> recommend starting with the following additional arguments to your JVM.
> Since I have no idea how solr gets started on your server, I don't know
> where you would add these:
>
> -Xmx4096M -Xms4096M -XX:NewRatio=1 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
>

Thanks. I've added those flags to the Solr line that I use to start
Solr. Those are Java flags, not Solr, correct? I'm googling the flags
now, but I find it interesting that I cannot find a canonical
reference for them.


> This allocates 4GB of RAM to java, sets up a larger than normal Eden space
> in the heap, and uses garbage collection options that usually fare better in
> a server environment than the default.Java memory management options are
> like religion to some people ... I may start a flamewar with these
> recommendations. ;)  The best I can tell you about these choices: They made
> a big difference for me.
>

Thanks. I will experiment with them empirically. The first step is to
learn to read the debug info, though. I've been googing for days, but
I must be missing something. Where is the information that I pasted in
pastebin documented?


> I would also recommend switching to a Sun/Oracle jvm.  I have heard that
> previous versions of Solr were not happy on variants like OpenJDK, I have no
> idea whether that might still be the case with 4.0.  If you choose to do
> this, you probably have package choices in Ubuntu.  I know that in Debian,
> the package is called sun-java6-jre ... Ubuntu is probably something
> similar. Debian has a CLI command 'update-java-alternatives' that will
> quickly switch between different java implementations that are installed.
> Hopefully Ubuntu also has this.  If not, you might need the following
> command instead to switch the main java executable:
>
> update-alternatives --config java
>

Thanks, I will take a look at the current Oracle JVM.


-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com

1 2 >

1 - 100 of 112 matches

Mail list logo