Re: joins and filter queries effecting scoring

2011-10-28 Thread Martijn v Groningen
Have your tried using the join in the fq instead of the q?
Like this (assuming user_id_i is a field in the post document type and
self_id_i a field in the user document type):
q=posts_text:"hello"&fq={!join from=self_id_i
to=user_id_i}is_active_boolean:true

In this example the fq produces a docset that contains all user
documents that are active. This docset is used as filter during the
execution of the main query (q param),
so it only returns posts with the contain the text hello for active users.

Martijn

On 28 October 2011 01:57, Jason Toy  wrote:
> Does anyone have any idea on this issue?
>
> On Tue, Oct 25, 2011 at 11:40 AM, Jason Toy  wrote:
>
>> Hi Yonik,
>>
>> Without a Join I would normally query user docs with:
>> q=data_text:"test"&fq=is_active_boolean:true
>>
>> With joining users with posts, I get no no results:
>> q={!join from=self_id_i
>> to=user_id_i}data_text:"test"&fq=is_active_boolean:true&fq=posts_text:"hello"
>>
>>
>>
>> I am able to use this query, but it gives me the results in an order that I
>> dont want(nor do I understand its order):
>> q={!join from=self_id_i to=user_id_i}data_text:"test" AND
>> is_active_boolean:true&fq=posts_text:"hello"
>>
>> I want the order to be the same as I would get from my original
>> "q=data_text:"test"&fq=is_active_boolean:true", but with the ability to join
>> with the Posts docs.
>>
>>
>>
>>
>>
>> On Tue, Oct 25, 2011 at 11:30 AM, Yonik Seeley > > wrote:
>>
>>> Can you give an example of the request (URL) you are sending to Solr?
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>> On Mon, Oct 24, 2011 at 3:31 PM, Jason Toy  wrote:
>>> > I have 2 types of docs, users and posts.
>>> > I want to view all the docs that belong to certain users by joining
>>> posts
>>> > and users together.  I have to filter the users with a filter query of
>>> > "is_active_boolean:true" so that the score is not effected,but since I
>>> do a
>>> > join, I have to move the filter query to the query parameter so that I
>>> can
>>> > get the filter applied. The problem is that since the is_active_boolean
>>> is
>>> > moved to the query, the score is affected which returns back an order
>>> that I
>>> > don't want.
>>> >  If I leave the is_active_boolean:true in the fq paramater, I get no
>>> > results back.
>>> >
>>> > My question is how can I apply a filter query to users so that the score
>>> is
>>> > not effected?
>>> >
>>>
>>
>>
>>
>> --
>> - sent from my mobile
>>
>>
>>
>
>
> --
> - sent from my mobile
>



-- 
Met vriendelijke groet,

Martijn van Groningen


Always return total number of documents

2011-10-28 Thread Robert Brown
Currently I'm making 2 calls to Solr to be able to state "matched 20 
out of 200 documents".


Is there no way to return the total number of docs as part of a 
search?



--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Re: Always return total number of documents

2011-10-28 Thread Michael Kuhlmann
Am 28.10.2011 11:16, schrieb Robert Brown:
> Is there no way to return the total number of docs as part of a search?

No, it isn't. Usually this information is of absolutely no value to the
end user.

A workaround would be to add some field to the schema that has the same
value for every document, and use this for facetting.

Greetings,
Kuli


Re: Always return total number of documents

2011-10-28 Thread Robert Brown
Cheers Kuli,

This is actually of huge importance to our customers, to see how many
documents we store.

The faceting option sounds a bit messy, maybe we'll have to stick with
2 queries.


---

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com

On Fri, 28 Oct 2011 11:43:11 +0200, Michael Kuhlmann 
wrote:
> Am 28.10.2011 11:16, schrieb Robert Brown:
>> Is there no way to return the total number of docs as part of a search?
> 
> No, it isn't. Usually this information is of absolutely no value to the
> end user.
> 
> A workaround would be to add some field to the schema that has the same
> value for every document, and use this for facetting.
> 
> Greetings,
> Kuli



Re: Too many values for UnInvertedField faceting on field autocompleteField

2011-10-28 Thread Torsten Krah
Am Mittwoch, den 26.10.2011, 08:02 -0400 schrieb Yonik Seeley:
> You can also try adding facet.method=enum directly to your request

Added 

  query.set("facet.method", "enum");

to my solr query at code level and now it works. Don't know why the
handler stuff gets ignored or overriden, but its ok for my usecase to
specify it at query level.

thx

Torsten


smime.p7s
Description: S/MIME cryptographic signature


Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Hi, I'm using Grouping with group.truncate=true, The following simple facet
query:

facet.query=Monitor_id:[38 TO 40]

Doesn't give the same number as the nGroups result (with
grouping.ngroups=true) for the equivalent filter query:

fq=Monitor_id:[38 TO 40]

I thought they should be the same - from the Wiki page: 'group.truncate: If
true, facet counts are based on the most relevant document of each group
matching the query.'

What am I doing wrong?

If I turn off group.truncate then the counts are the same, as I'd expect -
but unfortunately I'm only interested in the grouped results.

- I have also asked this question on StackOverflow, here:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries

Thanks!

-- 
Ian

i...@isfluent.com 
+44 (0)1223 257903


Re: changing omitNorms on an already built index

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 12:20 AM, Robert Muir  wrote:
> On Thu, Oct 27, 2011 at 6:00 PM, Simon Willnauer
>  wrote:
>> we are not actively removing norms. if you set omitNorms=true and
>> index documents they won't have norms for this field. Yet, other
>> segment still have norms until they get merged with a segment that has
>> no norms for that field ie. omits norms. omitNorms is anti-viral so
>> once you set it to true it will be true for other segment eventually.
>> If you optimize you index you should see that norms go away.
>>
>
> This is only true in trunk (4.x!)
> https://issues.apache.org/jira/browse/LUCENE-2846

ah right, I thought this was ported - nevermind! thanks robert

simon
>
> --
> lucidimagination.com
>


Re: Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Martijn v Groningen
Hi Ian,

I think this is a bug. After looking into the code the facet.query
feature doesn't take into account the group.truncate option.
This needs to be fixed. You can open a new issue in Jira if you want to.

Martijn

On 28 October 2011 12:09, Ian Grainger  wrote:
> Hi, I'm using Grouping with group.truncate=true, The following simple facet
> query:
>
> facet.query=Monitor_id:[38 TO 40]
>
> Doesn't give the same number as the nGroups result (with
> grouping.ngroups=true) for the equivalent filter query:
>
> fq=Monitor_id:[38 TO 40]
>
> I thought they should be the same - from the Wiki page: 'group.truncate: If
> true, facet counts are based on the most relevant document of each group
> matching the query.'
>
> What am I doing wrong?
>
> If I turn off group.truncate then the counts are the same, as I'd expect -
> but unfortunately I'm only interested in the grouped results.
>
> - I have also asked this question on StackOverflow, here:
> http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
>
> Thanks!
>
> --
> Ian
>
> i...@isfluent.com 
> +44 (0)1223 257903
>



-- 
Met vriendelijke groet,

Martijn van Groningen


Re: Solr 3.4 group.truncate does not work with facet queries

2011-10-28 Thread Ian Grainger
Thanks, Marijn. I have logged the bug here:
https://issues.apache.org/jira/browse/SOLR-2863

Is there any chance of a workaround for this issue before the bug is fixed?

If you want to answer the question on StackOverflow:
http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
I'll
accept your answer.


On Fri, Oct 28, 2011 at 12:14 PM, Martijn v Groningen <
martijn.v.gronin...@gmail.com> wrote:

> Hi Ian,
>
> I think this is a bug. After looking into the code the facet.query
> feature doesn't take into account the group.truncate option.
> This needs to be fixed. You can open a new issue in Jira if you want to.
>
> Martijn
>
> On 28 October 2011 12:09, Ian Grainger  wrote:
> > Hi, I'm using Grouping with group.truncate=true, The following simple
> facet
> > query:
> >
> > facet.query=Monitor_id:[38 TO 40]
> >
> > Doesn't give the same number as the nGroups result (with
> > grouping.ngroups=true) for the equivalent filter query:
> >
> > fq=Monitor_id:[38 TO 40]
> >
> > I thought they should be the same - from the Wiki page: 'group.truncate:
> If
> > true, facet counts are based on the most relevant document of each group
> > matching the query.'
> >
> > What am I doing wrong?
> >
> > If I turn off group.truncate then the counts are the same, as I'd expect
> -
> > but unfortunately I'm only interested in the grouped results.
> >
> > - I have also asked this question on StackOverflow, here:
> >
> http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries
> >
> > Thanks!
> >
> > --
> > Ian
> >
> > i...@isfluent.com 
> > +44 (0)1223 257903
> >
>
>
>
> --
> Met vriendelijke groet,
>
> Martijn van Groningen
>



-- 
Ian

i...@isfluent.com 
+44 (0)1223 257903


Solr Profiling

2011-10-28 Thread Rohit
Hi,

 

My Solr becomes very slow or hangs up at times, we have done almost
everything possible like

. Giving 16GB memory to JVM

. Sharding

 

But these help only for X time, i want to profile the server and see whats
going wrong? How can I profile solr remotely?

 

Regards,

Rohit

 



Re: solr break up word

2011-10-28 Thread Boris Quiroz
Hi Erick,

I'll try without the type="index" on analyzer tag and then I'll
re-index some files.

Thanks for you answer.

On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson  wrote:
> Hmmm, I'm not sure what happens when you specify
>  (without type="index" and
> . I have no clue which one
> is used.
>
> Look at the admin/analysis page to understand how things are
> broken up.
>
> Did you re-index after you added the ngram filter?
>
> You'll get better help if you include example queries with
> &debugQuery=on appended, it'll give us a lot more to
> work with.
>
> Best
> Erick
>
> On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz  wrote:
>> Hi,
>>
>> I've solr running on a CentOS server working OK, but sometimes my 
>> application needs to index some parts of a word. For example, if I search 
>> 'dislike' word fine but if I search 'disl' it returns zero. Also, if I 
>> search 'disl*' returns some values (the same if I search for 'dislike') but 
>> if I search 'dislike*' it returns zero too.
>>
>> So, I've two questions:
>>
>> 1. How exactly the asterisk works as a wildcard?
>>
>> 2. What can I do to index properly parts of a word? I added this lines to my 
>> schema.xml:
>>
>> 
>>      
>>        
>>        
>>        
>>        > maxGramSize="15"/>
>>      
>>
>>      
>>        
>>        
>>        
>>      
>> 
>>
>> But I can't get it to work. Is OK what I did or I'm wrong?
>>
>> Thanks.
>>
>> --
>> Boris Quiroz
>> boris.qui...@menco.it
>>
>>
>



-- 
Boris Quiroz
boris.qui...@menco.it


Re: Collection Distribution vs Replication in Solr

2011-10-28 Thread Alireza Salimi
So I have to ask my question again.
Is there any reason not to use Replication in Solr and use Collection
Distribution?

Thanks

On Thu, Oct 27, 2011 at 5:33 PM, Alireza Salimi wrote:

> I can't see those benchmarks, can you?
>
> On Thu, Oct 27, 2011 at 5:20 PM, Marc Sturlese wrote:
>
>> Replication is easier to manage and a bit faster. See the performance
>> numbers: http://wiki.apache.org/solr/SolrReplication
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Collection-Distribution-vs-Replication-in-Solr-tp3458724p3459178.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Alireza Salimi
> Java EE Developer
>
>
>


-- 
Alireza Salimi
Java EE Developer


Re: Faceting on multiple fields, with multiple where clauses

2011-10-28 Thread Rubinho
Thank you Erik,
Now i understand the difference between Q and QF.

Unfortunately, there is 1 unsolved problem left (didn't find the answer
yesterday evening).

I added grouping on this query, because i want to show a group of trips with
the same code only once. (A trip has multiple departure days, and i just
want to show 1 trip, while in de detail screen, i'll show all the available
trips (departure dates).

When i don't filter by country, i receive all countries with their correct
count.
When i do a filter by country, the count of my countries isn't grouped
anymore

When i get the number of trips/month, i just get numbers for the next 2
months and no numbers for the other months (the trip should appear here each
time in a month, because they have departures in each)

Can you help me again?
I'll appreciate it very much :)

http://localhost:8080/solr/select?facet=true&facet.date={!ex=SD}StartDate&f.StartDate.facet.date.start=2011-10-01T00:00:00Z&f.StartDate.facet.date.end=2012-09-30T00:00:00Z&f.StartDate.facet.date.gap=%2B1MONTH&facet.field={!ex=CC}CountryCode&rows=0&version=2.2&q=*:*&group=true&group.field=RoundtripgroupCode&group.truncate=true

These parts of the query are added when a selection is made:
&fq={!tag=CC}CountryCode:CR
&fq={!tag=SD}StartDate:[2011-10-01T00:00:00Z TO 2011-10-31T00:00:00Z]




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Faceting-on-multiple-fields-with-multiple-where-clauses-tp3457432p3460934.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: bbox issue

2011-10-28 Thread Yonik Seeley
Oops, didn't mean for this conversation to leave the mailing lists.

OK, so your lat and lon types were being stored as text but not
indexed (hence no search matches).
A dynamic field of "*" does tend to hide bugs/problems ;-)

> So should I have another for _latLon?  Would it look like:
> 

Yep.  It shouldn't be stored though (unless you just want to verify
for debugging).

-Yonik
http://www.lucidimagination.com



On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross  wrote:
> Hi Yonik.
>
> I never made a dynamicField definition for _latLon ... I was following
> the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I
> just added the field type definition, then the field in the list of
> fields.  I wasn't aware that I had to do anything else.  The only
> dynamic that I have is:
>  multiValued="true"/>
>
> So should I have another for _latLon?  Would it look like:
> 
>
> -- Chris
>
>
>
> On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley
>  wrote:
>> On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross  wrote:
>>> Hi Yonik.
>>>
>>> I'm having more of a problem now...
>>> I made the following lines in my schema.xml (in the appropriate places):
>>>
>>> >> subFieldSuffix="_latLon"/>
>>>
>>> >> required="false"/>
>>>
>>> I have data (did a q=*:*, found one with a point):
>>> 48.306074,14.286293
>>> 
>>> 48.306074
>>> 
>>> 
>>> 14.286293
>>> 
>>>
>>> I've tried to do a bbox:
>>> q=*:*&fq=point:[30.0,10.0%20TO%2050.0,20.0]
>>> q=*:*&fq={!bbox}&sfield=point&pt=48,14&d=50
>>>
>>> And neither of those seem to find the point...
>>
>> Hmmm, what's the dynamicField definition for _latLon?  Is it indexed?
>> If you add debugQuery=true, you should be able to see the underlying
>> range queries for your explicit range query.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>


Re: bbox issue

2011-10-28 Thread Christopher Gross
Ah!  That all makes sense.  The example on the SpacialSearchDev page
should have that bit added in!

I'm back in business now, thanks Yonik!

-- Chris



On Fri, Oct 28, 2011 at 9:40 AM, Yonik Seeley
 wrote:
> Oops, didn't mean for this conversation to leave the mailing lists.
>
> OK, so your lat and lon types were being stored as text but not
> indexed (hence no search matches).
> A dynamic field of "*" does tend to hide bugs/problems ;-)
>
>> So should I have another for _latLon?  Would it look like:
>> 
>
> Yep.  It shouldn't be stored though (unless you just want to verify
> for debugging).
>
> -Yonik
> http://www.lucidimagination.com
>
>
>
> On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross  wrote:
>> Hi Yonik.
>>
>> I never made a dynamicField definition for _latLon ... I was following
>> the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I
>> just added the field type definition, then the field in the list of
>> fields.  I wasn't aware that I had to do anything else.  The only
>> dynamic that I have is:
>> > multiValued="true"/>
>>
>> So should I have another for _latLon?  Would it look like:
>> 
>>
>> -- Chris
>>
>>
>>
>> On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley
>>  wrote:
>>> On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross  
>>> wrote:
 Hi Yonik.

 I'm having more of a problem now...
 I made the following lines in my schema.xml (in the appropriate places):

 >>> subFieldSuffix="_latLon"/>

 >>> required="false"/>

 I have data (did a q=*:*, found one with a point):
 48.306074,14.286293
 
 48.306074
 
 
 14.286293
 

 I've tried to do a bbox:
 q=*:*&fq=point:[30.0,10.0%20TO%2050.0,20.0]
 q=*:*&fq={!bbox}&sfield=point&pt=48,14&d=50

 And neither of those seem to find the point...
>>>
>>> Hmmm, what's the dynamicField definition for _latLon?  Is it indexed?
>>> If you add debugQuery=true, you should be able to see the underlying
>>> range queries for your explicit range query.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>
>


Re: solr break up word

2011-10-28 Thread Boris Quiroz
Hi,

I solved the issue. I added to my schema.xml the following lines:




...




...


Then, I re-index and everything is working great :-)

Thanks for your help.

On Fri, Oct 28, 2011 at 10:08 AM, Boris Quiroz  wrote:
> Hi Erick,
>
> I'll try without the type="index" on analyzer tag and then I'll
> re-index some files.
>
> Thanks for you answer.
>
> On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson  
> wrote:
>> Hmmm, I'm not sure what happens when you specify
>>  (without type="index" and
>> . I have no clue which one
>> is used.
>>
>> Look at the admin/analysis page to understand how things are
>> broken up.
>>
>> Did you re-index after you added the ngram filter?
>>
>> You'll get better help if you include example queries with
>> &debugQuery=on appended, it'll give us a lot more to
>> work with.
>>
>> Best
>> Erick
>>
>> On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz  wrote:
>>> Hi,
>>>
>>> I've solr running on a CentOS server working OK, but sometimes my 
>>> application needs to index some parts of a word. For example, if I search 
>>> 'dislike' word fine but if I search 'disl' it returns zero. Also, if I 
>>> search 'disl*' returns some values (the same if I search for 'dislike') but 
>>> if I search 'dislike*' it returns zero too.
>>>
>>> So, I've two questions:
>>>
>>> 1. How exactly the asterisk works as a wildcard?
>>>
>>> 2. What can I do to index properly parts of a word? I added this lines to 
>>> my schema.xml:
>>>
>>> 
>>>      
>>>        
>>>        
>>>        
>>>        >> maxGramSize="15"/>
>>>      
>>>
>>>      
>>>        
>>>        
>>>        
>>>      
>>> 
>>>
>>> But I can't get it to work. Is OK what I did or I'm wrong?
>>>
>>> Thanks.
>>>
>>> --
>>> Boris Quiroz
>>> boris.qui...@menco.it
>>>
>>>
>>
>
>
>
> --
> Boris Quiroz
> boris.qui...@menco.it
>



-- 
Boris Quiroz
boris.qui...@menco.it


Updating a document multi-value field (no dup values) without needed it to be already committed

2011-10-28 Thread Thibaut Colar

Sorry for the lengthy text, it's a bit difficult to explain:

We are using Solr to index some user info like username, email (among 
other things).


I'm also trying to use facets for search, so for example, I added a 
multi-value field to user called "organizations" where I would store the 
name of the organizations that user work for.


So i can use that field for facetted search and be able to filter a user 
search query result by the organizations this user work for.


So now, the issue I have is my code does something like: 1) Add users 
documents to Solr 2) When a user is assigned an organization 
membership(role), update the user doc to set the organizations field


Now I have the following issue with step 2: If I just do a 
addField("organizations", "BigCorp") on the user doc, it will add that 
value regardless if organizations already have that value("BigCorp") or 
not, but I want each org name to appear only once.


So only way I found to get that behavior is to query the user document, 
get the values of "organization" and only add the new value if it's not 
already in there - if !userDoc.getValues("organiations").contains(value) 
{... add the value to the doc and save it ...}-


Now that works well, but only if I commit all the time(between step 1 & 
2 at least), because the document query will not work unless it has been 
committed already. Obviously in theory its best not to commit all the 
time performance-wise, and unpractical since I process those inserts in 
batches.


*So I guess the main issue would be:*

 *

   Is there a way to update a multi-value field, without allowing
   duplicates, that would not require querying the doc to manually
   prevent duplicates ?

 *

   Maybe some better way to do this ?

Thanks.



Re: Updating a document multi-value field (no dup values) without needed it to be already committed

2011-10-28 Thread Thibaut Colar

Related questions is:
Is there a way to update a doc to remove a specific value from a 
multi-value field (in my case remove a role)


I manage to do that by querying the doc and reading all the other values 
"manually" then saving, but that has the same issues and is inefficient.


On 10/28/11 10:04 AM, Thibaut Colar wrote:

Sorry for the lengthy text, it's a bit difficult to explain:

We are using Solr to index some user info like username, email (among 
other things).


I'm also trying to use facets for search, so for example, I added a 
multi-value field to user called "organizations" where I would store 
the name of the organizations that user work for.


So i can use that field for facetted search and be able to filter a 
user search query result by the organizations this user work for.


So now, the issue I have is my code does something like: 1) Add users 
documents to Solr 2) When a user is assigned an organization 
membership(role), update the user doc to set the organizations field


Now I have the following issue with step 2: If I just do a 
addField("organizations", "BigCorp") on the user doc, it will add that 
value regardless if organizations already have that value("BigCorp") 
or not, but I want each org name to appear only once.


So only way I found to get that behavior is to query the user 
document, get the values of "organization" and only add the new value 
if it's not already in there - if 
!userDoc.getValues("organiations").contains(value) {... add the value 
to the doc and save it ...}-


Now that works well, but only if I commit all the time(between step 1 
& 2 at least), because the document query will not work unless it has 
been committed already. Obviously in theory its best not to commit all 
the time performance-wise, and unpractical since I process those 
inserts in batches.


*So I guess the main issue would be:*

 *

   Is there a way to update a multi-value field, without allowing
   duplicates, that would not require querying the doc to manually
   prevent duplicates ?

 *

   Maybe some better way to do this ?

Thanks.






Recover index

2011-10-28 Thread Frederico Azeiteiro
Hello all,

 

When moving a SOLR index to another instance I lost the files:

segments.gen

segments_xk

 

I have the .cfs file complete.

 

What are my options to recover the data.

Any ideia that I can test?

 

Thank you.



Frederico Azeiteiro

 



Re: Query/Delete performance difference between straight HTTP and SolrJ

2011-10-28 Thread Shawn Heisey

On 10/27/2011 5:56 AM, Michael Sokolov wrote:
From everything you've said, it certainly sounds like a low-level I/O 
problem in the client, not a server slowdown of any sort.  Maybe Perl 
is using the same connection over and over (keep-alive) and Java is 
not.  I really don't know.  One thing I've heard is that 
StreamingUpdateSolrServer (I think that's what it's called) can give 
better throughput for large request batches.  If you're not using 
that, you may be having problems w/closing and re-opening connections?


I turned off the perl build system and had the Java program take over 
full build duties for both index chains.  It's been designed so one copy 
of the program can keep any number of index chains up to date 
simultaneously.


On the most recently hourly run, the servers without virtualization took 
50 seconds, the servers with virtualization and more memory took only 16 
seconds, so it looks like this problem has nothing to do with SolrJ, 
it's due to the 1000 clause queries actually taking a long time to 
execute.  The 16 second runtime is still longer than the last run by the 
perl program (12 seconds), but I am also executing an index rebuild in 
the build cores on those servers, so I'm not overly concerned by that.


At this point there isn't any way for me to know whether the speedup 
with the old server builds is due to the extra memory (OS disk cache) or 
due to some quirk of virtualization.  I'm really hoping it's due to the 
extra memory, because I really don't want to go back to a virtualized 
environment.  I'll be able to figure it out after I eliminate my current 
bug and complete the migration.


Thank you very much to everyone who offered assistance.  It helped me 
make sure my testing was as unbiased as I could achieve.


Shawn



form-data post to ExtractingRequestHandler with utf-8 characters not handled

2011-10-28 Thread kgoess
I'm trying to post a PDF along with a whole bunch of metadata fields to the
ExtractingRequestHandler as multipart/form-data.   It works fine except for
the utf-8 character handling.  Here is what my post looks like (abridged):

   POST /solr/update/extract HTTP/1.1
   TE: deflate,gzip;q=0.3
   Connection: TE, close
   Host: localhost:8983
   Content-Length: 21418
   Content-Type: multipart/form-data;
boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   Content-Disposition: form-data; name=literal.title

   smart >>‘<< quote
   --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX
   
   Content-Disposition: form-data; name="myfile";
filename="text.pdf.1174588823"
   Content-Type: application/pdf
   Content-Transfer-Encoding: binary

   ...binary pdf data

I've verified on the network that the quote character, a LEFT SINGLE
QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes "e2 80
98" which is correct.  However, when I search for the document in Solr, it's
coming back as the byte sequence "c3 a2 c2 80 c2 98" which I'm guessing is
it being double-utf8-encoded.

The multipart/form-data is MIME, which is supposed to be 7-bit, so I've
tried encoding any non-ascii fields as quoted-printable

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: quoted-printable

   smart >>=E2=80=98<< quote=

as well as base64

   Content-Disposition: form-data; name=literal.title
   Content-Transfer-Encoding: base64

   c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI=

but what sold puts in its index is just that value, it's not decoding either
the quoted-printable or the base64.  I've tried encoding the utf-8 values as
HTML entities, but then Solr doesn't unescape them either, and any accented
characters are stored as the HTML entities, not as the unicode characters.

Can anybody give me any pointers as to where I might be going wrong, where
to look for solutions, or any different/better ways to handle this?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Partial updates?

2011-10-28 Thread mlevy
An ability to update would be extremely useful for us. Different parts of
records sometimes come from different databases, and being able to update
after creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The
most awkward thing is when there's been a CopyField, when the record is read
and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.


large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Roman Alekseenkov
Hi everyone,

I'm looking for some help with Solr indexing issues on a large scale.

We are indexing few terabytes/month on a sizeable Solr cluster (8
masters / serving writes, 16 slaves / serving reads). After certain
amount of tuning we got to the point where a single Solr instance can
handle index size of 100GB without much issues, but after that we are
starting to observe noticeable delays on index flush and they are
getting larger. See the attached picture for details, it's done for a
single JVM on a single machine.

We are posting data in 8 threads using javabin format and doing commit
every 5K documents, merge factor 20, and ram buffer size about 384MB.
>From the picture it can be seen that a single-threaded index flushing
code kicks in on every commit and blocks all other indexing threads.
The hardware is decent (12 physical / 24 virtual cores per machine)
and it is mostly idle when the index is flushing. Very little CPU
utilization and disk I/O (<5%), with the exception of a single CPU
core which actually does index flush (95% CPU, 5% I/O wait).

My questions are:

1) will Solr changes from real-time branch help to resolve these
issues? I was reading
http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
and it looks like we have exactly the same problem

2) what would be the best way to port these (and only these) changes
to 3.4.0? I tried to dig into the branching and revisions, but got
lost quickly. Tried something like "svn diff
[…]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
sure if it's even possible to merge these into 3.4.0

3) what would you recommend for production 24/7 use? 3.4.0?

4) is there a workaround that can be used? also, I listed the stack trace below

Thank you!
Roman

P.S. This single "index flushing" thread spends 99% of all the time in
"org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
the merge seems to go quickly. I looked it up and it looks like the
intent here is deleting old commit points (we are keeping only 1
non-optimized commit point per config). Not sure why is it taking that
long.

pool-2-thread-1 [RUNNABLE] CPU time: 3:31
java.nio.Bits.copyToByteArray(long, Object, long, long)
java.nio.DirectByteBuffer.get(byte[], int, int)
org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int)
org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
org.apache.lucene.index.SegmentTermEnum.next()
org.apache.lucene.index.TermInfosReader.(Directory, String,
FieldInfos, int, int)
org.apache.lucene.index.SegmentCoreReaders.(SegmentReader,
Directory, SegmentInfo, int, int)
org.apache.lucene.index.SegmentReader.get(boolean, Directory,
SegmentInfo, int, boolean, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
boolean, int, int)
org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
List)
org.apache.lucene.index.IndexWriter.doFlush(boolean)
org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
org.apache.lucene.index.IndexWriter.closeInternal(boolean)
org.apache.lucene.index.IndexWriter.close(boolean)
org.apache.lucene.index.IndexWriter.close()
org.apache.solr.update.SolrIndexWriter.close()
org.apache.solr.update.DirectUpdateHandler2.closeWriter()
org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
java.util.concurrent.Executors$RunnableAdapter.call()
java.util.concurrent.FutureTask$Sync.innerRun()
java.util.concurrent.FutureTask.run()
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
java.util.concurrent.ThreadPoolExecutor$Worker.run()
java.lang.Thread.run()


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Roman Alekseenkov
I'm wondering if this is relevant:
https://issues.apache.org/jira/browse/LUCENE-2680 - Improve how
IndexWriter flushes deletes against existing segments

Roman

On Fri, Oct 28, 2011 at 11:38 AM, Roman Alekseenkov
 wrote:
> Hi everyone,
>
> I'm looking for some help with Solr indexing issues on a large scale.
>
> We are indexing few terabytes/month on a sizeable Solr cluster (8
> masters / serving writes, 16 slaves / serving reads). After certain
> amount of tuning we got to the point where a single Solr instance can
> handle index size of 100GB without much issues, but after that we are
> starting to observe noticeable delays on index flush and they are
> getting larger. See the attached picture for details, it's done for a
> single JVM on a single machine.
>
> We are posting data in 8 threads using javabin format and doing commit
> every 5K documents, merge factor 20, and ram buffer size about 384MB.
> From the picture it can be seen that a single-threaded index flushing
> code kicks in on every commit and blocks all other indexing threads.
> The hardware is decent (12 physical / 24 virtual cores per machine)
> and it is mostly idle when the index is flushing. Very little CPU
> utilization and disk I/O (<5%), with the exception of a single CPU
> core which actually does index flush (95% CPU, 5% I/O wait).
>
> My questions are:
>
> 1) will Solr changes from real-time branch help to resolve these
> issues? I was reading
> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> and it looks like we have exactly the same problem
>
> 2) what would be the best way to port these (and only these) changes
> to 3.4.0? I tried to dig into the branching and revisions, but got
> lost quickly. Tried something like "svn diff
> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
> sure if it's even possible to merge these into 3.4.0
>
> 3) what would you recommend for production 24/7 use? 3.4.0?
>
> 4) is there a workaround that can be used? also, I listed the stack trace 
> below
>
> Thank you!
> Roman
>
> P.S. This single "index flushing" thread spends 99% of all the time in
> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
> the merge seems to go quickly. I looked it up and it looks like the
> intent here is deleting old commit points (we are keeping only 1
> non-optimized commit point per config). Not sure why is it taking that
> long.
>
> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
> java.nio.Bits.copyToByteArray(long, Object, long, long)
> java.nio.DirectByteBuffer.get(byte[], int, int)
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
> int)
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.TermInfosReader.(Directory, String,
> FieldInfos, int, int)
> org.apache.lucene.index.SegmentCoreReaders.(SegmentReader,
> Directory, SegmentInfo, int, int)
> org.apache.lucene.index.SegmentReader.get(boolean, Directory,
> SegmentInfo, int, boolean, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
> boolean, int, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
> org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
> List)
> org.apache.lucene.index.IndexWriter.doFlush(boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, boolean)
> org.apache.lucene.index.IndexWriter.closeInternal(boolean)
> org.apache.lucene.index.IndexWriter.close(boolean)
> org.apache.lucene.index.IndexWriter.close()
> org.apache.solr.update.SolrIndexWriter.close()
> org.apache.solr.update.DirectUpdateHandler2.closeWriter()
> org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand)
> org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run()
> java.util.concurrent.Executors$RunnableAdapter.call()
> java.util.concurrent.FutureTask$Sync.innerRun()
> java.util.concurrent.FutureTask.run()
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)
> java.util.concurrent.ThreadPoolExecutor$Worker.run()
> java.lang.Thread.run()
>


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
Hey Roman,

On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
 wrote:
> Hi everyone,
>
> I'm looking for some help with Solr indexing issues on a large scale.
>
> We are indexing few terabytes/month on a sizeable Solr cluster (8
> masters / serving writes, 16 slaves / serving reads). After certain
> amount of tuning we got to the point where a single Solr instance can
> handle index size of 100GB without much issues, but after that we are
> starting to observe noticeable delays on index flush and they are
> getting larger. See the attached picture for details, it's done for a
> single JVM on a single machine.
>
> We are posting data in 8 threads using javabin format and doing commit
> every 5K documents, merge factor 20, and ram buffer size about 384MB.
> From the picture it can be seen that a single-threaded index flushing
> code kicks in on every commit and blocks all other indexing threads.
> The hardware is decent (12 physical / 24 virtual cores per machine)
> and it is mostly idle when the index is flushing. Very little CPU
> utilization and disk I/O (<5%), with the exception of a single CPU
> core which actually does index flush (95% CPU, 5% I/O wait).
>
> My questions are:
>
> 1) will Solr changes from real-time branch help to resolve these
> issues? I was reading
> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
> and it looks like we have exactly the same problem

did you also read http://bit.ly/ujLw6v - here I try to explain the
major difference between Lucene 3.x and 4.0 and why 3.x has these long
idle times. In Lucene 3.x a full flush / commit is a single threaded
process, as you observed there is only one thread making progress. In
Lucene 4 there is still a single thread executing the commit but other
threads are not blocked anymore. Depending on how fast the thread can
flush other threads might help flushing segments for that commit
concurrently or simply index into new documents writers. So basically
4.0 won't have this problem anymore. The realtime branch you talk
about is already merged into 4.0 trunk.

>
> 2) what would be the best way to port these (and only these) changes
> to 3.4.0? I tried to dig into the branching and revisions, but got
> lost quickly. Tried something like "svn diff
> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
> sure if it's even possible to merge these into 3.4.0

Possible yes! Worth the trouble, I would say no!
DocumentsWriterPerThread (DWPT) is a very big change and I don't think
we should backport this into our stable branch. However, this feature
is very stable in 4.0 though.
>
> 3) what would you recommend for production 24/7 use? 3.4.0?

I think 3.4 is a safe bet! I personally tend to use trunk in
production too the only problem is that this is basically a moving
target and introduces extra overhead on your side to watch changes and
index format modification which could basically prevent you from
simple upgrades

>
> 4) is there a workaround that can be used? also, I listed the stack trace 
> below
>
> Thank you!
> Roman
>
> P.S. This single "index flushing" thread spends 99% of all the time in
> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
> the merge seems to go quickly. I looked it up and it looks like the
> intent here is deleting old commit points (we are keeping only 1
> non-optimized commit point per config). Not sure why is it taking that
> long.

in 3.x there is no way to apply deletes without doing a flush (afaik).
In 3.x a flush means single threaded again - similar to commit just
without syncing files to disk and writing a new segments file. In 4.0
you have way more control over this via
IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
without blocking other threads. In trunk we hijack indexing threads to
do all that work concurrently so you get better cpu utilization and
due to concurrent flushing better and usually continuous IO
utilization.

hope that helps.

simon
>
> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
> java.nio.Bits.copyToByteArray(long, Object, long, long)
> java.nio.DirectByteBuffer.get(byte[], int, int)
> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
> int)
> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
> org.apache.lucene.index.SegmentTermEnum.next()
> org.apache.lucene.index.TermInfosReader.(Directory, String,
> FieldInfos, int, int)
> org.apache.lucene.index.SegmentCoreReaders.(SegmentReader,
> Directory, SegmentInfo, int, int)
> org.apache.lucene.index.SegmentReader.get(boolean, Directory,
> SegmentInfo, int, boolean, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
> boolean, int, int)
> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
> org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool,
> List)
> org.apache.lucene.index.IndexWriter.doFlush(boolean)
> org.apache.lucene.index.IndexWriter.flush(boolean, b

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Simon Willnauer
On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
 wrote:
> Hey Roman,
>
> On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
>  wrote:
>> Hi everyone,
>>
>> I'm looking for some help with Solr indexing issues on a large scale.
>>
>> We are indexing few terabytes/month on a sizeable Solr cluster (8
>> masters / serving writes, 16 slaves / serving reads). After certain
>> amount of tuning we got to the point where a single Solr instance can
>> handle index size of 100GB without much issues, but after that we are
>> starting to observe noticeable delays on index flush and they are
>> getting larger. See the attached picture for details, it's done for a
>> single JVM on a single machine.
>>
>> We are posting data in 8 threads using javabin format and doing commit
>> every 5K documents, merge factor 20, and ram buffer size about 384MB.
>> From the picture it can be seen that a single-threaded index flushing
>> code kicks in on every commit and blocks all other indexing threads.
>> The hardware is decent (12 physical / 24 virtual cores per machine)
>> and it is mostly idle when the index is flushing. Very little CPU
>> utilization and disk I/O (<5%), with the exception of a single CPU
>> core which actually does index flush (95% CPU, 5% I/O wait).
>>
>> My questions are:
>>
>> 1) will Solr changes from real-time branch help to resolve these
>> issues? I was reading
>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>> and it looks like we have exactly the same problem
>
> did you also read http://bit.ly/ujLw6v - here I try to explain the
> major difference between Lucene 3.x and 4.0 and why 3.x has these long
> idle times. In Lucene 3.x a full flush / commit is a single threaded
> process, as you observed there is only one thread making progress. In
> Lucene 4 there is still a single thread executing the commit but other
> threads are not blocked anymore. Depending on how fast the thread can
> flush other threads might help flushing segments for that commit
> concurrently or simply index into new documents writers. So basically
> 4.0 won't have this problem anymore. The realtime branch you talk
> about is already merged into 4.0 trunk.
>
>>
>> 2) what would be the best way to port these (and only these) changes
>> to 3.4.0? I tried to dig into the branching and revisions, but got
>> lost quickly. Tried something like "svn diff
>> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
>> sure if it's even possible to merge these into 3.4.0
>
> Possible yes! Worth the trouble, I would say no!
> DocumentsWriterPerThread (DWPT) is a very big change and I don't think
> we should backport this into our stable branch. However, this feature
> is very stable in 4.0 though.
>>
>> 3) what would you recommend for production 24/7 use? 3.4.0?
>
> I think 3.4 is a safe bet! I personally tend to use trunk in
> production too the only problem is that this is basically a moving
> target and introduces extra overhead on your side to watch changes and
> index format modification which could basically prevent you from
> simple upgrades
>
>>
>> 4) is there a workaround that can be used? also, I listed the stack trace 
>> below
>>
>> Thank you!
>> Roman
>>
>> P.S. This single "index flushing" thread spends 99% of all the time in
>> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
>> the merge seems to go quickly. I looked it up and it looks like the
>> intent here is deleting old commit points (we are keeping only 1
>> non-optimized commit point per config). Not sure why is it taking that
>> long.
>
> in 3.x there is no way to apply deletes without doing a flush (afaik).
> In 3.x a flush means single threaded again - similar to commit just
> without syncing files to disk and writing a new segments file. In 4.0
> you have way more control over this via
> IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
> without blocking other threads. In trunk we hijack indexing threads to
> do all that work concurrently so you get better cpu utilization and
> due to concurrent flushing better and usually continuous IO
> utilization.
>
> hope that helps.
>
> simon
>>
>> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
>> java.nio.Bits.copyToByteArray(long, Object, long, long)
>> java.nio.DirectByteBuffer.get(byte[], int, int)
>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
>> int)
>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>> org.apache.lucene.index.SegmentTermEnum.next()
>> org.apache.lucene.index.TermInfosReader.(Directory, String,
>> FieldInfos, int, int)
>> org.apache.lucene.index.SegmentCoreReaders.(SegmentReader,
>> Directory, SegmentInfo, int, int)
>> org.apache.lucene.index.SegmentReader.get(boolean, Directory,
>> SegmentInfo, int, boolean, int)
>> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo,
>> boolean, int, int)
>> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean)
>> or

Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> We should maybe try to fix this in 3.x too?

+1 I suggested it should be backported a while back.  Or that Lucene
4.x should be released.  I'm not sure what is holding up Lucene 4.x at
this point, bulk postings is only needed useful for PFOR.

On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer
 wrote:
> On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer
>  wrote:
>> Hey Roman,
>>
>> On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov
>>  wrote:
>>> Hi everyone,
>>>
>>> I'm looking for some help with Solr indexing issues on a large scale.
>>>
>>> We are indexing few terabytes/month on a sizeable Solr cluster (8
>>> masters / serving writes, 16 slaves / serving reads). After certain
>>> amount of tuning we got to the point where a single Solr instance can
>>> handle index size of 100GB without much issues, but after that we are
>>> starting to observe noticeable delays on index flush and they are
>>> getting larger. See the attached picture for details, it's done for a
>>> single JVM on a single machine.
>>>
>>> We are posting data in 8 threads using javabin format and doing commit
>>> every 5K documents, merge factor 20, and ram buffer size about 384MB.
>>> From the picture it can be seen that a single-threaded index flushing
>>> code kicks in on every commit and blocks all other indexing threads.
>>> The hardware is decent (12 physical / 24 virtual cores per machine)
>>> and it is mostly idle when the index is flushing. Very little CPU
>>> utilization and disk I/O (<5%), with the exception of a single CPU
>>> core which actually does index flush (95% CPU, 5% I/O wait).
>>>
>>> My questions are:
>>>
>>> 1) will Solr changes from real-time branch help to resolve these
>>> issues? I was reading
>>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html
>>> and it looks like we have exactly the same problem
>>
>> did you also read http://bit.ly/ujLw6v - here I try to explain the
>> major difference between Lucene 3.x and 4.0 and why 3.x has these long
>> idle times. In Lucene 3.x a full flush / commit is a single threaded
>> process, as you observed there is only one thread making progress. In
>> Lucene 4 there is still a single thread executing the commit but other
>> threads are not blocked anymore. Depending on how fast the thread can
>> flush other threads might help flushing segments for that commit
>> concurrently or simply index into new documents writers. So basically
>> 4.0 won't have this problem anymore. The realtime branch you talk
>> about is already merged into 4.0 trunk.
>>
>>>
>>> 2) what would be the best way to port these (and only these) changes
>>> to 3.4.0? I tried to dig into the branching and revisions, but got
>>> lost quickly. Tried something like "svn diff
>>> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not
>>> sure if it's even possible to merge these into 3.4.0
>>
>> Possible yes! Worth the trouble, I would say no!
>> DocumentsWriterPerThread (DWPT) is a very big change and I don't think
>> we should backport this into our stable branch. However, this feature
>> is very stable in 4.0 though.
>>>
>>> 3) what would you recommend for production 24/7 use? 3.4.0?
>>
>> I think 3.4 is a safe bet! I personally tend to use trunk in
>> production too the only problem is that this is basically a moving
>> target and introduces extra overhead on your side to watch changes and
>> index format modification which could basically prevent you from
>> simple upgrades
>>
>>>
>>> 4) is there a workaround that can be used? also, I listed the stack trace 
>>> below
>>>
>>> Thank you!
>>> Roman
>>>
>>> P.S. This single "index flushing" thread spends 99% of all the time in
>>> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then
>>> the merge seems to go quickly. I looked it up and it looks like the
>>> intent here is deleting old commit points (we are keeping only 1
>>> non-optimized commit point per config). Not sure why is it taking that
>>> long.
>>
>> in 3.x there is no way to apply deletes without doing a flush (afaik).
>> In 3.x a flush means single threaded again - similar to commit just
>> without syncing files to disk and writing a new segments file. In 4.0
>> you have way more control over this via
>> IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied
>> without blocking other threads. In trunk we hijack indexing threads to
>> do all that work concurrently so you get better cpu utilization and
>> due to concurrent flushing better and usually continuous IO
>> utilization.
>>
>> hope that helps.
>>
>> simon
>>>
>>> pool-2-thread-1 [RUNNABLE] CPU time: 3:31
>>> java.nio.Bits.copyToByteArray(long, Object, long, long)
>>> java.nio.DirectByteBuffer.get(byte[], int, int)
>>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, 
>>> int)
>>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos)
>>> org.apache.lucene.index.SegmentTermEnum.next()
>>> org.apache.lucene.index.TermInfosReader.(Directo

RE: Partial updates?

2011-10-28 Thread Brandon Ramirez
I would love to see this too.  Most of our data comes from a relational 
database, but there are some files on the file system related to our products 
that may need to be indexed.  The files have different change control / life 
cycle, so I can't be sure that our application will know when this data  
changes, so a recurring background re-index job would be helpful.  Having to go 
to the database to get 99% of the data (which didn't change anyway) to send 
along with the 1% from the file system is a big limitation.

This also prevents the use of DIH.


Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 
Software Engineer II | Element K | www.elementk.com


-Original Message-
From: mlevy [mailto:ml...@ushmm.org] 
Sent: Friday, October 28, 2011 2:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Partial updates?

An ability to update would be extremely useful for us. Different parts of 
records sometimes come from different databases, and being able to update after 
creation of the Solr index would be extremely useful.

I've made some processes that reads a record and adds a new field to it. The 
most awkward thing is when there's been a CopyField, when the record is read 
and re-saved, the copied field causes CopyField to be invoked again.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
 wrote:

> +1 I suggested it should be backported a while back.  Or that Lucene
> 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
> this point, bulk postings is only needed useful for PFOR.

This is not true, most modern index compression schemes, not just
PFOR-delta read more than one integer at a time.

Thats why its important not only to abstract away the encoding of the
index, but to also ensure that the enumeration apis aren't biased
towards one-at-a-time vInt.

Otherwise we have "flexible indexing" where "flexible" means "slower
if you do anything but the default".

-- 
lucidimagination.com


edismax/boost: certain documents should be last

2011-10-28 Thread Paul
(I am using solr 3.4 and edismax.)

In my index, I have a multivalued field named "genre". One of the
values this field can have is "Citation". I would like documents that
have a genre field of Citation to always be at the bottom of the
search results.

I've been experimenting, but I can't seem to figure out the syntax of
the search I need. Here is the search that seems most logical to me
(newlines added here for readability):

q=%2bcontent%3Anotes+genre%3ACitation^0.01
&start=0
&rows=3
&fl=genre+title
&version=2.2
&defType=edismax

I get the same results whether I include "genre%3ACitation^0.01" or not.

Just to see if my names were correct, I put a minus sign before
"genre" and it did, in fact, stop returning all the documents
containing Citation.

What am I doing wrong?

Here are the results from the above query:


  
0
1

  genre title 
  0
  +content:notes genre:Citation^0.01
  3
  2.2
  edismax

  
  

  CitationFiction
  Notes on novelists With some other notes


  Citation
  Novel notes


  Citation
  Knock about notes

  



i don't get why this says non-match

2011-10-28 Thread Robert Petersen
It looks to me like everything matches down the line but top level says
otherQuery is a non-match... I don't get it?
- 
- 
  0 
  77 
- 
  SyncMaster 
  *,score 
  on 
  on 
  0 
  +syncmaster -SyncMaster 
   
  standard 
  standard 
   
  41 
  2.2 
  
  
+ 
- 
  +syncmaster -SyncMaster 
  +syncmaster -SyncMaster 
  +moreWords:syncmaster
-MultiPhraseQuery(moreWords:"sync (master syncmaster)") 
  +moreWords:syncmaster
-moreWords:"sync (master syncmaster)"
SyncMaster 
- 
0.0 = (NON-MATCH) Failure to meet condition(s) of
required/prohibited clause(s) 1.4043131 = (MATCH)
fieldWeight(moreWords:syncmaster in 46710), product of: 1.4142135 =
tf(termFreq(moreWords:syncmaster)=2) 9.078851 = idf(docFreq=41,
maxDocs=135472) 0.109375 = fieldNorm(field=moreWords, doc=46710) 0.0 =
match on prohibited clause (moreWords:"sync (master syncmaster)")
9.393997 = (MATCH) weight(moreWords:"sync (master syncmaster)" in
46710), product of: 2.5863855 = queryWeight(moreWords:"sync (master
syncmaster)"), product of: 23.481407 = idf(moreWords:"sync (master
syncmaster)") 0.1101461 = queryNorm 3.6320949 = (MATCH)
fieldWeight(moreWords:"sync (master syncmaster)" in 46710), product of:
1.4142135 = tf(phraseFreq=2.0) 23.481407 = idf(moreWords:"sync (master
syncmaster)") 0.109375 = fieldNorm(field=moreWords, doc=46710) 



Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> Otherwise we have "flexible indexing" where "flexible" means "slower
> if you do anything but the default".

The other encodings should exist as modules since they are pluggable.
4.0 can ship with the existing codec.  4.1 with additional codecs and
the bulk postings at a later time.

Otherwise it will be 6 months before 4.0 ships, that's too long.

Also it is an amusing contradiction that your argument flies in the
face of Lucid shipping 4.x today without said functionality.

On Fri, Oct 28, 2011 at 5:09 PM, Robert Muir  wrote:
> On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
>  wrote:
>
>> +1 I suggested it should be backported a while back.  Or that Lucene
>> 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
>> this point, bulk postings is only needed useful for PFOR.
>
> This is not true, most modern index compression schemes, not just
> PFOR-delta read more than one integer at a time.
>
> Thats why its important not only to abstract away the encoding of the
> index, but to also ensure that the enumeration apis aren't biased
> towards one-at-a-time vInt.
>
> Otherwise we have "flexible indexing" where "flexible" means "slower
> if you do anything but the default".
>
> --
> lucidimagination.com
>


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 8:10 PM, Jason Rutherglen
 wrote:
>> Otherwise we have "flexible indexing" where "flexible" means "slower
>> if you do anything but the default".
>
> The other encodings should exist as modules since they are pluggable.
> 4.0 can ship with the existing codec.  4.1 with additional codecs and
> the bulk postings at a later time.

you don't know what you are talking about:  go look at the source
code. the whole problem is that encodings aren't pluggable.

>
> Otherwise it will be 6 months before 4.0 ships, that's too long.

sucks for you.

>
> Also it is an amusing contradiction that your argument flies in the
> face of Lucid shipping 4.x today without said functionality.
>

No it doesn't. trunk is open source. you can use it, too, if you want.

-- 
lucidimagination.com


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Jason Rutherglen
> abstract away the encoding of the index

Robert, this is what you wrote.  "Abstract away the encoding of the
index" means pluggable, otherwise it's not abstract and / or it's a
flawed design.  Sounds like it's the latter.


Re: URL Redirect

2011-10-28 Thread prr
Finotti Simone  yoox.com> writes:

> 
> Hello,
> 
> I have been assigned the task to migrate from Endeca to Solr.
> 
> The former engine allowed me to set keyword triggers that, when matched
exactly, caused the web client to
> redirect to a specified URL.
> 
> Does that feature exist in Solr? If so, where can I get some info?
> 
> Thank you



Hi, Iam also looking out for migrating from Endeca to Solr , but on the first
look it looks extremely tedious to me ...please pass on any tips or how to
approach the problem..