Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Chantal Ackermann
Hi Ahmed,

fields that are empty do not impact the index. It's different from a
database.
I have text fields for different languages and per document there is
always only one of the languages set (the text fields for the other
languages are empty/not set). It works all very well and fast.

I wonder more about what you describe as "unrelated data" - why would
you want to put unrelated data into a single index? If you want to
search on all the data and return mixed results there surely must be
some kind of relation between the documents?

Chantal

On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> I understand (and its straightforward) when you want to create a index for
> something simple like Products.
> 
> But how do you go about creating a Solr index when you have data coming from
> 10-15 database tables, and the tables have unrelated data?
> 
> The issue is then you would have many 'columns' in your index, and they will
> be NULL for much of the data since you are trying to shove 15 db tables into
> a single Solr/Lucense index.
> 
> 
> This must be a common problem, what are the potential solutions?





Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread Gora Mohanty
On Thu, 29 Jul 2010 15:33:42 -0400
S Ahmed  wrote:

> I understand (and its straightforward) when you want to create a
> index for something simple like Products.
> 
> But how do you go about creating a Solr index when you have data
> coming from 10-15 database tables, and the tables have unrelated
> data?
> 
> The issue is then you would have many 'columns' in your index,
> and they will be NULL for much of the data since you are trying
> to shove 15 db tables into a single Solr/Lucense index.
[...]

This should not be a problem. With the Solr DataImportHandler, any
NULL values for a given record will simply be ignored, i.e., the
Solr index for that document will not contain an entry for that
field.

Regards,
Gora


Re: Get unique values

2010-07-30 Thread Rafal Bluszcz Zawadzki
2010/7/28 Rafal Bluszcz Zawadzki 

> Hi,
>
> In my schema I have (inter ali) fields CollectionID, and CollectionName.
>  These two values always match together, which means that for every value of
> CollectionID there is matching value from CollectionName.
>
> I am interested in query which allow me to get unique values of
> CollectionID with matching CollectionNames (rest of fields is not interested
> for me in this query).
>
>
Finally I decided to store values in one indexed field (Collections) and
below query did the trick:

q=*:*&rows=0&facet=on&facet.field=Collections

-- 
Rafał Zawadzki
http://dev.bluszcz.net/


Re: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan

> What approach shoud I use to perform wildcard and proximity
> searches?
> 
>  
> 
> Like: "solr mail*"~10
> 
>  
> 
> For getting docs where solr is within 10 words of "mailing"
> for
> instance?


You can do it with the plug-in described here:
https://issues.apache.org/jira/browse/SOLR-1604
It would be great if you test it and give feedback.



  


Solr and Lucene in South Africa

2010-07-30 Thread Jaco Olivier
Hi to all Solr/Lucene Users...

Out team had a discussion today regarding the Solr/Lucene community closer to 
home.
I am hereby putting out an SOS to all Solr/Lucene users in the South African 
market and wish to organize a meet-up (or user support group) if at all 
possible.
It would be great to share some triumphs and pitfalls that were experienced.

* Sorry for hogging the User Mailing list on non-technical question, but think 
this is the easiest way to get it done :)

Jaco Olivier
Web Specialist

Please note: This email and its content are subject to the disclaimer as 
displayed at the following link 
http://www.sabinet.co.za/?page=e-mail-disclaimer. Should you not have Web 
access, send an email to i...@sabinet.co.za and a 
copy will be sent to you


RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

Thank you. I'll be happy to test it if I manage to install it ok.. I'm a
newbie at solr but I'm going to try the instructions in the thread to
load it.

Another doubts I have about wildcard searches:

a) I think wildcard search is by default "case sensitive"? Is there a
way to make case insensitive?

b) I have about 6000 queries to run (could have widlcards, proximity
searches or just normal queries). I discovered that the normal query
type doesn't work with wildcards and so I'm using the "Filter Query" to
query these. Is this field slower? I notice that using this field my
queries are much slower (I have some queries like *word* or *word1* or
*word2* that take about one minute to perform)
Is there a way to optimize these queries (without removing the wildcards
:))?

c)Is there a way to do phrase queries with wildcards? Like "This solr*
mail*"? Because the tests I made, when using quotes I think the
wildcards are ignored.

d)How exactly works the pf (phrase fields) and ps (phrase slop)
parameters and what's the difference for the proximity searches (ex:
"word word2"~20)?

Sorry for the long email and thank you for your help...
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 10:57
To: solr-user@lucene.apache.org
Subject: Re: wildcard and proximity searches


> What approach shoud I use to perform wildcard and proximity
> searches?
> 
>  
> 
> Like: "solr mail*"~10
> 
>  
> 
> For getting docs where solr is within 10 words of "mailing"
> for
> instance?


You can do it with the plug-in described here:
https://issues.apache.org/jira/browse/SOLR-1604
It would be great if you test it and give feedback.



  


RE: wildcard and proximity searches

2010-07-30 Thread Ahmet Arslan
> a) I think wildcard search is by default "case sensitive"?
> Is there a
> way to make case insensitive?

Wildcard searches are not analyzed. To case insensitive search you can 
lowercase query terms at client side. (with using lowercasefilter at index 
time) e.g. Mail* => mail*

 
> I discovered that the normal query type doesn't work with wildcards
> and so I'm using the "Filter Query" to query these. 

I don't understand this. Wildcard search works with q parameter if you are 
asking that. &q=mail*

> field my
> queries are much slower (I have some queries like *word* or
> *word1* or
> *word2* that take about one minute to perform)
> Is there a way to optimize these queries (without removing
> the wildcards
> :))?

It is normal for leading wildcard search to be slow. Using 
ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are you 
doing this? 

> c)Is there a way to do phrase queries with wildcards? Like
> "This solr*
> mail*"? Because the tests I made, when using quotes I think
> the wildcards are ignored.

By default it is not supported. With SOLR-1604 is it possible.

> d)How exactly works the pf (phrase fields) and ps (phrase
> slop)
> parameters and what's the difference for the proximity
> searches (ex:
> "word word2"~20)?

These parameters are specific to dismax query parser. 
http://wiki.apache.org/solr/DisMaxQParserPlugin



  


Re: Can't find org.apache.solr.client.solrj.embedded

2010-07-30 Thread Uwe Reh

Sorry,

I had inspected the ...core.jar three times, without recognizing the 
package. I was realy blind. =8-)


thanks
Uwe

Am 26.07.2010 20:48, schrieb Chris Hostetter:

: where is a Jar, containing org.apache.solr.client.solrj.embedded?

Classes in the embedded package are useless w/o the rest of the Solr
internal "core" classes, so they are included directly in the
apache-solr-core-1.4.1.jar.

-Hoss



RE: wildcard and proximity searches

2010-07-30 Thread Frederico Azeiteiro
Hi Ahmet,

> a) I think wildcard search is by default "case sensitive"?
> Is there a
> way to make case insensitive?
>>Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms >>at client side. (with using lowercasefilter at
index time) e.g. Mail* => mail*
> 
> I discovered that the normal query type doesn't work with wildcards
> and so I'm using the "Filter Query" to query these. 
>>I don't understand this. Wildcard search works with q parameter if you
are asking that. >>&q=mail*

For the 2 points above, my bad. I'm already using the "lowercasefilter"
but I was not lowering the query with wildcards (the others are lowered
by the analyser). So it's working fine now! On my tests yesterday
probably I was testing &q=Mail* and &fq=mail* (and didn't notice the
difference) and read somewhere that it wasn't possible (probably on
older solr version) so I get the wrong conclusion that it wasn't
working. 

>>But it is unusual to use both leading and trailing * operator. Why are
you doing this?

Yes I know, but I have a few queries that need this. I'll try the
"ReversedWildcardFilterFactory". 

>>By default it is not supported. With SOLR-1604 is it possible.
Ok then. I guess "SOLR-1604" is the answer for most of my problems. I'm
going to give it a try and then I'll share some feedback.

Thanks for your help and sorry for my newbie confusions. :)
Frederico

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: sexta-feira, 30 de Julho de 2010 12:09
To: solr-user@lucene.apache.org
Subject: RE: wildcard and proximity searches

> a) I think wildcard search is by default "case sensitive"?
> Is there a
> way to make case insensitive?

Wildcard searches are not analyzed. To case insensitive search you can
lowercase query terms at client side. (with using lowercasefilter at
index time) e.g. Mail* => mail*

 
> I discovered that the normal query type doesn't work with wildcards
> and so I'm using the "Filter Query" to query these. 

I don't understand this. Wildcard search works with q parameter if you
are asking that. &q=mail*

> field my
> queries are much slower (I have some queries like *word* or
> *word1* or
> *word2* that take about one minute to perform)
> Is there a way to optimize these queries (without removing
> the wildcards
> :))?

It is normal for leading wildcard search to be slow. Using
ReversedWildcardFilterFactory at index time can speedup it.

But it is unusual to use both leading and trailing * operator. Why are
you doing this? 

> c)Is there a way to do phrase queries with wildcards? Like
> "This solr*
> mail*"? Because the tests I made, when using quotes I think
> the wildcards are ignored.

By default it is not supported. With SOLR-1604 is it possible.

> d)How exactly works the pf (phrase fields) and ps (phrase
> slop)
> parameters and what's the difference for the proximity
> searches (ex:
> "word word2"~20)?

These parameters are specific to dismax query parser. 
http://wiki.apache.org/solr/DisMaxQParserPlugin



  


Re: Customize order field list ???

2010-07-30 Thread kenf_nc

I believe they come back alphabetically sorted (not sure if this is language
specific or not), so a quick way might be to change the name from createdate
to zz_createdate or something like that. 

Generally with XML you should not be worried about order however. It's
usually a sign of a design issue somewhere if the order of the fields
matters.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Customize-order-field-list-tp1007996p1008924.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr to perform range queries in Dspace

2010-07-30 Thread Mckeane



Thank you for your reply.

 This is a background as to what I am trying to achieve. I want to be able
to perform a search  across numeric index ranges and get the results in
logical ordering instead of a lexicographic ordering using dspace. Currently
if I do a search using the query: var:[10 TO 50] if there are any values
with index 1000, 100 or a float say 10.x the result returns all these values
plus any other values that falls within the lexicographic range. Similar
result is returned if I enter any other numeric data type.  In solr I see
where TrieDoubleField,TrieLongField,SortableIntField, etc.. can be use to
perform numeric range queries and return the result in logical ordering. I
was thinking about using either TrieField classes for int, double etc..
and/or SortableIntField, SortableLongField classes defined in solr to
perform range query search in dspace.




-Mckeane
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-to-perform-range-queries-in-Dspace-tp987049p1008941.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Li Li
hightlight's time is mainly spent on getting the field which you want
to highlight and tokenize this field(If you don't store term vector) .
you can check what's wrong,

2010/7/30 Peter Spam :
> If I don't do highlighting, it's really fast.  Optimize has no effect.
>
> -Peter
>
> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>
>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>> text that you are storing in the SOLR. Try to
>> 1) Is this first time performance or on repaat queries with the same fields?
>> 2) Optimze the index and test performance again
>> 3) index without storing the text and see what the performance looks like.
>>
>>
>> On 7/29/10, Peter Spam  wrote:
>>> Any ideas?  I've got 5000 documents with an average size of 850k each, and
>>> it sometimes takes 2 minutes for a query to come back when highlighting is
>>> turned on!  Help!
>>>
>>>
>>> -Pete
>>>
>>> On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
>>>
 From the mailing list archive, Koji wrote:

> 1. Provide another field for highlighting and use copyField to copy
> plainText to the highlighting field.

 and Lance wrote:
 http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html

> If you want to highlight field X, doing the
> termOffsets/termPositions/termVectors will make highlighting that field
> faster. You should make a separate field and apply these options to that
> field.
>
> Now: doing a copyfield adds a "value" to a multiValued field. For a text
> field, you get a multi-valued text field. You should only copy one value
> to the highlighted field, so just copyField the document to your special
> field. To enforce this, I would add multiValued="false" to that field,
> just to avoid mistakes.
>
> So, all_text should be indexed without the term* attributes, and should
> not be stored. Then your document stored in a separate field that you use
> for highlighting and has the term* attributes.

 I've been experimenting with this, and here's what I've tried:

  >>> multiValued="true" termVectors="true" termPositions="true" termOff
 sets="true" />
  >>> multiValued="true" />
  

 ... but it's still very slow (10+ seconds).  Why is it better to have two
 fields (one indexed but not stored, and the other not indexed but stored)
 rather than just one field that's both indexed and stored?


 From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors

> If you aren't always using all the stored fields, then enabling lazy
> field loading can be a huge boon, especially if compressed fields are
> used.

 What does this mean?  How do you load a field lazily?

 Thanks for your time, guys - this has started to become frustrating, since
 it works so well, but is very slow!


 -Pete

 On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:

> Data set: About 4,000 log files (will eventually grow to millions).
> Average log file is 850k.  Largest log file (so far) is about 70MB.
>
> Problem: When I search for common terms, the query time goes from under
> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
> disable highlighting, performance improves a lot, but is still slow for
> some queries (7 seconds).  Thanks in advance for any ideas!
>
>
> -Peter
>
>
> -
>
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
>
> -
>
> schema.xml changes:
>
>  
>    
>      
>    
>     generateNumberParts="0" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>    
>  
>
> ...
>
>  multiValued="false" termVectors="true" termPositions="true"
> termOffsets="true" />
>   default="NOW" multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>  multiValued="false"/>
>
> ...
>
> 
> body
> 
>
> -
>
> solrconfig.xml changes:
>
>  2147483647
>  128
>
> -
>
> The query:
>
> rowStr = "&rows=10"
> facet =
> "&facet=true&facet.limit=10&facet.field=device&facet

Re: question about relevance

2010-07-30 Thread Bharat Jain
Hi,
   Thanks a lot for the info and your time. I think field collapse will work
for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but
which file I should use for patch. We use solr-1.3.

Thanks
Bharat Jain


On Fri, Jul 30, 2010 at 12:53 AM, Chris Hostetter
wrote:

>
> : 1. There are user records of type A, B, C etc. (userId field in index is
> : common to all records)
> : 2. A user can have any number of A, B, C etc (e.g. think of A being a
> : language then user can know many languages like french, english, german
> etc)
> : 3. Records are currently stored as a document in index.
> : 4. A given query can match multiple records for the user
> : 5. If for a user more records are matched (e.g. if he knows both french
> and
> : german) then he is more relevant and should come top in UI. This is the
> : reason I wanted to add lucene scores assuming the greater score means
> more
> : relevance.
>
> if your goal is to get back "users" from each search, then you should
> probably change your indexing strategry so that each "user" has a single
> document -- fields like "langauge" can be multivalued, etc...
>
> then a search for "language:en langauge:fr" will return users who speak
> english or french, and hte ones that speak both will score higher.
>
> if you really cant change the index structure, then essentially waht you
> are looking for is a "field collapsing" solution on the userId field,
> where you want each collapsed group to get a cumulative score.  i don't
> know if the existing field collapsing patches support this -- if you are
> already willing/capable to do it in the lcient then that may be the
> simplest thing to support moving foward.
>
> Adding the scores is certainly one metric you could use -- it's generally
> suspicious to try and imply too much meaning to scores in lucene/solr but
> that's becuase people typically try to imply broader absolute meaning.  in
> the case of a single query the scores are relative eachother, and adding
> up all the scores for a given userId is approximaly what would happen in
> my example above -- except that there is also a "coord" factor that would
> penalalize documents that only match one clause ... it's complicated, but
> as an approximation adding the scores might give you what you are looking
> for -- only you can know for sure based on your specific data.
>
>
>
> -Hoss
>
>


Document Boost with Solr Extraction - SolrContentHandler

2010-07-30 Thread jayendra patil
We are using Solr Extract Handler for indexing document metadata with
attachments. (/update/extract)
However, the SolrContentHandler doesn't seem to support index time document
boost attribute.
Probably , document.setDocumentBoost(Float.parseFloat(boost)) is missing.

Regards,
Jayendra


Re: advice on creating a solr index when data source is from many unrelated db tables

2010-07-30 Thread S Ahmed
So I have tables like this:

Users
UserSales
UserHistory
UserAddresses
UserNotes
ClientAddress
CalenderEvent
Articles
Blogs

Just seems odd to me, jamming on these tables into a single index.  But I
guess the idea of using a 'type' field to quality exactly what I am
searching is a good idea, in case I need to filter for only 'articles' or
blogs or contacts etc.

But there might be 50 fields if I do this no?



On Fri, Jul 30, 2010 at 4:01 AM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi Ahmed,
>
> fields that are empty do not impact the index. It's different from a
> database.
> I have text fields for different languages and per document there is
> always only one of the languages set (the text fields for the other
> languages are empty/not set). It works all very well and fast.
>
> I wonder more about what you describe as "unrelated data" - why would
> you want to put unrelated data into a single index? If you want to
> search on all the data and return mixed results there surely must be
> some kind of relation between the documents?
>
> Chantal
>
> On Thu, 2010-07-29 at 21:33 +0200, S Ahmed wrote:
> > I understand (and its straightforward) when you want to create a index
> for
> > something simple like Products.
> >
> > But how do you go about creating a Solr index when you have data coming
> from
> > 10-15 database tables, and the tables have unrelated data?
> >
> > The issue is then you would have many 'columns' in your index, and they
> will
> > be NULL for much of the data since you are trying to shove 15 db tables
> into
> > a single Solr/Lucense index.
> >
> >
> > This must be a common problem, what are the potential solutions?
>
>
>
>


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Spam
I do store term vector:



-Pete

On Jul 30, 2010, at 7:30 AM, Li Li wrote:

> hightlight's time is mainly spent on getting the field which you want
> to highlight and tokenize this field(If you don't store term vector) .
> you can check what's wrong,
> 
> 2010/7/30 Peter Spam :
>> If I don't do highlighting, it's really fast.  Optimize has no effect.
>> 
>> -Peter
>> 
>> On Jul 29, 2010, at 11:54 AM, dc tech wrote:
>> 
>>> Are you storing the entire log file text in SOLR? That's almost 3gb of
>>> text that you are storing in the SOLR. Try to
>>> 1) Is this first time performance or on repaat queries with the same fields?
>>> 2) Optimze the index and test performance again
>>> 3) index without storing the text and see what the performance looks like.
>>> 
>>> 
>>> On 7/29/10, Peter Spam  wrote:
 Any ideas?  I've got 5000 documents with an average size of 850k each, and
 it sometimes takes 2 minutes for a query to come back when highlighting is
 turned on!  Help!
 
 
 -Pete
 
 On Jul 21, 2010, at 2:41 PM, Peter Spam wrote:
 
> From the mailing list archive, Koji wrote:
> 
>> 1. Provide another field for highlighting and use copyField to copy
>> plainText to the highlighting field.
> 
> and Lance wrote:
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg35548.html
> 
>> If you want to highlight field X, doing the
>> termOffsets/termPositions/termVectors will make highlighting that field
>> faster. You should make a separate field and apply these options to that
>> field.
>> 
>> Now: doing a copyfield adds a "value" to a multiValued field. For a text
>> field, you get a multi-valued text field. You should only copy one value
>> to the highlighted field, so just copyField the document to your special
>> field. To enforce this, I would add multiValued="false" to that field,
>> just to avoid mistakes.
>> 
>> So, all_text should be indexed without the term* attributes, and should
>> not be stored. Then your document stored in a separate field that you use
>> for highlighting and has the term* attributes.
> 
> I've been experimenting with this, and here's what I've tried:
> 
>   multiValued="true" termVectors="true" termPositions="true" termOff
> sets="true" />
>   multiValued="true" />
>  
> 
> ... but it's still very slow (10+ seconds).  Why is it better to have two
> fields (one indexed but not stored, and the other not indexed but stored)
> rather than just one field that's both indexed and stored?
> 
> 
> From the Perf wiki page http://wiki.apache.org/solr/SolrPerformanceFactors
> 
>> If you aren't always using all the stored fields, then enabling lazy
>> field loading can be a huge boon, especially if compressed fields are
>> used.
> 
> What does this mean?  How do you load a field lazily?
> 
> Thanks for your time, guys - this has started to become frustrating, since
> it works so well, but is very slow!
> 
> 
> -Pete
> 
> On Jul 20, 2010, at 5:36 PM, Peter Spam wrote:
> 
>> Data set: About 4,000 log files (will eventually grow to millions).
>> Average log file is 850k.  Largest log file (so far) is about 70MB.
>> 
>> Problem: When I search for common terms, the query time goes from under
>> 2-3 seconds to about 60 seconds.  TermVectors etc are enabled.  When I
>> disable highlighting, performance improves a lot, but is still slow for
>> some queries (7 seconds).  Thanks in advance for any ideas!
>> 
>> 
>> -Peter
>> 
>> 
>> -
>> 
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>> 
>> -
>> 
>> schema.xml changes:
>> 
>>  
>>
>>  
>>
>>> generateNumberParts="0" catenateWords="0" catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>
>>  
>> 
>> ...
>> 
>> > multiValued="false" termVectors="true" termPositions="true"
>> termOffsets="true" />
>>  > default="NOW" multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> > multiValued="false"/>
>> 
>> ...
>> 
>> 
>> body
>> 
>> 
>> -
>> 
>> solrconfig.xml changes:
>> 
>>  2147483647
>>  128
>>>

Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
I want to programmatically retrieve the number of indexed documents. I.e., get 
the value of numDocs.

The only two ways I've come up with are searching for "*:*" and reporting the 
hit count, or sending an Http GET to 
http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for   in the response.

Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
"what's numDocs"?

(I'm doing this in Python, using Pysolr, if that matters.)

Thanks!



Re: Help with schema design

2010-07-30 Thread Erick Erickson
I'd just index the eventtype, eventby and eventtime as separate fields. Then
queries something like eventtype:update AND eventtime:[ TO *].

Similarly for events update by pramod, the query would be something like:
eventby:pramod AND eventtype:update

HTH
Erick

On Wed, Jul 28, 2010 at 11:05 PM, Pramod Goyal wrote:

> Hi,
>I have a use case where i get a document and a list of events that has
> happened on the document. For example
>
> First document:
>  Some text content
> Events:
>  Event TypeEvent By Event Time
>  Update  Pramod  06062010 2:30:00
>  Update  Raj 06062010 2:30:00
>  View Rahul  07062010 1:30:00
>
>
> I would like to support queries like get all document Event Type = ? and
> Event time greater than ? ,  also query like get all the documents Updated
> by Pramod.
> How should i design my schema to support this use case.
>
> Thanks,
> Regards,
> Pramod Goyal
>


Re: Solr using 1500 threads - is that normal?

2010-07-30 Thread Erick Erickson
Glad to help. Do be aware that there are several config values that
influence
the commit frequency, they might also be relevant.

Best
Erick

On Thu, Jul 29, 2010 at 5:11 AM, Christos Constantinou <
ch...@simpleweb.co.uk> wrote:

> Eric,
>
> Thank you very much for the indicators! I had a closer look at the commit
> intervals and it seems that the application is gradually increasing the
> commits to almost once per second after some time - something that was
> hidden in the massive amount of queries in the log file. I have changed the
> code to use commitWithin rather than commit and everything looks much better
> now. I believe that might have solved the problem so thanks again.
>
> Christos
>
> On 29 Jul 2010, at 01:44, Erick Erickson wrote:
>
> > Your commits are very suspect. How often are you making changes to your
> > index?
> > Do you have autocommit on? Do you commit when updating each document?
> > Committing
> > too often and consequently firing off warmup queries is the first place
> I'd
> > look. But I
> > agree with dc tech, 1,500 is wy more than I would expect.
> >
> > Best
> > Erick
> >
> >
> >
> > On Wed, Jul 28, 2010 at 6:53 AM, Christos Constantinou <
> > ch...@simpleweb.co.uk> wrote:
> >
> >> Hi,
> >>
> >> Solr seems to be crashing after a JVM exception that new threads cannot
> be
> >> created. I am writing in hope of advice from someone that has
> experienced
> >> this before. The exception that is causing the problem is:
> >>
> >> Exception in thread "btpool0-5" java.lang.OutOfMemoryError: unable to
> >> create new native thread
> >>
> >> The memory that is allocated to Solr is 3072MB, which should be enough
> >> memory for a ~6GB data set. The documents are not big either, they have
> >> around 10 fields of which only one stores large text ranging between
> 1k-50k.
> >>
> >> The top command at the time of the crash shows Solr using around 1500
> >> threads, which I assume it is not normal. Could it be that the threads
> are
> >> crashing one by one and new ones are created to cope with the queries?
> >>
> >> In the log file, right after the the exception, there are several
> thousand
> >> commits before the server stalls completely. Normally, the log file
> would
> >> report 20-30 document existence queries per second, then 1 commit per
> 5-30
> >> seconds, and some more infrequent faceted document searches on the data.
> >> However after the exception, there are only commits until the end of the
> log
> >> file.
> >>
> >> I am wondering if anyone has experienced this before or if it is some
> sort
> >> of known bug from Solr 1.4? Is there a way to increase the details of
> the
> >> exception in the logfile?
> >>
> >> I am attaching the output of a grep Exception command on the logfile.
> >>
> >> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:19:31 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:19:32 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:20:18 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:20:48 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:22:43 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:27:53 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apache.solr.common.SolrException: Error opening new
> searcher.
> >> exceeded limit of maxWarmingSearchers=2, try again later.
> >> Jul 28, 2010 8:28:50 AM org.apache.solr.common.SolrException log
> >> SEVERE: org.apac

Re: Solr Indexing slows down

2010-07-30 Thread Erick Erickson
See the subject about 1500 threads. The first place I'd look is how
often you're committing. If you're committing before the warmup queries
from the previous commit have done their magic, you might be getting
into a death spiral.

HTH
Erick

On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich  wrote:

> Hi,
>
> I am indexing a solr 1.4.0 core and commiting gets slower and slower.
> Starting from 3-5 seconds for ~200 documents and ending with over 60
> seconds after 800 commits. Then, if I reloaded the index, it is as fast
> as before! And today I have read a similar thread [1] and indeed: if I
> set autowarming for the caches to 0 the slowdown disappears.
>
> BUT at the same time I would like to offer searching on that core, which
> would be dramatically slowed down (due to no autowarming).
>
> Does someone know a better solution to avoid index-slow-down?
>
> Regards,
> Peter.
>
> [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
>


Problems running on tomcat

2010-07-30 Thread Claudio Devecchi
Hi,

I'm new with solr and I'm doing my first installation under tomcat, I
followed the documentation on link (
http://wiki.apache.org/solr/SolrTomcat#Installing_Tomcat_6) but there are
some problems.
The http://localhost:8080/solr/admin works fine, but in some cases, for
example to see my schema.xml from the admin console the error bellow
happensHTTP
Status 404 - /solr/admin/file/index.jspSomebody already saw this? There are
some trick to do?

Tks

-- 
Claudio Devecchi


Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Erick!

thanks for the response!
I will answer your questions ;-)

> How often are you making changes to your index?

Every 30-60 seconds. Too heavy?


> Do you have autocommit on?

No.


> Do you commit when updating each document?

No. I commit after a batch update of 200 documents


> Committing too often and consequently firing off warmup queries is the first 
> place I'd look.

Why is commiting firing warmup queries? Is there any documentation about
this subject?
How can I be sure that the previous commit has done its magic?

> there are several config values that influence the commit frequency


I now know the autowarm and the mergeFactor config. What else? Is this
documentation complete:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?

Regards,
Peter.

> See the subject about 1500 threads. The first place I'd look is how
> often you're committing. If you're committing before the warmup queries
> from the previous commit have done their magic, you might be getting
> into a death spiral.
>
> HTH
> Erick
>
> On Thu, Jul 29, 2010 at 7:02 AM, Peter Karich  wrote:
>
>   
>> Hi,
>>
>> I am indexing a solr 1.4.0 core and commiting gets slower and slower.
>> Starting from 3-5 seconds for ~200 documents and ending with over 60
>> seconds after 800 commits. Then, if I reloaded the index, it is as fast
>> as before! And today I have read a similar thread [1] and indeed: if I
>> set autowarming for the caches to 0 the slowdown disappears.
>>
>> BUT at the same time I would like to offer searching on that core, which
>> would be dramatically slowed down (due to no autowarming).
>>
>> Does someone know a better solution to avoid index-slow-down?
>>
>> Regards,
>> Peter.
>>
>> [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
>>
>> 


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Peter Karich
Both approaches are ok, I think. (although I don't know the python API)
BTW: If you query q=*:* then add rows=0 to avoid some traffic.

Regards,
Peter.

> I want to programmatically retrieve the number of indexed documents. I.e., 
> get the value of numDocs.
>
> The only two ways I've come up with are searching for "*:*" and reporting the 
> hit count, or sending an Http GET to 
> http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for  name="numDocs" >  in the response.
>
> Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
> "what's numDocs"?
>
> (I'm doing this in Python, using Pysolr, if that matters.)
>
> Thanks!


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Peter Karich
Hi Peter :-),

did you already try other values for

hl.maxAnalyzedChars=2147483647

? Also regular expression highlighting is more expensive, I think.
What does the 'fuzzy' variable mean? If you use this to query via
"~someTerm" instead "someTerm"
then you should try the trunk of solr which is a lot faster for fuzzy or
other wildcard search.

Regards,
Peter.
 
> Data set: About 4,000 log files (will eventually grow to millions).  Average 
> log file is 850k.  Largest log file (so far) is about 70MB.
>
> Problem: When I search for common terms, the query time goes from under 2-3 
> seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
> highlighting, performance improves a lot, but is still slow for some queries 
> (7 seconds).  Thanks in advance for any ideas!
>
>
> -Peter
>
>
> -
>
> 4GB RAM server
> % java -Xms2048M -Xmx3072M -jar start.jar
>
> -
>
> schema.xml changes:
>
> 
>   
> 
>
>generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>   
> 
>
> ...
>
> multiValued="false" termVectors="true" termPositions="true" 
> termOffsets="true" />
>  default="NOW" multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
> multiValued="false"/>
>
> ...
>
>  
>  body
>  
>
> -
>
> solrconfig.xml changes:
>
> 2147483647
> 128
>
> -
>
> The query:
>
> rowStr = "&rows=10"
> facet = 
> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
> termvectors = "&tv=true&qt=tvrh&tv.all=true"
> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
> regexv = "(?m)^.*\n.*\n.*$"
> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
> '').gsub(/([:~!<>="])/,'\1') + fuzzy + minLogSizeStr)
>
> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : 
> ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
> + hl_regex
>
> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + 
> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>
>
>   


-- 
http://karussell.wordpress.com/



Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Otis Gospodnetic
Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to "iron", so if you search for "ironic", you'll get "iron" 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
Peter, there are events in solrconfig where you define warm up queries when a 
new searcher is opened.

There are also cache settings that play a role here.

30-60 seconds is pretty frequent for Solr.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Peter Karich 
> To: solr-user@lucene.apache.org
> Sent: Fri, July 30, 2010 4:06:48 PM
> Subject: Re: Solr Indexing slows down
> 
> Hi Erick!
> 
> thanks for the response!
> I will answer your questions  ;-)
> 
> > How often are you making changes to your index?
> 
> Every  30-60 seconds. Too heavy?
> 
> 
> > Do you have autocommit  on?
> 
> No.
> 
> 
> > Do you commit when updating each  document?
> 
> No. I commit after a batch update of 200  documents
> 
> 
> > Committing too often and consequently firing off  warmup queries is the 
> > first 
>place I'd look.
> 
> Why is commiting firing  warmup queries? Is there any documentation about
> this subject?
> How can I  be sure that the previous commit has done its magic?
> 
> > there are  several config values that influence the commit frequency
> 
> 
> I now know  the autowarm and the mergeFactor config. What else? Is this
> documentation  complete:
> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?
> 
> Regards,
> Peter.
> 
> > See the subject about 1500 threads. The  first place I'd look is how
> > often you're committing. If you're  committing before the warmup queries
> > from the previous commit have done  their magic, you might be getting
> > into a death spiral.
> >
> >  HTH
> > Erick
> >
> > On Thu, Jul 29, 2010 at 7:02 AM, Peter Karichwrote:
> >
> >  
> >> Hi,
> >>
> >> I am  indexing a solr 1.4.0 core and commiting gets slower and slower.
> >>  Starting from 3-5 seconds for ~200 documents and ending with over 60
> >>  seconds after 800 commits. Then, if I reloaded the index, it is as  fast
> >> as before! And today I have read a similar thread [1] and  indeed: if I
> >> set autowarming for the caches to 0 the slowdown  disappears.
> >>
> >> BUT at the same time I would like to offer  searching on that core, which
> >> would be dramatically slowed down (due  to no autowarming).
> >>
> >> Does someone know a better solution  to avoid index-slow-down?
> >>
> >> Regards,
> >>  Peter.
> >>
> >> [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
> >>
> >> 
> 


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Otis Gospodnetic
I suppose you could write a component that just gets this info from 
SolrIndexSearcher and write that in the response?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: John DeRosa 
> To: solr-user@lucene.apache.org
> Sent: Fri, July 30, 2010 1:39:03 PM
> Subject: Programmatically retrieving numDocs (or any other statistic)
> 
> I want to programmatically retrieve the number of indexed documents. I.e., 
> get  
>the value of numDocs.
> 
> The only two ways I've come up with are searching  for "*:*" and reporting 
> the 
>hit count, or sending an Http GET to 
>http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for  name="numDocs" >  in the response.
> 
> Both seem  to be overkill. Is there an easier way to ask SolrIndexSearcher, 
>"what's  numDocs"?
> 
> (I'm doing this in Python, using Pysolr, if that  matters.)
> 
> Thanks!
> 
> 


Re: question about relevance

2010-07-30 Thread Otis Gospodnetic
May I suggest looking at some of the related issues, say SOLR-1682


This issue is related to:  
  SOLR-1682 Implement CollapseComponent   
 SOLR-1311 pseudo-field-collapsing   
 LUCENE-1421 Ability to group search results by field   
 SOLR-1773 Field Collapsing (lightweight version)   
  SOLR-237  Field collapsing  

 

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Bharat Jain 
> To: solr-user@lucene.apache.org
> Sent: Fri, July 30, 2010 10:40:19 AM
> Subject: Re: question about relevance
> 
> Hi,
>Thanks a lot for the info and your time. I think field collapse  will work
> for us. I looked at the https://issues.apache.org/jira/browse/SOLR-236 but
> which file I should  use for patch. We use solr-1.3.
> 
> Thanks
> Bharat Jain
> 
> 
> On Fri,  Jul 30, 2010 at 12:53 AM, Chris Hostetter
> wrote:
> 
> >
> >  : 1. There are user records of type A, B, C etc. (userId field in index  is
> > : common to all records)
> > : 2. A user can have any number of  A, B, C etc (e.g. think of A being a
> > : language then user can know many  languages like french, english, german
> > etc)
> > : 3. Records are  currently stored as a document in index.
> > : 4. A given query can match  multiple records for the user
> > : 5. If for a user more records are  matched (e.g. if he knows both french
> > and
> > : german) then he is  more relevant and should come top in UI. This is the
> > : reason I wanted  to add lucene scores assuming the greater score means
> > more
> > :  relevance.
> >
> > if your goal is to get back "users" from each search,  then you should
> > probably change your indexing strategry so that each  "user" has a single
> > document -- fields like "langauge" can be  multivalued, etc...
> >
> > then a search for "language:en langauge:fr"  will return users who speak
> > english or french, and hte ones that speak  both will score higher.
> >
> > if you really cant change the index  structure, then essentially waht you
> > are looking for is a "field  collapsing" solution on the userId field,
> > where you want each collapsed  group to get a cumulative score.  i don't
> > know if the existing  field collapsing patches support this -- if you are
> > already  willing/capable to do it in the lcient then that may be the
> > simplest  thing to support moving foward.
> >
> > Adding the scores is certainly  one metric you could use -- it's generally
> > suspicious to try and imply  too much meaning to scores in lucene/solr but
> > that's becuase people  typically try to imply broader absolute meaning.  in
> > the case of a  single query the scores are relative eachother, and adding
> > up all the  scores for a given userId is approximaly what would happen in
> > my example  above -- except that there is also a "coord" factor that would
> >  penalalize documents that only match one clause ... it's complicated,  but
> > as an approximation adding the scores might give you what you are  looking
> > for -- only you can know for sure based on your specific  data.
> >
> >
> >
> > -Hoss
> >
> >
> 


Re: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Walter Underwood
Some collisions are listed here:

http://www.attivio.com/blog/34-attivio-blog/333-doing-things-with-words-part-three-stemming-and-lemmatization.html

Have you asked Martin Porter? You can find his e-mail here: 
http://tartarus.org/~martin/

wunder

On Jul 30, 2010, at 1:41 PM, Otis Gospodnetic wrote:

> Hello,
> 
> I'm looking for a list of English  words that, when stemmed by Porter 
> stemmer, 
> end up in the same stem as  some similar, but unrelated words.  Below are 
> some 
> examples:
> 
> # this gets stemmed to "iron", so if you search for "ironic", you'll get 
> "iron" 
> matches
> ironic
> 
> # same stem as animal
> anime
> animated 
> animation
> animations
> 
> I imagine such a list could be added to the example protwords.txt
> 
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/







Stem collision, word protection, synonym hack

2010-07-30 Thread Otis Gospodnetic
Hello,

I'm wondering if anyone has good ideas for handling the following (Porter) 
stemming problem.
The word "city" gets stemmed to "citi".  But "citi" is short for "citibank", so 
we have a conflict - the stems of both "city" and "citi" are "citi", so when 
you 

search for "city", you will get matches that are really about citi(bank).

Now, we could put "citi" in the  "do not stem" list (protwords.txt), but it 
will 

be of no use because "citi" is already in the fully stemmed form.  This  leaves 
the option of not stemming "cities" or "city" (and perhaps  making "city" a 
synonym for "cities" as a work around) by adding those words to protwords.txt, 
but this feels like a kluge.

Are there more elegant solutions for cases like this one?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Stem collision, word protection, synonym hack

2010-07-30 Thread Robert Zotter

Otis,

https://issues.apache.org/jira/browse/LUCENE-2055 may be of some help.

cheers

On 7/30/10 2:18 PM, Otis Gospodnetic wrote:

Hello,

I'm wondering if anyone has good ideas for handling the following (Porter)
stemming problem.
The word "city" gets stemmed to "citi".  But "citi" is short for "citibank", so
we have a conflict - the stems of both "city" and "citi" are "citi", so when you

search for "city", you will get matches that are really about citi(bank).

Now, we could put "citi" in the  "do not stem" list (protwords.txt), but it will

be of no use because "citi" is already in the fully stemmed form.  This  leaves
the option of not stemming "cities" or "city" (and perhaps  making "city" a
synonym for "cities" as a work around) by adding those words to protwords.txt,
but this feels like a kluge.

Are there more elegant solutions for cases like this one?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/

   




Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread John DeRosa
Thanks!

On Jul 30, 2010, at 1:11 PM, Peter Karich wrote:

> Both approaches are ok, I think. (although I don't know the python API)
> BTW: If you query q=*:* then add rows=0 to avoid some traffic.
> 
> Regards,
> Peter.
> 
>> I want to programmatically retrieve the number of indexed documents. I.e., 
>> get the value of numDocs.
>> 
>> The only two ways I've come up with are searching for "*:*" and reporting 
>> the hit count, or sending an Http GET to 
>> http://xxx.xx.xxx.xxx:8080/solr/admin/stats.jsp#core and searching for > name="numDocs" >  in the response.
>> 
>> Both seem to be overkill. Is there an easier way to ask SolrIndexSearcher, 
>> "what's numDocs"?
>> 
>> (I'm doing this in Python, using Pysolr, if that matters.)
>> 
>> Thanks!



Re: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Yonik Seeley
On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic
 wrote:
> I'm looking for a list of English  words that, when stemmed by Porter stemmer,
> end up in the same stem as  some similar, but unrelated words.  Below are some
> examples:
>
> # this gets stemmed to "iron", so if you search for "ironic", you'll get 
> "iron"
> matches
> ironic
>
> # same stem as animal
> anime
> animated
> animation
> animations
>
> I imagine such a list could be added to the example protwords.txt

+1

No reason to make everyone come up with their own list.
Unless a good list already exists out there... we could semi-automate
it by running a large corpus through the stemmer and then for each
stem, list the original words.  The manual part would be looking at
the output to see the collisions (unless someone has a better idea).

-Yonik
http://www.lucidimagination.com


Some basic DataImportHandler questions

2010-07-30 Thread Harry Smith
Just starting with DataImportHandler and had a few simple questions.

Is there a location for more in depth documentation other than
http://wiki.apache.org/solr/DataImportHandler?

Specifically I was looking for a detailed document outlining
data-config.xml, the fields and attributes and how they are used.

* Is there a way to dynamically generate field elements from the
supplied sql statement?

Example: Suppose one has a table of 100 fields. Entering this manually
for each field is not very efficient.

ie, if table has only 3 columns this is easy enough...







What are the options if ITEM table has dozens or hundreds?


* Is there a way to apply insert logic based on the value of the incoming field?

My specific use case would be, if the incoming value is null, do not
add to Solr.

ie Record is :
ID : 50
NAME : Blahblah
MANU : null







Using the following in data-config.xml is there are way to ignore null
fields altogether? I see some special commands listed such as
$skipRecord, is there some type of $skipField operation?

Thanks


Re: Solr Indexing slows down

2010-07-30 Thread Peter Karich
Hi Otis,

does it mean that a new searcher is opened after I commit?
I thought only on startup...(?)

Regards,
Peter.

> Peter, there are events in solrconfig where you define warm up queries when a 
> new searcher is opened.
>
> There are also cache settings that play a role here.
>
> 30-60 seconds is pretty frequent for Solr.
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
>   
>> From: Peter Karich 
>> To: solr-user@lucene.apache.org
>> Sent: Fri, July 30, 2010 4:06:48 PM
>> Subject: Re: Solr Indexing slows down
>>
>> Hi Erick!
>>
>> thanks for the response!
>> I will answer your questions  ;-)
>>
>> 
>>> How often are you making changes to your index?
>>>   
>> Every  30-60 seconds. Too heavy?
>>
>>
>> 
>>> Do you have autocommit  on?
>>>   
>> No.
>>
>>
>> 
>>> Do you commit when updating each  document?
>>>   
>> No. I commit after a batch update of 200  documents
>>
>>
>> 
>>> Committing too often and consequently firing off  warmup queries is the 
>>> first 
>>>   
>> place I'd look.
>>
>> Why is commiting firing  warmup queries? Is there any documentation about
>> this subject?
>> How can I  be sure that the previous commit has done its magic?
>>
>> 
>>> there are  several config values that influence the commit frequency
>>>   
>>
>> I now know  the autowarm and the mergeFactor config. What else? Is this
>> documentation  complete:
>> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?
>>
>> Regards,
>> Peter.
>>
>> 
>>> See the subject about 1500 threads. The  first place I'd look is how
>>> often you're committing. If you're  committing before the warmup queries
>>> from the previous commit have done  their magic, you might be getting
>>> into a death spiral.
>>>
>>>  HTH
>>> Erick
>>>
>>> On Thu, Jul 29, 2010 at 7:02 AM, Peter Karichwrote:
>>>
>>>  
>>>   
 Hi,

 I am  indexing a solr 1.4.0 core and commiting gets slower and slower.
  Starting from 3-5 seconds for ~200 documents and ending with over 60
  seconds after 800 commits. Then, if I reloaded the index, it is as  fast
 as before! And today I have read a similar thread [1] and  indeed: if I
 set autowarming for the caches to 0 the slowdown  disappears.

 BUT at the same time I would like to offer  searching on that core, which
 would be dramatically slowed down (due  to no autowarming).

 Does someone know a better solution  to avoid index-slow-down?

 Regards,
  Peter.

 [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html

 
 
>> 
>   


-- 
http://karussell.wordpress.com/



RE: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Burton-West, Tom
A good starting place might be the list of stemming errors for the original 
Porter stemmer in this article that describes k-stem:

Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings 
of the 16th annual international ACM SIGIR conference on Research and 
development in information retrieval (pp. 191-202). Pittsburgh, Pennsylvania, 
United States: ACM. doi:10.1145/160688.160718

I don't know if the current porter stemmer is different.  I do see that on the 
snowball page there is a porter and a porter2 stemmer and this explanation is 
linked from the porter2 stemmer page: 
http://snowball.tartarus.org/algorithms/english/stemmer.html


Tom Burton-West
http://www.hathitrust.org/blogs/large-scale-search

-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Friday, July 30, 2010 4:42 PM
To: solr-user@lucene.apache.org
Subject: Good list of English words that get "butchered" by Porter Stemmer

Hello,

I'm looking for a list of English  words that, when stemmed by Porter stemmer, 
end up in the same stem as  some similar, but unrelated words.  Below are some 
examples:

# this gets stemmed to "iron", so if you search for "ironic", you'll get "iron" 
matches
ironic

# same stem as animal
anime
animated 
animation
animations

I imagine such a list could be added to the example protwords.txt

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



Re: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Robert Muir
Otis,

I think this is a great idea.

you could also go even further by making a better example for
StemmerOverrideFilter (stemdict.txt)
(
http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory
)

for example:
animated  animate
animation  animation
animations  animation

this might be a bit better (but more work!) than protected words since then
you could let animation and animations conflate, rather than just forcing
them to be all unchanged. i wouldnt go crazy and worry about animator
matching animation etc, but would at least let plural forms match the
singular, without screwing other things up.

On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> I'm looking for a list of English  words that, when stemmed by Porter
> stemmer,
> end up in the same stem as  some similar, but unrelated words.  Below are
> some
> examples:
>
> # this gets stemmed to "iron", so if you search for "ironic", you'll get
> "iron"
> matches
> ironic
>
> # same stem as animal
> anime
> animated
> animation
> animations
>
> I imagine such a list could be added to the example protwords.txt
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: Solr searching performance issues, using large documents

2010-07-30 Thread Lance Norskog
Wait- how much text are you highlighting? You say these logfiles are X
big- how big are the actual documents you are storing?



On Fri, Jul 30, 2010 at 1:16 PM, Peter Karich  wrote:
> Hi Peter :-),
>
> did you already try other values for
>
> hl.maxAnalyzedChars=2147483647
>
> ? Also regular expression highlighting is more expensive, I think.
> What does the 'fuzzy' variable mean? If you use this to query via
> "~someTerm" instead "someTerm"
> then you should try the trunk of solr which is a lot faster for fuzzy or
> other wildcard search.
>
> Regards,
> Peter.
>
>> Data set: About 4,000 log files (will eventually grow to millions).  Average 
>> log file is 850k.  Largest log file (so far) is about 70MB.
>>
>> Problem: When I search for common terms, the query time goes from under 2-3 
>> seconds to about 60 seconds.  TermVectors etc are enabled.  When I disable 
>> highlighting, performance improves a lot, but is still slow for some queries 
>> (7 seconds).  Thanks in advance for any ideas!
>>
>>
>> -Peter
>>
>>
>> -
>>
>> 4GB RAM server
>> % java -Xms2048M -Xmx3072M -jar start.jar
>>
>> -
>>
>> schema.xml changes:
>>
>>     
>>       
>>         
>>       
>>       > generateNumberParts="0" catenateWords="0" catenateNumbers="0" 
>> catenateAll="0" splitOnCaseChange="0"/>
>>       
>>     
>>
>> ...
>>
>>    > multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>>     > default="NOW" multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>    > multiValued="false"/>
>>
>> ...
>>
>>  
>>  body
>>  
>>
>> -
>>
>> solrconfig.xml changes:
>>
>>     2147483647
>>     128
>>
>> -
>>
>> The query:
>>
>> rowStr = "&rows=10"
>> facet = 
>> "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>> regexv = "(?m)^.*\n.*\n.*$"
>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv) + 
>> "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>> justq = '&q=' + CGI::escape('body:' + fuzzy + p['q'].to_s.gsub(/\\/, 
>> '').gsub(/([:~!<>="])/,'\1') + fuzzy + minLogSizeStr)
>>
>> thequery = '/solr/select?timeAllowed=5000&wt=ruby' + (p['fq'].empty? ? '' : 
>> ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors + hl 
>> + hl_regex
>>
>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s) + '&rows=' + 
>> p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>
>>
>>
>
>
> --
> http://karussell.wordpress.com/
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr Indexing slows down

2010-07-30 Thread Otis Gospodnetic
As you make changes to your index, you probably want to see the new/modified 
documents in your search results.  In order to do that, the new searcher needs 
to be reopened, and this happens on commit.
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Peter Karich 
> To: solr-user@lucene.apache.org
> Sent: Fri, July 30, 2010 6:19:03 PM
> Subject: Re: Solr Indexing slows down
> 
> Hi Otis,
> 
> does it mean that a new searcher is opened after I commit?
> I  thought only on startup...(?)
> 
> Regards,
> Peter.
> 
> > Peter, there  are events in solrconfig where you define warm up queries 
> > when 
>a 
>
> > new  searcher is opened.
> >
> > There are also cache settings that play a  role here.
> >
> > 30-60 seconds is pretty frequent for  Solr.
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
> > Lucene ecosystem search :: http://search-lucene.com/
> >
> >
> >
> > - Original  Message 
> >  
> >> From: Peter Karich 
> >> To: solr-user@lucene.apache.org
> >>  Sent: Fri, July 30, 2010 4:06:48 PM
> >> Subject: Re: Solr Indexing  slows down
> >>
> >> Hi Erick!
> >>
> >> thanks for  the response!
> >> I will answer your questions   ;-)
> >>
> >>
> >>> How often are you  making changes to your index?
> >>>  
> >>  Every  30-60 seconds. Too heavy?
> >>
> >>
> >> 
> >>> Do you have autocommit  on?
> >>>   
> >> No.
> >>
> >>
> >>
> >>> Do you commit when updating each   document?
> >>>  
> >> No. I commit after a  batch update of 200  documents
> >>
> >>
> >> 
> >>> Committing too often and consequently firing off   warmup queries is the 
>first 
>
> >>>  
> >>  place I'd look.
> >>
> >> Why is commiting firing  warmup  queries? Is there any documentation about
> >> this subject?
> >>  How can I  be sure that the previous commit has done its  magic?
> >>
> >>
> >>> there are   several config values that influence the commit frequency
> >>>   
> >>
> >> I now know  the autowarm and the  mergeFactor config. What else? Is this
> >> documentation   complete:
> >> http://wiki.apache.org/lucene-java/ImproveIndexingSpeed ?
> >>
> >> Regards,
> >>  Peter.
> >>
> >>
> >>> See the subject  about 1500 threads. The  first place I'd look is how
> >>> often  you're committing. If you're  committing before the warmup  queries
> >>> from the previous commit have done  their magic,  you might be getting
> >>> into a death  spiral.
> >>>
> >>>  HTH
> >>>  Erick
> >>>
> >>> On Thu, Jul 29, 2010 at 7:02 AM, Peter  Karich 
wrote:
> >>>
> >>>  
> >>>   
>  Hi,
> 
>  I  am  indexing a solr 1.4.0 core and commiting gets slower and  slower.
>   Starting from 3-5 seconds for ~200 documents  and ending with over 60
>   seconds after 800 commits.  Then, if I reloaded the index, it is as  
fast
>  as  before! And today I have read a similar thread [1] and  indeed: if  I
>  set autowarming for the caches to 0 the slowdown   disappears.
> 
>  BUT at the same time I would  like to offer  searching on that core, 
which
>  would be  dramatically slowed down (due  to no  autowarming).
> 
>  Does someone know a better  solution  to avoid index-slow-down?
> 
>   Regards,
>    Peter.
> 
>  [1] http://www.mail-archive.com/solr-user@lucene.apache.org/msg20785.html
> 
>  
> 
> >> 
> >  
> 
> 
> -- 
> http://karussell.wordpress.com/
> 
> 


Re: Some basic DataImportHandler questions

2010-07-30 Thread Shalin Shekhar Mangar
On Sat, Jul 31, 2010 at 3:40 AM, Harry Smith wrote:

> Just starting with DataImportHandler and had a few simple questions.
>
> Is there a location for more in depth documentation other than
> http://wiki.apache.org/solr/DataImportHandler?
>
>
Umm, no, but let us know what is not covered well and it can be added.


> Specifically I was looking for a detailed document outlining
> data-config.xml, the fields and attributes and how they are used.
>
> * Is there a way to dynamically generate field elements from the
> supplied sql statement?
>
> Example: Suppose one has a table of 100 fields. Entering this manually
> for each field is not very efficient.
>
> ie, if table has only 3 columns this is easy enough...
>
> 
>
>
>
> 
>
> What are the options if ITEM table has dozens or hundreds?
>
>
Yes! You do not need to specify the column names as long as your Solr schema
defines the same field names.

See http://wiki.apache.org/solr/DataImportHandler#A_shorter_data-config


>
> * Is there a way to apply insert logic based on the value of the incoming
> field?
>
> My specific use case would be, if the incoming value is null, do not
> add to Solr.
>
>
DIH Transformers are the way. However, in this particular case, you do not
need to worry because nulls are not inserted into the index.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Programmatically retrieving numDocs (or any other statistic)

2010-07-30 Thread Chris Hostetter

: I want to programmatically retrieve the number of indexed documents. I.e., 
get the value of numDocs.

Index level stats like this can be fetched from the LukeRequestHandler in 
any recent version of SOlr...
http://localhost:8983/solr/admin/luke?numTerms=0

In future releases (ie: already in trunk and branch 3x) there is also the 
SolrInfoMBeanRequestHandler which will replace registry.jsp and stats.jsp 

https://issues.apache.org/jira/browse/SOLR-1750


-Hoss