RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Yuval Feinstein
Thanks, Ahmet.
Yes, my solrconfig.xml file is very similar to what you wrote.
When I use &echoparams=all and defType=myqp, I get:


hi
all
myqp


However, when I do not use the defType (hoping it will be automatically 
Inserted from solrconfig),  I get:


hi
all


Can you see what I am doing wrong?
Thanks,
Yuval


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Tuesday, June 08, 2010 3:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Making my QParserPlugin the default one, with cores

>It appears that the defType parameter is not being set by the request 
>>handler. 

What do you get when you append &echoParams=all to your search url?

So you have something like this entry in solrconfig.xml



myqp

 



 





  


Re: question about the fieldCollapseCache

2010-06-09 Thread Martijn v Groningen
The fieldCollapseCache should not be used as it is now, it uses too
much memory. It stores any information relevant for a field collapse
search. Like document collapse counts, collapsed document ids /
fields, collapsed docset and uncollapsed docset (everything per unique
search). So the memory usage will grow for each unique query (and fast
with all this information). So its best I think to disable this cache
for now.

Martijn

On 8 June 2010 19:05, Jean-Sebastien Vachon  wrote:
> Hi All,
>
> I've been running some tests using 6 shards each one containing about 1 
> millions documents.
> Each shard is running in its own virtual machine with 7 GB of ram (5GB 
> allocated to the JVM).
> After about 1100 unique queries the shards start to struggle and run out of 
> memory. I've reduced all
> other caches without significant impact.
>
> When I remove completely the fieldCollapseCache, the server can keep up for 
> hours
> and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
>
> The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
> eat 3 GB of ram?
>
> Can someone tell me what is put in this cache? Has anyone experienced this 
> kind of problem?
>
> I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
> single field (pint) and
> collapse.maxdocs set to 200 000.
>
> Thanks for any hints...
>
>


Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Here's my config for the updateProcessor. It not uses another signature method 
but i've used TextProfileSignature as well and it works - sort of.


  

  true
  sig
  true
  content
  org.apache.solr.update.processor.Lookup3Signature



  


Of course, you must define the updateProcessor in your requestHandler, it's 
commented out in mine at the moment.


  

  


Also, i see you define minTokenLen = 3. Where does that come from? Haven't 
seen anything on the wiki specifying such a parameter.


On Tuesday 08 June 2010 19:45:35 Neeb wrote:
> Hey Andrew,
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code fragment
> for it from  solrconfig.
> 
> I have currently something like this, but not sure if I am doing it right:
> 
>  
>  class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>   true
>   signature
>   true
>   title,author,abstract
>name="signatureClass">org.apache.solr.update.processor.TextProfileSignature
>  3
> 
> 
> 
>   
> 
> --
> 
> Thanks in advance,
> -Ali
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Markus Jelsma
Well, it got me too! KMail didn't properly order this thread. Can't seem to 
find Hatcher's reply anywhere. ??!!?


On Tuesday 08 June 2010 22:00:06 Andrew Clegg wrote:
> Andrew Clegg wrote:
> > Re. your config, I don't see a minTokenLength in the wiki page for
> > deduplication, is this a recent addition that's not documented yet?
> 
> Sorry about this -- stupid question -- I should have read back through the
> thread and refreshed my memory.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: [Blacklight-development] facet data cleanup

2010-06-09 Thread Erik Hatcher


On Jun 8, 2010, at 1:57 PM, Naomi Dushay wrote:

Missing Facet Values:
---

to find how many documents are missing values:		 
facet.missing=true&facet.mincount=really big


http://your.solr.baseurl/select?rows=0&facet.field=ffldname&facet.mincount=1000&facet.missing=true

to find the documents with missing values:
		http://your.solr.baseurl/select?qt=standard&q=+uniquekey:[* TO *] - 
ffldname:[* TO *]


You could shorten that query to just q=-field_name:[* TO *]

Solr's "lucene" query parser supports top-level negative clauses.

And I'm assuming every doc has a unique key, so you could use *:*  
instead of uniquekey:[* TO *] - but I doubt one is really better than  
the other.


Erik



Re: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Erik Hatcher
Yuval - my only hunch is that you're hitting a different request  
handler than where you configured the default defType.  Send us the  
URL you're hitting Solr with, and the full request handler mapping.   
And you're sure you're the exact core you're hitting (since you  
mention multicore) you think you are?  Look at Solr's admin to see  
where the solr home directory is and ensure you're looking at the  
right solrconfig.xml.


Erik

On Jun 9, 2010, at 12:52 AM, Yuval Feinstein wrote:


Thanks, Ahmet.
Yes, my solrconfig.xml file is very similar to what you wrote.
When I use &echoparams=all and defType=myqp, I get:


hi
all
myqp


However, when I do not use the defType (hoping it will be  
automatically

Inserted from solrconfig),  I get:


hi
all


Can you see what I am doing wrong?
Thanks,
Yuval


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com]
Sent: Tuesday, June 08, 2010 3:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Making my QParserPlugin the default one, with cores

It appears that the defType parameter is not being set by the  
request >handler.


What do you get when you append &echoParams=all to your search url?

So you have something like this entry in solrconfig.xml

default="true">


myqp
















Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg


Markus Jelsma wrote:
> 
> Well, it got me too! KMail didn't properly order this thread. Can't seem
> to 
> find Hatcher's reply anywhere. ??!!?
> 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index search optimization for fulltext remote streaming

2010-06-09 Thread Danyal Mark

We have following solr configuration:

java -Xms512M -Xmx1024M -Dsolr.solr.home= -jar
start.jar

in SolrConfig.xml

   
false
4  
20  
1024
1
1000
1
native  
 



false
1024
4



false  
true  
  
  1
  0  

 false
  


Also, we have used autoCommit=false. We have our PC spec:

Core2-Duo
2GB RAM
Solr Server running in localhost
Index Directory is also in local FileSystem
Input Fulltext files using remoteStreaming from another PC


Here, when we indexed 10 Fulltext documents, the total time taken is
40mins. We want to optimize the time lesser to this. We have been studying
on UpdateRequestProcessorChain section


  
   dedupe
  
   

How to use this UpdateRequestProcessorChain in /update/extract/ to run
indexing in multiple chains (i.e multiple threads). Can you suggest me if I
can optimize the process changing any of these configurations?

with regards,
Danyal Mark 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-search-optimization-for-fulltext-remote-streaming-tp828274p881809.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to get multicore to work?

2010-06-09 Thread xdzgor

Hi - I can't seem to get "multicores" to work. I have a solr installtion
which does not have a "solr.xml" file - I assume this means it is not
multicore.

If I create a solr.xml, as described on
http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for
example I get 404 errors when trying to search, and "solr/admin" does not
work.

Is there more than simply making solr.xml to get multicores to work?

Thanks,
Peter
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Neeb

Thanks guys.
I will try this with some test documents, fingers crossed.
And by the way, I got the minTokenLen parameter from one of the thread
replies (from Erik).

Cheerz,
Ali


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881840.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: how to get multicore to work?

2010-06-09 Thread Chris Rode
If you take a look in the examples directory there is a directory called
multicore. This is an example of the solrhome of a multicore setup.

Otherwise take a look at the logged output of Solr itself. It should tell
you what is wrong with the setup

On 9 June 2010 11:08, xdzgor  wrote:

>
> Hi - I can't seem to get "multicores" to work. I have a solr installtion
> which does not have a "solr.xml" file - I assume this means it is not
> multicore.
>
> If I create a solr.xml, as described on
> http://wiki.apache.org/solr/CoreAdmin, my solr installation fails - for
> example I get 404 errors when trying to search, and "solr/admin" does not
> work.
>
> Is there more than simply making solr.xml to get multicores to work?
>
> Thanks,
> Peter
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p881826.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Ahmet Arslan

> Thanks, Ahmet.
> Yes, my solrconfig.xml file is very similar to what you
> wrote.
> When I use &echoparams=all and defType=myqp, I get:
> 
> 
> hi
> all
> myqp
> 
> 
> However, when I do not use the defType (hoping it will be
> automatically 
> Inserted from solrconfig),  I get:
> 
> 
> hi
> all
> 
> 

In &echoParams=all  p should be capital. Just use &echoParams=all and don't 
include defType explicitly. &echoParams=all will display default parameters 
that you specify in solrconfig.xml. You can debug this way.

http://wiki.apache.org/solr/CoreQueryParameters#echoParams

If you don't see myqp listed under  then it is not written in solrconfig.xml.

May be you forgot to restart core after editing solrconfig.xml?






Copyfield multi valued to single value

2010-06-09 Thread Marc Ghorayeb

Hello,
Is there a way to copy a multivalued field to a single value by taking for 
example the first index of the multivalued field?
I am actually trying to sort my index by Title and my index contains Tika 
extracted titles which come in as multi valued hence why my title field is 
multi valued. However when i do a sort on the title field, it crashes because 
well it cannot compare two arrays i guess which is logical. So my thought was 
to copy only one value from the array to another field.
Maybe there is another way to do that? Can anyone help me?
Thanks in advance!
Marc  
_
Vous voulez regarder la TV directement depuis votre PC ? C'est très simple avec 
Windows 7
http://clk.atdmt.com/FRM/go/229960614/direct/01/

requesthandler, variable ...

2010-06-09 Thread stockii

Hello.

i want to call the termscomponent with this request: 
http://host/solr/app/select/?q=har

i want the same result when i use this request:
http://host/solr/app/terms/?q=har&terms.prefix=har
-->

9
9
9
...


. this is my solrconfig.xml requestHandler


   
 
 
terms  
  

  termsComponent

   
 
 
   
 
  true
suggest
index

 

  termsComponent

  


it this possible ?  

or ho can i put the q-value on the place of term.prefix ?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/requesthandler-variable-tp881906p881906.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Making my QParserPlugin the default one, with cores

2010-06-09 Thread Yuval Feinstein
Thanks again Ahmet and Erik.
Turns out that this was calling the correct query parser all along.
The real problem was a combination of the query cache and my hacking the query 
to enable BM25 scoring.
When I use a standard BooleanQuery, this behaved as published.
Now I have to understand how to tweak my Lucene query data structure so that 
the query caching works like the standard Lucene queries.
Cheers,
Yuval

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Wednesday, June 09, 2010 1:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Making my QParserPlugin the default one, with cores


> Thanks, Ahmet.
> Yes, my solrconfig.xml file is very similar to what you
> wrote.
> When I use &echoparams=all and defType=myqp, I get:
> 
> 
> hi
> all
> myqp
> 
> 
> However, when I do not use the defType (hoping it will be
> automatically 
> Inserted from solrconfig),  I get:
> 
> 
> hi
> all
> 
> 

In &echoParams=all  p should be capital. Just use &echoParams=all and don't 
include defType explicitly. &echoParams=all will display default parameters 
that you specify in solrconfig.xml. You can debug this way.

http://wiki.apache.org/solr/CoreQueryParameters#echoParams

If you don't see myqp listed under  then it is not written in solrconfig.xml.

May be you forgot to restart core after editing solrconfig.xml?



  


how to test solr's performance?

2010-06-09 Thread Li Li
are there any built-in tools for performance test? thanks


AW: how to get multicore to work?

2010-06-09 Thread Markus.Rietzler
- solr.xml have to reside in the solr.home dir. you can setup this with the 
java-option
  -Dsolr.solr.home=
- admin is per core, so solr/CORENAME/admin will work

it is quite simple to setup.

> -Ursprüngliche Nachricht-
> Von: xdzgor [mailto:p...@alphasolutions.dk] 
> Gesendet: Mittwoch, 9. Juni 2010 12:08
> An: solr-user@lucene.apache.org
> Betreff: how to get multicore to work?
> 
> 
> Hi - I can't seem to get "multicores" to work. I have a solr 
> installtion
> which does not have a "solr.xml" file - I assume this means it is not
> multicore.
> 
> If I create a solr.xml, as described on
> http://wiki.apache.org/solr/CoreAdmin, my solr installation 
> fails - for
> example I get 404 errors when trying to search, and 
> "solr/admin" does not
> work.
> 
> Is there more than simply making solr.xml to get multicores to work?
> 
> Thanks,
> Peter
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-wor
k-tp881826p881826.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: question about the fieldCollapseCache

2010-06-09 Thread Jean-Sebastien Vachon
ok great.

I believe this should be mentioned in the wiki.

Later

On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote:

> The fieldCollapseCache should not be used as it is now, it uses too
> much memory. It stores any information relevant for a field collapse
> search. Like document collapse counts, collapsed document ids /
> fields, collapsed docset and uncollapsed docset (everything per unique
> search). So the memory usage will grow for each unique query (and fast
> with all this information). So its best I think to disable this cache
> for now.
> 
> Martijn
> 
> On 8 June 2010 19:05, Jean-Sebastien Vachon  wrote:
>> Hi All,
>> 
>> I've been running some tests using 6 shards each one containing about 1 
>> millions documents.
>> Each shard is running in its own virtual machine with 7 GB of ram (5GB 
>> allocated to the JVM).
>> After about 1100 unique queries the shards start to struggle and run out of 
>> memory. I've reduced all
>> other caches without significant impact.
>> 
>> When I remove completely the fieldCollapseCache, the server can keep up for 
>> hours
>> and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
>> 
>> The size of the fieldCollapseCache was set to 5000 items. How can 5000 items 
>> eat 3 GB of ram?
>> 
>> Can someone tell me what is put in this cache? Has anyone experienced this 
>> kind of problem?
>> 
>> I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
>> single field (pint) and
>> collapse.maxdocs set to 200 000.
>> 
>> Thanks for any hints...
>> 
>> 



Re: question about the fieldCollapseCache

2010-06-09 Thread Martijn v Groningen
I agree. I'll add this information to the wiki.

On 9 June 2010 14:32, Jean-Sebastien Vachon  wrote:
> ok great.
>
> I believe this should be mentioned in the wiki.
>
> Later
>
> On 2010-06-09, at 4:06 AM, Martijn v Groningen wrote:
>
>> The fieldCollapseCache should not be used as it is now, it uses too
>> much memory. It stores any information relevant for a field collapse
>> search. Like document collapse counts, collapsed document ids /
>> fields, collapsed docset and uncollapsed docset (everything per unique
>> search). So the memory usage will grow for each unique query (and fast
>> with all this information). So its best I think to disable this cache
>> for now.
>>
>> Martijn
>>
>> On 8 June 2010 19:05, Jean-Sebastien Vachon  wrote:
>>> Hi All,
>>>
>>> I've been running some tests using 6 shards each one containing about 1 
>>> millions documents.
>>> Each shard is running in its own virtual machine with 7 GB of ram (5GB 
>>> allocated to the JVM).
>>> After about 1100 unique queries the shards start to struggle and run out of 
>>> memory. I've reduced all
>>> other caches without significant impact.
>>>
>>> When I remove completely the fieldCollapseCache, the server can keep up for 
>>> hours
>>> and use only 2 GB of ram. (I'm even considering returning to a 32 bits JVM)
>>>
>>> The size of the fieldCollapseCache was set to 5000 items. How can 5000 
>>> items eat 3 GB of ram?
>>>
>>> Can someone tell me what is put in this cache? Has anyone experienced this 
>>> kind of problem?
>>>
>>> I am running Solr 1.4.1 with patch 236. All requests are collapsing on a 
>>> single field (pint) and
>>> collapse.maxdocs set to 200 000.
>>>
>>> Thanks for any hints...
>>>
>>>
>
>



-- 
Met vriendelijke groet,

Martijn van Groningen


Solr spellcheck config

2010-06-09 Thread Bogdan Gusiev
Hi everyone,

I am trying to build the spellcheck index with *IndexBasedSpellChecker*


  default
  text
  ./spellchecker


And I want to specify the dynamic field "*_text" as the field option:



How it can be done?

Thanks, Bogdan

-- 
Bogdan Gusiev.
agre...@gmail.com


Issue with response header in SOLR running on Linux instance

2010-06-09 Thread bbarani

Hi,

I have been using SOLR for sometime now and had no issues till I was using
it in windows. Yesterday I moved the SOLR code to Linux servers and started
to index the data. Indexing completed successfully in the linux severs but
when I queried the index, the response header returned (by the SOLR instance
running in Linux server) is different from the response header returned in
SOLR instance that is running on windows instance.

Response header returned by SOLR instance running in windows machine

- 
  0 
  2219 
- 
  on 
  0 
  credit 
  2.2 
  10 
  
  


Response header returned by SOLR instance running in Linux machine

- 
- 
  0 
  26 
- 
  credit 
  
  

Any idea why this happens?

Thanks,
Barani

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-response-header-in-SOLR-running-on-Linux-instance-tp882181p882181.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr Core Unload

2010-06-09 Thread abhatna...@vantage.com

Refering

http://lucene.472066.n3.nabble.com/unloading-a-solr-core-doesn-t-free-any-memory-td501246.html#a501246


Do we have any solution to free up memory after Solr Core Unload?


Ankit
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Core-Unload-tp882187p882187.html
Sent from the Solr - User mailing list archive at Nabble.com.


custom scorer in Solr

2010-06-09 Thread Fornoville, Tom
Hi all,

 

We are currently working on a proof-of-concept for a client using Solr
and have been able to configure all the features they want except the
scoring.

 

Problem is that they want scores that make results fall in buckets:

*   Bucket 1: exact match on category (score = 4)
*   Bucket 2: exact match on name (score = 3)
*   Bucket 3: partial match on category (score = 2)
*   Bucket 4: partial match on name (score = 1)

 

First thing we did was develop a custom similarity class that would
return the correct score depending on the field and an exact or partial
match.

 

The only problem now is that when a document matches on both the
category and name the scores are added together.

Example: searching for "restaurant" returns documents in the category
restaurant that also have the word restaurant in their name and thus get
a score of 5 (4+1) but they should only get 4.

 

I assume for this to work we would need to develop a custom Scorer class
but we have no clue on how to incorporate this in Solr.

Maybe there is even a simpler solution that we don't know about.

 

All suggestions welcome!

 

Thanks,

Tom



Re: Issue with response header in SOLR running on Linux instance

2010-06-09 Thread Markus Jelsma
Hi,


Check your requestHandler. It may preset some values that you don't see. Your 
echoParams setting may be explicit instead of all [1]. Alternatively, you 
could add the echoParams parameter to your query if it isn't set as an 
invariant in your requestHandler.

[1]: http://wiki.apache.org/solr/CoreQueryParameters

Cheers,
 
On Wednesday 09 June 2010 15:25:09 bbarani wrote:
> Hi,
> 
> I have been using SOLR for sometime now and had no issues till I was using
> it in windows. Yesterday I moved the SOLR code to Linux servers and started
> to index the data. Indexing completed successfully in the linux severs but
> when I queried the index, the response header returned (by the SOLR
>  instance running in Linux server) is different from the response header
>  returned in SOLR instance that is running on windows instance.
> 
> Response header returned by SOLR instance running in windows machine
> 
> - 
>   0
>   2219
> - 
>   on
>   0
>   credit
>   2.2
>   10
>   
>   
> 
> 
> Response header returned by SOLR instance running in Linux machine
> 
> - 
> - 
>   0
>   26
> - 
>   credit
>   
>   
> 
> Any idea why this happens?
> 
> Thanks,
> Barani
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Anyone using Solr spatial from trunk?

2010-06-09 Thread Rob Ganly
>>... but decided not to use it anyway?

that's pretty much correct.  the huge commercial scale of the project
dictates that we need as much system stability as possible from the outset;
thus the tools we are use must be established, community-tested and trusted
versions.  we also noticed that some of the regular non-geospatial queries
seemed to run slightly slower than on 1.4, with only a fraction of the total
amount of records we'd be searching in production (but that wasn't the main
reason for our decision).

i would perhaps use it for a much smaller [private] project where speed,
scaling and reliability weren't such critical issues.

future proofing was also a consideration:

*"With all the changes currently occurring with Solr, I would go so far as
to say that users should continue to use Solr 1.4. However, if you need
access to one of the many new features introduced in Solr 1.5+ or Lucene
3.x, then given Solr 3.1 a shot, and report back your experiences."  *(from
http://blog.jteam.nl/2010/04/14/state-of-solr/*).*

On 8 June 2010 21:09, Darren Govoni  wrote:

> So let me understand what you said. You went through the trouble to
> implement a geospatial
> solution using Solr 1.5, it worked really well. You saw no signs of
> instability, but decided not to use it anyway?
>
> Did you put it through a routine of tests and witness some stability
> problem? Or just guessing it had them?
>
> I'm just curious the reasoning behind your comment.
>
> On Tue, 2010-06-08 at 09:05 +0100, Rob Ganly wrote:
>
> > i used the 1.5 build a few weeks ago, implemented the geospatial
> > functionality and it worked really well.
> >
> > however due to the unknown quantity in terms of stability (and the
> uncertain
> > future of 1.5) etc. we decided not to use it in production.
> >
> > rob ganly
> >
> > On 8 June 2010 03:50, Darren Govoni  wrote:
> >
> > > I've been experimenting with it, but haven't quite gotten it to work as
> > > yet.
> > >
> > > On Mon, 2010-06-07 at 17:47 -0700, Jay Hill wrote:
> > >
> > > > I was wondering about the production readiness of the new-in-trunk
> > > spatial
> > > > functionality. Is anyone using this in a production environment?
> > > >
> > > > -Jay
> > >
> > >
> > >
>
>
>


Re: AW: XSLT for JSON

2010-06-09 Thread stockii

help me please =(
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/XSLT-for-JSON-tp845386p882319.html
Sent from the Solr - User mailing list archive at Nabble.com.


How Solr Manages Connected Database Updates

2010-06-09 Thread Sumit Arora
Hey All,

I am new to Solr Area, and just started exploring it and done basic stuff,
now I am stuck with logic :

How Solr Manages Connected Database Updates

Scenario :

-- Wrote one Indexing Program which runs on Tomcat , and by running this
program, it reads  data from connected MySql Database and then perform
Indexing.

Use Case - Database is not fixed, Its a data base for a web application,
from where user keep on inserting data, so database have frequent updates.
almost every minute.

How automatically solr should grab those changes and perform Index updation
?


Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ?
(Several ways could be ?)

Is there any one who can provide his comments or share his experience If
some one gone though from similar situation ?

Thanks,
-Sumit


Diagnosing solr timeout

2010-06-09 Thread Paul
Hi all,

In my app, it seems like solr has become slower over time. The index
has grown a bit, and there are probably a few more people using the
site, but the changes are not drastic.

I notice that when a solr search is made, the amount of cpu and ram
spike precipitously.

I notice in the solr log, a bunch of entries in the same second that end in:

status=0 QTime=212
status=0 QTime=96
status=0 QTime=44
status=0 QTime=276
status=0 QTime=8552
status=0 QTime=16
status=0 QTime=20
status=0 QTime=56

and then:

status=0 QTime=315919
status=0 QTime=325071

My questions: How do I figure out what to fix? Do I need to start java
with more memory? How do I tell what is the correct amount of memory
to use? Is there something particularly inefficient about something
else in my configuration, or the way I'm formulating the solr request,
and how would I narrow down what it could be? I can't tell, but it
seems like it happens after solr has been running unattended for a
little while. Should I have a cron job that restarts solr every day?
Could the solr process be starved by something else on the server
(although -- the only other thing that is particularly running is
apache/passenger/rails app)?

In other words, I'm at a total loss about how to fix this.

Thanks!

P.S. In case this helps, here's the exact log entry for the first item
that failed:

Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute
INFO: [resources] webapp=/solr path=/select
params={hl.fragsize=600&facet.missing=true&facet=false&facet.mincount=1&ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48&facet.limit=-1&hl.fl=text&hl.maxAnalyzedChars=512000&wt=javabin&hl=true&rows=30&version=1&fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uri&start=0&q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5&facet.field=genre&facet.field=archive&facet.field=freeculture&facet.field=has_full_text&facet.field=federation&isShard=true&fq=year:"1882"}
status=0 QTime=315919


Dataimport in debug mode store a last index date

2010-06-09 Thread Marc Emery
Hi,

When using the data import handler and clicking on 'Debug now' it stores the
current date as 'last_index_time' into the dataimport.properties file.
Is it the right behaviour, as debug don't do a commit?

Thanks
marc


Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
Have you looked at the garbage collector statistics? I've experienced this kind 
of issues in the past
and I was getting huge spikes when the GC was doing its job.

On 2010-06-09, at 10:52 AM, Paul wrote:

> Hi all,
> 
> In my app, it seems like solr has become slower over time. The index
> has grown a bit, and there are probably a few more people using the
> site, but the changes are not drastic.
> 
> I notice that when a solr search is made, the amount of cpu and ram
> spike precipitously.
> 
> I notice in the solr log, a bunch of entries in the same second that end in:
> 
> status=0 QTime=212
> status=0 QTime=96
> status=0 QTime=44
> status=0 QTime=276
> status=0 QTime=8552
> status=0 QTime=16
> status=0 QTime=20
> status=0 QTime=56
> 
> and then:
> 
> status=0 QTime=315919
> status=0 QTime=325071
> 
> My questions: How do I figure out what to fix? Do I need to start java
> with more memory? How do I tell what is the correct amount of memory
> to use? Is there something particularly inefficient about something
> else in my configuration, or the way I'm formulating the solr request,
> and how would I narrow down what it could be? I can't tell, but it
> seems like it happens after solr has been running unattended for a
> little while. Should I have a cron job that restarts solr every day?
> Could the solr process be starved by something else on the server
> (although -- the only other thing that is particularly running is
> apache/passenger/rails app)?
> 
> In other words, I'm at a total loss about how to fix this.
> 
> Thanks!
> 
> P.S. In case this helps, here's the exact log entry for the first item
> that failed:
> 
> Jun 9, 2010 1:02:52 PM org.apache.solr.core.SolrCore execute
> INFO: [resources] webapp=/solr path=/select
> params={hl.fragsize=600&facet.missing=true&facet=false&facet.mincount=1&ids=http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.44,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.06.xml;chunk.id%3Ddiv.ww.shelleyworks.v6.67,http://pm.nlx.com/xtf/view?docId%3Dtennyson_c/tennyson_c.02.xml;chunk.id%3Ddiv.tennyson.v2.1115,http://pm.nlx.com/xtf/view?docId%3Dmarx/marx.39.xml;chunk.id%3Ddiv.marx.engels.39.325,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.80,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.116,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.115,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.75,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.01.xml;chunk.id%3Ddiv.eliot.novels.bede.76,http://pm.nlx.com/xtf/view?docId%3Demerson/emerson.05.xml;chunk.id%3Dralph.waldo.v5.d083,http://pm.nlx.com/xtf/view?docId%3Dshelley/shelley.04.xml;chunk.id%3Ddiv.ww.shelleyworks.v4.31,http://pm.nlx.com/xtf/view?docId%3Dshelley_j/shelley_j.01.xml;chunk.id%3Ddiv.ww.shelley.journals.v1.88,http://pm.nlx.com/xtf/view?docId%3Deliot/eliot.03.xml;chunk.id%3Ddiv.eliot.romola.48&facet.limit=-1&hl.fl=text&hl.maxAnalyzedChars=512000&wt=javabin&hl=true&rows=30&version=1&fl=uri,archive,date_label,genre,source,image,thumbnail,title,alternative,url,role_ART,role_AUT,role_EDT,role_PBL,role_TRL,role_EGR,role_ETR,role_CRE,freeculture,is_ocr,federation,has_full_text,source_xml,uri&start=0&q=(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES)+OR+(*:*+AND+(life)+AND+(death)+AND+(of)+AND+(jason)+AND+federation:NINES+-genre:Citation)^5&facet.field=genre&facet.field=archive&facet.field=freeculture&facet.field=has_full_text&facet.field=federation&isShard=true&fq=year:"1882"}
> status=0 QTime=315919



Re: Diagnosing solr timeout

2010-06-09 Thread Paul
>Have you looked at the garbage collector statistics? I've experienced this 
>kind of issues in the past
and I was getting huge spikes when the GC was doing its job.

I haven't, and I'm not sure what a good way to monitor this is. The
problem occurs maybe once a week on a server. Should I run jstat the
whole time and redirect the output to a log file? Is there another way
to get that info?

Also, I was suspecting GC myself. So, if it is the problem, what do I
do about it? It seems like increasing RAM might make the problem worse
because it would wait longer to GC, then it would have more to do.


TrieRange for storage of dates

2010-06-09 Thread Jason Rutherglen
What is the best practice? Perhaps we can amend the article at
http://www.lucidimagination.com/blog/2009/05/13/exploring-lucene-and-solrs-trierange-capabilities/
to include the recommendation (ie, dates are commonly unique).
I'm assuming using a long is the best choice.


Re: Tomcat startup script

2010-06-09 Thread Sixten Otto
On Tue, Jun 8, 2010 at 4:18 PM,   wrote:
> The following should work on centos/redhat, don't forget to edit the paths,
> user, and java options for your environment. You can use chkconfig to add it
> to your startup.

Thanks, Colin.

Sixten


Some questions about ability of solr.

2010-06-09 Thread Vitaliy Avdeev
I am keeping some data int Json format in HBase table.
I would like to index this data with solr.
Is there any examples of indexing HBase table?

Evry node in HBase has atribyte that saves the data then it was writed int
table.
Is there any option to search no only by text but also to search the data
for period of time then it was writed into the HBase?


Re: general debugging techniques?

2010-06-09 Thread Jim Blomo
On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
 wrote:
> : That is still really small for 5MB documents. I think the default solr
> : document cache is 512 items, so you would need at least 3 GB of memory
> : if you didn't change that and the cache filled up.
>
> that assumes that the extracted text tika extracts from each document is
> the same size as the original raw files *and* that he's configured that
> content field to be "stored" ... in practice if you only stored=true the

Most times the extracted text is much smaller, though there are
occasional zip files that may expand in size (and in an unrelated
note, multifile zip archives cause tika 0.7 to hang currently).

> fast, 128MB is really, really, really small for a typical Solr instance.

In any case I bumped up the heap to 3G as suggested, which has helped
stability.  I have found that in practice I need to commit every
extraction because a crash or error will wipe out all extractions
after the last commit.

> if you are only seeing one log line per request, then you are just looking
> at the "request" log ... there should be more logs with messages from all
> over the code base with various levels of severity -- and using standard
> java log level controls you can turn these up/down for various components.

Unfortunately, I'm not very familiar with java deploys so I don't know
where the standard controls are yet.  As a concrete example, I do see
INFO level logs, but haven't found a way to move up DEBUG level in
either solr or tomcat.  I was hopeful debug statements would point to
where extraction/indexing hangs were occurring.  I will keep poking
around, thanks for the tips.

Jim


Re: Diagnosing solr timeout

2010-06-09 Thread Jean-Sebastien Vachon
I use the following article as a reference when dealing with GC related issues

http://www.petefreitag.com/articles/gctuning/

I suggest you activate the verbose option and send GC stats to a file. I don't 
remember exactly what
was the option but you should find the information easily

Good luck

On 2010-06-09, at 11:35 AM, Paul wrote:

>> Have you looked at the garbage collector statistics? I've experienced this 
>> kind of issues in the past
> and I was getting huge spikes when the GC was doing its job.
> 
> I haven't, and I'm not sure what a good way to monitor this is. The
> problem occurs maybe once a week on a server. Should I run jstat the
> whole time and redirect the output to a log file? Is there another way
> to get that info?
> 
> Also, I was suspecting GC myself. So, if it is the problem, what do I
> do about it? It seems like increasing RAM might make the problem worse
> because it would wait longer to GC, then it would have more to do.



Re: AW: how to get multicore to work?

2010-06-09 Thread xdzgor

Thanks for the comments. I still can't get this multicore thing to work!

Here is my directory structure:

d:
__apachesolr
lucidworks
__lucidworks
solr
__bin
__conf
__lib
tomcat

There is no solr.xml, and solr.solr.home points to
d:\apachesolr\lucidworkd\lucidworks\solr

As it stands, solr works fine, and sites like
http://locahost:8983/solr/admin also work.

As soon as I put a solr.xml in the solr directory, and restart the tomcat
service. It all stops working.

  

  


Any idea where I can look?
Where is the solr startup log written?

Thanks,
Peter
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-get-multicore-to-work-tp881826p883780.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: general debugging techniques?

2010-06-09 Thread Lance Norskog
https://issues.apache.org/jira/browse/LUCENE-2387

There is a "memory leak" that causes the last PDF binary file image to
stick around while working on the next binary image. When you commit
after every extraction, you clear up this "memory leak".

This is fixed in trunk and should make it into a 'bug fix' Solr 1.4.1
if such a thing happens.

Lance

On Wed, Jun 9, 2010 at 10:13 AM, Jim Blomo  wrote:
> On Fri, Jun 4, 2010 at 3:14 PM, Chris Hostetter
>  wrote:
>> : That is still really small for 5MB documents. I think the default solr
>> : document cache is 512 items, so you would need at least 3 GB of memory
>> : if you didn't change that and the cache filled up.
>>
>> that assumes that the extracted text tika extracts from each document is
>> the same size as the original raw files *and* that he's configured that
>> content field to be "stored" ... in practice if you only stored=true the
>
> Most times the extracted text is much smaller, though there are
> occasional zip files that may expand in size (and in an unrelated
> note, multifile zip archives cause tika 0.7 to hang currently).
>
>> fast, 128MB is really, really, really small for a typical Solr instance.
>
> In any case I bumped up the heap to 3G as suggested, which has helped
> stability.  I have found that in practice I need to commit every
> extraction because a crash or error will wipe out all extractions
> after the last commit.
>
>> if you are only seeing one log line per request, then you are just looking
>> at the "request" log ... there should be more logs with messages from all
>> over the code base with various levels of severity -- and using standard
>> java log level controls you can turn these up/down for various components.
>
> Unfortunately, I'm not very familiar with java deploys so I don't know
> where the standard controls are yet.  As a concrete example, I do see
> INFO level logs, but haven't found a way to move up DEBUG level in
> either solr or tomcat.  I was hopeful debug statements would point to
> where extraction/indexing hangs were occurring.  I will keep poking
> around, thanks for the tips.
>
> Jim
>



-- 
Lance Norskog
goks...@gmail.com


Re: How Solr Manages Connected Database Updates

2010-06-09 Thread Lance Norskog
The DataImportHandler has a tool for fetching recent updates in the
database and indexing only those new&changed records.  It has no
scheduler. You would set up the DIH configuration and then write a
cron job to run it at regular intervals.

Lance

On Wed, Jun 9, 2010 at 7:51 AM, Sumit Arora  wrote:
> Hey All,
>
> I am new to Solr Area, and just started exploring it and done basic stuff,
> now I am stuck with logic :
>
> How Solr Manages Connected Database Updates
>
> Scenario :
>
> -- Wrote one Indexing Program which runs on Tomcat , and by running this
> program, it reads  data from connected MySql Database and then perform
> Indexing.
>
> Use Case - Database is not fixed, Its a data base for a web application,
> from where user keep on inserting data, so database have frequent updates.
> almost every minute.
>
> How automatically solr should grab those changes and perform Index updation
> ?
>
>
> Do I need to Write a Cron Job kind of stuff ? Or Use Data Import Handler ?
> (Several ways could be ?)
>
> Is there any one who can provide his comments or share his experience If
> some one gone though from similar situation ?
>
> Thanks,
> -Sumit
>



-- 
Lance Norskog
goks...@gmail.com


Master master?

2010-06-09 Thread Glen Stampoultzis
Does Solr handling having two masters that are also slaves to each other (ie
in a cycle)?


Regards,

Glen


Re: Faceted Search Slows Down as index gets larger

2010-06-09 Thread Lance Norskog
The Distributed Search feature assumes that a document only exists in
one code. Updating a doc in a small core will fail because it may be
found twice.

If you are only updating a popularity score, and only need it for
boosting (but not for searching on a value), there is a feature called
the ExternalFileField:

http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4?q=ExternalFileField
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

On Sun, Jun 6, 2010 at 10:26 PM, Andy  wrote:
> Yonik,
>
> Is there any documentation where I can read more about the big core + small 
> core setup?
>
> One issue for me is that I don't just add new documents. Many of the changes 
> is to update existing documents, such as updating the popularity score of the 
> documents. Would the big core + small core strategy still work in this case? 
> If not, is there any other way to mitigate the cache re-building problem of 
> facet search?
>
> --- On Sun, 6/6/10, Yonik Seeley  wrote:
>
>> From: Yonik Seeley 
>> Subject: Re: Faceted Search Slows Down as index gets larger
>> To: solr-user@lucene.apache.org
>> Date: Sunday, June 6, 2010, 1:54 PM
>> On Sun, Jun 6, 2010 at 1:12 PM,
>> Furkan Kuru 
>> wrote:
>> > We try to provide real-time search. So the index is
>> changing almost in every
>> > minute.
>> >
>> > We commit for every 100 documents received.
>> >
>> > The facet search is executed every 5 mins.
>>
>> OK, that's the problem - pretty much every facet search is
>> rebuilding
>> the facet cache, which takes most of the time (and facet.fc
>> is more
>> expensive than facet.enum in this regard).
>>
>> One strategy is to use distributed search... have some big
>> cores that
>> don't change often, and then small cores for the new stuff
>> that
>> changes rapidly.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Index-time vs. search-time boosting performance

2010-06-09 Thread Lance Norskog
Is it necessary that a document 1 year old be more relevant than one
that's 1 year and 1 hour old? In other words, can the boosting be
logarithmic wrt time instead of linear?

A schema design tip: you can store a separate date field which is
rounded down to the hour. This will make for a much smaller term
dictionary and therefore faster searching & range queries.

On Mon, Jun 7, 2010 at 4:08 AM, Asif Rahman  wrote:
> I still need a relatively precise boost.  No less precise than hourly.  I
> think that would make for a pretty messy field query.
>
>
> On Mon, Jun 7, 2010 at 2:15 AM, Lance Norskog  wrote:
>
>> If you are unhappy with the performance overhead of a function boost,
>> you can push it into a field query by boosting date ranges.
>>
>> You would group in date ranges: documents in September would be
>> boosted 1.0, October 2.0, November 3.0 etc.
>>
>>
>> On 6/5/10, Asif Rahman  wrote:
>> > Thanks everyone for your help so far.  I'm still trying to get to the
>> bottom
>> > of whether switching over to index-time boosts will give me a performance
>> > improvement, and if so if it will be noticeable.  This is all under the
>> > assumption that I can achieve the scoring functionality that I need with
>> > either index-time or search-time boosting (given the loss of precision.
>>  I
>> > can always dust off the old profiler to see what's going on with the
>> > search-time boosts, but testing the index-time boosts will require a full
>> > reindex, which could take days with our dataset.
>> >
>> > On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir  wrote:
>> >
>> >> On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman  wrote:
>> >>
>> >> > Perhaps I should have been more specific in my initial post.  I'm
>> doing
>> >> > date-based boosting on the documents in my index, so as to assign a
>> >> higher
>> >> > score to more recent documents.  Currently I'm using a boost function
>> to
>> >> > achieve this.  I'm wondering if there would be a performance
>> improvement
>> >> if
>> >> > instead of using the boost function at search time, I indexed the
>> >> documents
>> >> > with a date-based boost.
>> >> >
>> >> >
>> >> Asif, without knowing more details, before you look at performance you
>> >> might
>> >> want to consider the relevance impacts of switching to index-time
>> boosting
>> >> for your use case too.
>> >>
>> >> You can read more about the differences here:
>> >> http://lucene.apache.org/java/3_0_1/scoring.html
>> >>
>> >> But I think the most important for this date-influenced use case is:
>> >>
>> >> "Indexing time boosts are preprocessed for storage efficiency and
>> written
>> >> to
>> >> the directory (when writing the document) in a single byte (!)"
>> >>
>> >> If you do this as an index-time boost, your boosts will lose lots of
>> >> precision for this reason.
>> >>
>> >> --
>> >> Robert Muir
>> >> rcm...@gmail.com
>> >>
>> >
>> >
>> >
>> > --
>> > Asif Rahman
>> > Lead Engineer - NewsCred
>> > a...@newscred.com
>> > http://platform.newscred.com
>> >
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: Need help with document format

2010-06-09 Thread Lance Norskog
This is what Field Collapsing does. It is a complex feature and is not
in the Solr trunk yet.

On Tue, Jun 8, 2010 at 9:15 AM, Moazzam Khan  wrote:
> How would I do a facet search if I did this and not get duplicates?
>
> Thanks,
> Moazzam
>
> On Mon, Jun 7, 2010 at 10:07 AM, Israel Ekpo  wrote:
>> I think you need a 1:1 mapping between the consultant and the company, else
>> how are you going to run your queries for let's say consultants that worked
>> for Google or AOL between March 1999 and August 2004?
>>
>> If the mapping is 1:1, your life would be easier and you would not need to
>> do extra parsing of the results your retrieved.
>>
>> Unfortunately, it looks like your are doing to have a lot of records.
>>
>> With an RDBMS, it is easier to do joins but with Lucene and Solr you have to
>> denormalize all the relationships.
>>
>> Hence in this particular scenario, if you have 5 consultants that worked for
>> 4 distinct companies you will have to send 20 documents to Solr
>>
>> On Mon, Jun 7, 2010 at 10:15 AM, Moazzam Khan  wrote:
>>
>>> Thanks for the replies guys.
>>>
>>>
>>> I am currently storing consultants like this ..
>>>
>>> 
>>>  123
>>>  tony
>>>  marjo
>>>  Google
>>>  AOL
>>> 
>>>
>>> I have a few multi valued fields so if I do it the way Israel
>>> suggested it, I will have tons of records. Do you think it will be
>>> better if I did this instead ?
>>>
>>>
>>> 
>>>  123
>>>  tony
>>>  marjo
>>>  Google_StartDate_EndDate
>>>  AOL_StartDate_EndDate
>>> 
>>>
>>> Or is what you guys said better?
>>>
>>> Thanks for all the help.
>>>
>>> Moazzam
>>>
>>>
>>> On Mon, Jun 7, 2010 at 1:10 AM, Lance Norskog  wrote:
>>> > And for 'present', you would pick some time far in the future:
>>> > 2100-01-01T00:00:00Z
>>> >
>>> > On 6/5/10, Israel Ekpo  wrote:
>>> >> You need to make each document added to the index a 1 to 1 mapping for
>>> each
>>> >> company and consultant combo
>>> >>
>>> >> 
>>> >>
>>> >> 
>>> >>     
>>> >>     >> >> stored="true" required="true"/>
>>> >>     >> >> stored="true" multiValued="false"/>
>>> >>     >> >> stored="true" multiValued="false"/>
>>> >>
>>> >>     
>>> >>     >> >> multiValued="false"/>
>>> >>     >> >> multiValued="false"/>
>>> >>     >> >> multiValued="false"/>
>>> >> 
>>> >>
>>> >> text
>>> >>
>>> >> 
>>> >> 
>>> >> 
>>> >>
>>> >> 
>>> >>
>>> >> 
>>> >>
>>> >> 
>>> >>     1_1
>>> >>     Michael
>>> >>     Davis
>>> >>     AOL
>>> >>     2006-02-13T15:26:37Z
>>> >>     2008-02-13T15:26:37Z
>>> >> 
>>> >>
>>> >> 
>>> >>     1_4
>>> >>     Michael
>>> >>     Davis
>>> >>     Google
>>> >>     2006-02-13T15:26:37Z
>>> >>     2009-02-13T15:26:37Z
>>> >> 
>>> >>
>>> >> 
>>> >>     2_3
>>> >>     Tom
>>> >>     Anderson
>>> >>     Yahoo
>>> >>     2001-01-13T15:26:37Z
>>> >>     2009-02-13T15:26:37Z
>>> >> 
>>> >>
>>> >> 
>>> >>     2_4
>>> >>     Tom
>>> >>     Anderson
>>> >>     Google
>>> >>     1999-02-13T15:26:37Z
>>> >>     2010-02-13T15:26:37Z
>>> >> 
>>> >>
>>> >>
>>> >> The you can search as
>>> >>
>>> >> q=company:X AND start_date:[X TO *] AND end_date:[* TO Z]
>>> >>
>>> >> On Fri, Jun 4, 2010 at 4:58 PM, Moazzam Khan 
>>> wrote:
>>> >>
>>> >>> Hi guys,
>>> >>>
>>> >>>
>>> >>> I have a list of consultants and the users (people who work for the
>>> >>> company) are supposed to be able to search for consultants based on
>>> >>> the time frame they worked for, for a company. For example, I should
>>> >>> be able to search for all consultants who worked for Bear Stearns in
>>> >>> the month of july. What is the best of accomplishing this?
>>> >>>
>>> >>> I was thinking of formatting the document like this
>>> >>>
>>> >>> 
>>> >>>    Bear Stearns
>>> >>>   2000-01-01
>>> >>>   present
>>> >>> 
>>> >>> 
>>> >>>    AIG
>>> >>>   1999-01-01
>>> >>>   2000-01-01
>>> >>> 
>>> >>>
>>> >>> Is this possible?
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Moazzam
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> "Good Enough" is not good enough.
>>> >> To give anything less than your best is to sacrifice the gift.
>>> >> Quality First. Measure Twice. Cut Once.
>>> >> http://www.israelekpo.com/
>>> >>
>>> >
>>> >
>>> > --
>>> > Lance Norskog
>>> > goks...@gmail.com
>>> >
>>>
>>
>>
>>
>> --
>> "Good Enough" is not good enough.
>> To give anything less than your best is to sacrifice the gift.
>> Quality First. Measure Twice. Cut Once.
>> http://www.israelekpo.com/
>>
>



-- 
Lance Norskog
goks...@gmail.com


Indexing HTML

2010-06-09 Thread Blargy

What is the preferred way to index html using DIH (my html is stored in a
blob field in our database)? 

I know there is the built in HTMLStripTransformer but that doesn't seem to
work well with malformed/incomplete HTML. I've created a custom transformer
to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:




However this method isn't fool-proof as you can see by my ignoreErrors
option. 

I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
Is this something I should look into? Are there any alternatives that deal
with malformed/incomplete  html? Thanks






-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
Sent from the Solr - User mailing list archive at Nabble.com.


Can query boosting be used with a custom request handlers?

2010-06-09 Thread Andy
I want to try out the bobo plugin for Solr, which is a custom request  handler  
(http://code.google.com/p/bobo-browse/wiki/SolrIntegration).

At the same time I want to use BoostQParserPlugin to boost my queries, 
something like {!boost b=log(popularity)}foo

Can I use the {!boost} feature in conjunction with an external custom request 
handler like the bobo plugin, or does {!boost} only work with the standard 
request handler?


  


Re: Diagnosing solr timeout

2010-06-09 Thread Lance Norskog
Every time you reload the index it is to rebuild the facet cached
data. Could that be it?

Also, how big are the fields being highlighted? And are they indexed
with term vectors? (If not, the text is re-analyzed in flight with
term vectors.)

How big are the caches? Are they growing & growing?

On Wed, Jun 9, 2010 at 11:12 AM, Jean-Sebastien Vachon
 wrote:
> I use the following article as a reference when dealing with GC related issues
>
> http://www.petefreitag.com/articles/gctuning/
>
> I suggest you activate the verbose option and send GC stats to a file. I 
> don't remember exactly what
> was the option but you should find the information easily
>
> Good luck
>
> On 2010-06-09, at 11:35 AM, Paul wrote:
>
>>> Have you looked at the garbage collector statistics? I've experienced this 
>>> kind of issues in the past
>> and I was getting huge spikes when the GC was doing its job.
>>
>> I haven't, and I'm not sure what a good way to monitor this is. The
>> problem occurs maybe once a week on a server. Should I run jstat the
>> whole time and redirect the output to a log file? Is there another way
>> to get that info?
>>
>> Also, I was suspecting GC myself. So, if it is the problem, what do I
>> do about it? It seems like increasing RAM might make the problem worse
>> because it would wait longer to GC, then it would have more to do.
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Indexing HTML

2010-06-09 Thread Lance Norskog
The HTMLStripChar variants are newer and might work better.

On Wed, Jun 9, 2010 at 8:38 PM, Blargy  wrote:
>
> What is the preferred way to index html using DIH (my html is stored in a
> blob field in our database)?
>
> I know there is the built in HTMLStripTransformer but that doesn't seem to
> work well with malformed/incomplete HTML. I've created a custom transformer
> to first tidy up the html using JTidy then I pass it to the
> HTMLStripTransformer like so:
>
>  ignoreErrors="true" propertiesFile="config/tidy.properties"/>
> 
>
> However this method isn't fool-proof as you can see by my ignoreErrors
> option.
>
> I quickly took a peek at Tika and I noticed that it has its own HtmlParser.
> Is this something I should look into? Are there any alternatives that deal
> with malformed/incomplete  html? Thanks
>
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884497.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


how to have "shards" parameter by default

2010-06-09 Thread Scott Zhang
Hi. I am running distributed search on solr.
I have 70 solr instances. So each time I want to search I need to use
?shards=localhost:7500/solr,localhost..7620/solr

It is very long url.

so how can I encode shards into config file then i don't need to type each
time.


thanks.
Scott


Re: how to have "shards" parameter by default

2010-06-09 Thread Scott Zhang
I tried put "shards" into default request handler.
But now each time if search, solr hangs forever.
So what's the correct solution?

Thanks.

  

 
   explicit

   10
   *
   2.1
   localhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr
   
 
  

On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang wrote:

> Hi. I am running distributed search on solr.
> I have 70 solr instances. So each time I want to search I need to use
> ?shards=localhost:7500/solr,localhost..7620/solr
>
> It is very long url.
>
> so how can I encode shards into config file then i don't need to type each
> time.
>
>
> thanks.
> Scott
>


Re: Indexing HTML

2010-06-09 Thread Blargy

Does the HTMLStripChar apply at index time or query time? Would it matter to
use over the other?

As a side question, if I want to perform highlighter summaries against this
field do I need to store the whole field or just index it with
TermVector.WITH_POSITIONS_OFFSETS? 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884579.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing HTML

2010-06-09 Thread Blargy

Wait... do you mean I should try the HTMLStripCharFilterFactory analyzer at
index time?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-HTML-tp884497p884592.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing HTML

2010-06-09 Thread Ken Krugler


On Jun 9, 2010, at 8:38pm, Blargy wrote:



What is the preferred way to index html using DIH (my html is stored  
in a

blob field in our database)?

I know there is the built in HTMLStripTransformer but that doesn't  
seem to
work well with malformed/incomplete HTML. I've created a custom  
transformer

to first tidy up the html using JTidy then I pass it to the
HTMLStripTransformer like so:




However this method isn't fool-proof as you can see by my ignoreErrors
option.

I quickly took a peek at Tika and I noticed that it has its own  
HtmlParser.
Is this something I should look into? Are there any alternatives  
that deal

with malformed/incomplete  html? Thanks


Actually the Tika HtmlParser just wraps TagSoup - that's a good option  
for cleaning up busted HTML.


-- Ken



+1 530-265-2225





Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






3.1-dev spatial search problem

2010-06-09 Thread nickdos

I'm running the 3.x branch and I'm trying to implement spatial searching. 

I am able to sort results by distance from a given lat/long using a query
like:

http://localhost:8080/solr/select/?q=_val_:"recip(dist(2, lat_long,
vector(-66.5,75.1)),1,1,0)"&fl=*,score

which gives me the expected results sorted by distance (location field type
called lat_long). 

However I cannot perform a spatial filter on the results.

When I try either:

http://localhost:8080/solr/select/?q={!sfilt fl=location}&pt=-65.5,75.2&d=20

http://localhost:8080/solr/select/?q={!sfilt fl=lat_long}&pt=-65.5,75.2&d=20

I get the following error:

SEVERE: org.apache.solr.common.SolrException: Unknown query type 'sfilt'

I have tried various forms of this query to no avail. Any suggestions?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/3-1-dev-spatial-search-problem-tp884741p884741.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing Problem with SOLR multicore

2010-06-09 Thread seesiddharth

Hi,
  I am using SOLR with Tomcat server. I have configured two
multicore inside the SOLR home directory. The solr.xml file looks like 


  


  
 

I am also using DIH to upload the data in these two cores separately &
document count in these two core is different. However whenever I restart
the tomcat server my document count in these two core show the same. Also
both the core exits but whenever I tried to search the data in any core it
returns me data from different core.

E.g. If I tried to search the data in MyTestCore1 core then solr returns the
result from MyTestCore2 core (this is a problem) & If I tried to search the
data in MyTestCore2 core then solr returns the data from MyTestCore2 core
(which is fine) OR some time vice-versa   happens...

Now if I reindex the data in MyTestCore1 core using "Full data-import with
cleanup" then problem gets sort out. but comes gaing if I restart my tomcat
server.

Is there any issue with my core configuration? Please help


Thanks,
Siddharth



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Problem-with-SOLR-multicore-tp884745p884745.html
Sent from the Solr - User mailing list archive at Nabble.com.