date:20091120

Function queries question

2009-11-20 Thread Oliver Beattie

Hi all,

I'm a relative newcomer to Solr, and I'm trying to use it in a project
of mine. I need to do a function query (I believe) to filter the
results so they are within a certain distance of a point. For this, I
understand I should use something like sqedist or hsin, and from the
documentation on the FunctionQuery page, I believe that the function
is executed on every "row" (or "record", not sure what the proper term
for this is). So, my question is threefold really; are those functions
the ones I should be using to perform a search where distance is one
of the criteria (there are others), and if so, does Solr execute the
query on every row (and again, if so, is there any way of preventing
this [like subqueries, though I know they're not supported])?

Sorry if this is a little confusing… any help would be greatly appreciated :)

Thanks,
Oliver

RE: schema-based Index-time field boosting

2009-11-20 Thread Ian Smith

Hi David, thanks for replying,

The field boost attribute was put there by me back in the 1.3 days, when
I somehow gained the mistaken impression that it was supposed to work!
Of course, despite a lot of searching I haven't been able to find
anything to back up my position ;)

Unfortunately our code (intentionally) has no idea what index it is
writing to so only a schema-based approach is really going to work for
us.

Of course, by now I am convinced that this might be a really good
feature - I might get the chance to look into it in the near future -
can anyone think of reasons why this might not work in practice?

Regards,

Ian.

-Original Message-
From: Smiley, David W. [mailto:dsmi...@mitre.org] 
Sent: 19 November 2009 19:29
To: solr-user@lucene.apache.org
Subject: Re: Index-time field boosting not working?

Hi Ian.  Thanks for buying my book.

The "boost" attribute goes on the field for the XML message you're
sending to Solr.  In your example you mistakenly placed it in the
schema.

FYI I use index time boosting as well as query time boosting.  Although
index time boosting isn't something I can change on a whim, I've found
it to be far easier to control the scoring than say function queries
which would be the query time substitute if the boost is a function of
particular field values.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Nov 18, 2009, at 6:40 AM, Ian Smith wrote:

> I have the following field configured in schema.xml:
> 
>  omitNorms="false" boost="3.0" />
> 
> Where "text" is the type which came with the Solr distribution.  I 
> have not been able to get this configuration to alter any document 
> scores, and if I look at the indexes in Luke there is no change in the

> norms (compared to an un-boosted equivalent).
> 
> I have confirmed that document boosting works (via SolrJ), but our 
> field boosting needs to be done in the schema.
> 
> Am I doing something wrong (BTW I have tried using "3.0f" as well, no 
> difference)?
> 
> Also, I have seen no debug output during startup which would indicate 
> that fild boosting is being configured - should there be any?
> 
> I have found no usage examples of this in the Solr 1.4 book, except a 
> vague discouragement - is this a deprecated feature?
> 
> TIA,
> 
> Ian
> 
> Web design and intelligent Content Management. 
> www.twitter.com/gossinteractive
> 
> Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower 
> Street, Plymouth, PL1 1LG.  Company Registration No: 3553908
> 
> This email contains proprietary information, some or all of which may
be legally privileged. It is for the intended recipient only. If an
addressing or transmission error has misdirected this email, please
notify the author by replying to this email. If you are not the intended
recipient you may not use, disclose, distribute, copy, print or rely on
this email. 
> 
> Email transmission cannot be guaranteed to be secure or error free, as
information may be intercepted, corrupted, lost, destroyed, arrive late
or incomplete or contain viruses. This email and any files attached to
it have been checked with virus detection software before transmission.
You should nonetheless carry out your own virus check before opening any
attachment. GOSS Interactive Ltd accepts no liability for any loss or
damage that may be caused by software viruses.
> 
>

Re: Configuring Solr to use RAMDirectory

2009-11-20 Thread Andrey Klochkov

I thought that SOLR-465 just does what is asked, i.e. one can use any
Directory implementation including RAMDirectory. Thomas, take a look at it.

On Thu, Nov 12, 2009 at 7:55 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> I think not out of the box, but look at SOLR-243 issue in JIRA.
>
> You could also put your index on ram disk (tmpfs), but it would be useless
> for writing to it.
>
> Note that when people ask about loading the whole index in memory
> explicitly, it's often a premature optimization attempt.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> - Original Message 
> > From: Thomas Nguyen 
> > To: solr-user@lucene.apache.org
> > Sent: Wed, November 11, 2009 8:46:11 PM
> > Subject: Configuring Solr to use RAMDirectory
> >
> > Is it possible to configure Solr to fully load indexes in memory?  I
> > wasn't able to find any documentation about this on either their site or
> > in the Solr 1.4 Enterprise Search Server book.
>
>


-- 
Andrew Klochkov
Senior Software Engineer,
Grid Dynamics

Re: Solr - Load Increasing.

2009-11-20 Thread kalidoss


Thank u all.

   I have increased the heap size memory from 1gb to 1.5gb. Now its 
java -Xms512M -Xmx1536M -jar start.jar, My cpu load is normal and solr 
is not restating frequently,


   My autocommit maxdoc increased to 200.

   For last 24 hours no issue on load/restarts.

Thanks Guys.
Kalidoss.m,

Otis Gospodnetic wrote:

Your autocommit settings are still pretty aggressive causing very frequent 
commits, and that is using your CPU.
Yes, splitting the servers into a master and slaves tends to be the 
performant/scalable way to go.  There is no real downside to replication, 
really, just a bit of network traffic.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 

From: kalidoss 
To: solr-user@lucene.apache.org
Sent: Wed, November 18, 2009 2:25:05 AM
Subject: Re: Solr - Load Increasing.

There seems to be some improvement. The writes speeds are faster. Server 
restarts are lower.


We changed the configuration to:
50
1

Before the Change:
- Server Restarts: 10 times in 12 hours
- CPU load: Average:50 and Peak:90

After the Change:
- Server Restarts: 4 times in 12 hours.
- CPU load: Average:30 and Peak:~70

Our every day writes are around 60k and reads are around 1 million.

We are now changing the MaxDocs to 300 and MaxTime will be 1 ms and 
hoping to some more improvements.


The system configuration is 4GB RAM and 4 core x 2 CPUs. We start the
solr (1.3) like this: java -Xms512M -Xmx1024M -jar start.jar

Is there any other way we can reduce the high CPU load in the system?

Do you guys think that upgrading to 1.4 and having the replication in 
place with reads and writes split into separate solrs will help? How 
efficient will the replication be with above mentioned scenarios? Is 
there any place we can look at for info on the disadvantages of 
replication...


Please help.
Kalidoss.m,
Tom Alt wrote:

Nice to learn a new word for the day!

But to answer your question, or at least part of it, I don't really think
you want a configuration like

 
 1

 10
 


Committing every doc, and every 10 milliseconds? That's just asking for
problems. How about starting with 1000 docs, and five minutes for maxTime
(5*60*1000) or about 3 laks of milliseconds.

That should help performance a lot. Try that, and see how it works.

Tom

On Mon, Nov 16, 2009 at 2:43 PM, Shashi Kant wrote:

 

I think it would be useful for members of this list to realize that not
everyone uses the same metrology and terms.

It is very easy for "Americans" to use the imperial system and presume
everyone does the same; Europeans to use the metric system etc. Hopefully
members on this list would be persuaded to use or at least clarify their
terminology.

While the apocryphal saying goes " the great thing about standards is they
are so many choose from", we should all make an effort to communicate
across
cultures and nations.



On Mon, Nov 16, 2009 at 5:33 PM, Israel Ekpo wrote:

   
On Mon, Nov 16, 2009 at 5:22 PM, Walter Underwood 
 

wrote:
   
Probably "lakh": 100,000.


So, 900k qpd and 3M docs.

http://en.wikipedia.org/wiki/Lakh

wunder

On Nov 16, 2009, at 2:17 PM, Otis Gospodnetic wrote:

   

Hi,

Your autoCommit settings are very aggressive.  I'm guessing that's
 

what's
 

causing the CPU load.
   

btw. what is "laks"?

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
 
From: kalidoss 
To: solr-user@lucene.apache.org

Sent: Mon, November 16, 2009 9:11:21 AM
Subject: Solr - Load Increasing.

Hi All.

  My server solr box cpu utilization  increasing b/w 60 to 90% and
   

some
 

time
   

solr is getting down and we are restarting it manually.

  No of documents in solr 30 laks.
  No of add/update requrest solr 30 thousand / day. Avg of every 30
   

minutes
   

around 500 writes.
  No of search request 9laks / day.
  Size of the data directory: 4gb.


  My system ram is 8gb.
  System available space 12gb.
  processor Family: Pentium Pro

  Our solr data size can be increase in number like 90 laks. and
   

writes
 

per day
   

will be around 1laks.   - Hope its possible by solr.

  For write commit i have configured like

 1
 10


  Is all above can be possible? 90laks datas and 1laks per day
   

writes
   

and
   

30laks per day read??  - if yes what type of system configuration
   

would
 

require.
   

  Please suggest us.

thanks,
Kalidoss.m,


Get your world in your inbox!

Mail, widgets, documents, spreadsheets, organizer and much more with
   

your
   

Sifymail WIYI id!
Log on to http://www.sify.com

** DISCLAIMER **
Information contained and transmitted by this E-MAIL is proprietary
   

to
   

Sify
   

Lim

field type definition

2009-11-20 Thread revas

Hello,

If  I define a field like this in the schema  ,is this correct ?

 
 -  
   
   
   
  
  

Here I am not differentiating it in terms of query analyzer and the index
analyzer and  I am assuming that this will be used by both
query and index analyzer .Is this correct?

Regards
Revas

Re: Function queries question

2009-11-20 Thread Grant Ingersoll

On Nov 20, 2009, at 3:15 AM, Oliver Beattie wrote:

> Hi all,
> 
> I'm a relative newcomer to Solr, and I'm trying to use it in a project
> of mine. I need to do a function query (I believe) to filter the
> results so they are within a certain distance of a point. For this, I
> understand I should use something like sqedist or hsin, and from the
> documentation on the FunctionQuery page, I believe that the function
> is executed on every "row" (or "record", not sure what the proper term
> for this is). So, my question is threefold really; are those functions
> the ones I should be using to perform a search where distance is one
> of the criteria (there are others),

Short answer: yes.  Long answer:  I just committed those functions this week.  
I believe they are good, but feedback is encouraged.

> and if so, does Solr execute the
> query on every row (and again, if so, is there any way of preventing
> this [like subqueries, though I know they're not supported])?

You can use the frange capability to filter first.  See 
http://www.lucidimagination.com/blog/tag/frange/

Here's an example from a soon to be published article I'm writing:
http://localhost:8983/solr/select/?q=*:*&fq={!frange l=0 u=400}hsin(0.57, 
-1.3, lat_rad, lon_rad,  3963.205)

This should filter out all documents that are beyond 400 miles in distance from 
that point on a sphere (specified in radians, see also the rads() method)

> 
> Sorry if this is a little confusing… any help would be greatly appreciated :)

No worries, a lot of this spatial stuff is still being ironed out.  See 
https://issues.apache.org/jira/browse/SOLR-773 for the issue that is tracking 
all of the related issues.  The pieces are starting to come together and I'm 
pretty excited about it b/c not only will it bring native spatial support to 
Solr, it will also give Solr some exciting new general capabilities (sort by 
function, pseudo-fields, facet by function, etc.)

Re: field type definition

2009-11-20 Thread Grant Ingersoll


On Nov 20, 2009, at 7:22 AM, revas wrote:

> Hello,
> 
> If  I define a field like this in the schema  ,is this correct ?
> 
>  class="*solr.TextField*"positionIncrementGap
> ="*100*">
> -  
>   
>   
>generateWordParts="*1*"generateNumberParts
> ="*1*" catenateWords="*1*" catenateNumbers="*1*"
> catenateAll="*0*"splitOnCaseChange
> ="*1*" />
>  
>  
> 
> Here I am not differentiating it in terms of query analyzer and the index
> analyzer and  I am assuming that this will be used by both
> query and index analyzer .Is this correct?


Correct.  The analysis specified will be used at both query and index time.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search

Re: Filtering query results

2009-11-20 Thread Grant Ingersoll

On Nov 19, 2009, at 4:59 PM, aseem cheema wrote:

> Hey Guys,
> I need to filter out some results based on who is performing the
> search. In other words, if a document is not accessible to a user
> performing search, I don't want it to be in the result set. What is
> the best/easiest way to do this reliable/securely in Solr?

Do you have ACL info on the document?  If so, you can likely do this through a 
filter (&fq=...).  If it is somewhere else, you will likely need to integrate 
in a component to do it.

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using 
Solr/Lucene:
http://www.lucidimagination.com/search

Solr index on multiple drives.

2009-11-20 Thread swatkatz


Hi,

Can I have one instance of Solr write the index and date to multiple drives
? e.g.

Can I configure Solr to do something like -

c:\data
d:\data
e:\data

Or is the suggested way to use multiple Solr cores and have the application
shard the index across the cores ? Or is distributed search (by having
multiple Solr instances) the way to go ?
-- 
View this message in context: 
http://old.nabble.com/Solr-index-on-multiple-drives.-tp26441588p26441588.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Multi word synonym problem

2009-11-20 Thread Nair, Manas

Hi,

I tried using the recommended approach but to no benefit. The multiword 
synonyms are still not appearing in the result.

My schema.xml has the following fieldType:

This "text" field is the defaultSearchField too.

If I give the synonym for Micheal Jackson as Michael Jackson, i.e. in my 
synonyms.txt file, he entry is:
Micheal Jackson => Michael Jackson

The response is not searching for Michael Jackson. Instead it is searching for 
(text:Micheal and text: Jackson).To monitor the parsed query, i turned on 
debugQuery, but in the present case, the parsed query string was searching 
Micheal and Jackson separately.

I was able to somehow bring the corret response by modifying the synonyms.txt 
file. I changed the entry as:
Micheal Jackson , Michael Jackson  (replaced '=>' with ',').

Is there something that needs to be done with the schema part that has been 
mentioned above. I would want the synonyms to work when I map them using =>.

Kindly help.

Thankyou,
Manas

From: AHMET ARSLAN [mailto:iori...@yahoo.com]
Sent: Thu 11/12/2009 1:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Multi word synonym problem

It is recommended [1] to use synonyms at index time only for various reasons 
especially with multi-word synonyms.

[1]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

only at index time use expand=true ingoreCase=true with synonym.txt :

micheal, michael

OR:

micheal jackson, michael jackson

Note it it is important to what filters you have before synonym filter.
Bu sure that you restart tomcat and re-index.

Query Micheal Jackson (not phrase search) should return the results
for Michael Jackson.

Hope this helps.

--- On Thu, 11/12/09, Nair, Manas  wrote:

> From: Nair, Manas 
> Subject: Multi word synonym problem
> To: solr-user@lucene.apache.org
> Cc: "Arumugam, Senthil Kumar" 
> Date: Thursday, November 12, 2009, 3:43 PM
> Hi Experts,
> 
> I would like help on multi word synonyms. The scenario is
> like:
> 
> I have a name Micheal Jackson(wrong term) which has a
> synonym Michael Jackson i.e.
> 
> Micheal Jackson => Michael Jackson
> 
> When I try to search for the word Micheal Jackson (not a
> phrase search), it is searching for text: Micheal , text:
> Jackson  and not for Michael Jackson.
> But when I search for "Micheal Jackson" (phrase search),
> solr is searching for "Michael Jackson" (the correct term).
> 
> The schema.xml for the particular core contains the 
> SynonymFilterFactory for text analyzer and is enabled during
> index as well as query time. The  SynonymFilterFactory
> during index and query time has the parameter expand=true.
> 
> Please help me as to how a multiword synonym can be made
> effective i.e I want a search for
> Micheal Jackson (not phrase search) to return the results
> for Michael Jackson.
> 
> What should be done so that Micheal Jackson is considered
> as one search term instead of splitting it.
> 
> Any help is greatly appreciated.
> 
> Thankyou,
> Manas Nair
>

Re: Upgrade to solr 1.4

2009-11-20 Thread kalidoss


Even i want to upgrade from v1.3 to 1.4

I did 1.3 index directory replace with 1.4 and associated schema changes 
in that. Its throwing lot of exception like datatype mismatch with 
Integer, String, Date, etc.  Even the results are coming with some error 
example: "name="Alias">ERROR:SCHEMA-INDEX-MISMATCH,stringValue=14903346"


Is there any tool/notes to upgrade from 1.3 to 1.4? on Data and schema 
data types etc?


Please suggest us.

-Kalidoss.m,

Walter Underwood wrote:

We are using the script replication. I have no interest in spending time
configuring and QA'ing a different method when the scripts work fine.

We are running the nightly from 2009-05-11.

wunder

On 6/26/09 8:51 AM, "Shalin Shekhar Mangar"  wrote:


On Fri, Jun 26, 2009 at 9:11 PM, Walter Underwood
wrote:


Netflix is running a nightly build from May in production. We did our
normal QA on it, then ran it on one of our five servers for two weeks.
No problems. It is handling about 10% more traffic with 10% less CPU.

Wow, that is good news! Are you also using the java based replication?


We deployed 1.4 to all our servers yesterday.

Can you tell us which revision you used?








Get your world in your inbox!

Mail, widgets, documents, spreadsheets, organizer and much more with your 
Sifymail WIYI id!
Log on to http://www.sify.com

** DISCLAIMER **
Information contained and transmitted by this E-MAIL is proprietary to 
Sify Limited and is intended for use only by the individual or entity to 
which it is addressed, and may contain information that is privileged, 
confidential or exempt from disclosure under applicable law. If this is a 
forwarded message, the content of this E-MAIL may not have been sent with 
the authority of the Company. If you are not the intended recipient, an 
agent of the intended recipient or a  person responsible for delivering the 
information to the named recipient,  you are notified that any use, 
distribution, transmission, printing, copying or dissemination of this 
information in any way or in any manner is strictly prohibited. If you have 
received this communication in error, please delete this mail & notify us 
immediately at ad...@sifycorp.com

Solr Cell text extraction

2009-11-20 Thread Ian Smith

Hi Guys,

I am trying to use Solr Cell to extract body content from documents, and
also to pass along some literal field values.  Trouble is, some of the
literal fields contain spaces, colons etc. which cause a "bad request"
exception in the server.  However, if I URL encode these fields the
encoding is not stripped away, so it is still present in search
responses.

Is there a way to pass literal values containing non-URL safe characters
to Solr Cell?

Regards,

Ian.

Web design and intelligent Content Management. www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street, 
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be 
legally privileged. It is for the intended recipient only. If an addressing or 
transmission error has misdirected this email, please notify the author by 
replying to this email. If you are not the intended recipient you may not use, 
disclose, distribute, copy, print or rely on this email. 

Email transmission cannot be guaranteed to be secure or error free, as 
information may be intercepted, corrupted, lost, destroyed, arrive late or 
incomplete or contain viruses. This email and any files attached to it have 
been checked with virus detection software before transmission. You should 
nonetheless carry out your own virus check before opening any attachment. GOSS 
Interactive Ltd accepts no liability for any loss or damage that may be caused 
by software viruses.

Re: Upgrade to solr 1.4

2009-11-20 Thread kalidoss

In version 1.3 EventDate field type is date, In 1.4 also its date But we 
are getting the following error.


name="EventDate">ERROR:SCHEMA-INDEX-MISMATCH,stringValue=2008-05-16T07:19:28


-kalidoss.m,

kalidoss wrote:

Even i want to upgrade from v1.3 to 1.4

I did 1.3 index directory replace with 1.4 and associated schema 
changes in that. Its throwing lot of exception like datatype mismatch 
with Integer, String, Date, etc.  Even the results are coming with 
some error example: "name="Alias">ERROR:SCHEMA-INDEX-MISMATCH,stringValue=14903346"


Is there any tool/notes to upgrade from 1.3 to 1.4? on Data and schema 
data types etc?


Please suggest us.

-Kalidoss.m,

Walter Underwood wrote:

We are using the script replication. I have no interest in spending time
configuring and QA'ing a different method when the scripts work fine.

We are running the nightly from 2009-05-11.

wunder

On 6/26/09 8:51 AM, "Shalin Shekhar Mangar"  
wrote:



On Fri, Jun 26, 2009 at 9:11 PM, Walter Underwood
wrote:


Netflix is running a nightly build from May in production. We did our
normal QA on it, then ran it on one of our five servers for two weeks.
No problems. It is handling about 10% more traffic with 10% less CPU.

Wow, that is good news! Are you also using the java based replication?


We deployed 1.4 to all our servers yesterday.

Can you tell us which revision you used?








Get your world in your inbox!

Mail, widgets, documents, spreadsheets, organizer and much more with 
your Sifymail WIYI id!

Log on to http://www.sify.com

** DISCLAIMER **
Information contained and transmitted by this E-MAIL is proprietary to 
Sify Limited and is intended for use only by the individual or entity 
to which it is addressed, and may contain information that is 
privileged, confidential or exempt from disclosure under applicable 
law. If this is a forwarded message, the content of this E-MAIL may 
not have been sent with the authority of the Company. If you are not 
the intended recipient, an agent of the intended recipient or a  
person responsible for delivering the information to the named 
recipient,  you are notified that any use, distribution, transmission, 
printing, copying or dissemination of this information in any way or 
in any manner is strictly prohibited. If you have received this 
communication in error, please delete this mail & notify us 
immediately at ad...@sifycorp.com

creating Lucene document from an external XML file.

2009-11-20 Thread Phanindra Reva

Hello All,
  I am a newbie using Solr and Lucene. In my task, I have
to create org.apache.lucene.document.Document objects from external
valid Solr xml files.To be brief, depending on the names of the fields
I need to modify corresponding values which is specific to our
project. So I would like to know whether there is an API exposed to
create org.apache.lucene.document.Document type object directly from
an external xml file because here in my case I need to make changes to
the created Document object.
Please dont mind if it does not make sense.
Thanks.

RE: Solr Cell text extraction - non-issue

2009-11-20 Thread Ian Smith

Sorry guys, the bad request seemed to be caused elsewhere, no need to
URL encode now.
Ian.

-Original Message-
From: Ian Smith [mailto:ian.sm...@gossinteractive.com] 
Sent: 20 November 2009 15:26
To: solr-user@lucene.apache.org
Subject: Solr Cell text extraction

Hi Guys,

I am trying to use Solr Cell to extract body content from documents, and
also to pass along some literal field values.  Trouble is, some of the
literal fields contain spaces, colons etc. which cause a "bad request"
exception in the server.  However, if I URL encode these fields the
encoding is not stripped away, so it is still present in search
responses.

Is there a way to pass literal values containing non-URL safe characters
to Solr Cell?

Regards,

Ian.

Web design and intelligent Content Management.
www.twitter.com/gossinteractive 

Registered Office: c/o Bishop Fleming, Cobourg House, Mayflower Street,
Plymouth, PL1 1LG.  Company Registration No: 3553908 

This email contains proprietary information, some or all of which may be
legally privileged. It is for the intended recipient only. If an
addressing or transmission error has misdirected this email, please
notify the author by replying to this email. If you are not the intended
recipient you may not use, disclose, distribute, copy, print or rely on
this email. 

Email transmission cannot be guaranteed to be secure or error free, as
information may be intercepted, corrupted, lost, destroyed, arrive late
or incomplete or contain viruses. This email and any files attached to
it have been checked with virus detection software before transmission.
You should nonetheless carry out your own virus check before opening any
attachment. GOSS Interactive Ltd accepts no liability for any loss or
damage that may be caused by software viruses.

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-11-20 Thread javaxmlsoapdev


did you extend DIH to do this work? can you share code samples. I have
similar requirement where I need tp index database records and each record
has a column with document path so need to create another index for
documents (we allow users to search both index separately) in parallel with
reading some meta data of documents from database as well. I have all sorts
of different document formats to index. fyi; I am on solr 1.4.0. Any
pointers would be appreciated.

Thanks,

Sascha Szott wrote:
> 
> Hi Khai,
> 
> a few weeks ago, I was facing the same problem.
> 
> In my case, this workaround helped (assuming, you're using Solr 1.3): 
> For each row, extract the content from the corresponding pdf file using 
> a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
> in case you need to process other file types as well), put it between
> 
>   
> 
> and store it in a text file. To keep the relationship between a file and 
> its corresponding database row, use the primary key as the file name.
> 
> Within data-config.xml use the XPathEntityProcessor as follows (replace 
> dbRow and primaryKey respectively):
> 
>processor="XPathEntityProcessor"
>   forEach="/foo"
>   url="${dbRow.primaryKey}.xml">
>
> 
> 
> 
> And, by the way, in Solr 1.4 you do not have to put your content between 
> xml tags: use the PlainTextEntityProcessor instead of
> XPathEntityProcessor.
> 
> Best,
> Sascha
> 
> Khai Doan schrieb:
>> Hi all,
>> 
>> My name is Khai.  I have a table in a relational database.  I have
>> successfully use DataImportHandler to import this data into Apache Solr.
>> However, one of the column store the location of PDF file.  How can I
>> configure DataImportHandler to use ExtractingRequestHandler to extract
>> the
>> content of the PDF?
>> 
>> Thanks!
>> 
>> Khai Doan
>> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/How-to-use-DataImportHandler-with-ExtractingRequestHandler--tp25267745p26443544.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Index documents with Solr

2009-11-20 Thread javaxmlsoapdev


Glock, did you get this approach to work? let me know. 

Thanks,

Glock, Thomas wrote:
> 
> I have a similar situation but not expecting any easy setup.  Currently
> the tables contain both a url to the file and quite a bit of additional
> metadata about the file.  I'm planning one initial load to Solr by
> creating xml in my own utility which posts the xml.  Data is messy so DIH
> is not a good choice for this situation.  After the initial load (only
> ~12K documents - takes 10 minutes tops); I plan to perform a second pass
> which will use the extractingrequesthandler.  I know how the id will map
> but not clear yet how to get that id to ExtractingRequestHandler. Would be
> good to see different examples on the Wiki. Have not yet had a first
> attempt - hoping to in a day or so.
> 
> 
> -Original Message-
> From: javaxmlsoapdev [mailto:vika...@yahoo.com]
> Sent: Wed 04-Nov-2009 5:42 PM
> To: solr-user@lucene.apache.org
> Subject: Index documents with Solr
>  
> 
> Wanted to find out how people are using Solr's ExtractingRequestHandler to
> index different types of documents from a configuration file in an import
> fashion. I want to use this handler in a similar way how DataImportHandler
> works where you can issue "import" command from the URL to create an index
> reading database table(s). 
> 
> For documents, I have a db table which stores files paths. Want to read
> file's location from a db table then create an index after reading
> document
> content using ExtractingRequestHandler. Again trying to see if all this
> can
> be done just from a configuration same way how DataImportHandler handles
> this
> 
> -- 
> View this message in context:
> http://old.nabble.com/Index-documents-with-Solr-tp26205991p26205991.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Index-documents-with-Solr-tp26205991p26443551.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Filtering query results

2009-11-20 Thread Glock, Thomas

Hi Aseem -

I had a similar challenge.  The solution that works for my case was to
add "role" as a repeating string value in the solr schema.  

Each piece of content contains 1 or more roles and these values are
supplied to solr for indexing.

Users also have one or more roles (which correspond exactly to the
metadata placed on content and supplied to Solr.)

So when peforming the search query, we add add an fq parameter to filter
search results.  For example q=Search Phrase&fq=role:(role1 || role2 ||
role3) 

Note that ultimate restriction to content is handled elsewhere, this is
only done as a filtering mechanism for search results.  Additionally, we
do not have unlimited sets of roles and that helps to keep the query
string on the HTTP GET to a minimum.  Finally, the roles for my system
are additive such that if there is a match on any one role - the user
has access - so an OR clause works.  Your system may have more complex
role rules.

-Original Message-
From: aseem cheema [mailto:aseemche...@gmail.com] 
Sent: Thursday, November 19, 2009 5:00 PM
To: solr-user@lucene.apache.org
Subject: Filtering query results

Hey Guys,
I need to filter out some results based on who is performing the search.
In other words, if a document is not accessible to a user performing
search, I don't want it to be in the result set. What is the
best/easiest way to do this reliable/securely in Solr?
Thanks
--
Aseem

Re: Upgrade to solr 1.4

2009-11-20 Thread Yonik Seeley

On Fri, Nov 20, 2009 at 10:26 AM, kalidoss
 wrote:
> In version 1.3 EventDate field type is date, In 1.4 also its date But we are
> getting the following error.

Use the schema you had with 1.3 and it should work.  The example
schemas are not backward compatible with an index built with the
previous version.

"date" in the 1.3 example schema mapped to DateField, but TrieDateField in 1.4.

-Yonik
http://www.lucidimagination.com

Default sort order for filter query

2009-11-20 Thread Mike

When I do a search using q=*:* and then narrow down the result set using 
a filter query, are there rules that are used for the sort order in the 
result set? In my results I have a "name" field that appears to be 
sorted descending in lexicographical order. For example:


Wyoming
Wynford
Wrightstown

The field declaration is:

 and this is 
the first field that is defined that is of type text. I'm assuming the 
ordering is arbitrary. I can easily setup a new field for sorting based 
on the "string" type, but I'm curious what rules are deciding the sort 
order in this example.


I'm using the dismax and the 1.4 release.

Mike

Re: Default sort order for filter query

2009-11-20 Thread Mike


Mike wrote:
When I do a search using q=*:* and then narrow down the result set 
using a filter query, are there rules that are used for the sort order 
in the result set? In my results I have a "name" field that appears to 
be sorted descending in lexicographical order. For example:


Wyoming
Wynford
Wrightstown

The field declaration is:

 and this 
is the first field that is defined that is of type text. I'm assuming 
the ordering is arbitrary. I can easily setup a new field for sorting 
based on the "string" type, but I'm curious what rules are deciding 
the sort order in this example.


I'm using the dismax and the 1.4 release.

Mike





Sorry for the noise - I think I have just answered my own question. The 
order in which docs are indexed determine the result sort order unless 
overridden via sort query parameters :)

Re: Default sort order for filter query

2009-11-20 Thread Yonik Seeley

On Fri, Nov 20, 2009 at 11:15 AM, Mike  wrote:
> Sorry for the noise - I think I have just answered my own question. The
> order in which docs are indexed determine the result sort order unless
> overridden via sort query parameters :)

Correct.  The internal lucene document id is the tiebreaker for all sorts.

-Yonik
http://www.lucidimagination.com

comparing index-time boost and sort in the case of a date field

2009-11-20 Thread Anil Cherian

Hi,

I have a requirement to get results in the order of latest date of a field
called approval_dt. ie results having the latest approval date should appear
first in the SOLR results xml. A sorting "desc" on approval_dt gave me this.

Can index-time boost be of use here to improve performance. Could you please
help me with an answer.

Thank You.
Anil.

Re: Default sort order for filter query

2009-11-20 Thread Mike


Yonik Seeley wrote:

On Fri, Nov 20, 2009 at 11:15 AM, Mike  wrote:
  

Sorry for the noise - I think I have just answered my own question. The
order in which docs are indexed determine the result sort order unless
overridden via sort query parameters :)



Correct.  The internal lucene document id is the tiebreaker for all sorts.

-Yonik
http://www.lucidimagination.com


  

Good to know and makes things more clear. Thanks.

Re: Default sort order for filter query

2009-11-20 Thread Yonik Seeley

On Fri, Nov 20, 2009 at 11:28 AM, Yonik Seeley
 wrote:
> On Fri, Nov 20, 2009 at 11:15 AM, Mike  wrote:
>> Sorry for the noise - I think I have just answered my own question. The
>> order in which docs are indexed determine the result sort order unless
>> overridden via sort query parameters :)
>
> Correct.

I should clarify - I meant "Correct" in that the sort order you are
seeing is by internal docid, not that it's the default sort.
The default sort is always by score... it's just that *:* is producing
the same score for all docs, thus the sort you see is the tiebreaker
(internal document id).

-Yonik
http://www.lucidimagination.com

Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Michael

Hoss,

Using Solr 1.4, I see constant index growth until an optimize.  I
commit (hundreds of updates) every 5 minutes and have a mergefactor of
10, but every 50 minutes I don't see the index collapse down to its
original size -- it's slightly larger.

Over the course of a week, the index grew from 4.5 gigs to 6 gigs,
growing and shrinking in file size and count but generally upward.
Only when I manually optimized did the index return to 4.5 gigs.

So -- I thought I understood you to mean that if I frequently merge,
it's basically the same as an optimize, and cruft will get purged.  Am
I misunderstanding you?

Michael
PS: The extra 1.5G actually matters, as this is one of 8 cores and I'm
trying to keep it all in RAM.

On Tue, Nov 17, 2009 at 2:37 PM, Israel Ekpo  wrote:
> On Tue, Nov 17, 2009 at 2:24 PM, Chris Hostetter
> wrote:
>
>>
>> : Basically, search entries are keyed to other documents.  We have finite
>> : storage,
>> : so we purge old documents.  My understanding was that deleted documents
>> : still
>> : take space until an optimize is done.  Therefore, if I don't optimize,
>> the
>> : index
>> : size on disk will grow without bound.
>> :
>> : Am I mistaken?  If I don't ever have to optimize, it would make my life
>> : easier.
>>
>> deletions are purged as segments get merged.  if you want to force
>> deleted documents to be purged, the only way to do that at the
>> moment is to optimize (which merges all segments).  but if you are
>> continually deleteing/adding documents, the deletions will eventaully get
>> purged even if you never optimize.
>>
>>
>>
>>
>> -Hoss
>>
>>
>
> Chris,
>
> Since the mergeFactor controls the segment merge frequency and size and the
> number of segments is limited to mergeFactor - 1.
>
> Would one be correct to state that if some documents have been deleted from
> the index and the changes finalized with a call to commit, as more documents
> are added to the index, eventually the index will be  implicitly "*optimized
> *" and the deleted documents will be purged even without explicitly issuing
> an optimize statement?
>
>
> --
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the gift.
> Quality First. Measure Twice. Cut Once.
>

Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Yonik Seeley

On Fri, Nov 20, 2009 at 12:24 PM, Michael  wrote:
> So -- I thought I understood you to mean that if I frequently merge,
> it's basically the same as an optimize, and cruft will get purged.  Am
> I misunderstanding you?

That only applies to the segments involved in the merge.  The deleted
documents are left behind when old segments are merged into a new
segment.

-Yonik
http://www.lucidimagination.com

Re: Filtering query results

2009-11-20 Thread aseem cheema

Thank you much for your responses guys. I do not have ACL. I need to
make a web service call to find out if a user has access to a
document. I was hoping to get search results, call the web service
with the IDs from the search results telling me what IDs the user has
access to, and then filter others before returning back to the user.
ACL and role based fq is definitely some food for thought. I will need
to figure out the synchronization issues.

Thanks
Aseem


On Fri, Nov 20, 2009 at 8:04 AM, Glock, Thomas  wrote:
> Hi Aseem -
>
> I had a similar challenge.  The solution that works for my case was to
> add "role" as a repeating string value in the solr schema.
>
> Each piece of content contains 1 or more roles and these values are
> supplied to solr for indexing.
>
> Users also have one or more roles (which correspond exactly to the
> metadata placed on content and supplied to Solr.)
>
> So when peforming the search query, we add add an fq parameter to filter
> search results.  For example q=Search Phrase&fq=role:(role1 || role2 ||
> role3)
>
> Note that ultimate restriction to content is handled elsewhere, this is
> only done as a filtering mechanism for search results.  Additionally, we
> do not have unlimited sets of roles and that helps to keep the query
> string on the HTTP GET to a minimum.  Finally, the roles for my system
> are additive such that if there is a match on any one role - the user
> has access - so an OR clause works.  Your system may have more complex
> role rules.
>
> -Original Message-
> From: aseem cheema [mailto:aseemche...@gmail.com]
> Sent: Thursday, November 19, 2009 5:00 PM
> To: solr-user@lucene.apache.org
> Subject: Filtering query results
>
> Hey Guys,
> I need to filter out some results based on who is performing the search.
> In other words, if a document is not accessible to a user performing
> search, I don't want it to be in the result set. What is the
> best/easiest way to do this reliable/securely in Solr?
> Thanks
> --
> Aseem
>



-- 
Aseem

Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Michael

On Fri, Nov 20, 2009 at 12:35 PM, Yonik Seeley
 wrote:
> On Fri, Nov 20, 2009 at 12:24 PM, Michael  wrote:
>> So -- I thought I understood you to mean that if I frequently merge,
>> it's basically the same as an optimize, and cruft will get purged.  Am
>> I misunderstanding you?
>
> That only applies to the segments involved in the merge.  The deleted
> documents are left behind when old segments are merged into a new
> segment.

Your statement is leading me to believe that I have misunderstood the
merge process.  I thought that every time there are 10 segments, they
get merged down to 1.  Therefore, every time a merge happens, every
single segment in my entire index is "involved in the merge".  9
segments later, we're back to 10 segments, and they're merged into 1.
9 segments later, we're back to 10 segments once again, and they're
merged into 1.

Maybe I have misunderstood the mergeFactor docs.  Maybe instead it's like this?
1. Segment A1 fills with N docs, and a new segment A2 is created.
2. A2 fills with N docs, and A3 is created; A3 fills with N docs, etc.
3. A9 fills with N docs, and merging occurs: Segment B1 is created
with 10*N docs, segments A1-A9 are deleted.
4. A new segment A1 fills with N docs, and a new segment A2 is
created; B1 is still sitting with 10*N docs.
5. Eventually A1 through A9 each have N docs, and then merging occurs:
Segment B2 is created, with 10*N docs.
6. Eventually Segments B1 through B9 each have 10*N docs, and merging
occurs: Segment C1 is created, with 100*N docs.  Segments B1-B9 are
deleted.
7. A new A1 starts filling again.

Some time down the line I might have 4 D segments with 1000*N docs
each, 6 C segments with 100*N docs each, 8 B segments with 10*N docs
each, 2 A segments with N docs each, and an open A3 segment filling
up.

If this is right, then your statement above means that yes, each merge
of many As into 1 B purges all the deleted docs in A1-A9, but All my
Ds, Cs, and Bs aren't updated to purge deleted docs yet.  Only when
B1-B9 merge into a new C do their deleted docs get purged; only when
C1-C9 merge into a new D do their deleted docs get purged; etc.

Is this right?  Sorry it was so verbose!
Michael

Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Yonik Seeley

On Fri, Nov 20, 2009 at 2:32 PM, Michael  wrote:
> On Fri, Nov 20, 2009 at 12:35 PM, Yonik Seeley
>  wrote:
>> On Fri, Nov 20, 2009 at 12:24 PM, Michael  wrote:
>>> So -- I thought I understood you to mean that if I frequently merge,
>>> it's basically the same as an optimize, and cruft will get purged.  Am
>>> I misunderstanding you?
>>
>> That only applies to the segments involved in the merge.  The deleted
>> documents are left behind when old segments are merged into a new
>> segment.
>
> Your statement is leading me to believe that I have misunderstood the
> merge process.  I thought that every time there are 10 segments, they
> get merged down to 1.  Therefore, every time a merge happens, every
> single segment in my entire index is "involved in the merge".  9
> segments later, we're back to 10 segments, and they're merged into 1.
> 9 segments later, we're back to 10 segments once again, and they're
> merged into 1.
>
> Maybe I have misunderstood the mergeFactor docs.  Maybe instead it's like 
> this?
> 1. Segment A1 fills with N docs, and a new segment A2 is created.
> 2. A2 fills with N docs, and A3 is created; A3 fills with N docs, etc.
> 3. A9 fills with N docs, and merging occurs: Segment B1 is created
> with 10*N docs, segments A1-A9 are deleted.
> 4. A new segment A1 fills with N docs, and a new segment A2 is
> created; B1 is still sitting with 10*N docs.
> 5. Eventually A1 through A9 each have N docs, and then merging occurs:
> Segment B2 is created, with 10*N docs.
> 6. Eventually Segments B1 through B9 each have 10*N docs, and merging
> occurs: Segment C1 is created, with 100*N docs.  Segments B1-B9 are
> deleted.
> 7. A new A1 starts filling again.
>
> Some time down the line I might have 4 D segments with 1000*N docs
> each, 6 C segments with 100*N docs each, 8 B segments with 10*N docs
> each, 2 A segments with N docs each, and an open A3 segment filling
> up.
>
> If this is right, then your statement above means that yes, each merge
> of many As into 1 B purges all the deleted docs in A1-A9, but All my
> Ds, Cs, and Bs aren't updated to purge deleted docs yet.  Only when
> B1-B9 merge into a new C do their deleted docs get purged; only when
> C1-C9 merge into a new D do their deleted docs get purged; etc.
>
> Is this right?  Sorry it was so verbose!

Yep, that's right.

-Yonik
http://www.lucidimagination.com

Huge load and long response times during search

2009-11-20 Thread Tomasz Kępski


Hi,

I'm using SOLR(1.4) to search among about 3,500,000 documents. After the 
server kernel was updated to 64bit system has started to suffer.

Our server has 8G of RAM and double Intel Core 2 DUO.
We used to have average loads around 2-2,5. It was not as good as it 
should but as long HTTP response times was acceptable we do not care to 
much ;-)


Since few days avg loads are usually around 6, sometimes goes even to 
20. PHP, Mysql and Postgresql based application is rather fine, but when 
tries to access SOLR it takes ages to load page. In top java process 
(Jetty) takes 200-250% of CPU, iotop shows that most of the disk 
operations are done by SOLR threads as well.


When we do shut down Jetty load goes down to 1,5 or even less than 1.

My index has ~12G below is a part of my solrconf.xml:


   1024
   
   
   
   true
   true
   40
   200
   
   
 
solr 0 name="rows">10 
solr price name="start">0 10 
solr name="sort">rekomendacja 0 name="rows">10 
   static newSearcher warming query from 
solrconfig.xml

 
   
   
 
fast_warm 0 
10 
   static firstSearcher warming query from 
solrconfig.xml

 
   
   false


 
   
dismax
explicit
0.01

   name^90.0 scategory^450.0 brand^90.0 text^0.01 description^30






   brand,description,id,name,price,score


   4<100% 5<90%

100
*:*
   
 

sample query parameters from log looks like this:

2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select 
params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true} 
hits=3784 status=0 QTime=83

2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/spellCheckCompRH 
params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true} 
hits=3784 status=0 QTime=16


And at last the question ;-)
How to speed up the search?
Which parameters should I check first to find out what is the bottleneck?

Sorry for verbose entry but I would like to give as clear point of view 
as possible


Thanks in advance,
Tom

Re: comparing index-time boost and sort in the case of a date field

2009-11-20 Thread Smiley, David W.

Using index time boosting isn't really a substitute for sorting.  It will be 
faster (I'm pretty sure) but isn't the same thing.  The index time boost is 
going to influence the score but not totally become the score... which means 
that in all likelihood there will be documents in search results that are out 
of order with respect to the approval_dt.  You might use high boost values as a 
compromise (ex: 100,200,300,...) but that wouldn't feel right to me in any case.

If your sorting result performance isn't fast enough then I'd discuss it here 
with everyone.  You'll want to put fields you sort on (like approval_dt) in a 
warming query so that when the search needs to sort on this field, the sort 
information is already cached.  This cache is invalidated when you modify the 
index, by the way.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Nov 20, 2009, at 11:34 AM, Anil Cherian wrote:

> Hi,
> 
> I have a requirement to get results in the order of latest date of a field
> called approval_dt. ie results having the latest approval date should appear
> first in the SOLR results xml. A sorting "desc" on approval_dt gave me this.
> 
> Can index-time boost be of use here to improve performance. Could you please
> help me with an answer.
> 
> Thank You.
> Anil.

Embedded solr with third party libraries

2009-11-20 Thread darniz


Hi
We are having issue running our test cases with third party library for
embedded solr.
For exampel we are using kstem library which is not a part of solr
distirbution. When we run test cases our schema.xml has defintion for lucid
kstem and it throws ClassNotFound Exception.
We declared the depency for the two jars lucid-kstem.jar and
lucid-solr-kstem.jar but still it throws an error.

Till now whenever we have to run embedded solr we manually copy all the
config files like schema.xml, solrConfig.xml to a temp directory which is
considered solr home and its generally under the user home directory which
will copy all config files to user_home/solr/conf/directory. below is the
example

C:\DOCUME~1\username\LOCALS~1\Temp\solr-all\0.8194571792905493\solr\conf\schema.xml

Now in order for the jar to be loaded should i copy the two jars to solr/lib
directory. is that the default location embedded solr looks into for some
default jars.

Any advice.



 







-- 
View this message in context: 
http://old.nabble.com/Embedded-solr-with-third-party-libraries-tp26452534p26452534.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: schema-based Index-time field boosting

2009-11-20 Thread Chris Hostetter


: The field boost attribute was put there by me back in the 1.3 days, when
: I somehow gained the mistaken impression that it was supposed to work!
: Of course, despite a lot of searching I haven't been able to find
: anything to back up my position ;)

solr has never supported anything like a "boost" paramter on fields in 
schema.xml 

: Of course, by now I am convinced that this might be a really good
: feature - I might get the chance to look into it in the near future -
: can anyone think of reasons why this might not work in practice?

field boosting only makes sense if it's only applied to some of the 
documents in the index, if every document has an index time boost on 
fieldX, then that boost is meaningless.

are you looking for query time boosting on fields?  like what dismax 
provides with the "qf" param?



-Hoss

Re: Huge load and long response times during search

2009-11-20 Thread Otis Gospodnetic

Tom,

It looks like the machine might simply be running too many things.  If the load 
is around 1 when Solr is not running, and this is a dual-core server, it shows 
its already relatively busy (cca 50% idle).  Your caches are not small, so I am 
guessing you either have to have a relatively big heap, or your heap is not 
large enough and it's the GC that's causing high CPU load.  If you are seeing 
Solr causing lots of IO, that's a sign the box doesn't have enough memory for 
all those servers running comfortably on it.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Tomasz Kępski 
> To: solr 
> Sent: Fri, November 20, 2009 3:15:08 PM
> Subject: Huge load and long response times during search
> 
> Hi,
> 
> I'm using SOLR(1.4) to search among about 3,500,000 documents. After the 
> server 
> kernel was updated to 64bit system has started to suffer.
> Our server has 8G of RAM and double Intel Core 2 DUO.
> We used to have average loads around 2-2,5. It was not as good as it should 
> but 
> as long HTTP response times was acceptable we do not care to much ;-)
> 
> Since few days avg loads are usually around 6, sometimes goes even to 20. 
> PHP, 
> Mysql and Postgresql based application is rather fine, but when tries to 
> access 
> SOLR it takes ages to load page. In top java process (Jetty) takes 200-250% 
> of 
> CPU, iotop shows that most of the disk operations are done by SOLR threads as 
> well.
> 
> When we do shut down Jetty load goes down to 1,5 or even less than 1.
> 
> My index has ~12G below is a part of my solrconf.xml:
> 
> 
>   1024
>   
>  class="solr.LRUCache"
>  size="16384"
>  initialSize="4096"
>  autowarmCount="4096"/>
>   
>  class="solr.LRUCache"
>  size="16384"
>  initialSize="4096"
>  autowarmCount="1024"/>
>   
>  class="solr.LRUCache"
>  size="16384"
>  initialSize="16384"
>  autowarmCount="0"/>
>   true
>   true
>   40
>   200
>   
>   
> 
>   solr 0 
> name="rows">10 
>   solr price 
> name="start">0 10 
>   solr rekomendacja 
> name="start">0 10 
>   static newSearcher warming query from 
> solrconfig.xml
> 
>   
>   
> 
>   fast_warm 0 
> name="rows">10 
>   static firstSearcher warming query from 
> solrconfig.xml
> 
>   
>   false
> 
> 
> 
>   
> dismax
> explicit
> 0.01
> 
>name^90.0 scategory^450.0 brand^90.0 text^0.01 description^30
> 
> 
> 
> 
> 
> 
>brand,description,id,name,price,score
> 
> 
>4<100% 5<90%
> 
> 100
> *:*
>   
> 
> 
> sample query parameters from log looks like this:
> 
> 2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select 
> params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true}
>  
> hits=3784 status=0 QTime=83
> 2009-11-20 21:07:15 org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/spellCheckCompRH 
> params={spellcheck=true&wt=json&rows=20&json.nl=map&start=520&facet=true&spellcheck.collate=true&fl=id,name,description,preparation,url,shop_id&q=camera&qt=dismax&version=1.3&hl.fl=name,description,atributes,brand,url&facet.field=shop_id&facet.field=brand&hl.fragsize=200&spellcheck.count=5&hl.snippets=3&hl=true}
>  
> hits=3784 status=0 QTime=16
> 
> And at last the question ;-)
> How to speed up the search?
> Which parameters should I check first to find out what is the bottleneck?
> 
> Sorry for verbose entry but I would like to give as clear point of view as 
> possible
> 
> Thanks in advance,
> Tom

Re: creating Lucene document from an external XML file.

2009-11-20 Thread Otis Gospodnetic

Hi,
If I understand you correctly, you really want to be constructing 
SolrInputDocuments (not Lucene's Documents) and indexing those with SolrJ.  I 
don't think there is anything in the API that can read in an XML file and 
convert it into a SolrInputDocuments instance, but aren't there libraries who 
can convert XML into Java objects and vice-versa?  Maybe that could be used.

Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: Phanindra Reva 
> To: solr-user@lucene.apache.org
> Sent: Fri, November 20, 2009 10:32:45 AM
> Subject: creating Lucene document from an external XML file.
> 
> Hello All,
>   I am a newbie using Solr and Lucene. In my task, I have
> to create org.apache.lucene.document.Document objects from external
> valid Solr xml files.To be brief, depending on the names of the fields
> I need to modify corresponding values which is specific to our
> project. So I would like to know whether there is an API exposed to
> create org.apache.lucene.document.Document type object directly from
> an external xml file because here in my case I need to make changes to
> the created Document object.
> Please dont mind if it does not make sense.
> Thanks.

Re: Solr index on multiple drives.

2009-11-20 Thread Otis Gospodnetic

Hi,

No, dataDir is a single directory, so limited to single partition on a single 
drive.  But, you can always have disks in RAID, and then it could be spread 
over multiple drives.

Yes, if you have multiple Solr cores and multiple drives, you could put them on 
different drivers for performance reasons.  Distributed search makes sense if 
you can no longer fit a single index on a server. How to make search fast?  
Make sure you have enough RAM.  And if you do not, put your index on an SSD.

 Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



- Original Message 
> From: swatkatz 
> To: solr-user@lucene.apache.org
> Sent: Fri, November 20, 2009 9:03:45 AM
> Subject: Solr index on multiple drives.
> 
> 
> Hi,
> 
> Can I have one instance of Solr write the index and date to multiple drives
> ? e.g.
> 
> Can I configure Solr to do something like -
> 
> c:\data
> d:\data
> e:\data
> 
> Or is the suggested way to use multiple Solr cores and have the application
> shard the index across the cores ? Or is distributed search (by having
> multiple Solr instances) the way to go ?
> -- 
> View this message in context: 
> http://old.nabble.com/Solr-index-on-multiple-drives.-tp26441588p26441588.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using DirectSolrConnection with Solrj

2009-11-20 Thread Lance Norskog

DirectSolrConnection is older and has not been changed in a year.
SolrJ is the preferred way to code an app against Solr.

SolrJ with the Embedded server will have the same performance
characteristics as DirectSolrConnection.

On Thu, Nov 19, 2009 at 5:55 AM, dipti khullar  wrote:
> Hi Solr experts
>
> We are using Solrj in our application -
> - CommonsHttpSolrServer (superclass:SolrServer)
>
> We are planning to compare the performance results of SolrServer with
> DirectSolrConnection.
>
> 1. I couldn't find any such class (as SolrServer for HTTPrequests) which
> supports DirectSolrConnection in SolrJ
>
> 2. Also I found out some posts on memory leak issues with Solr 1.3, lucene
> 2.4.0 verison if DirectSolrConnectionis used.
> Should this be avoided with lucene 2.4?
>
> Can somebody guide me on these two queries.
>
> Thanks
> Dipti Khullar
>



-- 
Lance Norskog
goks...@gmail.com

Re: Control DIH from PHP

2009-11-20 Thread Lance Norskog

Nice! I didn't notice that before. Very useful.

2009/11/19 Noble Paul നോബിള്‍  नोब्ळ् :
> you can pass the uniqueId as a param and use it in a sql query
> http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters.
> --Noble
>
> On Thu, Nov 19, 2009 at 3:53 PM, Pablo Ferrari  wrote:
>> Most specificly, I'm looking to update only one document using it's Unique
>> ID: I dont want the DIH to lookup the whole database because I already know
>> the Unique ID that has changed.
>>
>> Pablo
>>
>> 2009/11/19 Pablo Ferrari 
>>
>>>
>>>
>>> Hello!
>>>
>>> After been working in Solr documents updates using direct php code (using
>>> SolrClient class) I want to use the DIH (Data Import Handler) to update my
>>> documents.
>>>
>>> Any one knows how can I send commands to the DIH from php? Any idea or
>>> tutorial will be of great help because I'm not finding anything useful so
>>> far.
>>>
>>> Thank you for you time!
>>>
>>> Pablo
>>> Tinkerlabs
>>>
>>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>



-- 
Lance Norskog
goks...@gmail.com

Re: Problem with SolrJ driver for Solr 1.4

2009-11-20 Thread Lance Norskog

Yes, these are both bugs. SolrJ should do field lists right, and
distributed search should work exactly the same as normal search.

Please file these in the JIRA.

On Thu, Nov 19, 2009 at 8:32 AM, Asaf work  wrote:
> Hi,
>
> I'm using the SolrJ 1.4 client driver in a sharded Solr configuration and am
> experiencing 2 problems:
>
> 1) *The method SolrQuery.setIncludeScore(true)*:
> The current implementation of setIncludeScore(boolean) *adds *the value
> "score" to the FL parameter.
> This causes a problem when using the setFields followed by include score.
> If I do this:
>
> setFields("*");
> setIncludeScore(true);
>
> I would expect the outcome to be "fl=*,score"
> Instead the outcome is: "fl=* &fl=score" which fails to use the score field
> as FL is not a multi-valued field.
>
> The current implementation in the SolrJ SolrQuery object is:
> add("fl", "score")
> instead it should be:
> set("fl", get("fl") + ",score")
>
> obviously not as simplistic as that, but you catch my drift...
>
> 2) *Propagating "*,score" value to shards*:
> When doing an HTTP request to a Solr Server using the shards the behavior of
> the response varies.
>
> The following requests cause the entire document (all fields) to return in
> the response:
>
> http://localhost:8180/solr/cpaCore/select/?q=*:*
>> http://localhost:8180/solr/cpaCore/select/?q=*:*&fl=score
>>
>> http://localhost:8180/solr/cpaCore/select/?q=*:*&shards=shardLocation/solr/cpaCore
>>
>
> The following request causes only the fields "id" and "score" to return in
> the response:
>
> http://localhost:8180/solr/cpaCore/select/?q=*:*&fl=score&shards=localhost:8180/solr/cpaCore
>>
>
> I don't know if this is by design but it does provide for some inconsistent
> behavior, as shard requests behave differently than regular requests.
> Currently we worked around these 2 issues, I'm just submitting them for your
> opinions and views on whether JIRA issues should be opened.
>
>
> With Thanks
>  Asaf Ary
>



-- 
Lance Norskog
goks...@gmail.com

Re: getting total index size & last update date/time from query

2009-11-20 Thread Lance Norskog

solr/admin/stats.jsp gives a much larger XML dump and also includes
these two data items.

Note that Luke can walk the entire index data structures, so if you
have a large index it's like playing with fire.

On Thu, Nov 19, 2009 at 8:54 AM, Binkley, Peter
 wrote:
> The Luke request handler (normally available at /admin/luke) will
> give you the document count (not size on the disk, though, if that's
> what you want) and last update and other info:
>
> 
>        14591
>        14598
>        128730
>        1196962176380
>        false
>        true
>        true
>         s/solr/data/index
>        2009-11-19T16:44:45Z
> 
>
> See http://wiki.apache.org/solr/LukeRequestHandler
>
> Peter
>
>
>
>> -Original Message-
>> From: Joel Nylund [mailto:jnyl...@yahoo.com]
>> Sent: Thursday, November 19, 2009 8:31 AM
>> To: solr-user@lucene.apache.org
>> Subject: getting total index size & last update date/time from query
>>
>> Hi,
>>
>> Looking for total number of documents in my index and the
>> last updated date/time of the index.
>>
>> Is there a way to get this through the standard query q=?
>>
>> if not, what is the best way to get this info from solr.
>>
>> thanks
>> Joel
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: index-time boost ... query

2009-11-20 Thread Lance Norskog

No, the reverse is true. Sorting is very very fast in Lucene. The
first sort operation spends a lot of time making a data structure and
then following sort calls use it.

On Thu, Nov 19, 2009 at 1:52 PM, Anil Cherian
 wrote:
> Hi David,
>
> I just now tried a sorting on the results and I got the records with latest
> approval_dt first.
>
> My question now is will index-time boosting method increase the response. ie
> will I be able to acheive the same thing i achieved
> using sorting much faster if i use index-time boosting.
>
> If you feel it helps could you please send me a sample query also.
>
> Thanks,
> Anil.
> On Thu, Nov 19, 2009 at 3:36 PM, Smiley, David W.  wrote:
>
>> Anil, without delving into why your boosting isn't working as you expect,
>> why don't you simply sort?  Based on a message you sent to me directly
>> (excerpted bellow), it seems you want sorting, not boosting.  You could
>> subsequently sort by score after approval_dt.
>>
>> ~ David Smiley
>> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>>
>> >
>> > My ultimate aim is to always bring up records in the result having
>>  latest approval_dt to appear first using index-time boosting in SOLR. Could
>> you pls help me with some directions.
>>
>>
>>  On Nov 19, 2009, at 12:25 PM, Anil Cherian wrote:
>>
>> > Hi,
>> >
>> > I am working on index-time boosting.
>> > I have a field named approval_dt. I have created that  in my SOLR
>> xml
>> > to be uploaded, by sorting my query in ascending order of approval_dt and
>> > then increasing the boost for this field by 0.1 as i encounter new
>> records
>> > from database. In my schema.xml that field has omitNorms=false (> > name="approval_dt" type="date" indexed="true" stored="true"
>> > omitNorms="false"/>)
>> >
>> > Suppose I am searching for a keyword say *India*. I want my results to
>> come
>> > in such a way that the ones with latest/ recent approval_dt should come
>> > first.
>> >
>> > I achieved this using query-time boosting  bf parameter. I am not trying
>> it
>> > using the index-time approach.
>> > Any help is appreciated.
>> >
>> > Thank you.
>> > Anil
>>
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr 1.3 query and index perf tank during optimize

2009-11-20 Thread Lance Norskog

And, terms whose documents have been deleted are not purged. So, you
can merge all you like and the index will not shrink back completely.
Only an optimize will remove the "orphan" terms.

This is important because the orphan terms affect relevance
calculations. So you really want to purge them with an optimize. You
can do limited optimize passes with the 'maxSegments' option to the
optimize command.

http://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22commit.22_and_.22optimize.22

On Fri, Nov 20, 2009 at 11:37 AM, Yonik Seeley
 wrote:
> On Fri, Nov 20, 2009 at 2:32 PM, Michael  wrote:
>> On Fri, Nov 20, 2009 at 12:35 PM, Yonik Seeley
>>  wrote:
>>> On Fri, Nov 20, 2009 at 12:24 PM, Michael  wrote:
 So -- I thought I understood you to mean that if I frequently merge,
 it's basically the same as an optimize, and cruft will get purged.  Am
 I misunderstanding you?
>>>
>>> That only applies to the segments involved in the merge.  The deleted
>>> documents are left behind when old segments are merged into a new
>>> segment.
>>
>> Your statement is leading me to believe that I have misunderstood the
>> merge process.  I thought that every time there are 10 segments, they
>> get merged down to 1.  Therefore, every time a merge happens, every
>> single segment in my entire index is "involved in the merge".  9
>> segments later, we're back to 10 segments, and they're merged into 1.
>> 9 segments later, we're back to 10 segments once again, and they're
>> merged into 1.
>>
>> Maybe I have misunderstood the mergeFactor docs.  Maybe instead it's like 
>> this?
>> 1. Segment A1 fills with N docs, and a new segment A2 is created.
>> 2. A2 fills with N docs, and A3 is created; A3 fills with N docs, etc.
>> 3. A9 fills with N docs, and merging occurs: Segment B1 is created
>> with 10*N docs, segments A1-A9 are deleted.
>> 4. A new segment A1 fills with N docs, and a new segment A2 is
>> created; B1 is still sitting with 10*N docs.
>> 5. Eventually A1 through A9 each have N docs, and then merging occurs:
>> Segment B2 is created, with 10*N docs.
>> 6. Eventually Segments B1 through B9 each have 10*N docs, and merging
>> occurs: Segment C1 is created, with 100*N docs.  Segments B1-B9 are
>> deleted.
>> 7. A new A1 starts filling again.
>>
>> Some time down the line I might have 4 D segments with 1000*N docs
>> each, 6 C segments with 100*N docs each, 8 B segments with 10*N docs
>> each, 2 A segments with N docs each, and an open A3 segment filling
>> up.
>>
>> If this is right, then your statement above means that yes, each merge
>> of many As into 1 B purges all the deleted docs in A1-A9, but All my
>> Ds, Cs, and Bs aren't updated to purge deleted docs yet.  Only when
>> B1-B9 merge into a new C do their deleted docs get purged; only when
>> C1-C9 merge into a new D do their deleted docs get purged; etc.
>>
>> Is this right?  Sorry it was so verbose!
>
> Yep, that's right.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Lance Norskog
goks...@gmail.com

Index time boosts, payloads, and long query strings

2009-11-20 Thread Girish Redekar

Hi ,

I'm relatively new to Solr/Lucene, and am using Solr (and not lucene
directly) primarily because I can use it without writing java code (rest of
my project is python coded).

My application has the following requirements:
(a) ability to search over multiple fields, each with different weight
(b) If possible, I'd like to have the ability to add extra/diminished
weights to particular tokens within a field
(c) My query strings have large lengths (50-100 words)
(d) My index is 500K+  documents

1) The way to (a) is field boosting (right?). My question is: Is all field
boosting done at query time? Even if I give index time boosts to fields? Is
there a performance advantage in boosting fields at index time vs at using
something like fieldname:querystring^boost.
2) From what I've read, it seems that I can do (b) using payloads. However,
as this link (
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/)
suggests, I will have to write a payload aware Query Parser. Wanted to
confirm if this is indeed the case - or is there a out-of-box way to
implement payloads (am using Solr1.4)
3) For my project, the user fills multiple text boxes (for each query). I
combine these into a single query (with different treatment for contents of
each text box). Consequently, my query looks something like (fieldname1:
queryterm1 queryterm2^2.0 queryterm3^3.0 +queryterm4)^1.0  Are there any
guidelines for improving performance of such a system (sorry, this bit is
vague)

Any help with this will be great !

Girish Redekar
http://girishredekar.net

44 matches

Mail list logo