Re: 2 solr dataImport requests on a single core at the same time

2010-07-22 Thread kishan

please help me
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p986351.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Dismax query response field number

2010-07-22 Thread scrapy

No, i'm talking about fields.

In my schema i've got about 15 fields with:  stored="true"

Like this:
   


 
But when i run a query it return me only 10 fields, the last 4 or 5 are not the 
the response??



 

-Original Message-
From: Lance Norskog 
To: solr-user@lucene.apache.org
Sent: Thu, Jul 22, 2010 2:47 am
Subject: Re: Dismax query response field number


Fields or documents? It will return all of the fields that are 'stored'.



The default number of documents to return is 10. Returning all of the

documents is very slow, so you have to request that with the rows=

parameter.



On Wed, Jul 21, 2010 at 3:32 PM,   wrote:

>

>

>

>  Hi,

>

> It seems that not all field are returned from query response when i use 

DISMAX? Only first 10??

>

> Any idea?

>

> Here is my solrconfig:

>

>  

>

> dismax

> explicit

>   *

> 0.01

> 

>text^0.5 content^1.1 title^1.5

> 

> 

>text^0.2 content^1.1 title^1.5

> 

> 

>recip(price,1,1000,1000)^0.3

> 

> 

>2<-1 5<-2 6<90%

> 

> 100

> *:*

> 

> text features name

> 

> 0

> 

> name

> regex 

>

>  

>

>

>







-- 

Lance Norskog

goks...@gmail.com


 


Re: Dismax query response field number

2010-07-22 Thread Grijesh.singh

Do u have data in that field also,Solr returns field which have data only.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-query-response-field-number-tp985567p986417.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Clustering results limit?

2010-07-22 Thread Stanislaw Osinski
Hi,

 I am attempting to cluster a query. It kinda works, but where my
> (regular) query returns 500 results the cluster only shows 1-10 hits for
> each cluster (5 clusters). Never more than 10 docs and I know its not
> right. What could be happening here? It should be showing dozens of
> documents per cluster.
>

Just to clarify -- how many documents do you see in the response ( section)? Clustering is performed on the search results
(in real time), so if you request 10 results, clustering will apply only to
those 10 results. To get a larger number of clusters you'd need to request
more results, e.g. 50, 100, 200 etc. Obviously, the trade-off here is that
it will take longer to fetch the documents from the index, clustering time
will also increase. For some guidance on choosing the clustering algorithm,
you can take a look at the following section of Carrot2 manual:
http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
.

Cheers,

Staszek


Re: 2 solr dataImport requests on a single core at the same time

2010-07-22 Thread Alexey Serba
DataImportHandler does not support parallel execution of several
requests. You should either send your requests sequentially or
register several DIH handlers in solrconfig and use them in parallel.


On Thu, Jul 22, 2010 at 11:20 AM, kishan  wrote:
>
> please help me
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p986351.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dismax query response field number

2010-07-22 Thread scrapy

 Yes i've data... maybe my query is wrong?

select?q=moto&qt=dismax&q=city:Paris

Field city is not showing?

 


 

 

-Original Message-
From: Grijesh.singh 
To: solr-user@lucene.apache.org
Sent: Thu, Jul 22, 2010 10:07 am
Subject: Re: Dismax query response field number



Do u have data in that field also,Solr returns field which have data only.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-query-response-field-number-tp985567p986417.html
Sent from the Solr - User mailing list archive at Nabble.com.

 


Re: Securing Solr 1.4 in a glassfish container AS NEW THREAD

2010-07-22 Thread Bilgin Ibryam
Are you using the same instance of CommonsHttpSolrServer for all the
requests?

On Wed, Jul 21, 2010 at 4:50 PM, Sharp, Jonathan  wrote:

>
> Some further information --
>
> I tried indexing a batch of PDFs with the client and Solr CELL, setting
> the credentials in the httpclient. For some reason after successfully
> indexing several hundred files I start getting a "SolrException:
> Unauthorized" and an info message (for every subsequent file):
>
> INFO basic authentication scheme selected
> Org.apache.commons.httpclient.HttpMethodDirector process
> WWWAuthChallenge
> INFO Failure authenticating with BASIC ''@host:port
>
> I increased session timeout in web.xml with no change. I'm looking
> through the httpclient authentication now.
>
> -Jon
>
> -Original Message-
> From: Sharp, Jonathan
> Sent: Friday, July 16, 2010 8:59 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: Securing Solr 1.4 in a glassfish container AS NEW THREAD
>
> Hi Bilgin,
>
> Thanks for the snippet -- that helps a lot.
>
> -Jon
>
> -Original Message-
> From: Bilgin Ibryam [mailto:bibr...@gmail.com]
> Sent: Friday, July 16, 2010 1:31 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Securing Solr 1.4 in a glassfish container AS NEW THREAD
>
> Hi Jon,
>
> SolrJ (CommonsHttpSolrServer) internally uses apache http client to
> connect
> to solr. You can check there for some documentation.
> I secured solr also with BASIC auth-method and use the following snippet
> to
> access it from solrJ:
>
>  //set username and password
>  ((CommonsHttpSolrServer)
> server).getHttpClient().getParams().setAuthenticationPreemptive(true);
>  Credentials defaultcreds = new
> UsernamePasswordCredentials("username",
> "secret");
>  ((CommonsHttpSolrServer)
> server).getHttpClient().getState().setCredentials(new
> AuthScope("localhost",
> 80, AuthScope.ANY_REALM), defaultcreds);
>
> HTH
> Bilgin Ibryam
>
>
>
> On Fri, Jul 16, 2010 at 2:35 AM, Sharp, Jonathan  wrote:
>
> > Hi All,
> >
> > I am considering securing Solr with basic auth in glassfish using the
> > container, by adding to web.xml and adding sun-web.xml file to the
> > distributed WAR as below.
> >
> > If using SolrJ to index files, how can I provide the credentials for
> > authentication to the http-client (or can someone point me in the
> direction
> > of the right documentation to do that or that will help me make the
> > appropriate modifications) ?
> >
> > Also any comment on the below is appreciated.
> >
> > Add this to web.xml
> > ---
> >   
> >   BASIC
> >   SomeRealm
> >   
> >   
> >   
> >   Admin Pages
> >   /admin
> >   /admin/*
> >
> >
> GETPOST d>PUTTRACEHEAD p-method>OPTIONSDELETE hod>
> >   
> >   
> >   SomeAdminRole
> >   
> >   
> >   
> >   
> >   Update Servlet
> >   /update/*
> >
> >
> GETPOST d>PUTTRACEHEAD p-method>OPTIONSDELETE hod>
> >   
> >   
> >   SomeUpdateRole
> >   
> >   
> >   
> >   
> >   Select Servlet
> >   /select/*
> >
> >
> GETPOST d>PUTTRACEHEAD p-method>OPTIONSDELETE hod>
> >   
> >   
> >   SomeSearchRole
> >   
> >   
> > ---
> >
> > Also add this as sun-web.xml
> >
> > 
> > 
> >  Application
> > Server 9.0 Servlet 2.5//EN" "
> > http://www.sun.com/software/appserver/dtds/sun-web-app_2_5-0.dtd";>
> > 
> >  /Solr
> >  
> >   
> > Keep a copy of the generated servlet class' java
> > code.
> >   
> >  
> >  
> > SomeAdminRole
> > SomeAdminGroup
> >  
> >  
> > SomeUpdateRole
> > SomeUpdateGroup
> >  
> >  
> > SomeSearchRole
> > SomeSearchGroup
> >  
> > 
> > --
> >
> > -Jon
> >
> >
> > -
> > SECURITY/CONFIDENTIALITY WARNING: This message and any attachments are
> > intended solely for the individual or entity to which they are
> addressed.
> > This communication may contain information that is privileged,
> confidential,
> > or exempt from disclosure under applicable law (e.g., personal health
> > information, research data, financial information). Because this
> e-mail has
> > been sent without encryption, individuals other than the intended
> recipient
> > may be able to view the information, forward it to others or tamper
> with the
> > information without the knowledge or consent of the sender. If you are
> not
> > the intended recipient, or the employee or person responsible for
> delivering
> > the message to the intended recipient, any dissemination, distribution
> or
> > copying of the communication is strictly prohibited. If you received
> the
> > communication in error, please notify the sender immediately by
> replying to
> > this message and deleting the message and any accompanying files from
> you

Re: Dismax query response field number

2010-07-22 Thread Peter Karich
maybe its too simple, but did you try the rows=20 or sth. greater as
Lance suggested?
=>

select?rows=20&qt=dismax

Regards,
Peter.

>  Yes i've data... maybe my query is wrong?
>
> select?q=moto&qt=dismax&q=city:Paris
>
> Field city is not showing?
>
>  
>
>
>  
>
>  
>
> -Original Message-
> From: Grijesh.singh 
> To: solr-user@lucene.apache.org
> Sent: Thu, Jul 22, 2010 10:07 am
> Subject: Re: Dismax query response field number
>
>
>
> Do u have data in that field also,Solr returns field which have data only.
>   


-- 
http://karussell.wordpress.com/



Re: solrconfig.xml and xinclude

2010-07-22 Thread Tommaso Teofili
Hi,
I am trying to do a similar thing within the schema.xml (using Solr 1.4.1),
having a (super)schema that is common to 2 instances and specific fields I
would like to include (with XInclude).
Something like this:

*
   ...
  
  
  
  
  

   
 
  
  ...
*

and it works with the specific_fields_1.xml (or specific_fields_2.xml) like
the following:

**

but it stops working when I add more than one field in the included XML:

**
*  *
  **
**

and consequently modify the including element as following:

 * 
 
   
 
   *

I tried to modify the *xpointer* attribute value to:
*fields/field
fields/*
/fields/*
element(/fields/field)
element(/fields/*)
element(fields/field)
element(fields/*)
*
but I had no luck.


Fiedzia, I think that xpointer="xpointer(something)" won't work as you can
read in the last sentence of the page regarding SolrConfig.xml [1].
I took a look to the Solr source code and I found a JUnit test for the
XInclusion that tests the inclusion documented in the wiki [2][3].
I also found an entry on Lucid Imagination website at [4] but couldn't fix
my issue.
Please, could someone help us regarding what is the right way to configure
XInclude inside Solr?
Thanks in advance for your time.
Cheers,
Tommaso

[1] : http://wiki.apache.org/solr/SolrConfigXml
[2] : http://wiki.apache.org/solr/SolrConfigXml#XInclude
[3] :
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/org/apache/solr/core/TestXIncludeConfig.java
[4] :
http://www.lucidimagination.com/search/document/31a60b7ccad76de1/is_it_possible_to_use_xinclude_in_schema_xml


2010/7/21 fiedzia 

>
> I am trying to export some config options common to all cores into single
> file,
> which would be included using xinclude. The only problem is how to include
> childrens of given node.
>
>
> common_solrconfig.xml looks like that:
> 
> 
>  
> 
>
> solrconfig.xml looks like that:
> 
> 
> 
> 
>
>
> now all of the following attemps have failed:
>
>  xmlns:xi="http://www.w3.org/2001/XInclude";>
>  xmlns:xi="http://www.w3.org/2001/XInclude";>
>  xpointer="xpointer(config/*)"
> xmlns:xi="http://www.w3.org/2001/XInclude";>
>
>  xmlns:xi="http://www.w3.org/2001/XInclude";>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solrconfig-xml-and-xinclude-tp984058p984058.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Dismax query response field number

2010-07-22 Thread Chantal Ackermann
is this a typo in your query or in your e-mail?

you have the "q" parameter twice.
use "fq" for query inputs that mention a field explicitly when using
dismax.

So it should be:
select?q=moto&qt=dismax& fq =city:Paris

(the whitespace is only for visualization)


chantal


On Thu, 2010-07-22 at 11:03 +0200, scr...@asia.com wrote:
> Yes i've data... maybe my query is wrong?
> 
> select?q=moto&qt=dismax&q=city:Paris
> 
> Field city is not showing?
> 
>  
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Grijesh.singh 
> To: solr-user@lucene.apache.org
> Sent: Thu, Jul 22, 2010 10:07 am
> Subject: Re: Dismax query response field number
> 
> 
> 
> Do u have data in that field also,Solr returns field which have data only.





Re: Dismax query response field number

2010-07-22 Thread scrapy

 Thanks,

That was the problem!




select?q=moto&qt=dismax& fq =city:Paris


 

 


 

 

-Original Message-
From: Chantal Ackermann 
To: solr-user@lucene.apache.org 
Sent: Thu, Jul 22, 2010 12:47 pm
Subject: Re: Dismax query response field number


is this a typo in your query or in your e-mail?

you have the "q" parameter twice.
use "fq" for query inputs that mention a field explicitly when using
dismax.

So it should be:
select?q=moto&qt=dismax& fq =city:Paris

(the whitespace is only for visualization)


chantal


On Thu, 2010-07-22 at 11:03 +0200, scr...@asia.com wrote:
> Yes i've data... maybe my query is wrong?
> 
> select?q=moto&qt=dismax&q=city:Paris
> 
> Field city is not showing?
> 
>  
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Grijesh.singh 
> To: solr-user@lucene.apache.org
> Sent: Thu, Jul 22, 2010 10:07 am
> Subject: Re: Dismax query response field number
> 
> 
> 
> Do u have data in that field also,Solr returns field which have data only.




 


Re: Clustering results limit?

2010-07-22 Thread Darren Govoni
Staszek,
  Thank you. The cluster response has a maximum of 10 documents in each
cluster. I didn't set this limit and the query by itself returns 500+
documents. There should be many more than 10 in each cluster. Does it
default to 10 maybe? Or is there a way to say, cluster every result in
the query?

thank you, I will read the links again,
Darren

On Thu, 2010-07-22 at 10:15 +0200, Stanislaw Osinski wrote:

> Hi,
> 
>  I am attempting to cluster a query. It kinda works, but where my
> > (regular) query returns 500 results the cluster only shows 1-10 hits for
> > each cluster (5 clusters). Never more than 10 docs and I know its not
> > right. What could be happening here? It should be showing dozens of
> > documents per cluster.
> >
> 
> Just to clarify -- how many documents do you see in the response ( name="response" /> section)? Clustering is performed on the search results
> (in real time), so if you request 10 results, clustering will apply only to
> those 10 results. To get a larger number of clusters you'd need to request
> more results, e.g. 50, 100, 200 etc. Obviously, the trade-off here is that
> it will take longer to fetch the documents from the index, clustering time
> will also increase. For some guidance on choosing the clustering algorithm,
> you can take a look at the following section of Carrot2 manual:
> http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
> .
> 
> Cheers,
> 
> Staszek




Re: Clustering results limit?

2010-07-22 Thread Darren Govoni
I set the rows=50 on my clustering URL in a browser and it returns more.

In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it
still returns less than 10 for each cluster.

Is there a way to set rows wanted with ModifiableSolrParams?

thanks and sorry for the double post.

Darren

On Thu, 2010-07-22 at 10:15 +0200, Stanislaw Osinski wrote:

> Hi,
> 
>  I am attempting to cluster a query. It kinda works, but where my
> > (regular) query returns 500 results the cluster only shows 1-10 hits for
> > each cluster (5 clusters). Never more than 10 docs and I know its not
> > right. What could be happening here? It should be showing dozens of
> > documents per cluster.
> >
> 
> Just to clarify -- how many documents do you see in the response ( name="response" /> section)? Clustering is performed on the search results
> (in real time), so if you request 10 results, clustering will apply only to
> those 10 results. To get a larger number of clusters you'd need to request
> more results, e.g. 50, 100, 200 etc. Obviously, the trade-off here is that
> it will take longer to fetch the documents from the index, clustering time
> will also increase. For some guidance on choosing the clustering algorithm,
> you can take a look at the following section of Carrot2 manual:
> http://download.carrot2.org/stable/manual/#section.advanced-topics.fine-tuning.choosing-algorithm
> .
> 
> Cheers,
> 
> Staszek




Re: Dismax query response field number

2010-07-22 Thread Justin Lolofie
scrapy what version of solr are you using?

I'd like to do "fq=city:Paris" but it doesnt seem to work for me (solr
1.4) and the docs seem to suggest its a feature that is coming but not
there yet? Or maybe I misunderstood?


On Thu, Jul 22, 2010 at 6:00 AM,   wrote:
>
>  Thanks,
>
> That was the problem!
>
>
>
>
> select?q=moto&qt=dismax& fq =city:Paris
>
>
>
>
>
>
>
>
>
>
>
> -Original Message-
> From: Chantal Ackermann 
> To: solr-user@lucene.apache.org 
> Sent: Thu, Jul 22, 2010 12:47 pm
> Subject: Re: Dismax query response field number
>
>
> is this a typo in your query or in your e-mail?
>
> you have the "q" parameter twice.
> use "fq" for query inputs that mention a field explicitly when using
> dismax.
>
> So it should be:
> select?q=moto&qt=dismax& fq =city:Paris
>
> (the whitespace is only for visualization)
>
>
> chantal
>
>
> On Thu, 2010-07-22 at 11:03 +0200, scr...@asia.com wrote:
>> Yes i've data... maybe my query is wrong?
>>
>> select?q=moto&qt=dismax&q=city:Paris
>>
>> Field city is not showing?
>>
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Grijesh.singh 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, Jul 22, 2010 10:07 am
>> Subject: Re: Dismax query response field number
>>
>>
>>
>> Do u have data in that field also,Solr returns field which have data only.
>
>
>
>
>
>


Re: Dismax query response field number

2010-07-22 Thread scrapy

 I'm using Solr 1.4.1

 


 

 

-Original Message-
From: Justin Lolofie 
To: solr-user@lucene.apache.org
Sent: Thu, Jul 22, 2010 2:57 pm
Subject: Re: Dismax query response field number


scrapy what version of solr are you using?

I'd like to do "fq=city:Paris" but it doesnt seem to work for me (solr
1.4) and the docs seem to suggest its a feature that is coming but not
there yet? Or maybe I misunderstood?


On Thu, Jul 22, 2010 at 6:00 AM,   wrote:
>
>  Thanks,
>
> That was the problem!
>
>
>
>
> select?q=moto&qt=dismax& fq =city:Paris
>
>
>
>
>
>
>
>
>
>
>
> -Original Message-
> From: Chantal Ackermann 
> To: solr-user@lucene.apache.org 
> Sent: Thu, Jul 22, 2010 12:47 pm
> Subject: Re: Dismax query response field number
>
>
> is this a typo in your query or in your e-mail?
>
> you have the "q" parameter twice.
> use "fq" for query inputs that mention a field explicitly when using
> dismax.
>
> So it should be:
> select?q=moto&qt=dismax& fq =city:Paris
>
> (the whitespace is only for visualization)
>
>
> chantal
>
>
> On Thu, 2010-07-22 at 11:03 +0200, scr...@asia.com wrote:
>> Yes i've data... maybe my query is wrong?
>>
>> select?q=moto&qt=dismax&q=city:Paris
>>
>> Field city is not showing?
>>
>>
>>
>>
>>
>>
>>
>>
>> -Original Message-
>> From: Grijesh.singh 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, Jul 22, 2010 10:07 am
>> Subject: Re: Dismax query response field number
>>
>>
>>
>> Do u have data in that field also,Solr returns field which have data only.
>
>
>
>
>
>

 


Using Solr to perform range queries in Dspace

2010-07-22 Thread Mckeane

I'm trying to use dspace to search across a range of index created and stored
using Dsindexer.java class. I have seen where Solr can be use to perform
numerical range queries using either TrieIntField,
TrieDoubleField,TrieLongField, etc.. classes defined in Solr's api or 
SortableIntField.java, SortableLongField,SortableDoubleField.java. I would
like to know how to implement these classes in Dspace so that I can be able
to perform numerical range queries. Any help would be greatly apprciated.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-Solr-to-perform-range-queries-in-Dspace-tp987049p987049.html
Sent from the Solr - User mailing list archive at Nabble.com.


Getting FileNotFoundException with repl command=backup?

2010-07-22 Thread Peter Sturge
Informational

Hi,

This information is for anyone who might be running into problems when
performing explicit periodic backups of Solr indexes. I encountered this
problem, and hopefully this might be useful to others.
A related Jira issue is: SOLR-1475.

The issue is: When you execute a 'command=backup' request, the snapshot
starts, but then fails later on with file not found errors. This aborts the
snapshot, and you end up with no backup.

This error occurs if, during the backup process, Solr performs more commits
than its 'maxCommitsToKeep' setting in solrconfig.xml. If you don't commit
very often, you probably won't see this problem.
If, however, like me, you have Solr committing very often, the commit point
files for the backup can get deleted before the backup finishes. This is
particualrly true of larger indexes, where the backup can take some time.

Workaround 1:
One workaround to this is to set 'maxCommitsToKeep' to a number higher than
the total number of commits that can occur during the time it takes to do a
backup. Sounds like a 'finger-in-the-air' number? Well, yes it is.
If you commit every 20secs, and a full backup takes 10mins, you'll want a
value of at least 31. The trouble is, how long will a backup take? This can
vary hugely as the index grows, system is busy, disk fragmentation etc.
(my environment takes ~13mins to backup a 5.5GB index to a local folder)

An inefficiency of this approach that needs to be considered is the higher
the 'maxCommitsToKeep' number is, the more files you're going to have
lounging around in your index data folder - the majority of which never get
used. The collective size of these commit point files can be significant.
If you have a high mergeFactor, the number of files will increase as well.
You can set 'maxCommitAge' to delete old commit points after a certain time
- as long as it's not shorter than the 'worst-case' backup time.

I set my 'maxCommitsToKeep' to 2400, and the file not found errors
disappeared (note that 2400 is a hugely conservative number to cater for a
backup taking 24hrs). My mergeFactor is 25, so I get a high number of files
in the index folder, they are generally small in size, but significant extra
storage can be required.

If you're willing to trade off some (ok, potentially a lot of) extraneous
disk usage to keep commit points around waiting for a backup command, this
approach addresses the problem.

Workaround 2:
A preferable method (IMHO), is if you have an extra box, set up a read-only
replica, and then backup from the replica. Then you can then tune the slave
to suit your needs.

Coding:
I'm not very familiar with the repl/backup code, but a coded way to address
this might be to save a commit point's index version files when a backup
command is received, then release them for deletion when complete.
Perhaps someone with good knowledge of this part of Solr could comment more
succinctly.


Thanks,
Peter


Re: faceted search with job title

2010-07-22 Thread Ken Krugler

Hi Savannah,

A few comments below, scattered in-line...

-- Ken

On Jul 21, 2010, at 3:08pm, Savannah Beckett wrote:

And I will have to recompile the dom or sax code each time I add a  
job board for
crawling.  Regex patten is only a string which can be stored in a  
text file or

db, and retrieved based on the job board.  What do you think?


You can store the XPath expressions in a text file as strings, and  
load/compile them as needed.



From: "Nagelberg, Kallin" 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 10:39:32 AM
Subject: RE: faceted search with job title

Yeah you should definitely just setup a custom parser for each  
site.. should be
easy to extract title using groovy's xml parsing along with tagsoup  
for sloppy

html.


Definitely yes re using TagSoup to clean up bad HTML.

And definitely yes to needing per-site "rules" (typically XPath +  
optional regex as needed) to extract specific details.


For a common class of sites powered by the same back-end, you can  
often re-use the same general rules as the markup that you care about  
is consistent.


If you can't find the pattern for each site leading to the job title  
how

can you expect solr to? Humans have the advantage here :P

-Kallin Nagelberg

-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: Wednesday, July 21, 2010 12:20 PM
To: solr-user@lucene.apache.org
Cc: dave.sea...@magicalia.com
Subject: Re: faceted search with job title

mmm...there must be better way...each job board has different  
format.  If there
are constantly new job boards being crawled, I don't think I can  
manually look
for specific sequence of tags that leads to job title.  Most of them  
don't even
have class or id.  There is no guarantee that the job title will be  
in the title
tag, or header tag.  Something else can be in the title.  Should I  
do this in a

class that extends IndexFilter in Nutch?


When I do this kind of thing I use Bixo (http://openbixo.org), but  
that requires knowledge of Cascading (& some Hadoop) in order to  
construct web mining workflows.




From: Dave Searle 
To: "solr-user@lucene.apache.org" 
Sent: Wed, July 21, 2010 8:42:55 AM
Subject: RE: faceted search with job title

You'd probably need to do some post processing on the pages and set  
up rules for
each website to grab that specific bit of data. You could load the  
html into an
xml parser, then use xpath to grab content from a particular tag  
with a class or

id, based on the particular website



-Original Message-
From: Savannah Beckett [mailto:savannah_becket...@yahoo.com]
Sent: 21 July 2010 16:38
To: solr-user@lucene.apache.org
Subject: faceted search with job title

Hi,
  I am currently using nutch to crawl some job pages from job  
boards.  They are
in my solr index now.  I want to do faceted search with the job  
titles.  How?
The job titles can be in any locations of the page, e.g. title,  
header,
content...   If I use indexfilter in Nutch to search the content for  
job title,
there are hundred of thousands of job titles, I can't hard code them  
all.  Do
you have a better idea?  I think I need the job title in a separate  
field in the

index to make it work with solr faceted search, am I right?


Yes, you'd want a separate "job title" field in the index. Though  
often the job titles are slight variants on each other, so this would  
probably work much better if you automatically found common phrases  
and used those, otherwise you get "Senior Bottlewasher" and "Sr.  
Bottlewasher" and "Sr Bottlewasher" as separate facet values.



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: solrconfig.xml and xinclude

2010-07-22 Thread Tommaso Teofili
Just an update to say that the only way I figured out to include my 2
 tags was via the element() scheme :

* http://www.w3.org/2001/XInclude";>
 
   
 
   *

obviously this is not desirable and clean at all even if it can make the
trick if the number of fields is very small.
Any other ideas?
Cheers,
Tommaso

2010/7/22 Tommaso Teofili 

> Hi,
> I am trying to do a similar thing within the schema.xml (using Solr 1.4.1),
> having a (super)schema that is common to 2 instances and specific fields I
> would like to include (with XInclude).
> Something like this:
>
> *
>...
>required="false" multiValued="true"/>
>required="false" multiValued="true"/>
>required="false"/>
>   
>   
> 
> parse="xml"/>
>  
>   
>   ...
> *
>
> and it works with the specific_fields_1.xml (or specific_fields_2.xml) like
> the following:
>
> * stored="true" required="false"/>*
>
> but it stops working when I add more than one field in the included XML:
>
> **
> *   stored="true" required="false"/>*
>   * stored="false" required="false"/>*
> **
>
> and consequently modify the including element as following:
>
>  *  xpointer="/fields/field">
>  
> parse="xml" xpointer="/fields/field"/>
>  
>*
>
> I tried to modify the *xpointer* attribute value to:
> *fields/field
> fields/*
> /fields/*
> element(/fields/field)
> element(/fields/*)
> element(fields/field)
> element(fields/*)
> *
> but I had no luck.
>
>
> Fiedzia, I think that xpointer="xpointer(something)" won't work as you can
> read in the last sentence of the page regarding SolrConfig.xml [1].
> I took a look to the Solr source code and I found a JUnit test for the
> XInclusion that tests the inclusion documented in the wiki [2][3].
> I also found an entry on Lucid Imagination website at [4] but couldn't fix
> my issue.
> Please, could someone help us regarding what is the right way to configure
> XInclude inside Solr?
> Thanks in advance for your time.
> Cheers,
> Tommaso
>
> [1] : http://wiki.apache.org/solr/SolrConfigXml
> [2] : http://wiki.apache.org/solr/SolrConfigXml#XInclude
> [3] :
> http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/test/org/apache/solr/core/TestXIncludeConfig.java
> [4] :
> http://www.lucidimagination.com/search/document/31a60b7ccad76de1/is_it_possible_to_use_xinclude_in_schema_xml
>
>
> 2010/7/21 fiedzia 
>
>
>> I am trying to export some config options common to all cores into single
>> file,
>> which would be included using xinclude. The only problem is how to include
>> childrens of given node.
>>
>>
>> common_solrconfig.xml looks like that:
>> 
>> 
>>  
>> >
>>
>> solrconfig.xml looks like that:
>> 
>> 
>> 
>> 
>>
>>
>> now all of the following attemps have failed:
>>
>> > xmlns:xi="http://www.w3.org/2001/XInclude";>
>> > xmlns:xi="http://www.w3.org/2001/XInclude";>
>> > xpointer="xpointer(config/*)"
>> xmlns:xi="http://www.w3.org/2001/XInclude";>
>>
>> > xpointer="element(config/*)"
>> xmlns:xi="http://www.w3.org/2001/XInclude";>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/solrconfig-xml-and-xinclude-tp984058p984058.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Tree Faceting in Solr 1.4

2010-07-22 Thread Eric Grobler
Hi Solr Community

If I have:
COUNTRY CITY
Germany Berlin
Germany Hamburg
Spain   Madrid

Can I do faceting like:
Germany
  Berlin
  Hamburg
Spain
  Madrid

I tried to apply SOLR-792 to the current trunk but it does not seem to be
compatible.
Maybe there is a similar feature existing in the latest builds?

Thanks & Regards
Eric


Re: Tree Faceting in Solr 1.4

2010-07-22 Thread SR
Perhaps the following article can help: 
http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html

-S


On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:

> Hi Solr Community
> 
> If I have:
> COUNTRY CITY
> Germany Berlin
> Germany Hamburg
> Spain   Madrid
> 
> Can I do faceting like:
> Germany
>  Berlin
>  Hamburg
> Spain
>  Madrid
> 
> I tried to apply SOLR-792 to the current trunk but it does not seem to be
> compatible.
> Maybe there is a similar feature existing in the latest builds?
> 
> Thanks & Regards
> Eric



Re: Clustering results limit?

2010-07-22 Thread Stanislaw Osinski
Hi,

In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it
> still returns less than 10 for each cluster.
>

Oh, the number of documents per cluster very much depends on the
characteristics of your documents, it often happens that the algorithms
create larger numbers of smaller clusters. However, all returned documents
should get assigned to some cluster(s), the Other Topics one in the worst
case. Does that hold in your case?

If you'd like to tune clustering a bit, you can try Carrot2 tools:

http://download.carrot2.org/stable/manual/#section.getting-started.solr

and then:

http://download.carrot2.org/stable/manual/#chapter.tuning

Cheers,

S.


Delta import processing duration

2010-07-22 Thread Qwerky

I'm using Solr to index data from our data warehouse. The data is imported
through text files. I've written a custom FileImportDataImportHandler that
extends DataSource and it works fine - I've tested it with 280,000 records
and it manages to build the index in about 3 minutes. My problem is that
doing a delta update seems to take a really long time.

I've written a custome FileUpdateDataImportHandler which takes two files,
one for deletes and one fore updates. I've tested with an update file
containing 18,000 records and a delete file containing 30 records - my
custom handler whizzed through them in a few seconds but the page at
/solr/admin/dataimport.jsp says the command is still running (its been
running nearly an hour).

What's taking so long? Could there be some kind of inefficiency in the way
my update handler works?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Delta-import-processing-duration-tp987562p987562.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tree Faceting in Solr 1.4

2010-07-22 Thread Eric Grobler
Thank you for the link.

I was not aware of the multifaceting syntax - this will enable me to run 1
less query on the main page!

However this is not a tree faceting feature.

Thanks
Eric




On Thu, Jul 22, 2010 at 4:51 PM, SR  wrote:

> Perhaps the following article can help:
> http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
>
> -S
>
>
> On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:
>
> > Hi Solr Community
> >
> > If I have:
> > COUNTRY CITY
> > Germany Berlin
> > Germany Hamburg
> > Spain   Madrid
> >
> > Can I do faceting like:
> > Germany
> >  Berlin
> >  Hamburg
> > Spain
> >  Madrid
> >
> > I tried to apply SOLR-792 to the current trunk but it does not seem to be
> > compatible.
> > Maybe there is a similar feature existing in the latest builds?
> >
> > Thanks & Regards
> > Eric
>
>


Solr on iPad?

2010-07-22 Thread Stephan Schwab

Dear Solr community,

does anyone know whether it may be possible or has already been done to
bring Solr to the Apple iPad so that applications may use a local search
engine?

Greetings,
Stephan

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-on-iPad-tp987655p987655.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr on iPad?

2010-07-22 Thread Andreas Jung
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Stephan Schwab wrote:
> Dear Solr community,
> 
> does anyone know whether it may be possible or has already been done to
> bring Solr to the Apple iPad so that applications may use a local search
> engine?

huh?

Solr requires Java. iPad does not support Java.
Solr is memory and cpu intensive...nothing that fits with the concept
of a tablet pc.

- -aj
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxIfp8ACgkQCJIWIbr9KYwQgQCg0p1oiXuPf17/Vg2JEpVHlZql
bLEAoL46mARjhGkHsz30Kv1Agpf2xp6r
=86KI
-END PGP SIGNATURE-


Re: a bug of solr distributed search

2010-07-22 Thread Yonik Seeley
As the comments suggest, it's not a bug, but just the best we can do
for now since our priority queues don't support removal of arbitrary
elements.  I guess we could rebuild the current priority queue if we
detect a duplicate, but that will have an obvious performance impact.
Any other suggestions?

-Yonik
http://www.lucidimagination.com

On Wed, Jul 21, 2010 at 3:13 AM, Li Li  wrote:
> in QueryComponent.mergeIds. It will remove document which has
> duplicated uniqueKey with others. In current implementation, it use
> the first encountered.
>          String prevShard = uniqueDoc.put(id, srsp.getShard());
>          if (prevShard != null) {
>            // duplicate detected
>            numFound--;
>            collapseList.remove(id+"");
>            docs.set(i, null);//remove it.
>            // For now, just always use the first encountered since we
> can't currently
>            // remove the previous one added to the priority queue.
> If we switched
>            // to the Java5 PriorityQueue, this would be easier.
>            continue;
>            // make which duplicate is used deterministic based on shard
>            // if (prevShard.compareTo(srsp.shard) >= 0) {
>            //  TODO: remove previous from priority queue
>            //  continue;
>            // }
>          }
>
>  It iterate ove ShardResponse by
> for (ShardResponse srsp : sreq.responses)
> But the sreq.responses may be different. That is -- shard1's result
> and shard2's result may interchange position
> So when an uniqueKey(such as url) occurs in both shard1 and shard2.
> which one will be used is unpredicatable. But the socre of these 2
> docs are different because of different idf.
> So the same query will get different result.
> One possible solution is to sort ShardResponse srsp  by shard name.
>


Re: Solr on iPad?

2010-07-22 Thread mbklein

Hi Stephan,

On a lark, I hacked up solr running under a small-footprint servlet engine
on my jailbroken iPad. You can see the console here: http://imgur.com/tHRh3

It's not a particularly practical solution, though, since Apple would never
approve a Java-based app for the App Store. Or a background service, for
that matter. So it would only ever run on a jailbroken iPad. Even if you're
willing to live with that, keeping the process running in the background all
the time would have a devastating impact on battery life.

Michael

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-on-iPad-tp987655p987716.html
Sent from the Solr - User mailing list archive at Nabble.com.


Providing token variants at index time

2010-07-22 Thread Paul Dlug
Is there a tokenizer that supports providing variants of the tokens at
index time? I'm looking for something that could take a syntax like:

International|I Business|B Machines|M

Which would take each pipe delimited token and preserve its position
so that phrase queries work properly. The above would result in
queries for "International Business Machines" as well as "I B M" or
any variants. The point is that the variants would be generated
externally as part of the indexing process so they may not be as
simple as the above.

Any ideas or do I have to write a custom tokenizer to do this?


Thanks,
Paul


calling other core from request handler

2010-07-22 Thread Kevin Osborn
I have a multi-core environment and a custom request handler. However, I have 
one place where I would like to have my request handler on coreA query to 
coreB. 
This is not distributed search. This is just an independent query to get some 
additional data.

I am also guaranteed that each server will have the same core set. I am also 
guaranteed that I will not be reloading cores (just indexes).

It looks I can 
call coreA.getCoreDescriptor().getCoreContainer().getCore("coreB"); and then 
get 
the Searcher and release it when I am done.

Is there a better way?

And it also appears that during the inform or init methods of my 
requestHandler, 
coreB is NOT guaranteed to already exist?


  

Re: Providing token variants at index time

2010-07-22 Thread Jonathan Rochkind
I think the Synonym filter should actually do exactly what you want, no? 


http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

Hmm, maybe not exactly what you want as you describe it. It comes close, 
maybe good enough. Do you REALLY need to support "I Business M" or "I B 
Machines" as source/query? Your spec suggests yes, synonym filter won't 
easily do that.But if you just want "International Business Machines" == 
"IBM", keeping positions intact for subsequent terms, I think synonym 
filter will do it. 

If not, I suppose you could look at it's source to write your own. Or 
maybe there's some way to combine the PositionFilter with something else 
to do it, but I can't figure one out.


Jonathan

Paul Dlug wrote:

Is there a tokenizer that supports providing variants of the tokens at
index time? I'm looking for something that could take a syntax like:

International|I Business|B Machines|M

Which would take each pipe delimited token and preserve its position
so that phrase queries work properly. The above would result in
queries for "International Business Machines" as well as "I B M" or
any variants. The point is that the variants would be generated
externally as part of the indexing process so they may not be as
simple as the above.

Any ideas or do I have to write a custom tokenizer to do this?


Thanks,
Paul

  


Re: a bug of solr distributed search

2010-07-22 Thread Chris Hostetter

: As the comments suggest, it's not a bug, but just the best we can do
: for now since our priority queues don't support removal of arbitrary

FYI: I updated the DistributedSearch wiki to be more clear about this -- 
it previously didn't make it explicitly clear that docIds were suppose to 
be unique across all shards, and suggested that there was specific well 
definied behavior when they weren't.


-Hoss



Re: Providing token variants at index time

2010-07-22 Thread Paul Dlug
On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind  wrote:
> I think the Synonym filter should actually do exactly what you want, no?
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>
> Hmm, maybe not exactly what you want as you describe it. It comes close,
> maybe good enough. Do you REALLY need to support "I Business M" or "I B
> Machines" as source/query? Your spec suggests yes, synonym filter won't
> easily do that.But if you just want "International Business Machines" ==
> "IBM", keeping positions intact for subsequent terms, I think synonym filter
> will do it.
> If not, I suppose you could look at it's source to write your own. Or maybe
> there's some way to combine the PositionFilter with something else to do it,
> but I can't figure one out.

The synonym approach won't work as I need to provide them in a file.
The variants may be more dynamic and not known in advance, the process
creating the documents to index does have that logic and could easily
put them into the document in a format a tokenizer could pull apart
later.


--Paul


RE: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-22 Thread Chris Hostetter

: > I would like get the total count of the facet.field response values
: 
: I'm pretty sure there's no way to get Solr to do that -- other than not 
: setting a facet.limit, getting every value back in the response, and 
: counting them yourself (not feasible for very large counts).  I've 
: looked at trying to patch Solr to do it, because I could really use it 
: too; it's definitely possible, but made trickier because there are now 
: several different methods that Solr can use to do facetting, with 
: separate code paths.  It seems like an odd omission to me too.

beyond just having multiple facet algorithms for perforamance making it 
difficult to add this feature, the other issue is hte perforamce of 
computing the number:  in some algorithms it's relatively cheap (on a 
single server) but in others it's more expensive then computing the facet 
counts being returned (consider the case where we are sorting in term 
order - once we have collected counts for ${facet.limit} constraints, we 
can stop iterating over terms -- but to compute the total umber of 
constraints (ie: terms) we would have to keep going and test every one of 
them against ${facet.mincount})

With distributed searching it becomes even more prohibitive -- your 
description of using an infinite facet.limit and asking for every value 
back to count them is exactly what would have to be done internally in a 
distributed faceting situation -- except they couldn't just be counted, 
they'd have to be deduped and then counted)

To do this efficiently, other data structures (denormalized beyond just 
the inverted index level) would need to be built.

-Hoss



Re: stats on a field with no values

2010-07-22 Thread Chris Hostetter
: 
: When I use the stats component on a field that has no values in the result set
: (ie, stats.missing == rowCount), I'd expect that 'min'and 'max' would be
: blank.
: 
: Instead, they seem to be the smallest and largest float values or something,
: min = 1.7976931348623157E308, max = 4.9E-324 .
: 
: Is this a bug?

off the top of my head it sounds like it ... would you mind opening a n 
issue in Jira please?


-Hoss



Re: How to get the list of all available fields in a (sharded) index

2010-07-22 Thread Chris Hostetter

: I cannot find any info on how to get the list of current fields in an index
: (possibly sharded). With dynamic fields, I cannot simply parse the schema to

there isn't one -- the LukeRequestHandler can tell you what fields 
*actually* exist in your index, but you'd have to query it on each shard 
to know the full set of concrete fields in the entire distributed index.



-Hoss



Re: about warm up

2010-07-22 Thread Chris Hostetter

: I want to load full text into an external cache, So I added so codes
: in newSearcher where I found the warm up takes place. I add my codes

...

: public void newSearcher(SolrIndexSearcher newSearcher,
: SolrIndexSearcher currentSearcher) {
: warmTextCache(newSearcher,warmTextCache,new String[]{"title","content"});

...

: in warmTextCache I need a reader to get some docs

...

: So I need a reader, When I contruct a reader by myself like:
: IndexReader reader=IndexReader.open(...);
: Or by core.getSearcher().get().getReader()

Don't do that -- the readers/searchers are refrenced counted by the 
SolrCore, so unless you "release" your refrences cleanly all you are 
likely to get into some interesting situations

the newSearcher method you are implementing directly gives you the 
SolrIndexSearcher (the "newSearcher" arg) that will be used along with 
your cache .  why don't you use it to get the reader (the 
getReader() method)instead of jumping through these hoops you've been 
trying? 

 



-Hoss



commit is taking very very long time

2010-07-22 Thread bbarani

Hi,

I am not sure why some commits take very long time. I have a batch indexing
which commits just once after it completes the indexing.

I tried to index just 36 rows but the total time taken to index was like 12
minutes. The indexing time was very less just some 30 seconds but it took
the remaining time for commit.



−

0
0

−

−

dataimportHydrogen.xml


idle

−

4
36
0
2010-07-22 15:42:28
−

Indexing completed. Added/Updated: 4 documents. Deleted 0 documents.

2010-07-22 15:54:49
2010-07-22 15:54:49
4
0:12:21.632

−

This response format is experimental.  It is likely to change in the future.




I even set the autowarm count to 0 in solrconfig.xml file but of non use.
Any reason why the commit takes more time?

Also is there a way to reduce the time it takes?

I have attached my solrconfig / log for your reference.

http://lucene.472066.n3.nabble.com/file/n988220/SOLRerror.log SOLRerror.log 
http://lucene.472066.n3.nabble.com/file/n988220/solrconfig.xml
solrconfig.xml 

Thanks,
BB


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/commit-is-taking-very-very-long-time-tp988220p988220.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Novice seeking help to change filters to search without diacritics

2010-07-22 Thread Chris Hostetter

: I am new to Solr and seeking your help to change filter from
: ISOLatin1AccentFilterFactory to ASCIIFoldingFilterFactory files.  I am not

According to the files you posted, you aren't using the 
ISOLatin1AccentFilterFactory -- so problem solved w/o making any changes.

: sure what change is to be made and where exactly this change is to be made.
: And finally, what would replace mapping-ISOLatin1Accent.txt file?  I would

i think what's confusing you is thta you are using the 
MappingCharFilterFactory with that file in your "text" field type to 
convert any ISOLatin1Accent characters to their "base" characters (i'm 
sure there is a more precise term for it, but i'm not a charset savant 
like rmuir so i odn't know what it's caled)

: like Solr to search both with and without diacritics found in
: transliteration of Indian languages with characters such as Ā ś ṛ ṇ, etc. 

your existing usage should allow that on any fields using the "text" type 
-- if you index those characters they will get "flattened" and if someone 
searches on those characters they will get "flattened" -- it's just like 
using LowerCaseFilter -- as long as you do it at index and query time 
everything is consistent.

if you want docs to score higher when even the accents match, just index 
and query across two fields: on with that charfilter and one w/o.



-Hoss


Re: Providing token variants at index time

2010-07-22 Thread Jonathan Rochkind

Paul Dlug wrote:

On Thu, Jul 22, 2010 at 4:01 PM, Jonathan Rochkind  wrote:
  


The synonym approach won't work as I need to provide them in a file.
The variants may be more dynamic and not known in advance, the process
creating the documents to index does have that logic and could easily
put them into the document in a format a tokenizer could pull apart
later.
Then maybe look at the source code of the synonyms file, and build your 
own filter, copying the parts that do the real work (or even 
sub-classing), but instead of using a file, using the transient state 
information that is for some reason only available at indexing time?


Don't entirely understand your use case, if you give some more explicit 
examples, others might have other ideas.


Joanthan


Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-22 Thread Jonathan Rochkind

Chris Hostetter wrote:
computing the number:  in some algorithms it's relatively cheap (on a 
single server) but in others it's more expensive then computing the facet 
counts being returned (consider the case where we are sorting in term 
order - once we have collected counts for ${facet.limit} constraints, we 
can stop iterating over terms -- but to compute the total umber of 
constraints (ie: terms) we would have to keep going and test every one of 
them against ${facet.mincount})
  
I've been told this before, but it still doesn't really make sense to 
me.  How can you possibly find the top N constraints, without having at 
least examined all the contraints?  How do you know which are the top N 
if there are some you haven't looked at? And if you've looked at them 
all, it's no problem to increment at a counter as you look at each one.  
Although I guess the facet.minCount test does possibly put a crimp in 
things, I don't ever use that param myself to be something other than 1, 
so hadn't considered it.


But I may be missing something. I've examined only one of the code 
paths/methods for faceting in source code, the one (if my reading was 
correct) that ends up used for high-cardinality multi-valued fields -- 
in that method, it looked like it should add no work at all to give you 
a facet unique value (result set value cardinality) count. (with 
facet.mincount of 1 anyway).  But I may have been mis-reading, or it may 
be that other methods are more troublesome.


At any rate, if I need it bad enough, I'll try to write my own facet 
component that does it (perhaps a subclass of the existing SimpleFacet), 
and see what happens.  It does seem to be something a variety of 
people's use cases could use, I see it mentioned periodically in the 
list serv archives.


Jonathan




WordDelimiterFilter and phrase queries?

2010-07-22 Thread Drew Farris
Hi All,

A question about the WordDelimiterFilter and position increments /
phrase queries:

I have a string like: 3-diphenyl-propanoic

When indexed gets it is broken up into the following tokens:

pos token offset
1 3 0-1
2 diphenyl 2-10
3 propanoic 11-20
3 diphenylpropanoic 2-20

The WordDelimiterFilter has catenateWords set to 1, which causes it to
emit 'diphenylpropanoic'. Note that position for this term is '3'.
(catentateAll is set to 0)

Say someone enters the query string 3-diphenylpropanoic

The query parser I'm using transforms this into a phrase query and the
indexed form is missed because based the positions of the terms '3'
and 'diphenylpropanoic' indicate they are not adjacent?

Is this intended behavior? I expect that the catenated word
'diphenylpropanoic' should have a position of 2 based on the position
of the first term in the concatenation, but perhaps I'm missing
something. This seems to be present in both 1.4.1 and the current
trunk.

- Drew


Re: calling other core from request handler

2010-07-22 Thread Chris Hostetter
: It looks I can 
: call coreA.getCoreDescriptor().getCoreContainer().getCore("coreB"); and then 
get 
: the Searcher and release it when I am done.
: 
: Is there a better way?

not really ... not unless you want to do it via HTTP to "localhost"

: And it also appears that during the inform or init methods of my 
requestHandler, 
: coreB is NOT guaranteed to already exist?

correct ... your RequestHandler shouldn't make any assumptions about the 
order that core's are initialized in.

-Hoss



Duplicates

2010-07-22 Thread Pavel Minchenkov
Hi,

Is it possible to remove duplicates in search results by a given field?

Thanks.

-- 
Pavel Minchenkov


Re: Finding distinct unique IDs in documents returned by fq -- Urgent Help Req

2010-07-22 Thread Chris Hostetter
: > being returned (consider the case where we are sorting in term order - once
: > we have collected counts for ${facet.limit} constraints, we can stop
: > iterating over terms -- but to compute the total umber of constraints (ie:
: > terms) we would have to keep going and test every one of them against
: > ${facet.mincount})
: >   
: I've been told this before, but it still doesn't really make sense to me.  How
: can you possibly find the top N constraints, without having at least examined
: all the contraints?  How do you know which are the top N if there are some you

that's exactly my point: in the scenerio where you've asked for 
facet.mincount=N&facet.limit=M&facet.sort=index you don't have to find hte 
"top" constraints, you just have to find the first M terms in index order 
that have a mincount of N.

: But I may be missing something. I've examined only one of the code
: paths/methods for faceting in source code, the one (if my reading was correct)
: that ends up used for high-cardinality multi-valued fields -- in that method,
: it looked like it should add no work at all to give you a facet unique value
: (result set value cardinality) count. (with facet.mincount of 1 anyway).  But
: I may have been mis-reading, or it may be that other methods are more
: troublesome.

in any case where you ar sorting by *counts* then yes, all of the 
constraints have to be checked, so you can count them as you go -- but 
that doesn't scale in distributed faceting, you can't just add the counts 
up from each shard because you don't know what the overlap is -- hence my 
comment about how to dedup them.

there are some simple usecases where it's feasible, but in general it's a 
very hard problem.


-Hoss



Re: boosting particular field values

2010-07-22 Thread Chris Hostetter

I blieve this cam up on IRC, and the end result wsa that the bq was 
working fine, Justin just wasn't noticing because he added it to his 
solrconfig.xml (and not to the query URL) and his browser was still 
caching the page -- so he didn't see his boost affect anything)

(but i may be confusing justin with someone else)

: I'm using dismax request handler, solr 1.4.
: 
: I would like to boost the weight of certain fields according to their
: values... this appears to work:
: 
: bq=category:electronics^5.5
: 
: However, I think this boosting only affects sorting the results that
: have already matched? So if I only get 10 rows back, I might not get
: any records back that are category electronics. If I get 100 rows, I
: can see that bq is working. However, I only want to get 10 rows.
: 
: How does one affect the kinds of results that are matched to begin
: with? bq is the wrong thing to use, right?
: 
: Thanks for any help,
: Justin
: 



-Hoss



Re: Clustering results limit?

2010-07-22 Thread Darren Govoni
Yeah, my results count is 151 and only 21 documents appear in 6
clusters.

This is true whether I use URL or SolrJ.

When I use carrot workbench and point to my Solr using local clustering,
the workbench
has numerous clusters and all documents are placed

On Thu, 2010-07-22 at 18:06 +0200, Stanislaw Osinski wrote:

> Hi,
> 
> In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it
> > still returns less than 10 for each cluster.
> >
> 
> Oh, the number of documents per cluster very much depends on the
> characteristics of your documents, it often happens that the algorithms
> create larger numbers of smaller clusters. However, all returned documents
> should get assigned to some cluster(s), the Other Topics one in the worst
> case. Does that hold in your case?
> 
> If you'd like to tune clustering a bit, you can try Carrot2 tools:
> 
> http://download.carrot2.org/stable/manual/#section.getting-started.solr
> 
> and then:
> 
> http://download.carrot2.org/stable/manual/#chapter.tuning
> 
> Cheers,
> 
> S.




Re: Clustering results limit?

2010-07-22 Thread Darren Govoni
This seems to work from SolrJ now:

ModifiableSolrParams params = new ModifiableSolrParams();
params.set("qt", "/clustering");
params.set("q", query);
params.set("carrot.title", "filename_s");
   params.set("rows","100");
params.set("clustering", "true");
params.set("carrot.snippet", "excerpt_t");

The rows param needs to be a string I think.

thanks.

On Thu, 2010-07-22 at 19:10 -0400, Darren Govoni wrote:

> Yeah, my results count is 151 and only 21 documents appear in 6
> clusters.
> 
> This is true whether I use URL or SolrJ.
> 
> When I use carrot workbench and point to my Solr using local clustering,
> the workbench
> has numerous clusters and all documents are placed
> 
> On Thu, 2010-07-22 at 18:06 +0200, Stanislaw Osinski wrote:
> 
> > Hi,
> > 
> > In my SolrJ, I used ModifiableSolrParams and I set ("rows",50) but it
> > > still returns less than 10 for each cluster.
> > >
> > 
> > Oh, the number of documents per cluster very much depends on the
> > characteristics of your documents, it often happens that the algorithms
> > create larger numbers of smaller clusters. However, all returned documents
> > should get assigned to some cluster(s), the Other Topics one in the worst
> > case. Does that hold in your case?
> > 
> > If you'd like to tune clustering a bit, you can try Carrot2 tools:
> > 
> > http://download.carrot2.org/stable/manual/#section.getting-started.solr
> > 
> > and then:
> > 
> > http://download.carrot2.org/stable/manual/#chapter.tuning
> > 
> > Cheers,
> > 
> > S.
> 
> 




Re: Duplicates

2010-07-22 Thread Erick Erickson
If the field is a single token, just define the uniqueKey on it in your
schema.

Otherwise, this may be of interest:
http://wiki.apache.org/solr/Deduplication

Haven't used it myself though...

best
Erick

On Thu, Jul 22, 2010 at 6:14 PM, Pavel Minchenkov  wrote:

> Hi,
>
> Is it possible to remove duplicates in search results by a given field?
>
> Thanks.
>
> --
> Pavel Minchenkov
>


DIH stalling, how to debug?

2010-07-22 Thread Tommy Chheng

 Hi,
When I run my DIH script, it says it's "busy" but the "Total Requests 
made to DataSource" and "Total Rows Fetched" remain unchanged at 4 and 
6. It hasn't reported a failure.


How can I debug what is blocking the DIH?

--

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com



Re: DIH stalling, how to debug?

2010-07-22 Thread Tommy Chheng

 Ok, it was a runaway SQL query which isn't using an index.

@tommychheng
Programmer and UC Irvine Graduate Student
Find a great grad school based on research interests: http://gradschoolnow.com


On 7/22/10 4:26 PM, Tommy Chheng wrote:

 Hi,
When I run my DIH script, it says it's "busy" but the "Total Requests 
made to DataSource" and "Total Rows Fetched" remain unchanged at 4 and 
6. It hasn't reported a failure.


How can I debug what is blocking the DIH?



Re: filter query on timestamp slowing query???

2010-07-22 Thread Chris Hostetter

: You are correct, first of all i haven't move yet to the TrieDateField, but i
: am still waiting to find out a bit more information about it, and there's
: not a lot of info, other then in the xml file.

In general TrieFields are a way of trading disk space for range query 
speed.  they are explained fairly well if you look at the docs...

http://lucene.apache.org/solr/api/org/apache/solr/schema/TrieField.html
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/NumericRangeQuery.html

...allthough i realize now that "TrieDateField's docs don't actually 
link to "TrieField" where the explanation is provided.

AS for your usecase...

: I'll explain my use case, so you'll know a bit more. I have an  index that's
: being updated regularly, (every second i have 10 to 50 new documents, most
: of them are small)
: 
: Every 30 minutes, i ask the index what are the documents that were added to
: it, since the last time i queried it, that match a certain criteria.
: >From time to time, once a week or so, i ask the index for ALL the documents
: that match that criteria. (i also do this for not only one query, but
: several)
: This is why i need the timestamp filter.
: 
: The queries that don't have any time range, take a few seconds to finish,
: while the ones with time range, take a few minutes.
: Hope that helps understanding my situation, and i am open to any suggestion
: how to change the way things work, if it will improve performance.

you keep saying you run "simple queries" and gave an example of 
"myStrField:foo" and you say you "ask the index what are the documents 
that were added to it, since the last time i queried it" ... but you've 
never given any concrete example of a full Solr request that incorporates 
these timestamp filtering so we can see *exactly* what your requests look 
like.  Even with an index the size you are describing, and even with the 
slower performance of "DateField" compared to TreiDateField i find it hard 
to believe that a query for "myStrField:foo" would go fro ma few seconds 
to several minutes by adding an fq range query for a span of ~30 minutes.  
are you by any chance also *sorting* the documents by that timestamp field 
when you do this?

My best guess is that either:

  a) your "raw query performance" is generally really bad, but you don't 
notice when you do your "simple queries" because of solr's 
queryResultCache -- but this can't be used when you add the fq so you see 
the bad performance then.  If this is the situation I have no real 
suggestions

  b) when you do your individual requests that filter by your timestamp 
field you are also sorting by your timestamp field -- a field you don't 
ever sort on in any other queries so the filterCache needed for sorting 
needs to be built before those queries can be returned.  if you stop 
sorting onthis timestamp field (or add a newSearcher warming query that 
does the same sort) then the problem should go away.



-Hoss



Re: Novice seeking help to change filters to search without diacritics

2010-07-22 Thread HSingh

Hoss, thank you for your helpful response!

: i think what's confusing you is that you are using the
: MappingCharFilterFactory with that file in your "text" field type to
: convert any ISOLatin1Accent characters to their "base" characters

The problem is that a large range of characters are not getting converting
to their base characters.  The ASCIIFoldingFilterFactory handles this
conversion for the entire Latin character set, including the extended sets
without having to specify individual characters and their equivalent base
characters.

Is there way for me to switch to ASCIIFoldingFilterFactory?  If so, what
changes do I need to make to these files?  I would appreciate your help!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Novice-seeking-help-to-change-filters-to-search-without-diacritics-tp971263p988890.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Tree Faceting in Solr 1.4

2010-07-22 Thread rajini maski
I am also looking out for same feature in Solr and very keen to know whether
it supports this feature of tree faceting... Or we are forced to index in
tree faceting formatlike

1/2/3/4
1/2/3
1/2
1

In-case of multilevel faceting it will give only 2 level tree facet is what
i found..

If i give query as : country India and state Karnataka and city
bangalore...All what i want is a facet count  1) for condition above. 2) The
number of states in that Country 3) the number of cities in that state ...

Like => Country: India ,State:Karnataka , City: Bangalore <1>

 State:Karnataka
  Kerla
  Tamilnadu
  Andra Pradesh...and so on

 City:  Mysore
  Hubli
  Mangalore
  Coorg and so on...


If I am doing
facet=on & facet.field={!ex=State}State & fq={!tag=State}State:Karnataka

All it gives me is Facets on state excluding only that filter query.. But i
was not able to do same on third level ..Like  facet.field= Give me the
counts of  cities also in state Karantaka..
Let me know solution for this...

Regards,
Rajani Maski





On Thu, Jul 22, 2010 at 10:13 PM, Eric Grobler wrote:

> Thank you for the link.
>
> I was not aware of the multifaceting syntax - this will enable me to run 1
> less query on the main page!
>
> However this is not a tree faceting feature.
>
> Thanks
> Eric
>
>
>
>
> On Thu, Jul 22, 2010 at 4:51 PM, SR  wrote:
>
> > Perhaps the following article can help:
> >
> http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
> >
> > -S
> >
> >
> > On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:
> >
> > > Hi Solr Community
> > >
> > > If I have:
> > > COUNTRY CITY
> > > Germany Berlin
> > > Germany Hamburg
> > > Spain   Madrid
> > >
> > > Can I do faceting like:
> > > Germany
> > >  Berlin
> > >  Hamburg
> > > Spain
> > >  Madrid
> > >
> > > I tried to apply SOLR-792 to the current trunk but it does not seem to
> be
> > > compatible.
> > > Maybe there is a similar feature existing in the latest builds?
> > >
> > > Thanks & Regards
> > > Eric
> >
> >
>


RE: Tree Faceting in Solr 1.4

2010-07-22 Thread Jonathan Rochkind
Solr does not, yet, at least not simply, as far as I know, but there are ideas 
and some JIRA's with maybe some patches:

http://wiki.apache.org/solr/HierarchicalFaceting



From: rajini maski [rajinima...@gmail.com]
Sent: Friday, July 23, 2010 12:34 AM
To: solr-user@lucene.apache.org
Subject: Re: Tree Faceting in Solr 1.4

I am also looking out for same feature in Solr and very keen to know whether
it supports this feature of tree faceting... Or we are forced to index in
tree faceting formatlike

1/2/3/4
1/2/3
1/2
1

In-case of multilevel faceting it will give only 2 level tree facet is what
i found..

If i give query as : country India and state Karnataka and city
bangalore...All what i want is a facet count  1) for condition above. 2) The
number of states in that Country 3) the number of cities in that state ...

Like => Country: India ,State:Karnataka , City: Bangalore <1>

 State:Karnataka
  Kerla
  Tamilnadu
  Andra Pradesh...and so on

 City:  Mysore
  Hubli
  Mangalore
  Coorg and so on...


If I am doing
facet=on & facet.field={!ex=State}State & fq={!tag=State}State:Karnataka

All it gives me is Facets on state excluding only that filter query.. But i
was not able to do same on third level ..Like  facet.field= Give me the
counts of  cities also in state Karantaka..
Let me know solution for this...

Regards,
Rajani Maski





On Thu, Jul 22, 2010 at 10:13 PM, Eric Grobler wrote:

> Thank you for the link.
>
> I was not aware of the multifaceting syntax - this will enable me to run 1
> less query on the main page!
>
> However this is not a tree faceting feature.
>
> Thanks
> Eric
>
>
>
>
> On Thu, Jul 22, 2010 at 4:51 PM, SR  wrote:
>
> > Perhaps the following article can help:
> >
> http://www.craftyfella.com/2010/01/faceting-and-multifaceting-syntax-in.html
> >
> > -S
> >
> >
> > On Jul 22, 2010, at 5:39 PM, Eric Grobler wrote:
> >
> > > Hi Solr Community
> > >
> > > If I have:
> > > COUNTRY CITY
> > > Germany Berlin
> > > Germany Hamburg
> > > Spain   Madrid
> > >
> > > Can I do faceting like:
> > > Germany
> > >  Berlin
> > >  Hamburg
> > > Spain
> > >  Madrid
> > >
> > > I tried to apply SOLR-792 to the current trunk but it does not seem to
> be
> > > compatible.
> > > Maybe there is a similar feature existing in the latest builds?
> > >
> > > Thanks & Regards
> > > Eric
> >
> >
>


Re: 2 solr dataImport requests on a single core at the same time

2010-07-22 Thread kishan

Hi Tq very much its solved my problem , 
having multiple Request Handlers will not degrade the performance ... unless
we are sending parallel requests? am i right ?


Thansk,
Prasad


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/2-solr-dataImport-requests-on-a-single-core-at-the-same-time-tp978649p989132.html
Sent from the Solr - User mailing list archive at Nabble.com.