date:20100119

Re: Fastest way to use solrj

2010-01-19 Thread Noble Paul നോബിള്‍ नोब्ळ्

2010/1/19 Tim Terlegård :
> There are a few ways to use solrj. I just learned that I can use the
> javabin format to get some performance gain. But when I try the binary
> format nothing is added to the index. This is how I try to use this:
>
>    server = new CommonsHttpSolrServer("http://localhost:8983/solr";)
>    server.setRequestWriter(new BinaryRequestWriter())
>    request = new UpdateRequest()
>    request.setAction(UpdateRequest.ACTION.COMMIT, true, true);
>    request.setParam("stream.file", "/tmp/data.bin")
>    request.process(server)
>
> Should this work? Could there be something wrong with the file? I
> haven't found a good reference for how to create a javabin file, but
> by reading the source code I came up with this (groovy code):
BinaryRequestWriter does not read from a file and post it
>
>    fieldId = new NamedList()
>    fieldId.add("name", "id")
>    fieldId.add("val", "9-0")
>    fieldId.add("boost", null)
>    fieldText = new NamedList()
>    fieldText.add("name", "text")
>    fieldText.add("val", "Some text")
>    fieldText.add("boost", null)
>    fieldNull = new NamedList()
>    fieldNull.add("boost", null)
>    doc = [fieldNull, fieldId, fieldText]
>    docs = [doc]
>    root = new NamedList()
>    root.add("docs", docs)
>    fos = new FileOutputStream("data.bin")
>    new JavaBinCodec().marshal(root, fos)
>
> I haven't found any examples of using stream.file like this with a
> binary file. Is it supported? Is it better/faster to use
> StreamingUpdateSolrServer and send everything over HTTP instead? Would
> code for that look something like this?
>
>    while (moreDocs) {
>        xmlDoc = readDocFromFileUsingSaxParser()
>        doc = new SolrInputDocument()
>        doc.addField("id", "9-0")
>        doc.addField("text", "Some text")
>        server.add(doc)
>    }
>
> To me it instinctively looks as if stream.file would be faster because
> it doesn't have to use HTTP and it doesn't have to create a bunch of
> SolrInputDocument objects.
>
> /Tim
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com

Re: build path

2010-01-19 Thread Wangsheng Mei

I this it is.

solr has a default servlet container "jetty" with the downloaded package
under folder "example" .
but I use tomcat a lot, so I deployed solr on tomcat using solr.war.

I don't know why solr will use jetty as default.

2010/1/19 Siv Anette Fjellkårstad 

> I apologize for the newbie questions :|
> Do I need a servlet container to run the tests?
>
> Kind regards,
> Siv
>
>
> 
>
> Fra: Wangsheng Mei [mailto:hairr...@gmail.com]
> Sendt: ti 19.01.2010 08:49
> Til: solr-user@lucene.apache.org
> Emne: Re: build path
>
>
>
> maybe you should add "-Dsolr.solr.home=" to your JAVA_OPTS
> before your servlet container starts.
>
>
> 2010/1/19 Siv Anette Fjellkårstad 
>
> > Hi!
> > I try to run the tests of Solr 1.4 in Eclipse, but a most of them fails.
> > The error messages indicate that I miss some config files in my build
> path.
> > Is there any documentation of how to get Solr up and running in Eclipse?
> If
> > not; How did you set up (build path for) Solr in Eclipse?
> >
> > Another question; Some of the tests also fail when I run ant test. Is
> that
> > normal?
> >
> > Sincerely,
> > Siv
> >
> >
> > This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> > OSLO, http://www.steria.no  . This email and any
> attachments may contain
> > confidential/intellectual property/copyright information and is only for
> the
> > use of the addressee(s). You are prohibited from copying, forwarding,
> > disclosing, saving or otherwise using it in any way if you are not the
> > addressee(s) or responsible for delivery. If you receive this email by
> > mistake, please advise the sender and cancel it immediately. Steria may
> > monitor the content of emails within its network to ensure compliance
> with
> > its policies and procedures. Any email is susceptible to alteration and
> its
> > integrity cannot be assured. Steria shall not be liable if the message is
> > altered, modified, falsified, or even edited.
> >
>
>
>
> --
> ???
>
>
>
> This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> OSLO, http://www.steria.no. This email and any attachments may contain
> confidential/intellectual property/copyright information and is only for the
> use of the addressee(s). You are prohibited from copying, forwarding,
> disclosing, saving or otherwise using it in any way if you are not the
> addressee(s) or responsible for delivery. If you receive this email by
> mistake, please advise the sender and cancel it immediately. Steria may
> monitor the content of emails within its network to ensure compliance with
> its policies and procedures. Any email is susceptible to alteration and its
> integrity cannot be assured. Steria shall not be liable if the message is
> altered, modified, falsified, or even edited.
>



-- 
梅旺生

Re: Restricting Facet to FilterQuery in combination with mincount

2010-01-19 Thread Shalin Shekhar Mangar

On Wed, Jan 13, 2010 at 4:55 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi all,
>
> is it possible to restrict the returned facets to only those that apply to
> the filter query but still use mincount=0? Keeping those that have a count
> of 0 but apply to the filter, and at the same time leaving out those that
> are not covered by the filter (and thus 0, as well).
>
> Some longer explanation of the question:
>
> Example (don't nail me down on biology here, it's just for illustration):
> q=type:mammal&facet.mincount=0&facet.field=type
>
> returns facets for all values stored in the field "type". Results would
> look like:
>
> mammal(2123)
> bird(0)
> dinosaur(0)
> fish(0)
> ...
>
> In this case setting facet.mincount=1 solves the problem. But consider:
>
> q=area:water&fq=type:mammal&facet.field=name&facet.mincount=0
>
> would return something like
> dolphin (20)
> blue whale (20)
> salmon (0) <= not covered by filter query
> lion (0)
> dog (0)
> ... (all sorts of animals, every possible value in field "name")
>
> My question is: how can I exclude those facets from the result that are not
> covered by the filter query. In this example: how can I exclude the
> non-mammals from the facets but keep all those mammals that are not matched
> by the actual query parameter?
>
>
I've read this twice but the problem is still not clear to me. I guess you
will have to explain it better to get a meaningful response.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Does specifying a smaller number of rows in search improve efficiency?

2010-01-19 Thread Grant Ingersoll


On Jan 18, 2010, at 11:17 AM, Yonik Seeley wrote:

> On Mon, Jan 18, 2010 at 8:57 AM, Erick Erickson  
> wrote:
>> Nope. The problem is that SOLR needs to create a ranked
>> list. It has to search the entire corpus every time. There's
>> always the possibility that the very last document examined
>> would rank highest.
> 
> There's also the priority queue used to collect the top matches that
> needs to remain ordered.
> Finding and scoring matching documents will normally dominate the
> time, but if "N" becomes large (for collecting the top N matches), the
> priority queue operations can become significant.


See also https://issues.apache.org/jira/browse/SOLR-1726

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Fastest way to use solrj

2010-01-19 Thread Tim Terlegård

2010/1/19 Noble Paul നോബിള്‍  नोब्ळ् :
> 2010/1/19 Tim Terlegård :
>>    server = new CommonsHttpSolrServer("http://localhost:8983/solr";)
>>    server.setRequestWriter(new BinaryRequestWriter())
>>    request = new UpdateRequest()
>>    request.setAction(UpdateRequest.ACTION.COMMIT, true, true);
>>    request.setParam("stream.file", "/tmp/data.bin")
>>    request.process(server)
>>
>> Should this work? Could there be something wrong with the file? I
>> haven't found a good reference for how to create a javabin file, but
>> by reading the source code I came up with this (groovy code):

> BinaryRequestWriter does not read from a file and post it

Is there any other way or is this use case not supported? I tried this:

$ curl /solr/update/javabin -F stream.file=/tmp/data.bin
$ curl /solr/update -F stream.body=' '

Solr did read the file, because solr complained when the file wasn't
in the format the JavaBinUpdateRequestCodec expected. But no data is
added to the index for some reason.

/Tim

Re: Interesting OutOfMemoryError on a 170M index

2010-01-19 Thread Shalin Shekhar Mangar

On Thu, Jan 14, 2010 at 4:04 AM, Minutello, Nick <
nick.minute...@credit-suisse.com> wrote:

> Agreed, commit every second.
>
> Assuming I understand what you're saying correctly:
> There shouldn't be any index readers - as at this point, just writing to
> the index.
> Did I understand correctly what you meant?
>
>
Solr opens a new IndexSearcher after a commit whether or not you are
querying it. So if you are committing every second, you are going to have a
number of IndexSearchers trying to warm themselves. That can cause an
OutOfMemoryException. Just indexing documents with a reasonable heap size
will not cause the JVM to go out of memory.

-- 
Regards,
Shalin Shekhar Mangar.

Re: filter query parsing problem

2010-01-19 Thread John Thorhauer

Ahmet,

Thanks so much for the help.  I will give it a shot.

John

On Mon, Jan 18, 2010 at 4:40 PM, Ahmet Arslan  wrote:
>> I am submitting a query and it seems
>> to be parsing incorrectly.  Here
>> is the query with the debug output.  Any ideas what
>> the problem is:
>>
>> 
>>   
>>     ((VLog:814124 || VLog:12342) &&
>> (PublisherType:U || PublisherType:A))
>>   
>> 
>> 
>>     +(VLog:814124 VLog:12342)
>> +PublisherType:u
>> 
>>
>> I would have thought that the parsed filter would have
>> looked like this:
>>         +(VLog:814124
>> VLog:12342) +(PublisherType:u PublisherType:a)
>
> It seems that stopfilterfactory is eating A which is a stop word. You can 
> remove stopfilterfactory from analyzer chain of type of PublisherType. Or you 
> can remove entry a from stopwords.txt.
>
>
>
>

Re: Tokenization and wild card search

2010-01-19 Thread Ahmet Arslan

> I have an issue and I'm not sure how to address it, so I
> hope someone can help me.
>  
> I have the following text in one of my fields:
> "ABC_Expedition_ERROR".   When I search on it
> like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> fail to find me only this word “ABC_Expedition_ERROR”
> which I think is due to tokenization because of the
> underscore.

Do you want or do not want your query MyField:SDD_Expedition_PCB to return 
documents containing ABC_Expedition_ERROR?

> My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> outer quotes, but quotes around the word
> “ABC_Expedition_ERROR”).  This works fine. 
> But then, how do I search on "SDD_Expedition_PCB" with wild
> card?  For example: "MyField:SDD_Expedition*" will not
> work.

Can you paste your field type of MyField? And give some examples what queries 
should return what documents.

How to backup / dump solr database

2010-01-19 Thread jmf

Hi,

I'm using solr with the Plone CMS. I have just following some tutorials, and I 
would like to 'dump' the solr database on production server and make it run on 
my developement environement. Both are linux.

So first the question is : is it possible ?

Next how could I do this. I have try to simply copy paste the solr/data/index 
folder from the production to dev but it doesn't work. I don't have find 
anything about this in documentation. 


-- 
Regards,
JeanMichel FRANCOIS
Makina-Corpus

Re: How to backup / dump solr database

2010-01-19 Thread Erik Hatcher

yes, it is possible.  and copying the index is exactly how to go about  
it.  what didn't work exactly?


be sure that the index directory goes under data/ and looks just like  
your production environment.


Erik

On Jan 19, 2010, at 8:08 AM, jmf wrote:


Hi,

I'm using solr with the Plone CMS. I have just following some  
tutorials, and I
would like to 'dump' the solr database on production server and make  
it run on

my developement environement. Both are linux.

So first the question is : is it possible ?

Next how could I do this. I have try to simply copy paste the solr/ 
data/index
folder from the production to dev but it doesn't work. I don't have  
find

anything about this in documentation.


--
Regards,
JeanMichel FRANCOIS
Makina-Corpus

Best wasy to solve Parent-Child relationship without Denormalizing?

2010-01-19 Thread karthi_1986


Hi,

Here is an extract of my data schema in which my user should be able to
issue the following search:
company_description:pharmaceutical AND product_description:cosmetic

[Company profile]
 Company name
 Company url
 Company description
 Company user rating

[Product profile]
 Product name
 Product category
 Product description
 Product rating

So, I'm expecting a result where all cosmetic products created by
pharmaceutical companies are returned.

The problem is, I've read in posts a year old that this parent-child
relationship can only be solved by indexing the denormalized data together.
However, I'm dealing with 10,000,000 companies with possibly 10 products
each, so my data requirements are going to be HUGGEE!!

Is there a new feature in Solr which can handle this for me without the need
for de-normalization?
-- 
View this message in context: 
http://old.nabble.com/Best-wasy-to-solve-Parent-Child-relationship-without-Denormalizing--tp27225593p27225593.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Updating a single field in a Solr document

2010-01-19 Thread Shalin Shekhar Mangar

On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla <
raghuveer.kanche...@aplopio.com> wrote:

> Hi,
> I have 2 fields one with captures the category of the documents and an
> other
> which is a pre processed text of the document. Text of the document is
> fairly large.
> The category of the document changes often while the text remains the same.
> Search happens on both fields.
>
> The problem is, I have to index both the text and the category each time
> the
> category changes. The text being large obviously makes this suboptimal. Is
> there a patch or a tricky way to avoid indexing the text field every time.
>
>
Sure, make the text field as stored, read the old document and create the
new one. Sorry, there is no way to update an indexed document in Solr (yet).

-- 
Regards,
Shalin Shekhar Mangar.

RE: Interesting OutOfMemoryError on a 170M index

2010-01-19 Thread Minutello, Nick


Thanks.
Turns out the problem was related to throughput - I wasn't getting
enough docs indexed per second and an internal queue in a vendor library
was growing without bound.
Using the StreamingUpdateSolrServer fixed that.

-Nick

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: 19 January 2010 12:04
To: solr-user@lucene.apache.org
Subject: Re: Interesting OutOfMemoryError on a 170M index

On Thu, Jan 14, 2010 at 4:04 AM, Minutello, Nick <
nick.minute...@credit-suisse.com> wrote:

> Agreed, commit every second.
>
> Assuming I understand what you're saying correctly:
> There shouldn't be any index readers - as at this point, just writing 
> to the index.
> Did I understand correctly what you meant?
>
>
Solr opens a new IndexSearcher after a commit whether or not you are
querying it. So if you are committing every second, you are going to
have a number of IndexSearchers trying to warm themselves. That can
cause an OutOfMemoryException. Just indexing documents with a reasonable
heap size will not cause the JVM to go out of memory.

--
Regards,
Shalin Shekhar Mangar.

=== 
 Please access the attached hyperlink for an important electronic 
communications disclaimer: 
 http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
 
===

Re: Best wasy to solve Parent-Child relationship without Denormalizing?

2010-01-19 Thread Renaud Delbru


Hi,

SIREn [1] could help you to solve this task (look at the different 
indexing examples). But actually, only a Lucene extension is available. 
If you want to use it into Solr, you will have to implement your own 
Solr plugin (which should require only a limited amount of work).


[1] http://siren.sindice.com/
--
Renaud Delbru

On 19/01/10 13:14, karthi_1986 wrote:

Hi,

Here is an extract of my data schema in which my user should be able to
issue the following search:
company_description:pharmaceutical AND product_description:cosmetic

[Company profile]
 Company name
 Company url
 Company description
 Company user rating

[Product profile]
 Product name
 Product category
 Product description
 Product rating

So, I'm expecting a result where all cosmetic products created by
pharmaceutical companies are returned.

The problem is, I've read in posts a year old that this parent-child
relationship can only be solved by indexing the denormalized data together.
However, I'm dealing with 10,000,000 companies with possibly 10 products
each, so my data requirements are going to be HUGGEE!!

Is there a new feature in Solr which can handle this for me without the need
for de-normalization?

RE: Tokenization and wild card search

2010-01-19 Thread johnmunir



I want the following searches to work:
 
  MyField:SDD_Expedition_PCB
 
This should match the word "SDD_Expedition_PCB" only, and not matching 
individual words such as "SDD" or "Expedition", or "PCB".

And the following search:
 
  MyField:SDD_Expedition*
 
Should match any word starting with "SDD_Expedition" and ending with anything 
else such as "SDD_Expedition_PBC", "SDD_Expedition_One", "SDD_Expedition_Two", 
"SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not matching individual 
words such as "SDD" or "Expedition".
 

The field type for "MyField" is (the field name is keywords):
 

 
And here is the analyzer I'm using:
 

  







  
  







  

 
Any help on how I can achieve the above is greatly appreciated.
 
Btw, if at all possible, I would like to be able to achieve this search without 
having to change how I'm indexing / tokenizing the data.  I'm looking for 
search syntax to make this work.
 
-- JM
 
-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Tuesday, January 19, 2010 7:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization and wild card search
 
> I have an issue and I'm not sure how to address it, so I
> hope someone can help me.
>  
> I have the following text in one of my fields:
> "ABC_Expedition_ERROR".���When I search on it
> like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> fail to find me only this word �ABC_Expedition_ERROR�
> which I think is due to tokenization because of the
> underscore.
 
Do you want or do not want your query MyField:SDD_Expedition_PCB to return 
documents containing ABC_Expedition_ERROR?
 
> My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> outer quotes, but quotes around the word
> �ABC_Expedition_ERROR�).� This works fine.�
> But then, how do I search on "SDD_Expedition_PCB" with wild
> card?� For example: "MyField:SDD_Expedition*" will not
> work.
 
Can you paste your field type of MyField? And give some examples what queries 
should return what documents.

Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi All,

I have a problem using SOLR indexing. I am trying to index 96 pages PDF file
(using PDFBox for extracting the file contents into String). But
surprisingly SOLR Indexing is not done for the full document. Means I can't
get all the token how ever the field contains the full text of the PDF as i
am storing the field along with indexing.

Is there any such limitations with SOLR indexing, please let me know at the
earliest.

Thanks in advance!

Best Regards,
Kranti K K Parisa

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Mark Miller

Kranti™ K K Parisa wrote:
> Hi All,
>
> I have a problem using SOLR indexing. I am trying to index 96 pages PDF file
> (using PDFBox for extracting the file contents into String). But
> surprisingly SOLR Indexing is not done for the full document. Means I can't
> get all the token how ever the field contains the full text of the PDF as i
> am storing the field along with indexing.
>
> Is there any such limitations with SOLR indexing, please let me know at the
> earliest.
>
> Thanks in advance!
>
> Best Regards,
> Kranti K K Parisa
>
>   
Take a look at maxFieldLength in solrconfig.xml

-- 
- Mark

http://www.lucidimagination.com

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi Mark,

I really appreciate the quick reply.

here is what I have in the config xml

32
2147483647
  *  1*
1000
1

Does this matter with Tokens?? Because the field I am using is having the
full content of the file ( I checked that using Lukeall jar file), how ever
Tokens are not getting generated completely because of which my search not
working for the full content.

Please suggest.

Best Regards,
Kranti K K Parisa

On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller  wrote:

> Kranti™ K K Parisa wrote:
> > Hi All,
> >
> > I have a problem using SOLR indexing. I am trying to index 96 pages PDF
> file
> > (using PDFBox for extracting the file contents into String). But
> > surprisingly SOLR Indexing is not done for the full document. Means I
> can't
> > get all the token how ever the field contains the full text of the PDF as
> i
> > am storing the field along with indexing.
> >
> > Is there any such limitations with SOLR indexing, please let me know at
> the
> > earliest.
> >
> > Thanks in advance!
> >
> > Best Regards,
> > Kranti K K Parisa
> >
> >
> Take a look at maxFieldLength in solrconfig.xml
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Mark Miller

It limits the number of tokens that will be indexed.

Kranti™ K K Parisa wrote:
> Hi Mark,
>
> I really appreciate the quick reply.
>
> here is what I have in the config xml
>
> 32
> 2147483647
>   *  1*
> 1000
> 1
>
> Does this matter with Tokens?? Because the field I am using is having
> the full content of the file ( I checked that using Lukeall jar file),
> how ever Tokens are not getting generated completely because of which
> my search not working for the full content.
>
> Please suggest.
>
> Best Regards,
> Kranti K K Parisa
>
>
>
> On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller  > wrote:
>
> Kranti™ K K Parisa wrote:
> > Hi All,
> >
> > I have a problem using SOLR indexing. I am trying to index 96
> pages PDF file
> > (using PDFBox for extracting the file contents into String). But
> > surprisingly SOLR Indexing is not done for the full document.
> Means I can't
> > get all the token how ever the field contains the full text of
> the PDF as i
> > am storing the field along with indexing.
> >
> > Is there any such limitations with SOLR indexing, please let me
> know at the
> > earliest.
> >
> > Thanks in advance!
> >
> > Best Regards,
> > Kranti K K Parisa
> >
> >
> Take a look at maxFieldLength in solrconfig.xml
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>


-- 
- Mark

http://www.lucidimagination.com

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi Mark,

As you see my config file contains the value as 10,000
1

But when I check thru Lukeall jar file I can see the Term count around
3,000.

Please suggest.

Best Regards,
Kranti K K Parisa



2010/1/19 Mark Miller 

> It limits the number of tokens that will be indexed.
>
> Kranti™ K K Parisa wrote:
> > Hi Mark,
> >
> > I really appreciate the quick reply.
> >
> > here is what I have in the config xml
> >
> > 32
> > 2147483647
> >   *  1*
> > 1000
> > 1
> >
> > Does this matter with Tokens?? Because the field I am using is having
> > the full content of the file ( I checked that using Lukeall jar file),
> > how ever Tokens are not getting generated completely because of which
> > my search not working for the full content.
> >
> > Please suggest.
> >
> > Best Regards,
> > Kranti K K Parisa
> >
> >
> >
> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller  > > wrote:
> >
> > Kranti™ K K Parisa wrote:
> > > Hi All,
> > >
> > > I have a problem using SOLR indexing. I am trying to index 96
> > pages PDF file
> > > (using PDFBox for extracting the file contents into String). But
> > > surprisingly SOLR Indexing is not done for the full document.
> > Means I can't
> > > get all the token how ever the field contains the full text of
> > the PDF as i
> > > am storing the field along with indexing.
> > >
> > > Is there any such limitations with SOLR indexing, please let me
> > know at the
> > > earliest.
> > >
> > > Thanks in advance!
> > >
> > > Best Regards,
> > > Kranti K K Parisa
> > >
> > >
> > Take a look at maxFieldLength in solrconfig.xml
> >
> > --
> > - Mark
> >
> > http://www.lucidimagination.com
> >
> >
> >
> >
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi Mark,

I changed the value to 1,000,000,000 to just test my luck.

But unfortunately I am still not getting the index for all Token.

Please suggest.

Best Regards,
Kranti K K Parisa



2010/1/19 Kranti™ K K Parisa 

> Hi Mark,
>
> As you see my config file contains the value as 10,000
> 1
>
> But when I check thru Lukeall jar file I can see the Term count around
> 3,000.
>
> Please suggest.
>
> Best Regards,
> Kranti K K Parisa
>
>
>
> 2010/1/19 Mark Miller 
>
> It limits the number of tokens that will be indexed.
>>
>> Kranti™ K K Parisa wrote:
>> > Hi Mark,
>> >
>> > I really appreciate the quick reply.
>> >
>> > here is what I have in the config xml
>> >
>> > 32
>> > 2147483647
>> >   *  1*
>> > 1000
>> > 1
>> >
>> > Does this matter with Tokens?? Because the field I am using is having
>> > the full content of the file ( I checked that using Lukeall jar file),
>> > how ever Tokens are not getting generated completely because of which
>> > my search not working for the full content.
>> >
>> > Please suggest.
>> >
>> > Best Regards,
>> > Kranti K K Parisa
>> >
>> >
>> >
>> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller > > > wrote:
>> >
>> > Kranti™ K K Parisa wrote:
>> > > Hi All,
>> > >
>> > > I have a problem using SOLR indexing. I am trying to index 96
>> > pages PDF file
>> > > (using PDFBox for extracting the file contents into String). But
>> > > surprisingly SOLR Indexing is not done for the full document.
>> > Means I can't
>> > > get all the token how ever the field contains the full text of
>> > the PDF as i
>> > > am storing the field along with indexing.
>> > >
>> > > Is there any such limitations with SOLR indexing, please let me
>> > know at the
>> > > earliest.
>> > >
>> > > Thanks in advance!
>> > >
>> > > Best Regards,
>> > > Kranti K K Parisa
>> > >
>> > >
>> > Take a look at maxFieldLength in solrconfig.xml
>> >
>> > --
>> > - Mark
>> >
>> > http://www.lucidimagination.com
>> >
>> >
>> >
>> >
>>
>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>

Re: Restricting Facet to FilterQuery in combination with mincount

2010-01-19 Thread Chantal Ackermann


Hi Shalin,

thanks for taking your time (reading it twice!).

Rephrasing the question:
(suppose mincount=0 and facet.limit > all possible facet values)

Currently, the facet results include ALL values for that facet field.
Say I have a field color and when I look at the statistics (LUKE), I can 
see that my index contains altogether 7 different colors. This is 
comparable to a group/count/distinct query in a SQL db.


Querying for "color" as facet field with mincount=0 should thus return 7 
facet fields with various count results.
This fact (7 different counts returned for "color") will not change no 
matter what the query (q) or the filter queries (fq) are - unless I 
change mincount.


Is that correct?

If so, then I was considering the cases why a facet count would be 0 
(always suppose mincount=0).


Case 1) No hit as defined by the query (q parameter) contains that 
specific facet value (e.g. the colors blue and green).
Case 2) This is like Case (1) but there is a filterquery on top, that 
excludes certain values from the facet field, so even before q is 
executed, it's clear that certain facet values are 0.
(e.g. the filter includes only hits with colors yellow and orange. So, 
by this filter, documents with the colors blue and green are already 
excluded from the set that is considered for the actual query (q).)
For me, this results in two different flavours of "0" counts: either the 
0 is the result of executing the query (q) or a result of a filterquery.


Now, I was wondering whether it is possible to find that out. It would 
allow to show 0 counts of values that are produced by the query (q), and 
at the same time exclude all facet values that are already excluded by 
the filter query.


Applying facetting to a subset (subselect / filterset) of the index not 
to everything - that might describe it, as well.



Does that make sense?
Thanks,
Chantal


Shalin Shekhar Mangar schrieb:

On Wed, Jan 13, 2010 at 4:55 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:


Hi all,

is it possible to restrict the returned facets to only those that apply to
the filter query but still use mincount=0? Keeping those that have a count
of 0 but apply to the filter, and at the same time leaving out those that
are not covered by the filter (and thus 0, as well).

Some longer explanation of the question:

Example (don't nail me down on biology here, it's just for illustration):
q=type:mammal&facet.mincount=0&facet.field=type

returns facets for all values stored in the field "type". Results would
look like:

mammal(2123)
bird(0)
dinosaur(0)
fish(0)
...

In this case setting facet.mincount=1 solves the problem. But consider:

q=area:water&fq=type:mammal&facet.field=name&facet.mincount=0

would return something like
dolphin (20)
blue whale (20)
salmon (0) <= not covered by filter query
lion (0)
dog (0)
... (all sorts of animals, every possible value in field "name")

My question is: how can I exclude those facets from the result that are not
covered by the filter query. In this example: how can I exclude the
non-mammals from the facets but keep all those mammals that are not matched
by the actual query parameter?



I've read this twice but the problem is still not clear to me. I guess you
will have to explain it better to get a meaningful response.

--
Regards,
Shalin Shekhar Mangar.

Re: Tokenization and wild card search

2010-01-19 Thread Erick Erickson

I'm pretty sure you're going to be disappointed about
the re-indexing part.

I'm pretty sure that WordDelimiterFilterFactory is tokenizing
your input in ways you don't expect, making your use-case
hard to accomplish.

It's basically splitting your input on all non-alpha characters,
so you're indexing see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

I'd *strongly* suggest you examine the results of your indexing
in order to understand what's possible.

Get a copy of luke and examine your index or use the
SOLR admin Analysis page...

I suspect what you're really looking for is WhitespaceAnalyzer
or Keyword

On Tue, Jan 19, 2010 at 9:50 AM,  wrote:

>
>
> I want the following searches to work:
>
>  MyField:SDD_Expedition_PCB
>
> This should match the word "SDD_Expedition_PCB" only, and not matching
> individual words such as "SDD" or "Expedition", or "PCB".
>
> And the following search:
>
>  MyField:SDD_Expedition*
>
> Should match any word starting with "SDD_Expedition" and ending with
> anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
> "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
> but not matching individual words such as "SDD" or "Expedition".
>
>
> The field type for "MyField" is (the field name is keywords):
>
> required="false" multiValued="true">
>
> And here is the analyzer I'm using:
>
> positionIncrementGap="100">
>  
>
>
> words="stopwords.txt"/>
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>
> protected="protwords.txt"/>
>
>  
>  
>
>
> words="stopwords.txt"/>
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>
> protected="protwords.txt"/>
>
>  
>
>
> Any help on how I can achieve the above is greatly appreciated.
>
> Btw, if at all possible, I would like to be able to achieve this search
> without having to change how I'm indexing / tokenizing the data.  I'm
> looking for search syntax to make this work.
>
> -- JM
>
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Tuesday, January 19, 2010 7:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenization and wild card search
>
> > I have an issue and I'm not sure how to address it, so I
> > hope someone can help me.
> >
> > I have the following text in one of my fields:
> > "ABC_Expedition_ERROR".���When I search on it
> > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> > fail to find me only this word �ABC_Expedition_ERROR�
> > which I think is due to tokenization because of the
> > underscore.
>
> Do you want or do not want your query MyField:SDD_Expedition_PCB to return
> documents containing ABC_Expedition_ERROR?
>
> > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> > outer quotes, but quotes around the word
> > �ABC_Expedition_ERROR�).� This works fine.�
> > But then, how do I search on "SDD_Expedition_PCB" with wild
> > card?� For example: "MyField:SDD_Expedition*" will not
> > work.
>
> Can you paste your field type of MyField? And give some examples what
> queries should return what documents.
>
>

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Erick Erickson

Did you reindex the documents you examined? That limit
is applied when you index.

Try searching the user list for maxfieldlength, this topic
has been discussed many times and you should find a
solution.

HTH
Erick

2010/1/19 Kranti™ K K Parisa 

> Can anyone suggest/guide me on this.
>
> Best Regards,
> Kranti K K Parisa
>
>
>
> 2010/1/19 Kranti™ K K Parisa 
>
> > Hi Mark,
> >
> > I changed the value to 1,000,000,000 to just test my luck.
> >
> > But unfortunately I am still not getting the index for all Token.
> >
> > Please suggest.
> >
> > Best Regards,
> > Kranti K K Parisa
> >
> >
> >
> > 2010/1/19 Kranti™ K K Parisa 
> >
> > Hi Mark,
> >>
> >> As you see my config file contains the value as 10,000
> >> 1
> >>
> >> But when I check thru Lukeall jar file I can see the Term count around
> >> 3,000.
> >>
> >> Please suggest.
> >>
> >> Best Regards,
> >> Kranti K K Parisa
> >>
> >>
> >>
> >> 2010/1/19 Mark Miller 
> >>
> >> It limits the number of tokens that will be indexed.
> >>>
> >>> Kranti™ K K Parisa wrote:
> >>> > Hi Mark,
> >>> >
> >>> > I really appreciate the quick reply.
> >>> >
> >>> > here is what I have in the config xml
> >>> >
> >>> > 32
> >>> > 2147483647
> >>> >   *  1*
> >>> > 1000
> >>> > 1
> >>> >
> >>> > Does this matter with Tokens?? Because the field I am using is having
> >>> > the full content of the file ( I checked that using Lukeall jar
> file),
> >>> > how ever Tokens are not getting generated completely because of which
> >>> > my search not working for the full content.
> >>> >
> >>> > Please suggest.
> >>> >
> >>> > Best Regards,
> >>> > Kranti K K Parisa
> >>> >
> >>> >
> >>> >
> >>> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller  >>> > > wrote:
> >>> >
> >>> > Kranti™ K K Parisa wrote:
> >>> > > Hi All,
> >>> > >
> >>> > > I have a problem using SOLR indexing. I am trying to index 96
> >>> > pages PDF file
> >>> > > (using PDFBox for extracting the file contents into String).
> But
> >>> > > surprisingly SOLR Indexing is not done for the full document.
> >>> > Means I can't
> >>> > > get all the token how ever the field contains the full text of
> >>> > the PDF as i
> >>> > > am storing the field along with indexing.
> >>> > >
> >>> > > Is there any such limitations with SOLR indexing, please let me
> >>> > know at the
> >>> > > earliest.
> >>> > >
> >>> > > Thanks in advance!
> >>> > >
> >>> > > Best Regards,
> >>> > > Kranti K K Parisa
> >>> > >
> >>> > >
> >>> > Take a look at maxFieldLength in solrconfig.xml
> >>> >
> >>> > --
> >>> > - Mark
> >>> >
> >>> > http://www.lucidimagination.com
> >>> >
> >>> >
> >>> >
> >>> >
> >>>
> >>>
> >>> --
> >>> - Mark
> >>>
> >>> http://www.lucidimagination.com
> >>>
> >>>
> >>>
> >>>
> >>
> >
>

Field collapsing patch error

2010-01-19 Thread Licinio Fernández Maurelo

Hi folks,

i've downloaded solr release 1.4 and tried to apply  latest field collapsing
patchi've
found . Found errors :

d...@backend05:~/workspace/solr-release-1.4.0$ patch -p0 -i SOLR-236.patch

patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml
patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml
patching file src/test/test-files/solr/conf/solrconfig.xml
patching file src/test/test-files/fieldcollapse/testResponse.xml
patching file
src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java
patching file
src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java
patching file
src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java

patching file
src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java

patching file
src/test/org/apache/solr/handler/component/CollapseComponentTest.java

patching file
src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java

patching file
src/java/org/apache/solr/search/DocSetAwareCollector.java

patching file
src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java

patching file
src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java

patching file
src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java
patching file
src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java

patching file
src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java

patching file
src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java

patching file
src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java

patching file
src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java

patching file
src/java/org/apache/solr/search/fieldcollapse/util/Counter.java

patching file
src/java/org/apache/solr/search/SolrIndexSearcher.java

patching file
src/java/org/apache/solr/search/DocSetHitCollector.java

patching file
src/java/org/apache/solr/handler/component/CollapseComponent.java

patching file
src/java/org/apache/solr/handler/component/QueryComponent.java

Hunk #1 FAILED at
522.

1 out of 1 hunk FAILED -- saving rejects to file
src/java/org/apache/solr/handler/component/QueryComponent.java.rej

patching file
src/java/org/apache/solr/util/DocSetScoreCollector.java

patching file
src/common/org/apache/solr/common/params/CollapseParams.java

patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java
Hunk #1 FAILED at 17.
Hunk #2 FAILED at 50.
Hunk #3 FAILED at 76.
Hunk #4 FAILED at 148.
Hunk #5 FAILED at 197.
Hunk #6 succeeded at 510 (offset -155 lines).
Hunk #7 succeeded at 566 (offset -155 lines).
5 out of 7 hunks FAILED -- saving rejects to file
src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej
patching file
src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
Hunk #1 succeeded at 17 with fuzz 1.
Hunk #2 FAILED at 42.
Hunk #3 FAILED at 58.
Hunk #4 succeeded at 117 with fuzz 2 (offset -8 lines).
Hunk #5 succeeded at 315 with fuzz 2 (offset 17 lines).
2 out of 5 hunks FAILED -- saving rejects to file
src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java.rej
patching file
src/solrj/org/apache/solr/client/solrj/response/FieldCollapseResponse.java

Any ideas?

-- 
Lici
~Java Developer~

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Can anyone suggest/guide me on this.

Best Regards,
Kranti K K Parisa



2010/1/19 Kranti™ K K Parisa 

> Hi Mark,
>
> I changed the value to 1,000,000,000 to just test my luck.
>
> But unfortunately I am still not getting the index for all Token.
>
> Please suggest.
>
> Best Regards,
> Kranti K K Parisa
>
>
>
> 2010/1/19 Kranti™ K K Parisa 
>
> Hi Mark,
>>
>> As you see my config file contains the value as 10,000
>> 1
>>
>> But when I check thru Lukeall jar file I can see the Term count around
>> 3,000.
>>
>> Please suggest.
>>
>> Best Regards,
>> Kranti K K Parisa
>>
>>
>>
>> 2010/1/19 Mark Miller 
>>
>> It limits the number of tokens that will be indexed.
>>>
>>> Kranti™ K K Parisa wrote:
>>> > Hi Mark,
>>> >
>>> > I really appreciate the quick reply.
>>> >
>>> > here is what I have in the config xml
>>> >
>>> > 32
>>> > 2147483647
>>> >   *  1*
>>> > 1000
>>> > 1
>>> >
>>> > Does this matter with Tokens?? Because the field I am using is having
>>> > the full content of the file ( I checked that using Lukeall jar file),
>>> > how ever Tokens are not getting generated completely because of which
>>> > my search not working for the full content.
>>> >
>>> > Please suggest.
>>> >
>>> > Best Regards,
>>> > Kranti K K Parisa
>>> >
>>> >
>>> >
>>> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller >> > > wrote:
>>> >
>>> > Kranti™ K K Parisa wrote:
>>> > > Hi All,
>>> > >
>>> > > I have a problem using SOLR indexing. I am trying to index 96
>>> > pages PDF file
>>> > > (using PDFBox for extracting the file contents into String). But
>>> > > surprisingly SOLR Indexing is not done for the full document.
>>> > Means I can't
>>> > > get all the token how ever the field contains the full text of
>>> > the PDF as i
>>> > > am storing the field along with indexing.
>>> > >
>>> > > Is there any such limitations with SOLR indexing, please let me
>>> > know at the
>>> > > earliest.
>>> > >
>>> > > Thanks in advance!
>>> > >
>>> > > Best Regards,
>>> > > Kranti K K Parisa
>>> > >
>>> > >
>>> > Take a look at maxFieldLength in solrconfig.xml
>>> >
>>> > --
>>> > - Mark
>>> >
>>> > http://www.lucidimagination.com
>>> >
>>> >
>>> >
>>> >
>>>
>>>
>>> --
>>> - Mark
>>>
>>> http://www.lucidimagination.com
>>>
>>>
>>>
>>>
>>
>

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi Erik,

Yes, i deleted the index and re-indexed after increasing the value (i have
restarted tomcat as well)

but still no luck. but i was just wondering the field that i am trying to
index has the complete document text in it as i am storing that. but not
getting the complete terms/tokens into the index to perform the search.

What would be the suggestible Analyzers, filters that I should check with?

Currently I am using the following:


 

   



 





Please suggest
Best Regards,
Kranti K K Parisa



On Tue, Jan 19, 2010 at 9:03 PM, Erick Erickson wrote:

> Did you reindex the documents you examined? That limit
> is applied when you index.
>
> Try searching the user list for maxfieldlength, this topic
> has been discussed many times and you should find a
> solution.
>
> HTH
> Erick
>
> 2010/1/19 Kranti™ K K Parisa 
>
> > Can anyone suggest/guide me on this.
> >
> > Best Regards,
> > Kranti K K Parisa
> >
> >
> >
> > 2010/1/19 Kranti™ K K Parisa 
> >
> > > Hi Mark,
> > >
> > > I changed the value to 1,000,000,000 to just test my luck.
> > >
> > > But unfortunately I am still not getting the index for all Token.
> > >
> > > Please suggest.
> > >
> > > Best Regards,
> > > Kranti K K Parisa
> > >
> > >
> > >
> > > 2010/1/19 Kranti™ K K Parisa 
> > >
> > > Hi Mark,
> > >>
> > >> As you see my config file contains the value as 10,000
> > >> 1
> > >>
> > >> But when I check thru Lukeall jar file I can see the Term count around
> > >> 3,000.
> > >>
> > >> Please suggest.
> > >>
> > >> Best Regards,
> > >> Kranti K K Parisa
> > >>
> > >>
> > >>
> > >> 2010/1/19 Mark Miller 
> > >>
> > >> It limits the number of tokens that will be indexed.
> > >>>
> > >>> Kranti™ K K Parisa wrote:
> > >>> > Hi Mark,
> > >>> >
> > >>> > I really appreciate the quick reply.
> > >>> >
> > >>> > here is what I have in the config xml
> > >>> >
> > >>> > 32
> > >>> > 2147483647
> > >>> >   *  1*
> > >>> > 1000
> > >>> > 1
> > >>> >
> > >>> > Does this matter with Tokens?? Because the field I am using is
> having
> > >>> > the full content of the file ( I checked that using Lukeall jar
> > file),
> > >>> > how ever Tokens are not getting generated completely because of
> which
> > >>> > my search not working for the full content.
> > >>> >
> > >>> > Please suggest.
> > >>> >
> > >>> > Best Regards,
> > >>> > Kranti K K Parisa
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller <
> markrmil...@gmail.com
> > >>> > > wrote:
> > >>> >
> > >>> > Kranti™ K K Parisa wrote:
> > >>> > > Hi All,
> > >>> > >
> > >>> > > I have a problem using SOLR indexing. I am trying to index 96
> > >>> > pages PDF file
> > >>> > > (using PDFBox for extracting the file contents into String).
> > But
> > >>> > > surprisingly SOLR Indexing is not done for the full document.
> > >>> > Means I can't
> > >>> > > get all the token how ever the field contains the full text
> of
> > >>> > the PDF as i
> > >>> > > am storing the field along with indexing.
> > >>> > >
> > >>> > > Is there any such limitations with SOLR indexing, please let
> me
> > >>> > know at the
> > >>> > > earliest.
> > >>> > >
> > >>> > > Thanks in advance!
> > >>> > >
> > >>> > > Best Regards,
> > >>> > > Kranti K K Parisa
> > >>> > >
> > >>> > >
> > >>> > Take a look at maxFieldLength in solrconfig.xml
> > >>> >
> > >>> > --
> > >>> > - Mark
> > >>> >
> > >>> > http://www.lucidimagination.com
> > >>> >
> > >>> >
> > >>> >
> > >>> >
> > >>>
> > >>>
> > >>> --
> > >>> - Mark
> > >>>
> > >>> http://www.lucidimagination.com
> > >>>
> > >>>
> > >>>
> > >>>
> > >>
> > >
> >
>

Re: Urgent: SOLR Indexing missing tokens

2010-01-19 Thread Kranti™ K K Parisa

Hi

I was doing the same mistake mentioned in this URL.
http://search.lucidimagination.com/search/document/30616a061f8c4bf6/solr_ignoring_maxfieldlength

maxFieldLength is there at 2 places. earlier changed at the indexDefaults
now changed at mainIndex section also.

it worked. Thanks Mark & Erick. I appreciate your help.

Best Regards,
Kranti K K Parisa



2010/1/19 Kranti™ K K Parisa 

> Hi Erik,
>
> Yes, i deleted the index and re-indexed after increasing the value (i have
> restarted tomcat as well)
>
> but still no luck. but i was just wondering the field that i am trying to
> index has the complete document text in it as i am storing that. but not
> getting the complete terms/tokens into the index to perform the search.
>
> What would be the suggestible Analyzers, filters that I should check with?
>
> Currently I am using the following:
>
> 
>  
> 
>
> 
> 
> 
>  
> 
> 
> 
> 
>
> Please suggest
> Best Regards,
> Kranti K K Parisa
>
>
>
> On Tue, Jan 19, 2010 at 9:03 PM, Erick Erickson 
> wrote:
>
>> Did you reindex the documents you examined? That limit
>> is applied when you index.
>>
>> Try searching the user list for maxfieldlength, this topic
>> has been discussed many times and you should find a
>> solution.
>>
>> HTH
>> Erick
>>
>> 2010/1/19 Kranti™ K K Parisa 
>>
>> > Can anyone suggest/guide me on this.
>> >
>> > Best Regards,
>> > Kranti K K Parisa
>> >
>> >
>> >
>> > 2010/1/19 Kranti™ K K Parisa 
>> >
>> > > Hi Mark,
>> > >
>> > > I changed the value to 1,000,000,000 to just test my luck.
>> > >
>> > > But unfortunately I am still not getting the index for all Token.
>> > >
>> > > Please suggest.
>> > >
>> > > Best Regards,
>> > > Kranti K K Parisa
>> > >
>> > >
>> > >
>> > > 2010/1/19 Kranti™ K K Parisa 
>> > >
>> > > Hi Mark,
>> > >>
>> > >> As you see my config file contains the value as 10,000
>> > >> 1
>> > >>
>> > >> But when I check thru Lukeall jar file I can see the Term count
>> around
>> > >> 3,000.
>> > >>
>> > >> Please suggest.
>> > >>
>> > >> Best Regards,
>> > >> Kranti K K Parisa
>> > >>
>> > >>
>> > >>
>> > >> 2010/1/19 Mark Miller 
>> > >>
>> > >> It limits the number of tokens that will be indexed.
>> > >>>
>> > >>> Kranti™ K K Parisa wrote:
>> > >>> > Hi Mark,
>> > >>> >
>> > >>> > I really appreciate the quick reply.
>> > >>> >
>> > >>> > here is what I have in the config xml
>> > >>> >
>> > >>> > 32
>> > >>> > 2147483647
>> > >>> >   *  1*
>> > >>> > 1000
>> > >>> > 1
>> > >>> >
>> > >>> > Does this matter with Tokens?? Because the field I am using is
>> having
>> > >>> > the full content of the file ( I checked that using Lukeall jar
>> > file),
>> > >>> > how ever Tokens are not getting generated completely because of
>> which
>> > >>> > my search not working for the full content.
>> > >>> >
>> > >>> > Please suggest.
>> > >>> >
>> > >>> > Best Regards,
>> > >>> > Kranti K K Parisa
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> > On Tue, Jan 19, 2010 at 8:27 PM, Mark Miller <
>> markrmil...@gmail.com
>> > >>> > > wrote:
>> > >>> >
>> > >>> > Kranti™ K K Parisa wrote:
>> > >>> > > Hi All,
>> > >>> > >
>> > >>> > > I have a problem using SOLR indexing. I am trying to index
>> 96
>> > >>> > pages PDF file
>> > >>> > > (using PDFBox for extracting the file contents into String).
>> > But
>> > >>> > > surprisingly SOLR Indexing is not done for the full
>> document.
>> > >>> > Means I can't
>> > >>> > > get all the token how ever the field contains the full text
>> of
>> > >>> > the PDF as i
>> > >>> > > am storing the field along with indexing.
>> > >>> > >
>> > >>> > > Is there any such limitations with SOLR indexing, please let
>> me
>> > >>> > know at the
>> > >>> > > earliest.
>> > >>> > >
>> > >>> > > Thanks in advance!
>> > >>> > >
>> > >>> > > Best Regards,
>> > >>> > > Kranti K K Parisa
>> > >>> > >
>> > >>> > >
>> > >>> > Take a look at maxFieldLength in solrconfig.xml
>> > >>> >
>> > >>> > --
>> > >>> > - Mark
>> > >>> >
>> > >>> > http://www.lucidimagination.com
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>> >
>> > >>>
>> > >>>
>> > >>> --
>> > >>> - Mark
>> > >>>
>> > >>> http://www.lucidimagination.com
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>
>> > >
>> >
>>
>
>

Re: Tokenization and wild card search

2010-01-19 Thread johnmunir

You are correct, the way I'm using tokenization is my issue.  It's too late to 
re-index now, this is why I'm looking for a search syntax that will to make the 
search work.

I have tried various search syntax with no luck.  Is there no search syntax to 
make this work without re-indexing?!

-- JM

-Original Message-
From: Erick Erickson 
To: solr-user@lucene.apache.org
Sent: Tue, Jan 19, 2010 10:30 am
Subject: Re: Tokenization and wild card search

I'm pretty sure you're going to be disappointed about
he re-indexing part.
I'm pretty sure that WordDelimiterFilterFactory is tokenizing
our input in ways you don't expect, making your use-case
ard to accomplish.
It's basically splitting your input on all non-alpha characters,
o you're indexing see
ttp://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
I'd *strongly* suggest you examine the results of your indexing
n order to understand what's possible.
Get a copy of luke and examine your index or use the
OLR admin Analysis page...
I suspect what you're really looking for is WhitespaceAnalyzer
r Keyword
On Tue, Jan 19, 2010 at 9:50 AM,  wrote:
>

 I want the following searches to work:

  MyField:SDD_Expedition_PCB

 This should match the word "SDD_Expedition_PCB" only, and not matching
 individual words such as "SDD" or "Expedition", or "PCB".

 And the following search:

  MyField:SDD_Expedition*

 Should match any word starting with "SDD_Expedition" and ending with
 anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
 "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
 but not matching individual words such as "SDD" or "Expedition".

 The field type for "MyField" is (the field name is keywords):

 And here is the analyzer I'm using:

 Any help on how I can achieve the above is greatly appreciated.

 Btw, if at all possible, I would like to be able to achieve this search
 without having to change how I'm indexing / tokenizing the data.  I'm
 looking for search syntax to make this work.

 -- JM

 -Original Message-
 From: Ahmet Arslan [mailto:iori...@yahoo.com]
 Sent: Tuesday, January 19, 2010 7:57 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Tokenization and wild card search

 > I have an issue and I'm not sure how to address it, so I
 > hope someone can help me.
 >
 > I have the following text in one of my fields:
 > "ABC_Expedition_ERROR".���When I search on it
 > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
 > fail to find me only this word �ABC_Expedition_ERROR�
 > which I think is due to tokenization because of the
 > underscore.

 Do you want or do not want your query MyField:SDD_Expedition_PCB to return
 documents containing ABC_Expedition_ERROR?

 > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
 > outer quotes, but quotes around the word
 > �ABC_Expedition_ERROR�).� This works fine.�
 > But then, how do I search on "SDD_Expedition_PCB" with wild
 > card?� For example: "MyField:SDD_Expedition*" will not
 > work.

 Can you paste your field type of MyField? And give some examples what
 queries should return what documents.

Re: Field collapsing patch error

2010-01-19 Thread Joe Calderon

this has come up before, my suggestions would be to use the 12/24
patch with trunk revision 892336

http://www.lucidimagination.com/search/document/797549d29e1810d9/solr_1_4_field_collapsing_what_are_the_steps_for_applying_the_solr_236_patch

2010/1/19 Licinio Fernández Maurelo :
> Hi folks,
>
> i've downloaded solr release 1.4 and tried to apply  latest field collapsing
> patchi've
> found . Found errors :
>
> d...@backend05:~/workspace/solr-release-1.4.0$ patch -p0 -i SOLR-236.patch
>
> patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml
> patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml
> patching file src/test/test-files/solr/conf/solrconfig.xml
> patching file src/test/test-files/fieldcollapse/testResponse.xml
> patching file
> src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java
> patching file
> src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java
> patching file
> src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java
>
> patching file
> src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java
>
> patching file
> src/test/org/apache/solr/handler/component/CollapseComponentTest.java
>
> patching file
> src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java
>
> patching file
> src/java/org/apache/solr/search/DocSetAwareCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/util/Counter.java
>
> patching file
> src/java/org/apache/solr/search/SolrIndexSearcher.java
>
> patching file
> src/java/org/apache/solr/search/DocSetHitCollector.java
>
> patching file
> src/java/org/apache/solr/handler/component/CollapseComponent.java
>
> patching file
> src/java/org/apache/solr/handler/component/QueryComponent.java
>
> Hunk #1 FAILED at
> 522.
>
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/java/org/apache/solr/handler/component/QueryComponent.java.rej
>
> patching file
> src/java/org/apache/solr/util/DocSetScoreCollector.java
>
> patching file
> src/common/org/apache/solr/common/params/CollapseParams.java
>
> patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java
> Hunk #1 FAILED at 17.
> Hunk #2 FAILED at 50.
> Hunk #3 FAILED at 76.
> Hunk #4 FAILED at 148.
> Hunk #5 FAILED at 197.
> Hunk #6 succeeded at 510 (offset -155 lines).
> Hunk #7 succeeded at 566 (offset -155 lines).
> 5 out of 7 hunks FAILED -- saving rejects to file
> src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej
> patching file
> src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
> Hunk #1 succeeded at 17 with fuzz 1.
> Hunk #2 FAILED at 42.
> Hunk #3 FAILED at 58.
> Hunk #4 succeeded at 117 with fuzz 2 (offset -8 lines).
> Hunk #5 succeeded at 315 with fuzz 2 (offset 17 lines).
> 2 out of 5 hunks FAILED -- saving rejects to file
> src/solrj/org/apache/solr/client/solrj/response/QueryResp

Re: Updating a single field in a Solr document

2010-01-19 Thread Raghuveer Kancherla

Is this feature planned in any of the future releases. I ask because it will
help me plan my system architecture accordingly.

Thanks,
Raghu



On Tue, Jan 19, 2010 at 7:28 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla <
> raghuveer.kanche...@aplopio.com> wrote:
>
> > Hi,
> > I have 2 fields one with captures the category of the documents and an
> > other
> > which is a pre processed text of the document. Text of the document is
> > fairly large.
> > The category of the document changes often while the text remains the
> same.
> > Search happens on both fields.
> >
> > The problem is, I have to index both the text and the category each time
> > the
> > category changes. The text being large obviously makes this suboptimal.
> Is
> > there a patch or a tricky way to avoid indexing the text field every
> time.
> >
> >
> Sure, make the text field as stored, read the old document and create the
> new one. Sorry, there is no way to update an indexed document in Solr
> (yet).
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: Updating a single field in a Solr document

2010-01-19 Thread Richard Frovarp


Shalin Shekhar Mangar wrote:

On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla <
raghuveer.kanche...@aplopio.com> wrote:

  

Hi,
I have 2 fields one with captures the category of the documents and an
other
which is a pre processed text of the document. Text of the document is
fairly large.
The category of the document changes often while the text remains the same.
Search happens on both fields.

The problem is, I have to index both the text and the category each time
the
category changes. The text being large obviously makes this suboptimal. Is
there a patch or a tricky way to avoid indexing the text field every time.




Sure, make the text field as stored, read the old document and create the
new one. Sorry, there is no way to update an indexed document in Solr (yet).

  
Could you use SolrJ to grab a record, convert it over to an input record 
and update the one field? I suppose Solr would still reindex the whole 
thing, but at least you wouldn't have to do full pre-processing on the 
source.


Richard

Re: Tokenization and wild card search

2010-01-19 Thread Erick Erickson

What I suspect would work is phrase queries with no slop.
Unfortunately, to get this to work right you need wildcards
inside phrases, which is NOT supported out of the box.

However, see SOLR 1604 for patches that address this...

http://issues.apache.org/jira/browse/SOLR-1604

HTH
Erick

P.S. Are you absolutely sure you can't re-index .


On Tue, Jan 19, 2010 at 11:11 AM,  wrote:

>
>
> You are correct, the way I'm using tokenization is my issue.  It's too late
> to re-index now, this is why I'm looking for a search syntax that will to
> make the search work.
>
> I have tried various search syntax with no luck.  Is there no search syntax
> to make this work without re-indexing?!
>
> -- JM
>
>
> -Original Message-
> From: Erick Erickson 
> To: solr-user@lucene.apache.org
> Sent: Tue, Jan 19, 2010 10:30 am
> Subject: Re: Tokenization and wild card search
>
>
> I'm pretty sure you're going to be disappointed about
> he re-indexing part.
> I'm pretty sure that WordDelimiterFilterFactory is tokenizing
> our input in ways you don't expect, making your use-case
> ard to accomplish.
> It's basically splitting your input on all non-alpha characters,
> o you're indexing see
> ttp://
> wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> I'd *strongly* suggest you examine the results of your indexing
> n order to understand what's possible.
> Get a copy of luke and examine your index or use the
> OLR admin Analysis page...
> I suspect what you're really looking for is WhitespaceAnalyzer
> r Keyword
> On Tue, Jan 19, 2010 at 9:50 AM,  wrote:
> >
>
>  I want the following searches to work:
>
>  MyField:SDD_Expedition_PCB
>
>  This should match the word "SDD_Expedition_PCB" only, and not matching
>  individual words such as "SDD" or "Expedition", or "PCB".
>
>  And the following search:
>
>  MyField:SDD_Expedition*
>
>  Should match any word starting with "SDD_Expedition" and ending with
>  anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
>  "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
>  but not matching individual words such as "SDD" or "Expedition".
>
>
>  The field type for "MyField" is (the field name is keywords):
>
>  required="false" multiValued="true">
>
>  And here is the analyzer I'm using:
>
>  positionIncrementGap="100">
>  
>
>
>  words="stopwords.txt"/>
>  generateWordParts="0" generateNumberParts="1" catenateWords="1"
>  catenateNumbers="1" catenateAll="0"/>
>
>  protected="protwords.txt"/>
>
>  
>  
>
>
>  words="stopwords.txt"/>
>  generateWordParts="0" generateNumberParts="1" catenateWords="1"
>  catenateNumbers="1" catenateAll="0"/>
>
>  protected="protwords.txt"/>
>
>  
>
>
>  Any help on how I can achieve the above is greatly appreciated.
>
>  Btw, if at all possible, I would like to be able to achieve this search
>  without having to change how I'm indexing / tokenizing the data.  I'm
>  looking for search syntax to make this work.
>
>  -- JM
>
>  -Original Message-
>  From: Ahmet Arslan [mailto:iori...@yahoo.com]
>  Sent: Tuesday, January 19, 2010 7:57 AM
>  To: solr-user@lucene.apache.org
>  Subject: Re: Tokenization and wild card search
>
>  > I have an issue and I'm not sure how to address it, so I
>  > hope someone can help me.
>  >
>  > I have the following text in one of my fields:
>  > "ABC_Expedition_ERROR".���When I search on it
>  > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
>  > fail to find me only this word �ABC_Expedition_ERROR�
>  > which I think is due to tokenization because of the
>  > underscore.
>
>  Do you want or do not want your query MyField:SDD_Expedition_PCB to return
>  documents containing ABC_Expedition_ERROR?
>
>  > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
>  > outer quotes, but quotes around the word
>  > �ABC_Expedition_ERROR�).� This works fine.�
>  > But then, how do I search on "SDD_Expedition_PCB" with wild
>  > card?� For example: "MyField:SDD_Expedition*" will not
>  > work.
>
>  Can you paste your field type of MyField? And give some examples what
>  queries should return what documents.
>
>
>
>

RE: Tokenization and wild card search

2010-01-19 Thread Ahmet Arslan


> I want the following searches to work:
>  
>   MyField:SDD_Expedition_PCB
>  
> This should match the word "SDD_Expedition_PCB" only, and
> not matching individual words such as "SDD" or "Expedition",
> or "PCB".
> 
> And the following search:
>  
>   MyField:SDD_Expedition*
>  
> Should match any word starting with "SDD_Expedition" and
> ending with anything else such as "SDD_Expedition_PBC",
> "SDD_Expedition_One", "SDD_Expedition_Two",
> "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not
> matching individual words such as "SDD" or "Expedition".

I just tested your type in admin/analysis.jsp  (solr 1.4.0) page and two of 
your examples are reduced to:

SDD_Expedition_PCB=> sddexpeditionpcb 
ABC_Expedition_ERROR  => abcexpeditionerror

in both query and index time.

I think there is a misunderstanding. With your type decleration, the query 
Keywords:SDD_Expedition_PCB shouldn't match 
individual words such as "SDD" or "Expedition", or "PCB". Something wrong with 
the scenario in your first mail and your field type declaration. Can you run 
&q=Keywords:SDD_Expedition_PCB&debugQuery=on and send debug info?


About prefix query Keywords:SDD_Expedition* would never match in your current 
configuration. Because prefix and wildcard queries are not alayzed. Best thing 
you can do is convert this query to sddexpedition* then it will bring you all 
these: SDD_Expedition_PBC, SDD_Expedition_One, SDD_Expedition_Two, 
SDD_Expedition_Solr.

Re: Updating a single field in a Solr document

2010-01-19 Thread Mauricio Scheffer

Here's the corresponding issue:
https://issues.apache.org/jira/browse/SOLR-139

On Tue, Jan 19, 2010 at 1:36 PM, Raghuveer Kancherla <
raghuveer.kanche...@aplopio.com> wrote:

> Is this feature planned in any of the future releases. I ask because it
> will
> help me plan my system architecture accordingly.
>
> Thanks,
> Raghu
>
>
>
> On Tue, Jan 19, 2010 at 7:28 PM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> > On Mon, Jan 18, 2010 at 5:11 PM, Raghuveer Kancherla <
> > raghuveer.kanche...@aplopio.com> wrote:
> >
> > > Hi,
> > > I have 2 fields one with captures the category of the documents and an
> > > other
> > > which is a pre processed text of the document. Text of the document is
> > > fairly large.
> > > The category of the document changes often while the text remains the
> > same.
> > > Search happens on both fields.
> > >
> > > The problem is, I have to index both the text and the category each
> time
> > > the
> > > category changes. The text being large obviously makes this suboptimal.
> > Is
> > > there a patch or a tricky way to avoid indexing the text field every
> > time.
> > >
> > >
> > Sure, make the text field as stored, read the old document and create the
> > new one. Sorry, there is no way to update an indexed document in Solr
> > (yet).
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>

Number of values limitation in multivalued field

2010-01-19 Thread SHS SOLR

* Can we define a field in our schema as multiValued (with stored=false,
indexed=true) that will hold upto 42K zipcode values associated to each
document?
* Is there any query time performance impact with this.
* Is there any impact on index time.

The number of documents we are talking here is not more than 100 right now.
There is no requirement to facet or highlight or even show this field in the
search results. We only want to enable zipcode searches that would return
matching docs.

Thanks,

Comparison of Solr with Sharepoint Search

2010-01-19 Thread Abhishek Srivastava

Has anyone done a functionality comparison of Solr with Sharepoint/Fast
Search?

If yes, kindly share a few details here.

Thanks for your help in advance!

Regards,
Abhishek.

Rounding dates on sort and filter

2010-01-19 Thread Charlie Jackson

I've got a legacy date field that I'd like to round for sorting and
filtering. Right now, the index is large enough that sorting or
filtering on a date field takes 10-20 seconds (unless it's cached). I
know this is because the date field's precision is down to the
millisecond, and I don't really need that level of precision for most of
my searches. So, is it possible to round my field at query time without
having to reindex the field or add a second one? 

 

I already tried the function sorting in 1.5-dev, but my field isn't a
TrieDate field so I can't use the ms() function (which seems to allow
date math unlike the other functions). 

 

Thanks,

Charlie

Data storage, and textual analysis

2010-01-19 Thread Gora Mohanty

Hi,

Another simple query. I have set up a field to hold phonetic
equivalents, with the relevant part of schema.xml looking like:

 
 
  


Here, com.srijan.search.solr.analysis.AspellFilterFactory is
a custom filter that provides a phonetic soundslike equivalent for
Indian languages transliterated into English. However, that is
irrelevant here, as the issue below holds even if I use the standard
solr.DoubleMetaphoneFilterFactory.

I have a data source where all text is upper-case, and from
various Solr-related discussions found through Google, I would have
thought that fields of this type would be stored as the lower-case,
soundslike equivalent. Instead the data (as seen through the Solr
admin. interface, or through a front-end search) seem to be stored
as is.

The Solr admin. analysis view does show the index and query
conversions as I would expect. Also, phonetic matches, and matches
with lower-case input work properly. I am just curious as to how
this works.

Regards,
Gora

Re: Replication Condition (Swapping indexers)

2010-01-19 Thread Shalin Shekhar Mangar

On Thu, Jan 14, 2010 at 6:30 AM, wojtekpia  wrote:

>
> I have a deployment with 2 indexers (2 cores in a single servlet
> container),
> and a farm of searchers that replicate from one of the indexers. Once in a
> while I need to re-index all my data, so I do that on my second indexer
> (while my original indexer still gets incremental updates), then swap my
> indexer cores. What's the condition that slaves check before replicating
> the
> master's data? I'm concerned that if it's purely based on index version or
> generation number, then my new index may appear older than the old one, and
> the searchers won't synchronize to it. I haven't been able to simulate the
> scenario I'm afraid of yet (searchers always seem to replicate after I swap
> cores), but I want to make sure I'm not just getting lucky. I realize I
> could force each searcher to synchronize with command=fetchindex, but I'd
> prefer to set something on the indexer than on each searcher.
>
>
> I'm using Solr 1.4 and using built-in replication via ReplicationHandler.
>
>
The slaves check if they have the same index version and generation number
as the master. If the generation is earlier than the master then only the
files missing on the slave are copied over from the master. If the
generation is greater than or equal to master then all the index files are
replicated from master.

In your case, I'm guessing that when you do a full re-index, you probably
start from an empty index and therefore, the generation of the master index
is always less than the slave's index. This causes all the files to be
replicated. If that is the case then you don't need to worry.

-- 
Regards,
Shalin Shekhar Mangar.

Re: build path

2010-01-19 Thread Amit Nithian

If you are going to run the unit tests in Eclipse, then for the given JUnit
run configuration, add the -Dsolr.solr.home= to the VM
arguments section of the run configuration for the given test.

On Tue, Jan 19, 2010 at 12:34 AM, Wangsheng Mei  wrote:

> I this it is.
>
> solr has a default servlet container "jetty" with the downloaded package
> under folder "example" .
> but I use tomcat a lot, so I deployed solr on tomcat using solr.war.
>
> I don't know why solr will use jetty as default.
>
> 2010/1/19 Siv Anette Fjellkårstad 
>
> > I apologize for the newbie questions :|
> > Do I need a servlet container to run the tests?
> >
> > Kind regards,
> > Siv
> >
> >
> > 
> >
> > Fra: Wangsheng Mei [mailto:hairr...@gmail.com]
> > Sendt: ti 19.01.2010 08:49
> > Til: solr-user@lucene.apache.org
> > Emne: Re: build path
> >
> >
> >
> > maybe you should add "-Dsolr.solr.home=" to your
> JAVA_OPTS
> > before your servlet container starts.
> >
> >
> > 2010/1/19 Siv Anette Fjellkårstad 
> >
> > > Hi!
> > > I try to run the tests of Solr 1.4 in Eclipse, but a most of them
> fails.
> > > The error messages indicate that I miss some config files in my build
> > path.
> > > Is there any documentation of how to get Solr up and running in
> Eclipse?
> > If
> > > not; How did you set up (build path for) Solr in Eclipse?
> > >
> > > Another question; Some of the tests also fail when I run ant test. Is
> > that
> > > normal?
> > >
> > > Sincerely,
> > > Siv
> > >
> > >
> > > This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> > > OSLO, http://www.steria.no  . This email and
> any
> > attachments may contain
> > > confidential/intellectual property/copyright information and is only
> for
> > the
> > > use of the addressee(s). You are prohibited from copying, forwarding,
> > > disclosing, saving or otherwise using it in any way if you are not the
> > > addressee(s) or responsible for delivery. If you receive this email by
> > > mistake, please advise the sender and cancel it immediately. Steria may
> > > monitor the content of emails within its network to ensure compliance
> > with
> > > its policies and procedures. Any email is susceptible to alteration and
> > its
> > > integrity cannot be assured. Steria shall not be liable if the message
> is
> > > altered, modified, falsified, or even edited.
> > >
> >
> >
> >
> > --
> > ???
> >
> >
> >
> > This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> > OSLO, http://www.steria.no. This email and any attachments may contain
> > confidential/intellectual property/copyright information and is only for
> the
> > use of the addressee(s). You are prohibited from copying, forwarding,
> > disclosing, saving or otherwise using it in any way if you are not the
> > addressee(s) or responsible for delivery. If you receive this email by
> > mistake, please advise the sender and cancel it immediately. Steria may
> > monitor the content of emails within its network to ensure compliance
> with
> > its policies and procedures. Any email is susceptible to alteration and
> its
> > integrity cannot be assured. Steria shall not be liable if the message is
> > altered, modified, falsified, or even edited.
> >
>
>
>
> --
> 梅旺生
>

Re: Design Question - Dynamic Field Names (*)

2010-01-19 Thread Shalin Shekhar Mangar

On Sat, Jan 16, 2010 at 3:33 AM, Kumaravel Kandasami <
kumaravel.kandas...@gmail.com> wrote:

> Need to your suggestion in  best designing the following requirement.
>
> - We have two indexes.
> Index 1: "name_index",
> Fields:
> "id" - indexed, not stored
>  "field_name" - indexed, stored.
>
> Index 2: "trans_index',
> Fields(Dynamic Schema):
> "id" - indexed, not stored
> "*" - indexed, stored.
>
> (Dynamic field names of the trans_index is the same as the "field_name"
> from
> the name_index.)
>
> - Requirement:
>
> User would select the field he wants to query from the "name_index".
> Once he selects the one of the values from the 'field_name' (from the
> name_index), he queries the trans_index using the field_name.
>
>
> - Issue:
>
> When indexing the name_index field:"field_name" we are using the analyzer
> that would lowercase, strip spaces etc.
> Example: "First Name", "firstName" values are all stored and indexed as
> 'firstname'.
>
> However, when we store field names in the trans_index we would be storing
> as
> it is ... without analyzing.
> So User queries like 'firstname:a*' might not match.
>
> - Possible Solution:
>
> We are planning to have an custom analyzer that we would use while indexing
> (configured in the schema.xml) file. As well the crawler program would use
> the
> same analyzer to create field names.
>
> Is there any better design solutions ?
>
>
Your scenario sounds quite strange and it is still not clear why you are
doing all this. Perhaps the solution doesn't even require two indexes? Can
you describe the actual problem so that we can be of more help?

-- 
Regards,
Shalin Shekhar Mangar.

DIH delta import - last modified date

2010-01-19 Thread Yao Ge


I am struggling with the concept of delta import in DIH. According the to
documentation, the delta import will automatically record the last index
time stamp and make it available to use for the delta query. However in many
case when the last_modified date time stamp in the database lag behind the
current time, the last index time stamp is the not good for delta query. Can
I pick a different mechanism to generate "last_index_time" by using time
stamp computed from the database (such as from a column of the database)?
-- 
View this message in context: 
http://old.nabble.com/DIH-delta-import---last-modified-date-tp27231449p27231449.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tokenization and wild card search

2010-01-19 Thread Erick Erickson

Listen to Ahmet, ignore me. I missed "catenatewords=1", which
should produce the tokens exactly as Ahmet said. So standard
wildcarding should work it seems to me

Sorry 'bout that

Erick

On Tue, Jan 19, 2010 at 12:01 PM, Ahmet Arslan  wrote:

>
> > I want the following searches to work:
> >
> >   MyField:SDD_Expedition_PCB
> >
> > This should match the word "SDD_Expedition_PCB" only, and
> > not matching individual words such as "SDD" or "Expedition",
> > or "PCB".
> >
> > And the following search:
> >
> >   MyField:SDD_Expedition*
> >
> > Should match any word starting with "SDD_Expedition" and
> > ending with anything else such as "SDD_Expedition_PBC",
> > "SDD_Expedition_One", "SDD_Expedition_Two",
> > "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc, but not
> > matching individual words such as "SDD" or "Expedition".
>
> I just tested your type in admin/analysis.jsp  (solr 1.4.0) page and two of
> your examples are reduced to:
>
> SDD_Expedition_PCB=> sddexpeditionpcb
> ABC_Expedition_ERROR  => abcexpeditionerror
>
> in both query and index time.
>
> I think there is a misunderstanding. With your type decleration, the query
> Keywords:SDD_Expedition_PCB shouldn't match
> individual words such as "SDD" or "Expedition", or "PCB". Something wrong
> with the scenario in your first mail and your field type declaration. Can
> you run &q=Keywords:SDD_Expedition_PCB&debugQuery=on and send debug info?
>
>
> About prefix query Keywords:SDD_Expedition* would never match in your
> current configuration. Because prefix and wildcard queries are not alayzed.
> Best thing you can do is convert this query to sddexpedition* then it will
> bring you all these: SDD_Expedition_PBC, SDD_Expedition_One,
> SDD_Expedition_Two, SDD_Expedition_Solr.
>
>
>
>

Re: Number of values limitation in multivalued field

2010-01-19 Thread Erick Erickson

You should be able to do this no problem. Do be aware of the
maxfieldlength though, it defaults to 10,000 tokens but you
can change it in your schema.xml. Beware, there are TWO
instances of this in the schema file. See:
http://search.lucidimagination.com/search/document/30616a061f8c4bf6/solr_ignoring_maxfieldlength

What do you mean by index/search performance impact? As
compared to what?

I think the impacts will be negligible when compared to putting all
the zip codes into the field at once, and search time should be
unaffected over that alternative.

HTH
Erick

On Tue, Jan 19, 2010 at 12:11 PM, SHS SOLR  wrote:

> * Can we define a field in our schema as multiValued (with stored=false,
> indexed=true) that will hold upto 42K zipcode values associated to each
> document?
> * Is there any query time performance impact with this.
> * Is there any impact on index time.
>
> The number of documents we are talking here is not more than 100 right now.
> There is no requirement to facet or highlight or even show this field in
> the
> search results. We only want to enable zipcode searches that would return
> matching docs.
>
> Thanks,
>

Re: Data storage, and textual analysis

2010-01-19 Thread Otis Gospodnetic

Gora,

What you are seeing are the *stored* values, which are the original, unchanged 
field values.
Analysis is applied to text for *indexing* purposes.


Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Gora Mohanty 
> To: solr-user@lucene.apache.org
> Sent: Tue, January 19, 2010 1:41:05 PM
> Subject: Data storage, and textual analysis
> 
> Hi,
> 
> Another simple query. I have set up a field to hold phonetic
> equivalents, with the relevant part of schema.xml looking like:
> 
> 
> 
> generateWordParts="1" generateNumberParts="0" catenateWords="1"
> catenateNumbers="0" catenateAll="0"/>
> 
> class="com.srijan.search.solr.analysis.AspellFilterFactory"/>
> 
> 
> Here, com.srijan.search.solr.analysis.AspellFilterFactory is
> a custom filter that provides a phonetic soundslike equivalent for
> Indian languages transliterated into English. However, that is
> irrelevant here, as the issue below holds even if I use the standard
> solr.DoubleMetaphoneFilterFactory.
> 
> I have a data source where all text is upper-case, and from
> various Solr-related discussions found through Google, I would have
> thought that fields of this type would be stored as the lower-case,
> soundslike equivalent. Instead the data (as seen through the Solr
> admin. interface, or through a front-end search) seem to be stored
> as is.
> 
> The Solr admin. analysis view does show the index and query
> conversions as I would expect. Also, phonetic matches, and matches
> with lower-case input work properly. I am just curious as to how
> this works.
> 
> Regards,
> Gora

Re: Rounding dates on sort and filter

2010-01-19 Thread Otis Gospodnetic

Charlie,

Query-time terms/tokens need to match what's in your index, and my guess is 
that if you just altered query-time date field analysis, you'd get a mismatch.  
Easy enough to check through Solr Admin Analysis page.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Charlie Jackson 
> To: solr-user@lucene.apache.org
> Sent: Tue, January 19, 2010 1:20:02 PM
> Subject: Rounding dates on sort and filter
> 
> I've got a legacy date field that I'd like to round for sorting and
> filtering. Right now, the index is large enough that sorting or
> filtering on a date field takes 10-20 seconds (unless it's cached). I
> know this is because the date field's precision is down to the
> millisecond, and I don't really need that level of precision for most of
> my searches. So, is it possible to round my field at query time without
> having to reindex the field or add a second one? 
> 
> 
> 
> I already tried the function sorting in 1.5-dev, but my field isn't a
> TrieDate field so I can't use the ms() function (which seems to allow
> date math unlike the other functions). 
> 
> 
> 
> Thanks,
> 
> Charlie

termsComponent and filter queries

2010-01-19 Thread Naomi Dushay

I have a field that has millions of values, and I need to get "the  
next X values" in alpha order.  The terms component works fabulously  
for this.


Here is a cooked up example of the terms

a
b
f
q
r
rr
rrr
y
z
zzz

So if I ask for the 3 terms after "r", I get "rr", "rrr" and "y".

But now I'd like to apply a filter query on a different field.  After  
the filter, my terms might be:


b
q
r
y
z
zzz

So the 3 terms after "r", given the filter, become  "y" "z" and "zzz"

Given that I have millions of terms, and they are not predictable for  
range queries ... how can I get


"the next X values" of my field
after one or more filters are applied?

- Naomi

Re: Number of values limitation in multivalued field

2010-01-19 Thread SHS SOLR

Thanks Erik,

I was not aware of the maxFieldLength.

* Query performance compared to storing data by zipcode. Schema to
accommodate this would have 42K * 60 documents approx. Also to consider
duplicate document data with varying zipcode in the index.

Hope this makes sense. We however wanted to understand if it is a good
practice to dump 42K tokens in a multivalued field.

Thanks,
Pavan.

On Tue, Jan 19, 2010 at 1:56 PM, Erick Erickson wrote:

> You should be able to do this no problem. Do be aware of the
> maxfieldlength though, it defaults to 10,000 tokens but you
> can change it in your schema.xml. Beware, there are TWO
> instances of this in the schema file. See:
>
> http://search.lucidimagination.com/search/document/30616a061f8c4bf6/solr_ignoring_maxfieldlength
>
> What do you mean by index/search performance impact? As
> compared to what?
>
> I think the impacts will be negligible when compared to putting all
> the zip codes into the field at once, and search time should be
> unaffected over that alternative.
>
> HTH
> Erick
>
> On Tue, Jan 19, 2010 at 12:11 PM, SHS SOLR  wrote:
>
> > * Can we define a field in our schema as multiValued (with stored=false,
> > indexed=true) that will hold upto 42K zipcode values associated to each
> > document?
> > * Is there any query time performance impact with this.
> > * Is there any impact on index time.
> >
> > The number of documents we are talking here is not more than 100 right
> now.
> > There is no requirement to facet or highlight or even show this field in
> > the
> > search results. We only want to enable zipcode searches that would return
> > matching docs.
> >
> > Thanks,
> >
>

Re: Google Commerce Search

2010-01-19 Thread Otis Gospodnetic

And what I recommended to my Fortune 1 client;) ...actually, just one 
correction:

 
> Secondly you should know that, you can not update or push Synonyms at run
> time.

You can, if you are okay with query-time synonym expansion.  The new
replication can be used to replicate not only indices, but also config
files, including the synonyms file.  My guess is that this is the same with GSA 
and other search vendors' solutions.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Mohamed Parvez 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 17, 2010 1:35:17 PM
> Subject: Re: Google Commerce Search
> 
> I was in your same shoes as yours. And did recommend and implement Solr to
> my Fortune 10 client. Solr is a great solution and does meet most of the
> requirements and  lacks very few things.
> 
> In your case, I think you should know that Solr does handle Synonyms very
> well as long as they are single word to single word mappings.
> Even cases with many to one and one to many mappings. But when it comes to
> multi word to multi word mappings, it still works, but you need to do
> some twining.
> 
> Secondly you should know that, you can not update or push Synonyms at run
> time.
>
> Never less Solr just works out of the box in windows and Linux and I have
> tried it with various servers like, Tomcat, Jetty, Weblogic etc. It works
> like a chap and I have had 100% Uptime, since we went to production.
> 
> -
> Thanks/Regards,
> Parvez
> 
> 
> 
> On Sun, Jan 17, 2010 at 4:30 AM, mrbelvedr wrote:
> 
> >
> > Our customer is a Fortune 5 big time company. They have millions of
> > vendors/products they work with daily. They have budget for whatever we
> > recommend but we like to use open source if it is a great alternative to
> > Google Search Appliance or Google Commerce Search.
> >
> > Google has recently introduced "Google Commerce Search" which allows
> > ecommerce merchants to have their products indexed by Google and shoppers
> > may search for products easily.
> >
> > Here is the URL of their new offering:
> >
> >
> > 
> http://www.google.com/commercesearch/#utm_source=en-et-na-us-merchants&utm_medium=et&utm_campaign=merchants
> >
> > Obviously this is a great solution. It offers all the great things like
> > spell checking, product synonyms, etc.  Is Solr able to do these features:
> >
> > * Index our MS Sql Server 2008 product table
> >
> > * Spell check for product brand names - user enters brand "sharpee" and the
> > search engine will reply "Did you mean 'Sharpie'? "
> >
> > * We have 2 million products stored in our MS Sql Server 2008, will Solr
> > handle that many products and give fast search results?
> >
> > Please advise if Solr will work as well as Google product?
> >
> > Thx!
> > --
> > View this message in context:
> > http://old.nabble.com/Google-Commerce-Search-tp27197509p27197509.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
> >

Re: Contributors - Solr in Action Case Studies

2010-01-19 Thread Otis Gospodnetic

Hi Gora,

Thanks, this sounds interesting, as I don't think we explicitly cover phonetic 
searches and talking explicitly about languages other than English will be 
useful to some readers.

Let's take further discussion off-line.


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Gora Mohanty 
> To: solr-user@lucene.apache.org
> Sent: Sun, January 17, 2010 4:16:25 PM
> Subject: Re: Contributors - Solr in Action Case Studies
> 
> On Thu, 14 Jan 2010 11:09:41 -0800 (PST)
> Otis Gospodnetic wrote:
> [...]
> > If you are using Solr in some clever, interesting, or unusual way
> > and are willing to share this information, please get in touch.
> > 5 to max 10 pages (soft limits) per study is what we are hoping
> > for.  Feel free to respond on the list or reply to me directly.
> [...]
> 
> We have been getting to grips with Solr over the last couple of
> months, and while I am not sure how interesting this is to people
> outside of India, one of the things that we have just finished a
> beta version of is phonetic filters and spell-checking components
> for Solr, dealing with Indian languages. The aim is to have these
> work both for Unicode content/search terms, and for Indian
> languages transliterated into English. The latter is useful as
> many people, especially current computer users in India, find it
> more comfortable to type in transliterated English. These components
> use the standard Solr facilities, as well as established open-source
> spell-checking libraries like aspell, and the design goal includes
> fuzzy matches, such as between "Amitav" and "Amitabh", as there is
> often a fair amount of variance in English transliteration.
> 
> We see great potential for this as there is already a large amount
> of content in Indian language, and the government of India is
> putting in huge amounts of effort into generating more content.
> Please do let me know if this sounds interesting as a case study.
> 
> Regards,
> Gora

RE: Rounding dates on sort and filter

2010-01-19 Thread Charlie Jackson

Good point. So it doesn't sound like there's a way to do this without
adding a new field or reindexing. Thanks anyway. 

- Charlie


-Original Message-
From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] 
Sent: Tuesday, January 19, 2010 2:04 PM
To: solr-user@lucene.apache.org
Subject: Re: Rounding dates on sort and filter

Charlie,

Query-time terms/tokens need to match what's in your index, and my guess
is that if you just altered query-time date field analysis, you'd get a
mismatch.  Easy enough to check through Solr Admin Analysis page.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Charlie Jackson 
> To: solr-user@lucene.apache.org
> Sent: Tue, January 19, 2010 1:20:02 PM
> Subject: Rounding dates on sort and filter
> 
> I've got a legacy date field that I'd like to round for sorting and
> filtering. Right now, the index is large enough that sorting or
> filtering on a date field takes 10-20 seconds (unless it's cached). I
> know this is because the date field's precision is down to the
> millisecond, and I don't really need that level of precision for most
of
> my searches. So, is it possible to round my field at query time
without
> having to reindex the field or add a second one? 
> 
> 
> 
> I already tried the function sorting in 1.5-dev, but my field isn't a
> TrieDate field so I can't use the ms() function (which seems to allow
> date math unlike the other functions). 
> 
> 
> 
> Thanks,
> 
> Charlie

Re: Google Commerce Search

2010-01-19 Thread wojtekpia


While Solr is functionally platform independent, I have seen much better
performance on Linux than Windows under high load (related to SOLR-465). 


MitchK wrote:
> 
> As you know, Solr is fully written in Java and Java is still
> plattform-independent. ;)
> Learn more about Solr on http://www.lucene.apache.org/solr
> 
> 
> mrbelvedr wrote:
>> 
>> That sounds great. Could it also run on Windows?  I am interested in
>> hiring an experienced Solr freelancer to help us set up Solr on Windows
>> and configure it to index our products. If anybody is interested please
>> email tmil...@ktait.com
>> 
>> Thank you!
>> 
> 

-- 
View this message in context: 
http://old.nabble.com/Google-Commerce-Search-tp27197509p27232545.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TermsComponent, multiple fields, total count

2010-01-19 Thread Otis Gospodnetic

Hi Lukas,

 
- Original Message 

> From: Lukas Kahwe Smith 

> I want to use TermsComponent for both auto complete suggestions but also 
> showing 

Is TermsComponent really that good for AutoComplete?
Have a look at http://www.sematext.com/demo/ac/index.html - doesn't use TC.

> a search "quality" meter. As in indicate the total number of matches (doesnt 
> need to be accurate, just a ballpark figure especially if there are a lot of 
> matches)

As in, you want each suggestion include the number of documents it would match 
if that suggestion would be run as the query?
Wouldn't that require one to execute that query, so if you want to show 10 
suggestions, you'd hit Solr 10 times?

> I also want to match multiple fields at once.

Can you give an example?

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch

> I guess I can just issue multiple requests in order to get multiple fields 
> searched. But the total number is a bit more tricky. I can of course simply 
> add 
> up the counts for the limited number of results. But this is maybe a bit too 
> inaccurate and also seems like Lucene/Solr should be able to give me this 
> number 
> more efficiently.
> 
> regards,
> Lukas Kahwe Smith
> m...@pooteeweet.org

Please help: Failing tests

2010-01-19 Thread Siv Anette Fjellkårstad


I'm tring to run the unit tests from Eclipse. Almost half the tests are 
failing, and I don't know what I'm doing wrong. This is what I've done:

1. Checked out the code outside Eclipse's workspace
2. File > New > Project > Java project.
3. Create project from existing source"

4. Five compiler errors. Fixed in this way:
Properties > Java Build Path > order and Export
Moved JRE System Library to the top

5. Run As > Run Configuration > Arguments > VM Arguments: -Dsolr.solr.home=

One of the error messages:

org.apache.solr.common.SolrException: QueryElevationComponent missing config 
file: 'elevate.xml
either: C:\data\solr\release-1.4.0\conf\elevate.xml or 
C:\DOCUME~1\saf\LOCALS~1\Temp\org.apache.solr.DisMaxRequestHandlerTest-1263934397062\elevate.xml

When I add one conf-directory to the build path, another one is still missing. 
What have I done wrong?

Please help me. I need the tests to run for a presentation in a few days, and 
I'm really struggeling.

Kind regards,
Siv




-Opprinnelig melding-
Fra: Amit Nithian [mailto:anith...@gmail.com]
Sendt: ti 19.01.2010 19:51
Til: solr-user@lucene.apache.org
Emne: Re: build path

If you are going to run the unit tests in Eclipse, then for the given JUnit
run configuration, add the -Dsolr.solr.home= to the VM
arguments section of the run configuration for the given test.

On Tue, Jan 19, 2010 at 12:34 AM, Wangsheng Mei  wrote:

> I this it is.
>
> solr has a default servlet container "jetty" with the downloaded package
> under folder "example" .
> but I use tomcat a lot, so I deployed solr on tomcat using solr.war.
>
> I don't know why solr will use jetty as default.
>
> 2010/1/19 Siv Anette Fjellkårstad 
>
> > I apologize for the newbie questions :|
> > Do I need a servlet container to run the tests?
> >
> > Kind regards,
> > Siv
> >
> >
> > 
> >
> > Fra: Wangsheng Mei [mailto:hairr...@gmail.com]
> > Sendt: ti 19.01.2010 08:49
> > Til: solr-user@lucene.apache.org
> > Emne: Re: build path
> >
> >
> >
> > maybe you should add "-Dsolr.solr.home=" to your
> JAVA_OPTS
> > before your servlet container starts.
> >
> >
> > 2010/1/19 Siv Anette Fjellkårstad 
> >
> > > Hi!
> > > I try to run the tests of Solr 1.4 in Eclipse, but a most of them
> fails.
> > > The error messages indicate that I miss some config files in my build
> > path.
> > > Is there any documentation of how to get Solr up and running in
> Eclipse?
> > If
> > > not; How did you set up (build path for) Solr in Eclipse?
> > >
> > > Another question; Some of the tests also fail when I run ant test. Is
> > that
> > > normal?
> > >
> > > Sincerely,
> > > Siv
> > >
> > >
> > > This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> > > OSLO, http://www.steria.no  . This email and
> any
> > attachments may contain
> > > confidential/intellectual property/copyright information and is only
> for
> > the
> > > use of the addressee(s). You are prohibited from copying, forwarding,
> > > disclosing, saving or otherwise using it in any way if you are not the
> > > addressee(s) or responsible for delivery. If you receive this email by
> > > mistake, please advise the sender and cancel it immediately. Steria may
> > > monitor the content of emails within its network to ensure compliance
> > with
> > > its policies and procedures. Any email is susceptible to alteration and
> > its
> > > integrity cannot be assured. Steria shall not be liable if the message
> is
> > > altered, modified, falsified, or even edited.
> > >
> >
> >
> >
> > --
> > ???
> >
> >
> >
> > This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051
> > OSLO, http://www.steria.no. This email and any attachments may contain
> > confidential/intellectual property/copyright information and is only for
> the
> > use of the addressee(s). You are prohibited from copying, forwarding,
> > disclosing, saving or otherwise using it in any way if you are not the
> > addressee(s) or responsible for delivery. If you receive this email by
> > mistake, please advise the sender and cancel it immediately. Steria may
> > monitor the content of emails within its network to ensure compliance
> with
> > its policies and procedures. Any email is susceptible to alteration and
> its
> > integrity cannot be assured. Steria shall not be liable if the message is
> > altered, modified, falsified, or even edited.
> >
>
>
>
> --
> ???
>


This email originates from Steria AS, Biskop Gunnerus' gate 14a, N-0051 OSLO, 
http://www.steria.no. This email and any attachments may contain 
confidential/intellectual property/copyright information and is only for the 
use of the addressee(s). You are prohibited from copying, forwarding, 
disclosing, saving or otherwise using it in any way if you are not the 
addressee(s) or responsible for delivery. If you receive this email by mistake, 
please advise the sender and cancel it immediately. Steria may monitor the 
content of emails within its net

Re: Google Commerce Search

2010-01-19 Thread Mohamed Parvez

>From the Solr Wiki about Query-time synonym expansion
"...synonyms containing multiple words..The recommended approach for dealing
with synonyms like this, is to expand the synonym when indexing. This is
because there are two potential issues that can arise at query time"
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

So stuck with query time synonym.

---
Thanks/Regards,
Parvez



On Tue, Jan 19, 2010 at 2:39 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> And what I recommended to my Fortune 1 client;) ...actually, just one
> correction:
>
>
> > Secondly you should know that, you can not update or push Synonyms at run
> > time.
>
> You can, if you are okay with query-time synonym expansion.  The new
> replication can be used to replicate not only indices, but also config
> files, including the synonyms file.  My guess is that this is the same with
> GSA and other search vendors' solutions.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> - Original Message 
> > From: Mohamed Parvez 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 17, 2010 1:35:17 PM
> > Subject: Re: Google Commerce Search
> >
> > I was in your same shoes as yours. And did recommend and implement Solr
> to
> > my Fortune 10 client. Solr is a great solution and does meet most of the
> > requirements and  lacks very few things.
> >
> > In your case, I think you should know that Solr does handle Synonyms very
> > well as long as they are single word to single word mappings.
> > Even cases with many to one and one to many mappings. But when it comes
> to
> > multi word to multi word mappings, it still works, but you need to do
> > some twining.
> >
> > Secondly you should know that, you can not update or push Synonyms at run
> > time.
> >
> > Never less Solr just works out of the box in windows and Linux and I have
> > tried it with various servers like, Tomcat, Jetty, Weblogic etc. It works
> > like a chap and I have had 100% Uptime, since we went to production.
> >
> > -
> > Thanks/Regards,
> > Parvez
> >
> >
> >
> > On Sun, Jan 17, 2010 at 4:30 AM, mrbelvedr wrote:
> >
> > >
> > > Our customer is a Fortune 5 big time company. They have millions of
> > > vendors/products they work with daily. They have budget for whatever we
> > > recommend but we like to use open source if it is a great alternative
> to
> > > Google Search Appliance or Google Commerce Search.
> > >
> > > Google has recently introduced "Google Commerce Search" which allows
> > > ecommerce merchants to have their products indexed by Google and
> shoppers
> > > may search for products easily.
> > >
> > > Here is the URL of their new offering:
> > >
> > >
> > >
> >
> http://www.google.com/commercesearch/#utm_source=en-et-na-us-merchants&utm_medium=et&utm_campaign=merchants
> > >
> > > Obviously this is a great solution. It offers all the great things like
> > > spell checking, product synonyms, etc.  Is Solr able to do these
> features:
> > >
> > > * Index our MS Sql Server 2008 product table
> > >
> > > * Spell check for product brand names - user enters brand "sharpee" and
> the
> > > search engine will reply "Did you mean 'Sharpie'? "
> > >
> > > * We have 2 million products stored in our MS Sql Server 2008, will
> Solr
> > > handle that many products and give fast search results?
> > >
> > > Please advise if Solr will work as well as Google product?
> > >
> > > Thx!
> > > --
> > > View this message in context:
> > > http://old.nabble.com/Google-Commerce-Search-tp27197509p27197509.html
> > > Sent from the Solr - User mailing list archive at Nabble.com.
> > >
> > >
>
>

Re: TermsComponent, multiple fields, total count

2010-01-19 Thread Lukas Kahwe Smith

On 19.01.2010, at 21:55, Otis Gospodnetic wrote:

> Hi Lukas,
> 
> 
> - Original Message 
> 
>> From: Lukas Kahwe Smith 
> 
>> I want to use TermsComponent for both auto complete suggestions but also 
>> showing 
> 
> Is TermsComponent really that good for AutoComplete?
> Have a look at http://www.sematext.com/demo/ac/index.html - doesn't use TC.

will check it out.

>> a search "quality" meter. As in indicate the total number of matches (doesnt 
>> need to be accurate, just a ballpark figure especially if there are a lot of 
>> matches)
> 
> As in, you want each suggestion include the number of documents it would 
> match if that suggestion would be run as the query?
> Wouldn't that require one to execute that query, so if you want to show 10 
> suggestions, you'd hit Solr 10 times?

Hmm actually now that you ask, I guess what I want makes no sense.

If I type in "ver" and get various terms which start with "ver" obviously if I 
submit that search unless something is actually indexes as just "ver" there 
will obviously be no match at all.

Let me briefly explain where I am coming from.
We have a search field and above it is the number of total entities in the db.
Now as people are typing in search terms we want to give them an indication of 
how many results they can expect if they submit now.
But this UI concept was made by the UI team and obviously inspired by a more 
RDBMS like LIKE "foo%" search, which i guess could be implemented in solr as 
well, but  the question is if it makes sense.

so i guess if i do use TC then it makes more sense to display a list of all 
autocomplete terms and their respective totals. if at all i should update the 
number above as the person is moving their focus to one of the autocomplete 
options.

>> I also want to match multiple fields at once.
> 
> Can you give an example?

I enter "Kreuz" but this could either be part of a persons name or of a street 
name, which are separate fields in my index mainly because they analyzed 
differently (person name using doublemetaphone and street name using word 
splitting to extract relevant parts for better matching).

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

Re: complex query

2010-01-19 Thread Chris Hostetter


: I have considered building lucene index like:
: Document:  { title, content, author, friends }

That seems like the right appraoch.

: Thus, author and friends are two seperate fields. so I can boost them
: seperately.
: The problem is, if a document's author is the logged-in user, it's uncessary
: to search the friends field, because it would not appear in that

That's like saying "if a document doesn't match my query, there's no 
pointing in checking if it matches my query" ... you don't know if 
something matches until you query against it, and you don't know if a doc 
is written by the logged in user, or written by a freind of hte logged in 
user unless you query against those criteria.


-Hoss

Large Query Strings and performance

2010-01-19 Thread ldung


I am using Solr 1.4 with  large query strings with 20+ terms and faceting on
a single multi-valued field in a 1 million record system. I am using Solr to
categorize text, that why the query strings are big.

The performance get's worse the more search terms are used.  Is there any
way I can improve performance? I've tried several shingling but it had no
effect and tried everything in here
http://wiki.apache.org/solr/SolrPerformanceFactors

Is there anything else I can try? Will sharding help?
-- 
View this message in context: 
http://old.nabble.com/Large-Query-Strings-and-performance-tp27233477p27233477.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Restricting Facet to FilterQuery in combination with mincount

2010-01-19 Thread Chris Hostetter


: Now, I was wondering whether it is possible to find that out. It would allow
: to show 0 counts of values that are produced by the query (q), and at the same
: time exclude all facet values that are already excluded by the filter query.
: 
: Applying facetting to a subset (subselect / filterset) of the index not to
: everything - that might describe it, as well.

you can "tag" a filter query so that face.tfield knows to ignore that fq 
when computing the constraint counts...

http://wiki.apache.org/solr/SimpleFacetParameters#LocalParams_for_faceting

...but i'm pretty sure that still won't give you what you are looking for. 
In your "mammal" example it would just mean that the counts for your 
"name" facet would ignore the "fq=type:mammal" restriction and be based 
purely on the main q=area:water query ... so instead of "excluding" 
salmon(0) from the results, and leaving lion(0) and dog(0) you would get 
presumably start getting a positive count for "salmon", but lin and dog 
still wouldn't match

: > > q=area:water&fq=type:mammal&facet.field=name&facet.mincount=0
: > > 
: > > would return something like
: > > dolphin (20)
: > > blue whale (20)
: > > salmon (0) <= not covered by filter query
: > > lion (0)
: > > dog (0)

...even if you sqaped the fq and q (which would alter your scores 
drasticly) what taging and excluding changes is the *counts* associated 
with a facet value -- there is no way to get "some zeros" to show while 
"other zeros" don't.

Typically the driving force behind something like this is a hierarchical 
taxonomy -- your animal example fitting nicely.  In those cases, you can 
make your facets use the full hierarch (ie: mammal/lion, mammal/dog, 
fish/salmon instead of just lion/dog/salmon) and you can use facet.prefix 
to get the type of behavior you are talking about.


-Hoss

Unstemming after solr.PorterStemFilterFactory

2010-01-19 Thread Bogdan Vatkov

Hi,

I am indexing with the solr.PorterStemFilterFactory included but then I need
to access the unstemmed versions of the terms, what would be the easiest way
to get the unstemmed version?
Thanks in advance.

Best regards,
Bogdan




-- 
Best regards,
Bogdan

Re: Number of values limitation in multivalued field

2010-01-19 Thread Erick Erickson

As far as I know, there's no underlying difference between
adding all 42K tokens one at a time (mutlivalued)
or all at once (singlevalued), with one rather technical
difference: If you've changed the positionIncrementGap
to something other than "1" in your schema, then the
token offsets delta between successive value adds will be
something other than one. Put another way, there's
no difference if you leave PositionIncrementGap="1".
And even that doesn't matter of you're not doing
proximity queries on that field.

You could even batch them up in chunks. I.e.
zip1 zip2 zip3
zip4 zip5 zip6

You're only talking 2.5M tokens or so, right? I predict
you'll never notice the data duplication etc. I'd guess that
it's too small of a data set to worry about...

HTH
Erick

On Tue, Jan 19, 2010 at 3:15 PM, SHS SOLR  wrote:

> Thanks Erik,
>
> I was not aware of the maxFieldLength.
>
> * Query performance compared to storing data by zipcode. Schema to
> accommodate this would have 42K * 60 documents approx. Also to consider
> duplicate document data with varying zipcode in the index.
>
> Hope this makes sense. We however wanted to understand if it is a good
> practice to dump 42K tokens in a multivalued field.
>
> Thanks,
> Pavan.
>
> On Tue, Jan 19, 2010 at 1:56 PM, Erick Erickson  >wrote:
>
> > You should be able to do this no problem. Do be aware of the
> > maxfieldlength though, it defaults to 10,000 tokens but you
> > can change it in your schema.xml. Beware, there are TWO
> > instances of this in the schema file. See:
> >
> >
> http://search.lucidimagination.com/search/document/30616a061f8c4bf6/solr_ignoring_maxfieldlength
> >
> > What do you mean by index/search performance impact? As
> > compared to what?
> >
> > I think the impacts will be negligible when compared to putting all
> > the zip codes into the field at once, and search time should be
> > unaffected over that alternative.
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 19, 2010 at 12:11 PM, SHS SOLR  wrote:
> >
> > > * Can we define a field in our schema as multiValued (with
> stored=false,
> > > indexed=true) that will hold upto 42K zipcode values associated to each
> > > document?
> > > * Is there any query time performance impact with this.
> > > * Is there any impact on index time.
> > >
> > > The number of documents we are talking here is not more than 100 right
> > now.
> > > There is no requirement to facet or highlight or even show this field
> in
> > > the
> > > search results. We only want to enable zipcode searches that would
> return
> > > matching docs.
> > >
> > > Thanks,
> > >
> >
>

Re: Unstemming after solr.PorterStemFilterFactory

2010-01-19 Thread Otis Gospodnetic

Bogdan,

You can get them from stored values of your fields, if you are storing them.

Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



- Original Message 
> From: Bogdan Vatkov 
> To: solr-user@lucene.apache.org
> Sent: Tue, January 19, 2010 5:28:51 PM
> Subject: Unstemming after solr.PorterStemFilterFactory
> 
> Hi,
> 
> I am indexing with the solr.PorterStemFilterFactory included but then I need
> to access the unstemmed versions of the terms, what would be the easiest way
> to get the unstemmed version?
> Thanks in advance.
> 
> Best regards,
> Bogdan
> 
> 
> 
> 
> -- 
> Best regards,
> Bogdan

Re: Unstemming after solr.PorterStemFilterFactory

2010-01-19 Thread Bogdan Vatkov

I am using fields like:

which contain multi-line text, not just single strings, what does "stored
values" mean?
I am relatively new to Solr

I solved my issue by copy/pasting and enhancing
the SnowballPorterFilterFactory class by
creating SnowballPorterWithUnstemLowerCaseFilterFactory
I added lowercasing inside the factory since I need to capture the original
terms store them in a side file and only then lowercase and stem.

I was wondering if there is an easier way (without doing this custom filter
that I did).

Best regards,
Bogdan

On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Bogdan,
>
> You can get them from stored values of your fields, if you are storing
> them.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> - Original Message 
> > From: Bogdan Vatkov 
> > To: solr-user@lucene.apache.org
> > Sent: Tue, January 19, 2010 5:28:51 PM
> > Subject: Unstemming after solr.PorterStemFilterFactory
> >
> > Hi,
> >
> > I am indexing with the solr.PorterStemFilterFactory included but then I
> need
> > to access the unstemmed versions of the terms, what would be the easiest
> way
> > to get the unstemmed version?
> > Thanks in advance.
> >
> > Best regards,
> > Bogdan
> >
> >
> >
> >
> > --
> > Best regards,
> > Bogdan
>
>

-- 
Best regards,
Bogdan

Extracting URLs while indexing

2010-01-19 Thread Bogdan Vatkov

Hi,

I want to extract URLs (http://..., as well as file://... or even //.)
while pushing documents into Solr.
Is it possible with the Filters/Analyzers available nowadays?
I looked into the doc but could not find anything related to it.

Best regards,
Bogdan

solr blocking on commit

2010-01-19 Thread Steve Conover

I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
document index that's 10+GB.  We currently give solr 12GB of ram to
play in and our machine has 32GB total.

We're seeing a problem where solr blocks during commit - it won't
server /select requests - in some cases for more than 15-30 seconds.
We'd like to somehow configure things such that there's no
interruption in /select service.

I've tried tweaking many parts of solrconfig (based on searches of
this mailing list): playing with the lock, with autowarming (including
shutting it off), max searchers, etc.  I've tried asynchronous commits
(waitSearcher=false and waitFlush=false).  The problem won't go away.
I run a fair amount of /select load at solr and even when it doesn't
take 20 seconds to commit, I definitely see pauses in the log when
commits are in progress.

I have this sense that I'm just missing something obvious (?)

Regards,
Steve

Re: solr blocking on commit

2010-01-19 Thread Yonik Seeley

On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover  wrote:
> I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
> document index that's 10+GB.  We currently give solr 12GB of ram to
> play in and our machine has 32GB total.
>
> We're seeing a problem where solr blocks during commit - it won't
> server /select requests - in some cases for more than 15-30 seconds.
> We'd like to somehow configure things such that there's no
> interruption in /select service.

A commit shouldn't cause searches to block.
Could this perhaps be a stop-the-word GC pause that coincides with the commit?

-Yonik
http://www.lucidimagination.com

Re: termsComponent and filter queries

2010-01-19 Thread Yonik Seeley

You may be able to use faceting for this.
Use facet.method=enum - it will be more efficient for this specific use.

The main problem is that you can't specify a start term for faceting
though (you can only use numeric offset / limit into the list).

To do more will require either adding some terms component features to
faceting, or faceting features to terms component.

-Yonik
http://www.lucidimagination.com

On Tue, Jan 19, 2010 at 3:14 PM, Naomi Dushay  wrote:
> I have a field that has millions of values, and I need to get "the next X
> values" in alpha order.  The terms component works fabulously for this.
>
> Here is a cooked up example of the terms
>
> a
> b
> f
> q
> r
> rr
> rrr
> y
> z
> zzz
>
> So if I ask for the 3 terms after "r", I get "rr", "rrr" and "y".
>
> But now I'd like to apply a filter query on a different field.  After the
> filter, my terms might be:
>
> b
> q
> r
> y
> z
> zzz
>
> So the 3 terms after "r", given the filter, become  "y" "z" and "zzz"
>
> Given that I have millions of terms, and they are not predictable for range
> queries ... how can I get
>
> "the next X values" of my field
> after one or more filters are applied?
>
> - Naomi
>

Re: Unstemming after solr.PorterStemFilterFactory

2010-01-19 Thread Erick Erickson

This is completely unnecessary. Fields can be both indexed and
stored, and the operations are orthogonal.

That is, when you specify that a field is indexed, it is run through
an analyzer and the *tokens* are indexed, after any
stemming, casing, etc.

Stored means that the original value, before any analysis
whatsoever, is put in a completely separate location.
It's only there for retrieval and display to the user. It's as if
a copy of the original text was put into one place, and the
tokens were put in another.

Consider the problem of book titles. If I have a title "The Old
Man and the Sea", I want to display that title as a result of
searching for "old sea man". Rather than force the separate
storage to be done programmatically, SOLR allows you to
specify these two options. So if I specify indexing and storing,
the tokens "old" "man" "sea" (assuming lowercasing,
stopword removal, etc) are added to the searchable index.
"The Old Man and the Sea" is copied somewhere else, and
when you ask for the *value* of the field, you get "The Old Man
and the Sea". This stored part of the index is never searched, it
is solely there for retrieval/display.

I'd really get a copy of the book, it'll save you lots of time and
effort.

HTH
Erick

On Tue, Jan 19, 2010 at 5:45 PM, Bogdan Vatkov wrote:

> I am using fields like:
>   stored="true"/>
> which contain multi-line text, not just single strings, what does "stored
> values" mean?
> I am relatively new to Solr
>
> I solved my issue by copy/pasting and enhancing
> the SnowballPorterFilterFactory class by
> creating SnowballPorterWithUnstemLowerCaseFilterFactory
> I added lowercasing inside the factory since I need to capture the original
> terms store them in a side file and only then lowercase and stem.
>
> positionIncrementGap="100">
>  
>
>ignoreCase="true"
>words="stopwords.txt"
>enablePositionIncrements="true"
>/>
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> 
> 
>
> class="org.bogdan.solr.analysis.SnowballPorterWithUnstemLowerCaseFilterFactory"
> language="English" protected="protwords.txt" unstemmed="unstemmed.txt"/>
>  
>
> I was wondering if there is an easier way (without doing this custom filter
> that I did).
>
> Best regards,
> Bogdan
>
> On Wed, Jan 20, 2010 at 12:38 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
> > Bogdan,
> >
> > You can get them from stored values of your fields, if you are storing
> > them.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > - Original Message 
> > > From: Bogdan Vatkov 
> > > To: solr-user@lucene.apache.org
> > > Sent: Tue, January 19, 2010 5:28:51 PM
> > > Subject: Unstemming after solr.PorterStemFilterFactory
> > >
> > > Hi,
> > >
> > > I am indexing with the solr.PorterStemFilterFactory included but then I
> > need
> > > to access the unstemmed versions of the terms, what would be the
> easiest
> > way
> > > to get the unstemmed version?
> > > Thanks in advance.
> > >
> > > Best regards,
> > > Bogdan
> > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Bogdan
> >
> >
>
>
> --
> Best regards,
> Bogdan
>

Re: Problem indexing files

2010-01-19 Thread Chris Hostetter


I'm not sure if i fully understand what you mean, but is it possible that 
"contid" is your uniqueKey field? ... if so you are probably adding two 
documents, both with the same ID, so the second overwrites the first.

If i'm mistaken, then could you elaborate a bit more -- provide your 
configs/schema and ad some more asserts with a comment explaining which 
assertions fail.

: Hi all,
: 
: I'm trying to add multiple files to solr 1.4 with solrj.
: With this programm 1 Doc is added to solr:
: 
:   SolrServer server = SolrHelper.getServer();
:   server.deleteByQuery( "*:*" );// delete everything!
:   server.commit();
:   QueryResponse rsp = server.query( new SolrQuery( "*:*") );
:   Assert.assertEquals( 0, rsp.getResults().getNumFound() );
: 
:   ContentStreamUpdateRequest up = new
: ContentStreamUpdateRequest("/update/extract");
:   up.addFile(new File("d:/temp/test.txt"));
:   //up.addFile(new File("d:/temp/test2.txt")); //<-- Nothing
: added if removing the comment from this line.
:   up.setParam("literal.contid", "doc1");
:   up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
:   NamedList result = server.request(up);
:   UpdateResponse test = server.commit();
: 
: But no doc is added, if i remove the comment tag from the second addFile.
: What's wrong with this?
: 
: 
: Thanks,
: 
: Thomas
: 



-Hoss

Re: Extracting URLs while indexing

2010-01-19 Thread Erick Erickson

Do you mean you want the URLs to be extracted on the client?
If so, no. Filters/analyzers reside on the server, not the client.
You'll have to do it with custom code

Erick

On Tue, Jan 19, 2010 at 5:48 PM, Bogdan Vatkov wrote:

> Hi,
>
> I want to extract URLs (http://..., as well as file://... or even //.)
> while pushing documents into Solr.
> Is it possible with the Filters/Analyzers available nowadays?
> I looked into the doc but could not find anything related to it.
>
> Best regards,
> Bogdan
>

Re: Design Question - Dynamic Field Names (*)

2010-01-19 Thread Kumaravel Kandasami

First Thanks for the response.

Yes, mostly likely we want to optimize to one index file. I think it is
possible, coming from the RDBMS world  we might be over complicating the
solution.

*Requirement:*
- We are indexing CSV files and generating field names dynamically from the
"header" line.
User should be able to *list all the possible header names* (i.e. dynamic
field names), and filter results based on some of the field names.

- Also, list* all possible values* associated to for a given field name.




Kumar_/|\_
www.saisk.com
ku...@saisk.com
"making a profound difference with knowledge and creativity..."


On Tue, Jan 19, 2010 at 1:33 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Sat, Jan 16, 2010 at 3:33 AM, Kumaravel Kandasami <
> kumaravel.kandas...@gmail.com> wrote:
>
> > Need to your suggestion in  best designing the following requirement.
> >
> > - We have two indexes.
> > Index 1: "name_index",
> > Fields:
> > "id" - indexed, not stored
> >  "field_name" - indexed, stored.
> >
> > Index 2: "trans_index',
> > Fields(Dynamic Schema):
> > "id" - indexed, not stored
> > "*" - indexed, stored.
> >
> > (Dynamic field names of the trans_index is the same as the "field_name"
> > from
> > the name_index.)
> >
> > - Requirement:
> >
> > User would select the field he wants to query from the "name_index".
> > Once he selects the one of the values from the 'field_name' (from the
> > name_index), he queries the trans_index using the field_name.
> >
> >
> > - Issue:
> >
> > When indexing the name_index field:"field_name" we are using the analyzer
> > that would lowercase, strip spaces etc.
> > Example: "First Name", "firstName" values are all stored and indexed as
> > 'firstname'.
> >
> > However, when we store field names in the trans_index we would be storing
> > as
> > it is ... without analyzing.
> > So User queries like 'firstname:a*' might not match.
> >
> > - Possible Solution:
> >
> > We are planning to have an custom analyzer that we would use while
> indexing
> > (configured in the schema.xml) file. As well the crawler program would
> use
> > the
> > same analyzer to create field names.
> >
> > Is there any better design solutions ?
> >
> >
> Your scenario sounds quite strange and it is still not clear why you are
> doing all this. Perhaps the solution doesn't even require two indexes? Can
> you describe the actual problem so that we can be of more help?
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: solr blocking on commit

2010-01-19 Thread Steve Conover

I'll play with the GC settings and watch memory usage (I've done a
little bit of this already), but I have a sense that this isn't the
problem.

I should also note that in order to create the really long pauses I
need to post xml files full of documents that haven't been added in a
long time / ever.  Once a set of documents is posted to /update, if I
re-post it solr behaves pretty well - and that's true even if I
restart solr.

On Tue, Jan 19, 2010 at 3:05 PM, Yonik Seeley
 wrote:
> On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover  wrote:
>> I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
>> document index that's 10+GB.  We currently give solr 12GB of ram to
>> play in and our machine has 32GB total.
>>
>> We're seeing a problem where solr blocks during commit - it won't
>> server /select requests - in some cases for more than 15-30 seconds.
>> We'd like to somehow configure things such that there's no
>> interruption in /select service.
>
> A commit shouldn't cause searches to block.
> Could this perhaps be a stop-the-word GC pause that coincides with the commit?
>
> -Yonik
> http://www.lucidimagination.com
>

Re: solr blocking on commit

2010-01-19 Thread Jay Hill

A couple of follow up questions:

- What type of garbage collector is in use?
- How often are you optimizing the index?
- In solrconfig.xml what is the setting for ?
- Right before and after you see this pause, check the output of
http://:/solr/admin/system,
specifically the output of  and send this to the list.

If possible definitely watch memory usage with something like JConsole, or
start the JVM with some of these params:
–XX:+PrintGCDetails
–XX:+PrintGCTimeStamps

-Jay

On Tue, Jan 19, 2010 at 5:16 PM, Steve Conover  wrote:

> I'll play with the GC settings and watch memory usage (I've done a
> little bit of this already), but I have a sense that this isn't the
> problem.
>
> I should also note that in order to create the really long pauses I
> need to post xml files full of documents that haven't been added in a
> long time / ever.  Once a set of documents is posted to /update, if I
> re-post it solr behaves pretty well - and that's true even if I
> restart solr.
>
> On Tue, Jan 19, 2010 at 3:05 PM, Yonik Seeley
>  wrote:
> > On Tue, Jan 19, 2010 at 5:57 PM, Steve Conover 
> wrote:
> >> I'm using latest solr 1.4 with java 1.6 on linux.  I have a 3M
> >> document index that's 10+GB.  We currently give solr 12GB of ram to
> >> play in and our machine has 32GB total.
> >>
> >> We're seeing a problem where solr blocks during commit - it won't
> >> server /select requests - in some cases for more than 15-30 seconds.
> >> We'd like to somehow configure things such that there's no
> >> interruption in /select service.
> >
> > A commit shouldn't cause searches to block.
> > Could this perhaps be a stop-the-word GC pause that coincides with the
> commit?
> >
> > -Yonik
> > http://www.lucidimagination.com
> >
>

Re: TermsComponent, multiple fields, total count

2010-01-19 Thread Erik Hatcher



On Jan 19, 2010, at 3:55 PM, Otis Gospodnetic wrote:
a search "quality" meter. As in indicate the total number of  
matches (doesnt
need to be accurate, just a ballpark figure especially if there are  
a lot of

matches)


As in, you want each suggestion include the number of documents it  
would match if that suggestion would be run as the query?
Wouldn't that require one to execute that query, so if you want to  
show 10 suggestions, you'd hit Solr 10 times?


Not if you use faceting with the facet.prefix capability :)  It gives  
back counts per term suggested.


Erik

SOLR Performance Tuning: Fuzzy Searches, Distance, BK-Tree

2010-01-19 Thread Fuad Efendi

Hi,


I am wondering: will SOLR or Lucene use caches for fuzzy searches? I mean
per-term caching or something, internal to Lucene, or may be SOLR (SOLR may
use own query parser)...

Anyway, I implemented BK-Tree and playing with it right now, I altered
FuzzyTermEnum class of Lucene...
http://en.wikipedia.org/wiki/BK-tree

- it seems performance of fuzzy searches boosted at least hundred times, but
I need to do more tests... repeated similar (slightly different) queries run
with better performance, probably because of OS-level file caching... but it
could be that of BK-Tree distance! (although I need to use classic int
instead of float distance by Lucene/Levenstein etc.)

Thanks,
Fuad Efendi
+1 416-993-2060
http://www.tokenizer.ca/
Data Mining, Vertical Search

Re: DIH delta import - last modified date

2010-01-19 Thread Wangsheng Mei

I run DIH in the same machine with database, hence avoided the problem.

2010/1/20 Yao Ge 

>
> I am struggling with the concept of delta import in DIH. According the to
> documentation, the delta import will automatically record the last index
> time stamp and make it available to use for the delta query. However in
> many
> case when the last_modified date time stamp in the database lag behind the
> current time, the last index time stamp is the not good for delta query.
> Can
> I pick a different mechanism to generate "last_index_time" by using time
> stamp computed from the database (such as from a column of the database)?
> --
> View this message in context:
> http://old.nabble.com/DIH-delta-import---last-modified-date-tp27231449p27231449.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
梅旺生

Re: complex query

2010-01-19 Thread Wangsheng Mei

Thanks for your attention.

I think I was just a little over-worried about search performance,.
Fortunately, solr works pretty nice until present, it's fast enough for me.

2010/1/20 Chris Hostetter 

>
> : I have considered building lucene index like:
> : Document:  { title, content, author, friends }
>
> That seems like the right appraoch.
>
> : Thus, author and friends are two seperate fields. so I can boost them
> : seperately.
> : The problem is, if a document's author is the logged-in user, it's
> uncessary
> : to search the friends field, because it would not appear in that
>
> That's like saying "if a document doesn't match my query, there's no
> pointing in checking if it matches my query" ... you don't know if
> something matches until you query against it, and you don't know if a doc
> is written by the logged in user, or written by a freind of hte logged in
> user unless you query against those criteria.
>
>
> -Hoss
>



-- 
梅旺生

Re: Data storage, and textual analysis

2010-01-19 Thread Gora Mohanty

On Tue, 19 Jan 2010 12:02:27 -0800 (PST)
Otis Gospodnetic  wrote:

> Gora,
> 
> What you are seeing are the *stored* values, which are the
> original, unchanged field values. Analysis is applied to text for
> *indexing* purposes.
[...]

Ah, of course. Seems obvious now, and I had misread this message
in the mailing thread here:
http://old.nabble.com/How-does-search-work-with-phonetic-filter-factory---td24643678.html

Thanks for the help.

Regards,
Gora

Re: DIH delta import - last modified date

2010-01-19 Thread Noble Paul നോബിള്‍ नोब्ळ्

While invoking the delta-import you may, pass the value as a request
parameter. That value can be used in the query as ${dih.request.xyz}

where as xyz is the request parameter name

On Wed, Jan 20, 2010 at 1:15 AM, Yao Ge  wrote:
>
> I am struggling with the concept of delta import in DIH. According the to
> documentation, the delta import will automatically record the last index
> time stamp and make it available to use for the delta query. However in many
> case when the last_modified date time stamp in the database lag behind the
> current time, the last index time stamp is the not good for delta query. Can
> I pick a different mechanism to generate "last_index_time" by using time
> stamp computed from the database (such as from a column of the database)?
> --
> View this message in context: 
> http://old.nabble.com/DIH-delta-import---last-modified-date-tp27231449p27231449.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Systems Architect| AOL | http://aol.com

80 matches

Mail list logo