date:20090903

On Mon, Aug 31, 2009 at 10:47 PM, jOhn  wrote:

> This is mostly my misunderstanding of catenateAll="1" as I thought it would
> break down with an OR using the full concatenated word.
>
> Thus:
>
> Jokers Wild -> { jokers, wild } OR { jokerswild }
>
> But really it becomes: { jokers, {wild, jokerswild}} which will not match.
>
> And if you have a mistyped camel case like:
>
> jOkerswild -> { j, {okerswild, jokerswild}} again no match.
>
>
Sorry for the late reply. You still haven't given the fieldtype definition
that you were using.

I tried:

And I tried indexing "Jokers Wild" which matches when I query for
"jOkerswild" and "jokerswild". Note that if you change the tokenizer to
WhiteSpaceTokenizer then such queries won't match.

-- 
Regards,
Shalin Shekhar Mangar.

Re: score = sum of boosts

On Thu, Sep 3, 2009 at 4:09 AM, Joe Calderon  wrote:

> hello *, what would be the best approach to return the sum of boosts
> as the score?
>
> ex:
> a dismax handler boosts matches to field1^100 and field2^50, a query
> matches both fields hence the score for that row would be 150
>
>
Not really. The tf-idf score would be multiplied by 100 for field1 and by 50
for field2. The score can be more than 150 if both fields match.


>
> is this something i could do with a function query or do i need to
> hack up DisjunctionMaxScorer ?
>
>
Can you give a little more background on what you want to achieve this way?

-- 
Regards,
Shalin Shekhar Mangar.

Re: Problem querying for a value with a "space"

On Thu, Sep 3, 2009 at 1:45 AM, Adam Allgaier  wrote:

>
>  omitNorms="true"/>
> ...
> 
>
> I am indexing the "specific_LIST_s" with the value "For Sale".
> The document indexes just fine.  A query returns the document with the
> proper value:
>For Sale
>
> However, when I try to query on that field
>+specific_LIST_s:For Sale
>+specific_LIST_s:For+Sale
>+specific_LIST_s:For%20Sale
>
> I get no results with any one of those three queries.
>
>
Use +specific_LIST_s:(For Sale)
or
+specific_LIST_s:"For Sale"

-- 
Regards,
Shalin Shekhar Mangar.

Re: Return 2 fields per facet.. name and id, for example? / facet value search

On Fri, Aug 28, 2009 at 12:57 AM, Rihaed Tan  wrote:

> Hi,
>
> I have a similar requirement to Matthew (from his post 2 years ago). Is
> this
> still the way to go in storing both the ID and name/value for facet values?
> I'm planning to use id#name format if this is still the case and doing a
> prefix query. I believe this is a common requirement so I'd appreciate if
> any of you guys can share what's the best way to do it.
>
> Also, I'm indexing the facet values for text search as well. Should the
> field declaration below suffice the requirement?
>
>  required="true" multiValued="true"/>
>

There have been talks of having a pair field type in Solr but there is no
patch yet. So I guess the way proposed by Yonik is a good solution.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Field Collapsing (was Re: Schema for group/child entity setup)

The development on this patch is quite active. It works well for single 
solr instance, but distributed search (ie. shards) is not yet supported. 
Using this page you can group search results based on a specific field. 
There are two flavors of field collapsing - adjacent and non-adjacent, 
the former collapses only document which happen to be located next to 
each other in the otherwise-non-collapsed results set. The later (the 
non-adjacent) one collapses all documents with the same field value 
(regardless of their position in the otherwise-non-collapsed results 
set). Note, that non-adjacent performs better than adjacent one. There's 
currently discussion to extend this support so in addition to collapsing 
the documents, extra information will be returned for the collapsed 
documents (see the discussion on the issue page).


Uri

R. Tan wrote:

I think this is what I'm looking for. What is the status of this patch?

On Thu, Sep 3, 2009 at 12:00 PM, R. Tan  wrote:

  

Hi Solrers,
I would like to get your opinion on how to best approach a search
requirement that I have. The scenario is I have a set of business listings
that may be group into one parent business (such as 7-eleven having several
locations). On the results page, I only want 7-eleven to show up once but
also show how many locations matched the query (facet filtered by state, for
example) and maybe a preview of the some of the locations.

Searching for the business name is straightforward but the locations within
the a result is quite tricky. I can do the opposite, searching for the
locations and faceting on business names, but it will still basically be the
same thing and repeat results with the same business name.

Any advice?

Thanks,
R

Exact Word Search

2009-09-03 Thread bhaskar chandrasekar

Hi,
 
Can any one help me with the below scenario?.
 
Scenario :
 
I have integrated Solr with Carrot2.
The issue is 
Assuming i give "bhaskar" as input string for search.
It should give me search results pertaining to bhaskar only.
 Example: It should not display search results as "chandarbhaskar" or
 "bhaskarc".
 Basically search should happen based on the exact word match. I am not 
bothered about case sensitive here
 How to achieve the above Scenario in Carrot2 ?.
 
Regards
Bhaskar

Solr question

2009-09-03 Thread SEZNEC Bruno

Hi,
 
Following solr tuto,
I send doc to solr by request :
curl
'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map.
content=attr_content&commit=true' --F "myfi...@oxiane.pdf"

023717


Reply seems OK, content is in the index,
but after no query match the doc...
 
TIA
Regards
Bruno

Re: questions about solr

On Wed, Sep 2, 2009 at 10:44 PM, Zhenyu Zhong wrote:

> Dear all,
>
> I am very interested in Solr and would like to deploy Solr for distributed
> indexing and searching. I hope you are the right Solr expert who can help
> me
> out.
> However, I have concerns about the scalability and management overhead of
> Solr. I am wondering if anyone could give me some guidance on Solr.
>
> Basically, I have the following questions,
> For indexing
> 1.  How does Solr handle the distributed indexing? It seems Solr generates
> index on a single box. What if the index is huge and can't sit on one box?
>

Solr leaves the distribution of index upto the user. So if you think your
index will not fit in one box, you figure out a sharding strategy (such as
hashing or round-robin) and index your collection into each shards.

Solr supports distributed search so that your query can use all the shards
to give you the results.


> 2.  Is it possible for Solr to generate index in HDFS?
>
>
Never tried but it seems so. See Jason's response and the Jira issue he has
mentioned.


> For searching
> 3.  Solr provides Master/Slave framework. How does the Solr distribute the
> search? Does Solr know which index/shard to deliver the query to? Or does
> it
> have to do a multicast query to all the nodes?
>
>
For a full-text search it is hard to figure out the correct shards because
matching document could be living anywhere (unless you shard in a very
clever way and your data can be sharded in that way). Each shard is queried,
the results are merged and returned as if you had queried a single Solr
server.


> For fault tolerance
> 4. Does Solr handle the management overhead automatically? suppose master
> goes down, how does Solr recover the master in order to get the latest
> index
> updates?

   Do we have to code ourselves to handle this?
>

It does not. You have to handle that yourself currently. Similar topics have
been discussed on this list in the past and some workarounds have been
suggested. I suggest you search the archives.


> 5. Suppose master goes down immediately after the index updates, while the
> updates haven't been replicated to the slaves, data loss seems to happen.
> Does Solr have any mechanism to deal with that?
>
>
No. If you want you can setup a backup master and index on both master and
backup machines to achieve redundancy. However switching between the master
and the backup would need to be done by you.


> Performance of real-time index updating
> 6. How is the performance of this realtime index updating? Suppose we are
> updating a million records for a huge index with billions of records
> frequently. Can Solr provides a reasonable performance and low latency on
> that? (Probably it is related to Lucene library)
>
>
How frequently? With careful sharding, you can distribute your write load.
Depending on your data, you may also be able to split you indexes into a
more frequently updated on and an older archive index.

A lot of work is in progress in this area. Lucene 2.9 has support for near
real time search with more improvements planned in the coming days. Solr 1.4
will not have support for these new Lucene features but with 1.5 things
should be a lot better.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Field Collapsing (was Re: Schema for group/child entity setup)

2009-09-03 Thread R. Tan

Thanks Uri. How does paging and scoring work when using field collapsing?
What patch works with 1.3? Is it production ready?

R


On Thu, Sep 3, 2009 at 3:54 PM, Uri Boness  wrote:

> The development on this patch is quite active. It works well for single
> solr instance, but distributed search (ie. shards) is not yet supported.
> Using this page you can group search results based on a specific field.
> There are two flavors of field collapsing - adjacent and non-adjacent, the
> former collapses only document which happen to be located next to each other
> in the otherwise-non-collapsed results set. The later (the non-adjacent) one
> collapses all documents with the same field value (regardless of their
> position in the otherwise-non-collapsed results set). Note, that
> non-adjacent performs better than adjacent one. There's currently discussion
> to extend this support so in addition to collapsing the documents, extra
> information will be returned for the collapsed documents (see the discussion
> on the issue page).
>
> Uri
>
>
> R. Tan wrote:
>
>> I think this is what I'm looking for. What is the status of this patch?
>>
>> On Thu, Sep 3, 2009 at 12:00 PM, R. Tan  wrote:
>>
>>
>>
>>> Hi Solrers,
>>> I would like to get your opinion on how to best approach a search
>>> requirement that I have. The scenario is I have a set of business
>>> listings
>>> that may be group into one parent business (such as 7-eleven having
>>> several
>>> locations). On the results page, I only want 7-eleven to show up once but
>>> also show how many locations matched the query (facet filtered by state,
>>> for
>>> example) and maybe a preview of the some of the locations.
>>>
>>> Searching for the business name is straightforward but the locations
>>> within
>>> the a result is quite tricky. I can do the opposite, searching for the
>>> locations and faceting on business names, but it will still basically be
>>> the
>>> same thing and repeat results with the same business name.
>>>
>>> Any advice?
>>>
>>> Thanks,
>>> R
>>>
>>>
>>>
>>
>>
>>
>

Question: How do I run the solr analysis tool programtically ?

2009-09-03 Thread Yatir


Form java code I want to contact solr through Http and supply a text buffer
(or a url that returns text, whatever is easier) and I want to get in return
the final list of tokens (or the final text buffer) after it went through
all the query time filters defined for this solr instance (stemming, stop
words etc)
thanks in advance

-- 
View this message in context: 
http://www.nabble.com/Question%3A-How-do-I-run-the-solr-analysis-tool-programtically---tp25273484p25273484.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Question: How do I run the solr analysis tool programtically ?

2009-09-03 Thread Chris Male

Hi Yatir,

The FieldAnalysisRequestHandler has the same behavior as the analysis tool.
It will show you the list of tokens that are created after each of the
filters have been applied.  It can be used through normal HTTP requests, or
you can use SolrJ's support.

Thanks,
Chris

On Thu, Sep 3, 2009 at 12:42 PM, Yatir  wrote:

>
> Form java code I want to contact solr through Http and supply a text buffer
> (or a url that returns text, whatever is easier) and I want to get in
> return
> the final list of tokens (or the final text buffer) after it went through
> all the query time filters defined for this solr instance (stemming, stop
> words etc)
> thanks in advance
>
> --
> View this message in context:
> http://www.nabble.com/Question%3A-How-do-I-run-the-solr-analysis-tool-programtically---tp25273484p25273484.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Solr question

2009-09-03 Thread Erik Hatcher



On Sep 3, 2009, at 1:24 AM, SEZNEC Bruno wrote:


Hi,

Following solr tuto,
I send doc to solr by request :
curl
'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=attr_&map 
.

content=attr_content&commit=true' --F "myfi...@oxiane.pdf"

023717


Reply seems OK, content is in the index,
but after no query match the doc...


Not even a *:* query?  What queries are you trying?  What's your  
default search field?  What does the query parse to, as seen in the  
response using &debugQuery=true ?   Likely the problem is that you  
aren't searching on the field the content was indexed into, or that it  
was not analyzed as you need.


Erik

Re: Exact Word Search

On Thu, Sep 3, 2009 at 1:33 PM, bhaskar chandrasekar
wrote:

> Hi,
>
> Can any one help me with the below scenario?.
>
> Scenario :
>
> I have integrated Solr with Carrot2.
> The issue is
> Assuming i give "bhaskar" as input string for search.
> It should give me search results pertaining to bhaskar only.
>  Example: It should not display search results as "chandarbhaskar" or
>  "bhaskarc".
>  Basically search should happen based on the exact word match. I am not
> bothered about case sensitive here
>  How to achieve the above Scenario in Carrot2 ?.
>

Bhaskar, I think this question is better suited for the Carrot mailing
lists. Unless you yourself control how the solr query is created, we will
not be able to help you.

-- 
Regards,
Shalin Shekhar Mangar.

Re: Using SolrJ with Tika

2009-09-03 Thread Abdullah Shaikh

Hi Laurent,

I am not sure if this is what you need, but you can extract the content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it to SOLR
for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice  wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>

Indexing docs using TIKA

2009-09-03 Thread Abdullah Shaikh

I am not sure if this went to Mailing List before.. hence forwarding again

Hi All,

I want to search for a document containing "string to search", price between
100 to 200 and weight 10-20.

SolrQuery query = new SolrQuery();
query.setQuery( "DOC_CONTENT: string to search");

query.setFilterQueries("PRICE:[100 TO 200]");
query.setFilterQueries("WEIGHT:[10 TO 20]");

QueryResponse response = server.query(query);

The DOC_CONTENT contains the content extracted from the file uploaded by the
user, extracted using TIKA.

Is the above approach correct ?

Re : Using SolrJ with Tika

2009-09-03 Thread Angel Ice

Hi

This is the solution I was testing.
I got some difficulties with AutoDetectParser but I think it's the solution I 
will use in the end.


Thanks for the advice anyway :)

Regards,

Laurent





De : Abdullah Shaikh 
À : solr-user@lucene.apache.org
Envoyé le : Jeudi, 3 Septembre 2009, 14h31mn 10s
Objet : Re: Using SolrJ with Tika

Hi Laurent,

I am not sure if this is what you need, but you can extract the content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it to SOLR
for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice  wrote:

> Hi everybody.
>
> I hope it's the right place for questions, if not sorry.
>
> I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
> I have seen a few examples explaining how to use tika to solve this. But
> most of these examples are using curl to send documents to Solr or an HTML
> POST with an input file.
> But i'd like to do it in full java.
> Is there a way to use Solrj to index the documents with the
> ExtractingRequestHandler of SolR or at least to get the extracted xml back
> (with the extract.only option) ?
>
> Many thanks.
>
> Laurent.
>
>
>
>

RE: Solr question

2009-09-03 Thread SEZNEC Bruno


 Thanks
My idea was that is I have 

in schema.xml
Eveything was stored in the index.
The query "solr" or other stuff works well only with text given in the sample
files
Rgds
Bruno


> -Message d'origine-
> De : Erik Hatcher [mailto:erik.hatc...@gmail.com] 
> Envoyé : jeudi 3 septembre 2009 13:40
> À : solr-user@lucene.apache.org
> Objet : Re: Solr question
> 
> 
> On Sep 3, 2009, at 1:24 AM, SEZNEC Bruno wrote:
> 
> > Hi,
> >
> > Following solr tuto,
> > I send doc to solr by request :
> > curl
> > 
> 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=att
> > r_&map
> > .
> > content=attr_content&commit=true' --F "myfi...@oxiane.pdf"
> > 
> > 0 > name="QTime">23717 
> >
> > Reply seems OK, content is in the index, but after no query 
> match the 
> > doc...
> 
> Not even a *:* query?  What queries are you trying?  What's 
> your default search field?  What does the query parse to, as 
> seen in the  
> response using &debugQuery=true ?   Likely the problem is that you  
> aren't searching on the field the content was indexed into, 
> or that it was not analyzed as you need.
> 
>   Erik
> 
>

Re: score = sum of boosts

2009-09-03 Thread Walter Underwood

You could start with a TF formula that ignores frequencies above 1.  
"onOffTF", I guess, returning 1 if the term is there one or more times.


Or, you could tell us what you are trying to achieve.

wunder

On Sep 3, 2009, at 12:28 AM, Shalin Shekhar Mangar wrote:

On Thu, Sep 3, 2009 at 4:09 AM, Joe Calderon  
 wrote:



hello *, what would be the best approach to return the sum of boosts
as the score?

ex:
a dismax handler boosts matches to field1^100 and field2^50, a query
matches both fields hence the score for that row would be 150


Not really. The tf-idf score would be multiplied by 100 for field1  
and by 50

for field2. The score can be more than 150 if both fields match.




is this something i could do with a function query or do i need to
hack up DisjunctionMaxScorer ?


Can you give a little more background on what you want to achieve  
this way?


--
Regards,
Shalin Shekhar Mangar.

Best way to do a lucene matchAllDocs not using q.alt=:

2009-09-03 Thread Marc Sturlese


Hey there,
I need a query to get the total number of documents in my index. I can get
if I do this using DismaxRequestHandler:
q.alt=*:*&facet=false&hl=false&rows=0
I have noticed this query is very memory consuming. Is there any more
optimized way in trunk to get the total number of documents of my index?
Thanks in advanced

-- 
View this message in context: 
http://www.nabble.com/Best-way-to-do-a-lucene-matchAllDocs-not-using-q.alt%3D*%3A*-tp25277585p25277585.html
Sent from the Solr - User mailing list archive at Nabble.com.

Default Query Type For Facet Queries

2009-09-03 Thread Stephen Duncan Jr

We have a custom query parser plugin registered as the default for searches,
and we'd like to have the same parser used for facet.query.

Is there a way to register it as the default for FacetComponent in
solrconfig.xml?

I know I can add {!type=customparser} to each query as a workaround, but I'd
rather register it in the config that make my code send that and strip it
off on every facet query.

-- 
Stephen Duncan Jr
www.stephenduncanjr.com

RE: Solr question

2009-09-03 Thread SEZNEC Bruno

Response with id:doc4 is OK


−

0
3
−

on
0
id:doc4
2.2
10


−

−

−

Sami Siren

−

application/pdf

−

−

   Example PDF document Tika Solr Cell
This is a sample piece of content for Tika Solr Cell article.


−

Wed Dec 31 10:17:13 CET 2008

−

Writer

−

OpenOffice.org 3.0

−

application/octet-stream

−

SampleDocument.pdf

−

18408

−

myfile

doc4
Example PDF document




What I don't understand is why a simple search on title or content
Doesn't works
:

−

0
3
−

on
0
PDF
2.2
10





Thanks 

> -Message d'origine-
> De : Erik Hatcher [mailto:erik.hatc...@gmail.com] 
> Envoyé : jeudi 3 septembre 2009 13:40
> À : solr-user@lucene.apache.org
> Objet : Re: Solr question
> 
> 
> On Sep 3, 2009, at 1:24 AM, SEZNEC Bruno wrote:
> 
> > Hi,
> >
> > Following solr tuto,
> > I send doc to solr by request :
> > curl
> > 
> 'http://localhost:8983/solr/update/extract?literal.id=doc1&uprefix=att
> > r_&map
> > .
> > content=attr_content&commit=true' --F "myfi...@oxiane.pdf"
> > 
> > 0 > name="QTime">23717 
> >
> > Reply seems OK, content is in the index, but after no query 
> match the 
> > doc...
> 
> Not even a *:* query?  What queries are you trying?  What's 
> your default search field?  What does the query parse to, as 
> seen in the  
> response using &debugQuery=true ?   Likely the problem is that you  
> aren't searching on the field the content was indexed into, 
> or that it was not analyzed as you need.
> 
>   Erik
> 
>

how to scan dynamic field without specifying each field in query

2009-09-03 Thread gdeconto


say I have a dynamic field called Foo* (where * can be in the hundreds) and
want to search Foo* for a value of 3 (for example)

I know I can do this via this:

http://localhost:8994/solr/select?q=(Foo1:3 OR Foo2:3 OR Foo3:3 OR …
Foo999:3)

However, is there a better way?  i.e. is there some way to query by a
function I create, possibly something like this:

http://localhost:8994/solr/select?q=myfunction(‘Foo’, 3)

where myfunction itself iterates thru all the instances of Foo*

any help appreciated

-- 
View this message in context: 
http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25280228.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: how to scan dynamic field without specifying each field in query

2009-09-03 Thread Manepalli, Kalyan

You can copy the dynamic fields value into a different field and query on that 
field.

Thanks,
Kalyan Manepalli

-Original Message-
From: gdeconto [mailto:gerald.deco...@topproducer.com] 
Sent: Thursday, September 03, 2009 12:06 PM
To: solr-user@lucene.apache.org
Subject: how to scan dynamic field without specifying each field in query


say I have a dynamic field called Foo* (where * can be in the hundreds) and
want to search Foo* for a value of 3 (for example)

I know I can do this via this:

http://localhost:8994/solr/select?q=(Foo1:3 OR Foo2:3 OR Foo3:3 OR ...
Foo999:3)

However, is there a better way?  i.e. is there some way to query by a
function I create, possibly something like this:

http://localhost:8994/solr/select?q=myfunction('Foo', 3)

where myfunction itself iterates thru all the instances of Foo*

any help appreciated

-- 
View this message in context: 
http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25280228.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: how to scan dynamic field without specifying each field in query

2009-09-03 Thread Avlesh Singh

>
> I know I can do this via this: http://localhost:8994/solr/select?q=(Foo1:3OR 
> Foo2:3 OR Foo3:3 OR ... Foo999:3)
>
Careful! You may hit the upper limit for MAX_BOOLEAN_CLAUSES this way.


> You can copy the dynamic fields value into a different field and query on
> that field.
>
Good idea!

Cheers
**Avlesh

On Thu, Sep 3, 2009 at 10:47 PM, Manepalli, Kalyan <
kalyan.manepa...@orbitz.com> wrote:

> You can copy the dynamic fields value into a different field and query on
> that field.
>
> Thanks,
> Kalyan Manepalli
>
> -Original Message-
> From: gdeconto [mailto:gerald.deco...@topproducer.com]
> Sent: Thursday, September 03, 2009 12:06 PM
> To: solr-user@lucene.apache.org
> Subject: how to scan dynamic field without specifying each field in query
>
>
> say I have a dynamic field called Foo* (where * can be in the hundreds) and
> want to search Foo* for a value of 3 (for example)
>
> I know I can do this via this:
>
> http://localhost:8994/solr/select?q=(Foo1:3OR
>  Foo2:3 OR Foo3:3 OR ...
> Foo999:3)
>
> However, is there a better way?  i.e. is there some way to query by a
> function I create, possibly something like this:
>
> http://localhost:8994/solr/select?q=myfunction('Foo',
> 3)
>
> where myfunction itself iterates thru all the instances of Foo*
>
> any help appreciated
>
> --
> View this message in context:
> http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25280228.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

RE: how to scan dynamic field without specifying each field in query

2009-09-03 Thread gdeconto

thx for the reply.

you mean into a multivalue field?  possible, but was wondering if there was
something more flexible than that.  the ability to use a function (ie
myfunction) would open up some possibilities for more complex searching and
search syntax.

I could write my own query parser with special extended syntax, but that is
farther than I wanted to go.

Manepalli, Kalyan wrote:
> 
> You can copy the dynamic fields value into a different field and query on
> that field.
> 
> Thanks,
> Kalyan Manepalli
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25280669.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to use DataImportHandler with ExtractingRequestHandler?

2009-09-03 Thread Sascha Szott


Hi Khai,

a few weeks ago, I was facing the same problem.

In my case, this workaround helped (assuming, you're using Solr 1.3): 
For each row, extract the content from the corresponding pdf file using 
a parser library of your choice (I suggest Apache PDFBox or Apache Tika 
in case you need to process other file types as well), put it between




and store it in a text file. To keep the relationship between a file and 
its corresponding database row, use the primary key as the file name.


Within data-config.xml use the XPathEntityProcessor as follows (replace 
dbRow and primaryKey respectively):



  



And, by the way, in Solr 1.4 you do not have to put your content between 
xml tags: use the PlainTextEntityProcessor instead of XPathEntityProcessor.


Best,
Sascha

Khai Doan schrieb:

Hi all,

My name is Khai.  I have a table in a relational database.  I have
successfully use DataImportHandler to import this data into Apache Solr.
However, one of the column store the location of PDF file.  How can I
configure DataImportHandler to use ExtractingRequestHandler to extract the
content of the PDF?

Thanks!

Khai Doan

Re: how to scan dynamic field without specifying each field in query

2009-09-03 Thread Avlesh Singh

A query parser, may be.
But that would not help either. End of the day, someone has to create those
many boolean queries in your case.

Cheers
Avlesh

On Thu, Sep 3, 2009 at 10:59 PM, gdeconto wrote:

>
> thx for the reply.
>
> you mean into a multivalue field?  possible, but was wondering if there was
> something more flexible than that.  the ability to use a function (ie
> myfunction) would open up some possibilities for more complex searching and
> search syntax.
>
> I could write my own query parser with special extended syntax, but that is
> farther than I wanted to go.
>
>
>
> Manepalli, Kalyan wrote:
> >
> > You can copy the dynamic fields value into a different field and query on
> > that field.
> >
> > Thanks,
> > Kalyan Manepalli
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25280669.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: how to scan dynamic field without specifying each field in query

2009-09-03 Thread Renaud Delbru


Hi,

maybe SIREn [1] can help you for this task. SIREn is a Lucene plugin 
that allows to index and query tabular data. You can for example create 
a SIREn field "foo", index n values in n cells, and then query a 
specific cell or a range of cells. Unfortunately, the Solr plugin is not 
yet available, and therefore you will have to write your own query 
syntax and parser for this task.


Regards,

[1] http://siren.sindice.com
--
Renaud Delbru

gdeconto wrote:

thx for the reply.

you mean into a multivalue field?  possible, but was wondering if there was
something more flexible than that.  the ability to use a function (ie
myfunction) would open up some possibilities for more complex searching and
search syntax.

I could write my own query parser with special extended syntax, but that is
farther than I wanted to go.



Manepalli, Kalyan wrote:
  

You can copy the dynamic fields value into a different field and query on
that field.

Thanks,
Kalyan Manepalli

Re: how to scan dynamic field without specifying each field in query

2009-09-03 Thread gdeconto


I am thinking that my example was too simple/generic :-U.  It is possible for
more several dynamic fields to exist and other functionality to be required.
i.e. what about if my example had read:

http://localhost:8994/solr/select?q=((Foo1:3 OR Foo2:3 OR Foo3:3 OR …
Foo999:3) AND (Bar1:1 OR Bar2:1 OR Bar3:1...Bar999:1) AND (Etc1:7 OR Etc2:7
OR Etc3:7...Etc:999:7)

obviously a nasty query (and care would be needed for MAX_BOOLEAN_CLAUSES). 
that said, are there other mechanisms to better handle that type of query,
i.e.:

http://localhost:8994/solr/select?q=(myfunction(‘Foo’, 3) AND
myfunction('Bar', 1) AND (myfunction('Etc', 7))


gdeconto wrote:
> 
> say I have a dynamic field called Foo* (where * can be in the hundreds)
> and want to search Foo* for a value of 3 (for example)
> 
> I know I can do this via this:
> 
> http://localhost:8994/solr/select?q=(Foo1:3 OR Foo2:3 OR Foo3:3 OR …
> Foo999:3)
> 
> However, is there a better way?  i.e. is there some way to query by a
> function I create, possibly something like this:
> 
> http://localhost:8994/solr/select?q=myfunction(‘Foo’, 3)
> 
> where myfunction itself iterates thru all the instances of Foo*
> 
> any help appreciated
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-scan-dynamic-field-without-specifying-each-field-in-query-tp25280228p25283094.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Field Collapsing (was Re: Schema for group/child entity setup)

The collapsed documents are represented by one "master" document which 
can be part of the normal search result (the doc list), so pagination 
just works as expected, meaning taking only the returned documents in 
account (ignoring the collapsed ones). As for the scoring, the "master" 
document is actually the document with the highest score in the 
collapsed group.


As for Solr 1.3 compatibility... well... it's very hart to tell. All 
latest patch are certainly *not* 1.3 compatible (I think they're also 
depending on some changes in lucene which are not available for solr 
1.3). I guess you'll have to try some of the old patches, but I'm not 
sure about their stability.


cheers,
Uri

R. Tan wrote:

Thanks Uri. How does paging and scoring work when using field collapsing?
What patch works with 1.3? Is it production ready?

R


On Thu, Sep 3, 2009 at 3:54 PM, Uri Boness  wrote:

  

The development on this patch is quite active. It works well for single
solr instance, but distributed search (ie. shards) is not yet supported.
Using this page you can group search results based on a specific field.
There are two flavors of field collapsing - adjacent and non-adjacent, the
former collapses only document which happen to be located next to each other
in the otherwise-non-collapsed results set. The later (the non-adjacent) one
collapses all documents with the same field value (regardless of their
position in the otherwise-non-collapsed results set). Note, that
non-adjacent performs better than adjacent one. There's currently discussion
to extend this support so in addition to collapsing the documents, extra
information will be returned for the collapsed documents (see the discussion
on the issue page).

Uri


R. Tan wrote:



I think this is what I'm looking for. What is the status of this patch?

On Thu, Sep 3, 2009 at 12:00 PM, R. Tan  wrote:



  

Hi Solrers,
I would like to get your opinion on how to best approach a search
requirement that I have. The scenario is I have a set of business
listings
that may be group into one parent business (such as 7-eleven having
several
locations). On the results page, I only want 7-eleven to show up once but
also show how many locations matched the query (facet filtered by state,
for
example) and maybe a preview of the some of the locations.

Searching for the business name is straightforward but the locations
within
the a result is quite tricky. I can do the opposite, searching for the
locations and faceting on business names, but it will still basically be
the
same thing and repeat results with the same business name.

Any advice?

Thanks,
R

Re: Best way to do a lucene matchAllDocs not using q.alt=:


you can use LukeRequestHandler http://localhost:8983/solr/admin/luke

Marc Sturlese wrote:

Hey there,
I need a query to get the total number of documents in my index. I can get
if I do this using DismaxRequestHandler:
q.alt=*:*&facet=false&hl=false&rows=0
I have noticed this query is very memory consuming. Is there any more
optimized way in trunk to get the total number of documents of my index?
Thanks in advanced

Using scoring from another program

2009-09-03 Thread Paul Tomblin

Every document I put into Solr has a field "origScore" which is a
floating point number between 0 and 1 that represents a score assigned
by the program that generated the document.  I would like it that when
I do a query, it uses that origScore in the scoring, perhaps
multiplying the Solr score to find a weighted score and using that to
determine which are the highest scoring matches.  Can I do that?

-- 
http://www.linkedin.com/in/paultomblin

Re: Using scoring from another program


Function queries is what you need: http://wiki.apache.org/solr/FunctionQuery

Paul Tomblin wrote:

Every document I put into Solr has a field "origScore" which is a
floating point number between 0 and 1 that represents a score assigned
by the program that generated the document.  I would like it that when
I do a query, it uses that origScore in the scoring, perhaps
multiplying the Solr score to find a weighted score and using that to
determine which are the highest scoring matches.  Can I do that?

Sanity check: ResonseWriter directly to a database?

2009-09-03 Thread seanoc5

Hello all,
Are there any hidden gotchas--or even basic suggestions--regarding
implementing something like a DBResponseWriter that puts responses right
into a database? My specific questions are:

1) Any problems adding non-trivial jars to a solr plugin? I'm thinkin JDBC
and then perhaps Hibernate libraries?
I don't believe so, but I have just enough understanding to be dangerous at
the moment.

2) Is JSONResponseWriter a reasonable copy/paste starting point for me? Is
there anything that might match better, especially regarding initialization
and connection pooling?

3) Say I have a read-write single-core solr server: a vanilla-out-of-the-box
example install. Can I concurrently update the underlying index safely with
EmbeddedSolrServer? (This is my backup approach, less preferred)
I assume "no", one of them has to be read only, but I've learned not to
under-estimate the lucene/solr developers.

I'm starting with adapting JSONResponseWriter and the
http://wiki.apache.org/solr/SolrPlugins wiki notes . The docs seem to
indicate all I need to do is package up the appropriate supporting (jdbc)
jar files into my MyDBResponse.jar, and drop it into the ./lib dir (e.g.
c:\solr-svn\example\solr\lib). Of course, I need to update my solrconfig.xml
to use the new DBResponseWriter.

Straight straight JDBC seems like the easiest starting point. If that works,
perhaps move the DB stuff to hibernate. Does anyone have a "best practice"
suggestion for database access inside a plugin? I rather expect the answer
might be "use JNDI and well-configured hibernate; no special problems
related to 'inside' a solr plugin." I will eventually be interested in
saving both query results and document indexing information, so I expect to
do this in both a (custom) ResponseWriter, and ... um... a
DocumentAnalysisRequestHandler?

I realize embedded solr might be a better choice (performance has been a big
issue in my current implementation), and I am looking into that as well. If
feasible, I'd like to keep solr "in charge" of the database content through
plugins and extensions, rather than keeping both solr and db synced from my
(grails) app.
Thanks,

Sean

--
View this message in context:
http://www.nabble.com/Sanity-check%3A-ResonseWriter-directly-to-a-database--tp25284734p25284734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Clarifications to Synonym Filter Wiki entry? (1 of 2)

: I believe the following section is a bit misleading; I'm sure it's correct
: for the case it describes, but there's another case I've tested, which on
: the surface seemed similar, but where the actual results were different and
: in hindsight not really a conflict, just a surprise.

the crux of the issue is that *lines* in the file with only commas (no =>) 
are ambiguious, and only have meaning once the "expand" property is evaluated.  
once that's done then you have a list of *mappings* ... and it's the 
mappings that get merged.

: I tested this by actually looking at the word index with Luke.

FYI: an easy way to test it would probably be the analysis.jsp page

: If you DID want the merged behavior, where D would expand to match all 9
: letters you can either:
: 1: Put the synonym filter in the pipeline twice, along with the remove
: duplicates filter
: OR
: 2: Use the synonym filter at both index and query time

using the filter at query time with expand=true would wreck havoc with 
phrase queries ... your best bet is to be more explicit when expressing 
the mappings in the file.

: And what should be added to the Wiki doc?

Add whatever you think would help ... users discovering behavior for hte 
first time are the best people to write documentation, because the devs 
who know the code really well don't apprecaite what isn't obvious.



-Hoss

Re: Best way to do a lucene matchAllDocs not using q.alt=:

The statistics page will also give you numDocs (it is an xml response).

On Fri, Sep 4, 2009 at 2:24 AM, Uri Boness  wrote:

> you can use LukeRequestHandler http://localhost:8983/solr/admin/luke
>
>
> Marc Sturlese wrote:
>
>> Hey there,
>> I need a query to get the total number of documents in my index. I can get
>> if I do this using DismaxRequestHandler:
>> q.alt=*:*&facet=false&hl=false&rows=0
>> I have noticed this query is very memory consuming. Is there any more
>> optimized way in trunk to get the total number of documents of my index?
>> Thanks in advanced
>>
>>
>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Clarifications to Synonym Filter Wiki entry? (2 of 2)


: Earlier on the thread repeats the claim that, if you use index side
: expansion, you won't have a problem.  But it doesn't explain how/why that
: fixes it, given that the Lucene parser still breaks on white space.

because at query time, nothing knows (or cares) that that multiple 
variants were indexed ... if your feld contains "sea" and 
"biscut" and "seabiscut" the query parser doesn't care  a querystring 
whose parsed form results in the query (field:seabiscut) is going 
to match, ditto for (field:sea field:biscut) ... the only place things 
start getting interesting is with phrase queries: because the synonyms 
are put at the same term position, things typically work ok, but you 
sometimes (ie: when the synonyms have differnet number of tokens) need a 
non-zero slop factor to help bridge the gap.

: Later there's a clue, it seems that even single words of a multi-word
: thesaurus entry are matched - so I guess Lucene doesn't need to see both
: words in a multi-word query, it just picks up either word, so it works
: around the multi-word parsing problem, but adds the undesireable side effect
: of false positive matches?

no ... A multi word (phrase) query needs to match all the words ... what 
that's referign to is that if a document orriginall contained "seabiscut" 
and synonyms caused "sea" and "biscut" to be added, then a search for just 
the term "sea" will match.



-Hoss

Re: how to get highlighter to only show matched term


: text. Basically, I just want to know which of the terms in my query 
: matched and in which field they matched (could be different from my 
: example). I assume that I may need to write my own Formatter for just 
: outputting nothing. But, I'm not sure where to start to get only my 
: needed term. Do I need a custom Fragmenter?

I believe so yes ... i think you might be able to get away with a Fragment 
that always returns "true" from isNewFragment, because i *think* the 
highlighter will do the right thing and merge adjacent fragments that 
both contain matches (in the case of phrase queries) but i'm not 100% 
certain.

for questions about implementing Formatters & Fragmenters, you'll probably 
want to ask on java-u...@lucene ... there are a lot more people over there 
that udnerstand teh internals of the highlighting code.


-Hoss

Re: Problem with ResponseBuilder


: DocListAndSet results = new DocListAndSet();
: Hits h = searcher.search(rb.getQuery());
...
: Is this the correct way to obtain the docs?

Uh not really.  why are you using the Hits method at all?  why don't 
you call the searcher.search method that returns a DocListAndSet instead?  
(Hits is a deprecated method in Lucene, and in Solr it doesnt' take 
advantage of any of hte caches)

: I'm receiving a null java.lang.NullPointerException with this code.

FYI: that's because results.docList is null until something assigns a new 
DocList to it ... if you want to build you'r own from scratch, you've got 
to instantiate it.



-Hoss

Single Core or Multiple Core?

2009-09-03 Thread Jonathan Ariel

It seems like it is really hard to decide when the Multiple Core solution is
more appropriate.As I could understand from this list and wiki the Multiple
Core feature was designed to address the need of handling different sets of
data within the same solr instance, where the sets of data don't need to be
joined.
In my case the documents are of a specific site and country. So document A
can be of Site 1 / Country 1, B of Site 2 / Country 1, C of Site 1 / Country
2, and so on.
For the use cases of my application I will never query across countries or
sites. I will always have to provide to the query the country id and the
site id.
Would you suggest to split my data into cores? I have few sites (around 20)
and more countries (around 90).
Should I split my data into sites (around 20 cores) and within a core filter
by site? Should I split by Site and Country (around 1800 cores)?
What should I consider when splitting my data into multiple cores?

Thanks

Jonathan

Re: Searching with or without diacritics


Take a look at the MappingCharFilterFactory (in Solr 1.4) and/or the 
ISOLatin1AccentFilterFactory.

: Date: Thu, 27 Aug 2009 16:30:08 +0200
: From: "[ISO-8859-1] Gy�rgy Frivolt" 
: Reply-To: solr-user@lucene.apache.org
: To: solr-user 
: Subject: Searching with or without diacritics
: 
: Hello,
: 
:  I started to use solr only recently using the ruby/rails sunspot-solr
: client. I use solr on a slovak/czech data set and realized one not wanted
: behaviour of the search. When the user searches an expression or word which
: contains dicritics, letters like š, č, ť, ä, ô,... usually the special
: characters are omitted in the search query. In this case solr does not
: return records which contain the expression intended to be found by the
: user.
:  How can I configure solr in a way, that it founds records containing
: special characters, even if they are without special accents in the query?
: 
:  Some info about my solr instance: Solr Specification Version: 1.3.0Solr
: Implementation Version: 1.3.0 694707 - grantingersoll - 2008-09-12
: 11:06:47Lucene Specification Version: 2.4-devLucene Implementation Version:
: 2.4-dev 691741 - 2008-09-03 15:25:16
: 
: Thank for your help, regards,
: 
:  Georg
: 



-Hoss

Re: SnowballPorterFilterFactory stemming word question


: If i give "machine" why is that it stems to "machin", now from where does
: this word come from
: If i give "revolutionary" it stems to "revolutionari", i thought it should
: stem to revolution.
: 
: How does stemming work?

the porter stemmer (and all of the stemmers provided with solr) are 
programtic stemmers ... they don't actually know the root of any words the 
use an aproximate algorithm to compute a *token* from a word based on a 
set of rules ... these tokens aren't neccessarily real words (and most of 
the time they aren't words) but the same token tends to be produced from 
words with similar roots.

if you want to see the actaul root word, you'll have to use a dictionary 
based stemmer.


-Hoss

Re: Impact of compressed=true attribute (in schema.xml) on Indexing/Query


: Now the question is, how the compressed=true flag impacts the indexing 
: and Querying operations. I am sure that there will be CPU utilization 
: spikes as there will be operation of compressing(during indexing) and 
: uncompressing(during querying) of the indexed data. I am mainly looking 
: for any bench marks for the above scenario.

i don't have any hard numbers for you, but the stored data isn't 
uncompressed when executing aquery -- queries are executed against the 
indexed terms (which are never compressed) ... the only time the data will 
be uncompressed is when returning results to the client -- so if you set 
rows=17 in your request, only the values for the 17 docs returned  (or 
less if there were fewer then 17 matches) will be uncompressed.



-Hoss

Re: Optimal Cache Settings, complicated by regular commits


: I'm trying to work out the optimum cache settings for our Solr server, I'll
: begin by outlining our usage.

...but you didn't give any information about what your cache settings look 
like ... size is only part of the picture, the autowarm counts are more 
significant.

: Commit frequency: sometimes we do massive amounts of sequential commits,

if you know you are going to be indexing more docs soon, then you can hold 
off on issuing a commit ... it really comes down to what kind of SLA you 
have to provide on how quickly an add/update is visible in the index -- 
don't commit any more often then that.

: The problem we have is that the default cache settings resulting in very low
: hit rates (less than 30% for documents, less than 1% for filterCache), so we

under 1% for filterCache sounds like you either have some really unique 
filter queries, or you are using enum based faceting on a huge field and 
the LRU cache is working against you by expunging values during a single 
request ... what version of solr are you using? what do the fieldtype 
declarations look like for the fields you are faceting on? what do the 
luke stats look like for hte fields you are faceting on?

: now we have the issue of commits being very slow (more than 5 seconds for a
: document), to the point where it causes a timeout elsewhere in our systems.
: This is made worse by the fact that committing seems to empty the cache,
: given that it takes about an hour to get the cache to a good state this is
: obviously very problematic.

1) using waitSearch=false can help speed up the commit if all you care 
about is not having your client time out.

2) using autowarming can help fill the caches up prior to users making 
requests (you may already know that, but since you didn't provide your 
cache configs i have no idea) .. they key is finding a good autowarm count 
that helps your cache stats w/o taking too long to fill up.


-Hoss

Re: Sorting performance + replication of index between cores

2009-09-03 Thread Sreeram Vaidyanathan


Did u guys find a solution?
I am having a similar issue.

Setup:
One indexer box & 2 searcher box. Each having 6 different solr-cores
We have a lot of updates (in the range of a couple thousand items every few
mins).
The Snappuller/Snapinstaller pulls and commits every 5 mins.

Query response time peaks to 60+ seconds when a new searcher is being
prepared.
I have disabled the caches (filter, query & document). 

We have a strict requirement of response time < 10 secs all the time.

Thanks
Sreeram


sunnyfr wrote:
> 
> Hi Christophe, 
> 
> Did you find a way to fix up your problem, cuz even with replication will
> have this problem, lot of update means clear cache and manage that.
> I've the same issue, I just wondering if I won't turn off servers during
> update ??? 
> How did you fix that ? 
> 
> Thanks,
> sunny
> 
> 
> christophe-2 wrote:
>> 
>> Hi,
>> 
>> After fully reloading my index, using another field than a Data does not 
>> help that much.
>> Using a warmup query avoids having the first request slow, but:
>>  - Frequents commits means that the Searcher is reloaded frequently 
>> and, as the warmup takes time, the clients must wait.
>>  - Having warmup slows down the index process (I guess this is 
>> because after a commit, the Searchers are recreated)
>> 
>> So I'm considering, as suggested,  to have two instances: one for 
>> indexing and one for searching.
>> I was wondering if there are simple ways to replicate the index in a 
>> single Solr server running two cores ? Any such config already tested ? 
>> I guess that the standard replication based on rsync can be simplified a 
>> lot in this case as the two indexes are on the same server.
>> 
>> Thanks
>> Christophe
>> 
>> Beniamin Janicki wrote:
>>> :so you can send your updates anytime you want, and as long as you only 
>>> :commit every 5 minutes (or commit on a master as often as you want, but 
>>> :only run snappuller/snapinstaller on your slaves every 5 minutes) your 
>>> :results will be at most 5minutes + warming time stale.
>>>
>>> This is what I do as well ( commits are done once per 5 minutes ). I've
>>> got
>>> master - slave configuration. Master has turned off all caches
>>> (commented in
>>> solrconfig.cml) and setup only 2 maxWarmingSearchers. Index size has 5GB
>>> ,Xmx= 1GB and committing takes around 10 secs ( on default configuration
>>> with warming it took from 30 mins up to 2 hours). 
>>>
>>> Slave caches are configured to have autowarmCount="0" and
>>> maxWarmingSearchers=1 , and I have new data 1 second after snapshoot is
>>> done. I haven't noticed any huge delays while serving search request.
>>> Try to use those values - may be they'll help in your case too.
>>>
>>> Ben Janicki
>>>
>>>
>>> -Original Message-
>>> From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
>>> Sent: 22 October 2008 04:56
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Sorting performance
>>>
>>>
>>> : The problem is that I will have hundreds of users doing queries, and a
>>> : continuous flow of document coming in.
>>> : So a delay in warming up a cache "could" be acceptable if I do it a
>>> few
>>> times
>>> : per day. But not on a too regular basis (right now, the first query
>>> that
>>> loads
>>> : the cache takes 150s).
>>> : 
>>> : However: I'm not sure why it looks not to be a good idea to update the
>>> caches
>>>
>>> you can refresh the caches automaticly after updating, the "newSearcher" 
>>> event is fired whenever a searcher is opened (but before it's used by 
>>> clients) so you can configure warming queries for it -- it doesn't have
>>> to 
>>> be done manually (or by the first user to use that reader)
>>>
>>> so you can send your updates anytime you want, and as long as you only 
>>> commit every 5 minutes (or commit on a master as often as you want, but 
>>> only run snappuller/snapinstaller on your slaves every 5 minutes) your 
>>> results will be at most 5minutes + warming time stale.
>>>
>>>
>>> -Hoss
>>>
>>>   
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Sorting-performance-tp20037712p25286018.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Re : Using SolrJ with Tika

2009-09-03 Thread Grant Ingersoll


See https://issues.apache.org/jira/browse/SOLR-1411

On Sep 3, 2009, at 6:47 AM, Angel Ice wrote:


Hi

This is the solution I was testing.
I got some difficulties with AutoDetectParser but I think it's the  
solution I will use in the end.



Thanks for the advice anyway :)

Regards,

Laurent





De : Abdullah Shaikh 
À : solr-user@lucene.apache.org
Envoyé le : Jeudi, 3 Septembre 2009, 14h31mn 10s
Objet : Re: Using SolrJ with Tika

Hi Laurent,

I am not sure if this is what you need, but you can extract the  
content from
the uploaded document (MS Docs, PDF etc) using TIKA and then send it  
to SOLR

for indexing.

String CONTENT = extract the content using TIKA (you can use
AutoDetectParser)

and then,

SolrInputDocument doc = new SolrInputDocument();
doc.addField("DOC_CONTENT", CONTENT);

solrServer.add(doc);
soltServer.commit();


On Wed, Sep 2, 2009 at 5:26 PM, Angel Ice  wrote:


Hi everybody.

I hope it's the right place for questions, if not sorry.

I'm trying to index rich documents (PDF, MS docs etc) in SolR/Lucene.
I have seen a few examples explaining how to use tika to solve  
this. But
most of these examples are using curl to send documents to Solr or  
an HTML

POST with an input file.
But i'd like to do it in full java.
Is there a way to use Solrj to index the documents with the
ExtractingRequestHandler of SolR or at least to get the extracted  
xml back

(with the extract.only option) ?

Many thanks.

Laurent.










--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search

Solr, JNDI config, dataDir, and solr home problem

2009-09-03 Thread Archon810

Here's my problem.

I'm trying to follow a multi Solr setup, straight from the Solr wiki -
http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac.

Here's the relevant code:

Now I want to set the Solr in solrconfig.xml, relative to
the solr home property. The instructions
http://wiki.apache.org/solr/SolrConfigXml#head-e8fbf2d748d90c5900aac712d0e3385ced5bd128
say is used to specify an alternate directory to hold all
index data other than the default ./data under the Solr home. If replication
is in use, this should match the replication configuration. If this
directory is not absolute, then it is relative to the current working
directory of the servlet container.

However, no matter how I try to set the dataDir property, solr home is not
being found. For example,
${solr.home}/data

What's even more confusing are these INFO notices in the log:
INFO: No /solr/home in JNDI
Sep 3, 2009 4:33:26 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome
INFO: solr home defaulted to 'solr/' (could not find system property or
JNDI)

The JNDI instructions instruct to specify "solr/home", the log complains
about "/solr/home" (extra slash), the solrconfig.xml file seems to expect
${solr.home} - how more confusing can it get?

This person is having the same issue:
http://mysolr.com/tips/setting-solr-home-solrhome-in-jndi-on-tomcat-55/

So, how does one refer to solr home from solrconfig.xml in a JNDI
configuration scenario? Also, is there a way to debug/see variables that are
defined in a specific context, such as solrconfig.xml? I feel like I'm
completely blind here.

Thank you!
--
View this message in context:
http://www.nabble.com/Solr%2C-JNDI-config%2C-dataDir%2C-and-solr-home-problem-tp25286277p25286277.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Logging solr requests


: - I think that the use  of log files is discouraged, but i don't know if i
: can modify solr settings to log to a server (via rmi or http)
: - Don't want to drop down solr response performance

discouraged by who? ... having aseperate process tail your log file and 
build an index that way is the simplest way to do this without impact 
Solr's performace ... alternately you could write a custom LogHandler that 
sends the data anywhere you want (so you never need a log file) but that 
would require some non-trivial asynch code in your LogHandler to keep the 
budiling of your new idex from affecting hte performace (log calls are 
synchronous)


-Hoss

Re: Problem querying for a value with a "space"