date:20100922

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread Jan Høydahl / Cominvent

:-)

Also, that Wiki page clearly states in the very first line that it talks about 
uncommitted stuff "Solr4.0". I think that is pretty clear.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 22. sep. 2010, at 03.31, Lance Norskog wrote:

> Developers, like marketers, have this confusing habit of speaking both in the 
> present and an imaginary utopian future.
> 
> Dennis Gearon wrote:
>> This is what made me think it was doable now.
>> 
>> http://wiki.apache.org/solr/SpatialSearch
>> 
>> 
>> Dennis Gearon
>> 
>> Signature Warning
>> 
>> EARTH has a Right To Life,
>>   otherwise we all die.
>> 
>> Read 'Hot, Flat, and Crowded'
>> Laugh at http://www.yert.com/film.php
>> 
>> 
>> --- On Tue, 9/21/10, PeterKerk  wrote:
>> 
>>   
>>> From: PeterKerk
>>> Subject: Re: Calculating distances in Solr using longitude latitude
>>> To: solr-user@lucene.apache.org
>>> Date: Tuesday, September 21, 2010, 1:18 AM
>>> 
>>> It would be such a shame if there's no way to get it now
>>> already...particularly in today's online services,
>>> geolocation is one of the
>>> hottest things there is, I definitely hope this feature
>>> gets major priority
>>> over other features.
>>> 
>>> Also same question as Dennis: whats the timeline on this
>>> feature? Or even a
>>> way to get it running in the current release?
>>> -- 
>>> View this message in context: 
>>> http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1533936.html
>>> Sent from the Solr - User mailing list archive at
>>> Nabble.com.
>>> 
>>>

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread Dennis Gearon

I'm pretty new to the Apache Jira pages. I've not worked with a public, OSS 
version control sitation. The 'issue' numbers see concise, but when I click 
what I think will give more information about that issue, it takes me to the 
Apache project as whole, not the Lucene/Spatial Search 'issue' info page(if 
there is one). I'm used to more of the casual use of source forge.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 9/22/10, Jan Høydahl / Cominvent  wrote:

> From: Jan Høydahl / Cominvent 
> Subject: Re: Calculating distances in Solr using longitude latitude
> To: solr-user@lucene.apache.org
> Date: Wednesday, September 22, 2010, 12:11 AM
> :-)
> 
> Also, that Wiki page clearly states in the very first line
> that it talks about uncommitted stuff "Solr4.0". I think
> that is pretty clear.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 
> On 22. sep. 2010, at 03.31, Lance Norskog wrote:
> 
> > Developers, like marketers, have this confusing habit
> of speaking both in the present and an imaginary utopian
> future.
> > 
> > Dennis Gearon wrote:
> >> This is what made me think it was doable now.
> >> 
> >> http://wiki.apache.org/solr/SpatialSearch
> >> 
> >> 
> >> Dennis Gearon
> >> 
> >> Signature Warning
> >> 
> >> EARTH has a Right To Life,
> >>   otherwise we all die.
> >> 
> >> Read 'Hot, Flat, and Crowded'
> >> Laugh at http://www.yert.com/film.php
> >> 
> >> 
> >> --- On Tue, 9/21/10, PeterKerk 
> wrote:
> >> 
> >>   
> >>> From: PeterKerk
> >>> Subject: Re: Calculating distances in Solr
> using longitude latitude
> >>> To: solr-user@lucene.apache.org
> >>> Date: Tuesday, September 21, 2010, 1:18 AM
> >>> 
> >>> It would be such a shame if there's no way to
> get it now
> >>> already...particularly in today's online
> services,
> >>> geolocation is one of the
> >>> hottest things there is, I definitely hope
> this feature
> >>> gets major priority
> >>> over other features.
> >>> 
> >>> Also same question as Dennis: whats the
> timeline on this
> >>> feature? Or even a
> >>> way to get it running in the current release?
> >>> -- 
> >>> View this message in context: 
> >>> http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1533936.html
> >>> Sent from the Solr - User mailing list archive
> at
> >>> Nabble.com.
> >>> 
> >>>     
> 
>

Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shashikant Kore

Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?

Thank you,

--shashi

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread PeterKerk

Dennis Gearon wrote:
> 
> Soo, the short term answer is to use a function column to query
> against? Prefereably with a bounding box, of course. :-)
> 

What do you mean by that? Calculate all locations within a certain range
(bounding box) and query on that?
I hope not, because that would be a very unfriendly solution.

Im still surprised that this is not a bigger priority for the Solr
developers.

I think the better solution would then be to use LocalSolr, but I dont know:
1. if their implementation offers the same functions as the current Solr
release and 
2. if I was to use LocalSolr I woud simply use their implementation of Solr
instead of Solr OR that I have to copy some classes to my current deployment
of Solr

pff :)
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1559728.html
Sent from the Solr - User mailing list archive at Nabble.com.

is indexing single-threaded?

2010-09-22 Thread Andy

Does Solr index data in a single thread or can data be indexed concurrently in 
multiple threads?

Thanks
Andy

Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Andy

I have documents that are in different languages. There's a field in the 
documents specifying what language it's in.

Is it possible to index the documents such that based on what language a 
document is in, a different analyzer will be used on that document?

What is the "normal" way to handle documents in different languages?

Thanks
Andy

Re: Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Jan Høydahl / Cominvent

See this thread: http://search-lucene.com/m/FgbDS1JL3J1

Basically, what we normally do is to rename the fields with a language suffix, 
so if you have language=en and text="A red fox", then you would index it as 
text_en="A red fox". You would either have to do this outside Solr or write an 
UpdateRequestProcessor which does the renaming for you.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 22. sep. 2010, at 12.01, Andy wrote:

> I have documents that are in different languages. There's a field in the 
> documents specifying what language it's in.
> 
> Is it possible to index the documents such that based on what language a 
> document is in, a different analyzer will be used on that document?
> 
> What is the "normal" way to handle documents in different languages?
> 
> Thanks
> Andy
> 
> 
>

Autocomplete: match words anywhere in the token

2010-09-22 Thread Arunkumar Ayyavu

It's been over a week since I started learning Solr. Now, I'm using the
electronics store example to explore the autocomplete feature in Solr.

When I send the query terms.fl=name&terms.prefix=canon to terms request
handler, I get the following response

  
   2
  


But I expect the following results in the response.
canon pixma mp500 all-in-one photo printer
canon powershot sd500

So, I changed the schema for textgen fieldType to use
KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
gives me the expected result.

Now, I also want the Solr to return "canon pixma mp500 all-in-one photo
printer"  when I send the query terms.fl=name&terms.prefix=pixma. Could you
gurus help me get the expected result?

BTW, I couldn't quite understand the behavior of terms.lower and terms.upper
(I tried these with the electronics store example). Could you also help me
understand these 2 query fields?
Thanks.

-- 
Arun

Re: DIH: alternative approach to deltaQuery

2010-09-22 Thread Lukas Kahwe Smith

On 20.09.2010, at 08:32, Lukas Kahwe Smith wrote:

> Hi,
> 
> ok since it didnt seem like there was interest to document this approach on 
> the wiki i have simply documented it on my blog:
> http://pooteeweet.org/blog/1827

sorry for the spam. Lance (and Erik) did think it would be good to add it, so I 
placed a note with a link on the DIH page [1] and added a new page with the 
details [2].

The entire DIH page is kinda monolithic and maybe should be split up a bit.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org

[1] http://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
[2] http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

Re: Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shawn Heisey


 On 9/22/2010 1:39 AM, Shashikant Kore wrote:

Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?


Shashi,

I was not solving the same problem, but perhaps you can adapt my 
solution to yours.  My main problem was that I don't have a modified 
date in my database, and due to the size of the table, it is impractical 
to add one.  Instead, I chose to track the database primary key (a 
simple autoincrement) outside of Solr and pass min/max values into DIH 
for it to use in the SELECT statement.  You can see a simplified version 
of my entity here, with a URL showing how to send the parameters in via 
the dataimport GET:


http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html

The update script that runs every two minutes gets MAX(did) from the 
database, retrieves the minDid from a file on an NFS share, and runs a 
delta-import with those two values.  When the import is reported 
successful, it writes the maxDid value to the minDid file on the network 
share for the next run.  If the import fails, it sends an alarm and 
doesn't update the minDid.


Shawn

Re: Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Bernd Fehling

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...

I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to "see" the multilingual fields (title, keywords, 
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)

Bernd

Am 22.09.2010 12:01, schrieb Andy:
> I have documents that are in different languages. There's a field in the 
> documents specifying what language it's in.
> 
> Is it possible to index the documents such that based on what language a 
> document is in, a different analyzer will be used on that document?
> 
> What is the "normal" way to handle documents in different languages?
> 
> Thanks
> Andy
> 
> 
>

Re: Different analyzers for dfferent documents in different languages?

2010-09-22 Thread Andrzej Bialecki


On 2010-09-22 15:30, Bernd Fehling wrote:

Actually, this is one of the biggest disadvantage of Solr for multilingual 
content.
Solr is field based which means you have to know the language _before_ you feed
the content to a specific field and process the content for that field.
This results in having separate fields for each language.
E.g. for Europe this will be 24 to 26 languages for each title, keyword, 
description, ...

I guess when they started with Lucene/Solr they never had multilingual on their 
mind.

The alternative is to have a separate index for each language.
Therefore you also have to know the language of the content _before_ feeding to 
the core.
E.g. again for Europe you end up with 24 to 26 cores.

Onother option is to "see" the multilingual fields (title, keywords, 
description,...) as
a "subdocument". Write a filter class as subpipeline, use language and encoding 
detection
as first step in that pipeline and then go on with all other linguistic 
processing within
that pipeline and return the processed content back to the field for further 
filtering
and storing.

Many solutions, but nothing out off the box :-)


Take a look at SOLR-1536, it contains an example of a tokenizing chain 
that could use a language detector to create different fields (or 
tokenize differently) based on this decision.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Autocomplete: match words anywhere in the token

2010-09-22 Thread Jan Høydahl / Cominvent

Hmm, the terms component can only give you terms, so I don't think you can use 
that method.
Try to go for creating a new Solr Core for your usecase. A bit more work but 
much more flexible.

See http://search-lucene.com/m/Zfxp52FX49G1

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 22. sep. 2010, at 13.41, Arunkumar Ayyavu wrote:

> It's been over a week since I started learning Solr. Now, I'm using the
> electronics store example to explore the autocomplete feature in Solr.
> 
> When I send the query terms.fl=name&terms.prefix=canon to terms request
> handler, I get the following response
> 
>  
>   2
>  
> 
> 
> But I expect the following results in the response.
> canon pixma mp500 all-in-one photo printer
> canon powershot sd500
> 
> So, I changed the schema for textgen fieldType to use
> KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
> gives me the expected result.
> 
> Now, I also want the Solr to return "canon pixma mp500 all-in-one photo
> printer"  when I send the query terms.fl=name&terms.prefix=pixma. Could you
> gurus help me get the expected result?
> 
> BTW, I couldn't quite understand the behavior of terms.lower and terms.upper
> (I tried these with the electronics store example). Could you also help me
> understand these 2 query fields?
> Thanks.
> 
> -- 
> Arun

Re: Autocomplete: match words anywhere in the token

2010-09-22 Thread Jason Rutherglen

This may be what you're looking for.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

On Wed, Sep 22, 2010 at 4:41 AM, Arunkumar Ayyavu
 wrote:
> It's been over a week since I started learning Solr. Now, I'm using the
> electronics store example to explore the autocomplete feature in Solr.
>
> When I send the query terms.fl=name&terms.prefix=canon to terms request
> handler, I get the following response
> 
>  
>   2
>  
> 
>
> But I expect the following results in the response.
> canon pixma mp500 all-in-one photo printer
> canon powershot sd500
>
> So, I changed the schema for textgen fieldType to use
> KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
> gives me the expected result.
>
> Now, I also want the Solr to return "canon pixma mp500 all-in-one photo
> printer"  when I send the query terms.fl=name&terms.prefix=pixma. Could you
> gurus help me get the expected result?
>
> BTW, I couldn't quite understand the behavior of terms.lower and terms.upper
> (I tried these with the electronics store example). Could you also help me
> understand these 2 query fields?
> Thanks.
>
> --
> Arun
>

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread Erick Erickson

Well, rather than be surprised, you could jump right and *make*
it a priority, by contributing a patch :).

But before you do, you might want to look at the ongoing discussion at:

https://issues.apache.org/jira/browse/SOLR-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913577#action_12913577

Best
Erick


On Wed, Sep 22, 2010 at 4:41 AM, PeterKerk  wrote:

>
>
> Dennis Gearon wrote:
> >
> > Soo, the short term answer is to use a function column to query
> > against? Prefereably with a bounding box, of course. :-)
> >
>
> What do you mean by that? Calculate all locations within a certain range
> (bounding box) and query on that?
> I hope not, because that would be a very unfriendly solution.
>
> Im still surprised that this is not a bigger priority for the Solr
> developers.
>
> I think the better solution would then be to use LocalSolr, but I dont
> know:
> 1. if their implementation offers the same functions as the current Solr
> release and
> 2. if I was to use LocalSolr I woud simply use their implementation of Solr
> instead of Solr OR that I have to copy some classes to my current
> deployment
> of Solr
>
> pff :)
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1559728.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread Dennis Gearon

Well, do the least expensive operations first, all the other filtering on 
keywords, dates, other fields, then the bounding box, then generate the 
distances.

Really not sure what is local solr, but going to look now.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 9/22/10, PeterKerk  wrote:

> From: PeterKerk 
> Subject: Re: Calculating distances in Solr using longitude latitude
> To: solr-user@lucene.apache.org
> Date: Wednesday, September 22, 2010, 1:41 AM
> 
> 
> Dennis Gearon wrote:
> > 
> > Soo, the short term answer is to use a function
> column to query
> > against? Prefereably with a bounding box, of course.
> :-)
> > 
> 
> What do you mean by that? Calculate all locations within a
> certain range
> (bounding box) and query on that?
> I hope not, because that would be a very unfriendly
> solution.
> 
> Im still surprised that this is not a bigger priority for
> the Solr
> developers.
> 
> I think the better solution would then be to use LocalSolr,
> but I dont know:
> 1. if their implementation offers the same functions as the
> current Solr
> release and 
> 2. if I was to use LocalSolr I woud simply use their
> implementation of Solr
> instead of Solr OR that I have to copy some classes to my
> current deployment
> of Solr
> 
> pff :)
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1559728.html
> Sent from the Solr - User mailing list archive at
> Nabble.com.
>

Can Solr do approximate matching?

2010-09-22 Thread Igor Chudov

Hi guys. I am new here. So if I am unwittingly violating any rules,
let me know.

I am working with Solr because I own algebra.com, where I have a
database of 250,000 or so answered math questions. I want to use Solr
to provide approximate matching functionality called "similar items".
So that users looking at a problem could see how similar ones were
answered.

And my question is, does Solr support some "find similar"
functionality. For example, in my mind, sentence "I like tasty
strawberries" is 'similar' to a sentence such as "I like yummy
strawberries", just because both have a few of the same words.

So, to end my long winded query, how would I implement a "find top ten
similar items to this one" functionality?

Thanks!

Re: Can Solr do approximate matching?

2010-09-22 Thread Li Li

It seems there is a SimilarLikeThis in lucene . I don't know whether a
counterpart in solr. It just use the found document as a query to find
similar documents. Or you just use boolean or query and similar
questions with getting higher score. Of course, you can analyse the
question using some NLP techs such as identifying entities and ingore
less usefull words such as "which" "is" ... but I guess tf*idf score
function will also work well

2010/9/22 Igor Chudov :
> Hi guys. I am new here. So if I am unwittingly violating any rules,
> let me know.
>
> I am working with Solr because I own algebra.com, where I have a
> database of 250,000 or so answered math questions. I want to use Solr
> to provide approximate matching functionality called "similar items".
> So that users looking at a problem could see how similar ones were
> answered.
>
> And my question is, does Solr support some "find similar"
> functionality. For example, in my mind, sentence "I like tasty
> strawberries" is 'similar' to a sentence such as "I like yummy
> strawberries", just because both have a few of the same words.
>
> So, to end my long winded query, how would I implement a "find top ten
> similar items to this one" functionality?
>
> Thanks!
>

Re: Calculating distances in Solr using longitude latitude

2010-09-22 Thread Thomas Joiner

Re: your problems with JIRA

I have no idea what caused it/what resolved it but I have had the same
problem as you.  Assuming, that is, that the problem that is occurring is
when you click on a link to an issue, it instead takes you to
https://issues.apache.org/jira/secure/Dashboard.jspa  Or perhaps the page
above it...I don't remember clearly any more.

If you click "Browse Project" then navigate to Solr, and type the issue
number in the quick search box, it should get you to the issue.  I'm sorry I
don't know a simpler way to do this/why it isn't working correctly, but
maybe in your case it will magically resolve itself like it did in my case.

(Sorry for the somewhat off-topic post, however since he did make a comment
about it on this thread, I felt it appropriate to respond on this thread as
well.)

On Wed, Sep 22, 2010 at 10:30 AM, Dennis Gearon wrote:

> Well, do the least expensive operations first, all the other filtering on
> keywords, dates, other fields, then the bounding box, then generate the
> distances.
>
> Really not sure what is local solr, but going to look now.
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Wed, 9/22/10, PeterKerk  wrote:
>
> > From: PeterKerk 
> > Subject: Re: Calculating distances in Solr using longitude latitude
> > To: solr-user@lucene.apache.org
> > Date: Wednesday, September 22, 2010, 1:41 AM
> >
> >
> > Dennis Gearon wrote:
> > >
> > > Soo, the short term answer is to use a function
> > column to query
> > > against? Prefereably with a bounding box, of course.
> > :-)
> > >
> >
> > What do you mean by that? Calculate all locations within a
> > certain range
> > (bounding box) and query on that?
> > I hope not, because that would be a very unfriendly
> > solution.
> >
> > Im still surprised that this is not a bigger priority for
> > the Solr
> > developers.
> >
> > I think the better solution would then be to use LocalSolr,
> > but I dont know:
> > 1. if their implementation offers the same functions as the
> > current Solr
> > release and
> > 2. if I was to use LocalSolr I woud simply use their
> > implementation of Solr
> > instead of Solr OR that I have to copy some classes to my
> > current deployment
> > of Solr
> >
> > pff :)
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Calculating-distances-in-Solr-using-longitude-latitude-tp1524297p1559728.html
> > Sent from the Solr - User mailing list archive at
> > Nabble.com.
> >
>

Re: Can Solr do approximate matching?

2010-09-22 Thread Erik Hatcher

 (then narrow to 
wiki to find things in "documentation")

which will get  you to 

Erik


On Sep 22, 2010, at 12:12 PM, Li Li wrote:

> It seems there is a SimilarLikeThis in lucene . I don't know whether a
> counterpart in solr. It just use the found document as a query to find
> similar documents. Or you just use boolean or query and similar
> questions with getting higher score. Of course, you can analyse the
> question using some NLP techs such as identifying entities and ingore
> less usefull words such as "which" "is" ... but I guess tf*idf score
> function will also work well
> 
> 2010/9/22 Igor Chudov :
>> Hi guys. I am new here. So if I am unwittingly violating any rules,
>> let me know.
>> 
>> I am working with Solr because I own algebra.com, where I have a
>> database of 250,000 or so answered math questions. I want to use Solr
>> to provide approximate matching functionality called "similar items".
>> So that users looking at a problem could see how similar ones were
>> answered.
>> 
>> And my question is, does Solr support some "find similar"
>> functionality. For example, in my mind, sentence "I like tasty
>> strawberries" is 'similar' to a sentence such as "I like yummy
>> strawberries", just because both have a few of the same words.
>> 
>> So, to end my long winded query, how would I implement a "find top ten
>> similar items to this one" functionality?
>> 
>> Thanks!
>>

Improvements to SpellCheckComponent "Collate" -- Patch available for v1.4.1 (SOLR-2010)

2010-09-22 Thread Dyer, James

A couple of people have asked about getting SOLR-2010 to work on v1.4.1.  I 
uploaded a backported patch to JIRA today.  See 
https://issues.apache.org/jira/browse/SOLR-2010 and file "SOLR-2010_141.patch".

SOLR-2010 improves the SpellCheckComponent "Collate" funtionality. Specifically,

1. Only return collations that are guaranteed to result in hits if re-queried 
(applying original fq params also). This is especially helpful when there is 
more than one correction per query.
2. Provide the option to get multiple collation suggestions
3. Provide extended collation results including the # of hits re-querying will 
return and a breakdown of each misspelled word and its correction.
4. Provide the option to create a "master" dictionary consisting of terms from 
several fields and letting the Collator remove any spurious suggestions that 
would result. (useful if using the dismax deftype to search across multiple 
fields.)

More information is available from the JIRA case.

If you are interested in the additional functionality, I would encourage you to 
try the patch out and also let me know if you run into any problems or have 
suggestions for improvememnts.

Help on applying patches is here:  
http://wiki.apache.org/solr/HowToContribute#TestingPatches

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

Regarding Capacity of Solr Indexes.

2010-09-22 Thread Vaibhav Shrivastava

Hi,
I am a new user on this group. I wished to use Solr for deploying an
Index on a 2Tb amount of data of documents. I wished to know if someone
could help me out in estimating the number of machines required for serving
this index, assuming I shall use Amazon machine instances for this.

The details regarding the system are as follows:

1. No of fields to be indexed : 3 or 5.
2. The size of fields should be relatively small. The corresponding document
size could be around 5-10 kb.
3. Index should be updated on a daily basis.

-- 
Vaibhav Shrivastava,
Graduate Student,
MS Computer Science,
Stony Brook University.

Re: NullpointerException when combining spellcheck component and synonyms

2010-09-22 Thread Stefan Moises


 Hi all,

wow, this is weird...
now before I file the JIRA issue - one thing I forgot to mention is that 
I am using the edismax query parser.

I've just done the following:

1) searched with edismax parser:
/select?indent=on&version=2.2&q=beffy&fq=&start=0&rows=10&fl=*%2Cscore&qt=dismax&wt=standard&debugQuery=on&explainOther=&hl.fl=&spellcheck=true
and got the NullPointer as usual

2) changed "dismax" to "standard", just to make sure - using the same query:
/select?indent=on&version=2.2&q=beffy&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=&hl.fl=&spellcheck=true
and got 6 results

3) changed back to "dismax" and got the 6 results, again - no more 
NullPointer... !?


Is that something cache-related maybe? Because I've had other indexes 
which at first gave me the NullPointer exception using this setup and 
after a while the synonyms were working, even with spellchecking turned 
on and using edismax...


Just for reference, here is my handler config:


edismax
explicit
0.05

oxtitle^136.9 oxartnum^20.9 oxartnum_exact^30 oxartnum_rev^30 
oxtags^20 seokeywords^25 seodesc^15 oxlongdesc^10 oxcattitle^80.4 
allcattitles_exact^55 allcattitles^50 allcattitles_rev^45 
allcattitles_preserve_rev^50 oxmanu_title^60.0 oxvendor_title^60.0



oxtitle^136.9 oxartnum^20.9 oxartnum_exact^30 oxartnum_rev^30 
oxtags^20 seokeywords^25 seodesc^15 oxlongdesc^10 oxcattitle^80.4 
allcattitles_exact^55 allcattitles^50 allcattitles_rev^45 
allcattitles_preserve_rev^50 oxmanu_title^60.0 oxvendor_title^60.0



2<-1 5<-2 6<90%

100
*:*

oxcattitle oxtitle oxshortdesc oxlongdesc

0

oxtitle
regex 


spellcheck
elevator




Thanks for any hint :)
Stefan

Am 21.09.2010 21:14, schrieb Stefan Moises:

 Sure, no problem, I'll submit a JIRA entry :)

Am 21.09.2010 16:13, schrieb Robert Muir:
I don't think you should get an error like this from SynonymFilter... 
would

you mind opening a JIRA issue?

On Tue, Sep 21, 2010 at 9:49 AM, Stefan Moises  
wrote:



  Hi again,

well it turns out that it still doesn't work ...
Sometimes it works (i.e. for some cores), sometimes I still get the
nullpointer - e.g. if I create a new core and use the same settings 
as a
working one, but index different data, then I add a synonym (e.g. 
"foo =>

bar") and activate spellchecking, then search for "foo" -  boom :(
And I can't really tell where this error comes from... any idea 
where to

start looking?

Thanks,
Stefan

Am 01.09.2010 17:20, schrieb Stefan Moises:

   doh, looks like I only forgot to add the spellcheck component to my

edismax request handler... now it works with:

...

spellcheck
elevator


What's strange is that spellchecking seemed to work *without* that 
entry,

too

Cheers,
Stefan

Am 01.09.2010 13:33, schrieb Stefan Moises:


  Hi there,

I am using Solr from SVN,
https://svn.apache.org/repos/asf/lucene/dev/trunk (my last 
update/build

on my dev server was in July I think)...

I've encountered a strange problem when using the Spellcheck 
component in

combination with the SynonymFilter...
My "text" field is pretty standard, using the default synonyms.txt 
file:
positionIncrementGap="100">























I have only added some terms at the end of synonyms.txt:
...
# Synonym mappings can be used for spelling correction too
pixima =>  pixma

tekanne =>  teekanne
teekane =>  teekanne
flashen =>  flaschen
flasen =>  flaschen

Here is my query and the exception... if I turn off spellcheck,
everything works as expected and the synonyms are found...

INFO: [db] webapp=/solr path=/select
params={mlt.minwl=3&spellcheck=true&facet=true&mlt.fl=oxmanu_oxid,oxvendor_oxid,oxtags,oxsearchkeys&spellcheck.q=flasen&mlt.mintf=1&facet.limit=-1&mlt=true& 

json.nl=map&hl.fl=oxtitle&hl.fl=oxshortdesc&hl.fl=oxlongdesc&hl.fl=oxtags&hl.fl=seodesc&hl.fl=seokeywords&wt=json&hl=true&rows=10&version=1.2&mlt.mindf=1&debugQuery=true&facet.sort=lex&start=0&q=flasen&facet.field=oxcat_oxid&facet.field=oxcat_oxidtitle&facet.field=oxprice&facet.field=oxmanu_oxid&facet.field=oxmanu_oxidtitle&facet.field=oxvendor_oxid&facet.field=oxvendor_oxidtitle&facet.field=attrgroup_oxid&facet.field=attrgroup_oxidtitle&facet.field=attrgroup_oxidvalue&facet.field=attrvalue_oxid&facet.field=attrvalue_oxidtitle&facet.field=attr2attrgroup_oxidtitle&qt=dismax&spellcheck.build=false} 


hits=2 status=500 QTime=14
01.09.2010 12:54:47 org.apache.solr.common.SolrException log
SCHWERWIEGEND: java.lang.NullPointerException
at
org.apache.lucene.util.AttributeSource.cloneAttributes(AttributeSource.java:470) 


at
org.apache.lucene.analysis.synonym.SynonymFilter.incrementToken(SynonymFilter.java:128) 


at
org.apache.lucene.analysis.core.StopFilter.incrementToken(StopFilter.java:260) 


at
org.apache.lucene.analysis.miscellaneous.WordDelimiterFilter.incrementToken(WordDelimiterFilter.java:336) 


at
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:62)

I was at a search vendor round table today...

2010-09-22 Thread Smiley, David W.

(I don't twitter or blog so I thought I'd send this message here)

Today at work (at MITRE outside DC) there was (is) a day of technical 
presentations about topics related to information dissemination and discovery 
(broad squishy words there, but mostly covered "search") at which I spoke about 
the value of faceting, and gave a quick Solr pitch.  There was an hour vendor 
panel in which a representative from Autonomy, Microsoft (i.e. FAST), Google, 
Vivisimo, and Endeca had the opportunity to espouse the virtues of their 
product, and fit in an occasional jab at their competitors next to them.  In 
the absence of a suitable representative for Solr (e.g. Lucid) I pointed out 
how open-source Solr has "democratized" (i.e. made free) search and faceting 
when it used to require paying lots of money.  And I asked them how their 
products have reacted to this new reality.  Autonomy acknowledged they used to 
make millions on simple engagements in the distant past but that isn't the case 
these days.  He said some other things about a huge petabyte hosted search 
collection they have used by banks... I forget what else he said.  I forgot 
what Google said.  Vivisimo quoted Steve Ballmer, saying "open source is as 
free as a free puppy" (not a bad point IMO).  Endeca claimed to be happy Solr 
exists because it raises the awareness of faceted search, but then claimed it 
would not scale and they should then upgrade to Endeca.  (!)  I found that 
claim ridiculous, of course.

Speaking of performance, on a large scale search project where we're using Solr 
in place of a MarkLogic prototype (because ML is so friggin expensive, for one 
reason), the search results were so fast (~150ms) vs. the ML's results of 2-3 
seconds, that the UI engineers building the interface on top of the XML output 
thought Solr was broken because it was so fast.  The quote was "It's so fast, 
it's broken".In other words, they were used to 2-3 second response times 
and so if the results came back as fast as what Solr has been doing, then 
surely there's a bug.  There's no bug.  :)  Admittedly, I think it was a bit of 
an apples and oranges comparison but I love that quote nonetheless.

~ David Smiley
 Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book

Re: Autocomplete: match words anywhere in the token

2010-09-22 Thread Arunkumar Ayyavu

Thanks for the responses. Now, I included the EdgeNGramFilter. But, I get
the following results when I search for "canon pixma".
Canon PIXMA MP500 All-In-One Photo Printer
Canon PowerShot SD500

As you can guess, I'm not expecting the 2nd result entry. Though I
understand why I'm getting the 2nd entry, I don't know how to ask Solr to
exlcude it (I could fitler it in my application though). :-( Looks like I
should study more of Solr's capabilites to get the solution.

It would be very nice if you could provide more pointers to the solution.
Thanks a lot.

On Wed, Sep 22, 2010 at 7:48 PM, Jan Høydahl / Cominvent <
jan@cominvent.com> wrote:

> Hmm, the terms component can only give you terms, so I don't think you can
> use that method.
> Try to go for creating a new Solr Core for your usecase. A bit more work
> but much more flexible.
>
> See http://search-lucene.com/m/Zfxp52FX49G1
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> On 22. sep. 2010, at 13.41, Arunkumar Ayyavu wrote:
>
> > It's been over a week since I started learning Solr. Now, I'm using the
> > electronics store example to explore the autocomplete feature in Solr.
> >
> > When I send the query terms.fl=name&terms.prefix=canon to terms request
> > handler, I get the following response
> > 
> >  
> >   2
> >  
> > 
> >
> > But I expect the following results in the response.
> > canon pixma mp500 all-in-one photo printer
> > canon powershot sd500
> >
> > So, I changed the schema for textgen fieldType to use
> > KeywordTokenizerFactory and also removed WordDelimiterFilterFactory. That
> > gives me the expected result.
> >
> > Now, I also want the Solr to return "canon pixma mp500 all-in-one photo
> > printer"  when I send the query terms.fl=name&terms.prefix=pixma. Could
> you
> > gurus help me get the expected result?
> >
> > BTW, I couldn't quite understand the behavior of terms.lower and
> terms.upper
> > (I tried these with the electronics store example). Could you also help
> me
> > understand these 2 query fields?
> > Thanks.
> >
> > --
> > Arun
>
>


-- 
Arun

Re: NullpointerException when combining spellcheck component and synonyms

2010-09-22 Thread Stefan Moises

 well, to sum it up... it doesn't really matter if I use standard or 
dismax, at the moment both give me NullPointers for the same query, 
although I didn't change anything since it was working ... it seems 
totally random, sometimes it works a couple of times, sometimes it 
doesn't :(


Weird...

Stefan

Am 22.09.2010 19:58, schrieb Stefan Moises:

 Hi all,

wow, this is weird...
now before I file the JIRA issue - one thing I forgot to mention is 
that I am using the edismax query parser.

I've just done the following:

1) searched with edismax parser:
/select?indent=on&version=2.2&q=beffy&fq=&start=0&rows=10&fl=*%2Cscore&qt=dismax&wt=standard&debugQuery=on&explainOther=&hl.fl=&spellcheck=true 


and got the NullPointer as usual

2) changed "dismax" to "standard", just to make sure - using the same 
query:
/select?indent=on&version=2.2&q=beffy&fq=&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard&debugQuery=on&explainOther=&hl.fl=&spellcheck=true 


and got 6 results

3) changed back to "dismax" and got the 6 results, again - no more 
NullPointer... !?


Is that something cache-related maybe? Because I've had other indexes 
which at first gave me the NullPointer exception using this setup and 
after a while the synonyms were working, even with spellchecking 
turned on and using edismax...


Just for reference, here is my handler config:


edismax
explicit
0.05

oxtitle^136.9 oxartnum^20.9 oxartnum_exact^30 oxartnum_rev^30 
oxtags^20 seokeywords^25 seodesc^15 oxlongdesc^10 oxcattitle^80.4 
allcattitles_exact^55 allcattitles^50 allcattitles_rev^45 
allcattitles_preserve_rev^50 oxmanu_title^60.0 oxvendor_title^60.0



oxtitle^136.9 oxartnum^20.9 oxartnum_exact^30 oxartnum_rev^30 
oxtags^20 seokeywords^25 seodesc^15 oxlongdesc^10 oxcattitle^80.4 
allcattitles_exact^55 allcattitles^50 allcattitles_rev^45 
allcattitles_preserve_rev^50 oxmanu_title^60.0 oxvendor_title^60.0



2<-1 5<-2 6<90%

100
*:*

oxcattitle oxtitle oxshortdesc oxlongdesc

0

oxtitle
regex 


spellcheck
elevator




Thanks for any hint :)
Stefan

Am 21.09.2010 21:14, schrieb Stefan Moises:

 Sure, no problem, I'll submit a JIRA entry :)

Am 21.09.2010 16:13, schrieb Robert Muir:
I don't think you should get an error like this from 
SynonymFilter... would

you mind opening a JIRA issue?

On Tue, Sep 21, 2010 at 9:49 AM, Stefan Moises  
wrote:



  Hi again,

well it turns out that it still doesn't work ...
Sometimes it works (i.e. for some cores), sometimes I still get the
nullpointer - e.g. if I create a new core and use the same settings 
as a
working one, but index different data, then I add a synonym (e.g. 
"foo =>

bar") and activate spellchecking, then search for "foo" -  boom :(
And I can't really tell where this error comes from... any idea 
where to

start looking?

Thanks,
Stefan

Am 01.09.2010 17:20, schrieb Stefan Moises:

   doh, looks like I only forgot to add the spellcheck component to my

edismax request handler... now it works with:

...

spellcheck
elevator


What's strange is that spellchecking seemed to work *without* that 
entry,

too

Cheers,
Stefan

Am 01.09.2010 13:33, schrieb Stefan Moises:


  Hi there,

I am using Solr from SVN,
https://svn.apache.org/repos/asf/lucene/dev/trunk (my last 
update/build

on my dev server was in July I think)...

I've encountered a strange problem when using the Spellcheck 
component in

combination with the SynonymFilter...
My "text" field is pretty standard, using the default 
synonyms.txt file:
positionIncrementGap="100">





generateWordParts="1"

generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>









generateWordParts="1"

generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>







I have only added some terms at the end of synonyms.txt:
...
# Synonym mappings can be used for spelling correction too
pixima =>  pixma

tekanne =>  teekanne
teekane =>  teekanne
flashen =>  flaschen
flasen =>  flaschen

Here is my query and the exception... if I turn off spellcheck,
everything works as expected and the synonyms are found...

INFO: [db] webapp=/solr path=/select
params={mlt.minwl=3&spellcheck=true&facet=true&mlt.fl=oxmanu_oxid,oxvendor_oxid,oxtags,oxsearchkeys&spellcheck.q=flasen&mlt.mintf=1&facet.limit=-1&mlt=true& 

json.nl=map&hl.fl=oxtitle&hl.fl=oxshortdesc&hl.fl=oxlongdesc&hl.fl=oxtags&hl.fl=seodesc&hl.fl=seokeywords&wt=json&hl=true&rows=10&version=1.2&mlt.mindf=1&debugQuery=true&facet.sort=lex&start=0&q=flasen&facet.field=oxcat_oxid&facet.field=oxcat_oxidtitle&facet.field=oxprice&facet.field=oxmanu_oxid&facet.field=oxmanu_oxidtitle&facet.field=oxvendor_oxid&facet.field=oxvendor_oxidtitle&facet.field=attrgroup_oxid&facet.field=attrgroup_oxidtitle&facet.field=attrgroup_oxidvalue&facet.field=attrvalue_oxid&facet.field=attrvalue_oxidtitle&facet.field=attr2attrgroup_oxidtitle&qt=dismax&spellcheck.build=false} 


hits=2 status=500 QTime=14
01.09.2010

Re: I was at a search vendor round table today...

2010-09-22 Thread Grant Ingersoll

On Sep 22, 2010, at 2:04 PM, Smiley, David W. wrote:

> (I don't twitter or blog so I thought I'd send this message here)
> 
> Today at work (at MITRE outside DC) there was (is) a day of technical 
> presentations about topics related to information dissemination and discovery 
> (broad squishy words there, but mostly covered "search") at which I spoke 
> about the value of faceting, and gave a quick Solr pitch.  There was an hour 
> vendor panel in which a representative from Autonomy, Microsoft (i.e. FAST), 
> Google, Vivisimo, and Endeca had the opportunity to espouse the virtues of 
> their product, and fit in an occasional jab at their competitors next to 
> them.  In the absence of a suitable representative for Solr (e.g. Lucid) I 
> pointed out how open-source Solr has "democratized" (i.e. made free) search 
> and faceting when it used to require paying lots of money.  And I asked them 
> how their products have reacted to this new reality.  Autonomy acknowledged 
> they used to make millions on simple engagements in the distant past but that 
> isn't the case these days.  He said some other things about a huge petabyte 
> hosted search collection they have used by banks... I forget what else he 
> said.  I forgot what Google said.  Vivisimo quoted Steve Ballmer, saying 
> "open source is as free as a free puppy" (not a bad point IMO).  

Too funny.  Hadn't heard that one before.  Presumably meaning you have to care 
and feed it, despite the fact that you really do love it and it is cute as 
hell?  The care and feeding is true of the commercial ones, too, especially in 
terms of  for supporting features you never use, but love (as in we love 
using this tool) is usually not a word I hear associated in those respects too 
often, but of course that is likely self selecting.  

> Endeca claimed to be happy Solr exists because it raises the awareness of 
> faceted search, but then claimed it would not scale and they should then 
> upgrade to Endeca.  (!)  I found that claim ridiculous, of course.

Having replaced all the above on a number of occasions w/ Solr at both a 
significant cost savings on licensing, dev time, and hardware, I would agree 
that claim is quite ridiculous.  Besides, in my experience, the scale claim is 
silly.  Everyone (customers) says they need scale, but few of them really know 
what scale is, so it is all relative.   For some, scale is 1M docs, for others 
it's 1B+ docs;  for others it's 100K queries per day, for others it's 100M per 
day.  (BTW, I've seen Lucene/Solr do both, just fine.  Not that it is a free 
lunch, but neither are the other ones despite what they say.)

> 
> Speaking of performance, on a large scale search project where we're using 
> Solr in place of a MarkLogic prototype (because ML is so friggin expensive, 
> for one reason), the search results were so fast (~150ms) vs. the ML's 
> results of 2-3 seconds, that the UI engineers building the interface on top 
> of the XML output thought Solr was broken because it was so fast.  The quote 
> was "It's so fast, it's broken".In other words, they were used to 2-3 
> second response times and so if the results came back as fast as what Solr 
> has been doing, then surely there's a bug.  There's no bug.  :)  Admittedly, 
> I think it was a bit of an apples and oranges comparison but I love that 
> quote nonetheless.

I love it.  I have had the same experience where people think it's broken b/c 
it's so fast.  Large vendor named above took 24 hours to index 4M records (they 
weren't even doing anything fancy on the indexing side) and search was slow 
too.  Solr took about 40 minutes to index all the content and search was 
blazing.  Same content, faster indexing, better search results, a lot less 
time. 

At any rate, enough of tooting our own horn.  Thanks for sharing!

-Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Re: I was at a search vendor round table today...

2010-09-22 Thread Walter Underwood

On Sep 22, 2010, at 11:04 AM, Smiley, David W. wrote:

> Speaking of performance, on a large scale search project where we're using 
> Solr in place of a MarkLogic prototype (because ML is so friggin expensive, 
> for one reason), the search results were so fast (~150ms) vs. the ML's 
> results of 2-3 seconds, that the UI engineers building the interface on top 
> of the XML output thought Solr was broken because it was so fast.  The quote 
> was "It's so fast, it's broken".In other words, they were used to 2-3 
> second response times and so if the results came back as fast as what Solr 
> has been doing, then surely there's a bug.  There's no bug.  :) Admittedly, I 
> think it was a bit of an apples and oranges comparison but I love that quote 
> nonetheless.

I implemented Solr at Netflix and now I work at MarkLogic, and I strongly agree 
that the comparison is apples and oranges. MarkLogic does run very fast on very 
large datasets, so maybe that prototype was built to show functionality instead 
of speed. Also, MarkLogic already has a lot of stuff that is still in the 
future for Solr, like true real-time search, updating fields, and geospatial 
search.

Next time, invite the MarkLogic people, too. :-)

wunder
--
Walter Underwood
Lead Engineer
MarkLogic

Re: I was at a search vendor round table today...

2010-09-22 Thread Alexander Kanarsky

>  He said some other things about a huge petabyte hosted search collection 
> they have used by banks..

In context of your discussion this reference sounds really, really funny... :)

-Alexander

On Wed, Sep 22, 2010 at 1:17 PM, Grant Ingersoll  wrote:
>
> On Sep 22, 2010, at 2:04 PM, Smiley, David W. wrote:
>
>> (I don't twitter or blog so I thought I'd send this message here)
>>
>> Today at work (at MITRE outside DC) there was (is) a day of technical 
>> presentations about topics related to information dissemination and 
>> discovery (broad squishy words there, but mostly covered "search") at which 
>> I spoke about the value of faceting, and gave a quick Solr pitch.  There was 
>> an hour vendor panel in which a representative from Autonomy, Microsoft 
>> (i.e. FAST), Google, Vivisimo, and Endeca had the opportunity to espouse the 
>> virtues of their product, and fit in an occasional jab at their competitors 
>> next to them.  In the absence of a suitable representative for Solr (e.g. 
>> Lucid) I pointed out how open-source Solr has "democratized" (i.e. made 
>> free) search and faceting when it used to require paying lots of money.  And 
>> I asked them how their products have reacted to this new reality.  Autonomy 
>> acknowledged they used to make millions on simple engagements in the distant 
>> past but that isn't the case these days.  He said some other things about a 
>> huge petabyte hosted search collection they have used by banks... I forget 
>> what else he said.  I forgot what Google said.  Vivisimo quoted Steve 
>> Ballmer, saying "open source is as free as a free puppy" (not a bad point 
>> IMO).
>
> Too funny.  Hadn't heard that one before.  Presumably meaning you have to 
> care and feed it, despite the fact that you really do love it and it is cute 
> as hell?  The care and feeding is true of the commercial ones, too, 
> especially in terms of  for supporting features you never use, but love 
> (as in we love using this tool) is usually not a word I hear associated in 
> those respects too often, but of course that is likely self selecting.
>
>> Endeca claimed to be happy Solr exists because it raises the awareness of 
>> faceted search, but then claimed it would not scale and they should then 
>> upgrade to Endeca.  (!)  I found that claim ridiculous, of course.
>
> Having replaced all the above on a number of occasions w/ Solr at both a 
> significant cost savings on licensing, dev time, and hardware, I would agree 
> that claim is quite ridiculous.  Besides, in my experience, the scale claim 
> is silly.  Everyone (customers) says they need scale, but few of them really 
> know what scale is, so it is all relative.   For some, scale is 1M docs, for 
> others it's 1B+ docs;  for others it's 100K queries per day, for others it's 
> 100M per day.  (BTW, I've seen Lucene/Solr do both, just fine.  Not that it 
> is a free lunch, but neither are the other ones despite what they say.)
>
>>
>> Speaking of performance, on a large scale search project where we're using 
>> Solr in place of a MarkLogic prototype (because ML is so friggin expensive, 
>> for one reason), the search results were so fast (~150ms) vs. the ML's 
>> results of 2-3 seconds, that the UI engineers building the interface on top 
>> of the XML output thought Solr was broken because it was so fast.  The quote 
>> was "It's so fast, it's broken".    In other words, they were used to 2-3 
>> second response times and so if the results came back as fast as what Solr 
>> has been doing, then surely there's a bug.  There's no bug.  :)  Admittedly, 
>> I think it was a bit of an apples and oranges comparison but I love that 
>> quote nonetheless.
>
>
> I love it.  I have had the same experience where people think it's broken b/c 
> it's so fast.  Large vendor named above took 24 hours to index 4M records 
> (they weren't even doing anything fancy on the indexing side) and search was 
> slow too.  Solr took about 40 minutes to index all the content and search was 
> blazing.  Same content, faster indexing, better search results, a lot less 
> time.
>
> At any rate, enough of tooting our own horn.  Thanks for sharing!
>
> -Grant
>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
>

Delete Dynamic Fields

2010-09-22 Thread Moiz Bhukhiya

Hi All:

I had used dynamic fields for some of my fields and then later decided to
make it static. I removed that dynamic field from the schema but I still see
it on admin interface(FIELD LIST). Could somebody please point me out how
can I remove these dynamic fields?

Thanks,
Moiz

Re: Delete Dynamic Fields

2010-09-22 Thread Tom Hill

Delete all docs with the dynamic fields, and then optimize.

On Wed, Sep 22, 2010 at 1:58 PM, Moiz Bhukhiya  wrote:
> Hi All:
>
> I had used dynamic fields for some of my fields and then later decided to
> make it static. I removed that dynamic field from the schema but I still see
> it on admin interface(FIELD LIST). Could somebody please point me out how
> can I remove these dynamic fields?
>
> Thanks,
> Moiz
>

Searches with a period (.) in the query

2010-09-22 Thread Siddharth Powar

Hi,

I am getting some weird output upon searching in solr. For certain searches
that have a period in the search term (e.g: q=ab.xyz) solr returns the
results perfectly, but for some other searches (e.g: q=ab.pqr) solr would
return 0 results even though it is present. On the other hand if the search
query is q=abpqr (without the period), solr returns the correct documents.

The field type belongs to the class "solr.StrField".

Please let me know if i can provide with more info.

Thanks,
Sid

Re: Searches with a period (.) in the query

2010-09-22 Thread kenf_nc


Could it be a case-sensitivity issue? The StrField type is not analyzed, but
indexed/stored verbatim. (from the schema comments).  If you are looking for
ab.pqr but it is in fact ab.Pqr in the solr document, it wouldn't find it.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1565057.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Reporting

2010-09-22 Thread Adeel Qureshi

This probably isnt directly a solr user type question but its close enough
so I am gonna post it here. I have been using solr for a few months now and
it works just out of this world so I definitely love the software (and
obviously lucene too) .. but I feel that solr output xml is in kind of weird
format .. I mean its in a format that simply makes it difficult to plug solr
output xml in any xml reading tool or api .. this whole concept of using
123
instead of
123
doesnt makse sense to me ..

what I am trying to do now is setup a reporting system off of solr .. and
the concept is simply .. let the user do all the searches, facet etc and
once they have finalized on some results .. simply allow them to export
those results in an excel or pdf file .. what I have setup right now is I
simply let the export feature use the same solr query that user used to
search their results .. send that query to solr again and get all results
back and simply iterate over xml and dump all data in an excel file

this has worked fine in most situations but I want to improve this process
and specifically use jasper reports for reporting .. and I want to use
ireport to design my report templates ..
thats where solr output xml format is causing problems .. as I cant figure
out how to make it work with ireport because of solr xml not having any
named nodes .. it all looks like the same nodes and ireport cant distinguish
one column from another .. so I am thinking a couple of solutions here and
wanted to get some suggestions from you guys on how to do it best

1. receive solr output xml .. convert it to a more readable xml form .. use
named nodes instead of nodes by data type
123
xyz

=>
123
xyz

and then feed that to jasper report template

2. use solrJ to recieve solr output in the NamedList resultset as it returns
 ..I havent tried this method so I am not sure how useful or easy to work,
this NamedList structure is .. in this I would be feeding Collection of
NamedList items to jasper .. havent played around with this so not sure how
well its gonna work out .. if you have tried something like this please let
me know how it worked out for u

I would appreciate absolutely any kind of comments on this

Thanks
Adeel

How can I delete the entire contents of the index?

2010-09-22 Thread Igor Chudov

Let's say that I added a number of elements to Solr (I use
Webservice::Solr as the interface to do so).

Then I change my mind and want to delete them all.

How can I delete all contents of the database, but leave the database
itself, just empty?

Thanks

i

Re: Searches with a period (.) in the query

2010-09-22 Thread Siddharth Powar

Hey Ken,

Thanks for the reply. Its not a case-sensitivity issue. I wonder if this is
a bug in the way the data is indexed in solr as the behavior is not constant
throughout searches of similar type when a period (.) is used. You think
there could be some other issue?

Thanks,
Sid

On Wed, Sep 22, 2010 at 6:18 PM, kenf_nc  wrote:

>
> Could it be a case-sensitivity issue? The StrField type is not analyzed,
> but
> indexed/stored verbatim. (from the schema comments).  If you are looking
> for
> ab.pqr but it is in fact ab.Pqr in the solr document, it wouldn't find it.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Searches-with-a-period-in-the-query-tp1564780p1565057.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: How can I delete the entire contents of the index?

2010-09-22 Thread xu cheng

the query that fetch the data you wanna
delete
I did like this to delete my data
best regards

2010/9/23 Igor Chudov 

> Let's say that I added a number of elements to Solr (I use
> Webservice::Solr as the interface to do so).
>
> Then I change my mind and want to delete them all.
>
> How can I delete all contents of the database, but leave the database
> itself, just empty?
>
> Thanks
>
> i
>

Re: How can I delete the entire contents of the index?

2010-09-22 Thread Gora Mohanty

On Thu, Sep 23, 2010 at 9:05 AM, Igor Chudov  wrote:
> Let's say that I added a number of elements to Solr (I use
> Webservice::Solr as the interface to do so).
>
> Then I change my mind and want to delete them all.
>
> How can I delete all contents of the database, but leave the database
> itself, just empty?

Not sure what you mean by "leave the database itself". Solr is not a normal
database, and thus there is not much sense to an empty index. In any case,
to delete all entries, see entry 2.7 in the Solr FAQ on the Wiki:
http://wiki.apache.org/solr/FAQ#How_can_I_delete_all_documents_from_my_index.3F

Regards,
Gora

Re: is indexing single-threaded?

2010-09-22 Thread Andy


--- On Wed, 9/22/10, Andy  wrote:

> Does Solr index data in a single
> thread or can data be indexed concurrently in multiple
> threads?
> 

Can anyone help?

Re: is indexing single-threaded?

2010-09-22 Thread Ryan McKinley

Multiple threads work well.

If you are using solrj, check the StreamingSolrServer for an
implementation that will keep X number of threads busy.

Your mileage will very, but in general I find a reasonable thread
count is ~ (number of cores)+1

On Wed, Sep 22, 2010 at 5:52 AM, Andy  wrote:
> Does Solr index data in a single thread or can data be indexed concurrently 
> in multiple threads?
>
> Thanks
> Andy
>
>
>
>

Re: How can I delete the entire contents of the index?

2010-09-22 Thread Ryan McKinley

*:*

will leave you a fresh index


On Thu, Sep 23, 2010 at 12:50 AM, xu cheng  wrote:
> the query that fetch the data you wanna
> delete
> I did like this to delete my data
> best regards
>
> 2010/9/23 Igor Chudov 
>
>> Let's say that I added a number of elements to Solr (I use
>> Webservice::Solr as the interface to do so).
>>
>> Then I change my mind and want to delete them all.
>>
>> How can I delete all contents of the database, but leave the database
>> itself, just empty?
>>
>> Thanks
>>
>> i
>>
>

Re: Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shashikant Kore

Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range. Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey  wrote:

>  On 9/22/2010 1:39 AM, Shashikant Kore wrote:
>
>> Hi,
>>
>> I'm using DIH to index records from a database. After every update on
>> (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
>> observed that if db updates and DIH import is happening concurrently,
>> import
>> misses few records.
>>
>> Here is how it happens.
>>
>> The table has a column 'lastUpdated' which has default value of current
>> timestamp. Many records are added to database in a single transaction that
>> takes several seconds. For example, if 10,000 rows are being inserted, the
>> rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
>> 18:21:26'. These rows become visible only after transaction is committed.
>> That happens at, say, '2010-09-20 18:21:30'.
>>
>> If Solr is import gets triggered at '18:20:29', it will use a timestamp of
>> last import for delta query. This import will not see the records added in
>> the aforementioned transaction as transaction was not committed at that
>> instant. After this import, the dataimport.properties will have last index
>> time as '18:20:29'.  The next import will not able to get all the rows of
>> previously referred trasaction as some of the rows have timestamp earlier
>> than '18:20:29'.
>>
>> While I am testing extreme conditions, there is a possibility of missing
>> out
>> on some data.
>>
>> I could not find any solution in Solr framework to handle this. The table
>> has an auto increment key, all updates are deletes followed by inserts.
>> So,
>> having last_indexed_id would have helped, where last_indexed_id is the max
>> value of id fetched in that import. The query would then become "Select id
>> where id>last_indexed_id.' I suppose, Solr does not have any provision
>> like
>> this.
>>
>> Two options I could think of are:
>> (a) Ensure at application level that there are no concurrent DB updates
>> and
>> DIH import requests going concurrently.
>> (b) Use exclusive locking during DB update
>>
>> What is the best way to address this problem?
>>
>
> Shashi,
>
> I was not solving the same problem, but perhaps you can adapt my solution
> to yours.  My main problem was that I don't have a modified date in my
> database, and due to the size of the table, it is impractical to add one.
>  Instead, I chose to track the database primary key (a simple autoincrement)
> outside of Solr and pass min/max values into DIH for it to use in the SELECT
> statement.  You can see a simplified version of my entity here, with a URL
> showing how to send the parameters in via the dataimport GET:
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html
>
> The update script that runs every two minutes gets MAX(did) from the
> database, retrieves the minDid from a file on an NFS share, and runs a
> delta-import with those two values.  When the import is reported successful,
> it writes the maxDid value to the minDid file on the network share for the
> next run.  If the import fails, it sends an alarm and doesn't update the
> minDid.
>
> Shawn
>
>

42 matches

Mail list logo