Re: MoreLikeThis Question

2012-02-15 Thread Michael Jakl
Hi!

On Wed, Feb 15, 2012 at 07:27, Jamie Johnson  wrote:
> Is there anyway with MLT to say get similar based on all fields or is
> it always a requirement to specify the fields?

It seems to be not the case. But you could append the fields Parameter
in the solrconfig.xml:

 ...


Cheers,
Michael


RE: OR-FilterQuery

2012-02-15 Thread spring
> > q=some text
> > fq=id:(1 OR 2 OR 3...)
> >
> > Should I better use q:some text AND id:(1 OR 2 OR 3...)?
> >
> 1. These two opts have the different scoring.
> 2. if you hit same fq=id:(1 OR 2 OR 3...) many times you have 
> a benefit due
> to reading docset from heap instead of searching on disk.

OK, understood.
Thank you.



RE: OR-FilterQuery

2012-02-15 Thread spring
> In other words, there's no attempt to decompose the fq clause
> and store parts of it in the cache, it's exact-match or
> nothing.

Ah ok, thank you.



Solr as an part of api to unburden databases

2012-02-15 Thread Ramo Karahasan
Hi,

 

does anyone of the maillinglist users use solr as an API to avoid database
queries? I know that this depends on the type of data. Imagine you have
something like Quora "Q&A" System, which is most just "text". If I would
embed some of these "Q&A" into my personal site, and would invoke the Quroa
API, I guess, they would do some database operations.

Would it be possible to call the Quora API that internally calls solr and
stream the results back to my website?

This should be highly configurable, but the advantage would be that  it
would unburden the databases.

 

There would be something like a three layer architecture:   Client  -> |
API (is doing some authorization/authentication checks) -> |  solr 

   Solr  -> | API (may be filter the data, remove unofficial
data, etc. ) -> | Client

 

 

I'm not really familiar with that kind of architecture, and therefore does
not know, if it makes any sense.

Any comments are appreciated!

 

Best regards,

Ramo



MoreLikeThis Requesthandler

2012-02-15 Thread Molidor, Robert
Hi,
I'm quite new to Solr. We want to find similar documents based on a 
MoreLikeThis query. In general this works fine and gives us reasonable results. 
Now we want to influence the result score by ranking more recent documents 
higher than older documents. Is this possible with the MoreLikeThis 
Requesthandler? If so, how can we achieve this?

Thanks in advance,
Robert



Error Indexing in solr 3.5

2012-02-15 Thread mechravi25
Hi,

When I tried to index in solr 3.5 i got the following exception

org.apache.solr.client.solrj.SolrServerException: Error executing query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
at com.quartz.test.FullImport.callIndex(FullImport.java:80)
at
com.quartz.test.GetObjectTypes.checkObjectTypeProp(GetObjectTypes.java:245)
at com.quartz.test.GetObjectTypes.execute(GetObjectTypes.java:640)
at com.quartz.test.QuartzSchedMain.main(QuartzSchedMain.java:55)
Caused by: java.lang.RuntimeException: Invalid version or the data in not in
'javabin' format
at 
org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
at
org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)



I placed the latest solrj 3.5 jar in the example/solr/lib directory and then
re-started the same but still I am getting the above mentioned exception. 

Please let me know if I am missing anything.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3746735.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Highlighting stopwords

2012-02-15 Thread O. Klein

Koji Sekiguchi wrote
> 
> (12/02/14 22:25), O. Klein wrote:
>> I have not been able to find any logic in the behavior of hl.q and how it
>> analyses the query. Could you explain how it is supposed to work?
> 
> Nothing special on hl.q. If you use hl.q, the value of it will be used for
> highlighting rather than the value of q. There's no tricks, I think.
> 
> koji
> -- 
> Apache Solr Query Log Visualizer
> http://soleami.com/
> 

Field definitions:
content_text (no stopwords, only synonyms in index)
content_hl (stopwords, synonyms in index and query, and only field in hl.fl)

Searching is done with edismax on content_text

1. If I use a query like hl.q=spell Check it doesn't highlight terms with
uppercase, synonyms get highlighted (all fields have LowerCaseFilterFactory)

2. hl.q=content_hl:(spell Check) also highlights terms with uppercase,
synonyms are not highlighted

4. hl.q=content_hl:(spell Check) content_text:(spell Check) highlights terms
with uppercase and synonyms, but sometimes no highlights at all.

So if 1 also highlights terms with uppercase I get the behavior I need. I
can do this on client side, but maybe it's a bug?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-stopwords-tp3681901p3746817.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr binary response for C#?

2012-02-15 Thread Jan Høydahl
Hi,

I just created a JIRA to investigate an Avro based serialization format for 
Solr: https://issues.apache.org/jira/browse/SOLR-3135
You're welcome to contribute. Guess we'll first need to define schemas, then 
create an AvroResponseWriter and then support in the C# Solr client.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 15:14, Erick Erickson wrote:

> It's not as compact as binary format, but would just using something
> like JSON help enough? This is really simple, just specify
> &wt=json (there's a method to set this on the server, at least in Java).
> 
> Otherwise, you might get a more knowledgeable response on the
> C# java list, I'm frankly clueless
> 
> Best
> Erick
> 
> On Mon, Feb 13, 2012 at 1:15 PM, naptowndev  wrote:
>> Admittedly I'm new to this, but the project we're working on feeds results
>> from Solr to an ASP.net application.  Currently we are using XML, but our
>> payloads can be rather large, some up to 17MB.  We are looking for a way to
>> minimize that payload and increase performance and I'm curious if there's
>> anything anyone has been working out that creates a binary response that can
>> be read by C# (similar to the javabin response built into Solr).
>> 
>> That, or if anyone has experience implementing an external protocol like
>> Thrift with Solr and consuming it with C# - again all in the effort to
>> increase performance across the wire and while being consumed.
>> 
>> Any help and direction would be greatly appreciated!
>> 
>> Thanks!
>> 
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Solr-binary-response-for-C-tp3741101p3741101.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Stemming and accents (HunspellStemFilterFactory)

2012-02-15 Thread Jan Høydahl
Or if you know that you'll always strip accents in your search you may 
pre-process your pt_PT.dic to remove accents from it and use that custom 
dictionary instead in Solr.

Another alternative could be to extend HunSpellFilter so that it can take in 
the class name of a TokenFilter class to apply when parsing the dictionary into 
memory.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 16:27, Chantal Ackermann wrote:

> Hi Bráulio,
> 
> I don't know about HunspellStemFilterFactory especially but concerning
> accents:
> 
> There are several accent filter that will remove accents from your
> tokens. If the Hunspell filter factory requires the accents, then simply
> add the accent filters after Hunspell in your index and query filter
> chains.
> 
> You would then have Hunspell produce the tokens as result of the
> stemming and only afterwards the accents would be removed (your example:
> 'forum' instead of 'fórum'). Do the same on the query side in case
> someone inputs accents.
> 
> Accent filters are:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
> (lowercases, as well!)
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
> 
> and others on that page.
> 
> Chantal
> 
> 
> On Tue, 2012-02-14 at 14:48 +0100, Bráulio Bhavamitra wrote:
>> Hello all,
>> 
>> I'm evaluating the HunspellStemFilterFactory I found it works with a
>> pt_PT dictionary.
>> 
>> For example, if I search for 'fóruns' it stems it to 'fórum' and then find
>> 'fórum' references.
>> 
>> But if I search for 'foruns' (without accent),
>> then HunspellStemFilterFactory cannot stem
>> word, as it does' not exist in its dictionary.
>> 
>> It there any way to make HunspellStemFilterFactory work without accents
>> differences?
>> 
>> best,
>> bráulio
> 



Re: Semantic autocomplete with Solr

2012-02-15 Thread Jan Høydahl
Check out 
http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
You can feed it anything, such as a log of previous searches, or a pre-computed 
dictionary of "item" + "color" combinations that exist in your DB etc.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 23:46, Roman Chyla wrote:

> done something along these lines:
> 
> https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality
> 
> but you would need MontySolr for that - 
> https://github.com/romanchyla/montysolr
> 
> roman
> 
> On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi
>  wrote:
>> Hey guys,
>> 
>> Has anyone done any kind of "smart" autocomplete? Let's say we have a web
>> store, and we'd like to autocomplete user's searches. So if I'll type in
>> "jacket" next word that will be suggested should be something related to
>> jacket (color, fabric) etc...
>> 
>> It seems to me I have to structure this data in a particular way, but that
>> way I can do without solr, so I was wondering if Solr could help us.
>> 
>> Thank you in advance.



Re: Solr as an part of api to unburden databases

2012-02-15 Thread Tomas Zerolo
On Wed, Feb 15, 2012 at 11:48:14AM +0100, Ramo Karahasan wrote:
> Hi,
> 
>  
> 
> does anyone of the maillinglist users use solr as an API to avoid database
> queries? [...]

Like in a... cache?

Why not use a cache then? (memcached, for example, but there are more).

Regards
-- tomás


Re: Solr soft commit feature

2012-02-15 Thread Nagendra Nagarajayya


If you are looking for NRT functionality with Solr 3.5, you may want to 
take a look at Solr 3.5 with RankingAlgorithm. This allows you to 
add/update documents without a commit while being able to search 
concurrently. The add/update performance to add 1m docs is about 5000 
docs in about 498 ms  with one concurrent searcher. You can get more 
information about Solr 3.5 with RankingAlgorithm from here:


http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/14/2012 4:41 PM, Dipti Srivastava wrote:

Hi All,
Is there a way to soft commit in the current released version of solr 3.5?

Regards,
Dipti Srivastava


This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.








Re: MoreLikeThis Question

2012-02-15 Thread Chantal Ackermann
Hi,

you would not want to include the unique ID and similar stuff, though?
No idea whether it would impact the number of hits but it would most
probably influence the scoring if nothing else.

E.g. if you compare by certain fields, I would expect that a score of
1.0 indicates a match on all of those fields (haven't tested that
explicitly, though). If the unique ID is included you could never reach
that score.

Just my 2 cents...

Chantal


On Wed, 2012-02-15 at 07:27 +0100, Jamie Johnson wrote:
> Is there anyway with MLT to say get similar based on all fields or is
> it always a requirement to specify the fields?



Re: Solr as an part of api to unburden databases

2012-02-15 Thread Chantal Ackermann
> > 
> > does anyone of the maillinglist users use solr as an API to avoid database
> > queries? [...]
> 
> Like in a... cache?
> 
> Why not use a cache then? (memcached, for example, but there are more).
> 

Good point. A cache only uses lookup by one kind of cache key while SOLR
provides lookup by ... well... any search configuration that your index
setup (mainly the schema) supports.

If the "database queries" always do a find by unique id, then use a
cache. Otherwise using SOLR is a valid option.


Chantal



Re: Error Indexing in solr 3.5

2012-02-15 Thread Chantal Ackermann
Hi,

I've got these errors when my client used a different SolrJ version from
the SOLR server it connected to:

SERVER 3.5  responding ---> CLIENT some other version

You haven't provided any information on your client, though.

Chantal

On Wed, 2012-02-15 at 13:09 +0100, mechravi25 wrote:
> Hi,
> 
> When I tried to index in solr 3.5 i got the following exception
> 
> org.apache.solr.client.solrj.SolrServerException: Error executing query
>   at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:95)
>   at com.quartz.test.FullImport.callIndex(FullImport.java:80)
>   at
> com.quartz.test.GetObjectTypes.checkObjectTypeProp(GetObjectTypes.java:245)
>   at com.quartz.test.GetObjectTypes.execute(GetObjectTypes.java:640)
>   at com.quartz.test.QuartzSchedMain.main(QuartzSchedMain.java:55)
> Caused by: java.lang.RuntimeException: Invalid version or the data in not in
> 'javabin' format
>   at 
> org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
>   at
> org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:39)
>   at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:466)
>   at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:243)
>   at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
> 
> 
> 
> I placed the latest solrj 3.5 jar in the example/solr/lib directory and then
> re-started the same but still I am getting the above mentioned exception. 
> 
> Please let me know if I am missing anything.
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Error-Indexing-in-solr-3-5-tp3746735p3746735.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet on TrieDateField field without including date

2012-02-15 Thread Yonik Seeley
On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson  wrote:
> I would like to be able to facet based on the time of
> day items are purchased across a date span.  I was hoping that I could
> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
> I wanted facet broken into hourly bins.  Is this possible?  Do I

Will range faceting do everything you need?
http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range

-Yonik
lucidimagination.com


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Jamie Johnson
I think it would if I indexed the time information separately.  Which
was my original thought, but I was hoping to store this in one field
instead of 2.  So my idea was I'd store the time portion as as a
number (an int might suffice from 0 to 24 since I only need this to
have that level of granularity) then do range queries over that.  I
couldn't think of a way to do this using the date field though because
it would give me bins broken up by hours in a particular day,
something like

2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
2012-01-01-02:00:00 - 2012-01-01-03:00:00 5

But what I really want is just the time portion across all days

00:00:00 - 01:00:00 10
01:00:00 - 02:00:00 20
02:00:00 - 03:00:00 5

I would then use the date field to limit the time range in which the
facet was operating.  Does that make sense?  Is there a more efficient
way of doing this?

On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
 wrote:
> On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson  wrote:
>> I would like to be able to facet based on the time of
>> day items are purchased across a date span.  I was hoping that I could
>> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
>> I wanted facet broken into hourly bins.  Is this possible?  Do I
>
> Will range faceting do everything you need?
> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
>
> -Yonik
> lucidimagination.com


Re: MoreLikeThis Question

2012-02-15 Thread Jamie Johnson
Yes, agree that ID would be one that would need to be ignored.  I
don't think specifying them is too difficult I was just curious if it
was possible to do this or not.

On Wed, Feb 15, 2012 at 8:41 AM, Chantal Ackermann
 wrote:
> Hi,
>
> you would not want to include the unique ID and similar stuff, though?
> No idea whether it would impact the number of hits but it would most
> probably influence the scoring if nothing else.
>
> E.g. if you compare by certain fields, I would expect that a score of
> 1.0 indicates a match on all of those fields (haven't tested that
> explicitly, though). If the unique ID is included you could never reach
> that score.
>
> Just my 2 cents...
>
> Chantal
>
>
> On Wed, 2012-02-15 at 07:27 +0100, Jamie Johnson wrote:
>> Is there anyway with MLT to say get similar based on all fields or is
>> it always a requirement to specify the fields?
>


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Yonik Seeley
On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson  wrote:
> I think it would if I indexed the time information separately.  Which
> was my original thought, but I was hoping to store this in one field
> instead of 2.  So my idea was I'd store the time portion as as a
> number (an int might suffice from 0 to 24 since I only need this to
> have that level of granularity) then do range queries over that.  I
> couldn't think of a way to do this using the date field though because
> it would give me bins broken up by hours in a particular day,
> something like
>
> 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
> 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
> 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
>
> But what I really want is just the time portion across all days
>
> 00:00:00 - 01:00:00 10
> 01:00:00 - 02:00:00 20
> 02:00:00 - 03:00:00 5
>
> I would then use the date field to limit the time range in which the
> facet was operating.  Does that make sense?  Is there a more efficient
> way of doing this?

Hmm, no there's no way to do this.
Even if you were to write a custom faceting component, it seems like
it would still be very expensive to derive the hour of the day from ms
for every doc.

-Yonik
lucidimagination.com




> On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
>  wrote:
>> On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson  wrote:
>>> I would like to be able to facet based on the time of
>>> day items are purchased across a date span.  I was hoping that I could
>>> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
>>> I wanted facet broken into hourly bins.  Is this possible?  Do I
>>
>> Will range faceting do everything you need?
>> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
>>
>> -Yonik
>> lucidimagination.com


Re: Semantic autocomplete with Solr

2012-02-15 Thread Octavian Covalschi
Thank you! I'll check them out.

On Wed, Feb 15, 2012 at 6:50 AM, Jan Høydahl  wrote:

> Check out
> http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/
> You can feed it anything, such as a log of previous searches, or a
> pre-computed dictionary of "item" + "color" combinations that exist in your
> DB etc.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 14. feb. 2012, at 23:46, Roman Chyla wrote:
>
> > done something along these lines:
> >
> >
> https://svnweb.cern.ch/trac/rcarepo/wiki/InspireAutoSuggest#Autosuggestautocompletefunctionality
> >
> > but you would need MontySolr for that -
> https://github.com/romanchyla/montysolr
> >
> > roman
> >
> > On Tue, Feb 14, 2012 at 11:10 PM, Octavian Covalschi
> >  wrote:
> >> Hey guys,
> >>
> >> Has anyone done any kind of "smart" autocomplete? Let's say we have a
> web
> >> store, and we'd like to autocomplete user's searches. So if I'll type in
> >> "jacket" next word that will be suggested should be something related to
> >> jacket (color, fabric) etc...
> >>
> >> It seems to me I have to structure this data in a particular way, but
> that
> >> way I can do without solr, so I was wondering if Solr could help us.
> >>
> >> Thank you in advance.
>
>


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Ted Dunning
Use multiple fields and you get what you want.  The extra fields are going
to cost very little and will have a bit positive impact.

On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson  wrote:

> I think it would if I indexed the time information separately.  Which
> was my original thought, but I was hoping to store this in one field
> instead of 2.  So my idea was I'd store the time portion as as a
> number (an int might suffice from 0 to 24 since I only need this to
> have that level of granularity) then do range queries over that.  I
> couldn't think of a way to do this using the date field though because
> it would give me bins broken up by hours in a particular day,
> something like
>
> 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
> 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
> 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
>
> But what I really want is just the time portion across all days
>
> 00:00:00 - 01:00:00 10
> 01:00:00 - 02:00:00 20
> 02:00:00 - 03:00:00 5
>
> I would then use the date field to limit the time range in which the
> facet was operating.  Does that make sense?  Is there a more efficient
> way of doing this?
>
> On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
>  wrote:
> > On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson 
> wrote:
> >> I would like to be able to facet based on the time of
> >> day items are purchased across a date span.  I was hoping that I could
> >> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
> >> I wanted facet broken into hourly bins.  Is this possible?  Do I
> >
> > Will range faceting do everything you need?
> > http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
> >
> > -Yonik
> > lucidimagination.com
>


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Jamie Johnson
Thanks guys that's what I figured, just wanted to make sure I was
going down the right path.

On Wed, Feb 15, 2012 at 9:55 AM, Ted Dunning  wrote:
> Use multiple fields and you get what you want.  The extra fields are going
> to cost very little and will have a bit positive impact.
>
> On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson  wrote:
>
>> I think it would if I indexed the time information separately.  Which
>> was my original thought, but I was hoping to store this in one field
>> instead of 2.  So my idea was I'd store the time portion as as a
>> number (an int might suffice from 0 to 24 since I only need this to
>> have that level of granularity) then do range queries over that.  I
>> couldn't think of a way to do this using the date field though because
>> it would give me bins broken up by hours in a particular day,
>> something like
>>
>> 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
>> 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
>> 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
>>
>> But what I really want is just the time portion across all days
>>
>> 00:00:00 - 01:00:00 10
>> 01:00:00 - 02:00:00 20
>> 02:00:00 - 03:00:00 5
>>
>> I would then use the date field to limit the time range in which the
>> facet was operating.  Does that make sense?  Is there a more efficient
>> way of doing this?
>>
>> On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
>>  wrote:
>> > On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson 
>> wrote:
>> >> I would like to be able to facet based on the time of
>> >> day items are purchased across a date span.  I was hoping that I could
>> >> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
>> >> I wanted facet broken into hourly bins.  Is this possible?  Do I
>> >
>> > Will range faceting do everything you need?
>> > http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
>> >
>> > -Yonik
>> > lucidimagination.com
>>


Re: Facet on TrieDateField field without including date

2012-02-15 Thread Chantal Ackermann
I've done something like that by calculating the hours during indexing
time (in the script part of the DIH config using java.util.Calendar
which gives you all those field values without effort). I've also
extracted information on which weekday it is (using the integer
constants of Calendar).
If you need this only for one timezone it is straight forward but if the
queries come from different time zones you'll have to shift
appropriately.

I found that pre-calculating has the advantage that you end up with very
simple data: simple integers. And it makes it quite easy to build more
complex queries on that. For example I have created a grid (build from
facets) where the columns are the weekdays and the rows are the hours of
day. The facets are created using a field containing the combination of
weekday and hour of day.


Chantal



On Wed, 2012-02-15 at 15:49 +0100, Yonik Seeley wrote:
> On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson  wrote:
> > I think it would if I indexed the time information separately.  Which
> > was my original thought, but I was hoping to store this in one field
> > instead of 2.  So my idea was I'd store the time portion as as a
> > number (an int might suffice from 0 to 24 since I only need this to
> > have that level of granularity) then do range queries over that.  I
> > couldn't think of a way to do this using the date field though because
> > it would give me bins broken up by hours in a particular day,
> > something like
> >
> > 2012-01-01-00:00:00 - 2012-01-01-01:00:00 10
> > 2012-01-01-01:00:00 - 2012-01-01-02:00:00 20
> > 2012-01-01-02:00:00 - 2012-01-01-03:00:00 5
> >
> > But what I really want is just the time portion across all days
> >
> > 00:00:00 - 01:00:00 10
> > 01:00:00 - 02:00:00 20
> > 02:00:00 - 03:00:00 5
> >
> > I would then use the date field to limit the time range in which the
> > facet was operating.  Does that make sense?  Is there a more efficient
> > way of doing this?
> 
> Hmm, no there's no way to do this.
> Even if you were to write a custom faceting component, it seems like
> it would still be very expensive to derive the hour of the day from ms
> for every doc.
> 
> -Yonik
> lucidimagination.com
> 
> 
> 
> 
> > On Wed, Feb 15, 2012 at 9:16 AM, Yonik Seeley
> >  wrote:
> >> On Wed, Feb 15, 2012 at 8:58 AM, Jamie Johnson  wrote:
> >>> I would like to be able to facet based on the time of
> >>> day items are purchased across a date span.  I was hoping that I could
> >>> do a query of something like date:[NOW-1WEEK TO NOW] and then specify
> >>> I wanted facet broken into hourly bins.  Is this possible?  Do I
> >>
> >> Will range faceting do everything you need?
> >> http://wiki.apache.org/solr/SimpleFacetParameters#Facet_by_Range
> >>
> >> -Yonik
> >> lucidimagination.com



Solr multiple cores - multiple databases approach

2012-02-15 Thread Radu Toev
Hello,

I have a use where I'm trying to integrate Solr:
 - 2 databases with the same schema
 - I want to index multiple enttities from those databases
My question is what is the best way of approaching this topic:
 - should I create a core for each database and inside that core create a
document with all information that I need?


Re: Solr multiple cores - multiple databases approach

2012-02-15 Thread Em
Hello Radu,

>  - I want to index multiple enttities from those databases
Do you want to combine data of both databases within one document or are
you just interested in indexing both databases on their own?

If the second applies: You can do it within one core by using a field
(i.e. "source") to filter on it or create a core per database which
would completely seperate both indizes from eachother.

It depends on your usecase and access-patterns. To tell you more, you
should provide us more information.

Regards,
Em

Am 15.02.2012 16:23, schrieb Radu Toev:
> Hello,
> 
> I have a use where I'm trying to integrate Solr:
>  - 2 databases with the same schema
>  - I want to index multiple enttities from those databases
> My question is what is the best way of approaching this topic:
>  - should I create a core for each database and inside that core create a
> document with all information that I need?
> 


Re: Solr soft commit feature

2012-02-15 Thread Dipti Srivastava
Hi Nagendra,

Certainly interesting! Would this work in a Master/slave setup where the
reads are from the slaves and all writes are to the master?

Regards,
Dipti Srivastava


On 2/15/12 5:40 AM, "Nagendra Nagarajayya" 
wrote:

>
>If you are looking for NRT functionality with Solr 3.5, you may want to
>take a look at Solr 3.5 with RankingAlgorithm. This allows you to
>add/update documents without a commit while being able to search
>concurrently. The add/update performance to add 1m docs is about 5000
>docs in about 498 ms  with one concurrent searcher. You can get more
>information about Solr 3.5 with RankingAlgorithm from here:
>
>http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x
>
>Regards,
>
>- Nagendra Nagarajayya
>http://solr-ra.tgels.org
>http://rankingalgorithm.tgels.org
>
>On 2/14/2012 4:41 PM, Dipti Srivastava wrote:
>> Hi All,
>> Is there a way to soft commit in the current released version of solr
>>3.5?
>>
>> Regards,
>> Dipti Srivastava
>>
>>
>> This message is private and confidential. If you have received it in
>>error, please notify the sender and remove it from your system.
>>
>>
>>
>>
>
>


This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.




Search for hashtags and mentions

2012-02-15 Thread Rohit
Hi,

 

We are using solr version 3.5 to search though Tweets, I am using
WordDelimiterFactory with the following setting, to be able to search for
@username or #hashtags

 



 

I saw the following patch but this doesn't seem to be working as I expected,
am I missing something?  

 

https://issues.apache.org/jira/browse/SOLR-2059 

 

But searching for @username is also returning results for just username or
#hashtag is just returning result for hastag. How can I achieve this? 

 

Regards,

Rohit



problem with accents

2012-02-15 Thread R M


Hi,
I've got a problem with the configuration of solr. 
I have defined a new type of data : "text_fr" to use accent like "é à è". I 
have added this on my fieldtype definition : 
Everything seems to be ok, data are well added. But when I'm going to this 
adress :  http://localhost:8983/solr/admin to make a research there is a 
problem.
If I search "cherche" and "cherché" the results are differents although they 
should be the same, isn't it?
Thank you guys
Romain 
  

Re: problem with accents

2012-02-15 Thread Erick Erickson
Did you specify the correct field with the search? If you just
specified entered the
word in the  search box without the field, the search would be made against
your default search field (defined in schema.xml).

If you go to the "full interface" link on the admin page, you can then click
the debug:enable checkbox which will give you a lot more information
about what the parsed query looks like..

Best
Erick

On Wed, Feb 15, 2012 at 2:12 PM, R M  wrote:
>
>
> Hi,
> I've got a problem with the configuration of solr.
> I have defined a new type of data : "text_fr" to use accent like "é à è". I 
> have added this on my fieldtype definition :  class="solr.ISOLatin1AccentFilterFactory"/>
> Everything seems to be ok, data are well added. But when I'm going to this 
> adress :  http://localhost:8983/solr/admin to make a research there is a 
> problem.
> If I search "cherche" and "cherché" the results are differents although they 
> should be the same, isn't it?
> Thank you guys
> Romain
>


update extracted docs

2012-02-15 Thread Harold Frayman
Hi

I have a solr 3.5 database which is populated by using /update/extract
(configured pretty much as per the examples) and additional metadata. The
uploads are handled by a perl-driven webapp which uses WebService::Solr
(which use behind-the-scenes POSTing). That all works fine.

When I come to update the metadata associated with the stored docs, again
using my perl web app, I find the solr doc (by id), amend or append all the
changed metadata and use /update to re-post them. Again that works fine ...
but I'm getting nervous because I'm not sure why it works.

If I try to update only the changed fields for a single doc, the unchanged
fields are removed. Slightly surprising, but if that's what I should
expect, it's not difficult to accept.

So how come using /update doesn't remove the text content (and the indexing
on it) which was originally obtained using /update/extract? And can I
depend on it being there in future, after optimization, for example?

And if I can't, what is the best technique for updating metadata under
these circumstances?

Harold Frayman

Please consider the environment before printing this email.
--
Visit guardian.co.uk - newspaper of the year

www.guardian.co.ukwww.observer.co.uk www.guardiannews.com 

On your mobile, visit m.guardian.co.uk or download the Guardian
iPhone app www.guardian.co.uk/iphone
 
To save up to 30% when you subscribe to the Guardian and the Observer
visit www.guardian.co.uk/subscriber 
-
This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.
 
Guardian News & Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.

Guardian News & Media Limited

A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP

Registered in England Number 908396


Re: Search for hashtags and mentions

2012-02-15 Thread Emmanuel Espina
Do you want to index the hashtags and usernames to different fields?
Probably using

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory

will solve your problem.

However I don't fully understand the problem when you search

Thanks
Emmanuel


2012/2/15 Rohit :
> Hi,
>
>
>
> We are using solr version 3.5 to search though Tweets, I am using
> WordDelimiterFactory with the following setting, to be able to search for
> @username or #hashtags
>
>
>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1" handleAsChar="@#"/>
>
>
>
> I saw the following patch but this doesn't seem to be working as I expected,
> am I missing something?
>
>
>
> https://issues.apache.org/jira/browse/SOLR-2059
>
>
>
> But searching for @username is also returning results for just username or
> #hashtag is just returning result for hastag. How can I achieve this?
>
>
>
> Regards,
>
> Rohit
>


Re: update extracted docs

2012-02-15 Thread Emmanuel Espina
Solr or Lucene does not update documents. It deletes the old one and
replaces it with a new one when it has the same id.
So if you create a document with the changed fields only, and the same
id, and upload that one, the old one will be erased and replaced with
the new one. So THAT behaviour is expectable.

For updating documents you simply add the entire document again with
the modified fields, or, if that is an expensive procedure and want to
avoid the extraction of the metadata, you can store all the fields and
retrieve the full document, create a new document with all the fields,
even the not modified ones, and use the /update handler to add it
again.

Does that answer your question?

Thanks
Emmanuel






2012/2/15 Harold Frayman :
> Hi
>
> I have a solr 3.5 database which is populated by using /update/extract
> (configured pretty much as per the examples) and additional metadata. The
> uploads are handled by a perl-driven webapp which uses WebService::Solr
> (which use behind-the-scenes POSTing). That all works fine.
>
> When I come to update the metadata associated with the stored docs, again
> using my perl web app, I find the solr doc (by id), amend or append all the
> changed metadata and use /update to re-post them. Again that works fine ...
> but I'm getting nervous because I'm not sure why it works.
>
> If I try to update only the changed fields for a single doc, the unchanged
> fields are removed. Slightly surprising, but if that's what I should
> expect, it's not difficult to accept.
>
> So how come using /update doesn't remove the text content (and the indexing
> on it) which was originally obtained using /update/extract? And can I
> depend on it being there in future, after optimization, for example?
>
> And if I can't, what is the best technique for updating metadata under
> these circumstances?
>
> Harold Frayman
>
> Please consider the environment before printing this email.
> --
> Visit guardian.co.uk - newspaper of the year
>
> www.guardian.co.uk    www.observer.co.uk     www.guardiannews.com
>
> On your mobile, visit m.guardian.co.uk or download the Guardian
> iPhone app www.guardian.co.uk/iphone
>
> To save up to 30% when you subscribe to the Guardian and the Observer
> visit www.guardian.co.uk/subscriber
> -
> This e-mail and all attachments are confidential and may also
> be privileged. If you are not the named recipient, please notify
> the sender and delete the e-mail and all attachments immediately.
> Do not disclose the contents to another person. You may not use
> the information for any purpose, or store, or copy, it in any way.
>
> Guardian News & Media Limited is not liable for any computer
> viruses or other material transmitted with or as part of this
> e-mail. You should employ virus checking software.
>
> Guardian News & Media Limited
>
> A member of Guardian Media Group plc
> Registered Office
> PO Box 68164
> Kings Place
> 90 York Way
> London
> N1P 2AP
>
> Registered in England Number 908396


Re: feeding mahout cluster output back to solr

2012-02-15 Thread abhayd
I was looking at this
http://java.dzone.com/videos/configuring-mahout-clustering

seems like possible but can anyone shed more light, specially on the part of
mapping clusters to original docs

abhay

--
View this message in context: 
http://lucene.472066.n3.nabble.com/feeding-mahout-cluster-output-back-to-solr-tp3745883p3748349.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Robert Stewart
I implemented an index shrinker and it works.  I reduced my test index
from 6.6 GB to 3.6 GB by removing a single shingled field I did not
need anymore.  I'm actually using Lucene.Net for this project so code
is C# using Lucene.Net 2.9.2 API.  But basic idea is:

Create an IndexReader wrapper that only enumerates the terms you want
to keep, and that removes terms from documents when returning
documents.

Use the SegmentMerger to re-write each segment (where each segment is
wrapped by the wrapper class), writing new segment to a new directory.
Collect the SegmentInfos and do a commit in order to create a new
segments file in new index directory

Done - you now have a shrunk index with specified terms removed.

Implementation uses separate thread for each segment, so it re-writes
them in parallel.  Took about 15 minutes to do 770,000 doc index on my
macbook.


On Tue, Feb 14, 2012 at 10:12 PM, Li Li  wrote:
> I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>    SegmentMerger.add(IndexReader) will add to be merged Readers
>    merge() will call
>      mergeTerms(segmentWriteState);
>      mergePerDoc(segmentWriteState);
>
>   mergeTerms() will construct fields from IndexReaders
>    for(int
> readerIndex=0;readerIndex      final MergeState.IndexReaderAndLiveDocs r =
> mergeState.readers.get(readerIndex);
>      final Fields f = r.reader.fields();
>      final int maxDoc = r.reader.maxDoc();
>      if (f != null) {
>        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>        fields.add(f);
>      }
>      docBase += maxDoc;
>    }
>    So If you wrapper your IndexReader and override its fields() method,
> maybe it will work for merge terms.
>
>    for DocValues, it can also override AtomicReader.docValues(). just
> return null for fields you want to remove. maybe it should
> traverse CompositeReader's getSequentialSubReaders() and wrapper each
> AtomicReader
>
>    other things like term vectors norms are similar.
> On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart wrote:
>
>> I was thinking if I make a wrapper class that aggregates another
>> IndexReader and filter out terms I don't want anymore it might work.   And
>> then pass that wrapper into SegmentMerger.  I think if I filter out terms
>> on GetFieldNames(...) and Terms(...) it might work.
>>
>> Something like:
>>
>> HashSet ignoredTerms=...;
>>
>> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>>
>> SegmentMerger merger=new SegmentMerger(writer);
>>
>> merger.add(wrapper);
>>
>> merger.Merge();
>>
>>
>>
>>
>>
>> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>>
>> > for method 2, delete is wrong. we can't delete terms.
>> >   you also should hack with the tii and tis file.
>> >
>> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:
>> >
>> >> method1, dumping data
>> >> for stored fields, you can traverse the whole index and save it to
>> >> somewhere else.
>> >> for indexed but not stored fields, it may be more difficult.
>> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> id), it's easy to get from FieldCache.StringIndex.
>> >>    But for analyzed fields, though theoretically it can be restored from
>> >> term vector and term position, it's hard to recover from index.
>> >>
>> >> method 2, hack with metadata
>> >> 1. indexed fields
>> >>      delete by query, e.g. field:*
>> >> 2. stored fields
>> >>       because all fields are stored sequentially. it's not easy to
>> delete
>> >> some fields. this will not affect search speed. but if you want to get
>> >> stored fields,  and the useless fields are very long, then it will slow
>> >> down.
>> >>       also it's possible to hack with it. but need more effort to
>> >> understand the index file format  and traverse the fdt/fdx file.
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >>
>> >> this will give you some insight.
>> >>
>> >>
>> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart > >wrote:
>> >>
>> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> used in
>> >>> search at all.  In order to save memory and disk, I'd like to rebuild
>> that
>> >>> index *without* those fields, but I don't have original documents to
>> >>> rebuild entire index with (don't have the full-text anymore, etc.).  Is
>> >>> there some way to rebuild or optimize an existing index with only a
>> sub-set
>> >>> of the existing indexed fields?  Or alternatively is there a way to
>> avoid
>> >>> loading some indexed fields at all ( to avoid loading term infos and
>> terms
>> >>> index ) ?
>> >>>
>> >>> Thanks
>> >>> Bob
>> >>
>> >>
>> >>
>>
>>


Size of suggest dictionary

2012-02-15 Thread Mike Hugo
Hello,

We're building an auto suggest component based on the "label" field of
documents.  Is there a way to see how many terms are in the dictionary, or
how much memory it's taking up?  I looked on the statistics page but didn't
find anything obvious.

Thanks in advance,

Mike

ps- here's the config:



suggestlabel
org.apache.solr.spelling.suggest.Suggester
org.apache.solr.spelling.suggest.tst.TSTLookup
label
true





true
suggestlabel
10


suggestlabel




Date formatting issue

2012-02-15 Thread Zajkowski, Radoslaw
Hi all, here's an interesting one, in my xml imported if I use very simple 
xpath like this



I will get the date properly imported, however if I use this expression for 
another node which is nested:



I will receive this type of exception:  java.text.ParseException: Unparseable 
date: "Tue Aug 16 20:10:23 EDT 2011"

I have t use the Xpath above as I have a few of those release date nodes and I 
need to flatten them so we can look at dates per audience/group

I've also run just this /document/audiences/audience/audience_release_date and 
it works, however I need a more precise result then that since different groups 
could have different release dates.

Any help greatly appreciated,

Radek.


Radoslaw Zajkowski
Senior Developer
O°
proximity
CANADA
t: 416-972-1505 ext.7306
c: 647-281-2567
f: 416-944-7886

2011 ADCC Interactive Agency of the Year
2011 Strategy Magazine Digital Agency of the Year

http://www.proximityworld.com/

Join us on:
Facebook - http://www.facebook.com/ProximityCanada
Twitter - http://twitter.com/ProximityWW
YouTube - http://www.youtube.com/proximitycanada





Please consider the environment before printing this e-mail.

This message and any attachments contain information, which may be confidential 
or privileged. If you are not the intended recipient, please refrain from any 
disclosure, copying, distribution or use of this information. Please be aware 
that such actions are prohibited. If you have received this transmission in 
error, kindly notify us by e-mail to mailto:helpd...@bbdo.com. We appreciate 
your cooperation.




Re: Size of suggest dictionary

2012-02-15 Thread Em
Hello Mike,

have a look at Solr's Schema Browser. Click on "FIELDS", select "label"
and have a look at the number of distinct (term-)values.

Regards,
Em


Am 15.02.2012 23:07, schrieb Mike Hugo:
> Hello,
> 
> We're building an auto suggest component based on the "label" field of
> documents.  Is there a way to see how many terms are in the dictionary, or
> how much memory it's taking up?  I looked on the statistics page but didn't
> find anything obvious.
> 
> Thanks in advance,
> 
> Mike
> 
> ps- here's the config:
> 
> 
> 
> suggestlabel
>  name="classname">org.apache.solr.spelling.suggest.Suggester
>  name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> label
> true
> 
> 
> 
>  class="org.apache.solr.handler.component.SearchHandler">
> 
> true
> suggestlabel
> 10
> 
> 
> suggestlabel
> 
> 
> 


Re: Query in starting solr 3.5

2012-02-15 Thread Chris Hostetter

: WARNING: XML parse warning in "solrres:/dataimport.xml", line 2, column 95:
: Include operation failed, reverting to fallback. Resource error reading file
: as XML (href='solr/conf/solrconfig_master.xml'). Reason: Can't find resource
: 'solr/conf/solrconfig_master.xml' in classpath or
: '/solr/apache-solr-3.5.0/example/multicore/core1/conf/',
: cwd=/solr/apache-solr-3.5.0/example
: 
: The partial content of dataimport file that I used in solr1.4 is as follows
: 
: http://www.w3.org/2001/XInclude";>

I *think* what happened there is that some fixes were made to what path 
was used for relative includes -- before it was inconsistent and 
undefined, and now it's a true relative path from where you do the 
include.  so in your case, (i think) it is looking for 
/solr/apache-solr-3.5.0/example/multicore/core1/conf/solr/conf/solrconfig_master.xml
 
and not finding it -- so just fix the path to be what you actually want it 
to be realtive to that file

(If you look for SOLR-1656 in Solr's CHANGES.txt file it has all the 
details)

: The 3 files given in Fallback tag are present in the location. Does solr 3.5
: support fallback? Can someone please suggest a solution?

I think the fallback should be working fine (particularly since they are 
absolute paths in your case) ... nothing about that error says it's not, 
it actually says it's using hte fallback because the include itself is 
failing. (so unless you see a *subsequent* error you are getting the 
fallbacks)

: WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24
: emulation. You should at some point declare and reindex to at least 3.0,
: because 2.4 emulation is deprecated and will be removed in 4.0. This
: parameter will be mandatory in 4.0.
: 
: The solution i got after googling is to apply a patch. Is there any other

citation please?  where did you read that you need a patch to get rid of 
that warning?

This warning is just letting you know that in the absense of explicit 
confiugration, it's assuming you want the legacy behavior you would get if 
you explicitly configured the option with LUCENE_24.

if you add this line to your solrconfig.xml...

  LUCENE_24

...no behavior will change, and the warning will go away.  but as the 
warning points out, you should give serious consideration (on every 
upgrade) to wether or not you can re-index after upgrade, and then change 
it to the current value (LUCENE_35) to eliminate some buggy behavior that 
is supported for back compat with existing indexes.


-Hoss


Re: Language specific tokenizer for purpose of multilingual search in single-core solr,

2012-02-15 Thread Chris Hostetter

: I want to do multilingual search in single-core solr. That requires to
: define language specific tokenizers in scheme.xml. Say for example, I have
: two tokenizers, one for English ("en") and one for simplified Chinese
: ("zh-cn"). Can I just put following definitions together in one schema.xml,
: and both sets of the files ( stopwords, synonym, and protwords) in one
: directory? 

absolutely.


-Hoss


Re: Search for hashtags and mentions

2012-02-15 Thread Erick Erickson
We need the rest of your fieldType, it's quite possible
that other parts of it are stripping out the characters
in question. Try looking at the admin/analysis page.

If that doesn't help, please show us the whole fieldType
definition and the results of attaching &debugQuery=on
to the URL.

Best
Erick

On Wed, Feb 15, 2012 at 2:04 PM, Rohit  wrote:
> Hi,
>
>
>
> We are using solr version 3.5 to search though Tweets, I am using
> WordDelimiterFactory with the following setting, to be able to search for
> @username or #hashtags
>
>
>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1" handleAsChar="@#"/>
>
>
>
> I saw the following patch but this doesn't seem to be working as I expected,
> am I missing something?
>
>
>
> https://issues.apache.org/jira/browse/SOLR-2059
>
>
>
> But searching for @username is also returning results for just username or
> #hashtag is just returning result for hastag. How can I achieve this?
>
>
>
> Regards,
>
> Rohit
>


Spatial Search and faceting

2012-02-15 Thread Eric Grobler
Hi Solr community,

I am doing a spatial search and then do a facet by city.
Is it possible to then sort the faceted cities by distance?

We would like to display the hits per city, but sort them by distance.

Thanks & Regards
Ericz

q=iphone
fq={!bbox}
sfield=geopoint
pt=49.594857,8.468614
d=50
fl=id,description,city,geopoint

facet=true
facet.field=city
f.city.facet.limit=10
f.city.facet.sort=count //geodist() asc


Re: Search for hashtags and mentions

2012-02-15 Thread Robert Muir
On Wed, Feb 15, 2012 at 2:04 PM, Rohit  wrote:
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1" handleAsChar="@#"/>

There is no such parameter as 'handleAsChar'. If you want to do this,
you need to use a custom types file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

-- 
lucidimagination.com


Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Li Li
great. I think you could make it a public tool. maybe others also need such
functionality.

On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart wrote:

> I implemented an index shrinker and it works.  I reduced my test index
> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
> need anymore.  I'm actually using Lucene.Net for this project so code
> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>
> Create an IndexReader wrapper that only enumerates the terms you want
> to keep, and that removes terms from documents when returning
> documents.
>
> Use the SegmentMerger to re-write each segment (where each segment is
> wrapped by the wrapper class), writing new segment to a new directory.
> Collect the SegmentInfos and do a commit in order to create a new
> segments file in new index directory
>
> Done - you now have a shrunk index with specified terms removed.
>
> Implementation uses separate thread for each segment, so it re-writes
> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
> macbook.
>
>
> On Tue, Feb 14, 2012 at 10:12 PM, Li Li  wrote:
> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
> >SegmentMerger.add(IndexReader) will add to be merged Readers
> >merge() will call
> >  mergeTerms(segmentWriteState);
> >  mergePerDoc(segmentWriteState);
> >
> >   mergeTerms() will construct fields from IndexReaders
> >for(int
> > readerIndex=0;readerIndex >  final MergeState.IndexReaderAndLiveDocs r =
> > mergeState.readers.get(readerIndex);
> >  final Fields f = r.reader.fields();
> >  final int maxDoc = r.reader.maxDoc();
> >  if (f != null) {
> >slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
> >fields.add(f);
> >  }
> >  docBase += maxDoc;
> >}
> >So If you wrapper your IndexReader and override its fields() method,
> > maybe it will work for merge terms.
> >
> >for DocValues, it can also override AtomicReader.docValues(). just
> > return null for fields you want to remove. maybe it should
> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
> > AtomicReader
> >
> >other things like term vectors norms are similar.
> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart  >wrote:
> >
> >> I was thinking if I make a wrapper class that aggregates another
> >> IndexReader and filter out terms I don't want anymore it might work.
> And
> >> then pass that wrapper into SegmentMerger.  I think if I filter out
> terms
> >> on GetFieldNames(...) and Terms(...) it might work.
> >>
> >> Something like:
> >>
> >> HashSet ignoredTerms=...;
> >>
> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
> >>
> >> SegmentMerger merger=new SegmentMerger(writer);
> >>
> >> merger.add(wrapper);
> >>
> >> merger.Merge();
> >>
> >>
> >>
> >>
> >>
> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
> >>
> >> > for method 2, delete is wrong. we can't delete terms.
> >> >   you also should hack with the tii and tis file.
> >> >
> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:
> >> >
> >> >> method1, dumping data
> >> >> for stored fields, you can traverse the whole index and save it to
> >> >> somewhere else.
> >> >> for indexed but not stored fields, it may be more difficult.
> >> >>if the indexed and not stored field is not analyzed(fields such as
> >> >> id), it's easy to get from FieldCache.StringIndex.
> >> >>But for analyzed fields, though theoretically it can be restored
> from
> >> >> term vector and term position, it's hard to recover from index.
> >> >>
> >> >> method 2, hack with metadata
> >> >> 1. indexed fields
> >> >>  delete by query, e.g. field:*
> >> >> 2. stored fields
> >> >>   because all fields are stored sequentially. it's not easy to
> >> delete
> >> >> some fields. this will not affect search speed. but if you want to
> get
> >> >> stored fields,  and the useless fields are very long, then it will
> slow
> >> >> down.
> >> >>   also it's possible to hack with it. but need more effort to
> >> >> understand the index file format  and traverse the fdt/fdx file.
> >> >>
> >>
> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
> >> >>
> >> >> this will give you some insight.
> >> >>
> >> >>
> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
> bstewart...@gmail.com
> >> >wrote:
> >> >>
> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
> >> used in
> >> >>> search at all.  In order to save memory and disk, I'd like to
> rebuild
> >> that
> >> >>> index *without* those fields, but I don't have original documents to
> >> >>> rebuild entire index with (don't have the full-text anymore, etc.).
>  Is
> >> >>> there some way to rebuild or optimize an existing index with only a
> >> sub-set
> >> >>> of the existing indexed fields?  Or alternatively is there a way to
> >> avoid
> >> >>> loa

Re: Spatial Search and faceting

2012-02-15 Thread William Bell
One way to do it is to group by city and then sort=geodist() asc

select?group=true&group.field=city&sort=geodist() desc&rows=10&fl=city

It might require 2 calls to SOLR to get it the way you want.

On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler  wrote:
> Hi Solr community,
>
> I am doing a spatial search and then do a facet by city.
> Is it possible to then sort the faceted cities by distance?
>
> We would like to display the hits per city, but sort them by distance.
>
> Thanks & Regards
> Ericz
>
> q=iphone
> fq={!bbox}
> sfield=geopoint
> pt=49.594857,8.468614
> d=50
> fl=id,description,city,geopoint
>
> facet=true
> facet.field=city
> f.city.facet.limit=10
> f.city.facet.sort=count //geodist() asc



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


RE: Search for hashtags and mentions

2012-02-15 Thread Rohit
Go the problem, I need to user "types=" parameter to ignore character like #,@ 
in WordDelimiterFilterFactory factory.

Regards,
Rohit
Mobile: +91-9901768202
About Me: http://about.me/rohitg

-Original Message-
From: Robert Muir [mailto:rcm...@gmail.com] 
Sent: 16 February 2012 06:22
To: solr-user@lucene.apache.org
Subject: Re: Search for hashtags and mentions

On Wed, Feb 15, 2012 at 2:04 PM, Rohit  wrote:
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"
> preserveOriginal="1" handleAsChar="@#"/>

There is no such parameter as 'handleAsChar'. If you want to do this,
you need to use a custom types file.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

-- 
lucidimagination.com



is it possible to run deltaimport command with out delta query?

2012-02-15 Thread nagarjuna
hi all..
  i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions
1.is it possible to run deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


pls help me anybody
thanx in advance.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/is-it-possible-to-run-deltaimport-command-with-out-delta-query-tp3749328p3749328.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-15 Thread Alexey Verkhovsky
Hi, all,

I'm new here. Used Solr on a couple of projects before, but didn't need to
dive deep into anything until now. These days, I'm doing a spike for a
"yellow pages" type search server with the following technical requirements:

~10 mln listings in the database. A listing has a name, address,
description, coordinates and a number of tags / filtering fields; no more
than a kilobyte all told; i.e. theoretically the whole thing should fit in
RAM without sharding. A typical query is either "all text matches on name
and/or description within a bounded box", or "some combination of tag
matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
contain up to 10^5 unfiltered listings (the average is more like 10^3).
More than 50% of all the listings are in the frequently requested bounding
boxes, however a vast majority of listings are almost never displayed
(because they don't match the other filters).

Data "never changes" (i.e., a daily batch update; rebuild of the entire
index and restart of all search servers is feasible, as long as it takes
minutes, not hours). This thing ideally should serve up to 10^3 requests
per second on a small (as in, "less than 10 commodity boxes") cluster. In
other words, a typical request should be CPU bound and take ~100-200 msec
to process. Because of coordinates (that are almost never the same),
caching of queries makes no sense; from what little I understand about
Lucene internals, caching of filters probably doesn't make sense either.

After perusing documentation and some googling (but almost no source code
exploring yet), I understand how the schema and the queries will look like,
and now have to figure out a specific configuration that fits the
performance/scalability requirements. Here is what I'm thinking:

1. Search server is an internal service that uses embedded Solr for the
indexing part. RAMDirectoryFactory as index storage.
2. All data is in some sort of persistent storage on a file system, and is
loaded into the memory when a search server starts up.
3. Data updates are handled as "update the persistent storage, start
another cluster, load the world into RAM, flip the load balancer, kill the
old cluster"
4. Solr returns IDs with relevance scores; actual presentations of listings
(as JSON documents) are constructed outside of Solr and cached in
Memcached, as a mostly static content with a few templated bits, like
<%=DISTANCE_TO(-123.0123, 45.6789) %>.
5. All Solr caching is switched off.

Obviously, we are not the first people to do something like this with Solr,
so I'm hoping for some collective wisdom on the following:

Does this sounds like a feasible set of requirements in terms of
performance and scalability for Solr? Are we on the right path to solving
this problem well? If not, what should we be doing instead? What nasty
technical/architectural gotchas are we probably missing at this stage?

One particular advice I'd be really happy to hear is "you may not need
RAMDataFactory if you use  instead".

Aso, is there a blog, wiki page or a maillist thread where a similar
problem is discussed? Yes, we have seen
http://www.ibm.com/developerworks/opensource/library/j-spatial, it's a good
introduction that is outdated and doesn't go into the nasty bits, anyway.

Many thanks in advance,
-- Alex Verkhovsky


AW: is it possible to run deltaimport command with out delta query?

2012-02-15 Thread Ramo Karahasan
Hi,

may you have a look at
http://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport

hth,
Ramo

-Ursprüngliche Nachricht-
Von: nagarjuna [mailto:nagarjuna.avul...@gmail.com] 
Gesendet: Donnerstag, 16. Februar 2012 07:27
An: solr-user@lucene.apache.org
Betreff: is it possible to run deltaimport command with out delta query?

hi all..
  i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions 1.is it possible to run
deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


pls help me anybody
thanx in advance.

--
View this message in context:
http://lucene.472066.n3.nabble.com/is-it-possible-to-run-deltaimport-command
-with-out-delta-query-tp3749328p3749328.html
Sent from the Solr - User mailing list archive at Nabble.com.