Problems with dates!!!

2008-12-19 Thread lunapolar

Hello!!

I'm new working with solr. I've got a mysql database with some dates like
that:

2009-01-31
2009-02-01
2008-11-29
2008-11-30

but when I did a query, *:* for example, I realized that solr had this:

2009-01-30T23:00:00Z
2009-01-31T23:00:00Z
2008-11-28T23:00:00Z
2008-11-29T23:00:00Z

A day -!! How can I fix that, please??


 






-- 
View this message in context: 
http://www.nabble.com/Problems-with-dates%21%21%21-tp21088241p21088241.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Change in config file (synonym.txt) requires container restart?

2008-12-19 Thread Plaatje, Patrick
Hi ,

I'm wondering if you could not implement a custom filter which reads the
file realtime (you might even keep the create synonym map in memory for
a predefined time). This then doesn't need a restart of the container.

Best,

Patrick

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: vrijdag 19 december 2008 7:30
To: solr-user@lucene.apache.org
Subject: Re: Change in config file (synonym.txt) requires container
restart?

Please note that a core reload will also stop Solr from serving any
search requests in the time it reloads.

On Fri, Dec 19, 2008 at 8:24 AM, Sagar Khetkade
wrote:

>
> But i am using CommonsHttpSolrServer for Solr server configuation as 
> it is accepts the url. So here how can i reload the core.
>
> -Sagar> Date: Thu, 18 Dec 2008 07:55:02 -0500> From: 
> -Sagar> markrmil...@gmail.com>
> To: solr-user@lucene.apache.org> Subject: Re: Change in config file
> (synonym.txt) requires container restart?> > Sagar Khetkade wrote:> > 
> Hi,> >
> > > I am using SolrJ client to connect to the Solr 1.3 server and the 
> > > whole
> POC (doing a feasibility study ) reside in Tomcat web server. If any 
> change I am making in the synonym.txt file to add the synonym in the 
> file to make it reflect I have to restart the tomcat server. The 
> synonym filter factory that I am using are in both in analyzers for 
> type index and query in schema.xml. Please tell me whether this 
> approach is good or any other way to make the change reflect while 
> searching without restarting of tomcat server.> > > > Thanks and 
> Regards,> > Sagar Khetkade> > 
> _> > 
> Chose your Life Partner? Join MSN Matrimony FREE> > 
> http://in.msn.com/matrimony>
> > > You can also reload the core.> > - Mark
> _
> Chose your Life Partner? Join MSN Matrimony FREE 
> http://in.msn.com/matrimony
>



--
Regards,
Shalin Shekhar Mangar.


Re: looking for multilanguage indexing best practice/hint

2008-12-19 Thread Sujatha Arun
Thanks Daniel and Erik,

The requirement from the user end is to only search in that particular
language and not across languages.

Also going forward we will be adding more languages.

so if i have separate fields for each language ,then we need to change the
schema everytime and that will not scale very well.

So there are two options ,either use dynamic fields  or use multi core .

Please advice which is better in terms of scaling ,optimum use of existing
resources (available  ram which is abt 4GB for several instances of solr) .

If we use multicore ,will it degrade in terms of speed etc?

Any pointers will be helpful

Regards
Sujatha




On 12/19/08, Julian Davchev  wrote:
>
> Thanks Erick,
> I think I will go with different language fields as I want to give
> different stop words, analyzers etc.
> I might also consider scheme per language so scaling is more flexible as
> I was already advised but this will really make sense if I have more
> than one server I guess, else just all other data is duplicated for no
> reason.
> We already made decision that language will be passed each time in
> search so won't make sense to search quert in any lang.
>
> As of CJKAnalyzer from first look doesn't seem to be in solr (haven't
> tried yet) and since I am noob in java will check how it's done.
> Will definately give a try.
>
> Thanks alot for help.
>
> Erick Erickson wrote:
> > See the CJKAnalyzer for a start, StandardAnalyzer won't
> > help you much.
> >
> > Also, tell us a little more about your requirements. For instance,
> > if a user submits a query in Japanese, do you want to search
> > across documents in the other languages too? And will you want
> > to associate different analyzers with the content from different
> > languages? You really have two options:
> >
> > if you want different analyzers used with the different languages,
> > you probably have to index the content in different fields. That is
> > a Chinese document would have a chinese_content field, a Japanese
> > document would have a japanese_content field etc. Now you can
> > associate a different analyzer with each *_content field.
> >
> > If the same analyzer would work for all three languages, you
> > can just index all the content in a "content" field, and if you
> > need to restrict searching to the language in which the query
> > was submitted, you could always add a clause on the
> > language, e.g. AND language:chinese
> >
> > Hope this helps
> > Erick
> >
> > On Wed, Dec 17, 2008 at 11:15 PM, Sujatha Arun 
> wrote:
> >
> >
> >> Hi,
> >>
> >> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
> >> schema -id,content and language.
> >>
> >> I am indexing 3 pdf files ,the languages are foroyo,chinese and
> japanese.
> >>
> >> I use xpdf to convert the content of pdf to text and push the text to
> solr
> >> in the content field.
> >>
> >> What is the analyzer  that i need to use for the above.
> >>
> >> By using the default text analyzer and posting this content to solr, i
> am
> >> not getting any  results.
> >>
> >> Does solr support stemmin for the above languages.
> >>
> >> Regards
> >> Sujatha
> >>
> >>
> >>
> >>
> >> On 12/18/08, Feak, Todd  wrote:
> >>
> >>> Don't forget to consider scaling concerns (if there are any). There are
> >>> strong differences in the number of searches we receive for each
> >>> language. We chose to create separate schema and config per language so
> >>> that we can throw servers at a particular language (or set of
> languages)
> >>> if we needed to. We see 2 orders of magnitude difference between our
> >>> most popular language and our least popular.
> >>>
> >>> -Todd Feak
> >>>
> >>> -Original Message-
> >>> From: Julian Davchev [mailto:j...@drun.net]
> >>> Sent: Wednesday, December 17, 2008 11:31 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: looking for multilanguage indexing best practice/hint
> >>>
> >>> Hi,
> >>> From my study on solr and lucene so far it seems that I will use single
> >>> scheme.at least don't see scenario where I'd need more than that.
> >>> So question is how do I approach multilanguage indexing and multilang
> >>> searching. Will it really make sense for just searching word..or rather
> >>> I should supply lang param to search as well.
> >>>
> >>> I see there are those filters and already advised on them but I guess
> >>> question is more of a best practice.
> >>> solr.ISOLatin1AccentFilterFactory, solr.SnowballPorterFilterFactory
> >>>
> >>> So solution I see is using copyField I have same field in different
> >>> langs or something using distinct filter.
> >>> Cheers
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >
>
>


Re: Multi language search help

2008-12-19 Thread Sujatha Arun
Thanks Grant,

The requirement from the user end is to only search in that particular
language and not across languages.

Also going forward we will be adding more languages.

so if i have separate fields for each language ,then we need to change the
schema everytime and that will not scale very well.

So there are two options ,either use dynamic fields  or use multi core .

Please advice which is better in terms of scaling ,optimum use of existing
resources (available  ram which is abt 4GB for several instances of solr) .

If we use multicore ,will it degrade in terms of speed etc?

Any pointers will be helpful

Regards
Sujatha


On 12/19/08, Grant Ingersoll  wrote:
>
>
> On Dec 18, 2008, at 6:25 AM, Sujatha Arun wrote:
>
> Hi,
>> I am prototyping lanuage search using solr 1.3 .I  have 3 fields in the
>> schema -id,content and language.
>>
>> I am indexing 3 pdf files ,the languages are foroyo,chinese and japanese.
>>
>> I use xpdf to convert the content of pdf to text and push the text to solr
>> in the content field.
>>
>> What is the analyzer  that i need to use for the above.
>>
>> By using the default text analyzer and posting this content to solr, i am
>> not getting any  results.
>>
>> Does solr support stemming for the above languages.
>>
>
> I'm not familiar with Foroyo, but there should be tokenizers/analysis
> available for Chines and Japanese.  Are you putting all three languages into
> the same field?  If that is the case, you will need some type of language
> detection piece that can choose the correct analyzer.
>
> How are your users searching?  That is, do you know the language they want
> to search in?  If so, then you can have a field for each language.
>
> -Grant
>
>


FileBasedSpellChecker Multiple wordlist source files

2008-12-19 Thread tushar kapoor

I am using FileBasedSpellChecker and currently configuring it through one
source file. Something like this - 

 
  default
  solr.spelling.FileBasedSpellChecker
  ./files/spellings.txt
  UTF-8
  ./spellcheckerindex


I have a whole bunch of other wordlist files which I want to use for
spell-check. However I do not want to merge all these into one file. 

Is it possible to specify multiple wordlist sources in the same spell
checker configuration ?

I have tried using wildcards - *.txt but they dont seem to work.



-- 
View this message in context: 
http://www.nabble.com/FileBasedSpellChecker-Multiple-wordlist-source-files-tp21090710p21090710.html
Sent from the Solr - User mailing list archive at Nabble.com.



Date Stats support using Solr

2008-12-19 Thread Jana, Kumar Raja
Hi,

 

I was searching for features in Solr which would give me the maximum and
minimum values for various numeric and name fields. I found the Stats
Component (Solr-680) and thanks a ton for that !!! J

 

Is there a similar component for date fields too? I played a bit with
the stats component but I could not get it to work with dates. Of
course, converting the date to milliseconds and indexing it as long is
an option but can the StatsComponent be extended or tampered with to
return the max date and min date in the result set?

 

Thanks,

Kumar



Re: FileBasedSpellChecker Multiple wordlist source files

2008-12-19 Thread Grant Ingersoll

Unfortunately, it doesn't support that right now.

One thought, though, keep them as separate files, but then just have  
your build process cat them together for deployment.


-Grant

On Dec 19, 2008, at 7:49 AM, tushar kapoor wrote:



I am using FileBasedSpellChecker and currently configuring it  
through one

source file. Something like this -


 default
 solr.spelling.FileBasedSpellChecker
 ./files/spellings.txt
 UTF-8
 ./spellcheckerindex
   

I have a whole bunch of other wordlist files which I want to use for
spell-check. However I do not want to merge all these into one file.

Is it possible to specify multiple wordlist sources in the same spell
checker configuration ?

I have tried using wildcards - *.txt but they dont seem to work.



--
View this message in context: 
http://www.nabble.com/FileBasedSpellChecker-Multiple-wordlist-source-files-tp21090710p21090710.html
Sent from the Solr - User mailing list archive at Nabble.com.



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Distributed Searching - Limitations?

2008-12-19 Thread Yonik Seeley
On Fri, Dec 19, 2008 at 1:59 AM, Pooja Verlani  wrote:
> Hi,
> I am planning to use Solr's distributed searching for my project. But while
> going through http://wiki.apache.org/solr/DistributedSearch, i found a few
> limitations with it. Can anyone please explain the 2nd and 3rd points in the
> limitations sections on the page. The points are:
>
>   When duplicate doc IDs are received, Solr chooses the first doc and
>   discards subsequent ones

This one isn't a limitation IMO... IDs are supposed to be unique.
If someone indexes the same document (same meaning has the same
ID/uniqueKey) then distributed search will handle this error condition
relatively gracefully by using the first doc returned and ignoring
others.

>   No distributed idf

idf is part of scoring, based on the inverse document frequency of
terms in the query - rarer terms count more than common terms.

When the index is split across multiple shards, scoring is currently
done locally on each shard, not globally across all shards. If the
index isn't well mixed, say all documents pertaining to one subject
are in one shard, then that can skew the scoring since a term that is
rare on one shard may not be rare across the whole collection.

-Yonik


Re: Get All terms from all documents

2008-12-19 Thread roberto
Erick,

Thanks this sounds good, i'll try.

Mike,

Could you give more details about query logs?

Thanks

On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas  wrote:

>
> On 18-Dec-08, at 10:53 AM, roberto wrote:
>
>  Erick,
>>
>> Thanks for the answer, let me clarify the thing, we would like to have a
>> combobox with the terms to guide the user in the search i mean, if a have
>> thousands of documents and want to tell them how many documents in the
>> base
>> have the particular word, how can i do that?
>>
>
> Sounds like you want query autocomplete.  The best way to do this
> (including if you want the box filled with some queries), is to use the
> query logs, not the documents.
>
> -Mike
>



-- 
"Without love, we are birds with broken wings."
Morrie


MLT.FL - Invalid Date String

2008-12-19 Thread gullywompr

I'm trying to boost more like this queries with a timestamp field.  The field
is indexed in universal format, including the Z at the end.  I can include
the timestamp field in fl, qf, and mlt.qf but when I try to add the field to
the mlt.fl list, I get a 400 bad request error with "Invalid Date String:
2008-12-19T16:13:58"  Note that there is no trailing Z on the string that is
reported as invalid, which would make sense, except that it is stored in the
index properly with the trailing Z, and as I said, the field can be retried
in other parameters, just not MLT.FL.  Normal string are working well in
MLT.FL however.

Weird

Cheers,
Curtis Olson
CACI
-- 
View this message in context: 
http://www.nabble.com/MLT.FL---Invalid-Date-String-tp21095097p21095097.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Get All terms from all documents

2008-12-19 Thread Grant Ingersoll
I'd add you probably don't want just the query logs, people may search  
for things that aren't in the index, too.  Your call as to whether  
that is useful or not.  Also, have a look at the TermsComponent, as it  
will tell you the doc freq for terms.



On Dec 19, 2008, at 10:08 AM, roberto wrote:


Erick,

Thanks this sounds good, i'll try.

Mike,

Could you give more details about query logs?

Thanks

On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas   
wrote:




On 18-Dec-08, at 10:53 AM, roberto wrote:

Erick,


Thanks for the answer, let me clarify the thing, we would like to  
have a
combobox with the terms to guide the user in the search i mean, if  
a have
thousands of documents and want to tell them how many documents in  
the

base
have the particular word, how can i do that?



Sounds like you want query autocomplete.  The best way to do this
(including if you want the box filled with some queries), is to use  
the

query logs, not the documents.

-Mike





--
"Without love, we are birds with broken wings."
Morrie


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Re: Get All terms from all documents

2008-12-19 Thread Walter Underwood
At Netflix, we load the completion lexicon with movie titles, person
names, and a few aliases. Even then, we find a few misspellings in
our metadata (is it "NWA" or "N.W.A."?). Extracting terms from
documents will find a lot of misspellings.

You really do not want to rely on random users to correctly spell
things like Ratatouille and Koyaanisqatsi. Trust me.

Autocomplete needs to be really fast, so we use a dedicated
in-memory index (RAMDirectory) in the front end webapp and
also use an HTTP cache in the load balancer.

We get at least 25 million autocomplete requests a day, more
than 10X the number of search requests. I would plan for
10-15X search traffic.

wunder

On 12/19/08 10:45 AM, "Grant Ingersoll"  wrote:

> I'd add you probably don't want just the query logs, people may search
> for things that aren't in the index, too.  Your call as to whether
> that is useful or not.  Also, have a look at the TermsComponent, as it
> will tell you the doc freq for terms.
> 
> On Dec 19, 2008, at 10:08 AM, roberto wrote:
> 
>> Erick,
>> 
>> Thanks this sounds good, i'll try.
>> 
>> Mike,
>> 
>> Could you give more details about query logs?
>> 
>> Thanks
>> 
>> On Fri, Dec 19, 2008 at 12:02 AM, Mike Klaas 
>> wrote:
>> 
>>> 
>>> On 18-Dec-08, at 10:53 AM, roberto wrote:
>>> 
>>> Erick,
 
 Thanks for the answer, let me clarify the thing, we would like to
 have a
 combobox with the terms to guide the user in the search i mean, if
 a have
 thousands of documents and want to tell them how many documents in
 the
 base
 have the particular word, how can i do that?
 
>>> 
>>> Sounds like you want query autocomplete.  The best way to do this
>>> (including if you want the box filled with some queries), is to use
>>> the
>>> query logs, not the documents.
>>> 
>>> -Mike
>>> 
>> -- 
>> "Without love, we are birds with broken wings."
>> Morrie
> 
> --
> Grant Ingersoll
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ




Solrj: Getting response attributes from QueryResponse

2008-12-19 Thread Mark Ferguson
Hello,

I am trying to get the numFound attribute from a returned QueryResponse
object, but for the life of me I can't find where it is stored. When I view
a response in XML format, it is stored as an attribute on the response node,
e.g.:



However, I can't find a way to retrieve these attributes (numFound, start
and maxScore). When I look at the QueryResponse itself, I can see that the
attributes are being stored somewhere, because the toString method returns
them. For example, queryResponse.toString() returns:

{responseHeader={status=0,QTime=139,params={wt=javabin,hl=true,rows=15,version=2.2,fl=urlmd5,start=0,q=java}},response={
*numFound=1228*,start=03.633028,docs=[SolrDocument[{urlmd5=...

The problem is that when I call queryResponse.get('response'), all I get is
the list of SolrDocuments, I don't have any other attributes. Am I missing
something or are these attributes just not publically available? If they're
not, shouldn't they be? Thanks a lot,

Mark Ferguson


Re: Solrj: Getting response attributes from QueryResponse

2008-12-19 Thread Kevin Hagel
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html#getResults()

returns a SolrDocumentList

http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocumentList.html

which has that information

On Fri, Dec 19, 2008 at 2:22 PM, Mark Ferguson wrote:

> Hello,
>
> I am trying to get the numFound attribute from a returned QueryResponse
> object, but for the life of me I can't find where it is stored. When I view
> a response in XML format, it is stored as an attribute on the response
> node,
> e.g.:
>
> 
>
> However, I can't find a way to retrieve these attributes (numFound, start
> and maxScore). When I look at the QueryResponse itself, I can see that the
> attributes are being stored somewhere, because the toString method returns
> them. For example, queryResponse.toString() returns:
>
>
> {responseHeader={status=0,QTime=139,params={wt=javabin,hl=true,rows=15,version=2.2,fl=urlmd5,start=0,q=java}},response={
> *numFound=1228*,start=03.633028,docs=[SolrDocument[{urlmd5=...
>
> The problem is that when I call queryResponse.get('response'), all I get is
> the list of SolrDocuments, I don't have any other attributes. Am I missing
> something or are these attributes just not publically available? If they're
> not, shouldn't they be? Thanks a lot,
>
> Mark Ferguson
>


Re: Solrj: Getting response attributes from QueryResponse

2008-12-19 Thread Mark Ferguson
Oops .. thanks for the quick reply, I shouldn't have missed this. :)

Mark

On Fri, Dec 19, 2008 at 1:25 PM, Kevin Hagel  wrote:

>
> http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/response/QueryResponse.html#getResults()
>
> returns a SolrDocumentList
>
>
> http://lucene.apache.org/solr/api/org/apache/solr/common/SolrDocumentList.html
>
> which has that information
>
> On Fri, Dec 19, 2008 at 2:22 PM, Mark Ferguson  >wrote:
>
> > Hello,
> >
> > I am trying to get the numFound attribute from a returned QueryResponse
> > object, but for the life of me I can't find where it is stored. When I
> view
> > a response in XML format, it is stored as an attribute on the response
> > node,
> > e.g.:
> >
> > 
> >
> > However, I can't find a way to retrieve these attributes (numFound, start
> > and maxScore). When I look at the QueryResponse itself, I can see that
> the
> > attributes are being stored somewhere, because the toString method
> returns
> > them. For example, queryResponse.toString() returns:
> >
> >
> >
> {responseHeader={status=0,QTime=139,params={wt=javabin,hl=true,rows=15,version=2.2,fl=urlmd5,start=0,q=java}},response={
> > *numFound=1228*,start=03.633028,docs=[SolrDocument[{urlmd5=...
> >
> > The problem is that when I call queryResponse.get('response'), all I get
> is
> > the list of SolrDocuments, I don't have any other attributes. Am I
> missing
> > something or are these attributes just not publically available? If
> they're
> > not, shouldn't they be? Thanks a lot,
> >
> > Mark Ferguson
> >
>


Re: [ANNOUNCE] Solr Logo Contest Results

2008-12-19 Thread Paul Borgermans
Maybe one remark: shouldn't it be Apache instead of apache in the logo
(first letter capitalized)?

Otherwise I like it in spite of my first choice didn't make it

Cheers and congrats to Michiel!
Paul

On Thu, Dec 18, 2008 at 8:50 PM, Jeryl Cook  wrote:

> looks cool :),  how about a talking mascot as
>
> Jeryl Cook
> twoenc...@gmail.com
>
> On Thu, Dec 18, 2008 at 1:38 PM, Mathijs Homminga
>  wrote:
> > Good choice!
> >
> > Mathijs Homminga
> >
> > Chris Hostetter wrote:
> >>
> >> (replies to solr-user please)
> >>
> >> On behalf of the Solr Committers, I'm happy to announce that we the Solr
> >> Logo Contest is officially concluded. (Woot!)
> >>
> >> And the Winner Is...
> >>
> >>
> https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
> >> ...by Michiel
> >>
> >> We ran into a few hiccups during the contest making it take longer then
> >> intended, but the result was a thorough process in which everyone went
> above
> >> and beyond to ensure that the final choice best reflected the wishes of
> the
> >> community.
> >>
> >> You can expect to see the new logo appear on the site (and in the Solr
> >> app) in the next few weeks.
> >>
> >> Congrats Michiel!
> >>
> >>
> >> -Hoss
> >>
> >
> > --
> > Knowlogy
> > Helperpark 290 C
> > 9723 ZA Groningen
> > +31 (0)50 2103567
> > http://www.knowlogy.nl
> >
> > mathijs.hommi...@knowlogy.nl
> > +31 (0)6 15312977
> >
> >
> >
>
>
>
> --
> Jeryl Cook
> /^\ Pharaoh /^\
> http://pharaohofkush.blogspot.com/
> "Whether we bring our enemies to justice, or bring justice to our
> enemies, justice will be done."
> --George W. Bush, Address to a Joint Session of Congress and the
> American People, September 20, 2001
>


highlighting and stemming

2008-12-19 Thread David Bowen
We have two text fields, one for author names, and the other for the body of
the document.  It often happens that the author names also appear in the
body of the document.  We turned off stemming for the author field to avoid
unexpected matches when searching by author.

Now, suppose we have an author named "Joe Bloggs" whose name appears in both
the fields.  If the user searches for him by author, we get correct
highlighting in the author field, but only "Joe" and not "Bloggs" is
highlighted in the main body field.  Conversely, if the user searches for
"Joe Bloggs" in the main body field, the highlighting is correct in that
field but this time only "Joe" is highlighted in the author field.

Any suggestions on how we could make this work as we expected (name properly
highlighted in both fields)? Is it a bug that the query isn't re-tokenized
when highlighting a field that has different tokenization specified than was
used for the search?