Re: delete snapshot??

2009-02-17 Thread sunnyfr

How can I remove from time to time, because for the script snapcleaner I just
have the option to delete last day ??? 
thanks a lot Noble and sorry again for all this question,


Noble Paul നോബിള്‍  नोब्ळ् wrote:
> 
> The hardlinks will prevent the unused files from getting cleaned up.
> So the diskspace is consumed for unused index files also. You may need
> to delete unused snapshots from time to time
> --Noble
> 
> On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr  wrote:
>>
>> Hi Noble,
>>
>> I maybe don't get something
>> Ok if it's hard link but how come i've not space left on device error and
>> 30G shown on the data folder ??
>> sorry I'm quite new
>>
>> 6.0G/data/solr/book/data/snapshot.20090216214502
>> 35M /data/solr/book/data/snapshot.20090216195003
>> 12M /data/solr/book/data/snapshot.20090216195502
>> 12K /data/solr/book/data/spellchecker2
>> 36M /data/solr/book/data/snapshot.20090216185502
>> 37M /data/solr/book/data/snapshot.20090216203502
>> 6.0M/data/solr/book/data/index
>> 12K /data/solr/book/data/snapshot.20090216204002
>> 5.8G/data/solr/book/data/snapshot.20090216172020
>> 12K /data/solr/book/data/spellcheckerFile
>> 28K /data/solr/book/data/snapshot.20090216200503
>> 40K /data/solr/book/data/snapshot.20090216194002
>> 24K /data/solr/book/data/snapshot.2009021622
>> 32K /data/solr/book/data/snapshot.20090216184502
>> 20K /data/solr/book/data/snapshot.20090216191004
>> 1.1M/data/solr/book/data/snapshot.20090216213502
>> 1.1M/data/solr/book/data/snapshot.20090216201502
>> 1.1M/data/solr/book/data/snapshot.20090216213005
>> 24K /data/solr/book/data/snapshot.20090216191502
>> 1.1M/data/solr/book/data/snapshot.20090216212503
>> 107M/data/solr/book/data/snapshot.20090216212002
>> 14M /data/solr/book/data/snapshot.20090216190502
>> 32K /data/solr/book/data/snapshot.20090216201002
>> 2.3M/data/solr/book/data/snapshot.20090216204502
>> 28K /data/solr/book/data/snapshot.20090216184002
>> 5.8G/data/solr/book/data/snapshot.20090216181425
>> 44K /data/solr/book/data/snapshot.20090216190001
>> 20K /data/solr/book/data/snapshot.20090216183401
>> 1.1M/data/solr/book/data/snapshot.20090216203002
>> 44K /data/solr/book/data/snapshot.20090216194502
>> 36K /data/solr/book/data/snapshot.20090216185004
>> 12K /data/solr/book/data/snapshot.20090216182720
>> 12K /data/solr/book/data/snapshot.20090216214001
>> 5.8G/data/solr/book/data/snapshot.20090216175106
>> 1.1M/data/solr/book/data/snapshot.20090216202003
>> 5.8G/data/solr/book/data/snapshot.20090216173224
>> 12K /data/solr/book/data/spellchecker1
>> 1.1M/data/solr/book/data/snapshot.20090216202502
>> 30G /data/solr/book/data
>>  thanks a lot,
>>
>>
>> Noble Paul നോബിള്‍  नोब्ळ् wrote:
>>>
>>> they are just hardlinks. they do not consume space on disk
>>>
>>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr  wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:
>
> The --delete option of the rsync command deletes extraneous files from
> the
> destination directory.  It does not delete Solr snapshots.  To do that
> you
> can use the snapcleaner on the master and/or slave.
>
> Bill
>
> On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr 
> wrote:
>
>>
>> root 26834 16.2  0.0  19412   824 ?S16:05   0:08
>> rsync
>> -Wa
>> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
>> /data/solr/books/data/snapshot.20090213160051-wip
>>
>> Hi obviously it can't delete them because the adress is bad it
>> shouldnt
>> be
>> :
>> rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
>> but:
>> rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/
>>
>> Where should I change this, I checked my script.conf on the slave
>> server
>> but
>> it seems good.
>>
>> Because files can be very big and my server in few hours is getting
>> full.
>>
>> So actually snapcleaner is not necessary on the master ? what about
>> the
>> slave?
>>
>> Thanks a lot,
>> Sunny
>> --
>> View this message in context:
>> http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>

 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21998333p22041332.html
 Sent from the Solr - User mailing list archive at Nabble.com.


>>>
>>>
>>>
>>> --
>>> --Noble Paul
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/delete-snapshot---tp21998333p22048398.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> --Noble Paul
> 

Re: dealing with logs - feature advice based on a use case

2009-02-17 Thread Otis Gospodnetic
Marc,

I don't have a Multicore setup that's itching for better logging, but I 
think what you are suggesting is good.  If I had a multicore setup I might want 
either separate logs or the option to log the core name.  Perhaps an 
Enhancement type JIRA entry is in order?

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Marc Sturlese 
To: solr-user@lucene.apache.org
Sent: Wednesday, January 14, 2009 11:54:09 PM
Subject: dealing with logs - feature advice based on a use case


Hey there,
Just want to explain a feature I think would be really useful for the
future.
In my use case I need a log per core. I spoke about this feature before. My
idea was to separate the logs with log4j but saw it was not that easy. In
the other thread we spoke about passing the core name to the loggers. Do
that would be so much hacking so I decided not to do that (otherwise would
be almost impossible to upgrade to new releases). I think would be great to
have it in Solr.

To solve it, what I have done is use log4j and log all messages in the
syslog. Once in there I have bash scripts that redirect the messages
depending on the core name they have. Apparently this would solve my problem
but there are lots of messages that haven't the core name so I can't
redirect them to the needed log file.
So, another possible solution would be to have the core name in all log
messages.

Don't you think would be useful in many use cases?
Thanks in advance
-- 
View this message in context: 
http://www.nabble.com/dealing-with-logs---feature-advice-based-on-a-use-case-tp21458747p21458747.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: delete snapshot??

2009-02-17 Thread Otis Gospodnetic
Hi,

snapcleaner lets you delete snapshots by one of the following two criteria:
- delete all but last N snapshots
- delete all snapshots older than N days

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: sunnyfr 
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 4:17:39 PM
Subject: Re: delete snapshot??


How can I remove from time to time, because for the script snapcleaner I just
have the option to delete last day ??? 
thanks a lot Noble and sorry again for all this question,


Noble Paul നോബിള്‍  नोब्ळ् wrote:
> 
> The hardlinks will prevent the unused files from getting cleaned up.
> So the diskspace is consumed for unused index files also. You may need
> to delete unused snapshots from time to time
> --Noble
> 
> On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr  wrote:
>>
>> Hi Noble,
>>
>> I maybe don't get something
>> Ok if it's hard link but how come i've not space left on device error and
>> 30G shown on the data folder ??
>> sorry I'm quite new
>>
>> 6.0G    /data/solr/book/data/snapshot.20090216214502
>> 35M    /data/solr/book/data/snapshot.20090216195003
>> 12M    /data/solr/book/data/snapshot.20090216195502
>> 12K    /data/solr/book/data/spellchecker2
>> 36M    /data/solr/book/data/snapshot.20090216185502
>> 37M    /data/solr/book/data/snapshot.20090216203502
>> 6.0M    /data/solr/book/data/index
>> 12K    /data/solr/book/data/snapshot.20090216204002
>> 5.8G    /data/solr/book/data/snapshot.20090216172020
>> 12K    /data/solr/book/data/spellcheckerFile
>> 28K    /data/solr/book/data/snapshot.20090216200503
>> 40K    /data/solr/book/data/snapshot.20090216194002
>> 24K    /data/solr/book/data/snapshot.2009021622
>> 32K    /data/solr/book/data/snapshot.20090216184502
>> 20K    /data/solr/book/data/snapshot.20090216191004
>> 1.1M    /data/solr/book/data/snapshot.20090216213502
>> 1.1M    /data/solr/book/data/snapshot.20090216201502
>> 1.1M    /data/solr/book/data/snapshot.20090216213005
>> 24K    /data/solr/book/data/snapshot.20090216191502
>> 1.1M    /data/solr/book/data/snapshot.20090216212503
>> 107M    /data/solr/book/data/snapshot.20090216212002
>> 14M    /data/solr/book/data/snapshot.20090216190502
>> 32K    /data/solr/book/data/snapshot.20090216201002
>> 2.3M    /data/solr/book/data/snapshot.20090216204502
>> 28K    /data/solr/book/data/snapshot.20090216184002
>> 5.8G    /data/solr/book/data/snapshot.20090216181425
>> 44K    /data/solr/book/data/snapshot.20090216190001
>> 20K    /data/solr/book/data/snapshot.20090216183401
>> 1.1M    /data/solr/book/data/snapshot.20090216203002
>> 44K    /data/solr/book/data/snapshot.20090216194502
>> 36K    /data/solr/book/data/snapshot.20090216185004
>> 12K    /data/solr/book/data/snapshot.20090216182720
>> 12K    /data/solr/book/data/snapshot.20090216214001
>> 5.8G    /data/solr/book/data/snapshot.20090216175106
>> 1.1M    /data/solr/book/data/snapshot.20090216202003
>> 5.8G    /data/solr/book/data/snapshot.20090216173224
>> 12K    /data/solr/book/data/spellchecker1
>> 1.1M    /data/solr/book/data/snapshot.20090216202502
>> 30G    /data/solr/book/data
>>  thanks a lot,
>>
>>
>> Noble Paul നോബിള്‍  नोब्ळ् wrote:
>>>
>>> they are just hardlinks. they do not consume space on disk
>>>
>>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr  wrote:

 Hi,

 Ok but can I use it more often then every day like every three hours,
 because snapshot are quite big.

 Thanks a lot,


 Bill Au wrote:
>
> The --delete option of the rsync command deletes extraneous files from
> the
> destination directory.  It does not delete Solr snapshots.  To do that
> you
> can use the snapcleaner on the master and/or slave.
>
> Bill
>
> On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr 
> wrote:
>
>>
>> root    26834 16.2  0.0  19412  824 ?        S    16:05  0:08
>> rsync
>> -Wa
>> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
>> /data/solr/books/data/snapshot.20090213160051-wip
>>
>> Hi obviously it can't delete them because the adress is bad it
>> shouldnt
>> be
>> :
>> rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
>> but:
>> rsync://##.##.##.##:18180/solr/books/snapshot.20090213160051/
>>
>> Where should I change this, I checked my script.conf on the slave
>> server
>> but
>> it seems good.
>>
>> Because files can be very big and my server in few hours is getting
>> full.
>>
>> So actually snapcleaner is not necessary on the master ? what about
>> the
>> slave?
>>
>> Thanks a lot,
>> Sunny
>> --
>> View this message in context:
>> http://www.nabble.com/delete-snapshot---tp21998333p21998333.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>

 --
 View this message in context:
 http://www.nabble.com/delete-snapshot---tp21

Re: Outofmemory error for large files

2009-02-17 Thread Shalin Shekhar Mangar
On Tue, Feb 17, 2009 at 1:10 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Right.  But I was trying to point out that a single 150MB Document is not
> in fact what the o.p. wants to do.  For example, if your 150MB represents,
> say, a whole book, should that really be a single document?  Or should
> individual chapters be separate documents, for example?
>
>
Yes, a 150MB document is probably not a good idea. I am only trying to point
out that even if he writes multiple documents in a 150MB batch, he may still
hit the OOME because all the XML is written to memory first and then out to
the server.

-- 
Regards,
Shalin Shekhar Mangar.


Facet search on Multi-Valued Fields

2009-02-17 Thread Wang Guangchen
Hi all,
I have been experimenting solr faceted search for 2 weeks. But I meet
performance limitation on facet Search.
My solr contains 4,000,000 documents. Normal searching is fairly fast, But
faceted search is extremely slow.

I am trying to do facet search on 3 fields (all multivalued fields) in one
query. field1 has 2 million distinct values, field2 has 1.5 million distinct
values, field3 has 50,000 distinct values.

I already set the filterCache to 3,000,000, But the searching speed is still
very slow. Normally each query will took 5 mins or more.  As I narrow down
the search, the speed will increase dramatically.

Is there anyway to optimize the faceted search?  Every help is appreciated.
Thanks in advanced.



Regards

GC


Re: Facet search on Multi-Valued Fields

2009-02-17 Thread Marc Sturlese

Have you tired with a  nightly build with the new facet algorithm (it is
activated by default)?
http://www.nabble.com/new-faceting-algorithm-td20674902.html


Wang Guangchen wrote:
> 
> Hi all,
> I have been experimenting solr faceted search for 2 weeks. But I meet
> performance limitation on facet Search.
> My solr contains 4,000,000 documents. Normal searching is fairly fast, But
> faceted search is extremely slow.
> 
> I am trying to do facet search on 3 fields (all multivalued fields) in one
> query. field1 has 2 million distinct values, field2 has 1.5 million
> distinct
> values, field3 has 50,000 distinct values.
> 
> I already set the filterCache to 3,000,000, But the searching speed is
> still
> very slow. Normally each query will took 5 mins or more.  As I narrow down
> the search, the speed will increase dramatically.
> 
> Is there anyway to optimize the faceted search?  Every help is
> appreciated.
> Thanks in advanced.
> 
> 
> 
> Regards
> 
> GC
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multilanguage

2009-02-17 Thread Paul Libbrecht

I was looking for such a tool and haven't found it yet.
Using StandardAnalyzer one can obtain some form of token-stream which  
can be used for "agnostic analysis".
Clearly, then, something that matches words in a dictionary and  
decides on the language based on the language of the majority could do  
a decent job to decide the analyzer.


Does such a tool exist?
It doesn't seem too hard for Lucene.

paul


Le 17-févr.-09 à 04:44, Otis Gospodnetic a écrit :

The best option would be to identify the language after parsing the  
PDF and then index it using an appropriate analyzer defined in  
schema.xml.




smime.p7s
Description: S/MIME cryptographic signature


Re: Facet search on Multi-Valued Fields

2009-02-17 Thread Wang Guangchen
Nope, I am using the latest stable version of solr 1.3.0.

Thanks for your tips.

Besides this, Is there any other thing I should do?  I am reading some
previous threads about index optimization. (
http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html), Will
it improve the facet search speed?

GC



On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese wrote:

>
> Have you tired with a  nightly build with the new facet algorithm (it is
> activated by default)?
> http://www.nabble.com/new-faceting-algorithm-td20674902.html
>
>
> Wang Guangchen wrote:
> >
> > Hi all,
> > I have been experimenting solr faceted search for 2 weeks. But I meet
> > performance limitation on facet Search.
> > My solr contains 4,000,000 documents. Normal searching is fairly fast,
> But
> > faceted search is extremely slow.
> >
> > I am trying to do facet search on 3 fields (all multivalued fields) in
> one
> > query. field1 has 2 million distinct values, field2 has 1.5 million
> > distinct
> > values, field3 has 50,000 distinct values.
> >
> > I already set the filterCache to 3,000,000, But the searching speed is
> > still
> > very slow. Normally each query will took 5 mins or more.  As I narrow
> down
> > the search, the speed will increase dramatically.
> >
> > Is there anyway to optimize the faceted search?  Every help is
> > appreciated.
> > Thanks in advanced.
> >
> >
> >
> > Regards
> >
> > GC
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Multilanguage

2009-02-17 Thread Till Kinstler

Paul Libbrecht schrieb:

Clearly, then, something that matches words in a dictionary and decides 
on the language based on the language of the majority could do a decent 
job to decide the analyzer.


Does such a tool exist?


I once played around with http://ngramj.sourceforge.net/ for language 
guessing. It did a good job. It doesn't use dictionaries for language 
identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 1 documents in 
different languages (most in English, German and French, few in other 
european languages like Polish) there were only some 10 not identified 
correctly.


Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de


Re: Facet search on Multi-Valued Fields

2009-02-17 Thread Marc Sturlese

Well doing an optimization after you do indexing will always improve your
search speed a little bit. But with the new facet algorithm you will note a
huge improvement ...
Other things to consider is to just index and store the necessary fields,
omitNorms always that is possible... there are many tips around... keep
reading ;) 


Wang Guangchen wrote:
> 
> Nope, I am using the latest stable version of solr 1.3.0.
> 
> Thanks for your tips.
> 
> Besides this, Is there any other thing I should do?  I am reading some
> previous threads about index optimization. (
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html),
> Will
> it improve the facet search speed?
> 
> GC
> 
> 
> 
> On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese
> wrote:
> 
>>
>> Have you tired with a  nightly build with the new facet algorithm (it is
>> activated by default)?
>> http://www.nabble.com/new-faceting-algorithm-td20674902.html
>>
>>
>> Wang Guangchen wrote:
>> >
>> > Hi all,
>> > I have been experimenting solr faceted search for 2 weeks. But I meet
>> > performance limitation on facet Search.
>> > My solr contains 4,000,000 documents. Normal searching is fairly fast,
>> But
>> > faceted search is extremely slow.
>> >
>> > I am trying to do facet search on 3 fields (all multivalued fields) in
>> one
>> > query. field1 has 2 million distinct values, field2 has 1.5 million
>> > distinct
>> > values, field3 has 50,000 distinct values.
>> >
>> > I already set the filterCache to 3,000,000, But the searching speed is
>> > still
>> > very slow. Normally each query will took 5 mins or more.  As I narrow
>> down
>> > the search, the speed will increase dramatically.
>> >
>> > Is there anyway to optimize the faceted search?  Every help is
>> > appreciated.
>> > Thanks in advanced.
>> >
>> >
>> >
>> > Regards
>> >
>> > GC
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22054095.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facet search on Multi-Valued Fields

2009-02-17 Thread Wang Guangchen
Thank you very much.

On Tue, Feb 17, 2009 at 6:04 PM, Marc Sturlese wrote:

>
> Well doing an optimization after you do indexing will always improve your
> search speed a little bit. But with the new facet algorithm you will note a
> huge improvement ...
> Other things to consider is to just index and store the necessary fields,
> omitNorms always that is possible... there are many tips around... keep
> reading ;)
>
>
> Wang Guangchen wrote:
> >
> > Nope, I am using the latest stable version of solr 1.3.0.
> >
> > Thanks for your tips.
> >
> > Besides this, Is there any other thing I should do?  I am reading some
> > previous threads about index optimization. (
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg05290.html),
> > Will
> > it improve the facet search speed?
> >
> > GC
> >
> >
> >
> > On Tue, Feb 17, 2009 at 5:30 PM, Marc Sturlese
> > wrote:
> >
> >>
> >> Have you tired with a  nightly build with the new facet algorithm (it is
> >> activated by default)?
> >> http://www.nabble.com/new-faceting-algorithm-td20674902.html
> >>
> >>
> >> Wang Guangchen wrote:
> >> >
> >> > Hi all,
> >> > I have been experimenting solr faceted search for 2 weeks. But I meet
> >> > performance limitation on facet Search.
> >> > My solr contains 4,000,000 documents. Normal searching is fairly fast,
> >> But
> >> > faceted search is extremely slow.
> >> >
> >> > I am trying to do facet search on 3 fields (all multivalued fields) in
> >> one
> >> > query. field1 has 2 million distinct values, field2 has 1.5 million
> >> > distinct
> >> > values, field3 has 50,000 distinct values.
> >> >
> >> > I already set the filterCache to 3,000,000, But the searching speed is
> >> > still
> >> > very slow. Normally each query will took 5 mins or more.  As I narrow
> >> down
> >> > the search, the speed will increase dramatically.
> >> >
> >> > Is there anyway to optimize the faceted search?  Every help is
> >> > appreciated.
> >> > Thanks in advanced.
> >> >
> >> >
> >> >
> >> > Regards
> >> >
> >> > GC
> >> >
> >> >
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22053578.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Facet-search-on-Multi-Valued-Fields-tp22053260p22054095.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Finding total range of dates for date faceting

2009-02-17 Thread Jacob Singh
Hi,

I'm trying to write some code to build a facet list for a date field,
but I don't know what the first and last available dates are.  I would
adjust the gap param accordingly.  If there is a 10yr stretch between
min(date) and max(date) I'd want to facet by year.  If it is a 1 month
gap, I'd want to facet by day.

Is there a way to do this?

Thanks,
Jacob

-- 

+1 510 277-0891 (o)
+91  33 7458 (m)

web: http://pajamadesign.com

Skype: pajamadesign
Yahoo: jacobsingh
AIM: jacobsingh
gTalk: jacobsi...@gmail.com


Re: Multilanguage

2009-02-17 Thread revathy arun
Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler  wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 1 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>


DIH full-import with clean=true fails and rollback empties index

2009-02-17 Thread Steffen B.

Hi there,
I've got a pretty simple question regarding the DIH full-import command.
I have a SOLR server running that has a full index with lots of documents in
it. Once a day, a full-import is run, which uses the default parameters
(clean=true, because it's not an incremental index).
When I run a full-import, the first step is cleaning up the whole index:

Feb 7, 2009 2:12:01 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX

After that, suppose the import suddenly fails for one reason or another (ie.
SQL error), which initiates a rollback:

Feb 7, 2009 2:12:02 AM org.apache.solr.handler.dataimport.DataImporter
doFullImport
SEVERE: Full Import failed
[...]
Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)

Unfortunately, this rollback does not "refill" the index with the old data,
and neither keeps the old index from being overwritten with the new,
erroneous index. Now my question is: is there anything I can do to keep Solr
from trashing my index on a full-import when there is a problem with the
database?
Or should I use clean=false, even though 99% of the imported documents are
not incremental but the same documents that already were in the index, only
with new data?
Any tips will be greatly appreciated! :)
- Steffen
-- 
View this message in context: 
http://www.nabble.com/DIH-full-import-with-clean%3Dtrue-fails-and-rollback-empties-index-tp22055065p22055065.html
Sent from the Solr - User mailing list archive at Nabble.com.



Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Jana, Kumar Raja
Hi,

 

I am trying to avoid queries which take a lot of server time. For this I
plan to use setRows(Integer) and setTimeAllowed(Integer) methods while
creating the SolrQuery. I would like to know the following:

 

1.   I set SolrQuery.setRows(5000) Will the processing of the query
stop once 5000 results are found or the query will be completely
processed and then the result set is sorted out based on Rank Boosting
and the top 5000 results are returned?

2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
processing after 2 secs? (I know this question sounds silly but I just
want a confirmation from the experts J )

 

Is there anything else I can do to get the desired results?

 

Thanks,

Kumar



Re: DIH full-import with clean=true fails and rollback empties index

2009-02-17 Thread Shalin Shekhar Mangar
On Tue, Feb 17, 2009 at 4:42 PM, Steffen B. wrote:

>
> Unfortunately, this rollback does not "refill" the index with the old data,
> and neither keeps the old index from being overwritten with the new,
> erroneous index. Now my question is: is there anything I can do to keep
> Solr
> from trashing my index on a full-import when there is a problem with the
> database?


This is not good. I'll try to write some tests and try to find the cause.


>
> Or should I use clean=false, even though 99% of the imported documents are
> not incremental but the same documents that already were in the index, only
> with new data?


Use clean=false for the time being. The old documents will be replaced with
the new ones (old and new must have same uniqueKey).
-- 
Regards,
Shalin Shekhar Mangar.


Re: DIH full-import with clean=true fails and rollback empties index

2009-02-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
may be you can try "postImportDeleteQuery" (not yet documented ,
SOLR-801) on a root entity.
You can keep a timestamp in the fields which can keep the value of
${dataimporter.index_start_time} as a field . Use that to remove old
docs which may exist in the index before the indexing started
--Noble
On Tue, Feb 17, 2009 at 4:42 PM, Steffen B.  wrote:
>
> Hi there,
> I've got a pretty simple question regarding the DIH full-import command.
> I have a SOLR server running that has a full index with lots of documents in
> it. Once a day, a full-import is run, which uses the default parameters
> (clean=true, because it's not an incremental index).
> When I run a full-import, the first step is cleaning up the whole index:
>
> Feb 7, 2009 2:12:01 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
> INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
>
> After that, suppose the import suddenly fails for one reason or another (ie.
> SQL error), which initiates a rollback:
>
> Feb 7, 2009 2:12:02 AM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> [...]
> Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: start rollback
> Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 rollback
> INFO: end_rollback
> Feb 7, 2009 2:12:02 AM org.apache.solr.update.DirectUpdateHandler2 commit
> INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
>
> Unfortunately, this rollback does not "refill" the index with the old data,
> and neither keeps the old index from being overwritten with the new,
> erroneous index. Now my question is: is there anything I can do to keep Solr
> from trashing my index on a full-import when there is a problem with the
> database?
> Or should I use clean=false, even though 99% of the imported documents are
> not incremental but the same documents that already were in the index, only
> with new data?
> Any tips will be greatly appreciated! :)
> - Steffen
> --
> View this message in context: 
> http://www.nabble.com/DIH-full-import-with-clean%3Dtrue-fails-and-rollback-empties-index-tp22055065p22055065.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--Noble Paul


2 strange behaviours with DIH full-import.

2009-02-17 Thread Marc Sturlese

Hey, I have 2 problems that I think are really important and can be useful
for other users:

1.) I am runing 3 cores in a solr instance. Each core contains about a
milion and a half docs. Once a full-import is run in a core it will free
just a little bit of java memory. Once that first full-import is done and I
run another full-import with another core the memory used by the first
full-import will never be set free. Once the second full-import is done I
run the third... and I run out of memory! Is this a Solr bug setting memory
to free or I am missing something? Is there any way yo tell Solr to free all
memory after a full-import? It's a really severe error in my case as I can
not be restarting Tomcat server (I have other cron actions syncronized with
it).

2.)I run a full-import and everythins works fine... I run another
full-import in the same core and everything seems so work find. But I have
noticed that the index in  /data/index dir is two times bigger. I have seen
that Solr uses this indexwriter constructor when executes a deleteAll at the
begining of the full import :
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/index/IndexWriter.html#IndexWriter(org.apache.lucene.store.Directory,%20org.apache.lucene.analysis.Analyzer,%20boolean,%20org.apache.lucene.index.IndexDeletionPolicy,%20org.apache.lucene.index.IndexWriter.MaxFieldLength)

Why lucene is not deleteing the data of the old index if the boolean var of
the constructor is set to true? (the results are not duplicated but
phisically the directory /index is double size). Has this something to do
with de deletionPolicy that is saving commits or a lucenes 2.9-dev bug or
something like that???

I am running a nightly-build (from begining of January with some patches
that have been apperaring about concurrency indexing problems) with lucene
2.9-dev.
I would apreciate any advice as these two problems are really driving my
crazy and don't know how to sort it... specially the first one.

Thanks in advance!!

-- 
View this message in context: 
http://www.nabble.com/2-strange-behaviours-with-DIH-full-import.-tp22055769p22055769.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Multilanguage

2009-02-17 Thread Otis Gospodnetic
Hi,

No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for its 
accuracy nor speed (but I know the code has been around for years).  Another 
LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: revathy arun 
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler  wrote:
>
> Paul Libbrecht schrieb:
>
> Clearly, then, something that matches words in a dictionary and decides on
>> the language based on the language of the majority could do a decent job to
>> decide the analyzer.
>>
>> Does such a tool exist?
>>
>
> I once played around with http://ngramj.sourceforge.net/ for language
> guessing. It did a good job. It doesn't use dictionaries for language
> identification but a statistical approach using ngrams.
> I don't have any precise numbers, but out of about 1 documents in
> different languages (most in English, German and French, few in other
> european languages like Polish) there were only some 10 not identified
> correctly.
>
> Till
>
> --
> Till Kinstler
> Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> Platz der Göttinger Sieben 1, D 37073 Göttingen
> kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
>


Re: indexing Chienese langage

2009-02-17 Thread Koji Sekiguchi
CharFilter can normalize (convert) traditional chinese to simplified 
chinese or vice versa,
if you define mapping.txt. Here is the sample of Chinese character 
normalization:


https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

See SOLR-822 for the detail:

https://issues.apache.org/jira/browse/SOLR-822

Koji


revathy arun wrote:

Hi,

When I index chinese content using chinese tokenizer and analyzer in solr
1.3 ,some of the chinese text files are getting indexed but others are not.

Since chinese has got many different language subtypes as in standard
chinese,simplified chinese etc which of these does the chinese tokenizer
support and is there any method to find the type of  chiense language  from
the file?

Rgds

  




Re: Word Locations & Search Components

2009-02-17 Thread Koji Sekiguchi

Hmm, Otis, very nice!

Koji

Otis Gospodnetic wrote:

Hi,

Wouldn't this be as easy as:
- split email into "paragraphs"
- for each paragraph compute signature (MD5 or something fuzzier, like in 
SOLR-799)
- for each signature look for other emails with this signature
- when you find an email with an identical signature, you know you've found the 
"banner"

I'd do this in a pre-processing phase.  You may have to add special logic for 
">" and other email-quoting characters.  Perhaps you can make use of assumption 
that banners always come at the end of emails.  Perhaps you can make use of situations where 
the banner appears multiple times in a single email (the one with lots of back-and-forth 
replies, for example).

This is similar to MoreLikeThis on paragraph level.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  




Re: Finding total range of dates for date faceting

2009-02-17 Thread Peter Wolanin
It *looks* as though Solr supports returning the results of arbitrary
calculations:

http://wiki.apache.org/solr/SolrQuerySyntax

However, I am so far unable to get any example working except in the
context of a dismax bf.  It seems like one ought to be able to write a
query to return the doc matching the max OR the min of a particular
field.

-Peter

On Tue, Feb 17, 2009 at 5:33 AM, Jacob Singh  wrote:
> Hi,
>
> I'm trying to write some code to build a facet list for a date field,
> but I don't know what the first and last available dates are.  I would
> adjust the gap param accordingly.  If there is a 10yr stretch between
> min(date) and max(date) I'd want to facet by year.  If it is a 1 month
> gap, I'd want to facet by day.
>
> Is there a way to do this?
>
> Thanks,
> Jacob
>
> --
>
> +1 510 277-0891 (o)
> +91  33 7458 (m)
>
> web: http://pajamadesign.com
>
> Skype: pajamadesign
> Yahoo: jacobsingh
> AIM: jacobsingh
> gTalk: jacobsi...@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: DIH transformers - sect 2

2009-02-17 Thread Fergus McMenemie
>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie  wrote:
>>
>>  2) Having used TemplateTransformer to assign a value to an
>> entity column that column cannot be used in other
>> TemplateTransformer operations. In my project I am
>> attempting to reuse "x.fileWebPath". To fix this, the
>> last line of transformRow() in TemplateTransformer.java
>> needs replaced with the following which as well as
>> 'putting' the templated-ed string in 'row' also saves it
>> into the 'resolver'.
>>
>> **originally**
>>  row.put(column, resolver.replaceTokens(expr));
>>  }
>>
>> **new**
>>  String columnName = map.get(DataImporter.COLUMN);
>>  expr=resolver.replaceTokens(expr);
>>  row.put(columnName, expr);
>>  resolverMapCopy.put(columnName, expr);
>>  }
>
>isn't it better to write a custom transformer to achieve this. I did
>not want a standard component to change the state of the
>VariableResolver .
>
>I am not sure what is the best way.
>

Noble, (Good to have email working :-)

Hmm not sure why this requires a custom transformer. Why is this not 
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the 
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the 
VariableResolver.

Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?

Fergus.

>>
>>
>>   
>>   
>>
>>>   processor="FileListEntityProcessor"
>>   fileName="^.*\.xml$"
>>   newerThan="'NOW-1000DAYS'"
>>   recursive="true"
>>   rootEntity="false"
>>   dataSource="null"
>>   baseDir="/Volumes/spare/ts/solr/content"
>>   >
>>>  dataSource="myfilereader"
>>  processor="XPathEntityProcessor"
>>  url="${jc.fileAbsolutePath}"
>>  rootEntity="true"
>>  stream="false"
>>  forEach="/record | /record/mediaBlock"
>>  
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>>
>> 
>> > replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
>> 
>> 
>> 
>> > xpath="/record/metadata/da...@qualifier='pubDate']" 
>> dateTimeFormat="MMdd"   />
>>
>> > xpath="/record/mediaBlock/mediaObject/@vurl" />
>> > template="${dataimporter.request.fordinstalldir}" />
>> 
>>
>> > template="${dataimporter.request.contentinstalldir}" />
>> 
>> > replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
>> > replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
>> > template="${jc.fileAbsolutePath}#${x.vurl}" />
>>   
>>   
>>   
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===


Re: snapshot created if there is no documente updated/new?

2009-02-17 Thread Bill Au
A sanpshot is created every time snapshooter is invoked even if there is no
changed in the index.  However, since snapshots are created using hard
links, no additional space is used if there are no changed to the index.  It
does use up one directory entry in the data directory.

Bill

On Mon, Feb 16, 2009 at 5:03 AM, sunnyfr  wrote:

>
> Hi
>
> I would like to know if a snapshot is automaticly created even if there is
> no document update or added ?
>
> Thanks a lot,
> --
> View this message in context:
> http://www.nabble.com/snapshot-created-if-there-is-no-documente-updated-new--tp22034462p22034462.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: snapshot as big as the index folder?

2009-02-17 Thread Bill Au
Snapshots are created using hard links.  So even though it is as big as the
index, it is not taking up any more space on the disk.  The size of the
snapshot will change as the size of the index changes.

Bill

On Mon, Feb 16, 2009 at 9:50 AM, sunnyfr  wrote:

>
> It change a lot in few minute ?? is it normal ? thanks
>
> 5.8Gbook/data/snapshot.20090216153346
> 4.0Kbook/data/index
> 5.8Gbook/data/
> r...@search-07:/data/solr# du -h book/data/
> 5.8Gbook/data/snapshot.20090216153346
> 3.7Gbook/data/index
> 4.0Kbook/data/snapshot.20090216153759
> 9.4Gbook/data/
> r...@search-07:/data/solr# du -h book/data/
> 5.8Gvideo/data/snapshot.20090216153346
> 4.4Gbook/data/index
> 4.0Kbook/data/snapshot.20090216153759
> 11G book/data/
> r...@search-07:/data/solr# du -h book/data/
> 5.8Gbook/data/snapshot.20090216153346
> 5.8Gbook/data/index
> 4.0Kbook/data/snapshot.20090216154819
> 4.0Kbook/data/snapshot.20090216154820
> 15M book/data/snapshot.20090216153759
> 12G book/data/
>
>
>
>
> sunnyfr wrote:
> >
> > Hi,
> >
> > Is it normal or did I miss something ??
> > 5.8G  book/data/snapshot.20090216153346
> > 12K   book/data/spellchecker2
> > 4.0K  book/data/index
> > 12K   book/data/spellcheckerFile
> > 12K   book/data/spellchecker1
> > 5.8G  book/data/
> >
> > Last update ?
> > 92562
> > 45492
> > 0
> > 2009-02-16 15:20:01
> > 2009-02-16 15:20:01
> > 2009-02-16 15:20:42
> > 2009-02-16 15:20:42
> > 13223
> > -
> > 
> > Indexing completed. Added/Updated: 13223 documents. Deleted 0 documents.
> > 
> > 2009-02-16 15:33:50
> > 2009-02-16 15:33:50
> > 0:13:48.853
> >
> >
> > Thanks a lot,
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/snapshot-as-big-as-the-index-folder--tp22038427p22038656.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: delete snapshot??

2009-02-17 Thread Bill Au
usage: snapcleaner -D  | -N  [-d dir] [-u username] [-v]
   -Dcleanup snapshots more than  days old
   -N keep the most recent  number of snapshots and
   cleanup up the remaining ones that are not being pulled
   -d  specify directory holding index data
   -u  specify user to sudo to before running script
   -v  increase verbosity
   -V  output debugging info

Bill

On Tue, Feb 17, 2009 at 3:24 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
> snapcleaner lets you delete snapshots by one of the following two criteria:
> - delete all but last N snapshots
> - delete all snapshots older than N days
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> 
> From: sunnyfr 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 4:17:39 PM
> Subject: Re: delete snapshot??
>
>
> How can I remove from time to time, because for the script snapcleaner I
> just
> have the option to delete last day ???
> thanks a lot Noble and sorry again for all this question,
>
>
> Noble Paul നോബിള്‍  नोब्ळ् wrote:
> >
> > The hardlinks will prevent the unused files from getting cleaned up.
> > So the diskspace is consumed for unused index files also. You may need
> > to delete unused snapshots from time to time
> > --Noble
> >
> > On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr  wrote:
> >>
> >> Hi Noble,
> >>
> >> I maybe don't get something
> >> Ok if it's hard link but how come i've not space left on device error
> and
> >> 30G shown on the data folder ??
> >> sorry I'm quite new
> >>
> >> 6.0G/data/solr/book/data/snapshot.20090216214502
> >> 35M/data/solr/book/data/snapshot.20090216195003
> >> 12M/data/solr/book/data/snapshot.20090216195502
> >> 12K/data/solr/book/data/spellchecker2
> >> 36M/data/solr/book/data/snapshot.20090216185502
> >> 37M/data/solr/book/data/snapshot.20090216203502
> >> 6.0M/data/solr/book/data/index
> >> 12K/data/solr/book/data/snapshot.20090216204002
> >> 5.8G/data/solr/book/data/snapshot.20090216172020
> >> 12K/data/solr/book/data/spellcheckerFile
> >> 28K/data/solr/book/data/snapshot.20090216200503
> >> 40K/data/solr/book/data/snapshot.20090216194002
> >> 24K/data/solr/book/data/snapshot.2009021622
> >> 32K/data/solr/book/data/snapshot.20090216184502
> >> 20K/data/solr/book/data/snapshot.20090216191004
> >> 1.1M/data/solr/book/data/snapshot.20090216213502
> >> 1.1M/data/solr/book/data/snapshot.20090216201502
> >> 1.1M/data/solr/book/data/snapshot.20090216213005
> >> 24K/data/solr/book/data/snapshot.20090216191502
> >> 1.1M/data/solr/book/data/snapshot.20090216212503
> >> 107M/data/solr/book/data/snapshot.20090216212002
> >> 14M/data/solr/book/data/snapshot.20090216190502
> >> 32K/data/solr/book/data/snapshot.20090216201002
> >> 2.3M/data/solr/book/data/snapshot.20090216204502
> >> 28K/data/solr/book/data/snapshot.20090216184002
> >> 5.8G/data/solr/book/data/snapshot.20090216181425
> >> 44K/data/solr/book/data/snapshot.20090216190001
> >> 20K/data/solr/book/data/snapshot.20090216183401
> >> 1.1M/data/solr/book/data/snapshot.20090216203002
> >> 44K/data/solr/book/data/snapshot.20090216194502
> >> 36K/data/solr/book/data/snapshot.20090216185004
> >> 12K/data/solr/book/data/snapshot.20090216182720
> >> 12K/data/solr/book/data/snapshot.20090216214001
> >> 5.8G/data/solr/book/data/snapshot.20090216175106
> >> 1.1M/data/solr/book/data/snapshot.20090216202003
> >> 5.8G/data/solr/book/data/snapshot.20090216173224
> >> 12K/data/solr/book/data/spellchecker1
> >> 1.1M/data/solr/book/data/snapshot.20090216202502
> >> 30G/data/solr/book/data
> >>  thanks a lot,
> >>
> >>
> >> Noble Paul നോബിള്‍  नोब्ळ् wrote:
> >>>
> >>> they are just hardlinks. they do not consume space on disk
> >>>
> >>> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr 
> wrote:
> 
>  Hi,
> 
>  Ok but can I use it more often then every day like every three hours,
>  because snapshot are quite big.
> 
>  Thanks a lot,
> 
> 
>  Bill Au wrote:
> >
> > The --delete option of the rsync command deletes extraneous files
> from
> > the
> > destination directory.  It does not delete Solr snapshots.  To do
> that
> > you
> > can use the snapcleaner on the master and/or slave.
> >
> > Bill
> >
> > On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr 
> > wrote:
> >
> >>
> >> root26834 16.2  0.0  19412  824 ?S16:05  0:08
> >> rsync
> >> -Wa
> >> --delete rsync://##.##.##.##:18180/solr/snapshot.20090213160051/
> >> /data/solr/books/data/snapshot.20090213160051-wip
> >>
> >> Hi obviously it can't delete them because the adress is bad it
> >> shouldnt
> >> be
> >> :
> >> rsync://##.##.##.##:18180/solr/snapshot

Re: delete snapshot??

2009-02-17 Thread Walter Underwood
I run snapcleaner from cron. That cleans up old snapshots once
each day. Here is a crontab line that runs it at 30 minutes past
the hour, every hour.

30 * * * * /apps/wss/solr_home/bin/snapcleaner -N 3

wunder

On 2/17/09 7:23 AM, "Bill Au"  wrote:

> usage: snapcleaner -D  | -N  [-d dir] [-u username] [-v]
>-Dcleanup snapshots more than  days old
>-N keep the most recent  number of snapshots and
>cleanup up the remaining ones that are not being pulled
>-d  specify directory holding index data
>-u  specify user to sudo to before running script
>-v  increase verbosity
>-V  output debugging info
> 
> Bill
> 
> On Tue, Feb 17, 2009 at 3:24 AM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
> 
>> Hi,
>> 
>> snapcleaner lets you delete snapshots by one of the following two criteria:
>> - delete all but last N snapshots
>> - delete all snapshots older than N days
>> 
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> 
>> 
>> 
>> 
>> From: sunnyfr 
>> To: solr-user@lucene.apache.org
>> Sent: Tuesday, February 17, 2009 4:17:39 PM
>> Subject: Re: delete snapshot??
>> 
>> 
>> How can I remove from time to time, because for the script snapcleaner I
>> just
>> have the option to delete last day ???
>> thanks a lot Noble and sorry again for all this question,
>> 
>> 
>> Noble Paul നോബിള്‍  नोब्ळ् wrote:
>>> 
>>> The hardlinks will prevent the unused files from getting cleaned up.
>>> So the diskspace is consumed for unused index files also. You may need
>>> to delete unused snapshots from time to time
>>> --Noble
>>> 
>>> On Tue, Feb 17, 2009 at 5:24 AM, sunnyfr  wrote:
 
 Hi Noble,
 
 I maybe don't get something
 Ok if it's hard link but how come i've not space left on device error
>> and
 30G shown on the data folder ??
 sorry I'm quite new
 
 6.0G/data/solr/book/data/snapshot.20090216214502
 35M/data/solr/book/data/snapshot.20090216195003
 12M/data/solr/book/data/snapshot.20090216195502
 12K/data/solr/book/data/spellchecker2
 36M/data/solr/book/data/snapshot.20090216185502
 37M/data/solr/book/data/snapshot.20090216203502
 6.0M/data/solr/book/data/index
 12K/data/solr/book/data/snapshot.20090216204002
 5.8G/data/solr/book/data/snapshot.20090216172020
 12K/data/solr/book/data/spellcheckerFile
 28K/data/solr/book/data/snapshot.20090216200503
 40K/data/solr/book/data/snapshot.20090216194002
 24K/data/solr/book/data/snapshot.2009021622
 32K/data/solr/book/data/snapshot.20090216184502
 20K/data/solr/book/data/snapshot.20090216191004
 1.1M/data/solr/book/data/snapshot.20090216213502
 1.1M/data/solr/book/data/snapshot.20090216201502
 1.1M/data/solr/book/data/snapshot.20090216213005
 24K/data/solr/book/data/snapshot.20090216191502
 1.1M/data/solr/book/data/snapshot.20090216212503
 107M/data/solr/book/data/snapshot.20090216212002
 14M/data/solr/book/data/snapshot.20090216190502
 32K/data/solr/book/data/snapshot.20090216201002
 2.3M/data/solr/book/data/snapshot.20090216204502
 28K/data/solr/book/data/snapshot.20090216184002
 5.8G/data/solr/book/data/snapshot.20090216181425
 44K/data/solr/book/data/snapshot.20090216190001
 20K/data/solr/book/data/snapshot.20090216183401
 1.1M/data/solr/book/data/snapshot.20090216203002
 44K/data/solr/book/data/snapshot.20090216194502
 36K/data/solr/book/data/snapshot.20090216185004
 12K/data/solr/book/data/snapshot.20090216182720
 12K/data/solr/book/data/snapshot.20090216214001
 5.8G/data/solr/book/data/snapshot.20090216175106
 1.1M/data/solr/book/data/snapshot.20090216202003
 5.8G/data/solr/book/data/snapshot.20090216173224
 12K/data/solr/book/data/spellchecker1
 1.1M/data/solr/book/data/snapshot.20090216202502
 30G/data/solr/book/data
  thanks a lot,
 
 
 Noble Paul നോബിള്‍  नोब्ळ् wrote:
> 
> they are just hardlinks. they do not consume space on disk
> 
> On Mon, Feb 16, 2009 at 10:34 PM, sunnyfr 
>> wrote:
>> 
>> Hi,
>> 
>> Ok but can I use it more often then every day like every three hours,
>> because snapshot are quite big.
>> 
>> Thanks a lot,
>> 
>> 
>> Bill Au wrote:
>>> 
>>> The --delete option of the rsync command deletes extraneous files
>> from
>>> the
>>> destination directory.  It does not delete Solr snapshots.  To do
>> that
>>> you
>>> can use the snapcleaner on the master and/or slave.
>>> 
>>> Bill
>>> 
>>> On Fri, Feb 13, 2009 at 10:15 AM, sunnyfr 
>>> wrote:
>>> 
 
 root26834 16.2  0.0  19412  824 ?S16:0

Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Walter Underwood
Requesting 5000 rows will use a lot of server time, because
it has to fetch the information for 5000 results when it
makes the response.

It is much more efficient to request only the results you
will need, usually 10 at a time.

wunder

On 2/17/09 3:30 AM, "Jana, Kumar Raja"  wrote:

> Hi,
> 
>  
> 
> I am trying to avoid queries which take a lot of server time. For this I
> plan to use setRows(Integer) and setTimeAllowed(Integer) methods while
> creating the SolrQuery. I would like to know the following:
> 
>  
> 
> 1.   I set SolrQuery.setRows(5000) Will the processing of the query
> stop once 5000 results are found or the query will be completely
> processed and then the result set is sorted out based on Rank Boosting
> and the top 5000 results are returned?
> 
> 2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
> processing after 2 secs? (I know this question sounds silly but I just
> want a confirmation from the experts J )
> 
>  
> 
> Is there anything else I can do to get the desired results?
> 
>  
> 
> Thanks,
> 
> Kumar
> 



Store content out of solr

2009-02-17 Thread roberto
Hello,

We are indexing information from diferent sources so we would like to
centralize the information content so i can retrieve using the ID
provided buy solr?

Does anyone did something like this, and have some advices ? I
thinking in store the information into a database like mysql ?

Thanks,
-- 
"Without love, we are birds with broken wings."
Morrie


Re: Multilanguage

2009-02-17 Thread revathy arun
Hi Otis,

But this is not freeware ,right?




On 2/17/09, Otis Gospodnetic  wrote:
>
> Hi,
>
> No, Tika doesn't do LangID.  I haven't used ngramj, so I can't speak for
> its accuracy nor speed (but I know the code has been around for
> years).  Another LangID implementation is at the URL below my name.
>
> Otis --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
>
> 
> From: revathy arun 
> To: solr-user@lucene.apache.org
> Sent: Tuesday, February 17, 2009 6:39:40 PM
> Subject: Re: Multilanguage
>
> Does Apache Tika help find the language of the given document?
>
>
>
> On 2/17/09, Till Kinstler  wrote:
> >
> > Paul Libbrecht schrieb:
> >
> > Clearly, then, something that matches words in a dictionary and decides
> on
> >> the language based on the language of the majority could do a decent job
> to
> >> decide the analyzer.
> >>
> >> Does such a tool exist?
> >>
> >
> > I once played around with http://ngramj.sourceforge.net/ for language
> > guessing. It did a good job. It doesn't use dictionaries for language
> > identification but a statistical approach using ngrams.
> > I don't have any precise numbers, but out of about 1 documents in
> > different languages (most in English, German and French, few in other
> > european languages like Polish) there were only some 10 not identified
> > correctly.
> >
> > Till
> >
> > --
> > Till Kinstler
> > Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
> > Platz der Göttinger Sieben 1, D 37073 Göttingen
> > kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de
> >
>


Re: Store content out of solr

2009-02-17 Thread Peter Wolanin
Sure, we are doing essentially that with our Drupal integration module
- each search result contains a link to the "real" content, which is
stored in MySQL, etc, and presented via the Drupal CMS.

http://drupal.org/project/apachesolr

-Peter

On Tue, Feb 17, 2009 at 11:57 AM, roberto  wrote:
> Hello,
>
> We are indexing information from diferent sources so we would like to
> centralize the information content so i can retrieve using the ID
> provided buy solr?
>
> Does anyone did something like this, and have some advices ? I
> thinking in store the information into a database like mysql ?
>
> Thanks,
> --
> "Without love, we are birds with broken wings."
> Morrie
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Re: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Sean Timm

Jana, Kumar Raja wrote:

2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
processing after 2 secs? (I know this question sounds silly but I just
want a confirmation from the experts J 
That is the idea, but only some of the code is within the timer.  So, 
there are cases where a query could exceed the timeAllowed specified 
because the bulk of the work for that particular query is not in the 
actual collect, for example, an expensive range query.


-Sean


Re: Store content out of solr

2009-02-17 Thread Renaud Delbru
A common approach (for web search engines) is to use HBase [1] as a 
"Document Repository". Each document indexed inside Solr will have an 
entry (row, identified by the document URL) in the HBase table. This 
works great when you deal with a large data collection (it scales better 
than a SQL database). The counterpart is that it is slightly slower than 
a local database.


[1] http://hadoop.apache.org/hbase/
--
Renaud Delbru

roberto wrote:

Hello,

We are indexing information from diferent sources so we would like to
centralize the information content so i can retrieve using the ID
provided buy solr?

Does anyone did something like this, and have some advices ? I
thinking in store the information into a database like mysql ?

Thanks,
  




Re: Multilanguage

2009-02-17 Thread Grant Ingersoll
There are a number of options for freeware here, just do some  
searching on your favorite Internet search engine.


TextCat is one of the more popular, as I seem to recall:  
http://odur.let.rug.nl/~vannoord/TextCat/

I believe Karl Wettin submitted a Lucene patch for a Language guesser: http://issues.apache.org/jira/browse/LUCENE-826 
 but it is marked as won't fix.


Nutch has a Language Identification plugin as well (the document in  
the link below) that probably isn't too hard to extract the source  
from for your needs


Also see http://www.lucidimagination.com/search/?q=multilingual+detection 
 and also http://www.lucidimagination.com/search/?q=language 
+detection for help


If purchasing, several companies offer solutions, but I don't know  
that their quality is any better than what you can get through open  
source, as generally speaking, the problem is solved with a high  
degree of accuracy through n-gram analysis.


-Grant

On Feb 17, 2009, at 11:57 AM, revathy arun wrote:


Hi Otis,

But this is not freeware ,right?




On 2/17/09, Otis Gospodnetic  wrote:


Hi,

No, Tika doesn't do LangID.  I haven't used ngramj, so I can't  
speak for

its accuracy nor speed (but I know the code has been around for
years).  Another LangID implementation is at the URL below my name.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: revathy arun 
To: solr-user@lucene.apache.org
Sent: Tuesday, February 17, 2009 6:39:40 PM
Subject: Re: Multilanguage

Does Apache Tika help find the language of the given document?



On 2/17/09, Till Kinstler  wrote:


Paul Libbrecht schrieb:

Clearly, then, something that matches words in a dictionary and  
decides

on
the language based on the language of the majority could do a  
decent job

to

decide the analyzer.

Does such a tool exist?



I once played around with http://ngramj.sourceforge.net/ for  
language
guessing. It did a good job. It doesn't use dictionaries for  
language

identification but a statistical approach using ngrams.
I don't have any precise numbers, but out of about 1 documents  
in
different languages (most in English, German and French, few in  
other
european languages like Polish) there were only some 10 not  
identified

correctly.

Till

--
Till Kinstler
Verbundzentrale des Gemeinsamen Bibliotheksverbundes (VZG)
Platz der Göttinger Sieben 1, D 37073 Göttingen
kinst...@gbv.de, +49 (0) 551 39-13431, http://www.gbv.de





--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: Multilanguage

2009-02-17 Thread Walter Underwood
On 2/17/09 12:26 PM, "Grant Ingersoll"  wrote:

> If purchasing, several companies offer solutions, but I don't know
> that their quality is any better than what you can get through open
> source, as generally speaking, the problem is solved with a high
> degree of accuracy through n-gram analysis.

The expensive part of the problem is getting a good corpus in each
language, tuning the classifier, and QA. The commercial ones usually
recognize encoding and language, which is more complicated. Sorting
out the ISO-2022 codes is a real mess, for example.

Pre-Unicode PDF files are also a horror. To do it right, you need
to recognize which fonts are Central European, and so on.

wunder



making changes to solr schema

2009-02-17 Thread Jonathan Haddad
Preface: This is my first attempt at using solr.

What happens if I need to do a change to a solr schema that's already
in production?  Can fields be added or removed?

Can a type change from an integer to a float?

Thanks in advance,
Jon

-- 
Jonathan Haddad
http://www.rustyrazorblade.com


making changes to solr schema after deployed to production

2009-02-17 Thread Jonathan Haddad
Preface: This is my first attempt at using solr.

What happens if I need to do a change to a solr schema that's already
in production?  Can fields be added or removed?

Can a type change from an integer to a float?

Thanks in advance,
Jon


embedded wildcard search not working?

2009-02-17 Thread Jim Adams
This is a straightforward question, but I haven't been able to figure out
what is up with my application.

I seem to be able to search on trailing wildcards just find.  For example,
fieldName:a* will return documents with apple, ardvaark, etc. in them.  But
if I was to try and search on a field containing 'apple' with 'a*e' I would
return nothing.

My gut is telling me that I should be using a different data type or a
different filter option.  Here is how my text type is defined:

 
  






  
  







  

Thanks for your help.



Reading Core-Specific Config File in a Row Transformer

2009-02-17 Thread wojtekpia

I'm using the DataImportHandler to load data. I created a custom row
transformer, and inside of it I'm reading a configuration file. I am using
the system's solr.solr.home property to figure out which directory the file
should be in. That works for a single-core deployment, but not for
multi-core deployments (since I'm always looking in
solr.solr.home/conf/file.txt). Is there a clean way to resolve the actual
conf directory path from within a custom row transformer so that it works
for both single-core and multi-core deployments?

Thanks,

Wojtek
-- 
View this message in context: 
http://www.nabble.com/Reading-Core-Specific-Config-File-in-a-Row-Transformer-tp22069449p22069449.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Reading Core-Specific Config File in a Row Transformer

2009-02-17 Thread Shalin Shekhar Mangar
On Wed, Feb 18, 2009 at 5:53 AM, wojtekpia  wrote:

>
> Is there a clean way to resolve the actual
> conf directory path from within a custom row transformer so that it works
> for both single-core and multi-core deployments?
>

You can use Context.getSolrCore().getInstanceDir()

-- 
Regards,
Shalin Shekhar Mangar.


Re: making changes to solr schema

2009-02-17 Thread Noble Paul നോബിള്‍ नोब्ळ्
On Wed, Feb 18, 2009 at 3:37 AM, Jonathan Haddad  wrote:
> Preface: This is my first attempt at using solr.
>
> What happens if I need to do a change to a solr schema that's already
> in production?  Can fields be added or removed?
you may need a core reload or a serverrestart
fields can be added and the subsequent document additions take advantage of it.
fields can be removed if you are no longer going to use them in queries
>
> Can a type change from an integer to a float?
In general type changes may need re-indexing of data
>
> Thanks in advance,
> Jon
>
> --
> Jonathan Haddad
> http://www.rustyrazorblade.com
>



-- 
--Noble Paul


Data Normalization in Solr.

2009-02-17 Thread Kalidoss MM
Hi,

  I want to store normalized data into Solr, example am spliting
personal information datas(fname, lname, mname) as one solr record, Address
(personal, office) as another record in Solr. the id is different
123212_name, 123212_add,

  Now, some case i require both personal and address records by
single xml say( fname, lname, officeaddress only) it self, (with single http
request), Is it possible?

Thanks,
kalidoss.m,


RE: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Jana, Kumar Raja
Thanks wunder for the response.

So I would like to know if I were to limit the resultset from Solr to 10
and my query actually matches, say 1000 documents, will the query
processing stop the moment the search finds the first 10 documents? Or
will the entire search be carried out and then sorted out based on their
ranks and the top 10 results be returned?

-Kumar

-Original Message-
From: Walter Underwood [mailto:wunderw...@netflix.com] 
Sent: Tuesday, February 17, 2009 10:06 PM
To: solr-user@lucene.apache.org
Subject: Re: Query regarding setTimeAllowed(Integer) and
setRows(Integer)

Requesting 5000 rows will use a lot of server time, because
it has to fetch the information for 5000 results when it
makes the response.

It is much more efficient to request only the results you
will need, usually 10 at a time.

wunder

On 2/17/09 3:30 AM, "Jana, Kumar Raja"  wrote:

> Hi,
> 
>  
> 
> I am trying to avoid queries which take a lot of server time. For this
I
> plan to use setRows(Integer) and setTimeAllowed(Integer) methods while
> creating the SolrQuery. I would like to know the following:
> 
>  
> 
> 1.   I set SolrQuery.setRows(5000) Will the processing of the
query
> stop once 5000 results are found or the query will be completely
> processed and then the result set is sorted out based on Rank Boosting
> and the top 5000 results are returned?
> 
> 2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
> processing after 2 secs? (I know this question sounds silly but I just
> want a confirmation from the experts J )
> 
>  
> 
> Is there anything else I can do to get the desired results?
> 
>  
> 
> Thanks,
> 
> Kumar
> 



RE: Query regarding setTimeAllowed(Integer) and setRows(Integer)

2009-02-17 Thread Jana, Kumar Raja
Thanks Sean. That clears up the timer concept.

Is there any other way through which I can make sure that the server
time is not wasted?

-Original Message-
From: Sean Timm [mailto:tim...@aol.com] 
Sent: Wednesday, February 18, 2009 1:00 AM
To: solr-user@lucene.apache.org
Subject: Re: Query regarding setTimeAllowed(Integer) and
setRows(Integer)

Jana, Kumar Raja wrote:
> 2.   If I set SolrQuery.setTimeAllowed(2000) Will this kill query
> processing after 2 secs? (I know this question sounds silly but I just
> want a confirmation from the experts J 
That is the idea, but only some of the code is within the timer.  So, 
there are cases where a query could exceed the timeAllowed specified 
because the bulk of the work for that particular query is not in the 
actual collect, for example, an expensive range query.

-Sean


Re: Data Normalization in Solr.

2009-02-17 Thread Otis Gospodnetic
Hi,

There are no entity relationships in Solr and there are no joins, so the 
simplest thing to do in this case is to issue two requests.  You could also 
write a custom SearchComponent that internally does two requests and returns a 
single unified response.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch





From: Kalidoss MM 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 2:44:15 PM
Subject: Data Normalization in Solr.

Hi,

          I want to store normalized data into Solr, example am spliting
personal information datas(fname, lname, mname) as one solr record, Address
(personal, office) as another record in Solr. the id is different
123212_name, 123212_add,

          Now, some case i require both personal and address records by
single xml say( fname, lname, officeaddress only) it self, (with single http
request), Is it possible?

Thanks,
kalidoss.m,


Re: embedded wildcard search not working?

2009-02-17 Thread Otis Gospodnetic
Jim,

Does app*l or even a*p* work?  Perhaps "apple" gets stemmed to something that 
doesn't end in "e", such as "appl"?
Regarding your config, you probably want to lowercase before removing stop 
words, so you'll want to change the order of those filters a bit.  That's not 
related to your wildcard question.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 





From: Jim Adams 
To: solr-user@lucene.apache.org
Sent: Wednesday, February 18, 2009 6:30:22 AM
Subject: embedded wildcard search not working?

This is a straightforward question, but I haven't been able to figure out
what is up with my application.

I seem to be able to search on trailing wildcards just find.  For example,
fieldName:a* will return documents with apple, ardvaark, etc. in them.  But
if I was to try and search on a field containing 'apple' with 'a*e' I would
return nothing.

My gut is telling me that I should be using a different data type or a
different filter option.  Here is how my text type is defined:


      
        
        
        
        
        
        
      
      
        
        
        
        
        
        
        
      

Thanks for your help.