Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Peter Karich



Building on optimize is not possible as index optimization is done on
the master and the slaves don't even run an optimize but only fetch
the optimized index.


isn't the spellcheck index replicated to the slaves too?

--
http://jetwick.com open twitter search



gotchas, issues with document deletions/replacements/edits

2010-12-13 Thread Dennis Gearon
I am about to set  up a live edit of database contents that get indexed in a 
Solr Instance.

I seem to remember that edits in the index are actually deletes and 
replacements?

The deleted items don't really disappear, right? What about queries do they 
affect?

Counts?
Return results?
?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.



Re: Solr on Google App Engine

2010-12-13 Thread Praveen Agrawal
Thanks a lot, Mauricio.

Does anyone has any experience on Amazon EC2, or can point me to existing
discussions?

Appreciate your help.
Thanks.
Praveen

On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer <
mauricioschef...@gmail.com> wrote:

> Solr on GAE has been discussed a couple of times, see these threads:
>
> http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
> 
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> <
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> >
>
> --
> Mauricio
>
>
>
> On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal  wrote:
>
> > Hi,
> > I was wondering if Solr can be deployed/run on Google App Engine. GAE has
> > some restrictions, notably no local file write access is allowed, instead
> > applications must use JDO/JPA etc.
> >
> > I believe Solr can be deployed/run on Amazon EC2.
> >
> > Has anyone tried Solr on these two hosts?
> >
> > Thanks.
> > Praveen
> >
>


When all cores are ready to be used?

2010-12-13 Thread De Stefano, Giovanni, VF-Group
Hello all,
 
I have a component that uses SolrServer(s) in a multicore environment.
 
I would like to cache these solr servers in a map (core name, server).
 
When is the right time to create this map? 
 
I tried with a custom ContextListener but it seems that the cores are
not ready yet: I have to reinitialize the cores but this doesn't sound
right.
 
Basically I would like to have a listener of an event "Solr created
everything it needs, including cores, etc".
 
How can I do this?
 
Thanks,
Giovanni
 
 


RE: Solr on Google App Engine

2010-12-13 Thread Dave Searle
EC2 installations are just windows/linux machines, so this would just be a 
normal setup. I have a solr server running on a small instance with 1.7gb ram 
mounted to an EBS volume of 50gb, seems to run fine. Costs about $115 a month

-Original Message-
From: Praveen Agrawal [mailto:pkal...@gmail.com] 
Sent: 13 December 2010 09:20
To: solr-user@lucene.apache.org
Subject: Re: Solr on Google App Engine

Thanks a lot, Mauricio.

Does anyone has any experience on Amazon EC2, or can point me to existing
discussions?

Appreciate your help.
Thanks.
Praveen

On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer <
mauricioschef...@gmail.com> wrote:

> Solr on GAE has been discussed a couple of times, see these threads:
>
> http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
> 
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> <
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> >
>
> --
> Mauricio
>
>
>
> On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal  wrote:
>
> > Hi,
> > I was wondering if Solr can be deployed/run on Google App Engine. GAE has
> > some restrictions, notably no local file write access is allowed, instead
> > applications must use JDO/JPA etc.
> >
> > I believe Solr can be deployed/run on Amazon EC2.
> >
> > Has anyone tried Solr on these two hosts?
> >
> > Thanks.
> > Praveen
> >
>


Re: Highlighting for non-stored fields

2010-12-13 Thread Alessandro Benedetti
We developed a custom Highlighter to solve this issue.
We added a "url" field in the solr schema doc for our domain and when
highlighting is called, we access the file, extract the information and send
them to the custom highlighter.

If you still need some help, I can provide you, our solution in detail!
Cheers

2010/10/26 Phong Dais 

> Hi,
>
> I've been looking thru the mailing archive for the past week and I haven't
> found any useful info regarding this issue.
>
> My requirement is to index a few terabytes worth of data to be searched.
> Due to the size of the data, I would like to index without storing but I
> would like to use the highlighting feature.  Is this even possible?  What
> are my options?
>
> I've read about termOffsets, payload that could possibly be used to do this
> but I have no idea how this could be done.
>
> Any pointers greatly appreciated.  Someone please point me in the right
> direction.
>
>  I don't mind having to write some code or digging thru existing code to
> accomplish this task.
>
> Thanks,
> P.
>



-- 
--

Benedetti Alessandro
Personal Page: http://tigerbolt.altervista.org

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Erick Erickson
***
Just wondering what's the reason that this patch receives that little
interest. Anything wrong with it?
***

Nobody got behind it and pushed I suspect. And since it's been a long time
since it was updated, there's no guarantee that it would apply cleanly any
more.
Or that it will perform as intended.

So, if you're really interested, I'd suggest you ping the dev list and ask
whether this is valuable or if it's been superseded. If the feedback is that
this
would be valuable, you can see what you can do to make it happen.

Once it's working to your satisfaction and you've submitted a patch, let
people
know it's ready and ask them to commit it or critique it. You might have to
remind
the committers after a few days that it's ready and get it applied to trunk
and/or 3.x.

But I really wouldn't start working with it until I got some feedback from
the
people who are actively working on Solr whether it's been superseded by
other functionality first, sometimes bugs just aren't closed when something
else makes it obsolete.

Here's a place to start: http://wiki.apache.org/solr/HowToContribute

Best
Erick

On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke <
martin.grot...@googlemail.com> wrote:

> Hi,
>
> when thinking further about it it's clear that
>  https://issues.apache.org/jira/browse/SOLR-433
> would be even better - we could generate the spellechecker indices on
> commit/optimize on the master and replicate them to all slaves.
>
> Just wondering what's the reason that this patch receives that little
> interest. Anything wrong with it?
>
> Cheers,
> Martin
>
>
> On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
>  wrote:
> > Hi,
> >
> > the spellchecker component already provides a buildOnCommit and
> > buildOnOptimize option.
> >
> > Since we have several spellchecker indices building on each commit is
> > not really what we want to do.
> > Building on optimize is not possible as index optimization is done on
> > the master and the slaves don't even run an optimize but only fetch
> > the optimized index.
> >
> > Therefore I'm thinking about an extension of the spellchecker that
> > allows you to rebuild the spellchecker based on a cron-expression
> > (e.g. rebuild each night at 1 am).
> >
> > What do you think about this, is there anybody else interested in this?
> >
> > Regarding the lifecycle, is there already some executor "framework" or
> > any regularly running process in place, or would I have to pull up my
> > own thread? If so, how can I stop my thread when solr/tomcat is
> > shutdown (I couldn't see any shutdown or destroy method in
> > SearchComponent)?
> >
> > Thanx for your feedback,
> > cheers,
> > Martin
> >
>
>
>
> --
> Martin Grotzke
> http://www.javakaffee.de/blog/
>


Re: gotchas, issues with document deletions/replacements/edits

2010-12-13 Thread Erick Erickson
You're right, updates are really deletes/adds. Deleted documents are NOT
found in future queries, so that's not a problem.

However, the #terms# in a deleted document still affect the relevance
calculations. But in most cases you'll never notice this. By that I mean
that the term frequency counts are still influenced by the terms from the
deleted documents etc.

Even this abstruse effect is removed upon the first optimize after delete.
That's when the document, terms, etc is removed from the index files.

Best
Erick

On Mon, Dec 13, 2010 at 4:16 AM, Dennis Gearon wrote:

> I am about to set  up a live edit of database contents that get indexed in
> a
> Solr Instance.
>
> I seem to remember that edits in the index are actually deletes and
> replacements?
>
> The deleted items don't really disappear, right? What about queries do they
> affect?
>
> Counts?
> Return results?
> ?
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>


Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke
On Mon, Dec 13, 2010 at 4:01 AM, Erick Erickson  wrote:
> I'm shooting in the dark here, but according to this:
> http://wiki.apache.org/solr/SolrReplication
> after the slave pulls the index
> down, it issues a commit. So if your
> slave is configured to generate the dictionary on commit, will it
> "just happen"?

Our slaves spellcheckers are not configured to buildOnCommit,
therefore it shouldn't just happen.

>
> But according to this: https://issues.apache.org/jira/browse/SOLR-866
> this is an open issue

Thanx for the pointer! SOLR-866 is even better suited for us - after
reading SOLR-433 again I realized that it targets scripts based
replication (what we're going to leave behind us).

Cheers,
Martin


>
> Best
> Erick
>
> On Sun, Dec 12, 2010 at 8:30 PM, Martin Grotzke <
> martin.grot...@googlemail.com> wrote:
>
>> On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma
>>  wrote:
>> > Maybe you've overlooked the build parameter?
>> > http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build
>> I'm aware of this, but we don't want to maintain cron-jobs on all
>> slaves for all spellcheckers for all cores.
>> That's why I'm thinking about a more integrated solution. Or did I
>> really overlook s.th.?
>>
>> Cheers,
>> Martin
>>
>>
>> >
>> >> Hi,
>> >>
>> >> the spellchecker component already provides a buildOnCommit and
>> >> buildOnOptimize option.
>> >>
>> >> Since we have several spellchecker indices building on each commit is
>> >> not really what we want to do.
>> >> Building on optimize is not possible as index optimization is done on
>> >> the master and the slaves don't even run an optimize but only fetch
>> >> the optimized index.
>> >>
>> >> Therefore I'm thinking about an extension of the spellchecker that
>> >> allows you to rebuild the spellchecker based on a cron-expression
>> >> (e.g. rebuild each night at 1 am).
>> >>
>> >> What do you think about this, is there anybody else interested in this?
>> >>
>> >> Regarding the lifecycle, is there already some executor "framework" or
>> >> any regularly running process in place, or would I have to pull up my
>> >> own thread? If so, how can I stop my thread when solr/tomcat is
>> >> shutdown (I couldn't see any shutdown or destroy method in
>> >> SearchComponent)?
>> >>
>> >> Thanx for your feedback,
>> >> cheers,
>> >> Martin
>> >
>>
>>
>>
>> --
>> Martin Grotzke
>> http://twitter.com/martin_grotzke
>>
>



-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke
Hi Erick,

thanx for your advice! I'll check the options with our client and see
how we'll proceed. My spare time right now is already full with other
open source stuff, otherwise it'd be fun contributing s.th. to solr!
:-)

Cheers,
Martin


On Mon, Dec 13, 2010 at 2:46 PM, Erick Erickson  wrote:
> ***
> Just wondering what's the reason that this patch receives that little
> interest. Anything wrong with it?
> ***
>
> Nobody got behind it and pushed I suspect. And since it's been a long time
> since it was updated, there's no guarantee that it would apply cleanly any
> more.
> Or that it will perform as intended.
>
> So, if you're really interested, I'd suggest you ping the dev list and ask
> whether this is valuable or if it's been superseded. If the feedback is that
> this
> would be valuable, you can see what you can do to make it happen.
>
> Once it's working to your satisfaction and you've submitted a patch, let
> people
> know it's ready and ask them to commit it or critique it. You might have to
> remind
> the committers after a few days that it's ready and get it applied to trunk
> and/or 3.x.
>
> But I really wouldn't start working with it until I got some feedback from
> the
> people who are actively working on Solr whether it's been superseded by
> other functionality first, sometimes bugs just aren't closed when something
> else makes it obsolete.
>
> Here's a place to start: http://wiki.apache.org/solr/HowToContribute
>
> Best
> Erick
>
> On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke <
> martin.grot...@googlemail.com> wrote:
>
>> Hi,
>>
>> when thinking further about it it's clear that
>>  https://issues.apache.org/jira/browse/SOLR-433
>> would be even better - we could generate the spellechecker indices on
>> commit/optimize on the master and replicate them to all slaves.
>>
>> Just wondering what's the reason that this patch receives that little
>> interest. Anything wrong with it?
>>
>> Cheers,
>> Martin
>>
>>
>> On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
>>  wrote:
>> > Hi,
>> >
>> > the spellchecker component already provides a buildOnCommit and
>> > buildOnOptimize option.
>> >
>> > Since we have several spellchecker indices building on each commit is
>> > not really what we want to do.
>> > Building on optimize is not possible as index optimization is done on
>> > the master and the slaves don't even run an optimize but only fetch
>> > the optimized index.
>> >
>> > Therefore I'm thinking about an extension of the spellchecker that
>> > allows you to rebuild the spellchecker based on a cron-expression
>> > (e.g. rebuild each night at 1 am).
>> >
>> > What do you think about this, is there anybody else interested in this?
>> >
>> > Regarding the lifecycle, is there already some executor "framework" or
>> > any regularly running process in place, or would I have to pull up my
>> > own thread? If so, how can I stop my thread when solr/tomcat is
>> > shutdown (I couldn't see any shutdown or destroy method in
>> > SearchComponent)?
>> >
>> > Thanx for your feedback,
>> > cheers,
>> > Martin
>> >
>>
>>
>>
>> --
>> Martin Grotzke
>> http://www.javakaffee.de/blog/
>>
>



-- 
Martin Grotzke
http://www.javakaffee.de/blog/


Separate Lines Like Google

2010-12-13 Thread Alejandro Delgadillo

Hi everybody,

I¹m having some troubles trying to figure out how to separate lines in a
paragraph from a search result, I¹m indexing PDF¹s but when I search the
highlight terms I can not know when the first line ends and the next one
begins, 

Is there a way to put a [...] like google o a Paragraph symbol?

I¹ll appreciate all the help I can get.

-- Alex.


Re: Solr on Google App Engine

2010-12-13 Thread Praveen Agrawal
Thanks Dave..

On Mon, Dec 13, 2010 at 4:06 PM, Dave Searle wrote:

> EC2 installations are just windows/linux machines, so this would just be a
> normal setup. I have a solr server running on a small instance with 1.7gb
> ram mounted to an EBS volume of 50gb, seems to run fine. Costs about $115 a
> month
>
> -Original Message-
> From: Praveen Agrawal [mailto:pkal...@gmail.com]
> Sent: 13 December 2010 09:20
> To: solr-user@lucene.apache.org
> Subject: Re: Solr on Google App Engine
>
> Thanks a lot, Mauricio.
>
> Does anyone has any experience on Amazon EC2, or can point me to existing
> discussions?
>
> Appreciate your help.
> Thanks.
> Praveen
>
> On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer <
> mauricioschef...@gmail.com> wrote:
>
> > Solr on GAE has been discussed a couple of times, see these threads:
> >
> > http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
> > 
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
> > 
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> > <
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> > >
> >
> > --
> > Mauricio
> >
> >
> >
> > On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal 
> wrote:
> >
> > > Hi,
> > > I was wondering if Solr can be deployed/run on Google App Engine. GAE
> has
> > > some restrictions, notably no local file write access is allowed,
> instead
> > > applications must use JDO/JPA etc.
> > >
> > > I believe Solr can be deployed/run on Amazon EC2.
> > >
> > > Has anyone tried Solr on these two hosts?
> > >
> > > Thanks.
> > > Praveen
> > >
> >
>


Re: [pubDate] is not converting correctly

2010-12-13 Thread Adam Estrada
+1  If I knew enough about how to do this in Java I would but I do not
s.What is the correct way to add or suggest enhancements to Solr
core?

Adam

On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:

> Nice find!  This is Apache 2.0, copyright SUN.
>
> O Great Apache Elders: Is it kosher to add this to the Solr
> distribution? It's not in the JDK and is also com.sun.*
>
> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>  wrote:
> > Thanks for the feedback! There are quite a few formats that can be used.
> I
> > am experiencing at least 5 of them. Would something like this work? Note
> > that there are 2 different formats separated by a comma.
> >
> >  > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
> />
> >
> > I don't suppose it will because there is already a comma in the first
> > parser. I guess I am reallly looking for an all purpose data time parser
> but
> > even if I have that, would I still be able to query *all* fields in the
> > index?
> >
> > Good article:
> >
> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
> >
> > Adam
> >
> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
> wrote:
> >
> >> (10/12/13 8:49), Adam Estrada wrote:
> >>
> >>> All,
> >>>
> >>> I am having some difficu"lties parsing the pubDate field that is part
> of
> >>> the?
> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
> >>> 6:45:26
> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
> >>>  transformRow
> >>> WARNING: Could not parse a Date field
> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
> >>> +"
> >>> at java.text.DateFormat.parse(Unknown Source)"
> >>>
> >>> Does anyone know how to fix this? I would eventually like to do a date
> >>> query
> >>> but without the ability to properly parse them I don't know if it's
> going
> >>> to
> >>> work.
> >>>
> >>> Thanks,
> >>> Adam
> >>>
> >>
> >> Adam,
> >>
> >> How does your data-config.xml look like for that field?
> >> Have you looked at rss-data-config.xml file
> >> under example/example-DIH/solr/rss/conf directory?
> >>
> >> Koji
> >> --
> >> http://www.rondhuit.com/en/
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Jaakko Rajaniemi

Hello,

Alright, let's describe the situation. I have a website and the website has a 
database with at least three tables.

- "users" table  - (id, firstname, lastname)
- "artwork" table  - (id, user, name, description)
- "jobs" table  - (id, company, position, location, description)

I want to implement a multi-purpose search field. I want the search field to treat rows 
from each of these tables as independent results. For example if I type 
"Paris", an example result set might look like a list of links like this:

- Paris Hilton (from the "users" table)
- A day in Paris (from the "artwork" table)
- My Paris (from the "artwork" table)
- Art teacher (from the "jobs" table if the location is Paris)

I tried to search for solution for this on the Internet and found a somewhat 
similar thread floating on this mailing list, but I just couldn't understand it 
and the Solr documentation has its main focus on syntax, not implementation. I 
figured I would create three entities and relevant schema.xml entries in this 
way:

dataimport.xml:




schema.xml:










This obviously does not work as I want. I only get results from the "users" table, and I cannot get results from neither "artwork" nor "jobs". I 
have found out that the possible solution is in putting  tags in the  tag and somehow aliasing column names for Solr, but the logic behind this is 
completely alien to me and the blind tests I tried did not yield anything. My logic says that the "id" field is getting replaced by the "id" field of other 
entities and indexes are being overwritten. But if I aliased all "id" fields in all entities into something else, such as "user_id" and "job_id", I 
couldn't figure what to put in the  configuration in schema.xml because I have three different id fields from three different tables that are all primary keyed 
in the database!

Obviously I'm not quite on track so some help would be greatly appreciated. 
Thanks!
- Jaakko


How to implement and a system based on IMAP auth

2010-12-13 Thread milomalo2...@libero.it
Hi Guys,

i am new in Solr world and i was trying to figure out how to implement an 
application which would be able to connect to our business mail server throug 
IMAP connection (1000 users) and to index the information related e-mail 
contents.

I tried to use DH- import with the preconfigured imap class provided in the 
solr example but as i could see there is no way to fetch 1000 user and retrieve 
information for them

What would you suggest as first step to follow ?
should i use SOLRJ as client in order to reach user content across imap 
connection? 
Doesn anyone had experience with that ?

thanks in advance





Re: Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Stefan Matheis
To avoid overwrites in your case, use a combined id - f.e. $table_$id which
results in user_1, job_1 and so on ..


Re: Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Stefan Matheis
And yes, sorry for the short answer ..
http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer would be
good for that :)


Re: How to implement and a system based on IMAP auth

2010-12-13 Thread Erick Erickson
I don't see where the MailEntityProcessor really has anything built
into it for indexing somebody else's mail, so you're probably going
to need to go down the SolrJ route. SolrJ is actually quite easy to
use, there are only a very few classes you'll need, so I'd go there

The "Usage" section here will get you started:
http://wiki.apache.org/solr/Solrj

Best
Erick

On Mon, Dec 13, 2010 at 9:32 AM, milomalo2...@libero.it <
milomalo2...@libero.it> wrote:

> Hi Guys,
>
> i am new in Solr world and i was trying to figure out how to implement an
> application which would be able to connect to our business mail server
> throug
> IMAP connection (1000 users) and to index the information related e-mail
> contents.
>
> I tried to use DH- import with the preconfigured imap class provided in the
> solr example but as i could see there is no way to fetch 1000 user and
> retrieve
> information for them
>
> What would you suggest as first step to follow ?
> should i use SOLRJ as client in order to reach user content across imap
> connection?
> Doesn anyone had experience with that ?
>
> thanks in advance
>
>
>
>


Re: Separate Lines Like Google

2010-12-13 Thread Koji Sekiguchi

(10/12/13 23:00), Alejandro Delgadillo wrote:


Hi everybody,

I¹m having some troubles trying to figure out how to separate lines in a
paragraph from a search result, I¹m indexing PDF¹s but when I search the
highlight terms I can not know when the first line ends and the next one
begins,

Is there a way to put a [...] like google o a Paragraph symbol?

I¹ll appreciate all the help I can get.

-- Alex.


Alex,

Use hl.snippets=n parameter, where n is a number (2, 3, ...).
Then you'll get the number of snippets at maximum, you can
appends these snippets with "..." between them.

Koji
--
http://www.rondhuit.com/en/


Re: Concurrent DIH calls

2010-12-13 Thread Juan Manuel Alvarez
Thanks for the answer Barani!
I was doing the same thing (queuing requests and querying solr
status), but I was hoping some flag/configuration would do the trick.
I will continue with that approach then! =o)

Thanks!
Juan M.

On Sat, Dec 11, 2010 at 3:50 AM, bbarani  wrote:
>
> Hi,
>
> As far as I know there is no queuing mechanism in SOLR for concurrent
> indexing request. It would simple ignore the concurrent request (first come
> first serve basis).. Solr experts, please correct me if I am wrong..
>
> To achieve concurrency,  we have implemented a queue using JMS and we send
> the data one by one for indexing (for performing push indexing / real time
> indexing)..
>
> We have also written a simple java program with SOLRj which will check if
> the status is idle or busy before it starts indexing next batch (This is for
> batch indexing program)..
>
> I would say the same thing applies for commit also.. As far as I know there
> is not inbuilt queuing system in SOLR for indexing.
>
> Thanks,
> Barani
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Concurrent-DIH-calls-tp2059517p2067937.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Solr replication, HAproxy and data management

2010-12-13 Thread Paolo Castagna

Paolo Castagna wrote:

Hi,
we are using Solr v1.4.x with multi-cores and a master/slaves 
configuration.

We also use HAProxy [1] to load balance search requests amongst slaves.
Finally, we use MapReduce to create new Solr indexes.

I'd like to share with you what I am doing when I need to:

 1. add a new index
 2. replace an existing index with an new/updated one
 3. add a slave
 4. remove a slave (or a slave died)

I am interested in knowing what are the best practices in these scenarios.


[...]


Does all this seems sensible to you?

Do you have best practices, suggestions to share?



Well, maybe these are two too broad questions...

I have a very specific one, related to all this.

Let's say I have a Solr master with multi-cores and I want to add a new
slave. Can I tell the slave to replicate all the indexes for the master?
How?

Any comment/advice regarding my original message are still more than
welcome.

Thank you,
Paolo


Re: Taxonomy and Faceting

2010-12-13 Thread webdev1977

Based on this:

VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_OPENCALAIS_KEY 


...this can't be used unless you use some sort of processing engine?  I am
playing around with some other open source tagging software, but I have yet
to get very far.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2079148.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Concurrent DIH calls

2010-12-13 Thread Stefan Matheis
I don't know if this is helpful .. but there is
http://wiki.apache.org/solr/DataImportHandler#EventListeners which would
trigger on 'onImportEnd'


Re: How to implement and a system based on IMAP auth

2010-12-13 Thread Peter Sturge
imap has no intrinsic functionality for logging in as a user then
'impersonating' someone else.
What you can do is setup your email server so that your administrator
account or similar has access to other users via shared folders (this
is supported in imap2 servers - e.g. Exchange).
This is done all the time, for example if a manager wants his/her
secretary to have access to his/her mailbox.
Of course all access in this way needs to be in line with privacy policies etc.

When you connect as, say, 'admin', you can then see the shared folders
you have access to.
These folders are accessible via imap.
This is more of an imap thing, and isn't really related to DIH/Solr per se.

For Exchange servers, have a look at:
   http://www.petri.co.il/grant_full_mailbox_rights_on_exchange_2000_2003.htm
and
   http://www.ehow.com/how_5656820_share-exchange-mailboxes.html

HTH

Peter




On Mon, Dec 13, 2010 at 2:32 PM, milomalo2...@libero.it
 wrote:
> Hi Guys,
>
> i am new in Solr world and i was trying to figure out how to implement an
> application which would be able to connect to our business mail server throug
> IMAP connection (1000 users) and to index the information related e-mail
> contents.
>
> I tried to use DH- import with the preconfigured imap class provided in the
> solr example but as i could see there is no way to fetch 1000 user and 
> retrieve
> information for them
>
> What would you suggest as first step to follow ?
> should i use SOLRJ as client in order to reach user content across imap
> connection?
> Doesn anyone had experience with that ?
>
> thanks in advance
>
>
>
>


Re: Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Erick Erickson
Warning: I haven't tried this, but maybe it's relevant.
See: http://wiki.apache.org/solr/DataImportHandler
particularly the "multiple
datasources" section. I'm thinking here that you have to
define a different data source for each separate table you want to extract.

Stefan's comments about using a transformer to make unique unique-ids per
table
seems spot on for that part of the problem

Best
Erick

On Mon, Dec 13, 2010 at 9:31 AM, Jaakko Rajaniemi
wrote:

> Hello,
>
> Alright, let's describe the situation. I have a website and the website has
> a database with at least three tables.
>
> - "users" table  - (id, firstname, lastname)
> - "artwork" table  - (id, user, name, description)
> - "jobs" table  - (id, company, position, location, description)
>
> I want to implement a multi-purpose search field. I want the search field
> to treat rows from each of these tables as independent results. For example
> if I type "Paris", an example result set might look like a list of links
> like this:
>
> - Paris Hilton (from the "users" table)
> - A day in Paris (from the "artwork" table)
> - My Paris (from the "artwork" table)
> - Art teacher (from the "jobs" table if the location is Paris)
>
> I tried to search for solution for this on the Internet and found a
> somewhat similar thread floating on this mailing list, but I just couldn't
> understand it and the Solr documentation has its main focus on syntax, not
> implementation. I figured I would create three entities and relevant
> schema.xml entries in this way:
>
> dataimport.xml:
> 
> 
> 
>
> schema.xml:
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> This obviously does not work as I want. I only get results from the "users"
> table, and I cannot get results from neither "artwork" nor "jobs". I have
> found out that the possible solution is in putting  tags in the
>  tag and somehow aliasing column names for Solr, but the logic
> behind this is completely alien to me and the blind tests I tried did not
> yield anything. My logic says that the "id" field is getting replaced by the
> "id" field of other entities and indexes are being overwritten. But if I
> aliased all "id" fields in all entities into something else, such as
> "user_id" and "job_id", I couldn't figure what to put in the 
> configuration in schema.xml because I have three different id fields from
> three different tables that are all primary keyed in the database!
>
> Obviously I'm not quite on track so some help would be greatly appreciated.
> Thanks!
> - Jaakko
>


Re: Very high load after replicating

2010-12-13 Thread Mark

Markus,

My configuration is as follows...






...
false
2
...
false
64
10
false
true

No cache warming queries and our machines have 8g of memory in them with 
about 5120m of ram dedicated to so Solr. When our index is around 10-11g 
in size everything runs smoothly. At around 20g+ it just falls apart.


Can you (or anyone) provide some suggestions? Thanks


On 12/12/10 1:11 PM, Markus Jelsma wrote:

There can be numerous explanations such as your configuration (cache warm
queries, merge factor, replication events etc) but also I/O having trouble
flushing everything to disk. It could also be a memory problem, the OS might
start swapping if you allocate too much RAM to the JVM leaving little for the
OS to work with.

You need to provide more details.


After replicating an index of around 20g my slaves experience very high
load (50+!!)

Is there anything I can do to alleviate this problem?  Would solr cloud
be of any help?

thanks


Re: Taxonomy and Faceting

2010-12-13 Thread Tommaso Teofili
With the SOLR-2129 patch you enable an Apache UIMA [1] pipeline to enrich
documents being indexed.
The base pipeline provided with the patch uses the following blocks (see
OverridingParamsExtServicesAE.xml):

AggregateSentenceAE

OpenCalaisAnnotator

TextKeywordExtractionAEDescriptor

TextLanguageDetectionAEDescriptor

TextCategorizationAEDescriptor

TextConceptTaggingAEDescriptor

TextRankedEntityExtractionAEDescriptor
This enables tokenizing, adding part of speech to tokens extract sentences
with WhitespaceTokenizer and HMMTagger, then inserts named entities and
language extracted with OpenCalaisAnnotator and AlchemyAPIAnnotator.
The parameters you underlined are relevant only if you use
OpenCalaisAnnotator and AlchemyAPIAnnotator; as you may see those are
runtime parameters, so depending on which Analysis Engine you're executing
you could need or not such parameters or need other ones.
However you can change the pipeline blocks to use to whatever you want,
provided that they are UIMA compliant specifying the relative Analysis
Engine descriptor inside the tag:
   /org/apache/uima/desc
/OverridingParamsExtServicesAE.xml.
There are many other engines you can use and configure with SOLR-2129, see
[2] and [3].
I hope this clarifies things a little more.
Cheers,
Tommaso

[1] : http://uima.apache.org
[2] : http://uima.apache.org/sandbox.html
[3] : http://uima.apache.org/external-resources.html

2010/12/13 webdev1977 

>
> Based on this:
>
> VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_OPENCALAIS_KEY
>
>
> ...this can't be used unless you use some sort of processing engine?  I am
> playing around with some other open source tagging software, but I have yet
> to get very far.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2079148.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Indexing pdf files - question.

2010-12-13 Thread Siebor, Wlodek [USA]
HI,
Can sombody, please, send me a command for indexing a sample pdf with 
ExtractngRequestHandler file available in the /docs directory. I have 
lucidworks solr installed on linux, with standard schema.xml and solrconfig.xml 
files (unchanged). I want to pass as the unique id the name of the file.
I’m trying various curl commands and so far I have either  “… missing required 
field: id” or “.. missing content stream” errors.
Thanks for your help,
Wlodek


Strange replication problem

2010-12-13 Thread Ralf Mattes
Hello list,

I'm trying to set up a replicating solr system (one master, one slave) here. 
Everything _looks_ o.k. but replication fails. A little debugging shows the 
following:

 r...@slave:~# curl 
'http://master:8180/solr/website/replication?command=indexversion&wt=json' && 
echo ''
{"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}

 r...@slave:~# curl 
'http://master:8180/solr/website/replication?command=details&wt=json' && echo ''
{"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76 
GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
 response format is experimental.  It is likely to change in the future."}

 r...@slave:~# 

Note that indexversion returned by the indexversion command is 0 while the same 
information from the details command is 292192351652 ...
Any idea what's going on here?

 TIA Ralf Mattes
 



Re: full text search in multiple fields

2010-12-13 Thread PeterKerk

whoops :)
It was directed at iorixxx, in the first post before me
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/full-text-search-in-multiple-fields-tp1888328p2079581.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing pdf files - question.

2010-12-13 Thread Adam Estrada
Hi,

I use the following command to post PDF files.

$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\document.docx&stream.contentType=application/msword&literal.id
=esc.doc&commit=true"
$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\features.pdf&stream.contentType=application/pdf&literal.id
=esc2.doc&commit=true"
$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\Memo_ocrd.pdf&stream.contentType=application/pdf&literal.id
=Memo_ocrd.pdf&defaultField=text&commit=true"

The PDF's have to be OCR'd.

Adam

On Mon, Dec 13, 2010 at 11:01 AM, Siebor, Wlodek [USA] <
siebor_wlo...@bah.com> wrote:

> HI,
> Can sombody, please, send me a command for indexing a sample pdf with
> ExtractngRequestHandler file available in the /docs directory. I have
> lucidworks solr installed on linux, with standard schema.xml and
> solrconfig.xml files (unchanged). I want to pass as the unique id the name
> of the file.
> I’m trying various curl commands and so far I have either  “… missing
> required field: id” or “.. missing content stream” errors.
> Thanks for your help,
> Wlodek
>


Re: How to get all the search results?

2010-12-13 Thread Solr User
Hi,

I tried *:* using dismax and I get no results.

Is there a way that I can get all the search results using dismax?

Thanks,
Murali

On Mon, Dec 6, 2010 at 11:17 AM, Savvas-Andreas Moysidis <
savvas.andreas.moysi...@googlemail.com> wrote:

> Hello,
>
> shouldn't that query syntax be *:* ?
>
> Regards,
> -- Savvas.
>
> On 6 December 2010 16:10, Solr User  wrote:
>
> > Hi,
> >
> > First off thanks to the group for guiding me to move from default search
> > handler to dismax.
> >
> > I have a question related to getting all the search results. In the past
> > with the default search handler I was getting all the search results
> (8000)
> > if I pass q=* as search string but with dismax I was getting only 16
> > results
> > instead of 8000 results.
> >
> > How to get all the search results using dismax? Do I need to configure
> > anything to make * (asterisk) work?
> >
> > Thanks,
> > Solr User
> >
>


Re: How to get all the search results?

2010-12-13 Thread Erick Erickson
Can we see the results with &debugQuery=on? As well as the entire http
string you use?

Also, are you sure you've put documents in your index and committed
afterwards?

Best
Erick

On Mon, Dec 13, 2010 at 11:59 AM, Solr User  wrote:

> Hi,
>
> I tried *:* using dismax and I get no results.
>
> Is there a way that I can get all the search results using dismax?
>
> Thanks,
> Murali
>
> On Mon, Dec 6, 2010 at 11:17 AM, Savvas-Andreas Moysidis <
> savvas.andreas.moysi...@googlemail.com> wrote:
>
> > Hello,
> >
> > shouldn't that query syntax be *:* ?
> >
> > Regards,
> > -- Savvas.
> >
> > On 6 December 2010 16:10, Solr User  wrote:
> >
> > > Hi,
> > >
> > > First off thanks to the group for guiding me to move from default
> search
> > > handler to dismax.
> > >
> > > I have a question related to getting all the search results. In the
> past
> > > with the default search handler I was getting all the search results
> > (8000)
> > > if I pass q=* as search string but with dismax I was getting only 16
> > > results
> > > instead of 8000 results.
> > >
> > > How to get all the search results using dismax? Do I need to configure
> > > anything to make * (asterisk) work?
> > >
> > > Thanks,
> > > Solr User
> > >
> >
>


Re: Strange replication problem

2010-12-13 Thread Xin Li
" indexversion returned by the indexversion command is 0 while the
same information from the details command is 292192351652 ..."

This only happens to a Slave machine. For a Master machine,
indexversion returns the same number as details command.





On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
> Hello list,
>
> I'm trying to set up a replicating solr system (one master, one slave) here.
> Everything _looks_ o.k. but replication fails. A little debugging shows the 
> following:
>
>  r...@slave:~# curl 
> 'http://master:8180/solr/website/replication?command=indexversion&wt=json' && 
> echo ''
> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>
>  r...@slave:~# curl 
> 'http://master:8180/solr/website/replication?command=details&wt=json' && echo 
> ''
> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76 
> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>  response format is experimental.  It is likely to change in the future."}
>
>  r...@slave:~#
>
> Note that indexversion returned by the indexversion command is 0 while the 
> same information from the details command is 292192351652 ...
> Any idea what's going on here?
>
>  TIA Ralf Mattes
>
>
>


Re: Strange replication problem

2010-12-13 Thread Ralf Mattes
On Mon, 13 Dec 2010 12:31:27 -0500, Xin Li wrote:

> " indexversion returned by the indexversion command is 0 while the same
> information from the details command is 292192351652 ..."
> 
> This only happens to a Slave machine. For a Master machine, indexversion
> returns the same number as details command.

??? What part of my posted example did you not read? :-)
Both requests where sent to the same machine (configured as master) - and I get
exactly the result described. So, no, for my setup (pretty much out-of-the-box
with minimal master/slave configuration) your statement is wrong :-/

 Thanks, RalfD
 

 

> 
> 
> 
> On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
>> Hello list,
>>
>> I'm trying to set up a replicating solr system (one master, one slave)
>> here. Everything _looks_ o.k. but replication fails. A little debugging
>> shows the following:
>>
>>  r...@slave:~# curl
>>  'http://master:8180/solr/website/replication?command=indexversion&wt=json'
>>  && echo ''
>> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>>
>>  r...@slave:~# curl
>>  'http://master:8180/solr/website/replication?command=details&wt=json'
>>  && echo ''
>> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76
>> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>> response format is experimental.  It is likely to change in the
>> future."}
>>
>>  r...@slave:~#
>>
>> Note that indexversion returned by the indexversion command is 0 while
>> the same information from the details command is 292192351652 ... Any
>> idea what's going on here?
>>
>>  TIA Ralf Mattes
>>
>>
>>




Re: Indexing pdf files - question.

2010-12-13 Thread Wodek Siebor

The sample /docs/tutorial.pdf does not require OCR.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-pdf-files-question-tp2079505p2080307.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Geospatial search w/polygon bounding box?

2010-12-13 Thread Erick Erickson
It really doesn't look like it. As part of some other work I'm doing I ran
across this: https://issues.apache.org/jira/browse/SOLR-2155
which seems to speak (on
cursory glance) at the polygon-bounding-box
issue. But as you see it hasn't been committed. Want to push it forward?

Best
Erick

On Mon, Nov 8, 2010 at 11:57 AM, Jonathan Gill
wrote:

> Hi-
>
> Based on my limited but growing understanding of bounding box spatial
> filtering in Solr, I think I know the answer to my question, but I¹m
> looking
> for confirmation.  Is there a way to specify a polygon bounding box for
> spatial searches?  If so, is there a limit to the number of points that
> define the polygon shape?
>
>
>
> 
> This e-mail, including attachments, is intended for the person(s)
> or company named and may contain confidential and/or legally
> privileged information. Unauthorized disclosure, copying or use of
> this information may be unlawful and is prohibited. If you are not
> the intended recipient, please delete this message and notify the
> sender.


access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Burton-West, Tom
I see variables used to access java system properties in solrconfig.xml and 
schema.xml:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
${solr.data.dir:}
or
${solr.abortOnConfigurationError:true}

Is there a way to access environment variables or does everything have to be 
stuffed into a java system property?

Tom Burton-West





Re: Strange replication problem

2010-12-13 Thread Xin Li
did you double check
http://machine:port/solr/website/admin/replication/ to see the
"master" is indeed a master?

On Mon, Dec 13, 2010 at 1:01 PM, Ralf Mattes  wrote:
> On Mon, 13 Dec 2010 12:31:27 -0500, Xin Li wrote:
>
>> " indexversion returned by the indexversion command is 0 while the same
>> information from the details command is 292192351652 ..."
>>
>> This only happens to a Slave machine. For a Master machine, indexversion
>> returns the same number as details command.
>
> ??? What part of my posted example did you not read? :-)
> Both requests where sent to the same machine (configured as master) - and I 
> get
> exactly the result described. So, no, for my setup (pretty much out-of-the-box
> with minimal master/slave configuration) your statement is wrong :-/
>
>  Thanks, RalfD
>
>
>
>
>>
>>
>>
>> On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
>>> Hello list,
>>>
>>> I'm trying to set up a replicating solr system (one master, one slave)
>>> here. Everything _looks_ o.k. but replication fails. A little debugging
>>> shows the following:
>>>
>>>  r...@slave:~# curl
>>>  'http://master:8180/solr/website/replication?command=indexversion&wt=json'
>>>  && echo ''
>>> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>>>
>>>  r...@slave:~# curl
>>>  'http://master:8180/solr/website/replication?command=details&wt=json'
>>>  && echo ''
>>> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76
>>> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>>> response format is experimental.  It is likely to change in the
>>> future."}
>>>
>>>  r...@slave:~#
>>>
>>> Note that indexversion returned by the indexversion command is 0 while
>>> the same information from the details command is 292192351652 ... Any
>>> idea what's going on here?
>>>
>>>  TIA Ralf Mattes
>>>
>>>
>>>
>
>
>


Re: Is it possible to assign default value for a particular record when using multivalued field type?

2010-12-13 Thread bbarani

Hi,

Is there a template transformer which can act on each and every record of
multivalued attribute?

The issue is that some of the records might have null data in source and I
want those data to be replaced with some default value.

Also if the value is blank I could just see  in the XML. Any idea how
to parse this tag using SOLRNET?

The problem is that we are parsing the XML tag of 2 attribute (for ex:
objectid and objectname) and mapping it in the UI. Something like key value
pair.

If any one of the attribute (ex: in object name) is null or blank the
mapping order gets changed (b/w objectid and objectname).

I am not sure if the only solution is to handle in SOLR DIH query.

Thanks,
Barani
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-assign-default-value-for-a-particular-record-when-using-multivalued-field-type-tp2066167p2080890.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Separate Lines Like Google

2010-12-13 Thread Alejandro Delgadillo
Koji,

Thank you for helping me with my questions, but I still don't get it how
it's done, let's say I search for the term "love" and I get something like
this:

LoveLove may also
refer to: Contents. 1 Film and television.

As you can see the second term is from the same document but it is from a
different paragraph, aside from the highlight there is no way to tell them
apart, that's the problem I've been having.

I put the hl.snippets under my default search handler...



On 12/13/10 8:45 AM, "Koji Sekiguchi"  wrote:

> (10/12/13 23:00), Alejandro Delgadillo wrote:
>> 
>> Hi everybody,
>> 
>> I¹m having some troubles trying to figure out how to separate lines in a
>> paragraph from a search result, I¹m indexing PDF¹s but when I search the
>> highlight terms I can not know when the first line ends and the next one
>> begins,
>> 
>> Is there a way to put a [...] like google o a Paragraph symbol?
>> 
>> I¹ll appreciate all the help I can get.
>> 
>> -- Alex.
>> 
> Alex,
> 
> Use hl.snippets=n parameter, where n is a number (2, 3, ...).
> Then you'll get the number of snippets at maximum, you can
> appends these snippets with "..." between them.
> 
> Koji




Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread John Russell
Thanks for the response.

The date types are defined in our schema file like this






Which appears to be what you mentioned.  Then we use them in fields like
this

   
   

So I think we have the right datatypes for the dates.  Most of the other
ones are strings.

As for the doc we are adding, I don't think it would be considered "huge".
It is basically blog posts and tweets broken out into fields like author,
title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
probably smaller.

We do create them once and then update the indexes as we perform work on the
documents.  For example, we create the doc for the original incoming post
and then update that doc with tags or the results of filtering so we can
look for them later.

We have solr set up as a separate JVM which we talk to over HTTP on the same
box using the solrj client java library.  Unfortunately we are on 32 bit
hardware so solr can only get 2.6GB of memory.  Any more than that and the
JVM won't start.

I really just need a way to keep the cache from breaking the bank.  As I
pasted below there are some config elements in the XML that appear to be
related to caching but I'm not sure that they are related to that specific
hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
actually runs out of heap space but GC's the CPU to death.

Thanks again.

John

On Sat, Dec 11, 2010 at 17:46, Erick Erickson wrote:

> "unfortunately I can't check the statistics page.  For some reason the solr
> webapp itself is only returning a directory listing."
>
> This is very weird and makes me wonder if there's something really wonky
> with your system. I'm assuming when you say "the solr webapp itself" you're
> taking about ...localhost:8983/solr/admin/.. You might want to be
> looking
> at the stats page (and frantically hitting refresh) before you have
> problems.
> Alternately, you could record the queries as they are sent to solr to see
> what
> the offending
>
> But onwards Tell us more about your dates. One of the very common
> ways people get into trouble is to use dates that are unix-style
> timestamps,
> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
> fields
> are very much preferred for this.
>
> Your index isn't all that large by regular standards, so I think that
> there's
> hope that you can get this working.
>
>
> Wait, wait, wait. Looking again at the stack trace I see that your OOM
> is happening when you *add* a document. Tell us more about the
> document, perhaps you can print out some characteristics of the doc
> before you add it? Is it always the same doc? Are you indexing and
> searching on the same machine? Is the doc really huge?
>
> Best
> Erick
>
>
> On Fri, Dec 10, 2010 at 4:33 PM, John Russell  wrote:
>
> > Thanks a lot for the response.
> >
> > Unfortunately I can't check the statistics page.  For some reason the
> solr
> > webapp itself is only returning a directory listing.  This is sometimes
> > fixed when I restart but if I do that I'll lose the state I have now.  I
> > can
> > get at the JMX interface.  Can I check my insanity level from there?
> >
> > We did change two parts of the solr config to raise the size of the query
> > Results and document cache.  I assume from what you were saying that this
> > does not have an effect on the cache I mentioned taking up all of the
> > space.
> >
> >>
> >  class=*"solr.LRUCache"*
> >
> >  size=*"16384"*
> >
> >  initialSize=*"4096"*
> >
> >  autowarmCount=*"0"*/>
> >
> >
> >
> >
> >
> > >
> >  class=*"solr.LRUCache"*
> >
> >  size=*"16384"*
> >
> >  initialSize=*"16384"*
> >
> >  autowarmCount=*"0"*/>
> >
> >
> > This problem gets worse as our index grows (5.0GB now).  Unfortunately we
> > are maxed out on memory for our hardware.
> >
> > We aren't using faceting at all in our searches right now.  We usually
> sort
> > on 1 or 2 fields at the most.  I think the types of our fields are pretty
> > accurate, unfortunately they are mostly strings, and some dates.
> >
> > How do the field definitions effect that cache? Is it simply that fewer
> > fields mean less to cache? Does it not cache some fields configured in a
> > certain way?
> >
> > Is there a way to throw out an IndexReader after a while and restart,
> just
> > to restart the cache? Or maybe explicitly clear it if we see it getting
> out
> > of hand through JMX or something?
> >
> > Really anything to stop it from choking like this would be awesome.
> >
> > Thanks again.
> >
> > John
> >
> > On Fri, Dec 10, 2010 at 16:02, Tom Hill  wrote:
> >
> > > Hi John,
> > >
> > > WeakReferences allow things to get GC'd, if there are no other
> > > references to the object referred to.
> > >
> > > My understanding is that WeakHashMaps use weak references for the Keys
> > > in the HashMap.
> > >
> > > What this means is that the keys in HashMap can be GC'd, once there
> > > are no other references to the key. I _think_ this occ

Re: [pubDate] is not converting correctly

2010-12-13 Thread Lance Norskog
Please file a JIRA requesting this.

On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada  wrote:
> +1  If I knew enough about how to do this in Java I would but I do not
> s.What is the correct way to add or suggest enhancements to Solr
> core?
>
> Adam
>
> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:
>
>> Nice find!  This is Apache 2.0, copyright SUN.
>>
>> O Great Apache Elders: Is it kosher to add this to the Solr
>> distribution? It's not in the JDK and is also com.sun.*
>>
>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>>  wrote:
>> > Thanks for the feedback! There are quite a few formats that can be used.
>> I
>> > am experiencing at least 5 of them. Would something like this work? Note
>> > that there are 2 different formats separated by a comma.
>> >
>> > > > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
>> />
>> >
>> > I don't suppose it will because there is already a comma in the first
>> > parser. I guess I am reallly looking for an all purpose data time parser
>> but
>> > even if I have that, would I still be able to query *all* fields in the
>> > index?
>> >
>> > Good article:
>> >
>> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
>> >
>> > Adam
>> >
>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
>> wrote:
>> >
>> >> (10/12/13 8:49), Adam Estrada wrote:
>> >>
>> >>> All,
>> >>>
>> >>> I am having some difficu"lties parsing the pubDate field that is part
>> of
>> >>> the?
>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
>> >>> 6:45:26
>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
>> >>>  transformRow
>> >>> WARNING: Could not parse a Date field
>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
>> >>> +"
>> >>>         at java.text.DateFormat.parse(Unknown Source)"
>> >>>
>> >>> Does anyone know how to fix this? I would eventually like to do a date
>> >>> query
>> >>> but without the ability to properly parse them I don't know if it's
>> going
>> >>> to
>> >>> work.
>> >>>
>> >>> Thanks,
>> >>> Adam
>> >>>
>> >>
>> >> Adam,
>> >>
>> >> How does your data-config.xml look like for that field?
>> >> Have you looked at rss-data-config.xml file
>> >> under example/example-DIH/solr/rss/conf directory?
>> >>
>> >> Koji
>> >> --
>> >> http://www.rondhuit.com/en/
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>



-- 
Lance Norskog
goks...@gmail.com


Re: [pubDate] is not converting correctly

2010-12-13 Thread Lance Norskog
Create an account at
https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create
New Issue' for the Solr project.

On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog  wrote:
> Please file a JIRA requesting this.
>
> On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada  wrote:
>> +1  If I knew enough about how to do this in Java I would but I do not
>> s.What is the correct way to add or suggest enhancements to Solr
>> core?
>>
>> Adam
>>
>> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:
>>
>>> Nice find!  This is Apache 2.0, copyright SUN.
>>>
>>> O Great Apache Elders: Is it kosher to add this to the Solr
>>> distribution? It's not in the JDK and is also com.sun.*
>>>
>>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>>>  wrote:
>>> > Thanks for the feedback! There are quite a few formats that can be used.
>>> I
>>> > am experiencing at least 5 of them. Would something like this work? Note
>>> > that there are 2 different formats separated by a comma.
>>> >
>>> > >> > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
>>> />
>>> >
>>> > I don't suppose it will because there is already a comma in the first
>>> > parser. I guess I am reallly looking for an all purpose data time parser
>>> but
>>> > even if I have that, would I still be able to query *all* fields in the
>>> > index?
>>> >
>>> > Good article:
>>> >
>>> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
>>> >
>>> > Adam
>>> >
>>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
>>> wrote:
>>> >
>>> >> (10/12/13 8:49), Adam Estrada wrote:
>>> >>
>>> >>> All,
>>> >>>
>>> >>> I am having some difficu"lties parsing the pubDate field that is part
>>> of
>>> >>> the?
>>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
>>> >>> 6:45:26
>>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
>>> >>>  transformRow
>>> >>> WARNING: Could not parse a Date field
>>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
>>> >>> +"
>>> >>>         at java.text.DateFormat.parse(Unknown Source)"
>>> >>>
>>> >>> Does anyone know how to fix this? I would eventually like to do a date
>>> >>> query
>>> >>> but without the ability to properly parse them I don't know if it's
>>> going
>>> >>> to
>>> >>> work.
>>> >>>
>>> >>> Thanks,
>>> >>> Adam
>>> >>>
>>> >>
>>> >> Adam,
>>> >>
>>> >> How does your data-config.xml look like for that field?
>>> >> Have you looked at rss-data-config.xml file
>>> >> under example/example-DIH/solr/rss/conf directory?
>>> >>
>>> >> Koji
>>> >> --
>>> >> http://www.rondhuit.com/en/
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: How to get all the search results?

2010-12-13 Thread Shawn Heisey

On 12/13/2010 9:59 AM, Solr User wrote:

Hi,

I tried *:* using dismax and I get no results.

Is there a way that I can get all the search results using dismax?


For dismax, use q= or simply leave the q parameter off the URL 
entirely.  It appears that you need to have q.alt set to *:* for this to 
work.  It would be a good idea to include this in your handler definition:


*:*

Two people (myself and Peter Karich) gave this answer on this thread 
last week, within 15 minutes of the time your original question was 
posted.  Here's the entire thread on nabble:


http://lucene.472066.n3.nabble.com/How-to-get-all-the-search-results-td2028233.html

Shawn



Re: How to get all the search results?

2010-12-13 Thread Solr User
Hi Shawn,

Yes you did.

I tried and did not work so I asked the same question again.

Now I understood and tried directly on the Solr admin and I got all the
search results. I will implement the same on the website.

Thank you so much Shawn.


On Mon, Dec 13, 2010 at 5:16 PM, Shawn Heisey  wrote:

> On 12/13/2010 9:59 AM, Solr User wrote:
>
>> Hi,
>>
>> I tried *:* using dismax and I get no results.
>>
>> Is there a way that I can get all the search results using dismax?
>>
>
> For dismax, use q= or simply leave the q parameter off the URL entirely.
>  It appears that you need to have q.alt set to *:* for this to work.  It
> would be a good idea to include this in your handler definition:
>
> *:*
>
> Two people (myself and Peter Karich) gave this answer on this thread last
> week, within 15 minutes of the time your original question was posted.
>  Here's the entire thread on nabble:
>
>
> http://lucene.472066.n3.nabble.com/How-to-get-all-the-search-results-td2028233.html
>
> Shawn
>
>


Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Jonathan Rochkind
Forgive me if I've said this in this thread already, but I'm beginning 
to think this is the main 'mysterious' cause of Solr RAM/gc issues.


Are you committing very frequently?  So frequently that you commit 
faster than it takes for warming operations on a new Solr index to 
complete, and you're getting over-lapping indexes being prepared?


But if the problem really is just GC issues and not actually too much 
RAM being used, try this JVM setting:


-XX:+UseConcMarkSweepGC

Will make GC happen in a different thread, instead of the same thread as 
solr operations.


I think that is also something that many many Solr installations 
probably need, but don't realize they need.


On 12/13/2010 3:42 PM, John Russell wrote:

Thanks for the response.

The date types are defined in our schema file like this

 

 
 

Which appears to be what you mentioned.  Then we use them in fields like
this




So I think we have the right datatypes for the dates.  Most of the other
ones are strings.

As for the doc we are adding, I don't think it would be considered "huge".
It is basically blog posts and tweets broken out into fields like author,
title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
probably smaller.

We do create them once and then update the indexes as we perform work on the
documents.  For example, we create the doc for the original incoming post
and then update that doc with tags or the results of filtering so we can
look for them later.

We have solr set up as a separate JVM which we talk to over HTTP on the same
box using the solrj client java library.  Unfortunately we are on 32 bit
hardware so solr can only get 2.6GB of memory.  Any more than that and the
JVM won't start.

I really just need a way to keep the cache from breaking the bank.  As I
pasted below there are some config elements in the XML that appear to be
related to caching but I'm not sure that they are related to that specific
hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
actually runs out of heap space but GC's the CPU to death.

Thanks again.

John

On Sat, Dec 11, 2010 at 17:46, Erick Ericksonwrote:


"unfortunately I can't check the statistics page.  For some reason the solr
webapp itself is only returning a directory listing."

This is very weird and makes me wonder if there's something really wonky
with your system. I'm assuming when you say "the solr webapp itself" you're
taking about ...localhost:8983/solr/admin/.. You might want to be
looking
at the stats page (and frantically hitting refresh) before you have
problems.
Alternately, you could record the queries as they are sent to solr to see
what
the offending

But onwards Tell us more about your dates. One of the very common
ways people get into trouble is to use dates that are unix-style
timestamps,
i.e. in milliseconds (either as ints or strings) and sort on them. Trie
fields
are very much preferred for this.

Your index isn't all that large by regular standards, so I think that
there's
hope that you can get this working.


Wait, wait, wait. Looking again at the stack trace I see that your OOM
is happening when you *add* a document. Tell us more about the
document, perhaps you can print out some characteristics of the doc
before you add it? Is it always the same doc? Are you indexing and
searching on the same machine? Is the doc really huge?

Best
Erick


On Fri, Dec 10, 2010 at 4:33 PM, John Russell  wrote:


Thanks a lot for the response.

Unfortunately I can't check the statistics page.  For some reason the

solr

webapp itself is only returning a directory listing.  This is sometimes
fixed when I restart but if I do that I'll lose the state I have now.  I
can
get at the JMX interface.  Can I check my insanity level from there?

We did change two parts of the solr config to raise the size of the query
Results and document cache.  I assume from what you were saying that this
does not have an effect on the cache I mentioned taking up all of the
space.

   








This problem gets worse as our index grows (5.0GB now).  Unfortunately we
are maxed out on memory for our hardware.

We aren't using faceting at all in our searches right now.  We usually

sort

on 1 or 2 fields at the most.  I think the types of our fields are pretty
accurate, unfortunately they are mostly strings, and some dates.

How do the field definitions effect that cache? Is it simply that fewer
fields mean less to cache? Does it not cache some fields configured in a
certain way?

Is there a way to throw out an IndexReader after a while and restart,

just

to restart the cache? Or maybe explicitly clear it if we see it getting

out

of hand through JMX or something?

Really anything to stop it from choking like this would be awesome.

Thanks again.

John

On Fri, Dec 10, 2010 at 16:02, Tom Hill  wrote:


Hi John,

WeakReferences allow things to get GC'd, if there are no other
references to the object referred to.

My understanding is

Re: Separate Lines Like Google

2010-12-13 Thread Koji Sekiguchi

(10/12/14 5:06), Alejandro Delgadillo wrote:

Koji,

Thank you for helping me with my questions, but I still don't get it how
it's done, let's say I search for the term "love" and I get something like
this:

LoveLove  may also
refer to: Contents. 1 Film and television.

As you can see the second term is from the same document but it is from a
different paragraph, aside from the highlight there is no way to tell them
apart, that's the problem I've been having.

I put the hl.snippets under my default search handler...


Alex,

You need to pre-process them for paragraph. When you index your docs,
you should add like this:


  
Love is an intense feeling  of affection
Love may also refer to: Contents. 1 Film and 
television.
  


instead of:


  
Love is an intense feeling  of affection
Love may also refer to: Contents. 1 Film and television.
  


Koji
--
http://www.rondhuit.com/en/


Re: SolrEventListeners are instantiated twice

2010-12-13 Thread Chris Hostetter

: SolrEventListener. Even though I only register the listener in the query
: section of solrconfig.xml, listening to the firstSearcher event, the
: listener is also attached to the UpdateHandler and thus the init-method runs
: twice because there is two instances of the class. To eliminate any other

Jørgen: thank you for reporting this.

It is definitely a bug, and i have opened a jira trakcing issue with an 
attached test demonstrating your problem and a proposed fix that i am 
currently testing...

https://issues.apache.org/jira/browse/SOLR-2285


-Hoss

Re: access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Koji Sekiguchi

(10/12/14 4:28), Burton-West, Tom wrote:

I see variables used to access java system properties in solrconfig.xml and 
schema.xml:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
${solr.data.dir:}
or
${solr.abortOnConfigurationError:true}

Is there a way to access environment variables or does everything have to be 
stuffed into a java system property?


Tom,

No, there is no way to access environment variables. But you can
access your environment variable through a java system property
if you assign the environment variable to a java system property
when you launch JVM:

$ java -Djava.system.property1=$ENVIRONMENT_VAR1 ¥
-Djava.system.property2=$ENVIRONMENT_VAR2 -jar start.jar

Koji
--
http://www.rondhuit.com/en/


Re: Problem with loading a class

2010-12-13 Thread Chris Hostetter

: Caused by: java.lang.ClassNotFoundException:
: solr.StempelPolishStemFilterFactory
: 
: So I tried putting
: contrib/analysis-extras/lucene-libs/lucene-stempel-3.1-2010-12-06_10-23-49.jar
: in ./lib and ./lucene-libs - same result.

The lucene-stempel-*.jar file contains the StempelPolishStemFilter, but to 
use it in Solr you also need the StempelPolishStemFilterFactory which is 
in the apache-solr-analysis-extras-*.jar



-Hoss


Re: JMX Cache values are wrong

2010-12-13 Thread Chris Hostetter

: I've used three different JMX clients to query
...
: beans and they appear to return old cache information.
: 
: As new searchers come online, the newer caches dosen't appear to be
: registered perhaps?
: I can see this when I query JMX for the 'description' attribute and
: the regenerator JMX output shows a different
: org.apache.solr.search.SolrIndexSearcher to that which appears in the
: stats.jsp page.

Hmmm... using jconsole and the example jetty instance i can't reproduce 
this behavior (on trunk)

I can view details on those beans, and click the "refresh" button for 
those beans to see updated stats as requests come in.  When a new searcher 
is opened, those beans go away complete (jsonsole detects that they are 
gone and closes the active pane) and new beans come along which have the 
updated stats for the new cache instances.

Perhaps either the servlet container or JMX client you are using aren't 
recognizing when one bean goes away and a new one is registered with the 
same name?



-Hoss


Re: [pubDate] is not converting correctly

2010-12-13 Thread Adam Estrada
My first submission ;-)

https://issues.apache.org/jira/browse/SOLR-2286

Adam

On Mon, Dec 13, 2010 at 5:14 PM, Lance Norskog  wrote:

> Create an account at
> https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create
> New Issue' for the Solr project.
>
> On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog  wrote:
> > Please file a JIRA requesting this.
> >
> > On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada 
> wrote:
> >> +1  If I knew enough about how to do this in Java I would but I do not
> >> s.What is the correct way to add or suggest enhancements to Solr
> >> core?
> >>
> >> Adam
> >>
> >> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog 
> wrote:
> >>
> >>> Nice find!  This is Apache 2.0, copyright SUN.
> >>>
> >>> O Great Apache Elders: Is it kosher to add this to the Solr
> >>> distribution? It's not in the JDK and is also com.sun.*
> >>>
> >>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
> >>>  wrote:
> >>> > Thanks for the feedback! There are quite a few formats that can be
> used.
> >>> I
> >>> > am experiencing at least 5 of them. Would something like this work?
> Note
> >>> > that there are 2 different formats separated by a comma.
> >>> >
> >>> >  >>> > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz,
> -MM-dd'T'HH:mm:ss'Z'"
> >>> />
> >>> >
> >>> > I don't suppose it will because there is already a comma in the first
> >>> > parser. I guess I am reallly looking for an all purpose data time
> parser
> >>> but
> >>> > even if I have that, would I still be able to query *all* fields in
> the
> >>> > index?
> >>> >
> >>> > Good article:
> >>> >
> >>>
> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
> >>> >
> >>> > Adam
> >>> >
> >>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
> >>> wrote:
> >>> >
> >>> >> (10/12/13 8:49), Adam Estrada wrote:
> >>> >>
> >>> >>> All,
> >>> >>>
> >>> >>> I am having some difficu"lties parsing the pubDate field that is
> part
> >>> of
> >>> >>> the?
> >>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
> >>> >>> 6:45:26
> >>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
> >>> >>>  transformRow
> >>> >>> WARNING: Could not parse a Date field
> >>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009
> 14:41:43
> >>> >>> +"
> >>> >>> at java.text.DateFormat.parse(Unknown Source)"
> >>> >>>
> >>> >>> Does anyone know how to fix this? I would eventually like to do a
> date
> >>> >>> query
> >>> >>> but without the ability to properly parse them I don't know if it's
> >>> going
> >>> >>> to
> >>> >>> work.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Adam
> >>> >>>
> >>> >>
> >>> >> Adam,
> >>> >>
> >>> >> How does your data-config.xml look like for that field?
> >>> >> Have you looked at rss-data-config.xml file
> >>> >> under example/example-DIH/solr/rss/conf directory?
> >>> >>
> >>> >> Koji
> >>> >> --
> >>> >> http://www.rondhuit.com/en/
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goks...@gmail.com
> >>>
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread John Russell
Wow, you read my mind.  We are committing very frequently.  We are trying to
get as close to realtime access to the stuff we put in as possible.  Our
current commit time is... ahem every 4 seconds.

Is that insane?

I'll try the ConcMarkSweep as well and see if that helps.

On Mon, Dec 13, 2010 at 17:38, Jonathan Rochkind  wrote:

> Forgive me if I've said this in this thread already, but I'm beginning to
> think this is the main 'mysterious' cause of Solr RAM/gc issues.
>
> Are you committing very frequently?  So frequently that you commit faster
> than it takes for warming operations on a new Solr index to complete, and
> you're getting over-lapping indexes being prepared?
>
> But if the problem really is just GC issues and not actually too much RAM
> being used, try this JVM setting:
>
> -XX:+UseConcMarkSweepGC
>
> Will make GC happen in a different thread, instead of the same thread as
> solr operations.
>
> I think that is also something that many many Solr installations probably
> need, but don't realize they need.
>
>
> On 12/13/2010 3:42 PM, John Russell wrote:
>
>> Thanks for the response.
>>
>> The date types are defined in our schema file like this
>>
>> > precisionStep="0" positionIncrementGap="0"/>
>>
>> 
>> > precisionStep="6" positionIncrementGap="0"/>
>>
>> Which appears to be what you mentioned.  Then we use them in fields like
>> this
>>
>>> stored="false"
>> required="false" multiValued="false" />
>>> required="false" multiValued="false" />
>>
>> So I think we have the right datatypes for the dates.  Most of the other
>> ones are strings.
>>
>> As for the doc we are adding, I don't think it would be considered "huge".
>> It is basically blog posts and tweets broken out into fields like author,
>> title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
>> probably smaller.
>>
>> We do create them once and then update the indexes as we perform work on
>> the
>> documents.  For example, we create the doc for the original incoming post
>> and then update that doc with tags or the results of filtering so we can
>> look for them later.
>>
>> We have solr set up as a separate JVM which we talk to over HTTP on the
>> same
>> box using the solrj client java library.  Unfortunately we are on 32 bit
>> hardware so solr can only get 2.6GB of memory.  Any more than that and the
>> JVM won't start.
>>
>> I really just need a way to keep the cache from breaking the bank.  As I
>> pasted below there are some config elements in the XML that appear to be
>> related to caching but I'm not sure that they are related to that specific
>> hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
>> actually runs out of heap space but GC's the CPU to death.
>>
>> Thanks again.
>>
>> John
>>
>> On Sat, Dec 11, 2010 at 17:46, Erick Erickson> >wrote:
>>
>>  "unfortunately I can't check the statistics page.  For some reason the
>>> solr
>>> webapp itself is only returning a directory listing."
>>>
>>> This is very weird and makes me wonder if there's something really wonky
>>> with your system. I'm assuming when you say "the solr webapp itself"
>>> you're
>>> taking about ...localhost:8983/solr/admin/.. You might want to be
>>> looking
>>> at the stats page (and frantically hitting refresh) before you have
>>> problems.
>>> Alternately, you could record the queries as they are sent to solr to see
>>> what
>>> the offending
>>>
>>> But onwards Tell us more about your dates. One of the very common
>>> ways people get into trouble is to use dates that are unix-style
>>> timestamps,
>>> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
>>> fields
>>> are very much preferred for this.
>>>
>>> Your index isn't all that large by regular standards, so I think that
>>> there's
>>> hope that you can get this working.
>>>
>>>
>>> Wait, wait, wait. Looking again at the stack trace I see that your OOM
>>> is happening when you *add* a document. Tell us more about the
>>> document, perhaps you can print out some characteristics of the doc
>>> before you add it? Is it always the same doc? Are you indexing and
>>> searching on the same machine? Is the doc really huge?
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Fri, Dec 10, 2010 at 4:33 PM, John Russell
>>>  wrote:
>>>
>>>  Thanks a lot for the response.

 Unfortunately I can't check the statistics page.  For some reason the

>>> solr
>>>
 webapp itself is only returning a directory listing.  This is sometimes
 fixed when I restart but if I do that I'll lose the state I have now.  I
 can
 get at the JMX interface.  Can I check my insanity level from there?

 We did change two parts of the solr config to raise the size of the
 query
 Results and document cache.  I assume from what you were saying that
 this
 does not have an effect on the cache I mentioned taking up all of the
 space.

   >>>
  class=*"solr.LRUCache"*

  s

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 8:47 PM, John Russell  wrote:
> Wow, you read my mind.  We are committing very frequently.  We are trying to
> get as close to realtime access to the stuff we put in as possible.  Our
> current commit time is... ahem every 4 seconds.
>
> Is that insane?

Not necessarily insane, but challenging ;-)
I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml.  When
that is exceeded, a commit will fail (this just means a new searcher
won't be opened on that commit... the docs will be visible with the
next commit that does succeed.)

-Yonik
http://www.lucidimagination.com


RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Jonathan Rochkind
ConcMarkSweep probably won't help.

Solr 1.4 is not very good at 'near real time' committing.  There are some 
features post-1.4, that I don't know if they are in trunk yet or still just 
patches, that I have not investigated myself, but google (or JIRA search) for 
'near real time'.

http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs



This seems to be a very frequent issue these days; everyone running Solr should 
at least read that wiki section to understand what's going on.



From: John Russell [jjruss...@gmail.com]
Sent: Monday, December 13, 2010 8:47 PM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

Wow, you read my mind.  We are committing very frequently.  We are trying to
get as close to realtime access to the stuff we put in as possible.  Our
current commit time is... ahem every 4 seconds.

Is that insane?

I'll try the ConcMarkSweep as well and see if that helps.

On Mon, Dec 13, 2010 at 17:38, Jonathan Rochkind  wrote:

> Forgive me if I've said this in this thread already, but I'm beginning to
> think this is the main 'mysterious' cause of Solr RAM/gc issues.
>
> Are you committing very frequently?  So frequently that you commit faster
> than it takes for warming operations on a new Solr index to complete, and
> you're getting over-lapping indexes being prepared?
>
> But if the problem really is just GC issues and not actually too much RAM
> being used, try this JVM setting:
>
> -XX:+UseConcMarkSweepGC
>
> Will make GC happen in a different thread, instead of the same thread as
> solr operations.
>
> I think that is also something that many many Solr installations probably
> need, but don't realize they need.
>
>
> On 12/13/2010 3:42 PM, John Russell wrote:
>
>> Thanks for the response.
>>
>> The date types are defined in our schema file like this
>>
>> > precisionStep="0" positionIncrementGap="0"/>
>>
>> 
>> > precisionStep="6" positionIncrementGap="0"/>
>>
>> Which appears to be what you mentioned.  Then we use them in fields like
>> this
>>
>>> stored="false"
>> required="false" multiValued="false" />
>>> required="false" multiValued="false" />
>>
>> So I think we have the right datatypes for the dates.  Most of the other
>> ones are strings.
>>
>> As for the doc we are adding, I don't think it would be considered "huge".
>> It is basically blog posts and tweets broken out into fields like author,
>> title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
>> probably smaller.
>>
>> We do create them once and then update the indexes as we perform work on
>> the
>> documents.  For example, we create the doc for the original incoming post
>> and then update that doc with tags or the results of filtering so we can
>> look for them later.
>>
>> We have solr set up as a separate JVM which we talk to over HTTP on the
>> same
>> box using the solrj client java library.  Unfortunately we are on 32 bit
>> hardware so solr can only get 2.6GB of memory.  Any more than that and the
>> JVM won't start.
>>
>> I really just need a way to keep the cache from breaking the bank.  As I
>> pasted below there are some config elements in the XML that appear to be
>> related to caching but I'm not sure that they are related to that specific
>> hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
>> actually runs out of heap space but GC's the CPU to death.
>>
>> Thanks again.
>>
>> John
>>
>> On Sat, Dec 11, 2010 at 17:46, Erick Erickson> >wrote:
>>
>>  "unfortunately I can't check the statistics page.  For some reason the
>>> solr
>>> webapp itself is only returning a directory listing."
>>>
>>> This is very weird and makes me wonder if there's something really wonky
>>> with your system. I'm assuming when you say "the solr webapp itself"
>>> you're
>>> taking about ...localhost:8983/solr/admin/.. You might want to be
>>> looking
>>> at the stats page (and frantically hitting refresh) before you have
>>> problems.
>>> Alternately, you could record the queries as they are sent to solr to see
>>> what
>>> the offending
>>>
>>> But onwards Tell us more about your dates. One of the very common
>>> ways people get into trouble is to use dates that are unix-style
>>> timestamps,
>>> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
>>> fields
>>> are very much preferred for this.
>>>
>>> Your index isn't all that large by regular standards, so I think that
>>> there's
>>> hope that you can get this working.
>>>
>>>
>>> Wait, wait, wait. Looking again at the stack trace I see that your OOM
>>> is happening when you *add* a document. Tell us more about the
>>> document, perhaps you can print out some characteristics of the doc
>>> before you add it? Is it always the same doc? Are you indexing and
>>> searching on the same machine? Is the doc really huge?
>>>
>>> Best
>>

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Shawn Heisey

On 12/13/2010 3:38 PM, Jonathan Rochkind wrote:
But if the problem really is just GC issues and not actually too much 
RAM being used, try this JVM setting:


-XX:+UseConcMarkSweepGC


That's I use on my shards, I've never had any visible problems with 
memory or garbage collection delays.  I have not done any kind of 
profiling, though.


The servers (CentOS Xen VMs) have 9GB of total RAM and serve indexes 
that are nearing 15GB in size and have over 8 million documents.  
Important parts of my java commandline:


-Xms512M -Xmx2048M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode

java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

Shawn



RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Jonathan Rochkind
Wow, really, it's that easy?  I could swear there's a wiki page somewhere that 
suggests otherwise, but I believe Yonik today over a wiki page last edited 
wherever.  

But this should be well-publisized, it's a pretty easy solution that will at 
least give you "as up to date as your Solr can handle", to a problem that many 
people seem to be having.  I would suggest a maxWarmingSearchers 1 example 
should at least be included commented out in the example solrconfig.xml, if not 
even included live. 

(This would be even better if, on a commit failing due to maxWarmingSearchers, 
Solr would automatically commit them when the warming is complete -- instead of 
relying on another commit manually being made at some future point.  Is there 
any built-in hook for 'warming complete' or 'index fully ready' that could be 
used to jury-rig this?)

Yonik, how will maxWarmingSearchers in this scenario effect replication?  If a 
slave is pulling down new indexes so quickly that the warming searchers would 
ordinarily pile up, but maxWarmingSearchers is set to 1 what happens?


From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Monday, December 13, 2010 9:07 PM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

On Mon, Dec 13, 2010 at 8:47 PM, John Russell  wrote:
> Wow, you read my mind.  We are committing very frequently.  We are trying to
> get as close to realtime access to the stuff we put in as possible.  Our
> current commit time is... ahem every 4 seconds.
>
> Is that insane?

Not necessarily insane, but challenging ;-)
I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml.  When
that is exceeded, a commit will fail (this just means a new searcher
won't be opened on that commit... the docs will be visible with the
next commit that does succeed.)

-Yonik
http://www.lucidimagination.com


Userdefined Field type - Faceting

2010-12-13 Thread Viswa S

Hello,

We implemented an IP-Addr field type which internally stored the ips as hex-ed 
string (e.g. "192.2.103.29" will be stored as "c002671d"). My "toExternal" and 
"toInternal" methods for appropriate conversion seems to be working well for 
query results, but however when faceting on this field it returns the raw 
strings. in other words the query response would have "192.2.103.29", but facet 
on the field would return "1"

Why are these methods not used by the faceting component to convert the 
resulting values?

Thanks
Viswa
  

SpatialTierQueryParserPlugin Loading Error

2010-12-13 Thread Adam Estrada
All,

Can anyone shed some light on this error. I can't seem to get this
class to load. I am using the distribution of Solr from Lucid
Imagination and the Spatial Plugin from here
https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
apply a patch but the jar file is in there. What else can I do?

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
at org.apache.solr.core.SolrCore.(SolrCore.java:548)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
... 33 more


Re: SpatialTierQueryParserPlugin Loading Error

2010-12-13 Thread Erick Erickson
This page shows you how to apply a patch:
http://wiki.apache.org/solr/HowToContribute
However, are you aware that
this is a patch to the *source* code and then you have
to compile it? A simpler approach would be to just grab either the trunk
build or a very recent
3.x build. See: https://hudson.apache.org/hudson/

You'll find both solr trunk and solr 3x. note that Hudson (and NOT
specifically Solr/Lucene) is
suffering some bogus "failures" so don't be alarmed by that status.

Follow the "build artifacts" link, then on down to "dist" and you'll find a
full package awaiting
installation with all the geospatial stuff already built...

Best
Erick

On Mon, Dec 13, 2010 at 10:06 PM, Adam Estrada <
estrada.adam.gro...@gmail.com> wrote:

> All,
>
> Can anyone shed some light on this error. I can't seem to get this
> class to load. I am using the distribution of Solr from Lucid
> Imagination and the Spatial Plugin from here
> https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
> apply a patch but the jar file is in there. What else can I do?
>
> org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
>at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
>at
> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
>at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
>at org.apache.solr.core.SolrCore.(SolrCore.java:548)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
>at org.mortbay.jetty.Server.doStart(Server.java:210)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:183)
>at org.mortbay.start.Main.start(Main.java:497)
>at org.mortbay.start.Main.main(Main.java:115)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.lang.Class.forName0(Native Method)
>at java.lang.Class.forName(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
>... 33 more
>


Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley
On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind  wrote:
> Yonik, how will maxWarmingSearchers in this scenario effect replication?  If 
> a slave is pulling down new indexes so quickly that the warming searchers 
> would ordinarily pile up, but maxWarmingSearchers is set to 1 what 
> happens?

Like any other commits, this will limit the number of searchers
warming in the background to 1.  If a commit is called, and that tries
to open a new searcher while another is already warming, it will fail.
 The next commit that does succeed will have all the updates though.

Today, this maxWarmingSearchers check is done after the writer has
closed and before a new searcher is opened... so calling commit too
often won't affect searching, but it will currently affect indexing
speed (since the IndexWriter is constantly being closed/flushed).

-Yonik
http://www.lucidimagination.com


Re: Userdefined Field type - Faceting

2010-12-13 Thread Yonik Seeley
Perhaps try overriding indexedToReadable() also?

-Yonik
http://www.lucidimagination.com

On Mon, Dec 13, 2010 at 10:00 PM, Viswa S  wrote:
>
> Hello,
>
> We implemented an IP-Addr field type which internally stored the ips as 
> hex-ed string (e.g. "192.2.103.29" will be stored as "c002671d"). My 
> "toExternal" and "toInternal" methods for appropriate conversion seems to be 
> working well for query results, but however when faceting on this field it 
> returns the raw strings. in other words the query response would have 
> "192.2.103.29", but facet on the field would return " name="c002671d">1"
>
> Why are these methods not used by the faceting component to convert the 
> resulting values?
>
> Thanks
> Viswa
>


RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Jonathan Rochkind
Sorry, I guess I don't understand the details of replication enough. 

So slave tries to replicate. It pulls down the new index files. It tries to do 
a commit but fails.  But "the next commit that does succeed will have all the 
updates." Since it's a slave, it doesn't get any commits of it's own. But then 
some amount of time later, it does another replication pull. There are at this 
time maybe no _new_ changes since the last failed replication pull. Does this 
trigger a commit that will get those previous changes actually added to the 
slave?

In the meantime, between commits.. are those potentially large pulled new index 
files sitting around somewhere but not replacing the old slave index files, 
doubling disk space for those files?

Thanks for any clarification. 

Jonathan

From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Monday, December 13, 2010 10:41 PM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind  wrote:
> Yonik, how will maxWarmingSearchers in this scenario effect replication?  If 
> a slave is pulling down new indexes so quickly that the warming searchers 
> would ordinarily pile up, but maxWarmingSearchers is set to 1 what 
> happens?

Like any other commits, this will limit the number of searchers
warming in the background to 1.  If a commit is called, and that tries
to open a new searcher while another is already warming, it will fail.
 The next commit that does succeed will have all the updates though.

Today, this maxWarmingSearchers check is done after the writer has
closed and before a new searcher is opened... so calling commit too
often won't affect searching, but it will currently affect indexing
speed (since the IndexWriter is constantly being closed/flushed).

-Yonik
http://www.lucidimagination.com


Re: Very high load

2010-12-13 Thread Mark
Changing the subject. Its not related to after replication. It only 
appeared after indexing an extra field which increased our index size 
from 12g to 20g+


On 12/13/10 7:57 AM, Mark wrote:

Markus,

My configuration is as follows...






...
false
2
...
false
64
10
false
true

No cache warming queries and our machines have 8g of memory in them 
with about 5120m of ram dedicated to so Solr. When our index is around 
10-11g in size everything runs smoothly. At around 20g+ it just falls 
apart.


Can you (or anyone) provide some suggestions? Thanks


On 12/12/10 1:11 PM, Markus Jelsma wrote:
There can be numerous explanations such as your configuration (cache 
warm
queries, merge factor, replication events etc) but also I/O having 
trouble
flushing everything to disk. It could also be a memory problem, the 
OS might
start swapping if you allocate too much RAM to the JVM leaving little 
for the

OS to work with.

You need to provide more details.


After replicating an index of around 20g my slaves experience very high
load (50+!!)

Is there anything I can do to alleviate this problem?  Would solr cloud
be of any help?

thanks


Solr Memory Usage

2010-12-13 Thread Cameron Hurst
Hello all,

I am a new user to Solr and am currently in a testing phase before I try
and take my server into production. For my system I have a tomcat6
servlet running solr 1.4.1. Everything is running currently on my local
computer and it is parsing data from a local dump of the production
MySQL server. Things seem stable initially and I am able to query
everything and have not experienced any sort of errors. The problem has
to do with how much RAM the server uses compared to my expectations.

On initial start up the tomcat6 server, 90MB of RAM is used. This seems
normal compared to what I am expecting and what google searches give me.
>From there in my solrconf.xml settings I have the maximum RAM buffer set
to be 32 MB. Because of this, I am expecting that ontop of the 90MB of
RAM for the first turn on that during indexing, and running operations
that it will load an additional 32MB of information into RAM. Along with
HTTP request and other buffers I am assuming that the total amount of
RAM usage for the servlet and solr should be about 150MB of RAM with
these settings.

As I start to index data and passing queries to the database I notice a
steady rise in the RAM but it doesn't stop at 150MB. If I continue to
reindex the exact same data set with no additional data entries the RAM
continuously increases. I stopped looking as the RAM increased beyond
350MB and started to try and debug it and can't find anything obvious
from my beginners view point. I tried to a memory leak check from the
web manager and that came up with no leaks.

Are my expectations here unreasonable? Am i completely wrong with my
assumptions that I should only have about 150MB and that 350 is
perfectly fine? I am just trying to sort this out because on my
production server I have a limited amount of RAM and I need to minimize
this as much as possible.

Thanks,

Cameron


RAM usage issues

2010-12-13 Thread Cameron Hurst
hello all,

I am a new user to Solr and I am having a few issues with the setup and
wondering if anyone had some suggestions. I am currently running this as
just a test environment before I go into production. I am using a
tomcat6 environment for my servlet and solr 1.4.1 as the solr build. I
set up the instructions following the guide here.
http://wiki.apache.org/solr/SolrTomcatThe issue that I am having is
that the memory usages seems high for the settings I have.

When i start the server I am using about 90MB of RAM which is fine and
from the google searches I found that is normal. The issue comes when I
start indexing data. In my solrconf.xml file that my maximum RAM buffer
is 32MB. In my mind that means that the maximum RAM being used by the
servlet should be 122MB, but increasing to 150MB isn't out of my reach.
When I start indexing data and calling searches my memory usages slowly
keeps on increasing. The odd thing about it is that when I reindex the
exact same data set the memory usage increases every time but no new
data has been entered to be indexed. I stopped increasing as I went over
350MB of RAM.

So my question in all of this is if this is normal and why the RAM
buffer isn't being observed? Are my expectations unreasonable and
flawed? Or could there be something else in my settings that is causing
the memory usage to increase like this.

Thanks for the help,

Cameron


RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Upayavira
The second commit will bring in all changes, from both syncs. 

Think of the sync part as a glorified rsync of files on disk. So the
files will have been copied to disk, but the in memory index on the
slave will not have noticed that those files have changed. The commit is
intended to remedy that - it causes a new index reader to be created,
based upon the new on disk files, which will include updates from both
syncs.

Upayavira

On Mon, 13 Dec 2010 23:11 -0500, "Jonathan Rochkind" 
wrote:
> Sorry, I guess I don't understand the details of replication enough. 
> 
> So slave tries to replicate. It pulls down the new index files. It tries
> to do a commit but fails.  But "the next commit that does succeed will
> have all the updates." Since it's a slave, it doesn't get any commits of
> it's own. But then some amount of time later, it does another replication
> pull. There are at this time maybe no _new_ changes since the last failed
> replication pull. Does this trigger a commit that will get those previous
> changes actually added to the slave?
> 
> In the meantime, between commits.. are those potentially large pulled new
> index files sitting around somewhere but not replacing the old slave
> index files, doubling disk space for those files?
> 
> Thanks for any clarification. 
> 
> Jonathan
> 
> From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley
> [yo...@lucidimagination.com]
> Sent: Monday, December 13, 2010 10:41 PM
> To: solr-user@lucene.apache.org
> Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't
> WeakHashMap getting collected?
> 
> On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind 
> wrote:
> > Yonik, how will maxWarmingSearchers in this scenario effect replication?  
> > If a slave is pulling down new indexes so quickly that the warming 
> > searchers would ordinarily pile up, but maxWarmingSearchers is set to 1 
> > what happens?
> 
> Like any other commits, this will limit the number of searchers
> warming in the background to 1.  If a commit is called, and that tries
> to open a new searcher while another is already warming, it will fail.
>  The next commit that does succeed will have all the updates though.
> 
> Today, this maxWarmingSearchers check is done after the writer has
> closed and before a new searcher is opened... so calling commit too
> often won't affect searching, but it will currently affect indexing
> speed (since the IndexWriter is constantly being closed/flushed).
> 
> -Yonik
> http://www.lucidimagination.com
>