date:20101213

***
Just wondering what's the reason that this patch receives that little
interest. Anything wrong with it?
***

Nobody got behind it and pushed I suspect. And since it's been a long time
since it was updated, there's no guarantee that it would apply cleanly any
more.
Or that it will perform as intended.

So, if you're really interested, I'd suggest you ping the dev list and ask
whether this is valuable or if it's been superseded. If the feedback is that
this
would be valuable, you can see what you can do to make it happen.

Once it's working to your satisfaction and you've submitted a patch, let
people
know it's ready and ask them to commit it or critique it. You might have to
remind
the committers after a few days that it's ready and get it applied to trunk
and/or 3.x.

But I really wouldn't start working with it until I got some feedback from
the
people who are actively working on Solr whether it's been superseded by
other functionality first, sometimes bugs just aren't closed when something
else makes it obsolete.

Here's a place to start: http://wiki.apache.org/solr/HowToContribute

Best
Erick

On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke <
martin.grot...@googlemail.com> wrote:

> Hi,
>
> when thinking further about it it's clear that
>  https://issues.apache.org/jira/browse/SOLR-433
> would be even better - we could generate the spellechecker indices on
> commit/optimize on the master and replicate them to all slaves.
>
> Just wondering what's the reason that this patch receives that little
> interest. Anything wrong with it?
>
> Cheers,
> Martin
>
>
> On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
>  wrote:
> > Hi,
> >
> > the spellchecker component already provides a buildOnCommit and
> > buildOnOptimize option.
> >
> > Since we have several spellchecker indices building on each commit is
> > not really what we want to do.
> > Building on optimize is not possible as index optimization is done on
> > the master and the slaves don't even run an optimize but only fetch
> > the optimized index.
> >
> > Therefore I'm thinking about an extension of the spellchecker that
> > allows you to rebuild the spellchecker based on a cron-expression
> > (e.g. rebuild each night at 1 am).
> >
> > What do you think about this, is there anybody else interested in this?
> >
> > Regarding the lifecycle, is there already some executor "framework" or
> > any regularly running process in place, or would I have to pull up my
> > own thread? If so, how can I stop my thread when solr/tomcat is
> > shutdown (I couldn't see any shutdown or destroy method in
> > SearchComponent)?
> >
> > Thanx for your feedback,
> > cheers,
> > Martin
> >
>
>
>
> --
> Martin Grotzke
> http://www.javakaffee.de/blog/
>

Re: gotchas, issues with document deletions/replacements/edits

You're right, updates are really deletes/adds. Deleted documents are NOT
found in future queries, so that's not a problem.

However, the #terms# in a deleted document still affect the relevance
calculations. But in most cases you'll never notice this. By that I mean
that the term frequency counts are still influenced by the terms from the
deleted documents etc.

Even this abstruse effect is removed upon the first optimize after delete.
That's when the document, terms, etc is removed from the index files.

Best
Erick

On Mon, Dec 13, 2010 at 4:16 AM, Dennis Gearon wrote:

> I am about to set  up a live edit of database contents that get indexed in
> a
> Solr Instance.
>
> I seem to remember that edits in the index are actually deletes and
> replacements?
>
> The deleted items don't really disappear, right? What about queries do they
> affect?
>
> Counts?
> Return results?
> ?
>
>  Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better
> idea to learn from others’ mistakes, so you do not have to make them
> yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>
>

Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke

On Mon, Dec 13, 2010 at 4:01 AM, Erick Erickson  wrote:
> I'm shooting in the dark here, but according to this:
> http://wiki.apache.org/solr/SolrReplication
> after the slave pulls the index
> down, it issues a commit. So if your
> slave is configured to generate the dictionary on commit, will it
> "just happen"?

Our slaves spellcheckers are not configured to buildOnCommit,
therefore it shouldn't just happen.

>
> But according to this: https://issues.apache.org/jira/browse/SOLR-866
> this is an open issue

Thanx for the pointer! SOLR-866 is even better suited for us - after
reading SOLR-433 again I realized that it targets scripts based
replication (what we're going to leave behind us).

Cheers,
Martin


>
> Best
> Erick
>
> On Sun, Dec 12, 2010 at 8:30 PM, Martin Grotzke <
> martin.grot...@googlemail.com> wrote:
>
>> On Mon, Dec 13, 2010 at 2:12 AM, Markus Jelsma
>>  wrote:
>> > Maybe you've overlooked the build parameter?
>> > http://wiki.apache.org/solr/SpellCheckComponent#spellcheck.build
>> I'm aware of this, but we don't want to maintain cron-jobs on all
>> slaves for all spellcheckers for all cores.
>> That's why I'm thinking about a more integrated solution. Or did I
>> really overlook s.th.?
>>
>> Cheers,
>> Martin
>>
>>
>> >
>> >> Hi,
>> >>
>> >> the spellchecker component already provides a buildOnCommit and
>> >> buildOnOptimize option.
>> >>
>> >> Since we have several spellchecker indices building on each commit is
>> >> not really what we want to do.
>> >> Building on optimize is not possible as index optimization is done on
>> >> the master and the slaves don't even run an optimize but only fetch
>> >> the optimized index.
>> >>
>> >> Therefore I'm thinking about an extension of the spellchecker that
>> >> allows you to rebuild the spellchecker based on a cron-expression
>> >> (e.g. rebuild each night at 1 am).
>> >>
>> >> What do you think about this, is there anybody else interested in this?
>> >>
>> >> Regarding the lifecycle, is there already some executor "framework" or
>> >> any regularly running process in place, or would I have to pull up my
>> >> own thread? If so, how can I stop my thread when solr/tomcat is
>> >> shutdown (I couldn't see any shutdown or destroy method in
>> >> SearchComponent)?
>> >>
>> >> Thanx for your feedback,
>> >> cheers,
>> >> Martin
>> >
>>
>>
>>
>> --
>> Martin Grotzke
>> http://twitter.com/martin_grotzke
>>
>



-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Re: Rebuild Spellchecker based on cron expression

2010-12-13 Thread Martin Grotzke

Hi Erick,

thanx for your advice! I'll check the options with our client and see
how we'll proceed. My spare time right now is already full with other
open source stuff, otherwise it'd be fun contributing s.th. to solr!
:-)

Cheers,
Martin


On Mon, Dec 13, 2010 at 2:46 PM, Erick Erickson  wrote:
> ***
> Just wondering what's the reason that this patch receives that little
> interest. Anything wrong with it?
> ***
>
> Nobody got behind it and pushed I suspect. And since it's been a long time
> since it was updated, there's no guarantee that it would apply cleanly any
> more.
> Or that it will perform as intended.
>
> So, if you're really interested, I'd suggest you ping the dev list and ask
> whether this is valuable or if it's been superseded. If the feedback is that
> this
> would be valuable, you can see what you can do to make it happen.
>
> Once it's working to your satisfaction and you've submitted a patch, let
> people
> know it's ready and ask them to commit it or critique it. You might have to
> remind
> the committers after a few days that it's ready and get it applied to trunk
> and/or 3.x.
>
> But I really wouldn't start working with it until I got some feedback from
> the
> people who are actively working on Solr whether it's been superseded by
> other functionality first, sometimes bugs just aren't closed when something
> else makes it obsolete.
>
> Here's a place to start: http://wiki.apache.org/solr/HowToContribute
>
> Best
> Erick
>
> On Mon, Dec 13, 2010 at 2:58 AM, Martin Grotzke <
> martin.grot...@googlemail.com> wrote:
>
>> Hi,
>>
>> when thinking further about it it's clear that
>>  https://issues.apache.org/jira/browse/SOLR-433
>> would be even better - we could generate the spellechecker indices on
>> commit/optimize on the master and replicate them to all slaves.
>>
>> Just wondering what's the reason that this patch receives that little
>> interest. Anything wrong with it?
>>
>> Cheers,
>> Martin
>>
>>
>> On Mon, Dec 13, 2010 at 2:04 AM, Martin Grotzke
>>  wrote:
>> > Hi,
>> >
>> > the spellchecker component already provides a buildOnCommit and
>> > buildOnOptimize option.
>> >
>> > Since we have several spellchecker indices building on each commit is
>> > not really what we want to do.
>> > Building on optimize is not possible as index optimization is done on
>> > the master and the slaves don't even run an optimize but only fetch
>> > the optimized index.
>> >
>> > Therefore I'm thinking about an extension of the spellchecker that
>> > allows you to rebuild the spellchecker based on a cron-expression
>> > (e.g. rebuild each night at 1 am).
>> >
>> > What do you think about this, is there anybody else interested in this?
>> >
>> > Regarding the lifecycle, is there already some executor "framework" or
>> > any regularly running process in place, or would I have to pull up my
>> > own thread? If so, how can I stop my thread when solr/tomcat is
>> > shutdown (I couldn't see any shutdown or destroy method in
>> > SearchComponent)?
>> >
>> > Thanx for your feedback,
>> > cheers,
>> > Martin
>> >
>>
>>
>>
>> --
>> Martin Grotzke
>> http://www.javakaffee.de/blog/
>>
>



-- 
Martin Grotzke
http://www.javakaffee.de/blog/

Separate Lines Like Google

2010-12-13 Thread Alejandro Delgadillo


Hi everybody,

I¹m having some troubles trying to figure out how to separate lines in a
paragraph from a search result, I¹m indexing PDF¹s but when I search the
highlight terms I can not know when the first line ends and the next one
begins, 

Is there a way to put a [...] like google o a Paragraph symbol?

I¹ll appreciate all the help I can get.

-- Alex.

Re: Solr on Google App Engine

2010-12-13 Thread Praveen Agrawal

Thanks Dave..

On Mon, Dec 13, 2010 at 4:06 PM, Dave Searle wrote:

> EC2 installations are just windows/linux machines, so this would just be a
> normal setup. I have a solr server running on a small instance with 1.7gb
> ram mounted to an EBS volume of 50gb, seems to run fine. Costs about $115 a
> month
>
> -Original Message-
> From: Praveen Agrawal [mailto:pkal...@gmail.com]
> Sent: 13 December 2010 09:20
> To: solr-user@lucene.apache.org
> Subject: Re: Solr on Google App Engine
>
> Thanks a lot, Mauricio.
>
> Does anyone has any experience on Amazon EC2, or can point me to existing
> discussions?
>
> Appreciate your help.
> Thanks.
> Praveen
>
> On Thu, Dec 9, 2010 at 6:20 PM, Mauricio Scheffer <
> mauricioschef...@gmail.com> wrote:
>
> > Solr on GAE has been discussed a couple of times, see these threads:
> >
> > http://www.mail-archive.com/java-user@lucene.apache.org/msg26010.html
> > 
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg24473.html
> > 
> >
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> > <
> >
> http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3co2w952e01251005032245r79d6bfd6zbe08ece212c82...@mail.gmail.com%3e
> > >
> >
> > --
> > Mauricio
> >
> >
> >
> > On Thu, Dec 9, 2010 at 9:07 AM, Praveen Agrawal 
> wrote:
> >
> > > Hi,
> > > I was wondering if Solr can be deployed/run on Google App Engine. GAE
> has
> > > some restrictions, notably no local file write access is allowed,
> instead
> > > applications must use JDO/JPA etc.
> > >
> > > I believe Solr can be deployed/run on Amazon EC2.
> > >
> > > Has anyone tried Solr on these two hosts?
> > >
> > > Thanks.
> > > Praveen
> > >
> >
>

Re: [pubDate] is not converting correctly

+1  If I knew enough about how to do this in Java I would but I do not
s.What is the correct way to add or suggest enhancements to Solr
core?

Adam

On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:

> Nice find!  This is Apache 2.0, copyright SUN.
>
> O Great Apache Elders: Is it kosher to add this to the Solr
> distribution? It's not in the JDK and is also com.sun.*
>
> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>  wrote:
> > Thanks for the feedback! There are quite a few formats that can be used.
> I
> > am experiencing at least 5 of them. Would something like this work? Note
> > that there are 2 different formats separated by a comma.
> >
> >  > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
> />
> >
> > I don't suppose it will because there is already a comma in the first
> > parser. I guess I am reallly looking for an all purpose data time parser
> but
> > even if I have that, would I still be able to query *all* fields in the
> > index?
> >
> > Good article:
> >
> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
> >
> > Adam
> >
> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
> wrote:
> >
> >> (10/12/13 8:49), Adam Estrada wrote:
> >>
> >>> All,
> >>>
> >>> I am having some difficu"lties parsing the pubDate field that is part
> of
> >>> the?
> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
> >>> 6:45:26
> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
> >>>  transformRow
> >>> WARNING: Could not parse a Date field
> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
> >>> +"
> >>> at java.text.DateFormat.parse(Unknown Source)"
> >>>
> >>> Does anyone know how to fix this? I would eventually like to do a date
> >>> query
> >>> but without the ability to properly parse them I don't know if it's
> going
> >>> to
> >>> work.
> >>>
> >>> Thanks,
> >>> Adam
> >>>
> >>
> >> Adam,
> >>
> >> How does your data-config.xml look like for that field?
> >> Have you looked at rss-data-config.xml file
> >> under example/example-DIH/solr/rss/conf directory?
> >>
> >> Koji
> >> --
> >> http://www.rondhuit.com/en/
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Jaakko Rajaniemi

Hello,

Alright, let's describe the situation. I have a website and the website has a
database with at least three tables.

- "users" table - (id, firstname, lastname)
- "artwork" table - (id, user, name, description)
- "jobs" table - (id, company, position, location, description)

I want to implement a multi-purpose search field. I want the search field to treat rows
from each of these tables as independent results. For example if I type
"Paris", an example result set might look like a list of links like this:

- Paris Hilton (from the "users" table)
- A day in Paris (from the "artwork" table)
- My Paris (from the "artwork" table)
- Art teacher (from the "jobs" table if the location is Paris)

I tried to search for solution for this on the Internet and found a somewhat
similar thread floating on this mailing list, but I just couldn't understand it
and the Solr documentation has its main focus on syntax, not implementation. I
figured I would create three entities and relevant schema.xml entries in this
way:

dataimport.xml:

schema.xml:

This obviously does not work as I want. I only get results from the "users" table, and I cannot get results from neither "artwork" nor "jobs". I
have found out that the possible solution is in putting tags in the tag and somehow aliasing column names for Solr, but the logic behind this is
completely alien to me and the blind tests I tried did not yield anything. My logic says that the "id" field is getting replaced by the "id" field of other
entities and indexes are being overwritten. But if I aliased all "id" fields in all entities into something else, such as "user_id" and "job_id", I
couldn't figure what to put in the configuration in schema.xml because I have three different id fields from three different tables that are all primary keyed
in the database!

Obviously I'm not quite on track so some help would be greatly appreciated.
Thanks!
- Jaakko

How to implement and a system based on IMAP auth

2010-12-13 Thread milomalo2...@libero.it

Hi Guys,

i am new in Solr world and i was trying to figure out how to implement an 
application which would be able to connect to our business mail server throug 
IMAP connection (1000 users) and to index the information related e-mail 
contents.

I tried to use DH- import with the preconfigured imap class provided in the 
solr example but as i could see there is no way to fetch 1000 user and retrieve 
information for them

What would you suggest as first step to follow ?
should i use SOLRJ as client in order to reach user content across imap 
connection? 
Doesn anyone had experience with that ?

thanks in advance

Re: Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Stefan Matheis

To avoid overwrites in your case, use a combined id - f.e. $table_$id which
results in user_1, job_1 and so on ..

Re: Newbie: Indexing unrelated MySQL tables

2010-12-13 Thread Stefan Matheis

And yes, sorry for the short answer ..
http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer would be
good for that :)

Re: How to implement and a system based on IMAP auth

I don't see where the MailEntityProcessor really has anything built
into it for indexing somebody else's mail, so you're probably going
to need to go down the SolrJ route. SolrJ is actually quite easy to
use, there are only a very few classes you'll need, so I'd go there

The "Usage" section here will get you started:
http://wiki.apache.org/solr/Solrj

Best
Erick

On Mon, Dec 13, 2010 at 9:32 AM, milomalo2...@libero.it <
milomalo2...@libero.it> wrote:

> Hi Guys,
>
> i am new in Solr world and i was trying to figure out how to implement an
> application which would be able to connect to our business mail server
> throug
> IMAP connection (1000 users) and to index the information related e-mail
> contents.
>
> I tried to use DH- import with the preconfigured imap class provided in the
> solr example but as i could see there is no way to fetch 1000 user and
> retrieve
> information for them
>
> What would you suggest as first step to follow ?
> should i use SOLRJ as client in order to reach user content across imap
> connection?
> Doesn anyone had experience with that ?
>
> thanks in advance
>
>
>
>

Re: Separate Lines Like Google

2010-12-13 Thread Koji Sekiguchi


(10/12/13 23:00), Alejandro Delgadillo wrote:


Hi everybody,

I¹m having some troubles trying to figure out how to separate lines in a
paragraph from a search result, I¹m indexing PDF¹s but when I search the
highlight terms I can not know when the first line ends and the next one
begins,

Is there a way to put a [...] like google o a Paragraph symbol?

I¹ll appreciate all the help I can get.

-- Alex.


Alex,

Use hl.snippets=n parameter, where n is a number (2, 3, ...).
Then you'll get the number of snippets at maximum, you can
appends these snippets with "..." between them.

Koji
--
http://www.rondhuit.com/en/

Re: Concurrent DIH calls

2010-12-13 Thread Juan Manuel Alvarez

Thanks for the answer Barani!
I was doing the same thing (queuing requests and querying solr
status), but I was hoping some flag/configuration would do the trick.
I will continue with that approach then! =o)

Thanks!
Juan M.

On Sat, Dec 11, 2010 at 3:50 AM, bbarani  wrote:
>
> Hi,
>
> As far as I know there is no queuing mechanism in SOLR for concurrent
> indexing request. It would simple ignore the concurrent request (first come
> first serve basis).. Solr experts, please correct me if I am wrong..
>
> To achieve concurrency,  we have implemented a queue using JMS and we send
> the data one by one for indexing (for performing push indexing / real time
> indexing)..
>
> We have also written a simple java program with SOLRj which will check if
> the status is idle or busy before it starts indexing next batch (This is for
> batch indexing program)..
>
> I would say the same thing applies for commit also.. As far as I know there
> is not inbuilt queuing system in SOLR for indexing.
>
> Thanks,
> Barani
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Concurrent-DIH-calls-tp2059517p2067937.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr replication, HAproxy and data management

2010-12-13 Thread Paolo Castagna


Paolo Castagna wrote:

Hi,
we are using Solr v1.4.x with multi-cores and a master/slaves 
configuration.

We also use HAProxy [1] to load balance search requests amongst slaves.
Finally, we use MapReduce to create new Solr indexes.

I'd like to share with you what I am doing when I need to:

 1. add a new index
 2. replace an existing index with an new/updated one
 3. add a slave
 4. remove a slave (or a slave died)

I am interested in knowing what are the best practices in these scenarios.


[...]


Does all this seems sensible to you?

Do you have best practices, suggestions to share?



Well, maybe these are two too broad questions...

I have a very specific one, related to all this.

Let's say I have a Solr master with multi-cores and I want to add a new
slave. Can I tell the slave to replicate all the indexes for the master?
How?

Any comment/advice regarding my original message are still more than
welcome.

Thank you,
Paolo

Re: Taxonomy and Faceting

2010-12-13 Thread webdev1977


Based on this:

VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_ALCHEMYAPI_KEY 

  VALID_OPENCALAIS_KEY 


...this can't be used unless you use some sort of processing engine?  I am
playing around with some other open source tagging software, but I have yet
to get very far.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2079148.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Concurrent DIH calls

2010-12-13 Thread Stefan Matheis

I don't know if this is helpful .. but there is
http://wiki.apache.org/solr/DataImportHandler#EventListeners which would
trigger on 'onImportEnd'

Re: How to implement and a system based on IMAP auth

2010-12-13 Thread Peter Sturge

imap has no intrinsic functionality for logging in as a user then
'impersonating' someone else.
What you can do is setup your email server so that your administrator
account or similar has access to other users via shared folders (this
is supported in imap2 servers - e.g. Exchange).
This is done all the time, for example if a manager wants his/her
secretary to have access to his/her mailbox.
Of course all access in this way needs to be in line with privacy policies etc.

When you connect as, say, 'admin', you can then see the shared folders
you have access to.
These folders are accessible via imap.
This is more of an imap thing, and isn't really related to DIH/Solr per se.

For Exchange servers, have a look at:
   http://www.petri.co.il/grant_full_mailbox_rights_on_exchange_2000_2003.htm
and
   http://www.ehow.com/how_5656820_share-exchange-mailboxes.html

HTH

Peter

On Mon, Dec 13, 2010 at 2:32 PM, milomalo2...@libero.it
 wrote:
> Hi Guys,
>
> i am new in Solr world and i was trying to figure out how to implement an
> application which would be able to connect to our business mail server throug
> IMAP connection (1000 users) and to index the information related e-mail
> contents.
>
> I tried to use DH- import with the preconfigured imap class provided in the
> solr example but as i could see there is no way to fetch 1000 user and 
> retrieve
> information for them
>
> What would you suggest as first step to follow ?
> should i use SOLRJ as client in order to reach user content across imap
> connection?
> Doesn anyone had experience with that ?
>
> thanks in advance
>
>
>
>

Re: Newbie: Indexing unrelated MySQL tables

Warning: I haven't tried this, but maybe it's relevant.
See: http://wiki.apache.org/solr/DataImportHandler
particularly the "multiple
datasources" section. I'm thinking here that you have to
define a different data source for each separate table you want to extract.

Stefan's comments about using a transformer to make unique unique-ids per
table
seems spot on for that part of the problem

Best
Erick

On Mon, Dec 13, 2010 at 9:31 AM, Jaakko Rajaniemi
wrote:

> Hello,
>
> Alright, let's describe the situation. I have a website and the website has
> a database with at least three tables.
>
> - "users" table  - (id, firstname, lastname)
> - "artwork" table  - (id, user, name, description)
> - "jobs" table  - (id, company, position, location, description)
>
> I want to implement a multi-purpose search field. I want the search field
> to treat rows from each of these tables as independent results. For example
> if I type "Paris", an example result set might look like a list of links
> like this:
>
> - Paris Hilton (from the "users" table)
> - A day in Paris (from the "artwork" table)
> - My Paris (from the "artwork" table)
> - Art teacher (from the "jobs" table if the location is Paris)
>
> I tried to search for solution for this on the Internet and found a
> somewhat similar thread floating on this mailing list, but I just couldn't
> understand it and the Solr documentation has its main focus on syntax, not
> implementation. I figured I would create three entities and relevant
> schema.xml entries in this way:
>
> dataimport.xml:
> 
> 
> 
>
> schema.xml:
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
> This obviously does not work as I want. I only get results from the "users"
> table, and I cannot get results from neither "artwork" nor "jobs". I have
> found out that the possible solution is in putting  tags in the
>  tag and somehow aliasing column names for Solr, but the logic
> behind this is completely alien to me and the blind tests I tried did not
> yield anything. My logic says that the "id" field is getting replaced by the
> "id" field of other entities and indexes are being overwritten. But if I
> aliased all "id" fields in all entities into something else, such as
> "user_id" and "job_id", I couldn't figure what to put in the 
> configuration in schema.xml because I have three different id fields from
> three different tables that are all primary keyed in the database!
>
> Obviously I'm not quite on track so some help would be greatly appreciated.
> Thanks!
> - Jaakko
>

Re: Very high load after replicating

2010-12-13 Thread Mark


Markus,

My configuration is as follows...






...
false
2
...
false
64
10
false
true

No cache warming queries and our machines have 8g of memory in them with 
about 5120m of ram dedicated to so Solr. When our index is around 10-11g 
in size everything runs smoothly. At around 20g+ it just falls apart.


Can you (or anyone) provide some suggestions? Thanks


On 12/12/10 1:11 PM, Markus Jelsma wrote:

There can be numerous explanations such as your configuration (cache warm
queries, merge factor, replication events etc) but also I/O having trouble
flushing everything to disk. It could also be a memory problem, the OS might
start swapping if you allocate too much RAM to the JVM leaving little for the
OS to work with.

You need to provide more details.


After replicating an index of around 20g my slaves experience very high
load (50+!!)

Is there anything I can do to alleviate this problem?  Would solr cloud
be of any help?

thanks

Re: Taxonomy and Faceting

2010-12-13 Thread Tommaso Teofili

With the SOLR-2129 patch you enable an Apache UIMA [1] pipeline to enrich
documents being indexed.
The base pipeline provided with the patch uses the following blocks (see
OverridingParamsExtServicesAE.xml):

AggregateSentenceAE

OpenCalaisAnnotator

TextKeywordExtractionAEDescriptor

TextLanguageDetectionAEDescriptor

TextCategorizationAEDescriptor

TextConceptTaggingAEDescriptor

TextRankedEntityExtractionAEDescriptor
This enables tokenizing, adding part of speech to tokens extract sentences
with WhitespaceTokenizer and HMMTagger, then inserts named entities and
language extracted with OpenCalaisAnnotator and AlchemyAPIAnnotator.
The parameters you underlined are relevant only if you use
OpenCalaisAnnotator and AlchemyAPIAnnotator; as you may see those are
runtime parameters, so depending on which Analysis Engine you're executing
you could need or not such parameters or need other ones.
However you can change the pipeline blocks to use to whatever you want,
provided that they are UIMA compliant specifying the relative Analysis
Engine descriptor inside the tag:
   /org/apache/uima/desc
/OverridingParamsExtServicesAE.xml.
There are many other engines you can use and configure with SOLR-2129, see
[2] and [3].
I hope this clarifies things a little more.
Cheers,
Tommaso

[1] : http://uima.apache.org
[2] : http://uima.apache.org/sandbox.html
[3] : http://uima.apache.org/external-resources.html

2010/12/13 webdev1977 

>
> Based on this:
>
> VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_ALCHEMYAPI_KEY
>
>  VALID_OPENCALAIS_KEY
>
>
> ...this can't be used unless you use some sort of processing engine?  I am
> playing around with some other open source tagging software, but I have yet
> to get very far.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Taxonomy-and-Faceting-tp2028442p2079148.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Indexing pdf files - question.

2010-12-13 Thread Siebor, Wlodek [USA]

HI,
Can sombody, please, send me a command for indexing a sample pdf with 
ExtractngRequestHandler file available in the /docs directory. I have 
lucidworks solr installed on linux, with standard schema.xml and solrconfig.xml 
files (unchanged). I want to pass as the unique id the name of the file.
I’m trying various curl commands and so far I have either  “… missing required 
field: id” or “.. missing content stream” errors.
Thanks for your help,
Wlodek

Strange replication problem

2010-12-13 Thread Ralf Mattes

Hello list,

I'm trying to set up a replicating solr system (one master, one slave) here. 
Everything _looks_ o.k. but replication fails. A little debugging shows the 
following:

 r...@slave:~# curl 
'http://master:8180/solr/website/replication?command=indexversion&wt=json' && 
echo ''
{"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}

 r...@slave:~# curl 
'http://master:8180/solr/website/replication?command=details&wt=json' && echo ''
{"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76 
GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
 response format is experimental.  It is likely to change in the future."}

 r...@slave:~# 

Note that indexversion returned by the indexversion command is 0 while the same 
information from the details command is 292192351652 ...
Any idea what's going on here?

 TIA Ralf Mattes

Re: full text search in multiple fields

2010-12-13 Thread PeterKerk


whoops :)
It was directed at iorixxx, in the first post before me
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/full-text-search-in-multiple-fields-tp1888328p2079581.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing pdf files - question.

Hi,

I use the following command to post PDF files.

$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\document.docx&stream.contentType=application/msword&literal.id
=esc.doc&commit=true"
$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\features.pdf&stream.contentType=application/pdf&literal.id
=esc2.doc&commit=true"
$ curl "http://localhost:8983/solr/update/extract?stream.file=C
:\temp\Memo_ocrd.pdf&stream.contentType=application/pdf&literal.id
=Memo_ocrd.pdf&defaultField=text&commit=true"

The PDF's have to be OCR'd.

Adam

On Mon, Dec 13, 2010 at 11:01 AM, Siebor, Wlodek [USA] <
siebor_wlo...@bah.com> wrote:

> HI,
> Can sombody, please, send me a command for indexing a sample pdf with
> ExtractngRequestHandler file available in the /docs directory. I have
> lucidworks solr installed on linux, with standard schema.xml and
> solrconfig.xml files (unchanged). I want to pass as the unique id the name
> of the file.
> I’m trying various curl commands and so far I have either  “… missing
> required field: id” or “.. missing content stream” errors.
> Thanks for your help,
> Wlodek
>

Re: How to get all the search results?

2010-12-13 Thread Solr User

Hi,

I tried *:* using dismax and I get no results.

Is there a way that I can get all the search results using dismax?

Thanks,
Murali

On Mon, Dec 6, 2010 at 11:17 AM, Savvas-Andreas Moysidis <
savvas.andreas.moysi...@googlemail.com> wrote:

> Hello,
>
> shouldn't that query syntax be *:* ?
>
> Regards,
> -- Savvas.
>
> On 6 December 2010 16:10, Solr User  wrote:
>
> > Hi,
> >
> > First off thanks to the group for guiding me to move from default search
> > handler to dismax.
> >
> > I have a question related to getting all the search results. In the past
> > with the default search handler I was getting all the search results
> (8000)
> > if I pass q=* as search string but with dismax I was getting only 16
> > results
> > instead of 8000 results.
> >
> > How to get all the search results using dismax? Do I need to configure
> > anything to make * (asterisk) work?
> >
> > Thanks,
> > Solr User
> >
>

Re: How to get all the search results?

Can we see the results with &debugQuery=on? As well as the entire http
string you use?

Also, are you sure you've put documents in your index and committed
afterwards?

Best
Erick

On Mon, Dec 13, 2010 at 11:59 AM, Solr User  wrote:

> Hi,
>
> I tried *:* using dismax and I get no results.
>
> Is there a way that I can get all the search results using dismax?
>
> Thanks,
> Murali
>
> On Mon, Dec 6, 2010 at 11:17 AM, Savvas-Andreas Moysidis <
> savvas.andreas.moysi...@googlemail.com> wrote:
>
> > Hello,
> >
> > shouldn't that query syntax be *:* ?
> >
> > Regards,
> > -- Savvas.
> >
> > On 6 December 2010 16:10, Solr User  wrote:
> >
> > > Hi,
> > >
> > > First off thanks to the group for guiding me to move from default
> search
> > > handler to dismax.
> > >
> > > I have a question related to getting all the search results. In the
> past
> > > with the default search handler I was getting all the search results
> > (8000)
> > > if I pass q=* as search string but with dismax I was getting only 16
> > > results
> > > instead of 8000 results.
> > >
> > > How to get all the search results using dismax? Do I need to configure
> > > anything to make * (asterisk) work?
> > >
> > > Thanks,
> > > Solr User
> > >
> >
>

Re: Strange replication problem

2010-12-13 Thread Xin Li

" indexversion returned by the indexversion command is 0 while the
same information from the details command is 292192351652 ..."

This only happens to a Slave machine. For a Master machine,
indexversion returns the same number as details command.





On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
> Hello list,
>
> I'm trying to set up a replicating solr system (one master, one slave) here.
> Everything _looks_ o.k. but replication fails. A little debugging shows the 
> following:
>
>  r...@slave:~# curl 
> 'http://master:8180/solr/website/replication?command=indexversion&wt=json' && 
> echo ''
> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>
>  r...@slave:~# curl 
> 'http://master:8180/solr/website/replication?command=details&wt=json' && echo 
> ''
> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76 
> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>  response format is experimental.  It is likely to change in the future."}
>
>  r...@slave:~#
>
> Note that indexversion returned by the indexversion command is 0 while the 
> same information from the details command is 292192351652 ...
> Any idea what's going on here?
>
>  TIA Ralf Mattes
>
>
>

Re: Strange replication problem

2010-12-13 Thread Ralf Mattes

On Mon, 13 Dec 2010 12:31:27 -0500, Xin Li wrote:

> " indexversion returned by the indexversion command is 0 while the same
> information from the details command is 292192351652 ..."
> 
> This only happens to a Slave machine. For a Master machine, indexversion
> returns the same number as details command.

??? What part of my posted example did you not read? :-)
Both requests where sent to the same machine (configured as master) - and I get
exactly the result described. So, no, for my setup (pretty much out-of-the-box
with minimal master/slave configuration) your statement is wrong :-/

 Thanks, RalfD
 

 

> 
> 
> 
> On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
>> Hello list,
>>
>> I'm trying to set up a replicating solr system (one master, one slave)
>> here. Everything _looks_ o.k. but replication fails. A little debugging
>> shows the following:
>>
>>  r...@slave:~# curl
>>  'http://master:8180/solr/website/replication?command=indexversion&wt=json'
>>  && echo ''
>> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>>
>>  r...@slave:~# curl
>>  'http://master:8180/solr/website/replication?command=details&wt=json'
>>  && echo ''
>> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76
>> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>> response format is experimental.  It is likely to change in the
>> future."}
>>
>>  r...@slave:~#
>>
>> Note that indexversion returned by the indexversion command is 0 while
>> the same information from the details command is 292192351652 ... Any
>> idea what's going on here?
>>
>>  TIA Ralf Mattes
>>
>>
>>

Re: Indexing pdf files - question.

2010-12-13 Thread Wodek Siebor


The sample /docs/tutorial.pdf does not require OCR.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-pdf-files-question-tp2079505p2080307.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Geospatial search w/polygon bounding box?

It really doesn't look like it. As part of some other work I'm doing I ran
across this: https://issues.apache.org/jira/browse/SOLR-2155
which seems to speak (on
cursory glance) at the polygon-bounding-box
issue. But as you see it hasn't been committed. Want to push it forward?

Best
Erick

On Mon, Nov 8, 2010 at 11:57 AM, Jonathan Gill
wrote:

> Hi-
>
> Based on my limited but growing understanding of bounding box spatial
> filtering in Solr, I think I know the answer to my question, but I¹m
> looking
> for confirmation.  Is there a way to specify a polygon bounding box for
> spatial searches?  If so, is there a limit to the number of points that
> define the polygon shape?
>
>
>
> 
> This e-mail, including attachments, is intended for the person(s)
> or company named and may contain confidential and/or legally
> privileged information. Unauthorized disclosure, copying or use of
> this information may be unlawful and is prohibited. If you are not
> the intended recipient, please delete this message and notify the
> sender.

access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Burton-West, Tom

I see variables used to access java system properties in solrconfig.xml and 
schema.xml:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
${solr.data.dir:}
or
${solr.abortOnConfigurationError:true}

Is there a way to access environment variables or does everything have to be 
stuffed into a java system property?

Tom Burton-West

Re: Strange replication problem

2010-12-13 Thread Xin Li

did you double check
http://machine:port/solr/website/admin/replication/ to see the
"master" is indeed a master?

On Mon, Dec 13, 2010 at 1:01 PM, Ralf Mattes  wrote:
> On Mon, 13 Dec 2010 12:31:27 -0500, Xin Li wrote:
>
>> " indexversion returned by the indexversion command is 0 while the same
>> information from the details command is 292192351652 ..."
>>
>> This only happens to a Slave machine. For a Master machine, indexversion
>> returns the same number as details command.
>
> ??? What part of my posted example did you not read? :-)
> Both requests where sent to the same machine (configured as master) - and I 
> get
> exactly the result described. So, no, for my setup (pretty much out-of-the-box
> with minimal master/slave configuration) your statement is wrong :-/
>
>  Thanks, RalfD
>
>
>
>
>>
>>
>>
>> On Mon, Dec 13, 2010 at 11:06 AM, Ralf Mattes  wrote:
>>> Hello list,
>>>
>>> I'm trying to set up a replicating solr system (one master, one slave)
>>> here. Everything _looks_ o.k. but replication fails. A little debugging
>>> shows the following:
>>>
>>>  r...@slave:~# curl
>>>  'http://master:8180/solr/website/replication?command=indexversion&wt=json'
>>>  && echo ''
>>> {"responseHeader":{"status":0,"QTime":0},"indexversion":0,"generation":0}
>>>
>>>  r...@slave:~# curl
>>>  'http://master:8180/solr/website/replication?command=details&wt=json'
>>>  && echo ''
>>> {"responseHeader":{"status":0,"QTime":1},"details":{"indexSize":"6.76
>>> GB","indexPath":"/var/lib/solr/data/website/index","commits":[["indexVersion",1292192351652,"generation",5,"filelist",["_7e.fdx","_7e.tii","_7e.frq","_7e.prx","_7e.fdt","segments_5","_7e.fnm","_7e.nrm","_7e.tis"]]],"isMaster":"true","isSlave":"false","indexVersion":1292192351652,"generation":5},"WARNING":"This
>>> response format is experimental.  It is likely to change in the
>>> future."}
>>>
>>>  r...@slave:~#
>>>
>>> Note that indexversion returned by the indexversion command is 0 while
>>> the same information from the details command is 292192351652 ... Any
>>> idea what's going on here?
>>>
>>>  TIA Ralf Mattes
>>>
>>>
>>>
>
>
>

Re: Is it possible to assign default value for a particular record when using multivalued field type?

2010-12-13 Thread bbarani


Hi,

Is there a template transformer which can act on each and every record of
multivalued attribute?

The issue is that some of the records might have null data in source and I
want those data to be replaced with some default value.

Also if the value is blank I could just see  in the XML. Any idea how
to parse this tag using SOLRNET?

The problem is that we are parsing the XML tag of 2 attribute (for ex:
objectid and objectname) and mapping it in the UI. Something like key value
pair.

If any one of the attribute (ex: in object name) is null or blank the
mapping order gets changed (b/w objectid and objectname).

I am not sure if the only solution is to handle in SOLR DIH query.

Thanks,
Barani
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Is-it-possible-to-assign-default-value-for-a-particular-record-when-using-multivalued-field-type-tp2066167p2080890.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Separate Lines Like Google

2010-12-13 Thread Alejandro Delgadillo

Koji,

Thank you for helping me with my questions, but I still don't get it how
it's done, let's say I search for the term "love" and I get something like
this:

LoveLove may also
refer to: Contents. 1 Film and television.

As you can see the second term is from the same document but it is from a
different paragraph, aside from the highlight there is no way to tell them
apart, that's the problem I've been having.

I put the hl.snippets under my default search handler...

On 12/13/10 8:45 AM, "Koji Sekiguchi"  wrote:

> (10/12/13 23:00), Alejandro Delgadillo wrote:
>> 
>> Hi everybody,
>> 
>> I¹m having some troubles trying to figure out how to separate lines in a
>> paragraph from a search result, I¹m indexing PDF¹s but when I search the
>> highlight terms I can not know when the first line ends and the next one
>> begins,
>> 
>> Is there a way to put a [...] like google o a Paragraph symbol?
>> 
>> I¹ll appreciate all the help I can get.
>> 
>> -- Alex.
>> 
> Alex,
> 
> Use hl.snippets=n parameter, where n is a number (2, 3, ...).
> Then you'll get the number of snippets at maximum, you can
> appends these snippets with "..." between them.
> 
> Koji

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread John Russell

Thanks for the response.

The date types are defined in our schema file like this






Which appears to be what you mentioned.  Then we use them in fields like
this

   
   

So I think we have the right datatypes for the dates.  Most of the other
ones are strings.

As for the doc we are adding, I don't think it would be considered "huge".
It is basically blog posts and tweets broken out into fields like author,
title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
probably smaller.

We do create them once and then update the indexes as we perform work on the
documents.  For example, we create the doc for the original incoming post
and then update that doc with tags or the results of filtering so we can
look for them later.

We have solr set up as a separate JVM which we talk to over HTTP on the same
box using the solrj client java library.  Unfortunately we are on 32 bit
hardware so solr can only get 2.6GB of memory.  Any more than that and the
JVM won't start.

I really just need a way to keep the cache from breaking the bank.  As I
pasted below there are some config elements in the XML that appear to be
related to caching but I'm not sure that they are related to that specific
hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
actually runs out of heap space but GC's the CPU to death.

Thanks again.

John

On Sat, Dec 11, 2010 at 17:46, Erick Erickson wrote:

> "unfortunately I can't check the statistics page.  For some reason the solr
> webapp itself is only returning a directory listing."
>
> This is very weird and makes me wonder if there's something really wonky
> with your system. I'm assuming when you say "the solr webapp itself" you're
> taking about ...localhost:8983/solr/admin/.. You might want to be
> looking
> at the stats page (and frantically hitting refresh) before you have
> problems.
> Alternately, you could record the queries as they are sent to solr to see
> what
> the offending
>
> But onwards Tell us more about your dates. One of the very common
> ways people get into trouble is to use dates that are unix-style
> timestamps,
> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
> fields
> are very much preferred for this.
>
> Your index isn't all that large by regular standards, so I think that
> there's
> hope that you can get this working.
>
>
> Wait, wait, wait. Looking again at the stack trace I see that your OOM
> is happening when you *add* a document. Tell us more about the
> document, perhaps you can print out some characteristics of the doc
> before you add it? Is it always the same doc? Are you indexing and
> searching on the same machine? Is the doc really huge?
>
> Best
> Erick
>
>
> On Fri, Dec 10, 2010 at 4:33 PM, John Russell  wrote:
>
> > Thanks a lot for the response.
> >
> > Unfortunately I can't check the statistics page.  For some reason the
> solr
> > webapp itself is only returning a directory listing.  This is sometimes
> > fixed when I restart but if I do that I'll lose the state I have now.  I
> > can
> > get at the JMX interface.  Can I check my insanity level from there?
> >
> > We did change two parts of the solr config to raise the size of the query
> > Results and document cache.  I assume from what you were saying that this
> > does not have an effect on the cache I mentioned taking up all of the
> > space.
> >
> >>
> >  class=*"solr.LRUCache"*
> >
> >  size=*"16384"*
> >
> >  initialSize=*"4096"*
> >
> >  autowarmCount=*"0"*/>
> >
> >
> >
> >
> >
> > >
> >  class=*"solr.LRUCache"*
> >
> >  size=*"16384"*
> >
> >  initialSize=*"16384"*
> >
> >  autowarmCount=*"0"*/>
> >
> >
> > This problem gets worse as our index grows (5.0GB now).  Unfortunately we
> > are maxed out on memory for our hardware.
> >
> > We aren't using faceting at all in our searches right now.  We usually
> sort
> > on 1 or 2 fields at the most.  I think the types of our fields are pretty
> > accurate, unfortunately they are mostly strings, and some dates.
> >
> > How do the field definitions effect that cache? Is it simply that fewer
> > fields mean less to cache? Does it not cache some fields configured in a
> > certain way?
> >
> > Is there a way to throw out an IndexReader after a while and restart,
> just
> > to restart the cache? Or maybe explicitly clear it if we see it getting
> out
> > of hand through JMX or something?
> >
> > Really anything to stop it from choking like this would be awesome.
> >
> > Thanks again.
> >
> > John
> >
> > On Fri, Dec 10, 2010 at 16:02, Tom Hill  wrote:
> >
> > > Hi John,
> > >
> > > WeakReferences allow things to get GC'd, if there are no other
> > > references to the object referred to.
> > >
> > > My understanding is that WeakHashMaps use weak references for the Keys
> > > in the HashMap.
> > >
> > > What this means is that the keys in HashMap can be GC'd, once there
> > > are no other references to the key. I _think_ this occ

Re: [pubDate] is not converting correctly

2010-12-13 Thread Lance Norskog

Please file a JIRA requesting this.

On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada  wrote:
> +1  If I knew enough about how to do this in Java I would but I do not
> s.What is the correct way to add or suggest enhancements to Solr
> core?
>
> Adam
>
> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:
>
>> Nice find!  This is Apache 2.0, copyright SUN.
>>
>> O Great Apache Elders: Is it kosher to add this to the Solr
>> distribution? It's not in the JDK and is also com.sun.*
>>
>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>>  wrote:
>> > Thanks for the feedback! There are quite a few formats that can be used.
>> I
>> > am experiencing at least 5 of them. Would something like this work? Note
>> > that there are 2 different formats separated by a comma.
>> >
>> > > > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
>> />
>> >
>> > I don't suppose it will because there is already a comma in the first
>> > parser. I guess I am reallly looking for an all purpose data time parser
>> but
>> > even if I have that, would I still be able to query *all* fields in the
>> > index?
>> >
>> > Good article:
>> >
>> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
>> >
>> > Adam
>> >
>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
>> wrote:
>> >
>> >> (10/12/13 8:49), Adam Estrada wrote:
>> >>
>> >>> All,
>> >>>
>> >>> I am having some difficu"lties parsing the pubDate field that is part
>> of
>> >>> the?
>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
>> >>> 6:45:26
>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
>> >>>  transformRow
>> >>> WARNING: Could not parse a Date field
>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
>> >>> +"
>> >>>         at java.text.DateFormat.parse(Unknown Source)"
>> >>>
>> >>> Does anyone know how to fix this? I would eventually like to do a date
>> >>> query
>> >>> but without the ability to properly parse them I don't know if it's
>> going
>> >>> to
>> >>> work.
>> >>>
>> >>> Thanks,
>> >>> Adam
>> >>>
>> >>
>> >> Adam,
>> >>
>> >> How does your data-config.xml look like for that field?
>> >> Have you looked at rss-data-config.xml file
>> >> under example/example-DIH/solr/rss/conf directory?
>> >>
>> >> Koji
>> >> --
>> >> http://www.rondhuit.com/en/
>> >>
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>



-- 
Lance Norskog
goks...@gmail.com

Re: [pubDate] is not converting correctly

2010-12-13 Thread Lance Norskog

Create an account at
https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create
New Issue' for the Solr project.

On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog  wrote:
> Please file a JIRA requesting this.
>
> On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada  wrote:
>> +1  If I knew enough about how to do this in Java I would but I do not
>> s.What is the correct way to add or suggest enhancements to Solr
>> core?
>>
>> Adam
>>
>> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog  wrote:
>>
>>> Nice find!  This is Apache 2.0, copyright SUN.
>>>
>>> O Great Apache Elders: Is it kosher to add this to the Solr
>>> distribution? It's not in the JDK and is also com.sun.*
>>>
>>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
>>>  wrote:
>>> > Thanks for the feedback! There are quite a few formats that can be used.
>>> I
>>> > am experiencing at least 5 of them. Would something like this work? Note
>>> > that there are 2 different formats separated by a comma.
>>> >
>>> > >> > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz, -MM-dd'T'HH:mm:ss'Z'"
>>> />
>>> >
>>> > I don't suppose it will because there is already a comma in the first
>>> > parser. I guess I am reallly looking for an all purpose data time parser
>>> but
>>> > even if I have that, would I still be able to query *all* fields in the
>>> > index?
>>> >
>>> > Good article:
>>> >
>>> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
>>> >
>>> > Adam
>>> >
>>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
>>> wrote:
>>> >
>>> >> (10/12/13 8:49), Adam Estrada wrote:
>>> >>
>>> >>> All,
>>> >>>
>>> >>> I am having some difficu"lties parsing the pubDate field that is part
>>> of
>>> >>> the?
>>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
>>> >>> 6:45:26
>>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
>>> >>>  transformRow
>>> >>> WARNING: Could not parse a Date field
>>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009 14:41:43
>>> >>> +"
>>> >>>         at java.text.DateFormat.parse(Unknown Source)"
>>> >>>
>>> >>> Does anyone know how to fix this? I would eventually like to do a date
>>> >>> query
>>> >>> but without the ability to properly parse them I don't know if it's
>>> going
>>> >>> to
>>> >>> work.
>>> >>>
>>> >>> Thanks,
>>> >>> Adam
>>> >>>
>>> >>
>>> >> Adam,
>>> >>
>>> >> How does your data-config.xml look like for that field?
>>> >> Have you looked at rss-data-config.xml file
>>> >> under example/example-DIH/solr/rss/conf directory?
>>> >>
>>> >> Koji
>>> >> --
>>> >> http://www.rondhuit.com/en/
>>> >>
>>> >
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Lance Norskog
goks...@gmail.com

Re: How to get all the search results?

2010-12-13 Thread Shawn Heisey


On 12/13/2010 9:59 AM, Solr User wrote:

Hi,

I tried *:* using dismax and I get no results.

Is there a way that I can get all the search results using dismax?


For dismax, use q= or simply leave the q parameter off the URL 
entirely.  It appears that you need to have q.alt set to *:* for this to 
work.  It would be a good idea to include this in your handler definition:


*:*

Two people (myself and Peter Karich) gave this answer on this thread 
last week, within 15 minutes of the time your original question was 
posted.  Here's the entire thread on nabble:


http://lucene.472066.n3.nabble.com/How-to-get-all-the-search-results-td2028233.html

Shawn

Re: How to get all the search results?

2010-12-13 Thread Solr User

Hi Shawn,

Yes you did.

I tried and did not work so I asked the same question again.

Now I understood and tried directly on the Solr admin and I got all the
search results. I will implement the same on the website.

Thank you so much Shawn.


On Mon, Dec 13, 2010 at 5:16 PM, Shawn Heisey  wrote:

> On 12/13/2010 9:59 AM, Solr User wrote:
>
>> Hi,
>>
>> I tried *:* using dismax and I get no results.
>>
>> Is there a way that I can get all the search results using dismax?
>>
>
> For dismax, use q= or simply leave the q parameter off the URL entirely.
>  It appears that you need to have q.alt set to *:* for this to work.  It
> would be a good idea to include this in your handler definition:
>
> *:*
>
> Two people (myself and Peter Karich) gave this answer on this thread last
> week, within 15 minutes of the time your original question was posted.
>  Here's the entire thread on nabble:
>
>
> http://lucene.472066.n3.nabble.com/How-to-get-all-the-search-results-td2028233.html
>
> Shawn
>
>

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

Forgive me if I've said this in this thread already, but I'm beginning 
to think this is the main 'mysterious' cause of Solr RAM/gc issues.


Are you committing very frequently?  So frequently that you commit 
faster than it takes for warming operations on a new Solr index to 
complete, and you're getting over-lapping indexes being prepared?


But if the problem really is just GC issues and not actually too much 
RAM being used, try this JVM setting:


-XX:+UseConcMarkSweepGC

Will make GC happen in a different thread, instead of the same thread as 
solr operations.


I think that is also something that many many Solr installations 
probably need, but don't realize they need.


On 12/13/2010 3:42 PM, John Russell wrote:

Thanks for the response.

The date types are defined in our schema file like this

 

 
 

Which appears to be what you mentioned.  Then we use them in fields like
this




So I think we have the right datatypes for the dates.  Most of the other
ones are strings.

As for the doc we are adding, I don't think it would be considered "huge".
It is basically blog posts and tweets broken out into fields like author,
title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
probably smaller.

We do create them once and then update the indexes as we perform work on the
documents.  For example, we create the doc for the original incoming post
and then update that doc with tags or the results of filtering so we can
look for them later.

We have solr set up as a separate JVM which we talk to over HTTP on the same
box using the solrj client java library.  Unfortunately we are on 32 bit
hardware so solr can only get 2.6GB of memory.  Any more than that and the
JVM won't start.

I really just need a way to keep the cache from breaking the bank.  As I
pasted below there are some config elements in the XML that appear to be
related to caching but I'm not sure that they are related to that specific
hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
actually runs out of heap space but GC's the CPU to death.

Thanks again.

John

On Sat, Dec 11, 2010 at 17:46, Erick Ericksonwrote:


"unfortunately I can't check the statistics page.  For some reason the solr
webapp itself is only returning a directory listing."

This is very weird and makes me wonder if there's something really wonky
with your system. I'm assuming when you say "the solr webapp itself" you're
taking about ...localhost:8983/solr/admin/.. You might want to be
looking
at the stats page (and frantically hitting refresh) before you have
problems.
Alternately, you could record the queries as they are sent to solr to see
what
the offending

But onwards Tell us more about your dates. One of the very common
ways people get into trouble is to use dates that are unix-style
timestamps,
i.e. in milliseconds (either as ints or strings) and sort on them. Trie
fields
are very much preferred for this.

Your index isn't all that large by regular standards, so I think that
there's
hope that you can get this working.


Wait, wait, wait. Looking again at the stack trace I see that your OOM
is happening when you *add* a document. Tell us more about the
document, perhaps you can print out some characteristics of the doc
before you add it? Is it always the same doc? Are you indexing and
searching on the same machine? Is the doc really huge?

Best
Erick


On Fri, Dec 10, 2010 at 4:33 PM, John Russell  wrote:


Thanks a lot for the response.

Unfortunately I can't check the statistics page.  For some reason the

solr

webapp itself is only returning a directory listing.  This is sometimes
fixed when I restart but if I do that I'll lose the state I have now.  I
can
get at the JMX interface.  Can I check my insanity level from there?

We did change two parts of the solr config to raise the size of the query
Results and document cache.  I assume from what you were saying that this
does not have an effect on the cache I mentioned taking up all of the
space.

   








This problem gets worse as our index grows (5.0GB now).  Unfortunately we
are maxed out on memory for our hardware.

We aren't using faceting at all in our searches right now.  We usually

sort

on 1 or 2 fields at the most.  I think the types of our fields are pretty
accurate, unfortunately they are mostly strings, and some dates.

How do the field definitions effect that cache? Is it simply that fewer
fields mean less to cache? Does it not cache some fields configured in a
certain way?

Is there a way to throw out an IndexReader after a while and restart,

just

to restart the cache? Or maybe explicitly clear it if we see it getting

out

of hand through JMX or something?

Really anything to stop it from choking like this would be awesome.

Thanks again.

John

On Fri, Dec 10, 2010 at 16:02, Tom Hill  wrote:


Hi John,

WeakReferences allow things to get GC'd, if there are no other
references to the object referred to.

My understanding is

Re: Separate Lines Like Google

2010-12-13 Thread Koji Sekiguchi


(10/12/14 5:06), Alejandro Delgadillo wrote:

Koji,

Thank you for helping me with my questions, but I still don't get it how
it's done, let's say I search for the term "love" and I get something like
this:

LoveLove  may also
refer to: Contents. 1 Film and television.

As you can see the second term is from the same document but it is from a
different paragraph, aside from the highlight there is no way to tell them
apart, that's the problem I've been having.

I put the hl.snippets under my default search handler...


Alex,

You need to pre-process them for paragraph. When you index your docs,
you should add like this:


  
Love is an intense feeling  of affection
Love may also refer to: Contents. 1 Film and 
television.
  


instead of:


  
Love is an intense feeling  of affection
Love may also refer to: Contents. 1 Film and television.
  


Koji
--
http://www.rondhuit.com/en/

Re: SolrEventListeners are instantiated twice

2010-12-13 Thread Chris Hostetter


: SolrEventListener. Even though I only register the listener in the query
: section of solrconfig.xml, listening to the firstSearcher event, the
: listener is also attached to the UpdateHandler and thus the init-method runs
: twice because there is two instances of the class. To eliminate any other

Jørgen: thank you for reporting this.

It is definitely a bug, and i have opened a jira trakcing issue with an 
attached test demonstrating your problem and a proposed fix that i am 
currently testing...

https://issues.apache.org/jira/browse/SOLR-2285


-Hoss

Re: access to environment variables in solrconfig.xml and/or schema.xml?

2010-12-13 Thread Koji Sekiguchi


(10/12/14 4:28), Burton-West, Tom wrote:

I see variables used to access java system properties in solrconfig.xml and 
schema.xml:

http://wiki.apache.org/solr/SolrConfigXml#System_property_substitution
${solr.data.dir:}
or
${solr.abortOnConfigurationError:true}

Is there a way to access environment variables or does everything have to be 
stuffed into a java system property?


Tom,

No, there is no way to access environment variables. But you can
access your environment variable through a java system property
if you assign the environment variable to a java system property
when you launch JVM:

$ java -Djava.system.property1=$ENVIRONMENT_VAR1 ¥
-Djava.system.property2=$ENVIRONMENT_VAR2 -jar start.jar

Koji
--
http://www.rondhuit.com/en/

Re: Problem with loading a class

2010-12-13 Thread Chris Hostetter


: Caused by: java.lang.ClassNotFoundException:
: solr.StempelPolishStemFilterFactory
: 
: So I tried putting
: contrib/analysis-extras/lucene-libs/lucene-stempel-3.1-2010-12-06_10-23-49.jar
: in ./lib and ./lucene-libs - same result.

The lucene-stempel-*.jar file contains the StempelPolishStemFilter, but to 
use it in Solr you also need the StempelPolishStemFilterFactory which is 
in the apache-solr-analysis-extras-*.jar



-Hoss

Re: JMX Cache values are wrong

2010-12-13 Thread Chris Hostetter


: I've used three different JMX clients to query
...
: beans and they appear to return old cache information.
: 
: As new searchers come online, the newer caches dosen't appear to be
: registered perhaps?
: I can see this when I query JMX for the 'description' attribute and
: the regenerator JMX output shows a different
: org.apache.solr.search.SolrIndexSearcher to that which appears in the
: stats.jsp page.

Hmmm... using jconsole and the example jetty instance i can't reproduce 
this behavior (on trunk)

I can view details on those beans, and click the "refresh" button for 
those beans to see updated stats as requests come in.  When a new searcher 
is opened, those beans go away complete (jsonsole detects that they are 
gone and closes the active pane) and new beans come along which have the 
updated stats for the new cache instances.

Perhaps either the servlet container or JMX client you are using aren't 
recognizing when one bean goes away and a new one is registered with the 
same name?



-Hoss

Re: [pubDate] is not converting correctly

My first submission ;-)

https://issues.apache.org/jira/browse/SOLR-2286

Adam

On Mon, Dec 13, 2010 at 5:14 PM, Lance Norskog  wrote:

> Create an account at
> https://issues.apache.org/jira/secure/Dashboard.jspa and do 'Create
> New Issue' for the Solr project.
>
> On Mon, Dec 13, 2010 at 2:13 PM, Lance Norskog  wrote:
> > Please file a JIRA requesting this.
> >
> > On Mon, Dec 13, 2010 at 6:29 AM, Adam Estrada 
> wrote:
> >> +1  If I knew enough about how to do this in Java I would but I do not
> >> s.What is the correct way to add or suggest enhancements to Solr
> >> core?
> >>
> >> Adam
> >>
> >> On Sun, Dec 12, 2010 at 11:38 PM, Lance Norskog 
> wrote:
> >>
> >>> Nice find!  This is Apache 2.0, copyright SUN.
> >>>
> >>> O Great Apache Elders: Is it kosher to add this to the Solr
> >>> distribution? It's not in the JDK and is also com.sun.*
> >>>
> >>> On Sun, Dec 12, 2010 at 5:33 PM, Adam Estrada
> >>>  wrote:
> >>> > Thanks for the feedback! There are quite a few formats that can be
> used.
> >>> I
> >>> > am experiencing at least 5 of them. Would something like this work?
> Note
> >>> > that there are 2 different formats separated by a comma.
> >>> >
> >>> >  >>> > dateTimeFormat="EEE, dd MMM  HH:mm:ss zzz,
> -MM-dd'T'HH:mm:ss'Z'"
> >>> />
> >>> >
> >>> > I don't suppose it will because there is already a comma in the first
> >>> > parser. I guess I am reallly looking for an all purpose data time
> parser
> >>> but
> >>> > even if I have that, would I still be able to query *all* fields in
> the
> >>> > index?
> >>> >
> >>> > Good article:
> >>> >
> >>>
> http://www.java2s.com/Open-Source/Java-Document/RSS-RDF/Rome/com/sun/syndication/io/impl/DateParser.java.htm
> >>> >
> >>> > Adam
> >>> >
> >>> > On Sun, Dec 12, 2010 at 7:31 PM, Koji Sekiguchi 
> >>> wrote:
> >>> >
> >>> >> (10/12/13 8:49), Adam Estrada wrote:
> >>> >>
> >>> >>> All,
> >>> >>>
> >>> >>> I am having some difficu"lties parsing the pubDate field that is
> part
> >>> of
> >>> >>> the?
> >>> >>> RSS spec (I believe). I get the warning that "states, "Dec 12, 2010
> >>> >>> 6:45:26
> >>> >>> PM org.apache.solr.handler.dataimport.DateFormatTransformer
> >>> >>>  transformRow
> >>> >>> WARNING: Could not parse a Date field
> >>> >>> java.text.ParseException: Unparseable date: "Thu, 30 Jul 2009
> 14:41:43
> >>> >>> +"
> >>> >>> at java.text.DateFormat.parse(Unknown Source)"
> >>> >>>
> >>> >>> Does anyone know how to fix this? I would eventually like to do a
> date
> >>> >>> query
> >>> >>> but without the ability to properly parse them I don't know if it's
> >>> going
> >>> >>> to
> >>> >>> work.
> >>> >>>
> >>> >>> Thanks,
> >>> >>> Adam
> >>> >>>
> >>> >>
> >>> >> Adam,
> >>> >>
> >>> >> How does your data-config.xml look like for that field?
> >>> >> Have you looked at rss-data-config.xml file
> >>> >> under example/example-DIH/solr/rss/conf directory?
> >>> >>
> >>> >> Koji
> >>> >> --
> >>> >> http://www.rondhuit.com/en/
> >>> >>
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Lance Norskog
> >>> goks...@gmail.com
> >>>
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goks...@gmail.com
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread John Russell

Wow, you read my mind.  We are committing very frequently.  We are trying to
get as close to realtime access to the stuff we put in as possible.  Our
current commit time is... ahem every 4 seconds.

Is that insane?

I'll try the ConcMarkSweep as well and see if that helps.

On Mon, Dec 13, 2010 at 17:38, Jonathan Rochkind  wrote:

> Forgive me if I've said this in this thread already, but I'm beginning to
> think this is the main 'mysterious' cause of Solr RAM/gc issues.
>
> Are you committing very frequently?  So frequently that you commit faster
> than it takes for warming operations on a new Solr index to complete, and
> you're getting over-lapping indexes being prepared?
>
> But if the problem really is just GC issues and not actually too much RAM
> being used, try this JVM setting:
>
> -XX:+UseConcMarkSweepGC
>
> Will make GC happen in a different thread, instead of the same thread as
> solr operations.
>
> I think that is also something that many many Solr installations probably
> need, but don't realize they need.
>
>
> On 12/13/2010 3:42 PM, John Russell wrote:
>
>> Thanks for the response.
>>
>> The date types are defined in our schema file like this
>>
>> > precisionStep="0" positionIncrementGap="0"/>
>>
>> 
>> > precisionStep="6" positionIncrementGap="0"/>
>>
>> Which appears to be what you mentioned.  Then we use them in fields like
>> this
>>
>>> stored="false"
>> required="false" multiValued="false" />
>>> required="false" multiValued="false" />
>>
>> So I think we have the right datatypes for the dates.  Most of the other
>> ones are strings.
>>
>> As for the doc we are adding, I don't think it would be considered "huge".
>> It is basically blog posts and tweets broken out into fields like author,
>> title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
>> probably smaller.
>>
>> We do create them once and then update the indexes as we perform work on
>> the
>> documents.  For example, we create the doc for the original incoming post
>> and then update that doc with tags or the results of filtering so we can
>> look for them later.
>>
>> We have solr set up as a separate JVM which we talk to over HTTP on the
>> same
>> box using the solrj client java library.  Unfortunately we are on 32 bit
>> hardware so solr can only get 2.6GB of memory.  Any more than that and the
>> JVM won't start.
>>
>> I really just need a way to keep the cache from breaking the bank.  As I
>> pasted below there are some config elements in the XML that appear to be
>> related to caching but I'm not sure that they are related to that specific
>> hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
>> actually runs out of heap space but GC's the CPU to death.
>>
>> Thanks again.
>>
>> John
>>
>> On Sat, Dec 11, 2010 at 17:46, Erick Erickson> >wrote:
>>
>>  "unfortunately I can't check the statistics page.  For some reason the
>>> solr
>>> webapp itself is only returning a directory listing."
>>>
>>> This is very weird and makes me wonder if there's something really wonky
>>> with your system. I'm assuming when you say "the solr webapp itself"
>>> you're
>>> taking about ...localhost:8983/solr/admin/.. You might want to be
>>> looking
>>> at the stats page (and frantically hitting refresh) before you have
>>> problems.
>>> Alternately, you could record the queries as they are sent to solr to see
>>> what
>>> the offending
>>>
>>> But onwards Tell us more about your dates. One of the very common
>>> ways people get into trouble is to use dates that are unix-style
>>> timestamps,
>>> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
>>> fields
>>> are very much preferred for this.
>>>
>>> Your index isn't all that large by regular standards, so I think that
>>> there's
>>> hope that you can get this working.
>>>
>>>
>>> Wait, wait, wait. Looking again at the stack trace I see that your OOM
>>> is happening when you *add* a document. Tell us more about the
>>> document, perhaps you can print out some characteristics of the doc
>>> before you add it? Is it always the same doc? Are you indexing and
>>> searching on the same machine? Is the doc really huge?
>>>
>>> Best
>>> Erick
>>>
>>>
>>> On Fri, Dec 10, 2010 at 4:33 PM, John Russell
>>>  wrote:
>>>
>>>  Thanks a lot for the response.

 Unfortunately I can't check the statistics page.  For some reason the

>>> solr
>>>
 webapp itself is only returning a directory listing.  This is sometimes
 fixed when I restart but if I do that I'll lose the state I have now.  I
 can
 get at the JMX interface.  Can I check my insanity level from there?

 We did change two parts of the solr config to raise the size of the
 query
 Results and document cache.  I assume from what you were saying that
 this
 does not have an effect on the cache I mentioned taking up all of the
 space.

   >>>
  class=*"solr.LRUCache"*

  s

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley

On Mon, Dec 13, 2010 at 8:47 PM, John Russell  wrote:
> Wow, you read my mind.  We are committing very frequently.  We are trying to
> get as close to realtime access to the stuff we put in as possible.  Our
> current commit time is... ahem every 4 seconds.
>
> Is that insane?

Not necessarily insane, but challenging ;-)
I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml.  When
that is exceeded, a commit will fail (this just means a new searcher
won't be opened on that commit... the docs will be visible with the
next commit that does succeed.)

-Yonik
http://www.lucidimagination.com

RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

ConcMarkSweep probably won't help.

Solr 1.4 is not very good at 'near real time' committing.  There are some 
features post-1.4, that I don't know if they are in trunk yet or still just 
patches, that I have not investigated myself, but google (or JIRA search) for 
'near real time'.

http://wiki.apache.org/solr/SolrPerformanceFactors#Updates_and_Commit_Frequency_Tradeoffs



This seems to be a very frequent issue these days; everyone running Solr should 
at least read that wiki section to understand what's going on.



From: John Russell [jjruss...@gmail.com]
Sent: Monday, December 13, 2010 8:47 PM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

Wow, you read my mind.  We are committing very frequently.  We are trying to
get as close to realtime access to the stuff we put in as possible.  Our
current commit time is... ahem every 4 seconds.

Is that insane?

I'll try the ConcMarkSweep as well and see if that helps.

On Mon, Dec 13, 2010 at 17:38, Jonathan Rochkind  wrote:

> Forgive me if I've said this in this thread already, but I'm beginning to
> think this is the main 'mysterious' cause of Solr RAM/gc issues.
>
> Are you committing very frequently?  So frequently that you commit faster
> than it takes for warming operations on a new Solr index to complete, and
> you're getting over-lapping indexes being prepared?
>
> But if the problem really is just GC issues and not actually too much RAM
> being used, try this JVM setting:
>
> -XX:+UseConcMarkSweepGC
>
> Will make GC happen in a different thread, instead of the same thread as
> solr operations.
>
> I think that is also something that many many Solr installations probably
> need, but don't realize they need.
>
>
> On 12/13/2010 3:42 PM, John Russell wrote:
>
>> Thanks for the response.
>>
>> The date types are defined in our schema file like this
>>
>> > precisionStep="0" positionIncrementGap="0"/>
>>
>> 
>> > precisionStep="6" positionIncrementGap="0"/>
>>
>> Which appears to be what you mentioned.  Then we use them in fields like
>> this
>>
>>> stored="false"
>> required="false" multiValued="false" />
>>> required="false" multiValued="false" />
>>
>> So I think we have the right datatypes for the dates.  Most of the other
>> ones are strings.
>>
>> As for the doc we are adding, I don't think it would be considered "huge".
>> It is basically blog posts and tweets broken out into fields like author,
>> title, summary etc.  Each doc probably isn't more than 1 or 2k tops.  Some
>> probably smaller.
>>
>> We do create them once and then update the indexes as we perform work on
>> the
>> documents.  For example, we create the doc for the original incoming post
>> and then update that doc with tags or the results of filtering so we can
>> look for them later.
>>
>> We have solr set up as a separate JVM which we talk to over HTTP on the
>> same
>> box using the solrj client java library.  Unfortunately we are on 32 bit
>> hardware so solr can only get 2.6GB of memory.  Any more than that and the
>> JVM won't start.
>>
>> I really just need a way to keep the cache from breaking the bank.  As I
>> pasted below there are some config elements in the XML that appear to be
>> related to caching but I'm not sure that they are related to that specific
>> hashmap which eventually grows to 2.1GB of our 2.6GB heap.  It never
>> actually runs out of heap space but GC's the CPU to death.
>>
>> Thanks again.
>>
>> John
>>
>> On Sat, Dec 11, 2010 at 17:46, Erick Erickson> >wrote:
>>
>>  "unfortunately I can't check the statistics page.  For some reason the
>>> solr
>>> webapp itself is only returning a directory listing."
>>>
>>> This is very weird and makes me wonder if there's something really wonky
>>> with your system. I'm assuming when you say "the solr webapp itself"
>>> you're
>>> taking about ...localhost:8983/solr/admin/.. You might want to be
>>> looking
>>> at the stats page (and frantically hitting refresh) before you have
>>> problems.
>>> Alternately, you could record the queries as they are sent to solr to see
>>> what
>>> the offending
>>>
>>> But onwards Tell us more about your dates. One of the very common
>>> ways people get into trouble is to use dates that are unix-style
>>> timestamps,
>>> i.e. in milliseconds (either as ints or strings) and sort on them. Trie
>>> fields
>>> are very much preferred for this.
>>>
>>> Your index isn't all that large by regular standards, so I think that
>>> there's
>>> hope that you can get this working.
>>>
>>>
>>> Wait, wait, wait. Looking again at the stack trace I see that your OOM
>>> is happening when you *add* a document. Tell us more about the
>>> document, perhaps you can print out some characteristics of the doc
>>> before you add it? Is it always the same doc? Are you indexing and
>>> searching on the same machine? Is the doc really huge?
>>>
>>> Best
>>

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Shawn Heisey


On 12/13/2010 3:38 PM, Jonathan Rochkind wrote:
But if the problem really is just GC issues and not actually too much 
RAM being used, try this JVM setting:


-XX:+UseConcMarkSweepGC


That's I use on my shards, I've never had any visible problems with 
memory or garbage collection delays.  I have not done any kind of 
profiling, though.


The servers (CentOS Xen VMs) have 9GB of total RAM and serve indexes 
that are nearing 15GB in size and have over 8 million documents.  
Important parts of my java commandline:


-Xms512M -Xmx2048M -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode

java version "1.6.0_22"
Java(TM) SE Runtime Environment (build 1.6.0_22-b04)
Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode)

Shawn

RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

Wow, really, it's that easy?  I could swear there's a wiki page somewhere that 
suggests otherwise, but I believe Yonik today over a wiki page last edited 
wherever.  

But this should be well-publisized, it's a pretty easy solution that will at 
least give you "as up to date as your Solr can handle", to a problem that many 
people seem to be having.  I would suggest a maxWarmingSearchers 1 example 
should at least be included commented out in the example solrconfig.xml, if not 
even included live. 

(This would be even better if, on a commit failing due to maxWarmingSearchers, 
Solr would automatically commit them when the warming is complete -- instead of 
relying on another commit manually being made at some future point.  Is there 
any built-in hook for 'warming complete' or 'index fully ready' that could be 
used to jury-rig this?)

Yonik, how will maxWarmingSearchers in this scenario effect replication?  If a 
slave is pulling down new indexes so quickly that the warming searchers would 
ordinarily pile up, but maxWarmingSearchers is set to 1 what happens?

From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley 
[yo...@lucidimagination.com]
Sent: Monday, December 13, 2010 9:07 PM
To: solr-user@lucene.apache.org
Subject: Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap 
getting collected?

On Mon, Dec 13, 2010 at 8:47 PM, John Russell  wrote:
> Wow, you read my mind.  We are committing very frequently.  We are trying to
> get as close to realtime access to the stuff we put in as possible.  Our
> current commit time is... ahem every 4 seconds.
>
> Is that insane?

Not necessarily insane, but challenging ;-)
I'd start by setting maxWarmingSearchers to 1 in solrconfig.xml.  When
that is exceeded, a commit will fail (this just means a new searcher
won't be opened on that commit... the docs will be visible with the
next commit that does succeed.)

-Yonik
http://www.lucidimagination.com

Userdefined Field type - Faceting

2010-12-13 Thread Viswa S


Hello,

We implemented an IP-Addr field type which internally stored the ips as hex-ed 
string (e.g. "192.2.103.29" will be stored as "c002671d"). My "toExternal" and 
"toInternal" methods for appropriate conversion seems to be working well for 
query results, but however when faceting on this field it returns the raw 
strings. in other words the query response would have "192.2.103.29", but facet 
on the field would return "1"

Why are these methods not used by the faceting component to convert the 
resulting values?

Thanks
Viswa

SpatialTierQueryParserPlugin Loading Error

All,

Can anyone shed some light on this error. I can't seem to get this
class to load. I am using the distribution of Solr from Lucid
Imagination and the Spatial Plugin from here
https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
apply a patch but the jar file is in there. What else can I do?

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
at org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
at org.apache.solr.core.SolrCore.(SolrCore.java:548)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
at 
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
at 
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
at 
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at 
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
at org.mortbay.jetty.Server.doStart(Server.java:210)
at 
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.mortbay.start.Main.invokeMain(Main.java:183)
at org.mortbay.start.Main.start(Main.java:497)
at org.mortbay.start.Main.main(Main.java:115)
Caused by: java.lang.ClassNotFoundException:
org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
at java.net.URLClassLoader$1.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Unknown Source)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
... 33 more

Re: SpatialTierQueryParserPlugin Loading Error

This page shows you how to apply a patch:
http://wiki.apache.org/solr/HowToContribute
However, are you aware that
this is a patch to the *source* code and then you have
to compile it? A simpler approach would be to just grab either the trunk
build or a very recent
3.x build. See: https://hudson.apache.org/hudson/

You'll find both solr trunk and solr 3x. note that Hudson (and NOT
specifically Solr/Lucene) is
suffering some bogus "failures" so don't be alarmed by that status.

Follow the "build artifacts" link, then on down to "dist" and you'll find a
full package awaiting
installation with all the geospatial stuff already built...

Best
Erick

On Mon, Dec 13, 2010 at 10:06 PM, Adam Estrada <
estrada.adam.gro...@gmail.com> wrote:

> All,
>
> Can anyone shed some light on this error. I can't seem to get this
> class to load. I am using the distribution of Solr from Lucid
> Imagination and the Spatial Plugin from here
> https://issues.apache.org/jira/browse/SOLR-773. I don't know how to
> apply a patch but the jar file is in there. What else can I do?
>
> org.apache.solr.common.SolrException: Error loading class
> 'org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin'
>at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
>at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
>at
> org.apache.solr.core.SolrCore.createInitInstance(SolrCore.java:435)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1498)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1492)
>at org.apache.solr.core.SolrCore.initPlugins(SolrCore.java:1525)
>at org.apache.solr.core.SolrCore.initQParsers(SolrCore.java:1442)
>at org.apache.solr.core.SolrCore.(SolrCore.java:548)
>at
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
>at
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>at
> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
>at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
>at
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
>at
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
>at
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
>at org.mortbay.jetty.Server.doStart(Server.java:210)
>at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
>at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
>at java.lang.reflect.Method.invoke(Unknown Source)
>at org.mortbay.start.Main.invokeMain(Main.java:183)
>at org.mortbay.start.Main.start(Main.java:497)
>at org.mortbay.start.Main.main(Main.java:115)
> Caused by: java.lang.ClassNotFoundException:
> org.apache.solr.spatial.tier.SpatialTierQueryParserPlugin
>at java.net.URLClassLoader$1.run(Unknown Source)
>at java.security.AccessController.doPrivileged(Native Method)
>at java.net.URLClassLoader.findClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
>at java.lang.ClassLoader.loadClass(Unknown Source)
>at java.lang.Class.forName0(Native Method)
>at java.lang.Class.forName(Unknown Source)
>at
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
>... 33 more
>

Re: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?

2010-12-13 Thread Yonik Seeley

On Mon, Dec 13, 2010 at 9:27 PM, Jonathan Rochkind  wrote:
> Yonik, how will maxWarmingSearchers in this scenario effect replication?  If 
> a slave is pulling down new indexes so quickly that the warming searchers 
> would ordinarily pile up, but maxWarmingSearchers is set to 1 what 
> happens?

Like any other commits, this will limit the number of searchers
warming in the background to 1.  If a commit is called, and that tries
to open a new searcher while another is already warming, it will fail.
 The next commit that does succeed will have all the updates though.

Today, this maxWarmingSearchers check is done after the writer has
closed and before a new searcher is opened... so calling commit too
often won't affect searching, but it will currently affect indexing
speed (since the IndexWriter is constantly being closed/flushed).

-Yonik
http://www.lucidimagination.com

Re: Userdefined Field type - Faceting

2010-12-13 Thread Yonik Seeley

Perhaps try overriding indexedToReadable() also?

-Yonik
http://www.lucidimagination.com

On Mon, Dec 13, 2010 at 10:00 PM, Viswa S  wrote:
>
> Hello,
>
> We implemented an IP-Addr field type which internally stored the ips as 
> hex-ed string (e.g. "192.2.103.29" will be stored as "c002671d"). My 
> "toExternal" and "toInternal" methods for appropriate conversion seems to be 
> working well for query results, but however when faceting on this field it 
> returns the raw strings. in other words the query response would have 
> "192.2.103.29", but facet on the field would return " name="c002671d">1"
>
> Why are these methods not used by the faceting component to convert the 
> resulting values?
>
> Thanks
> Viswa
>

RE: OutOfMemory GC: GC overhead limit exceeded - Why isn't WeakHashMap getting collected?