problem of solr replcation's speed

2010-10-31 Thread kafka0102
It takes about one hour to replacate 6G index for solr in my env. But my 
network can transfer file about 10-20M/s using scp. So solr's http replcation 
is too slow, it's normal or I do something wrong?


Re: problem of solr replcation's speed

2010-10-31 Thread Peter Karich

 we have an identical-sized index and it takes ~5minutes


It takes about one hour to replacate 6G index for solr in my env. But 
my network can transfer file about 10-20M/s using scp. So solr's http 
replcation is too slow, it's normal or I do something wrong?






Re: Newbie to Solr, LIKE:foo

2010-10-31 Thread Erick Erickson
Not really. The problem here is that to perform this raw, you'd need
to enumerate every term in the index, which is pretty slow.

One solution is to use one of the ngram tokenizers, probably the
NGramFilterFactory to process the output of your tokenizers. Here's a
related place to start...
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

HTH
Erick

On Fri, Oct 29, 2010 at 3:32 AM, MilleBii  wrote:

> I'm Nutch user but I'm considering to use Solr for the following reason.
>
> I need a LIKE:foo , which turns into a *foo* query. I saw the built-in
> prefix query parser but it does only look for foo*, if I understand it well
> So is there a query parser that does what I'm looking.
> If not how difficult is it to build one with Solr ?
>
> --
> -MilleBii-
>


Re: org.tartarus package in lucene/solr?

2010-10-31 Thread Erick Erickson
In what? Where? What's the problem you're seeing? Why do you ask?

Please review: http://wiki.apache.org/solr/UsingMailingLists


Best
Erick

On Fri, Oct 29, 2010 at 4:19 AM, Tharindu Mathew wrote:

> Hi,
>
> How come $subject is present??
>
> --
> Regards,
>
> Tharindu
>


Re: Ensuring stable timestamp ordering

2010-10-31 Thread Erick Erickson
O, I didn't realize that, thanks!

Erick

On Sat, Oct 30, 2010 at 10:27 PM, Lance Norskog  wrote:

> Hi-
>
> NOW does not get re-run for each document. If you give a large upload
> batch, the same NOW is given to each document.
>
> It would be handy to have an auto-incrementing date field, so that
> each document would get a unique number and the timestamp would then
> be the unique ID of the document.
>
> On Sat, Oct 30, 2010 at 7:19 PM, Erick Erickson 
> wrote:
> > What are the actual values in your index? I'm wondering if they
> > all get the same values somehow, perhaps due to the granularity
> > of your dates? And (and I'm really grasping at straws here) your
> >  is causing enough delay to have time intervals be greater
> > than your granularity.
> >
> > Unfortunately, that  doesn't make much sense either. If you sort on a
> > field, the tiebreaker should be the document ID order absent secondary
> > sorts...
> >
> > So, can you post the results of adding &debugQuery=on to your URL?
> > Also, use the schema browser from the admin page to see what you
> > actually have in your index.
> >
> > Not much help, but the best I can do this evening.
> >
> > Erick
> >
> > On Thu, Oct 28, 2010 at 9:58 PM, Michael Sokolov  >wrote:
> >
> >> (Sorry - fumble finger sent too soon.)
> >>
> >>
> >> My confusion stems from the fact that in my test I insert a number of
> >> documents, and then retrieve them ordered by timestamp, and they don't
> come
> >> back in the same order they were inserted (the order seems random),
> unless
> >> I
> >> commit after each insert.
> >>
> >> Is that expected?  I could create my own timestamp values easily enough,
> >> but
> >> would just as soon not do so if I could use a pre-existing feature that
> >> seems tailor-made.
> >>
> >> -Mike
> >>
> >> > -Original Message-
> >> > From: Michael Sokolov [mailto:soko...@ifactory.com]
> >> > Sent: Thursday, October 28, 2010 9:55 PM
> >> > To: 'solr-user@lucene.apache.org'
> >> > Subject: Ensuring stable timestamp ordering
> >> >
> >> > I'm curious what if any guarantees there are regarding the
> >> > "timestamp" field that's defined in the sample solr
> >> > schema.xml.  Just for completeness, the definition is:
> >> >
> >>
> >>
> >>>> default="NOW" multiValued="false"/>
> >>
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Basic Document Question

2010-10-31 Thread Erick Erickson
I guess that depends on what you mean by re-index, but here are some
guesses.
All of them share the assumption that you can determine #what# you want to
index from the various sites. That is, you have some way of identifying
the content you care about.

Solr won't help you at all in identifying what you really want, it just
follows
the orders you give it when you tell it to index content.


> if you already have junk in your solr index that you want to remove, you
can
delete by query (and risk removing valuable stuff). You could also
reindex from scratch.

> #Assuming# you have a unique key defined, and you're really asking about
updating documents, you don't have to do anything. If your schema.xml file
has  identifying a particular field, just add your document again
and
Solr will automatically delete the old version and add the new one.

If none of this makes sense, perhaps you can give us a better idea of what
updating means in your use case...

This forum concentrates on Solr, there's a Nutch form that'll help you there
and I
haven't a clue about Drupal.

Best
Erick

On Sat, Oct 30, 2010 at 3:34 PM, Eric Martin  wrote:

> HI everyone,
>
>
>
> I'm new which won't be hard to figure out after I ask this question:
>
>
>
> I use Drupal/Solr/Nutch
>
>
>
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/conf/schema.
> xml?view=markup
>
>
>
> Solr specific:
>
> How do I re-index for specific content only? I am starting  a legal index
> specifically geared for law students and lawyers. I am crawling law related
> sites but I really don't want to index law firms, just the law content on
> places like:
>
> http://www.ecasebriefs.com/blog/law/
>
> http://www.lawnix.com/cases/cases-index/
>
> http://www.oyez.org/
>
> http://www.4lawnotes.com/
>
> http://www.docstoc.com/documents/education/law-school/case-briefs
>
> http://www.lawschoolcasebriefs.com/
>
> http://dictionary.findlaw.com 
>
>
>
> As I was saying, while crawling I get all kinds of extrinsic information
> put
> into the Solr index. How do I combat that?
>
>
>
> I am assuming (cough) that I can do this but I am really at a loss as to
> where I start to look to get this done. I prefer to learn and I defiantly
> don't want to waste anyone's time.
>
>
>
> Non-Solr Specific
>
> Does anyone here help with nutch or is this Solr only?
>
>
>
> I am sorry if I am asking elementary questions and am asking in the wrong
> place. I just need to be pointed to the right place. I'm sort of
> lost.(imagine that.)
>
>
>
> Thanks
>
>
>
> Eric
>
>
>
>
>
>
>
>


RE: Ensuring stable timestamp ordering

2010-10-31 Thread Toke Eskildsen
Lance Norskog [goks...@gmail.com] wrote:
> It would be handy to have an auto-incrementing date field, so that
> each document would get a unique number and the timestamp would then
> be the unique ID of the document.

If someone want to implement this, I'll just note that the granilarity of Solr 
dates is fixed to milliseconds:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

Using ms for unique timestamps means limiting the index rate to 1000 
documents/second. That might be okay for some applications but a serious 
limiter for other (our Lucene index update rate varies between 300 and 1600 
documents/second, depending on content, I am sure others have much higher 
rates). One could do tricks, but it is just plain ugly to use something like 
"Tenths of milliseconds since epoch", so switching to longs and nanoseconds 
seems to be the clean choice if we want the timestamps to be "true" timestamps 
and not just a unique integer-ID generator.

Re: Ensuring stable timestamp ordering

2010-10-31 Thread Michael Sokolov
Hmm - personally, I wouldn't want to rely on timestamps as a unique-id 
generation scheme.  Might we not one day want to have distributed 
parallel indexing that merges lazily?  Keeping timestamps unique and in 
sync across multiple nodes would be a tough requirement. I would be 
happy simply having NOW be more fine-grained, and this does seem like 
something that would be nice to have in a fairly low level, but as I 
said, if it would introduce backward-compatibility problems, it's easy 
enough to create a timestamp field in the indexing feed.


Thank you for clarifying this.

-Mike


On 10/31/2010 11:33 AM, Toke Eskildsen wrote:

Lance Norskog [goks...@gmail.com] wrote:

It would be handy to have an auto-incrementing date field, so that
each document would get a unique number and the timestamp would then
be the unique ID of the document.

If someone want to implement this, I'll just note that the granilarity of Solr 
dates is fixed to milliseconds:
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html

Using ms for unique timestamps means limiting the index rate to 1000 documents/second. That might 
be okay for some applications but a serious limiter for other (our Lucene index update rate varies 
between 300 and 1600 documents/second, depending on content, I am sure others have much higher 
rates). One could do tricks, but it is just plain ugly to use something like "Tenths of 
milliseconds since epoch", so switching to longs and nanoseconds seems to be the clean choice 
if we want the timestamps to be "true" timestamps and not just a unique integer-ID 
generator.




indexing '-

2010-10-31 Thread PeterKerk

I have a city named 's-Hertogenbosch

I want it to be indexed exactly like that, so "'s-Hertogenbosch" (without
"")

But now I get:

1
1
1


What filter should I add/remove from my field definition?

I already tried a new fieldtype with just this, but no luck:

  


  



My schema.xml


  








  









-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing '-

2010-10-31 Thread Ken Stanley
On Sun, Oct 31, 2010 at 12:12 PM, PeterKerk  wrote:

>
> I have a city named 's-Hertogenbosch
>
> I want it to be indexed exactly like that, so "'s-Hertogenbosch" (without
> "")
>
> But now I get:
> 
>1
>1
>1
> 
>
> What filter should I add/remove from my field definition?
>
> I already tried a new fieldtype with just this, but no luck:
> positionIncrementGap="100" >
>  
>
> ignoreCase="true" expand="false"/>
>  
>
>
>
> My schema.xml
>
> positionIncrementGap="100" >
>  
>
> ignoreCase="true" expand="false"/>
> words="stopwords_dutch.txt" />
> generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>
>
> protected="protwords.txt"/>
>
>  
>
>
> 
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1816969.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

For exact text, you should try using either the string type, or a type that
only uses the KeywordTokenizer. Other field types may perform
transformations on the text similar to what you are seeing.

- Ken


Re: indexing '-

2010-10-31 Thread PeterKerk

I already tried the normal string type, but that doesnt work either.
I now use this:

  

  


But that doesnt do it either...what else can I try?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modelling Access Control

2010-10-31 Thread Dennis Gearon
Ah haaa. I see now. :-) 

I didn't make that connection. Hopefully I would hbave before I ever tried to 
implement that :-)

Kind of like user names and icons on a windows login :-)

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sat, 10/30/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Modelling Access Control
> To: solr-user@lucene.apache.org
> Date: Saturday, October 30, 2010, 6:01 PM
> If that's in response to Lance's
> comment, the answer is that if you return
> autosuggest possibilities you effectively allow users to
> see data they
> shouldn't. Imagine you have a field of the real names of
> spies. You only
> want the persons way high up in the security chain to
> access these names and
> you control that on a document level.
> 
> Allowing autocomplete on that field would be...er...very
> tough on your
> spies' health...
> 
> HTH
> Erick
> 
> On Tue, Oct 26, 2010 at 2:24 PM, Dennis Gearon wrote:
> 
> > "Son, don't touch that stove . . . .",
> >
> > "OUCH! Hey Dad, I BURNED my hand on that stove, why
> didn't you tell me
> > that?!?#! You know I need to know WHY, not just
> DON'T!"
> >
> > Dennis Gearon
> >
> > > Very important: do not make a spelling or
> autosuggest index
> > > from a
> > > text field which some people can see and other
> people
> > > can't.
> > >
> >
> >
>


Re: Consulting in Solr tuning, stop words, dictionary, etc

2010-10-31 Thread Dennis Gearon
Thanks Erick.

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sat, 10/30/10, Erick Erickson  wrote:

> From: Erick Erickson 
> Subject: Re: Consulting in Solr tuning, stop words, dictionary, etc
> To: solr-user@lucene.apache.org
> Date: Saturday, October 30, 2010, 6:59 PM
> Well, that all depends on what you
> want. Here's a list of Solr
> consultants.
> 
> http://wiki.apache.org/solr/Support
> 
> HTH
> Erick
> 
> On Thu, Oct 28, 2010 at 4:21 PM, Dennis Gearon wrote:
> 
> > Speaking of jobs on this list . . . .
> >
> > How much does a good consultant for Solr work cost?
> >
> > I am interested first in English, but then in other
> languages around the
> > world. Just need budgetary amounts for a business
> plan.
> >
> > 1-6mos, or till BIG DOLLARS, whichever comes first
> ;-)
> >
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own
> mistakes. It is usually a
> > better idea to learn from others’ mistakes, so you
> do not have to make them
> > yourself. from '
> > http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> > EARTH has a Right To Life,
> >  otherwise we all die.
> >
>


RE: Ensuring stable timestamp ordering

2010-10-31 Thread Dennis Gearon
Even microseconds may not be enough on some really good, fast machine.
Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sun, 10/31/10, Toke Eskildsen  wrote:

> From: Toke Eskildsen 
> Subject: RE: Ensuring stable timestamp ordering
> To: "solr-user@lucene.apache.org" 
> Date: Sunday, October 31, 2010, 8:33 AM
> Lance Norskog [goks...@gmail.com]
> wrote:
> > It would be handy to have an auto-incrementing date
> field, so that
> > each document would get a unique number and the
> timestamp would then
> > be the unique ID of the document.
> 
> If someone want to implement this, I'll just note that the
> granilarity of Solr dates is fixed to milliseconds:
> http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html
> 
> Using ms for unique timestamps means limiting the index
> rate to 1000 documents/second. That might be okay for some
> applications but a serious limiter for other (our Lucene
> index update rate varies between 300 and 1600
> documents/second, depending on content, I am sure others
> have much higher rates). One could do tricks, but it is just
> plain ugly to use something like "Tenths of milliseconds
> since epoch", so switching to longs and nanoseconds seems to
> be the clean choice if we want the timestamps to be "true"
> timestamps and not just a unique integer-ID generator.


Re: indexing '-

2010-10-31 Thread Savvas-Andreas Moysidis
One way to view how your Tokenizers/Filters chain transforms your input
terms, is to use the analysis page of the Solr admin web application. This
is very handy when troubleshooting issues related to how terms are indexed.

On 31 October 2010 17:13, PeterKerk  wrote:

>
> I already tried the normal string type, but that doesnt work either.
> I now use this:
> omitNorms="true">
>  
>
>  
>
>
> But that doesnt do it either...what else can I try?
>
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Commit/Optimise question

2010-10-31 Thread Savvas-Andreas Moysidis
Thanks Eric. For the record, we are using 1.4.1 and SolrJ.

On 31 October 2010 01:54, Erick Erickson  wrote:

> What version of Solr are you using?
>
> About committing. I'd just let the solr defaults handle that. You configure
> this in the autocommit section of solrconfig.xml. I'm pretty sure this
>  gets
> triggered even if you're using SolrJ.
>
> That said, it's probably wise to issue a commit after all your data is
> indexed
> too, just to flush any remaining documents since the last autocommit.
>
> Optimize should not be issued until you're all done, if at all. If
> you're not deleting (or updating) documents, don't bother to optimize
> unless the number of files in your index directory gets really large.
> Recent Solr code almost removes the need to optimize unless you
> delete documents, but I confess I don't know the revision number
> "recent" refers to, perhaps only trunk...
>
> HTH
> Erick
>
> On Thu, Oct 28, 2010 at 9:56 AM, Savvas-Andreas Moysidis <
> savvas.andreas.moysi...@googlemail.com> wrote:
>
> > Hello,
> >
> > We currently index our data through a SQL-DIH setup but due to our model
> > (and therefore sql query) becoming complex we need to index our data
> > programmatically. As we didn't have to deal with commit/optimise before,
> we
> > are now wondering whether there is an optimal approach to that. Is there
> a
> > batch size after which we should fire a commit or should we execute a
> > commit
> > after indexing all of our data? What about optimise?
> >
> > Our document corpus is > 4m documents and through DIH the resulting index
> > is
> > around 1.5G
> >
> > We have searched previous posts but couldn't find a definite answer. Any
> > input much appreciated!
> >
> > Regards,
> > -- Savvas
> >
>


Start parameter and result grouping

2010-10-31 Thread Pavel Minchenkov
Hi,

I'm trying to implement paging when grouping is on.

Start parameter works, but the result contains all the documents that were
before him.

http://localhost:8983/solr/select?q=test&group=true&group.field=marketplaceId&group.limit=1&rows=1&start=0(I
get 1 document).
http://localhost:8983/solr/select?q=test&group=true&group.field=marketplaceId&group.limit=1&rows=1&start=1(I
get 2 documents).
...
http://localhost:8983/solr/select?q=test&group=true&group.field=marketplaceId&group.limit=1&rows=1&start=N(I
get N documents).

But in all these queries, I should get only one document in results. Am I
right?

I'm using this build: https://hudson.apache.org/hudson/job/Solr-trunk/1297/

Thanks.

-- 
Pavel Minchenkov


Re: Start parameter and result grouping

2010-10-31 Thread Markus Jelsma
Ah, seems you're just one day behind. SOLR-2207, paging with field collapsing, 
has just been resolved:
https://issues.apache.org/jira/browse/SOLR-2207


> Hi,
> 
> I'm trying to implement paging when grouping is on.
> 
> Start parameter works, but the result contains all the documents that were
> before him.
> 
> http://localhost:8983/solr/select?q=test&group=true&group.field=marketplace
> Id&group.limit=1&rows=1&start=0(I get 1 document).
> http://localhost:8983/solr/select?q=test&group=true&group.field=marketplace
> Id&group.limit=1&rows=1&start=1(I get 2 documents).
> ...
> http://localhost:8983/solr/select?q=test&group=true&group.field=marketplace
> Id&group.limit=1&rows=1&start=N(I get N documents).
> 
> But in all these queries, I should get only one document in results. Am I
> right?
> 
> I'm using this build: https://hudson.apache.org/hudson/job/Solr-trunk/1297/
> 
> Thanks.


Re: Start parameter and result grouping

2010-10-31 Thread Markus Jelsma
Oh, and see the just updated wiki page as well:
http://wiki.apache.org/solr/FieldCollapsing

> Ah, seems you're just one day behind. SOLR-2207, paging with field
> collapsing, has just been resolved:
> https://issues.apache.org/jira/browse/SOLR-2207
> 
> > Hi,
> > 
> > I'm trying to implement paging when grouping is on.
> > 
> > Start parameter works, but the result contains all the documents that
> > were before him.
> > 
> > http://localhost:8983/solr/select?q=test&group=true&group.field=marketpla
> > ce Id&group.limit=1&rows=1&start=0(I get 1 document).
> > http://localhost:8983/solr/select?q=test&group=true&group.field=marketpla
> > ce Id&group.limit=1&rows=1&start=1(I get 2 documents).
> > ...
> > http://localhost:8983/solr/select?q=test&group=true&group.field=marketpla
> > ce Id&group.limit=1&rows=1&start=N(I get N documents).
> > 
> > But in all these queries, I should get only one document in results. Am I
> > right?
> > 
> > I'm using this build:
> > https://hudson.apache.org/hudson/job/Solr-trunk/1297/
> > 
> > Thanks.


RE: Ensuring stable timestamp ordering

2010-10-31 Thread Toke Eskildsen
Dennis Gearon [gear...@sbcglobal.net] wrote:
> Even microseconds may not be enough on some really good, fast machine.

True, especially since the timer might not provide microsecond granularity 
although the returned value is in microseconds. However, an unique timestamp 
generator should keep track of the previous timestamp to guard against 
duplicates. Uniqueness can thus be guaranteed by waiting a bit or cheating on 
the decimals. With microseconds can produce 1 million timestamps / second. 
While I agree that duplicates within microseconds can occur on a fast machine, 
guaranteeing uniqueness by waiting should only be a performance problem when 
the number of duplicates is high. That's still a few years off, I think.

As Michael pointed out, using normal timestamps as unique IDs might not be such 
a great idea as it effectively locks index-building to a single JVM. By going 
the ugly route and expressing the time in nanos with only microsecond 
granularity and use the last 3 decimals for a builder ID this could be fixed. 
Not very clean though, as the contract is not expressed in the data themselves 
but must nevertheless be obeyed by all builders to avoid collisions. It also 
raises the question of who should assign the builder IDs. Not trivial in an 
anarchistic setup where new builders can be added by different controllers.

Pragmatists might use the PID % 1000 or similar for the builder ID as it does 
not require coordination, but this is where the Birthday Paradox hits us again: 
The chance of two processes on different machines having the same PID is 10% if 
just 15 machines are used (1% for 5 machines, 50% for 37 machines). I don't 
like those odds and that's assuming that the PIDs will be randomly distributed, 
which they won't. It could be lowered by reserving more decimals for the salt, 
but then we would decrease the maximum amount of timestamps / second, still 
without guaranteed uniqueness. Guys a lot smarter than me has spend time on the 
unique ID problem and it's clearly not easy: Java's UUID takes up 128 bits.

- Toke

Re: indexing '-

2010-10-31 Thread Erick Erickson
Did you restart solr after the changes? Did you reindex? Because the string
type
should do what you want.

And you've shown us  definitions. What  are you using with
them?

Best
Erick

On Sun, Oct 31, 2010 at 1:13 PM, PeterKerk  wrote:

>
> I already tried the normal string type, but that doesnt work either.
> I now use this:
> omitNorms="true">
>  
>
>  
>
>
> But that doesnt do it either...what else can I try?
>
> Thanks!
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Searching with wrong keyboard layout or using translit

2010-10-31 Thread Alexey Serba
Another approach for this problem is to use another Solr core for
storing users queries for auto complete functionality ( see
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
) and index not only user_query field, but also transliterated and
diff_layout versions and use dismax query parser to search suggestions
in all fields.

This solution is only viable if you have huge log of user queries (
which I believe google does ).

HTH,
Alex



2010/10/29 Alexander Kanarsky :
> Pavel,
>
> it depends on size of your documents corpus, complexity and types of
> the queries you plan to use etc. I would recommend you to search for
> the discussions on synonyms expansion in Lucene (index time vs. query
> time tradeoffs etc.) since your problem is quite similar to that
> (think Moskva vs. Moskwa). Unless you have a small corpus, I would go
> with the second approach and expand the terms during the query time.
> However, the first approach might be useful, too: say, you may want to
> boost the score for the documents that naturally contain the word
> 'Moskva', so such a documents will be at the top of the result list.
> Having both forms indexed will allow you to achieve this easily by
> utilizing Solr's dismax query (to boost the results from the field
> with the original terms):
> http://localhost:8983/solr/select/?q=Moskva&defType=dismax&qf=text^10.0+text_translit^0.1
> ('text' field has the original Cyrillic tokens, 'text_translit' is for
> transliterated ones)
>
> -Alexander
>
>
> 2010/10/28 Pavel Minchenkov :
>> Alexander,
>>
>> Thanks,
>> What variat has better performance?
>>
>>
>> 2010/10/28 Alexander Kanarsky 
>>
>>> Pavel,
>>>
>>> I think there is no single way to implement this. Some ideas that
>>> might be helpful:
>>>
>>> 1. Consider adding additional terms while indexing. This assumes
>>> conversion of Russian text to both "translit" and "wrong keyboard"
>>> forms and index converted terms along with original terms (i.e. your
>>> Analyzer/Filter should produce Moskva and Vjcrdf for term Москва). You
>>> may re-use the same field (if you plan for a simple term queries) or
>>> create a separate fields for the generated terms (better for phrase,
>>> proximity queries etc. since it keeps the original text positional
>>> info). Then the query could use any of these forms to fetch the
>>> document. If you use separate fields, you'll need to expand/create
>>> your query to search for them, of course.
>>> 2. If you have to index just an original Russian text, you might
>>> generate all term forms while analyzing the query, then you could
>>> treat the converted terms as a synonyms and use the combination of
>>> TermQuery for all term forms or the MultiPhraseQuery for the phrases.
>>> For Solr in this case you probably will need to add a custom filter
>>> similar to SynonymFilter.
>>>
>>> Hope this helps,
>>> -Alexander
>>>
>>> On Wed, Oct 27, 2010 at 1:31 PM, Pavel Minchenkov 
>>> wrote:
>>> > Hi,
>>> >
>>> > When I'm trying to search Google with wrong keyboard layout -- it
>>> corrects
>>> > my query, example: http://www.google.ru/search?q=vjcrdf (I typed word
>>> > "Moscow" in Russian but in English keyboard layout).
>>> > Also, when I'm searching using
>>> > translit, It does the same: http://www.google.ru/search?q=moskva
>>> >
>>> > What is the right way to implement this feature in Solr?
>>> >
>>> > --
>>> > Pavel Minchenkov
>>> >
>>>
>>
>>
>>
>> --
>> Pavel Minchenkov
>>
>


Solr in virtual host as opposed to /lib

2010-10-31 Thread Eric Martin
Is there an issue running Solr in /home/lib as opposed to running it
somewhere outside of the virtual hosts like /lib?

Eric



Design and Usage Questions

2010-10-31 Thread getagrip

Hi,

I've got some basic usage / design questions.

1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
   instance for all requests to avoid connection leaks.
   So if I create a Singleton instance upon application-startup I can
   securely use this instance for ALL queries/updates throughout my
   application without running into performance issues?

2. My System's documents are stored in a Subversion repository.
   For fast searchresults I want to periodically index new documents
   from the repository.

   What I get from the repository is a ByteArrayOutputStream. How can I
   pass this Stream to Solr?

   I only see possibilities to pass Files but in my case it does not
   make sense to write the ByteArrayOutputStream to disk again as this
   would cause performance issues apart from making no sense anyway.

3. Are there any disadvantages using Solrj over some other HTTP based
   solution e.g. creating & sending my own HTTP requests? Do I even
   have to use HTTP?
   I see the EmbeddedSolrServer exists. Any drawbacks using that?

Any hints are welcome, Thanks!


Re: Solr in virtual host as opposed to /lib

2010-10-31 Thread Erick Erickson
Can you expand on your question? Are you having a problem? Is this idle
curiosity?

Because I have no idea how to respond when there is so little information.

Best
Erick

On Sun, Oct 31, 2010 at 5:32 PM, Eric Martin  wrote:

> Is there an issue running Solr in /home/lib as opposed to running it
> somewhere outside of the virtual hosts like /lib?
>
> Eric
>
>


RE: Solr in virtual host as opposed to /lib

2010-10-31 Thread Eric Martin
Hi,

Thank you. This is more than idle curiosity. I am trying to debug an issue I
am having with my installation and this is one step in verifying that I have
a setup that does not consume resources. I am trying to debunk my internal
myth that having Solr nad Nutch in a virtual host would be causing these
issues. Here is the main issue that involves Nutch/Solr and Drupal:

/home/mootlaw/lib/solr
/home/mootlaw/lib/nutch
/home/mootlaw/www/

I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
jetty for my Solr. My server is not rooted.

Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:

/usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
-Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
-classpath
/home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil
d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n
utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib
/apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1.
4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils-1.8.0.jar:/home/mootla
w/lib/nutch/lib/commons-cli-1.2.jar:/home/mootlaw/lib/nutch/lib/commons-code
c-1.3.jar:/home/mootlaw/lib/nutch/lib/commons-collections-3.2.1.jar:/home/mo
otlaw/lib/nutch/lib/commons-el-1.0.jar:/home/mootlaw/lib/nutch/lib/commons-h
ttpclient-3.1.jar:/home/mootlaw/lib/nutch/lib/commons-io-1.4.jar:/home/mootl
aw/lib/nutch/lib/commons-lang-2.1.jar:/home/mootlaw/lib/nutch/lib/commons-lo
gging-1.0.4.jar:/home/mootlaw/lib/nutch/lib/commons-logging-api-1.0.4.jar:/h
ome/mootlaw/lib/nutch/lib/commons-net-1.4.1.jar:/home/mootlaw/lib/nutch/lib/
core-3.1.1.jar:/home/mootlaw/lib/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1.
jar:/home/mootlaw/lib/nutch/lib/hadoop-0.20.2-core.jar:/home/mootlaw/lib/nut
ch/lib/hadoop-0.20.2-tools.jar:/home/mootlaw/lib/nutch/lib/hsqldb-1.8.0.10.j
ar:/home/mootlaw/lib/nutch/lib/icu4j-4_0_1.jar:/home/mootlaw/lib/nutch/lib/j
akarta-oro-2.0.8.jar:/home/mootlaw/lib/nutch/lib/jasper-compiler-5.5.12.jar:
/home/mootlaw/lib/nutch/lib/jasper-runtime-5.5.12.jar:/home/mootlaw/lib/nutc
h/lib/jcl-over-slf4j-1.5.5.jar:/home/mootlaw/lib/nutch/lib/jets3t-0.6.1.jar:
/home/mootlaw/lib/nutch/lib/jetty-6.1.14.jar:/home/mootlaw/lib/nutch/lib/jet
ty-util-6.1.14.jar:/home/mootlaw/lib/nutch/lib/junit-3.8.1.jar:/home/mootlaw
/lib/nutch/lib/kfs-0.2.2.jar:/home/mootlaw/lib/nutch/lib/log4j-1.2.15.jar:/h
ome/mootlaw/lib/nutch/lib/lucene-core-3.0.1.jar:/home/mootlaw/lib/nutch/lib/
lucene-misc-3.0.1.jar:/home/mootlaw/lib/nutch/lib/oro-2.0.8.jar:/home/mootla
w/lib/nutch/lib/resolver.jar:/home/mootlaw/lib/nutch/lib/serializer.jar:/hom
e/mootlaw/lib/nutch/lib/servlet-api-2.5-6.1.14.jar:/home/mootlaw/lib/nutch/l
ib/slf4j-api-1.5.5.jar:/home/mootlaw/lib/nutch/lib/slf4j-log4j12-1.4.3.jar:/
home/mootlaw/lib/nutch/lib/taglibs-i18n.jar:/home/mootlaw/lib/nutch/lib/tika
-core-0.7.jar:/home/mootlaw/lib/nutch/lib/wstx-asl-3.2.7.jar:/home/mootlaw/l
ib/nutch/lib/xercesImpl.jar:/home/mootlaw/lib/nutch/lib/xml-apis.jar:/home/m
ootlaw/lib/nutch/lib/xmlenc-0.52.jar:/home/mootlaw/lib/nutch/lib/jsp-2.1/jsp
-2.1.jar:/home/mootlaw/lib/nutch/lib/jsp-2.1/jsp-api-2.1.jar
org.apache.nutch.fetcher.Fetcher
/home/mootlaw/lib/nutch/crawl/segments/2010103113 -threads 50

My PIDS cannot be traced and my mem usage is at 5%

My hadoop logs show:

2010-10-31 15:44:11,040 INFO  fetcher.Fetcher - fetching
http://caselaw.findlaw.com/us-5th-circuit/1454354.html
2010-10-31 15:44:11,294 INFO  fetcher.Fetcher - fetching
http://www.dallastxcriminaldefenseattorney.com/atom.xml
2010-10-31 15:44:11,337 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=48, fetchQueues.totalSize=2499
2010-10-31 15:44:12,339 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, fetchQueues.totalSize=2500
2010-10-31 15:44:13,341 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, fetchQueues.totalSize=2500
2010-10-31 15:44:14,344 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, fetchQueues.totalSize=2500
2010-10-31 15:44:15,346 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, fetchQueues.totalSize=2500
2010-10-31 15:44:16,349 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=50, fetchQueues.totalSize=2500
2010-10-31 15:44:16,568 INFO  fetcher.Fetcher - fetching
http://caselaw.findlaw.com/il-court-of-appeals/1542438.html
2010-10-31 15:44:17,308 INFO  fetcher.Fetcher - fetching
http://lcweb2.loc.gov/const/const.html
2010-10-31 15:44:17,352 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=49, fetchQueues.totalSize=2499
2010-10-31 15:44:18,354 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=49, fetchQueues.totalSize=2500
2010-10-31 15:44:19,356 INFO  fetcher.Fetcher - -activeThreads=50,
spinWaiting=49, fetchQueues.totalSize=2500
2010-10-31 15:44:20,358 INFO  fetcher.Fetcher - -activeT

RE: indexing '-

2010-10-31 Thread Jonathan Rochkind
What do you actually want to do? Give an example of a string that would be 
found in the source document (to index), and a few queries that you want to 
match it (and that presumably aren't matching it with the methods you've tried, 
since you say "it doesn't work")

Both a string type or a text type set to KeywordTokenizer (and with no other 
analyzers, as in your example) should/will index exactly what is in your source 
document. 

My guess is that you aren't happy with this because in fact you DO want 
tokenization, which neither of those options will get you.   But you haven't 
given enough information for us to know what you actually want to do, and 
without knowing what you're trying to do we cant' tell you why what you've 
tried doesn't do it, or brainstorm for ways to do it differently.  What 
"doesn't work"? 

From: PeterKerk [vettepa...@hotmail.com]
Sent: Sunday, October 31, 2010 1:13 PM
To: solr-user@lucene.apache.org
Subject: Re: indexing '-

I already tried the normal string type, but that doesnt work either.
I now use this:

  

  


But that doesnt do it either...what else can I try?

Thanks!
--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-tp1816969p1817298.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Solr in virtual host as opposed to /lib

2010-10-31 Thread Jonathan Rochkind
What servlet container are you putting your Solr in? Jetty? Tomcat? Something 
else?  Are you fronting it with apache on top of that? (I think maybe you are, 
otherwise I'm not sure how the phrase 'virtual host' applies). 

In general, Solr of course doesn't care what directory it's in on disk, so long 
as the process running solr has the neccesary read/write permissions to the 
neccesary directories (and if it doesn't, you'd usually find out right away 
with an error message).  And clients to Solr don't care what directory it's in 
on disk either, they only care that they can get it to it connecting to a 
certain port at a certain hostname. In general, if they can't get to it on a 
certain port at a certain hostname, that's something you'd discover right away, 
not something that would be intermittent.  But I'm not familiar with nutch, you 
may want to try connecting to the port you have Solr running on (the 
hostname/port you have told nutch to find solr on?) yourself manually, and just 
make sure it is connectable. 

I can't think of any reason that what directory you have Solr in could cause 
CPU utilization issues. I think it's got nothing to do with that. 

I am not familar with nutch, if it's nutch that's taking 100% of your CPU, you 
might want to find some nutch experts to ask. Perhaps there's a nutch listserv? 
 I am also not familiar with hadoop; you mention just in passing that you're 
using hadoop too, maybe that's an added complication, I don't know. 

One obvious reason nutch could be taking 100% cpu would be simply because 
you've asked it to do a lot of work quickly, and it's trying to. 

One reason I have seen Solr take 100% of CPU and become responsive, is when the 
Solr process gets caught up in terrible Java garbage collection. If that's 
what's happening, then giving the Solr JVM a higher maximum heap size can 
sometimes help (although confusingly, I've seen people suggest that if you give 
the Solr JVM too MUCH heap it can also result in long GC pauses), and if you 
have a multi-core/multi-CPU machine, I've found the JVM argument 
-XX:+UseConcMarkSweepGC to be very helpful. 

Other than that, it sounds to me like you've got a nutch/hadoop issue, not a 
Solr issue. 

From: Eric Martin [e...@makethembite.com]
Sent: Sunday, October 31, 2010 7:16 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

Hi,

Thank you. This is more than idle curiosity. I am trying to debug an issue I
am having with my installation and this is one step in verifying that I have
a setup that does not consume resources. I am trying to debunk my internal
myth that having Solr nad Nutch in a virtual host would be causing these
issues. Here is the main issue that involves Nutch/Solr and Drupal:

/home/mootlaw/lib/solr
/home/mootlaw/lib/nutch
/home/mootlaw/www/

I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
jetty for my Solr. My server is not rooted.

Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:

/usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
-Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
-classpath
/home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil
d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n
utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib
/apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1.
4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils-1.8.0.jar:/home/mootla
w/lib/nutch/lib/commons-cli-1.2.jar:/home/mootlaw/lib/nutch/lib/commons-code
c-1.3.jar:/home/mootlaw/lib/nutch/lib/commons-collections-3.2.1.jar:/home/mo
otlaw/lib/nutch/lib/commons-el-1.0.jar:/home/mootlaw/lib/nutch/lib/commons-h
ttpclient-3.1.jar:/home/mootlaw/lib/nutch/lib/commons-io-1.4.jar:/home/mootl
aw/lib/nutch/lib/commons-lang-2.1.jar:/home/mootlaw/lib/nutch/lib/commons-lo
gging-1.0.4.jar:/home/mootlaw/lib/nutch/lib/commons-logging-api-1.0.4.jar:/h
ome/mootlaw/lib/nutch/lib/commons-net-1.4.1.jar:/home/mootlaw/lib/nutch/lib/
core-3.1.1.jar:/home/mootlaw/lib/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1.
jar:/home/mootlaw/lib/nutch/lib/hadoop-0.20.2-core.jar:/home/mootlaw/lib/nut
ch/lib/hadoop-0.20.2-tools.jar:/home/mootlaw/lib/nutch/lib/hsqldb-1.8.0.10.j
ar:/home/mootlaw/lib/nutch/lib/icu4j-4_0_1.jar:/home/mootlaw/lib/nutch/lib/j
akarta-oro-2.0.8.jar:/home/mootlaw/lib/nutch/lib/jasper-compiler-5.5.12.jar:
/home/mootlaw/lib/nutch/lib/jasper-runtime-5.5.12.jar:/home/mootlaw/lib/nutc
h/lib/jcl-over-slf4j-1.5.5.jar:/home/mootlaw/lib/nutch/lib/jets3t-0.6.1.jar:
/home/mootlaw/lib/nutch/lib/jetty-6.1.14.jar:/home/mootlaw/lib/nutch/lib/jet
ty-util-6.1.14.jar:/home/mootlaw/lib/nutch/lib/junit-3.8.1.jar:/home/mootlaw
/lib/nutch/lib/kfs-0.2.2.jar:/home/mootlaw/lib/nutch/l

RE: Solr in virtual host as opposed to /lib

2010-10-31 Thread Eric Martin
Excellent information. Thank you. Solr is acting just fine then. I can
connect to it no issues, it indexes fine and there didn't seem to be any
complication with it. Now I can rule it out and go about solving, what you
pointed out, and I agree, to be a java/nutch issue.

Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
source and found on apache.org

Thanks for your time.

-Original Message-
From: Jonathan Rochkind [mailto:rochk...@jhu.edu] 
Sent: Sunday, October 31, 2010 4:33 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

What servlet container are you putting your Solr in? Jetty? Tomcat?
Something else?  Are you fronting it with apache on top of that? (I think
maybe you are, otherwise I'm not sure how the phrase 'virtual host'
applies). 

In general, Solr of course doesn't care what directory it's in on disk, so
long as the process running solr has the neccesary read/write permissions to
the neccesary directories (and if it doesn't, you'd usually find out right
away with an error message).  And clients to Solr don't care what directory
it's in on disk either, they only care that they can get it to it connecting
to a certain port at a certain hostname. In general, if they can't get to it
on a certain port at a certain hostname, that's something you'd discover
right away, not something that would be intermittent.  But I'm not familiar
with nutch, you may want to try connecting to the port you have Solr running
on (the hostname/port you have told nutch to find solr on?) yourself
manually, and just make sure it is connectable. 

I can't think of any reason that what directory you have Solr in could cause
CPU utilization issues. I think it's got nothing to do with that. 

I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
you might want to find some nutch experts to ask. Perhaps there's a nutch
listserv?  I am also not familiar with hadoop; you mention just in passing
that you're using hadoop too, maybe that's an added complication, I don't
know. 

One obvious reason nutch could be taking 100% cpu would be simply because
you've asked it to do a lot of work quickly, and it's trying to. 

One reason I have seen Solr take 100% of CPU and become responsive, is when
the Solr process gets caught up in terrible Java garbage collection. If
that's what's happening, then giving the Solr JVM a higher maximum heap size
can sometimes help (although confusingly, I've seen people suggest that if
you give the Solr JVM too MUCH heap it can also result in long GC pauses),
and if you have a multi-core/multi-CPU machine, I've found the JVM argument
-XX:+UseConcMarkSweepGC to be very helpful. 

Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
Solr issue. 

From: Eric Martin [e...@makethembite.com]
Sent: Sunday, October 31, 2010 7:16 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr in virtual host as opposed to /lib

Hi,

Thank you. This is more than idle curiosity. I am trying to debug an issue I
am having with my installation and this is one step in verifying that I have
a setup that does not consume resources. I am trying to debunk my internal
myth that having Solr nad Nutch in a virtual host would be causing these
issues. Here is the main issue that involves Nutch/Solr and Drupal:

/home/mootlaw/lib/solr
/home/mootlaw/lib/nutch
/home/mootlaw/www/

I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
jetty for my Solr. My server is not rooted.

Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:

/usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
-Dhadoop.log.file=hadoop.log
-Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
-classpath
/home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil
d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n
utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib
/apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1.
4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils-1.8.0.jar:/home/mootla
w/lib/nutch/lib/commons-cli-1.2.jar:/home/mootlaw/lib/nutch/lib/commons-code
c-1.3.jar:/home/mootlaw/lib/nutch/lib/commons-collections-3.2.1.jar:/home/mo
otlaw/lib/nutch/lib/commons-el-1.0.jar:/home/mootlaw/lib/nutch/lib/commons-h
ttpclient-3.1.jar:/home/mootlaw/lib/nutch/lib/commons-io-1.4.jar:/home/mootl
aw/lib/nutch/lib/commons-lang-2.1.jar:/home/mootlaw/lib/nutch/lib/commons-lo
gging-1.0.4.jar:/home/mootlaw/lib/nutch/lib/commons-logging-api-1.0.4.jar:/h
ome/mootlaw/lib/nutch/lib/commons-net-1.4.1.jar:/home/mootlaw/lib/nutch/lib/
core-3.1.1.jar:/home/mootlaw/lib/nutch/lib/geronimo-stax-api_1.0_spec-1.0.1.
jar:/home/mootlaw/lib/nutch/lib/hadoop-0.20.2-core.jar:/home/mootlaw/lib/nut
ch/lib/had

Re: problem of solr replcation's speed

2010-10-31 Thread Lance Norskog
If you are copying from an indexer while you are indexing new content,
this would cause contention for the disk head. Does indexing slow down
during this period?

Lance

2010/10/31 Peter Karich :
>  we have an identical-sized index and it takes ~5minutes
>
>
>> It takes about one hour to replacate 6G index for solr in my env. But my
>> network can transfer file about 10-20M/s using scp. So solr's http
>> replcation is too slow, it's normal or I do something wrong?
>>
>
>



-- 
Lance Norskog
goks...@gmail.com


Re: Design and Usage Questions

2010-10-31 Thread Lance Norskog
2.
The SolrJ library handling of content streams is "pull", not "push".
That is, you give it a reader and it pulls content when it feels like
it. If your software to feed the connection wants to write the data,
you have to either buffer the whole thing or do a dual-thread
writer/reader pair.

The easiest way to pull stuff from SVN is to use one of the web server
apps. Solr takes a "stream.url" parameter. (Also stream.file.) Note
that there is no outbound authentication supported; your web server
has to be open (at least to the Solr instance).


On Sun, Oct 31, 2010 at 4:06 PM, getagrip  wrote:
> Hi,
>
> I've got some basic usage / design questions.
>
> 1. The SolrJ wiki proposes to use the same CommonsHttpSolrServer
>   instance for all requests to avoid connection leaks.
>   So if I create a Singleton instance upon application-startup I can
>   securely use this instance for ALL queries/updates throughout my
>   application without running into performance issues?
>
> 2. My System's documents are stored in a Subversion repository.
>   For fast searchresults I want to periodically index new documents
>   from the repository.
>
>   What I get from the repository is a ByteArrayOutputStream. How can I
>   pass this Stream to Solr?
>
>   I only see possibilities to pass Files but in my case it does not
>   make sense to write the ByteArrayOutputStream to disk again as this
>   would cause performance issues apart from making no sense anyway.
>
> 3. Are there any disadvantages using Solrj over some other HTTP based
>   solution e.g. creating & sending my own HTTP requests? Do I even
>   have to use HTTP?
>   I see the EmbeddedSolrServer exists. Any drawbacks using that?
>
> Any hints are welcome, Thanks!
>



-- 
Lance Norskog
goks...@gmail.com


Re: Solr in virtual host as opposed to /lib

2010-10-31 Thread Lance Norskog
With virtual hosting you can give CPU & memory quotas to your
different VMs. This allows you to control the Nutch v.s. The World
problem. Unforch, you cannot allocate disk channel. With two i/o bound
apps, this is a problem.

On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin  wrote:
> Excellent information. Thank you. Solr is acting just fine then. I can
> connect to it no issues, it indexes fine and there didn't seem to be any
> complication with it. Now I can rule it out and go about solving, what you
> pointed out, and I agree, to be a java/nutch issue.
>
> Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
> source and found on apache.org
>
> Thanks for your time.
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Sunday, October 31, 2010 4:33 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr in virtual host as opposed to /lib
>
> What servlet container are you putting your Solr in? Jetty? Tomcat?
> Something else?  Are you fronting it with apache on top of that? (I think
> maybe you are, otherwise I'm not sure how the phrase 'virtual host'
> applies).
>
> In general, Solr of course doesn't care what directory it's in on disk, so
> long as the process running solr has the neccesary read/write permissions to
> the neccesary directories (and if it doesn't, you'd usually find out right
> away with an error message).  And clients to Solr don't care what directory
> it's in on disk either, they only care that they can get it to it connecting
> to a certain port at a certain hostname. In general, if they can't get to it
> on a certain port at a certain hostname, that's something you'd discover
> right away, not something that would be intermittent.  But I'm not familiar
> with nutch, you may want to try connecting to the port you have Solr running
> on (the hostname/port you have told nutch to find solr on?) yourself
> manually, and just make sure it is connectable.
>
> I can't think of any reason that what directory you have Solr in could cause
> CPU utilization issues. I think it's got nothing to do with that.
>
> I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
> you might want to find some nutch experts to ask. Perhaps there's a nutch
> listserv?  I am also not familiar with hadoop; you mention just in passing
> that you're using hadoop too, maybe that's an added complication, I don't
> know.
>
> One obvious reason nutch could be taking 100% cpu would be simply because
> you've asked it to do a lot of work quickly, and it's trying to.
>
> One reason I have seen Solr take 100% of CPU and become responsive, is when
> the Solr process gets caught up in terrible Java garbage collection. If
> that's what's happening, then giving the Solr JVM a higher maximum heap size
> can sometimes help (although confusingly, I've seen people suggest that if
> you give the Solr JVM too MUCH heap it can also result in long GC pauses),
> and if you have a multi-core/multi-CPU machine, I've found the JVM argument
> -XX:+UseConcMarkSweepGC to be very helpful.
>
> Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
> Solr issue.
> 
> From: Eric Martin [e...@makethembite.com]
> Sent: Sunday, October 31, 2010 7:16 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr in virtual host as opposed to /lib
>
> Hi,
>
> Thank you. This is more than idle curiosity. I am trying to debug an issue I
> am having with my installation and this is one step in verifying that I have
> a setup that does not consume resources. I am trying to debunk my internal
> myth that having Solr nad Nutch in a virtual host would be causing these
> issues. Here is the main issue that involves Nutch/Solr and Drupal:
>
> /home/mootlaw/lib/solr
> /home/mootlaw/lib/nutch
> /home/mootlaw/www/
>
> I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
> Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
> jetty for my Solr. My server is not rooted.
>
> Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:
>
> /usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
> -Dhadoop.log.file=hadoop.log
> -Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
> -classpath
> /home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil
> d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n
> utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib
> /apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1.
> 4.0.jar:/home/mootlaw/lib/nutch/lib/commons-beanutils-1.8.0.jar:/home/mootla
> w/lib/nutch/lib/commons-cli-1.2.jar:/home/mootlaw/lib/nutch/lib/commons-code
> c-1.3.jar:/home/mootlaw/lib/nutch/lib/commons-collections-3.2.1.jar:/home/mo
> otlaw/lib/nutch/lib/commons-el-1.0.jar:/home/mootlaw/lib/nutch/lib/commons-h
> ttpclient-3.1.jar:/home/mootlaw/l

RE: Solr in virtual host as opposed to /lib

2010-10-31 Thread Eric Martin
Oh. So I should take out the installations and move them to / as 
opposed to inside my virtual host of /home//www
'
 
-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: Sunday, October 31, 2010 7:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr in virtual host as opposed to /lib

With virtual hosting you can give CPU & memory quotas to your
different VMs. This allows you to control the Nutch v.s. The World
problem. Unforch, you cannot allocate disk channel. With two i/o bound
apps, this is a problem.

On Sun, Oct 31, 2010 at 4:38 PM, Eric Martin  wrote:
> Excellent information. Thank you. Solr is acting just fine then. I can
> connect to it no issues, it indexes fine and there didn't seem to be any
> complication with it. Now I can rule it out and go about solving, what you
> pointed out, and I agree, to be a java/nutch issue.
>
> Nutch is a crawler I use to feed URL's into Solr for indexing. Nutch is open
> source and found on apache.org
>
> Thanks for your time.
>
> -Original Message-
> From: Jonathan Rochkind [mailto:rochk...@jhu.edu]
> Sent: Sunday, October 31, 2010 4:33 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr in virtual host as opposed to /lib
>
> What servlet container are you putting your Solr in? Jetty? Tomcat?
> Something else?  Are you fronting it with apache on top of that? (I think
> maybe you are, otherwise I'm not sure how the phrase 'virtual host'
> applies).
>
> In general, Solr of course doesn't care what directory it's in on disk, so
> long as the process running solr has the neccesary read/write permissions to
> the neccesary directories (and if it doesn't, you'd usually find out right
> away with an error message).  And clients to Solr don't care what directory
> it's in on disk either, they only care that they can get it to it connecting
> to a certain port at a certain hostname. In general, if they can't get to it
> on a certain port at a certain hostname, that's something you'd discover
> right away, not something that would be intermittent.  But I'm not familiar
> with nutch, you may want to try connecting to the port you have Solr running
> on (the hostname/port you have told nutch to find solr on?) yourself
> manually, and just make sure it is connectable.
>
> I can't think of any reason that what directory you have Solr in could cause
> CPU utilization issues. I think it's got nothing to do with that.
>
> I am not familar with nutch, if it's nutch that's taking 100% of your CPU,
> you might want to find some nutch experts to ask. Perhaps there's a nutch
> listserv?  I am also not familiar with hadoop; you mention just in passing
> that you're using hadoop too, maybe that's an added complication, I don't
> know.
>
> One obvious reason nutch could be taking 100% cpu would be simply because
> you've asked it to do a lot of work quickly, and it's trying to.
>
> One reason I have seen Solr take 100% of CPU and become responsive, is when
> the Solr process gets caught up in terrible Java garbage collection. If
> that's what's happening, then giving the Solr JVM a higher maximum heap size
> can sometimes help (although confusingly, I've seen people suggest that if
> you give the Solr JVM too MUCH heap it can also result in long GC pauses),
> and if you have a multi-core/multi-CPU machine, I've found the JVM argument
> -XX:+UseConcMarkSweepGC to be very helpful.
>
> Other than that, it sounds to me like you've got a nutch/hadoop issue, not a
> Solr issue.
> 
> From: Eric Martin [e...@makethembite.com]
> Sent: Sunday, October 31, 2010 7:16 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr in virtual host as opposed to /lib
>
> Hi,
>
> Thank you. This is more than idle curiosity. I am trying to debug an issue I
> am having with my installation and this is one step in verifying that I have
> a setup that does not consume resources. I am trying to debunk my internal
> myth that having Solr nad Nutch in a virtual host would be causing these
> issues. Here is the main issue that involves Nutch/Solr and Drupal:
>
> /home/mootlaw/lib/solr
> /home/mootlaw/lib/nutch
> /home/mootlaw/www/
>
> I'm running a 1333 FSB Dual Socket Xeon 5500 Series @ 2.4ghz, Enterprise
> Linux - x86_64 - OS, 12 Gig RAM. My Solr and Nutch are running. I am using
> jetty for my Solr. My server is not rooted.
>
> Nutch is using 100% of my cpus. I see this in my CPU utilization in my whm:
>
> /usr/bin/java -Xmx1000m -Dhadoop.log.dir=/home/mootlaw/lib/nutch/logs
> -Dhadoop.log.file=hadoop.log
> -Djava.library.path=/home/mootlaw/lib/nutch/lib/native/Linux-amd64-64
> -classpath
> /home/mootlaw/lib/nutch/conf:/usr/lib/tools.jar:/home/mootlaw/lib/nutch/buil
> d:/home/mootlaw/lib/nutch/build/test/classes:/home/mootlaw/lib/nutch/build/n
> utch-1.2.job:/home/mootlaw/lib/nutch/nutch-*.job:/home/mootlaw/lib/nutch/lib
> /apache-solr-core-1.4.0.jar:/home/mootlaw/lib/nutch/lib/apache-solr-solrj-1.
> 4.0.jar:/home/mootlaw/lib/nutch