Broken stats.js

2011-03-25 Thread Mark Mandel
Relatively new to SOLR (only JUST deployed my first SOLR app to production,
very proud ;o) )

I went to check out the solr/mycore/admin/stats.jsp page... and all I get is
a blank page.

Looking into it deeper, it seems that SOLR is returning badly encoded XML to
the browser, so it's not rendering.

I can't seem to find any references to this issue anywhere except :
https://issues.apache.org/jira/browse/SOLR-1750

(Which has more of a workaround), and it seems that the SolrInfoMBeanHandler
is not in the 1.4.1 build.

Any help would be appreciated, so I can tune the caching settings on my SOLR
install (which so far is screaming along, but it's always good to have more
speed).

Thanks in advance,

Mark

-- 
E: mark.man...@gmail.com
T: http://www.twitter.com/neurotic
W: www.compoundtheory.com

cf.Objective(ANZ) - Nov 17, 18 - Melbourne Australia
http://www.cfobjective.com.au

Hands-on ColdFusion ORM Training
www.ColdFusionOrmTraining.com


AW: Newbie wants to index XML content.

2011-03-25 Thread Martin Rödig
You can use the DIH (Dataimport Import Handler) to split up and index that XML.
 http://wiki.apache.org/solr/DataImportHandler


Mit freundlichen Grüßen
M.Sc. Dipl.-Inf. (FH) Martin Rödig
 
SHI Elektronische Medien GmbH
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - -
AKTUELL - NEU - AB SOFORT 
Solr/Lucene Schulung vom 19. - 21. April in Berlin
 
Als erster zertifizierter Trainingspartner von Lucid Imagination in 
Deutschland, Österreich und Schweiz bietet SHI ab sofort 
deutschsprachige Solr Schulungen an.
Weitere Informationen: www.shi-gmbh.com/services/solr-training
Achtung: Die Anzahl der Plätze ist beschränkt!
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - -
Postadresse: Watzmannstr. 23, 86316 Friedberg
Besuchsadresse: Curt-Frenzel-Str. 12, 86167 Augsburg
Tel.: 0821 7482633 18
Tel.: 0821 7482633 0 (Zentrale)
Fax: 0821 7482633 29

Internet: http://www.shi-gmbh.com
Registergericht Augsburg HRB 17382
Geschäftsführer: Peter Spiske
Steuernummer: 103/137/30412

-Ursprüngliche Nachricht-
Von: Marcelo Iturbe [mailto:marc...@santiago.cl] 
Gesendet: Donnerstag, 24. März 2011 21:55
An: solr-user@lucene.apache.org
Betreff: Newbie wants to index XML content.

Hello,
I've been reading up on how to index XML content but have a few questions.

How is data in element attributes handled or defined? How are nested elements 
handled?

In the following XML structure, I want to index the content of what is between 
the  tags.
In one XML document, there can be up to 100  tags.
So the  tag would be equivalent to the  tag...

Can I somehow index this XML "as is" or will I have to parse it, creating the 
 tag and placing all the elements on the same level?

Thanks for your help.



manual

MC Anon User
mca...@mcdomain.com




John Smith

jsmit...@gmail.com




First Last
First
Last


MC S.A.
CIO

fi...@mcdomain.com
flas...@yahoo.com
+5629460600
fi...@mcdomain.com
First.Last
111 Bude St, Toronto
http://blog.mcdomain.com/



regards
Marcelo
WebRep
Overall rating


AW: stopwords not working in multicore setup

2011-03-25 Thread Martin Rödig
I have some questions about your config: 

Is the stopwords-de.txt in the same diractory as the shema.xml?
Is the title field from type text?
Have you the same problem with german stopwords with out Umlaut (ü,ö,ä) like 
the word "denn"? 

A Problem can be that the stopwords-de.txt is not save as UTF-8, so the filter 
can not read the umlaut ü in the file.


Mit freundlichen Grüßen
M.Sc. Dipl.-Inf. (FH) Martin Rödig
 
SHI Elektronische Medien GmbH
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - -
AKTUELL - NEU - AB SOFORT 
Solr/Lucene Schulung vom 19. - 21. April in Berlin
 
Als erster zertifizierter Trainingspartner von Lucid Imagination in 
Deutschland, Österreich und Schweiz bietet SHI ab sofort 
deutschsprachige Solr Schulungen an.
Weitere Informationen: www.shi-gmbh.com/services/solr-training
Achtung: Die Anzahl der Plätze ist beschränkt!
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
- - - - - -
Postadresse: Watzmannstr. 23, 86316 Friedberg
Besuchsadresse: Curt-Frenzel-Str. 12, 86167 Augsburg
Tel.: 0821 7482633 18
Tel.: 0821 7482633 0 (Zentrale)
Fax: 0821 7482633 29

Internet: http://www.shi-gmbh.com
Registergericht Augsburg HRB 17382
Geschäftsführer: Peter Spiske
Steuernummer: 103/137/30412

-Ursprüngliche Nachricht-
Von: Christopher Bottaro [mailto:cjbott...@onespot.com] 
Gesendet: Freitag, 25. März 2011 05:37
An: solr-user@lucene.apache.org
Betreff: stopwords not working in multicore setup

Hello,

I'm running a Solr server with 5 cores.  Three are for English content and two 
are for German content.  The default stopwords setup works fine for the English 
cores, but the German stopwords aren't working.

The German stopwords file is stopwords-de.txt and resides in the same directory 
as stopwords.txt.  The German cores use a different schema (named
schema.page.de.xml) which has the following text field definition:
http://pastie.org/1711866

The stopwords-de.txt file looks like this:  http://pastie.org/1711869

The query I'm doing is this:  q => "title:für"

And it's returning documents with für in the title.  Title is a text field 
which should use the stopwords-de.txt, as seen in the aforementioned pastie.

Any ideas?  Thanks for the help.


SOLR - problems with non-english symbols when extracting HTML

2011-03-25 Thread kushti
When I send plain utf-8 text to index(non-english text), all ok, but with
HTML I have wrong characters instead of non-ASCII symbols. So


$this->solr->extractContents($url,  strip_tags($code),
array("literal.url"=>$url,"fmap.content"=>"body"));

Works well, but just

$this->solr->extractContents($url,  $code,
array("literal.url"=>$url,"fmap.content"=>"body"));

not ! What's the problem ?

SOLR-PHP client used (code.google.com/p/solr-php-client/), but I think,
problem isn't here.

In both cases "text/plain" content-type noted in request(i've updated
standard lib code)

SOLR 1.4.1 / Tomcat 6 / Fedora 12

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729126.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: problem with snowballporterfilterfactory

2011-03-25 Thread anurag.walia
Thanks in advance. 
Please try to resolve the issue 
Please find the screen shot of analyser. I have a problem with number of
character in Term Text after snowballporterfilterfactory . I entered
"Polymer" but after snowballporterfilterfactory it become "Polym" while it
was not exist in "protwords.txt" file . I want if any word does not exist in
"protwords.txt" Term Text should be whole world like "Polymer" 
Regards 
Anurag Walia
http://lucene.472066.n3.nabble.com/file/n2729734/solrproblem.jpg
solrproblem.jpg 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/problem-with-snowballporterfilterfactory-tp2729589p2729734.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Parent-child options

2011-03-25 Thread Jan Høydahl
Otis,

Impressive list of possible solutions you've come up with :)

I've used Jonathan's "pattern" in several projects, but it quickly becomes 
unmanagable. My plan was to try to come up with a new FieldType inspired by 
FAST's Scope-field, which would take JSON in and be able to match hierarchical 
relationships with a syntax such as q=itemType:shoes AND 
items_json:"item:and(color:red,size:10)". The FieldType would make sure that 
the sub-tags within the and() actually exists within the scope of the same 
item. I's not trivial, as you implement a mini matching engine inside a field 
type and a new query syntax, but it should be possible for simple string type 
metadata. The FieldType would need to convert the json structure into some 
internal tree structure which is easily matched against the query.

I also thought about a JSON PolyField, where inserting one JSON string into the 
poly field would generate a bunch of sub fields _items_json_item1_color, 
_items_json_item1_size... to be able to re-use Lucene's matching capabilities, 
but I did not get it to support all use cases in my head.

Did anyone try SIREn?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 17. mars 2011, at 16.48, Jonathan Rochkind wrote:

> The standard answer, which is a kind of de-normalizing, is to index tokens 
> like this:
> 
> red_10   red_11orange_12
> 
> in another field, you could do these things with size first:
> 
> 10_red 11_red 12_orange
> 
> Now if you want to see what sizes of red you have, you can do a facet query 
> with facet.prefix=red_ .  You'll need to do a bit of parsing/interpreting 
> client size to translate from the results you get ("red_10", "red_11") to 
> telling the users "sizes 10 and 11 are available".  The second field with 
> size first lets you do the same thing to answer "what colors do we have in 
> size X?".
> 
> That gets unmanageable with more than 2-3 facet combinations, but with just 2 
> (or, pushing it, 3), can work out okay. You'd probably ALSO want to keep the 
> facets you have with plain values "red red orange" etc, to support that first 
> level of user-implementing. There is a bit more work to do on client side 
> with this approach, Solr isn't just giving you exactly what you want in it's 
> response, you've got to have logic for when to use the top-level facets and 
> when to go to that second-level combo facet ("red_12"), but it's do-able.
> 
> On 3/17/2011 11:21 AM, Otis Gospodnetic wrote:
>> Hi,
>> 
>> 
>> 
>> - Original Message 
>>> From: Yonik Seeley
>>> Subject: Re: Parent-child options
>>> 
>>> On Thu, Mar 17, 2011 at 1:49 AM, Otis Gospodnetic
>>>wrote:
 The dreaded parent-child without denormalization question.  What  are one's
 options for the following example:
 
 parent:  shoes
 3 children. each with 2 attributes/fields: color and size
   * color: red black orange
  * size: 10 11 12
 
 The goal is  to be able to search for:
 1) color:red AND size:10 and get 1 hit for the  above
 2) color:red AND size:12 and get *no* matches because there are no  red 
 shoes
>>> of
 size 12, only size 10.
>>> What if you had this  instead:
>>> 
>>>   color: red red orange
>>>   size: 10 11 12
>>> 
>>> Do  you need for color:red to return 1 or 2 (i.e. is the final answer
>>> in units of  child hits or parent hits)?
>> The final answer is the parent, which is "shoes" in this example.
>> So:
>> if the query is color:red AND size:10 the answer is: Yes, we got red shoes 
>> size
>> 10
>> if the query is color:red AND size:11 the answer is: Yes, we got red shoes 
>> size
>> 11
>> if the query is color:red AND size:12 the answer is: No, we don't have red 
>> shoes
>> size 12
>> 
>> Thanks,
>> Otis
>> 



Re: Detecting an empty index during start-up

2011-03-25 Thread David McLaughlin
Thanks Chris. I dug into the SolrCore code and after reading some of the
code I ended up going with core.getNewestSearcher(true) and this fixed the
problem.


David

On Thu, Mar 24, 2011 at 7:20 PM, Chris Hostetter
wrote:

> : I am not familiar with Solr internals, so the approach I wanted to take
> was
> : to basically check the numDocs property of the index during start-up and
> set
> : a READABLE state in the ZooKeeper node if it's greater than 0. I also
> : planned to create a commit hook for replication and updating which
> : controlled the READABLE property based on numDocs also.
> :
> : This just leaves the problem of finding out the number of documents
> during
> : start-up. I planned to have something like:
>
> Most of the ZK stuff you mentioned is over my head, but i get the general
> gist of what you want:
>
>  * a hook on startup that checks numDocs
>  * if not empty, trigger some logic
>
> My suggestion would be to implement this as a "firstSearcher"
> SolrEventListener.  when that runs, you'll have easy access to a
> SOlrIndexSearcher (and you won't even have to refcount it) and you can
> fire whatever logic you want based on what you find when looking at it.
>
>
> -Hoss
>


Re: Detecting an empty index during start-up

2011-03-25 Thread Andrzej Bialecki

On 3/25/11 11:25 AM, David McLaughlin wrote:

Thanks Chris. I dug into the SolrCore code and after reading some of the
code I ended up going with core.getNewestSearcher(true) and this fixed the
problem.


FYI, openNew=true is not implemented and can result in an 
UnsupportedOperationException. For now it's better to pass openNew=false 
and be prepared to get a null.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Upayavira
There's options in solr.xml that point to lib dirs. Make sure you get
them right.

Upayavira

On Thu, 24 Mar 2011 23:28 +0100, "Markus Jelsma"
 wrote:
> I believe it's example/solr/lib where it looks for shared libs in
> multicore. 
> But, each core can has its own lib dir, usually in core/lib. This is 
> referenced to in solrconfig.xml, see the example config for the lib
> directive.
> 
> > Well, there lies the problem--it's not JUST the Tika jar.  If it's not one
> > thing, it's another, and I'm not even sure which directory Solr actually
> > looks in.  In my Solr.xml file I have it use a shared library folder for
> > every core.  Since each core will be holding very homologous data, there's
> > no need to have any different library modules for each.
> > 
> > The relevant line in my solr.xml file is  > sharedLib="lib">.  That is housed in .../example/solr/.  So, does it look
> > in .../example/lib or .../example/solr/lib?
> > 
> > ~Brandon Waterloo
> > 
> > From: Markus Jelsma [markus.jel...@openindex.io]
> > Sent: Thursday, March 24, 2011 11:29 AM
> > To: solr-user@lucene.apache.org
> > Cc: Brandon Waterloo
> > Subject: Re: Multiple Cores with Solr Cell for indexing documents
> > 
> > Sounds like the Tika jar is not on the class path. Add it to a directory
> > where Solr's looking for libs.
> > 
> > On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> > > Hello everyone,
> > > 
> > > I've been trying for several hours now to set up Solr with multiple cores
> > > with Solr Cell working on each core. The only items being indexed are
> > > PDF, DOC, and TXT files (with the possibility of expanding this list,
> > > but for now, just assume the only things in the index should be
> > > documents).
> > > 
> > > I never had any problems with Solr Cell when I was using a single core.
> > > In fact, I just ran the default installation in example/ and worked from
> > > that. However, trying to migrate to multi-core has been a never ending
> > > list of problems.
> > > 
> > > Any time I try to add a document to the index (using the same curl
> > > command as I did to add to the single core, of course adding the core
> > > name to the request URL-- host/solr/corename/update/extract...), I get
> > > HTTP 500 errors due to classes not being found and/or lazy loading
> > > errors. I've copied the exact example/lib directory into the cores, and
> > > that doesn't work either.
> > > 
> > > Frankly the only libraries I want are those relevant to indexing files.
> > > The less bloat, the better, after all. However, I cannot figure out
> > > where to put what files, and why the example installation works
> > > perfectly for single-core but not with multi-cores.
> > > 
> > > Here is an example of the errors I'm receiving:
> > > 
> > > command prompt> curl
> > > "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> > > "myfile=@test2.txt"
> > > 
> > > 
> > > 
> > > 
> > > Error 500 
> > > 
> > > HTTP ERROR:
> > > 500org/apache/tika/exception/TikaException
> > > 
> > > java.lang.NoClassDefFoundError: org/apache/tika/exception/TikaException
> > > at java.lang.Class.forName0(Native Method)
> > > at java.lang.Class.forName(Class.java:247)
> > > at
> > > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java
> > > : 359) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
> > > at org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449)
> > > at
> > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWrappe
> > > dH andler(RequestHandlers.java:240) at
> > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequ
> > > e st(RequestHandlers.java:231) at
> > > org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > > at
> > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.jav
> > > a
> > > 
> > > :338) at
> > > 
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.ja
> > > v a:241) at
> > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHand
> > > l er.java:1089) at
> > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
> > > at
> > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:21
> > > 6 ) at
> > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
> > > at
> > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
> > > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
> > > at
> > > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerC
> > > o llection.java:211) at
> > > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java
> > > : 114) at
> > > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
> > > at org.mortbay.jetty.Server.handle(Server.java:285)
> > > at
> > > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
> > >

Re: solr on the cloud

2011-03-25 Thread Dmitry Kan
Hi Otis,

Ok, thanks.

No, the question about distributed faceting was in a 'guess' mode as
faceting seems to be a good fit to MR. I probably need to follow the jira
tickets closer for a follow-up, but was initially wondering if I missed some
documentation on the topic, which didn't apparently happen.

On Fri, Mar 25, 2011 at 5:34 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi,
>
>
> > I have tried running the sharded solr with zoo keeper on a  single
> machine.
>
> > The SOLR code is from current trunk. It runs nicely. Can you  please
> point me
> > to a page, where I can check the status of the solr on the  cloud
> development
> > and available features, apart from http://wiki.apache.org/solr/SolrCloud?
>
> I'm afraid that's the most comprehensive documentation so far.
>
> > Basically, of high interest  is checking out the Map-Reduce for
> distributed
> > faceting, is it even possible  with the trunk?
>
> Hm, MR for distributed faceting?  Maybe I missed this... can you point to a
> place that mentions this?
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>



-- 
Regards,

Dmitry Kan


Re: SOLR - problems with non-english symbols when extracting HTML

2011-03-25 Thread Grijesh
Try to send HTML data using format CDATA .

-
Thanx: 
Grijesh 
www.gettinhahead.co.in 
--
View this message in context: 
http://lucene.472066.n3.nabble.com/SOLR-problems-with-non-english-symbols-when-extracting-HTML-tp2729126p2729923.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr on the cloud

2011-03-25 Thread Yonik Seeley
On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan  wrote:
> Basically, of high interest is checking out the Map-Reduce for distributed
> faceting, is it even possible with the trunk?

Solr already has distributed faceting, and it's much more performant
than a map-reduce implementation would be.

I've also seen a product use the term "map reduce" incorrectly... as in,
we "map" the request to each shard, and then "reduce" the results to a
single list (of course, that's not actually map-reduce at all ;-)

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco


Re: solr on the cloud

2011-03-25 Thread Dmitry Kan
Hi Yonik,

Oh, this is great. Is distributed faceting available in the trunk? What is
the basic server setup needed for trying this out, is it cloud with HDFS and
SOLR with zookepers?
Any chance to see the related documentation? :)

On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley wrote:

> On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan  wrote:
> > Basically, of high interest is checking out the Map-Reduce for
> distributed
> > faceting, is it even possible with the trunk?
>
> Solr already has distributed faceting, and it's much more performant
> than a map-reduce implementation would be.
>
> I've also seen a product use the term "map reduce" incorrectly... as in,
> we "map" the request to each shard, and then "reduce" the results to a
> single list (of course, that's not actually map-reduce at all ;-)
>
>
:) this sounds pretty strange to me as well. It was only my guess, that if
you have MR as computational model and a cloud beneath it, you could
naturally map facet fields to their counts inside single documents (no
matter, where they are, be it shards or "single" index) and pass them onto
reducers.


> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>



-- 
Regards,

Dmitry Kan


Re: solr on the cloud

2011-03-25 Thread Upayavira


On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" 
wrote:
> Hi Yonik,
> 
> Oh, this is great. Is distributed faceting available in the trunk? What
> is
> the basic server setup needed for trying this out, is it cloud with HDFS
> and
> SOLR with zookepers?
> Any chance to see the related documentation? :)

Distributed faceting has been available for a long time, and is
available in the 1.4.1 release.

The distribution of facet requests across hosts happens in the
background. There's no real difference (in query syntax) between a
standard facet query and a distributed one.

i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
other benefits, but you don't need them for distributed faceting).

Upayavira

> On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> wrote:
> 
> > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan  wrote:
> > > Basically, of high interest is checking out the Map-Reduce for
> > distributed
> > > faceting, is it even possible with the trunk?
> >
> > Solr already has distributed faceting, and it's much more performant
> > than a map-reduce implementation would be.
> >
> > I've also seen a product use the term "map reduce" incorrectly... as in,
> > we "map" the request to each shard, and then "reduce" the results to a
> > single list (of course, that's not actually map-reduce at all ;-)
> >
> >
> :) this sounds pretty strange to me as well. It was only my guess, that
> if
> you have MR as computational model and a cloud beneath it, you could
> naturally map facet fields to their counts inside single documents (no
> matter, where they are, be it shards or "single" index) and pass them
> onto
> reducers.
> 
> 
> > -Yonik
> > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > 25-26, San Francisco
> >
> 
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Suggester spellcheck component and infix

2011-03-25 Thread Kai Schlamp-2
Does the suggester component of Solr also support infix search? (like
.*ompute.*)

Kai

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Suggester-spellcheck-component-and-infix-tp2729996p2729996.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr on the cloud

2011-03-25 Thread Dmitry Kan
Hi, Upayavira

Probably I'm confusing the terms here. When I say "distributed faceting" I'm
more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity machines)
rather than into traditional multicore/sharded SOLR on a single or multiple
servers with non-distributed file systems (is that what you mean when you
refer to "distribution of facet requests across hosts"?)

On Fri, Mar 25, 2011 at 1:57 PM, Upayavira  wrote:

>
>
> On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" 
> wrote:
> > Hi Yonik,
> >
> > Oh, this is great. Is distributed faceting available in the trunk? What
> > is
> > the basic server setup needed for trying this out, is it cloud with HDFS
> > and
> > SOLR with zookepers?
> > Any chance to see the related documentation? :)
>
> Distributed faceting has been available for a long time, and is
> available in the 1.4.1 release.
>
> The distribution of facet requests across hosts happens in the
> background. There's no real difference (in query syntax) between a
> standard facet query and a distributed one.
>
> i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
> other benefits, but you don't need them for distributed faceting).
>
> Upayavira
>
> > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> > wrote:
> >
> > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
> wrote:
> > > > Basically, of high interest is checking out the Map-Reduce for
> > > distributed
> > > > faceting, is it even possible with the trunk?
> > >
> > > Solr already has distributed faceting, and it's much more performant
> > > than a map-reduce implementation would be.
> > >
> > > I've also seen a product use the term "map reduce" incorrectly... as
> in,
> > > we "map" the request to each shard, and then "reduce" the results to a
> > > single list (of course, that's not actually map-reduce at all ;-)
> > >
> > >
> > :) this sounds pretty strange to me as well. It was only my guess, that
> > if
> > you have MR as computational model and a cloud beneath it, you could
> > naturally map facet fields to their counts inside single documents (no
> > matter, where they are, be it shards or "single" index) and pass them
> > onto
> > reducers.
> >
> >
> > > -Yonik
> > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > > 25-26, San Francisco
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
> ---
> Enterprise Search Consultant at Sourcesense UK,
> Making Sense of Open Source
>
>


-- 
Regards,

Dmitry Kan


Deduplication questions

2011-03-25 Thread eks dev
Q1. Is is possible to pass *analyzed* content to the

public abstract class Signature {
  public void init(SolrParams nl) {  }
  public abstract String calculate(String content);
}


Q2. Method calculate() is using concatenated fields from name,features,cat
Is there any mechanism I could build  "field dependant signatures"?

Use case for this: I have two fields:
OWNER , TEXT
I need to disable *fuzzy* duplicates for one owner, one clean way
would be to make prefixed signature "OWNER/FUZZY_SIGNATURE"

Is  idea to make two UpdadeProcessors and chain them OK? (Is ugly, but
would work)

  
  true
  false
  exact_signature
  OWNER
  ExactSignature

  

hard_signature should not be  stored and not indexed field

  
  true
  true
  mixed_signature
  exact_signature, TEXT
  MixedSignature

  

 
 

Assuming I know how long my exact_signature is, I could calculate
fuzzy part and mix it properly.

Possible, better ideas?

Thanks,
eks


Re: Newbie wants to index XML content.

2011-03-25 Thread Erick Erickson
Solr does not index random XML documents, (but see Martin's comments
about DIH). Solr will index XML documents that have a specific format,
however. The general form is:


  value to index
  value for this field 






So you can either try DIH or parse the raw XML yourself and put it in the above
form for indexing...

Best
Erick

On Thu, Mar 24, 2011 at 4:54 PM, Marcelo Iturbe  wrote:
> Hello,
> I've been reading up on how to index XML content but have a few questions.
>
> How is data in element attributes handled or defined? How are nested
> elements handled?
>
> In the following XML structure, I want to index the content of what is
> between the  tags.
> In one XML document, there can be up to 100  tags.
> So the  tag would be equivalent to the  tag...
>
> Can I somehow index this XML "as is" or will I have to parse it, creating
> the  tag and placing all the elements on the same level?
>
> Thanks for your help.
>
> 
> 
>    manual
>    
>        MC Anon User
>        mca...@mcdomain.com
>    
>
>    
>        
>            John Smith
>        
>        jsmit...@gmail.com
>    
>
>    
>        
>            First Last
>            First
>            Last
>        
>        
>            MC S.A.
>            CIO
>        
>        fi...@mcdomain.com
>        flas...@yahoo.com
>        +5629460600
>        fi...@mcdomain.com
>        First.Last
>        111 Bude St, Toronto
>        http://blog.mcdomain.com/
>    
> 
>
> regards
> Marcelo
> WebRep
> Overall rating
>


Re: problem with snowballporterfilterfactory

2011-03-25 Thread Erick Erickson
Why are you using the stemmer at all then? This is the exact
inverse of how protwords.txt is usually used

You might think about removing the stemmer from the analysis
chain and using synonyms to transform your list of words

Best
Erick

On Fri, Mar 25, 2011 at 5:59 AM, anurag.walia  wrote:
> Thanks in advance.
> Please try to resolve the issue
> Please find the screen shot of analyser. I have a problem with number of
> character in Term Text after snowballporterfilterfactory . I entered
> "Polymer" but after snowballporterfilterfactory it become "Polym" while it
> was not exist in "protwords.txt" file . I want if any word does not exist in
> "protwords.txt" Term Text should be whole world like "Polymer"
> Regards
> Anurag Walia
> http://lucene.472066.n3.nabble.com/file/n2729734/solrproblem.jpg
> solrproblem.jpg
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/problem-with-snowballporterfilterfactory-tp2729589p2729734.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr on the cloud

2011-03-25 Thread Upayavira


On Fri, 25 Mar 2011 14:26 +0200, "Dmitry Kan" 
wrote:
> Hi, Upayavira
> 
> Probably I'm confusing the terms here. When I say "distributed faceting"
> I'm
> more into SOLR on the cloud (e.g. HDFS + MR + cloud of commodity
> machines)
> rather than into traditional multicore/sharded SOLR on a single or
> multiple
> servers with non-distributed file systems (is that what you mean when you
> refer to "distribution of facet requests across hosts"?)

I mean the latter I am afraid. I'm very interested in how the former
might be implemented, but as far as I understand it, Zookeeper does not
take you all the way there. It co-ordinates nodes (e.g. telling a slave
where its master is), but if you have to distribute an index over
multiple hosts, it will be sharded between multiple solr hosts, with
each of those hosts having a local index.

You are presumably talking about a scenario where you effectively have
one index, spanning multiple hosts (we have code to distribute queries
across multiple segments, why can't we do it across multiple hosts?).
I've heard of work to do this with Infinispan underneath, but not within
the core Lucene/Solr.

Upayavira

> On Fri, Mar 25, 2011 at 1:57 PM, Upayavira  wrote:
> 
> >
> >
> > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan" 
> > wrote:
> > > Hi Yonik,
> > >
> > > Oh, this is great. Is distributed faceting available in the trunk? What
> > > is
> > > the basic server setup needed for trying this out, is it cloud with HDFS
> > > and
> > > SOLR with zookepers?
> > > Any chance to see the related documentation? :)
> >
> > Distributed faceting has been available for a long time, and is
> > available in the 1.4.1 release.
> >
> > The distribution of facet requests across hosts happens in the
> > background. There's no real difference (in query syntax) between a
> > standard facet query and a distributed one.
> >
> > i.e. you don't need SolrCloud nor Zookeeper for it. (they may provide
> > other benefits, but you don't need them for distributed faceting).
> >
> > Upayavira
> >
> > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik Seeley
> > > wrote:
> > >
> > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
> > wrote:
> > > > > Basically, of high interest is checking out the Map-Reduce for
> > > > distributed
> > > > > faceting, is it even possible with the trunk?
> > > >
> > > > Solr already has distributed faceting, and it's much more performant
> > > > than a map-reduce implementation would be.
> > > >
> > > > I've also seen a product use the term "map reduce" incorrectly... as
> > in,
> > > > we "map" the request to each shard, and then "reduce" the results to a
> > > > single list (of course, that's not actually map-reduce at all ;-)
> > > >
> > > >
> > > :) this sounds pretty strange to me as well. It was only my guess, that
> > > if
> > > you have MR as computational model and a cloud beneath it, you could
> > > naturally map facet fields to their counts inside single documents (no
> > > matter, where they are, be it shards or "single" index) and pass them
> > > onto
> > > reducers.
> > >
> > >
> > > > -Yonik
> > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> > > > 25-26, San Francisco
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> > ---
> > Enterprise Search Consultant at Sourcesense UK,
> > Making Sense of Open Source
> >
> >
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
> 
--- 
Enterprise Search Consultant at Sourcesense UK, 
Making Sense of Open Source



Search in database and documents

2011-03-25 Thread Deepak Singh
I m new in solr search i want to change schema for search in database and
documents.


Re: Problems with creating a query that matches all the documents I want to display

2011-03-25 Thread Jan-Eirik B . Nævdal
Hi and thanks for all the answers.

I finally managed to construct a fq that did what I wanted

fq=(-(-obj_todate_dt:[NOW/MINUTE TO *] AND obj_todate_dt:[* TO *]) AND
-(-obj_fromdate_dt:[* TO NOW/MINUTE] AND obj_fromdate_dt:[* TO *]))

This gave me all documents without opening and closing time, and those
documents that should be open for viewable . Have to use minutes since the
solution uses minutes on opening and closing time.

friendly
Jan Eirik




On Tue, Mar 22, 2011 at 1:41 AM, Jonathan Rochkind  wrote:

> If the "OR" actually worked to do what it's trying to say, would it be what
> you wanted?
>
> Because I can't believe I didn't recognize this is an instance of the very
> thing I posted on this list this morning, where the solr-lucene query parser
> has problems with some kinds of 'pure negative' queries. Try this as a
> workaround to those problems:
>
> fq=(*:* AND -openingtime:[* TO *]) OR openingtime:[* TO NOW]
>
> If the semantics we're trying to express there are in fact what you want,
> there's probably a way to make it work, if the problem is just how to
> actually get solr to give you "everything without an opening time or an
> opening time before NOW" in an fq.
>
> Jonathan
> 
> From: Jan-Eirik B. Nævdal [jan-eirik.naev...@iterate.no]
> Sent: Monday, March 21, 2011 6:55 PM
> To: Jonathan Rochkind
> Cc: solr-user@lucene.apache.org
> Subject: Re: Problems with creating a query that matches all the documents
> I want to display
>
> Unfortunalty have I tried the OR approach in the fq.
> with the positive filter query first i get document 4, with negative filter
> query first i get none result,
> This request gives me 1 reply
>
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=100&indent=on]&fq=obj_todate_dt%3A[*%20TO%20NOW]%20&fq=obj_todate_dt%3A[NOW%20TO%20*
> ]
>
> same as this (but if %20 is used between OR it would not give any hits)
>
>
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=100&indent=on]&fq=obj_fromdate_dt%3A[*%20TO%20NOW]OR-obj_fromdate_dt%3A[*%20TO%20*]&fq=obj_todate_dt%3A[NOW%20TO%20*]OR-obj_todate_dt%3A[*%20TO%20*
> ]
>
>
> This does not give any result:
>
> http://localhost:8983/solr/select/?q=*%3A*&version=2.2&start=0&rows=100&indent=on]&fq=-obj_fromdate_dt%3A[*%20TO%20*]ORobj_fromdate_dt%3A[*%20TO%20NOW]&fq=-obj_todate_dt%3A[*%20TO%20*]ORobj_todate_dt%3A[NOW%20TO%20*
> ]
>
>
> Testing both on 1.3 1.4 just to be sure, and changing default operator to
> OR
> Im running a standard solr 1.4.1 schema and just added these to dynamic
> fileds in the example docs, just to make it easy to test for me.
>
>
> On Mon, Mar 21, 2011 at 10:53 PM, Jonathan Rochkind  > wrote:
> You can put an actual OR in the fq (an fq, by default, is in the
> solr-lucene query parser language). Might that achieve what you want?
>
> &fq=  -openingtime:[* TO *] OR openingtime:[* TO NOW]
> &fq=  -closingtime:[* TO *] OR closingtime:[NOW TO *]
>
> Does that, or some variation of it, do what you need?
>
>
> On 3/21/2011 5:43 PM, Jan-Eirik B. Nævdal wrote:
> Hi
> Have this problem I tried to solve with filter queries but I think I`m
> stuck
> now, and don't see a solution how to solve my problem.
>
> My problem is that i want a result page that shows those documents that
> matches
> these filter query fq=openingtime:[* TO NOW] , fq=closingtime[NOW TO *] for
> the documents with limited time access
> but i want also all documents that does not have the fields openingtime and
> closingtime defined
> like this filter query fq=-openingtime[* TO *] and fq = -closingtime[* TO
> *]
>
> Are there some solution that allows me to make a "join" of these two filter
> queries, that supports pagination.
> A client side "manual" join would not be the best solution here because of
> the system the results are displayed in.
>
> Simple example:
> Document 1 : openingtime = 1545 1. May.2050  closingtime 1453 1. June.2050
> //available in the future
> Document 2:  Does not have the fields openingtime and closing time
> Document 3   Does not have the fields openingtime and closing time
> Document 4 openingtime = 1545 1. May.2010  closingtime 1453 1. June.2030 /
> available now
>
> My result page should then show document 2,3 and 4 but not document 1
> Can anyone point me in the direction how to solve this
>
>
> Technical information:
> Solr 1.4.1 (is being ported from 1.3) (reasons for going to 3.x would be
> appreciated
> Default operator: AND
> Several different documents where one type should only be displayed in the
> results for a limited time. This information is indexed as dates in that
> type of document.
> Those fields does not exist on the other documents in the index.
> Not any major changes in the schema.xml
> A solr instance here can contain between 5K-10M documents
>
>
>
> JanEirik
>
>
>
>
> --
> Jan Eirik B. Nævdal
> Solutions Engineer | +47 982 65 347
> Iterate AS | www.iterate.no
> Th

Re: Search in database and documents

2011-03-25 Thread Jan Høydahl
Hi again :)

Please elaborate on what you are trying to do in more detail, and we'll be able 
to suggest a way forward.

Read this page carefully: http://wiki.apache.org/solr/UsingMailingLists

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 25. mars 2011, at 14.16, Deepak Singh wrote:

> I m new in solr search i want to change schema for search in database and
> documents.



Create 2 index with solr

2011-03-25 Thread Amel Fraisse
Hi,

I am using Solr to index documents. And I would index my documents with 2
different analyzer and generate 2 index.

So I don't know how I could generate 2 different index?

Thank you for your help.

Amel.


Re: Create 2 index with solr

2011-03-25 Thread Dmitry Kan
Hi Amel,

If you copy example dir from the solr distribution dir to example2 and
change jetty's port in the example2/etc/jetty.xml to something different
from the one in example/etc/jetty.xml, you'll effectively have two different
servers with two separate SOLRs.

Now you can independently modify your schemas as you want, start up your
servers and post your documents to them, generating 2 different indices.


On Fri, Mar 25, 2011 at 4:16 PM, Amel Fraisse wrote:

> Hi,
>
> I am using Solr to index documents. And I would index my documents with 2
> different analyzer and generate 2 index.
>
> So I don't know how I could generate 2 different index?
>
> Thank you for your help.
>
> Amel.
>



-- 
Regards,

Dmitry Kan


Synonyms: whitespace problem

2011-03-25 Thread royr
Hello,

I have a problem with the synonyms. SOLR strips the synonyms on white space.
An example:

manchester united, reds, manunited

My index looks like this:

manchester
united
red
manunited

i want this:
manchester united
red
manunited

my configuration:



 
 
 
  
   


How can i fix this problem??

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Synonyms-whitespace-problem-tp2730953p2730953.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Wanted: a directory of quick-and-(not too)dirty analyzers for multi-language RDF.

2011-03-25 Thread Grant Ingersoll
You are looking for a language identification tool.  You could check 
https://issues.apache.org/jira/browse/SOLR-1979 for the start of this.  
Otherwise, you have to roll your own or buy a third party one.

On Mar 24, 2011, at 12:24 PM, fr.jur...@voila.fr wrote:

> Hello Solrists,
> 
> As it says in the subject line, I'm looking for a Java component that,
> given an ISO 639-1 code or some equivalent,
> would return a Lucene Analyzer ready to gobble documents in the corresponding 
> language.
> Solr looks like it has to contain one,
> only I've not been able to locate it so far; 
> can you point the spot?
> 
> I've found org.apache.solr.analysis,
> and thing like org.apache.lucene.analysis.bg &c in lucene/modules,
> with many classes which I'm sure are related, however the factory itself 
> still eludes me;
> I mean the Java class.method that'd decide on request, what to do with all 
> these packages
> to bring the requisite object to existence, once the language is specified.
> Where should I look? Or was I mistaken & Solr has nothing of the kind, at 
> least in Java?
> Thanks in advance for your help.
> 
> Best regards,
>François Jurain.
> 
> 
> 
>  Retrouvez les 10 conseils pour économiser votre carburant sur Voila :  
> http://actu.voila.fr/evenementiel/LeDossierEcologie/l-eco-conduite/
> 
> 
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search



Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Markus Jelsma
You can only set properties for a lib dir that must be used in solrconfig.xml. 
You can use sharedLib in solr.xml though.

> There's options in solr.xml that point to lib dirs. Make sure you get
> them right.
> 
> Upayavira
> 
> On Thu, 24 Mar 2011 23:28 +0100, "Markus Jelsma"
> 
>  wrote:
> > I believe it's example/solr/lib where it looks for shared libs in
> > multicore.
> > But, each core can has its own lib dir, usually in core/lib. This is
> > referenced to in solrconfig.xml, see the example config for the lib
> > directive.
> > 
> > > Well, there lies the problem--it's not JUST the Tika jar.  If it's not
> > > one thing, it's another, and I'm not even sure which directory Solr
> > > actually looks in.  In my Solr.xml file I have it use a shared library
> > > folder for every core.  Since each core will be holding very
> > > homologous data, there's no need to have any different library modules
> > > for each.
> > > 
> > > The relevant line in my solr.xml file is  > > sharedLib="lib">.  That is housed in .../example/solr/.  So, does it
> > > look in .../example/lib or .../example/solr/lib?
> > > 
> > > ~Brandon Waterloo
> > > 
> > > From: Markus Jelsma [markus.jel...@openindex.io]
> > > Sent: Thursday, March 24, 2011 11:29 AM
> > > To: solr-user@lucene.apache.org
> > > Cc: Brandon Waterloo
> > > Subject: Re: Multiple Cores with Solr Cell for indexing documents
> > > 
> > > Sounds like the Tika jar is not on the class path. Add it to a
> > > directory where Solr's looking for libs.
> > > 
> > > On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> > > > Hello everyone,
> > > > 
> > > > I've been trying for several hours now to set up Solr with multiple
> > > > cores with Solr Cell working on each core. The only items being
> > > > indexed are PDF, DOC, and TXT files (with the possibility of
> > > > expanding this list, but for now, just assume the only things in the
> > > > index should be documents).
> > > > 
> > > > I never had any problems with Solr Cell when I was using a single
> > > > core. In fact, I just ran the default installation in example/ and
> > > > worked from that. However, trying to migrate to multi-core has been
> > > > a never ending list of problems.
> > > > 
> > > > Any time I try to add a document to the index (using the same curl
> > > > command as I did to add to the single core, of course adding the core
> > > > name to the request URL-- host/solr/corename/update/extract...), I
> > > > get HTTP 500 errors due to classes not being found and/or lazy
> > > > loading errors. I've copied the exact example/lib directory into the
> > > > cores, and that doesn't work either.
> > > > 
> > > > Frankly the only libraries I want are those relevant to indexing
> > > > files. The less bloat, the better, after all. However, I cannot
> > > > figure out where to put what files, and why the example installation
> > > > works perfectly for single-core but not with multi-cores.
> > > > 
> > > > Here is an example of the errors I'm receiving:
> > > > 
> > > > command prompt> curl
> > > > "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> > > > "myfile=@test2.txt"
> > > > 
> > > > 
> > > > 
> > > >  Error 500 
> > > > 
> > > > HTTP ERROR:
> > > > 500org/apache/tika/exception/TikaException
> > > > 
> > > > java.lang.NoClassDefFoundError:
> > > > org/apache/tika/exception/TikaException at
> > > > java.lang.Class.forName0(Native Method)
> > > > at java.lang.Class.forName(Class.java:247)
> > > > at
> > > > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
> > > > java
> > > > 
> > > > : 359) at
> > > > : org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
> > > > 
> > > > at
> > > > org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449
> > > > ) at
> > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
> > > > ppe dH andler(RequestHandlers.java:240) at
> > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handle
> > > > Requ e st(RequestHandlers.java:231) at
> > > > org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > > > at
> > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
> > > > .jav a
> > > > 
> > > > :338) at
> > > > 
> > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
> > > > r.ja v a:241) at
> > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Servlet
> > > > Hand l er.java:1089) at
> > > > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:3
> > > > 65) at
> > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.jav
> > > > a:21 6 ) at
> > > > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:1
> > > > 81) at
> > > > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:7
> > > > 12) at
> > > > org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405
> > > > ) at
> > > > org.mortbay.jetty.han

RE: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Brandon Waterloo
I did finally manage to deploy Solr with multiple cores but we've been running 
into so many problems with permissions, index location, and other things that I 
(quite fortunately) convinced my boss that multiple cores are not the way to go 
here.  I had in place a single-core system that would filter the results based 
on their ID numbers, and show only the subset of results that you wanted to 
see.  The disadvantage is that it's a single core and thus will take longer to 
search over the entire index.  The advantage is that it's better in every other 
way.

So the plan now is to move back to single-core searching and then test it with 
a huge amount of documents to see whether performance is seriously impacted or 
not.  So for now, I guess we can consider this thread resolved.

Thanks for all your help guys!

~Brandon Waterloo



From: Markus Jelsma [markus.jel...@openindex.io]
Sent: Friday, March 25, 2011 1:23 PM
To: solr-user@lucene.apache.org
Cc: Upayavira
Subject: Re: Multiple Cores with Solr Cell for indexing documents

You can only set properties for a lib dir that must be used in solrconfig.xml.
You can use sharedLib in solr.xml though.

> There's options in solr.xml that point to lib dirs. Make sure you get
> them right.
>
> Upayavira
>
> On Thu, 24 Mar 2011 23:28 +0100, "Markus Jelsma"
>
>  wrote:
> > I believe it's example/solr/lib where it looks for shared libs in
> > multicore.
> > But, each core can has its own lib dir, usually in core/lib. This is
> > referenced to in solrconfig.xml, see the example config for the lib
> > directive.
> >
> > > Well, there lies the problem--it's not JUST the Tika jar.  If it's not
> > > one thing, it's another, and I'm not even sure which directory Solr
> > > actually looks in.  In my Solr.xml file I have it use a shared library
> > > folder for every core.  Since each core will be holding very
> > > homologous data, there's no need to have any different library modules
> > > for each.
> > >
> > > The relevant line in my solr.xml file is  > > sharedLib="lib">.  That is housed in .../example/solr/.  So, does it
> > > look in .../example/lib or .../example/solr/lib?
> > >
> > > ~Brandon Waterloo
> > > 
> > > From: Markus Jelsma [markus.jel...@openindex.io]
> > > Sent: Thursday, March 24, 2011 11:29 AM
> > > To: solr-user@lucene.apache.org
> > > Cc: Brandon Waterloo
> > > Subject: Re: Multiple Cores with Solr Cell for indexing documents
> > >
> > > Sounds like the Tika jar is not on the class path. Add it to a
> > > directory where Solr's looking for libs.
> > >
> > > On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
> > > > Hello everyone,
> > > >
> > > > I've been trying for several hours now to set up Solr with multiple
> > > > cores with Solr Cell working on each core. The only items being
> > > > indexed are PDF, DOC, and TXT files (with the possibility of
> > > > expanding this list, but for now, just assume the only things in the
> > > > index should be documents).
> > > >
> > > > I never had any problems with Solr Cell when I was using a single
> > > > core. In fact, I just ran the default installation in example/ and
> > > > worked from that. However, trying to migrate to multi-core has been
> > > > a never ending list of problems.
> > > >
> > > > Any time I try to add a document to the index (using the same curl
> > > > command as I did to add to the single core, of course adding the core
> > > > name to the request URL-- host/solr/corename/update/extract...), I
> > > > get HTTP 500 errors due to classes not being found and/or lazy
> > > > loading errors. I've copied the exact example/lib directory into the
> > > > cores, and that doesn't work either.
> > > >
> > > > Frankly the only libraries I want are those relevant to indexing
> > > > files. The less bloat, the better, after all. However, I cannot
> > > > figure out where to put what files, and why the example installation
> > > > works perfectly for single-core but not with multi-cores.
> > > >
> > > > Here is an example of the errors I'm receiving:
> > > >
> > > > command prompt> curl
> > > > "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
> > > > "myfile=@test2.txt"
> > > >
> > > > 
> > > > 
> > > >  Error 500 
> > > > 
> > > > HTTP ERROR:
> > > > 500org/apache/tika/exception/TikaException
> > > >
> > > > java.lang.NoClassDefFoundError:
> > > > org/apache/tika/exception/TikaException at
> > > > java.lang.Class.forName0(Native Method)
> > > > at java.lang.Class.forName(Class.java:247)
> > > > at
> > > > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
> > > > java
> > > >
> > > > : 359) at
> > > > : org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
> > > >
> > > > at
> > > > org.apache.solr.core.SolrCore.createRequestHandler(SolrCore.java:449
> > > > ) at
> > > > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.getWra
> > > > ppe dH andler(Requ

Re: Multiple Cores with Solr Cell for indexing documents

2011-03-25 Thread Erick Erickson
Right, and you can go to sharding rather than managing your multiple
cores if thats warranted.

Erick

On Fri, Mar 25, 2011 at 1:31 PM, Brandon Waterloo
 wrote:
> I did finally manage to deploy Solr with multiple cores but we've been 
> running into so many problems with permissions, index location, and other 
> things that I (quite fortunately) convinced my boss that multiple cores are 
> not the way to go here.  I had in place a single-core system that would 
> filter the results based on their ID numbers, and show only the subset of 
> results that you wanted to see.  The disadvantage is that it's a single core 
> and thus will take longer to search over the entire index.  The advantage is 
> that it's better in every other way.
>
> So the plan now is to move back to single-core searching and then test it 
> with a huge amount of documents to see whether performance is seriously 
> impacted or not.  So for now, I guess we can consider this thread resolved.
>
> Thanks for all your help guys!
>
> ~Brandon Waterloo
>
>
> 
> From: Markus Jelsma [markus.jel...@openindex.io]
> Sent: Friday, March 25, 2011 1:23 PM
> To: solr-user@lucene.apache.org
> Cc: Upayavira
> Subject: Re: Multiple Cores with Solr Cell for indexing documents
>
> You can only set properties for a lib dir that must be used in solrconfig.xml.
> You can use sharedLib in solr.xml though.
>
>> There's options in solr.xml that point to lib dirs. Make sure you get
>> them right.
>>
>> Upayavira
>>
>> On Thu, 24 Mar 2011 23:28 +0100, "Markus Jelsma"
>>
>>  wrote:
>> > I believe it's example/solr/lib where it looks for shared libs in
>> > multicore.
>> > But, each core can has its own lib dir, usually in core/lib. This is
>> > referenced to in solrconfig.xml, see the example config for the lib
>> > directive.
>> >
>> > > Well, there lies the problem--it's not JUST the Tika jar.  If it's not
>> > > one thing, it's another, and I'm not even sure which directory Solr
>> > > actually looks in.  In my Solr.xml file I have it use a shared library
>> > > folder for every core.  Since each core will be holding very
>> > > homologous data, there's no need to have any different library modules
>> > > for each.
>> > >
>> > > The relevant line in my solr.xml file is > > > sharedLib="lib">.  That is housed in .../example/solr/.  So, does it
>> > > look in .../example/lib or .../example/solr/lib?
>> > >
>> > > ~Brandon Waterloo
>> > > 
>> > > From: Markus Jelsma [markus.jel...@openindex.io]
>> > > Sent: Thursday, March 24, 2011 11:29 AM
>> > > To: solr-user@lucene.apache.org
>> > > Cc: Brandon Waterloo
>> > > Subject: Re: Multiple Cores with Solr Cell for indexing documents
>> > >
>> > > Sounds like the Tika jar is not on the class path. Add it to a
>> > > directory where Solr's looking for libs.
>> > >
>> > > On Thursday 24 March 2011 16:24:17 Brandon Waterloo wrote:
>> > > > Hello everyone,
>> > > >
>> > > > I've been trying for several hours now to set up Solr with multiple
>> > > > cores with Solr Cell working on each core. The only items being
>> > > > indexed are PDF, DOC, and TXT files (with the possibility of
>> > > > expanding this list, but for now, just assume the only things in the
>> > > > index should be documents).
>> > > >
>> > > > I never had any problems with Solr Cell when I was using a single
>> > > > core. In fact, I just ran the default installation in example/ and
>> > > > worked from that. However, trying to migrate to multi-core has been
>> > > > a never ending list of problems.
>> > > >
>> > > > Any time I try to add a document to the index (using the same curl
>> > > > command as I did to add to the single core, of course adding the core
>> > > > name to the request URL-- host/solr/corename/update/extract...), I
>> > > > get HTTP 500 errors due to classes not being found and/or lazy
>> > > > loading errors. I've copied the exact example/lib directory into the
>> > > > cores, and that doesn't work either.
>> > > >
>> > > > Frankly the only libraries I want are those relevant to indexing
>> > > > files. The less bloat, the better, after all. However, I cannot
>> > > > figure out where to put what files, and why the example installation
>> > > > works perfectly for single-core but not with multi-cores.
>> > > >
>> > > > Here is an example of the errors I'm receiving:
>> > > >
>> > > > command prompt> curl
>> > > > "host/solr/core0/update/extract?literal.id=2-3-1&commit=true" -F
>> > > > "myfile=@test2.txt"
>> > > >
>> > > > 
>> > > > 
>> > > >  Error 500 
>> > > > 
>> > > > HTTP ERROR:
>> > > > 500org/apache/tika/exception/TikaException
>> > > >
>> > > > java.lang.NoClassDefFoundError:
>> > > > org/apache/tika/exception/TikaException at
>> > > > java.lang.Class.forName0(Native Method)
>> > > > at java.lang.Class.forName(Class.java:247)
>> > > > at
>> > > > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.
>> > > > java
>> > > >
>> > > >

Re: stopwords not working in multicore setup

2011-03-25 Thread Christopher Bottaro
Ahh, thank you for the hints Martin... German stopwords without Umlaut work
correctly.

So I'm trying to figure out where the UTF-8 chars are getting messed up.
 Using the Solr admin web UI, I did a search for title:für and the xml (or
json) output in the browser shows the query with the proper encoding, but
the Solr logs show this:

INFO: [page_30d_de] webapp=/solr path=/select
params={explainOther=&fl=*,score&indent=on&start=0&q=title:f?r&hl.fl=&qt=standard&wt=xml&fq=&version=2.2&rows=10}
hits=76 status=0 QTime=2

Notice the title:f?r.  How do I fix that?  I'm using Jetty btw...

Thanks for the help.

On Fri, Mar 25, 2011 at 3:05 AM, Martin Rödig  wrote:

> I have some questions about your config:
>
> Is the stopwords-de.txt in the same diractory as the shema.xml?
> Is the title field from type text?
> Have you the same problem with german stopwords with out Umlaut (ü,ö,ä)
> like the word "denn"?
>
> A Problem can be that the stopwords-de.txt is not save as UTF-8, so the
> filter can not read the umlaut ü in the file.
>
>
> Mit freundlichen Grüßen
> M.Sc. Dipl.-Inf. (FH) Martin Rödig
>
> SHI Elektronische Medien GmbH
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - -
> AKTUELL - NEU - AB SOFORT
> Solr/Lucene Schulung vom 19. - 21. April in Berlin
>
> Als erster zertifizierter Trainingspartner von Lucid Imagination in
> Deutschland, Österreich und Schweiz bietet SHI ab sofort
> deutschsprachige Solr Schulungen an.
> Weitere Informationen: www.shi-gmbh.com/services/solr-training
> Achtung: Die Anzahl der Plätze ist beschränkt!
> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - -
> Postadresse: Watzmannstr. 23, 86316 Friedberg
> Besuchsadresse: Curt-Frenzel-Str. 12, 86167 Augsburg
> Tel.: 0821 7482633 18
> Tel.: 0821 7482633 0 (Zentrale)
> Fax: 0821 7482633 29
>
> Internet: http://www.shi-gmbh.com
> Registergericht Augsburg HRB 17382
> Geschäftsführer: Peter Spiske
> Steuernummer: 103/137/30412
>
> -Ursprüngliche Nachricht-
> Von: Christopher Bottaro [mailto:cjbott...@onespot.com]
> Gesendet: Freitag, 25. März 2011 05:37
> An: solr-user@lucene.apache.org
> Betreff: stopwords not working in multicore setup
>
> Hello,
>
> I'm running a Solr server with 5 cores.  Three are for English content and
> two are for German content.  The default stopwords setup works fine for the
> English cores, but the German stopwords aren't working.
>
> The German stopwords file is stopwords-de.txt and resides in the same
> directory as stopwords.txt.  The German cores use a different schema (named
> schema.page.de.xml) which has the following text field definition:
> http://pastie.org/1711866
>
> The stopwords-de.txt file looks like this:  http://pastie.org/1711869
>
> The query I'm doing is this:  q => "title:für"
>
> And it's returning documents with für in the title.  Title is a text field
> which should use the stopwords-de.txt, as seen in the aforementioned pastie.
>
> Any ideas?  Thanks for the help.
>


Re: Synonyms: whitespace problem

2011-03-25 Thread Ahmet Arslan
> I have a problem with the synonyms. SOLR strips the
> synonyms on white space.
> An example:
> 
> manchester united, reds, manunited
> 
> My index looks like this:
> 
> manchester
> united
> red
> manunited
> 
> i want this:
> manchester united
> red
> manunited


You can escape white spaces with back slash.

manchester\ united, reds, manunited



  


Dismax and worddelimiterfilter

2011-03-25 Thread David Yang
Hi,

 

I am having some really strange issues matching "N61JQ-B2". If I had a
field "N61JQ-B2", and I wanted to match "N61JQ", "N61JQB2", "N61JQ-B2"
and "N61JQ B2" in dismax, what fieldtype should it have? My final
fallback is to use ngrams but that would impose a pretty large overhead,
since the field could be a long normal string with one model number in
it.

 

I noticed when I used WordDelimiterFilterFactory the dismax would
convert the parsed query to some pre-analyzed query.

 

Cheers,

David



Re: solr on the cloud

2011-03-25 Thread Otis Gospodnetic
Hi Dan,

This feels a bit like a buzzword soup with mushrooms. :)

MR jobs, at least the ones in Hadoopland, are very batch oriented, so that 
wouldn't be very suitable for most search applications.  There are some 
technologies like Riak that combine MR and search.  Let me use this funny 
little 
link: http://lmgtfy.com/?q=riak%20mapreduce%20search


Sure, you can put indices on HDFS (but don't expect searches to be fast).  Sure 
you can create indices using MapReduce, we've done that successfully for 
customers bringing long indexing jobs from many hours to minutes by using, yes, 
a cluster of machines (actually EC2 instances).
But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of 
commodity machines)", I can't actually picture what precisely you mean...  


Otis
---
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
> From: Dmitry Kan 
> To: solr-user@lucene.apache.org
> Cc: Upayavira 
> Sent: Fri, March 25, 2011 8:26:33 AM
> Subject: Re: solr on the cloud
> 
> Hi, Upayavira
> 
> Probably I'm confusing the terms here. When I say  "distributed faceting" I'm
> more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity machines)
> rather than into traditional multicore/sharded  SOLR on a single or multiple
> servers with non-distributed file systems (is  that what you mean when you
> refer to "distribution of facet requests across  hosts"?)
> 
> On Fri, Mar 25, 2011 at 1:57 PM, Upayavira   wrote:
> 
> >
> >
> > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  
> >  wrote:
> > > Hi Yonik,
> > >
> > > Oh, this is great. Is  distributed faceting available in the trunk? What
> > > is
> > >  the basic server setup needed for trying this out, is it cloud with HDFS
> >  > and
> > > SOLR with zookepers?
> > > Any chance to see the  related documentation? :)
> >
> > Distributed faceting has been  available for a long time, and is
> > available in the 1.4.1  release.
> >
> > The distribution of facet requests across hosts happens  in the
> > background. There's no real difference (in query syntax) between  a
> > standard facet query and a distributed one.
> >
> > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
> > other  benefits, but you don't need them for distributed faceting).
> >
> >  Upayavira
> >
> > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> > > wrote:
> >  >
> > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
> >  wrote:
> > > > > Basically, of high interest is checking out the  Map-Reduce for
> > > > distributed
> > > > > faceting, is  it even possible with the trunk?
> > > >
> > > > Solr  already has distributed faceting, and it's much more performant
> > >  > than a map-reduce implementation would be.
> > > >
> > >  > I've also seen a product use the term "map reduce" incorrectly...  as
> > in,
> > > > we "map" the request to each shard, and then  "reduce" the results to a
> > > > single list (of course, that's not  actually map-reduce at all ;-)
> > > >
> > > >
> > >  :) this sounds pretty strange to me as well. It was only my guess, that
> >  > if
> > > you have MR as computational model and a cloud beneath it,  you could
> > > naturally map facet fields to their counts inside single  documents (no
> > > matter, where they are, be it shards or "single"  index) and pass them
> > > onto
> > > reducers.
> >  >
> > >
> > > > -Yonik
> > > > http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> >  > > 25-26, San Francisco
> > > >
> > >
> >  >
> > >
> > > --
> > > Regards,
> > >
> >  > Dmitry Kan
> > >
> > ---
> > Enterprise Search Consultant at  Sourcesense UK,
> > Making Sense of Open  Source
> >
> >
> 
> 
> -- 
> Regards,
> 
> Dmitry Kan
> 


Using ExtractRequestHandler when source site has redirects

2011-03-25 Thread Daniel Sharkey
Hi all,

I'm trying to execute the following command:

   curl "
http://localhost:8983/solr/update/extract?&extractOnly=true&stream.url=http://www.nytimes.com/2011/03/26/world/middleeast/26syria.html?hp
"


but it doesn't work because the NYTimes url has a redirect in it. Is there
any way to tell Tika to follow redirects?


Thanks,

DJ


Default operator

2011-03-25 Thread Brian Lamb
Hi all,

I know that I can change the default operator in two ways:

1) <*solrQueryParser defaultOperator*="AND|OR"/>
2) Add q.op=AND

I'm wondering if it is possible to change the default operator for a
specific field only? For example, if I use the URL:

http://localhost:8983/solr/search/?q=animal:german shepherd&type:dog canine

I would want it to effectively be:

http://localhost:8983/solr/search/?q=animal:german AND shepherd&type:dog OR
canine

Other than parsing the URL before I send it out, is there a way to do this?

Thanks,

Brian Lamb


Re: Dismax and worddelimiterfilter

2011-03-25 Thread lboutros
You could develop your own tokenizer to extract the different forms of your
ids.

It is possible to extend the pattern tokenizer.

Ludovic.

Le 25 mars 2011 21:13, "David Yang [via Lucene]" <
ml-node+2732007-1439913827-383...@n3.nabble.com> a écrit :
>
>
> Hi,
>
>
>
> I am having some really strange issues matching "N61JQ-B2". If I had a
> field "N61JQ-B2", and I wanted to match "N61JQ", "N61JQB2", "N61JQ-B2"
> and "N61JQ B2" in dismax, what fieldtype should it have? My final
> fallback is to use ngrams but that would impose a pretty large overhead,
> since the field could be a long normal string with one model number in
> it.
>
>
>
> I noticed when I used WordDelimiterFilterFactory the dismax would
> convert the parsed query to some pre-analyzed query.
>
>
>
> Cheers,
>
> David
>
>
>
> ___
> If you reply to this email, your message will be added to the discussion
below:
>
http://lucene.472066.n3.nabble.com/Dismax-and-worddelimiterfilter-tp2732007p2732007.html
> To start a new topic under Solr - User, email
ml-node+472068-1765922688-383...@n3.nabble.com
> To unsubscribe from Solr - User, visit
http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472068&code=Ym91dHJvc2xAZ21haWwuY29tfDQ3MjA2OHw0Mzk2MDUxNjE=


-
Jouve
France.
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Dismax-and-worddelimiterfilter-tp2732007p2732245.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr on the cloud

2011-03-25 Thread Dmitry Kan
Hi Otis,

Thanks for elaborating on this and the link (funny!).

I have quite a big dataset growing all the time. The problems that I start
facing are pretty much predictable:
1. Scalability: this inludes indexing time (now some days!, better hours or
even minutes, if that's possible) along with handling the rapid growth
2. Robustness: the entire system (distributed or single server or anything
else) should be fault-tolerant, e.g. if one shard goes down, other catches
up (master-slave scheme)
3. Some apps that we run on SOLR are pretty computationally demanding, like
faceting over one+bi+trigrams of hundreds of millions of documents (index
size of half a TB) ---> single server with a shard of data does not seem to
be enough for realtime search.

This is just for a bit of a background. I agree with you on that hadoop and
cloud probably best suit massive batch processes rather than realtime
search. I'm sure, if anyone out there made SOLR shine throught the cloud for
realtime search over large datasets.

By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
commodity machines)" I mean what you've done for your customers using EC2.
Any chance, the guidlines/articles for/on setting indices on HDFS are
available in some open / paid area?

To sum this up, I didn't mean to create a buzz on the cloud solutions in
this thread, just was wondering what is practically available / going on in
SOLR development in this regard.

Thanks,

Dmitry


On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hi Dan,
>
> This feels a bit like a buzzword soup with mushrooms. :)
>
> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
> wouldn't be very suitable for most search applications.  There are some
> technologies like Riak that combine MR and search.  Let me use this funny
> little
> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>
>
> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>  Sure
> you can create indices using MapReduce, we've done that successfully for
> customers bringing long indexing jobs from many hours to minutes by using,
> yes,
> a cluster of machines (actually EC2 instances).
> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)", I can't actually picture what precisely you mean...
>
>
> Otis
> ---
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> - Original Message 
> > From: Dmitry Kan 
> > To: solr-user@lucene.apache.org
> > Cc: Upayavira 
> > Sent: Fri, March 25, 2011 8:26:33 AM
> > Subject: Re: solr on the cloud
> >
> > Hi, Upayavira
> >
> > Probably I'm confusing the terms here. When I say  "distributed faceting"
> I'm
> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
> machines)
> > rather than into traditional multicore/sharded  SOLR on a single or
> multiple
> > servers with non-distributed file systems (is  that what you mean when
> you
> > refer to "distribution of facet requests across  hosts"?)
> >
> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira   wrote:
> >
> > >
> > >
> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  
> > >  wrote:
> > > > Hi Yonik,
> > > >
> > > > Oh, this is great. Is  distributed faceting available in the trunk?
> What
> > > > is
> > > >  the basic server setup needed for trying this out, is it cloud with
> HDFS
> > >  > and
> > > > SOLR with zookepers?
> > > > Any chance to see the  related documentation? :)
> > >
> > > Distributed faceting has been  available for a long time, and is
> > > available in the 1.4.1  release.
> > >
> > > The distribution of facet requests across hosts happens  in the
> > > background. There's no real difference (in query syntax) between  a
> > > standard facet query and a distributed one.
> > >
> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
> > > other  benefits, but you don't need them for distributed faceting).
> > >
> > >  Upayavira
> > >
> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
> > > > wrote:
> > >  >
> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
> > >  wrote:
> > > > > > Basically, of high interest is checking out the  Map-Reduce for
> > > > > distributed
> > > > > > faceting, is  it even possible with the trunk?
> > > > >
> > > > > Solr  already has distributed faceting, and it's much more
> performant
> > > >  > than a map-reduce implementation would be.
> > > > >
> > > >  > I've also seen a product use the term "map reduce" incorrectly...
>  as
> > > in,
> > > > > we "map" the request to each shard, and then  "reduce" the results
> to a
> > > > > single list (of course, that's not  actually map-reduce at all ;-)
> > > > >
> > > > >
> > > >  :) this sounds pretty strange to me as well. It was only my guess,
> that
> > >  > if
> > > > you have MR as computational model and a cloud beneath it,  you could
> > > > naturally map facet fields to their cou

Re: solr on the cloud

2011-03-25 Thread Jason Rutherglen
Dmitry,

If you're planning on using HBase you can take a look at
https://issues.apache.org/jira/browse/HBASE-3529  I think we may even
have a reasonable solution for reading the index [randomly] out of
HDFS.  Benchmarking'll be implemented next.  It's not production
ready, suggestions are welcome.

Jason

On Fri, Mar 25, 2011 at 2:03 PM, Dmitry Kan  wrote:
> Hi Otis,
>
> Thanks for elaborating on this and the link (funny!).
>
> I have quite a big dataset growing all the time. The problems that I start
> facing are pretty much predictable:
> 1. Scalability: this inludes indexing time (now some days!, better hours or
> even minutes, if that's possible) along with handling the rapid growth
> 2. Robustness: the entire system (distributed or single server or anything
> else) should be fault-tolerant, e.g. if one shard goes down, other catches
> up (master-slave scheme)
> 3. Some apps that we run on SOLR are pretty computationally demanding, like
> faceting over one+bi+trigrams of hundreds of millions of documents (index
> size of half a TB) ---> single server with a shard of data does not seem to
> be enough for realtime search.
>
> This is just for a bit of a background. I agree with you on that hadoop and
> cloud probably best suit massive batch processes rather than realtime
> search. I'm sure, if anyone out there made SOLR shine throught the cloud for
> realtime search over large datasets.
>
> By "SOLR on the cloud (e.g. HDFS + MR +  cloud of
> commodity machines)" I mean what you've done for your customers using EC2.
> Any chance, the guidlines/articles for/on setting indices on HDFS are
> available in some open / paid area?
>
> To sum this up, I didn't mean to create a buzz on the cloud solutions in
> this thread, just was wondering what is practically available / going on in
> SOLR development in this regard.
>
> Thanks,
>
> Dmitry
>
>
> On Fri, Mar 25, 2011 at 10:28 PM, Otis Gospodnetic <
> otis_gospodne...@yahoo.com> wrote:
>
>> Hi Dan,
>>
>> This feels a bit like a buzzword soup with mushrooms. :)
>>
>> MR jobs, at least the ones in Hadoopland, are very batch oriented, so that
>> wouldn't be very suitable for most search applications.  There are some
>> technologies like Riak that combine MR and search.  Let me use this funny
>> little
>> link: http://lmgtfy.com/?q=riak%20mapreduce%20search
>>
>>
>> Sure, you can put indices on HDFS (but don't expect searches to be fast).
>>  Sure
>> you can create indices using MapReduce, we've done that successfully for
>> customers bringing long indexing jobs from many hours to minutes by using,
>> yes,
>> a cluster of machines (actually EC2 instances).
>> But when you say "more into SOLR on the cloud (e.g. HDFS + MR +  cloud of
>> commodity machines)", I can't actually picture what precisely you mean...
>>
>>
>> Otis
>> ---
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>> > From: Dmitry Kan 
>> > To: solr-user@lucene.apache.org
>> > Cc: Upayavira 
>> > Sent: Fri, March 25, 2011 8:26:33 AM
>> > Subject: Re: solr on the cloud
>> >
>> > Hi, Upayavira
>> >
>> > Probably I'm confusing the terms here. When I say  "distributed faceting"
>> I'm
>> > more into SOLR on the cloud (e.g. HDFS + MR +  cloud of commodity
>> machines)
>> > rather than into traditional multicore/sharded  SOLR on a single or
>> multiple
>> > servers with non-distributed file systems (is  that what you mean when
>> you
>> > refer to "distribution of facet requests across  hosts"?)
>> >
>> > On Fri, Mar 25, 2011 at 1:57 PM, Upayavira   wrote:
>> >
>> > >
>> > >
>> > > On Fri, 25 Mar 2011 13:44 +0200, "Dmitry Kan"  
>> > >  wrote:
>> > > > Hi Yonik,
>> > > >
>> > > > Oh, this is great. Is  distributed faceting available in the trunk?
>> What
>> > > > is
>> > > >  the basic server setup needed for trying this out, is it cloud with
>> HDFS
>> > >  > and
>> > > > SOLR with zookepers?
>> > > > Any chance to see the  related documentation? :)
>> > >
>> > > Distributed faceting has been  available for a long time, and is
>> > > available in the 1.4.1  release.
>> > >
>> > > The distribution of facet requests across hosts happens  in the
>> > > background. There's no real difference (in query syntax) between  a
>> > > standard facet query and a distributed one.
>> > >
>> > > i.e. you  don't need SolrCloud nor Zookeeper for it. (they may provide
>> > > other  benefits, but you don't need them for distributed faceting).
>> > >
>> > >  Upayavira
>> > >
>> > > > On Fri, Mar 25, 2011 at 1:35 PM, Yonik  Seeley
>> > > > wrote:
>> > >  >
>> > > > > On Tue, Mar 22, 2011 at 7:51 AM, Dmitry Kan 
>> > >  wrote:
>> > > > > > Basically, of high interest is checking out the  Map-Reduce for
>> > > > > distributed
>> > > > > > faceting, is  it even possible with the trunk?
>> > > > >
>> > > > > Solr  already has distributed faceting, and it's much more
>> performant
>> > > >  > than a map-reduce implementation 

Re: Wanted: a directory of quick-and-(not too)dirty analyzers for multi-language RDF.

2011-03-25 Thread François Schiettecatte
François

I think there is a language identification tool in the Nutch code base, 
otherwise I have written one in Perl which could easily be translated to Java. 
I wont have access to it for 10 days (I am traveling), but I am happy to send 
you a link to it when I get back (and anyone else who wants it).

Cheers

François

On Mar 25, 2011, at 11:59 AM, Grant Ingersoll wrote:

> You are looking for a language identification tool.  You could check 
> https://issues.apache.org/jira/browse/SOLR-1979 for the start of this.  
> Otherwise, you have to roll your own or buy a third party one.
> 
> On Mar 24, 2011, at 12:24 PM, fr.jur...@voila.fr wrote:
> 
>> Hello Solrists,
>> 
>> As it says in the subject line, I'm looking for a Java component that,
>> given an ISO 639-1 code or some equivalent,
>> would return a Lucene Analyzer ready to gobble documents in the 
>> corresponding language.
>> Solr looks like it has to contain one,
>> only I've not been able to locate it so far; 
>> can you point the spot?
>> 
>> I've found org.apache.solr.analysis,
>> and thing like org.apache.lucene.analysis.bg &c in lucene/modules,
>> with many classes which I'm sure are related, however the factory itself 
>> still eludes me;
>> I mean the Java class.method that'd decide on request, what to do with all 
>> these packages
>> to bring the requisite object to existence, once the language is specified.
>> Where should I look? Or was I mistaken & Solr has nothing of the kind, at 
>> least in Java?
>> Thanks in advance for your help.
>> 
>> Best regards,
>>   François Jurain.
>> 
>> 
>> 
>> Retrouvez les 10 conseils pour économiser votre carburant sur Voila :  
>> http://actu.voila.fr/evenementiel/LeDossierEcologie/l-eco-conduite/
>> 
>> 
>> 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
> 



Re: Wanted: a directory of quick-and-(not too)dirty analyzers for multi-language RDF.

2011-03-25 Thread François Schiettecatte
I had meant to also include a link to a blog post of mine that lists some 
useful links:

http://fschiettecatte.wordpress.com/2008/07/23/language-recognition/

François

On Mar 25, 2011, at 11:59 AM, Grant Ingersoll wrote:

> You are looking for a language identification tool.  You could check 
> https://issues.apache.org/jira/browse/SOLR-1979 for the start of this.  
> Otherwise, you have to roll your own or buy a third party one.
> 
> On Mar 24, 2011, at 12:24 PM, fr.jur...@voila.fr wrote:
> 
>> Hello Solrists,
>> 
>> As it says in the subject line, I'm looking for a Java component that,
>> given an ISO 639-1 code or some equivalent,
>> would return a Lucene Analyzer ready to gobble documents in the 
>> corresponding language.
>> Solr looks like it has to contain one,
>> only I've not been able to locate it so far; 
>> can you point the spot?
>> 
>> I've found org.apache.solr.analysis,
>> and thing like org.apache.lucene.analysis.bg &c in lucene/modules,
>> with many classes which I'm sure are related, however the factory itself 
>> still eludes me;
>> I mean the Java class.method that'd decide on request, what to do with all 
>> these packages
>> to bring the requisite object to existence, once the language is specified.
>> Where should I look? Or was I mistaken & Solr has nothing of the kind, at 
>> least in Java?
>> Thanks in advance for your help.
>> 
>> Best regards,
>>   François Jurain.
>> 
>> 
>> 
>> Retrouvez les 10 conseils pour économiser votre carburant sur Voila :  
>> http://actu.voila.fr/evenementiel/LeDossierEcologie/l-eco-conduite/
>> 
>> 
>> 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>