strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
>
> Hi,


I would like to take the HTML documents that are the result of a Solr search
and combine them into a single HTML document that combines the body text of
each individual document.  What is a good strategy for this? I am crawling
with Nutch and Carrot2 for clustering.
Fred


Re: strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
can you say a bit more about this? I see Velocity and will download it and
start playing around but I am not quite sure I understand all the steps that
you are suggesting.  Fred


On Thu, Sep 22, 2011 at 19:51, Markus Jelsma wrote:

> Hi,
>
> Solr support the Velocity template engine and has veyr good support. Ideal
> for
> generating properly formatted output from the search engine. There's a
> clustering example and it's easy to format documents indexed by Nutch.
>
> http://wiki.apache.org/solr/VelocityResponseWriter
>
> Cheers
>
> > > Hi,
> >
> > I would like to take the HTML documents that are the result of a Solr
> > search and combine them into a single HTML document that combines the
> body
> > text of each individual document.  What is a good strategy for this? I am
> > crawling with Nutch and Carrot2 for clustering.
> > Fred
>


Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
This seems to be out of date. I am running Solr 3.4

* the file structure of apachehome/contrib is different and I don't see
velocity anywhere underneath
* the page referenced below only talks about Solr 1.4 and 4.0

?

On Thu, Sep 22, 2011 at 19:51, Markus Jelsma wrote:

> Hi,
>
> Solr support the Velocity template engine and has veyr good support. Ideal
> for
> generating properly formatted output from the search engine. There's a
> clustering example and it's easy to format documents indexed by Nutch.
>
> http://wiki.apache.org/solr/VelocityResponseWriter
>
> Cheers
>
> > > Hi,
> >
> > I would like to take the HTML documents that are the result of a Solr
> > search and combine them into a single HTML document that combines the
> body
> > text of each individual document.  What is a good strategy for this? I am
> > crawling with Nutch and Carrot2 for clustering.
> > Fred
>


Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
ok, answered my own question, found velocity rw in solrconfig.xml.  next
question:

where does velocity look for its templates?

-
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates



On Fri, Sep 23, 2011 at 11:57, Fred Zimmerman  wrote:

> This seems to be out of date. I am running Solr 3.4
>
> * the file structure of apachehome/contrib is different and I don't see
> velocity anywhere underneath
> * the page referenced below only talks about Solr 1.4 and 4.0
>
> ?
>
> On Thu, Sep 22, 2011 at 19:51, Markus Jelsma 
> wrote:
>
>> Hi,
>>
>> Solr support the Velocity template engine and has veyr good support. Ideal
>> for
>> generating properly formatted output from the search engine. There's a
>> clustering example and it's easy to format documents indexed by Nutch.
>>
>> http://wiki.apache.org/solr/VelocityResponseWriter
>>
>> Cheers
>>
>> > > Hi,
>> >
>> > I would like to take the HTML documents that are the result of a Solr
>> > search and combine them into a single HTML document that combines the
>> body
>> > text of each individual document.  What is a good strategy for this? I
>> am
>> > crawling with Nutch and Carrot2 for clustering.
>> > Fred
>>
>
>


Re: strategy for post-processing answer set

2011-09-24 Thread Fred Zimmerman
ok.  this is a very basic question so please bear with me.

I see where the velocity templates are and I have looked at the
documentation and get the idea of how to write them.

it looks to me as if Solr just brings back the URLs. what I want to do is to
get the actual documents in the answer set, simplify their HTML and remove
all the javascript, ads, etc., and append them into a single document.

Now ... does Nutch already have the documents? can I get them from its db?
or do I have to go get the documents again with something like a wget?

Fred

On Fri, Sep 23, 2011 at 16:02, Erik Hatcher  wrote:

> conf/velocity by default.  See Solr's example configuration.
>
>   Erik
>
> On Sep 23, 2011, at 12:37, Fred Zimmerman  wrote:
>
> > ok, answered my own question, found velocity rw in solrconfig.xml.  next
> > question:
> >
> > where does velocity look for its templates?
> >
> > -
> > Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
> > monthly updates
> >
> >
> >
> > On Fri, Sep 23, 2011 at 11:57, Fred Zimmerman 
> wrote:
> >
> >> This seems to be out of date. I am running Solr 3.4
> >>
> >> * the file structure of apachehome/contrib is different and I don't see
> >> velocity anywhere underneath
> >> * the page referenced below only talks about Solr 1.4 and 4.0
> >>
> >> ?
> >>
> >> On Thu, Sep 22, 2011 at 19:51, Markus Jelsma <
> markus.jel...@openindex.io>wrote:
> >>
> >>> Hi,
> >>>
> >>> Solr support the Velocity template engine and has veyr good support.
> Ideal
> >>> for
> >>> generating properly formatted output from the search engine. There's a
> >>> clustering example and it's easy to format documents indexed by Nutch.
> >>>
> >>> http://wiki.apache.org/solr/VelocityResponseWriter
> >>>
> >>> Cheers
> >>>
> >>>>> Hi,
> >>>>
> >>>> I would like to take the HTML documents that are the result of a Solr
> >>>> search and combine them into a single HTML document that combines the
> >>> body
> >>>> text of each individual document.  What is a good strategy for this? I
> >>> am
> >>>> crawling with Nutch and Carrot2 for clustering.
> >>>> Fred
> >>>
> >>
> >>
>


http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
This http request works as desired (bringing back a csv file)

http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=on&version=2.2&q=battleship&wt=csv&;

but the same URL submitted via wget produces the 500 error reproduced below.

I want the wget to download the csv file.  What's going on?

FredZ


bitnami@ip-10-202-202-68:/opt/bitnami/apache2/htdocs/scripts$ --2011-10-04
19:33:41--  http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=on
Resolving zimzazsearch3-1.bitnamiapp.com... 75.101.204.213
Connecting to zimzazsearch3-1.bitnamiapp.com|75.101.204.213|:8983...
connected.
HTTP request sent, awaiting response... 500 null
 java.lang.NullPointerException \tat
java.io.StringReader.(StringReader.java:33) \tat
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203) \tat
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) \tat
org.apache.solr.search.QParser.getQuery(QParser.java:142) \tat
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) \tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
\tat
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
\tat
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
\tat
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
\tat
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
\tat
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
\tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
\tat
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
\tat
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
\tat
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
\tat org.mortbay.jetty.Server.handle(Server.java:326) \tat
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) \tat
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
\tat org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) \tat
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) \tat
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) \tat
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
\tat o
2011-10-04 19:33:41 ERROR 500: null  java.lang.NullPointerException \tat
java.io.StringReader.(StringReader.java:33) \tat
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203) \tat
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) \tat
org.apache.solr.search.QParser.getQuery(QParser.java:142) \tat
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
\tat
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
\tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) \tat
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
\tat
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
\tat
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
\tat
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
\tat
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
\tat
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
\tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
\tat
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
\tat
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
\tat
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
\tat org.mortbay.jetty.Server.handle(Server.java:326) \tat
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) \tat
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
\tat org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) \tat
org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) \tat
org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) \tat
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
\tat o.



-
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates


Re: http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
got it.

curl "
http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select/?indent=on&q=video&fl=name,id&wt=csv";
works like a champ.





On Tue, Oct 4, 2011 at 15:35, Fred Zimmerman  wrote:

> This http request works as desired (bringing back a csv file)
>
>
> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=on&version=2.2&q=battleship&wt=csv&;
>
> but the same URL submitted via wget produces the 500 error reproduced
> below.
>
> I want the wget to download the csv file.  What's going on?
>
> FredZ
>
>
> bitnami@ip-10-202-202-68:/opt/bitnami/apache2/htdocs/scripts$ --2011-10-04
> 19:33:41--
> http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=on
> Resolving zimzazsearch3-1.bitnamiapp.com... 75.101.204.213
> Connecting to zimzazsearch3-1.bitnamiapp.com|75.101.204.213|:8983...
> connected.
> HTTP request sent, awaiting response... 500 null
>  java.lang.NullPointerException \tat
> java.io.StringReader.(StringReader.java:33) \tat
> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203) \tat
> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) \tat
> org.apache.solr.search.QParser.getQuery(QParser.java:142) \tat
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
> \tat
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> \tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> \tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) \tat
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> \tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> \tat
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> \tat
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> \tat
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> \tat
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> \tat
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> \tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> \tat
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> \tat
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> \tat
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> \tat org.mortbay.jetty.Server.handle(Server.java:326) \tat
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) \tat
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
> \tat org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) \tat
> org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) \tat
> org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) \tat
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
> \tat o
> 2011-10-04 19:33:41 ERROR 500: null  java.lang.NullPointerException \tat
> java.io.StringReader.(StringReader.java:33) \tat
> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203) \tat
> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80) \tat
> org.apache.solr.search.QParser.getQuery(QParser.java:142) \tat
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:81)
> \tat
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> \tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> \tat org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) \tat
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> \tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> \tat
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> \tat
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> \tat
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> \tat
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> \tat
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> \tat org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
> \tat
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> \tat
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> \tat
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrappe

"more like this"

2011-10-05 Thread Fred Zimmerman
Hi,

for my application, I would like to be able to create web queries
(wget/curl) that get "more like this" for either a single arbitrarily
specified URL or for the first x terms in a search query.  I want to return
the results to myself as a csv file using wt=csv. How can I accomplish the
MLT piece of it?

Fred Z.

-
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates


getting started with Solr Flare

2011-10-05 Thread Fred Zimmerman
Hi,

I followed the very simple instructions found at '

http://wiki.apache.org/solr/Flare/HowTo

but run into a problem at step 4

Launch Solr:
cd ; java -Dsolr.solr.home= -jar start.jar

where Solr complains that it can't find solrconfig.xml in either the
classpath or the solr-ruby home dir. Can anyone help me disentangle this?

FredZ

-
Subscribe to the Nimble Books Mailing List  http://eepurl.com/czS- for
monthly updates


Re: Search Relevance Assistance

2011-10-05 Thread Fred Zimmerman
probably can't help, but pls keep the topic on list, as it is important for
me too!


On Wed, Oct 5, 2011 at 14:12, FionaY  wrote:

> We have Solr integrated, but we are having some issues with search
> relevance
> and we need some help fine tuning the search results. Anyone think they can
> help?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Search-Relevance-Assistance-tp3397404p3397404.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


how to determine whether indexing is occurring?

2011-10-07 Thread Fred Zimmerman
I am running a big nutch job which is supposed to be sending information to
solr for indexing, but it does not seem to be occurring. the number of docs
and max docs in solr statistics is not changing. how can I figure out what's
happening here?


Re: how to determine whether indexing is occurring?

2011-10-07 Thread Fred Zimmerman
I did this


bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5

per http://wiki.apache.org/nutch/NutchTutorial




On Fri, Oct 7, 2011 at 13:36, Andy Lindeman  wrote:

> On Fri, Oct 7, 2011 at 13:32, Fred Zimmerman  wrote:
> > I am running a big nutch job which is supposed to be sending information
> to
> > solr for indexing, but it does not seem to be occurring. the number of
> docs
> > and max docs in solr statistics is not changing. how can I figure out
> what's
> > happening here?
>
> Have you issued a commit?
>
> --
> Andy Lindeman
> http://www.andylindeman.com/
>


how to add search terms to output of wt=csv?

2011-10-14 Thread Fred Zimmerman
Hi,

I want to include the search query in the output of wt=csv (or a duplicate
of it) so that the process that receives this output can do something with
the search terms. How would I accomplish this?

Fred


changing base URLs in indexes

2011-10-18 Thread Fred Zimmerman
Hi,

I am getting ready to index a recent copy of Wikipedia's pages-articles
dump.  I have two servers, foo and bar.  On foo.com/mediawiki I have a
Mediawiki install serving up the pages. On bar.com/solr I have my solr
install. I have the pages-articles.xml file from Wikipedia and the solr
instructions  at
http://wiki.apache.org/solr/DataImportHandler#Example:_Indexing_wikipedia.
 It looks pretty straightforward but I have a couple of preparatory
questions.

If I index the pages-articles.xml on bar.com/solr, they will then be
pointing to the relative links on solr.com/mediawiki, which don't exist,
right?  So is there a way to tell solr that the base url for a bunch of
index records is different than what it thinks they are? Or would it be
easier simply to put a solr installation on foo.com?




\

FredZ


dataimport indexing fails: where are my log files ? ;-)

2011-10-19 Thread Fred Zimmerman
dumb question ...

today I set up solr3.4/example, indexing to 8983 via post is working, so is
search, solr/dataimport reports

0
0
0
2011-10-19 18:13:57
Indexing failed. Rolled back all changes.

Google tells me to look at the exception logs to find out what's happening
... but, I can't find the logs!   Where are they? example/logs is an empty
directory.


where is solr data import handler looking for my file?

2011-10-19 Thread Fred Zimmerman
Solr dataimport is reporting file not found when it looks for foo.xml.

Where is it looking for /data? is this an url off the apache2/htdocs on the
server, or is it an URL within example/solr/...?


 


success with indexing Wikipedia - lessons learned

2011-10-21 Thread Fred Zimmerman
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/


Re: where is solr data import handler looking for my file?

2011-10-23 Thread Fred Zimmerman
Figured it out.  See step 12 in
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/.
 Thanks!

On Sun, Oct 23, 2011 at 1:31 PM, Erick Erickson wrote:

> I think you need to back up and state the problem you're trying to
> solve. Offhand, it looks as though you're trying to do something
> with DIH that it wasn't intended to do. But that's just a guess
> since the details of what you're trying to do are so sparse...
>
> Best
> Erick
>
> On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman 
> wrote:
> > Solr dataimport is reporting file not found when it looks for foo.xml.
> >
> > Where is it looking for /data? is this an url off the apache2/htdocs on
> the
> > server, or is it an URL within example/solr/...?
> >
> >
> >   >processor="XPathEntityProcessor"
> >stream="true"
> >forEach="/mediawiki/page/"
> >url="/data/foo.xml"
> >transformer="RegexTransformer,DateFormatTransformer"
> >>
> >
>


schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
Hi,

it seems from my limited experience thus far that as new data types are
added, schema.xml will tend to become bloated with many different field and
fieldtype definitions.  Is this a problem in real life, and if so, what
strategies are used to address it?

FredZ


Re: schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
So, basically, yes, it is a real problem and there is no designed solution?
 e.g. optional sub-schema files that can be turned off and on?

On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote:

>
> On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote:
> > it seems from my limited experience thus far that as new data types are
> > added, schema.xml will tend to become bloated with many different field
> and
> > fieldtype definitions.  Is this a problem in real life, and if so, what
> > strategies are used to address it?
>
> ... by keeping your schema lean and clean, only with what YOU need in it.
>  Granted, I'd personally keep all the built-in Solr primitive field types
> defined even if I didn't use them, but there aren't very many and don't
> really clutter things up.
>
> Defined fields should ONLY be what you need for your application, and
> generally that should be a tractable (and necessary) reasonably sized set.
>
>Erik
>


Re: Is there a good web front end application / interface for solr

2011-10-25 Thread Fred Zimmerman
what about something that's a bit less discovery-oriented? for my particular
application I am most concerned with bringing back a straightforward "top
ten" answer set and having users look at it. I actually don't want to bother
them with faceting, etc. at this juncture.

Fred

On Tue, Oct 25, 2011 at 7:40 AM, Erik Hatcher wrote:

>
> On Oct 25, 2011, at 07:24 , Robert Stewart wrote:
>
> > It is really not very difficult to build a decent web front-end to SOLR
> using one of the available client libraries
>
> Or even just not using any client library at all (other than an HTTP
> library).  I've done a bit of proof-of-concept/prototyping with a super
> light weight (and of course Ruby!) approach with my Prism tinkering: <
> https://github.com/lucidimagination/Prism>
>
> Yes, in general it's very straightforward to build a search UI that shows
> results, pages through them, displays facets, and allows them to be clicked
> and filter results and so on.  Devil is always in the details, and having
> saved searches, export, customizability, authentication, and so on makes it
> a more involved proposition.
>
> If you're in a PHP environment, there is VUFind... again pretty
> library-centric at first, but likely flexible enough to handle any Solr
> setup - .  For the Pythonistas, there's Kochief -
> http://code.google.com/p/kochief/
>
> Being a Rubyist myself (and founder of Blacklight), I'm not intimately
> familiar with the other solutions but the library world has done a lot to
> get this sort of thing off the ground in many environments.
>
>Erik
>
>


missing core name in path

2011-10-26 Thread Fred Zimmerman
It is not a multi-core setup.  The solr.xml has null value for . ?
HTTP ERROR 404

Problem accessing /solr/admin/index.jsp. Reason:

missing core name in path



2011-10-26 13:40:21.182:WARN::/solr/admin/
java.lang.IllegalStateException: STREAM
at org.mortbay.jetty.Response.getWriter(Response.java:616)
at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:187)


fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
It's a small indexing job coming from nutch.

2011-10-26 15:07:29,039 WARN  mapred.LocalJobRunner - job_local_0011
java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error
executi$
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: org.apache.solr.client.solrj.SolrServerException: Error executing
query
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
at
org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
... 3 more
Caused by: org.apache.solr.common.SolrException: Java heap space
 java.lang.OutOfMem$

Java heap space  java.lang.OutOfMemoryError: Java heap spaceat
org.apache.lucene$

request: localhost/solr/select?q=id:[* TO *]&fl=id,boost,tstamp,$
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
... 5 more


Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
More on what's happening. It seems to be timing out during the commit.

The new documents are small, but the existing index is large (11 million
records).

INFO: Closing Searcher@4a7df6 main
>
> fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
> ...
>


> Oct 26, 2011 4:51:17 PM org.apache.solr.update.processor.LogUpdateProcessor
> finish
> *INFO: {commit=} 0 2453
> **Oct 26, 2011 4:51:17 PM org.apache.solr.core.SolrCore execute
> **INFO: [] webapp=/solr path=/update
> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}
> status=0 QTime=2453
> *Oct 26, 2011 4:51:52 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=11576871 
> *status=0
> QTime=35298*
> Oct 26, 2011 4:51:53 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select
> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=11576871
> status=0 QTime=1
> *java.lang.OutOfMemoryError: Java heap space*
> Dumping heap to /home/bitnami/apache-solr-3.4.0/example/heaplog ...
> Heap dump file created [306866344 bytes in 32.376 secs]



On Wed, Oct 26, 2011 at 11:09 AM, Fred Zimmerman wrote:

> It's a small indexing job coming from nutch.
>
> 2011-10-26 15:07:29,039 WARN  mapred.LocalJobRunner - job_local_0011
> java.io.IOException: org.apache.solr.client.solrj.SolrServerException:
> Error executi$
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
> executing query
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
> at
> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
> at
> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
> ... 3 more
> Caused by: org.apache.solr.common.SolrException: Java heap space
>  java.lang.OutOfMem$
>
> Java heap space  java.lang.OutOfMemoryError: Java heap spaceat
> org.apache.lucene$
>
> request: localhost/solr/select?q=id:[* TO *]&fl=id,boost,tstamp,$
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
> at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
> at
> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
> ... 5 more
>
>


Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
http://wiki.apache.org/solr/SolrPerformanceFactors#Schema_Design_Considerations

The number of indexed fields greatly increases the following:
>
>- Memory usage during indexing
>
>
>- Segment merge time
>
>
>- Optimization times
>
>
>- Index size
>
> These impacts can be reduced by the use of omitNorms="true"


http://lucene.472066.n3.nabble.com/What-is-omitNorms-td2987547.html

1. length normalization will not work on the specific field--
> Which means matching documents with shorter length will not be
> preferred/boost over matching documents with greater length for the specific
> field, at search time.
> For my application, I actually prefer documents with greater length.
> 2. Index time boosting will not be available on the field.
> If, both the above cases are not required by you, then, you can set
> "omitNorms=true" for the specific fields.
> This has an added advantage, it will save you some(or a lot of) RAM also,
> since, with "omitNorms=false" on total "N" fields in the index will require
> RAM of size:
>  Total docs in index * 1 byte * N
> I have a lot of fields: I count 31 without omitNorms values, which means
> false by default.


Gak!  11,000,000 * 1 * 31 = 31 x 10M = 310MB RAM all by itself.

On Wed, Oct 26, 2011 at 1:01 PM, Fred Zimmerman wrote:

> More on what's happening. It seems to be timing out during the commit.
>
> The new documents are small, but the existing index is large (11 million
> records).
>
> INFO: Closing Searcher@4a7df6 main
>>
>> fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
>> ...
>>
>
>
>> Oct 26, 2011 4:51:17 PM
>> org.apache.solr.update.processor.LogUpdateProcessor finish
>> *INFO: {commit=} 0 2453
>> **Oct 26, 2011 4:51:17 PM org.apache.solr.core.SolrCore execute
>> **INFO: [] webapp=/solr path=/update
>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=2}
>> status=0 QTime=2453
>> *Oct 26, 2011 4:51:52 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=11576871 
>> *status=0
>> QTime=35298*
>> Oct 26, 2011 4:51:53 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params={fl=id&wt=javabin&q=id:[*+TO+*]&rows=1&version=2} hits=11576871
>> status=0 QTime=1
>> *java.lang.OutOfMemoryError: Java heap space*
>> Dumping heap to /home/bitnami/apache-solr-3.4.0/example/heaplog ...
>> Heap dump file created [306866344 bytes in 32.376 secs]
>
>
>
> On Wed, Oct 26, 2011 at 11:09 AM, Fred Zimmerman wrote:
>
>> It's a small indexing job coming from nutch.
>>
>> 2011-10-26 15:07:29,039 WARN  mapred.LocalJobRunner - job_local_0011
>> java.io.IOException: org.apache.solr.client.solrj.SolrServerException:
>> Error executi$
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>> at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Caused by: org.apache.solr.client.solrj.SolrServerException: Error
>> executing query
>> at
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
>> at
>> org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
>> at
>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$
>> ... 3 more
>> Caused by: org.apache.solr.common.SolrException: Java heap space
>>  java.lang.OutOfMem$
>>
>> Java heap space  java.lang.OutOfMemoryError: Java heap spaceat
>> org.apache.lucene$
>>
>> request: localhost/solr/select?q=id:[* TO *]&fl=id,boost,tstamp,$
>> at
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
>> at
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHt$
>> at
>> org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.ja$
>> ... 5 more
>>
>>
>


limiting searches to particular sources

2011-11-02 Thread Fred Zimmerman
I want to be able to list some searches to particular sources, e.g. "wiki
only", "crawled only", etc.  So I think I need to create a source field in
the schema.xml.  However, the native data for these sources does not
contain source info (e.g. "crawled").  So I want to use (I think)
 to add a string to each data set as I import it, e.g.
"website-X-crawl".  So my question is, how do I insert a string value into
a blank field?


Re: limiting searches to particular sources

2011-11-04 Thread Fred Zimmerman
Yes -- how do I specify the field as a constant in DIH?

On Fri, Nov 4, 2011 at 11:17 AM, Erick Erickson wrote:

> How are you crawling your info? Somewhere you have to inject the
> source into the document,  won't do the trick because
> there's no source available
>
> If you're crawling the data by yourself, you can just add the source
> to the document.
>
> If you're using DIH, you can specify the field as a constant. Or you
> could implement a custom Transformer that inserted it for you.
>
> Best
> Erick
>
> On Wed, Nov 2, 2011 at 10:52 AM, Fred Zimmerman 
> wrote:
> > I want to be able to list some searches to particular sources, e.g. "wiki
> > only", "crawled only", etc.  So I think I need to create a source field
> in
> > the schema.xml.  However, the native data for these sources does not
> > contain source info (e.g. "crawled").  So I want to use (I think)
> >  to add a string to each data set as I import it, e.g.
> > "website-X-crawl".  So my question is, how do I insert a string value
> into
> > a blank field?
> >
>


Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Fred Zimmerman
Any options that do not require adding new software?

On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya <
nnagaraja...@transaxtions.com> wrote:

> Shaun:
>
> You should try NRT available with Solr with RankingAlgorithm here. You
> should be able to add docs in real time and also query them in real time.
>  If DIH does not retain the old index, you may be able to convert the rss
> fields to a XML format as needed by Solr and update the docs (make sure
> there is a unique id)
>
> http://solr-ra.tgels.org/wiki/**en/Near_Real_Time_Search_ver_**3.x
>
> You can download Solr 3.4.0 with RankingAlgorithm 1.3 from here:
> http://solr-ra.tgels.org
>
> Regards,
>
> - Nagendra Nagarajayya
> http://solr-ra.tgels.org
> http://rankingalgorithm.tgels.**org 
>
>
> On 11/6/2011 1:22 PM, Shaun Barriball wrote:
>
>> Hi all,
>>
>> We've successfully setup Solr 3.4.0 to parse and import multiple news RSS
>> feeds (based on the slashdot example on http://wiki.apache.org/solr/**
>> DataImportHandler ) using
>> the HttpDataSource.
>> The objective is for Solr to index ALL news items published on this feed
>> (ever) - not just the current contents of the feed. I've read that the
>> delta import is not supported for XML imports. I've therefore tried to use
>> "command=full-impor&clean=**false".
>> But still the number of Documents Processed seems to be stuck at a fixed
>> number of items looking at the Stats and the 'numFound' result for a
>> generic '*:*' search. New items are being added to the feeds all the time
>> (and old ones dropping off).
>>
>> Is it possible for Solr to incrementally build an index of a live RSS
>> feed which is changing but retain the index of its archive?
>>
>> All help appreciated.
>> Shaun
>>
>
>


remove answers with identical scores

2011-11-24 Thread Fred Zimmerman
I have a corpus that has a lot of identical or nearly identical documents.
I'd like to return only the unique ones (excluding the "nearly identical"
which are redirects).  I notice that all the identical/nearly identicals
have identical Solr scores. How can I tell Solr to  throw out all the
successive documents in an answer set that have identical scores?

doc 1 score 5.0
doc 2  score 5.0
doc 3 score 5.0
doc 4 score 4.9

skip docs 2 and 3

bring back 10 docs with unique scores


Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
thanks.  i did consider postprocessing and may wind up doing that, i was
hoping there was a way to have Solr do it for me! that I have to as this
question is probably not a good sign, but what is LSH clustering?

On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning  wrote:

> You can do that pretty easily by just retrieving extra documents and post
> processing the results list.
>
> You are likely to have a significant number of apparent duplicates this
> way.
>
> To really get rid of duplicates in results, it might be better to remove
> them from the corpus by deploying something like LSH clustering.
>
> On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman  >wrote:
>
> > I have a corpus that has a lot of identical or nearly identical
> documents.
> > I'd like to return only the unique ones (excluding the "nearly identical"
> > which are redirects).  I notice that all the identical/nearly identicals
> > have identical Solr scores. How can I tell Solr to  throw out all the
> > successive documents in an answer set that have identical scores?
> >
> > doc 1 score 5.0
> > doc 2  score 5.0
> > doc 3 score 5.0
> > doc 4 score 4.9
> >
> > skip docs 2 and 3
> >
> > bring back 10 docs with unique scores
> >
>