Re: remove answers with identical scores

2011-11-25 Thread Fred Zimmerman
oying something like LSH clustering. > > On Thu, Nov 24, 2011 at 5:04 PM, Fred Zimmerman >wrote: > > > I have a corpus that has a lot of identical or nearly identical > documents. > > I'd like to return only the unique ones (excluding the "nearly identical"

remove answers with identical scores

2011-11-24 Thread Fred Zimmerman
I have a corpus that has a lot of identical or nearly identical documents. I'd like to return only the unique ones (excluding the "nearly identical" which are redirects). I notice that all the identical/nearly identicals have identical Solr scores. How can I tell Solr to throw out all the success

Re: Aggregated indexing of updating RSS feeds

2011-11-07 Thread Fred Zimmerman
Any options that do not require adding new software? On Mon, Nov 7, 2011 at 11:11 AM, Nagendra Nagarajayya < nnagaraja...@transaxtions.com> wrote: > Shaun: > > You should try NRT available with Solr with RankingAlgorithm here. You > should be able to add docs in real time and also query them in r

Re: limiting searches to particular sources

2011-11-04 Thread Fred Zimmerman
t; If you're crawling the data by yourself, you can just add the source > to the document. > > If you're using DIH, you can specify the field as a constant. Or you > could implement a custom Transformer that inserted it for you. > > Best > Erick > > On Wed, Nov

limiting searches to particular sources

2011-11-02 Thread Fred Zimmerman
I want to be able to list some searches to particular sources, e.g. "wiki only", "crawled only", etc. So I think I need to create a source field in the schema.xml. However, the native data for these sources does not contain source info (e.g. "crawled"). So I want to use (I think) to add a strin

Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
I have a lot of fields: I count 31 without omitNorms values, which means > false by default. Gak! 11,000,000 * 1 * 31 = 31 x 10M = 310MB RAM all by itself. On Wed, Oct 26, 2011 at 1:01 PM, Fred Zimmerman wrote: > More on what's happening. It seems to be timing out during the commit. &

Re: fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
on=2} hits=11576871 > status=0 QTime=1 > *java.lang.OutOfMemoryError: Java heap space* > Dumping heap to /home/bitnami/apache-solr-3.4.0/example/heaplog ... > Heap dump file created [306866344 bytes in 32.376 secs] On Wed, Oct 26, 2011 at 11:09 AM, Fred Zimmerman wrote

fixed schema problems, now running out of memory?

2011-10-26 Thread Fred Zimmerman
It's a small indexing job coming from nutch. 2011-10-26 15:07:29,039 WARN mapred.LocalJobRunner - job_local_0011 java.io.IOException: org.apache.solr.client.solrj.SolrServerException: Error executi$ at org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat.getRec$ at o

missing core name in path

2011-10-26 Thread Fred Zimmerman
It is not a multi-core setup. The solr.xml has null value for . ? HTTP ERROR 404 Problem accessing /solr/admin/index.jsp. Reason: missing core name in path 2011-10-26 13:40:21.182:WARN::/solr/admin/ java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Re

Re: Is there a good web front end application / interface for solr

2011-10-25 Thread Fred Zimmerman
what about something that's a bit less discovery-oriented? for my particular application I am most concerned with bringing back a straightforward "top ten" answer set and having users look at it. I actually don't want to bother them with faceting, etc. at this juncture. Fred On Tue, Oct 25, 2011

Re: schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
So, basically, yes, it is a real problem and there is no designed solution? e.g. optional sub-schema files that can be turned off and on? On Sun, Oct 23, 2011 at 6:38 PM, Erik Hatcher wrote: > > On Oct 23, 2011, at 19:34 , Fred Zimmerman wrote: > > it seems from my limited experie

schema.xml bloat?

2011-10-23 Thread Fred Zimmerman
Hi, it seems from my limited experience thus far that as new data types are added, schema.xml will tend to become bloated with many different field and fieldtype definitions. Is this a problem in real life, and if so, what strategies are used to address it? FredZ

Re: where is solr data import handler looking for my file?

2011-10-23 Thread Fred Zimmerman
Offhand, it looks as though you're trying to do something > with DIH that it wasn't intended to do. But that's just a guess > since the details of what you're trying to do are so sparse... > > Best > Erick > > On Wed, Oct 19, 2011 at 10:49 PM, Fred Zimmerman

success with indexing Wikipedia - lessons learned

2011-10-21 Thread Fred Zimmerman
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/

where is solr data import handler looking for my file?

2011-10-19 Thread Fred Zimmerman
Solr dataimport is reporting file not found when it looks for foo.xml. Where is it looking for /data? is this an url off the apache2/htdocs on the server, or is it an URL within example/solr/...?

dataimport indexing fails: where are my log files ? ;-)

2011-10-19 Thread Fred Zimmerman
dumb question ... today I set up solr3.4/example, indexing to 8983 via post is working, so is search, solr/dataimport reports 0 0 0 2011-10-19 18:13:57 Indexing failed. Rolled back all changes. Google tells me to look at the exception logs to find out what's happening ... but, I can't find the l

changing base URLs in indexes

2011-10-18 Thread Fred Zimmerman
Hi, I am getting ready to index a recent copy of Wikipedia's pages-articles dump. I have two servers, foo and bar. On foo.com/mediawiki I have a Mediawiki install serving up the pages. On bar.com/solr I have my solr install. I have the pages-articles.xml file from Wikipedia and the solr instruct

how to add search terms to output of wt=csv?

2011-10-14 Thread Fred Zimmerman
Hi, I want to include the search query in the output of wt=csv (or a duplicate of it) so that the process that receives this output can do something with the search terms. How would I accomplish this? Fred

Re: how to determine whether indexing is occurring?

2011-10-07 Thread Fred Zimmerman
I did this bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 per http://wiki.apache.org/nutch/NutchTutorial On Fri, Oct 7, 2011 at 13:36, Andy Lindeman wrote: > On Fri, Oct 7, 2011 at 13:32, Fred Zimmerman wrote: > > I am running a big nutch job which is su

how to determine whether indexing is occurring?

2011-10-07 Thread Fred Zimmerman
I am running a big nutch job which is supposed to be sending information to solr for indexing, but it does not seem to be occurring. the number of docs and max docs in solr statistics is not changing. how can I figure out what's happening here?

Re: Search Relevance Assistance

2011-10-05 Thread Fred Zimmerman
probably can't help, but pls keep the topic on list, as it is important for me too! On Wed, Oct 5, 2011 at 14:12, FionaY wrote: > We have Solr integrated, but we are having some issues with search > relevance > and we need some help fine tuning the search results. Anyone think they can > help?

getting started with Solr Flare

2011-10-05 Thread Fred Zimmerman
Hi, I followed the very simple instructions found at ' http://wiki.apache.org/solr/Flare/HowTo but run into a problem at step 4 Launch Solr: cd ; java -Dsolr.solr.home= -jar start.jar where Solr complains that it can't find solrconfig.xml in either the classpath or the solr-ruby home dir. Can

"more like this"

2011-10-05 Thread Fred Zimmerman
Hi, for my application, I would like to be able to create web queries (wget/curl) that get "more like this" for either a single arbitrarily specified URL or for the first x terms in a search query. I want to return the results to myself as a csv file using wt=csv. How can I accomplish the MLT pie

Re: http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
got it. curl " http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select/?indent=on&q=video&fl=name,id&wt=csv"; works like a champ. On Tue, Oct 4, 2011 at 15:35, Fred Zimmerman wrote: > This http request works as desired (bringing back a csv file) > > > htt

http request works, but wget same URL fails

2011-10-04 Thread Fred Zimmerman
This http request works as desired (bringing back a csv file) http://zimzazsearch3-1.bitnamiapp.com:8983/solr/select?indent=on&version=2.2&q=battleship&wt=csv&; but the same URL submitted via wget produces the 500 error reproduced below. I want the wget to download the csv file. What's going on

Re: strategy for post-processing answer set

2011-09-24 Thread Fred Zimmerman
wrote: > conf/velocity by default. See Solr's example configuration. > > Erik > > On Sep 23, 2011, at 12:37, Fred Zimmerman wrote: > > > ok, answered my own question, found velocity rw in solrconfig.xml. next > > question: > > >

Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
11:57, Fred Zimmerman wrote: > This seems to be out of date. I am running Solr 3.4 > > * the file structure of apachehome/contrib is different and I don't see > velocity anywhere underneath > * the page referenced below only talks about Solr 1.4 and 4.0 > > ? > &g

Re: strategy for post-processing answer set

2011-09-23 Thread Fred Zimmerman
This seems to be out of date. I am running Solr 3.4 * the file structure of apachehome/contrib is different and I don't see velocity anywhere underneath * the page referenced below only talks about Solr 1.4 and 4.0 ? On Thu, Sep 22, 2011 at 19:51, Markus Jelsma wrote: > Hi, > > Solr support the

Re: strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
can you say a bit more about this? I see Velocity and will download it and start playing around but I am not quite sure I understand all the steps that you are suggesting. Fred On Thu, Sep 22, 2011 at 19:51, Markus Jelsma wrote: > Hi, > > Solr support the Velocity template engine and has veyr g

strategy for post-processing answer set

2011-09-22 Thread Fred Zimmerman
> > Hi, I would like to take the HTML documents that are the result of a Solr search and combine them into a single HTML document that combines the body text of each individual document. What is a good strategy for this? I am crawling with Nutch and Carrot2 for clustering. Fred