Re: solr with hadoop

2010-06-22 Thread Jon Baer
I was playing around w/ Sqoop the other day, its a simple Cloudera tool for 
imports (mysql -> hdfs) @ http://www.cloudera.com/developers/downloads/sqoop/

It seems to me (it would be pretty efficient) to dump to HDFS and have 
something like Data Import Handler be able to read from hdfs:// directly ...

Has this route been discussed / developed before (ie DIH w/ hdfs:// handler)?

- Jon

On Jun 22, 2010, at 12:29 PM, MitchK wrote:

> 
> I wanted to add a Jira-issue about exactly what Otis is asking here.
> Unfortunately, I haven't time for it because of my exams.
> 
> However, I'd like to add a question to Otis' ones:
> If you destribute the indexing-progress this way, are you able to replicate
> the different documents correctly?
> 
> Thank you.
> - Mitch
> 
> Otis Gospodnetic-2 wrote:
>> 
>> Stu,
>> 
>> Interesting!  Can you provide more details about your setup?  By "load
>> balance the indexing stage" you mean "distribute the indexing process",
>> right?  Do you simply take your content to be indexed, split it into N
>> chunks where N matches the number of TaskNodes in your Hadoop cluster and
>> provide a map function that does the indexing?  What does the reduce
>> function do?  Does that call IndexWriter.addAllIndexes or do you do that
>> outside Hadoop?
>> 
>> Thanks,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>> 
>> - Original Message 
>> From: Stu Hood 
>> To: solr-user@lucene.apache.org
>> Sent: Monday, January 7, 2008 7:14:20 PM
>> Subject: Re: solr with hadoop
>> 
>> As Mike suggested, we use Hadoop to organize our data en route to Solr.
>> Hadoop allows us to load balance the indexing stage, and then we use
>> the raw Lucene IndexWriter.addAllIndexes method to merge the data to be
>> hosted on Solr instances.
>> 
>> Thanks,
>> Stu
>> 
>> 
>> 
>> -Original Message-
>> From: Mike Klaas 
>> Sent: Friday, January 4, 2008 3:04pm
>> To: solr-user@lucene.apache.org
>> Subject: Re: solr with hadoop
>> 
>> On 4-Jan-08, at 11:37 AM, Evgeniy Strokin wrote:
>> 
>>> I have huge index base (about 110 millions documents, 100 fields  
>>> each). But size of the index base is reasonable, it's about 70 Gb.  
>>> All I need is increase performance, since some queries, which match  
>>> big number of documents, are running slow.
>>> So I was thinking is any benefits to use hadoop for this? And if  
>>> so, what direction should I go? Is anybody did something for  
>>> integration Solr with Hadoop? Does it give any performance boost?
>>> 
>> Hadoop might be useful for organizing your data enroute to Solr, but  
>> I don't see how it could be used to boost performance over a huge  
>> Solr index.  To accomplish that, you need to split it up over two  
>> machines (for which you might find hadoop useful).
>> 
>> -Mike
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/solr-with-hadoop-tp482688p914589.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH and Cassandra

2010-08-05 Thread Jon Baer
That is not 100% true.  I would think RDBMS and XML would be the most common 
importers but the real flexibility is with the TikaEntityProcessor [1] that 
comes w/ DIH ...

http://wiki.apache.org/solr/TikaEntityProcessor

Im pretty sure it would be able to handle any type of serde (in the case of 
Cassandra I believe it is Thrift) on it's own w/ the dep libraries.

I find the TEP to be underutilized sometimes, I think it's because the docs on 
the DIH lack more info on what it can do.

[1] - http://tika.apache.org

- Jon

On Aug 4, 2010, at 3:00 PM, Andrei Savu wrote:

> DIH only works with relational databases and XML files [1], you need
> to write custom code in order to index data from Cassandra.
> 
> It should be pretty easy to map documents from Cassandra to Solr.
> There are a lot of client libraries available [2] for Cassandra.
> 
> [1] http://wiki.apache.org/solr/DataImportHandler
> [2] http://wiki.apache.org/cassandra/ClientOptions
> 
> On Wed, Aug 4, 2010 at 6:41 PM, Mark  wrote:
>> Is it possible to use DIH with Cassandra either out of the box or with
>> something more custom? Thanks
>> 
> 
> 
> 
> -- 
> Indekspot -- http://www.indekspot.com -- Managed Hosting for Apache Solr



Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Jon Baer
For this list I usually end up @ http://solr.markmail.org (which I believe also 
uses Lucene under the hood)

Google is such a black box ... 

Pros:
+ 1 Open Source (enough said :-)

There also seems to always be the notion that "crawling" leads itself to 
produce the best results but that is rarely the case.  And unless you are a 
"special" type of site Google will not overlay your results w/ some type of 
context in the search (ie news or sports, etc).  

What I think really needs to happen is Solr (and is a bit missing @ the moment) 
is there needs to be a common interface to "reindexing" another index (if that 
makes sense) ... something akin or like OpenSearch 
(http://www.opensearch.org/Community/OpenSearch_software)

For example what I would like to do is have my site, have my search index, and 
connect Google to indexing just to my search index (and not crawl the site) ... 
the only current option for something like that are sitemaps which I think Solr 
(templates) should have a contrib project for (but you would have to generate 
these offline for sure).

- Jon  

On Nov 13, 2009, at 6:00 AM, Lukáš Vlček wrote:

> Hi,
> 
> thanks for inputs so far... however, let's put it this way:
> 
> When you need to search for something Lucene or Solr related, which one do
> you use:
> - generic Google
> - go to a particular mail list web site and search from here (if there is
> any search form at all)
> - go to LucidImagination.com and use its search capability
> 
> Regards,
> Lukas
> 
> 
> On Fri, Nov 13, 2009 at 11:50 AM, Andrew Clegg wrote:
> 
>> 
>> 
>> Lukáš Vlček wrote:
>>> 
>>> I am looking for good arguments to justify implementation a search for
>>> sites
>>> which are available on the public internet. There are many sites in
>>> "powered
>>> by Solr" section which are indexed by Google and other search engines but
>>> still they decided to invest resources into building and maintenance of
>>> their own search functionality and not to go with [user_query site:
>>> my_site.com] google search. Why?
>>> 
>> 
>> You're assuming that Solr is just used in these cases to index discrete web
>> pages which Google etc. would be able to access via following navigational
>> links.
>> 
>> I would imagine that in a lot of cases, Solr is used to index database
>> entities which are used to build [parts of] pages dynamically, and which
>> might be viewable in different forms in various different pages.
>> 
>> Plus, with stored fields, you have the option of actually driving a website
>> off Solr instead of directly off a database, which might make sense from a
>> speed perspective in some cases.
>> 
>> And further, going back to page-only indexing -- you have no guarantee when
>> Google will decide to recrawl your site, so there may be a delay before
>> changes show up in their index. With an in-house search engine you can
>> reindex as often as you like.
>> 
>> Andrew.
>> 
>> --
>> View this message in context:
>> http://old.nabble.com/Arguments-for-Solr-implementation-at-public-web-site-tp26333987p26334734.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 



Reseting doc boosts

2009-11-13 Thread Jon Baer
Hi,

Im trying to figure out if there is an easy way to basically "reset" all of any 
doc boosts which you have made (for analytical purposes) ... for example if I 
run an index, gather report, doc boost on the report, and reset the boosts @ 
time of next index ... 

It would seem to be from just knowing how Lucene works that I would really need 
to reindex since its a attrib on the doc itself which would have to be 
modified, but there is no easy way to query for docs which have been boosted 
either.  Any insight?

Thanks.

- Jon

Re: Reseting doc boosts

2009-11-13 Thread Jon Baer
This looks exactly like what I was needing ... this looks like it would be a 
great tool / addition to Solr web interface but it looks like it only takes 
(Directory d, Similarity s) (vs. subset collection of documents) ...

Either way great find, thanks for your help ...

- Jon

On Nov 13, 2009, at 6:40 PM, Koji Sekiguchi wrote:

> I'm not sure this is what you are looking for,
> but there is FieldNormModifier tool in Lucene.
> 
> Koji
> 
> -- 
> 
> http://www.rondhuit.com/en/
> 
> 
> Avlesh Singh wrote:
>> AFAIK there is no way to "reset" the doc boost. You would need to re-index.
>> Moreover, there is no way to "search by boost".
>> 
>> Cheers
>> Avlesh
>> 
>> On Fri, Nov 13, 2009 at 8:17 PM, Jon Baer  wrote:
>> 
>>  
>>> Hi,
>>> 
>>> Im trying to figure out if there is an easy way to basically "reset" all of
>>> any doc boosts which you have made (for analytical purposes) ... for example
>>> if I run an index, gather report, doc boost on the report, and reset the
>>> boosts @ time of next index ...
>>> 
>>> It would seem to be from just knowing how Lucene works that I would really
>>> need to reindex since its a attrib on the doc itself which would have to be
>>> modified, but there is no easy way to query for docs which have been boosted
>>> either.  Any insight?
>>> 
>>> Thanks.
>>> 
>>> - Jon
>>>
>> 
>>  
> 



Re: Reseting doc boosts

2009-11-13 Thread Jon Baer
Yeah I ended up created a "boosted" field for @ least debugging, but might 
patch / extend / create my own FieldNormModifier using just that criteria + 
doing the reset.

- Jon

On Nov 13, 2009, at 12:21 PM, Avlesh Singh wrote:

> AFAIK there is no way to "reset" the doc boost. You would need to re-index.
> Moreover, there is no way to "search by boost".
> 
> Cheers
> Avlesh
> 
> On Fri, Nov 13, 2009 at 8:17 PM, Jon Baer  wrote:
> 
>> Hi,
>> 
>> Im trying to figure out if there is an easy way to basically "reset" all of
>> any doc boosts which you have made (for analytical purposes) ... for example
>> if I run an index, gather report, doc boost on the report, and reset the
>> boosts @ time of next index ...
>> 
>> It would seem to be from just knowing how Lucene works that I would really
>> need to reindex since its a attrib on the doc itself which would have to be
>> modified, but there is no easy way to query for docs which have been boosted
>> either.  Any insight?
>> 
>> Thanks.
>> 
>> - Jon



SOLR-469 - bad patch?

2008-06-24 Thread Jon Baer
It seems the new patch @ https://issues.apache.org/jira/browse/ 
SOLR-469 is x2 the size but turns out the patch itself might be bad?


Ie, it dumps build.xml twice, is it just me?

Thanks.

- Jon


Nutch <-> Solr latest?

2008-06-24 Thread Jon Baer

Hi,

Im curious, is there a spot / patch for the latest on Nutch / Solr  
integration, Ive found a few pages (a few outdated it seems), it would  
be nice (?) if it worked as a DataSource type to DataImportHandler,  
but not sure if that fits w/ how it works.  Either way a nice contrib  
patch the way the DIH is already setup would be nice to have.


Is there currently work ongoing on this?  Seems like it belongs in  
either / or project and not both.


Thanks.

- Jon


DataImportHandler - combined DataSource possible?

2008-07-01 Thread Jon Baer

Hi,

Is it currently possible to define a db-data-config.xml to include  
both a HttpDataSource and a JDBCDataSource @ all?  I can't tell if  
this is possible or not (although it seems that dataConfig might only  
take a single dataSource child element.


Thanks.

- Jon


Filter query + holding previous Facet query counts

2008-07-07 Thread Jon Baer

Hi,

Is there an easy way to use fq to filter down but retain the overall  
facet query counts?  I can't seem to find how to accomplish this but  
seems like a common item needed for navigating though a result set.  I  
need to do this w/o holding a session and the counts always seem to  
reflect the current fq instead of the overall lookup.


Thanks.

- Jon


HttpDataSource questions

2008-07-11 Thread Jon Baer

Hi,

On the wiki it says that url attribute can be templatized but Im not  
sure how that happens, do you I need to create something read from a  
database column in order to use that type of function?  ie Id like to  
run over some RSS feeds for multiple URLs (~ 30), do I need to copy 1  
per URL I want to read or is there an easier method?


Also does anything currently get compared when doing a delta-import  
for these types of data sources?  Does the dataimport.properties  
compare itself to anything?  (ie a pubdate on RSS, etc).


Thanks!

- Jon 


Re: HttpDataSource questions

2008-07-11 Thread Jon Baer
Ahhh very cool, did not realize that one.  I was actually able to use  
a db entity over the http entity so I pulled a list of subdomains and  
include it that way.


One small *possible* feature request (or is it possible already) is to  
load entities by name?  For example if I wanted to cron up something  
like:


?command=delta-import&entity=newyork,chicago,etc

Im guessing this is something that would be possible using  
javax.script but Im on 5 and ScriptProcessor does not seem to work w/  
BSF.  It would be nice to have something like that *without* having to  
add anything like a conditional tag.


- Jon

On Jul 12, 2008, at 12:38 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



On Fri, Jul 11, 2008 at 11:46 PM, Jon Baer <[EMAIL PROTECTED]> wrote:

Hi,

On the wiki it says that url attribute can be templatized but Im  
not sure
how that happens, do you I need to create something read from a  
database
column in order to use that type of function?  ie Id like to run  
over some
RSS feeds for multiple URLs (~ 30), do I need to copy 1 per URL I  
want to

read or is there an easier method?
It is a simple passthrough of parameters you passed over to the http  
request.

eg:
If you want to read a feed with say an extra attribute (date) which
you will know in runtime
you can make the http request with command=full-import&date=some_date.
The date can be consumed in the url as
url="http://xyz.com?a=b&thedate=${dataimporter.request.date}


if you could give me a sample on how the diffrent urls look like I may
be able to suggest you something. Actually there are many ways to
achieve it.



Also does anything currently get compared when doing a delta-import  
for
these types of data sources?  Does the dataimport.properties  
compare itself

to anything?  (ie a pubdate on RSS, etc).

Thanks!

- Jon





--
--Noble Paul




Specifying explicit FacetQuery w/ a normal query?

2008-07-17 Thread Jon Baer
Ive gone from a complex multicore setup back to a single solrconfig  
setup and using a doctype field (since the index is pretty small),  
however there are a few spots where items are laid out in tabs and  
each tab has a count of docs associated, ie:


News (123) | Images (345) | Video (678) | Blogs (901)

Unfortunately the tab controlling is server side and Im trying to grab  
a facet count on doctype w/ a filter query and can't seem to do it w/o  
having to send the small facet query (for the counts on all items) and  
the filter query itself.  Is there any way to do this in a single  
request w/ any params Im missing?  (Using SolrJ if that helps).


Thanks.

- Jon 
 


SolrJ + Spellcheck

2008-07-21 Thread Jon Baer

Hi,

I can't seem to locate any info on how to get SolrJ + Spellcheck  
working together, Id like to query the spellchecker if 0 items were  
matched, is SolrJ "generic" enough to pick apart added component  
results from the bottom of a query?


Thanks.

- Jon


Re: Specifying explicit FacetQuery w/ a normal query?

2008-07-21 Thread Jon Baer

Well that's my problem ... I can't :-)

When you put a fq=doctype:news in there your can't get an explicit  
facet.query, it will only let you deal w/ the stuff you have already  
filtered out.  I think what I was is possible, just need to dig in the  
code more.


- Jon

On Jul 21, 2008, at 9:14 PM, Mike Klaas wrote:



On 17-Jul-08, at 6:27 AM, Jon Baer wrote:

Ive gone from a complex multicore setup back to a single solrconfig  
setup and using a doctype field (since the index is pretty small),  
however there are a few spots where items are laid out in tabs and  
each tab has a count of docs associated, ie:


News (123) | Images (345) | Video (678) | Blogs (901)

Unfortunately the tab controlling is server side and Im trying to  
grab a facet count on doctype w/ a filter query and can't seem to  
do it w/o having to send the small facet query (for the counts on  
all items) and the filter query itself.  Is there any way to do  
this in a single request w/ any params Im missing?  (Using SolrJ if  
that helps).


No, there isn't, but does this really bother you?  It doesn't seem  
that the advantages to combining everything in one request are huge.


-Mike








Re: facets and filter query

2008-07-22 Thread Jon Baer

This is *exactly* my issue ... very nicely worded :-)

I would have thought facet.query=*:* would have been the solution but  
it does not seem to work.  Im interested in getting these *total*  
counts for UI display.


- Jon

On Jul 22, 2008, at 6:05 AM, Stefan Oestreicher wrote:


Hi,

I have a category field in my index which I'd like to use as a facet.
However my search frontend only allows you to search in one category  
at a
time for which I'm using a filter query. Unfortunately the filter  
query

restricts the facets as well.

My query looks like this:
? 
q 
= 
content:foo&fq=cat:default&fl=title,content&facet=true&facet.field=cat


What I'd like is to search only in the "default" category but get  
the result

count of that query for all categories. I thought maybe I can use the
facet.query parameter but this doesn't seem to do what I want,  
because the

result is the same.

Is there any way to accomplish this with only one request?

I'm using version 1.3 from trunk.

TIA,

Stefan Oestreicher





Restricting spellchecker for certain words

2008-07-22 Thread Jon Baer
It seems that spellchecker works great except all the "7 words you  
can't say on TV" resolve to very important people, is there a way to  
contain just certain words so they don't resolve?


Thanks.

- Jon


Recognizing date inputs

2008-08-20 Thread Jon Baer

Hi,

(Im sure this was asked before but found nothing on markmail) ...  
Wondering if Solr can handle this on its own or if something needs to  
be written ... would like to handle recognizing date inputs to a  
search box for news articles, items such as "August 1","August 1st" or  
"08/01/2008" ... its a bit different than synonym like handling in  
that I have to transform the query to a specific field.  Any thoughts?


Thanks.

- Jon


"Multicore" and snapshooter / snappuller

2008-08-21 Thread Jon Baer

Hi,

Ive started putting together a small cluster and going through the  
setup on some of the scripts, do they have any awareness of a  
multicore setup?  It seems like I can only snapshot a single master  
directory, Im assuming these tools are compatible with that type of  
setup but just want to check ...


Also is there a more step by step doc outlining the tool usage.  Id  
like to send a copy to our sysadmins but they seem to be more indepth  
w/ each tool verses a workflow on using them.


Thanks.

- Jon


Re: "Multicore" and snapshooter / snappuller

2008-08-21 Thread Jon Baer
Thanks ... on a somewhat related note, does having the index on ZFS  
buy me anything, has anyone toyed w/ ZFS snapshots / send / recv to  
automount? Does it work?


- Jon

On Aug 21, 2008, at 6:43 PM, Alexander Ramos Jardim wrote:


You need to setup one snapshooter for each index

2008/8/21 Jon Baer <[EMAIL PROTECTED]>


Hi,

Ive started putting together a small cluster and going through the  
setup on
some of the scripts, do they have any awareness of a multicore  
setup?  It
seems like I can only snapshot a single master directory, Im  
assuming these
tools are compatible with that type of setup but just want to  
check ...


Also is there a more step by step doc outlining the tool usage.  Id  
like to
send a copy to our sysadmins but they seem to be more indepth w/  
each tool

verses a workflow on using them.

Thanks.

- Jon





--
Alexander Ramos Jardim




Re: "Multicore" and snapshooter / snappuller

2008-08-25 Thread Jon Baer
Yeah I think the snapshot techniques that ZFS provides would be very  
nice for handling indexes, although remains to be seen as I have not  
seen too much info pertaining to it.


Im hoping to have a chance to put Solr on OpenSolaris soon and will  
see what works / what doesn't.  (BTW this combo can be tried out for  
free w/ VirtualBox and OpenSolaris .iso if anyone is interested).


- Jon

On Aug 25, 2008, at 7:37 AM, Norberto Meijome wrote:


On Fri, 22 Aug 2008 12:21:53 -0700
"Lance Norskog" <[EMAIL PROTECTED]> wrote:


Apparently the ZFS (Silicon Graphics
originally) is great for really huge files.


hi Lance,
You may be  confusing Sun's ZFS with SGI's XFS. The OP referred, i  
think, to ZFS.


B

_
{Beto|Norberto|Numard} Meijome

"The greatest dangers to liberty lurk in insidious encroachment by  
men of zeal, well-meaning but without understanding."

  Justice Louis D. Brandeis

I speak for myself, not my employer. Contents may be hot. Slippery  
when wet. Reading disclaimers makes you go blind. Writing them is  
worse. You have been Warned.




Check on Solr 1.3?

2008-09-02 Thread Jon Baer

Hi,

Was wondering if there was an update on a push for a final 1.3?   
Wanted to build a final .war but wondering status and if I should hold  
off ... everything in trunk seems promising any major issues?


Thanks.

- Jon


Re: SolrJ and JSON in Solr -1.3

2008-09-14 Thread Jon Baer
Hmm am I missing something but isn't the real point of SolrJ to be  
able to use the binary (javabin) format to keep it small / tight /  
compressed?  I have had to proxy Solr recently and found just throwing  
a SolrDocumentList as a JSONArray (via json.org libs) works pretty  
well (YMMV).  I was just under the impression that the Java to Java  
bridge was the best way to go ...


It would be nice to have util methods on the SolrDocumentList  
(toJSON(), toXML(), etc) maybe?


- Jon

On Sep 14, 2008, at 11:14 PM, Erik Hatcher wrote:



On Sep 14, 2008, at 2:51 PM, Julio Castillo wrote:

What is the status of JSON support via SolrJ?


Requires a custom ResponseParser.  See SOLR-402 for a couple of  
implementation ideas:


 

Maybe this code is no longer current to trunk?

I want to be able to specify a parser such as the XMLResponseParser  
on my

SolrServer. What are my options?


Use SolrServer#setParser() for one of the above implementations.

I guess I could get an XML response and then convert it to JSON? I  
rather

not.


Ewww, don't do that.

There is a JIRA entry SOLR-402, but real resolution to it per the  
comments

that follow in the feature request.
https://issues.apache.org/jira/browse/SOLR-402


Did the RawResponseParser work for you?   If so, we can build that  
into Solr trunk - +1.  I shoulda done that a while ago, sorry.  This  
actually fits well with SOLR-620, in my nefarious plans to build a  
web framework out of Solr ;)


Erik





Re: SolrJ and JSON in Solr -1.3

2008-09-15 Thread Jon Baer
From what I understand you don't have to select a thing, the SolrCore  
would detect SolrJ and do it automatically(?) ...


44. SOLR-486: Binary response format, faster and smaller
than XML and JSON response formats (use wt=javabin).
BinaryResponseParser for utilizing the binary format via SolrJ
and is now the default.
(Noble Paul, yonik)

On Sep 15, 2008, at 2:40 PM, Julio Castillo wrote:


Jon,
Is the binary (javabin) format implied by selecting the  
RawResponseParser? I

guess I don't know what the javabin format is.

So you took a SolrDocumentList and converted it into a JSON Array?

Thanks

** julio

-Original Message-----
From: Jon Baer [mailto:[EMAIL PROTECTED]
Sent: Sunday, September 14, 2008 9:01 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrJ and JSON in Solr -1.3

Hmm am I missing something but isn't the real point of SolrJ to be  
able to
use the binary (javabin) format to keep it small / tight /  
compressed?  I
have had to proxy Solr recently and found just throwing a  
SolrDocumentList
as a JSONArray (via json.org libs) works pretty well (YMMV).  I was  
just
under the impression that the Java to Java bridge was the best way  
to go ...


It would be nice to have util methods on the SolrDocumentList  
(toJSON(),

toXML(), etc) maybe?

- Jon

On Sep 14, 2008, at 11:14 PM, Erik Hatcher wrote:



On Sep 14, 2008, at 2:51 PM, Julio Castillo wrote:

What is the status of JSON support via SolrJ?


Requires a custom ResponseParser.  See SOLR-402 for a couple of
implementation ideas:

<https://issues.apache.org/jira/browse/SOLR-402>

Maybe this code is no longer current to trunk?


I want to be able to specify a parser such as the XMLResponseParser
on my SolrServer. What are my options?


Use SolrServer#setParser() for one of the above implementations.


I guess I could get an XML response and then convert it to JSON? I
rather not.


Ewww, don't do that.


There is a JIRA entry SOLR-402, but real resolution to it per the
comments that follow in the feature request.
https://issues.apache.org/jira/browse/SOLR-402


Did the RawResponseParser work for you?   If so, we can build that
into Solr trunk - +1.  I shoulda done that a while ago, sorry.  This
actually fits well with SOLR-620, in my nefarious plans to build a  
web

framework out of Solr ;)

Erik







DIH + RSS (@attributes)

2008-09-16 Thread Jon Baer

Hi,

For some reason my XPath attribute keeps failing to get picked up here  
(is that the proper format?):




- Jon



Re: DIH + RSS (@attributes)

2008-09-16 Thread Jon Baer

That was it, thanks Shalin.

On Sep 16, 2008, at 1:41 PM, Shalin Shekhar Mangar wrote:


On Tue, Sep 16, 2008 at 10:41 PM, Jon Baer <[EMAIL PROTECTED]> wrote:

For some reason my XPath attribute keeps failing to get picked up  
here (is

that the proper format?):





Put a slash between node and attribute.



--
Regards,
Shalin Shekhar Mangar.




Re: [ANNOUNCE] Solr 1.3.0 Released

2008-09-16 Thread Jon Baer

Another +1 for Shalin and Noble for DIH ...

On Sep 16, 2008, at 9:50 PM, Erik Hatcher wrote:

+1 for Grant's efforts!   He put a lot of sweat into making this  
release a reality.


Erik

On Sep 16, 2008, at 9:29 PM, Grant Ingersoll wrote:

The Apache Solr team is happy to announce the availability of Solr  
1.3.0 for public download.  This version contains many enhancements  
and bug fixes, including:

- Distributed search capabilities
- Numerous Lucene and other performance improvements
- Support for multiple indexes in a single deployment
- SolrJ client and a binary response protocol for faster client- 
server communication
- Search Components that can be chained together to offer flexible  
query processing.  Components include existing functionality like  
faceting and add More Like This, Editorial Boosting (Query  
Elevation) and Spell Checking
- New DataImportHandler for easily indexing database content into  
Solr


See the http://svn.apache.org/repos/asf/lucene/solr/tags/release-1.3.0/CHANGES.txt 
 for more details.  The download is  available from http://www.apache.org/dyn/closer.cgi/lucene/solr/ 
. See the Solr Wiki for documentation: http://wiki.apache.org/solr/


About Apache Solr:
Solr is an open source enterprise search server based on the Lucene  
Java search library, with XML/HTTP and JSON APIs, hit highlighting,  
faceted search, caching, replication, a web administration  
interface and many more features. It runs in a Java servlet  
container such as Tomcat.  For more information, refer to the Solr  
website at http://lucene.apache.org/solr/.






Delta importing issues

2008-09-19 Thread Jon Baer

Question -

So if I issued a dataimport?command=delta-import&entity=one,two,three

Would this also hit items w/o a delta-import like four,five,six, etc?   
Im trying to set something up and I ended up with 28k+ documents which  
seems more like a full import, so do I need to do something like delta- 
query="" to say no delta?


@ the moment I dont have anything defined for those since I don't need  
it, just wondering what the proper behavior is suppose to be?


Thanks.

- Jon


Re: Delta importing issues

2008-09-19 Thread Jon Baer
Actually how does ${deltaimporter.last_index_time} know which entity  
Im specifically updating?  I feel like Im missing something, can it  
work like that?


Thanks.

- Jon

On Sep 19, 2008, at 4:14 PM, Jon Baer wrote:


Question -

So if I issued a dataimport?command=delta-import&entity=one,two,three

Would this also hit items w/o a delta-import like four,five,six,  
etc?  Im trying to set something up and I ended up with 28k+  
documents which seems more like a full import, so do I need to do  
something like delta-query="" to say no delta?


@ the moment I dont have anything defined for those since I don't  
need it, just wondering what the proper behavior is suppose to be?


Thanks.

- Jon




Re: Delta importing issues

2008-09-20 Thread Jon Baer
Would that context but available for *each* entity?  @ present it  
seems like there should be a last_index_time written for each top  
level entity ... no?


Umm would it be possible to hack something like ${deltaimporter.[name  
of entity].last_index_time} as is or are there too many moving parts?


Thanks.

- Jon

On Sep 20, 2008, at 9:21 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



the if an entity is specified like entity=one&entity=two the command
will be run only for those entities. absence of the parameter entity
means all entities will be executed

the last_index_time is another piece which must be improved

It is hard to get usecases . If users  can give me more usecases it
would be great.

One thing I have in mind is allo users to store arbitrary properties
though API say context.persistProperty("key","value")
and you must be able to read it back using  
context.getPersistedProperty("key");


This would be a generic enough for users to get going

thoughts.

--Noble

On Sat, Sep 20, 2008 at 1:52 AM, Jon Baer <[EMAIL PROTECTED]> wrote:
Actually how does ${deltaimporter.last_index_time} know which  
entity Im
specifically updating?  I feel like Im missing something, can it  
work like

that?

Thanks.

- Jon

On Sep 19, 2008, at 4:14 PM, Jon Baer wrote:


Question -

So if I issued a dataimport?command=delta- 
import&entity=one,two,three


Would this also hit items w/o a delta-import like four,five,six,  
etc?  Im
trying to set something up and I ended up with 28k+ documents  
which seems
more like a full import, so do I need to do something like delta- 
query="" to

say no delta?

@ the moment I dont have anything defined for those since I don't  
need it,

just wondering what the proper behavior is suppose to be?

Thanks.

- Jon







--
--Noble Paul




Re: most searched keyword in solr

2008-09-25 Thread Jon Baer

Why even do any of the work :-)

Im not sure any of the free analytic apps (ala Google) can but the  
paid ones do, just drop the query into one of those and let them  
analyze ...


http://www.google.com/analytics/

Then just parse the reports.

- Jon

On Sep 25, 2008, at 8:39 AM, Mark Miller wrote:


sanraj25 wrote:

hi,
how will we find most searched keyword in solr?
If anybody can suggest us a good solution, it would be helpful
thank you

with  Regards,
P.Parkavi


Write some code to record every query/keyword. Could be done at  
different places depending on how you define 'keyword' compared to  
how things are tokenized.


Or, you should also be able to parse the solr logs and extract query  
information and figure it out based on that.


Or...? Havn't seen any code to help with this out there, but maybe  
there is some?


- Mark




Re: DataImportHandler: way to merge multiple db-rows to 1 doc using transformer?

2008-09-27 Thread Jon Baer
If I understand your question right ... you would not need a  
transformer, basically you nest entities under each other ... ie:




driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/nhldb? 
connectTimeout=0&autoReconnect=true" user="root" password=""  
batchSize="-1"/>




  
  
  processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">


  




I believe that is the basic steps.  Look up CachedSqlEntityProcessor  
to see if you need it.


- Jon

On Sep 27, 2008, at 5:47 PM, Britske wrote:



Looking at the wiki, code of DataImportHandler and it looks  
impressive.
There's talk about ways to use Transformers to be able to create  
several

rows (solr docs) based on a single db row.

I'd like to know if it's possible to do the exact opposite: to build
customer transformers that take multiple db-rows and merge it to a  
single

solr-row/document. If so, how?

Thanks,
Britske
--
View this message in context: 
http://www.nabble.com/DataImportHandler%3A-way-to-merge-multiple-db-rows-to-1-doc-using-transformer--tp19706722p19706722.html
Sent from the Solr - User mailing list archive at Nabble.com.





DIH - Full imports + ?entity=param

2008-10-02 Thread Jon Baer

Just curious,

Currently a full-import call does a delete all even when appending an  
entity param ... wouldn't it be possible to pick up the param and just  
delete on that entity somehow?  It would be nice if there was  
something involved w/ having an entity field name that worked w/ DIH  
to do some better introspection like that ...


Is that something which is currently doable?

Thanks.

- Jon




Re: DIH - Full imports + ?entity=param

2008-10-03 Thread Jon Baer
I like both ideas ... maybe the deleteByQuery attribute idea a little  
better since it's keeping it w/ the  inside + you would not  
really be mucking w/ urls too much.


On Oct 2, 2008, at 11:51 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



DIH does not know the rows created by that entity. So we do not really
have any knowledge on how to delete specific rows.

how about passing a deleteQuery=type:x in the request params
or having a deleteByQuery on each top level entitywhich can be used
when that entity is doing a full-import
--Noble

On Fri, Oct 3, 2008 at 4:32 AM, Jon Baer <[EMAIL PROTECTED]> wrote:

Just curious,

Currently a full-import call does a delete all even when appending  
an entity
param ... wouldn't it be possible to pick up the param and just  
delete on
that entity somehow?  It would be nice if there was something  
involved w/

having an entity field name that worked w/ DIH to do some better
introspection like that ...

Is that something which is currently doable?

Thanks.

- Jon







--
--Noble Paul




Re: Solr indexing not taking all values from DB.

2008-10-12 Thread Jon Baer
What is your  set to?  Could it be you have duplicates in  
your uniqueKey setup (thus producing only 10 rows in index)?


- Jon

On Oct 12, 2008, at 1:30 PM, con wrote:



I wrote a jdbc program to implement the same query. But it is  
returning all

the responses, 25 nos.
But the solr is still indexing only 10 rows.
Is there any optimization settings by default in the solrconfig.xml  
that

restricts the responses to 10 ?
thanks
con.





Noble Paul നോബിള്‍ नोब्ळ् wrote:


template transformer does not eat up rows.

I am almost sure that the query returns only 10 rows in that case.
could you write a quick jdbc program and verify that (not the oralce
client)

everything else looks fine

On Sat, Oct 11, 2008 at 4:52 PM, con <[EMAIL PROTECTED]> wrote:


Hi Noble
Thanks for your reply

In my data-config.xml I have;

  

  
  
  
  
  

  

  
  
  
  

Whether this, TemplateTransformer, is the one that is restricting  
the

resultset count to 10?
Where can I find it out?
I need this TemplateTransformer because I want to query the  
responses of

either one of these at a time using the URL like,

http://localhost:8983/solr/select/?q=(Bob%20AND%20rowtype:customers)&version=2.2&start=0&rows=10&indent=on&wt=json

I tried in the debug mode:
(http://localhost:8983/solr/dataimport?command=full-import&debug=on&verbose=on 
)

, But it is not all mentioning anything after the 10th document.


Thanks and regards
con



Noble Paul നോബിള്‍ नोब्ळ् wrote:


The DIH status says 10 rows which means only 10 rows got fetched  
for

that query. Do you have any custom transformers which eats up rows?

Try the debug page of DIH and see what is happening to the rest  
of the

rows.



On Fri, Oct 10, 2008 at 5:32 PM, con <[EMAIL PROTECTED]> wrote:


A simple question:
I performed the following steps to index data from a oracle db  
to solr

index
and then search:
a) I have the configurations for indexing data from a oracle db
b) started the server.
c) Done a full-import:
http://localhost:8983/solr/dataimport?command=full-import

But when I do a search using http://localhost:8983/solr/select/? 
q=
Not all the result sets that matches the search string are  
displayed.


1) Is the above steps enough for getting db values to solr index?
My configurations (data-config.xml and schema.xml )are quite  
correct

because
I am getting SOME of the result sets as search result(not all).
2) Is there some value in sorconfig.xml, or some other files that
limits
the
number of items being indexed? [For the time being I have only a  
few

hundreds of records in my db. ]
The query that I am specifying in data-config yields around 25  
results

if
i
execute it in a oracle client, where as the status of full- 
import is

something like:
idle
Configuration Re-loaded sucessfullystr>


  1
  10
  0
  2008-10-10 17:29:03
  0:0:0.513




--
View this message in context:
http://www.nabble.com/Solr-indexing-not-taking-all-values-from-DB.-tp19916938p19916938.html
Sent from the Solr - User mailing list archive at Nabble.com.






--
--Noble Paul




--
View this message in context:
http://www.nabble.com/Solr-indexing-not-taking-all-values-from-DB.-tp19916938p19931736.html
Sent from the Solr - User mailing list archive at Nabble.com.






--
--Noble Paul




--
View this message in context: 
http://www.nabble.com/Solr-indexing-not-taking-all-values-from-DB.-tp19916938p19943817.html
Sent from the Solr - User mailing list archive at Nabble.com.





SolrJ + HTTP caching

2008-10-15 Thread Jon Baer

Hi,

What is the proper behavior suppose to be between SolrJ and caching?   
Im proxying through a framework and wondering if it is possible to  
turn on / turn off caching programatically depending on the type of  
query (or if this will have no effect whatsoever) ... since SolrJ uses  
Apache HTTP client libs can it negotiate anything here?


SOLR-127: HTTP Caching awareness.  Solr now recognizes HTTP Request
headers related to HTTP Caching (see RFC 2616 sec13) and will  
respond
with "304 Not Modified" when appropriate.  New options have been  
added

to solrconfig.xml to influence this behavior.
(Thomas Peuss via hossman)

Thanks.

- Jon


RegexTransformer debugging (DIH)

2008-10-16 Thread Jon Baer
Is there a way to prevent this from occurring (or a way to nail down  
the doc which is causing it?):


INFO: [news] webapp=/solr path=/admin/dataimport  
params={command=status} status=0 QTime=0

Exception in thread "Thread-14" java.lang.StackOverflowError
at java.util.regex.Pattern$Single.match(Pattern.java:3313)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4763)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4637)
at java.util.regex.Pattern$All.match(Pattern.java:4079)
at java.util.regex.Pattern$Branch.match(Pattern.java:4538)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4578)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4767)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4637)
at java.util.regex.Pattern$All.match(Pattern.java:4079)
at java.util.regex.Pattern$Branch.match(Pattern.java:4538)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4578)
at java.util.regex.Pattern$LazyLoop.match(Pattern.java:4767)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4637)
at java.util.regex.Pattern$All.match(Pattern.java:4079)

Thanks.

- Jon



Ocean realtime search + Solr

2008-10-21 Thread Jon Baer

Hi,

Im pretty intrigued by the Ocean search stuff and the Lucene patch, Im  
wondering if it's something that a tweaked Solr w/ mod Lucene can run  
now?  Has anyone tried merging that patch and running w/ Solr?  Im  
sure there is more to it than just swapping out the libs but the real  
time indexing Im sure would be possible, no?


Thanks.

- Jon


Re: Solr for Whole Web Search

2008-10-22 Thread Jon Baer
If that is the case you should look @ the DataImportHandler examples  
as they can already index RSS, im doing it now for ~ a dozen feeds on  
an hourly basis.  (This is also for any XML-based feed for XHTML, XML,  
etc).  I find Nutch more useful for plain vanilla HTML (something that  
was built non-dynamic), since otherwise you can bring your DB content  
in that you would have to the page to begin with.  As well as Nutch  
for other types of documents I think (PDF) and anything that Tika (http://incubator.apache.org/tika/ 
) can extract.


- Jon

On Oct 22, 2008, at 11:08 AM, John Martyniak wrote:


Grant thanks for the response.

A couple of other people have recommended trying the Nutch + Solr  
approach, but I am not sure what the real benefit of doing that is.   
Since Nutch provides most of the same features as Solr and Solr has  
some nice additional features (like spell checking, incremental  
index).


So I currently have a Nutch Index of around 500,000+ Urls, but  
expect it to get much bigger.  And am generally pretty happy with  
it, but I just want to make sure that I am going down the correct  
path, for the best feature set.  As far as implementation to the  
front end is concerned, I have been using the Nutch search app as  
basically a webservice to feed the main app (So using RSS).  The  
main app takes that and manipulates the results for display.


As far as the Hadoop + Lucene integration, I haven't used that  
directly just the Hadoop integration with Nutch.  And of course  
Hadoop independently.


-John


On Oct 22, 2008, at 10:08 AM, Grant Ingersoll wrote:



On Oct 22, 2008, at 7:57 AM, John Martyniak wrote:


I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?


There is a patch that combines Nutch + Solr.  Nutch is used for  
crawling, Solr for searching.  Can't say I've used it for whole web  
searching, but I believe some are trying it.


At the end of the day, I'm sure Solr could do it, but it will take  
some work to setup the architecture (distributed, replicated) and  
deal properly with fault tolerance and fail over.There are also  
some examples on Hadoop about Hadoop + Lucene integration.


How big are you talking?




Thanks in advance for all of the info.

-John



--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ















Re: DIH and rss feeds

2008-10-30 Thread Jon Baer
Id like to say that deal is part of https://issues.apache.org/jira/browse/SOLR-783 
 but looking @ it closely it might be different.


I think the issue is that delta-import does not have anything to match  
it's last_index_time against when doing feeds.  Im also interested in  
that type of merge functionality.


- Jon

On Oct 30, 2008, at 11:46 PM, Lance Norskog wrote:

I have a DataImportHandler configured to index from an RSS feed. It  
is a

"latest stuff" feed. It reads the feed and indexes the 100 documents
harvested from the feed. So far, works great.

Now: a few hours later there are a different 100 "lastest"  
documents. How do
I add those to the index so I will have 200 documents?  'full- 
import' throws
away the first 100. 'delta-import' is not implemented. What is the  
special

trick here?  I'm using the Solr-1.3.0 release.

Thanks,

Lance Norskog




Re: DIH and rss feeds

2008-10-31 Thread Jon Baer
Is that right?  I find the wording of "clean" a little confusing.  I  
would have thought this is what I had needed earlier but the topic  
came up regarding the fact that you can not deleteByQuery for an  
entity you want to flush w/ delta-import.


I just noticed that the original JIRA request says it was implemented  
recently ...


https://issues.apache.org/jira/browse/SOLR-801

Im assuming this means your war needs to come from trunk copy?  Does  
this patch affect that param @ all?


- Jon

On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



run full-import with clean=false

for full-import clean is set to true by default and for delta-import
clean is false by default.

On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]>  
wrote:
I have a DataImportHandler configured to index from an RSS feed. It  
is a

"latest stuff" feed. It reads the feed and indexes the 100 documents
harvested from the feed. So far, works great.

Now: a few hours later there are a different 100 "lastest"  
documents. How do
I add those to the index so I will have 200 documents?  'full- 
import' throws
away the first 100. 'delta-import' is not implemented. What is the  
special

trick here?  I'm using the Solr-1.3.0 release.

Thanks,

Lance Norskog





--
--Noble Paul




TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I want  
to use MoreLikeThis w/ a different filter query than what Im using.


Is there any easy hack to get this going?

Thanks.

- Jon 


Re: TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Well for example in any given text (which is field on a document);

"While suitable for any application which requires full text indexing  
and searching capability, Lucene has been widely recognized for its  
utility in the implementation of Internet search engines and local,  
single-site searching.


At the core of Lucene's logical architecture is the idea of a document  
containing fields of text. This flexibility allows Lucene's API to be  
independent of file format. Text from PDFs, HTML, Microsoft Word  
documents, as well as many others can all be indexed so long as their  
textual information can be extracted."


Id like to be able to say the tags for this article should be [Lucene,  
PDF, HTML, Microsoft Word] because they are in field values from other  
documents.  Basically how to generate tags from just a single document  
based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the "most important" terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
generate query terms.


However, I'm not following the different filter query piece.  Can  
you provide a bit more details?


One thing you did make me think, though, is it might be interesting  
to extend TermVectorMapper so that it can output a NamedList and  
then allow people to implement their own SolrTermVectorMapper and  
have it customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might  
do what Im looking for.  Id like to figure out if its possible use  
a single doc to get tag generation based on the matches within that  
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ













Re: TermVectorComponent for tag generation?

2008-11-01 Thread Jon Baer


On Nov 1, 2008, at 1:16 PM, Grant Ingersoll wrote:


How do you propose to distinguish those words from the other ones?


** They are field values from other documents

 The problem you are addressing is often called keyword extraction.   
In general, it 's a difficult problem, but you may have domain  
knowledge that can help.


** Im finding it hard to think Lucene can do amazing job @ search but  
yet nothing to tell me if a generated list of content is present in a  
resulting document.  The other options of TVC are what peaked my  
interest in the beginning ...


Other Options
* tv.fl - List of fields to get TV information from. Optional. If  
not specified, the fl parameter is used.
* tv.docIds - List of Lucene document ids (not the Solr Unique  
Key) to get term vectors for.


Im pretty sure that might work for what I need it for.

- Jon


Re: DIH Http input bug - problem with two-level RSS walker

2008-11-01 Thread Jon Baer
Another idea is to use create the logic you need and dump to a temp  
MySQL table and then fetch the feeds, that has worked pretty nicely  
for me, it removes the need for the outer feed to do the work.  @  
first I could not figure out if this was a bug or feature ...  
Something like ...


	processor="org.apache.solr.handler.dataimport.CachedSqlEntityProcessor">
			http://{$db.id}.somedomain.com/ 
feed.xml" name="feeds" pk="link"  
processor="org.apache.solr.handler.dataimport.XPathEntityProcessor"  
forEach="/rss/channel/item"  
transformer="org.apache.solr.handler.dataimport.TemplateTransformer,  
org.apache.solr.handler.dataimport.DateFormatTransformer">







dateTimeFormat="-MM-dd'T'hh:mm:ss'Z'"/>




- Jon

On Nov 1, 2008, at 3:26 PM, Norskog, Lance wrote:

The inner entity drills down and gets more detail about each item in  
the

outer loop. It creates one document.

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED]
Sent: Friday, October 31, 2008 10:24 PM
To: solr-user@lucene.apache.org
Subject: Re: DIH Http input bug - problem with two-level RSS walker

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]>
wrote:


I wrote a nested HttpDataSource RSS poller. The outer loop reads an
rss feed which contains N links to other rss feeds. The nested loop
then reads each one of those to create documents. (Yes, this is an
obnoxious thing to do.) Let's say the outer RSS feed gives 10 items.
Both feeds use the same
structure: /rss/channel with a  node and then N  nodes
inside the channel. This should create two separate XML streams with
two separate Xpath iterators, right?


  
  

  
  
  


This does indeed walk each url from the outer feed and then fetch the
inner rss feed. Bravo!

However, I found two separate problems in xpath iteration. They may  
be



related. The first problem is that it only stores the first document
from each "inner" feed. Each feed has several documents with  
different



title fields but it only grabs the first.



The idea behind nested entities is to join them together so that one
Solr document is created for each root entity and the child entities
provide more fields which are added to the parent document.

I guess you want to create separate Solr documents from the root  
entity

as well as the child entities. I don't think that is possible with
nested entities. Essentially, you are trying to crawl feeds, not join
them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java
program and just post results to Solr as usual.



The other is an off-by-one bug. The outer loop iterates through the  
10



items and then tries to pull an 11th.  It then gives this exception
trace:

INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM
org.apache.solr.handler.dataimport.HttpDataSource
getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
  at java.net.URL.(URL.java:567)
  at java.net.URL.(URL.java:464)
  at java.net.URL.(URL.java:413)
  at

org 
.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour

ce.jav
a:90)
  at

org 
.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour

ce.jav
a:47)
  at

org.apache.solr.handler.dataimport.DebugLogger 
$2.getData(DebugLogger.j

ava:18
3)
  at

org 
.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPat

hEntit
yProcessor.java:210)
  at

org 
.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(X

PathEn
tityProcessor.java:180)
  at

org 
.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathE

ntityP
rocessor.java:160)
  at


org 
.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.j

ava:

285)
...
Oct 31, 2008 11:21:20 PM  
org.apache.solr.handler.dataimport.DocBuilder

buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException:
Exception in invoking url null Processing Document # 11
  at

org 
.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour

ce.jav
a:115)
  at

org 
.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSour

ce.jav
a:47)









--
Regards,
Shalin Shekhar Mangar.




Re: DIH Http input bug - problem with two-level RSS walker

2008-11-02 Thread Jon Baer
On a side note ... it would be nice if your data source could also be  
the result of a script (instead of trying to hack around it w/  
JdbcDataSource) ...


Something similar to what ScriptTransformer does ...
(http://wiki.apache.org/solr/DataImportHandler#head-27fcc2794bd71f7d727104ffc6b99e194bdb6ff9 
)


An example would be:

script="outerloop.js" />


(The script would basically contain just a callback - getData(String  
query) that results in an array set or might set values on it's  
children, etc)


- Jon

On Nov 3, 2008, at 12:40 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



Hi Lance,
I guess I got your problem
So you wish to create docs for both entities (as suggested by Jon
Baer). So the best solution would be to create two root entities. The
first one should be the outer and write a transformer to store all the
urls into the db . The JdbcDataSource can do inserts/update too (the
method is same getData()). The second entity can read from db and
create docs  (see Jon baer's suggestion) using the
XPathEntityProcessor as a sub-entity
--Noble

On Mon, Nov 3, 2008 at 9:44 AM, Noble Paul നോബിള്‍  
नोब्ळ्

<[EMAIL PROTECTED]> wrote:

Hi Lance,
Do a full import w/o debug and let us know if my suggestion worked
(rootEntity="false" ) . If it didn't , I can suggest u something else
(Writing a Transformer )


On Sun, Nov 2, 2008 at 8:13 AM, Noble Paul നോബിള്‍  
नोब्ळ्

<[EMAIL PROTECTED]> wrote:

If you wish to create 1 doc per inner entity the set
rootEntity="false" for the entity outer.
The exception is because the url is wrong

On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]>  
wrote:
I wrote a nested HttpDataSource RSS poller. The outer loop reads  
an rss feed
which contains N links to other rss feeds. The nested loop then  
reads each
one of those to create documents. (Yes, this is an obnoxious  
thing to do.)
Let's say the outer RSS feed gives 10 items. Both feeds use the  
same
structure: /rss/channel with a  node and then N   
nodes inside
the channel. This should create two separate XML streams with two  
separate

Xpath iterators, right?


  
  

  
  
  


This does indeed walk each url from the outer feed and then fetch  
the inner

rss feed. Bravo!

However, I found two separate problems in xpath iteration. They  
may be
related. The first problem is that it only stores the first  
document from
each "inner" feed. Each feed has several documents with different  
title

fields but it only grabs the first.

The other is an off-by-one bug. The outer loop iterates through  
the 10 items
and then tries to pull an 11th.  It then gives this exception  
trace:


INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM  
org.apache.solr.handler.dataimport.HttpDataSource

getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
  at java.net.URL.(URL.java:567)
  at java.net.URL.(URL.java:464)
  at java.net.URL.(URL.java:413)
  at
org 
.apache 
.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav

a:90)
  at
org 
.apache 
.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav

a:47)
  at
org.apache.solr.handler.dataimport.DebugLogger 
$2.getData(DebugLogger.java:18

3)
  at
org 
.apache 
.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit

yProcessor.java:210)
  at
org 
.apache 
.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn

tityProcessor.java:180)
  at
org 
.apache 
.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP

rocessor.java:160)
  at
org 
.apache 
.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:

285)
...
Oct 31, 2008 11:21:20 PM  
org.apache.solr.handler.dataimport.DocBuilder

buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException:  
Exception in

invoking url null Processing Document # 11
  at
org 
.apache 
.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav

a:115)
  at
org 
.apache 
.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav

a:47)










--
--Noble Paul





--
--Noble Paul





--
--Noble Paul




Re: DataImportHandler not indexing all the records

2008-11-15 Thread Jon Baer
Ive also had the same issues here but when trying to switch to  
HTMLStripWhitespaceTokenizerFactor I found that it only removes the  
tags but when it comes to all forms of javascript includes in a  
document it keeps it all intact so I ended up w/ scripts in the  
document text, is there any easy way to get around that (w/o having to  
use the RegEx processor)?


Thanks.

- Jon

On Nov 15, 2008, at 2:21 PM, Shalin Shekhar Mangar wrote:


I think the problem is that DIH catches Exception but not Error so a
StackOverFlowError will slip past it. Normally, the  
SolrDispatchFilter will
log such errors but the import is performed in a new thread, so the  
error is
not logged anywhere. However, DIH will not commit documents in this  
case

(and there is no mention of a commit in your DIH status).

We should change the catch clause to catch Throwable so that this is  
not

repeated. I'll open an issue and give a patch.

Btw, Ahmed, Solr has a Tokenizer which is much better at striping  
html --

HTMLStripWhitespaceTokenizerFactory which you can use for such tasks.

On Sun, Nov 16, 2008 at 12:30 AM, Ahmed Hammad <[EMAIL PROTECTED]>  
wrote:


I had a similar problem like Giri. I have 17,000 record in one  
table and

DIH
can import only 12464.

After some investigation, I found my problem.

I have a regular expression to strip off html tags form input text,  
as

following:


replaceWith=" "/>

The DIH RegEx have stack overflow on the record 17,000 due to error  
in the
content and then DIH exit without any error in the log on in the  
status

command. Here is the status:


0:0:31.657
1
12464
12464
0
2008-11-15 20:40:58


I found the error in Eclipse Console window while debugging; it was  
a stack

overflow in the RegEx library.

The problem is that, DIH does not show any problem in log file on  
in status

message.
What I think is important is to show whatever error happen in the  
log file.


I noticed also that, in case of no error a log message show  
completness:


Nov 15, 2008 8:57:34 PM org.apache.solr.handler.dataimport.DocBuilder
execute
INFO: Time taken = 0:0:40.656

In case of RegEx stack overflow error, this log message does not  
appear.


I am researching on how to catch such error in DIH. Any ideas?


Regards,
ahmd

On Sat, Nov 15, 2008 at 6:32 AM, Noble Paul നോബിള്‍  
नोब्ळ् <

[EMAIL PROTECTED]> wrote:


There is no obvious problem

I can be reasonably sure that
the query

select * from climatedata.ws_record limit 100

would have fetched only  615360 rows.
This is a very reliable pice of information
615360

On Sat, Nov 15, 2008 at 12:41 AM, Giri <[EMAIL PROTECTED]>  
wrote:

Hi Noble,
thanks for the help, here are the details: the field "id" is  
unique,

when

I

did a select distinct(id), it returned 1 million rows.

---
db-data-config.xml
note: I limit the resultset to 1 million in the select query
---

  url="jdbc:mysql://localhost:3306/climatedata" user="user"  
password="pw"

batchSize ="-1"/>
  
  
  
  
  
  
  
  
  
  
  
   
  


-
in the solr Schema.xml:


 
  
  
  
  
  
  
  
  

 
 

 
 stored="false"

required="false"/>


 

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>

 
stored="true"/>




I run the index via  firefox browser using
http://localhost:8080/solr/dataimport?command=full-import
I checked the status using
http://localhost:8080/solr/dataimport?command=status
initially the status increased steadily, but after reaching  
613071, the

status stayed for a while (as below), and then it displayed the

completed

message :


-

0
1

-

-

db-data-config.xml


status
busy
A command is still running...
-

0:3:24.266
1
613071
613070
0
2008-11-14 12:12:16

-

This response format is experimental.  It is likely to change in  
the

future.




---


NOTE: this is the status result after it completed

---


-

0
1

-

-

db-data-config.xml


status
idle

-

1
615360
0
2008-11-14 12:12:16
-

Indexing completed. Added/Updated: 615360 documents. Deleted 0

documents.


2008-11-14 12:16:32
2008-11-14 12:16:32
0:4:16.154

-

This response format is experimental.  It is likely to change in  
the

future.




-

here is the full solr scehma.xml content:






  

  
  
sortMissingLast="true"/>


  
  

  

  

  
  
  
  


  
  
  
  
  


  
  

Solr schema 1.3 -> 1.4-dev (changes?)

2008-11-19 Thread Jon Baer

Hi,

I wanted to try the TermVectorComponent w/ current schema setup and I  
did a build off trunk but it's giving me something like ...


org.apache.solr.common.SolrException: ERROR:unknown field 'DOCTYPE'

Even though it is declared in schema.xml (lowercase), before I grep  
replace the entire file would that be my issue?


Thanks.

- Jon


Re: Solr schema 1.3 -> 1.4-dev (changes?)

2008-11-19 Thread Jon Baer
Sorry I should have mentioned this is from using the  
DataImportHandler ... it seems case insensitive ... ie my columns are  
UPPERCASE and schema field names are lowercase and it works fine in  
1.3 but not in 1.4 ... it seems strict.  Going to resolve all the  
field names to uppercase to see if that resolves the problem.  Thanks.


- Jon

On Nov 19, 2008, at 6:44 PM, Ryan McKinley wrote:


schema fields should be case sensitive...  so DOCTYPE != doctype

is the behavior different for you in 1.3 with the same file/schema?


On Nov 19, 2008, at 6:26 PM, Jon Baer wrote:


Hi,

I wanted to try the TermVectorComponent w/ current schema setup and  
I did a build off trunk but it's giving me something like ...


org.apache.solr.common.SolrException: ERROR:unknown field 'DOCTYPE'

Even though it is declared in schema.xml (lowercase), before I grep  
replace the entire file would that be my issue?


Thanks.

- Jon






Re: Solr schema 1.3 -> 1.4-dev (changes?)

2008-11-19 Thread Jon Baer

Schema:


DIH:



The column is uppercase ... isn't there some automagic happening now  
where DIH will introspect the fields @ load time?


- Jon

On Nov 19, 2008, at 11:11 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



Hi John,
it could probably not the expected behavior?

only 'explicit' fields must be case-sensitive.

Could you tell me the usecase or can you paste the data-config?

--Noble






On Thu, Nov 20, 2008 at 8:55 AM, Jon Baer <[EMAIL PROTECTED]> wrote:
Sorry I should have mentioned this is from using the  
DataImportHandler ...
it seems case insensitive ... ie my columns are UPPERCASE and  
schema field
names are lowercase and it works fine in 1.3 but not in 1.4 ... it  
seems
strict.  Going to resolve all the field names to uppercase to see  
if that

resolves the problem.  Thanks.

- Jon

On Nov 19, 2008, at 6:44 PM, Ryan McKinley wrote:


schema fields should be case sensitive...  so DOCTYPE != doctype

is the behavior different for you in 1.3 with the same file/schema?


On Nov 19, 2008, at 6:26 PM, Jon Baer wrote:


Hi,

I wanted to try the TermVectorComponent w/ current schema setup  
and I did

a build off trunk but it's giving me something like ...

org.apache.solr.common.SolrException: ERROR:unknown field 'DOCTYPE'

Even though it is declared in schema.xml (lowercase), before I grep
replace the entire file would that be my issue?

Thanks.

- Jon









--
--Noble Paul




Re: Solr schema 1.3 -> 1.4-dev (changes?)

2008-11-19 Thread Jon Baer
Correct ... it is the unfortunate side effect of having some legacy  
tables in uppercase :-\  I thought the explicit declaration of field  
name attribute was ok.


- Jon

On Nov 19, 2008, at 11:53 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



So originally you had the field declaration as follows . right?


we did some refactoring to minimize the object creation for
case-insensitive comparisons.

I guess it should be rectified soon.

Thanks for bringing it to our notice.
--Noble





On Thu, Nov 20, 2008 at 10:05 AM, Jon Baer <[EMAIL PROTECTED]> wrote:

Schema:


DIH:



The column is uppercase ... isn't there some automagic happening  
now where

DIH will introspect the fields @ load time?

- Jon

On Nov 19, 2008, at 11:11 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



Hi John,
it could probably not the expected behavior?

only 'explicit' fields must be case-sensitive.

Could you tell me the usecase or can you paste the data-config?

--Noble






On Thu, Nov 20, 2008 at 8:55 AM, Jon Baer <[EMAIL PROTECTED]> wrote:


Sorry I should have mentioned this is from using the  
DataImportHandler

...
it seems case insensitive ... ie my columns are UPPERCASE and  
schema

field
names are lowercase and it works fine in 1.3 but not in 1.4 ...  
it seems
strict.  Going to resolve all the field names to uppercase to see  
if that

resolves the problem.  Thanks.

- Jon

On Nov 19, 2008, at 6:44 PM, Ryan McKinley wrote:


schema fields should be case sensitive...  so DOCTYPE != doctype

is the behavior different for you in 1.3 with the same file/ 
schema?



On Nov 19, 2008, at 6:26 PM, Jon Baer wrote:


Hi,

I wanted to try the TermVectorComponent w/ current schema setup  
and I

did
a build off trunk but it's giving me something like ...

org.apache.solr.common.SolrException: ERROR:unknown field  
'DOCTYPE'


Even though it is declared in schema.xml (lowercase), before I  
grep

replace the entire file would that be my issue?

Thanks.

- Jon









--
--Noble Paul







--
--Noble Paul




Re: Build Solr to run SolrJS

2008-11-22 Thread Jon Baer
Maybe another template idea ... I just started playing around w/ this  
plugin:


http://malsup.com/jquery/taconite/

Would be pretty neat to have that as a response (or @ least the  
technique), not sure how well known it is or if there is something W3C- 
based in the pipeline that is similar.  Pretty neat though.


- Jon

On Nov 22, 2008, at 4:26 AM, Erik Hatcher wrote:

I just got the client-side demo on trunk to work (with a few tweaks  
to make it work with the example core Solr data).


On trunk follow these steps:

 * root directory: ant example
 * separate console, index data: cd example/exampledocs; java -jar  
post.jar *.xml

 * open contrib/javascript/example/testClientside.html in your browser

The serverSide.html example is still not quite wired together  
properly to be an easy-to-run-demo, as what's on trunk doesn't  
include the reuters data and the velocity stuff is not wired in with  
it yet either.


We'll get this working better/cleaner as we go, so we appreciate  
your early adopter help ironing out this stuff.


Erik

On Nov 20, 2008, at 5:44 PM, JCodina wrote:



I could not manage, yet to use it. :confused:
My doubts are:
- must I  download solr from svn - trunk?
- then, must I apply the patches of solrjs and velocity and unzip  
the files?

or is this  already in trunk?
because  trunk contains velocity and javascript in contrib.
 but does not find the velocity
- How do I edit/activate SolrJs to adapt it to my data, the wiki  
page says
how to deploy the sample, and I looked at the sample page from the  
sample

site, but I don't find how to manually install it on a tomcat server.
PD. If I get how to do it, I promise I will introduce that  
information in

the solrjs wiki page  =) .


Matthias Epheser wrote:


Erik Hatcher schrieb:


On Nov 16, 2008, at 1:40 PM, Matthias Epheser wrote:

Matthias and Ryan - let's get SolrJS integrated into
contrib/velocity.  Any objections/reservations?


As SolrJS may be used without velocity at all (using eg.
ClientSideWidgets), is it possible to put it into "contrib/ 
javascript"
and create a "dependency" to contrib/velocity for  
ServerSideWidgets?


Sure, contrib/javascript sounds perfect.


If that's ok, I'll have a look at the directory structure and the
current ant build.xml to make them fit into the common solr  
structure

and build.


Awesome, thanks!


Just uploaded solrjs.zip to
https://issues.apache.org/jira/browse/SOLR-868. It
is intended to be extracted in contrib/javascript and supports the
following ant
targets:

* ant dist -> creates a single js file and a jar that holds velocity
templates.
* ant docs -> creates js docs. test in browser: doc/index.html
* ant example-init -> (depends ant dist on solr root) copies the  
current

built
of solr.war and solr-velocity.jar to example/testsolr/..
* ant example-start -> starts the testsolr server on port 8983
* ant example-import -> imports 3000 test data rows (requires a  
started

testserver)




  Erik







--
View this message in context: 
http://www.nabble.com/Build-Solr-to-run-SolrJS-tp20526644p20611635.html
Sent from the Solr - User mailing list archive at Nabble.com.






Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Jon Baer

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg


Re: Unknown field error using JDBC

2008-11-25 Thread Jon Baer
This sounds exactly same issue I had when going from 1.3 to 1.4 ... it  
sounds like DIH is trying to automagically figure out the columns :-\


- Jon

On Nov 25, 2008, at 6:37 AM, Joel Karlsson wrote:


Hello,

I get Unknown field error when I'm indexing an Oracle dB. I've  
reduced the
number of fields/columns in order to troubleshoot. If I change the  
uniqeKey
to timestamp (for example) and create a dynamic field name="*"
type="text" indexed="true" stored="true"> the indexing works fine,  
except

the id-field is empty.

--data- 
config 
.xml---

...



...


   


...

--

--
schema 
.xml 
---

...

required="true" />


...

id

...



--ERROR- 
message


2008-nov-25 12:25:25 org.apache.solr.handler.dataimport.SolrWriter  
upload

VARNING: Error creating document :
SolrInputDocument[{PUBID=PUBID(1.0)={43392}}]

org.apache.solr.common.SolrException: ERROR:unknown field 'PUBID'
   at
org 
.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java: 
274)


...

---

Anyone who had similar problems or knows how to solve this!? Any  
help is

truly appreciated!!

// Joel




Re: Using Solr with Hadoop ....

2008-11-29 Thread Jon Baer

HadoopEntityProcessor for the DIH?

Ive wondered about this as they make HadoopCluster LiveCDs and EC2  
have images but best way to make use of them is always a challenge.


- Jon

On Nov 29, 2008, at 3:34 AM, Erik Hatcher wrote:



On Nov 28, 2008, at 8:38 PM, Yonik Seeley wrote:

Or, it would be relatively trivial to write a Lucene program
to merge the indexes.


FYI, such a tool exists in Lucene's API already:

 


Erik





Re: NIO not working yet

2008-11-30 Thread Jon Baer
Sorry missed that (and probably dumb question), does that -D flag work  
for setting as a RAMDirectory as well?


- Jon

On Nov 30, 2008, at 8:42 PM, Yonik Seeley wrote:


OK, the development version of Solr should now be fixed (i.e. NIO
should be the default for non-Windows platforms).  The next nightly
build (Dec-01-2008) should have the changes.

-Yonik

On Wed, Nov 12, 2008 at 2:59 PM, Yonik Seeley <[EMAIL PROTECTED]>  
wrote:

NIO support in the latest Solr development versions does not work yet
(I previously advised that some people with possible lock contention
problems try it out).  We'll let you know when it's fixed, but in the
meantime you can always set the system property
"org.apache.lucene.FSDirectory.class" to
"org.apache.lucene.store.NIOFSDirectory" to try it out.

for example:

java - 
Dorg 
.apache 
.lucene.FSDirectory.class=org.apache.lucene.store.NIOFSDirectory

-jar start.jar

-Yonik




Re: Solr on Solaris

2008-12-04 Thread Jon Baer

Just curious, is this off a "zone" by any chance?

- Jon

On Dec 4, 2008, at 10:40 PM, Kashyap, Raghu wrote:

We are running solr on a solaris box with 4 CPU's(8 cores) and  3GB  
Ram.

When we try to index sometimes the HTTP Connection just hangs and the
client which is posting documents to solr doesn't get any response  
back.
We since then have added timeouts to our http requests from the  
clients.




I then get this error.



java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new. Out
of swap space?

java.lang.OutOfMemoryError: unable to create new native thread

Exception in thread "JmxRmiRegistryConnectionPoller"
java.lang.OutOfMemoryError: unable to create new native thread



We are running JDK 1.6_10 on the solaris box. . The weird thing is we
are running the same application on linux box with JDK 1.6 and we
haven't seen any problem like this.



Any suggestions?



-Raghu





Re: Solr on Solaris

2008-12-05 Thread Jon Baer
Are you running Solr in a container more specifically,  Ive had few  
issues w/ zones in the past and Solr (I believe there are some  
networking issues w/ older Solaris versions) ...


They are basically where you can slice ("virtualize") your resources  
and divide a box up into something similar to a VPS ...


http://www.sun.com/bigadmin/content/zones/

- Jon

On Dec 5, 2008, at 10:58 AM, Kashyap, Raghu wrote:


Jon,

What do you mean by off a "Zone"? Please clarify

-Raghu


-Original Message-
From: Jon Baer [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 04, 2008 9:56 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr on Solaris

Just curious, is this off a "zone" by any chance?

- Jon

On Dec 4, 2008, at 10:40 PM, Kashyap, Raghu wrote:


We are running solr on a solaris box with 4 CPU's(8 cores) and  3GB
Ram.
When we try to index sometimes the HTTP Connection just hangs and the
client which is posting documents to solr doesn't get any response
back.
We since then have added timeouts to our http requests from the
clients.



I then get this error.



java.lang.OutOfMemoryError: requested 239848 bytes for Chunk::new.  
Out

of swap space?

java.lang.OutOfMemoryError: unable to create new native thread

Exception in thread "JmxRmiRegistryConnectionPoller"
java.lang.OutOfMemoryError: unable to create new native thread



We are running JDK 1.6_10 on the solaris box. . The weird thing is we
are running the same application on linux box with JDK 1.6 and we
haven't seen any problem like this.



Any suggestions?



-Raghu







Re: Delta-import hack to use last indexed id document

2008-12-06 Thread Jon Baer
This sounds a little like my original problem of deltaQuery imports  
per entity ...


https://issues.apache.org/jira/browse/SOLR-783

I wonder if those 2 hacks could be combined to fix the issue.

- Jon

On Dec 6, 2008, at 12:29 PM, Marc Sturlese wrote:



Hey there,
I am doing some hacks to some parts of the solr source. I am doing a  
feature
for everytime I use delta import hanlder I want it to start geting  
info from

the db starting from the last indexed document id (from the latest
execution).

The point of doing that is that if I start a full import and the  
process is
aborted for any reason, I want to be able to start a delta import  
and start

indexing from the last indexed id of the full import.

To do that basically I have created functions in solrwriter.java and
dataimporter.java. The funcions I have created are the same as the  
ones to
write and retrieve the timestamp to the dataimport.properties but  
mines do

it with an id (long instead of date).
I call this functions in docbuilder.java (in the places were  
functions for

timestamp were created)
I do one more thing... i write in the dataimport.properties every  
time I

call the function upload in docbuilder to upload a document.

The problem is that not every time the upload function (in  
docbuilder) is
called a commit is called aswell. So, if I kill -9 the process in  
the middle
of the execution i will have in the dataimport.properties the last  
uploaded
id but in the index (opening it with luke) I will have the last  
commited.


I have done some tests calling writer.commit(false) just after the  
upload or
setting in solrconfig.xml  2.  
With both

it works fine but opiously the indexer works extremely slow.

Is there any way to write in the dataimport.properties
(writer.persistIndexLastID(arow.get("id").toString())) just after  
every
commit but not calling myself the commit function? If not, I would  
apreciate

any advice about other ways to reach this goals.

If I get it done I will open an issue and upload there the patch  
cause I

thing that this can be a common use case.
Thanks in advanced



--
View this message in context: 
http://www.nabble.com/Delta-import-hack-to-use-last-indexed-id-document-tp20872450p20872450.html
Sent from the Solr - User mailing list archive at Nabble.com.





DataImportHandler (reading XML w/ paging)

2009-01-06 Thread Jon Baer
Hi,

Anyone have a quick, clever way of dealing w/ paged XML for
DataImportHandler?  I have metadata like this:


1
3
15


I unfortunately can not get all the data in one shot so I need to
maybe a number of requests obtained from the paging meta, but can't
figure out if this is dynamically possible w/ the current DIH setup.
Any tips?

Thanks.

- Jon


Re: Reading database parameters from outside data-config.xml

2009-01-18 Thread Jon Baer
I think DIH would have to support JNDI which it current does not (I  
think).  Id also be interested in this (or where the credentials came  
from the db itself).


- Jon

On Jan 18, 2009, at 11:37 AM, con wrote:



Hi all
Currently i am defining database parameters like the url, username and
password and also the query in data-config.xml.
How can I change this scenario by which the query still remains in the
data-config.xml and all other DB details from a different file[Not in
solrconfig.xml]. This is because of security reasons and also i have
multiple DBs but same tables. So the query remains same but parameters
changes.

Waiting for a positive suggestion
Thanks
Con
--
View this message in context: 
http://www.nabble.com/Reading-database-parameters-from-outside-data-config.xml-tp21529799p21529799.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Connection mismanagement in Solrj?

2009-01-27 Thread Jon Baer
Could it be the framework you are using around it?  I know some IOC
containers will auto pool objects underneath as a service without you really
knowing it is being done or has to be explicitly turned off.  Just a
thought.  I use a single server for all requests behind a Hivemind setup ...
umm not by choice :-\

- Jon

On Tue, Jan 27, 2009 at 12:32 PM, Ryan McKinley  wrote:

> if you use this constructor:
>
>  public CommonsHttpSolrServer(URL baseURL, HttpClient client)
>
> then solrj never touches the HttpClient configuration.
>
> I normally reuse a single CommonsHttpSolrServer as well.
>
>
>
> On Jan 27, 2009, at 9:52 AM, Walter Underwood wrote:
>
>  Making requests in parallel, using the default connection manager,
>> which is multi-threaded, and we are reusing a single CommonsHttpSolrServer
>> for all requests.
>>
>> wunder
>>
>> On 1/26/09 10:59 PM, "Noble Paul നോബിള്‍  नोब्ळ्" 
>> wrote:
>>
>>  are you making requests in parallel ?
>>> which ConnectionManager are you using for HttpClient?
>>>
>>> On Tue, Jan 27, 2009 at 11:58 AM, Noble Paul നോബിള്‍  नोब्ळ्
>>>  wrote:
>>>
 you can set any connection parameters for the HttpClient and pass on
 the instance to CommonsHttpSolrServer and that will be used for making
 requests

 make sure that you are not reusing instance of CommonsHttpSolrServer

 On Tue, Jan 27, 2009 at 10:59 AM, Walter Underwood
  wrote:

> We just switched to Solrj from a home-grown client and we have a huge
> jump in the number of connections to the server, enough that our
> load balancer was rejecting connections in production tonight.
>
> Does that sound familiar? We're running 1.3.
>
> I set the timeouts and connection pools to the same values I'd
> used in my other code, also based on HTTPClient.
>
> We can roll back to my code temporarily, but we want some of
> the Solrj facet support for a new project.
>
> wunder
>
>
>


 --
 --Noble Paul


>>>
>>>
>>
>


1.3 <-> 1.4 patch for onError handling

2009-01-30 Thread Jon Baer
Hi,

Ive just had a bump in the night where some feeds have disappeared, Im
wondering since Im running the base 1.3 copy would patching it w/

https://issues.apache.org/jira/browse/SOLR-842

Break anything?  Has anyone done this yet?

Thanks.

- Jon


DIH - Example of using $nextUrl and $hasMore

2009-02-02 Thread Jon Baer
Hi,

Sorry I know this exists ...

"If an API supports chunking (when the dataset is too large) multiple calls
need to be made to complete the process. XPathEntityprocessor supports this
with a transformer. If transformer returns a row which contains a field *
$hasMore* with a the value "true" the Processor makes another request with
the same url template (The actual value is recomputed before invoking ). A
transformer can pass a totally new url too for the next call by returning a
row which contains a field *$nextUrl* whose value must be the complete url
for the next call."

But is there a true example of it's use somewhere?  Im trying to figure out
if I know before import that I have 56 "pages" to index how to set this up
properly.  (And how to set it up if pages need to be determined by something
in the feed, etc).

Thanks.

- Jon


Re: DIH - Example of using $nextUrl and $hasMore

2009-02-02 Thread Jon Baer
Yes I think what Jared mentions in the JIRA is what I was thinking about
when it is recommended to always return true for $hasMore ...

"The transformer must know somehow when $hasMore should be true. If the
transformer always give $hasMore a value "true", will there be infinite
requests made or will it stop on the first empty request? Using the
EnumeratedEntityTransformer, a user can specify from the config xml when
$hasMore should be true using the chunkSize attribute. This solves a general
case of "request N rows at a time until no more are available". I agree, a
combination of 'rowsFetchedCount' and a HasMoreUntilEmptyTransformer would
also make this doable from the configuration"

This makes sense.

- Jon
  [ Show » <https://issues.apache.org/jira/browse/SOLR-994> ]
 Jared 
Flatow<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jflatow>-
28/Jan/09
09:16 PM The transformer must know somehow when $hasMore should be true. If
the transformer always give $hasMore a value "true", will there be infinite
requests made or will it stop on the first empty request? Using the
EnumeratedEntityTransformer, a user can specify from the config xml when
$hasMore should be true using the chunkSize attribute. This solves a general
case of "request N rows at a time until no more are available". I agree, a
combination of 'rowsFetchedCount' and a HasMoreUntilEmptyTransformer would
also make this doable from the configuration.

On Mon, Feb 2, 2009 at 11:53 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Mon, Feb 2, 2009 at 9:20 PM, Jon Baer  wrote:
>
> > Hi,
> >
> > Sorry I know this exists ...
> >
> > "If an API supports chunking (when the dataset is too large) multiple
> calls
> > need to be made to complete the process. XPathEntityprocessor supports
> this
> > with a transformer. If transformer returns a row which contains a field *
> > $hasMore* with a the value "true" the Processor makes another request
> with
> > the same url template (The actual value is recomputed before invoking ).
> A
> > transformer can pass a totally new url too for the next call by returning
> a
> > row which contains a field *$nextUrl* whose value must be the complete
> url
> > for the next call."
> >
> > But is there a true example of it's use somewhere?  Im trying to figure
> out
> > if I know before import that I have 56 "pages" to index how to set this
> up
> > properly.  (And how to set it up if pages need to be determined by
> > something
> > in the feed, etc).
> >
>
> No, there is no example (yet). You'll put the url with variables for the
> corresponding 'start' and 'count' parameters and a custom transformer can
> specify if another request needs to be made. I know it's not much to go on.
> I'll try to write some documentation on the wiki.
>
> SOLR-994 might be interesting to you. I haven't been able to look at the
> patch though.
>
>  https://issues.apache.org/jira/browse/SOLR-994
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: DIH - Example of using $nextUrl and $hasMore

2009-02-02 Thread Jon Baer
See I think Im just misunderstanding how this entity is suppose to be setup
... for example, using the patch on 1.3 I ended up in a loop where .n is
never set ...

Feb 2, 2009 1:31:02 PM org.apache.solr.handler.dataimport.HttpDataSource
getData
INFO: Created URL to: http://subdomain.site.com/feed.rss?page=

http://subdomain.site.com/boards.rss?page=${blogs.n}"; chunkSize="50"
name="docs" pk="link" processor="XPathEntityProcessor"
forEach="/rss/channel/item" transformer="RegexTransformer,
com.nhl.solr.DateFormatTransformer, TemplateTransformer,
com.nhl.solr.EnumeratedEntityTransformer">

I guess what Im looking for is that snippet which shows how it is setup (the
initial counter) ...

- Jon

On Mon, Feb 2, 2009 at 12:39 PM, Noble Paul നോബിള്‍ नोब्ळ् <
noble.p...@gmail.com> wrote:

> On Mon, Feb 2, 2009 at 11:01 PM, Jon Baer  wrote:
> > Yes I think what Jared mentions in the JIRA is what I was thinking about
> > when it is recommended to always return true for $hasMore ...
> >
> > "The transformer must know somehow when $hasMore should be true. If the
> > transformer always give $hasMore a value "true", will there be infinite
> > requests made or will it stop on the first empty request? Using the
> > EnumeratedEntityTransformer, a user can specify from the config xml when
> > $hasMore should be true using the chunkSize attribute. This solves a
> general
> > case of "request N rows at a time until no more are available". I agree,
> a
> > combination of 'rowsFetchedCount' and a HasMoreUntilEmptyTransformer
> would
> > also make this doable from the configuration"
> why cant a Tranformer put a $hasMore=false?
> >
> > This makes sense.
> >
> > - Jon
> >  [ Show » <https://issues.apache.org/jira/browse/SOLR-994> ]
> >  Jared Flatow<
> https://issues.apache.org/jira/secure/ViewProfile.jspa?name=jflatow>-
> > 28/Jan/09
> > 09:16 PM The transformer must know somehow when $hasMore should be true.
> If
> > the transformer always give $hasMore a value "true", will there be
> infinite
> > requests made or will it stop on the first empty request? Using the
> > EnumeratedEntityTransformer, a user can specify from the config xml when
> > $hasMore should be true using the chunkSize attribute. This solves a
> general
> > case of "request N rows at a time until no more are available". I agree,
> a
> > combination of 'rowsFetchedCount' and a HasMoreUntilEmptyTransformer
> would
> > also make this doable from the configuration.
> >
> > On Mon, Feb 2, 2009 at 11:53 AM, Shalin Shekhar Mangar <
> > shalinman...@gmail.com> wrote:
> >
> >> On Mon, Feb 2, 2009 at 9:20 PM, Jon Baer  wrote:
> >>
> >> > Hi,
> >> >
> >> > Sorry I know this exists ...
> >> >
> >> > "If an API supports chunking (when the dataset is too large) multiple
> >> calls
> >> > need to be made to complete the process. XPathEntityprocessor supports
> >> this
> >> > with a transformer. If transformer returns a row which contains a
> field *
> >> > $hasMore* with a the value "true" the Processor makes another request
> >> with
> >> > the same url template (The actual value is recomputed before invoking
> ).
> >> A
> >> > transformer can pass a totally new url too for the next call by
> returning
> >> a
> >> > row which contains a field *$nextUrl* whose value must be the complete
> >> url
> >> > for the next call."
> >> >
> >> > But is there a true example of it's use somewhere?  Im trying to
> figure
> >> out
> >> > if I know before import that I have 56 "pages" to index how to set
> this
> >> up
> >> > properly.  (And how to set it up if pages need to be determined by
> >> > something
> >> > in the feed, etc).
> >> >
> >>
> >> No, there is no example (yet). You'll put the url with variables for the
> >> corresponding 'start' and 'count' parameters and a custom transformer
> can
> >> specify if another request needs to be made. I know it's not much to go
> on.
> >> I'll try to write some documentation on the wiki.
> >>
> >> SOLR-994 might be interesting to you. I haven't been able to look at the
> >> patch though.
> >>
> >>  https://issues.apache.org/jira/browse/SOLR-994
> >> --
> >> Regards,
> >> Shalin Shekhar Mangar.
> >>
> >
>
>
>
> --
> --Noble Paul
>


Good strategy for news in Solr?

2009-02-18 Thread Jon Baer
Ive spent a few months trying different techniques w/ regards to  
searching just news articles w/ players and can't seem to find the  
perfect setup.


Normally I take into consideration date (frequency + recently  
published), title (which boosts on relevancy) and general mm in body  
text (and score)


Sometimes its more of a preference on how to drill into news (own  
being most recently published) vs. historical where it can be more  
context based ...


Is anyone else toying w/ different setups for searching news in general?

- Jon


Re: why don't we have a forum for discussion?

2009-02-18 Thread Jon Baer
I don't think "general" discussion forums really help ... it would be  
great if every major page in the Solr wiki had a discuss link off to  
somewhere though +1 for that ...


Ie:
http://wiki.apache.org/solr/SolrRequestHandler
http://wiki.apache.org/solr/SolrReplication
etc.

For me even panning over discussion history on topics would be helpful.

- Jon

On Feb 18, 2009, at 2:56 PM, Martin Lamothe wrote:


Yep, I second the motion.
This mailing list overloads my poor BB curve.

-M

2009/2/18 Tony Wang 

I am just curious why we don't have a forum for discussion or you  
guys

think
it's really necessary to receive lots of crap information about  
Solr and

nutch in email? I can offer you a forum for discussion anyway.

--
Are you RCholic? www.RCholic.com
温 良 恭 俭 让 仁 义 礼 智 信





--
Martin Lamothe
Business Development and Operations
Wiser Web Solutions Inc.
Direct: (613) 262-5558
Toll-free: 1-800-949-4737
E-mail: m.lamo...@wiserweb.com
http://www.wiserweb.com




Re: Good strategy for news in Solr?

2009-02-19 Thread Jon Baer
Yes, more or less, most of my tries have not been function query based  
just basic dismax handler stuff.  I have a bit of a unique case where  
Im dealing w/ last names + multiple players (think Staal ;-) and non  
tagged content (feeds) so it's a bit tricky than just news articles  
and some tend to overload stories w/ just names which don't pertain to  
the story context.


I have been trying on the side to do a "3-pass" phase where it's a  
normal index, then doc boost on date (ie yesterday's games) and then  
taking analytic reports and applying another boost.  I may end up w/  
another factor where if a user is signed in, the favorite team is also  
pushed as a document boost as well.  This all happens offline though +  
really wonder if this type of operation can / will ever reach realtime  
to some degree.  Id really like to find some time to develop the whole  
thing as more of a DIH plugin (post op).


Any other cool tips to try? :-)

- Jon

On Feb 19, 2009, at 8:20 AM, Grant Ingersoll wrote:


Hey Jon,

If I understand right, you want news about a particular player,  
right?  And you need it to be fresh.


Can you share more about what you've done so far?  It sounds like  
you have tried out some function query stuff, but can you share what  
you did there?


-Grant

On Feb 18, 2009, at 1:54 PM, Jon Baer wrote:

Ive spent a few months trying different techniques w/ regards to  
searching just news articles w/ players and can't seem to find the  
perfect setup.


Normally I take into consideration date (frequency + recently  
published), title (which boosts on relevancy) and general mm in  
body text (and score)


Sometimes its more of a preference on how to drill into news (own  
being most recently published) vs. historical where it can be more  
context based ...


Is anyone else toying w/ different setups for searching news in  
general?


- Jon


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search





Re: Realtime Searching..

2009-02-19 Thread Jon Baer

This part:

The part of Zoie that enables real-time searchability is the fact that  
ZoieSystem contains three IndexDataLoader objects:


* a RAMLuceneIndexDataLoader, which is a simple wrapper around a  
RAMDirectory,
* a DiskLuceneIndexDataLoader, which can index directly to the  
FSDirectory (followed by an optimize() call if a specified  
optimizeDuration has been exceeded) in batches via an intermediary
* BatchedIndexDataLoader, whose primary job is to queue up and  
batch DataEvents that need to be flushed to disk


Sounds like it (might) be / (can) be layered into Solr somehow, has  
anyone been using this project or testing it?


- Jon

On Feb 19, 2009, at 9:44 AM, Genta Kaneyama wrote:


Michael,

I think you might be get interested in "zoie".

zoie: real-time search and indexing system built on Apache Lucene
http://code.google.com/p/zoie/

Zoie is realtime search project for lucene by Linkedin.
Basically, I think it is similar technique to a Otis's trick.

In the mean time you can use the trick of one large and less  
frequently updated core and one small and more frequently  
>>updated core + distributed search across them.


Otis


Genta


On Sat, Feb 7, 2009 at 3:02 AM, Michael Austin   
wrote:
I need to find a solution for our current social application. It's  
low
traffic now because we are early on.. However I'm expecting and  
want to be

prepaired to grow.  We have messages of different "types" that are
aggregated into one stream. Each of these message types have much  
different
data so that our main queries have a few unions and many joins.  I  
know that

Solr would work great for searching but we need a realtime system
(twitter-like) to view user updates.  I'm not interested in a few  
minutes
delay; I need something that will be fast updating and searchable  
and have n

columns per record/document. Can solor do this? what is Ocean?

Thanks





Verbose(r) logging in DIH?

2009-03-09 Thread Jon Baer

Hi,

Is there currently anything in DIH to allow for more verbose logging?   
(something more than status) ... was there a way to hook in your own  
for debugging purposes?  I can't seem to locate the options in the  
Wiki or remember if it was available.


Thanks.

- Jon


Re: SolrJ XML indexing

2009-03-11 Thread Jon Baer
Id suggest what someone else mentioned to just do a full clean up of  
the index.  Sounds like you might have kill -9 or stopped the process  
manually while indexing (would be only reason for a left over lock).


- Jon

On Mar 11, 2009, at 5:16 AM, Ashish P wrote:



I added single in indexDefaults that made the  
error

before go away but now I am getting following error :

Mar 11, 2009 6:12:56 PM org.apache.solr.common.SolrException log
SEVERE: java.io.IOException: Cannot overwrite:
C:\dw-solr\solr\data\index\_1o.fdt
	at  
org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:440)

at org.apache.lucene.index.FieldsWriter.(FieldsWriter.java:62)
at
org 
.apache 
.lucene 
.index.StoredFieldsWriter.initFieldsWriter(StoredFieldsWriter.java:65)


Please help..


Ashish P wrote:


Thanks man.
I just tried what u suggested but I am getting following error when
performing request
Mar 11, 2009 6:00:28 PM org.apache.solr.update.SolrIndexWriter
getDirectory
WARNING: No lockType configured for C:\dw-solr\solr\./data/index/  
assuming

'simple'
Mar 11, 2009 6:00:29 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock  
obtain

timed out:
simplefsl...@c:\dw-solr\solr\.\data\index 
\lucene-1d6c0059ac2f9f2c83acf749af7e0906-write.lock

at org.apache.lucene.store.Lock.obtain(Lock.java:85)
at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1140)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:938)
at
org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java: 
116)


Any ideas???

-Ashish


Noble Paul നോബിള്‍  नोब्ळ् wrote:


String xml = null;//load the file to the xml string
DirectXmlRequest up = new DirectXmlRequest( "/update", xml );
solrServer.request( up );

On Wed, Mar 11, 2009 at 2:19 PM, Ashish P 
wrote:


I have an XML file with structure :

  ...
  ...
  .
  .


It is present on disk on some location let's say C:\\documents.xml

Q.1. Using solrJ can I index all docs in this file directly?? or  
do I

have
to convert each document to solrInputDocument by parsing XML

Q.2 How to use DirectXmlRequest?? any example

Thanks in advance...
Ashish




--
View this message in context:
http://www.nabble.com/SolrJ-XML-indexing-tp22450845p22450845.html
Sent from the Solr - User mailing list archive at Nabble.com.






--
--Noble Paul







--
View this message in context: 
http://www.nabble.com/SolrJ-XML-indexing-tp22450845p22451235.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: Version 1.4 of Solr

2009-03-11 Thread Jon Baer

Are you using the replication feature by any chance?

- Jon

On Mar 10, 2009, at 2:28 PM, Matthew Runo wrote:

We're currently using 1.4 in production right now, using a recent  
nightly. It's working fine for us.


Thanks for your time!

Matthew Runo
Software Engineer, Zappos.com
mr...@zappos.com - 702-943-7833

On Mar 10, 2009, at 10:25 AM, Vauthrin, Laurent wrote:


Hello,



I'm not sure if this is the right forum for this, but I'm wondering  
if I

could get a rough timeline of when version 1.4 of Solr might be out?
I'm trying to figure out whether we will be able to use the new  
built-in

replication as opposed to the current rsync collection distribution.



Thanks,

Laurent







Re: Verbose(r) logging in DIH?

2009-03-11 Thread Jon Baer
+1 for this (as it would be an added bonus to do something based on  
the log events) ... so in this case if you have that transformer does  
it mean it will get events before and after the import?  Correct me if  
Im wrong there are currently (1.4) preImportDeleteQuery and  
postImportDeleteQuery hooks for the entire import just nothing on the  
entity level?


- Jon

On Mar 9, 2009, at 2:48 PM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



it is really not available. probably we can have a LogTransformer
which can Log using slf4j






On Mon, Mar 9, 2009 at 11:55 PM, Jon Baer  wrote:

Hi,

Is there currently anything in DIH to allow for more verbose logging?
 (something more than status) ... was there a way to hook in your  
own for
debugging purposes?  I can't seem to locate the options in the Wiki  
or

remember if it was available.

Thanks.

- Jon





--
--Noble Paul




Re: DIH use of the ?command=full-import entity= command option

2009-03-13 Thread Jon Baer
Bare in mind (and correct me if Im wrong) but a "full-import" is still  
a "full-import" no matter what entity you tack onto the param.


Thus I think clean=false should be appended (a friend starting off in  
Solr was really confused by this + could not understand why it did a  
delete on all documents).


Im not sure if that is clearly stated in the Wiki ...

- Jon

On Mar 13, 2009, at 1:34 AM, Shalin Shekhar Mangar wrote:

On Fri, Mar 13, 2009 at 10:44 AM, Fergus McMenemie  
wrote:



If my data-config.xml contains multiple root level entities
what is the expected action if I call full-import without an
entity=XXX sub-command?

Does it process all entities one after the other or only the
first? (It would be useful IMHO if it only did the first.)



It processes all entities one after the other. If you want to import  
only

one, use the entity parameter.

--
Regards,
Shalin Shekhar Mangar.




Caching question + "smart" autowarming

2009-03-13 Thread Jon Baer

I have a few general questions re: caching ...

1. The FastLRU cache in 1.4 seems promising but is there a more  
comprehensive list of benefits?  Is there a huge speed boost for using  
this type of cache?


2. What are the possibilities to using external caches for scaling out  
like memcachedb or redis?  Is this just a matter of interfacing  
SolrCache?


3. Does the autowarming of a new searcher use any type of statistics  
on the old one?  How does it figure what to pull from old cache if  
autowarmCount is set low?  IE can I get the most popular or is it  
ordered based on something in the docs themselves?


Thanks.

- Jon


Re: DIH use of the ?command=full-import entity= command option

2009-03-15 Thread Jon Baer
I think it could be as simple as if you have +1 entities in the param  
that clean=false as well (because you are specifically interested in  
just targeting that entity import) ...


- Jon

On Mar 15, 2009, at 3:07 AM, Shalin Shekhar Mangar wrote:


On Fri, Mar 13, 2009 at 9:56 PM, Jon Baer  wrote:

Bare in mind (and correct me if Im wrong) but a "full-import" is  
still a

"full-import" no matter what entity you tack onto the param.

Thus I think clean=false should be appended (a friend starting off  
in Solr
was really confused by this + could not understand why it did a  
delete on

all documents).

Im not sure if that is clearly stated in the Wiki ...



Yes it is confusing and even more now that we have  
preImportDeleteQuery.


For a full-import command, the default is clean=true. If clean=false  
is
specified, then no cleanup is done (not even pre/ 
postImportDeleteQuery).
Even if there is a pre/postImportDeleteQuery, if the first root  
entity does
not have a preImportDeleteQuery then all documents are deleted  
(which I
guess is a bug). For a delta-import command, the default is  
clean=false (and

no pre/postImportDeleteQuery is run).

I think we should open an issue to figure out and implement an  
acceptable

behavior before we release 1.4
--
Regards,
Shalin Shekhar Mangar.




Re: More contextual information in analyser

2010-03-08 Thread Jon Baer
Isn't this what Lucene/Solr payloads are theoretically for?

ie: 
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

- Jon

On Mar 8, 2010, at 11:15 PM, Lance Norskog wrote:

> This is an interesting idea. There are other projects to make the
> analyzer/filter chain more "porous", or open to outside interaction.
> 
> A big problem is that queries are analyzed, too. If you want to give
> the same metadata to the analyzer when doing a query against the
> field, things get tough. You would need a special query parser to
> implement your own syntax to do that. However, the analyzer chain in
> the query phase does not receive the parsed query, so you have to in
> some way change this.
> 
> On Mon, Mar 8, 2010 at 2:14 AM, dbejean  wrote:
>> 
>> Hello,
>> 
>> If I write a custom analyser that accept a specific attribut in the
>> constructor
>> 
>> public MyCustomAnalyzer(String myAttribute);
>> 
>> Is there a way to dynamically send a value for this attribute from Solr at
>> index time in the XML Message ?
>> 
>> 
>>  
>>.
>> 
>> 
>> Obviously, in Sorl shema.xml, the "content" field is associated to my custom
>> Analyser.
>> 
>> Thank you.
>> 
>> Dominique
>> 
>> --
>> View this message in context: 
>> http://old.nabble.com/More-contextual-information-in-analyser-tp27819298p27819298.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com



Re: QueryElevationComponent blues

2010-03-08 Thread Jon Baer
Maybe some things to try:

* make sure your uniqueKey is string field type (ie if using int it will not 
work)
* forceElevation to true (if sorting)

- Jon

On Mar 9, 2010, at 12:34 AM, Ryan Grange wrote:

> Using Solr 1.4.
> Was using the standard query handler, but needed the boost by field 
> functionality of qf from dismax.
> So we altered the query to boost certain phrases against a given field.
> We were using QueryElevationComponent ("elevator" from solrconfig.xml) for 
> one particular entry we wanted at the top, but because we aren't using a pure 
> q value, elevator never finds a match to boost.  We didn't realize it at the 
> time because the record we were elevating eventually became the top response 
> anyway.
> Recently added a _val_:formula to the q value to juice records based on a 
> value in the record.
> Now we have need to push a few other records to the top, but we've lost the 
> ability to use elevate.xml to do it.
> 
> Tried switching to dismax using qf, pf, qs, ps, and bf with a "pure" q value, 
> and debug showed queryBoost with a match and records, but they weren't moved 
> to the top of the result set.
> 
> What would really help is if there was something for elevator akin to 
> spellcheck.q like elevation.q so I could pass in the actual user phrase while 
> still performing all the other field score boosts in the q parameter. 
> Alternatively, if anyone can explain why I'm running into problems getting 
> QueryElevationComponent to move the results in a dismax query, I'd be very 
> thankful.
> 
> -- 
> Ryan T. Grange
> 



Re: SolrJ - how separte different results from the same facet query?

2010-03-15 Thread Jon Baer
I am interested in this as well ... Im also having the issue of understanding 
if a result has been elevated by the QueryElevation component.  It should like 
SolrJ would need to know about some type of metadata contained within the docs 
but I haven't seen SolrJ dealing w/ payloads specifically yet.  

I also can't tell if these would require some feature request on those 
components or if it's something that is too custom that it would require 
writing new components.  

It sounds like retrieving a document should answer questions like ...

"did this document come from a facet query?"
"was this document elevated?"

Etc.  Maybe something the Debug component can handle if it can write payloads 
back to the results, etc.

- Jon

On Mar 15, 2010, at 7:56 AM, Saïd Radhouani wrote:

> I'm faceting with a two different query ranges while using addFacetQuery. I
> wonder wether it's possible using SolrJ to extract the result of each query
> range separately. Here's is an example:
> 
> addFacetQuery("price:[* TO 150]"); addFacetQuery("price:[151 TO 300]"); etc.
> addFacetQuery("length:[* TO 5]");addFacetQuery("length:[5 TO 10]"); etc.
> 
> When I use getFacetQuery, SolrJ gives me the responses of both query ranges
> (prices and lengths) mixed in the same list. I wonder wether it's possible
> to tell SolrJ to extract the response of a specific query range, i.e., tell
> it to extract the price-based response in a list and the length-based
> response in another list. It would be helpful to have something like
> getFacetQuery(field=price), getFacetQuery(field=length), etc.
> 
> Any ideas?
> 
> Thanks.



Re: Generating a sitemap

2010-03-18 Thread Jon Baer
It's also possible to try and use the Velocity contrib response writer and 
paging it w/ the sitemap elements.

BTW generating a sitemap was a big reason of a switch we did from GSA to Solr 
because (for some reason) the map took way too long to generate (even simple 
requests).

If you page through w/ Solr (ie rows=100&wt=velocity&v.template=sitemap) its 
fairly painless to build on cron.

- Jon

On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:

> 
> : Been testing nutch to crawl for solr and I was wondering if anyone had
> : already worked on a system for getting the urls out of solr and generating
> : an XML sitemap for Google.
> 
> it's pretty easy to just paginate through all docs in solr, so you could 
> do that -- but I'd be really suprised if Nutch wasn't also loggign all the 
> URLs it indexed, so you could just post-process that log to build the 
> sitemap as well.
> 
> 
> 
> -Hoss
> 



Re: Generating a sitemap

2010-03-19 Thread Jon Baer
It's unfortunately actually a pretty domain specific thing (urls, content, 
etc), there are also limits @ certain points (see ... but we took CNN.com as a 
model, for example:

http://www.cnn.com/video_sitemap_index.xml
http://www.cnn.com/sitemap_videos_0001.xml

Then you just line up the big 3 w/ the static URLs, etc.

http://en.wikipedia.org/wiki/Sitemaps (the submission URLs are there)
http://www.bing.com/toolbox/posts/archive/2009/10/09/submit-a-sitemap-to-bing.aspx

In general though it's great to create custom handlers and use Velocity 
templates for pretty much anything + its great for prototyping.

- Jon

On Mar 19, 2010, at 8:55 AM, Erik Hatcher wrote:

> Jon -
> 
> Very cool use of VelocityResponseWriter!
> 
> Would you happen to have a sitemap.vm template to contribute?   I realize 
> there'd need to be an external URL configurable, but this would be trivially 
> added as a request parameter and leveraged in the template.
> 
>   Erik
> 
> p.s. Anyone else using VelocityResponseWriter out there?   Sitemaps is a 
> great use of it.  And also I've got a report of a big company in Brazil using 
> it for e-mail generation of search results.   I'm in the process of baking 
> VrW into the main Solr example (it's there on trunk, basically) and more 
> examples are better.
> 
> On Mar 18, 2010, at 7:40 PM, Jon Baer wrote:
> 
>> It's also possible to try and use the Velocity contrib response writer and 
>> paging it w/ the sitemap elements.
>> 
>> BTW generating a sitemap was a big reason of a switch we did from GSA to 
>> Solr because (for some reason) the map took way too long to generate (even 
>> simple requests).
>> 
>> If you page through w/ Solr (ie rows=100&wt=velocity&v.template=sitemap) its 
>> fairly painless to build on cron.
>> 
>> - Jon
>> 
>> On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:
>> 
>>> 
>>> : Been testing nutch to crawl for solr and I was wondering if anyone had
>>> : already worked on a system for getting the urls out of solr and generating
>>> : an XML sitemap for Google.
>>> 
>>> it's pretty easy to just paginate through all docs in solr, so you could
>>> do that -- but I'd be really suprised if Nutch wasn't also loggign all the
>>> URLs it indexed, so you could just post-process that log to build the
>>> sitemap as well.
>>> 
>>> 
>>> 
>>> -Hoss
>>> 
>> 
> 



Re: wikipedia and teaching kids search engines

2010-03-25 Thread Jon Baer
Just throwing this out there ... I recently saw something I found pretty 
interesting from CMU ...

http://csunplugged.org/activities

The search algorithm exercise was focused on a Battleship lookup I think.  

- Jon 

On Mar 24, 2010, at 10:40 AM, Erik Hatcher wrote:

> I've got a couple of questions for the community...
> 
>  * what's the simplest way to get Solr up and running with a relatively 
> richly schema'd index of a Wikipedia dump?
> 
> What I'm looking for is something as easy as something along these lines:
> 
>  java -Dsolr.solr.home=./wikipedia_solr_home -jar start.jar
> 
>  cat wikipedia.bz2 | wikipedia_solr_indexer
> 
> My goal is to index wikipedia in order to demonstrate search to a class of 
> middle school kids that I've volunteered to teach for a couple of hours.  
> Which brings me to my next question...
> 
> * anyone have ideas on some basic hands-on ways of teaching search engine 
> fundamentals?
> 
> One idea I have is to bring some actual "documents", say a poster board with 
> a sentence written largely on it, have the students physically *tokenize* the 
> document by cutting it up and lexicographically building the term dictionary. 
>  Thoughts on taking it further welcome!
> 
> Thanks all.
> 
>   Erik
> 



Getting /handlers from response and dynamically removing them

2010-03-29 Thread Jon Baer
This is just something that seems to come up now and then ...

* - Id like to write a last-component which does something specific for a 
particular declared handler /handler1 for example and there is no way to 
determine which handler it came from @ the moment (or can it?)
* - It would be nice if there was someway to dynamically update 
(enable/disable) handlers on the fly, specifically update handlers, Id imagine 
something working like the way logging currently is laid out in the admin.

Any thoughts on these 2?

- Jon

Re: Getting /handlers from response and dynamically removing them

2010-03-29 Thread Jon Baer
Thanks for the qt tip, I will try that.

Im building a Solr installation as a small standalone and Id like to disable 
everything but the /select after an import has been completed.  In normal 
situations just the master would be setup to index and the slaves are read but 
in this case I need to allow imports on a standalone w/ a small index and allow 
updates only when the handler is enabled.  

Also, its not possible currently to reload a handler w/o a restart correct?

- Jon

On Mar 29, 2010, at 3:22 PM, Erik Hatcher wrote:

> You can get the qt parameter, at least, in your search component.
> 
> What's the use case for controlling handlers enabled flag on the fly?
> 
>   Erik
> 
> 
> On Mar 29, 2010, at 3:02 PM, Jon Baer wrote:
> 
>> This is just something that seems to come up now and then ...
>> 
>> * - Id like to write a last-component which does something specific for a 
>> particular declared handler /handler1 for example and there is no way to 
>> determine which handler it came from @ the moment (or can it?)
>> * - It would be nice if there was someway to dynamically update 
>> (enable/disable) handlers on the fly, specifically update handlers, Id 
>> imagine something working like the way logging currently is laid out in the 
>> admin.
>> 
>> Any thoughts on these 2?
>> 
>> - Jon
> 



Listeners (Enable / Disable)

2010-04-06 Thread Jon Baer
Before digging through src ...

Docs say ... "Every component can have an extra attribute enable which can be 
set as true/false."

It doesn't seem that listeners are part of PluginInfo scheme though ... for 
example is this possible?


Re: Is there any other tool other than DIH to index a database

2010-04-07 Thread Jon Baer
There is the LuSQL tool which Ive used a few times.  

http://lab.cisti-icist.nrc-cnrc.gc.ca/cistilabswiki/index.php/LuSql
http://www.slideshare.net/eby/lusql-quickly-and-easily-getting-your-data-from-your-dbms-into-lucene

- Jon

On Apr 7, 2010, at 11:26 PM, bbarani wrote:

> 
> Hi,
> 
> I am currently using DIH to index the data from a database. I am just trying
> to figure out if there are any other open source tools which I can use just
> for indexing purpose and use SOLR for querying.
> 
> I also thought of writing a custom code for retrieving the data from
> database and use SOLRJ to add the data as documents in to lucene. One doubt
> here is that if I use the custom code for retrieving the data and use SOLRJ
> to commit that data, will the schema file be still used? I mean the field
> types / analyzers / tokenizers etc.. present in schema file? or do I need to
> manipulate each data (to fit to corresponding data type) in my SOLRJ
> program?
> 
> Please let me know your thoughts.
> 
> Thanks,
> Barani
> -- 
> View this message in context: 
> http://n3.nabble.com/Is-there-any-other-tool-other-than-DIH-to-index-a-database-tp705002p705002.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Need help with StackOverflowError

2010-04-07 Thread Jon Baer
You should maybe scan your db for bad data ...

This bit ...
at sun.nio.cs.UTF_8$Decoder.decodeLoop(UTF_8.java:324)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:561)

Is probably happening on a specific record somewhere, in the query limit the id 
range and try to narrow down which one is throwing the decoder for a spin.

- Jon

On Apr 7, 2010, at 11:39 PM, Blargy wrote:

> 
> If it helps at all to mention, I manually updated the last_index_time in
> conf/dataimport.properties so I could select a smaller subset and the
> delta-import worked which leads me to believe there is nothing wrong with my
> DIH delta queries themselves. There must be something wrong with my dataset
> that ends up in this circular recursion?
> 
> Any thoughts?
> -- 
> View this message in context: 
> http://n3.nabble.com/Need-help-with-StackOverflowError-tp704451p705022.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: port application using solr to Android device

2010-04-11 Thread Jon Baer
How large is the index?  There is probably alot of work in getting Solr and 
dependencies (for example Lucene / RMI from what I have read) ...

Interestingly enough there is a Jetty container for it ...

http://code.google.com/p/i-jetty/

I think Solr itself would be OK to port to Dalvik just the Lucene part Im not 
sure about ...

- Jon

On Apr 11, 2010, at 11:15 AM, Samk wrote:

> 
> 
> I have an application using solr than runs on an enterprise(computer). 
> I want to port the same application to Android hand held device. I tried
> searching for a light weight Solr server for Android but was unsuccessful. 
> 
> (Please note that I'm not interested about porting a client but the full
> application. Assume my android device has no wifi access)
> 
> Any ideas on how to port the application to Android device? 
> 
> -- 
> View this message in context: 
> http://n3.nabble.com/port-application-using-solr-to-Android-device-tp711716p711716.html
> Sent from the Solr - User mailing list archive at Nabble.com.



admin-extra file in multicore

2010-04-16 Thread Jon Baer
Hi,

It looks like Im trying to do the same thing in this open JIRA here ...

https://issues.apache.org/jira/browse/SOLR-975

I noticed in index.jsp it has a reference to:

<%
 // a quick hack to get rid of get-file.jsp -- note this still spits out 
invalid HTML
 out.write( 
org.apache.solr.handler.admin.ShowFileRequestHandler.getFileContents( 
"admin-extra.html" ) );
%>

Instead of resolving with the core.getName() path ...

Was trying to avoid building a custom solr.war for this project is there 
another quick hack to include content for admin backend or is patching the only 
way?

Thanks.

- Jon




Re: Retrieve time of last optimize

2010-04-22 Thread Jon Baer
I don't think there is anything low level in Lucene that will specifically 
output anything like lastOptimized() to you, since it can be setup a few ways.  

Your best bet is probably adding a postOptimize hook and dumping it to log / 
file / monitor / etc, probably something like ...


  lastOptimize.sh
  solr/bin
  true

 
Or writing to a file and reading it back into the admin if you need to display 
it there.

More @ http://wiki.apache.org/solr/SolrConfigXml#Update_Handler_Section

- Jon

On Apr 22, 2010, at 11:16 AM, Shawn Heisey wrote:

> On 4/21/2010 1:24 PM, Shawn Heisey wrote:
>> Is it possible to issue some kind of query to a Solr core that will return 
>> the last time the index was optimized?  Every day, one of my shards should 
>> get optimized, so I would like my monitoring system to tell me when the 
>> newest optimize date is more than 24 hours ago.  I could not find a way to 
>> get this.  The /admin/cores page has a lot of other useful information, but 
>> not that particular piece.
> 
> I have found some other useful information on the stats.jsp page, like the 
> number of segments, the size of the index on disk, and so on.  Still have not 
> been able to locate the last optimize date, which would simply be the 
> timestamp on the earliest disk segment.
> 
> Thanks,
> Shawn
> 



SolrJ + BasicAuth

2010-04-23 Thread Jon Baer
Uggg I just got bit hard by this on a Tomcat project ... 

https://issues.apache.org/jira/browse/SOLR-1238

Is there anyway to get access to that RequestEntity w/o patching?  Also are 
there security implications w/ using the repeatable payloads?

Thanks.

- Jon

Re: multiple cores on SOLR under Tomcat

2010-04-27 Thread Jon Baer
I would not use this layout, you are putting important Solr config files 
outside onto the docroot (presuming we are looking @ the webapps folder) ... 
here is my current Tomcat project (if it helps):

[507][jonbaer.MBP: tomcat]$ pwd
/Users/jonbaer/WORKAREA/SVN_HOME/my-project/tomcat

[508][jonbaer.MBP: tomcat]$ ls
bin conf  lib   logs  solr  tempwebapps work

[509][jonbaer.MBP: tomcat]$ ls -l solr
total 8
drwxr-xr-x   5 jonbaer  staff  170 Apr 15 11:40 core0
drwxr-xr-x  12 jonbaer  staff  408 Apr 18 11:57 lib
drwxr-xr-x   5 jonbaer  staff  170 Apr 26 10:46 logs
drwxr-xr-x   5 jonbaer  staff  170 Apr 15 11:40 core1
-rw-r--r--   1 jonbaer  staff  217 Apr 15 11:40 solr.xml

[510][jonbaer.MBP: tomcat]$ ls -l webapps
total 0
drwxr-xr-x   7 jonbaer  staff  238 Apr 16 23:17 solr

[511][jonbaer.MBP: tomcat]$ cat solr/solr.xml 



  
  



[512][jonbaer.MBP: tomcat]$ cat conf/server.xml 
...





...

- Jon

On Apr 27, 2010, at 8:26 AM, Dimitrios Sferopoulos wrote:

> Hi all,
> 
> I have been trying to set up multiple cores on SOLR that runs under Apache 
> Tomcat but haven't had much luck. I followed the instruction on the wiki but 
> that didn't help much.
> 
> This is what I get when I browse in: 
> http://devel.edina.ac.uk:20232/solr/admin/cores
> 
> My SOLR directory structure is:
> 
> solr
> admin
> home
>bin
>conf
>data
>solr.xml
>multicore
>   core0
>   data
>   conf
>   core1
>   data
>   conf
> META-INF
> WEB-INF
> and my solr.xml is:
> 
> 
> 
> 
>   
>   
> 
> 
> 
> 
> 
> Could someone let me know what the correct directory structure is as well as 
> the solr.xml? Are there any environmental variables that I need to set for 
> multiplecores to work?
> 
> Thanks
> Dimitrios
> 
> 
> 
> 
> 
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 



Re: Replicate cores from master to slave

2010-04-28 Thread Jon Baer
Correct me if Im wrong but I think the problem here is that while there is a 
"fetchindex" command in replication the handler and the master/slave setup 
pertain to the core config.

For example for this to work properly the solr.xml configuration would need to 
setup some type of "global" replication handler for all cores.

Just throwing that out there since Im not sure even the ZooKeeper setup would 
include something like this.

- Jon 

On Apr 28, 2010, at 10:14 AM, Jason Rutherglen wrote:

> I guess I didn't explain it properly. I want to create a core on
> the master, and then have N slaves also (aka replicate) create
> those new core(s) on the slave servers, then of course, begin to
> replicate (yeah, got that part). There doesn't appear to be
> anything today that does this, it's unclear how/if Solr Cloud
> enables the described functionality because, while yes Solr
> Cloud can be used to *create* new cores out of a config in ZK,
> that isn't what I'm asking. I'll open an issue and submit a
> patch?
> 
> On Wed, Apr 28, 2010 at 1:53 AM, Yonik Seeley
>  wrote:
>> On Tue, Apr 27, 2010 at 9:56 PM, Chris Hostetter
>>  wrote:
>>> but as i understand the new cloud stuff (by which i mean: i don't
>>> understand the new cloud stuff, but i've heard rumors) this will be
>>> possible with that functionality.
>> 
>> Yeah, that should be the goal.
>> The solrconfig.xml, etc, are stored in zookeeper, so creating a new
>> core is (should be) trivial.
>> There's not anything in cloud yet to automatically set up the
>> master/slave replication relationship though.
>> 
>> -Yonik
>> Apache Lucene Eurocon 2010
>> 18-21 May 2010 | Prague
>> 



Solr Cloud & Gossip Protocols

2010-04-28 Thread Jon Baer
Just a general theory question ...

From what I understand Cassandra uses a generic gossip protocol for node 
discovery (custom), will the Solr-Cloud have something similar? 

I was looking through both projects and it seems like this "protocol" type can 
be ripped from org.apache.cassandra.gms package ... what I can't figure out is 
why they did not end up w/ something like JGroups. 

Beyond something akin to Bonjour and Apache River is there something preferable 
for this stuff?  (To be clear this is the zero configuration of slave nodes).

- Jon 

Re: Problem with DIH delta-import on JDBC

2010-04-28 Thread Jon Baer
You should end up w/ a file like "conf/dataimport.properties"  @ full import 
time, might be that it did not get written out?

- Jon

On Apr 28, 2010, at 3:05 PM, safl wrote:

> 
> Hello,
> 
> I'm just new on the list.
> I searched a lot on the list, but I didn't find an answer to my question.
> 
> I'm using Solr 1.4 on Windows with an Oracle 10g database.
> I am able to do full-import without any problem, but I'm not able to get
> delta-import working.
> 
> I have the following in the data-config.xml:
> 
> ...
> query="select * from table"
>deltaImportQuery="select * from table where
> objectid='${dataimporter.delta.id}'"
>deltaQuery="select objectid from table where lastupdate >
> '${dataimporter.last_index_time}'">
> 
> ...
> 
> I update some records in the table and the try to run a delta-import.
> I track the SQL queries on DB with P6Spy, and I always see a query like
> 
> select * from table where objectid=''
> 
> Of course, with such an SQL query, nothing is updated in my index.
> 
> It behave the same if I replace ${dataimporter.delta.id} by
> ${dataimporter.delta.objectid}.
> Can someone tell what is wrong with it?
> 
> Thanks a lot,
> Florian
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-with-DIH-delta-import-on-JDBC-tp763469p763469.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to make documents low priority

2010-04-29 Thread Jon Baer
Does a "sort=field5+desc" on the query param not work?

- Jon

On Apr 29, 2010, at 9:32 AM, Doddamani, Prakash wrote:

> Hi,
> 
> 
> 
> I am using the boost factor as below
> 
> 
> 
>   field1^20.0 field2^5 field3^2.5 field4^.5 
> 
> 
> 
> 
> 
> Where it searches first in field1 then field1 and so on
> 
> 
> 
> Is there a way, where I can make some documents very low priority so
> that they come at the end?
> 
> 
> 
> Scenario :
> 
> 
> 
> 
> 
> aaa
> 
> bbb
> 
> 
> 
> 
> 
> 1
> 
> 
> 
> 2010-04-29T12:40:05.589Z
> 
> 
> 
> 
> 
> I want all the documents which have field5=1 come last and documents
> which have field5=0 should come first while searching.
> 
> Any advise is greatly appreciated.
> 
> 
> 
> Thanks
> 
> Prakash
> 



Re: Problem with DIH delta-import on JDBC

2010-04-29 Thread Jon Baer
All that stuff happens in the JDBC driver associated w/ the DataSource so 
probably not unless there is something which can be set in the Oracle driver 
itself.

One thing that might have helped in this case might have been if 
readFieldNames() in the JDBCDataSource dumped its return to debug log for you.  
That might be something that can be JIRA(ed).

- Jon

On Apr 29, 2010, at 9:45 AM, safl wrote:

> 
> Hi,
> 
> I did a debugger session and found that the column names are case sensitive
> (at least with Oracle).
> The column names are retreived from the JDBC metadatas and I found that my
> objectid is in fact OBJECTID.
> 
> So now, I'm able to do an update with the following config (pay attention to
> the OBJECTID):
> 
> query="select * from table"
>deltaImportQuery="select * from table where
> objectid='${dataimporter.delta.OBJECTID}'"
>deltaQuery="select objectid from table where lastupdate >
> '${dataimporter.last_index_time}'">
> 
> 
> 
> Is there a way to be "case insensitive" ?
> 
> Anyway, it works now and that's the most important thing!
> :-)
> 
> Thanks to all,
> Florian
> 
> 
> 
> cbennett wrote:
>> 
>> Hi,
>> 
>> It looks like the deltaImportQuery needs to be changed you are using
>> dataimporter.delta.id which is not correct, you are selecting objected in
>> the deltaQuery, so the deltaImportQuery should be using
>> dataimporter.delta.objectid
>> 
>> So try this:
>> 
>> >query="select * from table"
>>deltaImportQuery="select * from table where
>> objectid='${dataimporter.delta.objectid}'"
>>deltaQuery="select objectid from table where lastupdate >
>> '${dataimporter.last_index_time}'">
>> 
>> 
>> Colin.
>> 
> -- 
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Problem-with-DIH-delta-import-on-JDBC-tp763469p765262.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr Cloud & Gossip Protocols

2010-04-29 Thread Jon Baer
Thanks, Im looking @ the atomic broadcast messaging protocol of Zookeeper and 
think I have found what I was looking for ...

- Jon

On Apr 28, 2010, at 11:27 PM, Yonik Seeley wrote:

> On Wed, Apr 28, 2010 at 2:23 PM, Jon Baer  wrote:
>> From what I understand Cassandra uses a generic gossip protocol for node 
>> discovery (custom), will the Solr-Cloud have something similar?
> 
> SolrCloud uses zookeeper, so node discovery is a simple matter of
> looking there.  Nodes are responsible for registering themselves in
> zookeeper.
> 
> -Yonik
> Apache Lucene Eurocon 2010
> 18-21 May 2010 | Prague



Re: Solr configuration to enable indexing/searching webapp log files

2010-04-29 Thread Jon Baer
Good question, +1 on finding answer, my take ...

Depending on how large of log files you are talking about it might be better 
off to do this w/ HDFS / Hadoop (and a script language like Pig) (or Amazon EMR)

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873

Theoretically you could split the logs to fields, use a dataimporter and search 
/ sort w/ something like LineEntityProcessor.

http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor

I've tried to use Solr as a log analytics tool (before dataimporthandler) and 
it was not worth the disk space or practical but I'd love to hear otherwise.  
In general you could flush daily logs to an index but working w/ the data in 
another context if you had to seems better fit for HDFS use (I think).

- Jon

On Apr 29, 2010, at 1:46 PM, Stefan Maric wrote:

> 
> I thought i remembered seeing some information about this, but have been
> unable to find it
> 
> Does anyone know if there is a configuration / module that would allow us to
> setup Solr to take in the (large) log files generated by our web/app
> servers, so that we can query for things like peak time requests or most
> frequently requested web page etc
> 
> Thanks
> Stefan Maric
> 



  1   2   >