Re: Color search for images

2010-09-16 Thread Li Li
do you mean content based image retrieval or just search images by tag?
if the former, you can try LIRE

2010/9/15 Shawn Heisey :
>  My index consists of metadata for a collection of 45 million objects, most
> of which are digital images.  The executives have fallen in love with
> Google's color image search.  Here's a search for "flower" with a red color
> filter:
>
> http://www.google.com/images?q=flower&tbs=isch:1,ic:specific,isc:red
>
> I am interested in duplicating this.  Can this group of fine people point me
> in the right direction?  I don't want anyone to do it for me, just help me
> find software and/or algorithms that can extract the color information, then
> find a way to get Solr to index and search it.
>
> Thanks,
> Shawn
>
>


Re: Color search for images

2010-09-16 Thread Lance Norskog
Yes, notice the flowers are all a medium-dark crimson red. There are a 
bunch of these image-indexing & search technologies, but there is no (to 
my knowledge) "finished technology"- it's very much an area of research. 
If you want to search the word 'flower' and index data that can find 
blobs of red, that might be easy with public tools. But there are many 
hard problems.


Lance

Stephen Weiss wrote:

There's a project out there called LIRE (I heard about it on this list) that's 
supposed to create a lucene-based CIBR index for images.  I wonder if this 
could be integrated with Solr?  Personally I don't really care about the flower 
part, I'm more worried about searching whether the flower is red... we have 
good object keywording but not good color keywording - and color is so much 
more subjective too, red can mean a lot of things.  I'm already working on 
testing it separately but it sure would be more useful if the scoring could be 
integrated with the rest of the search index.

--
Steve

On Sep 15, 2010, at 11:56 PM, Shashi Kant wrote:

   

I'm sure there's some post doctoral types who could get a graphic shape 
analyzer, color analyzer, to at least say it's a flower.

However, even Google would have to build new datacenters to have the horsepower 
to do that kind of graphic processing.

   

Not necessarily true. Like.com - which incidentally got acquired by
Google recently - built a true visual search technology and applied it
on a large scale.
 
   


Re: Null Pointer Exception while indexing

2010-09-16 Thread Lance Norskog
Which version of Solr? 1.4?, 1.4.1? 3.x branch? trunk? if the 3.x or the 
trunk, when did you pull it?


andrewdps wrote:

What could be possible error for

14-Sep-10 4:28:47 PM org.apache.solr.common.SolrException log
SEVERE: java.util.concurrent.ExecutionException:
java.lang.NullPointerException
at java.util.concurrent.FutureTask$Sync.innerGet(libgcj.so.90)
at java.util.concurrent.FutureTask.get(libgcj.so.90)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:439)
at
org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run(DirectUpdateHandler2.java:602)
at java.util.concurrent.Executors$RunnableAdapter.call(libgcj.so.90)
at java.util.concurrent.FutureTask$Sync.innerRun(libgcj.so.90)
at java.util.concurrent.FutureTask.run(libgcj.so.90)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$2(libgcj.so.90)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(libgcj.so.90)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(libgcj.so.90)
at java.lang.Thread.run(libgcj.so.90)
Caused by: java.lang.NullPointerException
at
org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:252)
at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:280)
at java.lang.StringBuilder.append(libgcj.so.90)
at
org.apache.solr.search.SolrIndexSearcher.close(SolrIndexSearcher.java:223)
at org.apache.solr.core.SolrCore$6.close(SolrCore.java:1246)
at org.apache.solr.util.RefCounted.decref(RefCounted.java:57)
at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1192)
at java.util.concurrent.FutureTask$Sync.innerRun(libgcj.so.90)
at java.util.concurrent.FutureTask.run(libgcj.so.90)
...3 more

I get this error(after indexing a few records I get the above error and
again starts indexing.i get the same error after indexing few hundred
records) when I try to index the marc record on the server.I worked fine on
the local system.

Thanks
   


bug in dataimport context ?

2010-09-16 Thread Marc Emery
Hi,

It seems the dataimport context session setter is harcoded to the
"entitySession":

  private void putVal(String name, Object val, Map map) {
if(val == null) map.remove(name);
else entitySession.put(name, val);
  }

shouldn't be rather like this:

  private void putVal(String name, Object val, Map map) {
if(val == null) map.remove(name);
else map.put(name, val);
  }

Cheers
marc


RE: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Chantal Ackermann
Hi Andre,

changing the entity in your index from donor to gift changes of course
the scope of your search results. I found it helpful to re-think such
change from that "other" side (the result side).
If the users of your search application look for individual gifts, in
the end, then changing the index to gift is for the better.

If they are searching for donors, then I would rethink the change but
not discard it completely: you can still get the list of distinct donors
by facetting over donors. You can show the users that list of donors
(the facets), and they can chose from it and get all information on that
donor (restricted to the original query, of course). The information
would include the actual search result of a list of gifts that passed
the query.

Cheers,
Chantal

On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford wrote:
> Thanks for the response Erick.
> 
> I did actually try exactly what you suggested. I flipped the index over so 
> that a gift is the document. This solution certainly solves the previous 
> problem, but introduces a new issue where the search results show duplicate 
> donors. If a donor gave 12 times in a year, and we offer full years as facet 
> ranges, my understanding is that you'd see that donor 12 times in the search 
> results, once for each gift document. Obviously I could do some client side 
> filtering to list only distinct donors, but I was hoping to avoid that.
> 
> If I've simply stumbled into the basic tradeoffs of denormalization, I can 
> live with client side de-duplication, but if you have any further suggestions 
> I'm all eyes.
> 
> As for sizing, we have some huge charities as clients. However, right now I'm 
> testing on a copy of prod data from a smaller client with ~350,000 donors and 
> ~8,000,000 gift records. So, when I "flipped" the index around as you 
> suggested, it went from 350,000 documents to 8,000,000 documents. No issues 
> with performance at all.
> 
> Thanks again,
> Andre
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com] 
> Sent: Wednesday, September 15, 2010 3:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Simple Filter Query (fq) Use Case Question
> 
> One strategy is to denormalize all the way. That is, each
> Solr "document" is Gift Amount and Gift Date would not be multiValued.
> You'd create a different "document" for each gift, so you'd have multiple
> documents with the same Id, Name, and Address. Be careful, though,
> if you've defined Id as a UniqueKey, you'd only have one record/donor. You
> can handle this easily enough by making a composite key of Id+Gift Date
> (assuming no donor made more than one gift on exactly the same date).
> 
> I know this goes completely against all the reflexes you've built up with
> working with DBs, but...
> 
> Can you give us a clue how many donations we're talking about here?
> You'd have to be working with a really big nonprofit to get enough documents
> to have to start worrying about making your index smaller.
> 
> HTH
> Erick
> 
> On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford wrote:
> 
> > I'm working on creating a solr index search for a charitable organization.
> > The solr index stores documents of donors. Each donor document has the
> > following four fields:
> >
> > Id
> > Name
> > Address
> > Gift Amount (multiValued)
> > Gift Date (multiValued)
> >
> > In our relational database, there is a one-to-many relationship between the
> > DONOR table and the GIFT table. One donor can of course give many gifts over
> > time. Consequently, I created the Gift Amount and Gift Date fields to be
> > mutiValued.
> >
> > Now, consider the following query filtered for gifts last month between $0
> > and $100:
> >
> > q=name:Jones
> > fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
> > fq=giftAmount:[0 TO 100]
> >
> > The results show me donors who donated ANY amount in the past month and
> > donors who had EVER in the past given a gift between $0 and $100. I was
> > hoping to only see donors who had given a gift between $0 and $100 in the
> > past month exclusively. I believe the problem is that I neglected to
> > consider that for two multiValued fields, while the values might align
> > "index wise", there is really no other association between the two fields,
> > so the filter query intersection isn't really behaving as I expected.
> >
> > I think this is a fundamental question of one-to-many denormalization, but
> > obviously I'm not yet experienced enough with Lucene/Solr to find a
> > solution. As to why not just keep using a relational database, it's because
> > I'm trying to provide a faceting solution to "drill down" to donors. The
> > aforementioned fq parameters would come from faceting. Oh, that and Oracle
> > Text indexes are a PITA. :-)
> >
> > Thanks for any help you can provide.
> >
> > André Bickford
> > Software Engineering Team Leader
> > SofTrek Corporation
> > 30 Bryant Woods North  Amherst, NY 14228
> > 716.691.2800 x154  800.442.9211  Fax: 71

Re: Boosting specific field value

2010-09-16 Thread Chantal Ackermann
Hi Ravi,

with dismax, use the parameter "q.alt" which expects standard lucene
syntax (instead of "q"). If "q.alt" is present in the query, "q" is not
required. Add the parameter "qt=dismax".

Chantal

On Thu, 2010-09-16 at 06:22 +0200, Ravi Kiran wrote:
> Hello Mr.Rochkind,
>I am using StandardRequestHandler so I presume I
> cannot use bq param right ?? Is there a way we can mix dismax and
> standardhandler i.e use lucene syntax for query and use dismax style for bq
> using localparams/nested queries? I remember seeing your post related to
> localparams and nested queries and got thoroughly confused
> 
> On Wed, Sep 15, 2010 at 10:28 PM, Jonathan Rochkind wrote:
> 
> > Maybe you are looking for the 'bq' (boost query) parameter in dismax?
> >
> > http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
> > 
> > From: Ravi Kiran [ravi.bhas...@gmail.com]
> > Sent: Wednesday, September 15, 2010 10:02 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Boosting specific field value
> >
> > Erick,
> > I afraid you misinterpreted my issueif I query like you said
> > i.e q=source(bbc OR "associated press")^10  I will ONLY get documents with
> > source BBC or Associated Press...what I am asking is - if my query query
> > does not deal with source at all but uses some other field...since the
> > field
> > "source" will be in the result , is there a way to still boost such a
> > document
> >
> > To re-iterate, If my query is as follows
> >
> > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR "Photo
> > Gallery") pubdatetime:[NOW-3MONTHS TO NOW]
> >
> > and say the resulting docs have "source" field, is there any way I can
> > boost
> > the resulting doc/docs that have either BBC/Associated Press as the value
> > in
> > source field to be on top
> >
> > Can a filter query (fq) have a boost ? if yes, then probably I could
> > rewrite
> > the query as follows in a round about way
> >
> > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR "Photo
> > Gallery) pubdatetime:[NOW-3MONTHS TO NOW] (source:("BBC" OR "Associated
> > Press")^10 OR -source:("BBC" OR "Associated Press")^5)
> >
> > Theoretically, I have to write source in the fq 2 times as I need docs that
> > have source values too just that they will have a lower boost
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Wed, Sep 15, 2010 at 1:34 PM, Erick Erickson  > >wrote:
> >
> > > This seems like a simple query-time boost, although I may not be
> > > understanding
> > > your problem well. That is, q=source(bbc OR "associated press")^10
> > >
> > > As for boosting more recent documents, see:
> > >
> > >
> > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> > >
> > > HTH
> > > Erick
> > >
> > >
> > > On Wed, Sep 15, 2010 at 12:44 PM, Ravi Kiran 
> > > wrote:
> > >
> > > > Hello,
> > > >I am currently querying solr for a "*primarysection*" which will
> > > > return documents like - *q=primarysection:(Politics* OR
> > > > Nation*)&fq=contenttype:("Blog" OR "Photo Gallery)
> > > pubdatetime:[NOW-3MONTHS
> > > > TO NOW]"*. Each document has several fields of which I am most
> > interested
> > > > in
> > > > single valued field called "*source*" ...I want to boost documents
> > which
> > > > contain "*source*" value say "Associated Press" OR "BBC" and also by
> > > newer
> > > > documents. The returned documents may have several other source values
> > > > other
> > > > than "BBC" or "Associated Press". since I specifically don't query on
> > > these
> > > > source values I am not sure how I can boost them, Iam using *
> > > > StandardRequestHandler*
> > > >
> > >
> >





Re: Handling Aggregate Records/Roll-up in Solr

2010-09-16 Thread Markus Jelsma
You should " just flatten the representation of the shirt in the data model."


On Wednesday 15 September 2010 22:23:17 Thomas Martin wrote:
> Can someone point me to the mechanism in Sol that might allow me to
> roll-up or aggregate records for display.  We have many items that are
> similar and only want to show a representative record to the user until
> they select that record.
> 
> 
> 
> As an example - We carry a polo shirt and have 15 records that represent
> the individual colors for that shirt.  Does the query API provide anyway
> to rollup the records passed on a property or do we need to just flatten
> the representation of the shirt in the data model.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Solr for statistical data

2010-09-16 Thread Kjetil Ødegaard
Hi all,


we're currently using Solr 1.4.0 in a project for statistical data, where we
group and sum a number of "double" values. Probably not what most people use
Solr for, but it seems to be working fine for us :-)


We do have some challenges, especially with memory use, so I thought I'd
check here if anybody has done something similar.


Some details:


- The index is currently around 30 GB and growing. The data is indexed
directly from a database, each row ends up as a document. I think we have
around 100 million documents now, the largest core is about 40 million. The
data is split in different cores for different statistics data.


- Heap size is currently 4 GB. We're currently running all the cores in a
single JVM on WebSphere (WAS) 6.1. We have a couple of GB left for OS disk
cache. Initially we used a 1 GB heap, so we had to split cores in different
shards in order to avoid OutOfMemoryErrors because of the FieldCache (I
think).


- The grouping is done by a custom Solr component which takes parameters
that specify which fields to group by (like in SQL) and sums up values for
the group. This uses the FieldCache for speedy retrieval. We did a PoC on
using Documents instead, but this seemed to go a lot slower. I've done a
memory dump and the combined FieldCache looks to be about 3 GB (taken with a
grain of salt since I'm not sure all the data was cached).


I guess this is different from normal Solr searches since we have to process
all the documents in a core in order to calculate results, we can't just
return the first 10 (or whatever) documents.


Any tips or similar experiences?



---Kjetil


Re: Color search for images

2010-09-16 Thread Shawn Heisey

 On 9/15/2010 10:50 AM, Shashi Kant wrote:

Shawn, I have done some research into this, machine-vision especially
on a large scale is a hard problem, not to be entered into lightly. I
would recommend starting with OpenCV - a comprehensive toolkit for
extracting various features such as Color, Edge etc from images. Also
there is a project LIRE http://www.semanticmetadata.net/lire/ which
attempts to do something along what you are thinking of. Not sure how
well it works.



Lire looks promising, but how hard is it to integrate the content-based 
search into Solr as opposed to Lucene?  I myself am not a Java 
developer.  I have access to people who are, but their time is scarce.


I use DIH to populate my index, so I would have to do analysis outside 
of Solr to populate the database.  From there, I would come up with a 
new schema and DIH config to re-import either the entire index or just 
documents that have been recently updated.  I have a build system to 
handle these things on all my shards.


OpenCV looks intimidating, but potentially very useful and for most 
things would probably not require custom code in Solr.  To mention the 
most obvious capability I can find, I think many of our customers would 
love to be able to check a box to include or exclude photos with faces 
in them.


I can tell it's getting late ... I imagined a scenario similar to the 
Kohler commercial where a woman pulls out a faucet at an architect's 
office ... "Design a website around #00ebc9."


Thanks,
Shawn



Re: Solr for statistical data

2010-09-16 Thread Peter Karich
Hi Kjetil,

is this custom component (which performes groub by + calcs stats)
somewhere available?
I would like to do something similar. Would you mind to share if it
isn't already available?

The grouping stuff sounds similar to
https://issues.apache.org/jira/browse/SOLR-236

where you can have mem problems too ;-) or see:
https://issues.apache.org/jira/browse/SOLR-1682

> Any tips or similar experiences?

you want to decrease memory usage?

Regards,
Peter.


> Hi all,
>
>
> we're currently using Solr 1.4.0 in a project for statistical data, where we
> group and sum a number of "double" values. Probably not what most people use
> Solr for, but it seems to be working fine for us :-)
>
>
> We do have some challenges, especially with memory use, so I thought I'd
> check here if anybody has done something similar.
>
>
> Some details:
>
>
> - The index is currently around 30 GB and growing. The data is indexed
> directly from a database, each row ends up as a document. I think we have
> around 100 million documents now, the largest core is about 40 million. The
> data is split in different cores for different statistics data.
>
>
> - Heap size is currently 4 GB. We're currently running all the cores in a
> single JVM on WebSphere (WAS) 6.1. We have a couple of GB left for OS disk
> cache. Initially we used a 1 GB heap, so we had to split cores in different
> shards in order to avoid OutOfMemoryErrors because of the FieldCache (I
> think).
>
>
> - The grouping is done by a custom Solr component which takes parameters
> that specify which fields to group by (like in SQL) and sums up values for
> the group. This uses the FieldCache for speedy retrieval. We did a PoC on
> using Documents instead, but this seemed to go a lot slower. I've done a
> memory dump and the combined FieldCache looks to be about 3 GB (taken with a
> grain of salt since I'm not sure all the data was cached).
>
>
> I guess this is different from normal Solr searches since we have to process
> all the documents in a core in order to calculate results, we can't just
> return the first 10 (or whatever) documents.
>
>
> Any tips or similar experiences?
>
>
>
> ---Kjetil



Re: Null Pointer Exception while indexing

2010-09-16 Thread Israel Ekpo
Try removing the data directory and then restart your Servlet container and
see if that helps.

On Thu, Sep 16, 2010 at 3:28 AM, Lance Norskog  wrote:

> Which version of Solr? 1.4?, 1.4.1? 3.x branch? trunk? if the 3.x or the
> trunk, when did you pull it?
>
>
> andrewdps wrote:
>
>> What could be possible error for
>>
>> 14-Sep-10 4:28:47 PM org.apache.solr.common.SolrException log
>> SEVERE: java.util.concurrent.ExecutionException:
>> java.lang.NullPointerException
>>at java.util.concurrent.FutureTask$Sync.innerGet(libgcj.so.90)
>>at java.util.concurrent.FutureTask.get(libgcj.so.90)
>>at
>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:439)
>>at
>>
>> org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run(DirectUpdateHandler2.java:602)
>>at java.util.concurrent.Executors$RunnableAdapter.call(libgcj.so.90)
>>at java.util.concurrent.FutureTask$Sync.innerRun(libgcj.so.90)
>>at java.util.concurrent.FutureTask.run(libgcj.so.90)
>>at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$2(libgcj.so.90)
>>at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(libgcj.so.90)
>>at java.util.concurrent.ThreadPoolExecutor$Worker.run(libgcj.so.90)
>>at java.lang.Thread.run(libgcj.so.90)
>> Caused by: java.lang.NullPointerException
>>at
>> org.apache.solr.search.FastLRUCache.getStatistics(FastLRUCache.java:252)
>>at org.apache.solr.search.FastLRUCache.toString(FastLRUCache.java:280)
>>at java.lang.StringBuilder.append(libgcj.so.90)
>>at
>> org.apache.solr.search.SolrIndexSearcher.close(SolrIndexSearcher.java:223)
>>at org.apache.solr.core.SolrCore$6.close(SolrCore.java:1246)
>>at org.apache.solr.util.RefCounted.decref(RefCounted.java:57)
>>at org.apache.solr.core.SolrCore$5.call(SolrCore.java:1192)
>>at java.util.concurrent.FutureTask$Sync.innerRun(libgcj.so.90)
>>at java.util.concurrent.FutureTask.run(libgcj.so.90)
>>...3 more
>>
>> I get this error(after indexing a few records I get the above error and
>> again starts indexing.i get the same error after indexing few hundred
>> records) when I try to index the marc record on the server.I worked fine
>> on
>> the local system.
>>
>> Thanks
>>
>>
>


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Full text search in facet scope

2010-09-16 Thread Bogdan Gusiev
I need to build a faceted search.
Each facet consists of keywords that should also be applied to search query
in addition to mail query string
For instance: I am searching for "operating system"  and I need two facets
"linux" and "windows". Each should append it's keyword to query string to
get count.
I saw theirs is only value, range and date match condition supported but not
keyword match in documentation
http://wiki.apache.org/solr/SimpleFacetParameters


Does Solr support such queries?


-- 
Bogdan Gusiev.
agre...@gmail.com


Re: Null Pointer Exception while indexing

2010-09-16 Thread Yonik Seeley
On Wed, Sep 15, 2010 at 2:01 PM, andrewdps  wrote:
> I still get the same error when I try to index the mrc file...

If you get the exact same error, then you are still using GCJ.
When you type "java" it's probably going to GCJ because of your path
(i.e. change it or directly specify the path to the new JVM you just
installed).

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: Full text search in facet scope

2010-09-16 Thread Peter Karich
Hi,

if you index your doc with text='operating system' with an additional
keyword field='linux'
(of type string, can be multivalued) then solr facetting should be what
you want:

solr/select?q=*:*&facet=true&facet.field=keyword&rows=10 or rows=0
depending on your needs

Does this help?

Regards,
Peter.

> I need to build a faceted search.
> Each facet consists of keywords that should also be applied to search query
> in addition to mail query string
> For instance: I am searching for "operating system"  and I need two facets
> "linux" and "windows". Each should append it's keyword to query string to
> get count.
> I saw theirs is only value, range and date match condition supported but not
> keyword match in documentation
> http://wiki.apache.org/solr/SimpleFacetParameters
>
>
> Does Solr support such queries?
>
>
>   



RE: Boosting specific field value

2010-09-16 Thread Jonathan Rochkind
Nice, I didn't know about q.alt.  Or, alternately, yes, you could use a nested 
query, good call.   Which, yes, I agree is kind of confusing at first. 

&qt=dismax # use dismax for the overall query
&bq=whatever   # so we can use bq, since we're using dismax
&q=_query_:"{!lucene} solr-lucene syntax query" # but now make our entire 'q' a 
nested query, which is set to use lucene query parser. 

What gets really confusing there is that the nested query expression needs to 
be in quotes -- so if you need quotes within the actual query itself (say for a 
phrase), you need to escape them. And the whole thing needs to be URI-encoded, 
of course.  It does get confusing, but is quite powerful. 

Jonathan

From: Chantal Ackermann [chantal.ackerm...@btelligent.de]
Sent: Thursday, September 16, 2010 4:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Boosting specific field value

Hi Ravi,

with dismax, use the parameter "q.alt" which expects standard lucene
syntax (instead of "q"). If "q.alt" is present in the query, "q" is not
required. Add the parameter "qt=dismax".

Chantal

On Thu, 2010-09-16 at 06:22 +0200, Ravi Kiran wrote:
> Hello Mr.Rochkind,
>I am using StandardRequestHandler so I presume I
> cannot use bq param right ?? Is there a way we can mix dismax and
> standardhandler i.e use lucene syntax for query and use dismax style for bq
> using localparams/nested queries? I remember seeing your post related to
> localparams and nested queries and got thoroughly confused
>
> On Wed, Sep 15, 2010 at 10:28 PM, Jonathan Rochkind wrote:
>
> > Maybe you are looking for the 'bq' (boost query) parameter in dismax?
> >
> > http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
> > 
> > From: Ravi Kiran [ravi.bhas...@gmail.com]
> > Sent: Wednesday, September 15, 2010 10:02 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Boosting specific field value
> >
> > Erick,
> > I afraid you misinterpreted my issueif I query like you said
> > i.e q=source(bbc OR "associated press")^10  I will ONLY get documents with
> > source BBC or Associated Press...what I am asking is - if my query query
> > does not deal with source at all but uses some other field...since the
> > field
> > "source" will be in the result , is there a way to still boost such a
> > document
> >
> > To re-iterate, If my query is as follows
> >
> > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR "Photo
> > Gallery") pubdatetime:[NOW-3MONTHS TO NOW]
> >
> > and say the resulting docs have "source" field, is there any way I can
> > boost
> > the resulting doc/docs that have either BBC/Associated Press as the value
> > in
> > source field to be on top
> >
> > Can a filter query (fq) have a boost ? if yes, then probably I could
> > rewrite
> > the query as follows in a round about way
> >
> > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR "Photo
> > Gallery) pubdatetime:[NOW-3MONTHS TO NOW] (source:("BBC" OR "Associated
> > Press")^10 OR -source:("BBC" OR "Associated Press")^5)
> >
> > Theoretically, I have to write source in the fq 2 times as I need docs that
> > have source values too just that they will have a lower boost
> >
> > Thanks,
> >
> > Ravi Kiran Bhaskar
> >
> > On Wed, Sep 15, 2010 at 1:34 PM, Erick Erickson  > >wrote:
> >
> > > This seems like a simple query-time boost, although I may not be
> > > understanding
> > > your problem well. That is, q=source(bbc OR "associated press")^10
> > >
> > > As for boosting more recent documents, see:
> > >
> > >
> > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> > >
> > > HTH
> > > Erick
> > >
> > >
> > > On Wed, Sep 15, 2010 at 12:44 PM, Ravi Kiran 
> > > wrote:
> > >
> > > > Hello,
> > > >I am currently querying solr for a "*primarysection*" which will
> > > > return documents like - *q=primarysection:(Politics* OR
> > > > Nation*)&fq=contenttype:("Blog" OR "Photo Gallery)
> > > pubdatetime:[NOW-3MONTHS
> > > > TO NOW]"*. Each document has several fields of which I am most
> > interested
> > > > in
> > > > single valued field called "*source*" ...I want to boost documents
> > which
> > > > contain "*source*" value say "Associated Press" OR "BBC" and also by
> > > newer
> > > > documents. The returned documents may have several other source values
> > > > other
> > > > than "BBC" or "Associated Press". since I specifically don't query on
> > > these
> > > > source values I am not sure how I can boost them, Iam using *
> > > > StandardRequestHandler*
> > > >
> > >
> >





Re: Color search for images

2010-09-16 Thread Shashi Kant
On Thu, Sep 16, 2010 at 3:21 AM, Lance Norskog  wrote:
> Yes, notice the flowers are all a medium-dark crimson red. There are a bunch
> of these image-indexing & search technologies, but there is no (to my
> knowledge) "finished technology"- it's very much an area of research. If you
> want to search the word 'flower' and index data that can find blobs of red,
> that might be easy with public tools. But there are many hard problems.
>

Lance, is there *ever* a "finished technology"? >-)


Re: Color search for images

2010-09-16 Thread Shashi Kant
> Lire looks promising, but how hard is it to integrate the content-based
> search into Solr as opposed to Lucene?  I myself am not a Java developer.  I
> have access to people who are, but their time is scarce.
>


Lire is a nascent effort and based on a cursory overview a while back,
IMHO was an over-simplified version of what a CBIR engine should be.
They use CEDD (color & edge descriptors).
Wouldn't work for the kind of applications I am working on - which
needs among other things, Color, Shape, Orientation, Pose, Edge/Corner
etc.

OpenCV has a steep learning curve, but having been through it, is very
powerful toolkit - the best there is by far! BTW the code is in C++,
but has both Java & .NET bindings.
This is a fabulous book to get hold of:
http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
if you are seriously into OpenCV.

Pls feel free to reach out of if you need any help with OpenCV +
Solr/Lucene. I have spent quite a bit of time on this.


Index update issue

2010-09-16 Thread maggie chen

Dear All,

I use Solr in Rails. I added a new item, the index number update took a long
time (one hour).
For example, now the index is "97" and add a new item, the index will become
"98" in one hour.
I checked all of Solr config files, but I couldn't find the setting about
that.
I comment out 

 1
 1000


in solrconfig.xml 

But...

Thanks




-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-update-issue-tp1487956p1487956.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: using variables/properties in dataconfig.xml

2010-09-16 Thread Ephraim Ofir
No, it's not possible.  See workaround: 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201008.mbox/%3c9f8b39cb3b7c6d4594293ea29ccf438b01702...@icq-mail.icq.il.office.aol.com%3e

Ephraim Ofir | Reporting and Host Development team | ICQ 
P: +972 3 7665510 | M: + 972 52 4888510 | F: +972 3 7665566 | ICQ#: 18981 | E: 
ephra...@icq.com 
 


-Original Message-
From: Jason Chaffee [mailto:jchaf...@ebates.com] 
Sent: Wednesday, September 15, 2010 9:58 PM
To: solr-user@lucene.apache.org
Subject: using variables/properties in dataconfig.xml

Is it possible to use the same type of property configuration in
dataconfig.xml as is possible in solrconfig.xml?

 

I tried it and it didn't seem to work.  For example,

 

${solr.data.dir:/opt/search/store/solr/data}

 

And in the dataconfig.xml, I would like to do this to configure the
baseUrl:

 

  

 

Thanks,

 

Jason



Re: Null Pointer Exception while indexing

2010-09-16 Thread andrewdps

Thanks for all the suggestions.

As far as JAVA is concerned,I'm worried to see different things.I'm afraid
if things are going wrong with the settings.

r...@zoombox:/etc# echo $JAVA_HOME
/usr/lib/jvm/default-java
r...@zoombox:/etc# java -version
java version "1.6.0_0"
OpenJDK  Runtime Environment (build 1.6.0_0-b11)
OpenJDK 64-Bit Server VM (build 1.6.0_0-b11, mixed mode)
r...@zoombox:/etc# locate javac |grep bin
/usr/bin/javac
/usr/lib/jvm/java-1.5.0-gcj-4.3-1.5.0.0/bin/javac
/usr/lib/jvm/java-6-sun-1.6.0.20/bin/javac

I've many sub-directories in JVM directory

default-java
java-6-sun-1.6.0.20
java-6-sun
java-6-openjdk
java-1.5.0-gcj-4.3-1.5.0.0
java-gcj

and the /etc/profile looks like

/etc# more profile
# /etc/profile: system-wide .profile file for the Bourne shell (sh(1))
# and Bourne compatible shells (bash(1), ksh(1), ash(1), ...).

if [ -d /etc/profile.d ]; then
  for i in /etc/profile.d/*.sh; do
if [ -r $i ]; then
  . $i
fi
  done
  unset i
fi

if [ "$PS1" ]; then
  if [ "$BASH" ]; then
PS1='\...@\h:\w\$ '
if [ -f /etc/bash.bashrc ]; then
. /etc/bash.bashrc
fi
  else
if [ "`id -u`" -eq 0 ]; then
  PS1='# '
else
  PS1='$ '
fi
  fi
fi

umask 022
export JAVA_HOME="/usr/lib/jvm/java-6-sun-1.6.0.20"
export PATH="$JAVA_HOME/bin:$PATH"

Please let me know what went wrong.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-while-indexing-tp1481154p1488279.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Null Pointer Exception while indexing

2010-09-16 Thread andrewdps

Also,the solr Java properties looks like this using gcj,despite setting
java_home in /etc/profile

jetty.logs = /usr/local/vufind/solr/jetty/logs
path.separator = :
java.vm.name = GNU libgcj
java.vm.specification.name = Java(tm) Virtual Machine Specification
java.runtime.version = 1.5.0
java.home = /usr/lib/jvm/java-1.5.0-gcj-4.3-1.5.0.0/jre
java.vm.specification.version = 1.0
line.separator = 


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-while-indexing-tp1481154p1488292.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Null Pointer Exception while indexing

2010-09-16 Thread andrewdps

Lance,

We are on Solr Specification Version: 1.4.1
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-while-indexing-tp1481154p1488320.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Null Pointer Exception while indexing

2010-09-16 Thread Thomas Joiner
My guess would be that Jetty has some configuration somewhere that is
telling it to use GCJ.  Is it possible to completely remove GCJ from the
system?  Another possibility would be to uninstall Jetty, and then reinstall
it, and hope that on the reinstall it would pick up on the OpenJDK.

What distro of linux are you using?  It probably depends on that how to set
the JVM.

On Thu, Sep 16, 2010 at 10:22 AM, andrewdps  wrote:

>
> Lance,
>
> We are on Solr Specification Version: 1.4.1
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-while-indexing-tp1481154p1488320.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Boosting specific field value

2010-09-16 Thread Ravi Kiran
Awesome I also did not know about q.alt accepting lucene style...Thanks to
both of you, Mr. Ackermann and Mr. Rockkind, I learnt a lot in just this
thread than I have done in the last 6 months of reading and dealing with
solr.

As you folks pointed out q.alt and nested queries are both great options, I
shall pursue them...asking a question on a forum is fun when you have
knowledgeable people, isnt it??? That's true reuse of resources in software
terms :-) , reuse of knowledge in developer space !!!.

Ravi Kiran Bhaskar
Principal Software Engineer
The Washington Post

On Thu, Sep 16, 2010 at 9:25 AM, Jonathan Rochkind  wrote:

> Nice, I didn't know about q.alt.  Or, alternately, yes, you could use a
> nested query, good call.   Which, yes, I agree is kind of confusing at
> first.
>
> &qt=dismax # use dismax for the overall query
> &bq=whatever   # so we can use bq, since we're using dismax
> &q=_query_:"{!lucene} solr-lucene syntax query" # but now make our entire
> 'q' a nested query, which is set to use lucene query parser.
>
> What gets really confusing there is that the nested query expression needs
> to be in quotes -- so if you need quotes within the actual query itself (say
> for a phrase), you need to escape them. And the whole thing needs to be
> URI-encoded, of course.  It does get confusing, but is quite powerful.
>
> Jonathan
> 
> From: Chantal Ackermann [chantal.ackerm...@btelligent.de]
> Sent: Thursday, September 16, 2010 4:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Boosting specific field value
>
> Hi Ravi,
>
> with dismax, use the parameter "q.alt" which expects standard lucene
> syntax (instead of "q"). If "q.alt" is present in the query, "q" is not
> required. Add the parameter "qt=dismax".
>
> Chantal
>
> On Thu, 2010-09-16 at 06:22 +0200, Ravi Kiran wrote:
> > Hello Mr.Rochkind,
> >I am using StandardRequestHandler so I presume
> I
> > cannot use bq param right ?? Is there a way we can mix dismax and
> > standardhandler i.e use lucene syntax for query and use dismax style for
> bq
> > using localparams/nested queries? I remember seeing your post related to
> > localparams and nested queries and got thoroughly confused
> >
> > On Wed, Sep 15, 2010 at 10:28 PM, Jonathan Rochkind  >wrote:
> >
> > > Maybe you are looking for the 'bq' (boost query) parameter in dismax?
> > >
> > > http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29
> > > 
> > > From: Ravi Kiran [ravi.bhas...@gmail.com]
> > > Sent: Wednesday, September 15, 2010 10:02 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Boosting specific field value
> > >
> > > Erick,
> > > I afraid you misinterpreted my issueif I query like you
> said
> > > i.e q=source(bbc OR "associated press")^10  I will ONLY get documents
> with
> > > source BBC or Associated Press...what I am asking is - if my query
> query
> > > does not deal with source at all but uses some other field...since the
> > > field
> > > "source" will be in the result , is there a way to still boost such a
> > > document
> > >
> > > To re-iterate, If my query is as follows
> > >
> > > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR
> "Photo
> > > Gallery") pubdatetime:[NOW-3MONTHS TO NOW]
> > >
> > > and say the resulting docs have "source" field, is there any way I can
> > > boost
> > > the resulting doc/docs that have either BBC/Associated Press as the
> value
> > > in
> > > source field to be on top
> > >
> > > Can a filter query (fq) have a boost ? if yes, then probably I could
> > > rewrite
> > > the query as follows in a round about way
> > >
> > > q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR
> "Photo
> > > Gallery) pubdatetime:[NOW-3MONTHS TO NOW] (source:("BBC" OR "Associated
> > > Press")^10 OR -source:("BBC" OR "Associated Press")^5)
> > >
> > > Theoretically, I have to write source in the fq 2 times as I need docs
> that
> > > have source values too just that they will have a lower boost
> > >
> > > Thanks,
> > >
> > > Ravi Kiran Bhaskar
> > >
> > > On Wed, Sep 15, 2010 at 1:34 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > This seems like a simple query-time boost, although I may not be
> > > > understanding
> > > > your problem well. That is, q=source(bbc OR "associated press")^10
> > > >
> > > > As for boosting more recent documents, see:
> > > >
> > > >
> > >
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
> > > >
> > > > HTH
> > > > Erick
> > > >
> > > >
> > > > On Wed, Sep 15, 2010 at 12:44 PM, Ravi Kiran  >
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >I am currently querying solr for a "*primarysection*" which
> will
> > > > > return documents like - *q=primarysection:(Politics* OR
> > > > > Nation*)&fq=contenttype:("Blog" OR "Photo Gallery)
> > > > pubdatetime:[NOW-3MONTHS
> > > 

Re: Index update issue

2010-09-16 Thread Erik Hatcher
Be sure to issue a commit after updates (either with a separate  
 or append ?commit=true to your update requests).


Out of curiosity are you using any Ruby library to speak to Solr?  Or  
hand rolling some Net::HTTP stuff?


Erik

On Sep 16, 2010, at 9:29 AM, maggie chen wrote:



Dear All,

I use Solr in Rails. I added a new item, the index number update  
took a long

time (one hour).
For example, now the index is "97" and add a new item, the index  
will become

"98" in one hour.
I checked all of Solr config files, but I couldn't find the setting  
about

that.
I comment out

1
1000


in solrconfig.xml

But...

Thanks




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Index-update-issue-tp1487956p1487956.html
Sent from the Solr - User mailing list archive at Nabble.com.




Re: Boosting specific field value

2010-09-16 Thread Jonathan Rochkind
Actually, I _think_ you need to use a nested query for my idea, you can 
just use "LocalParams".


&q={!lucene} your query in lucene syntax

I think that'll work, I think you can use "LocalParams" directly in the 
'q', no need for nested query.  If it will, it avoids the escaping 
nightmares with nested queries.


Ravi Kiran wrote:

Awesome I also did not know about q.alt accepting lucene style...Thanks to
both of you, Mr. Ackermann and Mr. Rockkind, I learnt a lot in just this
thread than I have done in the last 6 months of reading and dealing with
solr.

As you folks pointed out q.alt and nested queries are both great options, I
shall pursue them...asking a question on a forum is fun when you have
knowledgeable people, isnt it??? That's true reuse of resources in software
terms :-) , reuse of knowledge in developer space !!!.

Ravi Kiran Bhaskar
Principal Software Engineer
The Washington Post

On Thu, Sep 16, 2010 at 9:25 AM, Jonathan Rochkind  wrote:

  

Nice, I didn't know about q.alt.  Or, alternately, yes, you could use a
nested query, good call.   Which, yes, I agree is kind of confusing at
first.

&qt=dismax # use dismax for the overall query
&bq=whatever   # so we can use bq, since we're using dismax
&q=_query_:"{!lucene} solr-lucene syntax query" # but now make our entire
'q' a nested query, which is set to use lucene query parser.

What gets really confusing there is that the nested query expression needs
to be in quotes -- so if you need quotes within the actual query itself (say
for a phrase), you need to escape them. And the whole thing needs to be
URI-encoded, of course.  It does get confusing, but is quite powerful.

Jonathan

From: Chantal Ackermann [chantal.ackerm...@btelligent.de]
Sent: Thursday, September 16, 2010 4:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Boosting specific field value

Hi Ravi,

with dismax, use the parameter "q.alt" which expects standard lucene
syntax (instead of "q"). If "q.alt" is present in the query, "q" is not
required. Add the parameter "qt=dismax".

Chantal

On Thu, 2010-09-16 at 06:22 +0200, Ravi Kiran wrote:


Hello Mr.Rochkind,
   I am using StandardRequestHandler so I presume
  

I


cannot use bq param right ?? Is there a way we can mix dismax and
standardhandler i.e use lucene syntax for query and use dismax style for
  

bq


using localparams/nested queries? I remember seeing your post related to
localparams and nested queries and got thoroughly confused

On Wed, Sep 15, 2010 at 10:28 PM, Jonathan Rochkind   

Maybe you are looking for the 'bq' (boost query) parameter in dismax?

http://wiki.apache.org/solr/DisMaxQParserPlugin#bq_.28Boost_Query.29

From: Ravi Kiran [ravi.bhas...@gmail.com]
Sent: Wednesday, September 15, 2010 10:02 PM
To: solr-user@lucene.apache.org
Subject: Re: Boosting specific field value

Erick,
I afraid you misinterpreted my issueif I query like you


said


i.e q=source(bbc OR "associated press")^10  I will ONLY get documents


with


source BBC or Associated Press...what I am asking is - if my query


query


does not deal with source at all but uses some other field...since the
field
"source" will be in the result , is there a way to still boost such a
document

To re-iterate, If my query is as follows

q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR


"Photo


Gallery") pubdatetime:[NOW-3MONTHS TO NOW]

and say the resulting docs have "source" field, is there any way I can
boost
the resulting doc/docs that have either BBC/Associated Press as the


value


in
source field to be on top

Can a filter query (fq) have a boost ? if yes, then probably I could
rewrite
the query as follows in a round about way

q=primarysection:(Politics* OR Nation*)&fq=contenttype:("Blog" OR


"Photo


Gallery) pubdatetime:[NOW-3MONTHS TO NOW] (source:("BBC" OR "Associated
Press")^10 OR -source:("BBC" OR "Associated Press")^5)

Theoretically, I have to write source in the fq 2 times as I need docs


that


have source values too just that they will have a lower boost

Thanks,

Ravi Kiran Bhaskar

On Wed, Sep 15, 2010 at 1:34 PM, Erick Erickson <


erickerick...@gmail.com


wrote:
  
This seems like a simple query-time boost, although I may not be

understanding
your problem well. That is, q=source(bbc OR "associated press")^10

As for boosting more recent documents, see:


  

http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents


HTH
Erick


On Wed, Sep 15, 2010 at 12:44 PM, Ravi Kiran   
wrote:


  

Hello,
   I am currently querying solr for a "*primarysection*" which


will


return documents like - *q=primarysection:(Politics* OR
Nation*)&fq=contenttype:("Blog" OR "Photo Gallery)

RE: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Dennis Gearon
There's something that works a little bit like 'DISTINCT' called field 
collapsing. Take a look in the archives for it.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 9/15/10, Andre Bickford  wrote:

> From: Andre Bickford 
> Subject: RE: Simple Filter Query (fq) Use Case Question
> To: solr-user@lucene.apache.org
> Date: Wednesday, September 15, 2010, 12:49 PM
> Thanks for the response Erick.
> 
> I did actually try exactly what you suggested. I flipped
> the index over so that a gift is the document. This solution
> certainly solves the previous problem, but introduces a new
> issue where the search results show duplicate donors. If a
> donor gave 12 times in a year, and we offer full years as
> facet ranges, my understanding is that you'd see that donor
> 12 times in the search results, once for each gift document.
> Obviously I could do some client side filtering to list only
> distinct donors, but I was hoping to avoid that.
> 
> If I've simply stumbled into the basic tradeoffs of
> denormalization, I can live with client side de-duplication,
> but if you have any further suggestions I'm all eyes.
> 
> As for sizing, we have some huge charities as clients.
> However, right now I'm testing on a copy of prod data from a
> smaller client with ~350,000 donors and ~8,000,000 gift
> records. So, when I "flipped" the index around as you
> suggested, it went from 350,000 documents to 8,000,000
> documents. No issues with performance at all.
> 
> Thanks again,
> Andre
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> 
> Sent: Wednesday, September 15, 2010 3:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Simple Filter Query (fq) Use Case Question
> 
> One strategy is to denormalize all the way. That is, each
> Solr "document" is Gift Amount and Gift Date would not be
> multiValued.
> You'd create a different "document" for each gift, so you'd
> have multiple
> documents with the same Id, Name, and Address. Be careful,
> though,
> if you've defined Id as a UniqueKey, you'd only have one
> record/donor. You
> can handle this easily enough by making a composite key of
> Id+Gift Date
> (assuming no donor made more than one gift on exactly the
> same date).
> 
> I know this goes completely against all the reflexes you've
> built up with
> working with DBs, but...
> 
> Can you give us a clue how many donations we're talking
> about here?
> You'd have to be working with a really big nonprofit to get
> enough documents
> to have to start worrying about making your index smaller.
> 
> HTH
> Erick
> 
> On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford wrote:
> 
> > I'm working on creating a solr index search for a
> charitable organization.
> > The solr index stores documents of donors. Each donor
> document has the
> > following four fields:
> >
> > Id
> > Name
> > Address
> > Gift Amount (multiValued)
> > Gift Date (multiValued)
> >
> > In our relational database, there is a one-to-many
> relationship between the
> > DONOR table and the GIFT table. One donor can of
> course give many gifts over
> > time. Consequently, I created the Gift Amount and Gift
> Date fields to be
> > mutiValued.
> >
> > Now, consider the following query filtered for gifts
> last month between $0
> > and $100:
> >
> > q=name:Jones
> > fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
> > fq=giftAmount:[0 TO 100]
> >
> > The results show me donors who donated ANY amount in
> the past month and
> > donors who had EVER in the past given a gift between
> $0 and $100. I was
> > hoping to only see donors who had given a gift between
> $0 and $100 in the
> > past month exclusively. I believe the problem is that
> I neglected to
> > consider that for two multiValued fields, while the
> values might align
> > "index wise", there is really no other association
> between the two fields,
> > so the filter query intersection isn't really behaving
> as I expected.
> >
> > I think this is a fundamental question of one-to-many
> denormalization, but
> > obviously I'm not yet experienced enough with
> Lucene/Solr to find a
> > solution. As to why not just keep using a relational
> database, it's because
> > I'm trying to provide a faceting solution to "drill
> down" to donors. The
> > aforementioned fq parameters would come from faceting.
> Oh, that and Oracle
> > Text indexes are a PITA. :-)
> >
> > Thanks for any help you can provide.
> >
> > André Bickford
> > Software Engineering Team Leader
> > SofTrek Corporation
> > 30 Bryant Woods North  Amherst, NY 14228
> > 716.691.2800 x154  800.442.9211  Fax:
> 716.691.2828
> > abickf...@softrek.com 
> www.softrek.com
> >
> >
> >
> 
> 
>


Re: Handling Aggregate Records/Roll-up in Solr

2010-09-16 Thread Dennis Gearon
Look for faceting or field collapsing.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 9/15/10, Thomas Martin  wrote:

> From: Thomas Martin 
> Subject: Handling Aggregate Records/Roll-up in Solr
> To: solr-user@lucene.apache.org
> Date: Wednesday, September 15, 2010, 1:23 PM
> Can someone point me to the mechanism
> in Sol that might allow me to
> roll-up or aggregate records for display.  We have
> many items that are
> similar and only want to show a representative record to
> the user until
> they select that record.  
> 
>  
> 
> As an example - We carry a polo shirt and have 15 records
> that represent
> the individual colors for that shirt.  Does the query
> API provide anyway
> to rollup the records passed on a property or do we need to
> just flatten
> the representation of the shirt in the data model.
> 
>  
> 
>  
> 
>  
> 
>


Re: Color search for images

2010-09-16 Thread Dennis Gearon
That's impressive!

So Google has BOUGHT some doctoral types, or highly specialized geeks,

And is looking at X number of images.

I bet the number of images on his video film library is at least several orders 
of magnitude above what Like deals with.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 9/15/10, Shashi Kant  wrote:

> From: Shashi Kant 
> Subject: Re: Color search for images
> To: solr-user@lucene.apache.org
> Date: Wednesday, September 15, 2010, 8:56 PM
> > I'm sure there's some post
> doctoral types who could get a graphic shape analyzer, color
> analyzer, to at least say it's a flower.
> >
> > However, even Google would have to build new
> datacenters to have the horsepower to do that kind of
> graphic processing.
> >
> 
> Not necessarily true. Like.com - which incidentally got
> acquired by
> Google recently - built a true visual search technology and
> applied it
> on a large scale.
> 


RE: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Dennis Gearon
This brings me to ask a question that's been on my mind for awhile.

Are indexes set up for the whole site, or a set of searches, with several 
different indexes for a site?

How many instances does one Solr/Lucene instance have access to, (not counting 
shards/segments)?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Chantal Ackermann  wrote:

> From: Chantal Ackermann 
> Subject: RE: Simple Filter Query (fq) Use Case Question
> To: "solr-user@lucene.apache.org" 
> Date: Thursday, September 16, 2010, 1:05 AM
> Hi Andre,
> 
> changing the entity in your index from donor to gift
> changes of course
> the scope of your search results. I found it helpful to
> re-think such
> change from that "other" side (the result side).
> If the users of your search application look for individual
> gifts, in
> the end, then changing the index to gift is for the
> better.
> 
> If they are searching for donors, then I would rethink the
> change but
> not discard it completely: you can still get the list of
> distinct donors
> by facetting over donors. You can show the users that list
> of donors
> (the facets), and they can chose from it and get all
> information on that
> donor (restricted to the original query, of course). The
> information
> would include the actual search result of a list of gifts
> that passed
> the query.
> 
> Cheers,
> Chantal
> 
> On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford wrote:
> > Thanks for the response Erick.
> > 
> > I did actually try exactly what you suggested. I
> flipped the index over so that a gift is the document. This
> solution certainly solves the previous problem, but
> introduces a new issue where the search results show
> duplicate donors. If a donor gave 12 times in a year, and we
> offer full years as facet ranges, my understanding is that
> you'd see that donor 12 times in the search results, once
> for each gift document. Obviously I could do some client
> side filtering to list only distinct donors, but I was
> hoping to avoid that.
> > 
> > If I've simply stumbled into the basic tradeoffs of
> denormalization, I can live with client side de-duplication,
> but if you have any further suggestions I'm all eyes.
> > 
> > As for sizing, we have some huge charities as clients.
> However, right now I'm testing on a copy of prod data from a
> smaller client with ~350,000 donors and ~8,000,000 gift
> records. So, when I "flipped" the index around as you
> suggested, it went from 350,000 documents to 8,000,000
> documents. No issues with performance at all.
> > 
> > Thanks again,
> > Andre
> > 
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> 
> > Sent: Wednesday, September 15, 2010 3:09 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Simple Filter Query (fq) Use Case
> Question
> > 
> > One strategy is to denormalize all the way. That is,
> each
> > Solr "document" is Gift Amount and Gift Date would not
> be multiValued.
> > You'd create a different "document" for each gift, so
> you'd have multiple
> > documents with the same Id, Name, and Address. Be
> careful, though,
> > if you've defined Id as a UniqueKey, you'd only have
> one record/donor. You
> > can handle this easily enough by making a composite
> key of Id+Gift Date
> > (assuming no donor made more than one gift on exactly
> the same date).
> > 
> > I know this goes completely against all the reflexes
> you've built up with
> > working with DBs, but...
> > 
> > Can you give us a clue how many donations we're
> talking about here?
> > You'd have to be working with a really big nonprofit
> to get enough documents
> > to have to start worrying about making your index
> smaller.
> > 
> > HTH
> > Erick
> > 
> > On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford 
> > wrote:
> > 
> > > I'm working on creating a solr index search for a
> charitable organization.
> > > The solr index stores documents of donors. Each
> donor document has the
> > > following four fields:
> > >
> > > Id
> > > Name
> > > Address
> > > Gift Amount (multiValued)
> > > Gift Date (multiValued)
> > >
> > > In our relational database, there is a
> one-to-many relationship between the
> > > DONOR table and the GIFT table. One donor can of
> course give many gifts over
> > > time. Consequently, I created the Gift Amount and
> Gift Date fields to be
> > > mutiValued.
> > >
> > > Now, consider the following query filtered for
> gifts last month between $0
> > > and $100:
> > >
> > > q=name:Jones
> > > fq=giftDate:[NOW/MONTH-1 TO NOW/MONTH]
> > > fq=giftAmount:[0 TO 100]
> > >
> > > The results show me donors who donated ANY amount
> in the past month and
> > > donors who had EVER in the past given a gift
> between $0 and $100. I was
> > > hoping to only see donors who had given a gift
> between $0 and $100 in the
> > > past month exclusively. I believe the problem i

Re: Color search for images

2010-09-16 Thread Dennis Gearon
LOL! now that is one of the wisest things I've seen in a while.
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Shashi Kant  wrote:

> From: Shashi Kant 
> Subject: Re: Color search for images
> To: solr-user@lucene.apache.org
> Date: Thursday, September 16, 2010, 6:36 AM
> On Thu, Sep 16, 2010 at 3:21 AM,
> Lance Norskog 
> wrote:
> > Yes, notice the flowers are all a medium-dark crimson
> red. There are a bunch
> > of these image-indexing & search technologies, but
> there is no (to my
> > knowledge) "finished technology"- it's very much an
> area of research. If you
> > want to search the word 'flower' and index data that
> can find blobs of red,
> > that might be easy with public tools. But there are
> many hard problems.
> >
> 
> Lance, is there *ever* a "finished technology"? >-)
> 


Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Jonathan Rochkind
One solr core has essentially one index in it. (not only one 'field', 
but one indexed collection of documents) There are weird hacks, like I 
believe the spellcheck component kind of creates it's own sub-indexes, 
not sure how it does that.


You can have more than one core in a single solr instance, but they're 
essentially seperate, there's no easy way to 'join' accross them or 
anything, a given request targets one core.


Dennis Gearon wrote:

This brings me to ask a question that's been on my mind for awhile.

Are indexes set up for the whole site, or a set of searches, with several 
different indexes for a site?

How many instances does one Solr/Lucene instance have access to, (not counting 
shards/segments)?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Chantal Ackermann  wrote:

  

From: Chantal Ackermann 
Subject: RE: Simple Filter Query (fq) Use Case Question
To: "solr-user@lucene.apache.org" 
Date: Thursday, September 16, 2010, 1:05 AM
Hi Andre,

changing the entity in your index from donor to gift
changes of course
the scope of your search results. I found it helpful to
re-think such
change from that "other" side (the result side).
If the users of your search application look for individual
gifts, in
the end, then changing the index to gift is for the
better.

If they are searching for donors, then I would rethink the
change but
not discard it completely: you can still get the list of
distinct donors
by facetting over donors. You can show the users that list
of donors
(the facets), and they can chose from it and get all
information on that
donor (restricted to the original query, of course). The
information
would include the actual search result of a list of gifts
that passed
the query.

Cheers,
Chantal

On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford wrote:


Thanks for the response Erick.

I did actually try exactly what you suggested. I
  

flipped the index over so that a gift is the document. This
solution certainly solves the previous problem, but
introduces a new issue where the search results show
duplicate donors. If a donor gave 12 times in a year, and we
offer full years as facet ranges, my understanding is that
you'd see that donor 12 times in the search results, once
for each gift document. Obviously I could do some client
side filtering to list only distinct donors, but I was
hoping to avoid that.


If I've simply stumbled into the basic tradeoffs of
  

denormalization, I can live with client side de-duplication,
but if you have any further suggestions I'm all eyes.


As for sizing, we have some huge charities as clients.
  

However, right now I'm testing on a copy of prod data from a
smaller client with ~350,000 donors and ~8,000,000 gift
records. So, when I "flipped" the index around as you
suggested, it went from 350,000 documents to 8,000,000
documents. No issues with performance at all.


Thanks again,
Andre

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
  
Sent: Wednesday, September 15, 2010 3:09 PM

To: solr-user@lucene.apache.org
Subject: Re: Simple Filter Query (fq) Use Case
  

Question


One strategy is to denormalize all the way. That is,
  

each


Solr "document" is Gift Amount and Gift Date would not
  

be multiValued.


You'd create a different "document" for each gift, so
  

you'd have multiple


documents with the same Id, Name, and Address. Be
  

careful, though,


if you've defined Id as a UniqueKey, you'd only have
  

one record/donor. You


can handle this easily enough by making a composite
  

key of Id+Gift Date


(assuming no donor made more than one gift on exactly
  

the same date).


I know this goes completely against all the reflexes
  

you've built up with


working with DBs, but...

Can you give us a clue how many donations we're
  

talking about here?


You'd have to be working with a really big nonprofit
  

to get enough documents


to have to start worrying about making your index
  

smaller.


HTH
Erick

On Wed, Sep 15, 2010 at 1:41 PM, Andre Bickford wrote:

  

I'm working on creating a solr index search for a


charitable organization.


The solr index stores documents of donors. Each


donor document has the


following four fields:

Id
Name
Address
Gift Amount (multiValued)
Gift Date (multiValued)

In our relational database, there is a


one-to-many relationship between the


DONOR table and the GIFT table. One donor can of


course give many gifts over


time. Consequently, I created the Gift Amount and


Gift Date fields to be


mutiValued.

Now, consider the following query filtered for


gifts last month between $0


and $100:

q=name:Jones

Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Dennis Gearon
So THAT'S what a core is! I have been wondering. Thank you very much!
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Jonathan Rochkind  wrote:

> From: Jonathan Rochkind 
> Subject: Re: Simple Filter Query (fq) Use Case Question
> To: "solr-user@lucene.apache.org" 
> Date: Thursday, September 16, 2010, 11:20 AM
> One solr core has essentially one
> index in it. (not only one 'field', 
> but one indexed collection of documents) There are weird
> hacks, like I 
> believe the spellcheck component kind of creates it's own
> sub-indexes, 
> not sure how it does that.
> 
> You can have more than one core in a single solr instance,
> but they're 
> essentially seperate, there's no easy way to 'join' accross
> them or 
> anything, a given request targets one core.
> 
> Dennis Gearon wrote:
> > This brings me to ask a question that's been on my
> mind for awhile.
> >
> > Are indexes set up for the whole site, or a set of
> searches, with several different indexes for a site?
> >
> > How many instances does one Solr/Lucene instance have
> access to, (not counting shards/segments)?
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >   otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Thu, 9/16/10, Chantal Ackermann 
> wrote:
> >
> >   
> >> From: Chantal Ackermann 
> >> Subject: RE: Simple Filter Query (fq) Use Case
> Question
> >> To: "solr-user@lucene.apache.org"
> 
> >> Date: Thursday, September 16, 2010, 1:05 AM
> >> Hi Andre,
> >>
> >> changing the entity in your index from donor to
> gift
> >> changes of course
> >> the scope of your search results. I found it
> helpful to
> >> re-think such
> >> change from that "other" side (the result side).
> >> If the users of your search application look for
> individual
> >> gifts, in
> >> the end, then changing the index to gift is for
> the
> >> better.
> >>
> >> If they are searching for donors, then I would
> rethink the
> >> change but
> >> not discard it completely: you can still get the
> list of
> >> distinct donors
> >> by facetting over donors. You can show the users
> that list
> >> of donors
> >> (the facets), and they can chose from it and get
> all
> >> information on that
> >> donor (restricted to the original query, of
> course). The
> >> information
> >> would include the actual search result of a list
> of gifts
> >> that passed
> >> the query.
> >>
> >> Cheers,
> >> Chantal
> >>
> >> On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford
> wrote:
> >>     
> >>> Thanks for the response Erick.
> >>>
> >>> I did actually try exactly what you suggested.
> I
> >>>       
> >> flipped the index over so that a gift is the
> document. This
> >> solution certainly solves the previous problem,
> but
> >> introduces a new issue where the search results
> show
> >> duplicate donors. If a donor gave 12 times in a
> year, and we
> >> offer full years as facet ranges, my understanding
> is that
> >> you'd see that donor 12 times in the search
> results, once
> >> for each gift document. Obviously I could do some
> client
> >> side filtering to list only distinct donors, but I
> was
> >> hoping to avoid that.
> >>     
> >>> If I've simply stumbled into the basic
> tradeoffs of
> >>>       
> >> denormalization, I can live with client side
> de-duplication,
> >> but if you have any further suggestions I'm all
> eyes.
> >>     
> >>> As for sizing, we have some huge charities as
> clients.
> >>>       
> >> However, right now I'm testing on a copy of prod
> data from a
> >> smaller client with ~350,000 donors and ~8,000,000
> gift
> >> records. So, when I "flipped" the index around as
> you
> >> suggested, it went from 350,000 documents to
> 8,000,000
> >> documents. No issues with performance at all.
> >>     
> >>> Thanks again,
> >>> Andre
> >>>
> >>> -Original Message-
> >>> From: Erick Erickson [mailto:erickerick...@gmail.com]
> >>>       
> >>> Sent: Wednesday, September 15, 2010 3:09 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Simple Filter Query (fq) Use
> Case
> >>>       
> >> Question
> >>     
> >>> One strategy is to denormalize all the way.
> That is,
> >>>       
> >> each
> >>     
> >>> Solr "document" is Gift Amount and Gift Date
> would not
> >>>       
> >> be multiValued.
> >>     
> >>> You'd create a different "document" for each
> gift, so
> >>>       
> >> you'd have multiple
> >>     
> >>> documents with the same Id, Name, and Address.
> Be
> >>>       
> >> careful, though,
> >>     
> >>> if you've defined Id as a UniqueKey, you'd
> only have
> >>>       
> >> one record/donor. You
> >>     
> >>> can handle this easily enough by making a
> composite
> >>>       
> >> key of Id+Gift Date
> >>     
> >>> (assuming no donor made more than one gift on
> exactly
> 

Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Dennis Gearon
Is a core a running piece of software, or just an index/config pairing?
Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Jonathan Rochkind  wrote:

> From: Jonathan Rochkind 
> Subject: Re: Simple Filter Query (fq) Use Case Question
> To: "solr-user@lucene.apache.org" 
> Date: Thursday, September 16, 2010, 11:20 AM
> One solr core has essentially one
> index in it. (not only one 'field', 
> but one indexed collection of documents) There are weird
> hacks, like I 
> believe the spellcheck component kind of creates it's own
> sub-indexes, 
> not sure how it does that.
> 
> You can have more than one core in a single solr instance,
> but they're 
> essentially seperate, there's no easy way to 'join' accross
> them or 
> anything, a given request targets one core.
> 
> Dennis Gearon wrote:
> > This brings me to ask a question that's been on my
> mind for awhile.
> >
> > Are indexes set up for the whole site, or a set of
> searches, with several different indexes for a site?
> >
> > How many instances does one Solr/Lucene instance have
> access to, (not counting shards/segments)?
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >   otherwise we all die.
> >
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> >
> >
> > --- On Thu, 9/16/10, Chantal Ackermann 
> wrote:
> >
> >   
> >> From: Chantal Ackermann 
> >> Subject: RE: Simple Filter Query (fq) Use Case
> Question
> >> To: "solr-user@lucene.apache.org"
> 
> >> Date: Thursday, September 16, 2010, 1:05 AM
> >> Hi Andre,
> >>
> >> changing the entity in your index from donor to
> gift
> >> changes of course
> >> the scope of your search results. I found it
> helpful to
> >> re-think such
> >> change from that "other" side (the result side).
> >> If the users of your search application look for
> individual
> >> gifts, in
> >> the end, then changing the index to gift is for
> the
> >> better.
> >>
> >> If they are searching for donors, then I would
> rethink the
> >> change but
> >> not discard it completely: you can still get the
> list of
> >> distinct donors
> >> by facetting over donors. You can show the users
> that list
> >> of donors
> >> (the facets), and they can chose from it and get
> all
> >> information on that
> >> donor (restricted to the original query, of
> course). The
> >> information
> >> would include the actual search result of a list
> of gifts
> >> that passed
> >> the query.
> >>
> >> Cheers,
> >> Chantal
> >>
> >> On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford
> wrote:
> >>     
> >>> Thanks for the response Erick.
> >>>
> >>> I did actually try exactly what you suggested.
> I
> >>>       
> >> flipped the index over so that a gift is the
> document. This
> >> solution certainly solves the previous problem,
> but
> >> introduces a new issue where the search results
> show
> >> duplicate donors. If a donor gave 12 times in a
> year, and we
> >> offer full years as facet ranges, my understanding
> is that
> >> you'd see that donor 12 times in the search
> results, once
> >> for each gift document. Obviously I could do some
> client
> >> side filtering to list only distinct donors, but I
> was
> >> hoping to avoid that.
> >>     
> >>> If I've simply stumbled into the basic
> tradeoffs of
> >>>       
> >> denormalization, I can live with client side
> de-duplication,
> >> but if you have any further suggestions I'm all
> eyes.
> >>     
> >>> As for sizing, we have some huge charities as
> clients.
> >>>       
> >> However, right now I'm testing on a copy of prod
> data from a
> >> smaller client with ~350,000 donors and ~8,000,000
> gift
> >> records. So, when I "flipped" the index around as
> you
> >> suggested, it went from 350,000 documents to
> 8,000,000
> >> documents. No issues with performance at all.
> >>     
> >>> Thanks again,
> >>> Andre
> >>>
> >>> -Original Message-
> >>> From: Erick Erickson [mailto:erickerick...@gmail.com]
> >>>       
> >>> Sent: Wednesday, September 15, 2010 3:09 PM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Simple Filter Query (fq) Use
> Case
> >>>       
> >> Question
> >>     
> >>> One strategy is to denormalize all the way.
> That is,
> >>>       
> >> each
> >>     
> >>> Solr "document" is Gift Amount and Gift Date
> would not
> >>>       
> >> be multiValued.
> >>     
> >>> You'd create a different "document" for each
> gift, so
> >>>       
> >> you'd have multiple
> >>     
> >>> documents with the same Id, Name, and Address.
> Be
> >>>       
> >> careful, though,
> >>     
> >>> if you've defined Id as a UniqueKey, you'd
> only have
> >>>       
> >> one record/donor. You
> >>     
> >>> can handle this easily enough by making a
> composite
> >>>       
> >> key of Id+Gift Date
> >>     
> >>> (assuming no donor made more than one gift on
> exactly

SOLR interface with PHP using javabin?

2010-09-16 Thread onlinespend...@gmail.com
 I am planning on creating a website that has some SOLR search 
capabilities for the users, and was also planning on using PHP for the 
server-side scripting.


My goal is to find the most efficient way to submit search queries from 
the website, interface with SOLR, and display the results back on the 
website.  If I use PHP, it seems that all the solutions use some form of 
character based stream for the interface.  It would seem that using a 
binary representation, such as javabin, would be more efficient.


If using javabin, or some similar efficient binary stream to interface 
SOLR with PHP is not possible, what do people recommend as the most 
efficient solution that provides the best performance, even if that 
means not using PHP and going with some other alternative?


Thank you,
Ben


Re: SOLR interface with PHP using javabin?

2010-09-16 Thread Yonik Seeley
On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com
 wrote:
>  I am planning on creating a website that has some SOLR search capabilities
> for the users, and was also planning on using PHP for the server-side
> scripting.
>
> My goal is to find the most efficient way to submit search queries from the
> website, interface with SOLR, and display the results back on the website.
>  If I use PHP, it seems that all the solutions use some form of character
> based stream for the interface.  It would seem that using a binary
> representation, such as javabin, would be more efficient.
>
> If using javabin, or some similar efficient binary stream to interface SOLR
> with PHP is not possible, what do people recommend as the most efficient
> solution that provides the best performance, even if that means not using
> PHP and going with some other alternative?

I'd recommend going with JSON - it will be quite a bit smaller than
XML, and the parsers are generally quite efficient.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8


Re: SOLR interface with PHP using javabin?

2010-09-16 Thread Thomas Joiner
If you wish to interface to Solr from PHP, and decide to go with Yonik's
suggestion to use JSON, I would suggest using
http://code.google.com/p/solr-php-client/

It has served my needs for the most part.

On Thu, Sep 16, 2010 at 1:33 PM, Yonik Seeley wrote:

> On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com
>  wrote:
> >  I am planning on creating a website that has some SOLR search
> capabilities
> > for the users, and was also planning on using PHP for the server-side
> > scripting.
> >
> > My goal is to find the most efficient way to submit search queries from
> the
> > website, interface with SOLR, and display the results back on the
> website.
> >  If I use PHP, it seems that all the solutions use some form of character
> > based stream for the interface.  It would seem that using a binary
> > representation, such as javabin, would be more efficient.
> >
> > If using javabin, or some similar efficient binary stream to interface
> SOLR
> > with PHP is not possible, what do people recommend as the most efficient
> > solution that provides the best performance, even if that means not using
> > PHP and going with some other alternative?
>
> I'd recommend going with JSON - it will be quite a bit smaller than
> XML, and the parsers are generally quite efficient.
>
> -Yonik
> http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8
>


DataImportHandler with multiline SQL

2010-09-16 Thread David Yang
Hi

 

I am using the DIH to retrieve data, and as part of the process, I
wanted to create a temporary table and then import data from that. I
have played around a little with DIH and it seems like for a query like:
"select x; select y;" you can have select y to return no results and do
random stuff, but the first select x needs to return results.

Does anybody know exactly how DIH handles multiple sql statements in the
query?

 

Cheers,

David



Re: DataImportHandler with multiline SQL

2010-09-16 Thread Lukas Kahwe Smith

On 16.09.2010, at 21:07, David Yang wrote:

> Hi
> 
> 
> 
> I am using the DIH to retrieve data, and as part of the process, I
> wanted to create a temporary table and then import data from that. I
> have played around a little with DIH and it seems like for a query like:
> "select x; select y;" you can have select y to return no results and do
> random stuff, but the first select x needs to return results.
> 
> Does anybody know exactly how DIH handles multiple sql statements in the
> query?

I do not know the answer to that question, but you can have multiple   
tags and so maybe you can just split up your queries like that?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





DIH: alternative approach to deltaQuery

2010-09-16 Thread Lukas Kahwe Smith
Hi,

I think i have mentioned this approach before on this list, but I really think 
that the deltaQuery approach which is currently explained as the "way to do 
updates" is far from ideal. It seems to add a lot of redundant queries.

I therefore propose to merge the initial import and delta queries using the 
below approach:



Using this approach when clean = true the "last_updated > 
'${dataimporter.last_index_time}" should be optimized out by any sane RDBMS. 
And if clean = false it basically triggers the delta query part to be evaluated.

Is there any downside to this approach? Should this be added to the wiki?

regards.
Lukas Kahwe Smith
m...@pooteeweet.org





Understanding Lucene's File Format

2010-09-16 Thread Giovanni Fernandez-Kincade
Hi,
I've been trying to understand Lucene's file format and I keep getting hung up 
on one detail - how can Lucene quickly find the frequency data (or proximity 
data) for a particular term? According to the file formats page on the Lucene 
website,
 the FreqDelta field in the Term Info file (.tis) is relative to the previous 
term. How is this helpful? The few references I've found on the web for this 
subject make it sound like the Term Dictionary has direct pointers to the 
frequency data for a given term, but that isn't consistent with the 
aforementioned reference.

Thanks for your help,
Gio.


Get all results from a solr query

2010-09-16 Thread Christopher Gross
I have some queries that I'm running against a solr instance (older,
1.2 I believe), and I would like to get *all* the results back (and
not have to put an absurdly large number as a part of the rows
parameter).

Is there a way that I can do that?  Any help would be appreciated.

-- Chris


Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
q=*:*

On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
> I have some queries that I'm running against a solr instance (older,
> 1.2 I believe), and I would like to get *all* the results back (and
> not have to put an absurdly large number as a part of the rows
> parameter).
>
> Is there a way that I can do that?  Any help would be appreciated.
>
> -- Chris
>


Re: Get all results from a solr query

2010-09-16 Thread Christopher Gross
That will stil just return 10 rows for me.  Is there something else in
the configuration of solr to have it return all the rows in the
results?

-- Chris



On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
> q=*:*
>
> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
>> I have some queries that I'm running against a solr instance (older,
>> 1.2 I believe), and I would like to get *all* the results back (and
>> not have to put an absurdly large number as a part of the rows
>> parameter).
>>
>> Is there a way that I can do that?  Any help would be appreciated.
>>
>> -- Chris
>>
>


RE: Re: Get all results from a solr query

2010-09-16 Thread Markus Jelsma
Not according to the wiki;

http://wiki.apache.org/solr/CommonQueryParameters#rows

 

But you could always create an issue for this one. 
 
-Original message-
From: Christopher Gross 
Sent: Thu 16-09-2010 22:50
To: solr-user@lucene.apache.org; 
Subject: Re: Get all results from a solr query

That will stil just return 10 rows for me.  Is there something else in
the configuration of solr to have it return all the rows in the
results?

-- Chris



On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
> q=*:*
>
> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
>> I have some queries that I'm running against a solr instance (older,
>> 1.2 I believe), and I would like to get *all* the results back (and
>> not have to put an absurdly large number as a part of the rows
>> parameter).
>>
>> Is there a way that I can do that?  Any help would be appreciated.
>>
>> -- Chris
>>
>


Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
Start with a *:*, then the “numFound” attribute of the 
element should give you the rows to fetch by a 2nd request.


On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
> That will stil just return 10 rows for me.  Is there something else in
> the configuration of solr to have it return all the rows in the
> results?
>
> -- Chris
>
>
>
> On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>> q=*:*
>>
>> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  wrote:
>>> I have some queries that I'm running against a solr instance (older,
>>> 1.2 I believe), and I would like to get *all* the results back (and
>>> not have to put an absurdly large number as a part of the rows
>>> parameter).
>>>
>>> Is there a way that I can do that?  Any help would be appreciated.
>>>
>>> -- Chris
>>>
>>
>


getting a list of top page-ranked webpages

2010-09-16 Thread Ian Upright
Hi, this question is a little off topic, but I thought since so many people
on this are probably experts in this field, someone may know.

I'm experimenting with my own semantic-based search engine, but I want to
test it with a large corpus of web pages.  Ideally I would like to have a
list of the top 10M or top 100M page-ranked URL's in the world.

Short of using Nutch to crawl the entire web and build this page-rank, is
there any other ways?  What other ways or resources might be available for
me to get this (smaller) corpus of top webpages?

Thanks, Ian


Re: Get all results from a solr query

2010-09-16 Thread Scott Gonyea
If you want to do it in Ruby, you can use this script as scaffolding:
require 'rsolr' # run `gem install rsolr` to get this
solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
total = solr.select({:rows => 0})["response"]["numFound"]
rows  = 10
query = {
  :rows   => rows,
  :start  => 0
}
pages = (total.to_f / rows.to_f).ceil # round up
(1..pages).each do |page|
  query[:start] = (page-1) * rows
  results = solr.select(query)
  docs    = results[:response][:docs]
  # Do stuff here
  #
  docs.each do |doc|
    doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
  end
  # Add it back in to Solr
  solr.add(docs)
  solr.commit
end

Scott

On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>
> Start with a *:*, then the “numFound” attribute of the 
> element should give you the rows to fetch by a 2nd request.
>
>
> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
> > That will stil just return 10 rows for me.  Is there something else in
> > the configuration of solr to have it return all the rows in the
> > results?
> >
> > -- Chris
> >
> >
> >
> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
> >> q=*:*
> >>
> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
> >> wrote:
> >>> I have some queries that I'm running against a solr instance (older,
> >>> 1.2 I believe), and I would like to get *all* the results back (and
> >>> not have to put an absurdly large number as a part of the rows
> >>> parameter).
> >>>
> >>> Is there a way that I can do that?  Any help would be appreciated.
> >>>
> >>> -- Chris
> >>>
> >>
> >


Re: Get all results from a solr query

2010-09-16 Thread Scott Gonyea
lol, note to self: scratch out IPs.  Good thing firewalls exist to
keep my stupidity at bay.

Scott

On Thu, Sep 16, 2010 at 2:55 PM, Scott Gonyea  wrote:
> If you want to do it in Ruby, you can use this script as scaffolding:
> require 'rsolr' # run `gem install rsolr` to get this
> solr  = RSolr.connect(:url => 'http://ip-10-164-13-204:8983/solr')
> total = solr.select({:rows => 0})["response"]["numFound"]
> rows  = 10
> query = {
>   :rows   => rows,
>   :start  => 0
> }
> pages = (total.to_f / rows.to_f).ceil # round up
> (1..pages).each do |page|
>   query[:start] = (page-1) * rows
>   results = solr.select(query)
>   docs    = results[:response][:docs]
>   # Do stuff here
>   #
>   docs.each do |doc|
>     doc[:content] = "IN UR SOLR MESSIN UP UR CONTENT!#{doc[:content]}"
>   end
>   # Add it back in to Solr
>   solr.add(docs)
>   solr.commit
> end
>
> Scott
>
> On Thu, Sep 16, 2010 at 2:27 PM, Shashi Kant  wrote:
>>
>> Start with a *:*, then the “numFound” attribute of the 
>> element should give you the rows to fetch by a 2nd request.
>>
>>
>> On Thu, Sep 16, 2010 at 4:49 PM, Christopher Gross  wrote:
>> > That will stil just return 10 rows for me.  Is there something else in
>> > the configuration of solr to have it return all the rows in the
>> > results?
>> >
>> > -- Chris
>> >
>> >
>> >
>> > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant  wrote:
>> >> q=*:*
>> >>
>> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross  
>> >> wrote:
>> >>> I have some queries that I'm running against a solr instance (older,
>> >>> 1.2 I believe), and I would like to get *all* the results back (and
>> >>> not have to put an absurdly large number as a part of the rows
>> >>> parameter).
>> >>>
>> >>> Is there a way that I can do that?  Any help would be appreciated.
>> >>>
>> >>> -- Chris
>> >>>
>> >>
>> >
>


Re: getting a list of top page-ranked webpages

2010-09-16 Thread Ken Krugler

Hi Ian,

On Sep 16, 2010, at 2:44pm, Ian Upright wrote:

Hi, this question is a little off topic, but I thought since so many  
people

on this are probably experts in this field, someone may know.

I'm experimenting with my own semantic-based search engine, but I  
want to
test it with a large corpus of web pages.  Ideally I would like to  
have a

list of the top 10M or top 100M page-ranked URL's in the world.

Short of using Nutch to crawl the entire web and build this page- 
rank, is
there any other ways?  What other ways or resources might be  
available for

me to get this (smaller) corpus of top webpages?


The public terabyte dataset project would be a good match for what you  
need.


http://bixolabs.com/datasets/public-terabyte-dataset-project/

Of course, that means we have to actually finish the crawl & finalize  
the Avro format we use for the data :)


There are other free collections of data around, though none that I  
know of which target top-ranked pages.


-- Ken

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







Re: SOLR interface with PHP using javabin?

2010-09-16 Thread onlinespend...@gmail.com
 OK, thanks for the suggestion.  Why do you recommend using JSON over 
simply using the built-in PHPSerializedResponseWriter?


I find using an interface that requires the data to be parsed to be 
inefficient (this would include the aforementioned 
PHPSerializedResponseWriter as well).  Wouldn't it be far better to use 
some standard data structure that is sent as a bit stream?


Ben

On 9/16/2010 11:38 AM, Thomas Joiner wrote:

If you wish to interface to Solr from PHP, and decide to go with Yonik's
suggestion to use JSON, I would suggest using
http://code.google.com/p/solr-php-client/

It has served my needs for the most part.

On Thu, Sep 16, 2010 at 1:33 PM, Yonik Seeleywrote:


On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com
  wrote:

  I am planning on creating a website that has some SOLR search

capabilities

for the users, and was also planning on using PHP for the server-side
scripting.

My goal is to find the most efficient way to submit search queries from

the

website, interface with SOLR, and display the results back on the

website.

  If I use PHP, it seems that all the solutions use some form of character
based stream for the interface.  It would seem that using a binary
representation, such as javabin, would be more efficient.

If using javabin, or some similar efficient binary stream to interface

SOLR

with PHP is not possible, what do people recommend as the most efficient
solution that provides the best performance, even if that means not using
PHP and going with some other alternative?

I'd recommend going with JSON - it will be quite a bit smaller than
XML, and the parsers are generally quite efficient.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8



Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Andre Bickford
Thanks to everyone for your suggestions.

It seems that creating the index using gifts as the top level entity is the 
appropriate approach so I can effectively filter gifts  on both the gift amount 
and gift date without running into multiValued field issues. It introduces a 
problem of listing donors multiple times, but that can be addressed by the 
field collapsing feature which will hopefully be completed in trunk soon.

For anyone else who is looking for information on the Solr equivalent of 
"select distinct", check out these resources:

http://wiki.apache.org/solr/FieldCollapsing
https://issues.apache.org/jira/browse/SOLR-236
 


On Sep 16, 2010, at 2:26 PM, Dennis Gearon wrote:

> So THAT'S what a core is! I have been wondering. Thank you very much!
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Thu, 9/16/10, Jonathan Rochkind  wrote:
> 
>> From: Jonathan Rochkind 
>> Subject: Re: Simple Filter Query (fq) Use Case Question
>> To: "solr-user@lucene.apache.org" 
>> Date: Thursday, September 16, 2010, 11:20 AM
>> One solr core has essentially one
>> index in it. (not only one 'field', 
>> but one indexed collection of documents) There are weird
>> hacks, like I 
>> believe the spellcheck component kind of creates it's own
>> sub-indexes, 
>> not sure how it does that.
>> 
>> You can have more than one core in a single solr instance,
>> but they're 
>> essentially seperate, there's no easy way to 'join' accross
>> them or 
>> anything, a given request targets one core.
>> 
>> Dennis Gearon wrote:
>>> This brings me to ask a question that's been on my
>> mind for awhile.
>>> 
>>> Are indexes set up for the whole site, or a set of
>> searches, with several different indexes for a site?
>>> 
>>> How many instances does one Solr/Lucene instance have
>> access to, (not counting shards/segments)?
>>> Dennis Gearon
>>> 
>>> Signature Warning
>>> 
>>> EARTH has a Right To Life,
>>>otherwise we all die.
>>> 
>>> Read 'Hot, Flat, and Crowded'
>>> Laugh at http://www.yert.com/film.php
>>> 
>>> 
>>> --- On Thu, 9/16/10, Chantal Ackermann 
>> wrote:
>>> 
>>>
 From: Chantal Ackermann 
 Subject: RE: Simple Filter Query (fq) Use Case
>> Question
 To: "solr-user@lucene.apache.org"
>> 
 Date: Thursday, September 16, 2010, 1:05 AM
 Hi Andre,
 
 changing the entity in your index from donor to
>> gift
 changes of course
 the scope of your search results. I found it
>> helpful to
 re-think such
 change from that "other" side (the result side).
 If the users of your search application look for
>> individual
 gifts, in
 the end, then changing the index to gift is for
>> the
 better.
 
 If they are searching for donors, then I would
>> rethink the
 change but
 not discard it completely: you can still get the
>> list of
 distinct donors
 by facetting over donors. You can show the users
>> that list
 of donors
 (the facets), and they can chose from it and get
>> all
 information on that
 donor (restricted to the original query, of
>> course). The
 information
 would include the actual search result of a list
>> of gifts
 that passed
 the query.
 
 Cheers,
 Chantal
 
 On Wed, 2010-09-15 at 21:49 +0200, Andre Bickford
>> wrote:
  
> Thanks for the response Erick.
> 
> I did actually try exactly what you suggested.
>> I
>
 flipped the index over so that a gift is the
>> document. This
 solution certainly solves the previous problem,
>> but
 introduces a new issue where the search results
>> show
 duplicate donors. If a donor gave 12 times in a
>> year, and we
 offer full years as facet ranges, my understanding
>> is that
 you'd see that donor 12 times in the search
>> results, once
 for each gift document. Obviously I could do some
>> client
 side filtering to list only distinct donors, but I
>> was
 hoping to avoid that.
  
> If I've simply stumbled into the basic
>> tradeoffs of
>
 denormalization, I can live with client side
>> de-duplication,
 but if you have any further suggestions I'm all
>> eyes.
  
> As for sizing, we have some huge charities as
>> clients.
>
 However, right now I'm testing on a copy of prod
>> data from a
 smaller client with ~350,000 donors and ~8,000,000
>> gift
 records. So, when I "flipped" the index around as
>> you
 suggested, it went from 350,000 documents to
>> 8,000,000
 documents. No issues with performance at all.
  
> Thanks again,
> Andre
> 
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
>
> Sent: Wednesday, September 15, 2010 3:09 PM
> To: solr-user@lucene.a

Index partitioned/ Full indexing by MSSQL or MySQL

2010-09-16 Thread Tommy Molto
Hi,



My company have a site of ads that have 2 types of data:
active ads (ads that are valid) and inactive ads (that are no longer valid,
but we have to show the page to get users to see related ads , related
searches, etc).

Some doubts crossed the mind of the team:



   - · Should we do 2 indexes one for each status, or only one with
   a “flag” of active/inactive? Will it impact the performance? Can we make
   something that will “partitionate” our index guided by this field, for that
   matter?
   - · We are thinking, orientated by a consultant company, to do a
   full indexing monthly or near. If we get the data directy form our database,
   what do you think is the best, without think n the software license:
   Microsoft SQL Server or MySQL? I saw an article that encourage us to make it
   with MySQL:
   
http://www.cabotsolutions.com/blog/200905/using-solr-lucene-for-full-text-search-with-mysql/



[]s

Paulo Marinho


Re: DIH: alternative approach to deltaQuery

2010-09-16 Thread Lance Norskog
Database optimization is not like program optimization- it is wildly 
unpredictable.


What bugs me about the delta approach is using the last time DIH ran, 
rather than a timestamp from the DB. Oh well. Also, with SOLR-1499 you 
can query Solr directly to see what it has.


Lukas Kahwe Smith wrote:

Hi,

I think i have mentioned this approach before on this list, but I really think that the 
deltaQuery approach which is currently explained as the "way to do updates" is 
far from ideal. It seems to add a lot of redundant queries.

I therefore propose to merge the initial import and delta queries using the 
below approach:

 

Using this approach when clean = true the "last_updated>  
'${dataimporter.last_index_time}" should be optimized out by any sane RDBMS. And if 
clean = false it basically triggers the delta query part to be evaluated.

Is there any downside to this approach? Should this be added to the wiki?

regards.
Lukas Kahwe Smith
m...@pooteeweet.org



   


Re: Null Pointer Exception while indexing

2010-09-16 Thread Lance Norskog
Good eye, Thomas! Yes, GCJ is a non-starter. You're best off downloading 
Java 1.6 yourself, but I understand that it is easier to use the public 
package repositories.


Thomas Joiner wrote:

My guess would be that Jetty has some configuration somewhere that is
telling it to use GCJ.  Is it possible to completely remove GCJ from the
system?  Another possibility would be to uninstall Jetty, and then reinstall
it, and hope that on the reinstall it would pick up on the OpenJDK.

What distro of linux are you using?  It probably depends on that how to set
the JVM.

On Thu, Sep 16, 2010 at 10:22 AM, andrewdps  wrote:

   

Lance,

We are on Solr Specification Version: 1.4.1
--
View this message in context:
http://lucene.472066.n3.nabble.com/Null-Pointer-Exception-while-indexing-tp1481154p1488320.html
Sent from the Solr - User mailing list archive at Nabble.com.

 
   


Re: Null Pointer Exception while indexing

2010-09-16 Thread Lance Norskog
Andrew, you should download Solr from the apache site. This packaging is 
wrong-headed.


As to Java, a Linux person would know the system for picking which is 
the standard Java.


andrewdps wrote:

Also,the solr Java properties looks like this using gcj,despite setting
java_home in /etc/profile

jetty.logs = /usr/local/vufind/solr/jetty/logs
path.separator = :
java.vm.name = GNU libgcj
java.vm.specification.name = Java(tm) Virtual Machine Specification
java.runtime.version = 1.5.0
java.home = /usr/lib/jvm/java-1.5.0-gcj-4.3-1.5.0.0/jre
java.vm.specification.version = 1.0
line.separator =


   


Re: Solr Rolling Log Files

2010-09-16 Thread Lance Norskog

Rolling logfiles is configured in the servlet container, not Solr.
Indexing logfiles is a pain because of multiline log outputs like 
Exceptions.


Vladimir Sutskever wrote:

Can SOLR be configured out of the box to handle rolling log files?


Kind regards,

Vladimir Sutskever
Investment Bank - Technology
JPMorgan Chase, Inc.
Tel: (212) 552.5097



This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.
   


Re: Color search for images

2010-09-16 Thread Shawn Heisey

 On 9/16/2010 7:45 AM, Shashi Kant wrote:

Lire is a nascent effort and based on a cursory overview a while back,
IMHO was an over-simplified version of what a CBIR engine should be.
They use CEDD (color&  edge descriptors).
Wouldn't work for the kind of applications I am working on - which
needs among other things, Color, Shape, Orientation, Pose, Edge/Corner
etc.

OpenCV has a steep learning curve, but having been through it, is very
powerful toolkit - the best there is by far! BTW the code is in C++,
but has both Java&  .NET bindings.
This is a fabulous book to get hold of:
http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
if you are seriously into OpenCV.

Pls feel free to reach out of if you need any help with OpenCV +
Solr/Lucene. I have spent quite a bit of time on this.


What I am envisioning (at least to start) is have all this add two 
fields in the index.  One would be for color information for the color 
similarity search.  The other would be a simple multivalued text field 
that we put keywords into based on what OpenCV can detect about the 
image.  If it detects faces, we would put "face" into this field.  Other 
things that it can detect would result in other keywords.


For the color search, I have a few inter-related hurdles.  I've got to 
figure out what form the color data actually takes and how to represent 
it in Solr.  I need Java code for Solr that can take an input color 
value and find similar values in the index.  Then I need some code that 
can go in our feed processing scripts for new content.  That code would 
also go into a crawler script to handle existing images.


We can probably handle most of the development if we can figure out the 
methods and data formats.  Naturally we would be interested in using 
off-the-shelf stuff as much as possible.  Today I learned that our CTO 
has already been looking into OpenCV and has a copy of the O'Reilly book.


Thanks,
Shawn



Re: DIH: alternative approach to deltaQuery

2010-09-16 Thread Lukas Kahwe Smith

On 17.09.2010, at 05:40, Lance Norskog wrote:

> Database optimization is not like program optimization- it is wildly 
> unpredictable.

well an RDBMS that cannot handle true != false as a NOP during the planning 
stage doesn't even do basics in optimization.

But this approach is so much more efficient than the approach of reading out 
the id's of the changed rows in any RDBMS. Furthermore it gets rid of an 
essentially redundant query definition which improves readability and 
maintainability.

> What bugs me about the delta approach is using the last time DIH ran, rather 
> than a timestamp from the DB. Oh well. Also, with SOLR-1499 you can query 
> Solr directly to see what it has.

Yeah, it would be nice to be able to tell DIH to store the timestamp in some 
table. Aka there should be a way to run arbitrary SQL before and after and the 
to be stored new last update timestamp should be available.

> 
> Lukas Kahwe Smith wrote:
>> Hi,
>> 
>> I think i have mentioned this approach before on this list, but I really 
>> think that the deltaQuery approach which is currently explained as the "way 
>> to do updates" is far from ideal. It seems to add a lot of redundant queries.
>> 
>> I therefore propose to merge the initial import and delta queries using the 
>> below approach:
>> 
>> 
>> 
>> Using this approach when clean = true the "last_updated>  
>> '${dataimporter.last_index_time}" should be optimized out by any sane RDBMS. 
>> And if clean = false it basically triggers the delta query part to be 
>> evaluated.
>> 
>> Is there any downside to this approach? Should this be added to the wiki?

Lukas Kahwe Smith
m...@pooteeweet.org





Re: getting a list of top page-ranked webpages

2010-09-16 Thread Dennis Gearon
There's a great web page somewhere that shows the popularity as the subway map 
of tokyo.

And, most popular in the world, per dominant culture in each country, per 
religious majority, per language culture . . .

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Ian Upright  wrote:

> From: Ian Upright 
> Subject: getting a list of top page-ranked webpages
> To: solr-user@lucene.apache.org
> Date: Thursday, September 16, 2010, 2:44 PM
> Hi, this question is a little off
> topic, but I thought since so many people
> on this are probably experts in this field, someone may
> know.
> 
> I'm experimenting with my own semantic-based search engine,
> but I want to
> test it with a large corpus of web pages.  Ideally I
> would like to have a
> list of the top 10M or top 100M page-ranked URL's in the
> world.
> 
> Short of using Nutch to crawl the entire web and build this
> page-rank, is
> there any other ways?  What other ways or resources
> might be available for
> me to get this (smaller) corpus of top webpages?
> 
> Thanks, Ian
>


Re: Simple Filter Query (fq) Use Case Question

2010-09-16 Thread Dennis Gearon
Yes, field collapsing is like faceting, only more so, and very useful, I 
believe. As my project gets going, I have lready imagined uses for it.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Andre Bickford  wrote:

> From: Andre Bickford 
> Subject: Re: Simple Filter Query (fq) Use Case Question
> To: solr-user@lucene.apache.org
> Date: Thursday, September 16, 2010, 4:45 PM
> Thanks to everyone for your
> suggestions.
> 
> It seems that creating the index using gifts as the top
> level entity is the appropriate approach so I can
> effectively filter gifts  on both the gift amount and
> gift date without running into multiValued field issues. It
> introduces a problem of listing donors multiple times, but
> that can be addressed by the field collapsing feature which
> will hopefully be completed in trunk soon.
> 
> For anyone else who is looking for information on the Solr
> equivalent of "select distinct", check out these resources:
> 
> http://wiki.apache.org/solr/FieldCollapsing
> https://issues.apache.org/jira/browse/SOLR-236
>  
> 
> 
> On Sep 16, 2010, at 2:26 PM, Dennis Gearon wrote:
> 
> > So THAT'S what a core is! I have been wondering. Thank
> you very much!
> > Dennis Gearon
> > 
> > Signature Warning
> > 
> > EARTH has a Right To Life,
> >  otherwise we all die.
> > 
> > Read 'Hot, Flat, and Crowded'
> > Laugh at http://www.yert.com/film.php
> > 
> > 
> > --- On Thu, 9/16/10, Jonathan Rochkind 
> wrote:
> > 
> >> From: Jonathan Rochkind 
> >> Subject: Re: Simple Filter Query (fq) Use Case
> Question
> >> To: "solr-user@lucene.apache.org"
> 
> >> Date: Thursday, September 16, 2010, 11:20 AM
> >> One solr core has essentially one
> >> index in it. (not only one 'field', 
> >> but one indexed collection of documents) There are
> weird
> >> hacks, like I 
> >> believe the spellcheck component kind of creates
> it's own
> >> sub-indexes, 
> >> not sure how it does that.
> >> 
> >> You can have more than one core in a single solr
> instance,
> >> but they're 
> >> essentially seperate, there's no easy way to
> 'join' accross
> >> them or 
> >> anything, a given request targets one core.
> >> 
> >> Dennis Gearon wrote:
> >>> This brings me to ask a question that's been
> on my
> >> mind for awhile.
> >>> 
> >>> Are indexes set up for the whole site, or a
> set of
> >> searches, with several different indexes for a
> site?
> >>> 
> >>> How many instances does one Solr/Lucene
> instance have
> >> access to, (not counting shards/segments)?
> >>> Dennis Gearon
> >>> 
> >>> Signature Warning
> >>> 
> >>> EARTH has a Right To Life,
> >>>    otherwise we all die.
> >>> 
> >>> Read 'Hot, Flat, and Crowded'
> >>> Laugh at http://www.yert.com/film.php
> >>> 
> >>> 
> >>> --- On Thu, 9/16/10, Chantal Ackermann 
> >> wrote:
> >>> 
> >>>    
>  From: Chantal Ackermann 
>  Subject: RE: Simple Filter Query (fq) Use
> Case
> >> Question
>  To: "solr-user@lucene.apache.org"
> >> 
>  Date: Thursday, September 16, 2010, 1:05
> AM
>  Hi Andre,
>  
>  changing the entity in your index from
> donor to
> >> gift
>  changes of course
>  the scope of your search results. I found
> it
> >> helpful to
>  re-think such
>  change from that "other" side (the result
> side).
>  If the users of your search application
> look for
> >> individual
>  gifts, in
>  the end, then changing the index to gift
> is for
> >> the
>  better.
>  
>  If they are searching for donors, then I
> would
> >> rethink the
>  change but
>  not discard it completely: you can still
> get the
> >> list of
>  distinct donors
>  by facetting over donors. You can show the
> users
> >> that list
>  of donors
>  (the facets), and they can chose from it
> and get
> >> all
>  information on that
>  donor (restricted to the original query,
> of
> >> course). The
>  information
>  would include the actual search result of
> a list
> >> of gifts
>  that passed
>  the query.
>  
>  Cheers,
>  Chantal
>  
>  On Wed, 2010-09-15 at 21:49 +0200, Andre
> Bickford
> >> wrote:
>       
> > Thanks for the response Erick.
> > 
> > I did actually try exactly what you
> suggested.
> >> I
> >        
>  flipped the index over so that a gift is
> the
> >> document. This
>  solution certainly solves the previous
> problem,
> >> but
>  introduces a new issue where the search
> results
> >> show
>  duplicate donors. If a donor gave 12 times
> in a
> >> year, and we
>  offer full years as facet ranges, my
> understanding
> >> is that
>  you'd see that donor 12 times in the
> search
> >> results, once
>  for each gift document. Obviously I could
> do some
> >> client
>  side filtering to list only distinct
> donors, but I
> >

Re: Color search for images

2010-09-16 Thread Dennis Gearon
Sounds like someone is/has going to say/said:

"Make it so, number one"

There are some good links off of this article about the color Magenta, (like, 
uh, who knows what 'cyan' or 'magenta' are anyway? So I looked it up. Refilling 
my printer cartidges required an explanation.)

http://en.wikipedia.org/wiki/Magenta


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Shawn Heisey  wrote:

> From: Shawn Heisey 
> Subject: Re: Color search for images
> To: solr-user@lucene.apache.org
> Date: Thursday, September 16, 2010, 7:58 PM
>  On 9/16/2010 7:45 AM, Shashi Kant
> wrote:
> > Lire is a nascent effort and based on a cursory
> overview a while back,
> > IMHO was an over-simplified version of what a CBIR
> engine should be.
> > They use CEDD (color&  edge descriptors).
> > Wouldn't work for the kind of applications I am
> working on - which
> > needs among other things, Color, Shape, Orientation,
> Pose, Edge/Corner
> > etc.
> > 
> > OpenCV has a steep learning curve, but having been
> through it, is very
> > powerful toolkit - the best there is by far! BTW the
> code is in C++,
> > but has both Java&  .NET bindings.
> > This is a fabulous book to get hold of:
> > http://www.amazon.com/Learning-OpenCV-Computer-Vision-Library/dp/0596516134,
> > if you are seriously into OpenCV.
> > 
> > Pls feel free to reach out of if you need any help
> with OpenCV +
> > Solr/Lucene. I have spent quite a bit of time on
> this.
> 
> What I am envisioning (at least to start) is have all this
> add two fields in the index.  One would be for color
> information for the color similarity search.  The other
> would be a simple multivalued text field that we put
> keywords into based on what OpenCV can detect about the
> image.  If it detects faces, we would put "face" into
> this field.  Other things that it can detect would
> result in other keywords.
> 
> For the color search, I have a few inter-related
> hurdles.  I've got to figure out what form the color
> data actually takes and how to represent it in Solr.  I
> need Java code for Solr that can take an input color value
> and find similar values in the index.  Then I need some
> code that can go in our feed processing scripts for new
> content.  That code would also go into a crawler script
> to handle existing images.
> 
> We can probably handle most of the development if we can
> figure out the methods and data formats.  Naturally we
> would be interested in using off-the-shelf stuff as much as
> possible.  Today I learned that our CTO has already
> been looking into OpenCV and has a copy of the O'Reilly
> book.
> 
> Thanks,
> Shawn
> 
>


Re: getting a list of top page-ranked webpages

2010-09-16 Thread Dennis Gearon
This was supposed to be a question:
> And, most popular in the world, per dominant culture in
> each country, per religious majority, per language culture .
> . .
> 

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 9/16/10, Dennis Gearon  wrote:

> From: Dennis Gearon 
> Subject: Re: getting a list of top page-ranked webpages
> To: solr-user@lucene.apache.org, i...@upright.net
> Date: Thursday, September 16, 2010, 11:28 PM
> There's a great web page somewhere
> that shows the popularity as the subway map of tokyo.
> 
> Dennis Gearon
> 
> Signature Warning
> 
> EARTH has a Right To Life,
>   otherwise we all die.
> 
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
> 
> 
> --- On Thu, 9/16/10, Ian Upright 
> wrote:
> 
> > From: Ian Upright 
> > Subject: getting a list of top page-ranked webpages
> > To: solr-user@lucene.apache.org
> > Date: Thursday, September 16, 2010, 2:44 PM
> > Hi, this question is a little off
> > topic, but I thought since so many people
> > on this are probably experts in this field, someone
> may
> > know.
> > 
> > I'm experimenting with my own semantic-based search
> engine,
> > but I want to
> > test it with a large corpus of web pages.  Ideally I
> > would like to have a
> > list of the top 10M or top 100M page-ranked URL's in
> the
> > world.
> > 
> > Short of using Nutch to crawl the entire web and build
> this
> > page-rank, is
> > there any other ways?  What other ways or resources
> > might be available for
> > me to get this (smaller) corpus of top webpages?
> > 
> > Thanks, Ian
> >
>