Entity with multiple datasources

2012-02-16 Thread Radu Toev
Hello,

I created a data-config.xml file where I define a datasource and an entity
with 12 fields.
In my use case I have 2 databases with the same schema, so I want to
combine in one index the 2 databases.
I defined a second dataSource tag and duplicateed the entity with its
field(changed the name and the datasource).
What I'm expecting is to get around 7k results(I have around 6k in the
first db and 1k in the second). However I'm getting a total of 2k.
Where could be the problem?

Thanks


Re: 'foruns' don't match 'forum' with NGramFilterFactory (or EdgeNGramFilterFactory)

2012-02-16 Thread Dirceu Vieira
Hi,

It's funny that if you try "fóruns" it matches:
http://bhakta.casadomato.org:8982/solr/select/?q=f%C3%B3runs&version=2.2&start=0&rows=10&indent=on
But not when you try "foruns", it does not.

Check this out...

http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=type&name=text&verbose=on&highlight=on&val=f%C3%B3rum&qverbose=on&qval=foruns

See that stemming does not work for the word foruns.

Could it be because fórum is part of the PT dictionary but not forum?

Regards,

2012/2/14 Bráulio Bhavamitra 

> Hello all,
>
> I'm experimenting with NGramFilterFactory and EgdeNGramFilterFactory.
>
> Both of them shows a match in my solr admin analysis, but when I query
> 'foruns'
> doesn't find any 'forum'.
> analysis
>
> http://bhakta.casadomato.org:8982/solr/admin/analysis.jsp?nt=type&name=text&verbose=on&highlight=on&val=f%C3%B3runs&qverbose=on&qval=f%C3%B3runs
> search
>
> http://bhakta.casadomato.org:8982/solr/select/?q=foruns&version=2.2&start=0&rows=10&indent=on
>
> Anybody knows what's the problem?
>
> bráulio
>



-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr


problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
Hi all,
I have a problem to configure a pdf indexing from a directory in my solr
wit DIH:

with this data-config



 
 
  
  
   
   
 
 

 
 
  
 


I obtain this result:



  full-import

  idle

  

- 

  0:0:2.44

  0

  43

  0

  2012-02-12 19:06:00

  Indexing failed. Rolled back all changes.

  2012-02-12 19:06:00
  


suggestions?
thank you
alessio


Best requestHandler for "typing error".

2012-02-16 Thread stockii
Hello.

Which RH do you use to find typing errors like "goolge" => do you mean
"google" ?!

I want to use my Autosuggestion "EdgeNGram" with a clever AutoCorrection!



What do you use ?

-
--- System 

One Server, 12 GB RAM, 2 Solr Instances, 8 Cores, 
1 Core with 45 Million Documents other Cores < 200.000

- Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
- Solr2 for Update-Request  - delta every Minute - 4GB Xmx
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3749576.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan
I kept old schema files and solrconfig file but there were some errors due to
which solr was not loading. I dono what are those things. We have few our
own custom plugins developed with 1.4.1

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749629.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread Kashif Khan
we have both stored = true and false fields in the schema. So we cant reindex
wat u said. we have tried that earlier.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3749631.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Mikhail Khludnev
Pls find inlined.

On Thu, Feb 16, 2012 at 10:30 AM, Alexey Verkhovsky <
alexey.verkhov...@gmail.com> wrote:

> Hi, all,
>
> I'm new here. Used Solr on a couple of projects before, but didn't need to
> dive deep into anything until now. These days, I'm doing a spike for a
> "yellow pages" type search server with the following technical
> requirements:
>
> ~10 mln listings in the database. A listing has a name, address,
> description, coordinates and a number of tags / filtering fields; no more
> than a kilobyte all told; i.e. theoretically the whole thing should fit in
> RAM without sharding. A typical query is either "all text matches on name
> and/or description within a bounded box", or "some combination of tag
> matches within a bounded box". Bounded boxes are 1 to 50 km wide, and
> contain up to 10^5 unfiltered listings (the average is more like 10^3).
> More than 50% of all the listings are in the frequently requested bounding
> boxes, however a vast majority of listings are almost never displayed
> (because they don't match the other filters).
>
> Data "never changes" (i.e., a daily batch update; rebuild of the entire
> index and restart of all search servers is feasible, as long as it takes
> minutes, not hours).

Everybody start from daily bounce, but end up with UPDATED_AT column and
delta updates , just consider urgent content fix usecase. Don't think it's
worth to rely on daily bounce as a cornerstone of architecture.


> This thing ideally should serve up to 10^3 requests
> per second on a small (as in, "less than 10 commodity boxes") cluster. In
> other words, a typical request should be CPU bound and take ~100-200 msec
> to process. Because of coordinates (that are almost never the same),
> caching of queries makes no sense;

you can use grid of coordinates to reduce their entropy, if you filter by
bounding box argument is bounding box not a coordinates. Anyway
postfiltering and cache=false for such filters
http://yonik.wordpress.com/2012/02/10/advanced-filter-caching-in-solr/


> from what little I understand about
> Lucene internals, caching of filters probably doesn't make sense either.
>
But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache

>
> After perusing documentation and some googling (but almost no source code
> exploring yet), I understand how the schema and the queries will look like,
> and now have to figure out a specific configuration that fits the
> performance/scalability requirements. Here is what I'm thinking:
>
> 1. Search server is an internal service that uses embedded Solr for the
> indexing part. RAMDirectoryFactory as index storage.
>
Bad idea. It's purposed mostly for tests, the closest purposed for
production analogue is
org.apache.lucene.store.instantiated.InstantiatedIndex


> 2. All data is in some sort of persistent storage on a file system, and is
> loaded into the memory when a search server starts up.
>
AFAIK the state of art is use file directory (MMAP or whatever), rely on
Linux file system RAM cache. Also Solr and partially Lucene cache some
stuff in HEAP themselves
http://wiki.apache.org/solr/SolrCaching#Types_of_Caches_and_Example_Configuration.
So, this is mostly done already.


> 3. Data updates are handled as "update the persistent storage, start
> another cluster, load the world into RAM, flip the load balancer, kill the
> old cluster"
>
no again. Lucene has pretty cool model of segments and generations purposed
to incremental update. And Solr does a lot to do search in old generation
and warnup the new one simultaneously (it just takes some memory, you know,
two times). I don;t think that manual A/B scheme is applicable. Anyway, you
can (but don't relly need to) play around replication facilities e.g.
disable traffic for half of nodes, push new index on it, let them warmup,
enable traffic (such machinery never works smoothly due number of moving
parts)


> 4. Solr returns IDs with relevance scores; actual presentations of listings
> (as JSON documents) are constructed outside of Solr and cached in
> Memcached, as a mostly static content with a few templated bits, like
> <%=DISTANCE_TO(-123.0123, 45.6789) %>.
>
Use separate nodes to do a search and another nodes to stream the content
sounds good (mentioned in every book). Looks like beside of the score you
can also return distance to user i.e. no need to <%=DISTANCE_TO(-123.0123,
45.6789) %> , just <%=doc.DISTANCE%> see
http://wiki.apache.org/solr/SpatialSearch?#Returning_the_distance



> 5. All Solr caching is switched off.
>
But why?



>
> Obviously, we are not the first people to do something like this with Solr,
> so I'm hoping for some collective wisdom on the following:
>
> Does this sounds like a feasible set of requirements in terms of
> performance and scalability for Solr? Are we on the right path to solving
> this problem well? If not, what should we be doing instead? What nasty
> technical/architectural gotchas are we probably missing at this stage?
>
> One particular 

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
1. Do you see any errors / exceptions in the logs?
2. Could you have duplicates?

On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev  wrote:

> Hello,
>
> I created a data-config.xml file where I define a datasource and an entity
> with 12 fields.
> In my use case I have 2 databases with the same schema, so I want to
> combine in one index the 2 databases.
> I defined a second dataSource tag and duplicateed the entity with its
> field(changed the name and the datasource).
> What I'm expecting is to get around 7k results(I have around 6k in the
> first db and 1k in the second). However I'm getting a total of 2k.
> Where could be the problem?
>
> Thanks
>



-- 
Regards,

Dmitry Kan


Re: Spatial Search and faceting

2012-02-16 Thread Eric Grobler
Hi William,

Thanks for the feedback.

I will try the group query and see how the performance with 2 queries is.

Best Regards
Ericz

On Thu, Feb 16, 2012 at 4:06 AM, William Bell  wrote:

> One way to do it is to group by city and then sort=geodist() asc
>
> select?group=true&group.field=city&sort=geodist() desc&rows=10&fl=city
>
> It might require 2 calls to SOLR to get it the way you want.
>
> On Wed, Feb 15, 2012 at 5:51 PM, Eric Grobler 
> wrote:
> > Hi Solr community,
> >
> > I am doing a spatial search and then do a facet by city.
> > Is it possible to then sort the faceted cities by distance?
> >
> > We would like to display the hits per city, but sort them by distance.
> >
> > Thanks & Regards
> > Ericz
> >
> > q=iphone
> > fq={!bbox}
> > sfield=geopoint
> > pt=49.594857,8.468614
> > d=50
> > fl=id,description,city,geopoint
> >
> > facet=true
> > facet.field=city
> > f.city.facet.limit=10
> > f.city.facet.sort=count //geodist() asc
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>


Realtime search with multi clients updating index simultaneously.

2012-02-16 Thread v_shan
I have a heldesk application developed in PHP/MySQL. I want to implement real
time Full text search and I have shortlisted Solr. MySQL database will store
all the tickets and their updates and that data will be imported for
building Solr index. All Search requests will be handled by Solr.

What I want is a real time search. The moment someone updates a ticket, it
should be available for search. 

As per my understanding of Solr, this is how I think the system will work. 
A user updates a ticket -> database record is modified -> a request is sent
to Solr server to modify corresponding document in index.

I have read a book on Solr and below questions are troubling me.
1. The book mentions that "commits are slow in Solr. Depending on the index
size, Solr's auto-warming
configuration, and Solr's cache state prior to committing, a commit can take
a non-trivial amount of time. Typically, it takes a few seconds, but it can
take
some number of minutes in extreme cases". If this is true then how will I
know when the data will be availbale for search and how can I implemnt
realtime search? Also I don't want the ticket update operation to be slowed
down (by adding extra step of updating Solr index)

2. It is also mentioned that "there is no transaction isolation. This means
that if more than one Solr client
were to submit modifications and commit them at overlapping times, it is
possible for part of one client's set of changes to be committed before that
client told Solr to commit. This applies to rollback as well. If this is a
problem
for your architecture then consider using one client process responsible for
updating Solr."

Does it mean that due to lack of transactional commits, Solr can mess up the
updates when multiple people update the ticket simultaneously?

Now the question before me is: Is Solr fit in my case? If yes, How?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Realtime-search-with-multi-clients-updating-index-simultaneously-tp3749881p3749881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
1. Nothing in the logs
2. No.

On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan  wrote:

> 1. Do you see any errors / exceptions in the logs?
> 2. Could you have duplicates?
>
> On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev  wrote:
>
> > Hello,
> >
> > I created a data-config.xml file where I define a datasource and an
> entity
> > with 12 fields.
> > In my use case I have 2 databases with the same schema, so I want to
> > combine in one index the 2 databases.
> > I defined a second dataSource tag and duplicateed the entity with its
> > field(changed the name and the datasource).
> > What I'm expecting is to get around 7k results(I have around 6k in the
> > first db and 1k in the second). However I'm getting a total of 2k.
> > Where could be the problem?
> >
> > Thanks
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>


Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
It sounds a bit, as if SOLR stopped processing data once it queried all
from the smaller dataset. That's why you have 2000. If you just have a
handler pointed to the bigger data set (6k), do you manage to get all 6k db
entries into solr?

On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev  wrote:

> 1. Nothing in the logs
> 2. No.
>
> On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan  wrote:
>
> > 1. Do you see any errors / exceptions in the logs?
> > 2. Could you have duplicates?
> >
> > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev  wrote:
> >
> > > Hello,
> > >
> > > I created a data-config.xml file where I define a datasource and an
> > entity
> > > with 12 fields.
> > > In my use case I have 2 databases with the same schema, so I want to
> > > combine in one index the 2 databases.
> > > I defined a second dataSource tag and duplicateed the entity with its
> > > field(changed the name and the datasource).
> > > What I'm expecting is to get around 7k results(I have around 6k in the
> > > first db and 1k in the second). However I'm getting a total of 2k.
> > > Where could be the problem?
> > >
> > > Thanks
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
I tried running with just one datasource(the one that has 6k entries) and
it indexes them ok.
The same, if I do sepparately the 1k database. It indexes ok.

On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan  wrote:

> It sounds a bit, as if SOLR stopped processing data once it queried all
> from the smaller dataset. That's why you have 2000. If you just have a
> handler pointed to the bigger data set (6k), do you manage to get all 6k db
> entries into solr?
>
> On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev  wrote:
>
> > 1. Nothing in the logs
> > 2. No.
> >
> > On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan 
> wrote:
> >
> > > 1. Do you see any errors / exceptions in the logs?
> > > 2. Could you have duplicates?
> > >
> > > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev 
> wrote:
> > >
> > > > Hello,
> > > >
> > > > I created a data-config.xml file where I define a datasource and an
> > > entity
> > > > with 12 fields.
> > > > In my use case I have 2 databases with the same schema, so I want to
> > > > combine in one index the 2 databases.
> > > > I defined a second dataSource tag and duplicateed the entity with its
> > > > field(changed the name and the datasource).
> > > > What I'm expecting is to get around 7k results(I have around 6k in
> the
> > > > first db and 1k in the second). However I'm getting a total of 2k.
> > > > Where could be the problem?
> > > >
> > > > Thanks
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>


Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
OK, maybe you can show the db-data-config.xml just in case?
Also in schema.xml, does you  correspond to the unique field in
the db?

On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev  wrote:

> I tried running with just one datasource(the one that has 6k entries) and
> it indexes them ok.
> The same, if I do sepparately the 1k database. It indexes ok.
>
> On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan  wrote:
>
> > It sounds a bit, as if SOLR stopped processing data once it queried all
> > from the smaller dataset. That's why you have 2000. If you just have a
> > handler pointed to the bigger data set (6k), do you manage to get all 6k
> db
> > entries into solr?
> >
> > On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev  wrote:
> >
> > > 1. Nothing in the logs
> > > 2. No.
> > >
> > > On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan 
> > wrote:
> > >
> > > > 1. Do you see any errors / exceptions in the logs?
> > > > 2. Could you have duplicates?
> > > >
> > > > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev 
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I created a data-config.xml file where I define a datasource and an
> > > > entity
> > > > > with 12 fields.
> > > > > In my use case I have 2 databases with the same schema, so I want
> to
> > > > > combine in one index the 2 databases.
> > > > > I defined a second dataSource tag and duplicateed the entity with
> its
> > > > > field(changed the name and the datasource).
> > > > > What I'm expecting is to get around 7k results(I have around 6k in
> > the
> > > > > first db and 1k in the second). However I'm getting a total of 2k.
> > > > > Where could be the problem?
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Dmitry Kan
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev

  
  
  


















   

   

















   
  


I've removed the connection params
The unique key is id.

On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan  wrote:

> OK, maybe you can show the db-data-config.xml just in case?
> Also in schema.xml, does you  correspond to the unique field in
> the db?
>
> On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev  wrote:
>
> > I tried running with just one datasource(the one that has 6k entries) and
> > it indexes them ok.
> > The same, if I do sepparately the 1k database. It indexes ok.
> >
> > On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan 
> wrote:
> >
> > > It sounds a bit, as if SOLR stopped processing data once it queried all
> > > from the smaller dataset. That's why you have 2000. If you just have a
> > > handler pointed to the bigger data set (6k), do you manage to get all
> 6k
> > db
> > > entries into solr?
> > >
> > > On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev  wrote:
> > >
> > > > 1. Nothing in the logs
> > > > 2. No.
> > > >
> > > > On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan 
> > > wrote:
> > > >
> > > > > 1. Do you see any errors / exceptions in the logs?
> > > > > 2. Could you have duplicates?
> > > > >
> > > > > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev 
> > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I created a data-config.xml file where I define a datasource and
> an
> > > > > entity
> > > > > > with 12 fields.
> > > > > > In my use case I have 2 databases with the same schema, so I want
> > to
> > > > > > combine in one index the 2 databases.
> > > > > > I defined a second dataSource tag and duplicateed the entity with
> > its
> > > > > > field(changed the name and the datasource).
> > > > > > What I'm expecting is to get around 7k results(I have around 6k
> in
> > > the
> > > > > > first db and 1k in the second). However I'm getting a total of
> 2k.
> > > > > > Where could be the problem?
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Regards,
> > > > >
> > > > > Dmitry Kan
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>


How to loop through the DataImportHandler query results?

2012-02-16 Thread K, Baraneetharan

Hi Solr community,

I'm new to Solr and DataImportHandler., I've a requirement to fetch the data 
from a database table and pass it to solr.

Part of existing data-config.xml and solr schema.xml are given below,

data-config.xml






 
 
 
 
 
 


   




Schema.xml



   
   
   
   
   
   
   
   




The table used in the query (adap) is often modified, number of columns in this 
table are changing frequently. Hence we are supposed to change the 
data-config.xml whenever a field is added or deleted.
To avoid that we don't want to mention the column names in the field tag , but 
want to write a query to map all the fields in the table with solr fileds even 
if we don't know, how many columns are there in the table.  I need a kind of 
loop which runs through all the query results and map that with solr fileds.

Please help me.

Regards,
Baranee


Re: SolrCloud Replication Question

2012-02-16 Thread Mark Miller

On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote:

>  Not sure if this is
> expected or not.

Nope - should be already resolved or will be today though.

- Mark Miller
lucidimagination.com













Re: SolrCloud Replication Question

2012-02-16 Thread Jamie Johnson
Ok, great.  Just wanted to make sure someone was aware.  Thanks for
looking into this.

On Thu, Feb 16, 2012 at 8:26 AM, Mark Miller  wrote:
>
> On Feb 14, 2012, at 10:57 PM, Jamie Johnson wrote:
>
>>  Not sure if this is
>> expected or not.
>
> Nope - should be already resolved or will be today though.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


PatternReplaceFilterFactory group

2012-02-16 Thread O. Klein
PatternReplaceFilterFactory has no option to select the group to replace.

Is there a reason for this, or could this be a nice feature? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750201.html
Sent from the Solr - User mailing list archive at Nabble.com.


custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello all:

We'd like to score the matching documents using a combination of SOLR's IR
score with another application-specific score that we store within the
documents themselves (i.e. a float field containing the app-specific
score). In particular, we'd like to calculate the final score doing some
operations with both numbers (i.e product, sqrt, ...)

According to what we know, there are two ways to do this in SOLR:

A) Sort by function [1]: We've tested an expression like
"sort=product(score, query_score)" in the SOLR query, where score is the
common SOLR IR score and query_score is our own precalculated score, but it
seems that SOLR can only do this with stored/indexed fields (and obviously
"score" is not stored/indexed).

B) Function queries: We've used _val_ and function queries like max, sqrt
and query, and we've obtained the desired results from a functional point
of view. However, our index is quite large (400M documents) and the
performance degrades heavily, given that function queries are AFAIK
matching all the documents.

I have two questions:

1) Apart from the two options I mentioned, is there any other (simple) way
to achieve this that we're not aware of?

2) If we have to choose the function queries path, would it be very
difficult to modify the actual implementation so that it doesn't match all
the documents, that is, to pass a query so that it only operates over the
documents matching the query?. Looking at the FunctionQuery.java source
code, there's a comment that says "// instead of matching all docs, we
could also embed a query. the score could either ignore the subscore, or
boost it", which is giving us some hope that maybe it's possible and even
desirable to go in this direction. If you can give us some directions about
how to go about this, we may be able to do the actual implementation.

BTW, we're using Lucene/SOLR trunk.

Thanks a lot for your help.
Carlos

[1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function


Re: problem to indexing pdf directory

2012-02-16 Thread Gora Mohanty
On 16 February 2012 14:33, alessio crisantemi
 wrote:
> Hi all,
> I have a problem to configure a pdf indexing from a directory in my solr
> wit DIH:
>
> with this data-config
>
>
> 
>  
>  
>      name="tika-test"
>    processor="FileListEntityProcessor"
>    baseDir="D:\gioconews_archivio\marzo2011"
>    fileName=".*pdf"
>    recursive="true"
>    rootEntity="false"
>    dataSource="null"/>
>   url="D:\gioconews_archivio\marzo2011" format="text" >
>   
>   
>     
>     
>
>     
>     
>  
>  
> 
[...]

You should look in your Solr logs for more details about
the exception, but as things stand, the above setup will
not work for indexing PDF files. You need Tika. Searching
Google for "solr tika index pdf" turns up many possibilities,
e.g.,
http://www.abcseo.com/tech/search/integrating-solr-and-tika
http://solr.pl/en/2011/04/04/indexing-files-like-doc-pdf-solr-and-tika-integration/

Regards,
Gora


Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Mikhail Khludnev
Hi Baranee,

Some time ago I played with
http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a
pretty good stuff.

Regards


On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan wrote:

> To avoid that we don't want to mention the column names in the field tag ,
> but want to write a query to map all the fields in the table with solr
> fileds even if we don't know, how many columns are there in the table.  I
> need a kind of loop which runs through all the query results and map that
> with solr fileds.




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: Solr soft commit feature

2012-02-16 Thread Nagendra Nagarajayya
The slaves will be able to replicate from the master as before but not 
in NRT depending on your commit interval. Commit interval can be set 
higher for NRT as it is not needed for searches except for consolidating 
the index changes on the master and can be an hr or even more. It maybe 
easier to update the slaves directly as the update/query performance is 
high (replication in the cloud in 4.0 also follows similar paradigm as 
the docs are sent across as a whole to be replicated. So for now you may 
have to do this manually)


- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org


On 2/15/2012 8:35 AM, Dipti Srivastava wrote:

Hi Nagendra,

Certainly interesting! Would this work in a Master/slave setup where the
reads are from the slaves and all writes are to the master?

Regards,
Dipti Srivastava


On 2/15/12 5:40 AM, "Nagendra Nagarajayya"
wrote:


If you are looking for NRT functionality with Solr 3.5, you may want to
take a look at Solr 3.5 with RankingAlgorithm. This allows you to
add/update documents without a commit while being able to search
concurrently. The add/update performance to add 1m docs is about 5000
docs in about 498 ms  with one concurrent searcher. You can get more
information about Solr 3.5 with RankingAlgorithm from here:

http://tgels.org/wiki/en/Near_Real_Time_Search_ver_3.x

Regards,

- Nagendra Nagarajayya
http://solr-ra.tgels.org
http://rankingalgorithm.tgels.org

On 2/14/2012 4:41 PM, Dipti Srivastava wrote:

Hi All,
Is there a way to soft commit in the current released version of solr
3.5?

Regards,
Dipti Srivastava


This message is private and confidential. If you have received it in
error, please notify the sender and remove it from your system.








This message is private and confidential. If you have received it in error, 
please notify the sender and remove it from your system.








Re: Can I rebuild an index and remove some fields?

2012-02-16 Thread Robert Stewart
I will test it with my big production indexes first, if it works I
will port to Java and add to contrib I think.

On Wed, Feb 15, 2012 at 10:03 PM, Li Li  wrote:
> great. I think you could make it a public tool. maybe others also need such
> functionality.
>
> On Thu, Feb 16, 2012 at 5:31 AM, Robert Stewart wrote:
>
>> I implemented an index shrinker and it works.  I reduced my test index
>> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
>> need anymore.  I'm actually using Lucene.Net for this project so code
>> is C# using Lucene.Net 2.9.2 API.  But basic idea is:
>>
>> Create an IndexReader wrapper that only enumerates the terms you want
>> to keep, and that removes terms from documents when returning
>> documents.
>>
>> Use the SegmentMerger to re-write each segment (where each segment is
>> wrapped by the wrapper class), writing new segment to a new directory.
>> Collect the SegmentInfos and do a commit in order to create a new
>> segments file in new index directory
>>
>> Done - you now have a shrunk index with specified terms removed.
>>
>> Implementation uses separate thread for each segment, so it re-writes
>> them in parallel.  Took about 15 minutes to do 770,000 doc index on my
>> macbook.
>>
>>
>> On Tue, Feb 14, 2012 at 10:12 PM, Li Li  wrote:
>> > I have roughly read the codes of 4.0 trunk. maybe it's feasible.
>> >    SegmentMerger.add(IndexReader) will add to be merged Readers
>> >    merge() will call
>> >      mergeTerms(segmentWriteState);
>> >      mergePerDoc(segmentWriteState);
>> >
>> >   mergeTerms() will construct fields from IndexReaders
>> >    for(int
>> > readerIndex=0;readerIndex> >      final MergeState.IndexReaderAndLiveDocs r =
>> > mergeState.readers.get(readerIndex);
>> >      final Fields f = r.reader.fields();
>> >      final int maxDoc = r.reader.maxDoc();
>> >      if (f != null) {
>> >        slices.add(new ReaderUtil.Slice(docBase, maxDoc, readerIndex));
>> >        fields.add(f);
>> >      }
>> >      docBase += maxDoc;
>> >    }
>> >    So If you wrapper your IndexReader and override its fields() method,
>> > maybe it will work for merge terms.
>> >
>> >    for DocValues, it can also override AtomicReader.docValues(). just
>> > return null for fields you want to remove. maybe it should
>> > traverse CompositeReader's getSequentialSubReaders() and wrapper each
>> > AtomicReader
>> >
>> >    other things like term vectors norms are similar.
>> > On Wed, Feb 15, 2012 at 6:30 AM, Robert Stewart > >wrote:
>> >
>> >> I was thinking if I make a wrapper class that aggregates another
>> >> IndexReader and filter out terms I don't want anymore it might work.
>> And
>> >> then pass that wrapper into SegmentMerger.  I think if I filter out
>> terms
>> >> on GetFieldNames(...) and Terms(...) it might work.
>> >>
>> >> Something like:
>> >>
>> >> HashSet ignoredTerms=...;
>> >>
>> >> FilteringIndexReader wrapper=new FilterIndexReader(reader);
>> >>
>> >> SegmentMerger merger=new SegmentMerger(writer);
>> >>
>> >> merger.add(wrapper);
>> >>
>> >> merger.Merge();
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Feb 14, 2012, at 1:49 AM, Li Li wrote:
>> >>
>> >> > for method 2, delete is wrong. we can't delete terms.
>> >> >   you also should hack with the tii and tis file.
>> >> >
>> >> > On Tue, Feb 14, 2012 at 2:46 PM, Li Li  wrote:
>> >> >
>> >> >> method1, dumping data
>> >> >> for stored fields, you can traverse the whole index and save it to
>> >> >> somewhere else.
>> >> >> for indexed but not stored fields, it may be more difficult.
>> >> >>    if the indexed and not stored field is not analyzed(fields such as
>> >> >> id), it's easy to get from FieldCache.StringIndex.
>> >> >>    But for analyzed fields, though theoretically it can be restored
>> from
>> >> >> term vector and term position, it's hard to recover from index.
>> >> >>
>> >> >> method 2, hack with metadata
>> >> >> 1. indexed fields
>> >> >>      delete by query, e.g. field:*
>> >> >> 2. stored fields
>> >> >>       because all fields are stored sequentially. it's not easy to
>> >> delete
>> >> >> some fields. this will not affect search speed. but if you want to
>> get
>> >> >> stored fields,  and the useless fields are very long, then it will
>> slow
>> >> >> down.
>> >> >>       also it's possible to hack with it. but need more effort to
>> >> >> understand the index file format  and traverse the fdt/fdx file.
>> >> >>
>> >>
>> http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/fileformats.html
>> >> >>
>> >> >> this will give you some insight.
>> >> >>
>> >> >>
>> >> >> On Tue, Feb 14, 2012 at 6:29 AM, Robert Stewart <
>> bstewart...@gmail.com
>> >> >wrote:
>> >> >>
>> >> >>> Lets say I have a large index (100M docs, 1TB, split up between 10
>> >> >>> indexes).  And a bunch of the "stored" and "indexed" fields are not
>> >> used in
>> >> >>> search at all.  In order to save memory and disk, I'd like to
>> rebuild
>> >> that
>> >> >>> index *without* those fields, but I don't have o

Payload and exact search - 2

2012-02-16 Thread leonardo2
Hello,
I already posted this question but for some reason it was attached to a
thread with different topic.


Is there the possibility of perform 'exact search' in a payload field? 

I'have to index text with auxiliary info for each word. In particular at
each word is associated the bounding box containing it in the original pdf
page (it is used for highligthing the search terms in the pdf). I used the
payload to store that information. 

In the schema.xml, the fieldType definition is: 

--- 








--- 

while the field definition is: 

--- 

--- 

When indexing, the field 'words' contains a list of word|box as in the
following example: 

--- 
doc_id=example 
words={Fonte:|307.62,948.16,324.62,954.25 Comune|326.29,948.16,349.07,954.25
di|350.74,948.16,355.62,954.25 Bologna|358.95,948.16,381.28,954.25} 
--- 

Such solution works well except in the case of an exact search. For example,
assuming the only indexed doc is the 'example' doc (before shown), the query
words:"Comune di Bologna" returns no results. 

Someone know if there is the possibility of perform 'exact search' in a
payload field? 

Thanks in advance, 
Leonardo

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Payload-and-exact-search-2-tp3750355p3750355.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
I think the problem here is that initially you trying to create separate
documents for two different tables, while your config is aiming to create
only one document. Here there is one solution (not tried by me):

--
You can have multiple documents generated by the same data-config:


 
 
 
 
   
   
  
   
   
  
   
 


It's the 'rootEntity="false" that makes the child entity a document.
--

Dmitry

On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev  wrote:

> 
>   name="s"
> driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url=""
> user=""
> password=""/>
>   name="p"
>  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url=""
> user=""
> password=""/>
>  
>datasource="s"
> query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
> m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
> m.contract as m_contract,
>   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
> m_c_code
>   FROM Machine AS m
>   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
>   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
>   LEFT JOIN Platform AS p ON m.fk_platform = p.id
>   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
>   LEFT JOIN Country AS c ON fk_country = c.id"
> readOnly="true"
> transformer="DateFormatTransformer">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
>
>   datasource="p"
> query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as m_delivery_date,
> m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as m_warranty,
> m.contract as m_contract,
>   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
> m_c_code
>   FROM Machine AS m
>   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
>   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
>   LEFT JOIN Platform AS p ON m.fk_platform = p.id
>   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
>   LEFT JOIN Country AS c ON fk_country = c.id"
> readOnly="true"
> transformer="DateFormatTransformer">
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   
>  
> 
>
> I've removed the connection params
> The unique key is id.
>
> On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan  wrote:
>
> > OK, maybe you can show the db-data-config.xml just in case?
> > Also in schema.xml, does you  correspond to the unique field
> in
> > the db?
> >
> > On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev  wrote:
> >
> > > I tried running with just one datasource(the one that has 6k entries)
> and
> > > it indexes them ok.
> > > The same, if I do sepparately the 1k database. It indexes ok.
> > >
> > > On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan 
> > wrote:
> > >
> > > > It sounds a bit, as if SOLR stopped processing data once it queried
> all
> > > > from the smaller dataset. That's why you have 2000. If you just have
> a
> > > > handler pointed to the bigger data set (6k), do you manage to get all
> > 6k
> > > db
> > > > entries into solr?
> > > >
> > > > On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev 
> wrote:
> > > >
> > > > > 1. Nothing in the logs
> > > > > 2. No.
> > > > >
> > > > > On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan  >
> > > > wrote:
> > > > >
> > > > > > 1. Do you see any errors / exceptions in the logs?
> > > > > > 2. Could you have duplicates?
> > > > > >
> > > > > > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev 
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > I created a data-config.xml file where I define a datasource
> and
> > an
> > > > > > entity
> > > > > > > with 12 fields.
> > > > > > > In my use case I have 2 databases with the same schema, so I
> want
> > > to
> > > > > > > combine in one index the 2 databases.
> > > > > > > I defined a second dataSource tag and duplicateed the entity
> with
> > > its
> > > > > > > field(changed the name and the datasource).
> > > > > > > What I'm expecting is to get around 7k results(I have around 6k
> > in
> > > > the
> > > > > > > first db and 1k in the second). However I'm getting a total of
> > 2k.
> > > > > > > Where could be the problem?
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Dmitry Kan
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Dmitry Kan
> > > >
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
I'm not sure I follow.
The idea is to have only one document. Do the multiple documents have the
same structure then(different datasources), and if so how are they actually
indexed?

Thanks.

On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan  wrote:

> I think the problem here is that initially you trying to create separate
> documents for two different tables, while your config is aiming to create
> only one document. Here there is one solution (not tried by me):
>
> --
> You can have multiple documents generated by the same data-config:
>
> 
>  
>  
>  
>  
>   
>   
>  
>   
>   
>  
>   
>  
> 
>
> It's the 'rootEntity="false" that makes the child entity a document.
> --
>
> Dmitry
>
> On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev  wrote:
>
> > 
> >   > name="s"
> > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > url=""
> > user=""
> > password=""/>
> >   > name="p"
> >  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > url=""
> > user=""
> > password=""/>
> >  
> > >datasource="s"
> > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> m_delivery_date,
> > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> m_warranty,
> > m.contract as m_contract,
> >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
> > m_c_code
> >   FROM Machine AS m
> >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> >   LEFT JOIN Country AS c ON fk_country = c.id"
> > readOnly="true"
> > transformer="DateFormatTransformer">
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >   
> >
> >>datasource="p"
> > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> m_delivery_date,
> > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> m_warranty,
> > m.contract as m_contract,
> >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code as
> > m_c_code
> >   FROM Machine AS m
> >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> >   LEFT JOIN Country AS c ON fk_country = c.id"
> > readOnly="true"
> > transformer="DateFormatTransformer">
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >   
> >  
> > 
> >
> > I've removed the connection params
> > The unique key is id.
> >
> > On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan 
> wrote:
> >
> > > OK, maybe you can show the db-data-config.xml just in case?
> > > Also in schema.xml, does you  correspond to the unique field
> > in
> > > the db?
> > >
> > > On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev  wrote:
> > >
> > > > I tried running with just one datasource(the one that has 6k entries)
> > and
> > > > it indexes them ok.
> > > > The same, if I do sepparately the 1k database. It indexes ok.
> > > >
> > > > On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan 
> > > wrote:
> > > >
> > > > > It sounds a bit, as if SOLR stopped processing data once it queried
> > all
> > > > > from the smaller dataset. That's why you have 2000. If you just
> have
> > a
> > > > > handler pointed to the bigger data set (6k), do you manage to get
> all
> > > 6k
> > > > db
> > > > > entries into solr?
> > > > >
> > > > > On Thu, Feb 16, 2012 at 1:46 PM, Radu Toev 
> > wrote:
> > > > >
> > > > > > 1. Nothing in the logs
> > > > > > 2. No.
> > > > > >
> > > > > > On Thu, Feb 16, 2012 at 12:44 PM, Dmitry Kan <
> dmitry@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > > > 1. Do you see any errors / exceptions in the logs?
> > > > > > > 2. Could you have duplicates?
> > > > > > >
> > > > > > > On Thu, Feb 16, 2012 at 10:15 AM, Radu Toev <
> radut...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > I created a data-config.xml file where I define a datasource
> > and
> > > an
> > > > > > > entity
> > > > > > > > with 12 fields.
> > > > > > > > In my use case I have 2 databases with the same schema, so I
> > want
> > > > to
> > > > > > > > combine in one index the 2 databases.
> > > > > > > > I defined a second dataSource tag and duplicateed the entity
> > with
> > > > its
> > > > > > > > field(changed the name and the datasource).
> > > 

Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
yes, but if I use TikaEntityProcessor the result of my full-import is

0
 1

0

Indexing failed. Rolled back all changes.




2012/2/16 alessio crisantemi 

> Hi all,
> I have a problem to configure a pdf indexing from a directory in my solr
> wit DIH:
>
> with this data-config
>
>
> 
>  
>  
>name="tika-test"
> processor="FileListEntityProcessor"
> baseDir="D:\gioconews_archivio\marzo2011"
> fileName=".*pdf"
> recursive="true"
> rootEntity="false"
> dataSource="null"/>
>url="D:\gioconews_archivio\marzo2011" format="text" >
>
>
>  
>  
>
>  
>  
>   
>  
> 
>
> I obtain this result:
>
>
>
>   full-import
>
>   idle
>
>   
>
> - 
>
>   0:0:2.44
>
>   0
>
>   43
>
>   0
>
>   2012-02-12 19:06:00
>
>   Indexing failed. Rolled back all changes.
>
>   2012-02-12 19:06:00
>   
>
>
> suggestions?
> thank you
> alessio
>


Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
Each document in SOLR will correspond to one db record and since both
databases have the same schema, you can't index two records from two
databases into the same SOLR document.

So after indexing, you should have 7k different documents, each of which
holds data from a db record.

Also one problem I see here is that since the record id in each table is
unique only within the table and (most probably) not globally, there will
be collisions. To aviod this, I would prepend a record_id with some static
value, like: concat("t1",  CONVERT(id, CHAR(8))).

Dmitry

On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev  wrote:

> I'm not sure I follow.
> The idea is to have only one document. Do the multiple documents have the
> same structure then(different datasources), and if so how are they actually
> indexed?
>
> Thanks.
>
> On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan  wrote:
>
> > I think the problem here is that initially you trying to create separate
> > documents for two different tables, while your config is aiming to create
> > only one document. Here there is one solution (not tried by me):
> >
> > --
> > You can have multiple documents generated by the same data-config:
> >
> > 
> >  
> >  
> >  
> >  
> >   
> >   
> >  
> >   
> >   
> >  
> >   
> >  
> > 
> >
> > It's the 'rootEntity="false" that makes the child entity a document.
> > --
> >
> > Dmitry
> >
> > On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev  wrote:
> >
> > > 
> > >   > > name="s"
> > > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > url=""
> > > user=""
> > > password=""/>
> > >   > > name="p"
> > >  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > url=""
> > > user=""
> > > password=""/>
> > >  
> > > > >datasource="s"
> > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > m_delivery_date,
> > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > m_warranty,
> > > m.contract as m_contract,
> > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
> as
> > > m_c_code
> > >   FROM Machine AS m
> > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > readOnly="true"
> > > transformer="DateFormatTransformer">
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   
> > >
> > >> >datasource="p"
> > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > m_delivery_date,
> > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > m_warranty,
> > > m.contract as m_contract,
> > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
> as
> > > m_c_code
> > >   FROM Machine AS m
> > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > readOnly="true"
> > > transformer="DateFormatTransformer">
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   
> > >  
> > > 
> > >
> > > I've removed the connection params
> > > The unique key is id.
> > >
> > > On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan 
> > wrote:
> > >
> > > > OK, maybe you can show the db-data-config.xml just in case?
> > > > Also in schema.xml, does you  correspond to the unique
> field
> > > in
> > > > the db?
> > > >
> > > > On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev 
> wrote:
> > > >
> > > > > I tried running with just one datasource(the one that has 6k
> entries)
> > > and
> > > > > it indexes them ok.
> > > > > The same, if I do sepparately the 1k database. It indexes ok.
> > > > >
> > > > > On Thu, Feb 16, 2012 at 2:11 PM, Dmitry Kan 
> > > > wrote:
> > > > >
> > > > > > It sounds a bit, as if SOLR stopped processing data once it
> queried
> > > all
> > > > > > from the smaller dataset. That's why you have 2000. If you just
> > have
> > > a
> > > > > > handler pointed to the bigger data set (6k), do you manage to get
> > all
> > > > 6k
> > > > > db
> > > > > > entries into solr?
> > > > > >
> > > > > > On Thu, Feb 16, 2012 at 1:46 PM

Re: custom scoring

2012-02-16 Thread Em
Hello carlos,

could you show us how your Solr-call looks like?

Regards,
Em

Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> Hello all:
> 
> We'd like to score the matching documents using a combination of SOLR's IR
> score with another application-specific score that we store within the
> documents themselves (i.e. a float field containing the app-specific
> score). In particular, we'd like to calculate the final score doing some
> operations with both numbers (i.e product, sqrt, ...)
> 
> According to what we know, there are two ways to do this in SOLR:
> 
> A) Sort by function [1]: We've tested an expression like
> "sort=product(score, query_score)" in the SOLR query, where score is the
> common SOLR IR score and query_score is our own precalculated score, but it
> seems that SOLR can only do this with stored/indexed fields (and obviously
> "score" is not stored/indexed).
> 
> B) Function queries: We've used _val_ and function queries like max, sqrt
> and query, and we've obtained the desired results from a functional point
> of view. However, our index is quite large (400M documents) and the
> performance degrades heavily, given that function queries are AFAIK
> matching all the documents.
> 
> I have two questions:
> 
> 1) Apart from the two options I mentioned, is there any other (simple) way
> to achieve this that we're not aware of?
> 
> 2) If we have to choose the function queries path, would it be very
> difficult to modify the actual implementation so that it doesn't match all
> the documents, that is, to pass a query so that it only operates over the
> documents matching the query?. Looking at the FunctionQuery.java source
> code, there's a comment that says "// instead of matching all docs, we
> could also embed a query. the score could either ignore the subscore, or
> boost it", which is giving us some hope that maybe it's possible and even
> desirable to go in this direction. If you can give us some directions about
> how to go about this, we may be able to do the actual implementation.
> 
> BTW, we're using Lucene/SOLR trunk.
> 
> Thanks a lot for your help.
> Carlos
> 
> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> 


Re: Entity with multiple datasources

2012-02-16 Thread Radu Toev
Really good point on the ids, I completely overlooked that matter.
I will give it a try.
Thanks again.

On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan  wrote:

> Each document in SOLR will correspond to one db record and since both
> databases have the same schema, you can't index two records from two
> databases into the same SOLR document.
>
> So after indexing, you should have 7k different documents, each of which
> holds data from a db record.
>
> Also one problem I see here is that since the record id in each table is
> unique only within the table and (most probably) not globally, there will
> be collisions. To aviod this, I would prepend a record_id with some static
> value, like: concat("t1",  CONVERT(id, CHAR(8))).
>
> Dmitry
>
> On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev  wrote:
>
> > I'm not sure I follow.
> > The idea is to have only one document. Do the multiple documents have the
> > same structure then(different datasources), and if so how are they
> actually
> > indexed?
> >
> > Thanks.
> >
> > On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan 
> wrote:
> >
> > > I think the problem here is that initially you trying to create
> separate
> > > documents for two different tables, while your config is aiming to
> create
> > > only one document. Here there is one solution (not tried by me):
> > >
> > > --
> > > You can have multiple documents generated by the same data-config:
> > >
> > > 
> > >  
> > >  
> > >  
> > >  
> > >   
> > >   
> > >  
> > >   
> > >   
> > >  
> > >   
> > >  
> > > 
> > >
> > > It's the 'rootEntity="false" that makes the child entity a document.
> > > --
> > >
> > > Dmitry
> > >
> > > On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev  wrote:
> > >
> > > > 
> > > >   > > > name="s"
> > > > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > > url=""
> > > > user=""
> > > > password=""/>
> > > >   > > > name="p"
> > > >  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > > url=""
> > > > user=""
> > > > password=""/>
> > > >  
> > > > > > >datasource="s"
> > > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > > m_delivery_date,
> > > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > > m_warranty,
> > > > m.contract as m_contract,
> > > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
> > as
> > > > m_c_code
> > > >   FROM Machine AS m
> > > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > > readOnly="true"
> > > > transformer="DateFormatTransformer">
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > >   
> > > >
> > > >> > >datasource="p"
> > > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > > m_delivery_date,
> > > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > > m_warranty,
> > > > m.contract as m_contract,
> > > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country, c.code
> > as
> > > > m_c_code
> > > >   FROM Machine AS m
> > > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > > readOnly="true"
> > > > transformer="DateFormatTransformer">
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > >   
> > > >  
> > > > 
> > > >
> > > > I've removed the connection params
> > > > The unique key is id.
> > > >
> > > > On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan 
> > > wrote:
> > > >
> > > > > OK, maybe you can show the db-data-config.xml just in case?
> > > > > Also in schema.xml, does you  correspond to the unique
> > field
> > > > in
> > > > > the db?
> > > > >
> > > > > On Thu, Feb 16, 2012 at 2:13 PM, Radu Toev 
> > wrote:
> > > > >
> > > > > > I tried running with just one datasource(the one that has 6k
> > entries)
> > > > and
> > > > > > it indexes them ok.
> > > > > > The same, if I do sepparately the

Re: Entity with multiple datasources

2012-02-16 Thread Dmitry Kan
no problem, hope it helps, you're welcome.

On Thu, Feb 16, 2012 at 5:03 PM, Radu Toev  wrote:

> Really good point on the ids, I completely overlooked that matter.
> I will give it a try.
> Thanks again.
>
> On Thu, Feb 16, 2012 at 5:00 PM, Dmitry Kan  wrote:
>
> > Each document in SOLR will correspond to one db record and since both
> > databases have the same schema, you can't index two records from two
> > databases into the same SOLR document.
> >
> > So after indexing, you should have 7k different documents, each of which
> > holds data from a db record.
> >
> > Also one problem I see here is that since the record id in each table is
> > unique only within the table and (most probably) not globally, there will
> > be collisions. To aviod this, I would prepend a record_id with some
> static
> > value, like: concat("t1",  CONVERT(id, CHAR(8))).
> >
> > Dmitry
> >
> > On Thu, Feb 16, 2012 at 4:47 PM, Radu Toev  wrote:
> >
> > > I'm not sure I follow.
> > > The idea is to have only one document. Do the multiple documents have
> the
> > > same structure then(different datasources), and if so how are they
> > actually
> > > indexed?
> > >
> > > Thanks.
> > >
> > > On Thu, Feb 16, 2012 at 4:40 PM, Dmitry Kan 
> > wrote:
> > >
> > > > I think the problem here is that initially you trying to create
> > separate
> > > > documents for two different tables, while your config is aiming to
> > create
> > > > only one document. Here there is one solution (not tried by me):
> > > >
> > > > --
> > > > You can have multiple documents generated by the same data-config:
> > > >
> > > > 
> > > >  
> > > >  
> > > >  
> > > >  
> > > >   
> > > >   
> > > >  
> > > >   
> > > >   
> > > >  
> > > >   
> > > >  
> > > > 
> > > >
> > > > It's the 'rootEntity="false" that makes the child entity a document.
> > > > --
> > > >
> > > > Dmitry
> > > >
> > > > On Thu, Feb 16, 2012 at 2:37 PM, Radu Toev 
> wrote:
> > > >
> > > > > 
> > > > >   > > > > name="s"
> > > > > driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > > > url=""
> > > > > user=""
> > > > > password=""/>
> > > > >   > > > > name="p"
> > > > >  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> > > > > url=""
> > > > > user=""
> > > > > password=""/>
> > > > >  
> > > > > > > > >datasource="s"
> > > > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > > > m_delivery_date,
> > > > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > > > m_warranty,
> > > > > m.contract as m_contract,
> > > > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country,
> c.code
> > > as
> > > > > m_c_code
> > > > >   FROM Machine AS m
> > > > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > > > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > > > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > > > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > > > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > > > readOnly="true"
> > > > > transformer="DateFormatTransformer">
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >   
> > > > >
> > > > >> > > >datasource="p"
> > > > > query="SELECT m.id as id, m.serial as m_machine_serial, m.ivk as
> > > > > m_machine_ivk, m.sitename as m_sitename, m.deliveryDate as
> > > > m_delivery_date,
> > > > > m.hotsite as m_hotsite, m.guardian as m_guardian, m.warranty as
> > > > m_warranty,
> > > > > m.contract as m_contract,
> > > > >   st.name as m_st_name, pm.name as m_pm_name, p.name as m_p_name,
> > > > > sv.shortName as m_sv_name, c.clusterMajor as m_c_cluster_major,
> > > > > c.clusterMinor as m_c_cluster_minor, c.country as m_c_country,
> c.code
> > > as
> > > > > m_c_code
> > > > >   FROM Machine AS m
> > > > >   LEFT JOIN SystemType AS st ON m.fk_systemType=st.id
> > > > >   LEFT JOIN ProductModel AS pm ON fk_productModel = pm.id
> > > > >   LEFT JOIN Platform AS p ON m.fk_platform = p.id
> > > > >   LEFT JOIN SoftwareVersion AS sv ON fk_softwareVersion = sv.id
> > > > >   LEFT JOIN Country AS c ON fk_country = c.id"
> > > > > readOnly="true"
> > > > > transformer="DateFormatTransformer">
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > >   
> > > > >  
> > > > > 
> > > > >
> > > > > I've removed the connection params
> > > > > The unique key is id.
> > > > >
> > > > > On Thu, Feb 16, 2012 at 2:27 PM, Dmitry Kan 
> > > > wrote:
> > > > 

Frequent garbage collections after a day of operation

2012-02-16 Thread Matthias Käppler
Hey everyone,

we're running into some operational problems with our SOLR production
setup here and were wondering if anyone else is affected or has even
solved these problems before. We're running a vanilla SOLR 3.4.0 in
several Tomcat 6 instances, so nothing out of the ordinary, but after
a day or so of operation we see increased response times from SOLR, up
to 3 times increases on average. During this time we see increased CPU
load due to heavy garbage collection in the JVM, which bogs down the
the whole system, so throughput decreases, naturally. When restarting
the slaves, everything goes back to normal, but that's more like a
brute force solution.

The thing is, we don't know what's causing this and we don't have that
much experience with Java stacks since we're for most parts a Rails
company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
seeing this, or can you think of a reason for this? Most of our
queries to SOLR involve the DismaxHandler and the spatial search query
components. We don't use any custom request handlers so far.

Thanks in advance,
-Matthias

-- 
Matthias Käppler
Lead Developer API & Mobile

Qype GmbH
Großer Burstah 50-52
20457 Hamburg
Telephone: +49 (0)40 - 219 019 2 - 160
Skype: m_kaeppler
Email: matth...@qype.com

Managing Director: Ian Brotherston
Amtsgericht Hamburg
HRB 95913

This e-mail and its attachments may contain confidential and/or
privileged information. If you are not the intended recipient (or have
received this e-mail in error) please notify the sender immediately
and destroy this e-mail and its attachments. Any unauthorized copying,
disclosure or distribution of this e-mail and  its attachments is
strictly forbidden. This notice also applies to future messages.


RE: PatternReplaceFilterFactory group

2012-02-16 Thread Steven A Rowe
Hi O.,

PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(), both 
of which take in a string that can include any or all groups using the syntax 
"$n", where n is the group number.  See the Matcher.appendReplacement() 
javadocs for an explanation of the functionality and syntax: 


Steve

> -Original Message-
> From: O. Klein [mailto:kl...@octoweb.nl]
> Sent: Thursday, February 16, 2012 8:34 AM
> To: solr-user@lucene.apache.org
> Subject: PatternReplaceFilterFactory group
> 
> PatternReplaceFilterFactory has no option to select the group to replace.
> 
> Is there a reason for this, or could this be a nice feature?
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-
> tp3750201p3750201.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

The URL is quite large (w/ shards, ...), maybe it's best if I paste the
relevant parts.

Our "q" parameter is:

  
"q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",

The subqueries q8, q7, q4 and q3 are regular queries, for example:

"q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
(stopword_phrase:las AND stopword_phrase:de)"

We've executed the subqueries q3-q8 independently and they're very fast,
but when we introduce the function queries as described below, it all goes
10X slower.

Let me know if you need anything else.

Thanks
Carlos


Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 4:02 PM, Em  wrote:

> Hello carlos,
>
> could you show us how your Solr-call looks like?
>
> Regards,
> Em
>
> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> > Hello all:
> >
> > We'd like to score the matching documents using a combination of SOLR's
> IR
> > score with another application-specific score that we store within the
> > documents themselves (i.e. a float field containing the app-specific
> > score). In particular, we'd like to calculate the final score doing some
> > operations with both numbers (i.e product, sqrt, ...)
> >
> > According to what we know, there are two ways to do this in SOLR:
> >
> > A) Sort by function [1]: We've tested an expression like
> > "sort=product(score, query_score)" in the SOLR query, where score is the
> > common SOLR IR score and query_score is our own precalculated score, but
> it
> > seems that SOLR can only do this with stored/indexed fields (and
> obviously
> > "score" is not stored/indexed).
> >
> > B) Function queries: We've used _val_ and function queries like max, sqrt
> > and query, and we've obtained the desired results from a functional point
> > of view. However, our index is quite large (400M documents) and the
> > performance degrades heavily, given that function queries are AFAIK
> > matching all the documents.
> >
> > I have two questions:
> >
> > 1) Apart from the two options I mentioned, is there any other (simple)
> way
> > to achieve this that we're not aware of?
> >
> > 2) If we have to choose the function queries path, would it be very
> > difficult to modify the actual implementation so that it doesn't match
> all
> > the documents, that is, to pass a query so that it only operates over the
> > documents matching the query?. Looking at the FunctionQuery.java source
> > code, there's a comment that says "// instead of matching all docs, we
> > could also embed a query. the score could either ignore the subscore, or
> > boost it", which is giving us some hope that maybe it's possible and even
> > desirable to go in this direction. If you can give us some directions
> about
> > how to go about this, we may be able to do the actual implementation.
> >
> > BTW, we're using Lucene/SOLR trunk.
> >
> > Thanks a lot for your help.
> > Carlos
> >
> > [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
> >
>


Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Chantal Ackermann
If your script turns out too complex to maintain, and you are developing
in Java, anyway, you could extend EntityProcessor and handle the data in
a custom way. I've done that to transform a datamart like data structure
back into a row based one.

Basically you override the method that gets the data in a Map and
transform it into a different Map which contains the fields as
understood by your schema.

Chantal


On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote:
> Hi Baranee,
> 
> Some time ago I played with
> http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it was a
> pretty good stuff.
> 
> Regards
> 
> 
> On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan 
> wrote:
> 
> > To avoid that we don't want to mention the column names in the field tag ,
> > but want to write a query to map all the fields in the table with solr
> > fileds even if we don't know, how many columns are there in the table.  I
> > need a kind of loop which runs through all the query results and map that
> > with solr fileds.
> 
> 
> 
> 



Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Chantal Ackermann
Make sure your Tomcat instances are started each with a max heap size
that adds up to something a lot lower than the complete RAM of your
system.

Frequent Garbage collection means that your applications request more
RAM but your Java VM has no more resources, so it requires the Garbage
Collector to free memory so that the requested new objects can be
created. It's not indicating a memory leak unless you are running a
custom EntityProcessor in DIH that runs into an infinite loop and
creates huge amounts of schema fields. ;-)

Also - if you are doing hot deploys on Tomcat, you will have to restart
the Tomcat instance on a regular bases as hot deploys DO leak memory
after a while. (You might be seeing class undeploy messages in
catalina.out and later on OutOfMemory error messages.)

If this is not of any help you will probably have to provide a bit more
information on your Tomcat and SOLR configuration setup.

Chantal


On Thu, 2012-02-16 at 16:22 +0100, Matthias Käppler wrote:
> Hey everyone,
> 
> we're running into some operational problems with our SOLR production
> setup here and were wondering if anyone else is affected or has even
> solved these problems before. We're running a vanilla SOLR 3.4.0 in
> several Tomcat 6 instances, so nothing out of the ordinary, but after
> a day or so of operation we see increased response times from SOLR, up
> to 3 times increases on average. During this time we see increased CPU
> load due to heavy garbage collection in the JVM, which bogs down the
> the whole system, so throughput decreases, naturally. When restarting
> the slaves, everything goes back to normal, but that's more like a
> brute force solution.
> 
> The thing is, we don't know what's causing this and we don't have that
> much experience with Java stacks since we're for most parts a Rails
> company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
> seeing this, or can you think of a reason for this? Most of our
> queries to SOLR involve the DismaxHandler and the spatial search query
> components. We don't use any custom request handlers so far.
> 
> Thanks in advance,
> -Matthias
> 



RE: PatternReplaceFilterFactory group

2012-02-16 Thread O. Klein

steve_rowe wrote
> 
> Hi O.,
> 
> PatternReplaceFilter(Factory) uses Matcher.replaceAll() or replaceFirst(),
> both of which take in a string that can include any or all groups using
> the syntax "$n", where n is the group number.  See the
> Matcher.appendReplacement() javadocs for an explanation of the
> functionality and syntax:
> ;
> 
> Steve
> 
>> -Original Message-
>> From: O. Klein [mailto:klein@]
>> Sent: Thursday, February 16, 2012 8:34 AM
>> To: solr-user@.apache
>> Subject: PatternReplaceFilterFactory group
>> 
>> PatternReplaceFilterFactory has no option to select the group to replace.
>> 
>> Is there a reason for this, or could this be a nice feature?
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-
>> tp3750201p3750201.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 

Thanks. I should get it working then.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PatternReplaceFilterFactory-group-tp3750201p3750650.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: is it possible to run deltaimport command with out delta query?

2012-02-16 Thread Shawn Heisey

On 2/15/2012 11:26 PM, nagarjuna wrote:

hi all..
   i am new to solr .can any body explain me about the delta-import and
delta query and also i have the below questions
1.is it possible to run deltaimport without delataquery?
2. is it possible to write a delta query without having last_modified column
in database? if yes pls explain me


Assuming I understand what you're asking:

Define deltaImportQuery to be the same as query, then set deltaQuery to 
something that always returns some kind of value in the field you have 
designated as your primary key.  The data doesn't have to be relevant to 
anything at all, it just needs to return something for the primary key 
field.  Here's what I have in mine, my pk is did:


  deltaQuery="SELECT 1 AS did"

If you wish, you can completely ignore lastModified and track your own 
information about what data is new, then pass parameters via the 
dataimport handler URL to be used in your queries.  This is what both my 
query and deltaImportQuery are set to:


SELECT * FROM ${dataimporter.request.dataView}
WHERE (
  (
did > ${dataimporter.request.minDid}
AND did <= ${dataimporter.request.maxDid}
  )
  ${dataimporter.request.extraWhere}
) AND (crc32(did) % ${dataimporter.request.numShards})
  IN (${dataimporter.request.modVal})

Thanks,
Shawn



Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
here the log:


org.apache.solr.handler.dataimport.DataImporter doFullImport
Grave: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
a required attribute Processing Document # 1
 at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117)
 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: start rollback
feb 12, 2012 7:06:00 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: end_rollback
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
Informazioni: Starting Full Import
feb 12, 2012 7:06:02 PM org.apache.solr.core.SolrCore execute
Informazioni: [] webapp=/solr path=/select
params={clean=false&commit=true&command=full-import&qt=/dataimport}
status=0 QTime=16
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
Informazioni: Read dataimport.properties
feb 12, 2012 7:06:02 PM org.apache.solr.handler.dataimport.DataImporter
doFullImport
Grave: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
a required attribute Processing Document # 1
 at
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:117)
 at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
 at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:319)
 at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:242)
 at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:180)
 at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:331)
 at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:389)
 at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:370)
feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: start rollback
feb 12, 2012 7:06:02 PM org.apache.solr.update.DirectUpdateHandler2 rollback
Informazioni: end_rollback
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause
Informazioni: Pausing ProtocolHandler ["http-bio-8983"]
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol pause
Informazioni: Pausing ProtocolHandler ["ajp-bio-8009"]
feb 12, 2012 7:06:42 PM org.apache.catalina.core.StandardService
stopInternal
Informazioni: Stopping service Catalina
feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore close
Informazioni: []  CLOSING SolrCore org.apache.solr.core.SolrCore@7d1217
feb 12, 2012 7:06:42 PM org.apache.solr.core.SolrCore closeSearcher
Informazioni: [] Closing main searcher on request.
feb 12, 2012 7:06:42 PM org.apache.solr.search.SolrIndexSearcher close
Informazioni: Closing Searcher@19fabda main
 
fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=2,evictions=0,size=2,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
 
documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close
Informazioni: closing
DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
feb 12, 2012 7:06:42 PM org.apache.solr.update.DirectUpdateHandler2 close
Informazioni: closed
DirectUpdateHandler2{commits=0,autocommits=0,optimizes=0,rollbacks=4,expungeDeletes=0,docsPending=0,adds=0,deletesById=0,deletesByQuery=0,errors=0,cumulative_adds=0,cumulative_deletesById=0,cumulative_deletesByQuery=0,cumulative_errors=0}
feb 12, 2012 7:06:42 PM org.apache.coyote.AbstractProtocol stop
Informazioni: Stopping Protoco

Re: problem to indexing pdf directory

2012-02-16 Thread Gora Mohanty
On 16 February 2012 21:37, alessio crisantemi
 wrote:
> here the log:
>
>
> org.apache.solr.handler.dataimport.DataImporter doFullImport
> Grave: Full Import failed
> org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir' is
> a required attribute Processing Document # 1
[...]

The exception message above is pretty clear. You need to define a
baseDir attribute for the second entity.

However, even if you fix this, the setup will *not* work for indexing
PDFs. Did you read the URLs that I sent earlier?

Regards,
Gora


Re: Setting solrj server connection timeout

2012-02-16 Thread Shawn Heisey

On 2/3/2012 1:12 PM, Shawn Heisey wrote:
Is the following a reasonable approach to setting a connection timeout 
with SolrJ?


queryCore.getHttpClient().getHttpConnectionManager().getParams()
.setConnectionTimeout(15000);

Right now I have all my solr server objects sharing a single 
HttpClient that gets created using the multithreaded connection 
manager, where I set the timeout for all of them.  Now I will be 
letting each server object create its own HttpClient object, and using 
the above statement to set the timeout on each one individually.  
It'll use up a bunch more memory, as there are 56 server objects, but 
maybe it'll work better.  The total of 56 objects comes about from 7 
shards, a build core and a live core per shard, two complete index 
chains, and for each of those, one server object for access to 
CoreAdmin and another for the index.


The impetus for this, as it's possible I'm stating an XY problem: 
Currently I have an occasional problem where SolrJ connections throw 
an exception.  When it happens, nothing is logged in Solr.  My code is 
smart enough to notice the problem, send an email alert, and simply 
try again at the top of the next minute.  The simple explanation is 
that this is a Linux networking problem, but I never had any problem 
like this when I was using Perl with LWP to keep my index up to date.  
I sent a message to the list some time ago on this exception, but I 
never got a response that helped me figure it out.


Caused by: org.apache.solr.client.solrj.SolrServerException: 
java.net.SocketException: Connection reset


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:480)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)


at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)


at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:276)

at com.newscom.idxbuild.solr.Core.getCount(Core.java:325)

... 3 more

Caused by: java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:168)

at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

at java.io.BufferedInputStream.read(BufferedInputStream.java:237)

at 
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:78)


at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:106)

at 
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.java:1116)


at 
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1413)


at 
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBase.java:1973)


at 
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase.java:1735)


at 
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1098)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:398)


at 
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)


at 
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)


at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:424)


... 7 more


No response in quite some time, so I'm bringing it up again.  I brought 
up the Exception issue before, and though I did get some responses, I 
didn't feel that I got an answer.


http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201112.mbox/%3c4eeaf6e5.9030...@elyograg.org%3E

Thanks,
Shawn



Re: problem to indexing pdf directory

2012-02-16 Thread alessio crisantemi
Yes, I read it. But I don't know the cause.
and more: I work on windows and so, I configured manually tika and solr
because I don't have maven...

2012/2/16 Gora Mohanty 

> On 16 February 2012 21:37, alessio crisantemi
>  wrote:
> > here the log:
> >
> >
> > org.apache.solr.handler.dataimport.DataImporter doFullImport
> > Grave: Full Import failed
> > org.apache.solr.handler.dataimport.DataImportHandlerException: 'baseDir'
> is
> > a required attribute Processing Document # 1
> [...]
>
> The exception message above is pretty clear. You need to define a
> baseDir attribute for the second entity.
>
> However, even if you fix this, the setup will *not* work for indexing
> PDFs. Did you read the URLs that I sent earlier?
>
> Regards,
> Gora
>


Re: Do we need reindexing from solr 1.4.1 to 3.5.0?

2012-02-16 Thread tamanjit.bin...@yahoo.co.in
There may be issues with your solrconfig. Kindly post the exception that you
are recieving.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Do-we-need-reindexing-from-solr-1-4-1-to-3-5-0-tp3739353p3750937.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: is it possible to run deltaimport command with out delta query?

2012-02-16 Thread Dyer, James
There is a good example on how to do a delta update using 
"command=full-update&clean=false" on the wiki, here:  
http://wiki.apache.org/solr/DataImportHandlerFaq#fullimportdelta

This can be advantageous if you are updating a ton of data at once and do not 
want it executing as many queries to the database.  It also can be easier to 
maintain just 1 set of queries for both full and delta imports.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Thursday, February 16, 2012 10:04 AM
To: solr-user@lucene.apache.org
Subject: Re: is it possible to run deltaimport command with out delta query?

On 2/15/2012 11:26 PM, nagarjuna wrote:
> hi all..
>i am new to solr .can any body explain me about the delta-import and
> delta query and also i have the below questions
> 1.is it possible to run deltaimport without delataquery?
> 2. is it possible to write a delta query without having last_modified column
> in database? if yes pls explain me

Assuming I understand what you're asking:

Define deltaImportQuery to be the same as query, then set deltaQuery to 
something that always returns some kind of value in the field you have 
designated as your primary key.  The data doesn't have to be relevant to 
anything at all, it just needs to return something for the primary key 
field.  Here's what I have in mine, my pk is did:

   deltaQuery="SELECT 1 AS did"

If you wish, you can completely ignore lastModified and track your own 
information about what data is new, then pass parameters via the 
dataimport handler URL to be used in your queries.  This is what both my 
query and deltaImportQuery are set to:

 SELECT * FROM ${dataimporter.request.dataView}
 WHERE (
   (
 did > ${dataimporter.request.minDid}
 AND did <= ${dataimporter.request.maxDid}
   )
   ${dataimporter.request.extraWhere}
 ) AND (crc32(did) % ${dataimporter.request.numShards})
   IN (${dataimporter.request.modVal})

Thanks,
Shawn



Re: Best requestHandler for "typing error".

2012-02-16 Thread tamanjit.bin...@yahoo.co.in
You can enable the spellcheck component and add it to your default request
handler.

This might be of use:
http://wiki.apache.org/solr/SpellCheckComponent
http://wiki.apache.org/solr/SpellCheckComponent 

It could be used both during autosuggest as well as did you mean.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Best-requestHandler-for-typing-error-tp3749576p3750995.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr edismax clarification

2012-02-16 Thread Indika Tantrigoda
Hi All,

I am using edismax SearchHandler in my search and I have some issues in the
search results. As I understand if the "defaultOperator" is set to OR the
search query will be passed as  -> The OR quick OR brown OR fox implicitly.
However if I search for The quick brown fox, I get lesser results than
explicitly adding the OR. Another issue is that if I search for The quick
brown fox other documents that contain the word fox is not in the search
results.

Thanks.


copyField: multivalued field to joined singlevalue field

2012-02-16 Thread flyingeagle-de
Hello,

I want to copy all data from a multivalued field joined together in a single
valued field.

Is there any opportunity to do this by using solr-standards?

kind regards

--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-multivalued-field-to-joined-singlevalue-field-tp3750857p3750857.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: copyField: multivalued field to joined singlevalue field

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 11:35 AM, flyingeagle-de
 wrote:
> Hello,
>
> I want to copy all data from a multivalued field joined together in a single
> valued field.
>
> Is there any opportunity to do this by using solr-standards?

There is not currently, but it certainly makes sense.

Anyone know of an open issue for this yet?  If not, we should create one!

-Yonik
lucidimagination.com


Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
I am attempting to execute a query with the following parameters

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=index
rows=10

When doing this I get the following exception

null  java.lang.ArrayIndexOutOfBoundsException

request: http://hostname:8983/solr/select
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
at 
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at 
org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
at 
org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

if I play with some of the parameters the query works as expected, i.e.

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=index
rows=0

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.limit=10
f.manu.facet.sort=count
rows=10

q=*:*
distrib=true
facet=true
facet.limit=10
facet.field=manu
f.manu.facet.mincount=1
f.manu.facet.sort=index
rows=10


I am running on an old snapshot of Solr, but will try this on a new
version relatively soon.  Unfortunately I can't duplicate locally so
I'm a bit baffled by the error.

All of the shards have the field which we are faceting on


Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

well, you must take into account that you are executing up to 8 queries
per request instead of one query per request.

I am not totally sure about the details of the implementation of the
max-function-query, but I guess it first iterates over the results of
the first max-query, afterwards over the results of the second max-query
and so on. This is a much higher complexity than in the case of a normal
query.

I would suggest you to optimize your request. I don't think that this
particular function query is matching *all* docs. Instead I think it
just matches those docs specified by your inner-query (although I might
be wrong about that).

What are you trying to achieve by your request?

Regards,
Em

Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> Hello Em:
> 
> The URL is quite large (w/ shards, ...), maybe it's best if I paste the
> relevant parts.
> 
> Our "q" parameter is:
> 
>   
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",
> 
> The subqueries q8, q7, q4 and q3 are regular queries, for example:
> 
> "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> (stopword_phrase:las AND stopword_phrase:de)"
> 
> We've executed the subqueries q3-q8 independently and they're very fast,
> but when we introduce the function queries as described below, it all goes
> 10X slower.
> 
> Let me know if you need anything else.
> 
> Thanks
> Carlos
> 
> 
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Thu, Feb 16, 2012 at 4:02 PM, Em  wrote:
> 
>> Hello carlos,
>>
>> could you show us how your Solr-call looks like?
>>
>> Regards,
>> Em
>>
>> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
>>> Hello all:
>>>
>>> We'd like to score the matching documents using a combination of SOLR's
>> IR
>>> score with another application-specific score that we store within the
>>> documents themselves (i.e. a float field containing the app-specific
>>> score). In particular, we'd like to calculate the final score doing some
>>> operations with both numbers (i.e product, sqrt, ...)
>>>
>>> According to what we know, there are two ways to do this in SOLR:
>>>
>>> A) Sort by function [1]: We've tested an expression like
>>> "sort=product(score, query_score)" in the SOLR query, where score is the
>>> common SOLR IR score and query_score is our own precalculated score, but
>> it
>>> seems that SOLR can only do this with stored/indexed fields (and
>> obviously
>>> "score" is not stored/indexed).
>>>
>>> B) Function queries: We've used _val_ and function queries like max, sqrt
>>> and query, and we've obtained the desired results from a functional point
>>> of view. However, our index is quite large (400M documents) and the
>>> performance degrades heavily, given that function queries are AFAIK
>>> matching all the documents.
>>>
>>> I have two questions:
>>>
>>> 1) Apart from the two options I mentioned, is there any other (simple)
>> way
>>> to achieve this that we're not aware of?
>>>
>>> 2) If we have to choose the function queries path, would it be very
>>> difficult to modify the actual implementation so that it doesn't match
>> all
>>> the documents, that is, to pass a query so that it only operates over the
>>> documents matching the query?. Looking at the FunctionQuery.java source
>>> code, there's a comment that says "// instead of matching all docs, we
>>> could also embed a query. the score could either ignore the subscore, or
>>> boost it", which is giving us some hope that maybe it's possible and even
>>> desirable to go in this direction. If you can give us some directions
>> about
>>> how to go about this, we may be able to do the actual implementation.
>>>
>>> BTW, we're using Lucene/SOLR trunk.
>>>
>>> Thanks a lot for your help.
>>> Carlos
>>>
>>> [1]: http://wiki.apache.org/solr/FunctionQuery#Sort_By_Function
>>>
>>
> 


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
please ignore this, it has nothing to do with the faceting component.
I was able to disable a custom component that I had and it worked
perfectly fine.


On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson  wrote:
> I am attempting to execute a query with the following parameters
>
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=index
> rows=10
>
> When doing this I get the following exception
>
> null  java.lang.ArrayIndexOutOfBoundsException
>
> request: http://hostname:8983/solr/select
>        at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
>        at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>        at 
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
>        at 
> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
>        at 
> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>        at java.lang.Thread.run(Thread.java:662)
>
> if I play with some of the parameters the query works as expected, i.e.
>
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=index
> rows=0
>
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=count
> rows=10
>
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.sort=index
> rows=10
>
>
> I am running on an old snapshot of Solr, but will try this on a new
> version relatively soon.  Unfortunately I can't duplicate locally so
> I'm a bit baffled by the error.
>
> All of the shards have the field which we are faceting on


Re: Distributed Faceting Bug?

2012-02-16 Thread Em
Hi Jamie,

what version of Solr/SolrJ are you using?

Regards,
Em

Am 16.02.2012 18:42, schrieb Jamie Johnson:
> I am attempting to execute a query with the following parameters
> 
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=index
> rows=10
> 
> When doing this I get the following exception
> 
> null  java.lang.ArrayIndexOutOfBoundsException
> 
> request: http://hostname:8983/solr/select
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
>   at 
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>   at 
> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
>   at 
> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
>   at 
> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> 
> if I play with some of the parameters the query works as expected, i.e.
> 
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=index
> rows=0
> 
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.limit=10
> f.manu.facet.sort=count
> rows=10
> 
> q=*:*
> distrib=true
> facet=true
> facet.limit=10
> facet.field=manu
> f.manu.facet.mincount=1
> f.manu.facet.sort=index
> rows=10
> 
> 
> I am running on an old snapshot of Solr, but will try this on a new
> version relatively soon.  Unfortunately I can't duplicate locally so
> I'm a bit baffled by the error.
> 
> All of the shards have the field which we are faceting on
> 


Re: Distributed Faceting Bug?

2012-02-16 Thread Em
Hi Jamie,

nice to hear.
Maybe you can share in what kind of bug you ran, so that other
developers with similar bugish components can benefit from your
experience. :)

Regards,
Em

Am 16.02.2012 19:23, schrieb Jamie Johnson:
> please ignore this, it has nothing to do with the faceting component.
> I was able to disable a custom component that I had and it worked
> perfectly fine.
> 
> 
> On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson  wrote:
>> I am attempting to execute a query with the following parameters
>>
>> q=*:*
>> distrib=true
>> facet=true
>> facet.limit=10
>> facet.field=manu
>> f.manu.facet.mincount=1
>> f.manu.facet.limit=10
>> f.manu.facet.sort=index
>> rows=10
>>
>> When doing this I get the following exception
>>
>> null  java.lang.ArrayIndexOutOfBoundsException
>>
>> request: http://hostname:8983/solr/select
>>at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
>>at 
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>>at 
>> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
>>at 
>> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
>>at 
>> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
>>at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>at java.lang.Thread.run(Thread.java:662)
>>
>> if I play with some of the parameters the query works as expected, i.e.
>>
>> q=*:*
>> distrib=true
>> facet=true
>> facet.limit=10
>> facet.field=manu
>> f.manu.facet.mincount=1
>> f.manu.facet.limit=10
>> f.manu.facet.sort=index
>> rows=0
>>
>> q=*:*
>> distrib=true
>> facet=true
>> facet.limit=10
>> facet.field=manu
>> f.manu.facet.mincount=1
>> f.manu.facet.limit=10
>> f.manu.facet.sort=count
>> rows=10
>>
>> q=*:*
>> distrib=true
>> facet=true
>> facet.limit=10
>> facet.field=manu
>> f.manu.facet.mincount=1
>> f.manu.facet.sort=index
>> rows=10
>>
>>
>> I am running on an old snapshot of Solr, but will try this on a new
>> version relatively soon.  Unfortunately I can't duplicate locally so
>> I'm a bit baffled by the error.
>>
>> All of the shards have the field which we are faceting on
> 


Re: How to loop through the DataImportHandler query results?

2012-02-16 Thread Mikhail Khludnev
Chantal,

if you prefer java here is http://wiki.apache.org/solr/DIHCustomTransform



On Thu, Feb 16, 2012 at 7:24 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> If your script turns out too complex to maintain, and you are developing
> in Java, anyway, you could extend EntityProcessor and handle the data in
> a custom way. I've done that to transform a datamart like data structure
> back into a row based one.
>
> Basically you override the method that gets the data in a Map and
> transform it into a different Map which contains the fields as
> understood by your schema.
>
> Chantal
>
>
> On Thu, 2012-02-16 at 14:59 +0100, Mikhail Khludnev wrote:
> > Hi Baranee,
> >
> > Some time ago I played with
> > http://wiki.apache.org/solr/DataImportHandler#ScriptTransformer - it
> was a
> > pretty good stuff.
> >
> > Regards
> >
> >
> > On Thu, Feb 16, 2012 at 3:53 PM, K, Baraneetharan <
> baraneethara...@hp.com>wrote:
> >
> > > To avoid that we don't want to mention the column names in the field
> tag ,
> > > but want to write a query to map all the fields in the table with solr
> > > fileds even if we don't know, how many columns are there in the table.
>  I
> > > need a kind of loop which runs through all the query results and map
> that
> > > with solr fileds.
> >
> >
> >
> >
>
>


-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

Thanks for your answer.

Yes, we initially also thought that the excessive increase in response time
was caused by the several queries being executed, and we did another test.
We executed one of the subqueries that I've shown to you directly in the
"q" parameter and then we tested this same subquery (only this one, without
the others) with the function query "query($q1)" in the "q" parameter.

Theoretically the times for these two queries should be more or less the
same, but the second one is several times slower than the first one. After
this observation we learned more about function queries and we learned from
the code and from some comments in the forums [1] that the FunctionQueries
are expected to match all documents.

We have some more tests on that matter: now we're moving from issuing this
large query through the SOLR interface to creating our own QueryParser. The
initial tests we've done in our QParser (that internally creates multiple
queries and inserts them inside a DisjunctionMaxQuery) are very good, we're
getting very good response times and high quality answers. But when we've
tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
QueryValueSource that wraps the DisMaxQuery), then the times move from
10-20 msec to 200-300msec.

Note that we're using early termination of queries (via a custom
collector), and therefore (as shown by the numbers I included above) even
if the query is very complex, we're getting very fast answers. The only
situation where the response time explodes is when we include a
FunctionQuery.

Re: your question of what we're trying to achieve ... We're implementing a
powerful query autocomplete system, and we use several fields to a) improve
performance on wildcard queries and b) have a very precise control over the
score.

Thanks a lot for your help,
Carlos

[1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 7:09 PM, Em  wrote:

> Hello Carlos,
>
> well, you must take into account that you are executing up to 8 queries
> per request instead of one query per request.
>
> I am not totally sure about the details of the implementation of the
> max-function-query, but I guess it first iterates over the results of
> the first max-query, afterwards over the results of the second max-query
> and so on. This is a much higher complexity than in the case of a normal
> query.
>
> I would suggest you to optimize your request. I don't think that this
> particular function query is matching *all* docs. Instead I think it
> just matches those docs specified by your inner-query (although I might
> be wrong about that).
>
> What are you trying to achieve by your request?
>
> Regards,
> Em
>
> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > The URL is quite large (w/ shards, ...), maybe it's best if I paste the
> > relevant parts.
> >
> > Our "q" parameter is:
> >
> >
> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",
> >
> > The subqueries q8, q7, q4 and q3 are regular queries, for example:
> >
> > "q7":"stopword_phrase:colomba~1 AND stopword_phrase:santa AND
> > wildcard_stopword_phrase:car^0.7 AND stopword_phrase:hoteles OR
> > (stopword_phrase:las AND stopword_phrase:de)"
> >
> > We've executed the subqueries q3-q8 independently and they're very fast,
> > but when we introduce the function queries as described below, it all
> goes
> > 10X slower.
> >
> > Let me know if you need anything else.
> >
> > Thanks
> > Carlos
> >
> >
> > Carlos Gonzalez-Cadenas
> > CEO, ExperienceOn - New generation search
> > http://www.experienceon.com
> >
> > Mobile: +34 652 911 201
> > Skype: carlosgonzalezcadenas
> > LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> >
> >
> > On Thu, Feb 16, 2012 at 4:02 PM, Em 
> wrote:
> >
> >> Hello carlos,
> >>
> >> could you show us how your Solr-call looks like?
> >>
> >> Regards,
> >> Em
> >>
> >> Am 16.02.2012 14:34, schrieb Carlos Gonzalez-Cadenas:
> >>> Hello all:
> >>>
> >>> We'd like to score the matching documents using a combination of SOLR's
> >> IR
> >>> score with another application-specific score that we store within the
> >>> documents themselves (i.e. a float field containing the app-specific
> >>> score). In particular, we'd like to calculate the final score doing
> some
> >>> operations with both numbers (i.e product, sqrt, ...)
> >>>
> >>> According to what we know, there are two ways to do this in SOLR:
> >>>
> >>> A) Sort by function [1]: We've tested an expression like
> >>> "sort=product(score, query_score)" in the SOLR query, where score is
> the
> >>> common SOLR IR score and query_score is our own precalculated score,
> but
> >> it
> >>> seems that SOLR can only do this w

Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

> We have some more tests on that matter: now we're moving from issuing this
> large query through the SOLR interface to creating our own
QueryParser. The
> initial tests we've done in our QParser (that internally creates multiple
> queries and inserts them inside a DisjunctionMaxQuery) are very good,
we're
> getting very good response times and high quality answers. But when we've
> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> QueryValueSource that wraps the DisMaxQuery), then the times move from
> 10-20 msec to 200-300msec.
I reviewed the sourcecode and yes, the FunctionQuery iterates over the
whole index, however... let's see!

In relation to the DisMaxQuery you create within your parser: What kind
of clause is the FunctionQuery and what kind of clause are your other
queries (MUST, SHOULD, MUST_NOT...)?

*I* would expect that with a shrinking set of matching documents to the
overall-query, the function query only checks those documents that are
guaranteed to be within the result set.

> Note that we're using early termination of queries (via a custom
> collector), and therefore (as shown by the numbers I included above) even
> if the query is very complex, we're getting very fast answers. The only
> situation where the response time explodes is when we include a
> FunctionQuery.
Could you give us some details about how/where did you plugin the
Collector, please?

Kind regards,
Em

Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> Hello Em:
> 
> Thanks for your answer.
> 
> Yes, we initially also thought that the excessive increase in response time
> was caused by the several queries being executed, and we did another test.
> We executed one of the subqueries that I've shown to you directly in the
> "q" parameter and then we tested this same subquery (only this one, without
> the others) with the function query "query($q1)" in the "q" parameter.
> 
> Theoretically the times for these two queries should be more or less the
> same, but the second one is several times slower than the first one. After
> this observation we learned more about function queries and we learned from
> the code and from some comments in the forums [1] that the FunctionQueries
> are expected to match all documents.
> 
> We have some more tests on that matter: now we're moving from issuing this
> large query through the SOLR interface to creating our own QueryParser. The
> initial tests we've done in our QParser (that internally creates multiple
> queries and inserts them inside a DisjunctionMaxQuery) are very good, we're
> getting very good response times and high quality answers. But when we've
> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> QueryValueSource that wraps the DisMaxQuery), then the times move from
> 10-20 msec to 200-300msec.
> 
> Note that we're using early termination of queries (via a custom
> collector), and therefore (as shown by the numbers I included above) even
> if the query is very complex, we're getting very fast answers. The only
> situation where the response time explodes is when we include a
> FunctionQuery.
> 
> Re: your question of what we're trying to achieve ... We're implementing a
> powerful query autocomplete system, and we use several fields to a) improve
> performance on wildcard queries and b) have a very precise control over the
> score.
> 
> Thanks a lot for your help,
> Carlos
> 
> [1]: http://grokbase.com/p/lucene/solr-user/11bjw87bt5/functionquery-score-0
> 
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Thu, Feb 16, 2012 at 7:09 PM, Em  wrote:
> 
>> Hello Carlos,
>>
>> well, you must take into account that you are executing up to 8 queries
>> per request instead of one query per request.
>>
>> I am not totally sure about the details of the implementation of the
>> max-function-query, but I guess it first iterates over the results of
>> the first max-query, afterwards over the results of the second max-query
>> and so on. This is a much higher complexity than in the case of a normal
>> query.
>>
>> I would suggest you to optimize your request. I don't think that this
>> particular function query is matching *all* docs. Instead I think it
>> just matches those docs specified by your inner-query (although I might
>> be wrong about that).
>>
>> What are you trying to achieve by your request?
>>
>> Regards,
>> Em
>>
>> Am 16.02.2012 16:24, schrieb Carlos Gonzalez-Cadenas:
>>> Hello Em:
>>>
>>> The URL is quite large (w/ shards, ...), maybe it's best if I paste the
>>> relevant parts.
>>>
>>> Our "q" parameter is:
>>>
>>>
>> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),max(query($q4),query($q3)\"",
>>>
>>> The subqueries q8, q7, q4 and q3 are regular queries, for example:
>>>
>>> "q7":"stopword_phrase:colomba

Re: custom scoring

2012-02-16 Thread Carlos Gonzalez-Cadenas
Hello Em:

1) Here's a printout of an example DisMax query (as you can see mostly MUST
terms except for some SHOULD terms used for boosting scores for stopwords)
*
*
*((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona
stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
ened_phrase:barcelona stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
+stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
+wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
| (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
stopword_phrase:en))*
*
*
2)* *The collector is inserted in the SolrIndexSearcher (replacing the
TimeLimitingCollector). We trigger it through the SOLR interface by passing
the timeAllowed parameter. We know this is a hack but AFAIK there's no
out-of-the-box way to specify custom collectors by now (
https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
part works perfectly as of now, so clearly this is not the problem.

3) Re: your sentence:
*
*
**I* would expect that with a shrinking set of matching documents to
the overall-query, the function query only checks those documents that are
guaranteed to be within the result set.*
*
*
Yes, I agree with this, but this snippet of code in FunctionQuery.java
seems to say otherwise:

// instead of matching all docs, we could also embed a query.
// the score could either ignore the subscore, or boost it.
// Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
// Boost:foo:myTerm^floatline("myFloatField",1.0,0.0f)
@Override
public int nextDoc() throws IOException {
  for(;;) {
++doc;
if (doc>=maxDoc) {
  return doc=NO_MORE_DOCS;
}
if (acceptDocs != null && !acceptDocs.get(doc)) continue;
return doc;
  }
}

It seems that the author also thought of maybe embedding a query in order
to restrict matches, but this doesn't seem to be in place as of now (or
maybe I'm not understanding how the whole thing works :) ).

Thanks
Carlos
*
*

Carlos Gonzalez-Cadenas
CEO, ExperienceOn - New generation search
http://www.experienceon.com

Mobile: +34 652 911 201
Skype: carlosgonzalezcadenas
LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas


On Thu, Feb 16, 2012 at 8:09 PM, Em  wrote:

> Hello Carlos,
>
> > We have some more tests on that matter: now we're moving from issuing
> this
> > large query through the SOLR interface to creating our own
> QueryParser. The
> > initial tests we've done in our QParser (that internally creates multiple
> > queries and inserts them inside a DisjunctionMaxQuery) are very good,
> we're
> > getting very good response times and high quality answers. But when we've
> > tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
> > QueryValueSource that wraps the DisMaxQuery), then the times move from
> > 10-20 msec to 200-300msec.
> I reviewed the sourcecode and yes, the FunctionQuery iterates over the
> whole index, however... let's see!
>
> In relation to the DisMaxQuery you create within your parser: What kind
> of clause is the FunctionQuery and what kind of clause are your other
> queries (MUST, SHOULD, MUST_NOT...)?
>
> *I* would expect that with a shrinking set of matching documents to the
> overall-query, the function query only checks those documents that are
> guaranteed to be within the result set.
>
> > Note that we're using early termination of queries (via a custom
> > collector), and therefore (as shown by the numbers I included above) even
> > if the query is very complex, we're getting very fast answers. The only
> > situation where the response time explodes is when we include a
> > FunctionQuery.
> Could you give us some details about how/where did you plugin the
> Collector, please?
>
> Kind regards,
> Em
>
> Am 16.02.2012 19:41, schrieb Carlos Gonzalez-Cadenas:
> > Hello Em:
> >
> > Thanks for your answer.
> >
> > Yes, we initially also thought that the excessive increase in response
> time
> > was caused by the several queries being executed, and we did another
> test.
> > We executed one of the subqueries that I've shown to you directly in the
> > "q" parameter and then we tested this same subquery (only this one,
> without
> > the others) with the function query "query($q1)" in the "q" parameter.
> >
> > Theoretically the times for these two queries should be more or less the
> > same, but the second one is several times slower than the first one.
> After
> > this observation we learned more about function queries and we learned
> from
> > the code and from some comments in the forums [1] that the

Re: copyField: multivalued field to joined singlevalue field

2012-02-16 Thread Chris Hostetter

: > I want to copy all data from a multivalued field joined together in a single
: > valued field.
: >
: > Is there any opportunity to do this by using solr-standards?
: 
: There is not currently, but it certainly makes sense.

Part of it has just recently been commited to trunk actually...

https://issues.apache.org/jira/browse/SOLR-2802

https://builds.apache.org/job/Solr-trunk/javadoc/org/apache/solr/update/processor/ConcatFieldUpdateProcessorFactory.html

...with that, it's easy to say "anytime multiple values are found for a 
single valued string field, join them together with a comma".

the only piece that's missing is to "copy" from a source field in an 
(earlier) UpdateProcessor.  

Theres a patch for this in SOLR-2599 but i haven't had a chance to look at 
it yet.




-Hoss


Re: Specify a cores roles through core add command

2012-02-16 Thread Mark Miller
https://issues.apache.org/jira/browse/SOLR-3138

On Feb 9, 2012, at 4:16 PM, Jamie Johnson wrote:

> per SOLR-2765 we can add roles to specific cores such that it's
> possible to give custom roles to solr instances, is it possible to
> specify this when adding a core through curl
> 'http://host:port/solr/admin/cores...'?
> 
> 
> https://issues.apache.org/jira/browse/SOLR-2765

- Mark Miller
lucidimagination.com













Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Alexey Verkhovsky
On Thu, Feb 16, 2012 at 3:37 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Everybody start from daily bounce, but end up with UPDATED_AT column and
> delta updates , just consider urgent content fix usecase. Don't think it's
> worth to rely on daily bounce as a cornerstone of architecture.
>

I'd be happy to avoid it, for all the obvious reasons.

I do know that performance of this type of services tends to be not that
great (as in "700 to 5000 msec"), and there should be ways to do it several
times faster than this.


> you can use grid of coordinates to reduce their entropy


I don't understand this statement. Can you elaborate, please?

Since my bounding boxes are small, one [premature optimization] idea could
be to divide Earth into 2x2 degree overlapping tiles at 1 degree step in
both directions (such that any bounding box fits within at least one of
them, and any location belongs to 4 of them), then use tileId=X as a cached
filter and geofilt as a post-filter. Is that along the lines of what you
are talking about?


> 
> > Lucene internals, caching of filters probably doesn't make sense either.
> > from what little I understand about
> But solr does it http://wiki.apache.org/solr/SolrCaching#filterCache
>

I didn't realize that multiple qf's in the same query were applied in
parallel as set intersections. In that case, the non-geography filters
should be cached (and added to the prewarming routine, I guess) even when
they are usually far less specific than the bounding box. Makes sense.


> > 1. Search server is an internal service that uses embedded Solr for the
> > indexing part. RAMDirectoryFactory as index storage.
> Bad idea. It's purposed mostly for tests, the closest purposed for
> production analogue is
> org.apache.lucene.store.instantiated.InstantiatedIndex
>
...

> AFAIK the state of art is use file directory (MMAP or whatever), rely on
> Linux file system RAM cache.
>

OK, I may as well start the spike from this angle, too. By the way, this is
precisely the kind of advice I was hoping for. Thanks a lot.

> 5. All Solr caching is switched off.

> But why?
>

Because (a) I shouldn't need to cache documents, if they are all in memory
anyway; (b) query caching will have abysmal hit/miss because of the spatial
component; and (c) I misunderstood how query filters work. So, now I'm
thinking a FastLFU query filter cache for non-geo filters.


> Btw, if you need multivalue geofield pls vote for SOLR-2155
>
Our data has one lon/lat pair per entity... so no, I don't need it. Or at
least haven't figured out that I do yet. :)

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
still digging ;)  Once I figure it out I'll be happy to share.

On Thu, Feb 16, 2012 at 1:32 PM, Em  wrote:
> Hi Jamie,
>
> nice to hear.
> Maybe you can share in what kind of bug you ran, so that other
> developers with similar bugish components can benefit from your
> experience. :)
>
> Regards,
> Em
>
> Am 16.02.2012 19:23, schrieb Jamie Johnson:
>> please ignore this, it has nothing to do with the faceting component.
>> I was able to disable a custom component that I had and it worked
>> perfectly fine.
>>
>>
>> On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson  wrote:
>>> I am attempting to execute a query with the following parameters
>>>
>>> q=*:*
>>> distrib=true
>>> facet=true
>>> facet.limit=10
>>> facet.field=manu
>>> f.manu.facet.mincount=1
>>> f.manu.facet.limit=10
>>> f.manu.facet.sort=index
>>> rows=10
>>>
>>> When doing this I get the following exception
>>>
>>> null  java.lang.ArrayIndexOutOfBoundsException
>>>
>>> request: http://hostname:8983/solr/select
>>>        at 
>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
>>>        at 
>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>>>        at 
>>> org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
>>>        at 
>>> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
>>>        at 
>>> org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
>>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>        at 
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>>        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>>        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>>        at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>>        at java.lang.Thread.run(Thread.java:662)
>>>
>>> if I play with some of the parameters the query works as expected, i.e.
>>>
>>> q=*:*
>>> distrib=true
>>> facet=true
>>> facet.limit=10
>>> facet.field=manu
>>> f.manu.facet.mincount=1
>>> f.manu.facet.limit=10
>>> f.manu.facet.sort=index
>>> rows=0
>>>
>>> q=*:*
>>> distrib=true
>>> facet=true
>>> facet.limit=10
>>> facet.field=manu
>>> f.manu.facet.mincount=1
>>> f.manu.facet.limit=10
>>> f.manu.facet.sort=count
>>> rows=10
>>>
>>> q=*:*
>>> distrib=true
>>> facet=true
>>> facet.limit=10
>>> facet.field=manu
>>> f.manu.facet.mincount=1
>>> f.manu.facet.sort=index
>>> rows=10
>>>
>>>
>>> I am running on an old snapshot of Solr, but will try this on a new
>>> version relatively soon.  Unfortunately I can't duplicate locally so
>>> I'm a bit baffled by the error.
>>>
>>> All of the shards have the field which we are faceting on
>>


Re: SolrCloud - issues running with embedded zookeeper ensemble

2012-02-16 Thread arin g
i have the same problem, it seems that there is a bug in SolrZkServer class
(parseProperties method), that doesn't work well when you have an external
zookeeper ensemble.

Thanks,
 arin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 3:03 PM, Alexey Verkhovsky
 wrote:
>> 5. All Solr caching is switched off.
>
>> But why?
>>
>
> Because (a) I shouldn't need to cache documents, if they are all in memory
> anyway;

Your're making many assumptions about how Solr works internally.

One example of many:
  Solr streams documents (requests the stored fields right before they
are written to the response stream) to support returning any number of
documents.
If you highlight documents, the stored fields need to be retrieved
first.  When streaming those same documents later, Solr will retrieve
the stored fields again - reying on the fact that they should be
cached by the document cache since they were just used.

There are tons of examples of how things are architected to take
advantage of the caches - it pretty much never makes sense to outright
disable them.  If they take up too much memory, then just reduce the
size.

-Yonik
lucidimagination.com


Re: custom scoring

2012-02-16 Thread Chris Hostetter

: We'd like to score the matching documents using a combination of SOLR's IR
: score with another application-specific score that we store within the
: documents themselves (i.e. a float field containing the app-specific
: score). In particular, we'd like to calculate the final score doing some
: operations with both numbers (i.e product, sqrt, ...)

let's back up a minute.

if your ultimate goal is to have the final score of all documents be a 
simple multiplication of an indexed field ("query_score") against the 
score of your "base" query, that's fairely trivial use of the 
BoostQParser...

q={!boost f=query_score}your base query

...or to split it out using pram derefrencing...

q={!boost f=query_score v=$qq}
qq=your base query

: A) Sort by function [1]: We've tested an expression like
: "sort=product(score, query_score)" in the SOLR query, where score is the
: common SOLR IR score and query_score is our own precalculated score, but it
: seems that SOLR can only do this with stored/indexed fields (and obviously
: "score" is not stored/indexed).

you could do this by replacing "score" with the query whose score you 
want, which could be a ref back to "$q" -- but that's really only needed 
if you want the "scores" returned for each document to be differnt then the 
value used for sorting (ie: score comes from solr, sort value includes you 
query_score and the score from the main query -- or some completley diff 
query)

based on what you've said, you don't need that and it would be 
unneccessary overhead.

: B) Function queries: We've used _val_ and function queries like max, sqrt
: and query, and we've obtained the desired results from a functional point
: of view. However, our index is quite large (400M documents) and the
: performance degrades heavily, given that function queries are AFAIK
: matching all the documents.

based on the examples you've given in your subsequent queries, it's not 
hard to see why...

> "q":"_val_:\"product(query_score,max(query($q8),max(query($q7),

wrapping queries in functions in queries can have that effect, because 
functions ultimatley match all documents -- even when that function wraps 
a query -- so your outermost query is still scoring every document in the 
index.

you want to do as much "pruning" with the query as possible, and only 
multiply by your boost function on matching docs, hence the 
purpose of the BoostQParser.

-Hoss


Re: Distributed Faceting Bug?

2012-02-16 Thread Jamie Johnson
The issue appears to be that I put an empty array into the doc scores
instead of null in DocSlice.  DocSlice then just checks if scores is
null when hasScore is called which caused a further issue down the
line.  I'll follow up with anything else that I find along the way.

On Thu, Feb 16, 2012 at 3:05 PM, Jamie Johnson  wrote:
> still digging ;)  Once I figure it out I'll be happy to share.
>
> On Thu, Feb 16, 2012 at 1:32 PM, Em  wrote:
>> Hi Jamie,
>>
>> nice to hear.
>> Maybe you can share in what kind of bug you ran, so that other
>> developers with similar bugish components can benefit from your
>> experience. :)
>>
>> Regards,
>> Em
>>
>> Am 16.02.2012 19:23, schrieb Jamie Johnson:
>>> please ignore this, it has nothing to do with the faceting component.
>>> I was able to disable a custom component that I had and it worked
>>> perfectly fine.
>>>
>>>
>>> On Thu, Feb 16, 2012 at 12:42 PM, Jamie Johnson  wrote:
 I am attempting to execute a query with the following parameters

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=10

 When doing this I get the following exception

 null  java.lang.ArrayIndexOutOfBoundsException

 request: http://hostname:8983/solr/select
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:435)
        at 
 org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at 
 org.apache.solr.client.solrj.impl.LBHttpSolrServer.request(LBHttpSolrServer.java:249)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:517)
        at 
 org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:482)
        at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at 
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)

 if I play with some of the parameters the query works as expected, i.e.

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=index
 rows=0

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.limit=10
 f.manu.facet.sort=count
 rows=10

 q=*:*
 distrib=true
 facet=true
 facet.limit=10
 facet.field=manu
 f.manu.facet.mincount=1
 f.manu.facet.sort=index
 rows=10


 I am running on an old snapshot of Solr, but will try this on a new
 version relatively soon.  Unfortunately I can't duplicate locally so
 I'm a bit baffled by the error.

 All of the shards have the field which we are faceting on
>>>


Re: custom scoring

2012-02-16 Thread Robert Muir
On Thu, Feb 16, 2012 at 8:34 AM, Carlos Gonzalez-Cadenas
 wrote:
> Hello all:
>
> We'd like to score the matching documents using a combination of SOLR's IR
> score with another application-specific score that we store within the
> documents themselves (i.e. a float field containing the app-specific
> score). In particular, we'd like to calculate the final score doing some
> operations with both numbers (i.e product, sqrt, ...)
...
>
> 1) Apart from the two options I mentioned, is there any other (simple) way
> to achieve this that we're not aware of?
>

In general there is always a third option, that may or may not fit,
depending really upon how you are trying to model relevance and how
you want to integrate with scoring, and thats to tie in your factors
directly into Similarity (lucene's term weighting api). For example,
some people use index-time boosting, but in lucene index-time boost
really just means 'make the document appear shorter'. You might for
example, have other boosts that modify term-frequency before
normalization, or however you want to do it. Similarity is pluggable
into Solr via schema.xml.

Since you are using trunk, this is a lot more flexible than previous
releases, e.g. you can access things from FieldCache, DocValues, or
even your own rapidly-changing float[] or whatever you want :) There
are also a lot more predefined models than just the vector space model
to work with if you find you can easily imagine your notion of
relevance in terms of an existing model.

-- 
lucidimagination.com


Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Alexey Verkhovsky
On Thu, Feb 16, 2012 at 1:32 PM, Yonik Seeley wrote:

> Your're making many assumptions about how Solr works internally.
>

True that. If this spike turns into a project, digging through the source
code will come. Meantime, we have to start somewhere, and the default
configuration may not be the greatest starting point for this problem.

We don't need highlighting, and only need ids, scores and total number of
results out of Solr. Presentation of selected entities will have to include
some write-heavy data (from RDBMS and/or memcached), therefore won't be
Solr's business anyway.

>From what you said, I guess it won't hurt to give it a small document
cache, just big enough to prevent streaming the same document twice within
the same query. Still don't have a reason to have a query cache - because
of lon/lat coming from the mobile devices, there are virtually no repeated
queries in our production logs. Or am I making a bad assumption here, too?

-- 
Alexey Verkhovsky
http://alex-verkhovsky.blogspot.com/
CruiseControl.rb [http://cruisecontrolrb.thoughtworks.com]


Re: Using Solr for a rather busy "Yellow Pages"-type index - good idea or not really?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 4:06 PM, Alexey Verkhovsky
 wrote:
> ly need ids, scores and total number of results out of Solr. Presentation of
> selected entities will have to include some write-heavy data (from RDBMS
> and/or memcached), therefore won't be Solr's business anyway.

It depends on if you're going to be doing distributed search - there
may be some scenarios there where it's used, but in general the query
cache is the least useful.
The filterCache is useful in a ton of ways if you're doing faceting too.

-Yonik
lucidimagination.com


Re: SolrCloud - issues running with embedded zookeeper ensemble

2012-02-16 Thread Mark Miller

On Feb 16, 2012, at 2:53 PM, arin g wrote:

> i have the same problem, it seems that there is a bug in SolrZkServer class
> (parseProperties method), that doesn't work well when you have an external
> zookeeper ensemble.
> 

This issue was around using an embedded ensemble - an external ensemble makes 
SolrZkServer irrelevant.

What issue are you having? I just tried a basic test against an external 
ensemble.

What version are you using?

> Thanks,
> arin
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrCloud-issues-running-with-embedded-zookeeper-ensemble-tp3694004p3751629.html
> Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com













Re: custom scoring

2012-02-16 Thread Em
Hello Carlos,

I think we missunderstood eachother.

As an example:
BooleanQuery (
  clauses: (
 MustMatch(
   DisjunctionMaxQuery(
   TermQuery("stopword_field", "barcelona"),
   TermQuery("stopword_field", "hoteles")
   )
 ),
 ShouldMatch(
  FunctionQuery(
*please insert your function here*
 )
 )
  )
)

Explanation:
You construct an artificial BooleanQuery which wraps your user's query
as well as your function query.
Your user's query - in that case - is just a DisjunctionMaxQuery
consisting of two TermQueries.
In the real world you might construct another BooleanQuery around your
DisjunctionMaxQuery in order to have more flexibility.
However the interesting part of the given example is, that we specify
the user's query as a MustMatch-condition of the BooleanQuery and the
FunctionQuery just as a ShouldMatch.
Constructed that way, I am expecting the FunctionQuery only scores those
documents which fit the MustMatch-Condition.

I conclude that from the fact that the FunctionQuery-class also has a
skipTo-method and I would expect that the scorer will use it to score
only matching documents (however I did not search where and how it might
get called).

If my conclusion is wrong than hopefully Robert Muir (as far as I can
see the author of that class) can tell us what was the intention by
constructing an every-time-match-all-function-query.

Can you validate whether your QueryParser constructs a query in the form
I drew above?

Regards,
Em

Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
> Hello Em:
> 
> 1) Here's a printout of an example DisMax query (as you can see mostly MUST
> terms except for some SHOULD terms used for boosting scores for stopwords)
> *
> *
> *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
> +stopword_phrase:barcelona
> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
> ened_phrase:barcelona stopword_shortened_phrase:en) | 
> (+stopword_phrase:hoteles
> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
> ord_phrase:barcelona stopword_phrase:en) | (+stopword_shortened_phrase:hoteles
> +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
> stopword_phrase:en))*
> *
> *
> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
> TimeLimitingCollector). We trigger it through the SOLR interface by passing
> the timeAllowed parameter. We know this is a hack but AFAIK there's no
> out-of-the-box way to specify custom collectors by now (
> https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
> part works perfectly as of now, so clearly this is not the problem.
> 
> 3) Re: your sentence:
> *
> *
> **I* would expect that with a shrinking set of matching documents to
> the overall-query, the function query only checks those documents that are
> guaranteed to be within the result set.*
> *
> *
> Yes, I agree with this, but this snippet of code in FunctionQuery.java
> seems to say otherwise:
> 
> // instead of matching all docs, we could also embed a query.
> // the score could either ignore the subscore, or boost it.
> // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
> // Boost:foo:myTerm^floatline("myFloatField",1.0,0.0f)
> @Override
> public int nextDoc() throws IOException {
>   for(;;) {
> ++doc;
> if (doc>=maxDoc) {
>   return doc=NO_MORE_DOCS;
> }
> if (acceptDocs != null && !acceptDocs.get(doc)) continue;
> return doc;
>   }
> }
> 
> It seems that the author also thought of maybe embedding a query in order
> to restrict matches, but this doesn't seem to be in place as of now (or
> maybe I'm not understanding how the whole thing works :) ).
> 
> Thanks
> Carlos
> *
> *
> 
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> http://www.experienceon.com
> 
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas
> 
> 
> On Thu, Feb 16, 2012 at 8:09 PM, Em  wrote:
> 
>> Hello Carlos,
>>
>>> We have some more tests on that matter: now we're moving from issuing
>> this
>>> large query through the SOLR interface to creating our own
>> QueryParser. The
>>> initial tests we've done in our QParser (that internally creates multiple
>>> queries and inserts them inside a DisjunctionMaxQuery) are very good,
>> we're
>>> getting very good response times and high quality answers. But when we've
>>> tried to wrap the DisjunctionMaxQuery within a FunctionQuery (i.e. with a
>>> QueryValueSource that wraps th

files left open?

2012-02-16 Thread Paulo Magalhaes
Hi all,

I was loading a big (60 million docs) csv in solr 4 when something odd
happened.
I got a solr error in the log saying that it could not write the file.
du -s indicated I had used 30Gb of a 50Gb available but df -k  indicated
that the disk was I00% used.
ds and df giving different results could be an indication that there are
file descriptors left open.
After a solr bounce, df -k came down and agreed with du.
has anyone seen anything like that ?

Thanks,
Paulo.

environment;
Linux 2.6.18-238.19.1.el5.centos.
plus #1 SMP Mon Jul 18 10:05:09 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
Java(TM) SE Runtime Environment (build 1.6.0_17-b04) Java HotSpot(TM)
64-Bit Server VM (build 14.3-b01, mixed mode)
apache-solr-4.0-2012-02-10_09-58-50
solr config is the one in the distribution package. i had my own schema.


Re: custom scoring

2012-02-16 Thread Em
I just modified some TestCases a little bit to see how the FunctionQuery
behaves.

Given that you got an index containing 14 docs, where 13 of them
containing the term "batman" and two contain the term "superman", a
search for

q=+text:superman _val_:"query($qq)"&qq=text:superman

Leads to two hits and the FunctionQuery has two iterations.

If you remove that little plus-symbol before "text:superman", it
wouldn't be a mustMatch-condition anymore and the whole query results
into 14 hits (default-operator is OR):

q=text:superman _val_:"query($qq)"&qq=text:superman

If both queries, the TermQuery and the FunctionQuery must match, it
would also result into two hits:

q=text:superman AND _val_:"query($qq)"&qq=text:superman

There is some behaviour that I currently don't understand (if 14 docs
match, the FunctionQuery's AllScorer re-iterates for two times over the
0th and the 1st doc and the reason for that seems to be the construction
of two AllScorers), but as far as I can see the performance of your
queries *should* increase if you construct your query as I explained in
my last eMail.

Kind regards,
Em

Am 16.02.2012 23:43, schrieb Em:
> Hello Carlos,
> 
> I think we missunderstood eachother.
> 
> As an example:
> BooleanQuery (
>   clauses: (
>  MustMatch(
>DisjunctionMaxQuery(
>TermQuery("stopword_field", "barcelona"),
>TermQuery("stopword_field", "hoteles")
>)
>  ),
>  ShouldMatch(
>   FunctionQuery(
> *please insert your function here*
>  )
>  )
>   )
> )
> 
> Explanation:
> You construct an artificial BooleanQuery which wraps your user's query
> as well as your function query.
> Your user's query - in that case - is just a DisjunctionMaxQuery
> consisting of two TermQueries.
> In the real world you might construct another BooleanQuery around your
> DisjunctionMaxQuery in order to have more flexibility.
> However the interesting part of the given example is, that we specify
> the user's query as a MustMatch-condition of the BooleanQuery and the
> FunctionQuery just as a ShouldMatch.
> Constructed that way, I am expecting the FunctionQuery only scores those
> documents which fit the MustMatch-Condition.
> 
> I conclude that from the fact that the FunctionQuery-class also has a
> skipTo-method and I would expect that the scorer will use it to score
> only matching documents (however I did not search where and how it might
> get called).
> 
> If my conclusion is wrong than hopefully Robert Muir (as far as I can
> see the author of that class) can tell us what was the intention by
> constructing an every-time-match-all-function-query.
> 
> Can you validate whether your QueryParser constructs a query in the form
> I drew above?
> 
> Regards,
> Em
> 
> Am 16.02.2012 20:29, schrieb Carlos Gonzalez-Cadenas:
>> Hello Em:
>>
>> 1) Here's a printout of an example DisMax query (as you can see mostly MUST
>> terms except for some SHOULD terms used for boosting scores for stopwords)
>> *
>> *
>> *((+stopword_shortened_phrase:hoteles +stopword_shortened_phrase:barcelona
>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles
>> +stopword_phrase:barcelona
>> stopword_phrase:en) | (+stopword_shortened_phrase:hoteles +stopword_short
>> ened_phrase:barcelona stopword_shortened_phrase:en) | 
>> (+stopword_phrase:hoteles
>> +stopword_phrase:barcelona stopword_phrase:en) | (+stopword_shor
>> tened_phrase:hoteles +wildcard_stopword_shortened_phrase:barcelona
>> stopword_shortened_phrase:en) | (+stopword_phrase:hoteles +wildcard_stopw
>> ord_phrase:barcelona stopword_phrase:en) | 
>> (+stopword_shortened_phrase:hoteles
>> +wildcard_stopword_shortened_phrase:barcelona stopword_shortened_phrase:en)
>> | (+stopword_phrase:hoteles +wildcard_stopword_phrase:barcelona
>> stopword_phrase:en))*
>> *
>> *
>> 2)* *The collector is inserted in the SolrIndexSearcher (replacing the
>> TimeLimitingCollector). We trigger it through the SOLR interface by passing
>> the timeAllowed parameter. We know this is a hack but AFAIK there's no
>> out-of-the-box way to specify custom collectors by now (
>> https://issues.apache.org/jira/browse/SOLR-1680). In any case the collector
>> part works perfectly as of now, so clearly this is not the problem.
>>
>> 3) Re: your sentence:
>> *
>> *
>> **I* would expect that with a shrinking set of matching documents to
>> the overall-query, the function query only checks those documents that are
>> guaranteed to be within the result set.*
>> *
>> *
>> Yes, I agree with this, but this snippet of code in FunctionQuery.java
>> seems to say otherwise:
>>
>> // instead of matching all docs, we could also embed a query.
>> // the score could either ignore the subscore, or boost it.
>> // Containment:  floatline(foo:myTerm, "myFloatField", 1.0, 0.0f)
>> // Boost:foo:myTerm^floatline("myFloatField",1.0,0.0f)
>> @Override
>> public int nextDoc() thr

Re: files left open?

2012-02-16 Thread Yonik Seeley
On Thu, Feb 16, 2012 at 5:56 PM, Paulo Magalhaes
 wrote:
> I was loading a big (60 million docs) csv in solr 4 when something odd
> happened.
> I got a solr error in the log saying that it could not write the file.
> du -s indicated I had used 30Gb of a 50Gb available but df -k  indicated
> that the disk was I00% used.

You probably hit a big segment merge, which does require more disk
space temporarily.
The difference between du and df probably just indicates how they
internally work (du may just look at file sizes, and non-closed files
can register as smaller or 0 than the amount of disk space they
actually take up).

-Yonik
lucidimagination.com


Re: Setting solrj server connection timeout

2012-02-16 Thread Mark Miller
Im not sure that timeout will help you here - I believe it's the timeout on
'creating' the connection.

Try setting the socket timeout (setSoTimeout) - that should let you try
sooner.

It looks like perhaps the server is timing out and closing the connection.

I guess all you can do is timeout reasonably (if it takes too long to we
for the exception) and retry.

On Fri, Feb 3, 2012 at 3:12 PM, Shawn Heisey  wrote:

> Is the following a reasonable approach to setting a connection timeout
> with SolrJ?
>
>queryCore.getHttpClient().**getHttpConnectionManager().**
> getParams()
>.setConnectionTimeout(15000);
>
> Right now I have all my solr server objects sharing a single HttpClient
> that gets created using the multithreaded connection manager, where I set
> the timeout for all of them.  Now I will be letting each server object
> create its own HttpClient object, and using the above statement to set the
> timeout on each one individually.  It'll use up a bunch more memory, as
> there are 56 server objects, but maybe it'll work better.  The total of 56
> objects comes about from 7 shards, a build core and a live core per shard,
> two complete index chains, and for each of those, one server object for
> access to CoreAdmin and another for the index.
>
> The impetus for this, as it's possible I'm stating an XY problem:
> Currently I have an occasional problem where SolrJ connections throw an
> exception.  When it happens, nothing is logged in Solr.  My code is smart
> enough to notice the problem, send an email alert, and simply try again at
> the top of the next minute.  The simple explanation is that this is a Linux
> networking problem, but I never had any problem like this when I was using
> Perl with LWP to keep my index up to date.  I sent a message to the list
> some time ago on this exception, but I never got a response that helped me
> figure it out.
>
> Caused by: org.apache.solr.client.solrj.**SolrServerException:
> java.net.SocketException: Connection reset
>
> at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:480)
>
> at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:246)
>
> at org.apache.solr.client.solrj.**request.QueryRequest.process(**
> QueryRequest.java:89)
>
> at org.apache.solr.client.solrj.**SolrServer.query(SolrServer.**java:276)
>
> at com.newscom.idxbuild.solr.**Core.getCount(Core.java:325)
>
> ... 3 more
>
> Caused by: java.net.SocketException: Connection reset
>
> at java.net.SocketInputStream.**read(SocketInputStream.java:**168)
>
> at java.io.BufferedInputStream.**fill(BufferedInputStream.java:**218)
>
> at java.io.BufferedInputStream.**read(BufferedInputStream.java:**237)
>
> at org.apache.commons.httpclient.**HttpParser.readRawLine(**
> HttpParser.java:78)
>
> at org.apache.commons.httpclient.**HttpParser.readLine(**
> HttpParser.java:106)
>
> at org.apache.commons.httpclient.**HttpConnection.readLine(**
> HttpConnection.java:1116)
>
> at org.apache.commons.httpclient.**MultiThreadedHttpConnectionMan**
> ager$HttpConnectionAdapter.**readLine(**MultiThreadedHttpConnectionMan**
> ager.java:1413)
>
> at org.apache.commons.httpclient.**HttpMethodBase.readStatusLine(**
> HttpMethodBase.java:1973)
>
> at org.apache.commons.httpclient.**HttpMethodBase.readResponse(**
> HttpMethodBase.java:1735)
>
> at org.apache.commons.httpclient.**HttpMethodBase.execute(**
> HttpMethodBase.java:1098)
>
> at org.apache.commons.httpclient.**HttpMethodDirector.**executeWithRetry(*
> *HttpMethodDirector.java:398)
>
> at org.apache.commons.httpclient.**HttpMethodDirector.**executeMethod(**
> HttpMethodDirector.java:171)
>
> at org.apache.commons.httpclient.**HttpClient.executeMethod(**
> HttpClient.java:397)
>
> at org.apache.commons.httpclient.**HttpClient.executeMethod(**
> HttpClient.java:323)
>
> at org.apache.solr.client.solrj.**impl.CommonsHttpSolrServer.**
> request(CommonsHttpSolrServer.**java:424)
>
> ... 7 more
>
>
> Thanks,
> Shawn
>
>


-- 
- Mark

http://www.lucidimagination.com


how to delta index linked entities in 3.5.0

2012-02-16 Thread AdamLane
The delta instructions from
https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
works for me in solr 1.4 but crashes in 3.5.0 (error: "deltaQuery has no
column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID'"  issue:
https://issues.apache.org/jira/browse/SOLR-2907) 

Is there anyone out there that can confirm my bug?  Because I am new to solr
and hopefully I am just doing something wrong based on a misunderstanding of
the wiki.  Anyone successfully indexing the join of items and multiple
item_categories just like the wiki example that would be willing to share
their workaround or suggest a workaround?

Thanks,
Adam   


--
View this message in context: 
http://lucene.472066.n3.nabble.com/how-to-delta-index-linked-entities-in-3-5-0-tp3752455p3752455.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Setting solrj server connection timeout

2012-02-16 Thread Shawn Heisey

On 2/16/2012 6:28 PM, Mark Miller wrote:

Im not sure that timeout will help you here - I believe it's the timeout on
'creating' the connection.

Try setting the socket timeout (setSoTimeout) - that should let you try
sooner.

It looks like perhaps the server is timing out and closing the connection.

I guess all you can do is timeout reasonably (if it takes too long to we
for the exception) and retry.


When the timeout exception happens, it is happening within the same 
second as the beginning of the update cycle, which involves a lot of 
other things happening (such as talking to a database) before it even 
gets around to talking to Solr.  I do not have millisecond timestamps, 
but from what little I can tell, it's a handful of milliseconds from 
when SolrJ starts the request until the exception is logged.  It happens 
relatively rarely - no more than once every few days, usually less often 
than that.  I cannot reproduce it at will.  Nobody is doing any work on 
either Solr or the network when it happens.  Nothing is logged in the 
Solr server log or syslog at the OS level, the only mention of anything 
bad going on is in the log of my SolrJ application.


I never had this problem when my build system was written in Perl, using 
LWP to make HTTP requests with URLs that I constructed myself.  The perl 
system ran on CentOS 5 with Xen virtualization, now I'm running CentOS 6 
on the bare metal.  I'm using a bonded interface (for failover, not load 
balancing) comprised of two NICs plugged into separate switches.  When 
it was virtualized, the Xen host was also using an identically 
configured bonded interface, bridged to the guests, which used eth0.


The last time the error happened, which was on Feb 15th at 2:04 PM MST, 
the query that failed was 'did:(289800299 OR 289800157)', a very simple 
query against a tlong field.  The application tests for the existence of 
the did values that it is trying to delete before it issues the delete 
request.


I'm willing to look deeper into possible networking issues, but I am 
skeptical about that being the problem, and because there are no log 
messages to investigate, I have no idea how to proceed.  The application 
runs on one of four Solr servers, sometimes the error even happens when 
connecting to Solr on the same server it's running on, which takes the 
gigabit switches out of the equation.  If it's an actual networking 
problem, it's either in the hardware (Dell PowerEdge 2950 III, built-in 
NICs) or the CentOS 6 kernel.


At this point, I am thinking it's one of the following problems, in 
order of decreasing probability: 1) I am using SolrJ incorrectly. 2) 
There is a SolrJ problem that only appears under specific circumstances 
that happen to exist in my setup. 3) My hardware or OS software has an 
extremely intermittent problem.


What other info can I provide?

Thanks,
Shawn



Sort by the number of matching terms (coord value)

2012-02-16 Thread Nicholas Clark
Hi,

I'm looking for a way to sort results by the number of matching terms.
Being able to sort by the coord() value or by the overlap value that gets
passed into the coord() function would do the trick. Is there a way I can
expose those values to the sort function?

I'd appreciate any help that points me in the right direction. I'm OK with
making basic code modifications.

Thanks!

-Nick


Re: how to delta index linked entities in 3.5.0

2012-02-16 Thread Shawn Heisey

On 2/16/2012 6:31 PM, AdamLane wrote:

The delta instructions from
https://wiki.apache.org/solr/DataImportHandler#Using_delta-import_command
works for me in solr 1.4 but crashes in 3.5.0 (error: "deltaQuery has no
column to resolve to declared primary key pk='ITEM_ID, CATEGORY_ID'"  issue:
https://issues.apache.org/jira/browse/SOLR-2907)

Is there anyone out there that can confirm my bug?  Because I am new to solr
and hopefully I am just doing something wrong based on a misunderstanding of
the wiki.  Anyone successfully indexing the join of items and multiple
item_categories just like the wiki example that would be willing to share
their workaround or suggest a workaround?


I ran into something like this, possibly even this exact problem.

Things have been tightened up in 3.x.  All query results now need to 
have a field corresponding to what you've defined as pk, or it's 
considered an error.  I was not using the results from my deltaQuery, 
but I still had to adjust it so that it returned a field with the same 
name as my primary key.  You have defined more than one field for your 
pk, so I don't really know exactly what you'll have to do - perhaps you 
need to have both ITEM_ID and CATEGORY_ID fields in your query results.


Thanks,
Shawn



Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Li Li
you can fool the lucene scoring fuction. override each function such as idf
queryNorm lengthNorm and let them simply return 1.0f.
I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
score by vector space model and the formula can't be replaced by users.

On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark  wrote:

> Hi,
>
> I'm looking for a way to sort results by the number of matching terms.
> Being able to sort by the coord() value or by the overlap value that gets
> passed into the coord() function would do the trick. Is there a way I can
> expose those values to the sort function?
>
> I'd appreciate any help that points me in the right direction. I'm OK with
> making basic code modifications.
>
> Thanks!
>
> -Nick
>


Improving proximity search performance

2012-02-16 Thread Bryan Loofbourrow
Here’s my use case. I expect to set up a Solr index that is approximately
1.4GB (this is a real number from the proof-of-concept using the real data,
which consists of about 10 million documents, many of significant size, and
making use of the FastVectorHighlighter to do highlighting on the body text
field, which is of course stored, and with termVectors, termPositions, and
termOffsets on).



I no longer have the proof-of-concept Solr core available (our live site
uses Solr 1.4 and the ordinary Highlighter), so I can’t get an empirical
answer to this question: Will storing that extra information about the
location of terms help the performance of proximity searches?



A significant and important subset of my users make extensive use of
proximity searches. These sophisticated users have found that they are best
able to locate what they want by doing searches about THISWORD within 5
words of THATWORD, or much more sophisticated variants on that theme,
including plenty of booleans and wildcards. The problem I’m facing is
performance. Some of these searches, when common words are used, can take
many minutes, even with the index on an SSD.



The question is, how to improve the performance. It occurred to me as
possible that all of that term vector information, stored for the benefit
of the FastVectorHighlighter, might be a significant aid to the performance
of these searches.



First question: is that already the case? Will storing this extra
information automatically improve my proximity search performance?



Second question: If not, I’m very willing to dive into the code and come up
with a patch that would do this. Can someone with knowledge of the
internals comment on whether this is a plausible strategy for improving
performance, and, if so, give tips about the outlines of what a successful
approach to the problem might look like?



Third question: Any tips in general for improving the performance of these
proximity searches? I have explored the question of whether the customers
might be weaned off of them, and that does not appear to be an option.



Thanks,



-- Bryan Loofbourrow


RE: Frequent garbage collections after a day of operation

2012-02-16 Thread Bryan Loofbourrow
A couple of thoughts:

We wound up doing a bunch of tuning on the Java garbage collection.
However, the pattern we were seeing was periodic very extreme slowdowns,
because we were then using the default garbage collector, which blocks
when it has to do a major collection. This doesn't sound like your
problem, but it's something to be aware of.

One thing that could fit the pattern you describe would be Solr caches
filling up and getting you too close to your JVM or memory limit. For
example, if you have large documents, and have defined a large document
cache, that might do it.

I found it useful to point jconsole (free with the JDK) at my JVM, and
watch the pattern of memory usage. If the troughs at the bottom of the GC
cycles keep rising, you know you've got something that is continuing to
grab more memory and not let go of it. Now that our JVM is running
smoothly, we just see a sawtooth pattern, with the troughs approximately
level. When the system is under load, the frequency of the wave rises. Try
it and see what sort of pattern you're getting.

-- Bryan

> -Original Message-
> From: Matthias Käppler [mailto:matth...@qype.com]
> Sent: Thursday, February 16, 2012 7:23 AM
> To: solr-user@lucene.apache.org
> Subject: Frequent garbage collections after a day of operation
>
> Hey everyone,
>
> we're running into some operational problems with our SOLR production
> setup here and were wondering if anyone else is affected or has even
> solved these problems before. We're running a vanilla SOLR 3.4.0 in
> several Tomcat 6 instances, so nothing out of the ordinary, but after
> a day or so of operation we see increased response times from SOLR, up
> to 3 times increases on average. During this time we see increased CPU
> load due to heavy garbage collection in the JVM, which bogs down the
> the whole system, so throughput decreases, naturally. When restarting
> the slaves, everything goes back to normal, but that's more like a
> brute force solution.
>
> The thing is, we don't know what's causing this and we don't have that
> much experience with Java stacks since we're for most parts a Rails
> company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
> seeing this, or can you think of a reason for this? Most of our
> queries to SOLR involve the DismaxHandler and the spatial search query
> components. We don't use any custom request handlers so far.
>
> Thanks in advance,
> -Matthias
>
> --
> Matthias Käppler
> Lead Developer API & Mobile
>
> Qype GmbH
> Großer Burstah 50-52
> 20457 Hamburg
> Telephone: +49 (0)40 - 219 019 2 - 160
> Skype: m_kaeppler
> Email: matth...@qype.com
>
> Managing Director: Ian Brotherston
> Amtsgericht Hamburg
> HRB 95913
>
> This e-mail and its attachments may contain confidential and/or
> privileged information. If you are not the intended recipient (or have
> received this e-mail in error) please notify the sender immediately
> and destroy this e-mail and its attachments. Any unauthorized copying,
> disclosure or distribution of this e-mail and  its attachments is
> strictly forbidden. This notice also applies to future messages.


Re: distributed deletes working?

2012-02-16 Thread Mark Miller
Yup - deletes are fine.


On Thu, Feb 16, 2012 at 8:56 PM, Jamie Johnson  wrote:

> With solr-2358 being committed to trunk do deletes and updates get
> distributed/routed like adds do? Also when a down shard comes back up are
> the deletes/updates forwarded as well? Reading the jira I believe the
> answer is yes, I just want to verify before bringing the latest into my
> environment.
>



-- 
- Mark

http://www.lucidimagination.com


Re: Sort by the number of matching terms (coord value)

2012-02-16 Thread Nicholas Clark
I want to leave the score intact so I can sort by matching term frequency
and then by score. I don't think I can do that if I modify all the
similarity functions, but I think your solution would have worked otherwise.

It would be great if there was a way I could expose this information
through a function query (similar to the new relevance functions in version
4.0). I'll have to see if I can figure out how those functions work.

-Nick


On Thu, Feb 16, 2012 at 6:58 PM, Li Li  wrote:

> you can fool the lucene scoring fuction. override each function such as idf
> queryNorm lengthNorm and let them simply return 1.0f.
> I don't lucene 4 will expose more details. but for 2.x/3.x, lucene can only
> score by vector space model and the formula can't be replaced by users.
>
> On Fri, Feb 17, 2012 at 10:47 AM, Nicholas Clark 
> wrote:
>
> > Hi,
> >
> > I'm looking for a way to sort results by the number of matching terms.
> > Being able to sort by the coord() value or by the overlap value that gets
> > passed into the coord() function would do the trick. Is there a way I can
> > expose those values to the sort function?
> >
> > I'd appreciate any help that points me in the right direction. I'm OK
> with
> > making basic code modifications.
> >
> > Thanks!
> >
> > -Nick
> >
>


Ranking based on number of matches in a multivalued field?

2012-02-16 Thread Steven Ou
So suppose I have a multivalued field for categories. Let's say we have 3
items with these categories:

Item 1: category ids [1,2,5,7,9]
Item 2: category ids [4,8,9]
Item 3: category ids [1,4,9]

I now run a filter query for any of the following category ids [1,4,9]. I
should get all of them back as results because they all include at least
one category which I'm querying.

Now, how do I order it based on the number of matching categories?? In this
case, I would like Item 3 (matched all [1,4,9]) to be ranked higher,
followed by Item 2 (matched [4,9]) and Item 3 (matches [1,9]). Is there a
way I can boost documents based on the number of matches?

I don't want an "absolute" rank where Item 3 is definitely the first
result, but rather a way to boost Item 3's score higher than that of Item 1
and 2 so that it's more likely to show up higher (depending on the query
string).

Thanks!
--
Steven Ou | 歐偉凡

*ravn.com* | Chief Technology Officer
steve...@gmail.com | +1 909-569-9880


UpdateRequestHandler coding

2012-02-16 Thread Lance Norskog
If I want to write a complex UpdateRequestHandler should I do it on
trunk or the 3.x branch? The criteria are a stable, debugged,
full-featured environment.

-- 
Lance Norskog
goks...@gmail.com


Re: Size of suggest dictionary

2012-02-16 Thread Mike Hugo
Thanks Em!

What if we use a threshold value in the suggest configuration, like 

  0.005

I assume the dictionary size will then be smaller than the total number of 
distinct terms, is there anyway to determine what that size is?

Thanks,

Mike


On Wednesday, February 15, 2012 at 4:39 PM, Em wrote:

> Hello Mike,
> 
> have a look at Solr's Schema Browser. Click on "FIELDS", select "label"
> and have a look at the number of distinct (term-)values.
> 
> Regards,
> Em
> 
> 
> Am 15.02.2012 23:07, schrieb Mike Hugo:
> > Hello,
> > 
> > We're building an auto suggest component based on the "label" field of
> > documents. Is there a way to see how many terms are in the dictionary, or
> > how much memory it's taking up? I looked on the statistics page but didn't
> > find anything obvious.
> > 
> > Thanks in advance,
> > 
> > Mike
> > 
> > ps- here's the config:
> > 
> > 
> > 
> > suggestlabel
> >  > name="classname">org.apache.solr.spelling.suggest.Suggester
> >  > name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup
> > label
> > true
> > 
> > 
> > 
> >  > class="org.apache.solr.handler.component.SearchHandler">
> > 
> > true
> > suggestlabel
> > 10
> > 
> > 
> > suggestlabel
> > 
> > 
> > 
> 
> 
> 




Re: Frequent garbage collections after a day of operation

2012-02-16 Thread Jason Rutherglen
> One thing that could fit the pattern you describe would be Solr caches
> filling up and getting you too close to your JVM or memory limit

This [uncommitted] issue would solve that problem by allowing the GC
to collect caches that become too large, though in practice, the cache
setting would need to be fairly large for an OOM to occur from them:
https://issues.apache.org/jira/browse/SOLR-1513

On Thu, Feb 16, 2012 at 7:14 PM, Bryan Loofbourrow
 wrote:
> A couple of thoughts:
>
> We wound up doing a bunch of tuning on the Java garbage collection.
> However, the pattern we were seeing was periodic very extreme slowdowns,
> because we were then using the default garbage collector, which blocks
> when it has to do a major collection. This doesn't sound like your
> problem, but it's something to be aware of.
>
> One thing that could fit the pattern you describe would be Solr caches
> filling up and getting you too close to your JVM or memory limit. For
> example, if you have large documents, and have defined a large document
> cache, that might do it.
>
> I found it useful to point jconsole (free with the JDK) at my JVM, and
> watch the pattern of memory usage. If the troughs at the bottom of the GC
> cycles keep rising, you know you've got something that is continuing to
> grab more memory and not let go of it. Now that our JVM is running
> smoothly, we just see a sawtooth pattern, with the troughs approximately
> level. When the system is under load, the frequency of the wave rises. Try
> it and see what sort of pattern you're getting.
>
> -- Bryan
>
>> -Original Message-
>> From: Matthias Käppler [mailto:matth...@qype.com]
>> Sent: Thursday, February 16, 2012 7:23 AM
>> To: solr-user@lucene.apache.org
>> Subject: Frequent garbage collections after a day of operation
>>
>> Hey everyone,
>>
>> we're running into some operational problems with our SOLR production
>> setup here and were wondering if anyone else is affected or has even
>> solved these problems before. We're running a vanilla SOLR 3.4.0 in
>> several Tomcat 6 instances, so nothing out of the ordinary, but after
>> a day or so of operation we see increased response times from SOLR, up
>> to 3 times increases on average. During this time we see increased CPU
>> load due to heavy garbage collection in the JVM, which bogs down the
>> the whole system, so throughput decreases, naturally. When restarting
>> the slaves, everything goes back to normal, but that's more like a
>> brute force solution.
>>
>> The thing is, we don't know what's causing this and we don't have that
>> much experience with Java stacks since we're for most parts a Rails
>> company. Are Tomcat 6 or SOLR known to leak memory? Is anyone else
>> seeing this, or can you think of a reason for this? Most of our
>> queries to SOLR involve the DismaxHandler and the spatial search query
>> components. We don't use any custom request handlers so far.
>>
>> Thanks in advance,
>> -Matthias
>>
>> --
>> Matthias Käppler
>> Lead Developer API & Mobile
>>
>> Qype GmbH
>> Großer Burstah 50-52
>> 20457 Hamburg
>> Telephone: +49 (0)40 - 219 019 2 - 160
>> Skype: m_kaeppler
>> Email: matth...@qype.com
>>
>> Managing Director: Ian Brotherston
>> Amtsgericht Hamburg
>> HRB 95913
>>
>> This e-mail and its attachments may contain confidential and/or
>> privileged information. If you are not the intended recipient (or have
>> received this e-mail in error) please notify the sender immediately
>> and destroy this e-mail and its attachments. Any unauthorized copying,
>> disclosure or distribution of this e-mail and  its attachments is
>> strictly forbidden. This notice also applies to future messages.