Fuzzy searching documents over multiple fields using Solr

2013-05-09 Thread britske
Not sure if this has ever come up (or perhaps even implemented without me
knowing) , but I'm interested in doing Fuzzy search over multiple fields
using Solr. 

What I mean is the ability to returns documents based on some 'distance
calculation' without documents having to match 100% to the query. 

Usecase: a user is searching for a tv with a couple of filters selected. No
tv matches all filters. How to come up with a bunch of suggestions that
match the selected filters as closely as possible? The hard part is to
determine what 'closely' means in this context, etc.

This relates to (approximate) nearest neighbor, Kd-trees, etc. Has anyone
ever tried to do something similar? any plugins, etc? or reasons Solr/Lucene
would/wouldn't be the correct system to build on?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fuzzy-searching-documents-over-multiple-fields-using-Solr-tp4061867.html
Sent from the Solr - User mailing list archive at Nabble.com.


multiple dateranges/timeslots per doc: modeling openinghours.

2011-09-26 Thread britske
Sorry for the somewhat length post, I would like to make clear that I covered
my basis here, and looking for an alternative solution, because the more
trivial solutions don't seem to work for my use-case. 

Consider Bars, musea, etc. 

These places have multiple openinghours that can depend on: 
REQ 1. day of week
REQ 2. special days on which they are closed, or have in another way
different openinghours than there related 'day of week'

Now, I want to model these 'places' in a way so I'm able to do temporal
queries like: 
- which bars are open NOW (and stay open for at least another 3 hours)
- which musea are (already) open at 25-12-2011 - 10AM - and stay open until
(at least) 3PM. 

I believe having opening/closing hours available for each day at least gives
me the data needed to query the above. (Note that having
dayOfWeek*openinghours is not enough, bc. of the special cases in 2.) 

Okay knowing I need openinghours*dates for each place, how would I format
this in documents? 

OPTION A) 
---
Considering granularity: I want documents to represent Places and not
Places*dates. Although the latter would trivially allow me to do the quering
mentioned above, it has the disadvantages: 

 - same place returned multiple times (each with a different date) when
queries are not constrained to date. 
- Lot's of data needs to be duplicated, all for the conceptually 'simple' 
functionality of needing multiple date-ranges. It feels bad and a simpler
solution should exist? 
- Exploding the resultset (documents = say, 100 dates * 1.000.000 =
100.000.000. ) suddenly the size of the resultset goes from 'easily doable'
to 'hmmm I have to think about this'. Given that places also have some other
fields to sort on, Lucene fieldcache mem-usage would explode with a factor
100. 

OPTION B)
--
Another, faulty, option would be to model opening/closing hours in 2
multivalued date-fields, i.e: open, close. and insert open/close for each
day, e.g: 

open: 2011-11-08:1800 - close: 2011-11-09:0300
open: 2011-11-09:1700 - close: 2011-11-10:0500
open: 2011-11-10:1700 - close: 2011-11-11:0300

And queries would be of the form:

'open < now && close > now+3h'

But since there is no way to indicate that 'open' and 'close' are pairwise
related I will get a lot of false positives, e.g the above document would be
returned for:

open < 2011-11-09:0100 && close > 2011-11-09:0600
because SOME opendate is before 2011-11-09:0100 (i.e: 2011-11-08:1800) and
SOME closedate is after 2011-11-09:0600 (for example: 2011-11-11:0300) but
these open and close-dates are not pairwise related.

OPTION C) The best of what I have now:
---
I have been thinking about a totally different approach using Solr dynamic
fields, in which each and every opening and closing-date gets it's own
dynamic field, e.g:

_date_2011-11-09_open: 1800
_date_2011-11-09_close: 0300
_date_2011-11-09_open: 1700
_date_2011-11-10_close: 0500
_date_2011-11-10_open: 1700
_date_2011-11-11_close: 0300

Then, the client should know the date to query, and thus the correct fields
to query. This would solve the problem, since startdate/ enddate are nor
pairwise -related, but I fear this can be a big issue from a performance
standpoint (especially memory consumption of the Lucene fieldcache)


IDEAL OPTION D) 

I'm pretty sure this does not exist out-of-the-box, but might be extended. 
Okay, Solr has a fieldtype: date, but what if it also had a fieldtype:
Daterange? A Daterange would be modeled as  or


Then this problem would be really easily modelled as a multivalued field
'openinghours' of type 'Daterange'. 
However, I have the feeling that the standard range-query implementation
can't be used on this fieldtype, or perhaps should be run for each of the N
datereange-values in 'openinghours'. 

To make matters worse ( I didn't want to introduce this above) 
REQ 3: It may be possible that certain places have multiple opening-hours /
timeslots each day. Consider museum in Spain which get's closed around noon
because of siesta-time. 
OPTION D) would be able to handle this natively, all other options can't. 

I would very much appreciate any pointers to: 
 - how to start with option D. and if this approach is at all feasible. 
 - if option C. would suffice. (excluding REQ 3. ), and if I'm likely to run
into performance / memory troubles. 
 - any other possible solutions I haven' thought of to tackle this. 

Thanks a lot. 

Cheers,
Geert-Jan






--
View this message in context: 
http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-tp3368790p3368790.html
Sent from the Solr - User mailing list archive at Nabble.com.


Modeling openinghours using multipoints

2012-12-08 Thread britske
Hi all, 

Over a year ago I posted a usecase to, the in this context familiar, issue
SOLR-2155 of modelling openinghours using multivalued points. 

https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13114839&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13114839

David (Smiley) gave two possible solutions that would work, but I'm
wondering if the latest advancements in spatial search have made a more
straightforward implementation possible. 

The crux: 
 - A venue can have multiple openinghours (depending on day of week, special
festivitydays, and sometimes even multiple timeslots per day) 
 - queries like the following should be possible: "which venues are open at
least for the following timespan: [NOW, NOW+3h] " Or [this monday 6h, this
monday 11pm] 
 - no need to search in the past. 

To me this an [open,close]-timespan could be nicely modelled as a point,
thus all openinghours of a venue could be defined as multiple points.
(multivalued points, multipoint, shape, not sure on the recent nomenclature) 

In the open/close domain the general query would be: 
Given a user defined query Q(open,close): return all venues that have a
timespan T(open,close) (out of many timespans) for which the following
holds: 
q.open <= T.open AND T.close <=q.close 

Mapping 'open' to latitude and 'close' to longitude results in: 

Given a user defined point X, return all docs that have a point P defined
(out of many points) for which the following holds: 
X.latitude <= P.latitude AND P.longitude <=X.longitude 

The question: Is such a query on multipoints now doable out-of-the-box with
spatial4j (or any other supported plugin for that matter) ? 

Any help highly appreciated! 

Kind regards, 
Geert-Jan. 

Oh btw: the idea behind the translation-function becomes easy as I don't
need to search dates in the past. Moreover, a reindex takes place every
night meaning today 0AM could be defined as 0. With a granularity of 15
minutes and wanting to search 100 days ahead: the transform is simply
mapping 9600 intervals (100*24*4) both for open and close onto [-90,90] and
[0,180] respectively. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Modeling-openinghours-using-multipoints-tp4025336.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Modeling openinghours using multipoints

2012-12-08 Thread britske
Brilliant! Got some great ideas for this. Indeed all sorts of usecases which 
use multiple temporal ranges could benefit.. 

Eg: Another Guy on stackoverflow asked me about this some days ago.. He wants 
to model multiple temporary offers per product (free shopping for christmas, 
20% discount for Black friday , etc) .. All possible with this out of the box. 
Factor in 'offer category' in  x and y as well for some extra powerfull 
querying. 

Yup im enthousiastic about it , which im sure you can tell :)

Thanks a lot David,

Cheers,
Geert-Jan 



Sent from my iPhone

On 9 dec. 2012, at 05:35, "David Smiley (@MITRE.org) [via Lucene]" 
 wrote:

> britske wrote
> That's seriously awesome! 
> 
> Some change in the query though: 
> You described: "To query for a business that is open during at least some 
> part of a given time duration" 
> I want "To query for a business that is open during at least the entire 
> given time duration". 
> 
> Feels like a small difference but probably isn't (I'm still wrapping my 
> head on the intersect query I must admit)
> So this would be a slightly different rectangle query.  Interestingly, you 
> simply swap the location in the rectangle where you put the start and end 
> time.  In summary: 
> 
> Indexed span CONTAINS query span: 
> minX minY maxX maxY -> 0 end start * 
> 
> Indexed span INTERSECTS (i.e. OVERLAPS) query span: 
> minX minY maxX maxY -> 0 start end * 
> 
> Indexed span WITHIN query span: 
> minX minY maxX maxY -> start 0 * end 
> 
> I'm using '*' here to denote the max possible value.  At some point I may add 
> that as a feature. 
> 
> That was a fun exercise!  I give you credit in prodding me in this direction 
> as I'm not sure if this use of spatial would have occurred to me otherwise. 
> 
> britske wrote
> Moreover, any indication on performance? Should, say, 50.000 docs with 
> about 100-200 points each (1 a 2 open-close spans per day) be ok? ( I know 
> 'your mileage may very' etc. but just a guestimate :)
> You should have absolutely no problem.  The real clincher in your favor is 
> the fact that you only need 9600 discrete time values (so you said), not 
> Long.MAX_VALUE.  Using Long.MAX_VALUE would simply not be possible with the 
> current implementation because it's using Doubles which has 52 bits of 
> precision not the 64 that would be required to be a complete substitute for 
> any time/date.  Even given the 52 bits, a quad SpatialPrefixTree with 
> maxLevels="52" would probably not perform well or might fail; not sure.  
> Eventually when I have time to work on an implementation that can be based on 
> a configurable number of grid cells (not unlike how you can configure 
> precisionStep on the Trie numeric fields), 52 should be no problem. 
> 
> I'll have to remember to refer back to this email on the approach if I create 
> a field type that wraps this functionality. 
> 
> ~ David 
> 
> britske wrote
> Again, this looks good! 
> Geert-Jan 
> 
> 2012/12/8 David Smiley (@MITRE.org) [via Lucene] < 
> [hidden email]> 
> 
> > Hello again Geert-Jan! 
> > 
> > What you're trying to do is indeed possible with Solr 4 out of the box. 
> >  Other terminology people use for this is multi-value time duration.  This 
> > creative solution is a pure application of spatial without the geospatial 
> > notion -- we're not using an earth or other sphere model -- it's a flat 
> > plane.  So no need to make reference to longitude & latitude, it's x & y. 
> > 
> > I would put opening time into x, and closing time into y.  To express a 
> > point, use "x y" (x space y), and supply this as a string to your 
> > SpatialRecursivePrefixTreeFieldType based field for indexing.  You can give 
> > it multiple values and it will work correctly; this is one of RPT's main 
> > features that set it apart from Solr 3 spatial.  To query for a business 
> > that is open during at least some part of a given time duration, say 6-8 
> > o'clock, the query would look like openDuration:"Intersects(minX minY maxX 
> > maxY)"  and put 0 or minX (always), 6 for minY (start time), 8 for maxX 
> > (end time), and the largest possible value for maxY.  You wouldn't actually 
> > use 6 & 8, you'd use the number of 15 minute intervals since your epoch for 
> > this equivalent time span. 
> > 
> > You'll need to configure the field correctly: geo="false" worldBounds="0 0 
> > maxTime maxTime" substituting an appropriate value for maxTime based on 
> > your unit of time

modeling prices based on daterange using multipoints

2012-12-11 Thread britske
HI all, 

Based on some good discussion in 
Modeling openinghours using multipoints

  
I was triggered to have a review of an old painpoint of mine: modeling
pricing & availability of hotels which depends on a couple of factors
including, date or arrival, length of stay & roomtype. 

This question is to see if it would be possible to model the above using
multipoints (os ome other technique I'm not aware of that's been come into
existence in Lucene / Solr in the last 2 years or so. 

Let me explain: Hotels (in my implementation) have pricing & availability
based on: date, duration, nr of persons, roomtype (e.g.: single, double,
twin, triple, family). Instead of modeling these as separate documents,
currently I model 1 doc per hotel where each
 combo has each own price and is modeled as
a separate field:  (configured in backend as dynamic fields: ddp-*)..
Non-availability is just modeled as the absence of the particular field. 

The advantage of modeling 1 doc per hotel is clear: users have no chance of
seeing multiple offers per hotel in the frontend. It's just how they have
become accustomed to these type of travel/ hotel searchengines. 

Now there's also a big diadvantage of my current setup: Lucene/Solr just
isn't really build for having 20.000+ fields on which can be sorted and
filtered on. (Could go into this, but it's not really the point of this
question) 

I realize the new spatial-stuff in Solr 4 is no magic bullet, but I'm
wondering if I could model multiple prices per day as multipoints, whereas: 

 - date*duration*nr of persons*roomtype is modeled as point.x (discretized
in some 20.000 values) 
 - price modeled as point.y ( in dollarcents / normalized as avg price per
day: range:  [0,20] covering a max price of $2.000/day) 

The stuff that needs to be possible: 
 A) 1 required filter on point.x (filtering a 1 particular  combo.
 B) an optional range query on point.y (min and./or max price filter)
 C) optional soring on point.y (sorting on price (normal or reverse))

I'm pretty certain A) and B) won't be a problem as far is functionality is
concerned, but how about performance? I.e: would some sort of cached Solr
filter jump in for a given  combo,
for quick doc-interesection, just as would with multiple dynamic fields in
my desribed as-is-case?

How about C)? Is sorting on point.y possible? (potenially in conjunction
with other sorting-fields used as tiebreaker, to give a stable sort? I
remember to have read that any filterquery can be used for sorting combined
with multipoints (which would make the above work I guess) but just would
like to confirm. 

Looking forward to your feedback, 

Best, 
Geert-Jan








--
View this message in context: 
http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-tp4026011.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: modeling prices based on daterange using multipoints

2012-12-11 Thread britske
Hi David,

Yeah interesting (as well as problematic as far is implementing) use-case
indeed :)

1. You mention "there are no special caches / memory requirements inherent
in this.". For a given user-query this would mean all hotels would have to
seach for all point.x each time right? What would be a good plugin-point to
build in some custom cached filter code for this (perhaps using the Solr
Filter cache)? As I see it, determining all hotels that have a particular
point.x value is probably: A) pretty costly to do on each user query. B).
is static and can be cached easily without a lot of memory (relatively
speaking) i.e: 20.000 filters (representing all of the 20.000 different
point.x, that is,  combos) with a
bitset per filter  representing ids of hotels that have the said point.x.

2. I'm not sure I explained C. (sorting) well, since I believe you're
talking about implementing custom code to sort multiple point.y's per
hotel, correct?. That's not what I need. Instead, for every user-query at
most 1 point ever matches. I.e: a hotel has a price for a particular -combo (P.x) or it hasn't.

Say a user queries for the -combo: <21
dec 2012,3 days,2 persons, double>. This might be encoded into a value,
say: 12345.
Now, for the hotels that do match that query (i.e: those hotels that have a
point P for which P.x=12345) I want to sort those hotels on P.y (the price
for the requested P.x)

Geert-Jan




2012/12/11 David Smiley (@MITRE.org) [via Lucene] <
ml-node+s472066n4026151...@n3.nabble.com>

> Hi Britske,
>   This is a very interesting question!
>
> britske wrote
> ...
> I realize the new spatial-stuff in Solr 4 is no magic bullet, but I'm
> wondering if I could model multiple prices per day as multipoints, whereas:
>
>  - date*duration*nr of persons*roomtype is modeled as point.x (discretized
> in some 20.000 values)
>  - price modeled as point.y ( in dollarcents / normalized as avg price per
> day: range:  [0,20] covering a max price of $2.000/day)
>
> The stuff that needs to be possible:
>  A) 1 required filter on point.x (filtering a 1 particular
>  combo.
>  B) an optional range query on point.y (min and./or max price filter)
>  C) optional soring on point.y (sorting on price (normal or reverse))
>
> I'm pretty certain A) and B) won't be a problem as far is functionality is
> concerned, but how about performance? I.e: would some sort of cached Solr
> filter jump in for a given  combo,
> for quick doc-interesection, just as would with multiple dynamic fields in
> my desribed as-is-case?
>
> A & B are indeed not a problem and there are no special caches / memory
> requirements inherent in this.
>
> britske wrote
> How about C)? Is sorting on point.y possible? (potenially in conjunction
> with other sorting-fields used as tiebreaker, to give a stable sort? I
> remember to have read that any filterquery can be used for sorting combined
> with multipoints (which would make the above work I guess) but just would
> like to confirm.
> ...
>
> 'C' (sorting) is the challenge.  As it stands, you will have to implement
> a variation of this class:
> http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/spatial/src/java/org/apache/lucene/spatial/util/ShapeFieldCacheDistanceValueSource.java?view=markup
> Unlike this implementation, your implementation should  ensure the point is
> indeed in the query shape, and it should be configured to take the smallest
> or largest 'y' as desired.  Note that the cache infrastructure that this is
> built on is flakey right now -- a memory hog in multiple ways.  There will
> be a Point implementation in memory for all of your indexed points, and an
> ArrayList per doc.  And it's not NRT search friendly, and doesn't
> relinquish its resources (i.e. on commit) as quickly as it should.  I know
> what it's problems are but I have been quite busy.
>
> ~ David
>  Author:
> http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/modeling-prices-based-on-daterange-using-multipoints-tp4026011p4026151.html
>  To unsubscribe from modeling prices based on daterange using multipoints, 
> click
> here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4026011&code=Z2JyaXRzQGdtYWlsLmNvbXw0MDI2MDExfDExNjk3MTIyNTA=>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&a

solr 1.4: extending StatsComponent to recognize localparm {!ex}

2009-08-25 Thread Britske

hi,

I'm looking for a way to extend StatsComponent te recognize localparams
especially the {!ex}-param.
To my knowledge this isn't implemented in the current trunk. 

One of my use-cases for this is to be able to have a javascript
price-slider, where the user can operate the slider and thus set a
price-filter (fq). (e.g. see kayak.com)  Regardless of this setting, min and
max values for the price should be available so to properly scale the
slider. I have a couple of other uses for it too, so I'm really interested
in this feature. 

I'm trying to get my had around how the Searchcomponents relate to eachother
and how the general (component) flow works. Judging from the source of
seachHandler it seems that when no sharding is used (I can use this case for
now) the flow is basically that for all registered searchComponents the
process() method is called, So there is nothing like stages, etc. to take
into account?

I was thinking of letting the modified statsComponent issue a new request in
Responsebuilder.outgoing that would be picked up by FacetComponent when
statComponent found a stats.field with the {ex}-param. Afterwards the
modified statsComponent would somehow pick up the result (a widened set of
docs that match for the original query excluding the specified filter) and
process further like before. 

However, as described above, this code seems to only work when using
sharding, so I'm quite unsure how to proceed further. 

Moreover, I can't seem to find the actual code in FacetComponent or anywhere
else for that matter where the {!ex}-param case is treated. I assume it's in
FacetComponent.refineFacets but I can't seem to get a grip on it.. Perhaps
it's late here..

So, somone care to shed a light on how this might be done? (I only need some
general directions I hope..)

thanks,

Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/solr-1.4%3A-extending-StatsComponent-to-recognize-localparm-%7B%21ex%7D-tp25143428p25143428.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr 1.4: extending StatsComponent to recognize localparm {!ex}

2009-08-26 Thread Britske

Thanks for that. 
it works now ;-) 


Erik Hatcher-4 wrote:
> 
> 
> On Aug 25, 2009, at 6:35 PM, Britske wrote:
>> Moreover, I can't seem to find the actual code in FacetComponent or  
>> anywhere
>> else for that matter where the {!ex}-param case is treated. I assume  
>> it's in
>> FacetComponent.refineFacets but I can't seem to get a grip on it..  
>> Perhaps
>> it's late here..
>>
>> So, somone care to shed a light on how this might be done? (I only  
>> need some
>> general directions I hope..)
> 
> It's in SimpleFacets, that does a call to QueryParsing.getLocalParams().
> 
>   Erik
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/solr-1.4%3A-extending-StatsComponent-to-recognize-localparm-%7B%21ex%7D-tp25143428p25148403.html
Sent from the Solr - User mailing list archive at Nabble.com.



If field A is empty take field B. Functionality available?

2009-08-28 Thread Britske

I have 2 fields:  
realprice
avgprice

I'd like to be able to take the contents of avgprice if realprice is not
available. 
due to design the average price cannot be encoded in the 'realprice'-field. 

Since I need to be able to filter, sort and facet on these fields, it would
be really nice to be able to do that just on something like a virtual-field
called 'price' or something. That field should contain the conditional logic
to know from which actual field to take the contents from.

I was looking at using functionqueries, but to me knowledge these can't be
used to filter and facet on. 

Would creating a custom field work for this or does a field know nothing
from its sibling-fields? What would performance impact be like, since this
is really important in this instance. 

Any better ways? Subclassing standardrequestHandler and hacking it all
together seems rather ugly to me, but if it's needed...

Thanks, 
Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/If-field-A-is-empty-take-field-B.-Functionality-available--tp25193668p25193668.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: If field A is empty take field B. Functionality available?

2009-08-28 Thread Britske

I oversimplified my question somewhat. 

The realprice can actually be 1 out +/- 10 thousand dynamic fields. Which
one is determined at query time. (price depends on date, class, and some
other properties. The product of the options of these properties amount to
the 10k number ) The avg field follows a similar pattern: it's actually 1
out of +/- 10 dynamic fields. So populating a field at index-time sadly is
out of the question. 

would a custom fieldtype (with possibly a custom valuesource that determines
from which index: realprice or avgprice to actually take the value) be a
possibility here? I'm looking for a way that would work transparently even
if used for faceting or filtering. Would that be the case in this scenario? 

Sorting is a special case as well since that independent of what sort is
used: products with realprices should always come prior to products with
only avg-prices. 

Thanks, 
Geert-Jan



ryantxu wrote:
> 
> can you just add a new field that has the real or ave price?
> Just populate that field at index time...  make it indexed but not  
> stored
> 
> If you want the real or average price to be treated the same in  
> faceting, you are really going to want them in the same field.
> 
> 
> On Aug 28, 2009, at 1:16 PM, Britske wrote:
> 
>>
>> I have 2 fields:
>> realprice
>> avgprice
>>
>> I'd like to be able to take the contents of avgprice if realprice is  
>> not
>> available.
>> due to design the average price cannot be encoded in the 'realprice'- 
>> field.
>>
>> Since I need to be able to filter, sort and facet on these fields,  
>> it would
>> be really nice to be able to do that just on something like a  
>> virtual-field
>> called 'price' or something. That field should contain the  
>> conditional logic
>> to know from which actual field to take the contents from.
>>
>> I was looking at using functionqueries, but to me knowledge these  
>> can't be
>> used to filter and facet on.
>>
>> Would creating a custom field work for this or does a field know  
>> nothing
>> from its sibling-fields? What would performance impact be like,  
>> since this
>> is really important in this instance.
>>
>> Any better ways? Subclassing standardrequestHandler and hacking it all
>> together seems rather ugly to me, but if it's needed...
>>
>> Thanks,
>> Geert-Jan
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/If-field-A-is-empty-take-field-B.-Functionality-available--tp25193668p25193668.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/If-field-A-is-empty-take-field-B.-Functionality-available--tp25193668p25195628.html
Sent from the Solr - User mailing list archive at Nabble.com.



manually creating indices to speed up indexing with app-knowledge

2009-11-02 Thread Britske

This may seem like a strange question, but here it goes anyway. 

Im considering the possibility of low-level constructing indices for about
20.000 indexed fields (type sInt) if at all possible . (With indices in this
context I mean the inverted indices from term to Documentid just to be 100%
complete)  
These indices have to be recreated each night, along with the normal
reindex. 

Globally it should go something like this (each night) : 
 - documents (consisting of about 20 stored fields and about 10 stored &
indexed fields) are indexed through the normal 'code-path' (solrJ in my
case) 
- After all docs are persisted (max 200.000) I want to extract the mapping
from 'lucene docid' --> 'stored/indexed product key'
I believe this should work, because after all docs are persisted the
internal docids aren't altered, so the relationship between 'lucene docid'
--> 'stored/indexed product key' is invariant from that point forward.
(please correct if wrong) 
- construct the 20.000 inverted indices on such a low enough level that I do
not have to go through IndexWriter if possible, so  I do not need to
construct Documents, I only need to construct the native format of the
indices themselves. Ideally this should work on multiple servers so that the
indices can be created in parallel and the index-files later simply copied
to the index-directory of the master. 

Basically what it boils down to is that indexing time (a reindex should be
done each night)  is a big show-stopper at the moment, although we've tried
and tested all the more standard optimization tricks & techniques, as well
as having build a  home-grown shard-like indexing strategy which uses 20
pretty big servers in parallel. The 20.000 indexed fields are still simply
killing. 

At the same time the app has a lot of knowledge of the 20.000 indices. 
- All indices consist of prices (ints) between 0 and 10.000
- and most important: as part of the document construction process the
ordening of each of the 20.000 indices is known for all documents that are
processed by the document-construction server in question. (This part is
needed, and is already performing at light speed) 

for sake of argument say we have 5 document-construction servers. Each
server processes 40.000 documents. Each server has 20.000 ordered indices in
its own format readily available for the 40.000 documents it's processing. 
Something like: LinkedHashMap> --> 


Say we have 20 indexing servers. Each server has to calculate 1.000 indices
(totalling the 20.000) 
We have the 5 doc-construction servers distribute the ordered sub-indices to
the correct servers. 
Each server constructs an index from 5 ordered sub-indices coming from 5
different construction-servers. This can be done efficiently using a
mergesort (since the sub-indices are already sorted) 

All that is missing (oversimplifying here ) is going from the ordered
indices in application-format to the index-format of lucene (substituting
the productids by the lucene docid's along the way) and stream it to disk. 
I believe this would quite posisbly give a really big indexing improvement.  

Is my thinking correct in the steps involved? 
Do you believe that this indeed would give a big speedup for this specific
situation  
Where would I hook in the SOlr / lucene code to construct the native format?


Thanks in advance (and for making it to here) 

Geert-Jan

-- 
View this message in context: 
http://old.nabble.com/manually-creating-indices-to-speed-up-indexing-with-app-knowledge-tp26157851p26157851.html
Sent from the Solr - User mailing list archive at Nabble.com.



big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

Hi all,

For some queries I need to return a lot of rows at once (say 100). 
When performing these queries I notice a big difference between qTime (which
is mostly in the 15-30 ms range due to caching) and total time taken to
return the response (measured through SolrJ's elapsedTime), which takes
between 500-1600 ms. 

For queries which return less rows the difference becomes less big.

I presume (after reading some threads in the past) that this is due to solr
constructing and streaming the response (which includes retrieving the
stored fields) , which is something that is not calculated in qTime. 

Documents have a lot of stored fields (more than 10.000), but at any given
query a maximum of say 20 are returned (through fl-field ) or used (as part
of filtering, faceting, sorting)

I would have thought that enabling enableLazyFieldLoading for this situation
would mean a lot, since so many stored fields can be skipped, but I notice
no real difference in measuring total elapsed time (or qTime for that
matter). 

Am I missing something here? What criteria would need to be met for a field
to not be loaded for instance? Should I see a big performance boost in this
situation?

Thanks,
Britske
-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

Size on disk is 1.84 GB (of which 1.3 GB sits in FDT files if that matters)
Physical RAM is 2 GB with -Xmx800M set to Solr. 


Yonik Seeley wrote:
> 
> That high of a difference is due to the part of the index containing
> these particular stored fields not being in OS cache.  What's the size
> on disk of your index compared to your physical RAM?
> 
> -Yonik
> 
> On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>>
>> Hi all,
>>
>> For some queries I need to return a lot of rows at once (say 100).
>> When performing these queries I notice a big difference between qTime
>> (which
>> is mostly in the 15-30 ms range due to caching) and total time taken to
>> return the response (measured through SolrJ's elapsedTime), which takes
>> between 500-1600 ms.
>>
>> For queries which return less rows the difference becomes less big.
>>
>> I presume (after reading some threads in the past) that this is due to
>> solr
>> constructing and streaming the response (which includes retrieving the
>> stored fields) , which is something that is not calculated in qTime.
>>
>> Documents have a lot of stored fields (more than 10.000), but at any
>> given
>> query a maximum of say 20 are returned (through fl-field ) or used (as
>> part
>> of filtering, faceting, sorting)
>>
>> I would have thought that enabling enableLazyFieldLoading for this
>> situation
>> would mean a lot, since so many stored fields can be skipped, but I
>> notice
>> no real difference in measuring total elapsed time (or qTime for that
>> matter).
>>
>> Am I missing something here? What criteria would need to be met for a
>> field
>> to not be loaded for instance? Should I see a big performance boost in
>> this
>> situation?
>>
>> Thanks,
>> Britske
>> --
>> View this message in context:
>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698909.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

I'm on a development box currently and production servers will be bigger, but
at the same time the index will be too. 

Each query requests at most 20 stored fields. Why doesn't help
lazyfieldloading in this situation? 
I don't need to retrieve all stored fields and I thought I wasn't doing this
(through limiting the fields returned using the FL-param), but if I read
your comment correctly, apparently I am retrieving them all, I'm just not
displaying them all? 

Also, if I understand correctly, for optimal performance I need to have at
least enough RAM to put the entire Index size in OS cache (thus RAM) + the
amount of RAM that SOLR / Lucene consumes directly through the JVM? (which
among other things includes the Lucene field-cache + all of SOlr's caches on
top of that). 

I've never read the requirement of having the entire index in OS cache
before, is this because in normal situations (with less stored fields) it
doesn't matter much? I'm just surprised to hear of this for the first time,
since it will likely give a big impact on my design.

Luckily most of the normal queries return 10 documents each, which results
in a discrepancy between total elapsed time and qTIme of about 15-30 ms.
Doesn't this seem strange, since to me it would seem logical that the
discrepancy would be at least 1/10th of fetching 100 documents. 

hmm, hope you can shine some light on this,

Thanks a lot,
Britske



Yonik Seeley wrote:
> 
> That's a bit too tight to have *all* of the index cached...your best
> bet is to go to 4GB+, or figure out a way not to have to retrieve so
> many stored fields.
> 
> -Yonik
> 
> On Mon, Jul 28, 2008 at 4:27 PM, Britske <[EMAIL PROTECTED]> wrote:
>>
>> Size on disk is 1.84 GB (of which 1.3 GB sits in FDT files if that
>> matters)
>> Physical RAM is 2 GB with -Xmx800M set to Solr.
>>
>>
>> Yonik Seeley wrote:
>>>
>>> That high of a difference is due to the part of the index containing
>>> these particular stored fields not being in OS cache.  What's the size
>>> on disk of your index compared to your physical RAM?
>>>
>>> -Yonik
>>>
>>> On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> For some queries I need to return a lot of rows at once (say 100).
>>>> When performing these queries I notice a big difference between qTime
>>>> (which
>>>> is mostly in the 15-30 ms range due to caching) and total time taken to
>>>> return the response (measured through SolrJ's elapsedTime), which takes
>>>> between 500-1600 ms.
>>>>
>>>> For queries which return less rows the difference becomes less big.
>>>>
>>>> I presume (after reading some threads in the past) that this is due to
>>>> solr
>>>> constructing and streaming the response (which includes retrieving the
>>>> stored fields) , which is something that is not calculated in qTime.
>>>>
>>>> Documents have a lot of stored fields (more than 10.000), but at any
>>>> given
>>>> query a maximum of say 20 are returned (through fl-field ) or used (as
>>>> part
>>>> of filtering, faceting, sorting)
>>>>
>>>> I would have thought that enabling enableLazyFieldLoading for this
>>>> situation
>>>> would mean a lot, since so many stored fields can be skipped, but I
>>>> notice
>>>> no real difference in measuring total elapsed time (or qTime for that
>>>> matter).
>>>>
>>>> Am I missing something here? What criteria would need to be met for a
>>>> field
>>>> to not be loaded for instance? Should I see a big performance boost in
>>>> this
>>>> situation?
>>>>
>>>> Thanks,
>>>> Britske
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698909.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18699550.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

Thanks for clearing that up for me.
I'm going to investigate some more...



Yonik Seeley wrote:
> 
> On Mon, Jul 28, 2008 at 4:53 PM, Britske <[EMAIL PROTECTED]> wrote:
>> Each query requests at most 20 stored fields. Why doesn't help
>> lazyfieldloading in this situation?
> 
> It's the disk seek that kills you... loading 1 byte or 1000 bytes per
> document would be about the same speed.
> 
>> Also, if I understand correctly, for optimal performance I need to have
>> at
>> least enough RAM to put the entire Index size in OS cache (thus RAM) +
>> the
>> amount of RAM that SOLR / Lucene consumes directly through the JVM?
> 
> The normal usage is to just retrieve the stored fields for the top 10
> (or a window of 10 or 20) documents.  Under this scenario, the
> slowdown from not having all of the stored fields cached is usually
> acceptable.  Faster disks (seek time) can also help.
> 
>> Luckily most of the normal queries return 10 documents each, which
>> results
>> in a discrepancy between total elapsed time and qTIme of about 15-30 ms.
>> Doesn't this seem strange, since to me it would seem logical that the
>> discrepancy would be at least 1/10th of fetching 100 documents.
> 
> Yes, in general 1/10th the cost is what one would expect on average.
> But some of the docs you are trying to retrieve *will* be in cache, so
> it's hard to control this test.
> You could try forcing the index out of memory by "cat"ing some other
> big files multiple times and then re-trying or do a reboot to be
> sure.
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p1861.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

I'm using the solr-nightly of 2008-04-05



Grant Ingersoll-6 wrote:
> 
> What version of Solr/Lucene are you using?
> 
> On Jul 28, 2008, at 4:53 PM, Britske wrote:
> 
>>
>> I'm on a development box currently and production servers will be  
>> bigger, but
>> at the same time the index will be too.
>>
>> Each query requests at most 20 stored fields. Why doesn't help
>> lazyfieldloading in this situation?
>> I don't need to retrieve all stored fields and I thought I wasn't  
>> doing this
>> (through limiting the fields returned using the FL-param), but if I  
>> read
>> your comment correctly, apparently I am retrieving them all, I'm  
>> just not
>> displaying them all?
>>
>> Also, if I understand correctly, for optimal performance I need to  
>> have at
>> least enough RAM to put the entire Index size in OS cache (thus RAM)  
>> + the
>> amount of RAM that SOLR / Lucene consumes directly through the JVM?  
>> (which
>> among other things includes the Lucene field-cache + all of SOlr's  
>> caches on
>> top of that).
>>
>> I've never read the requirement of having the entire index in OS cache
>> before, is this because in normal situations (with less stored  
>> fields) it
>> doesn't matter much? I'm just surprised to hear of this for the  
>> first time,
>> since it will likely give a big impact on my design.
>>
>> Luckily most of the normal queries return 10 documents each, which  
>> results
>> in a discrepancy between total elapsed time and qTIme of about 15-30  
>> ms.
>> Doesn't this seem strange, since to me it would seem logical that the
>> discrepancy would be at least 1/10th of fetching 100 documents.
>>
>> hmm, hope you can shine some light on this,
>>
>> Thanks a lot,
>> Britske
>>
>>
>>
>> Yonik Seeley wrote:
>>>
>>> That's a bit too tight to have *all* of the index cached...your best
>>> bet is to go to 4GB+, or figure out a way not to have to retrieve so
>>> many stored fields.
>>>
>>> -Yonik
>>>
>>> On Mon, Jul 28, 2008 at 4:27 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Size on disk is 1.84 GB (of which 1.3 GB sits in FDT files if that
>>>> matters)
>>>> Physical RAM is 2 GB with -Xmx800M set to Solr.
>>>>
>>>>
>>>> Yonik Seeley wrote:
>>>>>
>>>>> That high of a difference is due to the part of the index  
>>>>> containing
>>>>> these particular stored fields not being in OS cache.  What's the  
>>>>> size
>>>>> on disk of your index compared to your physical RAM?
>>>>>
>>>>> -Yonik
>>>>>
>>>>> On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> For some queries I need to return a lot of rows at once (say 100).
>>>>>> When performing these queries I notice a big difference between  
>>>>>> qTime
>>>>>> (which
>>>>>> is mostly in the 15-30 ms range due to caching) and total time  
>>>>>> taken to
>>>>>> return the response (measured through SolrJ's elapsedTime),  
>>>>>> which takes
>>>>>> between 500-1600 ms.
>>>>>>
>>>>>> For queries which return less rows the difference becomes less  
>>>>>> big.
>>>>>>
>>>>>> I presume (after reading some threads in the past) that this is  
>>>>>> due to
>>>>>> solr
>>>>>> constructing and streaming the response (which includes  
>>>>>> retrieving the
>>>>>> stored fields) , which is something that is not calculated in  
>>>>>> qTime.
>>>>>>
>>>>>> Documents have a lot of stored fields (more than 10.000), but at  
>>>>>> any
>>>>>> given
>>>>>> query a maximum of say 20 are returned (through fl-field ) or  
>>>>>> used (as
>>>>>> part
>>>>>> of filtering, faceting, sorting)
>>>>>>
>>>>>> I would have thought that enabling enableLazyFieldLoading for this
>>>>>> situation
>>>>>> would mean a lot, sin

Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-28 Thread Britske

That sounds interesting. Let me explain my situation, which may be a variant
of what you are proposing. My documents contain more than 10.000 fields, but
these fields are divided like: 

1. about 20 general purpose fields, of which more than 1 can be selected in
a query. 
2. about 10.000 fields of which each query based on some criteria exactly
selects one field. 

Obviously 2. is killing me here, but given the above perhaps it would be
possible to make 10.000 vertical slices/ indices, and based on the field to
be selected (from point 2) select the slice/index to search in. 
The 10.000 indices would run on the same box, and the 20 general purpose
fields have have to be copied to all slices (which means some increase in
overall index size, but managable), but this would give me far more
reasonable sized and compact documents, which would mean (documents are far
more likely to be in the same cached slot, and be accessed in the same disk
-seek. 

Does this make sense? Am I correct that this has nothing to do with
Distributed search, since that really is all about horizontal splitting /
sharding of the index, and what I'm suggesting is splitting vertically? Is
there some other part of Solr that I can use for this, or would it be all
home-grown?

Thanks,
Britske


Mike Klaas wrote:
> 
> Another possibility is to partition the stored fields into a  
> frequently-accessed set and a full set.  If the frequently-accessed  
> set is significantly smaller (in terms of # bytes), then the documents  
> will be tightly-packed on disk and the os caching will be much more  
> effective given the same amount of ram.
> 
> The situation you are experiencing is one-seek-per-doc, which is  
> performance death.
> 
> -Mike
> 
> On 28-Jul-08, at 1:34 PM, Yonik Seeley wrote:
> 
>> That's a bit too tight to have *all* of the index cached...your best
>> bet is to go to 4GB+, or figure out a way not to have to retrieve so
>> many stored fields.
>>
>> -Yonik
>>
>> On Mon, Jul 28, 2008 at 4:27 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>
>>> Size on disk is 1.84 GB (of which 1.3 GB sits in FDT files if that  
>>> matters)
>>> Physical RAM is 2 GB with -Xmx800M set to Solr.
>>>
>>>
>>> Yonik Seeley wrote:
>>>>
>>>> That high of a difference is due to the part of the index containing
>>>> these particular stored fields not being in OS cache.  What's the  
>>>> size
>>>> on disk of your index compared to your physical RAM?
>>>>
>>>> -Yonik
>>>>
>>>> On Mon, Jul 28, 2008 at 4:10 PM, Britske <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> For some queries I need to return a lot of rows at once (say 100).
>>>>> When performing these queries I notice a big difference between  
>>>>> qTime
>>>>> (which
>>>>> is mostly in the 15-30 ms range due to caching) and total time  
>>>>> taken to
>>>>> return the response (measured through SolrJ's elapsedTime), which  
>>>>> takes
>>>>> between 500-1600 ms.
>>>>>
>>>>> For queries which return less rows the difference becomes less big.
>>>>>
>>>>> I presume (after reading some threads in the past) that this is  
>>>>> due to
>>>>> solr
>>>>> constructing and streaming the response (which includes  
>>>>> retrieving the
>>>>> stored fields) , which is something that is not calculated in  
>>>>> qTime.
>>>>>
>>>>> Documents have a lot of stored fields (more than 10.000), but at  
>>>>> any
>>>>> given
>>>>> query a maximum of say 20 are returned (through fl-field ) or  
>>>>> used (as
>>>>> part
>>>>> of filtering, faceting, sorting)
>>>>>
>>>>> I would have thought that enabling enableLazyFieldLoading for this
>>>>> situation
>>>>> would mean a lot, since so many stored fields can be skipped, but I
>>>>> notice
>>>>> no real difference in measuring total elapsed time (or qTime for  
>>>>> that
>>>>> matter).
>>>>>
>>>>> Am I missing something here? What criteria would need to be met  
>>>>> for a
>>>>> field
>>>>> to not be loaded for instance? Should I see a big performance  
>>>>> boost in
>>>>> this
>>>>> situation?
>>>>>
>>>>> Thanks,
>>>>> Britske
>>>>> --
>>>>> View this message in context:
>>>>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698590.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18698909.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18706099.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Britske



Currently, I can't say what the data actualle represents but the analogy of
t

Mike Klaas wrote:
> 
> On 28-Jul-08, at 11:16 PM, Britske wrote:
> 
>>
>> That sounds interesting. Let me explain my situation, which may be a  
>> variant
>> of what you are proposing. My documents contain more than 10.000  
>> fields, but
>> these fields are divided like:
>>
>> 1. about 20 general purpose fields, of which more than 1 can be  
>> selected in
>> a query.
>> 2. about 10.000 fields of which each query based on some criteria  
>> exactly
>> selects one field.
>>
>> Obviously 2. is killing me here, but given the above perhaps it  
>> would be
>> possible to make 10.000 vertical slices/ indices, and based on the  
>> field to
>> be selected (from point 2) select the slice/index to search in.
>> The 10.000 indices would run on the same box, and the 20 general  
>> purpose
>> fields have have to be copied to all slices (which means some  
>> increase in
>> overall index size, but managable), but this would give me far more
>> reasonable sized and compact documents, which would mean (documents  
>> are far
>> more likely to be in the same cached slot, and be accessed in the  
>> same disk
>> -seek.
> 
> Are all 10k values equally-likely to be retrieved?
> 
> 
Well not exactly, but lets say the probabilities of access between the most
probable and least probable choice is roughly a factor 100. Of  course this
also give me massive room for optimizing (different indices on different
boxes with tuned memory for each)  if I would indeed have 10k seperate
indices. (For simplicity I'm forgetting the remaining 20 fields here). 



>> Does this make sense?
> 
> Well, I would probably split into two indices, one containing the 20  
> fields and one containing the 10k.  However, if the 10k fields are  
> equally likely to be chosen, this will not help in the long term,  
> since the working set of disk blocks is still going to be all of them.
> 

I figured that having 10k seperate indices (if that's at all feasible) would
result in having the values for the same field packed more closely together
on disk, this resulting in less disk seek. Since at any given time query
only 1 out of 10k fields would be chosen, so for a particular query I could
delegate to exactly 1 out of 10k indices to perform the query.  Wouldn't
this limit the disk blocks?



> 
>> Am I correct that this has nothing to do with
>> Distributed search, since that really is all about horizontal  
>> splitting /
>> sharding of the index, and what I'm suggesting is splitting  
>> vertically? Is
>> there some other part of Solr that I can use for this, or would it  
>> be all
>> home-grown?
> 
> There is some stuff that is coming down the pipeline in lucene, but  
> nothing is currently there.  Honestly, it sounds like these extra  
> fields should just be stored in a separate file/database.  I also  
> wonder if solving the underlying problem really requires storing 10k  
> values per doc (you haven't given us many clues in this regard)?
> 
> -Mike
> 
> 

Well, without being able to go in what the data represents (hopefully
later), I found that this analogy works well: 

- Rows in solr represent productcategories. I will have up to 100k of them. 

- Each product category can have 10k products each. These are encoded as the
10k columns / fields (all 10k fields are int values)
 
- At any given at most 1 product per productcategory is returned,
(analoguous to selecting 1 out of 10k columns). (This is the requirements
that makes this scheme possible) 

-products in the same column have certain characteristics in common, which
are encoded in the column name (using dynamic fields). So the combination of
these characteristics uniquely determines 1 out of 10k columns. When the
user hasn't supplied all characteristics good defaults for these
characteristics can be chosen, so a column can always be determined. 

- on top of that each row has 20 productcategory-fields (which all possible
10k products of that category share). 

- the row x column matrix is esssentially sparse (between 10 to 50% is
filled)

The queries performed are a combination of productcategory-filers, and
filters that together uniquely determine 1 out of 10k columns. 

Returned results are those products for which: 
-productcategory filters hold true
- are contained in the selected (1 out of 10k) column

The default way of sorting is by the int-values in the particular selected
column

Since the column-values are used to both filter and sort the results it
should be clear that I can't externalise the 10k fields to a database.
However, I've looked at the possibili

Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Britske

Hi Fuad, 


Funtick wrote:
> 
> 
> Britske wrote:
>> 
>> When performing these queries I notice a big difference between qTime
>> (which is mostly in the 15-30 ms range due to caching) and total time
>> taken to return the response (measured through SolrJ's elapsedTime),
>> which takes between 500-1600 ms. 
>> Documents have a lot of stored fields (more than 10.000), but at any
>> given query a maximum of say 20 are returned (through fl-field ) or used
>> (as part of filtering, faceting, sorting)
>> 
> 
> 
> Hi Britske, how do you manage 10.000 field  definitions? Sorry I didn't
> understand...
> 

I use dynamic fields. My 10k fields span all possible combinations of
variables, say, x, y, z. 
Then I can uniquely determine a column by specifying: _d___ for
example. 
while the field def is simply: 


Funtick wrote:
> 
> Guys, I am constantly seeing the same problem, athough I have just a few
> small fields defined, lazyLoading is disabled, and memory is more than
> enough (25Gb for SOLR, 7Gb for OS, 3Gb index).
> 
> Britske, do you see the difference with faceted queries only?
> 

No, the difference is there all the time (faceting or not), but it gets vary
noticable when a lot of rows are returned. As commented earlier, it is
highly likely that this is due to random disk seek times. Because a lot of
fields are stored in my situation the harddisk has to cover many blocks to
fetch all requested documents.  


Funtick wrote:
> 
> Yonik, 
> 
> I am suspecting there is _bug_ with SOLR faceting so that faceted query
> time (qtime) is 10-20ms and elapsed time is huge; SOLR has filterCache
> where Key is 'filter'; SOLR does not have any 
> where Key is 'query' and Value is 'facets'...
> 
> Am I right?
> 
> -Fuad
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18736439.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-30 Thread Britske



Funtick wrote:
> 
> 
> Britske wrote:
>> 
>> - Rows in solr represent productcategories. I will have up to 100k of
>> them. 
>> - Each product category can have 10k products each. These are encoded as
>> the 10k columns / fields (all 10k fields are int values) 
>> 
> 
> You are using multivalued fields, you are not using 10k fields. And 10k is
> huge.
> 
> Design is wrong... you should define two fileds only: .
> Lucene will do the rest.
> 
> -Fuad
> 

;-). Well I wish it was that simple. 
-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18744539.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big discrepancy between elapsedtime and qtime although enableLazyFieldLoading= true

2008-07-31 Thread Britske

no, I'm using dynamic fields, they've been around for a pretty long time. 
I use int-values in the 10k fields for filtering and sorting. On top of that
I use a lot of full-text filtering on the other fields, as well as faceting,
etc. 

I do understand that, at first glance, it seems possible to use multivalued
fields, but with multivalued fields it's not possible to pinpoint the exact
value within the multivalued field that I need. Consider the case with 1
multi-valued field, category, as you called it, which would have at most 10k
fields. The meaning of these values within the field are completely lost,
although it is a requirement to fetch products (thus values in the
multivalued field)  given a specific set of criteria. In other words, there
is no way of getting a specific value from a multivalued field given a set
of criteria.  Now, compare that with my current design in which these
criteria pinpoint a specific field / column to use and the difference should
be clear. 

regards,
Britske


Funtick wrote:
> 
> 
> Yes, it should be extremely simple! I simply can't understand how you
> describe it:
> 
> Britske wrote:
>> 
>> Rows in solr represent productcategories. I will have up to 100k of them. 
>> 
>> - Each product category can have 10k products each. These are encoded as
>> the 10k columns / fields (all 10k fields are int values) 
>>   
>> - At any given at most 1 product per productcategory is returned,
>> (analoguous to selecting 1 out of 10k columns). (This is the requirements
>> that makes this scheme possible) 
>> 
>> -products in the same column have certain characteristics in common,
>> which are encoded in the column name (using dynamic fields). So the
>> combination of these characteristics uniquely determines 1 out of 10k
>> columns. When the user hasn't supplied all characteristics good defaults
>> for these characteristics can be chosen, so a column can always be
>> determined. 
>> 
>> - on top of that each row has 20 productcategory-fields (which all
>> possible 10k products of that category share). 
>> 
> 
> 1. You can't really define 10.000 columns; you are probably using
> multivalued field for that. (sorry if I am not familiar with
> newest-greatest features of SOLR such as 'dynamic fields')
> 
> 2. You are trying to pass to Lucene 'normalized data'
> - But it is indeed the job of Lucene, to normalize data!
> 
> 3. All 10k fields are int values!? Lucene is designed for full-text
> search... are you trying to use Lucene instead of a database?
> 
> Sorry if I don't understand your design...
> 
> 
> 
> 
> Britske wrote:
>> 
>> 
>> 
>> Funtick wrote:
>>> 
>>> 
>>> Britske wrote:
>>>> 
>>>> - Rows in solr represent productcategories. I will have up to 100k of
>>>> them. 
>>>> - Each product category can have 10k products each. These are encoded
>>>> as the 10k columns / fields (all 10k fields are int values) 
>>>> 
>>> 
>>> You are using multivalued fields, you are not using 10k fields. And 10k
>>> is huge.
>>> 
>>> Design is wrong... you should define two fileds only: >> Product>. Lucene will do the rest.
>>> 
>>> -Fuad
>>> 
>> 
>> ;-). Well I wish it was that simple. 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-discrepancy-between-elapsedtime-and-qtime-although-enableLazyFieldLoading%3D-true-tp18698590p18757094.html
Sent from the Solr - User mailing list archive at Nabble.com.



DataImportHandler: way to merge multiple db-rows to 1 doc using transformer?

2008-09-27 Thread Britske

Looking at the wiki, code of DataImportHandler and it looks impressive. 
There's talk about ways to use Transformers to be able to create several
rows (solr docs) based on a single db row. 

I'd like to know if it's possible to do the exact opposite: to build
customer transformers that take multiple db-rows and merge it to a single
solr-row/document. If so, how?

Thanks, 
Britske
-- 
View this message in context: 
http://www.nabble.com/DataImportHandler%3A-way-to-merge-multiple-db-rows-to-1-doc-using-transformer--tp19706722p19706722.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DataImportHandler: way to merge multiple db-rows to 1 doc using transformer?

2008-09-29 Thread Britske

Well, merging from different tables (using pk's, etc. ) is pretty clear from
the Wiki (nice work)
However the difficult part:  currently I have a table called availabilities,
and product (all simplified)
---
| availabilities
-
|productidk
|factid  k
|providerid   k
|value


Each row in this table represents a temporal availability of 1 product of 1
provider. An availability is unique by combination of fields:
productid-providerid-factid.

However the row / document in SOlr needs to represent a product.  I want to
merge these productavailabilities in such a way with product that a product
document contains (among other fields) a field for each 'fact'. The value
for such a fact-field  contains the value of availability.value of the
'best' availability found under constraints productid and factid. 

Note that: 
-there are a lot of facts (resulting in a lot of fact-columns --> 1,000 +) 
-fact-fields are defined as dynamic fields, since possible facts are not
known at design-time.
- I want to use a custom Transformer to calculate what is the 'best'
availability given the factid, productid. 

Currently I have a more or less working home-grown solution, but I would
like to be able to set it up with DataImportHandler. 

thanks for your help,
Britske



Noble Paul നോബിള്‍ नोब्ळ् wrote:
> 
> What is the basis on which you merge rows ? Then I may be able to
> suggest an easy way of doing that
> 
> 
> On Sun, Sep 28, 2008 at 3:17 AM, Britske <[EMAIL PROTECTED]> wrote:
>>
>> Looking at the wiki, code of DataImportHandler and it looks impressive.
>> There's talk about ways to use Transformers to be able to create several
>> rows (solr docs) based on a single db row.
>>
>> I'd like to know if it's possible to do the exact opposite: to build
>> customer transformers that take multiple db-rows and merge it to a single
>> solr-row/document. If so, how?
>>
>> Thanks,
>> Britske
>> --
>> View this message in context:
>> http://www.nabble.com/DataImportHandler%3A-way-to-merge-multiple-db-rows-to-1-doc-using-transformer--tp19706722p19706722.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> --Noble Paul
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DataImportHandler%3A-way-to-merge-multiple-db-rows-to-1-doc-using-transformer--tp19706722p19722396.html
Sent from the Solr - User mailing list archive at Nabble.com.



solr on raid 0 --> no performance gain while indexing?

2008-10-15 Thread Britske

Hi, 

I understand that this may not be a 100% related question to the forum
(perhaps it's more Lucene than Solr) but perhaps someone here has seen
similar things...

I'm experimenting on Amazon Ec2 with indexing a solr / lucene index on a
striped (Raid 0) partition. 

While searching gives good benefits of using a single harddisk I see no
improvement is indexing over a single disk.  
Now I'm not at all a linux-guru but doing basic random write / read
io-testing with bonnie+ leads me to conclude that the raid is properly
setup, and is performing good.  I'm running Ubuntu 8.0.4 / Mdadm as software
raid / Xfs as file system btw.

The data i'm creating is very index heavy, e.g: over 1000 indices. 
Would this be a reason for not seeing better performance with indexing than
on a single disk? I'm guessing here: perhaps creating / shifting / altering
the indices after each insert creates such a load between physical disks
that the normal write scenario (of software raid 0) of writing sequential
chunks in round-robin fashion to all the disks in the array no longer holds? 

Does this seem logical or does someone know another reason?

Thanks,
Britske
-- 
View this message in context: 
http://www.nabble.com/solr-on-raid-0---%3E-no-performance-gain-while-indexing--tp20002623p20002623.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: solr on raid 0 --> no performance gain while indexing?

2008-10-15 Thread Britske

As a 'workaround' :
would instead of striping the available disks, but treating them as N silos
and merging the indices afterwards be an option ?


Britske wrote:
> 
> Hi, 
> 
> I understand that this may not be a 100% related question to the forum
> (perhaps it's more Lucene than Solr) but perhaps someone here has seen
> similar things...
> 
> I'm experimenting on Amazon Ec2 with indexing a solr / lucene index on a
> striped (Raid 0) partition. 
> 
> While searching gives good benefits of using a single harddisk I see no
> improvement is indexing over a single disk.  
> Now I'm not at all a linux-guru but doing basic random write / read
> io-testing with bonnie+ leads me to conclude that the raid is properly
> setup, and is performing good.  I'm running Ubuntu 8.0.4 / Mdadm as
> software raid / Xfs as file system btw.
> 
> The data i'm creating is very index heavy, e.g: over 1000 indices. 
> Would this be a reason for not seeing better performance with indexing
> than on a single disk? I'm guessing here: perhaps creating / shifting /
> altering the indices after each insert creates such a load between
> physical disks that the normal write scenario (of software raid 0) of
> writing sequential chunks in round-robin fashion to all the disks in the
> array no longer holds? 
> 
> Does this seem logical or does someone know another reason?
> 
> Thanks,
> Britske
> 

-- 
View this message in context: 
http://www.nabble.com/solr-on-raid-0---%3E-no-performance-gain-while-indexing--tp20002623p20002667.html
Sent from the Solr - User mailing list archive at Nabble.com.



solr 1.4: multi-select for statscomponent

2009-02-25 Thread Britske

Is there way to exclude filters from a stats field, like it is possible to
exclude filters from a facet.field? It didn't work for me. 

i.e: I have a field price, and although I filter on price, I would like to
be able to get the entire range (min,max) of prices as if I didn't specify
the filter. Obviously without excluding the filter the min,max range is
constrained by [50,100]

Part of query: 
stats=true&stats.field={!ex=p1}price&fq={!tag=p1}price:[50 TO 100]

USE-CASE:
I show a double-slider using javascript to display possible prices. (2
handles, one allowing to set min-price and the other to set max-price) 
The slider has a range of [0,maxprice without price filter set]. maxprice is
inserted by getting info from 'stats.price&stats=true'

When the user sets the slider a filter (fq) is set constraining the
resultset the set min and max-prices. 
After the page updates, I still want to show the price-slider, with the min
and max handles set to the prices as selected by the user, so the user can
alter his filter quickly.

However (and here it comes) I would also be able to get the 'maxprice
without price filter set' because I need this to set the max-range of the
slider. 

Is there any (undocumented) feature that makes this possible? If not, would
it be easy to add?

Thanks, 
Britske

-- 
View this message in context: 
http://www.nabble.com/solr-1.4%3A-multi-select-for-statscomponent-tp22202971p22202971.html
Sent from the Solr - User mailing list archive at Nabble.com.



speeding up indexing with a LOT of indexed fields

2009-03-25 Thread Britske

hi, 

I'm having difficulty indexing a collection of documents in a reasonable
time. 
it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2 which
just isnt enough. 
This box has 8GB ram and the equivalent of 20 xeon processors.   

these document have a couple of stored, indexed, multi and single-valued
fields, but the main problem lies in it having about 1500 indexed fields of
type sint.  Range [0,1] (Yes, I know this is a lot) 

I'm looking for some guidance as what strategies to try out to improve
throughput in indexing. I could slam in some more servers (I will) but my
feeling tells me I can get more out of this.

some additional info: 
 - I'm indexing to 10 cores in parallel.  This is done because :
  - at query time, 1 particular index will always fullfill all requests
so we can prune the search space to 1/10th of its original size. 
  - each document as represented in a core is actually 1/10th of a
'conceptual' document (which would contain up to 15000 indexed fields) if I
indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields proved
to give far worse results in searching and indexing than the solution i'm
going with now. 
 - the alternative of simply putting all docs with 1500 indexed field
each in the same core isn't really possible either, because this quickly
results in OOM-errors when sorting on a couple of fields. (even though 9/10
th of all docs in this case would not have the field sorted on, they would
still end up in a lucene fieldCache for this field) 

- to be clear: the 20 docs / second means 2 docs / second / core. Or 2
'conceptual' docs / second overall. 

- each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
them differently for each partition so that merges of different partitions
don't happen altogether. This seemed to help a bit) 

- running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
-XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
diskcaching. 

- I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
/dev/sdb 


observations: 
- within minutes after feeding the server reaches it's max ram. 
- until then the processors are running on ~70%
- although I throw in a commit at random intervals (between 600 to 800 secs,
again so not to commit al partitions at the same time) the jvm just stays
eating all the ram. 
- not a lot seems to be happening on disk (using dstat) when the ram hasn't
maxed out. Obviously, aftwerwards the disk is flooded with swapping. 

questions: 
- is there a good reason why all ram keeps occupied even though I commit
regularly? Perhaps fieldcaches get populated when indexing? I guess not, but
I'm not sure what else could explain this

- would splitting the 'conceptual docs' in even more partitions help at
indexing time? from an application standpoint it's possible, it just
requires some work and it's hard to compare figures so I'd like to know if
it's worth it .

- how is a flush different from a commit and would it help in getting the
ram-usage down?

- because all 15.000 indexed fields look very similar in structure (they are
all sints [0,1] to start with, I was looking for more efficient ways to
get them in an index using some low-level indexing operations. For example:
for a given document X and Y, and indexed fields 1,2.., i,...,N if X.a < Y.a
than this ordening in a lot of cases holds for fields 2,...,N. Because of
these special properties I could possibly create a sorting algorithm that
takes advantage of this and thus would make indexing faster. 
Would even considering this path be something that may be useful, because
obviously it would envolve some work to make it work, and presumably a lot
more work to get it to go faster than out of the box ?

 - lastly: should I be able to get more out of this box or am I just
complaining ;-) 

Thanks for making it to here, 
and hoping to receive some valuable info, 

Cheers, 
Britske
-- 
View this message in context: 
http://www.nabble.com/speeding-up-indexing-with-a-LOT-of-indexed-fields-tp22702364p22702364.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: speeding up indexing with a LOT of indexed fields

2009-03-25 Thread Britske

Thanks for the quick reply.

the box has 8 real cpu's. Perhaps a good idea then to reduce the nr of cores
to 8 as well. I'm testing out a different scenario with multiple boxes as
well, where clients persist docs to multiple cores on multiple boxes. (which
is what multicore was invented for after all) 

I set maxBufferedDocs  this low (and instead of ramBufferedSizeMB) because I
was worried for the impact on ram and to get a grip on when docs where
persisted to disk . I'm still not sure if it matters much on the big amounts
of ram consumed. This can't be all coming from buffering docs can it? On the
other hand, maxBufferedDocs (20 ) is set for each core so in total the
nrOfBufferedDocs is at max 200. Of course still at the low side, but I got
some draconian docs here.. ;-) 

I will try to use ramBufferedSizeMB and set it higher, but I first have to
get a grip why ram usage is maxed all the time, before this will make any
difference I guess. 

Thanks and please let the suggestions coming. 

Britske.


Otis Gospodnetic wrote:
> 
> 
> Britske,
> 
> Here are a few quick ones:
> 
> - Does that machine really have 10 CPU cores?  If it has significantly
> less, you may be beyond the "indexing sweet spot" in terms of indexer
> threads vs. CPU cores
> 
> - Your maxBufferedDocs is super small.  Comment that out anyway.  use
> ramBufferedSizeMB and set it as high as you can afford.  No need to commit
> very often, certainly no need to flush or optimize until the end.
> 
> There is a page about indexing performance on either Solr or Lucene Wiki
> that will help.
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: Britske 
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, March 25, 2009 10:05:17 AM
>> Subject: speeding up indexing with a LOT of indexed fields
>> 
>> 
>> hi, 
>> 
>> I'm having difficulty indexing a collection of documents in a reasonable
>> time. 
>> it's now going at 20 docs / sec on a c1.xlarge instance of amazon ec2
>> which
>> just isnt enough. 
>> This box has 8GB ram and the equivalent of 20 xeon processors.  
>> 
>> these document have a couple of stored, indexed, multi and single-valued
>> fields, but the main problem lies in it having about 1500 indexed fields
>> of
>> type sint.  Range [0,1] (Yes, I know this is a lot) 
>> 
>> I'm looking for some guidance as what strategies to try out to improve
>> throughput in indexing. I could slam in some more servers (I will) but my
>> feeling tells me I can get more out of this.
>> 
>> some additional info: 
>> - I'm indexing to 10 cores in parallel.  This is done because :
>>   - at query time, 1 particular index will always fullfill all
>> requests
>> so we can prune the search space to 1/10th of its original size. 
>>   - each document as represented in a core is actually 1/10th of a
>> 'conceptual' document (which would contain up to 15000 indexed fields) if
>> I
>> indexed to 1 core. Indexing as 1 doc containing 15.000 indexed fields
>> proved
>> to give far worse results in searching and indexing than the solution i'm
>> going with now. 
>>  - the alternative of simply putting all docs with 1500 indexed field
>> each in the same core isn't really possible either, because this quickly
>> results in OOM-errors when sorting on a couple of fields. (even though
>> 9/10
>> th of all docs in this case would not have the field sorted on, they
>> would
>> still end up in a lucene fieldCache for this field) 
>> 
>> - to be clear: the 20 docs / second means 2 docs / second / core. Or 2
>> 'conceptual' docs / second overall. 
>> 
>> - each core has maxBufferedDocs ~20 and mergeFactor~10 .  (I actually set
>> them differently for each partition so that merges of different
>> partitions
>> don't happen altogether. This seemed to help a bit) 
>> 
>> - running jvm with -server -Xmx6000M -Xms6000M -XX:+UseParallelGC
>> -XX:+CMSPermGenSweepingEnabled -XX:MaxPermSize=128M to leave room for
>> diskcaching. 
>> 
>> - I'm spreading the 10 indices over 2 physical disks. 5 to /dev/sda1 5 to
>> /dev/sdb 
>> 
>> 
>> observations: 
>> - within minutes after feeding the server reaches it's max ram. 
>> - until then the processors are running on ~70%
>> - although I throw in a commit at random intervals (between 600 to 800
>> secs,
>> again so not to commit al partitions at the same time) the jvm just stays
>> eating all 

correct? impossible to filter / facet on ExternalFileField

2009-06-11 Thread Britske

in our design some often changing fields would benefit from being defined as
ExternalFileFields, so we can index them more often than the rest. 

However we need to filter and facet on them. 
I don't think that this currently is possible with ExternalFileField but
just want to make sure.

if not possible, is it on the roadmap? 

Thanks, 
Britske
-- 
View this message in context: 
http://www.nabble.com/correct--impossible-to-filter---facet-on-ExternalFileField-tp23985106p23985106.html
Sent from the Solr - User mailing list archive at Nabble.com.



how to get to highlitghting results using solrJ

2009-06-11 Thread Britske

first time I'm using highlighting and results work ok. 
Im using it for an auto-suggest function. For reference I used the 
following query: 

http://localhost:8983/solr/autocompleteCore/select?fl=name_display,importance,score,hl&id&wt=xml&q={!boost
b=log(importance)}(prefix1:"or" OR prefix2:"or")&hl=true&hl.fl=prefix1

However, when using solrJ I can't get to the actual highlighted results,
i.e:  

QueryResponse.getHighlighting() shows me a map  as follows: 
{2-1-57010={}, 2-7-8481={}, } which I can't use because the result is
empty.(?) 

but debugging I see a field: 
QueryResponse._highlightingInfo with contents: 
{1-4167147={prefix1=[Orlando Verenigde Staten]},}
which is exactly what I need. 

However there is no (public) method: 
QueryRepsonse.getHighlightingInfo() !

what am I missing? 

thanks, 
Britske
-- 
View this message in context: 
http://www.nabble.com/how-to-get-to-highlitghting-results-using-solrJ-tp23986063p23986063.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to get to highlitghting results using solrJ

2009-06-11 Thread Britske

the query contained some experimenting code. The correct one is: 
http://localhost:8983/solr/autocompleteCore/select?fl=name_display,importance,score&wt=xml&q={!boost
b=log(importance)}(prefix1:"or" OR prefix2:"or")&hl=true&hl.fl=prefix1

Moreover, Is there a way to simply add the result of highlighting to the
fl-parameter, so I can just read the annotated name (including highlighting)
instead of the normal name.  (analogously as you can apply 'score' to fl.) 
To me, this would seem like the perfect way to get the highlighted result
without having to supply additional code in a client. You would only need to
refer to the annotated field name...


Britske wrote:
> 
> first time I'm using highlighting and results work ok. 
> Im using it for an auto-suggest function. For reference I used the 
> following query: 
> 
> http://localhost:8983/solr/autocompleteCore/select?fl=name_display,importance,score,hl&id&wt=xml&q={!boost
> b=log(importance)}(prefix1:"or" OR prefix2:"or")&hl=true&hl.fl=prefix1
> 
> However, when using solrJ I can't get to the actual highlighted results,
> i.e:  
> 
> QueryResponse.getHighlighting() shows me a map  as follows: 
> {2-1-57010={}, 2-7-8481={}, } which I can't use because the result is
> empty.(?) 
> 
> but debugging I see a field: 
> QueryResponse._highlightingInfo with contents: 
> {1-4167147={prefix1=[Orlando Verenigde Staten]},}
> which is exactly what I need. 
> 
> However there is no (public) method: 
> QueryRepsonse.getHighlightingInfo() !
> 
> what am I missing? 
> 
> thanks, 
> Britske
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-get-to-highlitghting-results-using-solrJ-tp23986063p23986127.html
Sent from the Solr - User mailing list archive at Nabble.com.



highlighting on edgeGramTokenized field --> hightlighting incorrect bc. position not incremented..

2009-06-12 Thread Britske

Hi, 

I'm trying to highlight based on a (multivalued) field (prefix2) that has
(among other things) a EdgeNGramFilterFactory defined. 
highlighting doesn't increment the start-position of the highlighted
portion, so in other words the highlighted portion is always the beginning
of the field. 




for example: 
for prefix2: "Orlando Verenigde Staten"
the query:
http://localhost:8983/solr/autocompleteCore/select?fl=prefix2,id&q=prefix2:%22ver%22&wt=xml&hl=true&&hl.fl=prefix2

returns: 
Orlando Verenigde Staten
while it should be: 
Orlando Verenigde Staten

the field def: 


  



  
  


  


I checked that removing the EdgeNGramFilterFactory results in correct
positioning of  highlighting. (But then I can't search for ngrams...) 

What am I missing? 
Thanks in advance, 
Britske



-- 
View this message in context: 
http://www.nabble.com/highlighting-on-edgeGramTokenized-field---%3E-hightlighting-incorrect-bc.-position-not-incremented..-tp23996196p23996196.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: highlighting on edgeGramTokenized field --> hightlighting incorrect bc. position not incremented..

2009-06-12 Thread Britske

Thanks, I'll check it out. 


Otis Gospodnetic wrote:
> 
> 
> Britske,
> 
> I'd have to dig, but there are a couple of JIRA issues in Lucene's JIRA
> (the actual ngram code is part of Lucene) that have to do with ngram
> positions.  I have a feeling that may be the problem.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> - Original Message 
>> From: Britske 
>> To: solr-user@lucene.apache.org
>> Sent: Friday, June 12, 2009 6:15:36 AM
>> Subject: highlighting on edgeGramTokenized field --> hightlighting
>> incorrect bc. position not incremented..
>> 
>> 
>> Hi, 
>> 
>> I'm trying to highlight based on a (multivalued) field (prefix2) that has
>> (among other things) a EdgeNGramFilterFactory defined. 
>> highlighting doesn't increment the start-position of the highlighted
>> portion, so in other words the highlighted portion is always the
>> beginning
>> of the field. 
>> 
>> 
>> 
>> 
>> for example: 
>> for prefix2: "Orlando Verenigde Staten"
>> the query:
>> http://localhost:8983/solr/autocompleteCore/select?fl=prefix2,id&q=prefix2:%22ver%22&wt=xml&hl=true&&hl.fl=prefix2
>> 
>> returns: 
>> Orlando Verenigde Staten
>> while it should be: 
>> Orlando Verenigde Staten
>> 
>> the field def: 
>> 
>> 
>> positionIncrementGap="1">
>>   
>> 
>> 
>> 
>> maxGramSize="20"/>
>>   
>>   
>> 
>> 
>>   
>> 
>> 
>> I checked that removing the EdgeNGramFilterFactory results in correct
>> positioning of  highlighting. (But then I can't search for ngrams...) 
>> 
>> What am I missing? 
>> Thanks in advance, 
>> Britske
>> 
>> 
>> 
>> -- 
>> View this message in context: 
>> http://www.nabble.com/highlighting-on-edgeGramTokenized-field---%3E-hightlighting-incorrect-bc.-position-not-incremented..-tp23996196p23996196.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/highlighting-on-edgeGramTokenized-field---%3E-hightlighting-incorrect-bc.-position-not-incremented..-tp23996196p24006375.html
Sent from the Solr - User mailing list archive at Nabble.com.



Universal DataImport(AndExport)Handler

2010-06-08 Thread britske

Recently I looked a bit at DataImportHandler and I'm really impressed with
the flexibility of transform / import options. 
Especially with integrations with Solr Cell / Tika this has become a great
Data importer.

Besides some use-cases that import to Solr (which I plan to migrate to DIH
asap), DIH would imo be ideally suited to export to other datasources
(besides Solr) as well. 

>From a roadmap I've read somewhere it was suggested that Lucene (without
Solr in the middle) would be supported in time. 

I'd like to know if other output sources are on the agenda/horizon as well.
With RDBMS and Tokyo Tyrant (and other kV-stores supporting the memcached
protocol) as my preferred options ;-) 

I realize this is a far leap from being an importer meant for SOLR alone,
but the flexibility of the data-transformations, new multi-threaded support,
etc. could be put to great use in a wider context imho. 

any roadmap, ideas in this direction? 

Geert-Jan
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Universal-DataImport-AndExport-Handler-tp878881p878881.html
Sent from the Solr - User mailing list archive at Nabble.com.


how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske

I want a couple of costly queries to be cached at all times in the
queryResultCache. (unless I have a new searcher of course) 

As for as I know the only parameters to be supplied to the
LRU-implementation of the queryResultCache are size-related, which doens't
give me this guarentee. 

what would be my best bet to implement this functionality with the least
impact?
1. use User/Generic-cache. This would result in seperate coding-path in
application which I would like to avoid. 
2. exend LRU-cache, and extend request-handler so that a query can be
extended with a parameter indicating that it should be cached at all times.
However, this seems like a lot of cluttering-up these interfaces, for a
relatively small change. 
3. another option..

best regards,
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13035381
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to make sure a particular query stays cached (and is not overwritten)

2007-10-04 Thread Britske

the title of my original post was misguided. 

// Geert-Jan


Britske wrote:
> 
> I want a couple of costly queries to be cached at all times in the
> queryResultCache. (unless I have a new searcher of course) 
> 
> As for as I know the only parameters to be supplied to the
> LRU-implementation of the queryResultCache are size-related, which doens't
> give me this guarentee. 
> 
> what would be my best bet to implement this functionality with the least
> impact?
> 1. use User/Generic-cache. This would result in seperate coding-path in
> application which I would like to avoid. 
> 2. exend LRU-cache, and extend request-handler so that a query can be
> extended with a parameter indicating that it should be cached at all
> times. However, this seems like a lot of cluttering-up these interfaces,
> for a relatively small change. 
> 3. another option..
> 
> best regards,
> Geert-Jan
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13037820
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske


hossman wrote:
> 
> 
> : I want a couple of costly queries to be cached at all times in the
> : queryResultCache. (unless I have a new searcher of course) 
> 
> first off: you can ensure that certain queries are in the cache, even if 
> there is a newSearcher, just configure a newSearcher Event Listener that 
> forcibly warms the queries you care about.
> 
> (this is particularly handy to ensure FieldCache gets populated before any 
> user queries are processed)
> 
> Second: if i understand correctly, you want a way to put an object in the 
> cache, and garuntee that it's always in the cache, even if other objects 
> are more frequetnly used or more recently used?
> 
> that's kind of a weird use case ... can you elaborate a little more on 
> what exactly your end goal is?
> 
> 

Sure.  Actually i got the idea of another thread posted by Thomas to which
you gave a reply a few days ago: 
http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630. 
I quote the relevant bits below, although I think you remember: 


hossman wrote:
> 
> : Is it possible to use faceting to not only get the facet count but also
> the
> : top-n documents for every facet
> : directly? If not, how hard would it be to implement this as an
> extension?
> 
> not hard ... a custom request handler could subclass
> StandardRequestHandler, call super.handleRequest, and then pull the field
> faceting info out of the response object, and fetch a DocList for each of
> the top N field constraints.
> 

I have about a dozen queries that I want to have permanently cached, each
corresponding to a particular navigation page. Each of these pages has up to
about 10 top-N lists which are populated as discussed above. These lists are
pretty static (updated once a day, together with the index). 

The above would enable me to populate all the lists on a single page in 1
pass. Correct? 
Although I haven't tried yet, I can't imagine that this request returns in
sub-zero seconds, which is what I want (having a index of about 1M docs with
6000 fields/ doc and about 10 complex facetqueries / request). 

The navigation-pages are pretty important for, eh well navigation ;-) and
although I can rely on frequent access of these pages most of the time, it
is not guarenteed (so neither is the caching)


hossman wrote:
> 
> the most straightforward approach i can think of would be a new cache 
> implementation that "permenantly" stores the first N items you put in it.  
> that in combination with the newSearcher warming i described above should 
> work.
> 
> : 1. use User/Generic-cache. This would result in seperate coding-path in
> : application which I would like to avoid. 
> : 2. exend LRU-cache, and extend request-handler so that a query can be
> : extended with a parameter indicating that it should be cached at all
> times.
> : However, this seems like a lot of cluttering-up these interfaces, for a
> : relatively small change. 
> 
> #1 wouldn't really accomplish what you want without #2 as well.
> 
> 
> 
> -Hoss
> 
> 
> 

regarding #1. 
Wouldn't making a user-cache for the sole-purpose of storing these queries
be enough? I could then reference this user-cache by name, and extract the
correct queryresult. (at least that's how I read the documentation, I have
no previous experience with the user-cache mechanism).  In that case I don't
need #2 right? Or is this for another reason not a good way to handle
things? 

//Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13048285
Sent from the Solr - User mailing list archive at Nabble.com.



RE: how to make sure a particular query is ALWAYS cached

2007-10-04 Thread Britske

I need the documents in order, so FilterCache is no use. Moreover, I already
use lots of the filtercache for other fq-queries. About 99% of the 6000
fields I mentioned have there values seperately  in the filtercache. There
must be room for optimization there, but that's a different story ;-)

//Geert-Jan


Lance Norskog wrote:
> 
> You could make these filter queries. Filters are a separate cache and as
> long as you have more cache than queries they will remain pinned in RAM.
> Your code has to remember these special queries in special-case code, and
> create dummy query strings to fetch the filter query.  "field:[* TO *]"
> will
> do nicely.
> 
> Cheers,
> 
> Lance Norskog 
> 
> -Original Message-
> From: Britske [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, October 04, 2007 1:38 PM
> To: solr-user@lucene.apache.org
> Subject: Re: how to make sure a particular query is ALWAYS cached
> 
> 
> 
> hossman wrote:
>> 
>> 
>> : I want a couple of costly queries to be cached at all times in the
>> : queryResultCache. (unless I have a new searcher of course)
>> 
>> first off: you can ensure that certain queries are in the cache, even 
>> if there is a newSearcher, just configure a newSearcher Event Listener 
>> that forcibly warms the queries you care about.
>> 
>> (this is particularly handy to ensure FieldCache gets populated before 
>> any user queries are processed)
>> 
>> Second: if i understand correctly, you want a way to put an object in 
>> the cache, and garuntee that it's always in the cache, even if other 
>> objects are more frequetnly used or more recently used?
>> 
>> that's kind of a weird use case ... can you elaborate a little more on 
>> what exactly your end goal is?
>> 
>> 
> 
> Sure.  Actually i got the idea of another thread posted by Thomas to which
> you gave a reply a few days ago: 
> http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630. 
> I quote the relevant bits below, although I think you remember: 
> 
> 
> hossman wrote:
>> 
>> : Is it possible to use faceting to not only get the facet count but 
>> also the
>> : top-n documents for every facet
>> : directly? If not, how hard would it be to implement this as an 
>> extension?
>> 
>> not hard ... a custom request handler could subclass 
>> StandardRequestHandler, call super.handleRequest, and then pull the 
>> field faceting info out of the response object, and fetch a DocList 
>> for each of the top N field constraints.
>> 
> 
> I have about a dozen queries that I want to have permanently cached, each
> corresponding to a particular navigation page. Each of these pages has up
> to
> about 10 top-N lists which are populated as discussed above. These lists
> are
> pretty static (updated once a day, together with the index). 
> 
> The above would enable me to populate all the lists on a single page in 1
> pass. Correct? 
> Although I haven't tried yet, I can't imagine that this request returns in
> sub-zero seconds, which is what I want (having a index of about 1M docs
> with
> 6000 fields/ doc and about 10 complex facetqueries / request). 
> 
> The navigation-pages are pretty important for, eh well navigation ;-) and
> although I can rely on frequent access of these pages most of the time, it
> is not guarenteed (so neither is the caching)
> 
> 
> hossman wrote:
>> 
>> the most straightforward approach i can think of would be a new cache 
>> implementation that "permenantly" stores the first N items you put in it.
>> that in combination with the newSearcher warming i described above 
>> should work.
>> 
>> : 1. use User/Generic-cache. This would result in seperate coding-path 
>> in
>> : application which I would like to avoid. 
>> : 2. exend LRU-cache, and extend request-handler so that a query can 
>> be
>> : extended with a parameter indicating that it should be cached at all 
>> times.
>> : However, this seems like a lot of cluttering-up these interfaces, 
>> for a
>> : relatively small change. 
>> 
>> #1 wouldn't really accomplish what you want without #2 as well.
>> 
>> 
>> 
>> -Hoss
>> 
>> 
>> 
> 
> regarding #1. 
> Wouldn't making a user-cache for the sole-purpose of storing these queries
> be enough? I could then reference this user-cache by name, and extract the
> correct queryresult. (at least that's how I read the documentation, I have
> no previous experience with the user-cache mechanism).  In that case I
> don't
> need #2 right? Or is this for another reason not a good way to handle
> things? 
> 
> //Geert-Jan
> 
> --
> View this message in context:
> http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-t
> f4566711.html#a13048285
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13050087
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to make sure a particular query is ALWAYS cached

2007-10-09 Thread Britske

seperating requests over 2 ports is a nice solution when having multiple
user-types. I like that althuigh I don't think i need it for this case. 

I'm just going to go the 'normal' caching-route and see where that takes me,
instead of thinking it can't be done upfront :-) 

Thanks!



hossman wrote:
> 
> 
> : Although I haven't tried yet, I can't imagine that this request returns
> in
> : sub-zero seconds, which is what I want (having a index of about 1M docs
> with
> : 6000 fields/ doc and about 10 complex facetqueries / request). 
> 
> i wouldn't neccessarily assume that :)  
> 
> If you have a request handler which does a query with a facet.field, and 
> then does a followup query for the top N constraings in that facet.field, 
> the time needed to execute that handler on a cold index should primarily 
> depend on the faceting aspect and how many unique terms there are in that 
> field.  try it and see.
> 
> : The navigation-pages are pretty important for, eh well navigation ;-)
> and
> : although I can rely on frequent access of these pages most of the time,
> it
> : is not guarenteed (so neither is the caching)
> 
> if i were in your shoes: i wouldn't worry about it.  i would setup 
> "cold cache warming" of the important queries using a firstSearcher event 
> listener, i would setup autowarming on the caches, i would setup explicit 
> warming of queries using sort fields i care about in a newSearcher event 
> listener, andi would make sure to tune my caches so that they were big 
> enough to contain a much larger number of entries then are used by my 
> custom request handler for the queris i care about (especially if my index 
> only changed a few times a day, the caches become a huge win in that case, 
> so throw everything you've got at them)
> 
> and for the record: i've been in your shoes.
> 
> From a purely theoretical standpoint: if enough other requests are coming 
> in fast enough to expunge the objects used by your "important" navigation 
> pages from the caches ... then those pages aren't that important (at least 
> not to your end users as an aggregate)
> 
> on the other hand: if you've got discreet pools of users (like say: 
> customers who do searches, vs your boss who thiks navigation pages are 
> really important) then another appraoch is to have to ports searching 
> queries -- one that you send your navigation type queries to (with the 
> caches tuned appropriately) and one that you send other traffic to (with 
> caches tuned appropriately) ... i do that for one major index, it makes a 
> lot of sense when you have very distinct usage profiles and you want to 
> get the most bang for your buck cache wise.
> 
> 
> : > #1 wouldn't really accomplish what you want without #2 as well.
> 
> : regarding #1. 
> : Wouldn't making a user-cache for the sole-purpose of storing these
> queries
> : be enough? I could then reference this user-cache by name, and extract
> the
> 
> only if you also write a custom request handler ... that was my point 
> before it was clear that you were already doing that no matter what (you 
> had custom request handler listed in #2)
> 
> you could definitely make sure to explicitly put all of your DocLists in 
> your own usercache, that will certainly work.  but frankly, based on 
> what you've described about your use case, and how often your data 
> cahnges, it would probably be easier to set up a layer of caching in front 
> of Solr (since you are concerned with ensuring *all* of the date 
> for these important pages gets cached) ... something like an HTTP reverse 
> proxy cache (aka: acelerator proxy) would help you ensure that thes whole 
> pages were getting cached.
> 
> i've never tried it, but in theory: you could even setup a newSearcher 
> event listener to trigger a little script to ping your proxy with a 
> request thatforced it to revalidate the query when your index changes.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-make-sure-a-particular-query-is-ALWAYS-cached-tf4566711.html#a13110514
Sent from the Solr - User mailing list archive at Nabble.com.



extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske

I'm trying to add a new requestHandler-plugin to Solr by extending
StandardRequestHandler.
However, when starting solr-server after configuration i get a
ClassCastException: 

SEVERE: java.lang.ClassCastException:
wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
org.apache.solr.request.SolrRequestHandler  at
org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:149)

I can't get my head around what might be wrong, as I am extending
org.apache.solr.handler.StandardRequestHandler which already implements
org.apache.solr.request.SolrRequestHandler so it must be able to cast i
figure. 

Anyone any ideas? below is the code / setup I used.

My handler: 
---
package wrappt.solr.requesthandler;

import org.apache.solr.handler.StandardRequestHandler;
import org.apache.solr.request.SolrRequestHandler;

public class TopListRequestHandler extends StandardRequestHandler implements
SolrRequestHandler
{
//no code here (so it mimicks StandardRequestHandler)
}
--

configured in solrconfig as: 


added this handler to a jar called: solrRequestHandler1.jar and added this
jar along with  apache-solr-nightly.jar to the \lib directory of my server.
(It needs the last jar  for resolving the StandardRequestHandler. Isnt this
strange btw, because I thought that it would be resolved from solr.war
automatically. ) 

general solr-info of the server: 
Solr Specification Version: 1.2.2007.10.07.08.05.52
Solr Implementation Version: nightly ${svnversion} - yonik - 2007-10-07
08:05:52

I double-checked that the included apache-solr-nightly.jar are the same
version as the deployed server by getting the latest nightly build and
getting the .jars and .war from it. 

Furthermore, I noticed that org.apache.solr.request.StandardRequestHandler
is deprecated. Note that I'm extending
org.apache.solr.handler.StandardRequestHandler. Is it possible that this has
anything to do with it? 

with regards,
Geert-Jan


-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13115182
Sent from the Solr - User mailing list archive at Nabble.com.



Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske

Yeah, I'm compiling with a reference to  apache-solr-nightly.jar wich is from
the same nightly builld (7 october 2007) as the apache.solr-nightly.war I'm
deploying against. I include this same apache-solr-nightly.jar in the lib
folder of my deployed server. 

It still seems odd that I have to include the jar, since the
StandardRequestHandler should be picked up in the war right? Is this also a
sign that there must be something wrong with the deployment?

btw: I deployed by copying a directory which contains the example
deployment, and swapped in  the apache.solr-nightly.war in the 'webapps'-dir
after renaming it to solr.war. This enables me to start the new server
using: java -jar start.jar. I don't know if this is common practice or
considered 'exotic', but it might just be causing the problem.. Anyway,
after deploying the server picks up the correct war, as solr/admin shows the
correct Solr Specification Version: 1.2.2007.10.07.08.05.52.

other options?
Geert-Jan



Erik Hatcher wrote:
> 
> Are you compiling your custom request handler against the same  
> version of Solr that you are deploying with?   My hunch is that  
> you're compiling against an older version.
> 
>   Erik
> 
> 
> On Oct 9, 2007, at 9:04 AM, Britske wrote:
> 
>>
>> I'm trying to add a new requestHandler-plugin to Solr by extending
>> StandardRequestHandler.
>> However, when starting solr-server after configuration i get a
>> ClassCastException:
>>
>> SEVERE: java.lang.ClassCastException:
>> wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
>> org.apache.solr.request.SolrRequestHandler  at
>> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java: 
>> 149)
>>
>> I can't get my head around what might be wrong, as I am extending
>> org.apache.solr.handler.StandardRequestHandler which already  
>> implements
>> org.apache.solr.request.SolrRequestHandler so it must be able to  
>> cast i
>> figure.
>>
>> Anyone any ideas? below is the code / setup I used.
>>
>> My handler:
>> ---
>> package wrappt.solr.requesthandler;
>>
>> import org.apache.solr.handler.StandardRequestHandler;
>> import org.apache.solr.request.SolrRequestHandler;
>>
>> public class TopListRequestHandler extends StandardRequestHandler  
>> implements
>> SolrRequestHandler
>> {
>>  //no code here (so it mimicks StandardRequestHandler)
>> }
>> --
>>
>> configured in solrconfig as:
>> > class="wrappt.solr.requesthandler.TopListRequestHandler"/>
>>
>> added this handler to a jar called: solrRequestHandler1.jar and  
>> added this
>> jar along with  apache-solr-nightly.jar to the \lib directory of my  
>> server.
>> (It needs the last jar  for resolving the StandardRequestHandler.  
>> Isnt this
>> strange btw, because I thought that it would be resolved from solr.war
>> automatically. )
>>
>> general solr-info of the server:
>> Solr Specification Version: 1.2.2007.10.07.08.05.52
>>  Solr Implementation Version: nightly ${svnversion} - yonik -  
>> 2007-10-07
>> 08:05:52
>>
>> I double-checked that the included apache-solr-nightly.jar are the  
>> same
>> version as the deployed server by getting the latest nightly build and
>> getting the .jars and .war from it.
>>
>> Furthermore, I noticed that  
>> org.apache.solr.request.StandardRequestHandler
>> is deprecated. Note that I'm extending
>> org.apache.solr.handler.StandardRequestHandler. Is it possible that  
>> this has
>> anything to do with it?
>>
>> with regards,
>> Geert-Jan
>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/extending- 
>> StandardRequestHandler-gives-ClassCastException- 
>> tf4594102.html#a13115182
>> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13118296
Sent from the Solr - User mailing list archive at Nabble.com.



Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske

Thanks, but I'm using the updated o.a.s.handler.StandardRequestHandler. I'm
going to try on 1.2 instead to see if it changes things. 

Geert-Jan



ryantxu wrote:
> 
> 
>> It still seems odd that I have to include the jar, since the
>> StandardRequestHandler should be picked up in the war right? Is this also
>> a
>> sign that there must be something wrong with the deployment?
>> 
> 
> Note that in 1.3, the StandardRequestHandler was moved from 
> o.a.s.request to o.a.s.handler:
> 
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/request/StandardRequestHandler.java
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/handler/StandardRequestHandler.java
> 
> If you are subclassing StandardRequestHandler, make sure you are using a 
> consistent versions
> 
> ryan
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13121575
Sent from the Solr - User mailing list archive at Nabble.com.



Re: extending StandardRequestHandler gives ClassCastException

2007-10-09 Thread Britske

Thanks that was the problem! I mistakingly thought the lib-folder containing
the jetty.jar etc. was the folder to put the plugins into.  After adding a
lib-folder to solr-home everything is resolved. 

Geert-Jan



hossman wrote:
> 
> 
> : SEVERE: java.lang.ClassCastException:
> : wrappt.solr.requesthandler.TopListRequestHandler cannot be cast to
> : org.apache.solr.request.SolrRequestHandler  at
> : org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:149)
> 
> : added this handler to a jar called: solrRequestHandler1.jar and added
> this
> : jar along with  apache-solr-nightly.jar to the \lib directory of my
> server.
> : (It needs the last jar  for resolving the StandardRequestHandler. Isnt
> this
> : strange btw, because I thought that it would be resolved from solr.war
> : automatically. ) 
> 
> classpaths are very very very tricky and anoying.  i believe the problem 
> you are seeing is that the SolrCore knows about the copy of 
> StandardREquestHandler in the Classloader for your war, but because of 
> where you put your custom request handler, the war's classloader is 
> delegating "up" to it's parent (the containers class loader) to find it, 
> at which point the containers class loader also needs to resolve 
> StandardRequestHandler (hence you put apache-solr-nightly.jar in that lib 
> so that classloader can find it)  now the container classloader has 
> resolved all of the classes it needs for Solr to finsh constructing your 
> hanlder -- except that your handler doesn't extend the "copy"
> of StandardRequestHandler Solr knows about -- it extends one up in in the 
> parent classloader.
> 
> try creating a lib directory in your solrhome and putting your jar there 
> ... make sure you get rid of your jar (and the solr-nightly jar) that you 
> put in the containers main lib directory.  they will cause you nothing but 
> problems.  if that *still* doesn't work, try unpacking the Solr war, and 
> adding your class directly to it ... that *completeley* eliminates any 
> possibility of classpath issues and will help identify if it's some other 
> random problem (but it's a last resort since it makes upgrading later 
> hard)
> 
>   http://wiki.apache.org/solr/SolrPlugins
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/extending-StandardRequestHandler-gives-ClassCastException-tf4594102.html#a13130439
Sent from the Solr - User mailing list archive at Nabble.com.



showing results per facet-value efficiently

2007-10-10 Thread Britske

First of all, I just wanted to say that I just started working with Solr and
really like the results I'm getting from Solr (in terms of performance,
flexibility) as well as the good responses I'm getting from this group.
Hopefully I will be able to contribute in way way or another to this
wonderful application in the future!

The current issue that I'm having is the following ( I tried not to be
long-winded, but somehow that didn't work out :-)   ):

I'm extending StandardRequestHandler to no only show the counts per
facet-value but also the top-N results per facet-value (where N is
configurable). 
(See http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630
for where I got the idea from). 
I quickly implemented this by fetching a doclist for each of my facet-values
and appending these to the result as suggested in the refered post, no
problems there. 

However, I realized that for calculating the count for each of the
facetvalues, the original standardrequesthandler already loops the doclist
to check for matches. Therefore my implementation actually does double work,
since it gets doclists for each of the facetvalues again. 

My question: 
is there a way to get to the already calculated doclist per facetvalue from
a subclassed StandardRequestHandler, and so get a nice speedup?  This
facet-falculation seems to go deep into the core of Solr
(SimpleFacets.getFacetTermEnumCounts) and seems not very sensible to alter
for just this requirement. opinions appreciated. 

Some additional info:

I have a  requirement to be able to limit the result to explicitly specified
facet-values. For that I do something like: 
select?
 qt=toplist
&q=name:A OR name:B OR  name:C 
&sort=sortfield asc 
&facet=true
&facet.field=name
&facet.limit=1
&rows=2

This all works okay and results in a faceting/grouping by field: 'name', 
where for each facetvalue (A, B, C)
2 results are shown (ordered by sortfield). 

The relevant code from the subclassed standardRequestHandler is below. As
can be seen I alter the query by adding the facetvalue to FQ (which is
almost guarenteed to already exist in FQ btw.) 

Therefore a second question is: 
will there be a noticable speedup when persuing the above, since the request
that is done per facet-value is nothing more than giving the ordered result
of the intersection of the overall query (which is in the querycache) and
the facetvalue itself (which is almost certainly in the filtercache). 

As a last and somewhat related question: 
is there a way to explicity specify facet-values that I want to include in
the faceting without (ab)using Q? This is  relevant for me since the perfect
solution would be to have the ability to orthogonally get multiple toplists
in 1 query. Given the current implementation, this orthoganality is now
'corrupted' as injection of a fieldvalue in Q for one facetfield influences
the outcome of another facetfield. 

kind regards, 
Geert-Jan



---
if(true) //TODO: this needs facetinfo as a precondition. 
{
NamedList facetFieldList = ((NamedList)facetInfo.get("facet_fields"));
   for(int i = 0; i < facetFieldList.size(); i++)
   {
NamedList facetValList = (NamedList)facetFieldList.getVal(i); 
for(int j = 0; j < facetValList.size(); j++)
 {
NamedList facetValue = new SimpleOrderedMap(); 
// facetValue.add("count", valList.getVal(j));

   DocListAndSet resultList = new DocListAndSet();
   Query facetq = QueryParsing.parseQuery(
facetFieldList.getName(i) + ":" + facetValList.getName(j),
req.getSchema());
   resultList.docList = s.getDocList(query,facetq,
sort,p.getInt(CommonParams.START,0), 
p.getInt(CommonParams.ROWS,3));

   facetValue.add("results",resultList.docList);
   facetValList.setVal(j, facetValue);
 }
   }
   rsp.add("facet_results", facetFieldList);
}
-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13133815
Sent from the Solr - User mailing list archive at Nabble.com.



implemented StandardReqeustHandler to show top-results per facet-value. Is this the fastest way?

2007-10-11 Thread Britske

Since the title of my original post may not have been so clear, here a
repost. 
//Geert-Jan


Britske wrote:
> 
> First of all, I just wanted to say that I just started working with Solr
> and really like the results I'm getting from Solr (in terms of
> performance, flexibility) as well as the good responses I'm getting from
> this group. Hopefully I will be able to contribute in way way or another
> to this wonderful application in the future!
> 
> The current issue that I'm having is the following ( I tried not to be
> long-winded, but somehow that didn't work out :-)   ):
> 
> I'm extending StandardRequestHandler to no only show the counts per
> facet-value but also the top-N results per facet-value (where N is
> configurable). 
> (See
> http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630 for
> where I got the idea from). 
> I quickly implemented this by fetching a doclist for each of my
> facet-values and appending these to the result as suggested in the refered
> post, no problems there. 
> 
> However, I realized that for calculating the count for each of the
> facetvalues, the original standardrequesthandler already loops the doclist
> to check for matches. Therefore my implementation actually does double
> work, since it gets doclists for each of the facetvalues again. 
> 
> My question: 
> is there a way to get to the already calculated doclist per facetvalue
> from a subclassed StandardRequestHandler, and so get a nice speedup?  This
> facet-falculation seems to go deep into the core of Solr
> (SimpleFacets.getFacetTermEnumCounts) and seems not very sensible to alter
> for just this requirement. opinions appreciated. 
> 
> Some additional info:
> 
> I have a  requirement to be able to limit the result to explicitly
> specified facet-values. For that I do something like: 
> select?
>  qt=toplist
> &q=name:A OR name:B OR  name:C 
> &sort=sortfield asc 
> &facet=true
> &facet.field=name
> &facet.limit=1
> &rows=2
> 
> This all works okay and results in a faceting/grouping by field: 'name', 
> where for each facetvalue (A, B, C)
> 2 results are shown (ordered by sortfield). 
> 
> The relevant code from the subclassed standardRequestHandler is below. As
> can be seen I alter the query by adding the facetvalue to FQ (which is
> almost guarenteed to already exist in FQ btw.) 
> 
> Therefore a second question is: 
> will there be a noticable speedup when persuing the above, since the
> request that is done per facet-value is nothing more than giving the
> ordered result of the intersection of the overall query (which is in the
> querycache) and the facetvalue itself (which is almost certainly in the
> filtercache). 
> 
> As a last and somewhat related question: 
> is there a way to explicity specify facet-values that I want to include in
> the faceting without (ab)using Q? This is  relevant for me since the
> perfect solution would be to have the ability to orthogonally get multiple
> toplists in 1 query. Given the current implementation, this orthoganality
> is now 'corrupted' as injection of a fieldvalue in Q for one facetfield
> influences the outcome of another facetfield. 
> 
> kind regards, 
> Geert-Jan
> 
> 
> 
> ---
> if(true) //TODO: this needs facetinfo as a precondition. 
> {
>   NamedList facetFieldList =
> ((NamedList)facetInfo.get("facet_fields"));
>for(int i = 0; i < facetFieldList.size(); i++)
>{
>   NamedList facetValList = (NamedList)facetFieldList.getVal(i); 
>   for(int j = 0; j < facetValList.size(); j++)
>  {
>   NamedList facetValue = new SimpleOrderedMap(); 
> // facetValue.add("count", valList.getVal(j));
>   
>  DocListAndSet resultList = new DocListAndSet();
>Query facetq = QueryParsing.parseQuery(
> facetFieldList.getName(i) + ":" + facetValList.getName(j),
> req.getSchema());
>  resultList.docList = s.getDocList(query,facetq,
> sort,p.getInt(CommonParams.START,0), 
> p.getInt(CommonParams.ROWS,3));
> 
>facetValue.add("results",resultList.docList);
>  facetValList.setVal(j, facetValue);
>  }
>}
>rsp.add("facet_results", facetFieldList);
> }
> 

-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13150519
Sent from the Solr - User mailing list archive at Nabble.com.



Re: showing results per facet-value efficiently

2007-10-11 Thread Britske

yup that clarifies things a lot, thanks.



Mike Klaas wrote:
> 
> On 10-Oct-07, at 4:16 AM, Britske wrote:
> 
>>
>> However, I realized that for calculating the count for each of the
>> facetvalues, the original standardrequesthandler already loops the  
>> doclist
>> to check for matches. Therefore my implementation actually does  
>> double work,
>> since it gets doclists for each of the facetvalues again.
> 
> Well, not quite.  Once you get into the faceting code, everything is  
> in terms of DocSets, which are undordered collections of doc ids.   
> Also, faceting employs efficient algorithms for counting the  
> cardinality of intersections without actually materializing them,  
> which is another difficulty to reusing the code.
> 
>> My question:
>> is there a way to get to the already calculated doclist per  
>> facetvalue from
>> a subclassed StandardRequestHandler, and so get a nice speedup?  This
>> facet-falculation seems to go deep into the core of Solr
>> (SimpleFacets.getFacetTermEnumCounts) and seems not very sensible  
>> to alter
>> for just this requirement. opinions appreciated.
> 
> Solr never really materializes much of the DocList for a query-- 
> almost all docs are dropped as soon as it is clear that they are not  
> in the top N results.
> 
> It should be possible to produce an approximation which is more  
> efficient, like collecting the DocList for the top 1000 docs,  
> converting it to a DocSet, find the set intersections (instead of  
> using SimpleFacets), and re-order the resulting sets in terms of the  
> original DocList.
> 
> It would take a bit of work to implement, however.
> 
>>
>> As a last and somewhat related question:
>> is there a way to explicity specify facet-values that I want to  
>> include in
>> the faceting without (ab)using Q? This is  relevant for me since  
>> the perfect
>> solution would be to have the ability to orthogonally get multiple  
>> toplists
>> in 1 query. Given the current implementation, this orthoganality is  
>> now
>> 'corrupted' as injection of a fieldvalue in Q for one facetfield  
>> influences
>> the outcome of another facetfield.
> 
> I'm not quite sure what you are asking here.  You can specify  
> arbitrary facet values using facet.query or facet.prefix.  If you  
> want to facet multiple doclists from different queries in one  
> request, just write your own request handler that takes a multi- 
> valued q param and facets on each.
> 
> I didn't answer all the questions in your email, but I hope this  
> clarifies things a bit.  Good luck!
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13163943
Sent from the Solr - User mailing list archive at Nabble.com.



quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-11 Thread Britske

say I have the following (partial)
querystring:...&facet=true&facet.field=country
field 'country' is not tokenized, not multi-valued, and not boolean, so the
field-cache approach is used.

Morover, the following (partial) querystring is used as well: 
..fq=country:france

do these queries share cached items in the fieldcache? (in this example:
country:france) or do they somehow live as seperate entities in the cache?
The latter would explain my fieldcache having evictions at the moment.

Geert-Jan



-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13164249
Sent from the Solr - User mailing list archive at Nabble.com.



Re: quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-11 Thread Britske

Yeah i meant filter-cache, thanks. 
It seemed that the particular field (cityname) was using a keywordtokenizer
(which doens't show at the front) which is why i missed it i guess :-S. This
means the term field is tokenized so termEnums-apporach is used. This
results in about 10.000 inserts on facet.field=cityname on a cold searcher,
which matches the nr of different terms in that field. At least that
explains that. 

So if I understand correctly if I use that same field in a FQ-param, say
fq=cityname:amsterdam and amsterdam is a term of field cityname, than the
FQ-query can utilize the cached 'query': cityname:amsterdam which is already
put into the filtercache by the query facet.field=cityname right?

The thing that I still don't get is why my filtercache starts to have
evictions although it's size is 16.000+.  This shouldn't be happing given
that:
I currently only use faceting on cityname and use this field on FQ as well,
as already said (which adds +/- 1 items to the filtercache, given that
faceting and fq share cached items). 
Moreover i use FQ on about 2500 different fields (named _ddp*), but only
check to see if a value exists by doing for example: fq=_ddp1234:[* TO *]. I
sometimes add them together like so: fq=_ddp1234:[* TO *] &fq=_ddp2345:[* TO
*]. But never like so: fq=_ddp1234:[* TO *] +_ddp2345:[* TO *]. Which means
each _ddp*-field is only added once to the filtercache. 

Wouldn't this mean that at a maximum I can only have 12500 items in the
filtercache?
Still my filtercache starts to have evictions although it's size is 16.000+. 

What am I missing here?
Geert-Jan


hossman wrote:
> 
> 
> : ..fq=country:france
> : 
> : do these queries share cached items in the fieldcache? (in this example:
> : country:france) or do they somehow live as seperate entities in the
> cache?
> : The latter would explain my fieldcache having evictions at the moment.
> 
> FieldCache can't have evicitions.  it's a really low level "cache" where 
> the key is field name and the value is an array containing a value for 
> every document (you cna think of it as an inverted-inverted-index) that 
> Lucene maintains directly.  items are never removed they just get garbage 
> collected when the IndexReader is no longer used.  It's primarily for 
> sorting, but the SimpleFacets code also leveragies it for facets in some 
> cases -- Solr has no way of showing you what's in the FieldCache, because 
> Lucene doesn't expose any inspection APIs to query it (it's a heisenberg 
> cache .. once you ask if something is in it, it's in it)
> 
> are you refering to the "filterCache" ?  
> 
> filterCache contains records whose key is a "query" and whose value is a 
> DocSet (an unordered collection of all docs matching a query) ... it's 
> used whenever you use an "fq" param, for faceting on some fields (when the 
> TermEnum method is used, a filterCache entry is added for each term 
> tested), and even for some sorted queries if the 
>  config option is set to true.
> 
> the easiest way to know whether your faceting is using the FieldCache is 
> to start your server cold (no newSearcher warming) and then send it a 
> simple query with a single facet.field.  depending on the query, you might 
> get 0 or 1 entries in the filterCache if SimpleFacets is using the 
> FieldCache -- but if it's using the TermEnums, and generating a DocSet per 
> term, you'llsee *lots* of inserts into the filterCache.
> 
> 
> 
> -Hoss
> 
> 

-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13169935
Sent from the Solr - User mailing list archive at Nabble.com.



Re: quickie: do facetfields use same cached items in field cache as FQ-param?

2007-10-12 Thread Britske

as a related question: is here a way to inspect the queries currently in the
filtercache?


Britske wrote:
> 
> Yeah i meant filter-cache, thanks. 
> It seemed that the particular field (cityname) was using a
> keywordtokenizer (which doens't show at the front) which is why i missed
> it i guess :-S. This means the term field is tokenized so
> termEnums-apporach is used. This results in about 10.000 inserts on
> facet.field=cityname on a cold searcher, which matches the nr of different
> terms in that field. At least that explains that. 
> 
> So if I understand correctly if I use that same field in a FQ-param, say
> fq=cityname:amsterdam and amsterdam is a term of field cityname, than the
> FQ-query can utilize the cached 'query': cityname:amsterdam which is
> already put into the filtercache by the query facet.field=cityname right?
> 
> The thing that I still don't get is why my filtercache starts to have
> evictions although it's size is 16.000+.  This shouldn't be happing given
> that:
> I currently only use faceting on cityname and use this field on FQ as
> well, as already said (which adds +/- 1 items to the filtercache,
> given that faceting and fq share cached items). 
> Moreover i use FQ on about 2500 different fields (named _ddp*), but only
> check to see if a value exists by doing for example: fq=_ddp1234:[* TO *].
> I sometimes add them together like so: fq=_ddp1234:[* TO *]
> &fq=_ddp2345:[* TO *]. But never like so: fq=_ddp1234:[* TO *]
> +_ddp2345:[* TO *]. Which means each _ddp*-field is only added once to the
> filtercache. 
> 
> Wouldn't this mean that at a maximum I can only have 12500 items in the
> filtercache?
> Still my filtercache starts to have evictions although it's size is
> 16.000+. 
> 
> What am I missing here?
> Geert-Jan
> 
> 
> hossman wrote:
>> 
>> 
>> : ..fq=country:france
>> : 
>> : do these queries share cached items in the fieldcache? (in this
>> example:
>> : country:france) or do they somehow live as seperate entities in the
>> cache?
>> : The latter would explain my fieldcache having evictions at the moment.
>> 
>> FieldCache can't have evicitions.  it's a really low level "cache" where 
>> the key is field name and the value is an array containing a value for 
>> every document (you cna think of it as an inverted-inverted-index) that 
>> Lucene maintains directly.  items are never removed they just get garbage 
>> collected when the IndexReader is no longer used.  It's primarily for 
>> sorting, but the SimpleFacets code also leveragies it for facets in some 
>> cases -- Solr has no way of showing you what's in the FieldCache, because 
>> Lucene doesn't expose any inspection APIs to query it (it's a heisenberg 
>> cache .. once you ask if something is in it, it's in it)
>> 
>> are you refering to the "filterCache" ?  
>> 
>> filterCache contains records whose key is a "query" and whose value is a 
>> DocSet (an unordered collection of all docs matching a query) ... it's 
>> used whenever you use an "fq" param, for faceting on some fields (when
>> the 
>> TermEnum method is used, a filterCache entry is added for each term 
>> tested), and even for some sorted queries if the 
>>  config option is set to true.
>> 
>> the easiest way to know whether your faceting is using the FieldCache is 
>> to start your server cold (no newSearcher warming) and then send it a 
>> simple query with a single facet.field.  depending on the query, you
>> might 
>> get 0 or 1 entries in the filterCache if SimpleFacets is using the 
>> FieldCache -- but if it's using the TermEnums, and generating a DocSet
>> per 
>> term, you'llsee *lots* of inserts into the filterCache.
>> 
>> 
>> 
>> -Hoss
>> 
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/quickie%3A-do-facetfields-use-same-cached-items-in-field-cache-as-FQ-param--tf4609795.html#a13170530
Sent from the Solr - User mailing list archive at Nabble.com.



solr-139: support for adding fields which are not known at design-time?

2007-10-26 Thread Britske

is it / will it be possible to add priorly non-existing fields to a document
with the upcoming solr-139? 
for instance, would something like this work? 



318127
12



with schema.xml: 
...

 
...
 
...

btw: how is solr-139 coming along? By judging the latest posts on jira,
there was still a lot of discussion going on, until about a month ago or so.
Has there been any recent activity? 
Moreover, which patch should i try?

kind regards,
Geert-Jan


-- 
View this message in context: 
http://www.nabble.com/solr-139%3A-support-for-adding-fields-which-are-not-known-at-design-time--tf4696143.html#a13423709
Sent from the Solr - User mailing list archive at Nabble.com.



copyField with functionquery as source

2007-10-26 Thread Britske

is it possible to have a CopyField with a functionquery as it's source? 
for instance : 

 

If not, I think this would make a nice addition.

thanks, 
Geert-Jan




-- 
View this message in context: 
http://www.nabble.com/copyField-with-functionquery-as-source-tf4696019.html#a13423343
Sent from the Solr - User mailing list archive at Nabble.com.



SOLR 1.3: defaultOperator always defaults to OR although AND is specifed.

2007-11-01 Thread Britske

experimenting with SOLR 1.3 and discovered that although I specified 
 in schema.xml

q=a+b behaves as q=a OR B instead of q=a AND b

Obviously this is not correct.
I used the nightly of 29 oct.

Cheers, 
Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/SOLR-1.3%3A-defaultOperator-always-defaults-to-OR-although-AND-is-specifed.-tf4731773.html#a13529997
Sent from the Solr - User mailing list archive at Nabble.com.



Solr-J: automatic url-escaping gives invalid uri exception. How to workaround?

2007-11-01 Thread Britske

I have a custom requesthandler which does some very basic dynamic parameter
substitution. 
dynamic params are params which are enclosed in braces ({}). 

So this means i can do something like this: 
q={order}...

where {order} is substituted by the name of an existing order-column. 
Now this all works well when i supply such a query directly as un url in
firefox / IE. 

However when i supply a query through SOLR-J I get an "invalid URI"
exception as SOLR-J automatically URLEncodes the braces and then passes this
onto Apache-HttpClient, which chokes on the URLEncoded URI. 

Is there any way around passing things as braces trough SOLR-J such that the
resulting URL is correctly interprested by HttpClient? 

Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/Solr-J%3A-automatic-url-escaping-gives-invalid-uri-exception.-How-to-workaround--tf4733909.html#a13536871
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr-J: automatic url-escaping gives invalid uri exception. How to workaround?

2007-11-01 Thread Britske

I replaced { and } by (( resp. )). Not ideal (I like braces...) but it
suffices for now. 
Still, if someone knows a general solution to the UrlEscaping-issue with
Solr-J i'd love to hear it.

Cheers,
Geert-Jan 


Britske wrote:
> 
> I have a custom requesthandler which does some very basic dynamic
> parameter substitution. 
> dynamic params are params which are enclosed in braces ({}). 
> 
> So this means i can do something like this: 
> q={order}...
> 
> where {order} is substituted by the name of an existing order-column. 
> Now this all works well when i supply such a query directly as un url in
> firefox / IE. 
> 
> However when i supply a query through SOLR-J I get an "invalid URI"
> exception as SOLR-J automatically URLEncodes the braces and then passes
> this onto Apache-HttpClient, which chokes on the URLEncoded URI. 
> 
> Is there any way around passing things as braces trough SOLR-J such that
> the resulting URL is correctly interprested by HttpClient? 
> 
> Geert-Jan
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-J%3A-automatic-url-escaping-gives-invalid-uri-exception.-How-to-workaround--tf4733909.html#a13537284
Sent from the Solr - User mailing list archive at Nabble.com.



where to hook in to SOLR to read field-label from functionquery

2007-11-05 Thread Britske

My question sounds strange I know, but I'll try to explain:

Say I have a custom functionquery MinFloatFunction which takes as its
arguments an array of valuesources. 

MinFloatFunction(ValueSource[] sources)

In my case all these valuesources are the values of a collection of fields.
What I need is to get the value and the fieldname of the lowest scoring
field provided in the above array. Obviously the result of the function is
the value of the lowest scoring fieldname but is there any way to get to the
fieldname of that lowest scoring field?? (with or without extending Solr a
little bit)

This not-so-standard need comes from the following:
My index consists of 'products' which can have a lot of variants (up to 5000
per product). Each of these variants can have their own price and a number
of characteristics (the latter which I dont need to filter /sort or search
by). Moreover, every search should only return any given product 0 or 1
time. (So 2 variants of the same product can never be returned in the same
search). 

For this I designed a schema in which each 'row' in the index represents a
product (indepdent of variants) (which takes care of the 1 variant max) and
every variant is represented as 2 fields in this row:

variant_p_* <-- represents price (stored / indexed)
variant_source_*  <-- represents the other fields dependent on the
variant (stored / multivalued)

Here for example variant_p_xyz and variant_source_xyz belong togehter. 

The specific usecase now is that sometimes a user would be satisfied in a
range of variants and wants the lowest price over all those variants.
to return for each product the variant with the smallest price alongside its
characteristics I need the name of the lowest scoring field (say,
variant_p_xyz) so that I can give back the contents of variant_source_xyz. 

sure, other routes would be possible and I'm open for suggestions, but at
least the following routes don't work: 

- give back all the fields to the client and let the client het the min over
the fields, etc. --> to much data over the line.
- store the minima of a certain range of variant_p_* values alongside the
cooresponding variant_source_* at INDEX-time, when I have all the
variant-fields ready in the client. --> the collections over which I need to
take the minima are not known a priori. 

Any help is highly appreciated! 

Cheers,
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/where-to-hook-in-to-SOLR-to-read-field-label-from-functionquery-tf4751109.html#a13585389
Sent from the Solr - User mailing list archive at Nabble.com.



Re: where to hook in to SOLR to read field-label from functionquery

2007-11-10 Thread Britske



hossman wrote:
> 
> 
> : Say I have a custom functionquery MinFloatFunction which takes as its
> : arguments an array of valuesources. 
> : 
> : MinFloatFunction(ValueSource[] sources)
> : 
> : In my case all these valuesources are the values of a collection of
> fields.
> 
> a ValueSource isn't required to be field specifc (it may already be the 
> mathematical combination of other multiple fields) so there is no generic 
> way to get the "field name" form a ValueSource ... but you could define 
> your MinFloatFunction only accept FieldCacheSource[] as input ... hmmm, 
> ecept that FieldCacheSource doesn't expose the field name.  so instead you 
> write...
> 
>   public class MyFieldCacheSource extends FieldCacheSource {
> public MyFieldCacheSource(String field) {
>   super(field);
> }
> public String getField() {
>   return field;
> }
>   }
>   public class MinFloatFunction ... {
> public MinFloatFunction(MyFieldCacheSource[] values);
>   }
> 
Thanks for this. I'm goign to look into this a little further. 


hossman wrote:
> 
> 
> : For this I designed a schema in which each 'row' in the index represents
> a
> : product (indepdent of variants) (which takes care of the 1 variant max)
> and
> : every variant is represented as 2 fields in this row:
> : 
> : variant_p_* <-- represents price (stored / indexed)
> : variant_source_*  <-- represents the other fields dependent on
> the
> : variant (stored / multivalued)
> 
> Note: if you have a lot of varients you may wind up with the same problem 
> as described here...
> 
> http://www.nabble.com/sorting-on-dynamic-fields---good%2C-bad%2C-neither--tf4694098.html
> 
> ...because of the underlying FieldCache usage in FieldCacheValueSource
> 
> 
> -Hoss
> 
> 
> 

Hmmm. thanks for pointing me to that one ( i guess ;-) I totally
underestimated the memory-requirements of the underlying Lucene Field-cache
implementation. 
Having the option to sort on about 10.000 variantfields with about 400.000
docs will consume about 16 GB max. Definitly not doable in my situation. A
LRU-implementation of the lucene field-cache would help big time in this
situation to at least not get OOM-errors.  Perhaps , you know of any
existing implementations? 

Thanks a lot, 
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/where-to-hook-in-to-SOLR-to-read-field-label-from-functionquery-tf4751109.html#a13682698
Sent from the Solr - User mailing list archive at Nabble.com.



how to load custom valuesource as plugin

2007-11-14 Thread Britske

I've created a simple valueSource which is supposed to calculate a weighted
sum over a list of supplied valuesources. 

How can I let Solr recognise this valuesource?

I tried to simply upload it as a plugin, and reference is by its name (wsum)
in a functionquery, but got a "Unknown function wsum in FunctionQuery". 

Can anybody tell me what I'm missing here? 

Thanks in advance, 
Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/how-to-load-custom-valuesource-as-plugin-tf4807284.html#a13754005
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to load custom valuesource as plugin

2007-11-14 Thread Britske

How would you then use a custom valueSource?  would extending a
requesthandler and explicitly calling the valuesource in the requesthandler
work? 

Obviously, thats not very flexible,but it might do for now.  Is a pluggable
query parser on the agenda? if so, where can I vote ;-)

Geert-Jan 



Yonik Seeley wrote:
> 
> Unfortunately, the function query parser isn't currently pluggable.
> 
> -Yonik
> 
> On Nov 14, 2007 2:02 PM, Britske <[EMAIL PROTECTED]> wrote:
>>
>> I've created a simple valueSource which is supposed to calculate a
>> weighted
>> sum over a list of supplied valuesources.
>>
>> How can I let Solr recognise this valuesource?
>>
>> I tried to simply upload it as a plugin, and reference is by its name
>> (wsum)
>> in a functionquery, but got a "Unknown function wsum in FunctionQuery".
>>
>> Can anybody tell me what I'm missing here?
>>
>> Thanks in advance,
>> Geert-Jan
> 
> 

-- 
View this message in context: 
http://www.nabble.com/how-to-load-custom-valuesource-as-plugin-tf4807284.html#a13755695
Sent from the Solr - User mailing list archive at Nabble.com.



Re: possible to set mincount on facetquery?

2007-12-05 Thread Britske

It seemed handy in the mentioned case where its not certain if there are
products in each of the budgetcategories so you simply ask them all, and
only get back the categories which contain at least 1 product. 

>From a functional perspective to me that's kind of on par with doing
facet.mincount=1 (the only difference being that the data just doesn't
happen to be in 1 field). However, the client ideally shouldn't need to
bother with that difference imho.  

Cheers,
Geert-Jan



Erik Hatcher wrote:
> 
> 
> On Dec 5, 2007, at 8:33 AM, Yonik Seeley wrote:
>> On Dec 5, 2007 7:45 AM, Erik Hatcher <[EMAIL PROTECTED]>  
>> wrote:
>>> On Dec 5, 2007, at 5:12 AM, Erik Hatcher wrote:
 In my perusal of the code (SimpleFacets#getFacetQueryCounts), I'm
 not seeing that facet.query respects facet.limit even.  If you
 asked for a count for a query, you get it regardless of any other
 parameters such as mincount or limit.
>>>
>>> sorry, facet.limit doesn't make sense for a facet.query  what I
>>> meant was that it doesn't support mincount.
>>
>> Right... in most cases I don't think mincount really makes sense for a
>> facet.query either.
> 
> I agree with that as well.  If you ask for a count for a query, you  
> get it regardless of whether it is zero or not.  You asked for it,  
> you got it!
> 
>   Erik
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/possible-to-set-mincount-on-facetquery--tf4948462.html#a14176570
Sent from the Solr - User mailing list archive at Nabble.com.



possible to set mincount on facetquery?

2007-12-05 Thread Britske

is it possible to set a mincount on a facetquery as well as on a facetfield?
I have a situation in which I want to group facetqueries (price-ranges) but
I obviously dont want to show ranges with 0 results. 

I tried things like: f.price:[0 TO 50].facet.mincount=1 and f.price:[0 TO
50].query.mincount=1.

Thanks,
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/possible-to-set-mincount-on-facetquery--tf4948462.html#a14168182
Sent from the Solr - User mailing list archive at Nabble.com.



how do do most efficient: collapsing facets into top-N results

2007-12-13 Thread Britske

I've subclassed StandardRequestHandler to be able to show top-N results for
some of the facet-values that I'm interested in. The functionality resembles
the solr-236 field collapsing a bit, with the difference that I can
arbitrarily specify which facet-query to collapse and to what extend.
(possibility to specify N independently)

The code for this is now quite simple, but (maybe because of that) I've got
the feeling that it can be optimized quite a bit. The question is how? 

first some explanation and code:

I extended the standardrequesthandler and execute
super.handleRequestBody(req,rsp) to be able to fetch the facetquery results.
>From that I copy the facets that I wish to collapse to a NamedList
facet_results and execute code (see below) that basically splits a (possibly
combined) facetquery into independent queries which are added to a FQ-list. 
That FQ-list is appended to the original query and FQ-list and the new query
is executed.

for(int i = 0; i < facetresults.size(); i++)
{
List fqList = new ArrayList();
String[] fqsplit = facetresults.getName(i).split("[+]");
for(int j = 0; j< fqsplit.length; j++)
{
  Query fqNew = QueryParsing.parseQuery(fqsplit[j].trim(),
req.getSchema());
  fqList.add(fqNew);
}
fqList.addAll(fqsExisting);
DocListAndSet resultList = new DocListAndSet();

SolrIndexSearcher s = req.getSearcher();
resultList.docList = s.getDocList(query,fqList, sort,start, rows ,0);
NamedList facetValue = new SimpleOrderedMap(); 
facetValue.add("results",resultList.docList);
facetresults.setVal(i, facetValue);
}

This all works okay, but I'm still thinking that there must be a better way
than executing queries over and over again, for which only the fq's are
different: Q and Sort are the same for the executed queries per facet as for
the same already exectuted overall query.

Obviously doing a intersect on the original result would by far be the
fastest solution but Mike mentioned that this wasn't doable, since the
overall sorted resultlist is not available. see: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-to13133815.html

Is there anything else I can do to speedup the queries? 

for reference I'm now seeing 15-16ms for each exectued query which is not in
the query-cache.
This seems independent whether of not Fq's are already in the filtercache or
not, which strikes me as odd.

For example see the performance measure of the collapsed facet-queries below
(and make up 1 call to Solr). Tested on an unwarmed solr-server. 20.000
docs. intel Core 2 Duo 2ghz. 800 MB Ram assigned to Solr. 

15 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50]
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100]
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200]
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300]
16 : ms for: idA:2140479
15 : ms for: idA:1456928
16 : ms for: idA:2601889
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200]
0 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300]
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:1456928
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2601889
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[51 TO 100] + idA:2601889
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[101 TO 200] + idA:2601889
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:2140479
16 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:1456928
15 : ms for: _ddp_p_dc_dc_2_dc_dc:[201 TO 300] + idA:2601889
 
The strange thing here is that for example the query: 

_ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479 

takes 15 ms 
although it's independent parts: 
-  _ddp_p_dc_dc_2_dc_dc:[0 TO 50] 
-  idA:2140479

have already been executed (they also take 15/16 ms)

so all FQ's for _ddp_p_dc_dc_2_dc_dc:[0 TO 50] + idA:2140479 must be in the
filter-cache and hence the query must execute quicker than the very first
query: 
_ddp_p_dc_dc_2_dc_dc:[0 TO 50] for which the FQ wasn't in the filter-cache
at that moment.

So to summarize my 2 questions: 
1. is there any way to get better performance for what 'm trying to achieve?
Perhaps a custom hitcollector or something? 
2. do you have any explanation for the fact the the filter-cache doens't
seem to matter for executing the queries? 

Thanks in advance for making it to the end of this post and for any help you
might give me ;-)

Geert-Jan

-- 
View this message in context: 
http://www.nabble.com/how-do-do-most-efficient%3A-collapsing-facets-into-top-N-results-tp14318577p14318577.html
Sent from the Solr - User mailing list archive at Nabble.com.



how to intersect a doclist with a docset and get a doclist back?

2007-12-14 Thread Britske

Is there  a way to get a doclist based on intersecting an existing doclist
with a docset? 

However doing doclist.intersection(docset)  returns docset. 
Is there something I'm missing here? 

I figured this must be possible since the order of the returned doclist is
the same as the order of the inserted doclist. 

Thanks,
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/how-to-intersect-a-doclist-with-a-docset-and-get-a-doclist-back--tp14338755p14338755.html
Sent from the Solr - User mailing list archive at Nabble.com.



big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2007-12-27 Thread Britske

Hi, 

I am using SolrJ to communicate with SOLR. My Solr-queries perform within
range (between 50 ms and 300 ms) by looking at the solr log as ouputted on
my (windows) commandline. 

However I discovered that the following command at all times takes
significantly longer than the number outputted in the solr-log, (sometimes
about 400% longer):
 
SorrQuery query  = /* some query */
QueryRequest req = new QueryRequest(query);
req.process(solrServer);  //solrServer is an instance of
CommonsHttpSolrServer

Of course, this includes the time to parse and transfer the response (about
10-12k) from the solr-server to  solrJ, but I can't imagine that it can all
be contributed to this.

I would really appreciate anyone giving me some pointers as to what would
make up this difference. And/Or whether this difference seems normal or not. 

As a refeference ,some figures I've seen are: 
30 ms on server 125ms on solrJ
250ms on server 550ms on solrJ. 
at all times the response is between 10-12k which includes facetfields and
facetqueries. (A facetfield takes  has up to a max of about 100 values). 

Thanks in advance, 
Geert-Jan


-- 
View this message in context: 
http://www.nabble.com/big-perf-difference-between-solr-server-vs.--SOlrJ-req.process%28solrserver%29-tp14513964p14513964.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2007-12-27 Thread Britske



Yonik Seeley wrote:
> 
> On Dec 27, 2007 9:45 AM, Britske <[EMAIL PROTECTED]> wrote:
>> I am using SolrJ to communicate with SOLR. My Solr-queries perform within
>> range (between 50 ms and 300 ms) by looking at the solr log as ouputted
>> on
>> my (windows) commandline.
>>
>> However I discovered that the following command at all times takes
>> significantly longer than the number outputted in the solr-log,
>> (sometimes
>> about 400% longer):
> 
> It's probably due to stored field retrieval.
> The time in the response includes everything except the time to write
> the response (since it appears at the beginning).  Writing the
> response involves reading the stored fields of documents (this was
> done to allow one to stream a large number of documents w/o having
> them all in memory).
> 
> SolrJ's parsing of the response should be a relatively small constant
> cost.
> 
> -Yonik
> 
> 

Is it normal to see this much time taken in stored field retrieval? And
where would I start to make sure that it is indeed caused by stored field
retrieval? 

It seems quite much to me, although I have kind if an out of the ordinary
setup with between 2000-4000 stored fields per document. By far the largest
part is taken by various 'product-variants' and their respective prices
(indexed field) and other characteristics (stored only). 
However only about 10 stored fields per document are returned for any
possible query. 

Would the time taken still include iterating the non-returned fields (of
which there are many in my case), or are only the returned fields retrieved
in a map-like implementation? 

Thanks, 
Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/big-perf-difference-between-solr-server-vs.--SOlrJ-req.process%28solrserver%29-tp14513964p14514441.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2007-12-27 Thread Britske

after inspecting solrconfig.xml I see that I already have enabled lazy field
loading by: 
true (I guess it was
enabled by default) 

Since any query returns about 10 fields (which differ from query to query) ,
would this mean that only these 10 of about 2000-4000 fields are retrieved /
loaded? 

Thanks,
Geert-Jan 



Erick Erickson wrote:
> 
> From a Lucene perspective, it's certainly possible to do lazy field
> loading. That is, when loading a document you can determine at
> run time what fields to load, even on a per-document basis. I'm
> not entirely sure how to accomplish this in Solr, but I'd give
> long odds that there's a way.
> 
> I did a writeup on this on the Wiki, see:
> 
> http://wiki.apache.org/lucene-java/FieldSelectorPerformance?highlight=%28fieldselectorperformance%29
> 
> 
> The title is FieldSelectorPerformance if you need to search the Wiki...
> 
> Best
> Erick
> 
> On Dec 27, 2007 10:28 AM, Britske <[EMAIL PROTECTED]> wrote:
> 
>>
>>
>>
>> Yonik Seeley wrote:
>> >
>> > On Dec 27, 2007 9:45 AM, Britske <[EMAIL PROTECTED]> wrote:
>> >> I am using SolrJ to communicate with SOLR. My Solr-queries perform
>> within
>> >> range (between 50 ms and 300 ms) by looking at the solr log as
>> ouputted
>> >> on
>> >> my (windows) commandline.
>> >>
>> >> However I discovered that the following command at all times takes
>> >> significantly longer than the number outputted in the solr-log,
>> >> (sometimes
>> >> about 400% longer):
>> >
>> > It's probably due to stored field retrieval.
>> > The time in the response includes everything except the time to write
>> > the response (since it appears at the beginning).  Writing the
>> > response involves reading the stored fields of documents (this was
>> > done to allow one to stream a large number of documents w/o having
>> > them all in memory).
>> >
>> > SolrJ's parsing of the response should be a relatively small constant
>> > cost.
>> >
>> > -Yonik
>> >
>> >
>>
>> Is it normal to see this much time taken in stored field retrieval? And
>> where would I start to make sure that it is indeed caused by stored field
>> retrieval?
>>
>> It seems quite much to me, although I have kind if an out of the ordinary
>> setup with between 2000-4000 stored fields per document. By far the
>> largest
>> part is taken by various 'product-variants' and their respective prices
>> (indexed field) and other characteristics (stored only).
>> However only about 10 stored fields per document are returned for any
>> possible query.
>>
>> Would the time taken still include iterating the non-returned fields (of
>> which there are many in my case), or are only the returned fields
>> retrieved
>> in a map-like implementation?
>>
>> Thanks,
>> Geert-Jan
>> --
>> View this message in context:
>> http://www.nabble.com/big-perf-difference-between-solr-server-vs.--SOlrJ-req.process%28solrserver%29-tp14513964p14514441.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-perf-difference-between-solr-server-vs.--SOlrJ-req.process%28solrserver%29-tp14513964p14514852.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: big perf-difference between solr-server vs. SOlrJ req.process(solrserver)

2007-12-31 Thread Britske

I imagine then that this "scanning-cost" is proportional to the number of
stored fields, correct? 

I tested this with generating a second index with 1/10th of the
product-variants (and thus 1/10th) of the stored fields. However I really
don't see the expected (at least by me) drop in post-processing time (which
includes lazy loading the needed fields and scanning all the stored fields).

Moreover, I realized that I'm using an xsl-transform in the post-processing
phase. This would contribute to the high cost I'm seeing as well I think.
Can this XSL-transform in general be considered small in relation to the
abovementioned costs?

Thanks, 
Geert-Jan 


Yonik Seeley wrote:
> 
> On Dec 27, 2007 11:01 AM, Britske <[EMAIL PROTECTED]> wrote:
>> after inspecting solrconfig.xml I see that I already have enabled lazy
>> field
>> loading by:
>> true (I guess it was
>> enabled by default)
>>
>> Since any query returns about 10 fields (which differ from query to
>> query) ,
>> would this mean that only these 10 of about 2000-4000 fields are
>> retrieved /
>> loaded?
> 
> Yes, but that's not the whole story.
> Lucene stores all of the fields back-to-back with no index (there is
> no random access to particular stored fields)... so all of the fields
> must be at least scanned.
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/big-perf-difference-between-solr-server-vs.--SOlrJ-req.process%28solrserver%29-tp14513964p14557779.html
Sent from the Solr - User mailing list archive at Nabble.com.



batch indexing takes more time than shown on SOLR output --> something to do with IO?

2008-01-14 Thread Britske

I have a batch program which inserts items in a solr/lucene index. 
all is going fine and I get update messages in the console like: 

14-jan-2008 16:40:52 org.apache.solr.update.processor.LogUpdateProcessor
finish
INFO: {add=[10485, 10488, 10489, 10490, 10491, 10495, 10497, 10498, ...(42
more)
]} 0 875

However, when timing this instruction on the client-side (I use SOlrJ -->
req.process(server)) I get totally different numbers (in the beginning the
client-side measured time is about 2 seconds on average but after some time
this time goes up to about 30-40 seconds, altough the solr-outputted time
stays between 0.8-1.3 seconds? 

Does this have anything to do with costly IO-activity that is accounted for
in the SOLR output? If this is true, what tool do you recommend using to
monitor IO-activity?

Thanks, 
Geert-Jan 
-- 
View this message in context: 
http://www.nabble.com/batch-indexing-takes-more-time-than-shown-on-SOLR-output%3E-something-to-do-with-IO--tp14804471p14804471.html
Sent from the Solr - User mailing list archive at Nabble.com.



indexing slow, IO-bound?

2008-04-05 Thread Britske

Hi, 

I have a schema with a lot of (about 1) non-stored indexed fields, which
I use for sorting. (no really, that is needed). Moreover I have about 30
stored fields. 

Indexing of these documents takes a long time. Because of the size of the
documents (because of the indexed fields) I am currently batching 50
documents at once which takes about 2 seconds.Without adding the 1
indexed fields to the document, indexing flies at about 15 ms for these 50
documents. INdexing is done using SolrJ

This is on a intel core 2 6400 @2.13ghz and 2 gb ram. 

To speed this up I let 2 threads do the indexing in parallel. What happens
is that solr just takes double the time (about 4 seconds) to complete these
two jobs of 50 docs each in parallel. I figured because of the multi-core
setup indexing should improve, which it doesn't. 

Does this perhaps indicate that the setup is IO-bound? What would be your
best guess  (given the fact that the schema has a big amount of indexed
fields) to try next to improve indexing performance? 

Geert-Jan
-- 
View this message in context: 
http://www.nabble.com/indexing-slow%2C-IO-bound--tp16513196p16513196.html
Sent from the Solr - User mailing list archive at Nabble.com.



reusing docset to limit new query

2008-04-16 Thread Britske

I'm creating a custom handler where I have a base query and a resulting
doclistandset. 
I need to do some extra queries to get top-results per facet. There are 2
cases: 

1. the sorting used for the top-results for a particular facet is the same
as the sorting used for the already returned doclistandset. This means that
I can return a docslice of the doclist (contained in the doclistandset)  
after doing some intersections. This is quick and works well.

2.  The sorting is different. In this case I need to do the query again (I
think, please let me know if there's a better option), by using
SolrIndexSearcher.getDocList(...). 

I'm looking for a way to tell the SolrIndexSearcher that it can limit it's
query (including sorting) to the docset that I got by 1. (orginal docset +
some intersections), because I figured it must be quicker (is it? )

I've found a method SolrIndexSearcher.cacheDocSet(..) but am not entirely
sure what it does (sideeffects? )

Can someone please elaborate on this? 

Britske 
-- 
View this message in context: 
http://www.nabble.com/reusing-docset-to-limit-new-query-tp16721670p16721670.html
Sent from the Solr - User mailing list archive at Nabble.com.