date:20110125


i did some research in schema DIH config file and i created my own DIH, i'm
getting this error when i run


−

0
0

−

−

try.xml


full-import
idle

−

0:0:0.163
0
1
0
0
2011-01-25 13:56:48
Indexing failed. Rolled back all changes.
2011-01-25 13:56:48

−

This response format is experimental.  It is likely to change in the future.



-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-Failed-rolled-back-tp2327412p2327412.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: MySQL + DIH + SpatialSearch

Hey Eric,

On Mon, Jan 24, 2011 at 7:23 PM, Eric Angel  wrote:

> * Or you can typecast before you concat:
> > * 
>

Casting before or after concat'ing, work's both - as we've seen two weeks
ago, in a similar thread (
http://search.lucidimagination.com/search/document/250975238eaeb9e0/solr_4_0_spatial_search_how_to#60c4c05c9b482df1
)

But anyway, thanks for pointing out, that it's really a (confirmed)
MySQL-Bug - very annoying :/

Regards
Stefan

Re: DIH serialize

Rich,

i played around for a few minutes with Script-Transformers, but i have not
enough knowledge to get anything done right know :/
My Idea was: looping over the given row, which should be a Java HashMap or
something like that? and do sth like this (pseudo-code):

var row_data = []
for( var key in row )
{
  row_data.push( '"' + key + '" : '" + row[key] + '"' );
}
row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' );

Which should result in a json-object like {'key1':'value1', 'key2':'value2'}
- and that should be okay to work with?

Regards
Stefan

On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard  wrote:

> Hi Stefan,
>
>  yes, this is exactly what I intend - I don't want to search in this field
> - just quicly return me the result in a serialized form (the search
> criteria
> is on other fields). Well, if I could serialize the data exactly as like
> the
> PHP serialize() does I would be maximally satisfied, but any other form in
> which I could compact the data easily into one field I would be pleased.
>  Can anyone help me? I guess the  is quite a good way, but I don't
> know which function should I use there to compact the data to be easily
> usable in PHP. Or any other method?
>
> thanks,
>  Rich
>
> -Original Message-
> From: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
> Sent: Monday, January 24, 2011 18:23
> To: solr-user@lucene.apache.org
> Subject: Re: DIH serialize
>
> Hi Rich,
>
> i'm a bit confused after reading your post .. what exactly you wanna try to
> achieve? Serializing (like http://php.net/serialize) your complete row
> into
> one field? Don't wanna search in them, just store and deliver them in your
> results? Does that make sense? Sounds a bit strange :)
>
> Regards
> Stefan
>
> On Mon, Jan 24, 2011 at 10:03 AM, Papp Richard  wrote:
>
> > Hi Dennis,
> >
> >  thank you for your answer, but didn't understand why you say it doesn't
> > need serialization. I'm with the option "C".
> >  but the main question is, how to put into one field a result of many
> > fields: "SELECT * FROM".
> >
> > thanks,
> >  Rich
> >
> > -Original Message-
> > From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> > Sent: Monday, January 24, 2011 02:07
> > To: solr-user@lucene.apache.org
> > Subject: Re: DIH serialize
> >
> > Depends on your process chain to the eventual viewer/consumer of the
> data.
> >
> > The questions to ask are:
> >  A/ Is the data IN Solr going to be viewed or processed in its original
> > form:
> >  -->set stored = 'true'
> > --->no serialization needed.
> >  B/ If it's going to be anayzed and searched for separate from any other
> > field,
> >
> >  the analyzing will put it into  an unreadable form. If you need to
> see
> > it,
> > then
> > --->set indexed="true" and stored="true"
> > --->no serializaton needed.   C/ If it's NOT going to be viewed AS
> IS,
> > and
> > it's not going to be searched for AS IS,
> >   (i.e. other columns will be how the data is found), and you have
> > another,
> >
> >   serialzable format:
> >   -->set indexed="false" and stored="true"
> >   -->serialize AS PER THE INTENDED APPLICATION,
> >   not sure that Solr can do that at all.
> >  C/ If it's NOT going to be viewed AS IS, and it's not going to be
> searched
> > for
> > AS IS,
> >   (i.e. other columns will be how the data is found), and you have
> > another,
> >
> >   serialzable format:
> >   -->set indexed="false" and stored="true"
> >   -->serialize AS PER THE INTENDED APPLICATION,
> >   not sure that Solr can do that at all.
> >  D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched
> for
> > AS
> > IS,
> >   (this column will be how the data is found), and you have another,
> >   serialzable format:
> >   -->you need to put it into TWO columns
> >   -->A SERIALIZED FIELD
> >   -->set indexed="false" and stored="true"
> >
> >  -->>AN UNSERIALIZED FIELD
> >   -->set indexed="false" and stored="true"
> >   -->serialize AS PER THE INTENDED APPLICATION,
> >   not sure that Solr can do that at all.
> >
> > Hope that helps!
> >
> >
> > Dennis Gearon
> >
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> > better
> > idea to learn from others' mistakes, so you do not have to make them
> > yourself.
> > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> >
> > EARTH has a Right To Life,
> > otherwise we all die.
> >
> >
> >
> > - Original Message 
> > From: Papp Richard 
> > To: solr-user@lucene.apache.org
> > Sent: Sun, January 23, 2011

Re: synonyms file, and example cases

Cam,

the examples with the provided inline-documentation should help you, no?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory

The Backslash \ in that context looks like an Escaping-Character, to avoid
the => to be interpreted as "assign-command"

Regards
Stefan

On Tue, Jan 25, 2011 at 2:31 AM, Cam Bazz  wrote:

> Hello,
>
> I have been looking at the solr synonym file that was an example, I
> did not understand some notation:
>
> aaa => 
>
> bbb => 1 2
>
> ccc => 1,2
>
> a\=>a => b\=>b
>
> a\,a => b\,b
>
> fooaaa,baraaa,bazaaa
>
> The first one says search for  when query is aaa. am I correct?
> the second one finds "1 2" when query is bbb
> the third one is find 1 or 2 when query is ccc
>
> the fourth, and fifth one I have not understood.
>
> the last one, i assume is a group, bidirectional mapping between
> fooaaa,baraaa,bazaaa
>
> I am especially interested with this last one, if I do aaa,bbb it will
> find aaa and bbb when either aaa or bbb is queryied?
>
> am I correct in those assumptions?
>
> Best regards,
> C.B.
>

Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram

Hi,

I am facing performance issues in three types of queries (and their
combination). Some of the queries take more than 2-3 mins. Index size is
around 150GB.


   - Wildcard
   - Proximity
   - Phrases (with common words)

I know CommonGrams and Stop words are a good way to resolve such issues but
they don't fulfill our functional requirements (Common Grams seem to have
issues with phrase proximity, stop words have issues with exact match etc).

Sharding is an option too but that too comes with limitations so want to
keep that as a last resort but I think there must be other things coz 150GB
is not too big for one drive/server with 32GB Ram.

Cache warming is a good option too but the index get updated every hour so
not sure how much would that help.

What are the other main tips that can help in performance optimization of
the above queries?

Thanks

-- 
Regards,

Salman Akram

Re: please help >>Problem with dataImportHandler

Caused by: org.xml.sax.SAXParseException: Element type "field" must be
followed by either attribute specifications, ">" or "/>".

Sounds like invalid XML in your .. dataimport-config?

On Tue, Jan 25, 2011 at 5:41 AM, Dinesh wrote:

>
> http://pastebin.com/tjCs5dHm
>
> this is the log produced by the solr server
>
> -
> DINESHKUMAR . M
> I am neither especially clever nor especially gifted. I am only very, very
> curious.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2326659.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: please help >>Problem with dataImportHandler


ya after correcting it also it is throwing an exception

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 10:05 AM, Dinesh  wrote:
>
> http://pastebin.com/CkxrEh6h
>
> this is my sample log
[...]

And, which portions of the log text do you want to preserve?
Does it go into Solr as a single error message, or do you want
to separate out parts of it.

Regards,
Gora

Re: Getting started with writing parser


i want to take the month, time, DHCPMESSAGE, from_mac, gateway_ip, net_ADDR

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: please help >>Problem with dataImportHandler


http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html

this thread explains my problem

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327745.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 11:44 AM, Dinesh  wrote:
>
> i don't even know whether the regex expression that i'm using for my log is
> correct or no..

If it is the same try.xml that you posted earlier, it is very likely not
going to work. You seem to have just cut and pasted entries from
the Hathi Trust blog, without understanding how they work.

Could you take a fresh look at http://wiki.apache.org/solr/DataImportHandler
and explain in words the following:
* What is your directory structure for storing the log files?
* What parts of the log file do you want to keep (you have already explained
  this in another message)?
* How would the above translate into:
  - A Solr schema
  - Setting up (a) a data source, (b) processor(s), and (c) transformers.

>i very much worried i couldn't proceed in my 
> project already
> 1/3 rd of the timing is over.. please help.. this is just the first stage..
> after this i have ti setup up all the log to be redirected to SYSLOG and
> from there i'll send it to SOLR server.. then i have to analyse all the
> data's that i obtained from DNS, DHCP, WIFI, SWITCES.. and i have to prepare
> a user based report on his actions.. please help me cause the day's i have
> keeps reducing.. my project leader is questioning me a lot.. pls..
[...]

Well, I am sorry, but at least I strongly feel that we should
not be doing your work for you, and especially not if it is a
student project, as seems to be the case.

If you can address the above points one by one (stay on
this thread, please), people should be able to help you.
However, it is up to you to get to understand Solr well
enough.

Regards,
Gora

Re: Getting started with writing parser


no i actually changed the directory to mine where i stored the log files.. it
is /home/exam/apa..solr/example/exampledocs

i specified it in a solr schema.. i created an DataImportHandler for that in
try.xml.. then in that i changed that file name to sample.txt

that new try.xml is
http://pastebin.com/pfVVA7Hs

i changed the log into one word per line thinking there might be error in my
regex expression.. now i'm completely stuck..

-
DINESHKUMAR . M
I am neither especially clever nor especially gifted. I am only very, very
curious.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327920.html
Sent from the Solr - User mailing list archive at Nabble.com.

Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Hi,

I posted a question in November last year about indexing content from 
multiple binary files into a single Solr document and Jayendra responded 
with a simple solution to zip them up and send that single file to Solr.


I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't 
currently allow this to work and only the file names of the zipped files 
are indexed (and not their contents).


I've tried downloading and building the latest Tika (0.8) and replacing 
the tika-parsers and tika-core JARS in 
\contrib\extraction\lib but this still isn't indexing the 
file contents, and not doesn't even index the file names!


Is there a version of Tika that works with the Solr 1.4.1 released 
distribution which does index the contents of the zipped files?


Thanks and kind regards,
Gary

DIH From various File system locations

2011-01-25 Thread pankaj bhatt

Hi All,
 I need to index the documents presents in my file system at various
locations (e.g. C:\docs , d:\docs ).
Is there any way through which i can specify this in my DIH
Configuration.
Here is my configuration:-


  

  
  
  
  







  

/ Pankaj Bhatt.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen



There seems to be a bug with the current 1.4.1 release. You cannot 
extract any content at all, regardless of content type.


Try to get a fresh version from the SVN repository. I did that earlier 
today and can verify that Tika now will extract the content. I'm not 
sure about zip files.


Tika version 0.8 is not included in the latest release/trunk from SVN.

Erlend

On 25.01.11 11.19, Gary Taylor wrote:

Hi,

I posted a question in November last year about indexing content from
multiple binary files into a single Solr document and Jayendra responded
with a simple solution to zip them up and send that single file to Solr.

I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't
currently allow this to work and only the file names of the zipped files
are indexed (and not their contents).

I've tried downloading and building the latest Tika (0.8) and replacing
the tika-parsers and tika-core JARS in
\contrib\extraction\lib but this still isn't indexing the
file contents, and not doesn't even index the file names!

Is there a version of Tika that works with the Solr 1.4.1 released
distribution which does index the contents of the zipped files?

Thanks and kind regards,
Gary




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Toke Eskildsen

On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> Cache warming is a good option too but the index get updated every hour so
> not sure how much would that help.

What is the time difference between queries with a warmed index and a
cold one? If the warmed index performs satisfactory, then one answer is
to upgrade your underlying storage. As always for IO-caused performance
problem in Lucene/Solr-land, SSD is the answer.

Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke

Hi,

recently we're experiencing OOMEs (GC overhead limit exceeded) in our
searches. Therefore I want to get some clarification on heap and cache
configuration.

This is the situation:
- Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
- JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
-XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
-XX:+UseParallelGC
- The machine has 32 GB RAM
- Currently there are 4 processors/cores in the machine, this shall be
changed to 2 cores in the future.
- The index size in the filesystem is ~9.5 GB
- The index contains ~ 5.500.000 documents
- 1.500.000 of those docs are available for searches/queries, the rest are
inactive docs that are excluded from searches (via a flag/field), but
they're still stored in the index as need to be available by id (solr is the
main document store in this app)
- Caches are configured with a big size (the idea was to prevent filesystem
access / disk i/o as much as possible):
  - filterCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
  - documentCache (solr.LRUCache): size=20, initialSize=10,
autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
  - queryResultCache (solr.LRUCache): size=20, initialSize=3,
autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71
- Searches are performed using a catchall text field using standard request
handler, all fields are fetched (no fl specified)
- Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
- Recently we also added a feature that adds weighted search for special
fields, so that the query might become s.th. like this
  q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
query)^4.0 OR longDescription_weighted:(some query)^0.5
  (it seemed as if this was the cause of the OOMEs, but IMHO it only
increased RAM usage so that now GC could not free enough RAM)

The OOMEs that we get are of type "GC overhead limit exceeded", one of the
OOMEs was thrown during auto-warming.

I checked two different heapdumps, the first one autogenerated
(by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
jmap.
These show the following distribution of used memory - the autogenerated
dump:
 - documentCache: 56% (size ~ 195.000)
- filterCache: 15% (size ~ 60.000)
- queryResultCache: 8% (size ~ 61.000)
- fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
- SolrIndexSearcher: 2%

The manually generated dump:
- documentCache: 48% (size ~ 195.000)
- filterCache: 20% (size ~ 60.000)
- fieldCache: 11% (fieldCache hängt am WebappClassLoader)
- queryResultCache: 7% (size ~ 61.000)
- fieldValueCache: 3%

We are also running two search engines with 17GB heap, these don't run into
OOMEs. Though, with these bigger heap sizes the longest requests are even
longer due to longer stop-the-world gc cycles.
Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
would be good to reduce the time needed for full gc.

So what's the right path to follow now? What would you recommend to change
on the configuration (solr/jvm)?

Would you say it is ok to reduce the cache sizes? Would this increase disk
i/o, or would the index be hold in the OS's disk cache?

Do have other recommendations to follow / questions?

Thanx && cheers,
Martin

Re: Specifying an AnalyzerFactory in the schema

2011-01-25 Thread Renaud Delbru


Hi Chris,

On 24/01/11 21:18, Chris Hostetter wrote:

: I notice that in the schema, it is only possible to specify a Analyzer class,
: but not a Factory class as for the other elements (Tokenizer, Fitler, etc.).
: This limits the use of this feature, as it is impossible to specify parameters
: for the Analyzer.
: I have looked at the IndexSchema implementation, and I think this requires a
: simple fix. Do I open an issue about it ?

Support for constructing Analyzers directly is very crude, and primarily
existed for making it easy for people with old indexes and analyzers to
keep working.

moving foward, Lucene/Solr eventtually won't "ship" concret Analyzers
implementations at all (at least, that's the last concensus i remember) so
enhancing support for loading Analyzers (or AnalyzerFactories) doesn't
make much sense.

Practically speaking, if you have an existing Analyzer that you want to
use in Solr, instead of writting an "AnalyzerFactory" for it, you could
just write a "TokenizerFactory" that wraps it instead -- functinally that
would let you achieve everything ana AnalyzerFactory would, except that
Solr would already handle letting the schema.xml specify the
positionIncrementGap (which you could happily ignore if you wanted)
Thanks for the trick, I haven't thought about doing that. This should 
work indeed.


cheers
--
Renaud Delbru

Use terracotta bigmemory for solr-caches

2011-01-25 Thread Martin Grotzke

Hi,

as the biggest parts of our jvm heap are used by solr caches I asked myself
if it wouldn't make sense to run solr caches backed by terracotta's
bigmemory (http://www.terracotta.org/bigmemory).
The goal is to reduce the time needed for full / stop-the-world GC cycles,
as with our 8GB heap the longest requests take up to several minutes.

What do you think?

Cheers,
Martin

Re: Performance optimization of Proximity/Wildcard searches

2011-01-25 Thread Salman Akram

By warmed index you only mean warming the SOLR cache or OS cache? As I said
our index is updated every hour so I am not sure how much SOLR cache would
be helpful but OS cache should still be helpful, right?

I haven't compared the results with a proper script but from manual testing
here are some of the observations.

'Recent' queries which are in cache of course return immediately (only if
they are exactly same - even if they took 3-4 mins first time). I will need
to test how many recent queries stay in cache but still this would work only
for very common queries. User can run different queries and I want at least
them to be at 'acceptable' level (5-10 secs) even if not very fast.

Our warm up script currently executes all distinct queries in our logs
having count > 5. It was run yesterday (with all the indexing update every
hour after that) and today when I executed some of the same queries again
their time seemed a little less (around 15-20%), I am not sure if this means
anything. However, still their time is not acceptable.

What do you think is the best way to compare results? First run all the warm
up queries and then execute same randomly and compare?

We are using Windows server, would it make a big difference if we move to
Linux? Our load is not high but some queries are really complex.

Also I was hoping to move to SSD in last after trying out all software
options. Is that an agreed fact that on large indexes (which don't fit in
RAM) proximity/wildcard/phrase queries (on common words) would be slow and
it can be only improved by cache warm up and better hardware? Otherwise with
an index of around 150GB such queries will take more than a min?

If that's the case I know this question is very subjective but if a single
query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD
(everything else same)?

Thanks!

On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen wrote:

> On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote:
> > Cache warming is a good option too but the index get updated every hour
> so
> > not sure how much would that help.
>
> What is the time difference between queries with a warmed index and a
> cold one? If the warmed index performs satisfactory, then one answer is
> to upgrade your underlying storage. As always for IO-caused performance
> problem in Lucene/Solr-land, SSD is the answer.
>
>

-- 
Regards,

Salman Akram

Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene

Hi,

Are you sure you need CMS incremental mode? It's only adviced when running on 
a machine with one or two processors. If you have more you should consider 
disabling the incremental flags.

Cheers,

On Monday 24 January 2011 19:32:38 Simon Wistow wrote:
> We have two slaves replicating off one master every 2 minutes.
> 
> Both using the CMS + ParNew Garbage collector. Specifically
> 
> -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
> -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing
> 
> but periodically they both get into a GC storm and just keel over.
> 
> Looking through the GC logs the amount of memory reclaimed in each GC
> run gets less and less until we get a concurrent mode failure and then
> Solr effectively dies.
> 
> Is it possible there's a memory leak? I note that later versions of
> Lucene have fixed a few leaks. Our current versions are relatively old
> 
>   Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17
> 18:06:42
> 
>   Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55
> 
> so I'm wondering if upgrading to later version of Lucene might help (of
> course it might not but I'm trying to investigate all options at this
> point). If so what's the best way to go about this? Can I just grab the
> Lucene jars and drop them somewhere (or unpack and then repack the solr
> war file?). Or should I use a nightly solr 1.4?
> 
> Or am I barking up completely the wrong tree? I'm trawling through heap
> logs and gc logs at the moment trying to to see what other tuning I can
> do but any other hints, tips, tricks or cluebats gratefully received.
> Even if it's just "Yeah, we had that problem and we added more slaves
> and periodically restarted them"
> 
> thanks,
> 
> Simon

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Weird behaviour with phrase queries

Frankly, this puzzles me. It *looks* like it should be OK. One warning, the
analysis page sometimes is a bit misleading, so beware of that.

But the output of your queries make it look like the query is parsing as you
expect, which leaves the question of whether your index contains what
you think it does. You might get a copy of Luke, which allows you to examine
what's actually in your index instead of what you think is in there.
Sometimes
there are surprises here!

I didn't mean to re-index your whole corpus, I was thinking that you could
just index a few documents in a test index so you have something small to
look at.

Sorry I can't spot what's happening right away.

Good luck!
Erick

On Tue, Jan 25, 2011 at 2:45 AM, Jerome Renard wrote:

> Erick,
>
> On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson 
> wrote:
>
>> Hmmm, I don't see any screen shots. Several things:
>> 1> If your stopword file has comments, I'm not sure what the effect would
>> be.
>>
>
> Ha, I thought comments were supported in stopwords.txt
>
>
>> 2> Something's not right here, or I'm being fooled again. Your withresults
>> xml has this line:
>> +DisjunctionMaxQuery((meta_text:"ecol d
>> ingenieur")~0.01) ()
>> and your noresults has this line:
>> +DisjunctionMaxQuery((meta_text:"academi
>> charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi
>> charpenti"~100)~0.01)
>>
>> the empty () in the first one often means you're NOT going to your
>> configured dismax parser in solrconfig.xml. Yet that doesn't square with
>> your custom qt, so I'm puzzled.
>>
>> Could we see your raw query string on the way in? It's almost as if you
>> defined qt in one and defType in the other, which are not equivalent.
>>
>
> You are right I fixed this problem (my bad).
>
> 3> It may take 12 hours to index, but you could experiment with a smaller
>> subset. You say you know that the noresults one should return documents,
>> what proof do
>> you have? If there's a single document that you know should match this,
>> just
>> index it and a few others and you should be able to make many runs until
>> you
>> get
>> to the bottom of this...
>>
>>
> I could but I always thought I had to fully re-index after updating
> schema.xml. If
> I update only few documents will that take the changes into account without
> breaking
> the rest ?
>
>
>> And obviously your stemming is happening on the query, are you sure it's
>> happening at index time too?
>>
>>
> Since you did not get the screenshots you will find attached the full
> output of the analysis
> for a phrase that works and for another that does not.
>
> Thanks for your support
>
> Best Regards,
>
> --
> Jérôme
>

Re: Recommendation on RAM-/Cache configuration

On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
> Hi,
> 
> recently we're experiencing OOMEs (GC overhead limit exceeded) in our
> searches. Therefore I want to get some clarification on heap and cache
> configuration.
> 
> This is the situation:
> - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
> - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
> -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
> -XX:+UseParallelGC

Consider switching to HotSpot JVM, use the -server as the first switch.

> - The machine has 32 GB RAM
> - Currently there are 4 processors/cores in the machine, this shall be
> changed to 2 cores in the future.
> - The index size in the filesystem is ~9.5 GB
> - The index contains ~ 5.500.000 documents
> - 1.500.000 of those docs are available for searches/queries, the rest are
> inactive docs that are excluded from searches (via a flag/field), but
> they're still stored in the index as need to be available by id (solr is
> the main document store in this app)

How do you exclude them? It should use filter queries. I also remember (but i 
just cannot find it back so please correct my if i'm wrong) that in 1.4.x 
sorting is done before filtering. It should be an improvement if filtering is 
done before sorting.
If you use sorting, it takes up a huge amount of RAM if filtering is not done 
first.

> - Caches are configured with a big size (the idea was to prevent filesystem
> access / disk i/o as much as possible):

There is only disk I/O if the kernel can't keep the index (or parts) in its 
page cache.

>   - filterCache (solr.LRUCache): size=20, initialSize=3,
> autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
>   - documentCache (solr.LRUCache): size=20, initialSize=10,
> autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
>   - queryResultCache (solr.LRUCache): size=20, initialSize=3,
> autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71

You should decrease the initialSize values. But your hitratio's seem very 
nice.

> - Searches are performed using a catchall text field using standard request
> handler, all fields are fetched (no fl specified)
> - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC)
> - Recently we also added a feature that adds weighted search for special
> fields, so that the query might become s.th. like this
>   q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some
> query)^4.0 OR longDescription_weighted:(some query)^0.5
>   (it seemed as if this was the cause of the OOMEs, but IMHO it only
> increased RAM usage so that now GC could not free enough RAM)
> 
> The OOMEs that we get are of type "GC overhead limit exceeded", one of the
> OOMEs was thrown during auto-warming.

Warming takes additional RAM. The current searcher still has its caches full 
and newSearcher is getting filled up. Decreasing sizes might help.

> 
> I checked two different heapdumps, the first one autogenerated
> (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via
> jmap.
> These show the following distribution of used memory - the autogenerated
> dump:
>  - documentCache: 56% (size ~ 195.000)
> - filterCache: 15% (size ~ 60.000)
> - queryResultCache: 8% (size ~ 61.000)
> - fieldCache: 6% (fieldCache referenced  by WebappClassLoader)
> - SolrIndexSearcher: 2%
> 
> The manually generated dump:
> - documentCache: 48% (size ~ 195.000)
> - filterCache: 20% (size ~ 60.000)
> - fieldCache: 11% (fieldCache hängt am WebappClassLoader)
> - queryResultCache: 7% (size ~ 61.000)
> - fieldValueCache: 3%
> 
> We are also running two search engines with 17GB heap, these don't run into
> OOMEs. Though, with these bigger heap sizes the longest requests are even
> longer due to longer stop-the-world gc cycles.
> Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB
> would be good to reduce the time needed for full gc.
> 
> So what's the right path to follow now? What would you recommend to change
> on the configuration (solr/jvm)?

Try tuning the GC
http://java.sun.com/performance/reference/whitepapers/tuning.html
http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html

> 
> Would you say it is ok to reduce the cache sizes? Would this increase disk
> i/o, or would the index be hold in the OS's disk cache?

Yes! If you also allocate less RAM to the JVM then there is more for the OS to 
cache.

> 
> Do have other recommendations to follow / questions?
> 
> Thanx && cheers,
> Martin

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Adding weightage to the facets count

2011-01-25 Thread Johannes Goll

Hi Siva,

try using the Solr Stats Component
http://wiki.apache.org/solr/StatsComponent

similar to
select/?&q=*:*&stats=true&stats.field={your-weight-field}&stats.facet={your-facet-field}

and get the sum field from the response. You may need to resort the weighted
facet counts to get a descending list of facet counts.

Note, there is a bug for using the Stats Component with multi-valued facet
fields.

For details see
https://issues.apache.org/jira/browse/SOLR-1782

Johannes

2011/1/24 Chris Hostetter 

>
> : prod1 has tag called “Light Weight” with weightage 20,
> : prod2 has tag called “Light Weight” with weightage 100,
> :
> : If i get facet for “Light Weight” , i will get Light Weight (2) ,
> : here i need to consider the weightage in to account, and the result will
> be
> : Light Weight (120)
> :
> : How can we achieve this?Any ideas are really helpful.
>
>
> It's not really possible with Solr out of the box.  Faceting is fast and
> efficient in Solr because it's all done using set intersections (and most
> of the sets can be kept in ram very compactly and reused).  For what you
> are describing you'd need to no only assocaited a weighted payload with
> every TermPosition, but also factor that weight in when doing the
> faceting, which means efficient set operations are now out the window.
>
> If you know java it would be probably be possible to write a custom
> SolrPlugin (a SearchComponent) to do this type of faceting in special
> cases (assuming you indexed in a particular way) but i'm not sure off hte
> top of my head how well it would scale -- the basic algo i'm thinking of
> is (after indexing each facet term wit ha weight payload) to iterate over
> the DocSet of all matching documents in parallel with an iteration over
> a TermPositions, skipping ahead to only the docs that match the query, and
> recording the sum of the payloads for each term.
>
> Hmmm...
>
> except TermPositions iterates over >> tuples,
> so you would have to iterate over every term, and for every term then loop
> over all matching docs ... like i said, not sure how efficient it would
> wind up being.
>
> You might be happier all arround if you just do some sampling -- store the
> tag+weight pairs so thta htey cna be retireved with each doc, and then
> when you get your top facet constraints back, look at the first page of
> results, and figure out what the sun "weight" is for each of those
> constraints based solely on the page#1 results.
>
> i've had happy users using a similar appraoch in the past.
>
> -Hoss




-- 
Johannes Goll
211 Curry Ford Lane
Gaithersburg, Maryland 20878

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael


Hi Eric,

You are right, there is a copy field to EdgeNgram, I tried the configuration
but it not working as expected.

Configuration I tried:





























edgy_user_query


==

When I search for the term "apple".

It is returning results for "pineapple vers apple", "milk with apple",
"apple milk shake" ...

Is there any other way to overcome this problem?

Thanks,

Johnny


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH From various File system locations

2011-01-25 Thread Estrada Groups

I would just use Nutch and specify the -solr param on the command line. That 
will add the extracted content your instance of solr.

Adam

Sent from my iPhone

On Jan 25, 2011, at 5:29 AM, pankaj bhatt  wrote:

> Hi All,
> I need to index the documents presents in my file system at various
> locations (e.g. C:\docs , d:\docs ).
>Is there any way through which i can specify this in my DIH
> Configuration.
>Here is my configuration:-
> 
> 
>  processor="FileListEntityProcessor"
>fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> *baseDir="G:\\Desktop\\"*
>recursive="false"
>rootEntity="true"
>transformer="DateFormatTransformer"
> onerror="continue">
> processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>  
>  
>  
>  
>
> 
>
>
>
>
> 
>  
> 
> / Pankaj Bhatt.

Re: Recommendation on RAM-/Cache configuration

2011-01-25 Thread Martin Grotzke

On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma
wrote:

> On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote:
> > Hi,
> >
> > recently we're experiencing OOMEs (GC overhead limit exceeded) in our
> > searches. Therefore I want to get some clarification on heap and cache
> > configuration.
> >
> > This is the situation:
> > - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit
> > - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G
> > -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC
> > -XX:+UseParallelGC
>
> Consider switching to HotSpot JVM, use the -server as the first switch.

The jvm options I mentioned were not all, we're running the jvm with -server
(of course).


>
> > - The machine has 32 GB RAM
> > - Currently there are 4 processors/cores in the machine, this shall be
> > changed to 2 cores in the future.
> > - The index size in the filesystem is ~9.5 GB
> > - The index contains ~ 5.500.000 documents
> > - 1.500.000 of those docs are available for searches/queries, the rest
> are
> > inactive docs that are excluded from searches (via a flag/field), but
> > they're still stored in the index as need to be available by id (solr is
> > the main document store in this app)
>
> How do you exclude them? It should use filter queries.

The docs are indexed with a field "findable" on which we do a filter query.


> I also remember (but i
> just cannot find it back so please correct my if i'm wrong) that in 1.4.x
> sorting is done before filtering. It should be an improvement if filtering
> is
> done before sorting.
>
Hmm, I cannot imagine a case where it makes sense to sort before filtering.
Can't believe that solr does it like this.
Can anyone shed a light on this?


> If you use sorting, it takes up a huge amount of RAM if filtering is not
> done
> first.
>
> > - Caches are configured with a big size (the idea was to prevent
> filesystem
> > access / disk i/o as much as possible):
>
> There is only disk I/O if the kernel can't keep the index (or parts) in its
> page cache.
>
Yes, I'll keep an eye on disk I/O.



> >   - filterCache (solr.LRUCache): size=20, initialSize=3,
> > autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99
> >   - documentCache (solr.LRUCache): size=20, initialSize=10,
> > autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74
> >   - queryResultCache (solr.LRUCache): size=20, initialSize=3,
> > autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71
>
> You should decrease the initialSize values. But your hitratio's seem very
> nice.
>
Does the initialSize have a real impact? According to
http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of
the HashMap backing the cache.
What would you say are reasonable values for size/initialSize/autowarmCount?

Cheers,
Martin

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Erlend Garåsen


On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.

And to clarify, by "content" I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the old 
0.4 version of Tika, but you need a newer Tika version (0.8) in order to 
fetch the main content as well. So try the newest Solr version from trunk.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Getting started with writing parser

2011-01-25 Thread Gora Mohanty

On Tue, Jan 25, 2011 at 3:46 PM, Dinesh  wrote:
>
> no i actually changed the directory to mine where i stored the log files.. it
> is /home/exam/apa..solr/example/exampledocs
>
> i specified it in a solr schema.. i created an DataImportHandler for that in
> try.xml.. then in that i changed that file name to sample.txt
>
> that new try.xml is
> http://pastebin.com/pfVVA7Hs
[...]

Let us take this one part at a time.

In your inner nested entity,

Re: Use terracotta bigmemory for solr-caches

2011-01-25 Thread Em


Hi Martin,

are you sure that your GC is well tuned?
A request that needs more than a minute isn't the standard, even when I
consider all the other postings about response-performance...

Regards
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor


Thanks Erlend.

Not used SVN before, but have managed to download and build latest trunk 
code.


Now I'm getting an error when trying to access the admin page (via 
Jetty) because I specify HTMLStripStandardTokenizerFactory in my 
schema.xml, but this appears to be no-longer supplied as part of the 
build so I get an exception cos it can't find that class.  I've checked 
the CHANGES.txt and found the following in the change list to 1.4.0 (!?) :


66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, 
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, 
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)


Unfortunately, I can't seem to get that to work correctly.  Does anyone 
have an example fieldType stanza (for schema.xml) for stripping out HTML ?


Thanks and kind regards,
Gary.



On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:


Tika version 0.8 is not included in the latest release/trunk from SVN.


Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.

And to clarify, by "content" I mean the main content of a Word file. 
Title and other kinds of metadata are successfully extracted by the 
old 0.4 version of Tika, but you need a newer Tika version (0.8) in 
order to fetch the main content as well. So try the newest Solr 
version from trunk.


Erlend

List of indexed or stored fields

2011-01-25 Thread kenf_nc


I use a lot of dynamic fields, so looking at my schema isn't a good way to
see all the field names that may be indexed across all documents. Is there a
way to query solr for that information? All field names that are indexed, or
stored? Possibly a count by field name? Is there any other metadata about a
field that can be queried?
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Gary Taylor

OK, got past the schema.xml problem, but now I'm back to square one.

I can index the contents of binary files (Word, PDF etc...), as well as
text files, but it won't index the content of files inside a zip.

As an example, I have two txt files - doc1.txt and doc2.txt. If I index
either of them individually using:

curl
"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
-F "file=@doc1.txt"

and commit, Solr will index the contents and searches will match.

If I zip those two files up into solr1.zip, and index that using:

curl
"http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
-F "file=@solr1.zip"

and commit, the file names are indexed, but not their contents.

I have checked that Tika can correctly process the zip file when used
standalone with the tika-app jar - it outputs both the filenames and
contents. Should I be able to index the contents of files stored in a
zip by using extract ?

Thanks and kind regards,
Gary.

On 25/01/2011 15:32, Gary Taylor wrote:

Thanks Erlend.

Not used SVN before, but have managed to download and build latest
trunk code.

Now I'm getting an error when trying to access the admin page (via
Jetty) because I specify HTMLStripStandardTokenizerFactory in my
schema.xml, but this appears to be no-longer supplied as part of the
build so I get an exception cos it can't find that class. I've
checked the CHANGES.txt and found the following in the change list to
1.4.0 (!?) :

66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
HTMLStripWhitespaceTokenizerFactory and
HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags,
HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji)

Unfortunately, I can't seem to get that to work correctly. Does
anyone have an example fieldType stanza (for schema.xml) for stripping
out HTML ?

Thanks and kind regards,
Gary.

On 25/01/2011 14:17, Erlend Garåsen wrote:

On 25.01.11 11.30, Erlend Garåsen wrote:

Tika version 0.8 is not included in the latest release/trunk from SVN.

Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.

And to clarify, by "content" I mean the main content of a Word file.
Title and other kinds of metadata are successfully extracted by the
old 0.4 version of Tika, but you need a newer Tika version (0.8) in
order to fetch the main content as well. So try the newest Solr
version from trunk.

Erlend

Re: List of indexed or stored fields

2011-01-25 Thread Juan Grande

You can query all the indexed or stored fields (including dynamic fields)
using the LukeRequestHandler: http://localhost:8983/solr/example/admin/luke

See also: http://wiki.apache.org/solr/LukeRequestHandler

Regards,
*
**Juan G. Grande*
-- Solr Consultant @ http://www.plugtree.com
-- Blog @ http://juanggrande.wordpress.com

On Tue, Jan 25, 2011 at 12:39 PM, kenf_nc  wrote:

>
> I use a lot of dynamic fields, so looking at my schema isn't a good way to
> see all the field names that may be indexed across all documents. Is there
> a
> way to query solr for that information? All field names that are indexed,
> or
> stored? Possibly a count by field name? Is there any other metadata about a
> field that can be queried?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: DIH From various File system locations

2011-01-25 Thread pankaj bhatt

Thanks Adam, It seems like Nutch use to solve most of my concerns.
i would be great if you can have share resources for Nutch with us.

/ Pankaj Bhatt.

On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
estrada.adam.gro...@gmail.com> wrote:

> I would just use Nutch and specify the -solr param on the command line.
> That will add the extracted content your instance of solr.
>
> Adam
>
> Sent from my iPhone
>
> On Jan 25, 2011, at 5:29 AM, pankaj bhatt  wrote:
>
> > Hi All,
> > I need to index the documents presents in my file system at
> various
> > locations (e.g. C:\docs , d:\docs ).
> >Is there any way through which i can specify this in my DIH
> > Configuration.
> >Here is my configuration:-
> >
> > 
> >   >processor="FileListEntityProcessor"
> >fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
> > *baseDir="G:\\Desktop\\"*
> >recursive="false"
> >rootEntity="true"
> >transformer="DateFormatTransformer"
> > onerror="continue">
> > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
> >  
> >  
> >  
> >  
> >
> >
> >
> >
> >
> >
> > 
> >  
> >
> > / Pankaj Bhatt.
>

Re: Extracting contents of zipped files with Tika and Solr 1.4.1

2011-01-25 Thread Jayendra Patil

Hi Gary,

The latest Solr Trunk was able to extract and index the contents of the zip
file using the ExtractingRequestHandler.
The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and
worked pretty well.

Tested again with sample url and works fine -
curl "
http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true
"

You would probably need to drill down to the Tika Jars and
the apache-solr-cell-4.0-dev.jar used for Rich documents indexing.

Regards,
Jayendra

On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor  wrote:

> OK, got past the schema.xml problem, but now I'm back to square one.
>
> I can index the contents of binary files (Word, PDF etc...), as well as
> text files, but it won't index the content of files inside a zip.
>
> As an example, I have two txt files - doc1.txt and doc2.txt.  If I index
> either of them individually using:
>
> curl "
> http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
> -F "file=@doc1.txt"
>
> and commit, Solr will index the contents and searches will match.
>
> If I zip those two files up into solr1.zip, and index that using:
>
> curl "
> http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5";
> -F "file=@solr1.zip"
>
> and commit, the file names are indexed, but not their contents.
>
> I have checked that Tika can correctly process the zip file when used
> standalone with the tika-app jar - it outputs both the filenames and
> contents.  Should I be able to index the contents of files stored in a zip
> by using extract ?
>
>
> Thanks and kind regards,
> Gary.
>
>
> On 25/01/2011 15:32, Gary Taylor wrote:
>
>> Thanks Erlend.
>>
>> Not used SVN before, but have managed to download and build latest trunk
>> code.
>>
>> Now I'm getting an error when trying to access the admin page (via Jetty)
>> because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but
>> this appears to be no-longer supplied as part of the build so I get an
>> exception cos it can't find that class.  I've checked the CHANGES.txt and
>> found the following in the change list to 1.4.0 (!?) :
>>
>> 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader,
>> HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory
>> deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an
>> arbitrary Tokenizer. (koji)
>>
>> Unfortunately, I can't seem to get that to work correctly.  Does anyone
>> have an example fieldType stanza (for schema.xml) for stripping out HTML ?
>>
>> Thanks and kind regards,
>> Gary.
>>
>>
>>
>> On 25/01/2011 14:17, Erlend Garåsen wrote:
>>
>>> On 25.01.11 11.30, Erlend Garåsen wrote:
>>>
>>>  Tika version 0.8 is not included in the latest release/trunk from SVN.

>>>
>>> Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry.
>>>
>>> And to clarify, by "content" I mean the main content of a Word file.
>>> Title and other kinds of metadata are successfully extracted by the old 0.4
>>> version of Tika, but you need a newer Tika version (0.8) in order to fetch
>>> the main content as well. So try the newest Solr version from trunk.
>>>
>>> Erlend
>>>
>>>
>>
>>
>

How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Valiveti


Hi ,

I have written a lucene custom filter.
I could not figure out on how to configure Solr to pick this custom filter
for search.

How to configure Solr to pick my custom filter?
Will the Solr standard search handler pick this custom filter?

Thanks,
Valiveti

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html
Sent from the Solr - User mailing list archive at Nabble.com.

in-index representaton of tokens

2011-01-25 Thread Dennis Gearon

So, the index is a list of tokens per column, right?

There's a table per column that lists the analyzed tokens?

And the tokens per column are represented as what, system integers? 32/64 bit 
unsigned ints?

 Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Re: in-index representaton of tokens


Why does it matter?  You can't really get at them unless you store them.

I don't know what "table per column" means, there's nothing in Solr 
architecture called a "table" or a "column". Although by column you 
probably mean more or less Solr "field".  There is nothing like a 
"table" in Solr.


Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:

So, the index is a list of tokens per column, right?

There's a table per column that lists the analyzed tokens?

And the tokens per column are represented as what, system integers? 32/64 bit
unsigned ints?

  Dennis Gearon


Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'


EARTH has a Right To Life,
otherwise we all die.

Re: EdgeNgram Auto suggest - doubles ignore

Let's back up here because now I'm not clear what you actually want.
EdgeNGrams
are a way of matching substrings, which is what's happening here. Of course
searching "apple" against any of the three examples, just as searching for
"apple"
without grams would match, that's the expected behavior.

So, we need a clear problem definition of what you're trying to do, along
with
example queries (please post the results of adding &debugQuery=on).

Best
Erick

On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael wrote:

>
> Hi Eric,
>
> You are right, there is a copy field to EdgeNgram, I tried the
> configuration
> but it not working as expected.
>
> Configuration I tried:
>
> 
>
>  termVectors=”true”>
> 
> 
> 
> 
> 
> 
> 
> 
> 
>
>  positionIncrementGap=”100″>
> 
> 
> 
>  maxGramSize=”25″/>
> 
> 
> 
> 
> 
> 
>
>  omitNorms=”true” omitTermFreqAndPositions=”true” />
>  omitNorms=”true” omitTermFreqAndPositions=”true” />
>
> edgy_user_query
> 
>
> ==
>
> When I search for the term "apple".
>
> It is returning results for "pineapple vers apple", "milk with apple",
> "apple milk shake" ...
>
> Is there any other way to overcome this problem?
>
> Thanks,
>
> Johnny
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Highlighting with/without Term Vectors

2011-01-25 Thread Salman Akram

Anyone?

On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram <
salman.ak...@northbaysolutions.net> wrote:

> Just to add one thing, in case it makes a difference.
>
> Max document size on which highlighting needs to be done is few hundred
> kb's (in file system). In index its compressed so should be much smaller.
> Total documents are more than 100 million.
>
>
> On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram <
> salman.ak...@northbaysolutions.net> wrote:
>
>> Hi,
>>
>> Does anyone have any benchmarks how much highlighting speeds up with Term
>> Vectors (compared to without it)? e.g. if highlighting on 20 documents take
>> 1 sec with Term Vectors any idea how long it will take without them?
>>
>> I need to know since the index used for highlighting has a TVF file of
>> around 450GB (approx 65% of total index size) so I am trying to see whether
>> the decreasing the index size by dropping TVF would be more helpful for
>> performance (less RAM, should be good for I/O too I guess) or keeping it is
>> still better?
>>
>> I know the best way is try it out but indexing takes a very long time so
>> trying to see whether its even worthy or not.
>>
>> --
>> Regards,
>>
>> Salman Akram
>>
>>
>
>
> --
> Regards,
>
> Salman Akram
>



-- 
Regards,

Salman Akram

Re: How to Configure Solr to pick my lucene custom filter

Presumably your custom filter is in a jar file. Drop that jar file in
/lib
and refer it from your schema.xml file by its full name
(e.g. com.yourcompany.filter.yourcustomfilter) just like the other filters
and it should
work fine.

You can also put your jar anywhere you'd like and alter solrconfig.xml with
an
addition al  section (see the example
solrconfig.xml).

Best
Erick

On Tue, Jan 25, 2011 at 12:07 PM, Valiveti wrote:

>
> Hi ,
>
> I have written a lucene custom filter.
> I could not figure out on how to configure Solr to pick this custom filter
> for search.
>
> How to configure Solr to pick my custom filter?
> Will the Solr standard search handler pick this custom filter?
>
> Thanks,
> Valiveti
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: List of indexed or stored fields

2011-01-25 Thread kenf_nc


That's exactly what I wanted, thanks. Any idea what

  1294513299077 

refers to under the  section? I have 2 cores on one Tomcat instance,
and 1 on a second instance (different server) and all 3 have different
numbers for "version", so I don't think it's the version of Luke.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2333281.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: List of indexed or stored fields

The index version. Can be used in replication to determine whether to 
replicate or not.

On Tuesday 25 January 2011 20:30:21 kenf_nc wrote:
> refers to under the  section? I have 2 cores on one Tomcat instance,
> and 1 on a second instance (different server) and all 3 have different
> numbers for "version", so I don't think it's the version of Luke.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael


Hi Eric,

What I want here is, lets say I have 3 documents like 

["pineapple vers apple", "milk with apple", "apple milk shake" ]

and If i search for "apple", it should return only "apple milk shake"
because that term alone starts with the letter "apple" which I typed in. It
should not bring others and if I type "milk" it should return only "milk
with apple"

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada

There are a few tutorials out there.

1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
3. Build the latest from branch
http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
this one.

http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/

but add the "solr" parameter at the end bin/nutch crawl urls -depth 5
-topN 100 -solr http://localhost:8983/solr

This will automatically add the data nutch collected to Solr. For
larger files I would also increase your JAVA_OPTS env to something
like JAVA_OPTS=' Xmx2048m'

Adam




On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt  wrote:
> Thanks Adam, It seems like Nutch use to solve most of my concerns.
> i would be great if you can have share resources for Nutch with us.
>
> / Pankaj Bhatt.
>
> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
> estrada.adam.gro...@gmail.com> wrote:
>
>> I would just use Nutch and specify the -solr param on the command line.
>> That will add the extracted content your instance of solr.
>>
>> Adam
>>
>> Sent from my iPhone
>>
>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt  wrote:
>>
>> > Hi All,
>> >         I need to index the documents presents in my file system at
>> various
>> > locations (e.g. C:\docs , d:\docs ).
>> >    Is there any way through which i can specify this in my DIH
>> > Configuration.
>> >    Here is my configuration:-
>> >
>> > 
>> >      > >        processor="FileListEntityProcessor"
>> >        fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
>> > *baseDir="G:\\Desktop\\"*
>> >        recursive="false"
>> >        rootEntity="true"
>> >        transformer="DateFormatTransformer"
>> > onerror="continue">
>> >        > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>> >          
>> >          
>> >          
>> >          
>> >        
>> >
>> >        
>> >        
>> >        
>> >    
>> > 
>> >  
>> >
>> > / Pankaj Bhatt.
>>
>

Re: DIH From various File system locations

2011-01-25 Thread Adam Estrada

I take that back...Use am currently using version 1.2 and make sure
that the latest versions of Tika and PDFBox is in the contrib folder.
1.3 is structured a bit differently and it doesn't look like there is
a contrib directory. Maybe one of the Nutch contributors can comment
on this?

Adam

On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada
 wrote:
> There are a few tutorials out there.
>
> 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical)
> 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.)
> 3. Build the latest from branch
> http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read
> this one.
>
> http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/
>
> but add the "solr" parameter at the end bin/nutch crawl urls -depth 5
> -topN 100 -solr http://localhost:8983/solr
>
> This will automatically add the data nutch collected to Solr. For
> larger files I would also increase your JAVA_OPTS env to something
> like JAVA_OPTS=' Xmx2048m'
>
> Adam
>
>
>
>
> On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt  wrote:
>> Thanks Adam, It seems like Nutch use to solve most of my concerns.
>> i would be great if you can have share resources for Nutch with us.
>>
>> / Pankaj Bhatt.
>>
>> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups <
>> estrada.adam.gro...@gmail.com> wrote:
>>
>>> I would just use Nutch and specify the -solr param on the command line.
>>> That will add the extracted content your instance of solr.
>>>
>>> Adam
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt  wrote:
>>>
>>> > Hi All,
>>> >         I need to index the documents presents in my file system at
>>> various
>>> > locations (e.g. C:\docs , d:\docs ).
>>> >    Is there any way through which i can specify this in my DIH
>>> > Configuration.
>>> >    Here is my configuration:-
>>> >
>>> > 
>>> >      >> >        processor="FileListEntityProcessor"
>>> >        fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$"
>>> > *baseDir="G:\\Desktop\\"*
>>> >        recursive="false"
>>> >        rootEntity="true"
>>> >        transformer="DateFormatTransformer"
>>> > onerror="continue">
>>> >        >> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor"
>>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin">
>>> >          
>>> >          
>>> >          
>>> >          
>>> >        
>>> >
>>> >        
>>> >        
>>> >        
>>> >    
>>> > 
>>> >  
>>> >
>>> > / Pankaj Bhatt.
>>>
>>
>

CFP - Berlin Buzzwords 2011 - Search, Score, Scale

2011-01-25 Thread Isabel Drost

This is to announce the Berlin Buzzwords 2011. The second edition of the 
successful conference on scalable and open search, data processing and data 
storage in Germany, taking place in Berlin.

Call for Presentations Berlin Buzzwords
   http://berlinbuzzwords.de
  Berlin Buzzwords 2011 - Search, Store, Scale
6/7 June 2011

The event will comprise presentations on scalable data processing. We invite 
you 
to submit talks on the topics:

   * IR / Search - Lucene, Solr, katta or comparable solutions
   * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others
   * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives
   * Closely related topics not explicitly listed above are welcome. We are
 looking for presentations on the implementation of the systems themselves,
 real world applications and case studies.

Important Dates (all dates in GMT +2)
   * Submission deadline: March 1st 2011, 23:59 MEZ
   * Notification of accepted speakers: March 22th, 2011, MEZ.
   * Publication of final schedule: April 5th, 2011.
   * Conference: June 6/7. 2011

High quality, technical submissions are called for, ranging from principles to 
practice. We are looking for real world use cases, background on the 
architecture of specific projects and a deep dive into architectures built on 
top of e.g. Hadoop clusters.

Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no 
later than March 1st, 2011. Acceptance notifications will be sent out soon 
after 
the submission deadline. Please include your name, bio and email, the title of 
the talk, a brief abstract in English language. Please indicate whether you 
want 
to give a lightning (10min), short (20min) or long (40min) presentation and 
indicate the level of experience with the topic your audience should have (e.g. 
whether your talk will be suitable for newbies or is targeted for experienced 
users.) If you'd like to pitch your brand new product in your talk, please let 
us know as well - there will be extra space for presenting new ideas, awesome 
products and great new projects.

The presentation format is short. We will be enforcing the schedule rigorously.

If you are interested in sponsoring the event (e.g. we would be happy to 
provide 
videos after the event, free drinks for attendees as well as an after-show 
party), please contact us.

Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, 
and the final schedule are be published at http://berlinbuzzwords.de.

Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer.

Please re-distribute this CfP to people who might be interested.

If you are local and wish to meet us earlier, please note that this Thursday 
evening there will be an Apache Hadoop Get Together (videos kindly sponsored by 
Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache 
Hadoop in production as well as news on current Apache Lucene developments.

Contact us at:

newthinking communications 
GmbH Schönhauser Allee 6/7 
10119 Berlin, 
Germany 

Julia Gemählich
Isabel Drost 

+49(0)30-9210 596


signature.asc
Description: This is a digitally signed message part.

Re: How to Configure Solr to pick my lucene custom filter

2011-01-25 Thread Valiveti


Hi Eric,

Thanks for the reply.

I Did see some entries in the solrconfig.xml for adding custom
reposneHandlers, queryParsers and queryResponseWriters.

Bit could not find the one for adding the custom filter.

Could you point to the exact location or syntax to be used.

Thanks,
Valiveti


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

I haven't figured out any way to achieve that AT ALL without making a 
seperate Solr index just to serve autosuggest queries. At least when you 
want to auto-suggest on a multi-value field. Someone posted a crazy 
tricky way to do it with a single-valued field a while ago.  If you 
can/are willing to make a seperate Solr index with a schema set up for 
auto-suggest specifically, it's easy. But from an existing schema, where 
you want to auto-suggest just based on the values in one field, it's a 
multi-valued field, and you want to allow matches in the middle of the 
field -- I don't think there's a way to do it.


On 1/25/2011 3:03 PM, johnnyisrael wrote:

Hi Eric,

What I want here is, lets say I have 3 documents like

["pineapple vers apple", "milk with apple", "apple milk shake" ]

and If i search for "apple", it should return only "apple milk shake"
because that term alone starts with the letter "apple" which I typed in. It
should not bring others and if I type "milk" it should return only "milk
with apple"

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny

Re: EdgeNgram Auto suggest - doubles ignore

Then you don't need NGrams at all. A wildcard will suffice or you can use the 
TermsComponent.

If these strings are indexed as single tokens (KeywordTokenizer with 
LowercaseFilter) you can simply do field:app* to retrieve the "apple milk 
shake". You can also use the string field type but then you must make sure the 
values are already lowercased before indexing.

Be careful though, there is no query time analysis for wildcard (and fuzzy) 
queries so make sure 

> Hi Eric,
> 
> What I want here is, lets say I have 3 documents like
> 
> ["pineapple vers apple", "milk with apple", "apple milk shake" ]
> 
> and If i search for "apple", it should return only "apple milk shake"
> because that term alone starts with the letter "apple" which I typed in. It
> should not bring others and if I type "milk" it should return only "milk
> with apple"
> 
> I want an output Similar like a Google auto suggest.
> 
> Is there a way to achieve  this without encapsulating with double quotes.
> 
> Thanks,
> 
> Johnny

Re: EdgeNgram Auto suggest - doubles ignore

Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker 
than using wildcards at the cost of a larger index. You should, of course, use 
EdgeNGrams if you worry about performance and have a huge index and a number 
of queries per second.

> Then you don't need NGrams at all. A wildcard will suffice or you can use
> the TermsComponent.
> 
> If these strings are indexed as single tokens (KeywordTokenizer with
> LowercaseFilter) you can simply do field:app* to retrieve the "apple milk
> shake". You can also use the string field type but then you must make sure
> the values are already lowercased before indexing.
> 
> Be careful though, there is no query time analysis for wildcard (and fuzzy)
> queries so make sure
> 
> > Hi Eric,
> > 
> > What I want here is, lets say I have 3 documents like
> > 
> > ["pineapple vers apple", "milk with apple", "apple milk shake" ]
> > 
> > and If i search for "apple", it should return only "apple milk shake"
> > because that term alone starts with the letter "apple" which I typed in.
> > It should not bring others and if I type "milk" it should return only
> > "milk with apple"
> > 
> > I want an output Similar like a Google auto suggest.
> > 
> > Is there a way to achieve  this without encapsulating with double quotes.
> > 
> > Thanks,
> > 
> > Johnny

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil


The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor. 

So it looks like, using edgeNgram it is difficult to achieve the the
following 

Result should return only those terms where search letter is matching with
the first word only. For example, when we type "M",  it should return
"Mumford and Sons" and not "jackson Michael". 


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html
Sent from the Solr - User mailing list archive at Nabble.com.

Specifying optional terms with standard (lucene) request handler?

2011-01-25 Thread Daniel Pötzinger

Hi

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
2
on
+lorem ipsum dolor amet
content

dismax
Will be parsed into something like this:

+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()

Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
content:lorem #optional#content:optionalboostword^10

An idea would be to use a function query to boost the document:

content:lorem _val_:"query({!lucene v='optionalword^20'})"

Which will result in:

+content:forum +query(content:optionalword^20.0,def=0.0)

Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel

Re: EdgeNgram Auto suggest - doubles ignore

Ah, sorry, I got confused about your requirements, if you just want to 
match at the beginning of the field, it may be more possible.  Using 
edgegrams or wildcard. If you have a single-valued field. Do you have a 
single-valued or a multi-valued field?  That is, does each document have 
just one value, or multiple?   I still get confused about how to do it 
with edgegrams, even with single-valued field, but I think maybe it's 
possible.


_Definitely_ possible, with or without edgegrams, if you are 
willing/able to make a completely seperate Solr index where each term 
for auto-suggest is a "document".  Yes.


The problem lies in what "results" are. In general, Solr's results are 
the documents you have in the Solr index. Thus it makes everything a lot 
easier to deal with if you have an index where each document in the 
index is a "term" for auto-suggest.   But that doesnt' always meet 
requirements if you need to auto-suggest within existing fq's and such, 
and of course it takes more resources to run an additional solr index.


On 1/25/2011 5:03 PM, mesenthil wrote:

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor.

So it looks like, using edgeNgram it is difficult to achieve the the
following

Result should return only those terms where search letter is matching with
the first word only. For example, when we type "M",  it should return
"Mumford and Sons" and not "jackson Michael".


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?

Re: Specifying optional terms with standard (lucene) request handler?


With the 'lucene' query parser?

include &q.op=OR and then put a "+" ("mandatory") in front of every term 
in the 'q' that is NOT optional, the rest will be optional.  I think 
that will do what want.


Jonathan

On 1/25/2011 5:07 PM, Daniel Pötzinger wrote:

Hi

I am searching for a way to specify optional terms in a query ( that dont need 
to match (But if they match should influence the scoring) )

Using the dismax parser a query like this:
2
on
+lorem ipsum dolor amet
content

dismax
Will be parsed into something like this:

+((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) ()

Which will result that only 2 of the 3 optional terms need to match?


How can optional terms be specified using the standard request handler?
My concrete requirement is that a certain term should match but another is 
optional. But if the optional part matches - it should give the document an 
extra score.
Something like :-)
content:lorem #optional#content:optionalboostword^10

An idea would be to use a function query to boost the document:

content:lorem _val_:"query({!lucene v='optionalword^20'})"

Which will result in:

+content:forum +query(content:optionalword^20.0,def=0.0)

Is this a good way or are there other suggestions?

Thanks for any opinion and tips on this

Daniel

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil


Right now our configuration says multivalues=true. But that need not be
"true" in our case. Will make it false and try and update this thread with
more details..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr set up issues with Magento

2011-01-25 Thread Sandhya Padala

Thank you Markus. I have added few more fields to schema.xml.

Now looks like the products are getting indexed. But no search results.

In Magento if I configure to use SOlr as the search engine. Search is not
returning any results.  If I change the search engine to use Magento's
inbuilt MYSQL , Search results are returned.  Can you please direct me on
where/how I  should start debug process.

If I use Solr admin and enter the search query that doesn't return any
results either.

Thank you,
Sandhya

On Mon, Jan 24, 2011 at 4:11 PM, Markus Jelsma
wrote:

> Hi,
>
> You haven't defined the field in Solr's schema.xml configuration so it
> needs to
> be added first. Perhaps following the tutorial might be a good idea.
>
> http://lucene.apache.org/solr/tutorial.html
>
> Cheers.
>
> > Hello Team:
> >
> >
> >   I am in the process of setting up Solr 1.4 with Magento ENterprise
> > Edition 1.9.
> >
> > When I try to index the products I get the following error message.
> >
> > Jan 24, 2011 3:30:14 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > fini
> > sh
> > INFO: {} 0 0
> > Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field
> > 'in_stock' at
> > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav
> > a:289)
> > at
> > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd
> > ateProcessorFactory.java:60)
> > at
> > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139)
> > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69)
> > at
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co
> > ntentStreamHandlerBase.java:54)
> > at
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl
> > erBase.java:131)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter
> > .java:338)
> > at
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte
> > r.java:241)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl
> > icationFilterChain.java:244)
> > at
> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF
> > ilterChain.java:210)
> > at
> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV
> > alve.java:240)
> > at
> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextV
> > alve.java:161)
> > at
> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j
> > ava:164)
> > at
> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j
> > ava:100)
> > at
> > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:
> > 550)
> > at
> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal
> > ve.java:118)
> > at
> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav
> > a:380)
> > at
> > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java
> >
> > :243)
> >
> > at
> > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
> > ss(Http11Protocol.java:188)
> > at
> > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce
> > ss(Http11Protocol.java:166)
> > at
> > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin
> > t.java:288)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec
> > utor.java:886)
> > at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
> > .java:908)
> > at java.lang.Thread.run(Thread.java:662)
> >
> > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
> > INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0
> > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
> > rollback INFO: start rollback
> > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2
> > rollback INFO: end_rollback
> > Jan 24, 2011 3:30:14 PM
> org.apache.solr.update.processor.LogUpdateProcessor
> > fini
> > sh
> > INFO: {rollback=} 0 16
> > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute
> >
> > I am a new to both Magento and SOlr. I could have done some thing stupid
> > during installation. I really look forward for your help.
> >
> > Thank you,
> > Sandhya
>

Best way to build a solr-based m2 project

2011-01-25 Thread Paul Libbrecht


Hello list,

Apologies if this was already asked, I haven't found the answer in the archive.
I've been out of this list for quite some time now, hence.

I am looking at a good way to package a project based on maven2 that would 
create me a solr-based webapp.
I would expect such projects as the velocity contrib or even the default solr 
to both include everything ready for it but I don't see this organized and, in 
particular, I see nothing that contains a packaging of type war.

Have I missed something?
Should I simply attempt at copying some bits into my source then make sure it 
gets copied to the right place?

I found a solr archetype but it's only delivering a standalone solr which does 
not interest me.

thanks in advance

paul

Re: in-index representaton of tokens

2011-01-25 Thread Dennis Gearon

I am saying there is a list of tokens that have been parsed (a table of them) 
for each column? Or one for the whole index?

 Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better 
idea to learn from others’ mistakes, so you do not have to make them yourself. 
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
otherwise we all die.

- Original Message 
From: Jonathan Rochkind 
To: "solr-user@lucene.apache.org" 
Sent: Tue, January 25, 2011 9:29:36 AM
Subject: Re: in-index representaton of tokens

Why does it matter?  You can't really get at them unless you store them.

I don't know what "table per column" means, there's nothing in Solr 
architecture called a "table" or a "column". Although by column you 
probably mean more or less Solr "field".  There is nothing like a 
"table" in Solr.

Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:
> So, the index is a list of tokens per column, right?
>
> There's a table per column that lists the analyzed tokens?
>
> And the tokens per column are represented as what, system integers? 32/64 bit
> unsigned ints?
>
>   Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a 
>better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>

Re: How to Configure Solr to pick my lucene custom filter

First, let's be sure we're talking about the same thing. My response was for
adding
a filter to your analysis chain for a field in Schema.xml. Are you talking
about a different
sort of filter?

Best
Erick

On Tue, Jan 25, 2011 at 4:09 PM, Valiveti wrote:

>
> Hi Eric,
>
> Thanks for the reply.
>
> I Did see some entries in the solrconfig.xml for adding custom
> reposneHandlers, queryParsers and queryResponseWriters.
>
> Bit could not find the one for adding the custom filter.
>
> Could you point to the exact location or syntax to be used.
>
> Thanks,
> Valiveti
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: in-index representaton of tokens

This should shed some light on the matter
http://lucene.apache.org/java/2_9_0/fileformats.html

> I am saying there is a list of tokens that have been parsed (a table of
> them) for each column? Or one for the whole index?
> 
>  Dennis Gearon
> 
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make
> them yourself. from
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> 
> EARTH has a Right To Life,
> otherwise we all die.
> 
> 
> 
> - Original Message 
> From: Jonathan Rochkind 
> To: "solr-user@lucene.apache.org" 
> Sent: Tue, January 25, 2011 9:29:36 AM
> Subject: Re: in-index representaton of tokens
> 
> Why does it matter?  You can't really get at them unless you store them.
> 
> I don't know what "table per column" means, there's nothing in Solr
> architecture called a "table" or a "column". Although by column you
> probably mean more or less Solr "field".  There is nothing like a
> "table" in Solr.
> 
> Solr is still not an rdbms.
> 
> On 1/25/2011 12:26 PM, Dennis Gearon wrote:
> > So, the index is a list of tokens per column, right?
> > 
> > There's a table per column that lists the analyzed tokens?
> > 
> > And the tokens per column are represented as what, system integers? 32/64
> > bit unsigned ints?
> > 
> >   Dennis Gearon
> > 
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes. It is usually a
> >
> >better
> >
> > idea to learn from others’ mistakes, so you do not have to make them
> > yourself. from
> > 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> > 
> > 
> > EARTH has a Right To Life,
> > otherwise we all die.

RE: DIH serialize

2011-01-25 Thread Papp Richard

Dear Stefan,

  thank you for your help! 
  Well, I wrote a small script, even if not json, but works:

  

regards,
  Rich

-Original Message-
From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] 
Sent: Tuesday, January 25, 2011 11:13
To: solr-user@lucene.apache.org
Subject: Re: DIH serialize

Rich,

i played around for a few minutes with Script-Transformers, but i have not
enough knowledge to get anything done right know :/
My Idea was: looping over the given row, which should be a Java HashMap or
something like that? and do sth like this (pseudo-code):

var row_data = []
for( var key in row )
{
  row_data.push( '"' + key + '" : '" + row[key] + '"' );
}
row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' );

Which should result in a json-object like {'key1':'value1', 'key2':'value2'}
- and that should be okay to work with?

Regards
Stefan

On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard  wrote:

> Hi Stefan,
>
>  yes, this is exactly what I intend - I don't want to search in this field
> - just quicly return me the result in a serialized form (the search
> criteria
> is on other fields). Well, if I could serialize the data exactly as like
> the
> PHP serialize() does I would be maximally satisfied, but any other form in
> which I could compact the data easily into one field I would be pleased.
>  Can anyone help me? I guess the  is quite a good way, but I don't
> know which function should I use there to compact the data to be easily
> usable in PHP. Or any other method?
>
> thanks,
>  Rich
>
> -Original Message-
> From: Stefan Matheis [mailto:matheis.ste...@googlemail.com]
> Sent: Monday, January 24, 2011 18:23
> To: solr-user@lucene.apache.org
> Subject: Re: DIH serialize
>
> Hi Rich,
>
> i'm a bit confused after reading your post .. what exactly you wanna try
to
> achieve? Serializing (like http://php.net/serialize) your complete row
> into
> one field? Don't wanna search in them, just store and deliver them in your
> results? Does that make sense? Sounds a bit strange :)
>
> Regards
> Stefan
>
> On Mon, Jan 24, 2011 at 10:03 AM, Papp Richard  wrote:
>
> > Hi Dennis,
> >
> >  thank you for your answer, but didn't understand why you say it doesn't
> > need serialization. I'm with the option "C".
> >  but the main question is, how to put into one field a result of many
> > fields: "SELECT * FROM".
> >
> > thanks,
> >  Rich
> >
> > -Original Message-
> > From: Dennis Gearon [mailto:gear...@sbcglobal.net]
> > Sent: Monday, January 24, 2011 02:07
> > To: solr-user@lucene.apache.org
> > Subject: Re: DIH serialize
> >
> > Depends on your process chain to the eventual viewer/consumer of the
> data.
> >
> > The questions to ask are:
> >  A/ Is the data IN Solr going to be viewed or processed in its original
> > form:
> >  -->set stored = 'true'
> > --->no serialization needed.
> >  B/ If it's going to be anayzed and searched for separate from any other
> > field,
> >
> >  the analyzing will put it into  an unreadable form. If you need to
> see
> > it,
> > then
> > --->set indexed="true" and stored="true"
> > --->no serializaton needed.   C/ If it's NOT going to be viewed AS
> IS,
> > and
> > it's not going to be searched for AS IS,
> >   (i.e. other columns will be how the data is found), and you have
> > another,
> >
> >   serialzable format:
> >   -->set indexed="false" and stored="true"
> >   -->serialize AS PER THE INTENDED APPLICATION,
> >   not sure that Solr can do that at all.
> >  C/ If it's NOT going to be viewed AS IS, and it's not going to be
> searched
> > for
> > AS IS,
> >   (i.e. other columns will be how the data is found), and you have
> > another,
> >
> >   serialzable format:
> >   -->set indexed="false" and stored="true"
> >   -->serialize AS PER THE INTENDED APPLICATION,
> >   not sure that Solr can do that at all.
> >  D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched
> for
> > AS
> > IS,
> >   (this column will be how the data is found), and you have another,
> >   serialzable format:
> >   -->you need to pu

RE: in-index representaton of tokens

There aren't any tables involved. There's basically one list (per field) of 
unique tokens for the entire index, and also, a list for each token of which 
documents contain that token. Which is efficiently encoded, but I don't know 
the details of that encoding, maybe someone who does can tell you, or you can 
look at the lucene source, or get one of the several good books on lucene.  
These 'lists' are set up so you can efficiently look up a token, and see what 
documents contain that token.  That's basically what lucene does, the purpose 
of lucene. Oh, and then there's term positions and such too, so not only can 
you see what documents contain that token but you can do proximity searches and 
stuff. 

This all gets into lucene implementation details I am not familiar with though. 

Why do you want to know?  If you have specific concerns about disk space or RAM 
usage or something and how different schema choices effect it, ask them, and 
someone can probably tell you more easily than someone can explain the total 
architecture of lucene in a short listserv message. But, hey, maybe someone 
other than me can do that too!

From: Dennis Gearon [gear...@sbcglobal.net]
Sent: Tuesday, January 25, 2011 7:02 PM
To: solr-user@lucene.apache.org
Subject: Re: in-index representaton of tokens

I am saying there is a list of tokens that have been parsed (a table of them)
for each column? Or one for the whole index?

 Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a better
idea to learn from others’ mistakes, so you do not have to make them yourself.
from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
otherwise we all die.

- Original Message 
From: Jonathan Rochkind 
To: "solr-user@lucene.apache.org" 
Sent: Tue, January 25, 2011 9:29:36 AM
Subject: Re: in-index representaton of tokens

Why does it matter?  You can't really get at them unless you store them.

I don't know what "table per column" means, there's nothing in Solr
architecture called a "table" or a "column". Although by column you
probably mean more or less Solr "field".  There is nothing like a
"table" in Solr.

Solr is still not an rdbms.

On 1/25/2011 12:26 PM, Dennis Gearon wrote:
> So, the index is a list of tokens per column, right?
>
> There's a table per column that lists the analyzed tokens?
>
> And the tokens per column are represented as what, system integers? 32/64 bit
> unsigned ints?
>
>   Dennis Gearon
>
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
>better
> idea to learn from others’ mistakes, so you do not have to make them yourself.
> from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
>
> EARTH has a Right To Life,
> otherwise we all die.
>

Re: EdgeNgram Auto suggest - doubles ignore

OK, try this.

Use some analysis chain for your field like:

This can be a multiValued field, BTW.

now use the TermsComponent to fetch your data. See:
http://wiki.apache.org/solr/TermsComponent

and specify terms.prefix=apple e.g.
http://localhost:8983/solr/terms?terms.prefix=app&terms.fl=blivet

The return list should be what you want. Note that the returned
values will be lower cased, and you can only specify
lower case in your search term (all because of specifying
the lowercase filter in my example).

This should be very fast no matter what your index size, as the
return list size defaults to 10 (though you can specify different
numbers).

Best
Erick

On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael wrote:

>
> Hi Eric,
>
> What I want here is, lets say I have 3 documents like
>
> ["pineapple vers apple", "milk with apple", "apple milk shake" ]
>
> and If i search for "apple", it should return only "apple milk shake"
> because that term alone starts with the letter "apple" which I typed in. It
> should not bring others and if I type "milk" it should return only "milk
> with apple"
>
> I want an output Similar like a Google auto suggest.
>
> Is there a way to achieve  this without encapsulating with double quotes.
>
> Thanks,
>
> Johnny
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr set up issues with Magento