from:"Paolo Castagna"

SolrQuery and escaping special characters

2010-06-18 Thread Paolo Castagna


Hi,
I am using Solr v1.4 and SolrJ on the client side.

I am not sure how SolrJ behaves regarding "escaping" special characters
[1] in a query string.

SolrJ does URL encoding of the query string it sends to Solr.

Do I need to escape special characters [1] when I construct a SolrQuery
object or not?

For example, if I want to search for "http://example.com#foo"; in a
"uri" field, should I use:

 (a)  SolrQuery query = new SolrQuery("uri:http://example.com#foo";);
 (b)  SolrQuery query = new SolrQuery("uri:http\\://example.com#foo");

which become respectively:

 (a') q=uri%3Ahttp%3A%2F%2Fexample.com%23foo
 (b') q=uri%3Ahttp%5C%3A%2F%2Fexample.com%23foo

My understanding is that SolrJ users are supposed to escape special 
characters, therefore (b) is the correct way.


If this is the case, what's the best way to escape a query string which
might contain field names and URIs in their field values?

Thanks,
Paolo

 [1] 
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#Escaping%20Special%20Characters

facet.method: enum vs. fc

2010-10-11 Thread Paolo Castagna


Hi,
I am using Solr v1.4 and I am not sure which facet.method I should use.

What should I use if I do not know in advance if the number of values
for a given field will be high or low?

What are the pros/cons of using facet.method=enum vs. facet.method=fc?

When should I use enum vs. fc?

I have found some comments and suggestions here:

 "enum enumerates all terms in a field, calculating the set intersection
  of documents that match the term with documents that match the query.
  This was the default (and only) method for faceting multi-valued fields
  prior to Solr 1.4.
 "fc (stands for field cache), the facet counts are calculated by
  iterating over documents that match the query and summing the terms
  that appear in each document. This was the default method for single
  valued fields prior to Solr 1.4.
  The default value is fc (except for BoolField) since it tends to use
  less memory and is faster when a field has many unique terms in the
  index."
  -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

 "facet.method=enum [...] this is excellent for fields where there is
  a small set of distinct values. The average number of values per
  document does not matter.
  facet.method=fc [...] this is excellent for situations where the
  number of indexed values for the field is high, but the number of
  values per document is low. For multi-valued fields, a hybrid approach
  is used that uses term filters from the filterCache for terms that
  match many documents."
  -- http://wiki.apache.org/solr/SolrFacetingOverview

 "If you are faceting on a field that you know only has a small number
  of values (say less than 50), then it is advisable to explicitly set
  this to enum. When faceting on multiple fields, remember to set this
  for the specific fields desired and not universally for all facets.
  The request handler configuration is a good place to put this."
  -- Book: "Solr 1.4 Enterprise Search Server", pag. 148

This is the part of the Solr code which deals with the facet.method
parameter:

  if (enumMethod) {
counts = getFacetTermEnumCounts([...]);
  } else {
if (multiToken) {
  UnInvertedField uif = [...]
  counts = uif.getCounts([...]);
} else {
  [...]
  if (per_segment) {
[...]
counts = ps.getFacetCounts([...]);
  } else {
counts = getFieldCacheCounts([...]);
  }
}
  }
  -- 
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java


See also:

 - 
http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values


At the end, since I do not know in advance the number of different
values for my fields I went for facet.method=fc, does this seems
reasonable to you?

Thank you,
Paolo

Re: facet.method: enum vs. fc

2010-10-11 Thread Paolo Castagna


Thank you Erick, your explanation was helpful.
I'll stick with fc and come back to this later if I need further tuning.

Paolo

Erick Erickson wrote:

Yep, that was probably the best choice

It's a classic time/space tradeoff. The enum method creates a bitset for
#each#
unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring
some overhead here). So if your facet field has 10 unique values, and 8M
documents,
you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so
on. But
this is very, very fast.

fc on the other hand, eats up cache for storing the string value for each
unique value,
plus various counter arrays (several bytes/doc). For most cases, it will use
less memory
than enum, but will be slower.

I'd stick with fc for the time being and think about enum if 1> you have a
good idea of
what the number of unique terms is or 2> you start to need to finely tune
your speed.

HTH
Erick

On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna <
castagna.li...@googlemail.com> wrote:


Hi,
I am using Solr v1.4 and I am not sure which facet.method I should use.

What should I use if I do not know in advance if the number of values
for a given field will be high or low?

What are the pros/cons of using facet.method=enum vs. facet.method=fc?

When should I use enum vs. fc?

I have found some comments and suggestions here:

 "enum enumerates all terms in a field, calculating the set intersection
 of documents that match the term with documents that match the query.
 This was the default (and only) method for faceting multi-valued fields
 prior to Solr 1.4.
 "fc (stands for field cache), the facet counts are calculated by
 iterating over documents that match the query and summing the terms
 that appear in each document. This was the default method for single
 valued fields prior to Solr 1.4.
 The default value is fc (except for BoolField) since it tends to use
 less memory and is faster when a field has many unique terms in the
 index."
 -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

 "facet.method=enum [...] this is excellent for fields where there is
 a small set of distinct values. The average number of values per
 document does not matter.
 facet.method=fc [...] this is excellent for situations where the
 number of indexed values for the field is high, but the number of
 values per document is low. For multi-valued fields, a hybrid approach
 is used that uses term filters from the filterCache for terms that
 match many documents."
 -- http://wiki.apache.org/solr/SolrFacetingOverview

 "If you are faceting on a field that you know only has a small number
 of values (say less than 50), then it is advisable to explicitly set
 this to enum. When faceting on multiple fields, remember to set this
 for the specific fields desired and not universally for all facets.
 The request handler configuration is a good place to put this."
 -- Book: "Solr 1.4 Enterprise Search Server", pag. 148

This is the part of the Solr code which deals with the facet.method
parameter:

 if (enumMethod) {
   counts = getFacetTermEnumCounts([...]);
 } else {
   if (multiToken) {
 UnInvertedField uif = [...]
 counts = uif.getCounts([...]);
   } else {
 [...]
 if (per_segment) {
   [...]
   counts = ps.getFacetCounts([...]);
 } else {
   counts = getFieldCacheCounts([...]);
 }
   }
 }
 --
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java

See also:

 -
http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values

At the end, since I do not know in advance the number of different
values for my fields I went for facet.method=fc, does this seems
reasonable to you?

Thank you,
Paolo

Faceting and omitNorms=true

2010-10-12 Thread Paolo Castagna


Hi,
I am not completely sure on what's the recommended setting for fields
used for faceting regarding omitNorms and potitionIncrementGap.

Should I used omitNorms="true"?

What about positionIncrementGap?

At the moment I have this in my schema.xml:

  stored="false" multiValued="true" />

  
   
  

And I was thinking to change to:

  stored="false" multiValued="true" omitNorms="true" 
positionIncrementGap="100" />


There is documentation, but I did not find a definite answer for
omitNorms in relation to faceting.

 "faceting: indexed='true'"
  -- http://wiki.apache.org/solr/FieldOptionsByUseCase

 "Only full-text fields or fields that need an index-time boost need
  norms."
  -- 
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/schema.xml



Thank you,
Paolo

Re: Faceting and omitNorms=true

2010-10-12 Thread Paolo Castagna


Thank you Markus for your quick reply.
Do you have a recommendation|suggestion about positionIncrementGap?

Paolo


Markus Jelsma wrote:
You can omit norms in field that you only facet on. The matrix only lists 
mandatory values for those parameters.


On Tuesday, October 12, 2010 10:15:35 am Paolo Castagna wrote:

Hi,
I am not completely sure on what's the recommended setting for fields
used for faceting regarding omitNorms and potitionIncrementGap.

Should I used omitNorms="true"?

What about positionIncrementGap?

At the moment I have this in my schema.xml:

   
   

   

And I was thinking to change to:

   

There is documentation, but I did not find a definite answer for
omitNorms in relation to faceting.

  "faceting: indexed='true'"
   -- http://wiki.apache.org/solr/FieldOptionsByUseCase

  "Only full-text fields or fields that need an index-time boost need
   norms."
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/sch
ema.xml


Thank you,
Paolo

EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?

2010-10-20 Thread Paolo Castagna


Hi,
I am trying to use EmbeddedSolrServer with just one core and I'd like to
load solrconfig.xml, schema.xml and other configuration files from a jar
via getResourceAsStream(...).

I've tried to use SolrResourceLoader, but all my attempts failed with a
RuntimeException: Can't find resource [...].

Is it possible to construct an EmbeddedSolrServer loading all the config
files from a jar file?

Thank you in advance for your help,
Paolo

Re: EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?

2010-10-25 Thread Paolo Castagna



I've found two ways which allow me to load all the config files from a
jar file, however with the first solution I cannot specify the dataDir.

This is the first way:

System.setProperty("solr.solr.home", solrHome);
CoreContainer.Initializer initializer =
  new CoreContainer.Initializer();
CoreContainer coreContainer =
  initializer.initialize();
EmbeddedSolrServer server =
  new EmbeddedSolrServer(coreContainer, coreName);

This is what http://wiki.apache.org/solr/Solrj suggests, however using
this way it's not possible to specify the dataDir which is, by default,
${solr.solr.home}/data/index.


This is my attempt to do the same, but in a way I can specify the
dataDir:

System.setProperty("solr.solr.home", solrHome);
System.setProperty("solr.core.dataDir", dataDir);
CoreContainer coreContainer = new CoreContainer();
SolrConfig solrConfig = new SolrConfig();
IndexSchema indexSchema =
  new IndexSchema(solrConfig, null, null);
SolrCore core =
  new SolrCore(dataDir, indexSchema);
core.setName(coreName);
coreContainer.register(core, false);
EmbeddedSolrServer server =
  new EmbeddedSolrServer(coreContainer, coreName);


Do you see any problems with the second solution?

Is there a better way?

Paolo

Paolo Castagna wrote:

Hi,
I am trying to use EmbeddedSolrServer with just one core and I'd like to
load solrconfig.xml, schema.xml and other configuration files from a jar
via getResourceAsStream(...).

I've tried to use SolrResourceLoader, but all my attempts failed with a
RuntimeException: Can't find resource [...].

Is it possible to construct an EmbeddedSolrServer loading all the config
files from a jar file?

Thank you in advance for your help,
Paolo

Solr replication, HAproxy and data management

2010-12-09 Thread Paolo Castagna


Hi,
we are using Solr v1.4.x with multi-cores and a master/slaves configuration.
We also use HAProxy [1] to load balance search requests amongst slaves.
Finally, we use MapReduce to create new Solr indexes.

I'd like to share with you what I am doing when I need to:

 1. add a new index
 2. replace an existing index with an new/updated one
 3. add a slave
 4. remove a slave (or a slave died)

I am interested in knowing what are the best practices in these scenarios.


1. add a new index

Copy the index on the master in the correct location.
Use CREATE [2] to load the new index:
http://host:port/solr/admin/cores?action=CREATE&name=[...]&instanceDir=[...]&dataDir=[...]
Use CREATE to create a new empty index/core on each slave.


2. replace an existing index with a new/updated one

Copy the index on the master in the correct location.
Use CREATE [2] to load the new index.
Use SWAP [3] to swap the old index with the new one.
http://host:port/solr/admin/cores?action=SWAP&core=[...]&other=[...]

Updates for that core on the master can continue during the operation.
Isn't it?

Or

Use UNLOAD [4] to remove the core from the master.
http://host:port/solr/admin/cores?action=UNLOAD&core=[...]
Copy the index on the master in the correct location.
Use CREATE [2] to load the new index.

Updates for that core on the master are not possible (but we queue
updates, so for us is just delaying a few updates for a few seconds).

Doing this I saw a strange thing, but I am not sure what was the problem:
index version and generation for the index on the master were different
from the index version and generation on the slave, but replication did
not happen. A RELOAD on the master seemed to trigger the replication.

Also... I know I should not do it, but... what happens if you swap
the directories on disk while Solr is running?


3. add a slave

Install/configure and start up a new slave.
Use CREATE [2] to create new empty indexes/cores.
The slave will start to replicate indexes from the master.
Add the new slave to the HAProxy pool.

This way, however, I need to CREATE all the cores, one by one.
Is there a way to replicate all the cores available on the master?


4. remove a slave

Remove the slave from HAProxy pool.

Or

HAProxy automatically removes it from the pool, if dead.



Does all this seems sensible to you?

Do you have best practices, suggestions to share?

Thank you,
Paolo


 [1] http://haproxy.1wt.eu/
 [2] http://wiki.apache.org/solr/CoreAdmin#CREATE
 [3] http://wiki.apache.org/solr/CoreAdmin#SWAP
 [4] http://wiki.apache.org/solr/CoreAdmin#UNLOAD

Re: Solr replication, HAproxy and data management

2010-12-13 Thread Paolo Castagna


Paolo Castagna wrote:

Hi,
we are using Solr v1.4.x with multi-cores and a master/slaves 
configuration.

We also use HAProxy [1] to load balance search requests amongst slaves.
Finally, we use MapReduce to create new Solr indexes.

I'd like to share with you what I am doing when I need to:

 1. add a new index
 2. replace an existing index with an new/updated one
 3. add a slave
 4. remove a slave (or a slave died)

I am interested in knowing what are the best practices in these scenarios.


[...]


Does all this seems sensible to you?

Do you have best practices, suggestions to share?



Well, maybe these are two too broad questions...

I have a very specific one, related to all this.

Let's say I have a Solr master with multi-cores and I want to add a new
slave. Can I tell the slave to replicate all the indexes for the master?
How?

Any comment/advice regarding my original message are still more than
welcome.

Thank you,
Paolo

Backup/restore strategies for Solr cores and "legacy" Lucene applications

2010-04-06 Thread Paolo Castagna


Hi,
I have an existing web application which is using Lucene (v2.1.0 and/or
v2.4.x) and which I'd like to gradually migrate to Solr.

I am already using multiple cores, master/slave replication and SolrJ
to re-implement current functionalities.

One use case I have is: backup/restore indexes.

I am thinking to use another Solr master (to which I will not submit
updates) to expose current Lucene indexes (on NFS) and allow them to be
replicated to my real Solr master (which I am using to submit updates
to). This way, I can reuse the current restore capabilities I have and
propagate "restored" indexes to my Solr cluster. To clarify:

 +-+   +--+   +--+
 | Solr master |   | Solr master  |   |+--+
 | (read-only) |<-(r)--| (read/write) |<-(r)--||Slaves|
 +-+   +--+   +|  |
^ ^+--+
| |
| |
 +-+   updates
 | NFS |<---+ via SolrJ
 +-+|
^   |
|   +--- updates via legacy Lucene
 restore

 (r) = replicates from

Before trying to test this setup, I'd like to have a feedback and see
if there are issues with it and/or better alternatives.

Aslo, I am not 100% sure I can have Solr v1.4 running concurrently with
other applications using Lucene (v2.1.0 and/or 2.4.x) pointing at the 
same index (on an NFS mounted file system).


Is simple the recommended setting in this scenario?

Paolo

Re: Searching Lucene Indexes with Solr

2010-04-07 Thread Paolo Castagna


Erick Erickson wrote:

It is possible but you have to take care to match Solr's schema with the
structure of documents in the Lucene index. The correct field names and
query-analyzers should be configured in schema.xml


Is it possible to use Solr v1.4 together with a legacy Lucene (v2.1.0
and/or v2.4.x) application using the same index (on an NFS mounted file
system)?

Is simple the recommended setting in this scenario?

Thanks,
Paolo

Is it possible to have Lucene and Solr (or two Solr instances) pointing at the same index directory?

2010-04-07 Thread Paolo Castagna


Hi,
(I know that this is probably not recommended and not a common
scenario, but...)

Is it possible to have an application using Lucene and a separate
(i.e. different JVM) instance of Solr both pointing at the same
index and read/write to the index from both applications?

I am trying (separately) two lockType settings in solrconfig.xml:
"native" and "simple" and the corresponding SimpleFSLockFactory
and NativeFSLockFactory with Lucene.

I have noticed that if I use:

  Directory dir = FSDirectory.open(
new File(path),
new SimpleFSLockFactory(path));

The lock file is called "writer.lock", while if I use:

  Directory dir = FSDirectory.open(new File(path));
  dir.setLockFactory(new SimpleFSLockFactory(path));

The lock file is called "lucene-{number}-writer.lock".

Solr uses the second method to set a custom lock factory.

Is the {number} supposed to be unique and the always same across
different JVMs?

I have also noticed that when Solr starts it creates a lock file even
if there are no updates or commits to the index. Why? Is this normal?

Finally, I would like to know if what I am doing is possible, what
are the potential problems and if people with more experience with
Lucene and Solr have suggestions on recommended settings or best
practices.

Thanks,
Paolo

Re: solr best practice to submit many documents

2010-04-07 Thread Paolo Castagna

Hi Brian,
I had similar questions when I begun to try and evaluate Solr.

If you use Java and SolrJ you might find these useful:

- http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update
-
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html

I am also interested in knowing what is the best and more efficient way
to index a large number of documents.

Paolo

Wawok, Brian wrote:

Hello,

I am using SOLR for some proof of concept work, and was wondering if anyone has
some guidance on a best practice.

Background:
Nightly get a delivery of a few 1000 reports. Each report is between 1 and
500,000 pages.
For my proof of concept I am using a single 100,000 page report.
I want to see how fast I can make SOLR handle this single report, and then can
see how we can scale out to meet the total indexing demand (if needed).

Trial 1:

1) Set up a solr server on server A with the default settings. Added a few
new fields to index, including a full text index of the report.

2) Set up a simple Python script on serve B. It splits the report into
100,000 small documents, pulls out a few key fields to be sent along to index,
and uses a python implementation of curl to shove the documents into the server
(with 4 threads posting away).

3) After all 100,000 documents are posted, we post an index and let the
server index.

I was able to get this method to work, and it took around 340 seconds for the
posting, and 10 seconds for the indexing. I am not sure if that indexing speed
is a red hearing, and it was really doing a little bit of the indexing during
the posts, or what.

Regardless, it seems less than ideal to make 100,000 requests to the server to
index 100,000 documents. Does anyone have an idea for how to make this process
more efficient? Should I look into making an XML document with 100,000
documents enclosed? Or what will give me the best performance? Will this be
much better than what I am seeing with my post method? I am not against
writing a custom parser on the SOLR side, but if there is already a way in SOLR
to send many documents efficiently, that is better.

Thanks!

Brian Wawok

Re: Benchmarking Solr

2010-04-12 Thread Paolo Castagna


Hi,
I do not have an answer to your questions.
But, I have the same issue/problem you have.

It would be good if Solr community would agree and share their approach
for benchmarking Solr. Indeed, it would be good to have a benchmark for
"information retrieval" systems. AFIK there isn't one. :-/

The content on the wiki [1] is better than nothing, but in practice
more is needed IMHO.

I have seen JMeter being used in ElasticSearch [2].
Solr could do the same to help users and new adopters to start.

Some guidelines/advices (I know it's hard) would be useful as well.

I ended up writing my own "crappy" multi-threaded benchmarking tool.
Also, are you using Jetty? At a certain point, in particular when you
are hitting the Solr cache and returning a large number of results,
the transfer time is a significant part of your response time.
Tuning Jetty or Tomcat or something else is essential.

Are you using Jetty or Tomcat?

I would also be interested in understanding the impact of the slave
pooling interval on searches and the impact of the number of slaves
and pooling interval on updates on the master.

Paolo

 [1] http://wiki.apache.org/solr/SolrPerformanceData
 [2] 
http://github.com/elasticsearch/elasticsearch/tree/master/modules/benchmark/jmeter


Blargy wrote:

I am about to deploy Solr into our production environment and I would like to
do some benchmarking to determine how many slaves I will need to set up.
Currently the only way I know how to benchmark is to use Apache Benchmark
but I would like to be able to send random requests to the Solr... not just
one request over and over.

I have a sample data set of 5000 user entered queries and I would like to be
able to use AB to benchmark against all these random queries. Is this
possible?

FYI our current index is ~1.5 gigs with ~5m documents and we will be using
faceting quite extensively. Are average requests per/day is ~2m. We will be
running RHEL with about 8-12g ram. Any idea how many slaves might be
required to handle our load?

Thanks

Re: Benchmarking Solr

2010-04-12 Thread Paolo Castagna


Paolo Castagna wrote:

I do not have an answer to your questions.
But, I have the same issue/problem you have.


Some related threads:

 - http://markmail.org/message/pns4dtfvt54mu3vs
 - http://markmail.org/message/7on6lvabsosvj7bc
 - http://markmail.org/message/ftz7tkd7ekhnk4bc
 - http://markmail.org/message/db2cv3dzakdp23qm
 - http://markmail.org/message/m3x6ogkfdhcwae6z
 - http://markmail.org/message/xoe3ny7dldnx4wby
 - http://markmail.org/message/eoqty4ralk34rgzk

Paolo

Re: Benchmarking Solr

2010-04-12 Thread Paolo Castagna


Shawn Heisey wrote:

Anyone got a recommendation about where to put it on the wiki?


There are already two related pages:

 - http://wiki.apache.org/solr/SolrPerformanceFactors
 - http://wiki.apache.org/solr/SolrPerformanceData

Why not to create a new page?

 - http://wiki.apache.org/solr/BenchmarkingSolr (?)

It would be good to have someone using JMeter to share their config
files as well.

Paolo

Re: LucidWorks Solr

2010-04-18 Thread Paolo Castagna


Thanks for asking, I am interested as well in reading the response to
your questions.

Paolo

Andy wrote:
Just wanted to know if anyone has used LucidWorks Solr. 


- How do you compare it to the standard Apache Solr?

- the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk 
IO? what are its effects?

- LucidWorks website also talked about "significantly improved faceting 
performance" -- what improvements are they? How much improvements?

Would you recommend using it?

Thanks.

Can I use per field analyzers and dynamic fields?

2010-05-05 Thread Paolo Castagna

Hi,
I have an existing Lucene application which I want to port to Solr.

A scenario I need to support requires me to use dynamic fields
with Solr, since users can add new fields at runtime.

At the same time, the existing Lucene application is using a
PerFieldAnalyzerWrapper in order to use different analyzers
for different fields.

One possible solution (server side) requires a custom QParser
which would use a PerFieldAnalyzerWrapper, but perhaps
there is a better (client side only) way to do that.

Do you have any suggestion on how I could use per field
analyzers with dynamic fields?

Regards,
Paolo

Re: Can I use per field analyzers and dynamic fields?

2010-05-05 Thread Paolo Castagna

Hi Erik,
first of all, thanks for your reply.

The "source" of my problems is the fact that I do not know in advance the
field names. Users are allowed to decide they own field names, they can,
at runtime, add new fields and different Lucene documents might have
different field names.

So, in addition to some custom and known field names, I have in my
schema.xml file a dynamicField:

The corresponding fieldType is:

...

...

This allows me to specify a fixed (i.e. it cannot change at runtime) and
"common" (i.e. it's the same for all dynamicField with name="*") set of
analyzers.

At the same time, in my Lucene application, users are allowed to configure
at runtime different analyzers per field. With Lucene I achieve this
using a PerFieldAnalyzerWrapper at indexing (i.e. IndexWriter and
IndexModifier
allow me to specify an Analyzer in their constructors) and query time
(i.e. QueryParser allows me to specify an Analyzer in its constructor).

Dynamic field patterns allows me to create "groups" of different types of
fields, but they will expose the users to the field patterns itself and remove
their freedom to chose field names as they want.

Perhaps, another way to express my problem is: could I use a
PerFieldAnalyzerWrapper in the above  section?
If I do that, how can I configure it at runtime?

Thanks again,
Paolo

On 5 May 2010 14:19, Erik Hatcher  wrote:
> Paolo,
>
> Solr takes care of associating fields with the proper analysis defined in
> schema.xml already.  This, of course, depends on which query parser you're
> using, but both the standard Solr query parser and dismax do the right thing
> analysis-wise automatically.
>
> But, I think you need to elaborate on what you're doing in your Lucene
> application to know more specifically.  A dynamic field specification in
> Solr is associated with only a single field type, so you'll want to use
> different dynamic field patterns for different types of fields.
>
>        Erik
>
> On May 5, 2010, at 9:14 AM, Paolo Castagna wrote:
>
>> Hi,
>> I have an existing Lucene application which I want to port to Solr.
>>
>> A scenario I need to support requires me to use dynamic fields
>> with Solr, since users can add new fields at runtime.
>>
>> At the same time, the existing Lucene application is using a
>> PerFieldAnalyzerWrapper in order to use different analyzers
>> for different fields.
>>
>> One possible solution (server side) requires a custom QParser
>> which would use a PerFieldAnalyzerWrapper, but perhaps
>> there is a better (client side only) way to do that.
>>
>> Do you have any suggestion on how I could use per field
>> analyzers with dynamic fields?
>>
>> Regards,
>> Paolo
>
>

Re: Can I use per field analyzers and dynamic fields?

2010-05-05 Thread Paolo Castagna

On 5 May 2010 14:19, Erik Hatcher  wrote:
> But, I think you need to elaborate on what you're doing in your Lucene
> application to know more specifically.

Hi Erik,
perhaps, this is another way to explain and maybe solve my issue...

At query time (everything here is just an illustrative example):

  PerFieldAnalyzerWrapper analyzer =
new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer());
  analyzer.addAnalyzer("title", new SimpleAnalyzer());
  analyzer.addAnalyzer("author", new StandardAnalyzer());
  ...

  // Lucene is doing the analysis client side...
  QueryParser parser = new QueryParser("", analyzer);
  Query lucene_query = parser.parse("title:dog title:The author:me
author:the the cat is on the table");
  ...
  // Solr query is build from the query string analyzed by Lucene
  SolrQuery solr_query = new SolrQuery();
  solr_query.setQuery(lucene_query.toString());

This way, I don't need to do the per field analysis over dynamic
fields with Solr (on the server side).

Similarly, but a little bit more involuted, at indexing time:

  String value = "The CAT is on the table";

Instead of (i.e. Lucene legacy/old existing application):

  IndexWriter writer = new IndexWriter(directory, analyzer);
  Document lucene_document = new Document();
  Field field = new Field("title", value, Field.Store.YES,
Field.Index.TOKENIZED);
  lucene_document.add(field);
  writer.addDocument(lucene_document);

I will do something like:

  StringBuffer solr_value = new StringBuffer();
  TokenStream ts = analyzer.tokenStream("title", new StringReader(value));
  Token token;
  while ((token = ts.next()) != null) {
solr_value.append(token.termText()).append(" ");
  }
  SolrInputDocument solr_document = new SolrInputDocument();
  solr_document.addField("title", solr_value.toString());
  ...

What do you think?

Thanks again,
Paolo

Re: Can I use per field analyzers and dynamic fields?

2010-05-09 Thread Paolo Castagna


Hi,
thank you for your reply.

What you suggested is a good idea and I am probably going to follow it.

However, I'd like to hear a comment on the approach of doing the parsing
using Lucene and then constructing a SolrQuery from a Lucene Query:

  QueryParser parser = new QueryParser("", analyzer);
  Query lucene_query = parser.parse("title:dog title:The author:me
author:the the cat is on the table");
  ...
  SolrQuery solr_query = new SolrQuery();
  solr_query.setQuery(lucene_query.toString());

What are the drawbacks of this approach?

Similarly, at indexing time:

  StringBuffer solr_value = new StringBuffer();
  TokenStream ts = analyzer.tokenStream("title", new StringReader(value));
  Token token;
  while ((token = ts.next()) != null) {
solr_value.append(token.termText()).append(" ");
  }
  SolrInputDocument solr_document = new SolrInputDocument();
  solr_document.addField("title", solr_value.toString());
  ...

What are the drawbacks of this approach?

Paolo


Chris Hostetter wrote:
: 
: The "source" of my problems is the fact that I do not know in advance the

: field names. Users are allowed to decide they own field names, they can,
: at runtime, add new fields and different Lucene documents might have
: different field names.

I would suggest you abstract away the field names your users pick and the 
underlying fieldnames you use when dealing with solr -- so create the list 
of fieldTypes you want to support (with all of the individual analzyer 
configurations that are valid) and then create a dynamicField 
corrisponding to each one.


then if your user tells you they want an "author" field associated with 
the type "text_en" you can map that in your application to 
"author_text_end" at both indexing and query time.


This will also let you map the same "logical field names" (from your 
user's perspective) to different "internal field names" (from Solr's 
perspective) based on usage -- searching the "author" field might be 
against "author_text_en" but sorting on "author" might use 
"author_string".


(Some notes were drafted up a while back on making this kind of field name 
aliasing a feature of Solr, but nothing ever came of it...

  http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams
)

-Hoss

Re: Can I use per field analyzers and dynamic fields?

2010-05-12 Thread Paolo Castagna


Chris Hostetter wrote:

: However, I'd like to hear a comment on the approach of doing the parsing
: using Lucene and then constructing a SolrQuery from a Lucene Query:

I believe you are asking about doing this in the client code? using the 
Lucene QueryParser to parse a string using an analyzer, then toString'ing 
that and sending it across hte wire to Solr?


Yes.


i would strongly advise against it.


Thank you.

Query.toString() is intended purely as a debugging tool, not as a 
serialization mechanism.  It's very possible for the toString() value of 
a query to not be useful in attempting to recreate the query -- 
particularly if the analyzer being used by Solr for the "re-parse" doesn't 
know to expect terms that have already been stemmed, or modified in the 
various ways the clinet may hvae done so (and if you have to go to all 
that work to make solr know about what you've pre-analyzed, why not just 
let solr do it for you?)


Is there a (better) way to construct a Solr's SolrQuery object from a
Lucene's Query object?


: Similarly, at indexing time:
...
: What are the drawbacks of this approach?

Hmmm... well besides hte drawback of doing all the hard work solr will do 
for you, i suppose that as long as you are extremely careful to manage 
both the indexing side and the query side externally from Solr then there 
is nothing wrong with this appraoch -- you would essentailly just have a 
single field type in your schema.xml that would use a whitespace tokenizer 
-- but again, this would make you lose out on a lot of solr's features 
(notably: the stored values in your index would be the post-analyze 
tokens, you would be force to trust the clients 100% to send you clean 
data at index and query time intead of being able to configure it 
centrally, etc...)


The rationale for wanting doing all the analysis (both query time and
indexing time) client side is that I have an application which is using
Lucene and it is already doing that and it has some "unusual"
requirements (i.e. almost all fields are dynamicFields with
custom/configurable analyzers per field).

I completely agree with everything you said and with the "dangers" of
doing the analysis client side and then let Solr re-analyzing again
server side. However, as you suggested, a simple whitespace tokenizer
on Solr should be relatively safe.

Definitely, your previous suggestion of using dynamicFields for each
of the possible analyzer configurations and transparently mapping field
names with "prefixes"|"postfixes" to select the right dynamicField
"type" is a better option.


In short: i don't see any advantages, but i see a lot of room for error.


Yep. Got it.

Paolo



-Hoss

SolrQuery and escaping special characters

facet.method: enum vs. fc

Re: facet.method: enum vs. fc

Faceting and omitNorms=true

Re: Faceting and omitNorms=true

EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?

Re: EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?

Solr replication, HAproxy and data management

Re: Solr replication, HAproxy and data management

Backup/restore strategies for Solr cores and "legacy" Lucene applications

Re: Searching Lucene Indexes with Solr

Is it possible to have Lucene and Solr (or two Solr instances) pointing at the same index directory?

Re: solr best practice to submit many documents

Re: Benchmarking Solr

Re: Benchmarking Solr

Re: Benchmarking Solr

Re: LucidWorks Solr

Can I use per field analyzers and dynamic fields?

Re: Can I use per field analyzers and dynamic fields?

Re: Can I use per field analyzers and dynamic fields?

Re: Can I use per field analyzers and dynamic fields?

Re: Can I use per field analyzers and dynamic fields?

22 matches

Site Navigation

Mail list logo

Footer information