SolrQuery and escaping special characters
Hi, I am using Solr v1.4 and SolrJ on the client side. I am not sure how SolrJ behaves regarding "escaping" special characters [1] in a query string. SolrJ does URL encoding of the query string it sends to Solr. Do I need to escape special characters [1] when I construct a SolrQuery object or not? For example, if I want to search for "http://example.com#foo"; in a "uri" field, should I use: (a) SolrQuery query = new SolrQuery("uri:http://example.com#foo";); (b) SolrQuery query = new SolrQuery("uri:http\\://example.com#foo"); which become respectively: (a') q=uri%3Ahttp%3A%2F%2Fexample.com%23foo (b') q=uri%3Ahttp%5C%3A%2F%2Fexample.com%23foo My understanding is that SolrJ users are supposed to escape special characters, therefore (b) is the correct way. If this is the case, what's the best way to escape a query string which might contain field names and URIs in their field values? Thanks, Paolo [1] http://lucene.apache.org/java/2_9_1/queryparsersyntax.html#Escaping%20Special%20Characters
facet.method: enum vs. fc
Hi, I am using Solr v1.4 and I am not sure which facet.method I should use. What should I use if I do not know in advance if the number of values for a given field will be high or low? What are the pros/cons of using facet.method=enum vs. facet.method=fc? When should I use enum vs. fc? I have found some comments and suggestions here: "enum enumerates all terms in a field, calculating the set intersection of documents that match the term with documents that match the query. This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. "fc (stands for field cache), the facet counts are calculated by iterating over documents that match the query and summing the terms that appear in each document. This was the default method for single valued fields prior to Solr 1.4. The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index." -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method "facet.method=enum [...] this is excellent for fields where there is a small set of distinct values. The average number of values per document does not matter. facet.method=fc [...] this is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents." -- http://wiki.apache.org/solr/SolrFacetingOverview "If you are faceting on a field that you know only has a small number of values (say less than 50), then it is advisable to explicitly set this to enum. When faceting on multiple fields, remember to set this for the specific fields desired and not universally for all facets. The request handler configuration is a good place to put this." -- Book: "Solr 1.4 Enterprise Search Server", pag. 148 This is the part of the Solr code which deals with the facet.method parameter: if (enumMethod) { counts = getFacetTermEnumCounts([...]); } else { if (multiToken) { UnInvertedField uif = [...] counts = uif.getCounts([...]); } else { [...] if (per_segment) { [...] counts = ps.getFacetCounts([...]); } else { counts = getFieldCacheCounts([...]); } } } -- https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java See also: - http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values At the end, since I do not know in advance the number of different values for my fields I went for facet.method=fc, does this seems reasonable to you? Thank you, Paolo
Re: facet.method: enum vs. fc
Thank you Erick, your explanation was helpful. I'll stick with fc and come back to this later if I need further tuning. Paolo Erick Erickson wrote: Yep, that was probably the best choice It's a classic time/space tradeoff. The enum method creates a bitset for #each# unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring some overhead here). So if your facet field has 10 unique values, and 8M documents, you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so on. But this is very, very fast. fc on the other hand, eats up cache for storing the string value for each unique value, plus various counter arrays (several bytes/doc). For most cases, it will use less memory than enum, but will be slower. I'd stick with fc for the time being and think about enum if 1> you have a good idea of what the number of unique terms is or 2> you start to need to finely tune your speed. HTH Erick On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna < castagna.li...@googlemail.com> wrote: Hi, I am using Solr v1.4 and I am not sure which facet.method I should use. What should I use if I do not know in advance if the number of values for a given field will be high or low? What are the pros/cons of using facet.method=enum vs. facet.method=fc? When should I use enum vs. fc? I have found some comments and suggestions here: "enum enumerates all terms in a field, calculating the set intersection of documents that match the term with documents that match the query. This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. "fc (stands for field cache), the facet counts are calculated by iterating over documents that match the query and summing the terms that appear in each document. This was the default method for single valued fields prior to Solr 1.4. The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index." -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method "facet.method=enum [...] this is excellent for fields where there is a small set of distinct values. The average number of values per document does not matter. facet.method=fc [...] this is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents." -- http://wiki.apache.org/solr/SolrFacetingOverview "If you are faceting on a field that you know only has a small number of values (say less than 50), then it is advisable to explicitly set this to enum. When faceting on multiple fields, remember to set this for the specific fields desired and not universally for all facets. The request handler configuration is a good place to put this." -- Book: "Solr 1.4 Enterprise Search Server", pag. 148 This is the part of the Solr code which deals with the facet.method parameter: if (enumMethod) { counts = getFacetTermEnumCounts([...]); } else { if (multiToken) { UnInvertedField uif = [...] counts = uif.getCounts([...]); } else { [...] if (per_segment) { [...] counts = ps.getFacetCounts([...]); } else { counts = getFieldCacheCounts([...]); } } } -- https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java See also: - http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values At the end, since I do not know in advance the number of different values for my fields I went for facet.method=fc, does this seems reasonable to you? Thank you, Paolo
Faceting and omitNorms=true
Hi, I am not completely sure on what's the recommended setting for fields used for faceting regarding omitNorms and potitionIncrementGap. Should I used omitNorms="true"? What about positionIncrementGap? At the moment I have this in my schema.xml: stored="false" multiValued="true" /> And I was thinking to change to: stored="false" multiValued="true" omitNorms="true" positionIncrementGap="100" /> There is documentation, but I did not find a definite answer for omitNorms in relation to faceting. "faceting: indexed='true'" -- http://wiki.apache.org/solr/FieldOptionsByUseCase "Only full-text fields or fields that need an index-time boost need norms." -- http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/schema.xml Thank you, Paolo
Re: Faceting and omitNorms=true
Thank you Markus for your quick reply. Do you have a recommendation|suggestion about positionIncrementGap? Paolo Markus Jelsma wrote: You can omit norms in field that you only facet on. The matrix only lists mandatory values for those parameters. On Tuesday, October 12, 2010 10:15:35 am Paolo Castagna wrote: Hi, I am not completely sure on what's the recommended setting for fields used for faceting regarding omitNorms and potitionIncrementGap. Should I used omitNorms="true"? What about positionIncrementGap? At the moment I have this in my schema.xml: And I was thinking to change to: There is documentation, but I did not find a definite answer for omitNorms in relation to faceting. "faceting: indexed='true'" -- http://wiki.apache.org/solr/FieldOptionsByUseCase "Only full-text fields or fields that need an index-time boost need norms." http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/example/solr/conf/sch ema.xml Thank you, Paolo
EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?
Hi, I am trying to use EmbeddedSolrServer with just one core and I'd like to load solrconfig.xml, schema.xml and other configuration files from a jar via getResourceAsStream(...). I've tried to use SolrResourceLoader, but all my attempts failed with a RuntimeException: Can't find resource [...]. Is it possible to construct an EmbeddedSolrServer loading all the config files from a jar file? Thank you in advance for your help, Paolo
Re: EmbeddedSolrServer with one core and schema.xml loaded via ClassLoader, is it possible?
I've found two ways which allow me to load all the config files from a jar file, however with the first solution I cannot specify the dataDir. This is the first way: System.setProperty("solr.solr.home", solrHome); CoreContainer.Initializer initializer = new CoreContainer.Initializer(); CoreContainer coreContainer = initializer.initialize(); EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, coreName); This is what http://wiki.apache.org/solr/Solrj suggests, however using this way it's not possible to specify the dataDir which is, by default, ${solr.solr.home}/data/index. This is my attempt to do the same, but in a way I can specify the dataDir: System.setProperty("solr.solr.home", solrHome); System.setProperty("solr.core.dataDir", dataDir); CoreContainer coreContainer = new CoreContainer(); SolrConfig solrConfig = new SolrConfig(); IndexSchema indexSchema = new IndexSchema(solrConfig, null, null); SolrCore core = new SolrCore(dataDir, indexSchema); core.setName(coreName); coreContainer.register(core, false); EmbeddedSolrServer server = new EmbeddedSolrServer(coreContainer, coreName); Do you see any problems with the second solution? Is there a better way? Paolo Paolo Castagna wrote: Hi, I am trying to use EmbeddedSolrServer with just one core and I'd like to load solrconfig.xml, schema.xml and other configuration files from a jar via getResourceAsStream(...). I've tried to use SolrResourceLoader, but all my attempts failed with a RuntimeException: Can't find resource [...]. Is it possible to construct an EmbeddedSolrServer loading all the config files from a jar file? Thank you in advance for your help, Paolo
Solr replication, HAproxy and data management
Hi, we are using Solr v1.4.x with multi-cores and a master/slaves configuration. We also use HAProxy [1] to load balance search requests amongst slaves. Finally, we use MapReduce to create new Solr indexes. I'd like to share with you what I am doing when I need to: 1. add a new index 2. replace an existing index with an new/updated one 3. add a slave 4. remove a slave (or a slave died) I am interested in knowing what are the best practices in these scenarios. 1. add a new index Copy the index on the master in the correct location. Use CREATE [2] to load the new index: http://host:port/solr/admin/cores?action=CREATE&name=[...]&instanceDir=[...]&dataDir=[...] Use CREATE to create a new empty index/core on each slave. 2. replace an existing index with a new/updated one Copy the index on the master in the correct location. Use CREATE [2] to load the new index. Use SWAP [3] to swap the old index with the new one. http://host:port/solr/admin/cores?action=SWAP&core=[...]&other=[...] Updates for that core on the master can continue during the operation. Isn't it? Or Use UNLOAD [4] to remove the core from the master. http://host:port/solr/admin/cores?action=UNLOAD&core=[...] Copy the index on the master in the correct location. Use CREATE [2] to load the new index. Updates for that core on the master are not possible (but we queue updates, so for us is just delaying a few updates for a few seconds). Doing this I saw a strange thing, but I am not sure what was the problem: index version and generation for the index on the master were different from the index version and generation on the slave, but replication did not happen. A RELOAD on the master seemed to trigger the replication. Also... I know I should not do it, but... what happens if you swap the directories on disk while Solr is running? 3. add a slave Install/configure and start up a new slave. Use CREATE [2] to create new empty indexes/cores. The slave will start to replicate indexes from the master. Add the new slave to the HAProxy pool. This way, however, I need to CREATE all the cores, one by one. Is there a way to replicate all the cores available on the master? 4. remove a slave Remove the slave from HAProxy pool. Or HAProxy automatically removes it from the pool, if dead. Does all this seems sensible to you? Do you have best practices, suggestions to share? Thank you, Paolo [1] http://haproxy.1wt.eu/ [2] http://wiki.apache.org/solr/CoreAdmin#CREATE [3] http://wiki.apache.org/solr/CoreAdmin#SWAP [4] http://wiki.apache.org/solr/CoreAdmin#UNLOAD
Re: Solr replication, HAproxy and data management
Paolo Castagna wrote: Hi, we are using Solr v1.4.x with multi-cores and a master/slaves configuration. We also use HAProxy [1] to load balance search requests amongst slaves. Finally, we use MapReduce to create new Solr indexes. I'd like to share with you what I am doing when I need to: 1. add a new index 2. replace an existing index with an new/updated one 3. add a slave 4. remove a slave (or a slave died) I am interested in knowing what are the best practices in these scenarios. [...] Does all this seems sensible to you? Do you have best practices, suggestions to share? Well, maybe these are two too broad questions... I have a very specific one, related to all this. Let's say I have a Solr master with multi-cores and I want to add a new slave. Can I tell the slave to replicate all the indexes for the master? How? Any comment/advice regarding my original message are still more than welcome. Thank you, Paolo
Backup/restore strategies for Solr cores and "legacy" Lucene applications
Hi, I have an existing web application which is using Lucene (v2.1.0 and/or v2.4.x) and which I'd like to gradually migrate to Solr. I am already using multiple cores, master/slave replication and SolrJ to re-implement current functionalities. One use case I have is: backup/restore indexes. I am thinking to use another Solr master (to which I will not submit updates) to expose current Lucene indexes (on NFS) and allow them to be replicated to my real Solr master (which I am using to submit updates to). This way, I can reuse the current restore capabilities I have and propagate "restored" indexes to my Solr cluster. To clarify: +-+ +--+ +--+ | Solr master | | Solr master | |+--+ | (read-only) |<-(r)--| (read/write) |<-(r)--||Slaves| +-+ +--+ +| | ^ ^+--+ | | | | +-+ updates | NFS |<---+ via SolrJ +-+| ^ | | +--- updates via legacy Lucene restore (r) = replicates from Before trying to test this setup, I'd like to have a feedback and see if there are issues with it and/or better alternatives. Aslo, I am not 100% sure I can have Solr v1.4 running concurrently with other applications using Lucene (v2.1.0 and/or 2.4.x) pointing at the same index (on an NFS mounted file system). Is simple the recommended setting in this scenario? Paolo
Re: Searching Lucene Indexes with Solr
Erick Erickson wrote: It is possible but you have to take care to match Solr's schema with the structure of documents in the Lucene index. The correct field names and query-analyzers should be configured in schema.xml Is it possible to use Solr v1.4 together with a legacy Lucene (v2.1.0 and/or v2.4.x) application using the same index (on an NFS mounted file system)? Is simple the recommended setting in this scenario? Thanks, Paolo
Is it possible to have Lucene and Solr (or two Solr instances) pointing at the same index directory?
Hi, (I know that this is probably not recommended and not a common scenario, but...) Is it possible to have an application using Lucene and a separate (i.e. different JVM) instance of Solr both pointing at the same index and read/write to the index from both applications? I am trying (separately) two lockType settings in solrconfig.xml: "native" and "simple" and the corresponding SimpleFSLockFactory and NativeFSLockFactory with Lucene. I have noticed that if I use: Directory dir = FSDirectory.open( new File(path), new SimpleFSLockFactory(path)); The lock file is called "writer.lock", while if I use: Directory dir = FSDirectory.open(new File(path)); dir.setLockFactory(new SimpleFSLockFactory(path)); The lock file is called "lucene-{number}-writer.lock". Solr uses the second method to set a custom lock factory. Is the {number} supposed to be unique and the always same across different JVMs? I have also noticed that when Solr starts it creates a lock file even if there are no updates or commits to the index. Why? Is this normal? Finally, I would like to know if what I am doing is possible, what are the potential problems and if people with more experience with Lucene and Solr have suggestions on recommended settings or best practices. Thanks, Paolo
Re: solr best practice to submit many documents
Hi Brian, I had similar questions when I begun to try and evaluate Solr. If you use Java and SolrJ you might find these useful: - http://wiki.apache.org/solr/Solrj#Streaming_documents_for_an_update - http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html I am also interested in knowing what is the best and more efficient way to index a large number of documents. Paolo Wawok, Brian wrote: Hello, I am using SOLR for some proof of concept work, and was wondering if anyone has some guidance on a best practice. Background: Nightly get a delivery of a few 1000 reports. Each report is between 1 and 500,000 pages. For my proof of concept I am using a single 100,000 page report. I want to see how fast I can make SOLR handle this single report, and then can see how we can scale out to meet the total indexing demand (if needed). Trial 1: 1) Set up a solr server on server A with the default settings. Added a few new fields to index, including a full text index of the report. 2) Set up a simple Python script on serve B. It splits the report into 100,000 small documents, pulls out a few key fields to be sent along to index, and uses a python implementation of curl to shove the documents into the server (with 4 threads posting away). 3) After all 100,000 documents are posted, we post an index and let the server index. I was able to get this method to work, and it took around 340 seconds for the posting, and 10 seconds for the indexing. I am not sure if that indexing speed is a red hearing, and it was really doing a little bit of the indexing during the posts, or what. Regardless, it seems less than ideal to make 100,000 requests to the server to index 100,000 documents. Does anyone have an idea for how to make this process more efficient? Should I look into making an XML document with 100,000 documents enclosed? Or what will give me the best performance? Will this be much better than what I am seeing with my post method? I am not against writing a custom parser on the SOLR side, but if there is already a way in SOLR to send many documents efficiently, that is better. Thanks! Brian Wawok
Re: Benchmarking Solr
Hi, I do not have an answer to your questions. But, I have the same issue/problem you have. It would be good if Solr community would agree and share their approach for benchmarking Solr. Indeed, it would be good to have a benchmark for "information retrieval" systems. AFIK there isn't one. :-/ The content on the wiki [1] is better than nothing, but in practice more is needed IMHO. I have seen JMeter being used in ElasticSearch [2]. Solr could do the same to help users and new adopters to start. Some guidelines/advices (I know it's hard) would be useful as well. I ended up writing my own "crappy" multi-threaded benchmarking tool. Also, are you using Jetty? At a certain point, in particular when you are hitting the Solr cache and returning a large number of results, the transfer time is a significant part of your response time. Tuning Jetty or Tomcat or something else is essential. Are you using Jetty or Tomcat? I would also be interested in understanding the impact of the slave pooling interval on searches and the impact of the number of slaves and pooling interval on updates on the master. Paolo [1] http://wiki.apache.org/solr/SolrPerformanceData [2] http://github.com/elasticsearch/elasticsearch/tree/master/modules/benchmark/jmeter Blargy wrote: I am about to deploy Solr into our production environment and I would like to do some benchmarking to determine how many slaves I will need to set up. Currently the only way I know how to benchmark is to use Apache Benchmark but I would like to be able to send random requests to the Solr... not just one request over and over. I have a sample data set of 5000 user entered queries and I would like to be able to use AB to benchmark against all these random queries. Is this possible? FYI our current index is ~1.5 gigs with ~5m documents and we will be using faceting quite extensively. Are average requests per/day is ~2m. We will be running RHEL with about 8-12g ram. Any idea how many slaves might be required to handle our load? Thanks
Re: Benchmarking Solr
Paolo Castagna wrote: I do not have an answer to your questions. But, I have the same issue/problem you have. Some related threads: - http://markmail.org/message/pns4dtfvt54mu3vs - http://markmail.org/message/7on6lvabsosvj7bc - http://markmail.org/message/ftz7tkd7ekhnk4bc - http://markmail.org/message/db2cv3dzakdp23qm - http://markmail.org/message/m3x6ogkfdhcwae6z - http://markmail.org/message/xoe3ny7dldnx4wby - http://markmail.org/message/eoqty4ralk34rgzk Paolo
Re: Benchmarking Solr
Shawn Heisey wrote: Anyone got a recommendation about where to put it on the wiki? There are already two related pages: - http://wiki.apache.org/solr/SolrPerformanceFactors - http://wiki.apache.org/solr/SolrPerformanceData Why not to create a new page? - http://wiki.apache.org/solr/BenchmarkingSolr (?) It would be good to have someone using JMeter to share their config files as well. Paolo
Re: LucidWorks Solr
Thanks for asking, I am interested as well in reading the response to your questions. Paolo Andy wrote: Just wanted to know if anyone has used LucidWorks Solr. - How do you compare it to the standard Apache Solr? - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk IO? what are its effects? - LucidWorks website also talked about "significantly improved faceting performance" -- what improvements are they? How much improvements? Would you recommend using it? Thanks.
Can I use per field analyzers and dynamic fields?
Hi, I have an existing Lucene application which I want to port to Solr. A scenario I need to support requires me to use dynamic fields with Solr, since users can add new fields at runtime. At the same time, the existing Lucene application is using a PerFieldAnalyzerWrapper in order to use different analyzers for different fields. One possible solution (server side) requires a custom QParser which would use a PerFieldAnalyzerWrapper, but perhaps there is a better (client side only) way to do that. Do you have any suggestion on how I could use per field analyzers with dynamic fields? Regards, Paolo
Re: Can I use per field analyzers and dynamic fields?
Hi Erik, first of all, thanks for your reply. The "source" of my problems is the fact that I do not know in advance the field names. Users are allowed to decide they own field names, they can, at runtime, add new fields and different Lucene documents might have different field names. So, in addition to some custom and known field names, I have in my schema.xml file a dynamicField: The corresponding fieldType is: ... ... This allows me to specify a fixed (i.e. it cannot change at runtime) and "common" (i.e. it's the same for all dynamicField with name="*") set of analyzers. At the same time, in my Lucene application, users are allowed to configure at runtime different analyzers per field. With Lucene I achieve this using a PerFieldAnalyzerWrapper at indexing (i.e. IndexWriter and IndexModifier allow me to specify an Analyzer in their constructors) and query time (i.e. QueryParser allows me to specify an Analyzer in its constructor). Dynamic field patterns allows me to create "groups" of different types of fields, but they will expose the users to the field patterns itself and remove their freedom to chose field names as they want. Perhaps, another way to express my problem is: could I use a PerFieldAnalyzerWrapper in the above section? If I do that, how can I configure it at runtime? Thanks again, Paolo On 5 May 2010 14:19, Erik Hatcher wrote: > Paolo, > > Solr takes care of associating fields with the proper analysis defined in > schema.xml already. This, of course, depends on which query parser you're > using, but both the standard Solr query parser and dismax do the right thing > analysis-wise automatically. > > But, I think you need to elaborate on what you're doing in your Lucene > application to know more specifically. A dynamic field specification in > Solr is associated with only a single field type, so you'll want to use > different dynamic field patterns for different types of fields. > > Erik > > On May 5, 2010, at 9:14 AM, Paolo Castagna wrote: > >> Hi, >> I have an existing Lucene application which I want to port to Solr. >> >> A scenario I need to support requires me to use dynamic fields >> with Solr, since users can add new fields at runtime. >> >> At the same time, the existing Lucene application is using a >> PerFieldAnalyzerWrapper in order to use different analyzers >> for different fields. >> >> One possible solution (server side) requires a custom QParser >> which would use a PerFieldAnalyzerWrapper, but perhaps >> there is a better (client side only) way to do that. >> >> Do you have any suggestion on how I could use per field >> analyzers with dynamic fields? >> >> Regards, >> Paolo > >
Re: Can I use per field analyzers and dynamic fields?
On 5 May 2010 14:19, Erik Hatcher wrote: > But, I think you need to elaborate on what you're doing in your Lucene > application to know more specifically. Hi Erik, perhaps, this is another way to explain and maybe solve my issue... At query time (everything here is just an illustrative example): PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new WhitespaceAnalyzer()); analyzer.addAnalyzer("title", new SimpleAnalyzer()); analyzer.addAnalyzer("author", new StandardAnalyzer()); ... // Lucene is doing the analysis client side... QueryParser parser = new QueryParser("", analyzer); Query lucene_query = parser.parse("title:dog title:The author:me author:the the cat is on the table"); ... // Solr query is build from the query string analyzed by Lucene SolrQuery solr_query = new SolrQuery(); solr_query.setQuery(lucene_query.toString()); This way, I don't need to do the per field analysis over dynamic fields with Solr (on the server side). Similarly, but a little bit more involuted, at indexing time: String value = "The CAT is on the table"; Instead of (i.e. Lucene legacy/old existing application): IndexWriter writer = new IndexWriter(directory, analyzer); Document lucene_document = new Document(); Field field = new Field("title", value, Field.Store.YES, Field.Index.TOKENIZED); lucene_document.add(field); writer.addDocument(lucene_document); I will do something like: StringBuffer solr_value = new StringBuffer(); TokenStream ts = analyzer.tokenStream("title", new StringReader(value)); Token token; while ((token = ts.next()) != null) { solr_value.append(token.termText()).append(" "); } SolrInputDocument solr_document = new SolrInputDocument(); solr_document.addField("title", solr_value.toString()); ... What do you think? Thanks again, Paolo
Re: Can I use per field analyzers and dynamic fields?
Hi, thank you for your reply. What you suggested is a good idea and I am probably going to follow it. However, I'd like to hear a comment on the approach of doing the parsing using Lucene and then constructing a SolrQuery from a Lucene Query: QueryParser parser = new QueryParser("", analyzer); Query lucene_query = parser.parse("title:dog title:The author:me author:the the cat is on the table"); ... SolrQuery solr_query = new SolrQuery(); solr_query.setQuery(lucene_query.toString()); What are the drawbacks of this approach? Similarly, at indexing time: StringBuffer solr_value = new StringBuffer(); TokenStream ts = analyzer.tokenStream("title", new StringReader(value)); Token token; while ((token = ts.next()) != null) { solr_value.append(token.termText()).append(" "); } SolrInputDocument solr_document = new SolrInputDocument(); solr_document.addField("title", solr_value.toString()); ... What are the drawbacks of this approach? Paolo Chris Hostetter wrote: : : The "source" of my problems is the fact that I do not know in advance the : field names. Users are allowed to decide they own field names, they can, : at runtime, add new fields and different Lucene documents might have : different field names. I would suggest you abstract away the field names your users pick and the underlying fieldnames you use when dealing with solr -- so create the list of fieldTypes you want to support (with all of the individual analzyer configurations that are valid) and then create a dynamicField corrisponding to each one. then if your user tells you they want an "author" field associated with the type "text_en" you can map that in your application to "author_text_end" at both indexing and query time. This will also let you map the same "logical field names" (from your user's perspective) to different "internal field names" (from Solr's perspective) based on usage -- searching the "author" field might be against "author_text_en" but sorting on "author" might use "author_string". (Some notes were drafted up a while back on making this kind of field name aliasing a feature of Solr, but nothing ever came of it... http://wiki.apache.org/solr/FieldAliasesAndGlobsInParams ) -Hoss
Re: Can I use per field analyzers and dynamic fields?
Chris Hostetter wrote: : However, I'd like to hear a comment on the approach of doing the parsing : using Lucene and then constructing a SolrQuery from a Lucene Query: I believe you are asking about doing this in the client code? using the Lucene QueryParser to parse a string using an analyzer, then toString'ing that and sending it across hte wire to Solr? Yes. i would strongly advise against it. Thank you. Query.toString() is intended purely as a debugging tool, not as a serialization mechanism. It's very possible for the toString() value of a query to not be useful in attempting to recreate the query -- particularly if the analyzer being used by Solr for the "re-parse" doesn't know to expect terms that have already been stemmed, or modified in the various ways the clinet may hvae done so (and if you have to go to all that work to make solr know about what you've pre-analyzed, why not just let solr do it for you?) Is there a (better) way to construct a Solr's SolrQuery object from a Lucene's Query object? : Similarly, at indexing time: ... : What are the drawbacks of this approach? Hmmm... well besides hte drawback of doing all the hard work solr will do for you, i suppose that as long as you are extremely careful to manage both the indexing side and the query side externally from Solr then there is nothing wrong with this appraoch -- you would essentailly just have a single field type in your schema.xml that would use a whitespace tokenizer -- but again, this would make you lose out on a lot of solr's features (notably: the stored values in your index would be the post-analyze tokens, you would be force to trust the clients 100% to send you clean data at index and query time intead of being able to configure it centrally, etc...) The rationale for wanting doing all the analysis (both query time and indexing time) client side is that I have an application which is using Lucene and it is already doing that and it has some "unusual" requirements (i.e. almost all fields are dynamicFields with custom/configurable analyzers per field). I completely agree with everything you said and with the "dangers" of doing the analysis client side and then let Solr re-analyzing again server side. However, as you suggested, a simple whitespace tokenizer on Solr should be relatively safe. Definitely, your previous suggestion of using dynamicFields for each of the possible analyzer configurations and transparently mapping field names with "prefixes"|"postfixes" to select the right dynamicField "type" is a better option. In short: i don't see any advantages, but i see a lot of room for error. Yep. Got it. Paolo -Hoss