Re: Problem with XML encode UFT-8
Hi, Please explain some more. a) What version of Solr? b) Are you trying to feed XML or PDF? c) What request handler are you feeding to? /update or /update/extract ? d) Can you copy/paste some more lines from the error log? -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 21. feb. 2011, at 15.02, jayronsoares wrote: > > Hi I'm using solr py to stored files in pdf, however at moment of run script, > shows me that issue: > > An invalid XML character (Unicode: 0xc) was found in the element content of > the document. > > Someone could give some help? > > cheers > jayron > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Any-new-python-libraries-tp493419p2545020.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Faceting
Hi, Even if the customer types a correct product name, how do you know that merchant A and merchant B both have registered that exact product in the same way? Merchant A may say as product name "White Sony LCD TV XY123" and the other says "Sony XY123 LCD TV", colour=white If you're serious about price comparison service, I think you need to invest in finding what products are the same before indexing, and then tagging them with some unique normalized name. Then when after a search, you show a facet with that normalized name and first when the user has selected the correct facet, can you be 100% certain that you're comparing apples to apples. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com On 22. feb. 2011, at 07.23, Praveen Parameswaran wrote: > Hi , > @Tommaso @Jan Høydahl Thanks for the response :) > > I 've done it almost similar to what Tommaso suggested and yes it's about > 70-80% accurate. > I understand the contradiction in the search - customer find stuff without > the exact right wording (recall) at the same time as you want the query to > be precise (precision). > > In my scenario both cases are there as well, but mostly a customer would > know which product name he is searching for and he will be interested in > comparing the prices that different marchants offer. What I feel is that , > may be the "Search" itself has to be classified based on the contexts. > > Will it be possible in solr to have the below: > 1 . A customer uses the correct product name to search , get the accurate > results > 2. A customer uses a keyword or without the exact name , get the most > relevant results. > > 2nd part is fine as it's working good. 1st part is where I'm struggling. > > thanks > Praveen > > On Mon, Feb 21, 2011 at 5:23 PM, Tommaso Teofili > wrote: > >> Hi Praveen, >> as far as I understand you have to set the type of the field(s) you are >> searching over to be conservative. >> So for example you won't include stemmer and lowercase filters and use only >> a whitespace tokenizer, more over you should search with the default >> operator set to AND. >> Then faceting over those field(s) will depend on those type settings. >> You may find the following wiki page useful: >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters >> My 2 cents, >> >> >> 2011/2/21 Praveen Parameswaran >> >>> Hi, >>> >>> Is it possible to have 100% accuracy for facet counts using solr ? Since >>> this is for a product price comparison site I would need the search to >>> return accurate results. for example if I search "sony lcd Tv" I do not >>> want >>> "sony Led Tv" to be returned int he results. Please let me know if this >> is >>> possible and how? >>> >>> >>> Thanks >>> >>> Prav >>> >>
Configure 2 or more Tomcat instances.
I have a tomcat6.0 instance running in my system, with connector port-8090, shutdown port -8005 ,AJP/1.3 port-8009 and redirect port-8443 in server.xml (path = C:\Program Files\Apache Software Foundation\Tomcat 6.0\conf\server.xml) How do I configure one more independent tomcat instance in the same system..? I went through many sites.. but couldn't fix this. If anyone one know the proper configuration steps please reply.. Regards, Rajani Maski
Re: Configure 2 or more Tomcat instances.
Hey Rajani, >From what I've seen, you just need to copy the Tomcat folder and change the following ports in server.xml: shutdown, connector,ajp. Then you can start them up independently. Regards, Jonathan On Tue, Feb 22, 2011 at 3:15 PM, rajini maski wrote: > I have a tomcat6.0 instance running in my system, with > connector port-8090, shutdown port -8005 ,AJP/1.3 port-8009 and redirect > port-8443 in server.xml (path = C:\Program Files\Apache Software > Foundation\Tomcat 6.0\conf\server.xml) > > How do I configure one more independent tomcat instance > in the same system..? I went through many sites.. but couldn't fix > this. If anyone one know the proper configuration steps please reply.. > > Regards, > Rajani Maski >
solr indexing
Hi all, to my keen intrest on solr indexing mechanism i started mining the code of solr indexing (/update/extract), i read the indexing file formats, scoring procedure, i have some queries regarding this.. 1) the scoring is performed on the dynamic and precalculated value(doc boost, field boost, lengthnorm). In calculating the score if suppose a term in the index consits nearly one million docs then is solr calculating the score for each and every doc present for the term and getting the top docs from the index??? or is it undergoing any mechanism such that limiting the calculation of score to only a particular docs??? If anybody know about it or any documentation regarding this please inform me... Regards, satya
Re: Any plan to make Field Collapsing available for distributed search?
(11/02/22 13:46), Andy wrote: Hello, I'm looking into Field Collapsing. According to the documentation one limitation is that "distributed search support for result grouping has not yet been implemented." Just wondered if there's any plan to add distributed search support to field collapsing. Or is there any technical obstacle that make such a feature unlikely? Thanks Andy Andy, There is an open ticket for it: https://issues.apache.org/jira/browse/SOLR-2066 Koji -- http://www.rondhuit.com/en/
disable replication in a persistent way
Hello, solr/replication?command=disablepoll disables replication on slave(s). However it is not persistent. After solr/tomcat restart, slave(s) will continue polling. Is there a built-in way to disable replication on slave side in a persistent manner? Currently I am using system property substitution along with solrcore.properties file to simulate this. ${enable.slave:false} #solrcore.properties in slave enable.master=true And modify solrcore.properties with a custom solr request handler after the disablepoll command, to make it persistent. It seems that there is no existing mechanism to write solrconfig.properties file, am I correct? Thanks, Ahmet
Re: Datetime problems with dataimport
Ok i got it. It should look like -mm-ddThh:mm:ssZ for example: 2011-02-22T15:07:00Z -- View this message in context: http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Datetime problems with dataimport
I logged an issue in Jira that relates to this and it looks like Yonik picked it up. https://issues.apache.org/jira/browse/SOLR-2286 Adam On Feb 22, 2011, at 9:07 AM, MOuli wrote: > > Ok i got it. > > It should look like -mm-ddThh:mm:ssZ > for example: 2011-02-22T15:07:00Z > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552477.html > Sent from the Solr - User mailing list archive at Nabble.com.
Multiple Blocked threads on UnInvertedField.getUnInvertedField() & SegmentReader$CoreReaders.getTermsReader
Hi Solr Users, We are upgrading from Solr 1.3 to Solr 1.4.1. While using Solr 1.3 , we were seeing multiple blocking active threads on "org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() ". To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other type of multiple blocking threads on "org.apache.solr.request.UnInvertedField.getUnInvertedField() & SegmentReader$CoreReaders.getTermsReader". Due to this, the QTimes shoots up from few hundreds to thousand of msec.. even going upto 30-40 secs for a single query. - The multiple blocking threads show up after few thousands of queries. - We do not have faceting and sorting on the same fields. - Our facet fields are multivalued text fields, but no large text values are present. - Index size - around 10 GB - We have not specified any method for faceting in our schema.xml. - Our field value cache settings are: Can someone please tell us the why we are seeing these blocked threads ? Also if they are related to our field value cache , then a cache of size 175 will be filled up with very few initial queries and right after that we should see multiple blocking threads ? What difference it will make if we have "facet.method = enum" ? Is this all related to fieldValueCache or is there some other configuration which we need to set to avoid these blocking threads? Thanks, Rachita *Cache values example: *facetField1_27443 : {field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1} facetField1_70 : {field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1} facetField2 : {field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031} * Stack trace for "org.apache.solr.request.UnInvertedField.getUnInvertedField() - BLOCKED"* at org.apache.solr.request.UnInvertedField.getUnInvertedField (UnInvertedField.java:837) at org.apache.solr.request.SimpleFacets.getTermCounts (SimpleFacets.java:250) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts (SimpleFacets.java:283) at org.apache.solr.request.SimpleFacets.getFacetCounts (SimpleFacets.java:166) at org.apache.solr.handler.component.FacetComponent.process (FacetComponent.java:72) at org.apache.solr.handler.component.SearchHandler.handleRequestBody (SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute (SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter (SolrDispatchFilter.java:241) at com.caucho.server.dispatch.FilterFilterChain.doFilter (FilterFilterChain.java:87) at com.caucho.server.webapp.WebAppFilterChain.doFilter (WebAppFilterChain.java:187) at com.caucho.server.dispatch.ServletInvocation.service (ServletInvocation.java:266) at com.caucho.server.http.HttpRequest.handleRequest (HttpRequest.java:270) at com.caucho.server.port.TcpConnection.run (TcpConnection.java:678) at com.caucho.util.ThreadPool$Item.runTasks (ThreadPool.java:721) at com.caucho.util.ThreadPool$Item.run (ThreadPool.java:643) at java.lang.Thread.run (Thread.java:595) *org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader() - BLOCKED* at org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader (SegmentReader.java:170) at org.apache.lucene.index.SegmentTermDocs. (SegmentTermDocs.java:52) at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:987) at org.apache.lucene.index.IndexReader.termDocs (IndexReader.java:1102) at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:981) at org.apache.solr.search.SolrIndexReader.termDocs (SolrIndexReader.java:320) at org.apache.solr.search.SolrIndexSearcher.getDocSetNC (SolrIndexSearcher.java:640) at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet (SolrIndexSearcher.java:563) at org.apache.solr.search.SolrIndexSearcher.numDocs (SolrIndexSearcher.java:1422) at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount (ExtendedFacet.java:132) at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount (ExtendedFacet.java:92) at com.askme.solrenhancements.facet.ExtendedFacet.getFacetAdditionalInfo (ExtendedFacet.java:69) at com.askme.solrenhancements.facet.ExtendedFacet.getFacetInfo (ExtendedFacet.java:56) at com.askme.solrenhancements.facet.CustomFacetComponent.process (CustomFacetComponent.java:43) at org.apache.solr.handler.component.SearchHandler.handleRequestBody (SearchHandler.java:195) at org.apache.solr.handler.RequestHandlerBase.handleRequest (RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute (SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter (SolrDi
Re: Datetime problems with dataimport
Can you give me an example? Should it looks like 2011-02-22'T'14:55:20 or 2011-02-22T14:55:20 or 2011-02-22 14:55:20. I tested every one of this formats, but got anyway the Exception. Invalid Date String:'2009-12-09'T'00:00:00' Invalid Date String:'2009-12-09 00:00:00' Invalid Date String:'2009-12-09T00:00:00' -- View this message in context: http://lucene.472066.n3.nabble.com/Datetime-problems-with-dataimport-tp2545654p2552422.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Configure 2 or more Tomcat instances.
Rajini, you need to make the (~3) ports defined in conf/server.xml different. paul Le 22 févr. 2011 à 12:15, rajini maski a écrit : > I have a tomcat6.0 instance running in my system, with > connector port-8090, shutdown port -8005 ,AJP/1.3 port-8009 and redirect > port-8443 in server.xml (path = C:\Program Files\Apache Software > Foundation\Tomcat 6.0\conf\server.xml) > > How do I configure one more independent tomcat instance > in the same system..? I went through many sites.. but couldn't fix > this. If anyone one know the proper configuration steps please reply.. > > Regards, > Rajani Maski
Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() & SegmentReader$CoreReaders.getTermsReader
+1 for more investigation Bill Bell Sent from mobile On Feb 22, 2011, at 7:13 AM, Rachita Choudhary wrote: > Hi Solr Users, > > We are upgrading from Solr 1.3 to Solr 1.4.1. > While using Solr 1.3 , we were seeing multiple blocking active threads on > "org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() ". > > To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other > type of multiple blocking threads on > "org.apache.solr.request.UnInvertedField.getUnInvertedField() & > > SegmentReader$CoreReaders.getTermsReader". > Due to this, the QTimes shoots up from few hundreds to thousand of > msec.. even going upto 30-40 secs for a single query. > > - The multiple blocking threads show up after few thousands of queries. > - We do not have faceting and sorting on the same fields. > - Our facet fields are multivalued text fields, but no large text values are > present. > - Index size - around 10 GB > - We have not specified any method for faceting in our schema.xml. > - Our field value cache settings are: > class="solr.FastLRUCache" >size="175" >autowarmCount="0" >showItems="10" > /> > > Can someone please tell us the why we are seeing these blocked threads ? > Also if they are related to our field value cache , then a cache of size 175 > will be filled up with very few initial queries and right after that we > should see multiple blocking threads ? > What difference it will make if we have "facet.method = enum" ? > Is this all related to fieldValueCache or is there some other configuration > which we need to set to avoid these blocking threads? > > Thanks, > Rachita > > *Cache values example: > *facetField1_27443 : > {field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1} > > facetField1_70 : > {field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1} > > facetField2 : > {field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031} > * > Stack trace for > "org.apache.solr.request.UnInvertedField.getUnInvertedField() - > BLOCKED"* > > at org.apache.solr.request.UnInvertedField.getUnInvertedField > (UnInvertedField.java:837) > at org.apache.solr.request.SimpleFacets.getTermCounts (SimpleFacets.java:250) > at org.apache.solr.request.SimpleFacets.getFacetFieldCounts > (SimpleFacets.java:283) > at org.apache.solr.request.SimpleFacets.getFacetCounts (SimpleFacets.java:166) > at org.apache.solr.handler.component.FacetComponent.process > (FacetComponent.java:72) > at org.apache.solr.handler.component.SearchHandler.handleRequestBody > (SearchHandler.java:195) > at org.apache.solr.handler.RequestHandlerBase.handleRequest > (RequestHandlerBase.java:131) > at org.apache.solr.core.SolrCore.execute (SolrCore.java:1316) > at org.apache.solr.servlet.SolrDispatchFilter.execute > (SolrDispatchFilter.java:338) > at org.apache.solr.servlet.SolrDispatchFilter.doFilter > (SolrDispatchFilter.java:241) > at com.caucho.server.dispatch.FilterFilterChain.doFilter > (FilterFilterChain.java:87) > at com.caucho.server.webapp.WebAppFilterChain.doFilter > (WebAppFilterChain.java:187) > at com.caucho.server.dispatch.ServletInvocation.service > (ServletInvocation.java:266) > at com.caucho.server.http.HttpRequest.handleRequest (HttpRequest.java:270) > at com.caucho.server.port.TcpConnection.run (TcpConnection.java:678) > at com.caucho.util.ThreadPool$Item.runTasks (ThreadPool.java:721) > at com.caucho.util.ThreadPool$Item.run (ThreadPool.java:643) > at java.lang.Thread.run (Thread.java:595) > > > *org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader() - > BLOCKED* > > at org.apache.lucene.index.SegmentReader$CoreReaders.getTermsReader > (SegmentReader.java:170) > at org.apache.lucene.index.SegmentTermDocs. (SegmentTermDocs.java:52) > at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:987) > at org.apache.lucene.index.IndexReader.termDocs (IndexReader.java:1102) > at org.apache.lucene.index.SegmentReader.termDocs (SegmentReader.java:981) > at org.apache.solr.search.SolrIndexReader.termDocs (SolrIndexReader.java:320) > at org.apache.solr.search.SolrIndexSearcher.getDocSetNC > (SolrIndexSearcher.java:640) > at org.apache.solr.search.SolrIndexSearcher.getPositiveDocSet > (SolrIndexSearcher.java:563) > at org.apache.solr.search.SolrIndexSearcher.numDocs > (SolrIndexSearcher.java:1422) > at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount > (ExtendedFacet.java:132) > at com.askme.solrenhancements.facet.ExtendedFacet.getCustomFacetCount > (ExtendedFacet.java:92) > at com.askme.solrenhancements.facet.ExtendedFacet.getFacetAdditionalInfo > (ExtendedFacet.java:69) > at com.askme.solrenhancements.facet.ExtendedFacet.getFacetInfo > (ExtendedFacet.java:56) > at com.askme.solrenhancements.facet.CustomFacetComponent.process > (CustomFacetComponent.java:4
Re: Question About Highlighting
Hi All I even tried that (Appending &hl.usePhraseHighlighter=true) but it still does not work. Please help Regards Ahsan Iqbal On Fri, Feb 18, 2011 at 12:30 AM, Ahmet Arslan wrote: > > I had a requirement to implement phrase proximity like ["a > > b c" w/5 "d e f"] for > > this i have implemented a custom query parser plug in which > > I make use of nested > > span queries to fulfill this requirement. Now it looks that > > documents are > > filtered correctly, but there is an issue in highlighting > > that also highlights > > the terms that are alone(not in phrase), can some body > > suggest me a fix to this > > issue > > > > Appending &hl.usePhraseHighlighter=true should work. > > > >
Question about Nested Span Near Query
Hi All I had a requirement to implement queries that involves phrase proximity. like user should be able to search "ab cd" w/5 "de fg", both phrases as whole should be with in 5 words of each other. For this I implement a query parser that make use of nested span queries, so above query would be parsed as spanNear([spanNear([Contents:ab, Contents:cd], 0, true), spanNear([Contents:de, Contents:fg], 0, true)], 5, false) Queries like this seems to work really good when phrases are small but when phrases are large this doesn't work fine. Now my question, Is there any limitation of SpanNearQuery. that we cannot handle large phrases in this way? please help Regards Ahsan
Tokenizer that Protects Phrases
Hi, I am trying to tokenize a string field of products. Two different products are: "camera", "security camera". What I would like is for "security camera" to be treated differently to "camera" - and only be displayed when the search is for "security camera", otherwise, the results should only display "camera". In other words, even though they share the English word "camera", their meanings are different. Now my guess about the best way to deal with this is just to manually provide a file of words that together is a token. For ex. "laptop battery", "security camera". Kind of like protwords, but like protphrases. Is this a good idea to solve this problem? How do I implement it if it is the right way? If there is a better way of dealing with this what is it? Thanks for your time, David
Re: Multiple Blocked threads on UnInvertedField.getUnInvertedField() & SegmentReader$CoreReaders.getTermsReader
On Tue, Feb 22, 2011 at 9:13 AM, Rachita Choudhary wrote: > Hi Solr Users, > > We are upgrading from Solr 1.3 to Solr 1.4.1. > While using Solr 1.3 , we were seeing multiple blocking active threads on > "org.apache.lucene.store.FSDirectory$FSIndexInput.readInternal() ". > > To utilize the benefits of NIO, on upgrading to Solr 1.4.1, we see other > type of multiple blocking threads on > "org.apache.solr.request.UnInvertedField.getUnInvertedField() & > > SegmentReader$CoreReaders.getTermsReader". > Due to this, the QTimes shoots up from few hundreds to thousand of > msec.. even going upto 30-40 secs for a single query. > > - The multiple blocking threads show up after few thousands of queries. > - We do not have faceting and sorting on the same fields. > - Our facet fields are multivalued text fields, but no large text values are > present. > - Index size - around 10 GB > - We have not specified any method for faceting in our schema.xml. > - Our field value cache settings are: > class="solr.FastLRUCache" > size="175" > autowarmCount="0" > showItems="10" > /> > > Can someone please tell us the why we are seeing these blocked threads ? > Also if they are related to our field value cache , then a cache of size 175 > will be filled up with very few initial queries and right after that we > should see multiple blocking threads ? > What difference it will make if we have "facet.method = enum" ? fc method on a multivalued field instantiates an UnInvertedField (like a multi-valued field cache) which can take some time. Just like sorting, you may want to use some warming faceting queries to make sure that real queries don't pay the cost of the initial entry construction. >From your fieldValueCache statistics, it looks like the number of terms is low enough that the enum method may be fine here. -Yonik http://lucidimagination.com > Is this all related to fieldValueCache or is there some other configuration > which we need to set to avoid these blocking threads? > > Thanks, > Rachita > > *Cache values example: > *facetField1_27443 : > {field=facet1_27443,memSize=4214884,tindexSize=52,time=22,phase1=15,nTerms=4,bigTerms=0,termInstances=6,uses=1} > > facetField1_70 : > {field=facetField1_70,memSize=4223310,tindexSize=308,time=28,phase1=21,nTerms=636,bigTerms=0,termInstances=14404,uses=1} > > facetField2 : > {field=facetField2,memSize=4262644,tindexSize=3156,time=273,phase1=267,nTerms=12188,bigTerms=0,termInstances=1255522,uses=7031}
Snipet in results
Hi, I would like to have a google similar snipet of 2-3 lines of docs in my search results. Something like: TITLE <- full title of doc Description <- that extract the sentence or some text before and after keywords with highlightining and merge a couple of these extracted piece together Thanks for your help, Rosa
Sorting - bad performance
The performance factors wiki says: "If you do a lot of field based sorting, it is advantageous to add explicitly warming queries to the "newSearcher" and "firstSearcher" event listeners in your solrconfig which sort on those fields, so the FieldCache is populated prior to any queries being executed by your users." I've got an index with 24+ million docs of forum posts from users. I want to be able to get a given user's posts sorted by date. It's taking 20 seconds right now. What would I put in the newSearch/firstSearcher to make that quicker? Is there any other general approach I can use to speed up sorting? The schema looks like cistring is a case-insensitive string type i created:
UpdateProcessor and copyField
Can fields created by copyField instructions be processed by UpdateProcessors? Or only raw input fields can? So far my experiment is suggesting the latter. T. "Kuro" Kurosaka
Indexing languages, dataimporthandler
Hello all, I have just gone through the mailing list and have set up my different field type analysers for my 6 different languages in my shema.xml. Here is my question. I am using the dataimporthandler to import data from my database into my index. In my table, the documentname column's data can be in any of the 6 languages. Lets say I want to index this data and apply the different language analysers for certain cases, what would be the best way in my case. The real problem is that I do not know the language of the string in the documentname column once I create my index, therefore I cannot apply the correct field type. Should I create a custom transformer? Thanks Greg
Re: Snipet in results
http://wiki.apache.org/solr/HighlightingParameters [ ]'s Leonardo Souza °v° Linux user #375225 /(_)\ http://counter.li.org/ ^ ^ On Tue, Feb 22, 2011 at 3:39 PM, Rosa (Anuncios) < rosaemailanunc...@gmail.com> wrote: > Hi, > > I would like to have a google similar snipet of 2-3 lines of docs in my > search results. > > Something like: > > TITLE <- full title of doc > Description <- that extract the sentence or some text before and after > keywords with highlightining and merge a couple of these extracted piece > together > > Thanks for your help, > > Rosa >
DIH and updating specific record
Hi all- I am trying to determine if there is a way to tell Solr to update its index with a specific ID to a record in the database. All the examples and documentation seems to discuss using a "last updated" date/time field, but in this case modifying the table would not be an option. Instead, I'd like to invoke Solr's DIH delta query with a specific ID to say "here's something new or updated, please update your index with it". I apologize if this is a trivial thing, but I can't seem to find any documentation on how to do it. Thanks, Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
RE: XML Stripping from DIH
Thanks a lot! I thought I'd looked on this page but didn't see this one, not sure why. I greatly appreciate it! Ron -Original Message- From: Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com] Sent: Sunday, February 20, 2011 5:59 AM To: solr-user@lucene.apache.org Subject: Re: XML Stripping from DIH Ron, http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: "Olson, Ron" > To: "solr-user@lucene.apache.org" > Sent: Fri, February 18, 2011 4:05:15 PM > Subject: XML Stripping from DIH > > Hi all- > > I have some XML in a database that I am trying to index and store; I am >interested in the various pieces of text, but none of the tags. I've been >trying to figure out a way to strip all the tags out, but haven't found >anything within Solr to do so; the XML parser seems to want XPath to get the >various element values, when all I want is to turn the whole thing into one >blob >of text, regardless of whether it makes any "contextual" sense. > > Is there something in Solr to do this, or is it something I'd have to write >myself (which I'm willing to do if necessary)? > > Thanks for any info, > > Ron > > DISCLAIMER: This electronic message, including any attachments, files or >documents, is intended only for the addressee and may contain CONFIDENTIAL, >PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended >recipient, you are hereby notified that any use, disclosure, copying or >distribution of this message or any of the information included in or with it >is unauthorized and strictly prohibited. If you have received this message >in >error, please notify the sender immediately by reply e-mail and permanently >delete and destroy this message and its attachments, along with any copies >thereof. This message does not create any contractual obligation on behalf of >the sender or Law Bulletin Publishing Company. > Thank you. > DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: Indexing languages, dataimporthandler
Greg, You could use copyField to copy the column in question to 6 fields, one for each of your 6 languages, and hope they none of the analyzers do something reasonable without crashing. Or apply the white-space tokenizer and hope for the best? If the column has long enough text, you could try a language detector. My company, Basis Technology, sells one, and it can plug into Solr easily. http://www.basistech.com/language-identification/ On 2/22/11 11:50 AM, "Greg Georges" wrote: >Hello all, > >I have just gone through the mailing list and have set up my different >field type analysers for my 6 different languages in my shema.xml. Here >is my question. I am using the dataimporthandler to import data from my >database into my index. In my table, the documentname column's data can >be in any of the 6 languages. Lets say I want to index this data and >apply the different language analysers for certain cases, what would be >the best way in my case. The real problem is that I do not know the >language of the string in the documentname column once I create my index, >therefore I cannot apply the correct field type. Should I create a custom >transformer? > >Thanks > >Greg T. "Kuro" Kurosaka, 415-227-9600x122, 617-386-7122(direct)
Re: Passing parameters to DataImportHandler
: It'd be nice to be able to pass HTTP parameters into DataImportHandler : that'd be passed into the SQL as parameters, is this possible? there is a specific sub-section about this in the docs... http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters -Hoss
Sort Stability With Date Boosting and Rounding
I'm trying to use http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents as a bf parameter to my dismax handler. The problem is, the value of NOW can cause documents in a similar range (date value within a few seconds of each other) to sometimes round to be equal, and sometimes not, changing their sort order (when equal, falling back to a secondary sort). This, in turn, screws up paging. The problem is that score is rounded to a lower level of precision than what the suggested formula produces as a difference between two values within seconds of each other. It seems to me if I could round the value to minutes or hours, where the difference will be large enough to not be rounded-out, then I wouldn't have problems with order changing on me. But it's not legal syntax to specify something like: recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) Is this a problem anyone has faced and solved? Anyone have suggested solutions, other than indexing a copy of the date field that's rounded to the hour? -- Stephen Duncan Jr www.stephenduncanjr.com
RE: DIH and updating specific record
Chris Hostetter answered this just recently: http://wiki.apache.org/solr/DataImportHandler#Accessing_request_paramete rs My addition: Pass a parameter like command=delta-import&idz=31415 And access it via 'sql where id=${dataimporter.request.idz}' If the idz is a string you might need to prequote the idz value. -Original Message- From: Olson, Ron [mailto:rol...@lbpc.com] Sent: Tuesday, February 22, 2011 3:18 PM To: solr-user@lucene.apache.org Subject: DIH and updating specific record Hi all- I am trying to determine if there is a way to tell Solr to update its index with a specific ID to a record in the database. All the examples and documentation seems to discuss using a "last updated" date/time field, but in this case modifying the table would not be an option. Instead, I'd like to invoke Solr's DIH delta query with a specific ID to say "here's something new or updated, please update your index with it". I apologize if this is a trivial thing, but I can't seem to find any documentation on how to do it. Thanks, Ron DISCLAIMER: This electronic message, including any attachments, files or documents, is intended only for the addressee and may contain CONFIDENTIAL, PROPRIETARY or LEGALLY PRIVILEGED information. If you are not the intended recipient, you are hereby notified that any use, disclosure, copying or distribution of this message or any of the information included in or with it is unauthorized and strictly prohibited. If you have received this message in error, please notify the sender immediately by reply e-mail and permanently delete and destroy this message and its attachments, along with any copies thereof. This message does not create any contractual obligation on behalf of the sender or Law Bulletin Publishing Company. Thank you.
Re: UpdateProcessor and copyField
Yes. But did you actually search the mailing list or Solr's wiki? I guess not. Here it is: http://wiki.apache.org/solr/UpdateRequestProcessor > Can fields created by copyField instructions be processed by > UpdateProcessors? > Or only raw input fields can? > > So far my experiment is suggesting the latter. > > > T. "Kuro" Kurosaka
RE: Sort Stability With Date Boosting and Rounding
One suggestion: use logarithms to compress the large time range into something easier to compare: 1/log(ms(now,date) -Original Message- From: Stephen Duncan Jr [mailto:stephen.dun...@gmail.com] Sent: Tuesday, February 22, 2011 6:03 PM To: solr-user@lucene.apache.org Subject: Sort Stability With Date Boosting and Rounding I'm trying to use http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents as a bf parameter to my dismax handler. The problem is, the value of NOW can cause documents in a similar range (date value within a few seconds of each other) to sometimes round to be equal, and sometimes not, changing their sort order (when equal, falling back to a secondary sort). This, in turn, screws up paging. The problem is that score is rounded to a lower level of precision than what the suggested formula produces as a difference between two values within seconds of each other. It seems to me if I could round the value to minutes or hours, where the difference will be large enough to not be rounded-out, then I wouldn't have problems with order changing on me. But it's not legal syntax to specify something like: recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) Is this a problem anyone has faced and solved? Anyone have suggested solutions, other than indexing a copy of the date field that's rounded to the hour? -- Stephen Duncan Jr www.stephenduncanjr.com
Re: Sort Stability With Date Boosting and Rounding
You could always use a secondary sort as a tie-breaker, i.e: something unique like 'documentid' or something. That would ensure a stable sort. 2011/2/23 Stephen Duncan Jr > I'm trying to use > > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents > as > a bf parameter to my dismax handler. The problem is, the value of NOW can > cause documents in a similar range (date value within a few seconds of each > other) to sometimes round to be equal, and sometimes not, changing their > sort order (when equal, falling back to a secondary sort). This, in turn, > screws up paging. > > The problem is that score is rounded to a lower level of precision than > what > the suggested formula produces as a difference between two values within > seconds of each other. It seems to me if I could round the value to > minutes > or hours, where the difference will be large enough to not be rounded-out, > then I wouldn't have problems with order changing on me. But it's not > legal > syntax to specify something like: > recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) > > Is this a problem anyone has faced and solved? Anyone have suggested > solutions, other than indexing a copy of the date field that's rounded to > the hour? > > -- > Stephen Duncan Jr > www.stephenduncanjr.com >
Re: Sort Stability With Date Boosting and Rounding
Hi, You're right, it's illegal syntax to use other functions in the ms function, which is a pity indeed. However, you reduce the score by 50% for each year. Therefore paging through the results shouldn't make that much of a difference because the difference in score with NOW+2 minutes has a negligable impact on the total score. I had some thoughts on this issue as well but i decided the impact was too little to bother about. Cheers, > I'm trying to use > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_n > ewer_documents as > a bf parameter to my dismax handler. The problem is, the value of NOW can > cause documents in a similar range (date value within a few seconds of each > other) to sometimes round to be equal, and sometimes not, changing their > sort order (when equal, falling back to a secondary sort). This, in turn, > screws up paging. > > The problem is that score is rounded to a lower level of precision than > what the suggested formula produces as a difference between two values > within seconds of each other. It seems to me if I could round the value > to minutes or hours, where the difference will be large enough to not be > rounded-out, then I wouldn't have problems with order changing on me. But > it's not legal syntax to specify something like: > recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) > > Is this a problem anyone has faced and solved? Anyone have suggested > solutions, other than indexing a copy of the date field that's rounded to > the hour? > > -- > Stephen Duncan Jr > www.stephenduncanjr.com
hierarchical faceting, SOLR-792 - confused on config
I'm using solr 4.0 and trying to implement a hierarchical faceting example. The example I'm trying to implement is taken from the webcast "Mastering the Power of Faceted Search." (http://www.lucidimagination.com/solutions/webcasts/faceting) Around minute 30, Chris Hostetter gives a very nice "tips & tricks" example he described as Taxonomy facets. Where I'm confused is how to get the data indexed/organized into the "taxonomy facets" (0/NonFic, 1/NonFic/Law, 0/NonFic, 1/NonFic/Sci, 0/NonFic, 1/NonFic/Hist, 1/NonFic/Sci, 2/NonFic/Sci/Phys). Since I'm using DIH to import my data from a DB, do I create a TemplateTransformer to produce the indexed data? Do I have to do something special within schema.xml and/or solrconfig.xml? Once I figure out the correct config setup, I assume it's simply a matter of creating the correct solr query like he describes in the video? Thanks, kmf -- View this message in context: http://lucene.472066.n3.nabble.com/hierarchical-faceting-SOLR-792-confused-on-config-tp2556394p2556394.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: hierarchical faceting, SOLR-792 - confused on config
(11/02/23 8:26), kmf wrote: I'm using solr 4.0 and trying to implement a hierarchical faceting example. The example I'm trying to implement is taken from the webcast "Mastering the Power of Faceted Search." (http://www.lucidimagination.com/solutions/webcasts/faceting) Around minute 30, Chris Hostetter gives a very nice "tips& tricks" example he described as Taxonomy facets. Where I'm confused is how to get the data indexed/organized into the "taxonomy facets" (0/NonFic, 1/NonFic/Law, 0/NonFic, 1/NonFic/Sci, 0/NonFic, 1/NonFic/Hist, 1/NonFic/Sci, 2/NonFic/Sci/Phys). Since I'm using DIH to import my data from a DB, do I create a TemplateTransformer to produce the indexed data? Do I have to do something special within schema.xml and/or solrconfig.xml? Once I figure out the correct config setup, I assume it's simply a matter of creating the correct solr query like he describes in the video? Thanks, kmf kmf, disclaimer: I've never seen the webcast yet. First, SOLR-792 is not for hierarchical faceting. Please see SOLR-64. Second, please take a look at PathHierarchyTokenizer in trunk and 3x. It cannot output the depth factor ("0/", "1/", ...), though. Hmm, does everyone think that it has to be better if it outputs the depth factors to type or payload or somewhere else? Koji -- http://www.rondhuit.com/en/
Re: UpdateProcessor and copyField
Markus, I searched but I couldn't find a definite answer, so I posted this question. The article you quoted talks about implementing a copyField-like operation using UpdateProcessor. It doesn't talk about relationship between the copyField operation proper and UpdateProcessors. Kuro On 2/22/11 3:00 PM, "Markus Jelsma" wrote: >Yes. But did you actually search the mailing list or Solr's wiki? I guess >not. > >Here it is: >http://wiki.apache.org/solr/UpdateRequestProcessor > >> Can fields created by copyField instructions be processed by >> UpdateProcessors? >> Or only raw input fields can? >> >> So far my experiment is suggesting the latter. >> >> >> T. "Kuro" Kurosaka
Re: Date Math
: org.apache.lucene.queryParser.ParseException: Cannot parse 'last_modified:-DAY': ... : Are they not supported as a short-cut for "NOW-1DAY"? I'm using Solr 1.4. No, "-1DAY" is a valid DateMath string (to the DateMathParser) but as a field value you must specify a valid date string, which can *end* with a DateMath string. so "NOW-1DAY" is legal, as is "2011-02-22T12:34:56Z-1DAY" Note also: you didn't do "-1DAY" you tried "-DAY" which isn't valid anywhere. -Hoss
Re: Sort Stability With Date Boosting and Rounding
The problem comes when you have results that are all the same natural score (because you've filtered them, with no primary search, for instance), and are very close together in time. Then, as you page through, the order changes. So the user experience is that they see duplicate documents, and miss out on some of the docs in the overall set. It's not something negligible that I can ignore. I either have to come up with a fix for this, or get rid of the boost function altogether. Stephen Duncan Jr www.stephenduncanjr.com On Tue, Feb 22, 2011 at 6:09 PM, Markus Jelsma wrote: > Hi, > > You're right, it's illegal syntax to use other functions in the ms > function, > which is a pity indeed. > > However, you reduce the score by 50% for each year. Therefore paging > through > the results shouldn't make that much of a difference because the difference > in > score with NOW+2 minutes has a negligable impact on the total score. > > I had some thoughts on this issue as well but i decided the impact was too > little to bother about. > > Cheers, > > > I'm trying to use > > > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_n > > ewer_documents as > > a bf parameter to my dismax handler. The problem is, the value of NOW > can > > cause documents in a similar range (date value within a few seconds of > each > > other) to sometimes round to be equal, and sometimes not, changing their > > sort order (when equal, falling back to a secondary sort). This, in > turn, > > screws up paging. > > > > The problem is that score is rounded to a lower level of precision than > > what the suggested formula produces as a difference between two values > > within seconds of each other. It seems to me if I could round the value > > to minutes or hours, where the difference will be large enough to not be > > rounded-out, then I wouldn't have problems with order changing on me. > But > > it's not legal syntax to specify something like: > > recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) > > > > Is this a problem anyone has faced and solved? Anyone have suggested > > solutions, other than indexing a copy of the date field that's rounded to > > the hour? > > > > -- > > Stephen Duncan Jr > > www.stephenduncanjr.com >
Re: Sort Stability With Date Boosting and Rounding
No, the problem is that, due to rounding, sometimes the docs ARE considered ties, and therefore the secondary sort is used, but sometimes they don't round to exactly equal, and the tiebreaker isn't used, and the results get shuffled. Stephen Duncan Jr www.stephenduncanjr.com On Tue, Feb 22, 2011 at 6:09 PM, Geert-Jan Brits wrote: > You could always use a secondary sort as a tie-breaker, i.e: something > unique like 'documentid' or something. That would ensure a stable sort. > > 2011/2/23 Stephen Duncan Jr > > > I'm trying to use > > > > > http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents > > as > > a bf parameter to my dismax handler. The problem is, the value of NOW > can > > cause documents in a similar range (date value within a few seconds of > each > > other) to sometimes round to be equal, and sometimes not, changing their > > sort order (when equal, falling back to a secondary sort). This, in > turn, > > screws up paging. > > > > The problem is that score is rounded to a lower level of precision than > > what > > the suggested formula produces as a difference between two values within > > seconds of each other. It seems to me if I could round the value to > > minutes > > or hours, where the difference will be large enough to not be > rounded-out, > > then I wouldn't have problems with order changing on me. But it's not > > legal > > syntax to specify something like: > > recip(ms(NOW,manufacturedate_dt/HOUR),3.16e-11,1,1) > > > > Is this a problem anyone has faced and solved? Anyone have suggested > > solutions, other than indexing a copy of the date field that's rounded to > > the hour? > > > > -- > > Stephen Duncan Jr > > www.stephenduncanjr.com > > >