Re: DataImport TXT file entity processor
an EntityProcessor looks right to me. It may help us add more attributes if needed. PlainTextEntityProcessor looks like a good name. It can also be used to read html etc. --Noble On Sat, Jan 24, 2009 at 12:37 PM, Shalin Shekhar Mangar wrote: > On Sat, Jan 24, 2009 at 5:56 AM, Nathan Adams wrote: > >> Is there a way to us Data Import Handler to index non-XML (i.e. simple >> text) files (either via HTTP or FileSystem)? I need to put the entire >> contents of a text file into a single field of a document and the other >> fields are being pulled out of Oracle... > > > Not yet. But I think it will be nice to have. Can you open an issue in Jira? > > I think importing from HTTP was something another user had asked for > recently. How do you get the url/path of this text file? That would help > decide if we need a Transformer or EntityProcessor for these tasks. > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul
Re: Should I extend DIH to handle POST too?
That does not look like a great option. DIH looks like an overkill for this usecase. You can write a simple UpdateHandler to do that . All that you need to do is to extent ContentStreamHandlerBase and register it as an UpdateHandler On Sat, Jan 24, 2009 at 12:34 PM, Shalin Shekhar Mangar wrote: > There's another option. Using DIH with Solrj. Take a look at: > > https://issues.apache.org/jira/browse/SOLR-853 > > There's a patch there but it hasn't been updated to trunk. A contribution > would be most welcome. > > On Sat, Jan 24, 2009 at 3:11 AM, Gunaranjan Chandraraju < > chandrar...@apple.com> wrote: > >> Hi >> I had earlier described my requirement of needing to 'post XMLs as-is' to >> SOLR and have it handled just as the DIH would do on import using the >> mapping in data-config.xml. I got multiple answers for the 'post approach' >> - the top two being >> >> - Use SOLR CELL >> - Use SOLRJ >> >> In general I would like to keep all the 'data conversion' inside the SOLR >> powered search system rather than having clients do the XSL and transforming >> the XML before sending them (CELL approach). >> >> My question is? How should I design this >> - Tomcat Servlet that provides this 'post' endpoint. Accepts the XML over >> HTTP, transforms it and calls SOLRJ to update. This is the same TOMCAT that >> houses SOLR. >> - SOLR Handler (Is this the right way?) >> - Take this a step further and implement it as an extension to DIH - a >> handler that will refer to DIH data-config xml and use the same >> transformation. This way I can invoke an import for 'batched files' or do a >> 'post 'for the same XML with the same data-config mapping being applied. >> Maybe it can be a separate handler that just refers to the same >> data-config.xml and not necessarily bundled with DIH handler code. >> >> Looking for some advise. If the DIH extension is the way to go then I >> would be happy to extend it and contribute that back to SOLR. >> >> Regards, >> Guna >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul
Re: Solr Replication: disk space consumed on slave much higher than on master
hi Jaco, We owe you a bing THANK YOU. We were planning to roll out this feature into production in the next week or so. Our internal testing could not find this out. --Noble On Fri, Jan 23, 2009 at 6:36 PM, Jaco wrote: > Hi, > > I have tested this as well, looking fine! Both issues are indeed fixed, and > the index directory of the slaves gets cleaned up nicely. I will apply the > changes to all systems I've got running and report back in this thread in > case any issues are found. > > Thanks for the very fast help! I usually need much, much more patience with > commercial software vendors.. > > Cheers, > > Jaco. > > > 2009/1/23 Noble Paul നോബിള് नोब्ळ् > >> I have opened an issue to track this >> https://issues.apache.org/jira/browse/SOLR-978 >> >> On Fri, Jan 23, 2009 at 5:22 PM, Noble Paul നോബിള് नोब्ळ् >> wrote: >> > I tested with the patch >> > it has solved both the issues >> > >> > On Fri, Jan 23, 2009 at 5:00 PM, Shalin Shekhar Mangar >> > wrote: >> >> >> >> >> >> On Fri, Jan 23, 2009 at 2:12 PM, Jaco wrote: >> >>> >> >>> Hi, >> >>> >> >>> I applied the patch and did some more tests - also adding some >> LOG.info() >> >>> calls in delTree to see if it actually gets invoked (LOG.info("START: >> >>> delTree: "+dir.getName()); at the start of that method). I don't see >> any >> >>> entries of this showing up in the log file at all, so it looks like >> >>> delTree >> >>> doesn't get invoked at all. >> >>> >> >>> To be sure, explaining the issue to prevent misunderstanding: >> >>> - The number of files in the index directory on the slave keeps >> increasing >> >>> (in my very small test core, there are now 128 files in the slave's >> index >> >>> directory, and only 73 files in the master's index directory) >> >>> - The directories index.x are still there after replication, but >> they >> >>> are empty >> >>> >> >>> Are there any other things I can do check, or more info that I can >> provide >> >>> to help fix this? >> >> >> >> The problem is that when we do a commit on the slave after replication >> is >> >> done. The commit does not re-open the IndexWriter. Therefore, the >> deletion >> >> policy does not take affect and older files are left as is. This can >> keep on >> >> building up. The only solution is to re-open the index writer. >> >> >> >> I think the attached patch can solve this problem. Can you try this and >> let >> >> us know? Thank you for your patience. >> >> >> >> -- >> >> Regards, >> >> Shalin Shekhar Mangar. >> >> >> > >> > >> > >> > -- >> > --Noble Paul >> > >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul
Re: How to make Relationships work for Multi-valued Index Fields?
Hello, I am also a newbie and was wanting to do almost the exact same thing. I was planning on doing the equivalent of:- ***change** ID is no longer unique within Solr, There would be multiple "documents" with a given ID; one for each address. You can then search on ID and get the three addresses, you can also search on an address more sensibly. I have not been able to try this yet as other issues are still to be dealt with. Comments? >Hi >I may be completely off on this being new to SOLR but I am not sure >how to index related groups of fields in a document and preserver >their 'grouping'. I would appreciate any help on this.Detailed >description of the problem below. > >I am trying to index an entity that can have multiple occurrences in >the same document - e.g. Address. The address could be Shipping, >Home, Office etc. Each address element has multiple values in it >like street, state etc.Thus each address element is a group with >the state and street in one address element being related to each other. > >It looks like this in my source xml > > > > > > > > >I have setup my DIH to treat these as entities as below > > > > >baseDir="***" > fileName=".*xml" > rootEntity="false" > dataSource="null" > > name="record" > processor="XPathEntityProcessor" > stream="false" > forEach="/record" >url="${f.fileAbsolutePath}"> > > > >name="record_adr" >processor="XPathEntityProcessor" >stream="false" >forEach="/record/address" >url="${f.fileAbsolutePath}"> > > xpath="/record/address//@state" /> > > > > > > > > >The problem is as follows. DIH seems to treat these as entities but >solr seems to flatten them out on indexing to fields in a document >(losing the entity part). > >So when I search for the an ID - in the response all the street fields >are bunched to-gather, followed by all the state fields type etc. >Thus I can't associate which street address corresponds to which >address type in the response. > >What seems harder is this - say I need to query on 'Street' = XYZ1 and >type="Office". This should NOT return a document since the street for >the office address is "XY2" and not "XYZ1". However when I query for >address_state:"XYZ1" and address_type:"Office" I get back this document. > >The problem seems to be that while DIH allows 'entities' within a >document the SOLR schema does not preserve them - it 'flattens' all >of them out as indices for the document. > >I could work around the problem by creating SOLR fields like >"home_address_street" and "office_address_street" and do some xpath >mapping. However I don't want to do it as we can have multiple >'other' addresses. Also I have other fields whose type is not easily >distinguished like address. > >As I mentioned being new to SOLR I might have completely goofed on a >way to set it up - much appreciate any direction on it. I am using >SOLR 1.3 > >Regards, >Guna -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: How to make Relationships work for Multi-valued Index Fields?
nesting of an XPathEntityProcessor into another XPathEntityProcessor is possible only if a field in an xml is a filename/url . what is the purpose of nesting like this? is it because you have multiple addresses? the possible solutions are discussed elsewhere in this thread On Sat, Jan 24, 2009 at 2:41 PM, Fergus McMenemie wrote: > Hello, > > I am also a newbie and was wanting to do almost the exact same thing. > I was planning on doing the equivalent of:- > > > > >baseDir="***" > fileName=".*xml" > rootEntity="false" > dataSource="null" > >name="record" > processor="XPathEntityProcessor" > stream="false" > rootEntity="false"***changed*** > forEach="/record" > url="${f.fileAbsolutePath}"> > > ***change** > > name="record_adr" > processor="XPathEntityProcessor" > stream="false" > forEach="/record/address" > url="${f.fileAbsolutePath}"> > > xpath="/record/address//@state" /> > > > > > > > > ID is no longer unique within Solr, There would be multiple "documents" > with a given ID; one for each address. You can then search on ID and get > the three addresses, you can also search on an address more sensibly. > > I have not been able to try this yet as other issues are still to be > dealt with. > > Comments? > >>Hi >>I may be completely off on this being new to SOLR but I am not sure >>how to index related groups of fields in a document and preserver >>their 'grouping'. I would appreciate any help on this.Detailed >>description of the problem below. >> >>I am trying to index an entity that can have multiple occurrences in >>the same document - e.g. Address. The address could be Shipping, >>Home, Office etc. Each address element has multiple values in it >>like street, state etc.Thus each address element is a group with >>the state and street in one address element being related to each other. >> >>It looks like this in my source xml >> >> >> >> >> >> >> >> >>I have setup my DIH to treat these as entities as below >> >> >> >> >> > baseDir="***" >> fileName=".*xml" >> rootEntity="false" >> dataSource="null" > >> >name="record" >> processor="XPathEntityProcessor" >> stream="false" >> forEach="/record" >>url="${f.fileAbsolutePath}"> >> >> >> >> > name="record_adr" >>processor="XPathEntityProcessor" >>stream="false" >>forEach="/record/address" >>url="${f.fileAbsolutePath}"> >> >>> xpath="/record/address//@state" /> >> >> >> >> >> >> >> >> >>The problem is as follows. DIH seems to treat these as entities but >>solr seems to flatten them out on indexing to fields in a document >>(losing the entity part). >> >>So when I search for the an ID - in the response all the street fields >>are bunched to-gather, followed by all the state fields type etc. >>Thus I can't associate which street address corresponds to which >>address type in the response. >> >>What seems harder is this - say I need to query on 'Street' = XYZ1 and >>type="Office". This should NOT return a document since the street for >>the office address is "XY2" and not "XYZ1". However when I query for >>address_state:"XYZ1" and address_type:"Office" I get back this document. >> >>The problem seems to be that while DIH allows 'entities' within a >>document the SOLR schema does not preserve them - it 'flattens' all >>of them out as indices for the document. >> >>I could work around the problem by creating SOLR fields like >>"home_address_street" and "office_address_street" and do some xpath >>mapping. However I don't want to do it as we can have multiple >>'other' addresses. Also I have other fields whose type is not easily >>distinguished like address. >> >>As I mentioned being new to SOLR I might have completely goofed on a >>way to set it up - much appreciate any direction on it. I am using >>SOLR 1.3 >> >>Regards, >>Guna > > -- > > === > Fergus McMenemie Email:fer...@twig.me.uk > Techmore Ltd Phone:(UK) 07721 376021 > > Unix/Mac/Intranets Analyst Programmer > === > -- --Noble Paul
Re: Master failover - seeking comments
Did you look at the new in-built replication? http://wiki.apache.org/solr/SolrReplication#head-0e25211b6ef50373fcc2f9a6ad40380c169a5397 It can help you decide where to replicate from during runtime . Look at the snappull command you can pass the masterUrl at the time of replication. On Fri, Jan 23, 2009 at 7:55 PM, edre...@ha wrote: > > Thanks for the response. Let me clarify things a bit. > > Regarding the Slaves: > Our project is a web application. It is our desire to embedd Solr into the > web application. The web applications are configured with a local embedded > Solr instance configured as a slave, and a remote Solr instance configured > as a master. > > We have a requirement for real-time updates to the Solr indexes. Our > strategy is to use the local embedded Solr instance as a read-only > repository. Any time a write is made, we will send it to the remote Master. > Once a user pushes a write operation to the remote Master, all subsequent > read operations for this user now are made against the Master for the > duration of the session. This approximates "realtime" updates and seems to > work for our purposes. Writes to our system are a small percentage of Read > operations. > > Now, back to the original question. We're simply looking for failover > solution if the Master server goes down. Oh, and we are using the > replication scripts to sync the servers. > > > >> It seems like you are trying to write to Solr directly from your front end >> application. This is why you are thinking of multiple masters. I'll let >> others comment on how easy/hard/correct the solution would be. >> > > Well, yes. We have business requirements that want updates to Solr to be > realtime, or as close to that as possible, so when a user changes something, > our strategy was to save it to the DB and push it to the Solr Master as > well. Although, we will have a background application that will help ensure > that Solr is in sync with the DB for times that Solr is down and the DB is > not. > > > >> But, do you really need to have live writes? Can they be channeled through >> a >> background process? Since you anyway cannot do a commit per-write, the >> advantage of live writes is minimal. Moreover you would need to invest a >> lot >> of time in handling availability concerns to avoid losing updates. If you >> log/record the write requests to an intermediate store (or queue), you can >> do with one master (with another host on standby acting as a slave). >> > > We do need to have live writes, as I mentioned above. The concern you > mention about losing live writes is exactly why we are looking at a Master > Solr server failover strategy. We thought about having a backup Solr server > that is a Slave to the Master and could be easily reconfigured as a new > Master in a pinch. Our operations team has pushed us to come up with a > solution that would be more seamless. This is why we came up with a > Master/Master solution where both Masters are also slaves to each other. > > > >>> >>> To test this, I ran the following scenario. >>> >>> 1) Slave 1 (S1) is configured to use M2 as it's master. >>> 2) We push an update to M2. >>> 3) We restart S1, now pointing to M1. >>> 4) We wait for M1 to sync from M2 >>> 5) We then sync S1 to M1. >>> 6) Success! >>> >> >> How do you co-ordinate all this? >> > > This was just a test scenario I ran manually to see if the setup I described > above would even work. > > Is there a Wiki page that outlines typical web application Solr deployment > strategies? There are a lot of questions on the forum about this type of > thing (including this one). For those who have expertise in this area, I'm > sure there are many who could benefit from this (hint hint). > > As before, any comments or suggestions on the above would be much > appreciated. > > Thanks, > Erik > -- > View this message in context: > http://www.nabble.com/Master-failover---seeking-comments-tp21614750p21625324.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- --Noble Paul
Re: Random queries extremely slow
Use multiple boxes, with a mirroring delaay from one to another, like a pipeline. 2009/1/22 oleg_gnatovskiy > > Well this probably isn't the cause of our random slow queries, but might be > the cause of the slow queries after pulling a new index. Is there anything > we could do to reduce the performance hit we take from this happening? > > > > Otis Gospodnetic wrote: > > > > Here is one example: pushing a large newly optimized index onto the > > server. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > > > - Original Message > >> From: oleg_gnatovskiy > >> To: solr-user@lucene.apache.org > >> Sent: Thursday, January 22, 2009 2:22:51 PM > >> Subject: Re: Random queries extremely slow > >> > >> > >> What are some things that could happen to force files out of the cache > on > >> a > >> Linux machine? I don't know what kinds of events to look for... > >> > >> > >> > >> > >> yonik wrote: > >> > > >> > On Thu, Jan 22, 2009 at 1:46 PM, oleg_gnatovskiy > >> > wrote: > >> >> Hello. Our production servers are operating relatively smoothly most > >> of > >> >> the > >> >> time running Solr with 19 million listings. However every once in a > >> while > >> >> the same query that used to take 100 miliseconds takes 6000. > >> > > >> > Anything else happening on the system that may have forced some of the > >> > index files out of operating system disk cache at these times? > >> > > >> > -Yonik > >> > > >> > > >> > >> -- > >> View this message in context: > >> > http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611240.html > >> Sent from the Solr - User mailing list archive at Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/Random-queries-extremely-slow-tp21610568p21611454.html > Sent from the Solr - User mailing list archive at Nabble.com. > > -- Alexander Ramos Jardim
Re: Results not appearing
They all appear in the stats admin page under the NumDocs & maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as "text" -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: > > These might be obvious, but: > > * I assume you did a Solr commit command after indexing, right? > > * If you are using the fieldtype definitions from the default > schema.xml, then your "string" fields are not being analyzed, which > means you should expect search results only if you enter the entire, > exact value of one of the Message-ID or Date fields in your query. Is > that your intention? > > And yes, your analysis of "stored" seems correct. Stored fields are > those whose values you need back at query time, and indexed fields are > those you can do queries on. For a few complications, see > http://wiki.apache.org/solr/FieldOptionsByUseCase > > On Fri, Jan 23, 2009 at 8:04 PM, Johnny X > wrote: >> >> I've indexed my XML using the below in the schema: >> >> > required="true"/> >> >> >> >> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> >> >> >> >> >> >> >> >> >> Message-ID >> >> However searching via the Message-ID or Content fields returns 0. Using >> Luke >> I can still see these fields are stored however. >> >> Out of interest, by setting the other fields to just "stored=true", can >> they >> be returned in a query as part of a search? >> >> >> Cheers. >> -- >> View this message in context: >> http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21640562.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How to make Relationships work for Multi-valued Index Fields?
Hi Fergus, XPathEntityprocessor can read multivalued fields easily eg ***change** In this case all address_street,address_state,address_type will be returned as separate lists while parsing. If you wish to put them into multple fields you can write a transformer and iterate thru the lists and put them into separate fields. If there are 3 tags then you get a List for each fields where the length of the list==3. If an item is missing it will be added as a null. ensure that the fields are marked as multiValued="true" in the schema.xml. Otherwise it does not return List . If there is no corresponding mapping in schema.xml you can explicitly put it here in the dataconfig.xml eg: I saw the syntax '/record/address//@state'. '//' is not supported . You will have to explicitly give the full path. --Noble On Sat, Jan 24, 2009 at 2:57 PM, Noble Paul നോബിള് नोब्ळ् wrote: > nesting of an XPathEntityProcessor into another XPathEntityProcessor > is possible only if a field in an xml is a filename/url . > what is the purpose of nesting like this? > is it because you have multiple addresses? the possible solutions are > discussed elsewhere in this thread > > On Sat, Jan 24, 2009 at 2:41 PM, Fergus McMenemie wrote: >> Hello, >> >> I am also a newbie and was wanting to do almost the exact same thing. >> I was planning on doing the equivalent of:- >> >> >> >> >> > baseDir="***" >> fileName=".*xml" >> rootEntity="false" >> dataSource="null" > >> > name="record" >> processor="XPathEntityProcessor" >> stream="false" >> rootEntity="false"***changed*** >> forEach="/record" >> url="${f.fileAbsolutePath}"> >> >> ***change** >> >> > name="record_adr" >> processor="XPathEntityProcessor" >> stream="false" >> forEach="/record/address" >> url="${f.fileAbsolutePath}"> >> >> > xpath="/record/address//@state" /> >> >> >> >> >> >> >> >> ID is no longer unique within Solr, There would be multiple "documents" >> with a given ID; one for each address. You can then search on ID and get >> the three addresses, you can also search on an address more sensibly. >> >> I have not been able to try this yet as other issues are still to be >> dealt with. >> >> Comments? >> >>>Hi >>>I may be completely off on this being new to SOLR but I am not sure >>>how to index related groups of fields in a document and preserver >>>their 'grouping'. I would appreciate any help on this.Detailed >>>description of the problem below. >>> >>>I am trying to index an entity that can have multiple occurrences in >>>the same document - e.g. Address. The address could be Shipping, >>>Home, Office etc. Each address element has multiple values in it >>>like street, state etc.Thus each address element is a group with >>>the state and street in one address element being related to each other. >>> >>>It looks like this in my source xml >>> >>> >>> >>> >>> >>> >>> >>> >>>I have setup my DIH to treat these as entities as below >>> >>> >>> >>> >>> >> baseDir="***" >>> fileName=".*xml" >>> rootEntity="false" >>> dataSource="null" > >>> >>name="record" >>> processor="XPathEntityProcessor" >>> stream="false" >>> forEach="/record" >>>url="${f.fileAbsolutePath}"> >>> >>> >>> >>> >> name="record_adr" >>>processor="XPathEntityProcessor" >>>stream="false" >>>forEach="/record/address" >>>url="${f.fileAbsolutePath}"> >>> >>>>> xpath="/record/address//@state" /> >>> >>> >>> >>> >>> >>> >>> >>> >>>The problem is as follows. DIH seems to treat these as entities but >>>solr seems to flatten them out on indexing to fields in a document >>>(losing the entity part). >>> >>>So when I search for the an ID - in the response all the street fields >>>are bunched to-gather, followed by all the state fields type etc. >>>Thus I can't associate which street address corresponds to which >>>address type in the response. >>> >>>What seems harder is this - say I need to query on 'Street' = XYZ1 and >>>type="Office". This should NOT return a document since the street for >>>the office address is "XY2" and not "XYZ1". However when I quer
Re: Results not appearing
If it helps, everything appears when I use Luke to search through the index...but the search in that returns nothing either. When I search using the admin page for the word 'Phillip' (which appears the most in all of the documents) I get the following: - - 0 0 - on 0 phillip 10 2.2 Duh...? Johnny X wrote: > > They all appear in the stats admin page under the NumDocs & maxDocs > fields. > > I don't explicitly send a commit command, but my posting ends like this > (suggesting they are commited): > > SimplePostTool: POSTing file 21166.xml > SimplePostTool: POSTing file 21169.xml > SimplePostTool: COMMITting Solr index changes.. > > I just tried re-posting all the documents set as "text" -- will that > update the current documents indexed? (bearing in mind the unique key, > message-id, will be included again) > > When I try searching I still get 0 results for anything included in the > message-id and content fields, both of which should be indexed and > returning results... > > > Cheers for any help! > > > ryguasu wrote: >> >> These might be obvious, but: >> >> * I assume you did a Solr commit command after indexing, right? >> >> * If you are using the fieldtype definitions from the default >> schema.xml, then your "string" fields are not being analyzed, which >> means you should expect search results only if you enter the entire, >> exact value of one of the Message-ID or Date fields in your query. Is >> that your intention? >> >> And yes, your analysis of "stored" seems correct. Stored fields are >> those whose values you need back at query time, and indexed fields are >> those you can do queries on. For a few complications, see >> http://wiki.apache.org/solr/FieldOptionsByUseCase >> >> On Fri, Jan 23, 2009 at 8:04 PM, Johnny X >> wrote: >>> >>> I've indexed my XML using the below in the schema: >>> >>> >> required="true"/> >>> >>> >>> >>> >>> >> stored="true"/> >>> >> stored="true"/> >>> >> stored="true"/> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Message-ID >>> >>> However searching via the Message-ID or Content fields returns 0. Using >>> Luke >>> I can still see these fields are stored however. >>> >>> Out of interest, by setting the other fields to just "stored=true", can >>> they >>> be returned in a query as part of a search? >>> >>> >>> Cheers. >>> -- >>> View this message in context: >>> http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> > > -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21641692.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: solr-duplicate post management
On Thu, Jan 22, 2009 at 2:33 PM, S.Selvam Siva wrote: > > > On Thu, Jan 22, 2009 at 7:12 AM, Chris Hostetter > wrote: > >> >> : what i need is ,to log the existing urlid and new urlid(of course both >> will >> : not be same) ,when a .xml file of same id(unique field) is posted. >> : >> : I want to make this by modifying the solr source.Which file do i need to >> : modify so that i could get the above details in log ? >> : >> : I tried with DirectUpdateHandler2.java(which removes the duplicate >> : entries),but efforts in vein. >> >> DirectUpdateHandler2.java (on the trunk) delegates to Lucene-Java's >> IndexWriter.updateDocument method when you have a uniqueKey and you aren't >> allowing duplicates -- this method doesn't give you any way to access the >> old document(s) that had that existing key. >> >> The easiest way to make a change like what you are interested in might be >> an UpdateProcessor that does a lookup/search for the uniqueKey of each >> document about to be added to see if it already exists. that's probably >> about as efficient as you can get, and would be nicely encapsulated. >> >> You might also want to take a look at SOLR-799, where some work is being >> done to create UpdateProcessors that can do "near duplicate" detection... >> >> http://wiki.apache.org/solr/Deduplication >> https://issues.apache.org/jira/browse/SOLR-799 >> >> >> >> >> >> >> -Hoss >> > > Hi, i added some code to *DirectUpdateHandler2.java's doDeletions()* (solr 1.2.0) ,and got the solution i wanted.(logging duplicate post entry-i.e old field and new field of duplicate post) Document d1=searcher.doc(prev);//existing doc to be deleted Document d2=searcher.doc(tdocs.doc());//new doc String oldname=d1.get("name"); String id1=d1.get("id"); String newname=d2.get("name"); String id2=d1.get("id"); out3.write(id1+","+oldname+","+newname+"\n"); But i dont know ,wether the performance of solr will be affected by this. Any comment on the performance issue for the above solution is welcome... -- Yours, S.Selvam
Re: faceting question
is there no other way then to use the patch? since the query A is super set of B ??? if not doable, I will probably use some caching technique. Best. On Sat, Jan 24, 2009 at 9:14 AM, Shalin Shekhar Mangar wrote: > On Sat, Jan 24, 2009 at 6:56 AM, Cam Bazz wrote: > >> Hello; >> >> I got a multiField named tagList which may contain multiple tags. I am >> making a query like: >> >> tagList:a AND tagList:b AND tagList:c >> >> and I am also getting a tagList facet returning me some values. >> >> What I would like is Solr to return me facets as if the query was: >> tagList:a AND tagList:b >> >> is it even possible? >> > > If I understand correctly, > 1. You want to query for tagList:a AND tagList:b AND tagList:c > 2. At the same time, you want to request facets for tagList but only for > tagList:a and tagList:b > > If that is correct, you can use the features introduced by > https://issues.apache.org/jira/browse/SOLR-911 > > However you may need to put #1 as fq instead of q. > -- > Regards, > Shalin Shekhar Mangar. >
Re: Results not appearing
I should clarify that I misspoke before; I thought you had indexed="true" on Message-Id and Date, whereas you had it on Message-Id and Content. It sounds like you figured this out and interpreted my reply in a useful way nonetheless, though. So that's good. The post tool should be a valid way to commit. As for your technique of updating the field types and reindexing the documents, I think it should be fine provided you kept the field type for the Message-Id field as string. If you changed it to text along with the other field types, then there's a chance your "update" technique might instead of the effect of inserting a duplicate copy of each document, so there are two copies of each document, one searchable, and one not searchable. (I'm not totally sure about this, but it's a worry I would have.) That doesn't sound like what's happened to you, though. Could the problem be that you're not specifying which field to query? If you're using the standard query analyzer and the stock schema.xml, then the default field name is "text", whereas you don't have a field called "text" in your schema. In that setup if you want to search on the Content field you need to say so explicitly, like so: Content:phillip On Sat, Jan 24, 2009 at 7:25 AM, Johnny X wrote: > > If it helps, everything appears when I use Luke to search through the > index...but the search in that returns nothing either. > > When I search using the admin page for the word 'Phillip' (which appears the > most in all of the documents) I get the following: > > > - > - > 0 > 0 > - > on > 0 > phillip > 10 > 2.2 > > > > > > > Duh...? > > > > Johnny X wrote: >> >> They all appear in the stats admin page under the NumDocs & maxDocs >> fields. >> >> I don't explicitly send a commit command, but my posting ends like this >> (suggesting they are commited): >> >> SimplePostTool: POSTing file 21166.xml >> SimplePostTool: POSTing file 21169.xml >> SimplePostTool: COMMITting Solr index changes.. >> >> I just tried re-posting all the documents set as "text" -- will that >> update the current documents indexed? (bearing in mind the unique key, >> message-id, will be included again) >> >> When I try searching I still get 0 results for anything included in the >> message-id and content fields, both of which should be indexed and >> returning results... >> >> >> Cheers for any help! >> >> >> ryguasu wrote: >>> >>> These might be obvious, but: >>> >>> * I assume you did a Solr commit command after indexing, right? >>> >>> * If you are using the fieldtype definitions from the default >>> schema.xml, then your "string" fields are not being analyzed, which >>> means you should expect search results only if you enter the entire, >>> exact value of one of the Message-ID or Date fields in your query. Is >>> that your intention? >>> >>> And yes, your analysis of "stored" seems correct. Stored fields are >>> those whose values you need back at query time, and indexed fields are >>> those you can do queries on. For a few complications, see >>> http://wiki.apache.org/solr/FieldOptionsByUseCase >>> >>> On Fri, Jan 23, 2009 at 8:04 PM, Johnny X >>> wrote: I've indexed my XML using the below in the schema: >>> required="true"/> >>> stored="true"/> >>> stored="true"/> >>> stored="true"/> Message-ID However searching via the Message-ID or Content fields returns 0. Using Luke I can still see these fields are stored however. Out of interest, by setting the other fields to just "stored=true", can they be returned in a query as part of a search? Cheers. -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> > > -- > View this message in context: > http://www.nabble.com/Results-not-appearing-tp21637069p21641692.html > Sent from the Solr - User mailing list archive at Nabble.com. > >
Re: Solr stemming -> preserve original words
I still don't understand your final goal but if you want to get an output in the form of "run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner" you need to index your documents using standard analyzer. Walk through the index using org.apache.lucene.index.IndexReader and stem each term using stemmer. Storing stems (key) and orignal word list (value) in a map will give that kind of output. However if seeing something like the following list (not exactly you want but similar) on schema.jsp will help you run=>run run=>running run=>runner run=>runners add one line of code newstr = newstr + "=>" + new String(termBuffer, 0, len); to org.apache.solr.analysis.EnglishPorterFilterFactory.java between lines #116 and #117. Rename the file, compile the code, put your jar file to libs directory under your solr home. Now you can use your new FilfterFactory in your schema.xml --- On Sat, 1/24/09, Thushara Wijeratna wrote: > From: Thushara Wijeratna > Subject: Re: Solr stemming -> preserve original words > To: solr-user@lucene.apache.org, iori...@yahoo.com > Date: Saturday, January 24, 2009, 1:53 AM > Chris, Ahmet - thanks for the responses. > > Ahmet - yes, i want to see "run" as a top term + > the original words that > formed that term > The reason is that due to mis-stemming, the terms could > become non-english. > ex: "permanent" would stem to "perm", > "archive" would become "archiv". > > I need to extract a set of keywords from the indexed > content - I'd like > these to be correct full english words. > > thanks, > thushara
size of solr update document a limitation?
Hello Solr experts, is good practice to post large solr update documents? (e.g. 100kb-2mb). Will solr do the necessary tricks to make the field use a reader instead of strings? thanks in advance paul smime.p7s Description: S/MIME cryptographic signature
Re: Results not appearing
Thanks for the reply. I ended up fixing it by re-installing Tomcat and starting over. Searches now appear to work. Because I'm testing atm however, is it possible to delete the index and start afresh in future. At the moment I backed up the original index folder...if I just replace that with the current one including an index will that work...or will other parts of Solr recognise it's changed and as a result not work? What's the best solution for removing the index? Cheers. ryguasu wrote: > > I should clarify that I misspoke before; I thought you had > indexed="true" on Message-Id and Date, whereas you had it on > Message-Id and Content. It sounds like you figured this out and > interpreted my reply in a useful way nonetheless, though. So that's > good. > > The post tool should be a valid way to commit. > > As for your technique of updating the field types and reindexing the > documents, I think it should be fine provided you kept the field type > for the Message-Id field as string. If you changed it to text along > with the other field types, then there's a chance your "update" > technique might instead of the effect of inserting a duplicate copy of > each document, so there are two copies of each document, one > searchable, and one not searchable. (I'm not totally sure about this, > but it's a worry I would have.) That doesn't sound like what's > happened to you, though. > > Could the problem be that you're not specifying which field to query? > If you're using the standard query analyzer and the stock schema.xml, > then the default field name is "text", whereas you don't have a field > called "text" in your schema. In that setup if you want to search on > the Content field you need to say so explicitly, like so: > > Content:phillip > > On Sat, Jan 24, 2009 at 7:25 AM, Johnny X > wrote: >> >> If it helps, everything appears when I use Luke to search through the >> index...but the search in that returns nothing either. >> >> When I search using the admin page for the word 'Phillip' (which appears >> the >> most in all of the documents) I get the following: >> >> >> - >> - >> 0 >> 0 >> - >> on >> 0 >> phillip >> 10 >> 2.2 >> >> >> >> >> >> >> Duh...? >> >> >> >> Johnny X wrote: >>> >>> They all appear in the stats admin page under the NumDocs & maxDocs >>> fields. >>> >>> I don't explicitly send a commit command, but my posting ends like this >>> (suggesting they are commited): >>> >>> SimplePostTool: POSTing file 21166.xml >>> SimplePostTool: POSTing file 21169.xml >>> SimplePostTool: COMMITting Solr index changes.. >>> >>> I just tried re-posting all the documents set as "text" -- will that >>> update the current documents indexed? (bearing in mind the unique key, >>> message-id, will be included again) >>> >>> When I try searching I still get 0 results for anything included in the >>> message-id and content fields, both of which should be indexed and >>> returning results... >>> >>> >>> Cheers for any help! >>> >>> >>> ryguasu wrote: These might be obvious, but: * I assume you did a Solr commit command after indexing, right? * If you are using the fieldtype definitions from the default schema.xml, then your "string" fields are not being analyzed, which means you should expect search results only if you enter the entire, exact value of one of the Message-ID or Date fields in your query. Is that your intention? And yes, your analysis of "stored" seems correct. Stored fields are those whose values you need back at query time, and indexed fields are those you can do queries on. For a few complications, see http://wiki.apache.org/solr/FieldOptionsByUseCase On Fri, Jan 23, 2009 at 8:04 PM, Johnny X wrote: > > I've indexed my XML using the below in the schema: > > required="true"/> > > > > > stored="true"/> > stored="true"/> > indexed="false" > stored="true"/> > > > > > > > stored="true"/> > > > Message-ID > > However searching via the Message-ID or Content fields returns 0. > Using > Luke > I can still see these fields are stored however. > > Out of interest, by setting the other fields to just "stored=true", > can > they > be returned in a query as part of a search? > > > Cheers. > -- > View this message in context: > http://www.nabble.com/Results-not-appearing-tp21637069p21637069.html > Sent from the Solr - User mailing list archive at Nabble.com. > > >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/Results-not-appearing-tp21637069p21641692.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> >> > > -- View this message in context: http://www.nabble.com/Results-not-appearing-tp21637069p216
Re: Results not appearing
Without you stopping Solr itself, a solr client can remove all the documents in an index by doing a delete-by-query with the query "*:*" (without quotes). For XML interface clients, see http://wiki.apache.org/solr/UpdateXmlMessage. Solrj would have another way to do it. You'll need to do a commit after this to flush your changes. Alternatively, you can stop Solr and delete the whole data/ directory, which includes the index directory. If you do this, Solr will create a new fresh one the next time it starts up. For backups it might be a better habit to backup the data/ directory, rather than just the data/index directory. Assuming your schema.xml hasn't changed, then you should be able to restore one data/ directory with another. If you're changing your schema file, though, you need to make sure you restore a version of that file that is consistent with the one that you indexed with. On Sat, Jan 24, 2009 at 5:43 PM, Johnny X wrote: > > Thanks for the reply. > > I ended up fixing it by re-installing Tomcat and starting over. Searches now > appear to work. > > Because I'm testing atm however, is it possible to delete the index and > start afresh in future. > > At the moment I backed up the original index folder...if I just replace that > with the current one including an index will that work...or will other parts > of Solr recognise it's changed and as a result not work? > > What's the best solution for removing the index? > > > Cheers. > > > > ryguasu wrote: >> >> I should clarify that I misspoke before; I thought you had >> indexed="true" on Message-Id and Date, whereas you had it on >> Message-Id and Content. It sounds like you figured this out and >> interpreted my reply in a useful way nonetheless, though. So that's >> good. >> >> The post tool should be a valid way to commit. >> >> As for your technique of updating the field types and reindexing the >> documents, I think it should be fine provided you kept the field type >> for the Message-Id field as string. If you changed it to text along >> with the other field types, then there's a chance your "update" >> technique might instead of the effect of inserting a duplicate copy of >> each document, so there are two copies of each document, one >> searchable, and one not searchable. (I'm not totally sure about this, >> but it's a worry I would have.) That doesn't sound like what's >> happened to you, though. >> >> Could the problem be that you're not specifying which field to query? >> If you're using the standard query analyzer and the stock schema.xml, >> then the default field name is "text", whereas you don't have a field >> called "text" in your schema. In that setup if you want to search on >> the Content field you need to say so explicitly, like so: >> >> Content:phillip >> >> On Sat, Jan 24, 2009 at 7:25 AM, Johnny X >> wrote: >>> >>> If it helps, everything appears when I use Luke to search through the >>> index...but the search in that returns nothing either. >>> >>> When I search using the admin page for the word 'Phillip' (which appears >>> the >>> most in all of the documents) I get the following: >>> >>> >>> - >>> - >>> 0 >>> 0 >>> - >>> on >>> 0 >>> phillip >>> 10 >>> 2.2 >>> >>> >>> >>> >>> >>> >>> Duh...? >>> >>> >>> >>> Johnny X wrote: They all appear in the stats admin page under the NumDocs & maxDocs fields. I don't explicitly send a commit command, but my posting ends like this (suggesting they are commited): SimplePostTool: POSTing file 21166.xml SimplePostTool: POSTing file 21169.xml SimplePostTool: COMMITting Solr index changes.. I just tried re-posting all the documents set as "text" -- will that update the current documents indexed? (bearing in mind the unique key, message-id, will be included again) When I try searching I still get 0 results for anything included in the message-id and content fields, both of which should be indexed and returning results... Cheers for any help! ryguasu wrote: > > These might be obvious, but: > > * I assume you did a Solr commit command after indexing, right? > > * If you are using the fieldtype definitions from the default > schema.xml, then your "string" fields are not being analyzed, which > means you should expect search results only if you enter the entire, > exact value of one of the Message-ID or Date fields in your query. Is > that your intention? > > And yes, your analysis of "stored" seems correct. Stored fields are > those whose values you need back at query time, and indexed fields are > those you can do queries on. For a few complications, see > http://wiki.apache.org/solr/FieldOptionsByUseCase > > On Fri, Jan 23, 2009 at 8:04 PM, Johnny X > wrote: >> >> I've indexed my XML using the below in the schema: >> >> > required="true"/> >> >>
Re: How to make Relationships work for Multi-valued Index Fields?
I make this approach work with XPATH and XSL. However, this approach creates multiple fields of like this address_state_1 address_state_2 ... address_state_10 and credit_card_1 credit_card_2 credit_card_3 How do I search for a credit_card.The query syntax does not seem to support wild cards in field names. For e.g. I cant seem to do this -> credit_card*:1234 4567 7890 1234 On the search side I would not know how many credit card fields got created for a document and so I need that to be dynamic. -g On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju < chandrar...@apple.com> wrote: I have setup my DIH to treat these as entities as below I think the only way is to create a dynamic field for each attribute (street, state etc.). Write a transformer to copy the fields from your data config to appropriately named dynamic field (e.g. street_1, state_1, etc). To maintain this counter you will need to get/store it with Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and Context#setSessionAttribute(name, val, Context.SCOPE_DOC). I cant't think of an easier way. -- Regards, Shalin Shekhar Mangar. -- Regards, Shalin Shekhar Mangar.
Re: How to make Relationships work for Multi-valued Index Fields?
for searching you need to put them in a single field . use in schema.xml to achieve that On Sun, Jan 25, 2009 at 7:39 AM, Gunaranjan Chandraraju wrote: > I make this approach work with XPATH and XSL. However, this approach > creates multiple fields of like this > > address_state_1 > address_state_2 > ... > address_state_10 > > and > > credit_card_1 > credit_card_2 > credit_card_3 > > > How do I search for a credit_card.The query syntax does not seem to > support wild cards in field names. For e.g. I cant seem to do this -> > credit_card*:1234 4567 7890 1234 > > On the search side I would not know how many credit card fields got created > for a document and so I need that to be dynamic. > > -g > > > On Jan 22, 2009, at 11:54 PM, Shalin Shekhar Mangar wrote: > >> Oops, one more gotcha. The dynamic field support is only in 1.4 trunk. >> >> On Fri, Jan 23, 2009 at 1:24 PM, Shalin Shekhar Mangar < >> shalinman...@gmail.com> wrote: >> >>> On Fri, Jan 23, 2009 at 1:08 PM, Gunaranjan Chandraraju < >>> chandrar...@apple.com> wrote: >>> I have setup my DIH to treat these as entities as below >>> baseDir="***" fileName=".*xml" rootEntity="false" dataSource="null" > >>> name="record" processor="XPathEntityProcessor" stream="false" forEach="/record" url="${f.fileAbsolutePath}"> >>> name="record_adr" processor="XPathEntityProcessor" stream="false" forEach="/record/address" url="${f.fileAbsolutePath}"> >>> xpath="/record/address/@street" /> >>> xpath="/record/address//@state" /> >>> xpath="/record/address//@type" /> >>> >>> I think the only way is to create a dynamic field for each attribute >>> (street, state etc.). Write a transformer to copy the fields from your >>> data >>> config to appropriately named dynamic field (e.g. street_1, state_1, >>> etc). >>> To maintain this counter you will need to get/store it with >>> Context#getSessionAttribute(name, val, Context.SCOPE_DOC) and >>> Context#setSessionAttribute(name, val, Context.SCOPE_DOC). >>> >>> I cant't think of an easier way. >>> -- >>> Regards, >>> Shalin Shekhar Mangar. >>> >> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. > > -- --Noble Paul