Indexing Failed rolled back
i did some research in schema DIH config file and i created my own DIH, i'm getting this error when i run − 0 0 − − try.xml full-import idle − 0:0:0.163 0 1 0 0 2011-01-25 13:56:48 Indexing failed. Rolled back all changes. 2011-01-25 13:56:48 − This response format is experimental. It is likely to change in the future. - DINESHKUMAR . M I am neither especially clever nor especially gifted. I am only very, very curious. -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-Failed-rolled-back-tp2327412p2327412.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MySQL + DIH + SpatialSearch
Hey Eric, On Mon, Jan 24, 2011 at 7:23 PM, Eric Angel wrote: > * Or you can typecast before you concat: > > * > Casting before or after concat'ing, work's both - as we've seen two weeks ago, in a similar thread ( http://search.lucidimagination.com/search/document/250975238eaeb9e0/solr_4_0_spatial_search_how_to#60c4c05c9b482df1 ) But anyway, thanks for pointing out, that it's really a (confirmed) MySQL-Bug - very annoying :/ Regards Stefan
Re: DIH serialize
Rich, i played around for a few minutes with Script-Transformers, but i have not enough knowledge to get anything done right know :/ My Idea was: looping over the given row, which should be a Java HashMap or something like that? and do sth like this (pseudo-code): var row_data = [] for( var key in row ) { row_data.push( '"' + key + '" : '" + row[key] + '"' ); } row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' ); Which should result in a json-object like {'key1':'value1', 'key2':'value2'} - and that should be okay to work with? Regards Stefan On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard wrote: > Hi Stefan, > > yes, this is exactly what I intend - I don't want to search in this field > - just quicly return me the result in a serialized form (the search > criteria > is on other fields). Well, if I could serialize the data exactly as like > the > PHP serialize() does I would be maximally satisfied, but any other form in > which I could compact the data easily into one field I would be pleased. > Can anyone help me? I guess the is quite a good way, but I don't > know which function should I use there to compact the data to be easily > usable in PHP. Or any other method? > > thanks, > Rich > > -Original Message- > From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] > Sent: Monday, January 24, 2011 18:23 > To: solr-user@lucene.apache.org > Subject: Re: DIH serialize > > Hi Rich, > > i'm a bit confused after reading your post .. what exactly you wanna try to > achieve? Serializing (like http://php.net/serialize) your complete row > into > one field? Don't wanna search in them, just store and deliver them in your > results? Does that make sense? Sounds a bit strange :) > > Regards > Stefan > > On Mon, Jan 24, 2011 at 10:03 AM, Papp Richardwrote: > > > Hi Dennis, > > > > thank you for your answer, but didn't understand why you say it doesn't > > need serialization. I'm with the option "C". > > but the main question is, how to put into one field a result of many > > fields: "SELECT * FROM". > > > > thanks, > > Rich > > > > -Original Message- > > From: Dennis Gearon [mailto:gear...@sbcglobal.net] > > Sent: Monday, January 24, 2011 02:07 > > To: solr-user@lucene.apache.org > > Subject: Re: DIH serialize > > > > Depends on your process chain to the eventual viewer/consumer of the > data. > > > > The questions to ask are: > > A/ Is the data IN Solr going to be viewed or processed in its original > > form: > > -->set stored = 'true' > > --->no serialization needed. > > B/ If it's going to be anayzed and searched for separate from any other > > field, > > > > the analyzing will put it into an unreadable form. If you need to > see > > it, > > then > > --->set indexed="true" and stored="true" > > --->no serializaton needed. C/ If it's NOT going to be viewed AS > IS, > > and > > it's not going to be searched for AS IS, > > (i.e. other columns will be how the data is found), and you have > > another, > > > > serialzable format: > > -->set indexed="false" and stored="true" > > -->serialize AS PER THE INTENDED APPLICATION, > > not sure that Solr can do that at all. > > C/ If it's NOT going to be viewed AS IS, and it's not going to be > searched > > for > > AS IS, > > (i.e. other columns will be how the data is found), and you have > > another, > > > > serialzable format: > > -->set indexed="false" and stored="true" > > -->serialize AS PER THE INTENDED APPLICATION, > > not sure that Solr can do that at all. > > D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched > for > > AS > > IS, > > (this column will be how the data is found), and you have another, > > serialzable format: > > -->you need to put it into TWO columns > > -->A SERIALIZED FIELD > > -->set indexed="false" and stored="true" > > > > -->>AN UNSERIALIZED FIELD > > -->set indexed="false" and stored="true" > > -->serialize AS PER THE INTENDED APPLICATION, > > not sure that Solr can do that at all. > > > > Hope that helps! > > > > > > Dennis Gearon > > > > > > Signature Warning > > > > It is always a good idea to learn from your own mistakes. It is usually a > > better > > idea to learn from others' mistakes, so you do not have to make them > > yourself. > > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > > > > EARTH has a Right To Life, > > otherwise we all die. > > > > > > > > - Original Message > > From: Papp Richard > > To: solr-user@lucene.apache.org > > Sent: Sun, January 23, 2011
Re: synonyms file, and example cases
Cam, the examples with the provided inline-documentation should help you, no? http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory The Backslash \ in that context looks like an Escaping-Character, to avoid the => to be interpreted as "assign-command" Regards Stefan On Tue, Jan 25, 2011 at 2:31 AM, Cam Bazz wrote: > Hello, > > I have been looking at the solr synonym file that was an example, I > did not understand some notation: > > aaa => > > bbb => 1 2 > > ccc => 1,2 > > a\=>a => b\=>b > > a\,a => b\,b > > fooaaa,baraaa,bazaaa > > The first one says search for when query is aaa. am I correct? > the second one finds "1 2" when query is bbb > the third one is find 1 or 2 when query is ccc > > the fourth, and fifth one I have not understood. > > the last one, i assume is a group, bidirectional mapping between > fooaaa,baraaa,bazaaa > > I am especially interested with this last one, if I do aaa,bbb it will > find aaa and bbb when either aaa or bbb is queryied? > > am I correct in those assumptions? > > Best regards, > C.B. >
Performance optimization of Proximity/Wildcard searches
Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram
Re: please help >>Problem with dataImportHandler
Caused by: org.xml.sax.SAXParseException: Element type "field" must be followed by either attribute specifications, ">" or "/>". Sounds like invalid XML in your .. dataimport-config? On Tue, Jan 25, 2011 at 5:41 AM, Dinesh wrote: > > http://pastebin.com/tjCs5dHm > > this is the log produced by the solr server > > - > DINESHKUMAR . M > I am neither especially clever nor especially gifted. I am only very, very > curious. > -- > View this message in context: > http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2326659.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: please help >>Problem with dataImportHandler
ya after correcting it also it is throwing an exception - DINESHKUMAR . M I am neither especially clever nor especially gifted. I am only very, very curious. -- View this message in context: http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327662.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting started with writing parser
On Tue, Jan 25, 2011 at 10:05 AM, Dinesh wrote: > > http://pastebin.com/CkxrEh6h > > this is my sample log [...] And, which portions of the log text do you want to preserve? Does it go into Solr as a single error message, or do you want to separate out parts of it. Regards, Gora
Re: Getting started with writing parser
i want to take the month, time, DHCPMESSAGE, from_mac, gateway_ip, net_ADDR - DINESHKUMAR . M I am neither especially clever nor especially gifted. I am only very, very curious. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: please help >>Problem with dataImportHandler
http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327738.html this thread explains my problem - DINESHKUMAR . M I am neither especially clever nor especially gifted. I am only very, very curious. -- View this message in context: http://lucene.472066.n3.nabble.com/please-help-Problem-with-dataImportHandler-tp2318585p2327745.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Getting started with writing parser
On Tue, Jan 25, 2011 at 11:44 AM, Dinesh wrote: > > i don't even know whether the regex expression that i'm using for my log is > correct or no.. If it is the same try.xml that you posted earlier, it is very likely not going to work. You seem to have just cut and pasted entries from the Hathi Trust blog, without understanding how they work. Could you take a fresh look at http://wiki.apache.org/solr/DataImportHandler and explain in words the following: * What is your directory structure for storing the log files? * What parts of the log file do you want to keep (you have already explained this in another message)? * How would the above translate into: - A Solr schema - Setting up (a) a data source, (b) processor(s), and (c) transformers. >i very much worried i couldn't proceed in my > project already > 1/3 rd of the timing is over.. please help.. this is just the first stage.. > after this i have ti setup up all the log to be redirected to SYSLOG and > from there i'll send it to SOLR server.. then i have to analyse all the > data's that i obtained from DNS, DHCP, WIFI, SWITCES.. and i have to prepare > a user based report on his actions.. please help me cause the day's i have > keeps reducing.. my project leader is questioning me a lot.. pls.. [...] Well, I am sorry, but at least I strongly feel that we should not be doing your work for you, and especially not if it is a student project, as seems to be the case. If you can address the above points one by one (stay on this thread, please), people should be able to help you. However, it is up to you to get to understand Solr well enough. Regards, Gora
Re: Getting started with writing parser
no i actually changed the directory to mine where i stored the log files.. it is /home/exam/apa..solr/example/exampledocs i specified it in a solr schema.. i created an DataImportHandler for that in try.xml.. then in that i changed that file name to sample.txt that new try.xml is http://pastebin.com/pfVVA7Hs i changed the log into one word per line thinking there might be error in my regex expression.. now i'm completely stuck.. - DINESHKUMAR . M I am neither especially clever nor especially gifted. I am only very, very curious. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-started-with-writing-parser-tp2278092p2327920.html Sent from the Solr - User mailing list archive at Nabble.com.
Extracting contents of zipped files with Tika and Solr 1.4.1
Hi, I posted a question in November last year about indexing content from multiple binary files into a single Solr document and Jayendra responded with a simple solution to zip them up and send that single file to Solr. I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't currently allow this to work and only the file names of the zipped files are indexed (and not their contents). I've tried downloading and building the latest Tika (0.8) and replacing the tika-parsers and tika-core JARS in \contrib\extraction\lib but this still isn't indexing the file contents, and not doesn't even index the file names! Is there a version of Tika that works with the Solr 1.4.1 released distribution which does index the contents of the zipped files? Thanks and kind regards, Gary
DIH From various File system locations
Hi All, I need to index the documents presents in my file system at various locations (e.g. C:\docs , d:\docs ). Is there any way through which i can specify this in my DIH Configuration. Here is my configuration:- / Pankaj Bhatt.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
There seems to be a bug with the current 1.4.1 release. You cannot extract any content at all, regardless of content type. Try to get a fresh version from the SVN repository. I did that earlier today and can verify that Tika now will extract the content. I'm not sure about zip files. Tika version 0.8 is not included in the latest release/trunk from SVN. Erlend On 25.01.11 11.19, Gary Taylor wrote: Hi, I posted a question in November last year about indexing content from multiple binary files into a single Solr document and Jayendra responded with a simple solution to zip them up and send that single file to Solr. I understand that the Tika 0.4 JARs supplied with Solr 1.4.1 don't currently allow this to work and only the file names of the zipped files are indexed (and not their contents). I've tried downloading and building the latest Tika (0.8) and replacing the tika-parsers and tika-core JARS in \contrib\extraction\lib but this still isn't indexing the file contents, and not doesn't even index the file names! Is there a version of Tika that works with the Solr 1.4.1 released distribution which does index the contents of the zipped files? Thanks and kind regards, Gary -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Performance optimization of Proximity/Wildcard searches
On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > Cache warming is a good option too but the index get updated every hour so > not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer.
Recommendation on RAM-/Cache configuration
Hi, recently we're experiencing OOMEs (GC overhead limit exceeded) in our searches. Therefore I want to get some clarification on heap and cache configuration. This is the situation: - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC -XX:+UseParallelGC - The machine has 32 GB RAM - Currently there are 4 processors/cores in the machine, this shall be changed to 2 cores in the future. - The index size in the filesystem is ~9.5 GB - The index contains ~ 5.500.000 documents - 1.500.000 of those docs are available for searches/queries, the rest are inactive docs that are excluded from searches (via a flag/field), but they're still stored in the index as need to be available by id (solr is the main document store in this app) - Caches are configured with a big size (the idea was to prevent filesystem access / disk i/o as much as possible): - filterCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99 - documentCache (solr.LRUCache): size=20, initialSize=10, autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74 - queryResultCache (solr.LRUCache): size=20, initialSize=3, autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71 - Searches are performed using a catchall text field using standard request handler, all fields are fetched (no fl specified) - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC) - Recently we also added a feature that adds weighted search for special fields, so that the query might become s.th. like this q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some query)^4.0 OR longDescription_weighted:(some query)^0.5 (it seemed as if this was the cause of the OOMEs, but IMHO it only increased RAM usage so that now GC could not free enough RAM) The OOMEs that we get are of type "GC overhead limit exceeded", one of the OOMEs was thrown during auto-warming. I checked two different heapdumps, the first one autogenerated (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via jmap. These show the following distribution of used memory - the autogenerated dump: - documentCache: 56% (size ~ 195.000) - filterCache: 15% (size ~ 60.000) - queryResultCache: 8% (size ~ 61.000) - fieldCache: 6% (fieldCache referenced by WebappClassLoader) - SolrIndexSearcher: 2% The manually generated dump: - documentCache: 48% (size ~ 195.000) - filterCache: 20% (size ~ 60.000) - fieldCache: 11% (fieldCache hängt am WebappClassLoader) - queryResultCache: 7% (size ~ 61.000) - fieldValueCache: 3% We are also running two search engines with 17GB heap, these don't run into OOMEs. Though, with these bigger heap sizes the longest requests are even longer due to longer stop-the-world gc cycles. Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB would be good to reduce the time needed for full gc. So what's the right path to follow now? What would you recommend to change on the configuration (solr/jvm)? Would you say it is ok to reduce the cache sizes? Would this increase disk i/o, or would the index be hold in the OS's disk cache? Do have other recommendations to follow / questions? Thanx && cheers, Martin
Re: Specifying an AnalyzerFactory in the schema
Hi Chris, On 24/01/11 21:18, Chris Hostetter wrote: : I notice that in the schema, it is only possible to specify a Analyzer class, : but not a Factory class as for the other elements (Tokenizer, Fitler, etc.). : This limits the use of this feature, as it is impossible to specify parameters : for the Analyzer. : I have looked at the IndexSchema implementation, and I think this requires a : simple fix. Do I open an issue about it ? Support for constructing Analyzers directly is very crude, and primarily existed for making it easy for people with old indexes and analyzers to keep working. moving foward, Lucene/Solr eventtually won't "ship" concret Analyzers implementations at all (at least, that's the last concensus i remember) so enhancing support for loading Analyzers (or AnalyzerFactories) doesn't make much sense. Practically speaking, if you have an existing Analyzer that you want to use in Solr, instead of writting an "AnalyzerFactory" for it, you could just write a "TokenizerFactory" that wraps it instead -- functinally that would let you achieve everything ana AnalyzerFactory would, except that Solr would already handle letting the schema.xml specify the positionIncrementGap (which you could happily ignore if you wanted) Thanks for the trick, I haven't thought about doing that. This should work indeed. cheers -- Renaud Delbru
Use terracotta bigmemory for solr-caches
Hi, as the biggest parts of our jvm heap are used by solr caches I asked myself if it wouldn't make sense to run solr caches backed by terracotta's bigmemory (http://www.terracotta.org/bigmemory). The goal is to reduce the time needed for full / stop-the-world GC cycles, as with our 8GB heap the longest requests take up to several minutes. What do you think? Cheers, Martin
Re: Performance optimization of Proximity/Wildcard searches
By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count > 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen wrote: > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > Cache warming is a good option too but the index get updated every hour > so > > not sure how much would that help. > > What is the time difference between queries with a warmed index and a > cold one? If the warmed index performs satisfactory, then one answer is > to upgrade your underlying storage. As always for IO-caused performance > problem in Lucene/Solr-land, SSD is the answer. > > -- Regards, Salman Akram
Re: Possible Memory Leaks / Upgrading to a Later Version of Solr or Lucene
Hi, Are you sure you need CMS incremental mode? It's only adviced when running on a machine with one or two processors. If you have more you should consider disabling the incremental flags. Cheers, On Monday 24 January 2011 19:32:38 Simon Wistow wrote: > We have two slaves replicating off one master every 2 minutes. > > Both using the CMS + ParNew Garbage collector. Specifically > > -server -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > -XX:+CMSIncrementalMode -XX:+CMSIncrementalPacing > > but periodically they both get into a GC storm and just keel over. > > Looking through the GC logs the amount of memory reclaimed in each GC > run gets less and less until we get a concurrent mode failure and then > Solr effectively dies. > > Is it possible there's a memory leak? I note that later versions of > Lucene have fixed a few leaks. Our current versions are relatively old > > Solr Implementation Version: 1.4.1 955763M - mark - 2010-06-17 > 18:06:42 > > Lucene Implementation Version: 2.9.3 951790 - 2010-06-06 01:30:55 > > so I'm wondering if upgrading to later version of Lucene might help (of > course it might not but I'm trying to investigate all options at this > point). If so what's the best way to go about this? Can I just grab the > Lucene jars and drop them somewhere (or unpack and then repack the solr > war file?). Or should I use a nightly solr 1.4? > > Or am I barking up completely the wrong tree? I'm trawling through heap > logs and gc logs at the moment trying to to see what other tuning I can > do but any other hints, tips, tricks or cluebats gratefully received. > Even if it's just "Yeah, we had that problem and we added more slaves > and periodically restarted them" > > thanks, > > Simon -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Weird behaviour with phrase queries
Frankly, this puzzles me. It *looks* like it should be OK. One warning, the analysis page sometimes is a bit misleading, so beware of that. But the output of your queries make it look like the query is parsing as you expect, which leaves the question of whether your index contains what you think it does. You might get a copy of Luke, which allows you to examine what's actually in your index instead of what you think is in there. Sometimes there are surprises here! I didn't mean to re-index your whole corpus, I was thinking that you could just index a few documents in a test index so you have something small to look at. Sorry I can't spot what's happening right away. Good luck! Erick On Tue, Jan 25, 2011 at 2:45 AM, Jerome Renard wrote: > Erick, > > On Mon, Jan 24, 2011 at 9:57 PM, Erick Erickson > wrote: > >> Hmmm, I don't see any screen shots. Several things: >> 1> If your stopword file has comments, I'm not sure what the effect would >> be. >> > > Ha, I thought comments were supported in stopwords.txt > > >> 2> Something's not right here, or I'm being fooled again. Your withresults >> xml has this line: >> +DisjunctionMaxQuery((meta_text:"ecol d >> ingenieur")~0.01) () >> and your noresults has this line: >> +DisjunctionMaxQuery((meta_text:"academi >> charpenti")~0.01) DisjunctionMaxQuery((meta_text:"academi >> charpenti"~100)~0.01) >> >> the empty () in the first one often means you're NOT going to your >> configured dismax parser in solrconfig.xml. Yet that doesn't square with >> your custom qt, so I'm puzzled. >> >> Could we see your raw query string on the way in? It's almost as if you >> defined qt in one and defType in the other, which are not equivalent. >> > > You are right I fixed this problem (my bad). > > 3> It may take 12 hours to index, but you could experiment with a smaller >> subset. You say you know that the noresults one should return documents, >> what proof do >> you have? If there's a single document that you know should match this, >> just >> index it and a few others and you should be able to make many runs until >> you >> get >> to the bottom of this... >> >> > I could but I always thought I had to fully re-index after updating > schema.xml. If > I update only few documents will that take the changes into account without > breaking > the rest ? > > >> And obviously your stemming is happening on the query, are you sure it's >> happening at index time too? >> >> > Since you did not get the screenshots you will find attached the full > output of the analysis > for a phrase that works and for another that does not. > > Thanks for your support > > Best Regards, > > -- > Jérôme >
Re: Recommendation on RAM-/Cache configuration
On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote: > Hi, > > recently we're experiencing OOMEs (GC overhead limit exceeded) in our > searches. Therefore I want to get some clarification on heap and cache > configuration. > > This is the situation: > - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit > - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G > -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC > -XX:+UseParallelGC Consider switching to HotSpot JVM, use the -server as the first switch. > - The machine has 32 GB RAM > - Currently there are 4 processors/cores in the machine, this shall be > changed to 2 cores in the future. > - The index size in the filesystem is ~9.5 GB > - The index contains ~ 5.500.000 documents > - 1.500.000 of those docs are available for searches/queries, the rest are > inactive docs that are excluded from searches (via a flag/field), but > they're still stored in the index as need to be available by id (solr is > the main document store in this app) How do you exclude them? It should use filter queries. I also remember (but i just cannot find it back so please correct my if i'm wrong) that in 1.4.x sorting is done before filtering. It should be an improvement if filtering is done before sorting. If you use sorting, it takes up a huge amount of RAM if filtering is not done first. > - Caches are configured with a big size (the idea was to prevent filesystem > access / disk i/o as much as possible): There is only disk I/O if the kernel can't keep the index (or parts) in its page cache. > - filterCache (solr.LRUCache): size=20, initialSize=3, > autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99 > - documentCache (solr.LRUCache): size=20, initialSize=10, > autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74 > - queryResultCache (solr.LRUCache): size=20, initialSize=3, > autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71 You should decrease the initialSize values. But your hitratio's seem very nice. > - Searches are performed using a catchall text field using standard request > handler, all fields are fetched (no fl specified) > - Normally ~ 5 concurrent requests, peaks up to 30 or 40 (mostly during GC) > - Recently we also added a feature that adds weighted search for special > fields, so that the query might become s.th. like this > q=(some query) OR name_weighted:(some query)^2.0 OR brand_weighted:(some > query)^4.0 OR longDescription_weighted:(some query)^0.5 > (it seemed as if this was the cause of the OOMEs, but IMHO it only > increased RAM usage so that now GC could not free enough RAM) > > The OOMEs that we get are of type "GC overhead limit exceeded", one of the > OOMEs was thrown during auto-warming. Warming takes additional RAM. The current searcher still has its caches full and newSearcher is getting filled up. Decreasing sizes might help. > > I checked two different heapdumps, the first one autogenerated > (by -XX:+HeapDumpOnOutOfMemoryError) the second one generated manually via > jmap. > These show the following distribution of used memory - the autogenerated > dump: > - documentCache: 56% (size ~ 195.000) > - filterCache: 15% (size ~ 60.000) > - queryResultCache: 8% (size ~ 61.000) > - fieldCache: 6% (fieldCache referenced by WebappClassLoader) > - SolrIndexSearcher: 2% > > The manually generated dump: > - documentCache: 48% (size ~ 195.000) > - filterCache: 20% (size ~ 60.000) > - fieldCache: 11% (fieldCache hängt am WebappClassLoader) > - queryResultCache: 7% (size ~ 61.000) > - fieldValueCache: 3% > > We are also running two search engines with 17GB heap, these don't run into > OOMEs. Though, with these bigger heap sizes the longest requests are even > longer due to longer stop-the-world gc cycles. > Therefore my goal is to run with a smaller heap, IMHO even smaller than 8GB > would be good to reduce the time needed for full gc. > > So what's the right path to follow now? What would you recommend to change > on the configuration (solr/jvm)? Try tuning the GC http://java.sun.com/performance/reference/whitepapers/tuning.html http://www.oracle.com/technetwork/java/gc-tuning-5-138395.html > > Would you say it is ok to reduce the cache sizes? Would this increase disk > i/o, or would the index be hold in the OS's disk cache? Yes! If you also allocate less RAM to the JVM then there is more for the OS to cache. > > Do have other recommendations to follow / questions? > > Thanx && cheers, > Martin -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: Adding weightage to the facets count
Hi Siva, try using the Solr Stats Component http://wiki.apache.org/solr/StatsComponent similar to select/?&q=*:*&stats=true&stats.field={your-weight-field}&stats.facet={your-facet-field} and get the sum field from the response. You may need to resort the weighted facet counts to get a descending list of facet counts. Note, there is a bug for using the Stats Component with multi-valued facet fields. For details see https://issues.apache.org/jira/browse/SOLR-1782 Johannes 2011/1/24 Chris Hostetter > > : prod1 has tag called “Light Weight” with weightage 20, > : prod2 has tag called “Light Weight” with weightage 100, > : > : If i get facet for “Light Weight” , i will get Light Weight (2) , > : here i need to consider the weightage in to account, and the result will > be > : Light Weight (120) > : > : How can we achieve this?Any ideas are really helpful. > > > It's not really possible with Solr out of the box. Faceting is fast and > efficient in Solr because it's all done using set intersections (and most > of the sets can be kept in ram very compactly and reused). For what you > are describing you'd need to no only assocaited a weighted payload with > every TermPosition, but also factor that weight in when doing the > faceting, which means efficient set operations are now out the window. > > If you know java it would be probably be possible to write a custom > SolrPlugin (a SearchComponent) to do this type of faceting in special > cases (assuming you indexed in a particular way) but i'm not sure off hte > top of my head how well it would scale -- the basic algo i'm thinking of > is (after indexing each facet term wit ha weight payload) to iterate over > the DocSet of all matching documents in parallel with an iteration over > a TermPositions, skipping ahead to only the docs that match the query, and > recording the sum of the payloads for each term. > > Hmmm... > > except TermPositions iterates over >> tuples, > so you would have to iterate over every term, and for every term then loop > over all matching docs ... like i said, not sure how efficient it would > wind up being. > > You might be happier all arround if you just do some sampling -- store the > tag+weight pairs so thta htey cna be retireved with each doc, and then > when you get your top facet constraints back, look at the first page of > results, and figure out what the sun "weight" is for each of those > constraints based solely on the page#1 results. > > i've had happy users using a similar appraoch in the past. > > -Hoss -- Johannes Goll 211 Curry Ford Lane Gaithersburg, Maryland 20878
Re: EdgeNgram Auto suggest - doubles ignore
Hi Eric, You are right, there is a copy field to EdgeNgram, I tried the configuration but it not working as expected. Configuration I tried: edgy_user_query == When I search for the term "apple". It is returning results for "pineapple vers apple", "milk with apple", "apple milk shake" ... Is there any other way to overcome this problem? Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH From various File system locations
I would just use Nutch and specify the -solr param on the command line. That will add the extracted content your instance of solr. Adam Sent from my iPhone On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: > Hi All, > I need to index the documents presents in my file system at various > locations (e.g. C:\docs , d:\docs ). >Is there any way through which i can specify this in my DIH > Configuration. >Here is my configuration:- > > > processor="FileListEntityProcessor" >fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" > *baseDir="G:\\Desktop\\"* >recursive="false" >rootEntity="true" >transformer="DateFormatTransformer" > onerror="continue"> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> > > > > > > > > > > > > > > / Pankaj Bhatt.
Re: Recommendation on RAM-/Cache configuration
On Tue, Jan 25, 2011 at 2:06 PM, Markus Jelsma wrote: > On Tuesday 25 January 2011 11:54:55 Martin Grotzke wrote: > > Hi, > > > > recently we're experiencing OOMEs (GC overhead limit exceeded) in our > > searches. Therefore I want to get some clarification on heap and cache > > configuration. > > > > This is the situation: > > - Solr 1.4.1 running on tomcat 6, Sun JVM 1.6.0_13 64bit > > - JVM Heap Params: -Xmx8G -XX:MaxPermSize=256m -XX:NewSize=2G > > -XX:MaxNewSize=2G -XX:SurvivorRatio=6 -XX:+UseParallelOldGC > > -XX:+UseParallelGC > > Consider switching to HotSpot JVM, use the -server as the first switch. The jvm options I mentioned were not all, we're running the jvm with -server (of course). > > > - The machine has 32 GB RAM > > - Currently there are 4 processors/cores in the machine, this shall be > > changed to 2 cores in the future. > > - The index size in the filesystem is ~9.5 GB > > - The index contains ~ 5.500.000 documents > > - 1.500.000 of those docs are available for searches/queries, the rest > are > > inactive docs that are excluded from searches (via a flag/field), but > > they're still stored in the index as need to be available by id (solr is > > the main document store in this app) > > How do you exclude them? It should use filter queries. The docs are indexed with a field "findable" on which we do a filter query. > I also remember (but i > just cannot find it back so please correct my if i'm wrong) that in 1.4.x > sorting is done before filtering. It should be an improvement if filtering > is > done before sorting. > Hmm, I cannot imagine a case where it makes sense to sort before filtering. Can't believe that solr does it like this. Can anyone shed a light on this? > If you use sorting, it takes up a huge amount of RAM if filtering is not > done > first. > > > - Caches are configured with a big size (the idea was to prevent > filesystem > > access / disk i/o as much as possible): > > There is only disk I/O if the kernel can't keep the index (or parts) in its > page cache. > Yes, I'll keep an eye on disk I/O. > > - filterCache (solr.LRUCache): size=20, initialSize=3, > > autowarmCount=1000, actual size =~ 60.000, hitratio =~ 0.99 > > - documentCache (solr.LRUCache): size=20, initialSize=10, > > autowarmCount=0, actual size =~ 160.000 - 190.000, hitratio =~ 0.74 > > - queryResultCache (solr.LRUCache): size=20, initialSize=3, > > autowarmCount=1, actual size =~ 10.000 - 60.000, hitratio =~ 0.71 > > You should decrease the initialSize values. But your hitratio's seem very > nice. > Does the initialSize have a real impact? According to http://wiki.apache.org/solr/SolrCaching#initialSize it's the initial size of the HashMap backing the cache. What would you say are reasonable values for size/initialSize/autowarmCount? Cheers, Martin
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry. And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
Re: Getting started with writing parser
On Tue, Jan 25, 2011 at 3:46 PM, Dinesh wrote: > > no i actually changed the directory to mine where i stored the log files.. it > is /home/exam/apa..solr/example/exampledocs > > i specified it in a solr schema.. i created an DataImportHandler for that in > try.xml.. then in that i changed that file name to sample.txt > > that new try.xml is > http://pastebin.com/pfVVA7Hs [...] Let us take this one part at a time. In your inner nested entity,
Re: Use terracotta bigmemory for solr-caches
Hi Martin, are you sure that your GC is well tuned? A request that needs more than a minute isn't the standard, even when I consider all the other postings about response-performance... Regards -- View this message in context: http://lucene.472066.n3.nabble.com/Use-terracotta-bigmemory-for-solr-caches-tp2328257p2330652.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) : 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji) Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ? Thanks and kind regards, Gary. On 25/01/2011 14:17, Erlend Garåsen wrote: On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry. And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend
List of indexed or stored fields
I use a lot of dynamic fields, so looking at my schema isn't a good way to see all the field names that may be indexed across all documents. Is there a way to query solr for that information? All field names that are indexed, or stored? Possibly a count by field name? Is there any other metadata about a field that can be queried? -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
OK, got past the schema.xml problem, but now I'm back to square one. I can index the contents of binary files (Word, PDF etc...), as well as text files, but it won't index the content of files inside a zip. As an example, I have two txt files - doc1.txt and doc2.txt. If I index either of them individually using: curl "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; -F "file=@doc1.txt" and commit, Solr will index the contents and searches will match. If I zip those two files up into solr1.zip, and index that using: curl "http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; -F "file=@solr1.zip" and commit, the file names are indexed, but not their contents. I have checked that Tika can correctly process the zip file when used standalone with the tika-app jar - it outputs both the filenames and contents. Should I be able to index the contents of files stored in a zip by using extract ? Thanks and kind regards, Gary. On 25/01/2011 15:32, Gary Taylor wrote: Thanks Erlend. Not used SVN before, but have managed to download and build latest trunk code. Now I'm getting an error when trying to access the admin page (via Jetty) because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but this appears to be no-longer supplied as part of the build so I get an exception cos it can't find that class. I've checked the CHANGES.txt and found the following in the change list to 1.4.0 (!?) : 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, HTMLStripWhitespaceTokenizerFactory and HTMLStripStandardTokenizerFactory deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an arbitrary Tokenizer. (koji) Unfortunately, I can't seem to get that to work correctly. Does anyone have an example fieldType stanza (for schema.xml) for stripping out HTML ? Thanks and kind regards, Gary. On 25/01/2011 14:17, Erlend Garåsen wrote: On 25.01.11 11.30, Erlend Garåsen wrote: Tika version 0.8 is not included in the latest release/trunk from SVN. Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry. And to clarify, by "content" I mean the main content of a Word file. Title and other kinds of metadata are successfully extracted by the old 0.4 version of Tika, but you need a newer Tika version (0.8) in order to fetch the main content as well. So try the newest Solr version from trunk. Erlend
Re: List of indexed or stored fields
You can query all the indexed or stored fields (including dynamic fields) using the LukeRequestHandler: http://localhost:8983/solr/example/admin/luke See also: http://wiki.apache.org/solr/LukeRequestHandler Regards, * **Juan G. Grande* -- Solr Consultant @ http://www.plugtree.com -- Blog @ http://juanggrande.wordpress.com On Tue, Jan 25, 2011 at 12:39 PM, kenf_nc wrote: > > I use a lot of dynamic fields, so looking at my schema isn't a good way to > see all the field names that may be indexed across all documents. Is there > a > way to query solr for that information? All field names that are indexed, > or > stored? Possibly a count by field name? Is there any other metadata about a > field that can be queried? > -- > View this message in context: > http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2330986.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: DIH From various File system locations
Thanks Adam, It seems like Nutch use to solve most of my concerns. i would be great if you can have share resources for Nutch with us. / Pankaj Bhatt. On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < estrada.adam.gro...@gmail.com> wrote: > I would just use Nutch and specify the -solr param on the command line. > That will add the extracted content your instance of solr. > > Adam > > Sent from my iPhone > > On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: > > > Hi All, > > I need to index the documents presents in my file system at > various > > locations (e.g. C:\docs , d:\docs ). > >Is there any way through which i can specify this in my DIH > > Configuration. > >Here is my configuration:- > > > > > > >processor="FileListEntityProcessor" > >fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" > > *baseDir="G:\\Desktop\\"* > >recursive="false" > >rootEntity="true" > >transformer="DateFormatTransformer" > > onerror="continue"> > > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" > > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> > > > > > > > > > > > > > > > > > > > > > > > > > > > > / Pankaj Bhatt. >
Re: Extracting contents of zipped files with Tika and Solr 1.4.1
Hi Gary, The latest Solr Trunk was able to extract and index the contents of the zip file using the ExtractingRequestHandler. The snapshot of Trunk we worked upon had the Tika 0.8 snapshot jars and worked pretty well. Tested again with sample url and works fine - curl " http://localhost:8080/solr/core0/update/extract?stream.file=C:/temp/extract/777045.zip&literal.id=777045&literal.title=Test&commit=true " You would probably need to drill down to the Tika Jars and the apache-solr-cell-4.0-dev.jar used for Rich documents indexing. Regards, Jayendra On Tue, Jan 25, 2011 at 11:08 AM, Gary Taylor wrote: > OK, got past the schema.xml problem, but now I'm back to square one. > > I can index the contents of binary files (Word, PDF etc...), as well as > text files, but it won't index the content of files inside a zip. > > As an example, I have two txt files - doc1.txt and doc2.txt. If I index > either of them individually using: > > curl " > http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; > -F "file=@doc1.txt" > > and commit, Solr will index the contents and searches will match. > > If I zip those two files up into solr1.zip, and index that using: > > curl " > http://localhost:8983/solr/core0/update/extract?literal.docid=74&fmap.content=text&literal.type=5"; > -F "file=@solr1.zip" > > and commit, the file names are indexed, but not their contents. > > I have checked that Tika can correctly process the zip file when used > standalone with the tika-app jar - it outputs both the filenames and > contents. Should I be able to index the contents of files stored in a zip > by using extract ? > > > Thanks and kind regards, > Gary. > > > On 25/01/2011 15:32, Gary Taylor wrote: > >> Thanks Erlend. >> >> Not used SVN before, but have managed to download and build latest trunk >> code. >> >> Now I'm getting an error when trying to access the admin page (via Jetty) >> because I specify HTMLStripStandardTokenizerFactory in my schema.xml, but >> this appears to be no-longer supplied as part of the build so I get an >> exception cos it can't find that class. I've checked the CHANGES.txt and >> found the following in the change list to 1.4.0 (!?) : >> >> 66. SOLR-1343: Added HTMLStripCharFilter and marked HTMLStripReader, >> HTMLStripWhitespaceTokenizerFactory andHTMLStripStandardTokenizerFactory >> deprecated. To strip HTML tags, HTMLStripCharFilter can be used with an >> arbitrary Tokenizer. (koji) >> >> Unfortunately, I can't seem to get that to work correctly. Does anyone >> have an example fieldType stanza (for schema.xml) for stripping out HTML ? >> >> Thanks and kind regards, >> Gary. >> >> >> >> On 25/01/2011 14:17, Erlend Garåsen wrote: >> >>> On 25.01.11 11.30, Erlend Garåsen wrote: >>> >>> Tika version 0.8 is not included in the latest release/trunk from SVN. >>> >>> Ouch, I wrote "not" instead of "now". Sorry, I replied in a hurry. >>> >>> And to clarify, by "content" I mean the main content of a Word file. >>> Title and other kinds of metadata are successfully extracted by the old 0.4 >>> version of Tika, but you need a newer Tika version (0.8) in order to fetch >>> the main content as well. So try the newest Solr version from trunk. >>> >>> Erlend >>> >>> >> >> >
How to Configure Solr to pick my lucene custom filter
Hi , I have written a lucene custom filter. I could not figure out on how to configure Solr to pick this custom filter for search. How to configure Solr to pick my custom filter? Will the Solr standard search handler pick this custom filter? Thanks, Valiveti -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html Sent from the Solr - User mailing list archive at Nabble.com.
in-index representaton of tokens
So, the index is a list of tokens per column, right? There's a table per column that lists the analyzed tokens? And the tokens per column are represented as what, system integers? 32/64 bit unsigned ints? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: in-index representaton of tokens
Why does it matter? You can't really get at them unless you store them. I don't know what "table per column" means, there's nothing in Solr architecture called a "table" or a "column". Although by column you probably mean more or less Solr "field". There is nothing like a "table" in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: So, the index is a list of tokens per column, right? There's a table per column that lists the analyzed tokens? And the tokens per column are represented as what, system integers? 32/64 bit unsigned ints? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
Re: EdgeNgram Auto suggest - doubles ignore
Let's back up here because now I'm not clear what you actually want. EdgeNGrams are a way of matching substrings, which is what's happening here. Of course searching "apple" against any of the three examples, just as searching for "apple" without grams would match, that's the expected behavior. So, we need a clear problem definition of what you're trying to do, along with example queries (please post the results of adding &debugQuery=on). Best Erick On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael wrote: > > Hi Eric, > > You are right, there is a copy field to EdgeNgram, I tried the > configuration > but it not working as expected. > > Configuration I tried: > > > > termVectors=”true”> > > > > > > > > > > > positionIncrementGap=”100″> > > > > maxGramSize=”25″/> > > > > > > > > omitNorms=”true” omitTermFreqAndPositions=”true” /> > omitNorms=”true” omitTermFreqAndPositions=”true” /> > > edgy_user_query > > > == > > When I search for the term "apple". > > It is returning results for "pineapple vers apple", "milk with apple", > "apple milk shake" ... > > Is there any other way to overcome this problem? > > Thanks, > > Johnny > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Highlighting with/without Term Vectors
Anyone? On Tue, Jan 25, 2011 at 12:57 AM, Salman Akram < salman.ak...@northbaysolutions.net> wrote: > Just to add one thing, in case it makes a difference. > > Max document size on which highlighting needs to be done is few hundred > kb's (in file system). In index its compressed so should be much smaller. > Total documents are more than 100 million. > > > On Tue, Jan 25, 2011 at 12:42 AM, Salman Akram < > salman.ak...@northbaysolutions.net> wrote: > >> Hi, >> >> Does anyone have any benchmarks how much highlighting speeds up with Term >> Vectors (compared to without it)? e.g. if highlighting on 20 documents take >> 1 sec with Term Vectors any idea how long it will take without them? >> >> I need to know since the index used for highlighting has a TVF file of >> around 450GB (approx 65% of total index size) so I am trying to see whether >> the decreasing the index size by dropping TVF would be more helpful for >> performance (less RAM, should be good for I/O too I guess) or keeping it is >> still better? >> >> I know the best way is try it out but indexing takes a very long time so >> trying to see whether its even worthy or not. >> >> -- >> Regards, >> >> Salman Akram >> >> > > > -- > Regards, > > Salman Akram > -- Regards, Salman Akram
Re: How to Configure Solr to pick my lucene custom filter
Presumably your custom filter is in a jar file. Drop that jar file in /lib and refer it from your schema.xml file by its full name (e.g. com.yourcompany.filter.yourcustomfilter) just like the other filters and it should work fine. You can also put your jar anywhere you'd like and alter solrconfig.xml with an addition al section (see the example solrconfig.xml). Best Erick On Tue, Jan 25, 2011 at 12:07 PM, Valiveti wrote: > > Hi , > > I have written a lucene custom filter. > I could not figure out on how to configure Solr to pick this custom filter > for search. > > How to configure Solr to pick my custom filter? > Will the Solr standard search handler pick this custom filter? > > Thanks, > Valiveti > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2331928.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: List of indexed or stored fields
That's exactly what I wanted, thanks. Any idea what 1294513299077 refers to under the section? I have 2 cores on one Tomcat instance, and 1 on a second instance (different server) and all 3 have different numbers for "version", so I don't think it's the version of Luke. -- View this message in context: http://lucene.472066.n3.nabble.com/List-of-indexed-or-stored-fields-tp2330986p2333281.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: List of indexed or stored fields
The index version. Can be used in replication to determine whether to replicate or not. On Tuesday 25 January 2011 20:30:21 kenf_nc wrote: > refers to under the section? I have 2 cores on one Tomcat instance, > and 1 on a second instance (different server) and all 3 have different > numbers for "version", so I don't think it's the version of Luke. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
Re: EdgeNgram Auto suggest - doubles ignore
Hi Eric, What I want here is, lets say I have 3 documents like ["pineapple vers apple", "milk with apple", "apple milk shake" ] and If i search for "apple", it should return only "apple milk shake" because that term alone starts with the letter "apple" which I typed in. It should not bring others and if I type "milk" it should return only "milk with apple" I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH From various File system locations
There are a few tutorials out there. 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) 3. Build the latest from branch http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read this one. http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ but add the "solr" parameter at the end bin/nutch crawl urls -depth 5 -topN 100 -solr http://localhost:8983/solr This will automatically add the data nutch collected to Solr. For larger files I would also increase your JAVA_OPTS env to something like JAVA_OPTS=' Xmx2048m' Adam On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt wrote: > Thanks Adam, It seems like Nutch use to solve most of my concerns. > i would be great if you can have share resources for Nutch with us. > > / Pankaj Bhatt. > > On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < > estrada.adam.gro...@gmail.com> wrote: > >> I would just use Nutch and specify the -solr param on the command line. >> That will add the extracted content your instance of solr. >> >> Adam >> >> Sent from my iPhone >> >> On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: >> >> > Hi All, >> > I need to index the documents presents in my file system at >> various >> > locations (e.g. C:\docs , d:\docs ). >> > Is there any way through which i can specify this in my DIH >> > Configuration. >> > Here is my configuration:- >> > >> > >> > > > processor="FileListEntityProcessor" >> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" >> > *baseDir="G:\\Desktop\\"* >> > recursive="false" >> > rootEntity="true" >> > transformer="DateFormatTransformer" >> > onerror="continue"> >> > > > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" >> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > / Pankaj Bhatt. >> >
Re: DIH From various File system locations
I take that back...Use am currently using version 1.2 and make sure that the latest versions of Tika and PDFBox is in the contrib folder. 1.3 is structured a bit differently and it doesn't look like there is a contrib directory. Maybe one of the Nutch contributors can comment on this? Adam On Tue, Jan 25, 2011 at 3:21 PM, Adam Estrada wrote: > There are a few tutorials out there. > > 1. http://wiki.apache.org/nutch/RunningNutchAndSolr (not the most practical) > 2. http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/ (similar to 1.) > 3. Build the latest from branch > http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ and read > this one. > > http://www.adamestrada.com/2010/04/24/web-crawling-with-nutch/ > > but add the "solr" parameter at the end bin/nutch crawl urls -depth 5 > -topN 100 -solr http://localhost:8983/solr > > This will automatically add the data nutch collected to Solr. For > larger files I would also increase your JAVA_OPTS env to something > like JAVA_OPTS=' Xmx2048m' > > Adam > > > > > On Tue, Jan 25, 2011 at 11:41 AM, pankaj bhatt wrote: >> Thanks Adam, It seems like Nutch use to solve most of my concerns. >> i would be great if you can have share resources for Nutch with us. >> >> / Pankaj Bhatt. >> >> On Tue, Jan 25, 2011 at 7:21 PM, Estrada Groups < >> estrada.adam.gro...@gmail.com> wrote: >> >>> I would just use Nutch and specify the -solr param on the command line. >>> That will add the extracted content your instance of solr. >>> >>> Adam >>> >>> Sent from my iPhone >>> >>> On Jan 25, 2011, at 5:29 AM, pankaj bhatt wrote: >>> >>> > Hi All, >>> > I need to index the documents presents in my file system at >>> various >>> > locations (e.g. C:\docs , d:\docs ). >>> > Is there any way through which i can specify this in my DIH >>> > Configuration. >>> > Here is my configuration:- >>> > >>> > >>> > >> > processor="FileListEntityProcessor" >>> > fileName="docx$|doc$|pdf$|xls$|xlsx|html$|rtf$|txt$|zip$" >>> > *baseDir="G:\\Desktop\\"* >>> > recursive="false" >>> > rootEntity="true" >>> > transformer="DateFormatTransformer" >>> > onerror="continue"> >>> > >> > processor="org.apache.solr.handler.dataimport.TikaEntityProcessor" >>> > url="${sd.fileAbsolutePath}" format="text" dataSource="bin"> >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > / Pankaj Bhatt. >>> >> >
CFP - Berlin Buzzwords 2011 - Search, Score, Scale
This is to announce the Berlin Buzzwords 2011. The second edition of the successful conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. Call for Presentations Berlin Buzzwords http://berlinbuzzwords.de Berlin Buzzwords 2011 - Search, Store, Scale 6/7 June 2011 The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics: * IR / Search - Lucene, Solr, katta or comparable solutions * NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others * Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives * Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies. Important Dates (all dates in GMT +2) * Submission deadline: March 1st 2011, 23:59 MEZ * Notification of accepted speakers: March 22th, 2011, MEZ. * Publication of final schedule: April 5th, 2011. * Conference: June 6/7. 2011 High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters. Proposals should be submitted at http://berlinbuzzwords.de/content/cfp-0 no later than March 1st, 2011. Acceptance notifications will be sent out soon after the submission deadline. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a lightning (10min), short (20min) or long (40min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) If you'd like to pitch your brand new product in your talk, please let us know as well - there will be extra space for presenting new ideas, awesome products and great new projects. The presentation format is short. We will be enforcing the schedule rigorously. If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Follow @hadoopberlin on Twitter for updates. Tickets, news on the conference, and the final schedule are be published at http://berlinbuzzwords.de. Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer. Please re-distribute this CfP to people who might be interested. If you are local and wish to meet us earlier, please note that this Thursday evening there will be an Apache Hadoop Get Together (videos kindly sponsored by Cloudera, venue kindly provided for free by Zanox) featuring talks on Apache Hadoop in production as well as news on current Apache Lucene developments. Contact us at: newthinking communications GmbH Schönhauser Allee 6/7 10119 Berlin, Germany Julia Gemählich Isabel Drost +49(0)30-9210 596 signature.asc Description: This is a digitally signed message part.
Re: How to Configure Solr to pick my lucene custom filter
Hi Eric, Thanks for the reply. I Did see some entries in the solrconfig.xml for adding custom reposneHandlers, queryParsers and queryResponseWriters. Bit could not find the one for adding the custom filter. Could you point to the exact location or syntax to be used. Thanks, Valiveti -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: EdgeNgram Auto suggest - doubles ignore
I haven't figured out any way to achieve that AT ALL without making a seperate Solr index just to serve autosuggest queries. At least when you want to auto-suggest on a multi-value field. Someone posted a crazy tricky way to do it with a single-valued field a while ago. If you can/are willing to make a seperate Solr index with a schema set up for auto-suggest specifically, it's easy. But from an existing schema, where you want to auto-suggest just based on the values in one field, it's a multi-valued field, and you want to allow matches in the middle of the field -- I don't think there's a way to do it. On 1/25/2011 3:03 PM, johnnyisrael wrote: Hi Eric, What I want here is, lets say I have 3 documents like ["pineapple vers apple", "milk with apple", "apple milk shake" ] and If i search for "apple", it should return only "apple milk shake" because that term alone starts with the letter "apple" which I typed in. It should not bring others and if I type "milk" it should return only "milk with apple" I want an output Similar like a Google auto suggest. Is there a way to achieve this without encapsulating with double quotes. Thanks, Johnny
Re: EdgeNgram Auto suggest - doubles ignore
Then you don't need NGrams at all. A wildcard will suffice or you can use the TermsComponent. If these strings are indexed as single tokens (KeywordTokenizer with LowercaseFilter) you can simply do field:app* to retrieve the "apple milk shake". You can also use the string field type but then you must make sure the values are already lowercased before indexing. Be careful though, there is no query time analysis for wildcard (and fuzzy) queries so make sure > Hi Eric, > > What I want here is, lets say I have 3 documents like > > ["pineapple vers apple", "milk with apple", "apple milk shake" ] > > and If i search for "apple", it should return only "apple milk shake" > because that term alone starts with the letter "apple" which I typed in. It > should not bring others and if I type "milk" it should return only "milk > with apple" > > I want an output Similar like a Google auto suggest. > > Is there a way to achieve this without encapsulating with double quotes. > > Thanks, > > Johnny
Re: EdgeNgram Auto suggest - doubles ignore
Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker than using wildcards at the cost of a larger index. You should, of course, use EdgeNGrams if you worry about performance and have a huge index and a number of queries per second. > Then you don't need NGrams at all. A wildcard will suffice or you can use > the TermsComponent. > > If these strings are indexed as single tokens (KeywordTokenizer with > LowercaseFilter) you can simply do field:app* to retrieve the "apple milk > shake". You can also use the string field type but then you must make sure > the values are already lowercased before indexing. > > Be careful though, there is no query time analysis for wildcard (and fuzzy) > queries so make sure > > > Hi Eric, > > > > What I want here is, lets say I have 3 documents like > > > > ["pineapple vers apple", "milk with apple", "apple milk shake" ] > > > > and If i search for "apple", it should return only "apple milk shake" > > because that term alone starts with the letter "apple" which I typed in. > > It should not bring others and if I type "milk" it should return only > > "milk with apple" > > > > I want an output Similar like a Google auto suggest. > > > > Is there a way to achieve this without encapsulating with double quotes. > > > > Thanks, > > > > Johnny
Re: EdgeNgram Auto suggest - doubles ignore
The index contains around 1.5 million documents. As this is used for autosuggest feature, performance is an important factor. So it looks like, using edgeNgram it is difficult to achieve the the following Result should return only those terms where search letter is matching with the first word only. For example, when we type "M", it should return "Mumford and Sons" and not "jackson Michael". Jonathan, Is it possible to achieve this when we have separate index using edgeNgram? -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html Sent from the Solr - User mailing list archive at Nabble.com.
Specifying optional terms with standard (lucene) request handler?
Hi I am searching for a way to specify optional terms in a query ( that dont need to match (But if they match should influence the scoring) ) Using the dismax parser a query like this: 2 on +lorem ipsum dolor amet content dismax Will be parsed into something like this: +((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) () Which will result that only 2 of the 3 optional terms need to match? How can optional terms be specified using the standard request handler? My concrete requirement is that a certain term should match but another is optional. But if the optional part matches - it should give the document an extra score. Something like :-) content:lorem #optional#content:optionalboostword^10 An idea would be to use a function query to boost the document: content:lorem _val_:"query({!lucene v='optionalword^20'})" Which will result in: +content:forum +query(content:optionalword^20.0,def=0.0) Is this a good way or are there other suggestions? Thanks for any opinion and tips on this Daniel
Re: EdgeNgram Auto suggest - doubles ignore
Ah, sorry, I got confused about your requirements, if you just want to match at the beginning of the field, it may be more possible. Using edgegrams or wildcard. If you have a single-valued field. Do you have a single-valued or a multi-valued field? That is, does each document have just one value, or multiple? I still get confused about how to do it with edgegrams, even with single-valued field, but I think maybe it's possible. _Definitely_ possible, with or without edgegrams, if you are willing/able to make a completely seperate Solr index where each term for auto-suggest is a "document". Yes. The problem lies in what "results" are. In general, Solr's results are the documents you have in the Solr index. Thus it makes everything a lot easier to deal with if you have an index where each document in the index is a "term" for auto-suggest. But that doesnt' always meet requirements if you need to auto-suggest within existing fq's and such, and of course it takes more resources to run an additional solr index. On 1/25/2011 5:03 PM, mesenthil wrote: The index contains around 1.5 million documents. As this is used for autosuggest feature, performance is an important factor. So it looks like, using edgeNgram it is difficult to achieve the the following Result should return only those terms where search letter is matching with the first word only. For example, when we type "M", it should return "Mumford and Sons" and not "jackson Michael". Jonathan, Is it possible to achieve this when we have separate index using edgeNgram?
Re: Specifying optional terms with standard (lucene) request handler?
With the 'lucene' query parser? include &q.op=OR and then put a "+" ("mandatory") in front of every term in the 'q' that is NOT optional, the rest will be optional. I think that will do what want. Jonathan On 1/25/2011 5:07 PM, Daniel Pötzinger wrote: Hi I am searching for a way to specify optional terms in a query ( that dont need to match (But if they match should influence the scoring) ) Using the dismax parser a query like this: 2 on +lorem ipsum dolor amet content dismax Will be parsed into something like this: +((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) () Which will result that only 2 of the 3 optional terms need to match? How can optional terms be specified using the standard request handler? My concrete requirement is that a certain term should match but another is optional. But if the optional part matches - it should give the document an extra score. Something like :-) content:lorem #optional#content:optionalboostword^10 An idea would be to use a function query to boost the document: content:lorem _val_:"query({!lucene v='optionalword^20'})" Which will result in: +content:forum +query(content:optionalword^20.0,def=0.0) Is this a good way or are there other suggestions? Thanks for any opinion and tips on this Daniel
Re: EdgeNgram Auto suggest - doubles ignore
Right now our configuration says multivalues=true. But that need not be "true" in our case. Will make it false and try and update this thread with more details.. -- View this message in context: http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr set up issues with Magento
Thank you Markus. I have added few more fields to schema.xml. Now looks like the products are getting indexed. But no search results. In Magento if I configure to use SOlr as the search engine. Search is not returning any results. If I change the search engine to use Magento's inbuilt MYSQL , Search results are returned. Can you please direct me on where/how I should start debug process. If I use Solr admin and enter the search query that doesn't return any results either. Thank you, Sandhya On Mon, Jan 24, 2011 at 4:11 PM, Markus Jelsma wrote: > Hi, > > You haven't defined the field in Solr's schema.xml configuration so it > needs to > be added first. Perhaps following the tutorial might be a good idea. > > http://lucene.apache.org/solr/tutorial.html > > Cheers. > > > Hello Team: > > > > > > I am in the process of setting up Solr 1.4 with Magento ENterprise > > Edition 1.9. > > > > When I try to index the products I get the following error message. > > > > Jan 24, 2011 3:30:14 PM > org.apache.solr.update.processor.LogUpdateProcessor > > fini > > sh > > INFO: {} 0 0 > > Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log > > SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field > > 'in_stock' at > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav > > a:289) > > at > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd > > ateProcessorFactory.java:60) > > at > > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) > > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > > at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co > > ntentStreamHandlerBase.java:54) > > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl > > erBase.java:131) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > at > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter > > .java:338) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte > > r.java:241) > > at > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl > > icationFilterChain.java:244) > > at > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF > > ilterChain.java:210) > > at > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV > > alve.java:240) > > at > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextV > > alve.java:161) > > at > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j > > ava:164) > > at > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j > > ava:100) > > at > > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: > > 550) > > at > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal > > ve.java:118) > > at > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav > > a:380) > > at > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java > > > > :243) > > > > at > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce > > ss(Http11Protocol.java:188) > > at > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce > > ss(Http11Protocol.java:166) > > at > > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin > > t.java:288) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec > > utor.java:886) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor > > .java:908) > > at java.lang.Thread.run(Thread.java:662) > > > > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute > > INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0 > > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2 > > rollback INFO: start rollback > > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2 > > rollback INFO: end_rollback > > Jan 24, 2011 3:30:14 PM > org.apache.solr.update.processor.LogUpdateProcessor > > fini > > sh > > INFO: {rollback=} 0 16 > > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute > > > > I am a new to both Magento and SOlr. I could have done some thing stupid > > during installation. I really look forward for your help. > > > > Thank you, > > Sandhya >
Best way to build a solr-based m2 project
Hello list, Apologies if this was already asked, I haven't found the answer in the archive. I've been out of this list for quite some time now, hence. I am looking at a good way to package a project based on maven2 that would create me a solr-based webapp. I would expect such projects as the velocity contrib or even the default solr to both include everything ready for it but I don't see this organized and, in particular, I see nothing that contains a packaging of type war. Have I missed something? Should I simply attempt at copying some bits into my source then make sure it gets copied to the right place? I found a solr archetype but it's only delivering a standalone solr which does not interest me. thanks in advance paul
Re: in-index representaton of tokens
I am saying there is a list of tokens that have been parsed (a table of them) for each column? Or one for the whole index? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Tue, January 25, 2011 9:29:36 AM Subject: Re: in-index representaton of tokens Why does it matter? You can't really get at them unless you store them. I don't know what "table per column" means, there's nothing in Solr architecture called a "table" or a "column". Although by column you probably mean more or less Solr "field". There is nothing like a "table" in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: > So, the index is a list of tokens per column, right? > > There's a table per column that lists the analyzed tokens? > > And the tokens per column are represented as what, system integers? 32/64 bit > unsigned ints? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. >
Re: How to Configure Solr to pick my lucene custom filter
First, let's be sure we're talking about the same thing. My response was for adding a filter to your analysis chain for a field in Schema.xml. Are you talking about a different sort of filter? Best Erick On Tue, Jan 25, 2011 at 4:09 PM, Valiveti wrote: > > Hi Eric, > > Thanks for the reply. > > I Did see some entries in the solrconfig.xml for adding custom > reposneHandlers, queryParsers and queryResponseWriters. > > Bit could not find the one for adding the custom filter. > > Could you point to the exact location or syntax to be used. > > Thanks, > Valiveti > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-Configure-Solr-to-pick-my-lucene-custom-filter-tp2331928p2334120.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: in-index representaton of tokens
This should shed some light on the matter http://lucene.apache.org/java/2_9_0/fileformats.html > I am saying there is a list of tokens that have been parsed (a table of > them) for each column? Or one for the whole index? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better idea to learn from others’ mistakes, so you do not have to make > them yourself. from > 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > > > > - Original Message > From: Jonathan Rochkind > To: "solr-user@lucene.apache.org" > Sent: Tue, January 25, 2011 9:29:36 AM > Subject: Re: in-index representaton of tokens > > Why does it matter? You can't really get at them unless you store them. > > I don't know what "table per column" means, there's nothing in Solr > architecture called a "table" or a "column". Although by column you > probably mean more or less Solr "field". There is nothing like a > "table" in Solr. > > Solr is still not an rdbms. > > On 1/25/2011 12:26 PM, Dennis Gearon wrote: > > So, the index is a list of tokens per column, right? > > > > There's a table per column that lists the analyzed tokens? > > > > And the tokens per column are represented as what, system integers? 32/64 > > bit unsigned ints? > > > > Dennis Gearon > > > > Signature Warning > > > > It is always a good idea to learn from your own mistakes. It is usually a > > > >better > > > > idea to learn from others’ mistakes, so you do not have to make them > > yourself. from > > 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > > > > EARTH has a Right To Life, > > otherwise we all die.
RE: DIH serialize
Dear Stefan, thank you for your help! Well, I wrote a small script, even if not json, but works: regards, Rich -Original Message- From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] Sent: Tuesday, January 25, 2011 11:13 To: solr-user@lucene.apache.org Subject: Re: DIH serialize Rich, i played around for a few minutes with Script-Transformers, but i have not enough knowledge to get anything done right know :/ My Idea was: looping over the given row, which should be a Java HashMap or something like that? and do sth like this (pseudo-code): var row_data = [] for( var key in row ) { row_data.push( '"' + key + '" : '" + row[key] + '"' ); } row.put( 'whatever_field', '{' + row_data.join( ',' ) + '}' ); Which should result in a json-object like {'key1':'value1', 'key2':'value2'} - and that should be okay to work with? Regards Stefan On Mon, Jan 24, 2011 at 7:53 PM, Papp Richard wrote: > Hi Stefan, > > yes, this is exactly what I intend - I don't want to search in this field > - just quicly return me the result in a serialized form (the search > criteria > is on other fields). Well, if I could serialize the data exactly as like > the > PHP serialize() does I would be maximally satisfied, but any other form in > which I could compact the data easily into one field I would be pleased. > Can anyone help me? I guess the is quite a good way, but I don't > know which function should I use there to compact the data to be easily > usable in PHP. Or any other method? > > thanks, > Rich > > -Original Message- > From: Stefan Matheis [mailto:matheis.ste...@googlemail.com] > Sent: Monday, January 24, 2011 18:23 > To: solr-user@lucene.apache.org > Subject: Re: DIH serialize > > Hi Rich, > > i'm a bit confused after reading your post .. what exactly you wanna try to > achieve? Serializing (like http://php.net/serialize) your complete row > into > one field? Don't wanna search in them, just store and deliver them in your > results? Does that make sense? Sounds a bit strange :) > > Regards > Stefan > > On Mon, Jan 24, 2011 at 10:03 AM, Papp Richardwrote: > > > Hi Dennis, > > > > thank you for your answer, but didn't understand why you say it doesn't > > need serialization. I'm with the option "C". > > but the main question is, how to put into one field a result of many > > fields: "SELECT * FROM". > > > > thanks, > > Rich > > > > -Original Message- > > From: Dennis Gearon [mailto:gear...@sbcglobal.net] > > Sent: Monday, January 24, 2011 02:07 > > To: solr-user@lucene.apache.org > > Subject: Re: DIH serialize > > > > Depends on your process chain to the eventual viewer/consumer of the > data. > > > > The questions to ask are: > > A/ Is the data IN Solr going to be viewed or processed in its original > > form: > > -->set stored = 'true' > > --->no serialization needed. > > B/ If it's going to be anayzed and searched for separate from any other > > field, > > > > the analyzing will put it into an unreadable form. If you need to > see > > it, > > then > > --->set indexed="true" and stored="true" > > --->no serializaton needed. C/ If it's NOT going to be viewed AS > IS, > > and > > it's not going to be searched for AS IS, > > (i.e. other columns will be how the data is found), and you have > > another, > > > > serialzable format: > > -->set indexed="false" and stored="true" > > -->serialize AS PER THE INTENDED APPLICATION, > > not sure that Solr can do that at all. > > C/ If it's NOT going to be viewed AS IS, and it's not going to be > searched > > for > > AS IS, > > (i.e. other columns will be how the data is found), and you have > > another, > > > > serialzable format: > > -->set indexed="false" and stored="true" > > -->serialize AS PER THE INTENDED APPLICATION, > > not sure that Solr can do that at all. > > D/ If it's NOT going to be viewed AS IS, BUT it's going to be searched > for > > AS > > IS, > > (this column will be how the data is found), and you have another, > > serialzable format: > > -->you need to pu
RE: in-index representaton of tokens
There aren't any tables involved. There's basically one list (per field) of unique tokens for the entire index, and also, a list for each token of which documents contain that token. Which is efficiently encoded, but I don't know the details of that encoding, maybe someone who does can tell you, or you can look at the lucene source, or get one of the several good books on lucene. These 'lists' are set up so you can efficiently look up a token, and see what documents contain that token. That's basically what lucene does, the purpose of lucene. Oh, and then there's term positions and such too, so not only can you see what documents contain that token but you can do proximity searches and stuff. This all gets into lucene implementation details I am not familiar with though. Why do you want to know? If you have specific concerns about disk space or RAM usage or something and how different schema choices effect it, ask them, and someone can probably tell you more easily than someone can explain the total architecture of lucene in a short listserv message. But, hey, maybe someone other than me can do that too! From: Dennis Gearon [gear...@sbcglobal.net] Sent: Tuesday, January 25, 2011 7:02 PM To: solr-user@lucene.apache.org Subject: Re: in-index representaton of tokens I am saying there is a list of tokens that have been parsed (a table of them) for each column? Or one for the whole index? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Jonathan Rochkind To: "solr-user@lucene.apache.org" Sent: Tue, January 25, 2011 9:29:36 AM Subject: Re: in-index representaton of tokens Why does it matter? You can't really get at them unless you store them. I don't know what "table per column" means, there's nothing in Solr architecture called a "table" or a "column". Although by column you probably mean more or less Solr "field". There is nothing like a "table" in Solr. Solr is still not an rdbms. On 1/25/2011 12:26 PM, Dennis Gearon wrote: > So, the index is a list of tokens per column, right? > > There's a table per column that lists the analyzed tokens? > > And the tokens per column are represented as what, system integers? 32/64 bit > unsigned ints? > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. >
Re: EdgeNgram Auto suggest - doubles ignore
OK, try this. Use some analysis chain for your field like: This can be a multiValued field, BTW. now use the TermsComponent to fetch your data. See: http://wiki.apache.org/solr/TermsComponent and specify terms.prefix=apple e.g. http://localhost:8983/solr/terms?terms.prefix=app&terms.fl=blivet The return list should be what you want. Note that the returned values will be lower cased, and you can only specify lower case in your search term (all because of specifying the lowercase filter in my example). This should be very fast no matter what your index size, as the return list size defaults to 10 (though you can specify different numbers). Best Erick On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael wrote: > > Hi Eric, > > What I want here is, lets say I have 3 documents like > > ["pineapple vers apple", "milk with apple", "apple milk shake" ] > > and If i search for "apple", it should return only "apple milk shake" > because that term alone starts with the letter "apple" which I typed in. It > should not bring others and if I type "milk" it should return only "milk > with apple" > > I want an output Similar like a Google auto suggest. > > Is there a way to achieve this without encapsulating with double quotes. > > Thanks, > > Johnny > -- > View this message in context: > http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr set up issues with Magento
There's almost no information to go on here. Please review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Tue, Jan 25, 2011 at 6:13 PM, Sandhya Padala wrote: > Thank you Markus. I have added few more fields to schema.xml. > > Now looks like the products are getting indexed. But no search results. > > In Magento if I configure to use SOlr as the search engine. Search is not > returning any results. If I change the search engine to use Magento's > inbuilt MYSQL , Search results are returned. Can you please direct me on > where/how I should start debug process. > > If I use Solr admin and enter the search query that doesn't return any > results either. > > Thank you, > Sandhya > > On Mon, Jan 24, 2011 at 4:11 PM, Markus Jelsma > wrote: > > > Hi, > > > > You haven't defined the field in Solr's schema.xml configuration so it > > needs to > > be added first. Perhaps following the tutorial might be a good idea. > > > > http://lucene.apache.org/solr/tutorial.html > > > > Cheers. > > > > > Hello Team: > > > > > > > > > I am in the process of setting up Solr 1.4 with Magento ENterprise > > > Edition 1.9. > > > > > > When I try to index the products I get the following error message. > > > > > > Jan 24, 2011 3:30:14 PM > > org.apache.solr.update.processor.LogUpdateProcessor > > > fini > > > sh > > > INFO: {} 0 0 > > > Jan 24, 2011 3:30:14 PM org.apache.solr.common.SolrException log > > > SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field > > > 'in_stock' at > > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.jav > > > a:289) > > > at > > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpd > > > ateProcessorFactory.java:60) > > > at > > > org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:139) > > > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:69) > > > at > > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(Co > > > ntentStreamHandlerBase.java:54) > > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandl > > > erBase.java:131) > > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > > > at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter > > > .java:338) > > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilte > > > r.java:241) > > > at > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appl > > > icationFilterChain.java:244) > > > at > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationF > > > ilterChain.java:210) > > > at > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperV > > > alve.java:240) > > > at > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextV > > > alve.java:161) > > > at > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.j > > > ava:164) > > > at > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.j > > > ava:100) > > > at > > > org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: > > > 550) > > > at > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineVal > > > ve.java:118) > > > at > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.jav > > > a:380) > > > at > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java > > > > > > :243) > > > > > > at > > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce > > > ss(Http11Protocol.java:188) > > > at > > > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proce > > > ss(Http11Protocol.java:166) > > > at > > > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoin > > > t.java:288) > > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExec > > > utor.java:886) > > > at > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor > > > .java:908) > > > at java.lang.Thread.run(Thread.java:662) > > > > > > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute > > > INFO: [] webapp=/solr path=/update params={wt=json} status=400 QTime=0 > > > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2 > > > rollback INFO: start rollback > > > Jan 24, 2011 3:30:14 PM org.apache.solr.update.DirectUpdateHandler2 > > > rollback INFO: end_rollback > > > Jan 24, 2011 3:30:14 PM > > org.apache.solr.update.processor.LogUpdateProcessor > > > fini > > > sh > > > INFO: {rollback=} 0 16 > > > Jan 24, 2011 3:30:14 PM org.apache.solr.core.SolrCore execute > > > > > > I am a new to both Magento and SOlr. I could have done some thing > stupid > > > during installation. I really look forward for your help. > > > > > > Thank you, > > > Sandhya > > >
Specifying optional terms with standard (lucene) request handler?
Hello I am searching for a way to specify optional terms in a query ( that dont need to match (But if they match should influence the scoring) ) Using the dismax parser a query like this: 2 on +lorem ipsum dolor amet content dismax Will be parsed into something like this: +((+(content:lor) (content:ipsum) (content:dolor) (content:amet))~2) () Which will result that only 2 of the 3 optional terms need to match? How can optional terms be specified using the standard request handler? My concrete requirement is that a certain term should match but another is optional. But if the optional part matches - it should give the document an extra score. Something like :-) content:lorem #optional#content:optionalboostword^10 An idea would be to use a function query to boost the document: content:lorem _val_:"query({!lucene v='optionalword^20'})" Which will result in: +content:forum +query(content:optionalword^20.0,def=0.0) Is this a good way or are there other suggestions? Thanks for any opinion and tips on this Daniel
DIH clean=false
I am not sure if i really understand what that mean by clean=false. In my understanding, for full-import with default clean=true, it will blow off all document of the existing index. Then full import data from a table into a index. Is that right? Then for clean=false, my understanding is that it won't blow off existing index. For data that exist in index and db table (by the same uniqueKey) it will update the index data regardless if there is actual field update. For existing index data but not existing in table (by comparing uniqueKey), it will leave it in the index. Is it correct? Otherwise, what is the difference from clean=true? Look for your knowledge on this. Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-clean-false-tp2351120p2351120.html Sent from the Solr - User mailing list archive at Nabble.com.