criteria for using the property stored="true" and indexed="true"
Hi, I would be some clarifications on which fields should we assign the property stored="true" and indexed="true" What is the criteria for these property assignments? What would be the impact if no field is assigned with this property? Thanks in Advance, Regards, Dilip TS Starmark Services Pvt. Ltd.
Solr and word frequencies?
Hi, iam working on the following task. I have a big Solr index "B"(round about 2 million forum-post entries) and 50 Sub-Indices "S1-50"(sub-forum entries) which are also included in "B". Now I want Solr to compare the word frequency of each Word in "S1-50" to the the word frequency of the whole big Index "B" to examine the words of special interrest in "S1-50" compared to "B". My questions are. I guess Solr is using word frequency itself...is it possible to just access this Solr functionality for my task (and if yes, how?) or do i have to write something from scratch. Do i need to put "S1-50" in standalone Solr-instances also or its enough to set a field in "B" called 'S', values(1-50) ? Thanks in advance! -- View this message in context: http://www.nabble.com/Solr-and-word-frequencies--tp14292112p14292112.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR X FAST
On Dec 12, 2007, at 2:50 AM, Nuno Leitao wrote: FAST uses two pipelines - an ingestion pipeline (for document feeding) and a query pipeline which are fully programmable (i.e., you can customize it fully). At ingestion time you typically prepare documents for indexing (tokenize, character normalize, lemmatize, clean up text, perform entity extraction for facets, perform static boosting for certain documents, etc.), while at query time you can expand synonyms, and do other general query side tasks (not unlike Solr). Horizontal scalability means the ability to cluster your search engine across a large number of servers, so you can scale up on the number of documents, queries, crawls, etc. There are FAST deployments out there which run on dozens, in some cases hundreds of nodes serving multiple terabyte size indexes and achieving hundreds of queries per seconds. Yet again, if your requirements are relatively simple then Lucene might do the job just fine. Hope this helps. With Fast, you will also get things like: - categorization - clustering - more flexible collapsing / grouping - more scalable facets (navigators) - at least for multivalued fields - gigabytes of poorly documented software - operations from hell - huge amount of bugs - high bills, both for software and hardware. As for linguistic features (named entity extraction, dictionary based lemmatization and so on) and things like categorization / clustering etc, things should not be expected to work to well unless you put a huge amount of work into it, and some of the features are really primitive. To sum up, if Solr meets your needs I would highly recommend Solr. If you need some additional features and have the knowledge, integrate other products with Solr. If you really need the scalability, go for Fast or some other commercial software. As for document preprocessing and connectors for Solr, if you need it, you could have a look at OpenPipe, http://openpipe.berlios.de/ (not yet announced). Svein
Leading WildCard in Query
Hi All, I understand that a leading Wild card search is not allowed as it is a very costly operation. There is an issues logged for it . ( http://issues.apache.org/jira/browse/SOLR-218). Is there any other way of enabling leading wildcards apart from doing it in code by calling * QueryParser.setAllowLeadingWildcard( true );*? Regards, Eswar
Re: criteria for using the property stored="true" and indexed="true"
See: http://wiki.apache.org/solr/SchemaXml#head-af67aefdc51d18cd8556de164606030446f56554 indexed means searchable (facet and sort also need this), stored instead is needed only when you need the original text (i.e. not tokenized/analyzed) to be returned. When stored and indexed are not present, I think solr put them to a default true (both of them) Dilip.TS wrote: > Hi, > > I would be some clarifications on which fields should we assign the property > stored="true" and indexed="true" > What is the criteria for these property assignments? > What would be the impact if no field is assigned with this property? > > Thanks in Advance, > > Regards, > Dilip TS > Starmark Services Pvt. Ltd. > > >
Re: Leading WildCard in Query
Please vote for SOLR-218. I'm not aware of any other way to accomplish the leading wildcard functionality that would be convenient. SOLR-218 is not asking that it be enabled by default, only that it be functionality that is exposed to SOLR admins via config.xml. On Dec 12, 2007 6:29 AM, Eswar K <[EMAIL PROTECTED]> wrote: > Hi All, > > I understand that a leading Wild card search is not allowed as it is a > very > costly operation. There is an issues logged for it . ( > http://issues.apache.org/jira/browse/SOLR-218). Is there any other way of > enabling leading wildcards apart from doing it in code by calling * > QueryParser.setAllowLeadingWildcard( true );*? > > Regards, > Eswar > -- Michael Kimsal http://webdevradio.com
Re: display tokens
Chris Hostetter wrote: : Subject: display tokens : : How can I retrieve the "analyzed tokens" (e.g. the stemmed values) of a : specific field? for a field by name independent of documents? the LukeRequestHandler can give you the top N terms for a field ... but if you mean "i did a search, i found a document, show me the analyzed tokens for that document in this field" there is no easy way to get that information. if you have a stored value for that field you can feed it into the analysis.jsp to see what the analyzed tokens are. also check out faceting. This returns the analyzed tokens, not the stored fields. ryan
Re: Creating document schema at runtime
Shalin Shekhar Mangar wrote: Hi, I'm looking on some tips on how to create a new document schema and add it to solr core at runtime. The use case that I'm trying to solve is: 1. Using a custom configuration tool, user creates a solr schema 2. The schema is added (uploaded) to a solr instance (on a remote machine). 3. Documents corresponding to the newly added schema are added to solr. I understand that with SOLR-215, I can create a new core by specifying the config and schema but still, there is no way for me to do this from a remote machine using HTTP calls. Check SOLR-350 and: http://wiki.apache.org/solr/MultiCore the 'LOAD' method isn't implemented yet, but that sounds like what you want. If this capability does not exist, I would be happy to open an issue in JIRA and contribute patches. patches are always welcome! ryan
Re: Solr and word frequencies?
Recono, This would be easier to do with Lucene. Solr uses Lucene under the hood, so just write an app that opens appropriate indices and makes use of various docFreq methods in the Lucene API. Look at TermDocs, IndexReader, TermEnum, etc. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Recono <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Wednesday, December 12, 2007 5:00:49 AM Subject: Solr and word frequencies? Hi, iam working on the following task. I have a big Solr index "B"(round about 2 million forum-post entries) and 50 Sub-Indices "S1-50"(sub-forum entries) which are also included in "B". Now I want Solr to compare the word frequency of each Word in "S1-50" to the the word frequency of the whole big Index "B" to examine the words of special interrest in "S1-50" compared to "B". My questions are. I guess Solr is using word frequency itself...is it possible to just access this Solr functionality for my task (and if yes, how?) or do i have to write something from scratch. Do i need to put "S1-50" in standalone Solr-instances also or its enough to set a field in "B" called 'S', values(1-50) ? Thanks in advance! -- View this message in context: http://www.nabble.com/Solr-and-word-frequencies--tp14292112p14292112.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr 1.3 expected release date
What date or year do we believe Solr 1.3 will be released? Regards, Martin Owens
Re: Solr 1.3 expected release date
Owens, Martin wrote: What date or year do we believe Solr 1.3 will be released? Regards, Martin Owens 2008 for sure. It will be after lucene 2.3 and that is a month(more?) away. My honest guess is late Jan to mid Feb. I think the last *major* change going into 1.3 is SOLR-303 (Distributed Search over HTTP) -- this will require some reworking of new features like SearchComponents and solrj. After that, changes will mostly be for stability and clarity. I don't really want to promote using nightly builds, but if you need 1.3 features, the current ones are stable. The interfaces may change, but it should not crash or anything like that. ryan
RE: Solr, search result format
>> I think your biggest problem is requesting 70,000 records from Solr. >> That is not going to be fast. I know it, but the limits on the development don't lend themselves to putting all of the fields into lucene so a proper search can be conducted. We need to return them all because more work is done on the results webserver side (much to my chagrin) so paging is out of the question. >> 2. Since you are running out of memory parsing XML, I'm guessing >> that you're using a DOM-style parser. Don't do that. You do not >> need to create elaborate structures, strip mine the data, then >> throw those structures away. Instead, us a streaming parser, like Stax. Oh I know there are better ways of doing it, I just can't do any of them. constraints and all that. I was looking at the PythonResponseWriter, I'm trying to find a howto since a response writer would be responsible for writing the response after a search right? Best regards, Martin Owens
Re: Solr, search result format
I think your biggest problem is requesting 70,000 records from Solr. That is not going to be fast. Two suggestions: 1. Use paging. Get the results in chunks, 10, 25, 100, whatever. 2. Since you are running out of memory parsing XML, I'm guessing that you're using a DOM-style parser. Don't do that. You do not need to create elaborate structures, strip mine the data, then throw those structures away. Instead, us a streaming parser, like Stax. The sounds like an XY problem. What are you trying to achieve by fetching 10,000 records? There is probably a better way to do it. wunder On 12/12/07 11:58 AM, "Owens, Martin" <[EMAIL PROTECTED]> wrote: > Hello everyone, > > I'm looking for a better solution that the current xml output we're currently > getting; if you return more than 70k records the webserver can no longer cope > with parsing the xml and the machine falls over out of memory. > > Ideally what we'd like is for the search results to go directly into a > temporary mysql table so we can link against it in a further request from the > web server. Does anyone know any plugs or people who have done anything along > these lines? > > We might be able to settle for receiving the single field column as a csv type > file, that would at least let us cut down on the processing and parsing. I see > there is a csv indexer but do we have a csv output plugin? > > Once again thank you all for your help. > > Best Regards, Martin Ownes
Re: Solr, search result format
Owens, Martin wrote: Hello everyone, I'm looking for a better solution that the current xml output we're currently getting; if you return more than 70k records the webserver can no longer cope with parsing the xml and the machine falls over out of memory. Ideally what we'd like is for the search results to go directly into a temporary mysql table so we can link against it in a further request from the web server. Does anyone know any plugs or people who have done anything along these lines? "out of the box" solr does not do that... maybe try a custom RequestHandler that extends StandarRequestHandler. let the base handler to everything, then in handleRequestBody, pull the results out of the response and use JDBC to fill your SQL tables. otherwise try paging through the results... 70*1K results or something like that... ryan
Re: Solr, search result format
Fetch your 70,000 results in 70 chunks of 1000 results. Parse each chunk and add it to your internal list. If you are allowed to parse Python results, why can't you use a diffetent XML parser? What sort of "more work" are you doing? I've implemented lots of stuff on top of a paged model, including customizing the relevance formula and re-ranking. wunder On 12/12/07 12:31 PM, "Owens, Martin" <[EMAIL PROTECTED]> wrote: > >>> I think your biggest problem is requesting 70,000 records from Solr. >>> That is not going to be fast. > > I know it, but the limits on the development don't lend themselves to putting > all of the fields into lucene so a proper search can be conducted. We need to > return them all because more work is done on the results webserver side (much > to my chagrin) so paging is out of the question. > >>> 2. Since you are running out of memory parsing XML, I'm guessing >>> that you're using a DOM-style parser. Don't do that. You do not >>> need to create elaborate structures, strip mine the data, then >>> throw those structures away. Instead, us a streaming parser, like Stax. > > Oh I know there are better ways of doing it, I just can't do any of them. > constraints and all that. > > I was looking at the PythonResponseWriter, I'm trying to find a howto since a > response writer would be responsible for writing the response after a search > right? > > Best regards, Martin Owens
Re: Solr, search result format
On 12-Dec-07, at 11:58 AM, Owens, Martin wrote: Hello everyone, Hi Martin, It is usually preferrable to not reply to an existing message in the group when starting a new thread. Some people (like me) use clients that properly track the Followup-To header that gets added, so multiple threads get all jumbled together. (For instance, "Solr and word frequencies?", "Solr 1.3 expected release date", and "Solr, search result format" are all now mixed together in my client.) Thanks! -Mike
Autocommit
Hello UG I already posted a while ago a problem that one of the solr threads starts using 100% of one of the processor cores on a 4 core system. This doesn't happen right after the start it slightly increaes for about a week until the process runs constantly at 100%. I couldn't figure out a solution for this. I could live with this problem but I think it has an side effect. While the processor load increases the time between two autocommits increases as well. Currently autocommit is set to 3 minutes. After 4 weeks the commits run only every 40 minutes. I have the following solr version installed: Solr Specification Version: 1.2.0 Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12 Lucene Specification Version: 2007-05-20_00-04-53 Lucene Implementation Version: build 2007-05-20 Does anyone have a hint what I could look for? Thanks Michael -- Michael Thessel <[EMAIL PROTECTED]> Gossamer Threads Inc. http://www.gossamer-threads.com/ Tel: (604) 687-5804 Fax: (604) 687-5806
Re: Autocommit
On Dec 12, 2007 6:15 PM, Michael Thessel <[EMAIL PROTECTED]> wrote: > I already posted a while ago a problem that one of the solr threads > starts using 100% of one of the processor cores on a 4 core > system. This sounds like warming / autowarming. The other possibility is garbage collection. > This doesn't happen right after the start it slightly > increaes for about a week until the process runs constantly at > 100%. Is it still just one CPU at 100%, or is it ever 2 or more at 100%. That would tell us if it were due to overlapping autowarming. What happens to your index over time? Does maxDoc() keep increasing? > I couldn't figure out a solution for this. I could live with this > problem but I think it has an side effect. While the processor load > increases the time between two autocommits increases as well. Currently > autocommit is set to 3 minutes. After 4 weeks the commits run only > every 40 minutes. > > I have the following solr version installed: > > Solr Specification Version: 1.2.0 > Solr Implementation Version: 1.2.0 - Yonik - 2007-06-02 17:35:12 > Lucene Specification Version: 2007-05-20_00-04-53 > Lucene Implementation Version: build 2007-05-20 > > Does anyone have a hint what I could look for? Perhaps post the XML you get from the statistics page so we might know more. Try looking in the logs to see what part of autowarming is taking so long. -Yonik
Re: Autocommit
> > I already posted a while ago a problem that one of the solr threads > > starts using 100% of one of the processor cores on a 4 core > > system. > > This sounds like warming / autowarming. > The other possibility is garbage collection. What can I do here? Decrease the autowarmcount? My current settings filterCache autowarmCount="256" queryResultCache autowarmCount="256" documentCache autowarmCount="0" > > This doesn't happen right after the start it slightly > > increaes for about a week until the process runs constantly at > > 100%. > > Is it still just one CPU at 100%, or is it ever 2 or more at 100%. > That would tell us if it were due to overlapping autowarming. It is always just one core. All the other cores run once in a while at 100% but only one core is constantly at 100% > What happens to your index over time? Does maxDoc() keep increasing? Yes maxDoc is always increasing it is pretty much the same as the total number of indexed documents. > Perhaps post the XML you get from the statistics page so we might > know more. Try looking in the logs to see what part of autowarming is > taking so long. While checking the logs I run over some performance warnings. I run a optimize every night. Minutes after I start the optimize I get a: INFO: PERFORMANCE WARNING: Overlapping onDeckSearchers=2 Could the problem be related to this? I disabled the optimize for now. If the optimize is the problem what would be a better different strategy to run the optimize? Thanks for your help. Cheers, Michael -- Michael Thessel <[EMAIL PROTECTED]> Gossamer Threads Inc. http://www.gossamer-threads.com/ Tel: (604) 687-5804 Fax: (604) 687-5806
RE: Solr 1.3 expected release date
... SOLR-303 (Distributed Search over HTTP)... Woo-hoo! -Original Message- From: Ryan McKinley [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 12, 2007 12:09 PM To: solr-user@lucene.apache.org Subject: Re: Solr 1.3 expected release date Owens, Martin wrote: > What date or year do we believe Solr 1.3 will be released? > > Regards, Martin Owens 2008 for sure. It will be after lucene 2.3 and that is a month(more?) away. My honest guess is late Jan to mid Feb. I think the last *major* change going into 1.3 is SOLR-303 (Distributed Search over HTTP) -- this will require some reworking of new features like SearchComponents and solrj. After that, changes will mostly be for stability and clarity. I don't really want to promote using nightly builds, but if you need 1.3 features, the current ones are stable. The interfaces may change, but it should not crash or anything like that. ryan
Re: Solr 1.3 expected release date
On Dec 13, 2007 1:38 AM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > > I think the last *major* change going into 1.3 is SOLR-303 (Distributed > Search over HTTP) -- this will require some reworking of new features > like SearchComponents and solrj. After that, changes will mostly be for > stability and clarity. > > interesting !! ...are you planning to use Hadoop?? Can you brief the DL on the architecture? -- Venkat Blog @ http://blizzardzblogs.blogspot.com
Re: Solr and Flex
I presume you understand the difference between Solr and Flex - and am not sure what you need the code for? do you want an AS3 script implementation/wrapper for Solr or are you expecting an application which call uses Solr(to index the docs) and retrieve the docs using some web services and present it to the users in a Flex app? either ways - you can code :) On Dec 12, 2007 3:47 AM, jenix <[EMAIL PROTECTED]> wrote: > > Has anyone used Solr in a Flex application? > Any code snipplets to share? > > Thank you. > Jennifer > -- > View this message in context: > http://www.nabble.com/Solr-and-Flex-tp14284703p14284703.html > Sent from the Solr - User mailing list archive at Nabble.com. > > --
Re: SOLR X FAST
: Why use FAST and not use SOLR ? For example. : What will FAST offer that will justify the investment ? Am I the only one that finds these questions incredibly hilarious? particularly on this list? You should also email FAST customer service and ask them "Why use Solr and not use FAST ?" :) -Hoss
RE: Two Solr Webapps, one folder for the index data?
: I asked a question similar to this back in : http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200709.mbox/[EMAIL PROTECTED] : SolrDispatchFilter and stored in the global Config). This way, I can : have a multiple instances of Solr up and running with the exact same : configuration, and their indices contained wholly within their : deployment directories. As i mentioned in that thread (and i don't think you ever replied) this seems like a really bad idea ... anytime you want to upgrade Solr, your configs and data all get completely blown away. I think if people want to reuse the same configs multiple times with only small varaitions (for things like the dataDir) it makes a lot more sense to add support for variable substitution based on JNDI variables... : : I actually have a patch for solr config parser which allows you to use : : context environment variables in the solrconfig.xml : : I generally use it for development when I'm working with multiple : : instances and different data dirs. I'll add it to jira today if you : : want it. -Hoss
Re: criteria for using the property stored="true" and indexed="true"
: http://wiki.apache.org/solr/SchemaXml#head-af67aefdc51d18cd8556de164606030446f56554 : : indexed means searchable (facet and sort also need this), stored instead : is needed only when you need the original text (i.e. not : tokenized/analyzed) to be returned. : When stored and indexed are not present, I think solr put them to a : default true (both of them) not exactly ... if you don't have them on the field, they are inherited from the ... if you don't have them on the fieldType, then it's whatever the default behavior is for the FieldType class used by the : > What would be the impact if no field is assigned with this property? if no fields are "stored" then you can't see your search results. if no fields are "indexed" you can't search for anything. -Hoss
Re: does solr handle hierarchical facets?
: > such that if you search for category, you get all those documents that have : > been tagged with the category AND any sub categories. If this is possible I : > think I'll investigate using solr in place of some existing code we have : > that deals with indexing and searching of such data. : : sort of. you can index a field literally as "category/subcategory/ : subsubcategory" and query for category/* to get all documents in that category : and below. I deal with this kind of stuff all the time ... if you can model your hierarchy using unique categoryIds (numbers are easiest) such that no categoryId appears more then one place in the hierarchy (something that frequently isn't possible with "category names" then it's really easy to just index the entire "breadcrumb" for a document and then you can search on any categoryId and get all of the documents in any "descendent" category. ie, if this is your hierarchy... Products/ Products/Computers/ Products/Computers/Laptops Products/Computers/Desktops Products/Cases Products/Cases/Laptops Products/Cases/CellPhones Then this trick won't work (because Laptops appears twice) but if you have numeric IDs that corrispond with each of those categories (so that the two instances of Laptops are unique... 1/ 1/2/ 1/2/3 1/2/4 1/5/ 1/5/6 1/5/7 ...then you just index the full "path" (a pattern tokenizer can work fine) and then you can search on "5" and get all products which are "Cases") -Hoss
Re: Solr, Multiple processes running
: Subject: Solr, Multiple processes running : References: <[EMAIL PROTECTED]> : <[EMAIL PROTECTED]> : <[EMAIL PROTECTED]> ... http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/Thread_hijacking -Hoss