Multicore Issue - Server Restart

2012-05-28 Thread Sujatha Arun
Hello , We have a multicore webapp for every 50 cores.Currently 3 Multicore webapps and 150 cores distributed across the 3 webapps. When we re started the server [Tomcat] ,we noticed that the solr.xml was wiped out and we could not see any cores in webapp1 and webapp3 ,but only a few cores in we

Re: Accent Characters

2012-05-28 Thread Jack Krupansky
The query seems fine - as far as the URL being UTF-8. It seems that the documents are not being passed to Solr with UTF-8 encoding. The document is not part of the URL. It is HTTP POST data. Try an explicit curl command to add a document and see if it is indexed with the accents. -- Jack Kru

Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
"... is this limitation documented anywhere..." Kind of, but not very well, at least at the Lucene level. The Lucene File Formats page says "Lucene uses a Java int to refer to document numbers, and the index file format uses an Int32 on-disk to store document numbers. This is a limitation of b

RE: useFastVectorHighlighter doesn't work

2012-05-28 Thread ZHANG Liang F
Hi, The reason why I use useFastVectorHighlighter is because I want to set stored="false", and with more settings like termVectors="true" termPositions="true" termOffsets="true". If stored="true", what is the difference between normal highlight and useFastVectorHighlighter? What is the right

Re: boost date for fresh result query

2012-05-28 Thread Jack Krupansky
Add &debugQuery=true to your query and look at the scores of the older vs. newer docs compared to the boost. Maybe the boost needs to be increased. -- Jack Krupansky -Original Message- From: Jonty Rhods Sent: Monday, May 28, 2012 5:51 AM To: solr-user@lucene.apache.org Subject: boost

Re: boost date for fresh result query

2012-05-28 Thread Jonty Rhods
please suggest me I am stuck here.. On Mon, May 28, 2012 at 3:21 PM, Jonty Rhods wrote: > Hi > > I am facing problem to boost on date field. > I have following field in schema > > > solr version 3.4 > I don't want to sort by date but want to give 50 to 60% boost those result > which have lat

Re: suggestions developing a multi-version concurrency control (MVCC) mechanism

2012-05-28 Thread Lance Norskog
You can use the document id and timestamp as a compound unique id. Then the search would also sort by id, then by timestamp. Result grouping might let you pick the most recent document from each of the sorted docs. On Mon, May 28, 2012 at 3:15 PM, Nicholas Ball wrote: > > Hello all, > > For the f

Re: xpathentityprocessor not import all documents

2012-05-28 Thread Jack Krupansky
Try adding rootEntity="false" to the FilePath entity. The DIH code ends up ignoring your rootEntity="true" on the XPathEntityProcessor entity if the parent does not have rootEntity="false". I'm not sure if that is really correct, but that's the way the code is. -- Jack Krupansky -Original

suggestions developing a multi-version concurrency control (MVCC) mechanism

2012-05-28 Thread Nicholas Ball
Hello all, For the first step of the distributed snapshot isolation system I'm developing for Solr, I'm going to need to have a MVCC mechanism as opposed to the single-version concurrency control mechanism already developed (DistributedUpdateProcessor class). I'm trying to find the very best way

Re: Negative value in numFound

2012-05-28 Thread William Bell
You went over the max limit for number of docs. On Monday, May 28, 2012, tosenthu wrote: > Hi > > I have a index of size 1 Tb.. And I prepared this by setting up a > background > script to index records. The index was fine last 2 days, and i have not > disturbed the process. Suddenly when i queri

Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
I think 100 million documents is a realistic number for a single shard. Maybe 250 million depending on your data. But I would say that beyond that is being unrealistic. In some cases, even 50 million might be too much for a single shard, depending on the data and query usage. Sure, maybe dependi

xpathentityprocessor not import all documents

2012-05-28 Thread Sagar Joshi
i have xml files need to import in solr, xml looks like below, 1 albert LA 2 john NY xml filepath is in sql database, so i have created dataimporthandler file as per below

Re: Negative value in numFound

2012-05-28 Thread tosenthu
The RAM is about 14.5G. Allocated for Tomcat.. I have now 2 shards. But I was in an impression i can handle it with couple of Shards. But in this case i need to have shards which can only grow up 2^31-1 records and many such shards to support 12 Billion records. I will try to have more cores and

Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Jack Krupansky
And it might make sense to have a "multi-value flattening" attribute for Solr itself rather than in SolrCell. -- Jack Krupansky -Original Message- From: Raphaël Sent: Monday, May 28, 2012 12:56 PM To: solr-user@lucene.apache.org Subject: Re: UpdateRequestProcessor : flattened values

Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Raphaël
On Mon, May 28, 2012 at 10:30:03AM -0400, Jack Krupansky wrote: > "... the access to individual literal fields seems (currently) very limited > as they appear to be flattened." > > That is s "feature" of SolrCell, to flatten multiple values for a > non-multi-valued field into a string concatenat

Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
"numFound="-390662429"" That suggests that you have at least two shards which each have > 2G docs (2^31-1). How many shards do you have and how big do you think they should be in terms of number of documents? Are you being careful to distribute your update requests between shards so that n

Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
OOM is a problem. You need more RAM and more machines, and maybe more shards. -- Jack Krupansky -Original Message- From: tosenthu Sent: Monday, May 28, 2012 11:29 AM To: solr-user@lucene.apache.org Subject: Re: Negative value in numFound There was an Out Of Memory.. But still the in

Re: Negative value in numFound

2012-05-28 Thread tosenthu
Hi It is a multicore but when i searched the shards query even then i get this response which is again a negative value. Might be the total number of records may be > 2147483647 (2^31-1), But is this limitation documented anywhere. What is the strategy to over come this situation. Expectation

Re: Negative value in numFound

2012-05-28 Thread ku3ia
In some cases multi-shard architecture might significantly slow down the search process at this index size... By the way, how much RAM do you use? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986438.html Sent from the Solr - User maili

Re: Negative value in numFound

2012-05-28 Thread tosenthu
There was an Out Of Memory.. But still the indexing was happening further.. -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986437.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing unstructured text (tweets)

2012-05-28 Thread Gora Mohanty
On 28 May 2012 20:12, Jack Krupansky wrote: > Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user > names and hash tags: > > http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ [...] One could also use the Solr DataImportHandler, and RegexTransformer to do the job: htt

Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
2012/5/28 Jack Krupansky : > Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user > names and hash tags: > > http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ Awesome! thank you very much Jack. GGhh

Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
Ah, okay. Here's some PHP regexp code for parsing a raw tweet to get user names and hash tags: http://saturnboy.com/2010/02/parsing-twitter-with-regexp/ -- Jack Krupansky -Original Message- From: Giovanni Gherdovich Sent: Monday, May 28, 2012 10:35 AM To: solr-user@lucene.apache.org

Re: Negative value in numFound

2012-05-28 Thread Jack Krupansky
Is this for a single-shard or multi-shard index? There is a 2^31-1 limit for a single Lucene index since document numbers are "int" (32-bit signed in Java) in Lucene, but with Solr shards you can have a multiple of that, based on number of shards. If you are multi-shard, maybe one of the shar

Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Jack and Anuj, 2012/5/28 Jack Krupansky : > The Twitter API extracts hash tag and user mentions for you, in addition to > giving you the full raw text. You'll have to read up on the Twitter API. That's what I thought just after hittind "send" on the message above ;-) I am pretty sure the Tw

Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Jack Krupansky
"... the access to individual literal fields seems (currently) very limited as they appear to be flattened." That is s "feature" of SolrCell, to flatten multiple values for a non-multi-valued field into a string concatenation of the values. All you need to do is add "multiValued="true"" to th

Re: Negative value in numFound

2012-05-28 Thread ku3ia
Hm... Have you any errors in logs? During search, during indexing? -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986426.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
The Twitter API extracts hash tag and user mentions for you, in addition to giving you the full raw text. You'll have to read up on the Twitter API. -- Jack Krupansky -Original Message- From: Giovanni Gherdovich Sent: Monday, May 28, 2012 10:09 AM To: solr-user@lucene.apache.org Subje

Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Jack, hi all, 2012/5/28 Jack Krupansky : > Other obvious metadata from the Twitter API to index would be hashtags, user > mentions (both the user id/screen name and user name), date/time, urls > mentioned (expanded if a URL shortener is used), and possibly coordinates > for spatial search.

Re: indexing unstructured text (tweets)

2012-05-28 Thread Anuj Kumar
This is a bit old but provides good information for schema design- http://www.readwriteweb.com/archives/this_is_what_a_tweet_looks_like.php Found this link as well- https://gist.github.com/702360 The types of the field may depend on the search requirements. Regards, Anuj On Mon, May 28, 2012 at

Re: Accent Characters

2012-05-28 Thread couto.vicente
Hi, Jack. First of all thank you for your help. Well, I tried again then I realized that my problem is not really with solr. I did run this query against solr after start it up with the command "java -jar start.jar": http://localhost:8983/solr/coreFR/spell?q=content:pr%C3%A9senta&spellcheck=true&sp

Re: indexing unstructured text (tweets)

2012-05-28 Thread Jack Krupansky
Other obvious metadata from the Twitter API to index would be hashtags, user mentions (both the user id/screen name and user name), date/time, urls mentioned (expanded if a URL shortener is used), and possibly coordinates for spatial search. You would have to add all these fields and values yo

Re: Negative value in numFound

2012-05-28 Thread tosenthu
The details are below Solr : 3.5 Using a Schema file with 53 fields and 8 fields indexed among them. OS : CentOS 5.4 64 Bit Java : 1.6.0 64 Bit Apache Tomcat : 7.0.22 Intel(R) Xeon(R) CPU L5518 @ 2.13GHz (16 Processors) /dev/mapper/index 5.9T 1.9T 4.0T 33% /Index Had around 2 Billion Record

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Darren Govoni
I don't recall anyone being able to get acceptable performance with a single index that large with solr/lucene. The conventional wisdom is that parallel searching across cores (or shards in SolrCloud) is the best way to handle index sizes in the "illions". So its of great interest how you did. Any

Re: Negative value in numFound

2012-05-28 Thread ku3ia
Hi! Can you please show your hardware parameters, version of Solr, that you're using and schema.xml file? thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Negative-value-in-numFound-tp3986398p3986408.html Sent from the Solr - User mailing list archive at Nabble.com.

Negative value in numFound

2012-05-28 Thread tosenthu
Hi I have a index of size 1 Tb.. And I prepared this by setting up a background script to index records. The index was fine last 2 days, and i have not disturbed the process. Suddenly when i queried the index i get this response, where the value of numFound is negative. Can any one say why/how th

Re: [Announce] Solr 3.6 with RankingAlgorithm 1.4.2 - NRT support

2012-05-28 Thread Nagendra Nagarajayya
It is a single node. I am trying to find out if the performance can be referenced. Regarding information on Solr with RankingAlgorithm, you can find all the information here: http://solr-ra.tgels.org On RankingAlgorithm: http://rankingalgorithm.tgels.org Regards, - NN On 5/27/2012 4:50 PM

Re: indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hello Dmitry and David, 2012/5/28 Dmitry Kan : > [...] If you just want to > index the text contents of tweets (including web links etc), using just > off-the-shelf Solr is enough. You'll have to wrap your text input (per each > tweet I would assume) into an xml [...] > So design your schema firs

Re: indexing unstructured text (tweets)

2012-05-28 Thread David Radunz
Hey, I think you might be over-thinking this. Tweets are structured. You have the content (tweet), the user who tweeted it and various other meta data. So your 'document', might look like this: ABCD1234 I bought some apples JohnnyBoy To get this structure, you can use any programming

Re: indexing unstructured text (tweets)

2012-05-28 Thread Dmitry Kan
Hi, You want to use Tika, if you have your data in some binary format, like pdf or excel. It extracts text from the binary for you. If you just want to index the text contents of tweets (including web links etc), using just off-the-shelf Solr is enough. You'll have to wrap your text input (per eac

indexing unstructured text (tweets)

2012-05-28 Thread Giovanni Gherdovich
Hi all. I am in the process of setting up Solr for my application, which is full text search on a bunch of tweets from twitter. I am afraid I am missing something. >From the books I am reading, "Apache Solr 3 Enterprise Search Server", it looks like Solr works with structured input, like XML or C

Re: UpdateRequestProcessor : flattened values

2012-05-28 Thread Raphaël
On Sun, May 27, 2012 at 11:54:02PM -0400, Jack Krupansky wrote: > You can create your own "update processor" that gets control between the > output of Tika and the indexing of the document. > > See: > http://wiki.apache.org/solr/UpdateRequestProcessor Seems to be exactly what I was looking for,

Re: Is there any relationship between size of index and indexing performance?

2012-05-28 Thread bilal dadanlar
indexing performance is mostly about the number of docs. but when you are optimizing, a large index takes a bit much time On Mon, May 28, 2012 at 12:48 PM, Aditya wrote: > Hi Ivan, > > It depends on number of terms it has to load. If you index less amount of > data but store large amount of data

boost date for fresh result query

2012-05-28 Thread Jonty Rhods
Hi I am facing problem to boost on date field. I have following field in schema solr version 3.4 I don't want to sort by date but want to give 50 to 60% boost those result which have latest date... following are the query : http://localhost:8083/solr/movie/select/?defType=dismax&q=titanic&f

Re: Is there any relationship between size of index and indexing performance?

2012-05-28 Thread Aditya
Hi Ivan, It depends on number of terms it has to load. If you index less amount of data but store large amount of data then your index size may be big but actual terms may be less. It is not directly proportional. Regards Aditya www.findbestopensource.com On Mon, May 28, 2012 at 3:00 PM, Ivan

Which time consuming processes are executed during Solr startup?

2012-05-28 Thread Ivan Hrytsyuk
For example we know that cache warming is executed during startup. Are any other processes executed during Solr startup? Thank you, Ivan

Re: useFastVectorHighlighter doesn't work

2012-05-28 Thread Ahmet Arslan
> I had a schema defined as indexed="true" stored="false" termVectors="true" > termPositions="true" termOffsets="true"/> You need to mark your text field as stored="true" to use &hl.useFastVectorHighlighter=true