Proper analyzer / tokenizer for syslog data?

2011-11-04 Thread Peter Spam
Example data:
01/23/2011 05:12:34 [Test] a=1; hello_there=50; data=[1,5,30%];

I would love to be able to just "grep" the data - ie. if I search for "ello", 
it finds and returns "ello", and if I search for "hello_there=5", it would 
match too.

Here's what I'm using now:

   
 
   
   
   
 
   

The problem with this is that if I search for a substring, I don't get anything 
back.  For example, searching for "ello" or "*ello*" doesn't return.  Any ideas?

http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400


Thanks!
Pete


Re: Can Solr handle large text files?

2011-11-04 Thread Peter Spam
Solr 4.0 (11/1 snapshot)
Data: 80k files, average size 2.5MB, largest is 750MB; 
Solr: Each document is max 256k; total docs = 800k
Machine: Early 2009 Mac Pro, 6GB RAM, 1GBmin/2GBmax given to Solr Java; Admin 
shows 30% mem usage

I originally tried injecting the entire file into a single Solr document, and 
this had disastrous results when trying to highlight.  I've now tried splitting 
each file into 256k segments per Solr document, and the results are better, but 
still not what I was hoping for.  Queries are around 2-8 seconds, with some 
reaching into 30+ second territory.

Ideally, I'd like to feed Solr the metadata and the entire file at once, and 
have the back-end split the file into thousands of pieces.  Is this possible?


Thanks!
Pete

On Nov 1, 2011, at 5:15 PM, Peter Spam wrote:

> Wow, 50 lines is tiny!  Is that how small you need to go, to get good 
> highlighting performance?
> 
> I'm looking at documents that can be up to 800MB in size, so I've decided to 
> split them down into 256k chunks.  I'm still indexing right now - I'm curious 
> to see how performance is when the injection is finished.
> 
> Has anyone done analysis on where the knee in the curve is, wrt document size 
> vs. # of documents?
> 
> 
> Thanks!
> Pete
> 
> On Oct 31, 2011, at 9:28 PM, anand.ni...@rbs.com wrote:
> 
>> Hi,
>> 
>> Basically I need to index very large log files. I have modified the 
>> ExtractingDocumentLoader to create a new document for every 50 lines (it is 
>> made configurable by keeping it as a system property)  of the log file being 
>> indexed. 'Filename' field for document created from 1 log file is kept the 
>> same and unique id is generated by appending the line no. with the file 
>> name, e.g 'log.txt (line no. 100 -150)'. Each doc is given the custom score 
>> stored in field called 'custom_score' which is directly proportional to its 
>> distance from the beginning of the file.
>> 
>> I have also found 'hitGrouped.vm' from the net. Since I am reading only 50 
>> lines for each document so the default max chunk size works for me but it 
>> can be easily adjusted depending upon the no of lines you are reading per 
>> doc.
>> 
>> Now I have done the grouping based on the 'filename' field and show the 
>> results from docs having highest score as a result I am able to show the 
>> last matching results from log file. Query parameters that I am using for 
>> search are:
>> 
>> http://localhost:8080/solr/select?defType=dismax&qf=Content&q=Solr&fl=id,score&defType=dismax&bf=sub(1000,caprice_score)&group=true&group.field=FileName
>> 
>> Results are amazing, I am able to index and search from very larger log 
>> files (few 100 MBs) with very low memory requirements. Highlighting is also 
>> working fine.
>> 
>> Thanks & Regards,
>> Anand
>> 
>> 
>> 
>> 
>> 
>> Anand Nigam
>> RBS Global Banking & Markets
>> Office: +91 124 492 5506   
>> 
>> -Original Message-
>> From: Peter Spam [mailto:ps...@mac.com] 
>> Sent: 21 October 2011 23:04
>> To: solr-user@lucene.apache.org
>> Subject: Re: Can Solr handle large text files?
>> 
>> Thanks for your note, Anand.  What was the maximum chunk size for you?  
>> Could you post the relevant portions of your configuration file?
>> 
>> 
>> Thanks!
>> Pete
>> 
>> On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:
>> 
>>> Hi,
>>> 
>>> I was also facing the issue of highlighting the large text files. I applied 
>>> the solution proposed here and it worked. But I am getting following error :
>>> 
>>> 
>>> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where 
>>> can I get this file from. Its reference is present in browse.vm
>>> 
>>> 
>>> #if($response.response.get('grouped'))
>>>  #foreach($grouping in $response.response.get('grouped'))
>>>#parse("hitGrouped.vm")
>>>  #end
>>> #else
>>>  #foreach($doc in $response.results)
>>>#parse("hit.vm")
>>>  #end
>>> #end
>>> 
>>> 
>>> 
>>> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
>>> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config 
>>> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in 
>>> classpath or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
>>> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
>>> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoade
>>> r.java:268) at 
>>> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(
>>> SolrVelocityResourceLoader.java:42) at 
>>> org.apache.velocity.Template.process(Template.java:98) at 
>>> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(
>>> ResourceManagerImpl.java:446) at
>>> 
>>> Thanks & Regards,
>>> Anand
>>> Anand Nigam
>>> RBS Global Banking & Markets
>>> Office: +91 124 492 5506   
>>> 
>>> 
>>> -Original Message-
>>> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de]
>>> Sent: 21 October 2011 14:58
>>> To: solr-user@lucene.apache.org
>>> Subjec

Zookeeper aware Replication in SolrCloud

2011-11-04 Thread prakash chandrasekaran


hi,
i m using SolrCloud and i wanted to add Replication feature to it .. 
i followed the steps in Solr Wiki .. but when the client tried to poll for data 
from server i got below Error Message ..
in Master LogNov 3, 2011 8:34:00 PM org.apache.solr.common.SolrException 
logSEVERE: org.apache.solr.common.cloud.ZooKeeperException: 
ZkSolrResourceLoader does not support getConfigDir() - likely, what you are 
trying to do is not supported in ZooKeeper modeat 
org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceLoader.java:99)
at 
org.apache.solr.handler.ReplicationHandler.getConfFileInfoFromCache(ReplicationHandler.java:378)
 at 
org.apache.solr.handler.ReplicationHandler.getFileList(ReplicationHandler.java:364)

in Slave logNov 3, 2011 8:34:00 PM org.apache.solr.handler.ReplicationHandler 
doFetchSEVERE: SnapPull failed org.apache.solr.common.SolrException: Request 
failed for the url org.apache.commons.httpclient.methods.PostMethod@18eabf6  at 
org.apache.solr.handler.SnapPuller.getNamedListResponse(SnapPuller.java:197) at 
org.apache.solr.handler.SnapPuller.fetchFileList(SnapPuller.java:219)at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:281) at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:284)
but i could see the slave pointing to correct master from link : 
http://localhost:7574/solr/replication?command=details
i m also seeing these values in replication details link .. 
(http://localhost:7574/solr/replication?command=details)
Thu Nov 03 20:28:00 PDT 
2011Thu Nov 03 20:27:00 PDT 2011Thu Nov 03 20:26:00 PDT 
2011Thu Nov 03 20:25:00 PDT 2011  Thu Nov 03 20:28:00 PDT 2011 
Thu Nov 03 20:27:00 PDT 2011 Thu Nov 03 20:26:00 PDT 2011 
Thu Nov 03 20:25:00 PDT 2011


Thanks,Prakash

Re: Proper analyzer / tokenizer for syslog data?

2011-11-04 Thread Ahmet Arslan
> Example data:
> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
> data=[1,5,30%];
> 
> I would love to be able to just "grep" the data - ie. if I
> search for "ello", it finds and returns "ello", and if I
> search for "hello_there=5", it would match too.
> 
> Here's what I'm using now:
> 
>     class="solr.TextField">
>      
>         class="solr.StandardTokenizerFactory"/>
>         class="solr.LowerCaseFilterFactory"/>
>         class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0"
> catenateWords="0" catenateNumbers="0" catenateAll="0"
> splitOnCaseChange="0"/>
>      
>    
> 
> The problem with this is that if I search for a substring,
> I don't get anything back.  For example, searching for
> "ello" or "*ello*" doesn't return.  Any ideas?
> 
> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400

For sub-string match NGramFilterFactory is required at index time.

 

Plus you may want to use WhiteSpaceTokenizer instead of 
StandardTokenizerFactory. Analysis admin page displays behavior of each 
tokenizer.


Solr, MultiValues and links...

2011-11-04 Thread Tiernan OToole
Right, not sure how to ask this question, what the terminology, but
hopefully my explaination will help...

We are chucking data into solr for queries. i cant mention the exact data,
but the closest thing i can think of is as follows:


   - Unique ID for the solr record (DB ID in this case)
   - A not so unique ID (say an ISBN number)
   - the name of the book (multiple names for this, each have a ISBN and a
   unique DB ID)
   - A Status of the book (books arent the correct term, but, say, a book
   has 4 editions but keeps the ISBN, we would have 4 names in Solr, each
   queryable. So, searching for the first edition title will return the
   correct ISBN).

Anyway, what i want to be able to do is search for a single title (say Solr
or Dummies) and find all instances of that title in index. but for each
name, i want to be able to link the status of each title with each one.
there are other "statuses" for each item also...

Anyway, i had 2 ways of doing this:


   - the first way was using multi values, storing the names in a multi
   value list and also the statuses, but the linking dont seem to be working
   correctly...
   - the second way is storing each record uniqley, but with this i would
   need to run a second query to get all records for the ISBN...

Any ideas which one i should be using? any tips on this?

Thanks.


-- 
Tiernan O'Toole
blog.lotas-smartman.net
www.geekphotographer.com
www.tiernanotoole.ie


Comparing apples & oranges?

2011-11-04 Thread Martin Koch
Hi List

I have a solr index where I want to include numerical fields in my ranking
function as well as keyword relevance. For example, each document has a
document view count, and I'd like to increase the relevancy of documents
that are read often, and penalize documents with a very low view count. I'm
aware that this could be achieved with a filter as well, but ignore that
for this question :) since this will be extended to other numerical fields.

The keyword scoring works just fine and I can include the view count as a
factor in the scoring, but I would like to somehow express that the view
count accounts for e.g. 25% of the total score. This could be achieved by
mapping the view count into some predetermined fixed range and then
performing suitable arithmetic to scale to the score of the query. The
score of the term query is normalized to queryNorm, so I'd like somehow to
express that the view count score should be normalized to the queryNorm.

If I look at the explain of how the score below is computed, the 17.4 is
the part of the score that comes from term relevancy. Searching for another
(set of) terms yields a different queryNorm, so I can't see how I can
a-priori pick a scaling function (I've used log for this example) and boost
factor that will give control of the final contribution of the view count
to the score.

19.14161 = (MATCH) sum of:
  17.403849 = (MATCH) max plus 0.1 times others of:
16.747877 = (MATCH) weight(document:water^4.0 in 1076362), product of:
  0.22298127 = queryWeight(document:water^4.0), product of:
4.0 = boost
2.939238 = idf(docFreq=527730, maxDocs=3669552)
0.018965907 = queryNorm
  75.108894 = (MATCH) fieldWeight(document:water in 1076362), product
of:
25.553865 = tf(termFreq(document:water)=653)
2.939238 = idf(docFreq=527730, maxDocs=3669552)
1.0 = fieldNorm(field=document, doc=1076362)
[snip]
  1.7377597 = (MATCH) FunctionQuery(log(map(int(views),0.0,0.0,1.0))),
product of:
1.8325089 = log(map(int(views)=68,min=0.0,max=0.0,target=1.0))
50.0 = boost
0.018965907 = queryNorm

Thanks in advance for your help,
/Martin


Solr

2011-11-04 Thread KARHU Toni
Hi, when is the SOLR cloud version planned to be released/stable what are your 
thought of using it in a serious production environment?

Br,

Toni


**
IMPORTANT: This message is intended exclusively for information purposes. It 
cannot be considered as an 
official OHIM communication concerning procedures laid down in the Community 
Trade Mark Regulations 
and Designs Regulations. It is therefore not legally binding on the OHIM for 
the purpose of those procedures.
The information contained in this message and attachments is intended solely 
for the attention and use of the 
named addressee and may be confidential. If you are not the intended recipient, 
you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this 
e-mail. If you have received this message in error, please contact the sender 
immediately and irrevocably 
delete or destroy this message and any copies.

**

Re: Ordered proximity search

2011-11-04 Thread Rahul Warawdekar
Hi Thomas,

Do you always need the ordered proximity search by default ?
You may want to check SpanNearQuery at "
http://www.lucidimagination.com/blog/2009/07/18/the-spanquery/";.

We are using edismax query parser provided by Solr.
I had a similar type of requirement in our project in here is how we
addressed it

1. Wrote a customized query parser similar to edismax.
2. Identified the method in the code which takes care of "PhraseQuery" and
replaced it with a snippet of "SpanNearQuery" code.

Please check more on SpanNearQuery if that works for you.



On Thu, Nov 3, 2011 at 2:11 PM, LT.thomas  wrote:

> Hi,
>
> By ordered I mean term1 will always come before term2 in the document.
>
> I have two documents:
> 1. "By ordered I mean term1 will always come before term2 in the document"
> 2. "By ordered I mean term2 will always come before term1 in the document"
>
> if I make the query:
>
> "term1 term2"~Integer.MAX_VALUE
>
> my results is: 2 documents
>
> How can I query to have one result (only if term1 come before term2):
> "By ordered I mean term1 will always come before term2 in the document"
>
> Thanks
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Ordered-proximity-search-tp3477946p3477946.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Thanks and Regards
Rahul A. Warawdekar


Re: Ordered proximity search

2011-11-04 Thread LT.thomas
Thanks for your reply, I will check this advice

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ordered-proximity-search-tp3477946p3480321.html
Sent from the Solr - User mailing list archive at Nabble.com.


Fwd: Assist please

2011-11-04 Thread Oleg Tikhonov
-- Forwarded message --
From: NDIAYE Bacar 
Date: Fri, Nov 4, 2011 at 12:05 PM
Subject: Assist please
To: d...@tika.apache.org, u...@tika.apache.org


  Hi, 

I need your assist please for to configuration the Apache Tika to Sorl
attachment in Drupal 7. 

I have try to configure the sorl Attachment after to install the apach solr
search now I want to know how to configure the tika-app-0.10.jar and
configure the attachment to show the attach file in search. 

** **



** **

** **

Regards,

** **

NDIAYE Bacar

01 44 05 58 42 

[image: murex iso]

** **


***

This e-mail contains information for the intended recipient only. It may
contain proprietary material or confidential information. If you are not
the intended recipient you are not authorised to distribute, copy or use
this e-mail or any attachment to it. Murex cannot guarantee that it is
virus free and accepts no responsibility for any loss or damage arising
from its use. If you have received this e-mail in error please notify
immediately the sender and delete the original email received, any
attachments and all copies from your system.


Re: limiting searches to particular sources

2011-11-04 Thread Erick Erickson
How are you crawling your info? Somewhere you have to inject the
source into the document,  won't do the trick because
there's no source available

If you're crawling the data by yourself, you can just add the source
to the document.

If you're using DIH, you can specify the field as a constant. Or you
could implement a custom Transformer that inserted it for you.

Best
Erick

On Wed, Nov 2, 2011 at 10:52 AM, Fred Zimmerman  wrote:
> I want to be able to list some searches to particular sources, e.g. "wiki
> only", "crawled only", etc.  So I think I need to create a source field in
> the schema.xml.  However, the native data for these sources does not
> contain source info (e.g. "crawled").  So I want to use (I think)
>  to add a string to each data set as I import it, e.g.
> "website-X-crawl".  So my question is, how do I insert a string value into
> a blank field?
>


Re: Highlighting "text" field when query is for "string" field

2011-11-04 Thread Erick Erickson
Try this with &debugQuery=on. I suspect you're not getting the query you
think you are and I'd straighten that out before worrying about highlighting.

Usually, for instance, AND should be capitalized to be an operator.

So try with &debugQuery=on and see what happens. The highlighter, I
believe, will try to highlight on all returned fields by default so it *should*
work.

And I assume you're setting 'stored="true" ' on your excerpt field?

Best
Erick

On Wed, Nov 2, 2011 at 5:44 PM, solrdude  wrote:
> I have situation where I need to highlight matching phrases in "text" field
> where as query is against "string" field. Its not highlighting now, may be
> because in text field they are all terms and hence not a match for phrase.
> How do i do it? With hl.alternateField, it identifies those things in
>  field, but not applying default  around matching phrase.
> How do I get it to mark it?
>
> Eg:
> 
> smooth skin  // field type: string
> Smooth skin    // field type: text
> 
>
> query:
> http://localhost:8080/mycore/select?facet=true&group.ngroups=true&facet.mincount=1&group.limit=3&facet.limit=10&hl=true&rows=10&version=2&start=0&q=keyword:%22smooth+skin%22+and+publishStatus:Live&group.field=productName&group=true&facet.field=brand&hl.fl=excerpt&hl.alternateField=excerpt
>
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Highlighting-text-field-when-query-is-for-string-field-tp3475334p3475334.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Extended Dismax and Proximity Queries

2011-11-04 Thread Erick Erickson
Yes. Just try it with &debugQuery=on and you can see the parsed
form of the query.

Best
Erick

On Wed, Nov 2, 2011 at 6:20 PM, Jamie Johnson  wrote:
> Is it possible to do Proximity queries using edismax?  I saw I could
> do the following
>
> q="batman movie"&qs=100
>
> but I wanted to be able to handle queries like "batman movie"~100
>
> I know I can do
>
> text:"batman movie"~100
>
> but I'm trying to do this without specifying a field.  Is this possible?
>


Re: limiting searches to particular sources

2011-11-04 Thread Markus Jelsma
Your Nutch indexes the site and host fields. If that is not enough you can use 
its subcollection plugin to write values for URL patterns.

On Wednesday 02 November 2011 15:52:37 Fred Zimmerman wrote:
> I want to be able to list some searches to particular sources, e.g. "wiki
> only", "crawled only", etc.  So I think I need to create a source field in
> the schema.xml.  However, the native data for these sources does not
> contain source info (e.g. "crawled").  So I want to use (I think)
>  to add a string to each data set as I import it, e.g.
> "website-X-crawl".  So my question is, how do I insert a string value into
> a blank field?

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350


Re: Solr real-time update taking time

2011-11-04 Thread Erick Erickson
I think that 1-2 second requirement is unreasonable. The first thing I'd
do is push back and understand whether this is actually a requirement or
just somebody picking numbers our of thin air.

Committing often enough for this to work is just *asking* for trouble
with 3.3. I'd
take a look at the Near Real Time (NRT) stuff happening on trunk if this turns
out to be a hard requirement.

Best
Erick

On Wed, Nov 2, 2011 at 11:30 PM, Vijay Sampath
 wrote:
> Hi Jan,
>
>  Thanks very much for the suggestion. I used CommitWithin(5000) and the
> response came down to less than a second.  But I see an inconsistent
> behaviour on the response times. Sometimes it's taking more than 20-25
> seconds. May be I'll open up a separate thread.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Solr-real-time-update-taking-time-tp3472709p3476091.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: best way for sum of fields

2011-11-04 Thread Erick Erickson
Please define "sum of fields". The total number of unique terms for
all the fields?
The sum of some values of some fields for each document?
The count of the number of fields in the index?
Other???

Best
Erick

On Thu, Nov 3, 2011 at 11:43 AM, stockii  wrote:
> i am searching for the best way to get the sum of fields.
>
> I know the StatsComponent, but this component is not fast enough for 40-60
> thousands documents.
>
> exists some other components or methods form solr ?
>
> -
> --- System 
> 
>
> One Server, 12 GB RAM, 2 Solr Instances, 8 Cores,
> 1 Core with 45 Million Documents other Cores < 200.000
>
> - Solr1 for Search-Requests - commit every Minute  - 5GB Xmx
> - Solr2 for Update-Request  - delta every Minute - 4GB Xmx
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/best-way-for-sum-of-fields-tp3477517p3477517.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-04 Thread Erick Erickson
Let's see...
1> Committing every second, even with commitWithin is probably going
to be a problem.
 I usually think that 1 second latency is usually overkill, but
that's up to your
 product manager. Look at the NRT (Near Real Time) stuff if you
really need this.
 I thought that NRT was only on trunk, but it *might* be in the
3.4 code base.
2> Don't understand what "a single index per entity" is. How many cores do you
 have total? For not very many records, I'd put everything in a
single index and
 use filterqueries to restrict views.
3> I guess this relates to <2>. And I'd use a single core. If, for
some reason, you decide
 that you need multiple indexes, use several cores with ONE Solr
rather than start
 a new Solr per core, it's more resource expensive to have
multiple JVMs around.

Best
Erick

On Thu, Nov 3, 2011 at 2:03 PM, Gustavo Falco
 wrote:
> Hi guys!
>
> I have a couple of questions that I hope someone could help me with:
>
> 1) Recently I've implemented Solr in my app. My use case is not
> complicated. Suppose that there will be 50 concurrent users tops. This is
> an app like, let's say, a CRM. I tell you this so you have an idea in terms
> of how many read and write operations will be needed. What I do need is
> that the data that is added / updated be available right after it's added /
> updated (maybe a second later it's ok). I know that the commit operation is
> expensive, so maybe doing a commit right after each write operation is not
> a good idea. I'm trying to use the autoCommit feature with a maxTime of
> 1000ms, but then the question arised: Is this the best way to handle this
> type of situation? and if not, what should I do?
>
> 2) I'm using a single index per entity type because I've read that if the
> app is not handling lots of data (let's say, 1 million of records) then
> it's "safe" to use a single index. Is this true? if not, why?
>
> 3) Is it a problem if I use a simple setup of Solr using a single core for
> this use case? if not, what do you recommend?
>
>
>
> Any help in any of these topics would be greatly appreciated.
>
> Thanks in advance!
>


Re: [Profiling] How to profile/tune Solr server

2011-11-04 Thread karsten-solr
Hi Spark,

2009 there was a monitor from lucidimagination:
http://www.lucidimagination.com/about/news/releases/lucid-imagination-releases-performance-monitoring-utility-open-source-apache-lucene

A colleague of mine calls the sematext-monitor "trojan" because "SPM phone 
home":
"Easy in, easy out - if you try SPM and don't like it, simply stop and remove 
the small client-side piece that sends us your data"
http://sematext.com/spm/solr-performance-monitoring/index.html

Looks like other people using a "real profiler" like YourKit Java Profiler
http://forums.yourkit.com/viewtopic.php?f=3&t=3850

There is also an article about Zabbix
http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/

In your case any profiler would do, but if you find out a Profiler with 
solr-specific default-filter let me know.



Best regrads
  Karsten

P.S. eMail in context 
http://lucene.472066.n3.nabble.com/Profiling-How-to-profile-tune-Solr-server-td3467027.html

 Original-Nachricht 
> Datum: Mon, 31 Oct 2011 18:35:32 +0800
> Von: yu shen 
> An: solr-user@lucene.apache.org
> Betreff: Re: [Profiling] How to profile/tune Solr server

> No idea so far, try to figure out.
> 
> Spark
> 
> 2011/10/31 Jan Høydahl 
> 
> > Hi,
> >
> > There are no official tools other than looking at the built-in stats
> pages
> > and perhaps using JConsole or similar JVM monitoring tools. Note that
> > Solr's JMX capabilities may let you hook your enterprise's existing
> > monitoring dashboard up with Solr.
> >
> > Also check out the new monitoring service from Sematext which will give
> > you graphs and all. So far it's free evaluation:
> > http://sematext.com/spm/index.html
> >
> > Do you have a clue for why the indexing is slow?
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> > Solr Training - www.solrtraining.com
> >
> > On 31. okt. 2011, at 04:59, yu shen wrote:
> >
> > > Hi All,
> > >
> > > I am a solr newbie. I find solr documents easy to access and use,
> which
> > is
> > > really good thing. While my problem is I did not find a solr home
> grown
> > > profiling/monitoring tool.
> > >
> > > I set up the server as a multi-core server, each core has
> approximately
> > 2GB
> > > index. And I need to update solr and re-generate index in a real time
> > > manner (In java code, using SolrJ). Sometimes the update operation is
> > slow.
> > > And it is expected that in a year, the index size may increase to 4GB.
> > And
> > > I need to do something to prevent performance downgrade.
> > >
> > > Is there any solr official monitoring & profiling tool for this?
> > >
> > > Spark
> >
> >


InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

2011-11-04 Thread Edwin Steiner
Hello all

I would like to handle german accents (Umlaute) by replacing the accented char 
with its two-letter substitute (e.g ä => ae). For this reason I use the 
char-filter solr.MappingCharFilterFactory configured with a mapping file 
containing entries like “ä” => “ae”. I also want to use the 
solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of 
compound words (e.g. revision in totalrevision). And finally I want to use Solr 
highlighting. But there seems to be a problem if I combine the char filter and 
the compound word filter in combination with highlighting (an 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).

Here are the details:

types:


  



  


schema:
---
  
  
 
  

document:
--
  
 1 
 banküberfall 
  

mapping.txt:
-
"ü" => "ue"

words.txt:
--
fall

The resulting error when search with:

http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title

Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/ 
params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: 
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall 
exceeds length of provided text sized 12
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
at 
org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
at 
org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:680)
Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: 
Token fall exceeds length of provided text sized 12
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
... 23 more

Thanks a lot for any suggestions and best regards,
Edwin

Re: [Profiling] How to profile/tune Solr server

2011-11-04 Thread Andre Bois-Crettez
SolrMeter is useful too, it can be plugged to a production server just 
to watch evolution of caches usage :

http://code.google.com/p/solrmeter/wiki/Screenshots#CacheHistoryStatistic
André



Re: limiting searches to particular sources

2011-11-04 Thread Fred Zimmerman
Yes -- how do I specify the field as a constant in DIH?

On Fri, Nov 4, 2011 at 11:17 AM, Erick Erickson wrote:

> How are you crawling your info? Somewhere you have to inject the
> source into the document,  won't do the trick because
> there's no source available
>
> If you're crawling the data by yourself, you can just add the source
> to the document.
>
> If you're using DIH, you can specify the field as a constant. Or you
> could implement a custom Transformer that inserted it for you.
>
> Best
> Erick
>
> On Wed, Nov 2, 2011 at 10:52 AM, Fred Zimmerman 
> wrote:
> > I want to be able to list some searches to particular sources, e.g. "wiki
> > only", "crawled only", etc.  So I think I need to create a source field
> in
> > the schema.xml.  However, the native data for these sources does not
> > contain source info (e.g. "crawled").  So I want to use (I think)
> >  to add a string to each data set as I import it, e.g.
> > "website-X-crawl".  So my question is, how do I insert a string value
> into
> > a blank field?
> >
>


Batch indexing documents using ContentStreamUpdateRequest

2011-11-04 Thread Tod
This is a code fragment of how I am doing a ContentStreamUpdateRequest 
using CommonHTTPSolrServer:



  ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url);
  InputStream is = csbu.getStream();
  FastInputStream fis = new FastInputStream(is);

  csur.addContentStream(csbu);
  csur.setParam("literal.content_id","00");
  csur.setParam("literal.contentitle","This is a test");
  csur.setParam("literal.title","This is a test");
  server.request(csur);
  server.commit();

  fis.close();


This works fine for one document (a pdf in this case).  When I surround 
this with a while loop and try adding multiple documents I get:


org.apache.solr.client.solrj.SolrServerException: java.io.IOException: 
stream is closed


I've tried commenting out the fis.close, and also using just a plain 
InputStream with and without a .close() call - neither work.  Is there a 
way to do this that I'm missing?



Thanks - Tod


overwrite=false support with SolrJ client

2011-11-04 Thread Ken Krugler
Hi list,

I'm working on improving the performance of the Solr scheme for Cascading.

This supports generating a Solr index as the output of a Hadoop job. We use 
SolrJ to write the index locally (via EmbeddedSolrServer).

There are mentions of using overwrite=false with the CSV request handler, as a 
way of improving performance.

I see that https://issues.apache.org/jira/browse/SOLR-653 removed this support 
from SolrJ, because it was deemed too dangerous for mere mortals.

My question is whether anyone knows just how much performance boost this really 
provides.

For Hadoop-based workflows, it's straightforward to ensure that the unique key 
field is really unique, thus if the performance gain is significant, I might 
look into figuring out some way (with a trigger lock) of re-enabling this 
support in SolrJ.

Thanks,

-- Ken

--
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr






Re: performance - dynamic fields versus static fields

2011-11-04 Thread Erick Erickson
Dynamic fields are just fields, man. There's really no overhead that I know of.

I tend to prefer non-dynamic fields whenever possible to reduce
hard-to-find errors where, say, I've made a typo and they dynamic
pattern matches but that's largely a personal preference.

Best
Erick

On Thu, Nov 3, 2011 at 3:15 PM, Memory Makers  wrote:
> Hi,
>
> Is there a handy resource on the:
>  a. performance of: dynamic fields versus static fields
>  b. other pros-cons?
>
> Thanks.
>


Re: overwrite=false support with SolrJ client

2011-11-04 Thread Jason Rutherglen
It should be supported in SolrJ, I'm surprised it's been lopped out.
Bulk indexing is extremely common.

On Fri, Nov 4, 2011 at 1:16 PM, Ken Krugler  wrote:
> Hi list,
>
> I'm working on improving the performance of the Solr scheme for Cascading.
>
> This supports generating a Solr index as the output of a Hadoop job. We use 
> SolrJ to write the index locally (via EmbeddedSolrServer).
>
> There are mentions of using overwrite=false with the CSV request handler, as 
> a way of improving performance.
>
> I see that https://issues.apache.org/jira/browse/SOLR-653 removed this 
> support from SolrJ, because it was deemed too dangerous for mere mortals.
>
> My question is whether anyone knows just how much performance boost this 
> really provides.
>
> For Hadoop-based workflows, it's straightforward to ensure that the unique 
> key field is really unique, thus if the performance gain is significant, I 
> might look into figuring out some way (with a trigger lock) of re-enabling 
> this support in SolrJ.
>
> Thanks,
>
> -- Ken
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>
>
>
>
>


Term frequency question

2011-11-04 Thread Craig Stadler
I am using this reference link: 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg26389.html


However the article is a bit old and when I try to compile the class (using 
newest solr 3.4 / java version "1.7.0_01" / Java(TM) SE Runtime Environment 
(build 1.7.0_01-b08) / Java HotSpot(TM) 64-Bit Server VM (build 21.1-b02, 
mixed mode)


I get :
-
javac -classpath /var/lib/lucene_master/lib/lucene-core-3.4.0.jar 
./NoLengthNormAndTfSimilarity.java


./NoLengthNormAndTfSimilarity.java:7: error: lengthNorm(String,int) in 
NoLengthNormAndTfSimilarity cannot override lengthNorm(String,int) in 
Similarity

 public float lengthNorm(String fieldName, int numTerms) {
  ^
 overridden method is final
1 error
-
What am I doing wrong, is there a better way or newer way to do this?

The bottom line is i want the number of words to NOT factor in score, and I 
DO want to use related functions. I tried omitTermFreqAndPositions="true" 
and it did not allow me to use for example "spiderman doc"~100. Loss of all 
functionality


-Craig

--- actual code ---
package mypackagexyz...

import org.apache.lucene.search.DefaultSimilarity;

public class NoLengthNormAndTfSimilarity extends DefaultSimilarity {

public float lengthNorm(String fieldName, int numTerms) {
return numTerms > 0 ? 1.0f : 0.0f;
}

public float tf(float freq) {
return freq > 0 ? 1.0f : 0.0f;
}
}
---



Re: Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-04 Thread Gustavo Falco
First of all, thanks a lot for your answer.

1) I could use 5 to 15 seconds between each commit and give it a try. Is
this an acceptable configuration? I'll take a look at NRT.
2) Currently I'm using a single core, the simplest setup. I don't expect to
have an overwhelming quantity of records, but I do have lots of classes to
persist, and I need to search all of them at the same time, and not per
class (entity). For now is working good. With multiple indexes I mean using
an index for each entity. Let's say, an index for "Articles", another for
"Users", etc. The thing is that I don't know when I should divide it and
use one index for each entity (or if it's possible to make a "UNION" like
search between every index). I've read that when an entity reaches the size
of one million records then it's best to give it a dedicated index, even
though I don't expect to have that size even with all my entities. But I
wanted to know from you just to be sure.
3) Great! for now I think I'll stick with one index, but it's good to know
that in case I need to change later for some reason.



Again, thanks a lot for your help!

2011/11/4 Erick Erickson 

> Let's see...
> 1> Committing every second, even with commitWithin is probably going
> to be a problem.
> I usually think that 1 second latency is usually overkill, but
> that's up to your
> product manager. Look at the NRT (Near Real Time) stuff if you
> really need this.
> I thought that NRT was only on trunk, but it *might* be in the
> 3.4 code base.
> 2> Don't understand what "a single index per entity" is. How many cores do
> you
> have total? For not very many records, I'd put everything in a
> single index and
> use filterqueries to restrict views.
> 3> I guess this relates to <2>. And I'd use a single core. If, for
> some reason, you decide
> that you need multiple indexes, use several cores with ONE Solr
> rather than start
> a new Solr per core, it's more resource expensive to have
> multiple JVMs around.
>
> Best
> Erick
>
> On Thu, Nov 3, 2011 at 2:03 PM, Gustavo Falco
>  wrote:
> > Hi guys!
> >
> > I have a couple of questions that I hope someone could help me with:
> >
> > 1) Recently I've implemented Solr in my app. My use case is not
> > complicated. Suppose that there will be 50 concurrent users tops. This is
> > an app like, let's say, a CRM. I tell you this so you have an idea in
> terms
> > of how many read and write operations will be needed. What I do need is
> > that the data that is added / updated be available right after it's
> added /
> > updated (maybe a second later it's ok). I know that the commit operation
> is
> > expensive, so maybe doing a commit right after each write operation is
> not
> > a good idea. I'm trying to use the autoCommit feature with a maxTime of
> > 1000ms, but then the question arised: Is this the best way to handle this
> > type of situation? and if not, what should I do?
> >
> > 2) I'm using a single index per entity type because I've read that if the
> > app is not handling lots of data (let's say, 1 million of records) then
> > it's "safe" to use a single index. Is this true? if not, why?
> >
> > 3) Is it a problem if I use a simple setup of Solr using a single core
> for
> > this use case? if not, what do you recommend?
> >
> >
> >
> > Any help in any of these topics would be greatly appreciated.
> >
> > Thanks in advance!
> >
>


solr/jetty request log strangeness

2011-11-04 Thread Shawn Heisey
If the URL being sent to Solr is too long to be completely displayed in 
the jetty request log, the next log entry is recorded on the same line.  
The following line from my log is actually three separate requests:


10.100.0.240 -  -  [04/Nov/2011:00:00:00 +] "GET 
/solr/s1live/select?qt=lbcheck&rows=0&q=did%3A%28236297360+OR+236297361+OR+236297362+OR+236297363+OR+236297364+OR+236297375+OR+282948606+OR+281822104+OR+236297379+OR+236297378+OR+23629

7377+OR+236297376+OR+236297387+OR+236297386+OR+236297385+OR+236297390+OR+236297389+OR+236297388+OR+236297413+OR+236297412+OR+236297414+OR+278631444+OR+275337739+OR+236297233+OR+236297235+OR+236297234+OR+281822007+OR+236297236+OR+27894483
7+OR+278944836+OR+279328339+OR+278944833+OR+279328340+OR+278944832+OR+278944835+OR+279328342+OR+278944834+OR+236297231+OR+279328344+OR+236297216+OR+279328345+OR+279328347+OR+279328348+OR+279328350+OR+279328352+OR+282672994+OR+236297279+O
R+282672995+OR+282794348+OR+282672993+OR+236297264+OR+236297262+OR+236297263+OR+236297260+OR+236297261+OR+236297250+OR+236297251+OR+236297248+OR+236297249+OR+236297254+OR+236297252+OR+236297253+OR+278631560+OR+212057029+OR+225234809+OR+2
78631559+OR+278631558+OR+236297288+OR+236297289+OR+236297290+OR+236297291+OR10.100.0.240 
-  -  [04/Nov/2011:00:00:00 +] "GET 
/solr/s0live/select?qt=lbcheck&rows=0&q=did%3A%28236297360+OR+236297361+OR+236297362+OR+236297363+OR+2362973

64+OR+236297375+OR+282948606+OR+281822104+OR+236297379+OR+236297378+OR+236297377+OR+236297376+OR+236297387+OR+236297386+OR+236297385+OR+236297390+OR+236297389+OR+236297388+OR+236297413+OR+236297412+OR+236297414+OR+278631444+OR+275337739+
OR+236297233+OR+236297235+OR+236297234+OR+281822007+OR+236297236+OR+278944837+OR+278944836+OR+279328339+OR+278944833+OR+279328340+OR+278944832+OR+278944835+OR+279328342+OR+278944834+OR+236297231+OR+279328344+OR+236297216+OR+279328345+OR+
279328347+OR+279328348+OR+279328350+OR+279328352+OR+282672994+OR+236297279+OR+282672995+OR+282794348+OR+282672993+OR+236297264+OR+236297262+OR+236297263+OR+236297260+OR+236297261+OR+236297250+OR+236297251+OR+236297248+OR+236297249+OR+236
297254+OR+236297252+OR+236297253+OR+278631560+OR+212057029+OR+225234809+OR+278631559+OR+278631558+OR+236297288+OR+236297289+OR+236297290+OR+236297291+OR10.100.0.240 
-  -  [04/Nov/2011:00:00:01 +] "POST /solr/s0live/select HTTP/1.1" 2

00 193

I assume that this is probably a jetty problem, but since it's the jetty 
included with Solr, I am bringing it up here.  This happens with both 
Solr 3.2.0 and Solr 3.4.0.  Is there a config change I can make to fix 
it?  Almost every request that gets sent to my Solr server is too long 
to be fully represented in the request log.


As long as I am bringing this up, is there any way to get the full URL 
in that log?  Some of my URLs get close to 2 bytes.


Thanks,
Shawn



RE: Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-04 Thread Brian Gerby

Gustavo - 

Even with the most basic requirements, I'd recommend setting up a multi-core 
configuration so you can RELOAD the main core you will be using when you make 
simple changes to config files. This is much cleaner than bouncing solr each 
time. There are other benefits to doing it, but this is the main reason I do 
it.  

Brian 

> Date: Fri, 4 Nov 2011 15:34:27 -0300
> Subject: Re: Three questions about: Commit, single index vs multiple indexes 
> and implementation advice
> From: comfortablynum...@gmail.com
> To: solr-user@lucene.apache.org
> 
> First of all, thanks a lot for your answer.
> 
> 1) I could use 5 to 15 seconds between each commit and give it a try. Is
> this an acceptable configuration? I'll take a look at NRT.
> 2) Currently I'm using a single core, the simplest setup. I don't expect to
> have an overwhelming quantity of records, but I do have lots of classes to
> persist, and I need to search all of them at the same time, and not per
> class (entity). For now is working good. With multiple indexes I mean using
> an index for each entity. Let's say, an index for "Articles", another for
> "Users", etc. The thing is that I don't know when I should divide it and
> use one index for each entity (or if it's possible to make a "UNION" like
> search between every index). I've read that when an entity reaches the size
> of one million records then it's best to give it a dedicated index, even
> though I don't expect to have that size even with all my entities. But I
> wanted to know from you just to be sure.
> 3) Great! for now I think I'll stick with one index, but it's good to know
> that in case I need to change later for some reason.
> 
> 
> 
> Again, thanks a lot for your help!
> 
> 2011/11/4 Erick Erickson 
> 
> > Let's see...
> > 1> Committing every second, even with commitWithin is probably going
> > to be a problem.
> > I usually think that 1 second latency is usually overkill, but
> > that's up to your
> > product manager. Look at the NRT (Near Real Time) stuff if you
> > really need this.
> > I thought that NRT was only on trunk, but it *might* be in the
> > 3.4 code base.
> > 2> Don't understand what "a single index per entity" is. How many cores do
> > you
> > have total? For not very many records, I'd put everything in a
> > single index and
> > use filterqueries to restrict views.
> > 3> I guess this relates to <2>. And I'd use a single core. If, for
> > some reason, you decide
> > that you need multiple indexes, use several cores with ONE Solr
> > rather than start
> > a new Solr per core, it's more resource expensive to have
> > multiple JVMs around.
> >
> > Best
> > Erick
> >
> > On Thu, Nov 3, 2011 at 2:03 PM, Gustavo Falco
> >  wrote:
> > > Hi guys!
> > >
> > > I have a couple of questions that I hope someone could help me with:
> > >
> > > 1) Recently I've implemented Solr in my app. My use case is not
> > > complicated. Suppose that there will be 50 concurrent users tops. This is
> > > an app like, let's say, a CRM. I tell you this so you have an idea in
> > terms
> > > of how many read and write operations will be needed. What I do need is
> > > that the data that is added / updated be available right after it's
> > added /
> > > updated (maybe a second later it's ok). I know that the commit operation
> > is
> > > expensive, so maybe doing a commit right after each write operation is
> > not
> > > a good idea. I'm trying to use the autoCommit feature with a maxTime of
> > > 1000ms, but then the question arised: Is this the best way to handle this
> > > type of situation? and if not, what should I do?
> > >
> > > 2) I'm using a single index per entity type because I've read that if the
> > > app is not handling lots of data (let's say, 1 million of records) then
> > > it's "safe" to use a single index. Is this true? if not, why?
> > >
> > > 3) Is it a problem if I use a simple setup of Solr using a single core
> > for
> > > this use case? if not, what do you recommend?
> > >
> > >
> > >
> > > Any help in any of these topics would be greatly appreciated.
> > >
> > > Thanks in advance!
> > >
> >
  

Re: Three questions about: Commit, single index vs multiple indexes and implementation advice

2011-11-04 Thread Gustavo Falco
Hi Brian,

I'll take a look at what you mentioned. I didn't think about that. I'll
finish the implementation at the app level and then I'll read a little more
about multi-core setups. Maybe I don't know yet all the benefits it has.


Thanks a lot for your advice.

2011/11/4 Brian Gerby 

>
> Gustavo -
>
> Even with the most basic requirements, I'd recommend setting up a
> multi-core configuration so you can RELOAD the main core you will be using
> when you make simple changes to config files. This is much cleaner than
> bouncing solr each time. There are other benefits to doing it, but this is
> the main reason I do it.
>
> Brian
>
> > Date: Fri, 4 Nov 2011 15:34:27 -0300
> > Subject: Re: Three questions about: Commit, single index vs multiple
> indexes and implementation advice
> > From: comfortablynum...@gmail.com
> > To: solr-user@lucene.apache.org
> >
> > First of all, thanks a lot for your answer.
> >
> > 1) I could use 5 to 15 seconds between each commit and give it a try. Is
> > this an acceptable configuration? I'll take a look at NRT.
> > 2) Currently I'm using a single core, the simplest setup. I don't expect
> to
> > have an overwhelming quantity of records, but I do have lots of classes
> to
> > persist, and I need to search all of them at the same time, and not per
> > class (entity). For now is working good. With multiple indexes I mean
> using
> > an index for each entity. Let's say, an index for "Articles", another for
> > "Users", etc. The thing is that I don't know when I should divide it and
> > use one index for each entity (or if it's possible to make a "UNION" like
> > search between every index). I've read that when an entity reaches the
> size
> > of one million records then it's best to give it a dedicated index, even
> > though I don't expect to have that size even with all my entities. But I
> > wanted to know from you just to be sure.
> > 3) Great! for now I think I'll stick with one index, but it's good to
> know
> > that in case I need to change later for some reason.
> >
> >
> >
> > Again, thanks a lot for your help!
> >
> > 2011/11/4 Erick Erickson 
> >
> > > Let's see...
> > > 1> Committing every second, even with commitWithin is probably going
> > > to be a problem.
> > > I usually think that 1 second latency is usually overkill, but
> > > that's up to your
> > > product manager. Look at the NRT (Near Real Time) stuff if you
> > > really need this.
> > > I thought that NRT was only on trunk, but it *might* be in the
> > > 3.4 code base.
> > > 2> Don't understand what "a single index per entity" is. How many
> cores do
> > > you
> > > have total? For not very many records, I'd put everything in a
> > > single index and
> > > use filterqueries to restrict views.
> > > 3> I guess this relates to <2>. And I'd use a single core. If, for
> > > some reason, you decide
> > > that you need multiple indexes, use several cores with ONE Solr
> > > rather than start
> > > a new Solr per core, it's more resource expensive to have
> > > multiple JVMs around.
> > >
> > > Best
> > > Erick
> > >
> > > On Thu, Nov 3, 2011 at 2:03 PM, Gustavo Falco
> > >  wrote:
> > > > Hi guys!
> > > >
> > > > I have a couple of questions that I hope someone could help me with:
> > > >
> > > > 1) Recently I've implemented Solr in my app. My use case is not
> > > > complicated. Suppose that there will be 50 concurrent users tops.
> This is
> > > > an app like, let's say, a CRM. I tell you this so you have an idea in
> > > terms
> > > > of how many read and write operations will be needed. What I do need
> is
> > > > that the data that is added / updated be available right after it's
> > > added /
> > > > updated (maybe a second later it's ok). I know that the commit
> operation
> > > is
> > > > expensive, so maybe doing a commit right after each write operation
> is
> > > not
> > > > a good idea. I'm trying to use the autoCommit feature with a maxTime
> of
> > > > 1000ms, but then the question arised: Is this the best way to handle
> this
> > > > type of situation? and if not, what should I do?
> > > >
> > > > 2) I'm using a single index per entity type because I've read that
> if the
> > > > app is not handling lots of data (let's say, 1 million of records)
> then
> > > > it's "safe" to use a single index. Is this true? if not, why?
> > > >
> > > > 3) Is it a problem if I use a simple setup of Solr using a single
> core
> > > for
> > > > this use case? if not, what do you recommend?
> > > >
> > > >
> > > >
> > > > Any help in any of these topics would be greatly appreciated.
> > > >
> > > > Thanks in advance!
> > > >
> > >
>
>


Re: Batch indexing documents using ContentStreamUpdateRequest

2011-11-04 Thread Tod

Answering my own question.

ContentStreamUpdateRequest (csur) needs to be within the while loop not 
outside as I had it.  Still not seeing any dramatic performance 
improvements over perl though (the point of this exercise).  Indexing 
locks after about 30-45 minutes of activity, even a commit won't budge it.




On 11/04/2011 12:36 PM, Tod wrote:

This is a code fragment of how I am doing a ContentStreamUpdateRequest
using CommonHTTPSolrServer:


ContentStreamBase.URLStream csbu = new ContentStreamBase.URLStream(url);
InputStream is = csbu.getStream();
FastInputStream fis = new FastInputStream(is);

csur.addContentStream(csbu);
csur.setParam("literal.content_id","00");
csur.setParam("literal.contentitle","This is a test");
csur.setParam("literal.title","This is a test");
server.request(csur);
server.commit();

fis.close();


This works fine for one document (a pdf in this case). When I surround
this with a while loop and try adding multiple documents I get:

org.apache.solr.client.solrj.SolrServerException: java.io.IOException:
stream is closed

I've tried commenting out the fis.close, and also using just a plain
InputStream with and without a .close() call - neither work. Is there a
way to do this that I'm missing?


Thanks - Tod




Solrj 3.3.0 Method SolrQuery.setSortField not working

2011-11-04 Thread tech20nn
When setting SolrQuery.setSortField("field1", ORDER.asc) on SolrQuery is not
adding sort parameter to Solr query. Has anyone faced this issue ?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-3-3-0-Method-SolrQuery-setSortField-not-working-tp3481239p3481239.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Proper analyzer / tokenizer for syslog data?

2011-11-04 Thread Peter Spam
Wow, I tried with minGramSize=1 and maxgramSize=1000 (I want someone to be able 
to search on any substring, just like "grep"), and the index is multiple orders 
of magnitude larger than my data!

There's got to be a better way to support full grep-like searching?


Thanks!
Pete

On Nov 4, 2011, at 1:20 AM, Ahmet Arslan wrote:

>> Example data:
>> 01/23/2011 05:12:34 [Test] a=1; hello_there=50;
>> data=[1,5,30%];
>> 
>> I would love to be able to just "grep" the data - ie. if I
>> search for "ello", it finds and returns "ello", and if I
>> search for "hello_there=5", it would match too.
>> 
>> Here's what I'm using now:
>> 
>>> class="solr.TextField">
>>  
>>> class="solr.StandardTokenizerFactory"/>
>>> class="solr.LowerCaseFilterFactory"/>
>>> class="solr.WordDelimiterFilterFactory"
>> generateWordParts="0" generateNumberParts="0"
>> catenateWords="0" catenateNumbers="0" catenateAll="0"
>> splitOnCaseChange="0"/>
>>  
>>
>> 
>> The problem with this is that if I search for a substring,
>> I don't get anything back.  For example, searching for
>> "ello" or "*ello*" doesn't return.  Any ideas?
>> 
>> http://localhost:8983/solr/select?q=*ello*&start=0&rows=50&hl.maxAnalyzedChars=2147483647&hl.useFastVectorHighlighter=true&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400
> 
> For sub-string match NGramFilterFactory is required at index time.
> 
>  maxGramSize="15"/> 
> 
> Plus you may want to use WhiteSpaceTokenizer instead of 
> StandardTokenizerFactory. Analysis admin page displays behavior of each 
> tokenizer.



Solr Score Normalization

2011-11-04 Thread sangrish

Hi,


I have a (dismax) request handler which has the following 3 scoring
components (1 qf & 2 bf) :

qf = "field1^2 field2^3"
bf = func1(field3)^2 func2(field4)^3

  Both func1 & func2 return scores between 0 & 1. The score returned by
textual match (qf) ranges from 0 to 

   To allow better combination of text match & my functions, I want the text
score to be normalized between 0 & 1. Is there any way I can achieve that
here?

Thanks
Sid




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Score-Normalization-tp3481627p3481627.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Score Normalization

2011-11-04 Thread Chris Hostetter


:To allow better combination of text match & my functions, I want the text
: score to be normalized between 0 & 1. Is there any way I can achieve that
: here?

It is achievable, but it is not usualy meaningful...

https://wiki.apache.org/lucene-java/ScoresAsPercentages


-Hoss


Re: Question about dismax and score boost with date

2011-11-04 Thread Chris Hostetter

: /solr/ftf/dismax/?q=libya
: &debugQuery=off
: &hl=true
: &start=
: &rows=10
: --
: 
: I am trying to factor in created to the SCORE. (boost) I have tried a million
: ways to do this, no success. I know the dates are populating correctly because
: I can sort by them. Can anyone help me implement date boosting with dismax
: under this scenario???

https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
https://wiki.apache.org/solr/FunctionQuery#Date_Boosting


-Hoss


Re: [Profiling] How to profile/tune Solr server

2011-11-04 Thread yu shen
Really helpful, thanks so much.

Spark

2011/11/4 

> Hi Spark,
>
> 2009 there was a monitor from lucidimagination:
>
> http://www.lucidimagination.com/about/news/releases/lucid-imagination-releases-performance-monitoring-utility-open-source-apache-lucene
>
> A colleague of mine calls the sematext-monitor "trojan" because "SPM phone
> home":
> "Easy in, easy out - if you try SPM and don't like it, simply stop and
> remove the small client-side piece that sends us your data"
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
> Looks like other people using a "real profiler" like YourKit Java Profiler
> http://forums.yourkit.com/viewtopic.php?f=3&t=3850
>
> There is also an article about Zabbix
>
> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>
> In your case any profiler would do, but if you find out a Profiler with
> solr-specific default-filter let me know.
>
>
>
> Best regrads
>  Karsten
>
> P.S. eMail in context
>
> http://lucene.472066.n3.nabble.com/Profiling-How-to-profile-tune-Solr-server-td3467027.html
>
>  Original-Nachricht 
> > Datum: Mon, 31 Oct 2011 18:35:32 +0800
> > Von: yu shen 
> > An: solr-user@lucene.apache.org
> > Betreff: Re: [Profiling] How to profile/tune Solr server
>
> > No idea so far, try to figure out.
> >
> > Spark
> >
> > 2011/10/31 Jan Høydahl 
> >
> > > Hi,
> > >
> > > There are no official tools other than looking at the built-in stats
> > pages
> > > and perhaps using JConsole or similar JVM monitoring tools. Note that
> > > Solr's JMX capabilities may let you hook your enterprise's existing
> > > monitoring dashboard up with Solr.
> > >
> > > Also check out the new monitoring service from Sematext which will give
> > > you graphs and all. So far it's free evaluation:
> > > http://sematext.com/spm/index.html
> > >
> > > Do you have a clue for why the indexing is slow?
> > >
> > > --
> > > Jan Høydahl, search solution architect
> > > Cominvent AS - www.cominvent.com
> > > Solr Training - www.solrtraining.com
> > >
> > > On 31. okt. 2011, at 04:59, yu shen wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am a solr newbie. I find solr documents easy to access and use,
> > which
> > > is
> > > > really good thing. While my problem is I did not find a solr home
> > grown
> > > > profiling/monitoring tool.
> > > >
> > > > I set up the server as a multi-core server, each core has
> > approximately
> > > 2GB
> > > > index. And I need to update solr and re-generate index in a real time
> > > > manner (In java code, using SolrJ). Sometimes the update operation is
> > > slow.
> > > > And it is expected that in a year, the index size may increase to
> 4GB.
> > > And
> > > > I need to do something to prevent performance downgrade.
> > > >
> > > > Is there any solr official monitoring & profiling tool for this?
> > > >
> > > > Spark
> > >
> > >
>


Re: [Profiling] How to profile/tune Solr server

2011-11-04 Thread yu shen
Thank you for the information.

2011/11/5 yu shen 

> Really helpful, thanks so much.
>
> Spark
>
> 2011/11/4 
>
> Hi Spark,
>>
>> 2009 there was a monitor from lucidimagination:
>>
>> http://www.lucidimagination.com/about/news/releases/lucid-imagination-releases-performance-monitoring-utility-open-source-apache-lucene
>>
>> A colleague of mine calls the sematext-monitor "trojan" because "SPM
>> phone home":
>> "Easy in, easy out - if you try SPM and don't like it, simply stop and
>> remove the small client-side piece that sends us your data"
>> http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>> Looks like other people using a "real profiler" like YourKit Java Profiler
>> http://forums.yourkit.com/viewtopic.php?f=3&t=3850
>>
>> There is also an article about Zabbix
>>
>> http://www.lucidimagination.com/blog/2011/10/02/monitoring-apache-solr-and-lucidworks-with-zabbix/
>>
>> In your case any profiler would do, but if you find out a Profiler with
>> solr-specific default-filter let me know.
>>
>>
>>
>> Best regrads
>>  Karsten
>>
>> P.S. eMail in context
>>
>> http://lucene.472066.n3.nabble.com/Profiling-How-to-profile-tune-Solr-server-td3467027.html
>>
>>  Original-Nachricht 
>> > Datum: Mon, 31 Oct 2011 18:35:32 +0800
>> > Von: yu shen 
>> > An: solr-user@lucene.apache.org
>> > Betreff: Re: [Profiling] How to profile/tune Solr server
>>
>> > No idea so far, try to figure out.
>> >
>> > Spark
>> >
>> > 2011/10/31 Jan Høydahl 
>> >
>> > > Hi,
>> > >
>> > > There are no official tools other than looking at the built-in stats
>> > pages
>> > > and perhaps using JConsole or similar JVM monitoring tools. Note that
>> > > Solr's JMX capabilities may let you hook your enterprise's existing
>> > > monitoring dashboard up with Solr.
>> > >
>> > > Also check out the new monitoring service from Sematext which will
>> give
>> > > you graphs and all. So far it's free evaluation:
>> > > http://sematext.com/spm/index.html
>> > >
>> > > Do you have a clue for why the indexing is slow?
>> > >
>> > > --
>> > > Jan Høydahl, search solution architect
>> > > Cominvent AS - www.cominvent.com
>> > > Solr Training - www.solrtraining.com
>> > >
>> > > On 31. okt. 2011, at 04:59, yu shen wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > I am a solr newbie. I find solr documents easy to access and use,
>> > which
>> > > is
>> > > > really good thing. While my problem is I did not find a solr home
>> > grown
>> > > > profiling/monitoring tool.
>> > > >
>> > > > I set up the server as a multi-core server, each core has
>> > approximately
>> > > 2GB
>> > > > index. And I need to update solr and re-generate index in a real
>> time
>> > > > manner (In java code, using SolrJ). Sometimes the update operation
>> is
>> > > slow.
>> > > > And it is expected that in a year, the index size may increase to
>> 4GB.
>> > > And
>> > > > I need to do something to prevent performance downgrade.
>> > > >
>> > > > Is there any solr official monitoring & profiling tool for this?
>> > > >
>> > > > Spark
>> > >
>> > >
>>
>
>


Re: DIH doesn't handle bound namespaces?

2011-11-04 Thread Lance Norskog
Yes, the xpath thing is a custom lightweight thing for high-speed use.

There is a separate full XSL processor.
http://wiki.apache.org/solr/DataImportHandler#Configuration_in_data-config.xml-1

I think this lets you run real XSL on input files. I assume it lets you
throw in your favorite XSL implementation.

On Thu, Nov 3, 2011 at 12:45 PM, Chris Hostetter
wrote:

>
> : *It does not support namespaces , but it can handle xmls with namespaces
> .
>
> The real crux of hte issue is that XPathEntityProcessor is terribly named.
> it should have been called "LimitedXPathishSyntaxEntityProcessor" or
> something like that because it doesn't support full xpath syntax...
>
> "The XPathEntityProcessor implements a streaming parser which supports a
> subset of xpath syntax. Complete xpath syntax is not supported but most of
> the common use cases are covered..."
>
> ...i thought there was a DIH FAQ about this, but if not there really
> should be.
>
>
> -Hoss
>



-- 
Lance Norskog
goks...@gmail.com